<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fgene.2021.716132</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Predicting Protein Therapeutic Candidates for Bovine Babesiosis Using Secondary Structure Properties and Machine Learning</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Goodswen</surname> <given-names>Stephen J.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/543266/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Kennedy</surname> <given-names>Paul J.</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/598477/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Ellis</surname> <given-names>John T.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/541484/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>School of Life Sciences, University of Technology Sydney</institution>, <addr-line>Ultimo, NSW</addr-line>, <country>Australia</country></aff>
<aff id="aff2"><sup>2</sup><institution>School of Computer Science, Faculty of Engineering and Information Technology and the Australian Artificial Intelligence Institute, University of Technology Sydney</institution>, <addr-line>Ultimo, NSW</addr-line>, <country>Australia</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Ka-Chun Wong, City University of Hong Kong, Hong Kong</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Mohamed Abdo Rizk, Mansoura University, Egypt; Estrella Montero, Carlos III Health Institute (ISCIII), Spain</p></fn>
<corresp id="c001">&#x002A;Correspondence: John T. Ellis, <email>john.ellis@uts.edu.au</email></corresp>
<fn fn-type="other" id="fn004"><p>This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>23</day>
<month>07</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>12</volume>
<elocation-id>716132</elocation-id>
<history>
<date date-type="received">
<day>28</day>
<month>05</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>28</day>
<month>06</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2021 Goodswen, Kennedy and Ellis.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Goodswen, Kennedy and Ellis</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Bovine babesiosis causes significant annual global economic loss in the beef and dairy cattle industry. It is a disease instigated from infection of red blood cells by haemoprotozoan parasites of the genus <italic>Babesia</italic> in the phylum Apicomplexa. Principal species are <italic>Babesia bovis, Babesia bigemina</italic>, and <italic>Babesia divergens.</italic> There is no subunit vaccine. Potential therapeutic targets against babesiosis include members of the exportome. This study investigates the novel use of protein secondary structure characteristics and machine learning algorithms to predict exportome membership probabilities. The premise of the approach is to detect characteristic differences that can help classify one protein type from another. Structural properties such as a protein&#x2019;s local conformational classification states, backbone torsion angles &#x03D5; (phi) and &#x03C8; (psi), solvent-accessible surface area, contact number, and half-sphere exposure are explored here as potential distinguishing protein characteristics. The presented methods that exploit these structural properties via machine learning are shown to have the capacity to detect exportome from non-exportome <italic>Babesia bovis</italic> proteins with an 86&#x2013;92% accuracy (based on 10-fold cross validation and independent testing). These methods are encapsulated in freely available Linux pipelines setup for automated, high-throughput processing. Furthermore, proposed therapeutic candidates for laboratory investigation are provided for <italic>B. bovis, B. bigemina</italic>, and two other haemoprotozoan species, <italic>Babesia canis</italic>, and <italic>Plasmodium falciparum.</italic></p>
</abstract>
<kwd-group>
<kwd><italic>Babesia bovis</italic></kwd>
<kwd><italic>Babesia bigemina</italic></kwd>
<kwd><italic>Babesia canis</italic></kwd>
<kwd>machine learning</kwd>
<kwd>exportome</kwd>
<kwd>vaccine</kwd>
<kwd>protein secondary structure</kwd>
</kwd-group>
<contract-sponsor id="cn001">Australian Research Council<named-content content-type="fundref-id">10.13039/501100000923</named-content></contract-sponsor>
<counts>
<fig-count count="4"/>
<table-count count="6"/>
<equation-count count="0"/>
<ref-count count="73"/>
<page-count count="17"/>
<word-count count="0"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1">
<title>Introduction</title>
<p>The underlying procedure to identify protein candidates in an <italic>in silico</italic> vaccine discovery pipeline is to find and exploit differences between proteins. A procedure based on a plausible assumption that proteins inducing an immune response in a host must be different to those that induce no response. More specifically, immunogenic proteins are expected to contain regions that can trigger a cellular immune response mediated by T or B cells, namely epitopes. An epitope is the minimal structure necessary to invoke an immune response and must come from proteins accessible to the immune system (<xref ref-type="bibr" rid="B71">Vivona et al., 2008</xref>). Several bioinformatic programs (<xref ref-type="bibr" rid="B47">Krogh et al., 2001</xref>; <xref ref-type="bibr" rid="B43">Kall et al., 2004</xref>; <xref ref-type="bibr" rid="B39">Horton et al., 2007</xref>; <xref ref-type="bibr" rid="B4">Armenteros et al., 2017</xref>, <xref ref-type="bibr" rid="B5">2019a</xref>) have been developed to predict various protein characteristics given a protein&#x2019;s primary structure represented by a linear sequence of amino acids. Detecting characteristic differences can help classify one protein type from another. For example, a characteristic such as whether a newly synthesised protein is targeted to the secretory pathway can be predicted from the presence of a secretory signal peptide (SP) encoded in its primary structure (<xref ref-type="bibr" rid="B16">Emanuelsson et al., 2007</xref>). Previous studies (<xref ref-type="bibr" rid="B27">Goodswen et al., 2013a</xref>,<xref ref-type="bibr" rid="B28">b</xref>) have collated these various predicted characteristics and then trained machine learning (ML) models to computationally detect differences, which effectively epitomises the current state-of-the-art approach to <italic>in silico</italic> vaccine discovery against eukaryotic pathogens.</p>
<p>Prediction of protein secondary structure (SS) presents further protein characterisation opportunities to help classify one protein type from another. Protein SS denotes the local conformation of a protein&#x2019;s polypeptide backbone, i.e., helical and sheet hydrogen bonding patterns in a biopolymer (<xref ref-type="bibr" rid="B73">Yang et al., 2018</xref>). The two most common SS conformations are &#x03B1;-helix and &#x03B2;-sheet. A common SS characterisation standard (<xref ref-type="bibr" rid="B42">Kabsch and Sander, 1983</xref>) defines the conformation in 3 or 8 classification states according to hydrogen-bonding patterns. The 3 classes are helix, sheet, and coil commonly designated H, E, and C, respectively; and the 8 classes comprise three types for helix (G for 3<sub>10</sub> helix, H for &#x03B1;-helix, and I for &#x03C0;-helix), two types for sheet (E for &#x03B2;-sheet and B for &#x03B2;-bridge), and three types for coil (T for &#x03B2;-turn, S for high curvature loop, and C for irregular).</p>
<p>Several bioinformatic programs (<xref ref-type="bibr" rid="B41">Jones, 1999</xref>; <xref ref-type="bibr" rid="B50">Magnan and Baldi, 2014</xref>; <xref ref-type="bibr" rid="B13">Drozdetskiy et al., 2015</xref>; <xref ref-type="bibr" rid="B72">Wang et al., 2016</xref>; <xref ref-type="bibr" rid="B36">Heffernan et al., 2018</xref>; <xref ref-type="bibr" rid="B34">Hanson et al., 2019</xref>; <xref ref-type="bibr" rid="B46">Klausen et al., 2019</xref>; <xref ref-type="bibr" rid="B70">Torrisi et al., 2019</xref>) have been developed to predict 3 and/or 8 classes given primary sequences. Most of these <italic>ab initio</italic> predictors use a combination of ML and evolutionary profiles. Variations of neural networks are the predominant ML algorithm. An evolutionary profile is derived from a multiple sequence alignment of homologous sequences (<xref ref-type="bibr" rid="B73">Yang et al., 2018</xref>), mainly from the position specific substitution matrix (PSSM) (<xref ref-type="bibr" rid="B41">Jones, 1999</xref>) calculated by Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST) (<xref ref-type="bibr" rid="B3">Altschul et al., 1997</xref>). Predictions are typically evaluated and reported as a Q3 or Q8 accuracy for 3 or 8 classes, respectively, which represents the percentage of residues correctly predicted. Although accuracies for SS predictions have steadily increased over the decades, a theoretical prediction limit of 88&#x2013;90% for Q3 has been determined (<xref ref-type="bibr" rid="B63">Rost, 2001</xref>; <xref ref-type="bibr" rid="B52">Martin et al., 2005</xref>). The published predictor accuracies range from 72.5% to 87% for Q3, and 60% to 77% for Q8 (<xref ref-type="bibr" rid="B41">Jones, 1999</xref>; <xref ref-type="bibr" rid="B50">Magnan and Baldi, 2014</xref>; <xref ref-type="bibr" rid="B13">Drozdetskiy et al., 2015</xref>; <xref ref-type="bibr" rid="B72">Wang et al., 2016</xref>; <xref ref-type="bibr" rid="B36">Heffernan et al., 2018</xref>; <xref ref-type="bibr" rid="B34">Hanson et al., 2019</xref>; <xref ref-type="bibr" rid="B46">Klausen et al., 2019</xref>; <xref ref-type="bibr" rid="B70">Torrisi et al., 2019</xref>).</p>
<p>Other structural properties such as backbone torsion (dihedral) angles &#x03D5; (phi) and &#x03C8; (psi), solvent-accessible surface area (ASA), contact number (CN), and half-sphere exposure (HSE) are also potential distinguishing protein characteristics. Torsion angles phi and psi provide the flexibility required for the polypeptide backbone to adopt a certain fold, and therefore determine the conformation of the backbone (<xref ref-type="bibr" rid="B61">Ramachandran et al., 1963</xref>). ASA provides the distinction between a buried (low ASA) and exposed (high ASA) residue to solvent (water) in its folded state (<xref ref-type="bibr" rid="B34">Hanson et al., 2019</xref>). CN is another solvent exposure measure that counts spatially close residues within a distance cut-off to a target residue (<xref ref-type="bibr" rid="B60">Pollastri et al., 2002</xref>). The distances are based on the positions of alpha carbon (C&#x03B1;) or beta carbon (C&#x03B2;) atoms (<xref ref-type="bibr" rid="B35">Heffernan et al., 2016</xref>). HSE is a 2D measure of a residue&#x2019;s solvent exposure and adds directionality to CN by splitting the spherical distance cut-off into two halves defined as upper and down (<xref ref-type="bibr" rid="B33">Hamelryck, 2005</xref>; <xref ref-type="bibr" rid="B35">Heffernan et al., 2016</xref>). Several programs (<xref ref-type="bibr" rid="B18">Fang et al., 2019</xref>; <xref ref-type="bibr" rid="B34">Hanson et al., 2019</xref>; <xref ref-type="bibr" rid="B46">Klausen et al., 2019</xref>) provide phi and psi angles, ASA, CN, and HSE values in their output in addition to 3 and 8 class predictions.</p>
<p><italic>Babesia bovis</italic> is a tick-transmitted, obligate intracellular, haemoprotozoan parasite of the phylum Apicomplexa (<xref ref-type="bibr" rid="B38">Homer et al., 2000</xref>). <italic>Babesia</italic> infection of erythrocytes (red blood cells) can cause a severe disease called babesiosis in susceptible hosts (<xref ref-type="bibr" rid="B40">Hunfeld et al., 2008</xref>). This disease is of interest to the current study because there is no subunit vaccine (<xref ref-type="bibr" rid="B9">Brayton et al., 2007</xref>) and the annual global economic loss in the beef and dairy cattle industry due to babesiosis is significant and of great concern (<xref ref-type="bibr" rid="B68">Suarez and Noh, 2011</xref>). Current vaccines against <italic>B. bovis</italic> are based on live formulations, whilst subunit vaccines are deemed safer, and easier to handle and produce (<xref ref-type="bibr" rid="B20">Florin-Christensen et al., 2014</xref>). Several reviews describe background and current insights into the research, detection and treatment of bovine babesiosis (<xref ref-type="bibr" rid="B68">Suarez and Noh, 2011</xref>; <xref ref-type="bibr" rid="B54">Mosqueda et al., 2012</xref>; <xref ref-type="bibr" rid="B62">Rathinasamy et al., 2019</xref>; <xref ref-type="bibr" rid="B67">Suarez et al., 2019</xref>). Potential vaccine targets against bovine babesiosis include members of the exportome, i.e., those proteins exported outside the parasite into the host&#x2019;s erythrocyte cytoplasm and/or the erythrocyte membrane (<xref ref-type="bibr" rid="B25">Gohil et al., 2010</xref>). An unknown subset of the exportome is thought to mediate the pathogenesis of babesiosis by altering structural and functional properties of parasitised erythrocytes, and such a subset contains potential therapeutic targets (<xref ref-type="bibr" rid="B24">Gohil et al., 2013</xref>). Furthermore, exported proteins exposed to the immune system provide target potential for vaccine development (<xref ref-type="bibr" rid="B62">Rathinasamy et al., 2019</xref>). <xref ref-type="fig" rid="F1">Figure 1</xref> shows a 3D model of the SS of two <italic>B. bovis</italic> proteins, one expected and the other not expected to be exportome members [images generated by Phyre2 (<xref ref-type="bibr" rid="B44">Kelley et al., 2015</xref>)].</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>3D model of the secondary structure of two <italic>Babesia bovis</italic> T2Bo proteins. The two images were generated by the Phyre2 web portal for protein modelling, prediction and analysis. Protein folding is coloured by rainbow from the N to C Terminals. BBOV_III002350 is a small open reading frame (smORF) protein. It is expected to be an exportome member because smORF proteins are known to have an association with the erythrocyte membrane. BBOV_III002880 is a KE2 family protein. This protein is expected to be a non-exportome member because its subcellular location is in the cytoplasm. Note that the Phyre2 protein model reliability depends on the extent of homology between a user-supplied sequence and a sequence of known structure in the Protein Database (PDB). In this case, the BBOV_III002350 sequence has 23% coverage (33 out of 142 residues) with a known PDB molecule (nicotinate-nucleotide adenylyltransferase), and the BBOV_III002880 sequence has 83% coverage (101 out of 122 residues) with a PDB molecule (chaperone prefoldin subunit 4). Therefore, only the homologous residues are displayed in the images. Secondary-structure prediction for the three predominate states: &#x03B1;-helix, &#x03B2;-sheet, and coil are shown below the modelling images for the entire sequence lengths of BBOV_III002350 and BBOV_III002880.</p></caption>
<graphic xlink:href="fgene-12-716132-g001.tif"/>
</fig>
<p>In a previous study, we used ML with a protein&#x2019;s primary structure, principally in the input format of amino-acid composition, to predict <italic>B. bovis</italic> exported proteins (<xref ref-type="bibr" rid="B29">Goodswen et al., 2021</xref>). In this study, we investigated the novel use of protein SS characteristics to predict exportome membership as a complementary method. More specifically, trained ML models were used to detect differences in 3 and 8 state local conformations, phi and psi angles, CN, ASA, and HSE between expected exportome and non-exportome proteins. Apicomplexan pathogens, such as <italic>B. bovis</italic> are complex biological systems consisting of thousands of proteins. The presented ML-SS approach predicts exportome membership with 86&#x2013;92% accuracy (based on 10-fold cross validation and independent testing), and therefore identifies out of thousands those proteins most worthy of further laboratory investigation. Furthermore, the approach was tested for its universal effectiveness on other parasites of the phylum Apicomplexa &#x2013; three haemoprotozoan species, namely <italic>Babesia bigemina, Babesia canis</italic>, and <italic>Plasmodium falciparum</italic>. <italic>Toxoplasma gondii</italic>, considered the model organism for Apicomplexa (<xref ref-type="bibr" rid="B45">Kim and Weiss, 2004</xref>), is also tested in the current study as an outlier to <italic>Babesia</italic> and <italic>Plasmodium</italic> because it invades only nucleated cells (<xref ref-type="bibr" rid="B66">Sibley, 2003</xref>). Proposed candidates for laboratory investigation are provided for <italic>B. bovis</italic> and the three other haemoprotozoan species. Furthermore, Linux pipelines implementing the ML-SS approach are made freely available for download.</p>
</sec>
<sec id="S2">
<title>Results</title>
<sec id="S2.SS1">
<title>Predicted <italic>Babesia bovis</italic> T2Bo Exportome Members Using Rule-Based Method</title>
<p>There is currently no laboratory verified list of exportome proteins for <italic>B. bovis, B. bigemina, and B. canis.</italic> The current state-of-the-art prediction method for exportome membership is a rule-based bioinformatics approach proposed by Gohil (<xref ref-type="bibr" rid="B24">Gohil et al., 2013</xref>). The rules are based on known characteristics of <italic>Plasmodium</italic> exportome proteins such as presence of SPs but no transmembrane (TM) domain(s) and glycosylphosphatidylinositol (GPI) anchors.</p>
<p><xref ref-type="supplementary-material" rid="TS1">Supplementary Table 1</xref> lists 276 out of a possible 3706 <italic>Babesia bovis T2Bo</italic> proteins that meet the Gohil rule-based selection criteria. Only laboratory testing can definitely confirm whether any of these 276 are exportome members. However, previous studies have provided indications of protein types likely to have an association with the erythrocyte membrane. For example, spherical body proteins (SBP) are believed to be responsible for host cell modifications (<xref ref-type="bibr" rid="B69">Terkawi et al., 2011</xref>; <xref ref-type="bibr" rid="B30">Gubbels and Duraisingh, 2012</xref>); heat shock proteins (HSP70 and HSP90) are known to be exported into the erythrocyte cytoplasm (<xref ref-type="bibr" rid="B51">Maier et al., 2009</xref>; <xref ref-type="bibr" rid="B48">Kuelzer et al., 2012</xref>); variant erythrocyte surface antigen (VESA) proteins are postulated to play a role in cytoadhesion, sequestration, and immune evasion (<xref ref-type="bibr" rid="B2">Allred et al., 2000</xref>; <xref ref-type="bibr" rid="B57">O&#x2019;Connor and Allred, 2000</xref>; <xref ref-type="bibr" rid="B9">Brayton et al., 2007</xref>); and small open reading frame (smORF) proteins play a role in VESA protein biology (<xref ref-type="bibr" rid="B9">Brayton et al., 2007</xref>; <xref ref-type="bibr" rid="B19">Ferreri et al., 2012</xref>). The reliability of annotated protein names for <italic>Babesia</italic> species is considered poor by the current study (see later the section &#x201C;Discussion&#x201D;). Despite this, proteins are assessed here on their names. <xref ref-type="table" rid="T1">Table 1</xref> shows a breakdown of the 276 proteins into protein types based on the annotated protein name. For example, there are four SBPs in the available 3706 <italic>B. bovis</italic> T2Bo proteins. Three of these fulfil the rule-based selection criteria. Two of which were previously reported to localise to the infected erythrocyte membrane (<xref ref-type="bibr" rid="B37">Hines et al., 1995</xref>; <xref ref-type="bibr" rid="B64">Ruef et al., 2000</xref>): BBOV_II002880 (SBP 1) and BBOV_I004210 (SPB 3). No SP and one TM were predicted for BBOV_II000740 (SBP 2) and consequently this protein did not meet the selection criteria. Notable findings are that most SBP, HSP70, HSP90, and smORF proteins were selected, whereas almost all VESA proteins were not due to the absence of an SP.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Breakdown of protein types meeting the rule-based exportome selection criteria.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Protein type</td>
<td valign="top" align="center">Available<sup>a</sup></td>
<td valign="top" align="center">Selected<sup>b</sup></td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Spherical body proteins (SBP)<sup>c</sup></td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">3</td>
</tr>
<tr>
<td valign="top" align="left">Small open reading frame (smORF)<sup>c</sup></td>
<td valign="top" align="center">44</td>
<td valign="top" align="center">42</td>
</tr>
<tr>
<td valign="top" align="left">Heat shock proteins (HSP70)<sup>c</sup></td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">Heat shock proteins (HSP90)<sup>c</sup></td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">2</td>
</tr>
<tr>
<td valign="top" align="left">Variant erythrocyte surface antigen-1 (VESA); family protein<sup>c</sup></td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">Variant erythrocyte surface antigen-1 (VESA); alpha subunit<sup>c</sup></td>
<td valign="top" align="center">71</td>
<td valign="top" align="center">2</td>
</tr>
<tr>
<td valign="top" align="left">Variant erythrocyte surface antigen-1 (VESA); beta subunit<sup>c</sup></td>
<td valign="top" align="center">43</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">Variant erythrocyte surface antigen-1 (VESA); putative<sup>c</sup></td>
<td valign="top" align="center">14</td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">Hypothetical proteins</td>
<td valign="top" align="center">1309</td>
<td valign="top" align="center">95</td>
</tr>
<tr>
<td valign="top" align="left">Membrane proteins (putative)</td>
<td valign="top" align="center">171</td>
<td valign="top" align="center">46</td>
</tr>
<tr>
<td valign="top" align="left">Erythrocyte membrane-associated antigen<sup>c</sup></td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">2</td>
</tr>
<tr>
<td valign="top" align="left">Conserved hypothetical proteins</td>
<td valign="top" align="center">539</td>
<td valign="top" align="center">20</td>
</tr>
<tr>
<td valign="top" align="left">Other</td>
<td valign="top" align="center">1501</td>
<td valign="top" align="center">62</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Total</bold></td>
<td valign="top" align="center">3706</td>
<td valign="top" align="center"><bold>276</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<attrib><italic><sup><italic>a</italic></sup>Total number of available proteins of a particular protein type given 3706 <italic>Babesia bovis</italic> T2Bo proteins currently available.</italic></attrib>
<attrib><italic><sup><italic>b</italic></sup>Number of proteins from available that fulfil the rule-based selection criteria for an exportome member.</italic></attrib>
<attrib><italic><sup><italic>c</italic></sup>Protein types reported in studies to likely have an association with the erythrocyte membrane.</italic></attrib>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="supplementary-material" rid="TS1">Supplementary Table 1</xref> also lists 196 proteins (&#x223C;70%) taken from the 276 fulfilling the rule-based selection criteria to represent the &#x2018;positives&#x2019; for ML training data. The remaining 80 (&#x223C;30%) formed the independent dataset to test the ML model&#x2019;s performance. Two further datasets comprising 196 and 80 proteins were selected using the Python random module from 3430 <italic>B. bovis</italic> T2Bo proteins that did not meet the rule-based selection criteria. These latter datasets represented the &#x2018;negatives&#x2019; for ML training and testing, respectively.</p>
</sec>
<sec id="S2.SS2">
<title>Predicted Exportome Members From Test Species Using Rule-Based Method</title>
<p><xref ref-type="supplementary-material" rid="TS2">Supplementary Table 2</xref> lists 277 out of a possible 5077 proteins currently available for <italic>Babesia bigemina</italic> BOND, 133 out of 3467 <italic>Babesia canis</italic> BcH-CHIPZ, 264 out of 5460 <italic>Plasmodium falciparum</italic> 3D7, and 318 out of 8322 <italic>Toxoplasma gondii</italic> ME49 proteins that meet the Gohil rule-based selection criteria.</p>
<p>Protein names of interest in those selected are ones with reported exportome associations. For example, selected <italic>B. bigemina</italic> proteins include HSP70, HSP90, SBP3, SBP4, and membrane attack complex (MAC)/perforin or DnaJ domain containing. Heat-shock proteins containing a DnaJ domain (previously known as HSP40s) are reported to export into the erythrocyte cytoplasm and be involved in transport of parasite proteins, including <italic>P. falciparum</italic> erythrocyte membrane protein 1 (PfEMP1) (<xref ref-type="bibr" rid="B51">Maier et al., 2009</xref>). Apicomplexan MAC/perforin proteins have been shown to be secreted from micronemes during the intraerythrocytic parasite stage and bind to the parasitised erythrocyte membrane, where it facilitates the egress of the parasite (<xref ref-type="bibr" rid="B58">Paoletta et al., 2021</xref>).</p>
<p>Selected <italic>B. canis</italic> proteins include HSP90 and MAC/perforin. Selected <italic>P. falciparum</italic> proteins include HSP70, HSP90, HSP110, HSP20-like chaperone, and <italic>Plasmodium</italic> exported proteins containing helical interspersed subtelomeric (PHIST) or HYP domains. Proteins containing PHIST and HYP domains are exported to the infected erythrocyte membrane (<xref ref-type="bibr" rid="B56">Oberli et al., 2014</xref>; <xref ref-type="bibr" rid="B65">Schulze et al., 2015</xref>).</p>
<p>A notable protein type not selected was PfEMP1 because of the absence of SPs. PfEMP1 is known to play a role in erythrocyte modification (<xref ref-type="bibr" rid="B12">Cooke et al., 2006</xref>). Selected proteins containing PHIST domains are known, however, to bind to PfEMP1 (<xref ref-type="bibr" rid="B56">Oberli et al., 2014</xref>). Other selection exceptions are proteins from two <italic>Plasmodium</italic> protein families, repetitive interspersed family (RIFIN) and subtelomeric variable open reading frame family (STEVOR), which are thought to play roles in export and display of virulence proteins (<xref ref-type="bibr" rid="B51">Maier et al., 2009</xref>; <xref ref-type="bibr" rid="B31">Haase and de Koning-Ward, 2010</xref>). Although most RIFIN and STEVOR proteins have SPs, all but one STEVOR-like protein fails the selection criteria due to the presence of at least one TM.</p>
<p>Selected <italic>T. gondii</italic> proteins include dense granule (GRA) and rhoptry (ROP) proteins, which are categorised as excreted/secreted proteins and not exportome members as <italic>T. gondii</italic> parasites do not live in erythrocytes. GRAs and ROPs are known to be excreted/secreted into the parasitophorous vacuole and/or host cell from their respective subcellular organelles, rhoptries and dense granules (<xref ref-type="bibr" rid="B32">Hakimi et al., 2017</xref>). Only 12 out of 37 GRA and ROP proteins meet the selection criteria.</p>
</sec>
<sec id="S2.SS3">
<title>Predictors for 3 and 8 Classes</title>
<p>Nine 3 class and seven 8 class conformational state predictors were used in this study. The output from each predictor shows at least 3 and/or 8 structural classifications for every amino acid in the primary input sequence, e.g., each amino acid is classified H, E, or C for 3 classes and G, H, I, E, B, T, S, or C for 8 classes. The predictors were evaluated by comparing a consensus from all predictors with the individual predictor&#x2019;s classifications given the 392 training data protein sequences as input (i.e., 196 positives + 196 negatives). <xref ref-type="table" rid="T2">Table 2</xref> shows the percentage of classifications for each predictor that matched the consensus classification. For example, 94.0% of Porter 5 classifications for 3 class predictions matched the consensus classifications derived from all nine predictors. The true secondary structures of the training proteins are unknown and consequently the consensus accuracies are unknown. The percentages in <xref ref-type="table" rid="T2">Table 2</xref> are therefore not a true indication of a predictor&#x2019;s accuracy. However, the assumption here is that predictors making the most similar predictions are more accurate than outlier predictors. With this assumption, Porter 5 is the most and DeepCNF is the least accurate for both 3 and 8 class predictions.</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Percentage of secondary structure classifications matching a consensus classification.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Predictor</td>
<td valign="top" align="center">Class 3 (%)</td>
<td valign="top" align="center">Class 8 (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">DeepCNF</td>
<td valign="top" align="center">77.4</td>
<td valign="top" align="center">70.8</td>
</tr>
<tr>
<td valign="top" align="left">Spider3<sup>a</sup></td>
<td valign="top" align="center">79.7</td>
<td valign="top" align="center">71.7</td>
</tr>
<tr>
<td valign="top" align="left">Jpred 4<sup>b</sup></td>
<td valign="top" align="center">83.8</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">SSpro</td>
<td valign="top" align="center">86.3</td>
<td valign="top" align="center">77.0</td>
</tr>
<tr>
<td valign="top" align="left">NetsurfP</td>
<td valign="top" align="center">87.1</td>
<td valign="top" align="center">82.1</td>
</tr>
<tr>
<td valign="top" align="left">PSIPRED<sup>b</sup></td>
<td valign="top" align="center">90.2</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">MUFold</td>
<td valign="top" align="center">91.2</td>
<td valign="top" align="center">87.2</td>
</tr>
<tr>
<td valign="top" align="left">SPOT-1D</td>
<td valign="top" align="center">92.0</td>
<td valign="top" align="center">88.5</td>
</tr>
<tr>
<td valign="top" align="left">Porter 5</td>
<td valign="top" align="center">94.0</td>
<td valign="top" align="center">90.4</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<attrib><italic>Class 3, predictors that predict three class states &#x2013; helix, sheet, and coil; Class 8, predictors that predict eight class states &#x2013; three types for helix (3<sub>10</sub> helix, &#x03B1;-helix, and &#x03C0;-helix), two types for sheet (&#x03B2;-sheet and &#x03B2;-bridge), and three types for coil (&#x03B2;-turn, high curvature loop, and irregular).</italic></attrib>
<attrib><italic><sup><italic>a</italic></sup>Single-sequence-based prediction version (i.e., uses no multiple sequence alignment of homologous sequences).</italic></attrib>
<attrib><italic><sup><italic>b</italic></sup>These programs do not predict 8 Class secondary structure states.</italic></attrib>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="S2.SS4">
<title>Predicted Exportome Members Using Machine Learning With 3 and 8 Class Predictions</title>
<p><xref ref-type="fig" rid="F2">Figure 2</xref> shows the steps taken to classify 392 <italic>B. bovis</italic> T2Bo proteins as either an exportome (positive) or non-exportome (negative) using ML and protein SS classifications. Protein sequences from 392 proteins, which represent the training data, were input into nine 3 class and seven 8 class predictors. Consensus classifications were derived from the predicted 3 and 8 classifications from each predictor. These consensus classifications were subsequently used to train the ML algorithms, adaptive boosting (adaBoost) and Random Forest (RF). Different representations of data input to the ML algorithms were evaluated using 10-fold cross validation (Materials and methods describes each representation). The best performances were derived when using only the first 40 classifications from the N-terminal and a proportional class count for the remaining classifications. <xref ref-type="table" rid="T3">Table 3</xref> shows the ML performance measures obtained from 10-fold cross validation. For example, an ensemble of ML algorithms consisting of adaBoost and RF given 3 and 8 class consensus predictions achieved 86.99% and 86.73% accuracies, respectively, in classifying 392 <italic>B. bovis</italic> proteins as either a positive or negative. <xref ref-type="supplementary-material" rid="DS1">Supplementary Data 1</xref> shows the ML performances for all the data input representations.</p>
<table-wrap position="float" id="T3">
<label>TABLE 3</label>
<caption><p>Machine learning performance measures for predicting exportome membership using 3 and 8 class conformational state predictions.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="3">Class 3<hr/></td>
<td valign="top" align="center" colspan="3">Class 8<hr/></td>
</tr>
<tr>
<td valign="top" align="left">Performance measures (%)</td>
<td valign="top" align="center">adaBoost</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">Ensemble</td>
<td valign="top" align="center">adaBoost</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">Ensemble</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Accuracy</td>
<td valign="top" align="center">86.73</td>
<td valign="top" align="center">87.24</td>
<td valign="top" align="center">86.99</td>
<td valign="top" align="center">86.48</td>
<td valign="top" align="center">85.71</td>
<td valign="top" align="center">86.73</td>
</tr>
<tr>
<td valign="top" align="left">Error rate</td>
<td valign="top" align="center">13.27</td>
<td valign="top" align="center">12.76</td>
<td valign="top" align="center">13.01</td>
<td valign="top" align="center">13.52</td>
<td valign="top" align="center">14.29</td>
<td valign="top" align="center">13.27</td>
</tr>
<tr>
<td valign="top" align="left">Sensitivity</td>
<td valign="top" align="center">88.27</td>
<td valign="top" align="center">89.29</td>
<td valign="top" align="center">87.76</td>
<td valign="top" align="center">88.27</td>
<td valign="top" align="center">87.24</td>
<td valign="top" align="center">87.76</td>
</tr>
<tr>
<td valign="top" align="left">False positive rate</td>
<td valign="top" align="center">14.80</td>
<td valign="top" align="center">14.80</td>
<td valign="top" align="center">13.78</td>
<td valign="top" align="center">15.31</td>
<td valign="top" align="center">15.82</td>
<td valign="top" align="center">14.29</td>
</tr>
<tr>
<td valign="top" align="left">Specificity</td>
<td valign="top" align="center">85.20</td>
<td valign="top" align="center">85.20</td>
<td valign="top" align="center">86.22</td>
<td valign="top" align="center">84.69</td>
<td valign="top" align="center">84.18</td>
<td valign="top" align="center">85.71</td>
</tr>
<tr>
<td valign="top" align="left">Positive predictive value</td>
<td valign="top" align="center">85.64</td>
<td valign="top" align="center">85.78</td>
<td valign="top" align="center">86.43</td>
<td valign="top" align="center">85.22</td>
<td valign="top" align="center">84.65</td>
<td valign="top" align="center">86.00</td>
</tr>
<tr>
<td valign="top" align="left">Negative predictive value</td>
<td valign="top" align="center">87.89</td>
<td valign="top" align="center">88.83</td>
<td valign="top" align="center">87.56</td>
<td valign="top" align="center">87.83</td>
<td valign="top" align="center">86.84</td>
<td valign="top" align="center">87.50</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<attrib><italic>adaBoost, adaptive boosting; RF, Random Forest; Ensemble, final classifications derived from the average of adaBoost and RF classification probabilities; Class 3, predictions based on three class states; Class 8, predictions based on eight class states.</italic></attrib>
</table-wrap-foot>
</table-wrap>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Schematic of steps taken to determine the exportome membership of <italic>Babesia bovis</italic> T2Bo proteins using machine learning and three-class state predictions for protein secondary structure. (1) Following a rule-based selection procedure, protein sequences from 276 proteins fulfilling the selection criteria (positives) + 196 failing the selection criteria (negatives) are input to nine 3-class conformational state predictors, (2) a consensus of the nine individual predictions is determined; (3) consensus characters are converted to numeric values in preparation for machine learning input, (4) a &#x2018;1&#x2019; or a &#x2018;0&#x2019; is appended to the numerical consensus as an indication it represents positive or negative training data, respectively; (5) a training dataset is collated comprising 196 positives and 196 negatives; (6) machine learning algorithms, Random Forest and Adboost, are trained using the training dataset; (7) the exportome membership of 80 positives (276 &#x2013; 196) is predicted using the trained machine learning models.</p></caption>
<graphic xlink:href="fgene-12-716132-g002.tif"/>
</fig>
</sec>
<sec id="S2.SS5">
<title>Predicted Exportome Members Using Machine Learning With Phi and Psi Angles</title>
<p>Torsion angles phi and psi determine protein conformation, which in turn determines the protein function (<xref ref-type="bibr" rid="B18">Fang et al., 2019</xref>). The premise here is that a different conformation exists between exportome and non-exportome proteins, and therefore a difference in psi and phi angles. A &#x2018;Psi vs. Phi&#x2019; angle plot is shown in <xref ref-type="supplementary-material" rid="FS1">Supplementary Figure 1</xref>. This plot is similar to a Ramachandran plot (<xref ref-type="bibr" rid="B61">Ramachandran et al., 1963</xref>).</p>
<p>Three predictors (SPOT-1D, NetsurfP, and MUFold) were used to predict psi and phi angles for the 392 <italic>B. bovis</italic> T2Bo training proteins. Outputs from the predictors are values between &#x2212;180 and +180 for each amino acid in the input sequence and represent the psi and phi angles of the protein. The mean psi and phi angles at each amino acid were determined using the values from each predictor. As a predictor comparison measure, the absolute difference between the mean angle and the predictor&#x2019;s predicted angle was determined at each amino acid and summed. MUFold had the least total deviation from the mean for psi angles, followed by SPOT-1D, then NetsurfP; and SPOT-1D had the least total deviation for phi angles, followed by NetsurfP, then MUFold.</p>
<p>There was no detectable difference observed in the mean angles between the 196 exportome and 196 non-exportome proteins and hence the application here of ML. Various representations of the two sets of angles (psi and phi) as a uniform set of features for ML input were assessed with 10-fold cross validation (Materials and methods describes each representation). The representation that achieved the best accuracy of 86.73% was obtained by using the psi and phi angles from the first 40 amino acids (AAs) to in effect have 80 features per protein. <xref ref-type="table" rid="T4">Table 4</xref> shows the ML performance measures obtained from 10-fold cross validation for this representation when classifying the 392 training proteins as either positives or negatives. <xref ref-type="supplementary-material" rid="DS1">Supplementary Data 1</xref> shows the ML performances for all the data input representations.</p>
<table-wrap position="float" id="T4">
<label>TABLE 4</label>
<caption><p>Machine learning performance measures for predicting exportome membership using backbone torsion angles &#x03D5; (phi) and &#x03C8; (psi), half-sphere exposure (HSE) upper sphere predictions, and solvent-accessible surface area (ASA).</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="3">&#x03D5; (phi) and &#x03C8; (psi)<hr/></td>
<td valign="top" align="center" colspan="3">HSE_u<hr/></td>
<td valign="top" align="center" colspan="3">ASA<hr/></td>
</tr>
<tr>
<td valign="top" align="left">Performance measures (%)</td>
<td valign="top" align="center">ada</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">Ens</td>
<td valign="top" align="center">ada</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">Ens</td>
<td valign="top" align="center">ada</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">Ens</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Accuracy</td>
<td valign="top" align="center">86.73</td>
<td valign="top" align="center">84.85</td>
<td valign="top" align="center">86.73</td>
<td valign="top" align="center">89.54</td>
<td valign="top" align="center">91.84</td>
<td valign="top" align="center">90.31</td>
<td valign="top" align="center">90.31</td>
<td valign="top" align="center">88.78</td>
<td valign="top" align="center">90.31</td>
</tr>
<tr>
<td valign="top" align="left">Error rate</td>
<td valign="top" align="center">13.27</td>
<td valign="top" align="center">15.05</td>
<td valign="top" align="center">13.27</td>
<td valign="top" align="center">10.46</td>
<td valign="top" align="center">8.16</td>
<td valign="top" align="center">9.69</td>
<td valign="top" align="center">9.69</td>
<td valign="top" align="center">11.22</td>
<td valign="top" align="center">9.69</td>
</tr>
<tr>
<td valign="top" align="left">Sensitivity</td>
<td valign="top" align="center">86.73</td>
<td valign="top" align="center">80.10</td>
<td valign="top" align="center">86.22</td>
<td valign="top" align="center">92.86</td>
<td valign="top" align="center">93.37</td>
<td valign="top" align="center">93.37</td>
<td valign="top" align="center">93.37</td>
<td valign="top" align="center">90.82</td>
<td valign="top" align="center">93.37</td>
</tr>
<tr>
<td valign="top" align="left">False positive rate</td>
<td valign="top" align="center">13.27</td>
<td valign="top" align="center">10.20</td>
<td valign="top" align="center">12.76</td>
<td valign="top" align="center">13.78</td>
<td valign="top" align="center">9.69</td>
<td valign="top" align="center">12.76</td>
<td valign="top" align="center">12.76</td>
<td valign="top" align="center">13.27</td>
<td valign="top" align="center">12.76</td>
</tr>
<tr>
<td valign="top" align="left">Specificity</td>
<td valign="top" align="center">86.73</td>
<td valign="top" align="center">89.80</td>
<td valign="top" align="center">87.24</td>
<td valign="top" align="center">86.22</td>
<td valign="top" align="center">90.31</td>
<td valign="top" align="center">87.24</td>
<td valign="top" align="center">87.24</td>
<td valign="top" align="center">86.73</td>
<td valign="top" align="center">87.24</td>
</tr>
<tr>
<td valign="top" align="left">Positive predictive value</td>
<td valign="top" align="center">86.73</td>
<td valign="top" align="center">88.70</td>
<td valign="top" align="center">87.11</td>
<td valign="top" align="center">87.08</td>
<td valign="top" align="center">90.59</td>
<td valign="top" align="center">87.98</td>
<td valign="top" align="center">87.98</td>
<td valign="top" align="center">87.25</td>
<td valign="top" align="center">87.98</td>
</tr>
<tr>
<td valign="top" align="left">Negative predictive value</td>
<td valign="top" align="center">86.73</td>
<td valign="top" align="center">81.86</td>
<td valign="top" align="center">86.36</td>
<td valign="top" align="center">92.35</td>
<td valign="top" align="center">93.16</td>
<td valign="top" align="center">92.93</td>
<td valign="top" align="center">92.93</td>
<td valign="top" align="center">90.43</td>
<td valign="top" align="center">92.93</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<attrib><italic>&#x03D5; (phi) and &#x03C8; (psi), backbone torsion (dihedral) angles; HSE_u, half-sphere exposure (HSE) upper sphere; ASA, solvent-accessible surface area; ada, adaptive boosting; RF, Random Forest; Ens, ensemble = final classifications derived from the average of adaBoost and RF classification probabilities.</italic></attrib>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="S2.SS6">
<title>Predicted Exportome Members Using Machine Learning With the Secondary Structure Properties ASA, CN, and HSE</title>
<p>Predicted SS properties of ASA, CN, and HSE (for both upper and down spheres) were used in turn as ML input features to classify the 392 <italic>B. bovis</italic> T2Bo proteins. ASA and HSE-upper predictions from the first 40 AAs achieved the best ML performance measures as determined by 10-fold cross validation with an equal accuracy of 90.31% (see <xref ref-type="table" rid="T4">Table 4</xref>). HSE-down features obtain the next best accuracy followed by CN (shown in <xref ref-type="supplementary-material" rid="DS1">Supplementary Data 1</xref>).</p>
</sec>
<sec id="S2.SS7">
<title>Comparison of Secondary Structure Derived Machine Learning Methods</title>
<p>The exportome membership probabilities predicted by each of the best performing ML SS prediction methods during 10-fold cross validation and testing are shown in <xref ref-type="supplementary-material" rid="TS3">Supplementary Table 3</xref>. The five best performing methods comprise 3 and 8 classes, psi and phi angles, ASA, and HSE-upper &#x2013; referred to henceforth as the ML-SS methods. Each protein was reclassified from that expected based on a 0.5 probability threshold for comparative purposes. The percentage of classifications per method different to that expected is shown in <xref ref-type="table" rid="T5">Table 5</xref>. For example, the least percentage of positive misclassifications observed during cross validation was 6.1% for both ASA and HSE-upper predictions, and the least percentage of negative misclassifications was 12.8% for ASA, HSE-upper, and psi and phi angles predictions. The percentage of misclassifications reduces to 4.6% and 2.0% for positives and negatives, respectively, when classifications are based on the average of the exportome membership probabilities from all five methods. Using the &#x2018;average&#x2019; effectively increases the prediction accuracy to 96.7%. A misclassification consensus was also determined, e.g., &#x2018;0&#x2019; indicated all five methods classified a protein as expected, and &#x2018;5&#x2019; indicated all five methods misclassified a protein to that expected (see <xref ref-type="supplementary-material" rid="TS3">Supplementary Table 3</xref>). One positive protein (BBOV_IV011310 &#x2013; membrane protein; putative) was misclassified by all five methods and; 71.4% and 62.5% of positives and negatives, respectively, were classified as expected by all five methods.</p>
<table-wrap position="float" id="T5">
<label>TABLE 5</label>
<caption><p>Percentage of prediction misclassifications and the accuracy per secondary structure prediction method for <italic>Babesia bovis</italic> T2Bo training and test data.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="3">Training data <sup>b</sup><hr/></td>
<td valign="top" align="center" colspan="3">Test data <sup>c</sup><hr/></td>
</tr>
<tr>
<td valign="top" align="left">Prediction method</td>
<td valign="top" align="center">Positives (%)</td>
<td valign="top" align="center">Negatives (%)</td>
<td valign="top" align="center">Accuracy (%)</td>
<td valign="top" align="center">Positives (%)</td>
<td valign="top" align="center">Negatives (%)</td>
<td valign="top" align="center">Accuracy (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">3 classes</td>
<td valign="top" align="center">12.2</td>
<td valign="top" align="center">14.1</td>
<td valign="top" align="center">87.0</td>
<td valign="top" align="center">10.0</td>
<td valign="top" align="center">11.3</td>
<td valign="top" align="center">89.4</td>
</tr>
<tr>
<td valign="top" align="left">8 classes</td>
<td valign="top" align="center">12.2</td>
<td valign="top" align="center">14.2</td>
<td valign="top" align="center">86.7</td>
<td valign="top" align="center">8.8</td>
<td valign="top" align="center">15.0</td>
<td valign="top" align="center">86.3</td>
</tr>
<tr>
<td valign="top" align="left">Phi and psi angles</td>
<td valign="top" align="center">13.8</td>
<td valign="top" align="center">12.8</td>
<td valign="top" align="center">86.7</td>
<td valign="top" align="center">13.8</td>
<td valign="top" align="center">11.3</td>
<td valign="top" align="center">87.5</td>
</tr>
<tr>
<td valign="top" align="left">Solvent-accessible surface area</td>
<td valign="top" align="center">6.1</td>
<td valign="top" align="center">12.8</td>
<td valign="top" align="center">90.3</td>
<td valign="top" align="center">3.8</td>
<td valign="top" align="center">10.0</td>
<td valign="top" align="center">93.1</td>
</tr>
<tr>
<td valign="top" align="left">Half-sphere exposure &#x2013; upper sphere</td>
<td valign="top" align="center">6.1</td>
<td valign="top" align="center">12.8</td>
<td valign="top" align="center">90.3</td>
<td valign="top" align="center">2.5</td>
<td valign="top" align="center">12.5</td>
<td valign="top" align="center">92.5</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Average <sup>a</sup></bold></td>
<td valign="top" align="center"><bold>4.6</bold></td>
<td valign="top" align="center"><bold>2.0</bold></td>
<td valign="top" align="center"><bold>96.7</bold></td>
<td valign="top" align="center"><bold>3.8</bold></td>
<td valign="top" align="center"><bold>5.0</bold></td>
<td valign="top" align="center"><bold>95.6</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<attrib><italic><sup><italic>a</italic></sup>Average = average exportome probabilities for all five prediction methods (3 classes, 8 classes, phi and psi angles, solvent-accessible surface area, and half-sphere exposure &#x2013; upper sphere).</italic></attrib>
<attrib><italic><sup><italic>b</italic></sup>Training data = 196 <italic>Babesia bovis T2Bo</italic> proteins that meet the Gohil rule-based selection criteria (positives), and 196 <italic>Babesia bovis T2Bo</italic> proteins that fail the Gohil rule-based selection criteria (negatives).</italic></attrib>
<attrib><italic><sup><italic>c</italic></sup>Test data = 80 <italic>Babesia bovis T2Bo</italic> proteins that meet the Gohil rule-based selection but not included in training data (positives), and 80 <italic>Babesia bovis T2Bo</italic> proteins that fail the Gohil rule-based selection criteria but not included in training data (negatives).</italic></attrib>
<attrib><italic>Positives (%) = percentage of positive classifications per method different to that expected; Negatives (%) = percentage of negative classifications per method different to that expected.</italic></attrib>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="table" rid="T5">Table 5</xref> also shows the percentage of misclassifications per prediction method when using <italic>B. bovis</italic> T2Bo test data as input to the five methods. Test data consisted of 80 positive and 80 negative proteins not used in training. All misclassification percentages were expected to slightly decrease due to using the full training data and this expectation was observed. Similarly, the accuracies for each method increased as expected. Using the &#x2018;average&#x2019; to represent all five prediction methods reduced the percentage of misclassifications and improved the overall accuracy to 95.6%. No proteins were misclassified to that expected by all five methods, but one positive (BBOV_IV000520 &#x2013; importin beta subunit, putative) and three negative proteins BBOV_II005340 &#x2013; cytidine triphosphate synthetase, putative; BBOV_III005850 &#x2013; membrane protein, putative; and BBOV_III003780 &#x2013; conserved hypothetical protein) were misclassified by four methods. The highest scoring positive was a smORF protein when based on the average score, whereas the lowest scoring positives have protein names not known to be associated with the erythrocyte membrane.</p>
<p>As a further evaluation to verify that the results did not occur by chance but were attributable to input values at specific locations, the input values to the ML-SS methods were randomly shuffled. The prediction accuracies based on random shuffling obtained from 10-fold cross validation were 56.1%, 57.9%, 54.9%, 73.2%, and 71.7% for 3 class, 8 class, psi and phi angles, ASA, and HSE-upper, respectively.</p>
<p><xref ref-type="fig" rid="F3">Figure 3</xref> shows a comparison of each feature contribution toward the ML-SS methods&#x2019; prediction accuracies. For example, SS prediction values for the first 40 AAs from the N-terminal represent 40 input features to the ML-SS methods. Features in positions 10&#x2013;15 have the most predictive importance. The predicted SP cleavage sites for the 276 expected exportome members range from 13 to 37 AAs measured from the N-terminal (average 22.5 AAs). Feature at position 21 makes the least contribution, with the greatest contributions from features in the SP regions. Interestingly however, if ML input features are restricted to only the first 25, the prediction accuracies reduce to 84.7%, 85.7%, 85.3%, 89.0%, and 88.8% for 3 class, 8 class, psi and phi angles, ASA, and HSE-upper, respectively. This suggests that regions beyond the SP cleavage sites contain additional, albeit weaker signals that contribute to differentiating between positives and negatives, especially positions 28&#x2013;33.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Feature importance from secondary structure prediction methods. The bar chart shows the feature importance from each of the secondary structure prediction methods: 3 Class, three local conformational states; 8 Class, eight local conformational states; PHI angles, torsional (dihedral) angle &#x03D5; (phi); PSI angles, torsional (dihedral) angle &#x03C8; (psi); ASA, solvent-accessible surface area; HSE, half-sphere exposure (upper). Position from N-terminal is equivalent to the amino acid number in a protein sequence. Each position is a machine learning input feature. There are 40 features in this example. The Random Forest algorithm has built-in feature importance function that computes &#x2018;mean decrease accuracy&#x2019;, which is based on how much the accuracy decreases when the feature is excluded, i.e., a measure of the feature contribution toward the Random Forest model&#x2019;s prediction accuracy &#x2013; the greater the contribution, the higher the importance.</p></caption>
<graphic xlink:href="fgene-12-716132-g003.tif"/>
</fig>
</sec>
<sec id="S2.SS8">
<title>Predicting the Presence of Transmembrane Domains Using Machine Learning and Secondary Structure Characteristics</title>
<p>All five SS predictions methods revealed during the 10-fold cross validation and testing that the first 40 AAs encodes the strongest signal for differentiating between positives and negatives. Most TMs are not located in the first 40 AAs. The presence or absence of a TM therefore provided little or no contributing signal to the overall prediction outcomes. The expectation still remains that a protein with one or more predicted TMs is a less worthy exportome candidate than a protein with no TMs. We therefore investigated the use of SS predictions and ML to specifically predict the presence of TMs. <xref ref-type="supplementary-material" rid="TS4">Supplementary Table 4</xref> lists all <italic>B. bovis</italic> proteins containing at least one TM as predicted by TMHMM, and a TM presence probability. This ML-derived probability was obtained by counting the number of AAs per protein that fall into a particular location on &#x2018;Psi vs. Phi&#x2019; angle plot, although other variations of fixed sets of values for ML input were evaluated (see the section &#x201C;Materials and Methods&#x201D;). The training data was collated from predictions determined by TMHMM because of the limited experimental evidence for TMs in <italic>B. bovis</italic> proteins. Using predicted TMs may be deemed counterintuitive, but it served the investigative purpose of determining whether the presence or absence of TMs could be represented by SS predictions. <xref ref-type="supplementary-material" rid="TS5">Supplementary Table 5</xref> shows the ML performance measures. The best achieved accuracy was 76.25%. It is concluded that the &#x2018;Psi vs. Phi&#x2019; ML-derived probabilities could be used to complement the ML-SS methods by providing a TM presence indicator.</p>
</sec>
<sec id="S2.SS9">
<title>Predicted Exportome Members for All <italic>Babesia bovis</italic> T2Bo Proteins Using Machine Learning Methods</title>
<p>All 3706 <italic>B. bovis</italic> T2Bo proteins minus the 392 training proteins (equating to 3314) were input into the ML-SS methods. Only Spider3 (<xref ref-type="bibr" rid="B36">Heffernan et al., 2018</xref>) was used to generate the required SS prediction inputs (3 and 8 classes, psi and phi angles, ASA, and HSE-upper). Spider3 is a single-sequence-based prediction method that does not required evolutionary information from multiple sequence alignments. The 3314 input included the 80 positive and 80 negative test proteins. <xref ref-type="supplementary-material" rid="TS6">Supplementary Table 6</xref> shows the percentage of prediction misclassifications when using Spider3 only as a comparison to multiple predictor inputs (as per <xref ref-type="table" rid="T5">Table 5</xref>). The source of the 3 and 8 classes, and the psi and phi angle predictions underlying <xref ref-type="table" rid="T5">Table 5</xref> were obtained from a consensus of multiple predictors; but ASA and HSE-upper predictions were also from Spider3. In summary, the accuracy per method for the test proteins when using Spider3 inputs was 82.5% (3 classes), 83.1% (8 classes, and psi and phi angles), 91.9% (HSE-upper), and 93.1% (ASA). All these accuracies are lower than those shown in <xref ref-type="table" rid="T5">Table 5</xref>. Despite lower accuracies for each method when using only Spider inputs, the overall accuracy was comparable with 96.2% when using the &#x2018;average&#x2019; to represent all five prediction methods. These accuracies were computed with a 0.5 threshold. We propose more stringent selection criteria when using the presented ML-SS methods for predicting therapeutic candidates for laboratory investigation: an exportome membership probability greater or equal to 0.7, a TM presence indicator less than 0.5 (i.e., no predicted TM), and a consensus of four or more SS prediction methods. Such a selection criteria applied to the 160 test proteins would incorrectly filter 25% positives, and incorrectly include 1.3% negatives. However, less false positives in the laboratory would be expected.</p>
<p><xref ref-type="supplementary-material" rid="TS3">Supplementary Table 3</xref> shows the predicted exportome probabilities for each of the 3314 <italic>B. bovis</italic> proteins. Out of 3154 proteins (i.e., 3314 &#x2013; 160 testing proteins = 3154), 101 candidates were selected based on the proposed stringent criteria. <xref ref-type="supplementary-material" rid="TS7">Supplementary Table 7</xref> shows a breakdown of the 3314 selection pool in terms of predicted SPs and TMs. Out of the selected 101 candidates, 75 have no SPs and TMs. Also, 19 of the 101 (18.8%) contain more than 1 TM as predicted by TMHMM, which are considered here as possible false positives. Most of the candidate protein names are hypothetical, putative or are not known to be associated with erythrocyte membranes. However, one candidate exception is a smORF protein.</p>
<p><xref ref-type="supplementary-material" rid="TS3">Supplementary Table 3</xref> also lists, as a comparison, the predicted exportome membership probabilities derived from both the ML-SS methods and an alternative independent method. This alternative method uses ML with input comprising amino acid composition and delivery signals as described in a previous study (<xref ref-type="bibr" rid="B29">Goodswen et al., 2021</xref>). Out of 3109 <italic>B. bovis</italic> proteins, 85.37% have the same exportome or non-exportome predicted outcome based on a 0.5 threshold (the comparison excluded all proteins used in training). When using an average of both methods&#x2019; probabilities, 71 proteins have an average greater or equal to 0.7. A smORF and a SBP2 are the only high average probability proteins with known exportome names.</p>
</sec>
<sec id="S2.SS10">
<title>Predicted Exportome Members From Test Species Using Machine Learning and Secondary Structure Properties</title>
<p><xref ref-type="supplementary-material" rid="TS8">Supplementary Table 8</xref> lists predicted exportome membership probabilities of all proteins that meet the Gohil rule-based selection criteria from the study&#x2019;s four Apicomplexa test species. These probabilities were predicted by the ML-SS methods with the <italic>B. bovis</italic> T2Bo training data (i.e., 196 positives + 196 negatives). <xref ref-type="supplementary-material" rid="TS9">Supplementary Table 9</xref> compares summarised probability counts from the four test species. Based on an average exportome membership probability greater than 0.5, 92.1% of the <italic>B. bigemina</italic> BOND, 87.9% of <italic>B. canis</italic> BcH-CHIPZ, 98.1% of <italic>P. falciparum</italic> 3D7 rule-based derived exportome proteins are predicted exportome members by the ML-SS methods. The results suggest that patterns presented within the SS properties that define expected exportome and non-exportome proteins in <italic>B. bovis</italic> are universal to the test species. For the outlier species <italic>T. gondii</italic>, 86.5% of the rule-based selected proteins had a probability greater than 0.5.</p>
<p><italic>Plasmodium</italic> RIFIN and STEVOR proteins are of interest because of their association in export and display of virulence proteins. No RIFINs and only one STEVOR protein fulfils the rule-based selection criteria. Interestingly, the ML-SS methods predicted 155 out of 157 RIFINs (52 with no SP) and 30 out of 33 STEVORs (11 with no SP) to be exportome members.</p>
<p><xref ref-type="supplementary-material" rid="TS8">Supplementary Table 8</xref> also lists exportome membership probabilities for every currently available protein from the four test species. These probabilities were predicted by the ML-SS methods with the <italic>B. bovis</italic> T2Bo training data and Spider3 input. <xref ref-type="supplementary-material" rid="TS10">Supplementary Table 10</xref> summarises the number of proteins per species proposed as candidates worthy of further investigation. The numbers presented are governed by the thresholds applied to both exportome membership and TM presence probabilities. For example, there is a greater number of proposed candidates but with potentially more false positives than negatives when using lower thresholds. With the stringent selection criteria previously defined, we propose 327 <italic>B. bigemina</italic>, 155 <italic>B. canis</italic>, and 372 <italic>P. falciparum</italic> candidates for further investigation (see <xref ref-type="supplementary-material" rid="TS8">Supplementary Table 8</xref>). These candidates consist of 39.0% with no SP as predicted by SignalP and 15.6% with a TM as predicted by TMHMM.</p>
</sec>
</sec>
<sec id="S3">
<title>Discussion</title>
<p>A subunit vaccine is urgently required to alleviate the significant annual global economic loss in the beef and dairy cattle industry due to babesiosis. The foremost objective of the current study was to identify using an <italic>in silico</italic> approach the most worthy therapeutic candidates from potentially thousands of <italic>B. bovis</italic> proteins. More specifically, identify candidate members of the exportome, which are expected therapeutic targets against babesiosis. Candidate proteins accessible to the immune system are potential subunit vaccines. The identified candidates provide an important starting impetus for downstream laboratory investigations. To appropriately place this objective into perspective, it needs to be emphasised that <italic>B. bovis</italic> is a complex biological system with a multifaceted life cycle that infects an even more complex biological system in the form of cattle. The reality of such an infection is the interaction of a multitude of specialised molecules in a three dimensional (3D) environment. Our objective essentially attempts to predict from a digital linear abstraction of a protein molecule, represented as a sequence of letters, whether it will induce memory helper T and B cells when incorporated in a vaccine formulation.</p>
<p>The current state-of-the-art approach to <italic>in silico</italic> vaccine discovery against eukaryotic pathogens is to use trained ML models to detect differences between predicted protein characteristics representing candidates (positives) and non-candidates (negatives). This study investigated whether differences in predicted SS characteristics between proteins representing exportome and non-exportome members could be detected with ML. Using SS characteristics for this purpose is a novel approach that is considered complementary to the current one. SS properties such as &#x03B1;-helixes, &#x03B2;-sheets, torsion angles phi and psi, ASA, CN, and HSE are related to the 3D structure of a protein. A protein&#x2019;s 3D structure determines its function and accessibility to the immune system. The premise of the presented approach is that exportome and non-exportome members have different 3D structures and these differences may be detected in their underlying SS properties.</p>
<p>Our pathogen of interest is <italic>B. bovis</italic> for which there is no subunit vaccine. Furthermore, there is currently no laboratory verified list of exportome proteins for <italic>B. bovis</italic> or even for closely related species <italic>B. bigemina</italic> and <italic>B. canis</italic>. This presented an unavoidable challenge to the study in that there are no verified data for ML training or validating the predictions. One solution considered was to use protein types shown in previous studies to have an association with the erythrocyte membrane, i.e., use proteins &#x2018;expected&#x2019; to be exportome members. However, the number of these expected proteins reported in published studies is limited and provides an insufficient number for ML training. The cyclic conundrum is that a sufficient number of verified target candidates are required to predict target candidates. It therefore should be acknowledged that our approach, as is the initial case with all <italic>in silico</italic> vaccine discovery approaches, requires iterative cycles of ML predictions, laboratory feedback and training data adjustment. Our study provides the ML predictions to help initiate this required cyclic approach.</p>
<p>Predicting &#x2018;expected&#x2019; exportome proteins using extant bioinformatic programs was deemed the best solution for obtaining the initiating training data. Expected proteins are those exported outside the parasite. The presence of SPs and the absence of TMs and GPI-anchors provide indications that a protein may be secreted beyond the parasite membrane. In this study, the programs SignalP, TMHMM, and PredGPI predicted SPs, TMs, and GPI-anchors, respectively. A rule-based approach applied to these program outputs determined the positive and negative training data classification. We acknowledge the following limitations of our approach to obtaining this training data: (1) unknown misclassifications owing to the inherent imprecise nature of all prediction programs; (2) the limitation of rule-based systems <italic>per se</italic> because they tend to fail outside test scenarios with unseen data; (3) the ambiguity of the SP rule when the presence of an SP is only an indication a protein is targeted to the secretory pathway and not necessarily beyond the parasite membrane, and SP-containing proteins are known to be of two distinct types: those secreted from organelles during erythrocyte invasion, and those exported and involved in erythrocyte modification; and (4) the approach excludes proteins without SPs that are exported by distinct non-classical secretion pathways, although non-classical pathway <italic>Babesia</italic> proteins associated with erythrocyte membranes are yet to be reported.</p>
<p>The limitations of the rule-based approach further exacerbated the challenge of validating the predictions derived from the ML-SS methods. That is, there was an inept unavoidable scenario where unverified rule-based predictions were used not only for ML training but validation. Conversely, these rule-based limitations in the current approach for predicting exportome membership instigated our motivation for using ML with SS properties as an alternative approach. Our premise was that the feasibility of using SS properties could still be evaluated despite an unknown percentage of misclassified training data because some ML algorithms have the capacity to detect informative classification signals despite noisy or inconsistent data.</p>
<p>Nine SS predictors were evaluated and used in varying degrees in this study. As highlighted in <xref ref-type="table" rid="T2">Table 2</xref>, SS predictions varied for each predictor from 6% to 22.6% from a consensus prediction, given the training data sequences as input. These prediction variations are supported by the published predictor accuracies that range from 72.5% to 87% for Q3, and 60% to 77% for Q8. These inaccuracies <italic>per se</italic> add to an increasing accumulation of inaccuracy commencing from the genome sequencing to the translation of predicted genes to protein sequences. Our premise, however, is that the same level of SS prediction inaccuracies exist in both negatives and positives, irrespective of magnitude. In other words, the role of the SS predictions here is to represent structural patterns for differentiating between protein types and not for an accurate study of a protein&#x2019;s true conformation, e.g., SS predictions such as for phi and psi angles, ASA, CN, and HSE are continuous numerical values. The importance to this study is how adequately these values of a particular predictor represent structural patterns, regardless whether a value itself is more or less accurate than another predictor&#x2019;s value.</p>
<p>The training data protein sequences were input into various SS property predictors. Predictions for seven types of properties (3 and 8 classes, phi and psi angles, ASA, CN, and HSE down and upper spheres) were formatted into one file per property and evaluated separately. An appropriate format for ML input requires a uniform set of features. <italic>Babesia bovis</italic> proteins vary in length from 38 to 4820 AAs. This necessitated either fixed length inputs from each protein or property counts for the entire protein to fulfil the ML input format requirement. Variations of fixed sets of features comprising different representations of the seven property types were evaluated using 10-fold cross validation. The best performances on the independent test dataset in terms of accuracy per property type were, in ascending order: CN (83.7%), HSE down sphere (83.9%), 8 classes (86.3%), phi and psi angles (87.5%), 3 classes (89.4%), HSE upper sphere (92.5%), and ASA (92.5%). The prediction accuracy increases to 95.6% when classifications are based on the average of the exportome membership probabilities from the best five property types.</p>
<p>The ML input representation achieving the best performance for each property type was when using the first 40 property values from the N-terminal with a proportional count of values for the remaining protein length. This finding elicits two important issues. SPs are mainly located in the first 60 AAs, and most TMs are not located in the first 40 AAs. This implies that the ML algorithms are differentiating between positives and negatives based purely on the presence or absence of SPs. An implication that is not unexpected because all the positive training data comprises proteins with SPs. In fact, of the 368 <italic>B. bovis</italic> proteins predicted to contain an SP irrespective of TMs, 276 are used in the training and test data. This presents a challenge in determining whether there is an encoded signal specific to exportome proteins in addition to SPs. A proposal for future research when many more verified exportome proteins are known is to use an equal proportion of SP-containing proteins in both positives and negatives. For instance, positives would consist of known exportome proteins, where a proportion is expected to have SPs; and negatives would be non-exportome proteins but with an equal proportion containing SPs, e.g., those SP-containing proteins that invade erythrocytes. Given ML training data with a proportional number of SPs, a specific exportome signal should be unambiguously detectable, if one existed.</p>
<p>Exportome membership probabilities were predicted using the ML-SS methods for every available <italic>B. bovis</italic> protein minus those used for training and testing. From these predictions, 101 candidates were selected on stringent criteria as the most worthy for further investigation. Interestingly, 75 out of the 101 do not have a predicted SP, which included a SmORF protein. It is possible these 75 proteins were incorrectly predicted to have no SP by SignalP, but equally possible to possess similar SS properties to those in true exportome proteins. Furthermore, 26 out of 101 candidates have a TM. Likewise, these TMs may be incorrectly predicted by TMHMM and the ML algorithms have correctly detected SS patterns similar to those presented by exportome proteins.</p>
<p>A further challenge to the study was the uncertainty in the <italic>B. bovis</italic> annotation quality of protein names (see Annotation analysis in section &#x201C;Materials and Methods&#x201D;). The poor annotation had two implications. First, protein names appeared to contradict expectations following the selection of the training data based on the rule-based approach. For example, VESA named proteins were in both negative and positive datasets. Second, appraising the prediction methods based on protein names is potentially inaccurate given the poor annotation. Consequently, protein sequences took precedence over names in this study, despite sequences having their own levels of inaccuracies. The high scoring exportome proteins presented in the results must therefore come with a caveat that their names may be misleading with regard to their sequence signals encoded and true function.</p>
<p>Most of the SS predictors use PSI-BLAST to create an evolutionary profile. PSI-BLAST can take about 30 minutes (especially for SPOT1D) to process a short protein around 100 AAs and up to multiple hours for sequences greater than 1000 AAs (performed on a HPC computer with 64 bit kernel, 32 MB memory, and 8 cores). The average <italic>B. bovis</italic> protein length is 500 AAs. This makes the desired high-throughput processing a considerable drawback unless lengthy computational times are not an issue. Our proposal is to use only Spider3 predictions, which does not use PSI-BLAST and can process thousands of proteins in minutes. Spider3 predictions (especially 3 and 8 classes, and psi and phi angles) were observed to be less accurate in comparison to the other eight SS predictors. Nonetheless, our premise that the same level of SS prediction inaccuracies exist in both negatives and positives appears to be upheld, i.e., an appropriate level of SS pattern differences were detectable as supported by a binary classification accuracy of 96.2% when using the test dataset.</p>
<p>The ML-SS methods with <italic>B. bovis</italic> T2Bo training data were used to predict exportome membership probabilities for every available protein from four Apicomplexa test species. Classification accuracies between 86.5 and 92.1% on proteins fulfilling the Gohil rule-based selection criteria suggested that patterns presented within SS properties defining <italic>B. bovis</italic> expected exportome and non-exportome members are universal to the test species. Furthermore, the large percentage (39.0%) of proteins with exportome membership probabilities &#x003E;0.7 and no SP supports the possibility there is a SS pattern specific to exportome proteins in addition to a SS pattern representing SPs.</p>
<p>We conclude that representing the &#x2018;secondary&#x2019; structure of proteins as a set of features for ML algorithms, in particular RF and adaBoost, provides the potential to classify apicomplexan exported from non-exported proteins with an accuracy of 86&#x2013;92% accuracy (based on 10-fold cross validation and an independent test dataset). It is problematic, however, to decisively claim here that the presented ML-SS methods are superior to the rule-based approach due to the rule-based origins of the training data. The lack of verified <italic>B. bovis</italic> exportome proteins for training and testing posed the study&#x2019;s main challenge. Paradoxically, the lack of known candidates and the urgency for a <italic>B. bovis</italic> vaccine provided the motivation for the study. At least three <italic>B. bovis</italic> proteins (BBOV_II007340, BBOV_II002880, and BBOV_I004210) have been experimentally verified in a previous study (<xref ref-type="bibr" rid="B24">Gohil et al., 2013</xref>) to be associated with the infected erythrocyte membrane. All three proteins were predicted by the ML-SS methods to have exportome membership probabilities greater than 0.7. This further supports, albeit with a small sample, the potential of SS representations and ML.</p>
<p>The current study focused on proteins from the exportome of <italic>B. bovis</italic> with the aim of identifying therapeutic candidates. We acknowledge, however, there are other protein types that are potential candidates, e.g., proteins involved in the invasion of erythrocytes. Proteins of this type are accessible to the immune system and several representatives of the <italic>Babesia</italic> species have been shown to induce an immune response, namely merozoite surface antigen-1 (MSA-1) (<xref ref-type="bibr" rid="B15">Elisa Rodriguez et al., 2014</xref>), thrombospondin-related anonymous protein (TRAP) (<xref ref-type="bibr" rid="B23">Gaffar et al., 2004</xref>; <xref ref-type="bibr" rid="B26">Gonzalez et al., 2019</xref>), rhoptry associated protein 1 (RAP-1) (<xref ref-type="bibr" rid="B55">Norimine et al., 2003</xref>; <xref ref-type="bibr" rid="B26">Gonzalez et al., 2019</xref>), and Erythrocyte binding protein (<xref ref-type="bibr" rid="B1">Abd El-Salam El-Sayed et al., 2017</xref>).</p>
<p>We make available via GitHub, five independent pipelines that use five different predicted SS properties to predict exportome membership. The five properties are: 3- and 8-state SS predictions, phi and psi angles, and HSE-upper. Furthermore, exportome membership probabilities are provided for every available <italic>B. bovis</italic> T2Bo, <italic>B. bigemina</italic> BOND, <italic>B. canis</italic> BcH-CHIPZ, and <italic>P. falciparum</italic> 3D7 proteins. The expectation is that a desired percentage of high probability candidates can be selected to suit laboratory capability and budget. These candidates help initiate the required iterative cycles of laboratory testing, training data adjustment (i.e., adding or removing verified proteins), and further ML-SS predictions.</p>
</sec>
<sec id="S4" sec-type="materials|methods">
<title>Materials and Methods</title>
<sec id="S4.SS1">
<title>Data Source</title>
<p>All 3706 and 5077 currently available protein sequences for <italic>B. bovis</italic> T2Bo and <italic>B. bigemina</italic> BOND, respectively, were downloaded in a FASTA format from PiroPlasmaDB (release 47), which is a database member of Eukaryotic Pathogen Databases (EuPathDB) (<xref ref-type="bibr" rid="B7">Aurrecoechea et al., 2010</xref>). Sequences for 3467 <italic>B. canis</italic> BcH-CHIPZ proteins were extracted from a Supplementary Excel spreadsheet created from the <italic>B. canis</italic> genome sequencing study (<xref ref-type="bibr" rid="B14">Eichenberger et al., 2017</xref>). Sequences for all 5460 <italic>P. falciparum</italic> (strain 3D7) and 8322 <italic>T. gondii</italic> (strain ME49) proteins were downloaded in a FASTA format from PlasmoDB (release 47) and ToxoDB (release 47), respectively, which are also database members of EuPathDB. Sequences from all five species were used in a FASTA format as primary input for each of the presented ML-SS methods.</p>
</sec>
<sec id="S4.SS2">
<title>Rule-Based Method</title>
<p>SignalP 5.0 (<xref ref-type="bibr" rid="B6">Armenteros et al., 2019b</xref>) predicted the presence of SPs, TMHMM 2.0 (<xref ref-type="bibr" rid="B47">Krogh et al., 2001</xref>) predicted TMs, and PredGPI 1.0 (<xref ref-type="bibr" rid="B59">Pierleoni et al., 2008</xref>) predicted GPI-anchors. A protein is classified an exportome member if SignalP score &#x2265; 0.5, TMHMM predicted number of TMs = 0, and PredGPI predicted GPI FPrate (False Positive rate) &#x2265; 0.005; otherwise it is classified a non-exportome member. Note that GPI FPrate &#x003C; 0.001 is highly probable, &#x003C;0.005 is probable, &#x003C;0.01 is weakly probable, and &#x2265;0.01 is not GPI-anchored.</p>
<p>Two additional programs were used to compare the prediction outputs from SignalP and TMHMM: TargetP (<xref ref-type="bibr" rid="B16">Emanuelsson et al., 2007</xref>) (predicts SPs) and Phobius (<xref ref-type="bibr" rid="B43">Kall et al., 2004</xref>) (predicts SPs and TMs). A warning is given in the TMHMM 2.0 User&#x2019;s guide that a predicted TM helix in the first 60 AAs of the N-terminal could be a SP. Note that 21 <italic>B. bovis</italic> proteins with a SignalP score &#x2265; 0.5 and TMHMM predicted number of TMs = 1 were classified an exportome, but only when the TM was predicted in the first 60 AAs. Phobius predictions supported that 17 of the 21 contained a SP but no TM.</p>
</sec>
<sec id="S4.SS3">
<title>Training Input Sequences for Machine Learning</title>
<p>The 196 proteins representing the training positives (i.e., exportome members) were selected from the 3706 <italic>B. bovis</italic> T2Bo proteins following the rule-based bioinformatics approach. The 196 proteins representing the training negatives were randomly chosen using the Python random module (implements a Mersenne Twister (<xref ref-type="bibr" rid="B53">Matsumoto and Nishimura, 1998</xref>) as the core generator) from 3430 proteins predicted by the rule-based bioinformatics approach to be non-exportome members. The ML input training file for each method therefore consisted of 392 sequences representing the positives and negatives datasets. <xref ref-type="supplementary-material" rid="TS1">Supplementary Table 1</xref> lists the rule-based selected proteins along with the SP, TMHMM, and GPI predicted characteristics on sheets &#x2018;positives&#x2019; and &#x2018;negatives&#x2019;.</p>
<p>The program CD-HIT (cluster database at high identity with tolerance) (<xref ref-type="bibr" rid="B49">Li and Godzik, 2006</xref>) was used to determine whether any of the training and test sequences had 100% similarity (i.e., a check for redundant sequences). No identical sequences were detected, but four clusters of positive proteins had similarities &#x003E;90% (see <xref ref-type="supplementary-material" rid="DS1">Supplementary Data 2</xref>). These proteins were assumed to be isoforms rather than the same proteins incorrectly assigned with unique IDs.</p>
</sec>
<sec id="S4.SS4">
<title>Programs for Predicting Protein Secondary Structure Properties</title>
<p>Nine programs were selected that met the following requirements: standalone or at least had high-throughput processing capability, worked in a Linux environment, and generated an appropriate output from which SS properties could be extracted. The nine programs were Porter 5 (<xref ref-type="bibr" rid="B70">Torrisi et al., 2019</xref>), PSIPRED 4.0 (<xref ref-type="bibr" rid="B41">Jones, 1999</xref>; <xref ref-type="bibr" rid="B11">Buchan and Jones, 2019</xref>), NetsurfP 2.0 (<xref ref-type="bibr" rid="B46">Klausen et al., 2019</xref>), SSpro (<xref ref-type="bibr" rid="B50">Magnan and Baldi, 2014</xref>), SPOT-1D (<xref ref-type="bibr" rid="B34">Hanson et al., 2019</xref>), Spider3 (<xref ref-type="bibr" rid="B36">Heffernan et al., 2018</xref>), DeepCNF (<xref ref-type="bibr" rid="B72">Wang et al., 2016</xref>), MUFold (<xref ref-type="bibr" rid="B18">Fang et al., 2019</xref>, <xref ref-type="bibr" rid="B17">2020</xref>), and Jpred 4 (<xref ref-type="bibr" rid="B13">Drozdetskiy et al., 2015</xref>). Predictors Jpred 4 and SPOT-1D have an 800 and MUFold a 700 AA length limit for input. <xref ref-type="table" rid="T6">Table 6</xref> describes the algorithms used and the type of SS characteristics predicted by the nine programs. To enable high-throughput processing nine separate pipelines were created. Some predictors use different versions of the same Python modules and it was therefore not possible to create one generic pipeline suitable for all predictors. Furthermore, the outputs from these programs were different from each other and required the creation of nine Python scripts to extract, transform, and present the relevant SS in a consistent format, e.g., a series of H, E, or C for 3 class states and G, H, I, E, B, T, S, or C for 8 class states for each processed protein. The predictions varied considerably and therefore a consensus of the predicted classifications was created based on a majority rule approach applied at each amino acid. In the instance of a draw (e.g., due to a missing classification creating an equal number of classifications at a particular amino acid), predicted classifications were consecutively dropped from each predictor until a majority classification was achieved. The classifications from each predictor were dropped in the following order: DeepCNF, Spider3, Jpred 4, SSpro, NetsurfP, PSIPRED, SPOT-1D, and Porter 5. This order, from least to most accurate, was determined from evaluating the predictors (see <xref ref-type="table" rid="T2">Table 2</xref>).</p>
<table-wrap position="float" id="T6">
<label>TABLE 6</label>
<caption><p>Publically available software for predicting protein secondary structure characteristics.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Program</td>
<td valign="top" align="left">Main algorithms</td>
<td valign="top" align="left">Evolutionary profile</td>
<td valign="top" align="left">Predicted SS characteristics</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">NetSurfP</td>
<td valign="top" align="left">Large Long Short-Term Memory (LSTM) network in a Bidirectional Recurrent Neural Network (BRNN) &#x2013; trained on solved protein structures</td>
<td valign="top" align="left">PSSM created by PSIBLAST against UniRef90, and HMM profile created by HHblits given UniRef90</td>
<td valign="top" align="left">3 and 8 classes, ASA, CN, HSE, phi and psi</td>
</tr>
<tr>
<td valign="top" align="left">SPOT-1D</td>
<td valign="top" align="left">An ensemble of residual convolutional networks (ResNets) and Long-Short-Term Memory Cells in Bidirectional Recurrent Neural Networks (LSTM-BRNNs) with predicted contact maps input from SPOT-contact</td>
<td valign="top" align="left">PSSM created by PSIBLAST against UniRef90, and HMM profile created by HHblits given UniRef90</td>
<td valign="top" align="left">3 and 8 classes, ASA, CN, HSE, phi and psi</td>
</tr>
<tr>
<td valign="top" align="left">PSIPRED</td>
<td valign="top" align="left">Deep neural network architecture with two hidden layers, and with rectifier activations</td>
<td valign="top" align="left">PSSM created by PSIBLAST against UniRef90</td>
<td valign="top" align="left">3 classes</td>
</tr>
<tr>
<td valign="top" align="left">MUFOLD-SS and MUFOLDAngle</td>
<td valign="top" align="left">Variants of inception networks</td>
<td valign="top" align="left">PSSM created by PSIBLAST against UniRef90, and HMM profile created by HHblits given UniRef90</td>
<td valign="top" align="left">3 and 8 classes, phi and psi</td>
</tr>
<tr>
<td valign="top" align="left">Porter5</td>
<td valign="top" align="left">Ensembles of cascaded bidirectional recurrent neural networks and Convolutional Neural Networks (CNN)</td>
<td valign="top" align="left">PSSM created by PSIBLAST against UniRef90, and HMM profile created by HHblits given UniRef90</td>
<td valign="top" align="left">3 and 8 state</td>
</tr>
<tr>
<td valign="top" align="left">DeepCNF</td>
<td valign="top" align="left">Combines Conditional Neural Fields (CNF) and Deep Convolutional Neural Networks (DCNN)</td>
<td valign="top" align="left">PSSM created by PSIBLAST against UniRef90, and HMM profile created by HHblits given UniRef90</td>
<td valign="top" align="left">3 and 8 classes</td>
</tr>
<tr>
<td valign="top" align="left">SSPRO</td>
<td valign="top" align="left">An ensemble of 100 Bidirectional Recursive Neural Networks (BRNNs)</td>
<td valign="top" align="left">PSSM created by PSIBLAST against UniRef50</td>
<td valign="top" align="left">3 and 8 classes</td>
</tr>
<tr>
<td valign="top" align="left">SPIDER3 single</td>
<td valign="top" align="left">Long-Short-Term Memory Cells in Bidirectional Recurrent Neural Networks (LSTM-BRNNs)</td>
<td valign="top" align="left">Single-sequence-based prediction, i.e., no evolutionary profile used</td>
<td valign="top" align="left">3 and 8 classes, ASA, CN, HSE, phi and psi</td>
</tr>
<tr>
<td valign="top" align="left">Jpred4</td>
<td valign="top" align="left">Online tool using JNet algorithm</td>
<td valign="top" align="left">PSSM created by PSIBLAST against UniRef90</td>
<td valign="top" align="left">3 classes, ASA</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<attrib><italic>SS, protein secondary structure; 3 classes, 3 classes of secondary structure conformation; 8 classes, 8 classes of secondary structure conformation; phi, backbone torsion angle &#x03D5;; psi, backbone torsion angle &#x03C8;; ASA, solvent-accessible surface area; CN, contact number; HSE, half-sphere exposure; PSSM, position specific substitution matrix; PSI-BLAST, Position-Specific Iterative Basic Local Alignment Search Tool; HMM, hidden Markov models; HHblits, a HMM-HMM-based iterative sequence search tool; UniRef50 and UniRef90, UniProt Reference Clusters from the UniProt Knowledgebase (UniProtKB) comprised of clustered sets of sequences with 50 or 90 sequence identity levels, respectively.</italic></attrib>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="S4.SS5">
<title>Machine Learning Algorithms</title>
<p>Six supervised ML algorithms were evaluated in this study for predicting exportome membership: adaptive boosting (AdaBoost), random forest (RF), <italic>k</italic>-nearest neighbour classifier, naive Bayes classifier, neural network, and support vector machines. AdaBoost and RF were the only two algorithms selected for final exportome predictions based on their superior 10-fold cross validation performances. AdaBoost (<xref ref-type="bibr" rid="B21">Freund and Schapire, 1997</xref>) and RF (<xref ref-type="bibr" rid="B10">Breiman, 2001</xref>) were implemented via the R functions <italic>ada</italic> (<xref ref-type="bibr" rid="B22">Friedman et al., 2000</xref>) and <italic>randomForest</italic>, respectively. Both R functions used at least two arguments: a data frame of numeric variables (i.e., a training dataset) and a numerical class vector, i.e., a vector representing the target label, which had two classes: 1 (positive) and 0 (negative). Each algorithm generates a probability that the binary classification is correct. Furthermore, both algorithms were used as an ensemble of classifiers, i.e., for each protein, classification probabilities from each algorithm in the ensemble were averaged to determine the final classification probability. Default ML parameters were used throughout except for the RF parameters &#x2018;ntree&#x2019; and &#x2018;mtry&#x2019; (changed to 300 and 3, respectively) and AdaBoost parameter &#x2018;iter&#x2019; changed to 300.</p>
</sec>
<sec id="S4.SS6">
<title>Creating Machine Learning Training Data With 3 and 8 Class State Predictions</title>
<p>The training data protein sequences (196 negatives + 196 positives) were used as input to nine 3 class and seven 8 class predictors. Predicted classifications varied considerably between predictors as illustrated in <xref ref-type="table" rid="T2">Table 2</xref>. Therefore, a consensus of classifications was derived as previously described for each input protein (see <xref ref-type="fig" rid="F2">Figure 2</xref>). The consensus classifications comprising letter characters were also converted to numerical values (C &#x2192; 0, E &#x2192; 1, H &#x2192; 2 for 3 classes, and C &#x2192; 0, S &#x2192; 1, T &#x2192; 2, B &#x2192; 3, E &#x2192; 4, I &#x2192; 5, G &#x2192; 6, H &#x2192; 7 for 8 classes. Note that although each class was assigned a consistent value, the actual chosen value is arbitrary). Collectively, each consensus is of varying length due to a protein&#x2019;s varying length. ML algorithms require a uniform set of features as input. The consensus classifications were therefore limited to various fixed length sections; such that the ML models were evaluated with different fixed length inputs. For instance, only a set number of classifications (features) from the consensus start and end, plus a set number of mid-section features were used as ML input (see <xref ref-type="fig" rid="F4">Figure 4</xref>). Where the mid-section in this instance is either 3 or 8 features representing the total number of classifications for each particular structural class divided by the number of mid-section classifications, e.g., if 500 3 class classifications exist in a mid-section; whereby 250 are C, 50 are E, and 200 H; the three mid-section feature values are 0.5, 0.1, and 0.4.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>An illustration of a machine learning input representation for 3 classification state predictions of protein conformation. The 3 classes are helix, sheet, and coil designated H, E, and C, respectively. &#x2018;<italic>x</italic>&#x2019; classifications at mid-section denotes that a varying number of classifications exist between a nominated start and end number of classifications (in this instance, 30) because the source proteins vary in length. S1, S2, etc. are the consecutively numbered feature names at the start, and E1, E2, etc. are the consecutively numbered feature names at the end. M0, M1, and M2 are the mid-section feature names that contain the total number of classifications for each structural class divided by the number of mid-section classifications (e.g., for 8 classification state predictions the mid-section feature names are M0, M1, M2, M3, M4, M5, M6, and M7). The target is what the machine learning (ML) algorithm attempts to predict, i.e., exportome (positive) or non-exportome (negative), and features are what the ML algorithm uses to help make the prediction. In this illustration, the total number of features used to represent varying length proteins is 63 per protein.</p></caption>
<graphic xlink:href="fgene-12-716132-g004.tif"/>
</fig>
<p>Different variations of data input to the ML algorithms were evaluated using 10-fold cross validation. The variations evaluated include: start, middle, and end; start and middle; middle and end; start only; middle only; end only; start and remaining; start classes, middle classes, and end classes; start classes and middle classes; middle classes and end classes; start classes only; middle classes only; end classes only &#x2013; where &#x2018;start&#x2019; and &#x2018;end&#x2019; is 30, 40, 50, 60, 75, or 100 classifications (equivalent to the number of AAs) measured from the N- or C-terminal, respectively; &#x2018;middle&#x2019; is the remaining classifications between the two ends (the 3 or 8 classification structures are counted for the middle and divided by the number of mid-section classifications); &#x2018;remaining&#x2019; is the remaining classifications after the start section and is calculated in the same way as &#x2018;middle&#x2019;, and &#x2018;classes&#x2019; are 3 or 8 classification structure counts divided by the length of the relevant section. Normalisation or standardisation was also applied to all count features and the impact to the ML performances compared. The formulae used were Normalised <italic>x</italic> = (<italic>x</italic> &#x2013; minimum <italic>x</italic>) / (maximum <italic>x</italic> &#x2013; minimum <italic>x</italic>); and Standardised <italic>x</italic> = (<italic>x</italic> &#x2013; &#x03BC;) / &#x03C3;. Where &#x2018;<italic>x</italic>&#x2019; is the count value, and minimum and maximum relates to the minimum or maximum count value when considering the entire dataset (e.g., 3706 <italic>B. bovis</italic> proteins), and &#x2018;&#x03BC;&#x2019; is the sample mean and &#x2018;&#x03C3;&#x2019; is the standard deviation. As an additional test, predicted classifications were randomised and all the above variations evaluated, i.e., all input values were randomly shuffled using Python random module (with shuffle method). This test helps check whether the classification at a particular amino acid position makes a difference to the ML performance.</p>
</sec>
<sec id="S4.SS7">
<title>Creating Machine Learning Training Data With Psi and Phi Angles</title>
<p>Predictors SPOT-1D, NetsurfP, and MUFold predict two sets of angles (psi and phi) between &#x2212;180 and +180 for each amino acid as part of their output. The mean of the angles at each amino acid were determined. A challenge was how best to represent two sets of angles from proteins with varying length as a uniform set of values for ML input. The following input representations were evaluated with different fixed lengths (e.g., 40, 50, 60, and 75 AAs) from the N-terminal of the training sequences: (1) counting the number of AAs that fall into a particular angle range, e.g., a 30 step range &#x2013; Range 1: &#x2264;&#x2212;150; Range 2: &#x003E;&#x2212;150 and &#x2264;&#x2212;120; Range 3: &#x003E;&#x2212;120 and &#x2264;&#x2212;90; Range 4: &#x003E;&#x2212;90 and &#x2264;&#x2212;60, etc. creating six ranges per angle type (psi or phi), and therefore 12 features in total; (2) counting the number of AAs that fall into a particular region on a &#x2018;Psi vs. Phi&#x2019; angle plot (see <xref ref-type="supplementary-material" rid="FS1">Supplementary Figure 1</xref>). The plot is divided into four quadrants and then each quadrant is subdivided into regions. Each region represents a feature for ML input, whereby the feature value is the number of AAs within the region, e.g., if one squared region in the plot has a dimension of 30 &#x2013; Region 1 in quadrant one is defined by Phi angles &#x003C; 0 and &#x2264;&#x2212;30 <italic>and</italic> Psi angles &#x003E; 0 and &#x2264;30. There are 36 regions per quadrant and therefore 144 features in total for the entire plot; (3) using angles directly as the features, e.g., if the fixed length is 60 AAs then there are 60 features for phi and 60 for psi, and therefore 120 features in total; and (4) using angles directly as the features but combining the psi and phi angles as one feature by either multiplying or adding the two angles, e.g., if the fixed length is 60 AAs then there are 60 features for both phi and psi angles. Normalisation, standardisation, or no feature scaling was applied to all counts when evaluating representations. Note for representations #3 and #4, the assumption is that psi and phi angles recorded for amino acid #1 on a positive protein can be compared to the psi and phi angles recorded for amino acid #1 on a negative protein, and so on for each consecutive amino acid.</p>
</sec>
<sec id="S4.SS8">
<title>Creating Machine Learning Training Data With Structural Properties ASA, CN, and HSE</title>
<p>Predictors SPOT-1D and Spider3 were used independently to predict for each amino acid in the input sequence, the structural properties of ASA, CN, and HSE for upper and down spheres. SPOT-1D and Spider3 save all three latter properties along with other data in one output file per protein with the extension &#x2018;spot1d&#x2019; and &#x2018;i1&#x2019;, respectively. Each property (ASA, CN, HSE-upper, and HSE-down) were used in turn as features for ML input with different fixed lengths (e.g., 40, 50, 60, and 75 AAs) from the N-terminal of the training sequences.</p>
</sec>
<sec id="S4.SS9">
<title><italic>Babesia bovis</italic> Annotation Analysis</title>
<p>UniProtKB (<xref ref-type="bibr" rid="B8">Bateman et al., 2015</xref>) provides a heuristic measure of the annotation, although the curators claim they cannot define the &#x2018;correct annotation&#x2019; for any given protein<sup><xref ref-type="fn" rid="footnote1">1</xref></sup>. UniProtKB have assigned an annotation score from one to five to every protein, where five is considered the best-annotated entry (annotations with experimental evidence score higher than equivalent predicted/inferred annotations). With an understanding UniProtKB annotation scores are only a guideline of annotation quality, we checked scores for all <italic>B. bovis</italic> protein<italic>s:</italic> 89.1% scored 1, 10% scored 2, 0.87% scored 3, and 0.03% scored 4.</p>
</sec>
<sec id="S4.SS10">
<title>Predicting the Presence of Transmembrane Domains Using Machine Learning and Secondary Structure Characteristics</title>
<p>We investigated the use of SS predictions, namely 3 and 8 classes, psi and phi angles, ASA, and HSE-upper to predict the presence of TM domains. The start and end of single or multiple TMs can occur anywhere within a protein sequence. For example, TMHMM predicts that 677 <italic>B. bovis</italic> T2Bo proteins contain at least one ranging to 22 TMs per protein starting anywhere from 2 AAs to 3405 AAs from the N-terminal. Furthermore, protein lengths of all 3706 <italic>B. bovis</italic> vary between 38 and 4820 AAs. These varying factors presented a challenge because ML requires a fixed set of values per protein for input. <xref ref-type="supplementary-material" rid="TS11">Supplementary Table 11</xref> provides a breakdown of the number of <italic>Babesia bovis</italic> T2Bo proteins containing SPs and/or TM domains. For example, 575 of the 677 have no predicted SP. A ML training dataset comprising 539 positives and 539 negatives was created using predictions determined by TMHMM. Positives were proteins containing at least 1TM beyond 60 AAs from the N-terminal. The reasoning for ignoring the first 60 AAs was to prevent the impact of possible SPs. Negatives were proteins with no predicted TMs (see <xref ref-type="supplementary-material" rid="TS4">Supplementary Table 4</xref>).</p>
<p>Spider3 provided the secondary predictions. Different variations of fixed sets of values for ML input were evaluated using 10-fold cross validation. The variations evaluated included for each protein: counting the 3 or 8 classification structures; counting the number of AAs that fall into a particular psi or phi angle range; and counting the number of AAs that fall into a particular location on &#x2018;Psi vs. Phi&#x2019; angle plot (see <xref ref-type="supplementary-material" rid="FS1">Supplementary Figure 1</xref>). All counts per protein were divided by the total number of characteristics counted (i.e., protein length &#x2013; 60 AAs). Normalisation, standardisation, or no feature scaling was applied to all counts during evaluation.</p>
</sec>
<sec id="S4.SS11">
<title>ML-SS Pipelines</title>
<p>The presented ML-SS methods have been implemented in five Linux pipelines. These pipelines consist of linked Python and Linux shell scripts, and R functions. The pipelines are designed to facilitate an automated, high-throughput computational approach to predict exportome membership probabilities. The five pipelines are named pipeline_ss (for 3 and 8 class predictions), pipeline_angles (for psi and phi angles predictions), pipeline_prop (for ASA and HSE-upper), pipeline_all (for 3 and 8 classes, psi and phi angles, ASA and HSE-upper) and pipeline_TM (for TM presence predictions). Four additional pipelines are also provided to conduct automated 10-fold cross validation of the ML-SS methods: pipeline_CV_ss, pipeline_CV_angles, pipeline_CV_prop, and pipeline_CV_TM. The pipelines were designed for a Linux operating system and have only been tested on Red Hat Enterprise Linux 7.7 but are expected to work on most Linux distributions. They are freely available at: <ext-link ext-link-type="uri" xlink:href="https://github.com/goodswen/ML-SS_Methods">https://github.com/goodswen/ML-SS_Methods</ext-link>. All pipelines are provided with a ReadMe file with instructions, and test and training data for <italic>B. bovis</italic> T2Bo. Note that the SS predictor programs are not packaged with the ML-SS pipelines and need to be downloaded and installed independently. The pipelines only require the raw output from the SS predictors.</p>
</sec>
</sec>
<sec id="S5">
<title>Data Availability Statement</title>
<p>All ML-SS methods presented in the article are freely available at: <ext-link ext-link-type="uri" xlink:href="https://github.com/goodswen/ML-SS_Methods">https://github.com/goodswen/ML-SS_Methods</ext-link>.</p>
</sec>
<sec id="S6">
<title>Author Contributions</title>
<p>JE and SG contributed to the conception and methodology of the study. SG wrote the software, validated results, and performed the statistical analysis. JE and PK provided supervision. JE handled the project administration and funding acquisition. SG prepared the original draft. All authors contributed to manuscript revision, read, and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="S15">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="financial-disclosure">
<p><bold>Funding.</bold> This research was funded by Australian Research Council Discovery, Grant Number DP180102584.</p>
</fn>
</fn-group>
<sec id="S8" sec-type="supplementary material">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fgene.2021.716132/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fgene.2021.716132/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.PDF" id="DS1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Data_Sheet_2.PDF" id="DS2" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Image_1.PDF" id="FS1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_1.XLSX" id="TS1" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_2.XLSX" id="TS2" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_3.XLSX" id="TS3" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_4.XLSX" id="TS4" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_5.PDF" id="TS5" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_6.PDF" id="TS6" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_7.PDF" id="TS7" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_8.XLSX" id="TS8" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_9.PDF" id="TS9" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_10.PDF" id="TS10" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_11.PDF" id="TS11" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abd El-Salam El-Sayed</surname> <given-names>S.</given-names></name> <name><surname>Rizk</surname> <given-names>M. A.</given-names></name> <name><surname>Terkawi</surname> <given-names>M. A.</given-names></name> <name><surname>Yokoyama</surname> <given-names>N.</given-names></name> <name><surname>Igarashi</surname> <given-names>I.</given-names></name></person-group> (<year>2017</year>). <article-title>Molecular identification and antigenic characterization of Babesia divergens Erythrocyte Binding Protein (BdEBP) as a potential vaccine candidate.</article-title> <source><italic>Parasitol. Int.</italic></source> <volume>66</volume> <fpage>721</fpage>&#x2013;<lpage>726</lpage>. <pub-id pub-id-type="doi">10.1016/j.parint.2017.07.004</pub-id> <pub-id pub-id-type="pmid">28743470</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Allred</surname> <given-names>D. R.</given-names></name> <name><surname>Carlton</surname> <given-names>J. M. R.</given-names></name> <name><surname>Satcher</surname> <given-names>R. L.</given-names></name> <name><surname>Long</surname> <given-names>J. A.</given-names></name> <name><surname>Brown</surname> <given-names>W. C.</given-names></name> <name><surname>Patterson</surname> <given-names>P. E.</given-names></name><etal/></person-group> (<year>2000</year>). <article-title>The ves multigene family of B-bovis encodes components of rapid antigenic variation at the infected erythrocyte surface.</article-title> <source><italic>Mol. Cell</italic></source> <volume>5</volume> <fpage>153</fpage>&#x2013;<lpage>162</lpage>. <pub-id pub-id-type="doi">10.1016/s1097-2765(00)80411-6</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname> <given-names>S. F.</given-names></name> <name><surname>Madden</surname> <given-names>T. L.</given-names></name> <name><surname>Schaffer</surname> <given-names>A. A.</given-names></name> <name><surname>Zhang</surname> <given-names>J. H.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Miller</surname> <given-names>W.</given-names></name><etal/></person-group> (<year>1997</year>). <article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>25</volume> <fpage>3389</fpage>&#x2013;<lpage>3402</lpage>. <pub-id pub-id-type="doi">10.1093/nar/25.17.3389</pub-id> <pub-id pub-id-type="pmid">9254694</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Armenteros</surname> <given-names>J. J. A.</given-names></name> <name><surname>Sonderby</surname> <given-names>C. K.</given-names></name> <name><surname>Sonderby</surname> <given-names>S. K.</given-names></name> <name><surname>Nielsen</surname> <given-names>H.</given-names></name> <name><surname>Winther</surname> <given-names>O.</given-names></name></person-group> (<year>2017</year>). <article-title>DeepLoc: prediction of protein subcellular localization using deep learning.</article-title> <source><italic>Bioinformatics</italic></source> <volume>33</volume> <fpage>3387</fpage>&#x2013;<lpage>3395</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btx431</pub-id> <pub-id pub-id-type="pmid">29036616</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Armenteros</surname> <given-names>J. J. A.</given-names></name> <name><surname>Tsirigos</surname> <given-names>K. D.</given-names></name> <name><surname>Sonderby</surname> <given-names>C. K.</given-names></name> <name><surname>Petersen</surname> <given-names>T. N.</given-names></name> <name><surname>Winther</surname> <given-names>O.</given-names></name> <name><surname>Brunak</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2019a</year>). <article-title>SignalP 5.0 improves signal peptide predictions using deep neural networks.</article-title> <source><italic>Nat. Biotechnol.</italic></source> <volume>37</volume> <fpage>420</fpage>&#x2013;<lpage>423</lpage>. <pub-id pub-id-type="doi">10.1038/s41587-019-0036-z</pub-id> <pub-id pub-id-type="pmid">30778233</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Armenteros</surname> <given-names>J. J. A.</given-names></name> <name><surname>Tsirigos</surname> <given-names>K. D.</given-names></name> <name><surname>Sonderby</surname> <given-names>C. K.</given-names></name> <name><surname>Petersen</surname> <given-names>T. N.</given-names></name> <name><surname>Winther</surname> <given-names>O.</given-names></name> <name><surname>Brunak</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2019b</year>). <article-title>SignalP 5.0 improves signal peptide predictions using deep neural networks.</article-title> <source><italic>Nat. Biotechnol.</italic></source> <volume>37</volume> <fpage>420</fpage>&#x2013;<lpage>423</lpage>.</citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aurrecoechea</surname> <given-names>C.</given-names></name> <name><surname>Brestelli</surname> <given-names>J.</given-names></name> <name><surname>Brunk</surname> <given-names>B. P.</given-names></name> <name><surname>Fischer</surname> <given-names>S.</given-names></name> <name><surname>Gajria</surname> <given-names>B.</given-names></name> <name><surname>Gao</surname> <given-names>X.</given-names></name><etal/></person-group> (<year>2010</year>). <article-title>EuPathDB: a portal to eukaryotic pathogen databases.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>38</volume> <fpage>D415</fpage>&#x2013;<lpage>D419</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkp941</pub-id> <pub-id pub-id-type="pmid">19914931</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bateman</surname> <given-names>A.</given-names></name> <name><surname>Martin</surname> <given-names>M. J.</given-names></name> <name><surname>O&#x2019;Donovan</surname> <given-names>C.</given-names></name> <name><surname>Magrane</surname> <given-names>M.</given-names></name> <name><surname>Apweiler</surname> <given-names>R.</given-names></name> <name><surname>Alpi</surname> <given-names>E.</given-names></name><etal/></person-group> (<year>2015</year>). <article-title>UniProt: a hub for protein information.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>43</volume> <fpage>D204</fpage>&#x2013;<lpage>D212</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gku989</pub-id> <pub-id pub-id-type="pmid">25348405</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brayton</surname> <given-names>K. A.</given-names></name> <name><surname>Lau</surname> <given-names>A. O. T.</given-names></name> <name><surname>Herndon</surname> <given-names>D. R.</given-names></name> <name><surname>Hannick</surname> <given-names>L.</given-names></name> <name><surname>Kappmeyer</surname> <given-names>L. S.</given-names></name> <name><surname>Berens</surname> <given-names>S. J.</given-names></name><etal/></person-group> (<year>2007</year>). <article-title>Genome sequence of babesia bovis and comparative analysis of apicomplexan hemoprotozoa.</article-title> <source><italic>PLoS Pathog.</italic></source> <volume>3</volume>:<fpage>1401</fpage>&#x2013;<lpage>1413</lpage>. <pub-id pub-id-type="doi">10.1371/journal.ppat.0030148</pub-id> <pub-id pub-id-type="pmid">17953480</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breiman</surname> <given-names>L.</given-names></name></person-group> (<year>2001</year>). <article-title>Random forests.</article-title> <source><italic>Machine Learn.</italic></source> <volume>45</volume> <fpage>5</fpage>&#x2013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1023/a:1010933404324</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Buchan</surname> <given-names>D. W. A.</given-names></name> <name><surname>Jones</surname> <given-names>D. T.</given-names></name></person-group> (<year>2019</year>). <article-title>The PSIPRED protein analysis workbench: 20 years on.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>47</volume> <fpage>W402</fpage>&#x2013;<lpage>W407</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkz297</pub-id> <pub-id pub-id-type="pmid">31251384</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cooke</surname> <given-names>B. M.</given-names></name> <name><surname>Buckingham</surname> <given-names>D. W.</given-names></name> <name><surname>Glenister</surname> <given-names>F. K.</given-names></name> <name><surname>Fernandez</surname> <given-names>K. M.</given-names></name> <name><surname>Bannister</surname> <given-names>L. H.</given-names></name> <name><surname>Marti</surname> <given-names>M.</given-names></name><etal/></person-group> (<year>2006</year>). <article-title>A Maurer&#x2019;s cleft-associated protein is essential for expression of the major malaria virulence antigen on the surface of infected red blood cells.</article-title> <source><italic>J. Cell Biol.</italic></source> <volume>172</volume> <fpage>899</fpage>&#x2013;<lpage>908</lpage>. <pub-id pub-id-type="doi">10.1083/jcb.200509122</pub-id> <pub-id pub-id-type="pmid">16520384</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Drozdetskiy</surname> <given-names>A.</given-names></name> <name><surname>Cole</surname> <given-names>C.</given-names></name> <name><surname>Procter</surname> <given-names>J.</given-names></name> <name><surname>Barton</surname> <given-names>G. J.</given-names></name></person-group> (<year>2015</year>). <article-title>JPred4: a protein secondary structure prediction server.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>43</volume> <fpage>W389</fpage>&#x2013;<lpage>W394</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkv332</pub-id> <pub-id pub-id-type="pmid">25883141</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eichenberger</surname> <given-names>R. M.</given-names></name> <name><surname>Ramakrishnan</surname> <given-names>C.</given-names></name> <name><surname>Russo</surname> <given-names>G.</given-names></name> <name><surname>Deplazes</surname> <given-names>P.</given-names></name> <name><surname>Hehl</surname> <given-names>A. B.</given-names></name></person-group> (<year>2017</year>). <article-title>Genome-wide analysis of gene expression and protein secretion of Babesia canis during virulent infection identifies potential pathogenicity factors.</article-title> <source><italic>Sci. Rep.</italic></source> <volume>7</volume>:<issue>3357</issue>. <pub-id pub-id-type="doi">10.1038/s41598-017-03445-x</pub-id> <pub-id pub-id-type="pmid">28611446</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Elisa Rodriguez</surname> <given-names>A.</given-names></name> <name><surname>Florin-Christensen</surname> <given-names>M.</given-names></name> <name><surname>Agustina Flores</surname> <given-names>D.</given-names></name> <name><surname>Echaide</surname> <given-names>I.</given-names></name> <name><surname>Suarez</surname> <given-names>C. E.</given-names></name> <name><surname>Schnittger</surname> <given-names>L.</given-names></name></person-group> (<year>2014</year>). <article-title>The glycosylphosphatidylinositol-anchored protein repertoire of Babesia bovis and its significance for erythrocyte invasion.</article-title> <source><italic>Ticks Tick Borne Dis.</italic></source> <volume>5</volume> <fpage>343</fpage>&#x2013;<lpage>348</lpage>. <pub-id pub-id-type="doi">10.1016/j.ttbdis.2013.12.011</pub-id> <pub-id pub-id-type="pmid">24642346</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Emanuelsson</surname> <given-names>O.</given-names></name> <name><surname>Brunak</surname> <given-names>S.</given-names></name> <name><surname>von Heijne</surname> <given-names>G.</given-names></name> <name><surname>Nielsen</surname> <given-names>H.</given-names></name></person-group> (<year>2007</year>). <article-title>Locating proteins in the cell using TargetP, SignalP and related tools.</article-title> <source><italic>Nat. Protocols</italic></source> <volume>2</volume> <fpage>953</fpage>&#x2013;<lpage>971</lpage>. <pub-id pub-id-type="doi">10.1038/nprot.2007.131</pub-id> <pub-id pub-id-type="pmid">17446895</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fang</surname> <given-names>C.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Xu</surname> <given-names>D.</given-names></name> <name><surname>Shang</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>MUFold-SSW: a new web server for predicting protein secondary structures, torsion angles and turns.</article-title> <source><italic>Bioinformatics</italic></source> <volume>36</volume> <fpage>1293</fpage>&#x2013;<lpage>1295</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btz712</pub-id> <pub-id pub-id-type="pmid">31532508</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fang</surname> <given-names>C.</given-names></name> <name><surname>Shang</surname> <given-names>Y.</given-names></name> <name><surname>Xu</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>Prediction of protein backbone torsion angles using deep residual inception neural networks.</article-title> <source><italic>Ieee Acm Trans. Comput. Biol. Bioinform.</italic></source> <volume>16</volume> <fpage>1020</fpage>&#x2013;<lpage>1028</lpage>. <pub-id pub-id-type="doi">10.1109/tcbb.2018.2814586</pub-id> <pub-id pub-id-type="pmid">29994074</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ferreri</surname> <given-names>L. M.</given-names></name> <name><surname>Brayton</surname> <given-names>K. A.</given-names></name> <name><surname>Sondgeroth</surname> <given-names>K. S.</given-names></name> <name><surname>Lau</surname> <given-names>A. O. T.</given-names></name> <name><surname>Suarez</surname> <given-names>C. E.</given-names></name> <name><surname>McElwain</surname> <given-names>T. F.</given-names></name></person-group> (<year>2012</year>). <article-title>Expression and strain variation of the novel &#x201C;small open reading frame&#x201D; (smorf) multigene family in Babesia bovis.</article-title> <source><italic>Int. J. Parasitol.</italic></source> <volume>42</volume> <fpage>131</fpage>&#x2013;<lpage>138</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijpara.2011.10.004</pub-id> <pub-id pub-id-type="pmid">22138017</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Florin-Christensen</surname> <given-names>M.</given-names></name> <name><surname>Suarez</surname> <given-names>C. E.</given-names></name> <name><surname>Rodriguez</surname> <given-names>A. E.</given-names></name> <name><surname>Flores</surname> <given-names>D. A.</given-names></name> <name><surname>Schnittger</surname> <given-names>L.</given-names></name></person-group> (<year>2014</year>). <article-title>Vaccines against bovine babesiosis: where we are now and possible roads ahead.</article-title> <source><italic>Parasitology</italic></source> <volume>141</volume> <fpage>1563</fpage>&#x2013;<lpage>1592</lpage>. <pub-id pub-id-type="doi">10.1017/s0031182014000961</pub-id> <pub-id pub-id-type="pmid">25068315</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Freund</surname> <given-names>Y.</given-names></name> <name><surname>Schapire</surname> <given-names>R. E.</given-names></name></person-group> (<year>1997</year>). <article-title>A decision-theoretic generalization of on-line learning and an application to boosting.</article-title> <source><italic>J. Comput. Syst. Sci.</italic></source> <volume>55</volume> <fpage>119</fpage>&#x2013;<lpage>139</lpage>. <pub-id pub-id-type="doi">10.1006/jcss.1997.1504</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Friedman</surname> <given-names>J.</given-names></name> <name><surname>Hastie</surname> <given-names>T.</given-names></name> <name><surname>Tibshirani</surname> <given-names>R.</given-names></name></person-group> (<year>2000</year>). <article-title>Additive logistic regression: a statistical view of boosting.</article-title> <source><italic>Ann. Statist.</italic></source> <volume>28</volume> <fpage>337</fpage>&#x2013;<lpage>374</lpage>. <pub-id pub-id-type="doi">10.1214/aos/1016218223</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gaffar</surname> <given-names>F. R.</given-names></name> <name><surname>Yatsuda</surname> <given-names>A. P.</given-names></name> <name><surname>Franssen</surname> <given-names>F. F. J.</given-names></name> <name><surname>de Vries</surname> <given-names>E.</given-names></name></person-group> (<year>2004</year>). <article-title>A Babesia bovis merozoite protein with a domain architecture highly similar to the thrombospondin-related anonymous protein (TRAP) present in Plasmodium sporozoites.</article-title> <source><italic>Mol. Biochem. Parasitol.</italic></source> <volume>136</volume> <fpage>25</fpage>&#x2013;<lpage>34</lpage>. <pub-id pub-id-type="doi">10.1016/j.molbiopara.2004.02.006</pub-id> <pub-id pub-id-type="pmid">15138064</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gohil</surname> <given-names>S.</given-names></name> <name><surname>Kats</surname> <given-names>L. M.</given-names></name> <name><surname>Seemann</surname> <given-names>T.</given-names></name> <name><surname>Fernandez</surname> <given-names>K. M.</given-names></name> <name><surname>Siddiqui</surname> <given-names>G.</given-names></name> <name><surname>Cooke</surname> <given-names>B. M.</given-names></name></person-group> (<year>2013</year>). <article-title>Bioinformatic prediction of the exportome of Babesia bovis and identification of novel proteins in parasite-infected red blood cells.</article-title> <source><italic>Int. J. Parasitol.</italic></source> <volume>43</volume> <fpage>409</fpage>&#x2013;<lpage>416</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijpara.2013.01.002</pub-id> <pub-id pub-id-type="pmid">23395698</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gohil</surname> <given-names>S.</given-names></name> <name><surname>Kats</surname> <given-names>L. M.</given-names></name> <name><surname>Sturm</surname> <given-names>A.</given-names></name> <name><surname>Cooke</surname> <given-names>B. M.</given-names></name></person-group> (<year>2010</year>). <article-title>Recent insights into alteration of red blood cells by Babesia bovis: moovin&#x2019; forward.</article-title> <source><italic>Trends Parasitol.</italic></source> <volume>26</volume> <fpage>591</fpage>&#x2013;<lpage>599</lpage>. <pub-id pub-id-type="doi">10.1016/j.pt.2010.06.012</pub-id> <pub-id pub-id-type="pmid">20598944</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gonzalez</surname> <given-names>L. M.</given-names></name> <name><surname>Estrada</surname> <given-names>K.</given-names></name> <name><surname>Grande</surname> <given-names>R.</given-names></name> <name><surname>Jimenez-Jacinto</surname> <given-names>V.</given-names></name> <name><surname>Vega-Alvarado</surname> <given-names>L.</given-names></name> <name><surname>Sevilla</surname> <given-names>E.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>Comparative and functional genomics of the protozoan parasite Babesia divergens highlighting the invasion and egress processes.</article-title> <source><italic>PLoS Neglected Tropical Dis.</italic></source> <volume>13</volume>:<issue>e0007680</issue>. <pub-id pub-id-type="doi">10.1371/journal.pntd.0007680</pub-id> <pub-id pub-id-type="pmid">31425518</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goodswen</surname> <given-names>S. J.</given-names></name> <name><surname>Kennedy</surname> <given-names>P. J.</given-names></name> <name><surname>Ellis</surname> <given-names>J. T.</given-names></name></person-group> (<year>2013a</year>). <article-title>A guide to in silico vaccine discovery for eukaryotic pathogens.</article-title> <source><italic>Brief. Bioinform.</italic></source> <volume>14</volume> <fpage>753</fpage>&#x2013;<lpage>774</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbs066</pub-id> <pub-id pub-id-type="pmid">23097412</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goodswen</surname> <given-names>S. J.</given-names></name> <name><surname>Kennedy</surname> <given-names>P. J.</given-names></name> <name><surname>Ellis</surname> <given-names>J. T.</given-names></name></person-group> (<year>2013b</year>). <article-title>A novel strategy for classifying the output from an in silico vaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms.</article-title> <source><italic>BMC Bioinform.</italic></source> <volume>14</volume>:<issue>315</issue>. <pub-id pub-id-type="doi">10.1186/1471-2105-14-315</pub-id> <pub-id pub-id-type="pmid">24180526</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goodswen</surname> <given-names>S. J.</given-names></name> <name><surname>Kennedy</surname> <given-names>P. J.</given-names></name> <name><surname>Ellis</surname> <given-names>J. T.</given-names></name></person-group> (<year>2021</year>). <article-title>Applying machine learning to predict the exportome of bovine and canine babesia species that cause babesiosis.</article-title> <source><italic>Pathogens</italic></source> <volume>10</volume>:<issue>660</issue>. <pub-id pub-id-type="doi">10.3390/pathogens10060660</pub-id> <pub-id pub-id-type="pmid">34071992</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gubbels</surname> <given-names>M.-J.</given-names></name> <name><surname>Duraisingh</surname> <given-names>M. T.</given-names></name></person-group> (<year>2012</year>). <article-title>Evolution of apicomplexan secretory organelles.</article-title> <source><italic>Int. J. Parasitol.</italic></source> <volume>42</volume> <fpage>1071</fpage>&#x2013;<lpage>1081</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijpara.2012.09.009</pub-id> <pub-id pub-id-type="pmid">23068912</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Haase</surname> <given-names>S.</given-names></name> <name><surname>de Koning-Ward</surname> <given-names>T. F.</given-names></name></person-group> (<year>2010</year>). <article-title>New insights into protein export in malaria parasites.</article-title> <source><italic>Cell. Microbiol.</italic></source> <volume>12</volume> <fpage>580</fpage>&#x2013;<lpage>587</lpage>. <pub-id pub-id-type="doi">10.1111/j.1462-5822.2010.01455.x</pub-id> <pub-id pub-id-type="pmid">20180801</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hakimi</surname> <given-names>M.-A.</given-names></name> <name><surname>Olias</surname> <given-names>P.</given-names></name> <name><surname>Sibley</surname> <given-names>L. D.</given-names></name></person-group> (<year>2017</year>). <article-title>Toxoplasma effectors targeting host signaling and transcription.</article-title> <source><italic>Clin. Microbiol. Rev.</italic></source> <volume>30</volume> <fpage>615</fpage>&#x2013;<lpage>645</lpage>. <pub-id pub-id-type="doi">10.1128/cmr.00005-17</pub-id> <pub-id pub-id-type="pmid">28404792</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hamelryck</surname> <given-names>T.</given-names></name></person-group> (<year>2005</year>). <article-title>An amino acid has two sides: a new 2D measure provides a different view of solvent exposure.</article-title> <source><italic>Proteins Struct. Funct. Bioinform.</italic></source> <volume>59</volume> <fpage>38</fpage>&#x2013;<lpage>48</lpage>. <pub-id pub-id-type="doi">10.1002/prot.20379</pub-id> <pub-id pub-id-type="pmid">15688434</pub-id></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hanson</surname> <given-names>J.</given-names></name> <name><surname>Paliwal</surname> <given-names>K.</given-names></name> <name><surname>Litfin</surname> <given-names>T.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Zhou</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks.</article-title> <source><italic>Bioinformatics</italic></source> <volume>35</volume> <fpage>2403</fpage>&#x2013;<lpage>2410</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bty1006</pub-id> <pub-id pub-id-type="pmid">30535134</pub-id></citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Heffernan</surname> <given-names>R.</given-names></name> <name><surname>Dehzangi</surname> <given-names>A.</given-names></name> <name><surname>Lyons</surname> <given-names>J.</given-names></name> <name><surname>Paliwal</surname> <given-names>K.</given-names></name> <name><surname>Sharma</surname> <given-names>A.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>Highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteins.</article-title> <source><italic>Bioinformatics</italic></source> <volume>32</volume> <fpage>843</fpage>&#x2013;<lpage>849</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btv665</pub-id> <pub-id pub-id-type="pmid">26568622</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Heffernan</surname> <given-names>R.</given-names></name> <name><surname>Paliwal</surname> <given-names>K.</given-names></name> <name><surname>Lyons</surname> <given-names>J.</given-names></name> <name><surname>Singh</surname> <given-names>J.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Zhou</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole-sequence learning.</article-title> <source><italic>J. Comput. Chem.</italic></source> <volume>39</volume> <fpage>2210</fpage>&#x2013;<lpage>2216</lpage>. <pub-id pub-id-type="doi">10.1002/jcc.25534</pub-id> <pub-id pub-id-type="pmid">30368831</pub-id></citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hines</surname> <given-names>S. A.</given-names></name> <name><surname>Palmer</surname> <given-names>G. H.</given-names></name> <name><surname>Brown</surname> <given-names>W. C.</given-names></name> <name><surname>McElwain</surname> <given-names>T. F.</given-names></name> <name><surname>Suarez</surname> <given-names>C. E.</given-names></name> <name><surname>Vidotto</surname> <given-names>O.</given-names></name><etal/></person-group> (<year>1995</year>). <article-title>Genetic and antigenic characterization of Babesia bovis merozoite spherical body protein Bb-1.</article-title> <source><italic>Mol. Biochem. Parasitol.</italic></source> <volume>69</volume> <fpage>149</fpage>&#x2013;<lpage>159</lpage>. <pub-id pub-id-type="doi">10.1016/0166-6851(94)00200-7</pub-id></citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Homer</surname> <given-names>M. J.</given-names></name> <name><surname>Aguilar-Delfin</surname> <given-names>I.</given-names></name> <name><surname>Telford</surname> <given-names>S. R.</given-names></name> <name><surname>Krause</surname> <given-names>P. J.</given-names></name> <name><surname>Persing</surname> <given-names>D. H.</given-names></name></person-group> (<year>2000</year>). <article-title>Babesiosis.</article-title> <source><italic>Clin. Microbiol. Rev.</italic></source> <volume>13</volume> <fpage>451</fpage>&#x2013;<lpage>469</lpage>. <pub-id pub-id-type="doi">10.1128/cmr.13.3.451-469.2000</pub-id></citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Horton</surname> <given-names>P.</given-names></name> <name><surname>Park</surname> <given-names>K.-J.</given-names></name> <name><surname>Obayashi</surname> <given-names>T.</given-names></name> <name><surname>Fujita</surname> <given-names>N.</given-names></name> <name><surname>Harada</surname> <given-names>H.</given-names></name> <name><surname>Adams-Collier</surname> <given-names>C. J.</given-names></name><etal/></person-group> (<year>2007</year>). <article-title>WoLF PSORT: protein localization predictor.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>35</volume> <fpage>W585</fpage>&#x2013;<lpage>W587</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkm259</pub-id> <pub-id pub-id-type="pmid">17517783</pub-id></citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hunfeld</surname> <given-names>K. P.</given-names></name> <name><surname>Hildebrandt</surname> <given-names>A.</given-names></name> <name><surname>Gray</surname> <given-names>J. S.</given-names></name></person-group> (<year>2008</year>). <article-title>Babesiosis: recent insights into an ancient disease.</article-title> <source><italic>Int. J. Parasitol.</italic></source> <volume>38</volume> <fpage>1219</fpage>&#x2013;<lpage>1237</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijpara.2008.03.001</pub-id> <pub-id pub-id-type="pmid">18440005</pub-id></citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jones</surname> <given-names>D. T.</given-names></name></person-group> (<year>1999</year>). <article-title>Protein secondary structure prediction based on position-specific scoring matrices.</article-title> <source><italic>J. Mol. Biol.</italic></source> <volume>292</volume> <fpage>195</fpage>&#x2013;<lpage>202</lpage>. <pub-id pub-id-type="doi">10.1006/jmbi.1999.3091</pub-id> <pub-id pub-id-type="pmid">10493868</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kabsch</surname> <given-names>W.</given-names></name> <name><surname>Sander</surname> <given-names>C.</given-names></name></person-group> (<year>1983</year>). <article-title>Dictionary of protein secondary structure - pattern-recognition of hydrogen-bonded and geometrical features.</article-title> <source><italic>Biopolymers</italic></source> <volume>22</volume> <fpage>2577</fpage>&#x2013;<lpage>2637</lpage>. <pub-id pub-id-type="doi">10.1002/bip.360221211</pub-id> <pub-id pub-id-type="pmid">6667333</pub-id></citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kall</surname> <given-names>L.</given-names></name> <name><surname>Krogh</surname> <given-names>A.</given-names></name> <name><surname>Sonnhammer</surname> <given-names>E. L. L.</given-names></name></person-group> (<year>2004</year>). <article-title>A combined transmembrane topology and signal peptide prediction method.</article-title> <source><italic>J. Mol. Biol.</italic></source> <volume>338</volume> <fpage>1027</fpage>&#x2013;<lpage>1036</lpage>. <pub-id pub-id-type="doi">10.1016/j.jmb.2004.03.016</pub-id> <pub-id pub-id-type="pmid">15111065</pub-id></citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kelley</surname> <given-names>L. A.</given-names></name> <name><surname>Mezulis</surname> <given-names>S.</given-names></name> <name><surname>Yates</surname> <given-names>C. M.</given-names></name> <name><surname>Wass</surname> <given-names>M. N.</given-names></name> <name><surname>Sternberg</surname> <given-names>M. J. E.</given-names></name></person-group> (<year>2015</year>). <article-title>The Phyre2 web portal for protein modeling, prediction and analysis.</article-title> <source><italic>Nat. Protocols</italic></source> <volume>10</volume> <fpage>845</fpage>&#x2013;<lpage>858</lpage>. <pub-id pub-id-type="doi">10.1038/nprot.2015.053</pub-id> <pub-id pub-id-type="pmid">25950237</pub-id></citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>K.</given-names></name> <name><surname>Weiss</surname> <given-names>L. M.</given-names></name></person-group> (<year>2004</year>). <article-title>Toxoplasma gondii: the model apicomplexan.</article-title> <source><italic>Int. J. Parasitol.</italic></source> <volume>34</volume> <fpage>423</fpage>&#x2013;<lpage>432</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijpara.2003.12.009</pub-id> <pub-id pub-id-type="pmid">15003501</pub-id></citation></ref>
<ref id="B46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Klausen</surname> <given-names>M. S.</given-names></name> <name><surname>Jespersen</surname> <given-names>M. C.</given-names></name> <name><surname>Nielsen</surname> <given-names>H.</given-names></name> <name><surname>Jensen</surname> <given-names>K. K.</given-names></name> <name><surname>Jurtz</surname> <given-names>V. I.</given-names></name> <name><surname>Sonderby</surname> <given-names>C. K.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning.</article-title> <source><italic>Proteins Struct. Funct. Bioinform.</italic></source> <volume>87</volume> <fpage>520</fpage>&#x2013;<lpage>527</lpage>. <pub-id pub-id-type="doi">10.1002/prot.25674</pub-id> <pub-id pub-id-type="pmid">30785653</pub-id></citation></ref>
<ref id="B47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krogh</surname> <given-names>A.</given-names></name> <name><surname>Larsson</surname> <given-names>B.</given-names></name> <name><surname>von Heijne</surname> <given-names>G.</given-names></name> <name><surname>Sonnhammer</surname> <given-names>E. L. L.</given-names></name></person-group> (<year>2001</year>). <article-title>Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.</article-title> <source><italic>J. Mol. Biol.</italic></source> <volume>305</volume> <fpage>567</fpage>&#x2013;<lpage>580</lpage>. <pub-id pub-id-type="doi">10.1006/jmbi.2000.4315</pub-id> <pub-id pub-id-type="pmid">11152613</pub-id></citation></ref>
<ref id="B48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kuelzer</surname> <given-names>S.</given-names></name> <name><surname>Charnaud</surname> <given-names>S.</given-names></name> <name><surname>Dagan</surname> <given-names>T.</given-names></name> <name><surname>Riedel</surname> <given-names>J.</given-names></name> <name><surname>Mandal</surname> <given-names>P.</given-names></name> <name><surname>Pesce</surname> <given-names>E. R.</given-names></name><etal/></person-group> (<year>2012</year>). <article-title>Plasmodium falciparum-encoded exported hsp70/hsp40 chaperone/co-chaperone complexes within the host erythrocyte.</article-title> <source><italic>Cell. Microbiol.</italic></source> <volume>14</volume> <fpage>1784</fpage>&#x2013;<lpage>1795</lpage>. <pub-id pub-id-type="doi">10.1111/j.1462-5822.2012.01840.x</pub-id> <pub-id pub-id-type="pmid">22925632</pub-id></citation></ref>
<ref id="B49"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>W.</given-names></name> <name><surname>Godzik</surname> <given-names>A.</given-names></name></person-group> (<year>2006</year>). <article-title>Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.</article-title> <source><italic>Bioinformatics</italic></source> <volume>22</volume> <fpage>1658</fpage>&#x2013;<lpage>1659</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btl158</pub-id> <pub-id pub-id-type="pmid">16731699</pub-id></citation></ref>
<ref id="B50"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Magnan</surname> <given-names>C. N.</given-names></name> <name><surname>Baldi</surname> <given-names>P.</given-names></name></person-group> (<year>2014</year>). <article-title>SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity.</article-title> <source><italic>Bioinformatics</italic></source> <volume>30</volume> <fpage>2592</fpage>&#x2013;<lpage>2597</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btu352</pub-id> <pub-id pub-id-type="pmid">24860169</pub-id></citation></ref>
<ref id="B51"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Maier</surname> <given-names>A. G.</given-names></name> <name><surname>Cooke</surname> <given-names>B. M.</given-names></name> <name><surname>Cowman</surname> <given-names>A. F.</given-names></name> <name><surname>Tilley</surname> <given-names>L.</given-names></name></person-group> (<year>2009</year>). <article-title>Malaria parasite proteins that remodel the host erythrocyte.</article-title> <source><italic>Nat. Rev. Microbiol.</italic></source> <volume>7</volume> <fpage>341</fpage>&#x2013;<lpage>354</lpage>. <pub-id pub-id-type="doi">10.1038/nrmicro2110</pub-id> <pub-id pub-id-type="pmid">19369950</pub-id></citation></ref>
<ref id="B52"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Martin</surname> <given-names>J.</given-names></name> <name><surname>Letellier</surname> <given-names>G.</given-names></name> <name><surname>Marin</surname> <given-names>A.</given-names></name> <name><surname>Taly</surname> <given-names>J. F.</given-names></name> <name><surname>de Brevern</surname> <given-names>A. G.</given-names></name> <name><surname>Gibrat</surname> <given-names>J. F.</given-names></name></person-group> (<year>2005</year>). <article-title>Protein secondary structure assignment revisited: a detailed analysis of different assignment methods.</article-title> <source><italic>BMC Struct. Biol.</italic></source> <volume>5</volume>:<issue>17</issue>. <pub-id pub-id-type="doi">10.1186/1472-6807-5-17</pub-id> <pub-id pub-id-type="pmid">16164759</pub-id></citation></ref>
<ref id="B53"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Matsumoto</surname> <given-names>M.</given-names></name> <name><surname>Nishimura</surname> <given-names>T.</given-names></name></person-group> (<year>1998</year>). <article-title>Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator.</article-title> <source><italic>ACM Trans. Model. Comput. Simul.</italic></source> <volume>8</volume> <fpage>3</fpage>&#x2013;<lpage>30</lpage>. <pub-id pub-id-type="doi">10.1145/272991.272995</pub-id></citation></ref>
<ref id="B54"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mosqueda</surname> <given-names>J.</given-names></name> <name><surname>Olvera-Ramirez</surname> <given-names>A.</given-names></name> <name><surname>Aguilar-Tipacamu</surname> <given-names>G.</given-names></name> <name><surname>Canto</surname> <given-names>G. J.</given-names></name></person-group> (<year>2012</year>). <article-title>Current advances in detection and treatment of babesiosis.</article-title> <source><italic>Curr. Med. Chem.</italic></source> <volume>19</volume> <fpage>1504</fpage>&#x2013;<lpage>1518</lpage>. <pub-id pub-id-type="doi">10.2174/092986712799828355</pub-id> <pub-id pub-id-type="pmid">22360483</pub-id></citation></ref>
<ref id="B55"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Norimine</surname> <given-names>J.</given-names></name> <name><surname>Mosqueda</surname> <given-names>J.</given-names></name> <name><surname>Suarez</surname> <given-names>C.</given-names></name> <name><surname>Palmer</surname> <given-names>G. H.</given-names></name> <name><surname>McElwain</surname> <given-names>T. F.</given-names></name> <name><surname>Mbassa</surname> <given-names>G.</given-names></name><etal/></person-group> (<year>2003</year>). <article-title>Stimulation of T-helper cell gamma interferon and immunoglobulin G responses specific for Babesia bovis rhoptry-associated protein 1 (RAP-1) or a RAP-1 protein lacking the carboxy-terminal repeat region is insufficient to provide protective immunity against virulent B-bovis challenge.</article-title> <source><italic>Infect. Immun.</italic></source> <volume>71</volume> <fpage>5021</fpage>&#x2013;<lpage>5032</lpage>. <pub-id pub-id-type="doi">10.1128/iai.71.9.5021-5032.2003</pub-id> <pub-id pub-id-type="pmid">12933845</pub-id></citation></ref>
<ref id="B56"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oberli</surname> <given-names>A.</given-names></name> <name><surname>Slater</surname> <given-names>L. M.</given-names></name> <name><surname>Cutts</surname> <given-names>E.</given-names></name> <name><surname>Brand</surname> <given-names>F.</given-names></name> <name><surname>Mundwiler-Pachlatko</surname> <given-names>E.</given-names></name> <name><surname>Rusch</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2014</year>). <article-title>A plasmodium falciparum PHIST protein binds the virulence factor PfEMP1 and comigrates to knobs on the host cell surface.</article-title> <source><italic>FASEB J.</italic></source> <volume>28</volume> <fpage>4420</fpage>&#x2013;<lpage>4433</lpage>. <pub-id pub-id-type="doi">10.1096/fj.14-256057</pub-id> <pub-id pub-id-type="pmid">24983468</pub-id></citation></ref>
<ref id="B57"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>O&#x2019;Connor</surname> <given-names>R. M.</given-names></name> <name><surname>Allred</surname> <given-names>D. R.</given-names></name></person-group> (<year>2000</year>). <article-title>Selection of Babesia bovis-infected erythrocytes for adhesion to endothelial cells coselects for altered variant erythrocyte surface antigen isoforms.</article-title> <source><italic>J. Immunol.</italic></source> <volume>164</volume> <fpage>2037</fpage>&#x2013;<lpage>2045</lpage>. <pub-id pub-id-type="doi">10.4049/jimmunol.164.4.2037</pub-id> <pub-id pub-id-type="pmid">10657656</pub-id></citation></ref>
<ref id="B58"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paoletta</surname> <given-names>M. S.</given-names></name> <name><surname>Laughery</surname> <given-names>J. M.</given-names></name> <name><surname>Arias</surname> <given-names>L. S. L.</given-names></name> <name><surname>Ortiz</surname> <given-names>J. M. J.</given-names></name> <name><surname>Montenegro</surname> <given-names>V. N.</given-names></name> <name><surname>Petrigh</surname> <given-names>R.</given-names></name><etal/></person-group> (<year>2021</year>). <article-title>The key to egress? Babesia bovis perforin-like protein 1 (PLP1) with hemolytic capacity is required for blood stage replication and is involved in the exit of the parasite from the host cell.</article-title> <source><italic>Int. J. Parasitol.</italic></source> <volume>51</volume> <fpage>643</fpage>&#x2013;<lpage>658</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijpara.2020.12.010</pub-id> <pub-id pub-id-type="pmid">33753093</pub-id></citation></ref>
<ref id="B59"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pierleoni</surname> <given-names>A.</given-names></name> <name><surname>Martelli</surname> <given-names>P. L.</given-names></name> <name><surname>Casadio</surname> <given-names>R.</given-names></name></person-group> (<year>2008</year>). <article-title>PredGPI: a GPI-anchor predictor.</article-title> <source><italic>BMC Bioinform.</italic></source> <volume>9</volume>:<issue>392</issue>. <pub-id pub-id-type="doi">10.1186/1471-2105-9-392</pub-id> <pub-id pub-id-type="pmid">18811934</pub-id></citation></ref>
<ref id="B60"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pollastri</surname> <given-names>G.</given-names></name> <name><surname>Baldi</surname> <given-names>P.</given-names></name> <name><surname>Fariselli</surname> <given-names>P.</given-names></name> <name><surname>Casadio</surname> <given-names>R.</given-names></name></person-group> (<year>2002</year>). <article-title>Prediction of coordination number and relative solvent accessibility in proteins.</article-title> <source><italic>Proteins Struct. Funct. Bioinform.</italic></source> <volume>47</volume> <fpage>142</fpage>&#x2013;<lpage>153</lpage>. <pub-id pub-id-type="doi">10.1002/prot.10069</pub-id> <pub-id pub-id-type="pmid">11933061</pub-id></citation></ref>
<ref id="B61"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ramachandran</surname> <given-names>G. N.</given-names></name> <name><surname>Ramakrishnan</surname> <given-names>C.</given-names></name> <name><surname>Sasisekharan</surname> <given-names>V.</given-names></name></person-group> (<year>1963</year>). <article-title>Stereochemistry of polypeptide chain configurations.</article-title> <source><italic>J. Mol. Biol.</italic></source> <volume>7</volume> <fpage>95</fpage>&#x2013;<lpage>99</lpage>. <pub-id pub-id-type="doi">10.1016/s0022-2836(63)80023-6</pub-id></citation></ref>
<ref id="B62"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rathinasamy</surname> <given-names>V.</given-names></name> <name><surname>Poole</surname> <given-names>W. A.</given-names></name> <name><surname>Bastos</surname> <given-names>R. G.</given-names></name> <name><surname>Suarez</surname> <given-names>C. E.</given-names></name> <name><surname>Cooke</surname> <given-names>B. M.</given-names></name></person-group> (<year>2019</year>). <article-title>Babesiosis vaccines: lessons learned, challenges ahead, and future glimpses.</article-title> <source><italic>Trends Parasitol.</italic></source> <volume>35</volume> <fpage>622</fpage>&#x2013;<lpage>635</lpage>. <pub-id pub-id-type="doi">10.1016/j.pt.2019.06.002</pub-id> <pub-id pub-id-type="pmid">31281025</pub-id></citation></ref>
<ref id="B63"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rost</surname> <given-names>B.</given-names></name></person-group> (<year>2001</year>). <article-title>Review: protein secondary structure prediction continues to rise.</article-title> <source><italic>J. Struct. Biol.</italic></source> <volume>134</volume> <fpage>204</fpage>&#x2013;<lpage>218</lpage>. <pub-id pub-id-type="doi">10.1006/jsbi.2001.4336</pub-id> <pub-id pub-id-type="pmid">11551180</pub-id></citation></ref>
<ref id="B64"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ruef</surname> <given-names>B. J.</given-names></name> <name><surname>Dowling</surname> <given-names>S. C.</given-names></name> <name><surname>Conley</surname> <given-names>P. G.</given-names></name> <name><surname>Perryman</surname> <given-names>L. E.</given-names></name> <name><surname>Brown</surname> <given-names>W. C.</given-names></name> <name><surname>Jasmer</surname> <given-names>D. P.</given-names></name><etal/></person-group> (<year>2000</year>). <article-title>A unique Babesia bovis spherical body protein is conserved among geographic isolates and localizes to the infected erythrocyte membrane.</article-title> <source><italic>Mol. Biochem. Parasitol.</italic></source> <volume>105</volume> <fpage>1</fpage>&#x2013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1016/s0166-6851(99)00167-x</pub-id></citation></ref>
<ref id="B65"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schulze</surname> <given-names>J.</given-names></name> <name><surname>Kwiatkowski</surname> <given-names>M.</given-names></name> <name><surname>Borner</surname> <given-names>J.</given-names></name> <name><surname>Schlueter</surname> <given-names>H.</given-names></name> <name><surname>Bruchhaus</surname> <given-names>I.</given-names></name> <name><surname>Burmester</surname> <given-names>T.</given-names></name><etal/></person-group> (<year>2015</year>). <article-title>The <italic>Plasmodium falciparum</italic> exportome contains non-canonical PEXEL/HT proteins.</article-title> <source><italic>Mol. Microbiol.</italic></source> <volume>97</volume> <fpage>301</fpage>&#x2013;<lpage>314</lpage>. <pub-id pub-id-type="doi">10.1111/mmi.13024</pub-id> <pub-id pub-id-type="pmid">25850860</pub-id></citation></ref>
<ref id="B66"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sibley</surname> <given-names>L. D.</given-names></name></person-group> (<year>2003</year>). <article-title>Toxoplasma gondii: perfecting an intracellular life style.</article-title> <source><italic>Traffic</italic></source> <volume>4</volume> <fpage>581</fpage>&#x2013;<lpage>586</lpage>. <pub-id pub-id-type="doi">10.1034/j.1600-0854.2003.00117.x</pub-id> <pub-id pub-id-type="pmid">12911812</pub-id></citation></ref>
<ref id="B67"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suarez</surname> <given-names>C. E.</given-names></name> <name><surname>Alzan</surname> <given-names>H. F.</given-names></name> <name><surname>Silva</surname> <given-names>M. G.</given-names></name> <name><surname>Rathinasamy</surname> <given-names>V.</given-names></name> <name><surname>Poole</surname> <given-names>W. A.</given-names></name> <name><surname>Cooke</surname> <given-names>B. M.</given-names></name></person-group> (<year>2019</year>). <article-title>Unravelling the cellular and molecular pathogenesis of bovine babesiosis: is the sky the limit?</article-title> <source><italic>Int. J. Parasitol.</italic></source> <volume>49</volume> <fpage>183</fpage>&#x2013;<lpage>197</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijpara.2018.11.002</pub-id> <pub-id pub-id-type="pmid">30690089</pub-id></citation></ref>
<ref id="B68"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suarez</surname> <given-names>C. E.</given-names></name> <name><surname>Noh</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <article-title>Emerging perspectives in the research of bovine babesiosis and anaplasmosis.</article-title> <source><italic>Veterinary Parasitol.</italic></source> <volume>180</volume> <fpage>109</fpage>&#x2013;<lpage>125</lpage>. <pub-id pub-id-type="doi">10.1016/j.vetpar.2011.05.032</pub-id> <pub-id pub-id-type="pmid">21684084</pub-id></citation></ref>
<ref id="B69"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Terkawi</surname> <given-names>M. A.</given-names></name> <name><surname>Seuseu</surname> <given-names>F. J.</given-names></name> <name><surname>Eko-Wibowo</surname> <given-names>P.</given-names></name> <name><surname>Nguyen Xuan</surname> <given-names>H.</given-names></name> <name><surname>Minoda</surname> <given-names>Y.</given-names></name> <name><surname>AbouLaila</surname> <given-names>M.</given-names></name><etal/></person-group> (<year>2011</year>). <article-title>Secretion of a new spherical body protein of Babesia bovis into the cytoplasm of infected erythrocytes.</article-title> <source><italic>Mol. Biochem. Parasitol.</italic></source> <volume>178</volume> <fpage>40</fpage>&#x2013;<lpage>45</lpage>. <pub-id pub-id-type="doi">10.1016/j.molbiopara.2011.02.006</pub-id> <pub-id pub-id-type="pmid">21406202</pub-id></citation></ref>
<ref id="B70"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Torrisi</surname> <given-names>M.</given-names></name> <name><surname>Kaleel</surname> <given-names>M.</given-names></name> <name><surname>Pollastri</surname> <given-names>G.</given-names></name></person-group> (<year>2019</year>). <article-title>Deeper profiles and cascaded recurrent and convolutional neural networks for state-of-the-art protein secondary structure prediction.</article-title> <source><italic>Sci. Rep.</italic></source> <volume>9</volume>:<issue>12374</issue>. <pub-id pub-id-type="doi">10.1038/s41598-019-48786-x</pub-id> <pub-id pub-id-type="pmid">31451723</pub-id></citation></ref>
<ref id="B71"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vivona</surname> <given-names>S.</given-names></name> <name><surname>Gardy</surname> <given-names>J. L.</given-names></name> <name><surname>Ramachandran</surname> <given-names>S.</given-names></name> <name><surname>Brinkman</surname> <given-names>F. S. L.</given-names></name> <name><surname>Raghava</surname> <given-names>G. P. S.</given-names></name> <name><surname>Flower</surname> <given-names>D. R.</given-names></name><etal/></person-group> (<year>2008</year>). <article-title>Computer-aided biotechnology: from immuno-informatics to reverse vaccinology.</article-title> <source><italic>Trends Biotechnol.</italic></source> <volume>26</volume> <fpage>190</fpage>&#x2013;<lpage>200</lpage>. <pub-id pub-id-type="doi">10.1016/j.tibtech.2007.12.006</pub-id> <pub-id pub-id-type="pmid">18291542</pub-id></citation></ref>
<ref id="B72"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>S.</given-names></name> <name><surname>Peng</surname> <given-names>J.</given-names></name> <name><surname>Ma</surname> <given-names>J.</given-names></name> <name><surname>Xu</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Protein secondary structure prediction using deep convolutional neural fields.</article-title> <source><italic>Sci. Rep.</italic></source> <volume>6</volume>:<issue>18962</issue>. <pub-id pub-id-type="doi">10.1038/srep18962</pub-id> <pub-id pub-id-type="pmid">26752681</pub-id></citation></ref>
<ref id="B73"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Gao</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Heffernan</surname> <given-names>R.</given-names></name> <name><surname>Hanson</surname> <given-names>J.</given-names></name> <name><surname>Paliwal</surname> <given-names>K.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>Sixty-five years of the long march in protein secondary structure prediction: the final stretch?</article-title> <source><italic>Brief. Bioinform.</italic></source> <volume>19</volume> <fpage>482</fpage>&#x2013;<lpage>494</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbw129</pub-id> <pub-id pub-id-type="pmid">28040746</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="footnote1">
<label>1</label>
<p><ext-link ext-link-type="uri" xlink:href="https://www.uniprot.org/help/annotation_score">https://www.uniprot.org/help/annotation_score</ext-link></p></fn>
</fn-group>
</back>
</article>