<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<?covid-19-tdm?>
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Med.</journal-id>
<journal-title>Frontiers in Medicine</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Med.</abbrev-journal-title>
<issn pub-type="epub">2296-858X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fmed.2022.1025887</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Medicine</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>MP-VHPPI: Meta predictor for viral host protein-protein interaction prediction in multiple hosts and viruses</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Asim</surname> <given-names>Muhammad Nabeel</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/910239/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Fazeel</surname> <given-names>Ahtisham</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Ibrahim</surname> <given-names>Muhammad Ali</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Dengel</surname> <given-names>Andreas</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Ahmed</surname> <given-names>Sheraz</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Computer Science, Technical University of Kaiserslautern</institution>, <addr-line>Kaiserslautern</addr-line>, <country>Germany</country></aff>
<aff id="aff2"><sup>2</sup><institution>German Research Center for Artificial Intelligence GmbH</institution>, <addr-line>Kaiserslautern</addr-line>, <country>Germany</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Yu-Dong Zhang, University of Leicester, United Kingdom</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Jose Arturo Molina Mora, University of Costa Rica, Costa Rica; Mallur Srivatsan Madhusudhan, Indian Institute of Science Education and Research, Pune, India</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Muhammad Nabeel Asim <email>muhammad_nabeel.asim&#x00040;dfki.de</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Precision Medicine, a section of the journal Frontiers in Medicine</p></fn></author-notes>
<pub-date pub-type="epub">
<day>16</day>
<month>11</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>9</volume>
<elocation-id>1025887</elocation-id>
<history>
<date date-type="received">
<day>23</day>
<month>08</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>10</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Asim, Fazeel, Ibrahim, Dengel and Ahmed.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Asim, Fazeel, Ibrahim, Dengel and Ahmed</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Viral-host protein-protein interaction (VHPPI) prediction is essential to decoding molecular mechanisms of viral pathogens and host immunity processes that eventually help to control the propagation of viral diseases and to design optimized therapeutics. Multiple AI-based predictors have been developed to predict diverse VHPPIs across a wide range of viruses and hosts, however, these predictors produce better performance only for specific types of hosts and viruses. The prime objective of this research is to develop a robust meta predictor (MP-VHPPI) capable of more accurately predicting VHPPI across multiple hosts and viruses. The proposed meta predictor makes use of two well-known encoding methods Amphiphilic Pseudo-Amino Acid Composition (APAAC) and Quasi-sequence (QS) Order that capture amino acids sequence order and distributional information to most effectively generate the numerical representation of complete viral-host raw protein sequences. Feature agglomeration method is utilized to transform the original feature space into a more informative feature space. Random forest (RF) and Extra tree (ET) classifiers are trained on optimized feature space of both APAAC and QS order separate encoders and by combining both encodings. Further predictions of both classifiers are utilized to feed the Support Vector Machine (SVM) classifier that makes final predictions. The proposed meta predictor is evaluated over 7 different benchmark datasets, where it outperforms existing VHPPI predictors with an average performance of 3.07, 6.07, 2.95, and 2.85% in terms of accuracy, Mathews correlation coefficient, precision, and sensitivity, respectively. To facilitate the scientific community, the MP-VHPPI web server is available at <ext-link ext-link-type="uri" xlink:href="https://sds_genetic_analysis.opendfki.de/MP-VHPPI/">https://sds_genetic_analysis.opendfki.de/MP-VHPPI/</ext-link>.</p></abstract>
<kwd-group>
<kwd>virus-host protein-protein interaction</kwd>
<kwd>meta predictor</kwd>
<kwd>feature agglomeration</kwd>
<kwd>SARSCoV-2</kwd>
<kwd>Ebola virus</kwd>
<kwd>H1N1 virus</kwd>
</kwd-group>
<counts>
<fig-count count="4"/>
<table-count count="6"/>
<equation-count count="13"/>
<ref-count count="79"/>
<page-count count="20"/>
<word-count count="13902"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Viruses have a long history of posing threat to living organisms (<xref ref-type="bibr" rid="B1">1</xref>) as they have caused more than 300 million deaths worldwide (<xref ref-type="bibr" rid="B2">2</xref>). A recent emanation Coronavirus) is an example of an acute virus that caused a global pandemic (<xref ref-type="bibr" rid="B3">3</xref>). According to World Health Organization, Coronavirus has caused approximately more than 400 million infections and 6 million deaths across the globe (<xref ref-type="bibr" rid="B4">4</xref>). Similarly, the Ebola virus was also responsible for an epidemic that caused more than 11 thousand deaths in Africa (<xref ref-type="bibr" rid="B5">5</xref>).</p>
<p>Viruses are small microscopic particles that contain genetic material (DNA or RNA) surrounded by a protein coat (<xref ref-type="bibr" rid="B1">1</xref>). These particles are considered non-living because of their inability to reproduce or perform any other biological function since they lack specific proteins (<xref ref-type="bibr" rid="B6">6</xref>). However, once they get a chance to enter inside the host cell, they make interactions with available proteins in the cell and become capable to reproduce themselves (<xref ref-type="bibr" rid="B7">7</xref>). Initially, to enter inside a host cell, the viruses interact with the host cell receptor proteins (<xref ref-type="bibr" rid="B8">8</xref>) and replicate themselves by injecting their genetic material into the cell&#x00027;s genome (<xref ref-type="bibr" rid="B9">9</xref>). After the entrance into the cell, the aim of viruses is to interact with diverse types of proteins through which they can control the process of the cell cycle, particle assembly, apoptosis, and cell metabolism (<xref ref-type="bibr" rid="B7">7</xref>, <xref ref-type="bibr" rid="B10">10</xref>). The relationships between host and virus proteins are termed virus-host protein-protein interactions (<xref ref-type="bibr" rid="B11">11</xref>).</p>
<p>To prevent viruses from interacting with host proteins, hosts have sophisticated mechanisms to recognize and confine the viruses, such as the dendritic and &#x003B2;- cells, T-cells, and major histocompatibility complex (<xref ref-type="bibr" rid="B12">12</xref>). Therefore, viruses tend to adapt in an efficient manner by interacting with specific host proteins and cellular pathways that prove to be substantial for evading or inactivating factors that are detrimental to viral growth (<xref ref-type="bibr" rid="B7">7</xref>). Meanwhile, to enhance immunity against viruses, it is difficult to develop efficient vaccines/drugs because of the poor understanding of different mechanisms that have been adapted by the viruses, and their frequent transmissibility from cell-to-cell or species-to-species (<xref ref-type="bibr" rid="B13">13</xref>). Consequently, analyzes of virus-host PPIs are essential to explore their effects on diverse types of biological functions and to design antiviral strategies (<xref ref-type="bibr" rid="B14">14</xref>). Furthermore, through such analyzes essential viral proteins and viral dependencies on host proteins can be identified as drug targets to halt the replication process of viruses by pharmacological inhibition (<xref ref-type="bibr" rid="B15">15</xref>).</p>
<p>Multiple experimental techniques have been utilized to identify virus-host protein-protein interactions (VHPPIs) such as protease assay (<xref ref-type="bibr" rid="B16">16</xref>), surface plasmon resonance (SPR) (<xref ref-type="bibr" rid="B17">17</xref>), F&#x000F6;rster resonance energy (FRET) (<xref ref-type="bibr" rid="B18">18</xref>), Yeast two hybrid screening (Y2H) (<xref ref-type="bibr" rid="B19">19</xref>) and affinity purification mass spectrometry (AP-MS) (<xref ref-type="bibr" rid="B20">20</xref>). Such conventional wet lab methods are expensive, time-consuming, and error-prone, which impede inter and intra species large scale proteomics sequence analyzes. To overcome the shortcomings of experimental approaches, the development of machine learning applications for efficient proteomics sequence analyzes across different species (e.g., humans, viruses) is an active area of research (<xref ref-type="bibr" rid="B21">21</xref>&#x02013;<xref ref-type="bibr" rid="B23">23</xref>). Researchers have developed a machine learning based clustering applications to distinguish several microbial pathogens (<xref ref-type="bibr" rid="B24">24</xref>, <xref ref-type="bibr" rid="B25">25</xref>) and classification applications to categorize the genes associated with the survival of pathogens under certain environmental conditions, antibiotics, or other disturbances (<xref ref-type="bibr" rid="B26">26</xref>). Similarly, researchers have developed classification applications to determine VHPPIs that play a key role in understanding the functional paradigms of viruses as well as host responses (<xref ref-type="bibr" rid="B27">27</xref>, <xref ref-type="bibr" rid="B28">28</xref>). With an aim to provide cheap, fast, and accurate virus-host protein-protein analyzes, to date, around 13 AI-based predictors (<xref ref-type="bibr" rid="B21">21</xref>&#x02013;<xref ref-type="bibr" rid="B23">23</xref>, <xref ref-type="bibr" rid="B27">27</xref>&#x02013;<xref ref-type="bibr" rid="B36">36</xref>) have been proposed.</p>
<p>Recently, Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) proposed a VHPPI predictor by utilizing position-specific scoring matrices to statistically represent virus and host protein sequences that were further passed to Siamese convolutional neural network (CNN) for VHPPI prediction. The predictor was evaluated on VHPPI data of human proteins and 8 different viruses. Another similar predictor namely, Deep Viral (<xref ref-type="bibr" rid="B35">35</xref>) used one hot vector encoding (OHE) for the discretization of sequences and convolutional neural network architecture for VHPPI prediction. Deep Viral was evaluated on VHPPIs of humans and 12 different viruses. Deep-VHPPI (<xref ref-type="bibr" rid="B27">27</xref>) predictor also used OHE and attention mechanism along with CNN for VHPPI prediction. The predictor was evaluated on VHPPI data related to humans and 4 different viruses.</p>
<p>Ding et al. (<xref ref-type="bibr" rid="B34">34</xref>) proposed a VHPPI predictor based on long short-term memory (LSTM) neural network. At preprocessing stage, they generated statistical representations of viral and host proteins by reaping the benefits of 3 different encoders namely, the relative frequency of amino acid triplets (RFAT), frequency difference of amino acid triplets (FDAT), and amino acid composition (AC). The predictor (<xref ref-type="bibr" rid="B34">34</xref>) was evaluated on VHPPIs across proteins belonging to 137 different viruses and 13 hosts. Denovo (<xref ref-type="bibr" rid="B29">29</xref>) used amino acid properties such as dipoles and volumes of side chains to represent 20 amino acids (AAs) with only 7 cluster numbers to reduce the diversity of amino acids. The sequences were then encoded based on the normalized kmer frequencies of 7 unique clusters. Denovo predictor used SVM classifier and was evaluated on the dataset of 10 viruses and human proteins. HOPITOR (<xref ref-type="bibr" rid="B37">37</xref>) used a similar encoding method as Denovo (<xref ref-type="bibr" rid="B29">29</xref>). HOPITOR used an SVM classifier and was evaluated on 10 different viruses and human proteins. Yang et al. (<xref ref-type="bibr" rid="B31">31</xref>) proposed InterSPPI-HVPPI which utilized Doc2vec embeddings and random forest (RF) classifier for VHPPI prediction. The predictor (<xref ref-type="bibr" rid="B31">31</xref>) was evaluated on data related to 12 viruses, and human proteins. Karabulut et al. (<xref ref-type="bibr" rid="B28">28</xref>) proposed meta predictor (ML-AdVInfect) that reaped the benefits of 4 existing predictors namely HOPITOR (<xref ref-type="bibr" rid="B37">37</xref>), InterSPPI-HVPPI (<xref ref-type="bibr" rid="B31">31</xref>), VHPPI, and Denovo (<xref ref-type="bibr" rid="B29">29</xref>). Specifically, the authors passed the predictions of existing predictors to the SVM classifier for the final VHPPI prediction.</p>
<p>Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>) proposed a VHPPI predictor that utilized an RF classifier and statistical vectors generated through 4 different encoding methods namely, average domain-domain association score, virus methionine, virus seline, and virus valine. The predictor was evaluated on VHPPI data related to human proteins and 5 different viruses. Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) used 7 sequence encoding methods i.e., RFAT, FDAT, AC, composition, transition, and distribution of amino acid groups. The approach (<xref ref-type="bibr" rid="B30">30</xref>) used an SVM classifier for VHPPI predictions across the proteins of 332 viruses and 29 hosts. Alguwaizani et al. (<xref ref-type="bibr" rid="B32">32</xref>) combined statistical vectors of 4 different encoders namely, amino acid repeats, the sum of the squared length of single amino acid repeats (SARs), maximum of the sum of the squared length of SARs in a window of 6 residues, and composition of amino acids in 5 partitions of the protein sequence. The predictor used an SVM classifier and experimentation was performed on VHPPI data related to 6 hosts and 5 viruses. Recently, Asim et al. proposed an LCGA-VHPPI predictor (<xref ref-type="bibr" rid="B38">38</xref>), that made use of a local-global residue context aware sequence encoding scheme and a deep forest model. The authors evaluated their predictor on data related to 23 viruses and human proteins.</p>
<p>Following the success of neural word embedding approaches in natural language processing and bioinformatics, Tsukiyama et al. proposed LSTM-PHV (<xref ref-type="bibr" rid="B21">21</xref>) that transformed viral host protein sequences to statistical vectors by learning statistical representation of k-mers in an unsupervised manner using Word2vec approach. The study (<xref ref-type="bibr" rid="B21">21</xref>) used bidirectional LSTM for VHPPI prediction and data of proteins belonging to 332 viruses and 29 hosts. Similarly, MTT (<xref ref-type="bibr" rid="B23">23</xref>) predictor utilized randomly initialized embeddings and LSTM based classifier. MTT predictor was evaluated on data related to 16 viruses and human proteins. Hangyu et al. (<xref ref-type="bibr" rid="B33">33</xref>) developed a VHPPI predictor based on Node2vec and Word2vec embeddings methods and a multilayer perceptron (MLP) classifier. Authors performed experimentation over 7 variants of the SARS virus and 16 different host proteins.</p>
<p>The working paradigm of existing VHPPI predictors can be broadly categorized into two different stages. In the first stage, raw sequences are transformed into statistical vectors where the aim is to capture distributional information of 21 unique amino acids. In the second stage, a machine or deep learning classifier is utilized to discriminate interactive viral-host protein pairs from non-interactive ones.</p>
<p>In the first stage, while transforming raw sequences to statistical vectors, 2 predictors (<xref ref-type="bibr" rid="B27">27</xref>, <xref ref-type="bibr" rid="B35">35</xref>), make use of one hot vector encoding method which lacks information related to correlations of amino acids. Moreover, 3 predictors use word embedding generation approaches (<xref ref-type="bibr" rid="B21">21</xref>, <xref ref-type="bibr" rid="B23">23</xref>, <xref ref-type="bibr" rid="B35">35</xref>), that capture kmer-kmer associations but lack information related to the distribution of amino acids. To capture distribution and various patterns of amino acids, other predictors utilized 10 different mathematical encoders (<xref ref-type="bibr" rid="B22">22</xref>, <xref ref-type="bibr" rid="B29">29</xref>, <xref ref-type="bibr" rid="B31">31</xref>, <xref ref-type="bibr" rid="B34">34</xref>, <xref ref-type="bibr" rid="B36">36</xref>) however, these methods do not capture sequence order or amino acids correlation information. Such information is crucial for the analyzes of protein sequences as reported in the existing studies (<xref ref-type="bibr" rid="B39">39</xref>&#x02013;<xref ref-type="bibr" rid="B42">42</xref>) which include sequence encoders such as APAAC and QS order. Despite the promising performance shown by APAAC and QS order encoders for subcellular location prediction (<xref ref-type="bibr" rid="B39">39</xref>), Cyclin protein classification (<xref ref-type="bibr" rid="B40">40</xref>), and protein-protein interaction prediction (<xref ref-type="bibr" rid="B41">41</xref>, <xref ref-type="bibr" rid="B42">42</xref>) tasks, no researcher has explored their potential to effectively generate numerical representations of viral-host protein sequences.</p>
<p>In the second stage, 4 predictors (<xref ref-type="bibr" rid="B27">27</xref>, <xref ref-type="bibr" rid="B34">34</xref>&#x02013;<xref ref-type="bibr" rid="B36">36</xref>) utilize CNNs, 2 predictors (<xref ref-type="bibr" rid="B21">21</xref>, <xref ref-type="bibr" rid="B34">34</xref>) make use of LSTM architecture and 8 predictors (<xref ref-type="bibr" rid="B22">22</xref>, <xref ref-type="bibr" rid="B23">23</xref>, <xref ref-type="bibr" rid="B28">28</xref>&#x02013;<xref ref-type="bibr" rid="B33">33</xref>) use traditional classifiers. As such predictors have shown better performances across limited hosts and viruses, therefore these predictors cannot be generalized across multiple hosts and viruses. For instance, LSTM-PHV is the most recent predictor which managed to produce better performance for human and Coronavirus related VHPPI, but failed to produce similar performance over Zhou et al. (<xref ref-type="bibr" rid="B43">43</xref>) datasets that contain multiple hosts and viruses. To make a generic predictor capable of accurately predicting interactions across multiple hosts and viruses, only one meta predictor (<xref ref-type="bibr" rid="B28">28</xref>) has been developed. However, this meta-predictor relies on the predictions of 4 existing VHPPI predictors that have their own drawbacks at the sequence encoding and classification level.</p>
<p>With an aim to develop a more accurate and generic meta predictor, the contributions of this paper are manifold, i) It makes use of two different physicochemical properties-based sequence encoding methods namely, APAAC and QS order. In addition, unlike other protein sequence analysis tasks where numerical representations of complete raw protein sequences have been generated through these encoders by utilizing a combination of different physicochemical properties, the paper in hand proposes an effective way to generate numerical representations by using a precise subset of physicochemical properties. ii) Considering different physicochemical properties in both encoders extract some irrelevant and redundant features, to remove such features, it transforms the original feature space into a reduced and more discriminative feature space by utilizing a dimensionality reduction method named feature agglomeration. iii) Using separate and combined statistical vectors generated through APAAC and Qsorder, it generates more effective and discriminative probabilistic feature space by fusing the predictions of two different classifiers. Optimized probabilistic feature space is used to feed the SVM classifier which makes final predictions. iv) Large-scale experimentation over 7 public benchmark datasets and performance comparison of the proposed meta predictor with existing predictors is performed. v) To facilitate researchers and practitioners, a web application based on the proposed meta predictor is developed.</p>
</sec>
<sec sec-type="materials and methods" id="s2">
<title>2. Materials and methods</title>
<p>This section briefly describes the working paradigm of the proposed predictor, benchmark datasets, and diverse types of evaluation measures.</p>
<sec>
<title>2.1. Proposed meta-predictor</title>
<p>Machine learning classifiers cannot directly operate on raw sequences due to their dependency on statistical representations. While transforming raw protein sequences into statistical vectors, the aim is to encode positional and discriminative information about amino acids. To represent viral and host protein sequences by extracting both types of information, the proposed meta predictor makes use of two sequence encoders namely APAAC and QS order. The statistical vectors generated by these methods depend on certain physicochemical properties. For example, the APAAC (<xref ref-type="bibr" rid="B44">44</xref>) encoder contains three different physicochemical properties namely hydrophobicity, hydrophilicity, and side chain mass whereas, QS order (<xref ref-type="bibr" rid="B39">39</xref>) has two content matrices namely, Schneider and Grantham. However, it is important to investigate which particular properties of both encoders are appropriate in order to generate more comprehensive statistical vectors, rather than utilizing all the available properties.</p>
<p>To fully utilize the potential of both encoders, a strategy similar to the forward feature selection method is adopted to find out the most appropriate physicochemical properties. For instance, from 3 the properties of the APAAC encoder, first, we generate statistical vectors by using one property and compute the performance of the RF classifier. Similarly, we repeat the same process for the second and third properties in order to record the performance of the RF classifier. On the basis of higher performance, we take the property-specific statistical vectors and combine them with the second best performing property vectors. This is followed by the evaluation on the basis of combined features, if this does not yield any performance gains then the iterative process stops, and individual property-based statistical vectors with the highest performance are selected. In contrast, if there are any performance gains with such combinations then the combined encodings are retained and utilized further. A similar procedure is used to generate statistical representations using QS order.</p>
<p>The statistical vectors generated from the encoders may contain irrelevant and redundant features. In order to remove such features and retain only the most informative features, we utilize a dimensionality reduction algorithm named feature agglomeration (<xref ref-type="bibr" rid="B45">45</xref>). While reducing the dimensions of the original feature space, it is important to find the target dimension of the reduced feature space. To find an appropriate feature space, we reduce the dimension of the original feature space from 40 to 95% with a step size of 5%. By utilizing RF classifier based on its performance, we chose the most appropriate feature space. It is noteworthy to mention that the process of property selection and appropriate reduced feature space selection is performed only using training data.</p>
<p>In the current study, the training of meta-predictor can be seen as a two-stage process. In the first stage, the statistical vectors generated for virus-host protein sequences using APAAC and QS order are separately passed through two machine learning classifiers i.e., RF and ET (<xref ref-type="bibr" rid="B46">46</xref>). Then the prior representations are concatenated and passed again through the RF and ET classifiers, predictions of both classifiers using individual and combined encodings are utilized to create a new feature space on which the SVM classifier is trained to make final predictions.</p>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> describes a graphical illustration of the proposed meta predictor&#x00027;s workflow. The more detailed working of the encoding methods is given in Section 2.1.1. The dimensionality reduction method is explained in Section 2.1.2. In addition, details of the machine learning classifiers are provided in Section 2.3.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The overall working paradigm of the proposed VHPPIs predictor. <bold>Dataset Construction</bold> To begin with, different datasets are collected from existing studies based on VHPPIs from several databases such as HPID, Intact, and VirusMentha. <bold>Feature Representation</bold> Obtained protein sequences are encoded on the basis of two physicochemical properties based protein sequence encoders i.e., QS order and APAAC. <bold>Feature Analyzes</bold> Appropriate physicochemical properties are selected for the APAAC and QS order on the basis of feature analyzes. <bold>Model Construction</bold> the VHPPIs predictor is an SVM model formed on the basis of probabilistic vectors obtained from the RF and ET classifiers. Finally, a web server is established for fast, and easy on-go analyzes of VHPPIs.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-09-1025887-g0001.tif"/>
</fig>
<sec>
<title>2.1.1. Protein sequence encoding</title>
<p>The following subsections briefly illustrate the working paradigm of APAAC and QS order sequence encoding methods.</p>
<sec>
<title>2.1.1.1. Amphiphilic Pseudo-Amino Acid Composition (APAAC)</title>
<p>Chou (<xref ref-type="bibr" rid="B44">44</xref>, <xref ref-type="bibr" rid="B47">47</xref>) proposed an APAAC encoder that makes use of pre-computed physicochemical values of hydrophobicity, hydrophilicity, and side chain mass (<xref ref-type="bibr" rid="B44">44</xref>, <xref ref-type="bibr" rid="B47">47</xref>). Each physicochemical property contains 20 float values associated with 20 unique amino acids (<xref ref-type="supplementary-material" rid="SM1">Supplementary Table S1</xref>). These values are computed based on diverse types of information related to protein folding, and protein&#x00027;s interactions with the environment and other molecules. For each of the three quantitative properties, the values of its corresponding amino acids are normalized to zero mean and unit standard deviation through Equation 1.</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>M</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>20</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>[</mml:mo><mml:mi>A</mml:mi><mml:msub><mml:mi>A</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mn>20</mml:mn></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mfrac><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>20</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>[</mml:mo><mml:mi>A</mml:mi><mml:msub><mml:mi>A</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>M</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mn>20</mml:mn></mml:mrow></mml:mfrac></mml:mrow></mml:msqrt><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>P</mml:mi><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>[</mml:mo><mml:mi>A</mml:mi><mml:msub><mml:mi>A</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>M</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mtext>&#x02009;</mml:mtext></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mo>&#x0007B;</mml:mo><mml:mn>1.</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mo>,</mml:mo><mml:mn>20</mml:mn><mml:mo>&#x0007D;</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mo>&#x0007B;</mml:mo><mml:mtext>hydrophobicity,&#x000A0;hydrophilicity,&#x000A0;side&#x000A0;chain&#x000A0;mass</mml:mtext><mml:mo>&#x0007D;</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>Whereas, <italic>p</italic><sub><italic>i</italic></sub> represents the physicochemical property based value of an amino acid (<italic>AA</italic><sub><italic>k</italic></sub>) which is either hydrophobicity, hydrophilicity, or side chain mass. In Equation 1, Mean[<italic>p</italic><sub><italic>i</italic></sub>] is the mean of 20 amino acids in each property, and S[<italic>p</italic><sub><italic>i</italic></sub>] is the standard deviation, where both can be computed using Equation 1.</p>
<p>In each physicochemical property, using normalized values of all 20 amino acids, the order of amino acids within the host and viral protein sequences is captured using a lag-based phenomenon.</p>
<p>For instance, we have a raw sequence S=<italic>R</italic><sub>1</sub>, <italic>R</italic><sub>2</sub>, <italic>R</italic><sub>3</sub>, <italic>R</italic><sub>4</sub>, &#x022EF;&#x02009;, <italic>R</italic><sub><italic>L</italic></sub>, where <italic>R</italic><sub>1, &#x022EF;&#x02009;, <italic>L</italic></sub> denotes 20 unique amino acids. If lag=1, then two most contiguous amino acids i.e., <italic>S</italic><sub><italic>lag</italic>1</sub> &#x0003D; <italic>R</italic><sub>1</sub><italic>R</italic><sub>2</sub>, <italic>R</italic><sub>2</sub><italic>R</italic><sub>3</sub>, <italic>R</italic><sub>3</sub><italic>R</italic><sub>4</sub>, <italic>R</italic><sub>4</sub><italic>R</italic><sub>5</sub>, are taken, for lag=2, second-most contiguous amino acids, i.e., <italic>S</italic><sub><italic>lag</italic>2</sub> &#x0003D; <italic>R</italic><sub>1</sub><italic>R</italic><sub>3</sub>, <italic>R</italic><sub>2</sub><italic>R</italic><sub>4</sub>, <italic>R</italic><sub>3</sub><italic>R</italic><sub>5</sub> are taken by skipping 1 amino acid, and for lag=3, third-most contiguous amino acids are taken by skipping 2 amino acids i.e., <italic>S</italic><sub><italic>lag</italic>3</sub> &#x0003D; <italic>R</italic><sub>1</sub><italic>R</italic><sub>4</sub>, <italic>R</italic><sub>2</sub><italic>R</italic><sub>5</sub>, and so on. After generating bigrams, from <italic>S</italic><sub><italic>lag</italic>1</sub>, <italic>S</italic><sub><italic>lag</italic>2</sub>, and <italic>S</italic><sub><italic>lag</italic>3</sub>, iteratively, bigrams are taken and in each bigram, physicochemical values of both amino acids are multiplied using a correlation function shown in Equation 2.</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtext class="textrm" mathvariant="normal">hydrophobicity, hydrophilicity, side chain mass</mml:mtext></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>After computing the correlation functions, for a property, across N number of lags, a single float is computed by averaging property values across all the lag-based amino acid bigrams.</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mfrac><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mo>-</mml:mo><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Furthermore, both types of sequence order and amino acid distributional information can be captured using Equation (4).</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable class="aligned"><mml:mtr><mml:mtd columnalign="right"><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mi>A</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>f</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>y</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:mi>A</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, <italic>w</italic> is a weight parameter that varies from 0.1 to 1. Similarly, normalization is applied to the original sequence order information by using Equation (5),</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>w</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Once the amino acid distribution and sequence order related information are encoded, the final statistical representation is obtained by concatenating the amino acid distributions and correlations among amino acids, that represent the sequence order information of a protein sequence.</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mi>A</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x02225;</mml:mo><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:msub><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The dimension of the final statistical vector for a single physicochemical property is 20 &#x0002B; lag-D vector and for 3 physicochemical properties, the final statistical vector is (20 &#x0002B; lag) &#x000D7; 3 dimensional vector. As a result, the first 20 numbers are the normalized amino acid frequencies and the next following discrete numbers reminisce the amphiphilic amino acid correlations along a protein chain.</p>
</sec>
<sec>
<title>2.1.1.2. Quasi-sequence (QS) order</title>
<p>Owing to similar ideas like APAAC, QS order also encodes the sequence order and discriminative information based on different physicochemical properties (<xref ref-type="bibr" rid="B39">39</xref>). To incorporate a more significant sequence order information, QS order makes use of pre-computed values of 4 different physicochemical properties namely, hydrophobicity, hydrophilicity, polarity, and side chain volume to compute the coupling factors among the amino acids of a protein sequence (<xref ref-type="bibr" rid="B39">39</xref>). These physicochemical properties describe protein folding and its structural features, particularly surface physical chemistry. These pre-computed values have been averaged and on the basis of Manhattan distance, new values (20 &#x000D7; 20 = 400) have been provided by Schneider et al. (<xref ref-type="bibr" rid="B48">48</xref>) and Grantham et al. (<xref ref-type="bibr" rid="B49">49</xref>) (for details refer to <xref ref-type="supplementary-material" rid="SM1">Supplementary Tables S2</xref>, <xref ref-type="supplementary-material" rid="SM1">S3</xref>).</p>
<p>In QS order, first, the bigrams of amino acids are generated on the basis of a lag value that is quite similar to stride size in CNN. To compute a coupling factor <italic>P</italic>[<italic>B</italic>], distance values between two amino acids are taken from <xref ref-type="supplementary-material" rid="SM1">Supplementary Tables 2</xref>, <xref ref-type="supplementary-material" rid="SM1">3</xref>, with respect to bigrams generated <italic>via</italic> lag value. The coupling factor <italic>P</italic>[<italic>B</italic>] can be written as;</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>S</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mo>,</mml:mo><mml:mi>G</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>a</mml:mi><mml:mi>m</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where D is the distance value taken from the Schneider or Grantham&#x00027;s content matrices and <italic>B</italic> denotes a bigram of amino acids. Corresponding encoding value for a lag can be computed by averaging all the physicochemical distance values for bigrams,</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M8"><mml:mrow><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:mtext>&#x02009;</mml:mtext><mml:msub><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mi>g</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mtext>&#x02009;</mml:mtext><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>P</mml:mi><mml:msub><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mtext>&#x02009;</mml:mtext><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula>
<p>To get a single float value for the encoding, lag values are averaged depending on the size of the lag. For example, for lag=3, first, the bigrams are generated with lag=1,2,3, then the corresponding encodings for these bigrams are generated and averaged using the following equation.</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:msub><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>These computed encoding values are normalized along with a weight factor <italic>w</italic>,</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:msub><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>w</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:msub><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>To incorporate the distribution of amino acids, normalized frequencies of 20 different amino acids are computed, according to the following equation,</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>f</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>y</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>A</mml:mi><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Finally, (20&#x0002B;lag) &#x000D7; 2 dimensional statistical vector is formed by concatenating 20 amino acids distribution values and lag number of correlation factors referring to sequence order information with respect to distance values provided by Schneider and Grantham.</p>
<disp-formula id="E12"><label>(12)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>q</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mi>A</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x02225;</mml:mo><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:msub><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where, <italic>Encdoing</italic>[<italic>AA</italic>] represents the normalized frequency values of 20 different amino acids, and <italic>Encoding</italic>[<italic>D</italic><sub><italic>i</italic></sub>]<italic>lag</italic><sub><italic>i</italic></sub> refers to the sequence order information.</p>
</sec>
</sec>
<sec>
<title>2.1.2. Dimensionality reduction <italic>via</italic> feature agglomeration clustering</title>
<p>Hierarchical clustering (HC) is a known group of clustering algorithms that construct clusters on the basis of similarities among the data samples. The end goal of HC is to compute clusters that are completely different from each other and data samples within a single cluster are similar to each other. Similar ideas are inherited by feature agglomeration, where the grouping is applied to the features of the data rather than the data samples. In feature agglomeration, two steps are iteratively followed to achieve the required dimensions of feature space namely, distance computation and pooling. First the distance among all the features is computed using Euclidean or Manhattan distance (<xref ref-type="bibr" rid="B50">50</xref>). On the basis of the minimum distance, two features are combined together on the basis of a pooling function which can be the mean of respective features. This process is repeated unless the features are reduced to desired dimensions.</p>
</sec>
<sec>
<title>2.1.3. Iterative representation learning</title>
<p>Iterative representation learning is a crucial step for performance improvements of ML models, inspired by layer-wise training of deep learning models. In the current study, the proposed meta-predictor works in a two-stage process based on iterative representation learning. In the first stage, the statistical vectors generated for virus-host protein sequences by APAAC and QS order are separately passed through two machine learning models i.e., RF and ET. Then the prior representations are concatenated and passed again through the RF and ET classifiers. As a result, for protein sequences, in total around 6 different positive class probabilities are obtained. In the second stage, these probabilistic values are concatenated with each other to form a new 6-D feature vector for protein sequences. This probabilistic feature representation of protein sequences is used as an input for a support vector machine classifier that provides results for the prediction of VHPPIs.</p>
</sec>
</sec>
<sec>
<title>2.2. Benchmark datasets</title>
<p>In order to develop and evaluate AI-based predictors for virus-host protein-protein interaction prediction, several datasets have been developed in the existing studies (<xref ref-type="bibr" rid="B11">11</xref>, <xref ref-type="bibr" rid="B21">21</xref>, <xref ref-type="bibr" rid="B22">22</xref>, <xref ref-type="bibr" rid="B30">30</xref>, <xref ref-type="bibr" rid="B32">32</xref>, <xref ref-type="bibr" rid="B36">36</xref>). We have collected 7 publicly available benchmark datasets from 4 different studies. These datasets have been extensively utilized in the development/evaluation of the most recent VHPPI predictors (<xref ref-type="bibr" rid="B22">22</xref>, <xref ref-type="bibr" rid="B29">29</xref>, <xref ref-type="bibr" rid="B30">30</xref>, <xref ref-type="bibr" rid="B36">36</xref>).</p>
<p>One dataset is taken from the study of Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>), which contains VHPPIs across human and 4 viruses i.e., HIV-1, simian virus 40 (SV40), HBV, HCV, papilloma virus, these VHPPIs were downloaded from VirusMint database (<xref ref-type="bibr" rid="B51">51</xref>). Whereas, negative samples were collected from Uniprot (<xref ref-type="bibr" rid="B52">52</xref>) based on their dissimilarity with the true VHPPIs.</p>
<p>Similarly, another dataset is taken from Fatma et al. (<xref ref-type="bibr" rid="B29">29</xref>) work, which contains VHPPIs of humans and 173 viruses i.e., Paramyxoviridae, Filoviridae, Bunyaviridae, Flaviviridae, Adenoviridae, Orthomyxoviridae, Chordopoxviridae, Papillomaviridae, Herpesviridae, Retroviridae. These VHPPIs were collected from VirusMetha (<xref ref-type="bibr" rid="B53">53</xref>), and Uniprot (<xref ref-type="bibr" rid="B52">52</xref>). Negative class samples were generated by a random dissimilarity algorithm, which assumed the condition that two viral proteins comprised of similar amino acid sequences could not interact with the same host protein. The similarity between two proteins was decided through distance (dissimilarity) score based on normalized global alignment bit scores. Furthermore, once unique viral proteins were obtained, their interactions were decided based on the dissimilarity (distance) score &#x0003E;0.8 with host proteins.</p>
<p>The coronavirus and human proteins related dataset are taken from Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) work, where the interactions were collected from HPID (<xref ref-type="bibr" rid="B54">54</xref>), VirusHostNet (<xref ref-type="bibr" rid="B55">55</xref>), PHISTO (<xref ref-type="bibr" rid="B56">56</xref>), and PDB (<xref ref-type="bibr" rid="B57">57</xref>) databases. Moreover, negative samples were generated by dissimilarity-based negative sampling across the PPIs retrieved from Uniprot (<xref ref-type="bibr" rid="B36">36</xref>, <xref ref-type="bibr" rid="B52">52</xref>).</p>
<p>To make the predictor generic and capable to predict interactions over new viruses, we collected 4 datasets from Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) study. These datasets contain interactions related to 29 different hosts and 332 different viruses. To collect raw sequences and interactions, authors utilized 5 different databases namely PSICQUIC (<xref ref-type="bibr" rid="B58">58</xref>), APID (<xref ref-type="bibr" rid="B59">59</xref>), IntAct (<xref ref-type="bibr" rid="B60">60</xref>), Mentha (<xref ref-type="bibr" rid="B53">53</xref>), and Uniprot (<xref ref-type="bibr" rid="B52">52</xref>). Furthermore, for negative data, authors obtained protein sequences of 4 major hosts namely, human, non-human animal, plant, and bacteria, from UniProt (<xref ref-type="bibr" rid="B52">52</xref>), and removed sequences with a sequence similarity higher than 80% to any positive data using CD-HIT-2D (<xref ref-type="bibr" rid="B61">61</xref>). Moreover, in order to assess the applicability on new/unseen viruses, authors distributed VHPPIs of 29 hosts and 332 viruses into 4 different train and 2 test sets, the distribution of viruses and hosts in these datasets is given below,</p>
<list list-type="simple">
<list-item><p><bold>TR1:</bold> PPIs between human and any virus except H1N1 virus.</p></list-item>
<list-item><p><bold>TR2:</bold> PPIs between human and any virus except Ebola virus.</p></list-item>
<list-item><p><bold>TR3:</bold> PPIs between any host and any virus except H1N1 virus.</p></list-item>
<list-item><p><bold>TR4:</bold> PPIs between any host and any virus except Ebola virus.</p></list-item>
<list-item><p><bold>TS1:</bold> PPIs between human and H1N1 virus.</p></list-item>
<list-item><p><bold>TS2:</bold> PPIs between human and Ebola virus.</p></list-item>
</list>
<p>Furthermore, <xref ref-type="fig" rid="F2">Figure 2</xref> summarizes the statistics of datasets in terms of the number of positive and negative samples. In order to perform experimentation, selected datasets are more appropriate due to multiple reasons such as recent VHPPI predictors reporting their performance scores, making it possible to compare our proposed VHPPI predictor to existing predictors directly. These datasets contain sufficient VHPPIs which enable training machine learning models in an optimal way. Furthermore, these datasets contain diverse VHPPIs across a broad selection of viruses and hosts which allows testing the generalizability of the model against multiple hosts and viruses for the task of VHPPI prediction.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Summary statistics of 7 datasets utilized in this study. For each study, the respective number of positive and negative samples are shown.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-09-1025887-g0002.tif"/>
</fig>
</sec>
<sec>
<title>2.3. Virus-host protein-protein interaction prediction</title>
<p>The following section summarizes the machine learning (ML) classifiers used to predict virus-host protein-protein interactions.</p>
<sec>
<title>2.3.1. Support vector machine</title>
<p>SVM classifier finds hyperplane(s) in an N-dimensional feature space that can discriminate between interactive and non-interactive pairs of host and viral proteins (<xref ref-type="bibr" rid="B62">62</xref>). Specifically, it tends to find the hyperplane that maximizes the margin i.e., the distance between data points of the corresponding classes. Furthermore, to handle non-linear feature space, the SVM classifier facilitates kernel trick where non-linear feature space is transformed to separable linear feature space. Considering the promising predictive performance of the SVM classifier in various proteomics sequence analysis tasks including coronavirus survival prediction (<xref ref-type="bibr" rid="B63">63</xref>), hepatitis B-related hepatocellular carcinoma recurrence prediction (<xref ref-type="bibr" rid="B64">64</xref>), sulfenylation sites prediction (<xref ref-type="bibr" rid="B65">65</xref>). As a whole, the SVM classifier has achieved an average performance of more than 80%. We have used an SVM classifier to distinguish interactive virus-host protein sequences from non-interactive ones.</p>
</sec>
<sec>
<title>2.3.2. Random forest classifier</title>
<p>A random forest classifier is based on decision trees (DT), which are considered the base learners (<xref ref-type="bibr" rid="B66">66</xref>). To begin with, a root node is selected according to the feature with the lowest Gini impurity or the maximum information gain (<xref ref-type="bibr" rid="B67">67</xref>). Several samples are then separated based on the classes relevant to the selected feature. The process is repeated until all nodes are homogeneous or simply all nodes contain data only related to one class.</p>
<p>Random Forest classifier is a collection of hundreds of trees such that each tree is grown using a bootstrap sample of the original data (<xref ref-type="bibr" rid="B68">68</xref>). In a random forest each tree is grown in a nondeterministic way by inducing randomness at two different stages. First, randomness occurs at the tree level, as different trees get bootstraps of samples. At the node level, randomization is introduced by selecting a random subset of features for finding the best split rather than growing a tree on the complete feature set. This randomness helps in decorrelating the individual trees such that the whole forest has a low variance. Finally, an averaged or a voted decision is formulated based on individual predictions from the DTs. Following the success of the RF classifier in distinct proteomics sequence analysis tasks such as host disease classification (<xref ref-type="bibr" rid="B69">69</xref>), urine proteome profiling (<xref ref-type="bibr" rid="B70">70</xref>), and hydroxyproline and hydroxylysine site prediction (<xref ref-type="bibr" rid="B71">71</xref>). RF classifier manages to achieve an average performance of more than 80% on these tasks. We have utilized RF classifier to compute a discriminative and informative feature space using the separate statistical representations of APAAC and QS order sequence encoders as well as combined statistical representations.</p>
</sec>
<sec>
<title>2.3.3. Extra trees classifier</title>
<p>Extremely randomized trees or extra trees (ET) classifier makes predictions similar to RF classifier (<xref ref-type="bibr" rid="B72">72</xref>). Multiple trees are trained and each tree is exposed to all training data. ET drops the idea of bootstrapping and takes a random subset of features without replacement. Unlike RF classifier, in the ET classifier, the splits are created across nodes <italic>via</italic> random splitting not based on the best splitting. Therefore, the ET classifier provides independent trees which yield better accuracy scores and low variance across different classes. Considering the increasing usage of ET classifiers in proteomics sequence analysis tasks including glutarylation sites prediction (<xref ref-type="bibr" rid="B73">73</xref>), non-coding RNA-protein interaction prediction (<xref ref-type="bibr" rid="B74">74</xref>), and protein stability changes estimation (<xref ref-type="bibr" rid="B75">75</xref>). Overall, the ET classifier marks an average performance of more than 80% on these tasks. We have used the ET classifier to assist the RF classifier in the generation of effective feature space using statistical representations of two physico-chemical properties based sequence encoders.</p>
</sec>
</sec>
</sec>
<sec id="s3">
<title>3. Evaluation criteria</title>
<p>To evaluate the integrity, effectiveness, and prediction performance of the proposed virus-host PPIs meta predictor in a reliable manner, following the evaluation criteria of existing studies (<xref ref-type="bibr" rid="B21">21</xref>, <xref ref-type="bibr" rid="B30">30</xref>, <xref ref-type="bibr" rid="B36">36</xref>), we utilize 8 different evaluation measures, i.e., accuracy (ACC), specificity (SP), sensitivity (SN), precision (PR), F1 score, area under the receiver operating characteristic (AUROC), area under the precision-recall curve (AUPRC) and Matthews correlation coefficient (MCC).</p>
<p>Accuracy (ACC) (<xref ref-type="bibr" rid="B30">30</xref>) measures the proportion of correct predictions with respect to total predictions. Specificity (SP) or True Negative Rate (TNR) is the ratio between true negative class predictions and overall predictions of negative class. Similarly, Recall/Sensitivity (<xref ref-type="bibr" rid="B21">21</xref>) computes the score by taking the ratio of correct predictions made on positive class samples to the sum of correct and false predictions of positive class samples. Precision (PR) computes performance score by taking the ratio between correct predictions of positive class samples and all samples which predictor labeled as the positive class. The area under receiver operating characteristics (AUROC) (<xref ref-type="bibr" rid="B32">32</xref>) calculates performance score by using true positive rate (TPR) and false positive rate (FPR) at different thresholds. Whereas, the area under the precision-recall curve (AUPRC) (<xref ref-type="bibr" rid="B32">32</xref>) calculates the performance scores using precision (P) and recall (FPR) at different thresholds. F1 score combines the precision (P) and recall (R) into a single measure by taking their harmonic mean. MCC (<xref ref-type="bibr" rid="B21">21</xref>) measures the correlation of the true classes with the predicted classes by considering all predictions related to positive and negative class samples. Mathematical equations of the aforementioned measures are given as,</p>
<disp-formula id="E13"><label>(13)</label><mml:math id="M13"><mml:mrow><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>ACC</mml:mtext><mml:mo>=</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>/</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>Specificity(SP)</mml:mtext><mml:mo>=</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo>/</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>Sensitivity(SN)orRecall(R)</mml:mtext><mml:mo>=</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>/</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>Precision(P)</mml:mtext><mml:mo>=</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>/</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>TruePositiveRate(TPR)</mml:mtext><mml:mo>=</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>/</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>FalsePositiveRate(FPR)</mml:mtext><mml:mo>=</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>/</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>F1score</mml:mtext><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy='false'>(</mml:mo><mml:mi>P</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>R</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>/</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>R</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>MCC</mml:mtext><mml:mo>=</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo>/</mml:mo><mml:mi>E</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>E</mml:mtext><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msqrt></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>In the above numerical cases, <italic>T</italic><sub><italic>P</italic></sub> and <italic>T</italic><sub><italic>N</italic></sub> denote true predictions on the positive and negative classes. While <italic>F</italic><sub><italic>P</italic></sub> and <italic>F</italic><sub><italic>N</italic></sub> refer to the false predictions related to the positive and negative classes, respectively.</p>
<p>To compute the performance of the proposed predictor in terms of the aforementioned evaluation measures, the evaluation of the predictive pipeline can be performed under the hood of two different settings, k-fold cross-validation and independent test set based evaluation. In k-fold cross-validation, corpus sequences are split into k-folds where iteratively k-1 folds are used for training and the remaining fold is used for testing the predictive pipeline. In this setting, train test split biaseness does not exist as each sequence participates in training and evaluation. Whereas in the independent test set based evaluation, standard train and test splits of sequences are already available, hence training sequences are used to train the predictive pipeline and test sequences are used to test the predictive pipeline. There is a possibility that authors of benchmark datasets may partition hard sequences in the training set and simple sequences in the test set or vice versa. However, we believe that authors of benchmark datasets have carefully partitioned the sequences into train and test sets to avoid any biaseness and perform a fair evaluation. In our study, following previous work (<xref ref-type="bibr" rid="B22">22</xref>, <xref ref-type="bibr" rid="B29">29</xref>&#x02013;<xref ref-type="bibr" rid="B31">31</xref>) which has performed independent test set based evaluation on all datasets except Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>) dataset, we also perform independent test set based evaluation to make a fair performance comparison. Contrarily, we evaluate our proposed predictor on Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>) dataset using 5-fold cross-validation as done by the previous benchmark study.</p>
</sec>
<sec id="s4">
<title>4. Experimental setup</title>
<p>The proposed meta-predictor is implemented in Python and protein sequence encoders i.e., APAAC and Qsorder are taken from iLearnPlus (<xref ref-type="bibr" rid="B76">76</xref>). The classifiers are implemented using Scikit-Learn (<xref ref-type="bibr" rid="B77">77</xref>). In order to determine the optimal hyperparameters of encoding methods and classifiers, we have utilized a grid search strategy (<xref ref-type="bibr" rid="B78">78</xref>). A search in the grid finds the most optimal values of hyperparameters by evaluating all possible parameter combinations.</p>
<p>In order to determine the optimal parameters &#x003BB; for the APAAC encoder and lag for the Qsorder encoder, <italic>en</italic><sub>&#x003BB;, <italic>lag</italic></sub> &#x0003D; [1, &#x022EF;&#x02009;, 5] is used as the grid search space with a stride size of 1. For the Qsorder encoder, lag=1 is chosen for Denovo, TR3-TS1, and Coronavirus datasets, lag=2 is selected for Barman, TR1-TS1 datasets, lag=3 for TR2-TS2, and lag=4 for TR4-TS4. In addition for the APAAC encoder, &#x003BB; &#x0003D; (4, 4, 4, 4, 3, 3, 1), are chosen for Barman, Denovo, TR2-TS2, TR4-TS2, Coronavirus, TR1-TS1, and TR3-TS1 datasets.</p>
<p>In terms of ML classifiers, the performance of the SVM classifier is greatly influenced by the base kernel, along with the regularization parameter penalty <italic>C</italic> that controls the margin among hyperplanes. Whereas the parameter <italic>d</italic> represents the degree of the polynomial kernel, which affects the flexibility of the decision boundary in SVM. Similarly, the performance of tree-based classifiers (RF, ET) relies on the number of base estimators, maximum features for the best split, and splitting criteria i.e., Gini impurity and entropy. To make sure the reproducibility of results, the grid search space, and the selected hyperparameters are shown in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Grid search parameters along with optimal values for 3 different machine learning classifiers used for virus-host protein-protein interaction prediction.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left" colspan="3" style="border-bottom: thin solid #000000;"><bold>Selected parameters</bold></th>
<th valign="top" align="left" colspan="3" style="border-bottom: thin solid #000000;"><bold>Grid search space</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>Datasest</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>RF</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>ET</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>SVM</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>RF</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>ET</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>SVM</bold></td>
</tr> <tr>
<td valign="top" align="left">Barman</td>
<td valign="top" align="left">(100, gini, auto)</td>
<td valign="top" align="left">(100, entropy)</td>
<td valign="top" align="left">(poly, 1, 100, 0.001, true)</td>
<td valign="top" align="left"><italic>n</italic><sub><italic>be</italic></sub>=[30, 60, 100, 150, 200,<break/> 250, 300, 350, 400, 450,<break/> 500], c=[gini, entropy] <break/> max<sub><italic>fea</italic></sub>=[auto, <italic>log</italic><sub>2</sub>, sqrt]</td>
<td valign="top" align="left"><italic>n</italic><sub><italic>be</italic></sub>=[30, 60, 100, 150, 200,<break/> 250, 300, 350, 400, 450, 500],<break/> c=[gini, entropy]</td>
<td valign="top" align="left">K=[rbf, poly, linear, sigmoid],<break/> d=[1, 2, 3, 4, 5],<break/> C=[1, &#x022EF;&#x02009;, 100], &#x003B3;=[auto, scale], <break/> prob=[True, False]</td>
</tr>
<tr>
<td valign="top" align="left">Denovo</td>
<td valign="top" align="left">(300, gini, null)</td>
<td valign="top" align="left">(300, gini)</td>
<td valign="top" align="left">(poly, 1, 100, 0.01, true)</td>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">Coronavirus</td>
<td valign="top" align="left">(100, gini, auto)</td>
<td valign="top" align="left">(100, entropy)</td>
<td valign="top" align="left">(linear, 3, 5, 0.0001, true)</td>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">TR1-TS1</td>
<td valign="top" align="left">(100, gini, auto)</td>
<td valign="top" align="left">(30, entropy)</td>
<td valign="top" align="left">(poly, 2, 100, 0.0001, true)</td>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">TR2-TS2</td>
<td valign="top" align="left">(100, gini, auto)</td>
<td valign="top" align="left">(30, entropy)</td>
<td valign="top" align="left">(poly, 3, 10, 0.1, true)</td>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">TR3-TS1</td>
<td valign="top" align="left">(100, gini, auto)</td>
<td valign="top" align="left">(30, entropy)</td>
<td valign="top" align="left">(poly, 2, 100, 0.001, true)</td>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">TR4-TS2</td>
<td valign="top" align="left">(100, gini, auto)</td>
<td valign="top" align="left">(30, entropy)</td>
<td valign="top" align="left">(poly, 2, 100, 0.001, true)</td>
<td/>
<td/>
<td/>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec sec-type="results" id="s5">
<title>5. Results</title>
<p>This section briefly describes the performance of the proposed meta predictor at different levels of ensembling. Furthermore, it compares the performance of the proposed meta predictor with existing predictors (<xref ref-type="bibr" rid="B11">11</xref>, <xref ref-type="bibr" rid="B21">21</xref>&#x02013;<xref ref-type="bibr" rid="B23">23</xref>, <xref ref-type="bibr" rid="B29">29</xref>, <xref ref-type="bibr" rid="B30">30</xref>, <xref ref-type="bibr" rid="B32">32</xref>, <xref ref-type="bibr" rid="B36">36</xref>) over 7 different benchmark datasets (<xref ref-type="bibr" rid="B22">22</xref>, <xref ref-type="bibr" rid="B29">29</xref>, <xref ref-type="bibr" rid="B36">36</xref>, <xref ref-type="bibr" rid="B43">43</xref>).</p>
<sec>
<title>5.1. Performance analyzes of proposed meta predictor using different representations at property level and encoder level</title>
<p>The impact of different physicochemical properties and dimensionality reduction is explored by analyzing the performance of RF and SVM classifiers on the TR4-TS2 dataset. <xref ref-type="table" rid="T2">Table 2</xref> shows 8 different evaluation measures based on performance values produced by the RF classifier using statistical representations generated through APAAC and Qsorder encoders using individual and combinations of properties. It also illustrates the performance values of the classifier using combined statistical vectors of both encoders. To illustrate the performance impact of dimensionality reduction, it shows the performance of the RF classifier using the feature agglomeration method based on generated comprehensive feature space of statistical vectors produced through individual encoders (APAAC, Qsorder) and a combination of both encoders. To illustrate, the performance gains achieved through iterative representation learning of second stage classifier using first stage classifier predicted probabilities, show the performance of the SVM classifier.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Performance comparison of different statistical representations across 1st stage RF classifier and with iterative feature learning based 2nd stage SVM classifier.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>Encoder</bold></th>
<th valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>Properties</bold></th>
<th valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>DR</bold></th>
<th valign="top" align="left" colspan="8" style="border-bottom: thin solid #000000;"><bold>Random forest classifier</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"/>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"/>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"/>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>ACC</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>PR</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>F1</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>SP</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>SN</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>AUPRC</bold></td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;"><bold>AUROC</bold></td>
<td valign="top" align="center"><bold>MCC</bold></td>
</tr> <tr>
<td valign="top" align="left">QSOrder</td>
<td valign="top" align="left"><italic>p</italic><sub>1</sub></td>
<td valign="top" align="left">no</td>
<td valign="top" align="left">84.16</td>
<td valign="top" align="left">85.46</td>
<td valign="top" align="left">84.01</td>
<td valign="top" align="left">84.06</td>
<td valign="top" align="left">91.01</td>
<td valign="top" align="left">98.02</td>
<td valign="top" align="left">97.55</td>
<td valign="top" align="center">69.91</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><italic>p</italic><sub>2</sub></td>
<td valign="top" align="left">no</td>
<td valign="top" align="left">83.89</td>
<td valign="top" align="left">85.75</td>
<td valign="top" align="left">83.68</td>
<td valign="top" align="left">83.89</td>
<td valign="top" align="left">90.23</td>
<td valign="top" align="left">98.16</td>
<td valign="top" align="left">97.74</td>
<td valign="top" align="center">69.62</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><italic>p</italic><sub>1</sub>&#x0002B; <italic>p</italic><sub>2</sub></td>
<td valign="top" align="left">no</td>
<td valign="top" align="left">84.23</td>
<td valign="top" align="left">85.99</td>
<td valign="top" align="left">84.03</td>
<td valign="top" align="left">84.23</td>
<td valign="top" align="left">91.77</td>
<td valign="top" align="left"><bold>98.24</bold></td>
<td valign="top" align="left"><bold>97.89</bold></td>
<td valign="top" align="center">70.20</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><italic>p</italic><sub>1</sub>&#x0002B; <italic>p</italic><sub>2</sub></td>
<td valign="top" align="left">yes</td>
<td valign="top" align="left"><bold>85.23</bold></td>
<td valign="top" align="left"><bold>86.72</bold></td>
<td valign="top" align="left"><bold>85.08</bold></td>
<td valign="top" align="left"><bold>85.23</bold></td>
<td valign="top" align="left"><bold>92.30</bold></td>
<td valign="top" align="left">97.90</td>
<td valign="top" align="left">97.49</td>
<td valign="top" align="center"><bold>71.94</bold></td>
</tr>
<tr>
<td valign="top" align="left">APAAC</td>
<td valign="top" align="left"><italic>p</italic><sub>1</sub></td>
<td valign="top" align="left">no</td>
<td valign="top" align="left">83.22</td>
<td valign="top" align="left">85.80</td>
<td valign="top" align="left">82.91</td>
<td valign="top" align="left">83.22</td>
<td valign="top" align="left">90.22</td>
<td valign="top" align="left">98.09</td>
<td valign="top" align="left">97.49</td>
<td valign="top" align="center">68.97</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><italic>p</italic><sub>2</sub></td>
<td valign="top" align="left">no</td>
<td valign="top" align="left">82.89</td>
<td valign="top" align="left">85.05</td>
<td valign="top" align="left">82.62</td>
<td valign="top" align="left">82.89</td>
<td valign="top" align="left">89.45</td>
<td valign="top" align="left">98.04</td>
<td valign="top" align="left">97.35</td>
<td valign="top" align="center">67.90</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><italic>p</italic><sub>3</sub></td>
<td valign="top" align="left">no</td>
<td valign="top" align="left">84.56</td>
<td valign="top" align="left">86.71</td>
<td valign="top" align="left">84.34</td>
<td valign="top" align="left">84.56</td>
<td valign="top" align="left">91.45</td>
<td valign="top" align="left"><bold>98.30</bold></td>
<td valign="top" align="left">97.88</td>
<td valign="top" align="center">71.24</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><italic>p</italic><sub>3</sub>&#x0002B; <italic>p</italic><sub>1</sub></td>
<td valign="top" align="left">no</td>
<td valign="top" align="left">82.89</td>
<td valign="top" align="left">85.30</td>
<td valign="top" align="left">82.59</td>
<td valign="top" align="left">82.89</td>
<td valign="top" align="left">89.45</td>
<td valign="top" align="left">98.10</td>
<td valign="top" align="left">97.49</td>
<td valign="top" align="center">68.15</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><italic>p</italic><sub>3</sub>&#x0002B;<italic>p</italic><sub>2</sub></td>
<td valign="top" align="left">no</td>
<td valign="top" align="left">85.57</td>
<td valign="top" align="left">87.65</td>
<td valign="top" align="left">85.37</td>
<td valign="top" align="left">85.57</td>
<td valign="top" align="left">92.59</td>
<td valign="top" align="left">98.24</td>
<td valign="top" align="left">97.75</td>
<td valign="top" align="center">73.19</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><italic>p</italic><sub>3</sub>&#x0002B;<italic>p</italic><sub>2</sub></td>
<td valign="top" align="left">yes</td>
<td valign="top" align="left"><bold>86.24</bold></td>
<td valign="top" align="left"><bold>88.36</bold></td>
<td valign="top" align="left"><bold>86.05</bold></td>
<td valign="top" align="left"><bold>86.24</bold></td>
<td valign="top" align="left"><bold>92.98</bold></td>
<td valign="top" align="left">98.26</td>
<td valign="top" align="left"><bold>97.96</bold></td>
<td valign="top" align="center"><bold>74.57</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><italic>p</italic><sub>1</sub>&#x0002B;<italic>p</italic><sub>2</sub>&#x0002B;<italic>p</italic><sub>3</sub></td>
<td valign="top" align="left">no</td>
<td valign="top" align="left">85.23</td>
<td valign="top" align="left">87.17</td>
<td valign="top" align="left">85.04</td>
<td valign="top" align="left">85.23</td>
<td valign="top" align="left">92.23</td>
<td valign="top" align="left">98.16</td>
<td valign="top" align="left">97.63</td>
<td valign="top" align="center">72.38</td>
</tr>
<tr>
<td valign="top" align="left"><bold>APAAC&#x0002B;QSorder</bold></td>
<td/>
<td valign="top" align="left">no</td>
<td valign="top" align="left">86.24</td>
<td valign="top" align="left">87.88</td>
<td valign="top" align="left">86.09</td>
<td valign="top" align="left">86.24</td>
<td valign="top" align="left">92.90</td>
<td valign="top" align="left">98.10</td>
<td valign="top" align="left">97.49</td>
<td valign="top" align="center">74.10</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">yes</td>
<td valign="top" align="left">86.24</td>
<td valign="top" align="left">87.88</td>
<td valign="top" align="left">86.09</td>
<td valign="top" align="left">86.24</td>
<td valign="top" align="left">92.91</td>
<td valign="top" align="left"><bold>98.24</bold></td>
<td valign="top" align="left"><bold>97.80</bold></td>
<td valign="top" align="center">74.10</td>
</tr> <tr>
<td valign="top" align="left" colspan="3" style="border-bottom: thin solid #000000; border-top: thin solid #000000;"><bold>2</bold><italic>nd</italic> <bold>Stage predictors</bold></td>
<td valign="top" align="left" colspan="8" style="border-bottom: thin solid #000000; border-top: thin solid #000000;"><bold>SVM classifier</bold></td>
</tr> <tr>
<td valign="top" align="left" colspan="3">Qsorder-DR-RF, APAAC-DR-RF, <break/>Qsorder-DR&#x0002B;APAAC-DR-RF, Qsorder-DR-ET,<break/> APAAC-DR-ET, Qsorder-DR&#x0002B;APAAC-DR-ET</td>
<td valign="top" align="left"><bold>93.62</bold></td>
<td valign="top" align="left"><bold>93.64</bold></td>
<td valign="top" align="left"><bold>93.62</bold></td>
<td valign="top" align="left"><bold>93.62</bold></td>
<td valign="top" align="left"><bold>96.71</bold></td>
<td valign="top" align="left"><bold>98.50</bold></td>
<td valign="top" align="left"><bold>98.14</bold></td>
<td valign="top" align="left"><bold>87.27</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Bold values denote the highest performance figures.</p>
</table-wrap-foot>
</table-wrap>
<p>In <xref ref-type="table" rid="T2">Table 2</xref>, for the Qsorder encoder, <italic>p</italic><sub>1</sub> represents the Schneider-Wrede property and <italic>p</italic><sub>2</sub> denotes the Grantham property. Similarly for the APAAC encoder, <italic>p</italic><sub>1</sub>, <italic>p</italic><sub>2</sub>, and <italic>p</italic><sub>3</sub> denote hydrophobicity, hydrophilicity, and side chain mass properties, respectively. RF classifier with statistical vectors generated through Qsorder using the <italic>p</italic><sub>1</sub> property produces 84.16% accuracy and 83.89% accuracy using the <italic>p</italic><sub>2</sub> property. It can be concluded that, RF classifier produces different performances when it is fed with two different statistical vectors generated through the Qsorder encoder by using two different physicochemical properties <italic>p</italic><sub>1</sub> and <italic>p</italic><sub>2</sub>. This performance difference illustrates both properties extract and encode different types of information while generating statistical vectors. The performance of the classifier is improved when it is fed with combined statistical vectors generated through both properties. Its performance gets further improved when it is fed with combined vectors of both properties reduced through the feature agglomeration method. This performance improvement validates, that both properties extract some redundant features that when eradicated in the newly generated feature space, the performance gets improved.</p>
<p>Similarly, for the APAAC encoder, among 3 statistical vectors generated through 3 different properties, the RF classifier produces better performance with the <italic>p</italic><sub>3</sub> property and produces the lowest performance with the <italic>p</italic><sub>2</sub> property. Thus, according to the working paradigm of the proposed property selection method, top-performing property <italic>p</italic><sub>3</sub> vectors will combine with <italic>p</italic><sub>1</sub> and <italic>p</italic><sub>2</sub> properties vectors iteratively. From the concatenation of the <italic>p</italic><sub>3</sub> property vector with <italic>p</italic><sub>1</sub> and <italic>p</italic><sub>2</sub> property vectors, the classifier achieves a slight performance gain with <italic>p</italic><sub>3</sub> and <italic>p</italic><sub>2</sub> concatenation. Furthermore, when <italic>p</italic><sub>3</sub> and <italic>p</italic><sub>2</sub> properties vectors were combined with the <italic>p</italic><sub>1</sub> property, the performance of the classifier decreased as compared to its performance with <italic>p</italic><sub>2</sub> and <italic>p</italic><sub>3</sub> properties combinations, and the property selection method selected <italic>p</italic><sub>2</sub> and <italic>p</italic><sub>3</sub> as two optimal properties. These results reveal that to fully utilize the potential of the APAAC encoder, it is essential to utilize the best combination of properties. Furthermore, the concatenation of statistical vectors generated through selected best properties based APAAC and Qsorder encoders fails to improve the performance of the RF classifier as compared to its performance on individual statistical representations.</p>
<p>Dimensionality reduction along with individual encoders has improved the performance of the RF classifier as compared to its performance on the same encoders without applying dimensionality reduction. However, it produces almost similar performance with and without dimensionality reduction on combined vectors of APAAC and Qsorder encoders.</p>
<p>To gain further performance enhancement, in the second stage we utilize positive class probabilities predicted by ET and RF classifiers using feature agglomeration based optimized statistical vectors of individual APAAC and Qsorder encoders and both encoders combined vectors. SVM classifier is trained on newly generated probabilistic 6D feature space where it achieves higher performance as compared to the performance values of RF and ET classifiers. In comparison to the performance of the RF classifier with sequence representations obtained by applying dimensionality reduction to APAAC and QS Order combined vectors (APAAC&#x0002B;Qsorder, DR=yes), it achieves performance improvements of 7.38% in accuracy, 5.76% in precision, 7.53% in F1 score, 7.38% in specificity, 3.1% in sensitivity, 0.26% in AUPRC, 0.34% in AUROC and 13.17% in MCC. In comparison to the performance of the RF classifier with sequence representations generated through (<italic>p</italic><sub>1</sub>&#x0002B;<italic>p</italic><sub>2</sub>, DR=yes) of Qsorder and (<italic>p</italic><sub>3</sub>&#x0002B;<italic>p</italic><sub>2</sub>, DR=yes) of APAAC, it achieves performance improvements with an average margin of 6.10% across all the evaluation measures. Therefore, it is inferred that the SVM classifier along with the iterative representation learning leads to the highest performance for virus-host protein-protein interaction prediction.</p>
<sec>
<title>5.1.1. Proposed MP-VHPPI predictor performance comparison with existing predictors on Barman&#x00027;s dataset</title>
<p><xref ref-type="table" rid="T3">Table 3</xref> shows the performance values of 7 different evaluation measures of the proposed meta predictor and 6 existing VHPPI predictors (<xref ref-type="bibr" rid="B22">22</xref>, <xref ref-type="bibr" rid="B30">30</xref>&#x02013;<xref ref-type="bibr" rid="B32">32</xref>, <xref ref-type="bibr" rid="B38">38</xref>) on Barman et al.&#x00027;s (<xref ref-type="bibr" rid="B22">22</xref>) dataset. From 6 existing predictors, Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI predictor achieves better performance in terms of accuracy 82%, specificity 89.37%, f1 score 81.47%, MCC 63.99%, and AU-ROC 88%. Whereas, Zhou et al., (<xref ref-type="bibr" rid="B30">30</xref>) predictor produce better performance in terms of precision at 82.46%. Among 7 different evaluation measures, Barman et al., predictor (<xref ref-type="bibr" rid="B22">22</xref>) only managed to produce the highest sensitivity 89.08% as compared to the sensitivity of 5 other predictors. Comparatively, the proposed meta-predictor outperforms 6 previously mentioned predictors (<xref ref-type="bibr" rid="B22">22</xref>, <xref ref-type="bibr" rid="B30">30</xref>&#x02013;<xref ref-type="bibr" rid="B32">32</xref>, <xref ref-type="bibr" rid="B38">38</xref>) in terms of 6 distinct evaluation measures. Overall, in terms of accuracy, the proposed meta predictor achieves an improvement of 0.9%, 1.79% in sensitivity 1.62% increase in precision, 1.27% increase in F1 score, 2.97% in MCC, and 0.17% in terms of AU-ROC.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Performance comparison of proposed MP-VHPPI with existing viral-host PPI predictors over a benchmark Barman dataset in terms of 7 different evaluation measures.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Approach</bold></th>
<th valign="top" align="center"><bold>ACC</bold></th>
<th valign="top" align="center"><bold>SN</bold></th>
<th valign="top" align="center"><bold>SP</bold></th>
<th valign="top" align="center"><bold>PR</bold></th>
<th valign="top" align="center"><bold>F1</bold></th>
<th valign="top" align="center"><bold>MCC</bold></th>
<th valign="top" align="center"><bold>AU-ROC</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Yang et al. (<xref ref-type="bibr" rid="B31">31</xref>) RF</td>
<td valign="top" align="center">79.17</td>
<td valign="top" align="center">81.85</td>
<td valign="top" align="center">76.45</td>
<td valign="top" align="center">77.83</td>
<td valign="top" align="center">79.79</td>
<td valign="top" align="center">58.40</td>
<td valign="top" align="center">87.1</td>
</tr>
<tr>
<td valign="top" align="left">Alguwzizani et al. (<xref ref-type="bibr" rid="B32">32</xref>) SVM</td>
<td valign="top" align="center">78.6</td>
<td valign="top" align="center">73.72</td>
<td valign="top" align="center">83.48</td>
<td valign="top" align="center">81.69</td>
<td valign="top" align="center">77.50</td>
<td valign="top" align="center">57.50</td>
<td valign="top" align="center">84.70</td>
</tr>
<tr>
<td valign="top" align="left">Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>) SVM</td>
<td valign="top" align="center">71.00</td>
<td valign="top" align="center">67.00</td>
<td valign="top" align="center">74.00</td>
<td valign="top" align="center">72.00</td>
<td valign="top" align="center">69.41</td>
<td valign="top" align="center">44.0</td>
<td valign="top" align="center">73.00</td>
</tr>
<tr>
<td valign="top" align="left">Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>) RF</td>
<td valign="top" align="center">72.41</td>
<td valign="top" align="center">89.08</td>
<td valign="top" align="center">55.66</td>
<td valign="top" align="center">82.26</td>
<td valign="top" align="center">66.39</td>
<td valign="top" align="center">48.00</td>
<td valign="top" align="center">76.00</td>
</tr>
<tr>
<td valign="top" align="left">Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) SVM</td>
<td valign="top" align="center">79.95</td>
<td valign="top" align="center">76.14</td>
<td valign="top" align="center">83.77</td>
<td valign="top" align="center">82.46</td>
<td valign="top" align="center">79.17</td>
<td valign="top" align="center">60.1</td>
<td valign="top" align="center">85.8</td>
</tr>
<tr>
<td valign="top" align="left">Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI</td>
<td valign="top" align="center">82.00</td>
<td valign="top" align="center">82.00</td>
<td valign="top" align="center"><bold>89.37</bold></td>
<td valign="top" align="center">82.40</td>
<td valign="top" align="center">81.47</td>
<td valign="top" align="center">63.99</td>
<td valign="top" align="center">88.00</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Proposed MP-VHPPI</bold></td>
<td valign="top" align="center"><bold>82.90</bold></td>
<td valign="top" align="center"><bold>90.87</bold></td>
<td valign="top" align="center">82.90</td>
<td valign="top" align="center"><bold>84.08</bold></td>
<td valign="top" align="center"><bold>82.74</bold></td>
<td valign="top" align="center"><bold>66.96</bold></td>
<td valign="top" align="center"><bold>88.17</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Performance figures of Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>) SVM, Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>) RF, Alguwzizani et al. (<xref ref-type="bibr" rid="B32">32</xref>) SVM, and Yang et al. (<xref ref-type="bibr" rid="B31">31</xref>) RF are taken from Yang et al. (<xref ref-type="bibr" rid="B31">31</xref>) work. Bold values denote the highest performance figures.</p>
</table-wrap-foot>
</table-wrap>
<p>In terms of robustness on Barman&#x00027;s dataset, the proposed and existing predictors fall into two different categories based on the differences between their specificity and sensitivity scores, i.e., less biased, predictors with a small difference in specificity and sensitivity scores, and more biased predictors with a large difference in specificity and sensitivity scores. Individually there are sensitivity and specificity differences of 5.4, 9.76, 7, 33.42, 7.63, 7.37, and 7.97% for Yang et al.&#x00027;s (<xref ref-type="bibr" rid="B31">31</xref>) RF, Alguwzizani et al. (<xref ref-type="bibr" rid="B32">32</xref>) SVM, Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>) SVM, and Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>) RF, Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) SVM, Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI predictor and the proposed meta-predictor, respectively. On the basis of these difference values, among all predictors, Yang et al.&#x00027;s RF (<xref ref-type="bibr" rid="B31">31</xref>), Barman et al. (<xref ref-type="bibr" rid="B22">22</xref>) SVM, Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) SVM, Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI, and proposed meta-predictor can be considered less biased as they have small differences (&#x0003C; 8%) in terms of their specificity and sensitivity scores. Contrarily, the other two predictors Barman et al.&#x00027;s (<xref ref-type="bibr" rid="B22">22</xref>) RF, and Alguwzizani et al. (<xref ref-type="bibr" rid="B32">32</xref>) SVM, have large differences between sensitivity and specificity scores and are biased toward either type I or type II error. Type I error arises when a predictor is prone toward false positive predictions due to low specificity and high sensitivity scores (<italic>T</italic><sub><italic>I</italic></sub><italic>E</italic> &#x0003D; 1&#x02212;<italic>SP</italic>), and in type II error the predictor is prone to false negative predictions due to low sensitivity and high specificity scores (<italic>T</italic><sub><italic>II</italic></sub><italic>E</italic> &#x0003D; 1&#x02212;<italic>SN</italic>). Barman et al.&#x00027;s (<xref ref-type="bibr" rid="B22">22</xref>) RF is more prone to type I error due to high sensitivity and lower specificity scores, whereas Alguwzizani et al. (<xref ref-type="bibr" rid="B32">32</xref>) SVM is more prone to type II error due to higher specificity and lower sensitivity scores.</p>
</sec>
<sec>
<title>5.1.2. Proposed MP-VHPPI predictor performance comparison with existing predictors on Denovo&#x00027;s dataset</title>
<p><xref ref-type="table" rid="T4">Table 4</xref> illustrates performance values of 7 different evaluation measures of the proposed meta predictor and 7 existing VHPPI predictors Yang et al. (<xref ref-type="bibr" rid="B31">31</xref>) RF, Alguwzizani et al. (<xref ref-type="bibr" rid="B32">32</xref>) SVM, Fatma et al. (<xref ref-type="bibr" rid="B29">29</xref>) SVM, Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) CNN, Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) SVM, Dong et al. (<xref ref-type="bibr" rid="B23">23</xref>) LSTM, and Asim et al. (<xref ref-type="bibr" rid="B29">29</xref>) LCGA-VHPPI on Denovo dataset.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Performance comparison of proposed MP-VHPPI with existing viral-host PPI predictors over benchmark DeNovo dataset (<xref ref-type="bibr" rid="B29">29</xref>) in terms of 7 different evaluation measures.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Approach</bold></th>
<th valign="top" align="center"><bold>ACC</bold></th>
<th valign="top" align="center"><bold>SN</bold></th>
<th valign="top" align="center"><bold>SP</bold></th>
<th valign="top" align="center"><bold>PR</bold></th>
<th valign="top" align="center"><bold>F1</bold></th>
<th valign="top" align="center"><bold>MCC</bold></th>
<th valign="top" align="center"><bold>AU-ROC</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Yang et al. (<xref ref-type="bibr" rid="B31">31</xref>) RF</td>
<td valign="top" align="center">93.23</td>
<td valign="top" align="center">90.33</td>
<td valign="top" align="center">96.17</td>
<td valign="top" align="center">95.99</td>
<td valign="top" align="center">93.07</td>
<td valign="top" align="center">86.60</td>
<td valign="top" align="center">98.10</td>
</tr>
<tr>
<td valign="top" align="left">Alguwzizani et al. (<xref ref-type="bibr" rid="B32">32</xref>) SVM</td>
<td valign="top" align="center">86.47</td>
<td valign="top" align="center">86.35</td>
<td valign="top" align="center">86.59</td>
<td valign="top" align="center">86.56</td>
<td valign="top" align="center">86.46</td>
<td valign="top" align="center">72.90</td>
<td valign="top" align="center">92.60</td>
</tr>
<tr>
<td valign="top" align="left">Fatma et al. (<xref ref-type="bibr" rid="B29">29</xref>) SVM</td>
<td valign="top" align="center">81.90</td>
<td valign="top" align="center">80.71</td>
<td valign="top" align="center">83.06</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
</tr>
<tr>
<td valign="top" align="left">Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) CNN</td>
<td valign="top" align="center">94.12</td>
<td valign="top" align="center">90.82</td>
<td valign="top" align="center"><bold>97.41</bold></td>
<td valign="top" align="center"><bold>97.23</bold></td>
<td valign="top" align="center">93.92</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
</tr>
<tr>
<td valign="top" align="left">Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) SVM</td>
<td valign="top" align="center">84.47</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">88.94</td>
<td valign="top" align="center">87.86</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">62.92</td>
<td valign="top" align="center">89.7</td>
</tr>
<tr>
<td valign="top" align="left">Dong et al. (<xref ref-type="bibr" rid="B23">23</xref>) LSTM</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">84.12</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">83.92</td>
<td valign="top" align="center">84.02</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">92.21</td>
</tr>
<tr>
<td valign="top" align="left">Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI</td>
<td valign="top" align="center">94.24</td>
<td valign="top" align="center">94.24</td>
<td valign="top" align="center">96.47</td>
<td valign="top" align="center">94.32</td>
<td valign="top" align="center">94.23</td>
<td valign="top" align="center">88.56</td>
<td valign="top" align="center"><bold>98.49</bold></td>
</tr>
<tr>
<td valign="top" align="left"><bold>Proposed MP-VHPPI</bold></td>
<td valign="top" align="center"><bold>94.59</bold></td>
<td valign="top" align="center"><bold>97.23</bold></td>
<td valign="top" align="center">94.59</td>
<td valign="top" align="center">94.73</td>
<td valign="top" align="center"><bold>94.58</bold></td>
<td valign="top" align="center"><bold>89.32</bold></td>
<td valign="top" align="center">98.16</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Performance figures of DeNovo SVM (<xref ref-type="bibr" rid="B29">29</xref>), Alguwzizani et al. (<xref ref-type="bibr" rid="B32">32</xref>) SVM, and Yang et al. RF on DeNovo Dataset (<xref ref-type="bibr" rid="B29">29</xref>) are taken from Yang et al. (<xref ref-type="bibr" rid="B31">31</xref>) work. Bold values denote the highest performance figures.</p>
</table-wrap-foot>
</table-wrap>
<p>From 7 existing predictors, Asim et al., LGCA-VHPPI predictor (<xref ref-type="bibr" rid="B38">38</xref>) achieve better performance in terms of accuracy 94.24%, sensitivity 94.24%, f1 score 94.23%, MCC 88.56%, and AU-ROC 98.49%. Whereas, Yang et al., predictor (<xref ref-type="bibr" rid="B36">36</xref>) achieve the highest performance values in terms of specificity 97.41%, and precision 97.23%. Among all existing predictors, Fatma et al., predictor (<xref ref-type="bibr" rid="B29">29</xref>) show the least performance. In comparison to these predictors, the proposed meta predictor offers performance improvements across 4 different evaluation measures. It achieves a performance gain of 0.35% in both accuracy and f1 score, 2.99% increment in sensitivity, and 0.76% increment in MCC.</p>
<p>The predictors on the Denovo dataset can be seen in two different categories as done previously in terms of Barman&#x00027;s dataset on the basis of specificity and sensitivity differences. Individually there exist differences of 5.84, 6.59, 2.35, 2.23, and 2.64% across Yang et al. (<xref ref-type="bibr" rid="B31">31</xref>) predictor, Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) CNN, Fatma et al. (<xref ref-type="bibr" rid="B29">29</xref>), Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI, and proposed meta-predictor. Due to less difference (&#x0003C; 3%) in the specificity and sensitivity scores, Alguwzizani et al. (<xref ref-type="bibr" rid="B32">32</xref>), Fatma et al. (<xref ref-type="bibr" rid="B29">29</xref>), Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI, and the proposed meta-predictor can be considered less biased toward type I and type II errors as compared to the other two predictors i.e., Yang et al. (<xref ref-type="bibr" rid="B31">31</xref>) RF and Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) CNN that are more biased toward type II error due to high specificity and low sensitivity scores.</p>
</sec>
<sec>
<title>5.1.3. Proposed MP-VHPPI predictor performance comparison with existing predictors on coronavirus dataset</title>
<p>Due to the recent pandemic of Coronavirus, it is important to analyze the performance of a predictor on Coronavirus and human proteins. <xref ref-type="table" rid="T5">Table 5</xref> shows the performance values of the proposed meta predictor, Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) CNN, and Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI, across Coronavirus and human proteins dataset (<xref ref-type="bibr" rid="B36">36</xref>), in terms of 8 distinct evaluation measures.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Performance comparison of the proposed predictor with the existing Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) predictor over the Coronavirus dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Approach</bold></th>
<th valign="top" align="center"><bold>ACC</bold></th>
<th valign="top" align="center"><bold>SN</bold></th>
<th valign="top" align="center"><bold>SP</bold></th>
<th valign="top" align="center"><bold>PR</bold></th>
<th valign="top" align="center"><bold>F1</bold></th>
<th valign="top" align="center"><bold>MCC</bold></th>
<th valign="top" align="center"><bold>AU-PRC</bold></th>
<th valign="top" align="center"><bold>AU-ROC</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) CNN</td>
<td valign="top" align="center">90.64</td>
<td valign="top" align="center">16.37</td>
<td valign="top" align="center"><bold>98.06</bold></td>
<td valign="top" align="center">45.81</td>
<td valign="top" align="center">24.12</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">32.9</td>
<td valign="top" align="center">-</td>
</tr>
<tr>
<td valign="top" align="left">Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI</td>
<td valign="top" align="center">90.11</td>
<td valign="top" align="center">93.6</td>
<td valign="top" align="center">50.04</td>
<td valign="top" align="center">85.67</td>
<td valign="top" align="center">85.07</td>
<td valign="top" align="center"><bold>22.21</bold></td>
<td valign="top" align="center">38.01</td>
<td valign="top" align="center">80.0</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Proposed MP-VHPPI</bold></td>
<td valign="top" align="center"><bold>91.18</bold></td>
<td valign="top" align="center"><bold>95.58</bold></td>
<td valign="top" align="center">51.74</td>
<td valign="top" align="center"><bold>86.01</bold></td>
<td valign="top" align="center"><bold>87.27</bold></td>
<td valign="top" align="center">10.08</td>
<td valign="top" align="center"><bold>47.07</bold></td>
<td valign="top" align="center"><bold>82.95</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Bold values denote the highest performance figures.</p>
</table-wrap-foot>
</table-wrap>
<p>Out of two existing predictors, Yang et al., a predictor based on CNN achieve better accuracy 90.64%. Whereas, Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI predictor shows better performance in terms of, sensitivity of 93.6%, precision of 85.67%, AU-PRC 38.01%, and f1 score of 85.07%. Due to the highly imbalanced number of samples for interactive and non-interactive classes in the Coronavirus dataset, Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) predictor perform poorly as evidenced by its extremely low sensitivity, precision, F1, and AU-PRC scores. The proposed meta predictor outperforms existing predictors in terms of accuracy by a margin of 0.54%, 1.98% in sensitivity, 0.34% in precision, 2.2% in f1 score, 9.06% in terms of AU-PRC, and 2.95% in AU-ROC. Only in terms of MCC, Asim et al., LGCA-VHPPI predictor (<xref ref-type="bibr" rid="B38">38</xref>) perform better than the proposed meta predictor by achieving an increment of 12%.</p>
<p>Individually, there exist differences of 81.69, 43.56, and 43.84% in specificity and sensitivity scores for Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) predictor, Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI predictor, and the proposed meta predictor. On the basis of that, it can be inferred that the proposed meta-predictor and Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI predictor are less biased toward type I and type II errors. Whereas, Yang et al. (<xref ref-type="bibr" rid="B36">36</xref>) predictor is biased toward type II error due to high specificity and low sensitivity scores.</p>
</sec>
</sec>
<sec>
<title>5.2. Proposed predictor performance comparison with existing predictors on unseen viruses test sets</title>
<p>To assess the applicability of the VHPPI predictors on unseen viruses where predictors are trained on different types of viruses and evaluation is performed on the test sets that contain viruses (Influenza A virus subtype H1N1, and Ebola virus EBV) that are not part of the training sets. <xref ref-type="table" rid="T6">Table 6</xref> compares the performance values of the proposed meta-predictor with 4 existing predictors i.e., Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) SVM, Tsukiyama et al. (<xref ref-type="bibr" rid="B21">21</xref>) LSTM-PHV, Dong et al. (<xref ref-type="bibr" rid="B23">23</xref>) predictor, and Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Performance comparison of the proposed MP-VHPPI with existing virus-Host PPI Predictors over 4 datasets developed by Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>), to assess the applicability of the unseen viruses.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Dataset</bold></th>
<th valign="top" align="left"><bold>Approach</bold></th>
<th valign="top" align="center"><bold>ACC</bold></th>
<th valign="top" align="center"><bold>SN</bold></th>
<th valign="top" align="center"><bold>SP</bold></th>
<th valign="top" align="center"><bold>PR</bold></th>
<th valign="top" align="center"><bold>F1</bold></th>
<th valign="top" align="center"><bold>MCC</bold></th>
<th valign="top" align="center"><bold>AU-ROC</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>TR1-TS1</bold></td>
<td valign="top" align="left">Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) (SVM)</td>
<td valign="top" align="center">77.95</td>
<td valign="top" align="center">89.76</td>
<td valign="top" align="center">66.14</td>
<td valign="top" align="center">72.61</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">57.5</td>
<td valign="top" align="center">88.6</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Tsukiyama el al. (<xref ref-type="bibr" rid="B21">21</xref>) LSTM-PHV</td>
<td valign="top" align="center">86.7</td>
<td valign="top" align="center">90.6</td>
<td valign="top" align="center">82.9</td>
<td valign="top" align="center">84.1</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">73.7</td>
<td valign="top" align="center">91.2</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Dong et al. (<xref ref-type="bibr" rid="B23">23</xref>) LSTM</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">86.51</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">86.28</td>
<td valign="top" align="center">86.40</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">94.61</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI</td>
<td valign="top" align="center">83.82</td>
<td valign="top" align="center">91.48</td>
<td valign="top" align="center">83.82</td>
<td valign="top" align="center">85.34</td>
<td valign="top" align="center">83.64</td>
<td valign="top" align="center">69.14</td>
<td valign="top" align="center">94.0</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Proposed MP-VHPPI</bold></td>
<td valign="top" align="center"><bold>90.26</bold></td>
<td valign="top" align="center"><bold>95.06</bold></td>
<td valign="top" align="center"><bold>90.26</bold></td>
<td valign="top" align="center"><bold>91.44</bold></td>
<td valign="top" align="center"><bold>90.19</bold></td>
<td valign="top" align="center"><bold>81.69</bold></td>
<td valign="top" align="center"><bold>96.70</bold></td>
</tr>
<tr>
<td valign="top" align="left"><bold>TR2-TS2</bold></td>
<td valign="top" align="left">Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>)(SVM)</td>
<td valign="top" align="center">78.00</td>
<td valign="top" align="center">90.67</td>
<td valign="top" align="center">65.33</td>
<td valign="top" align="center">72.34</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">57.9</td>
<td valign="top" align="center">86.7</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Tsukiyama el. al. (<xref ref-type="bibr" rid="B21">21</xref>) LSTM-PHV</td>
<td valign="top" align="center">84.0</td>
<td valign="top" align="center">93.3</td>
<td valign="top" align="center">74.7</td>
<td valign="top" align="center">78.7</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">69.2</td>
<td valign="top" align="center">94.1</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Dong et al. (<xref ref-type="bibr" rid="B23">23</xref>) LSTM</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">92.53</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">90.93</td>
<td valign="top" align="center">91.23</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">96.80</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI</td>
<td valign="top" align="center">86.58</td>
<td valign="top" align="center">93.11</td>
<td valign="top" align="center">86.57</td>
<td valign="top" align="center">88.35</td>
<td valign="top" align="center">86.42</td>
<td valign="top" align="center">74.9</td>
<td valign="top" align="center">96.0</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Proposed MP-VHPPI</bold></td>
<td valign="top" align="center"><bold>94.30</bold></td>
<td valign="top" align="center"><bold>97.07</bold></td>
<td valign="top" align="center"><bold>94.30</bold></td>
<td valign="top" align="center"><bold>94.39</bold></td>
<td valign="top" align="center"><bold>94.29</bold></td>
<td valign="top" align="center"><bold>88.69</bold></td>
<td valign="top" align="center"><bold>97.77</bold></td>
</tr>
<tr>
<td valign="top" align="left"><bold>TR3-TS1</bold></td>
<td valign="top" align="left">Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) (SVM)</td>
<td valign="top" align="center">77.43</td>
<td valign="top" align="center">88.98</td>
<td valign="top" align="center">65.88</td>
<td valign="top" align="center">72.28</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">56.4</td>
<td valign="top" align="center">88.4</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Tsukiyama el al. (<xref ref-type="bibr" rid="B21">21</xref>) LSTM-PHV</td>
<td valign="top" align="center">85.7</td>
<td valign="top" align="center">89.2</td>
<td valign="top" align="center">82.2</td>
<td valign="top" align="center">83.3</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">71.6</td>
<td valign="top" align="center">92.1</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI</td>
<td valign="top" align="center">83.29</td>
<td valign="top" align="center">91.2</td>
<td valign="top" align="center">83.28</td>
<td valign="top" align="center">85.31</td>
<td valign="top" align="center">83.05</td>
<td valign="top" align="center">68.57</td>
<td valign="top" align="center">94.0</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Proposed MP-VHPPI</bold></td>
<td valign="top" align="center"><bold>90.53</bold></td>
<td valign="top" align="center"><bold>95.06</bold></td>
<td valign="top" align="center"><bold>90.53</bold></td>
<td valign="top" align="center"><bold>90.78</bold></td>
<td valign="top" align="center"><bold>90.51</bold></td>
<td valign="top" align="center"><bold>81.31</bold></td>
<td valign="top" align="center"><bold>95.98</bold></td>
</tr>
<tr>
<td valign="top" align="left"><bold>TR4-TS2</bold></td>
<td valign="top" align="left">Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) (SVM)</td>
<td valign="top" align="center">81.67</td>
<td valign="top" align="center">94.67</td>
<td valign="top" align="center">68.67</td>
<td valign="top" align="center">75.13</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">65.6</td>
<td valign="top" align="center">89.0</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Tsukiyama el al. (<xref ref-type="bibr" rid="B21">21</xref>) LSTM-PHV</td>
<td valign="top" align="center">90.0</td>
<td valign="top" align="center">91.3</td>
<td valign="top" align="center">88.7</td>
<td valign="top" align="center">89.0</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">80.0</td>
<td valign="top" align="center">95.6</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI</td>
<td valign="top" align="center">85.57</td>
<td valign="top" align="center">92.59</td>
<td valign="top" align="center">85.57</td>
<td valign="top" align="center">87.65</td>
<td valign="top" align="center">85.37</td>
<td valign="top" align="center">73.19</td>
<td valign="top" align="center">96.0</td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Proposed MP-VHPPI</bold></td>
<td valign="top" align="center"><bold>93.62</bold></td>
<td valign="top" align="center"><bold>96.71</bold></td>
<td valign="top" align="center"><bold>93.62</bold></td>
<td valign="top" align="center"><bold>93.64</bold></td>
<td valign="top" align="center"><bold>93.62</bold></td>
<td valign="top" align="center"><bold>87.27</bold></td>
<td valign="top" align="center"><bold>98.14</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The performance values of the existing predictors i.e., Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) and Tsukiyama et al. (<xref ref-type="bibr" rid="B21">21</xref>) (LSTM-PHV) are taken from their corresponding studies (<xref ref-type="bibr" rid="B21">21</xref>, <xref ref-type="bibr" rid="B30">30</xref>).</p>
</table-wrap-foot>
</table-wrap>
<p>Over the TR1-TS1 dataset, out of 4 existing predictors Tsukiyama et al. (<xref ref-type="bibr" rid="B21">21</xref>) LSTM-PHV performs better in terms of accuracy 86.7% and MCC 73.7%, Dong et al. (<xref ref-type="bibr" rid="B23">23</xref>) predictor shows the highest precision 86.28%, f1 score 86.40%, and AUROC 94.61%. Asim et al., LCGA-VHPPI shows the highest performance in terms of specificity and sensitivity i.e., 83.82 and 91.48%. Whereas, Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) predictor show the least performance across all evaluation measures except sensitivity. In comparison to the existing predictors, the proposed meta predictor outperforms existing predictors across 7 evaluation measures. It achieves an increase of 3.56% in accuracy, 6.44% in specificity, 3.58% in sensitivity, 5.16% in precision, 3.79% in F1 score, 7.99% in MCC, and 2.09% in AUROC. Three out of 4 existing predictors, Tsukiyama et al. (<xref ref-type="bibr" rid="B21">21</xref>) LSTM-PHV, Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) SVM, and Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI, are biased toward type 1 error due to lower specificity (82.9, 66.14, and 83.82%) and higher sensitivity scores (90.6, 89.76, and 91.48%) with differences of 7.7, 23.62, and 7.66%. In comparison, the proposed meta predictor is robust and generalizable due to the small difference between specificity and sensitivity scores i.e., 4.8%, and overall higher sensitivity, specificity, AUROC, accuracy, and MCC scores.</p>
<p>Over the TR2-TS2 dataset, out of four existing predictors Tsukiyama et al., LSTM-PHV performs better in terms of sensitivity 93.3%, whereas Dong et al. (<xref ref-type="bibr" rid="B23">23</xref>) predictor performs better in terms of precision 90.93%, f1 score 91.23%, and AUROC 96.80%. Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI predictor performs better in terms of accuracy 86.58%, specificity 86.57%, and MCC 74.9%. Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) predictor, shows the least performance across all the evaluation metrics except sensitivity at 90.67%. The proposed meta predictor outperforms existing predictors across all of the evaluation measures. Overall, the proposed meta predictor achieves a gain of 7.72% in accuracy, 3.77% increase in sensitivity, 7.73% in specificity, 3.46% in precision, 3.06% in F1, 13.79% in MCC, and 0.97% in AUROC. Among these predictors, the predictors of Tsukiyama (<xref ref-type="bibr" rid="B21">21</xref>), Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>), and Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI, are prone to type 1 error due to high sensitivity and low specificity scores. For instance, the difference in specificity and sensitivity scores of Zhou et al. predictor is 25.34%, 18.6% for Tsukiyama et al. (<xref ref-type="bibr" rid="B21">21</xref>) LSTM-PHV, and 6.54% for Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI predictor. Due to these big differences, these predictors do not generalize well against the human and Ebola virus protein data. Whereas, the proposed meta-predictor has a smaller difference of 2.77% between specificity and sensitivity values, which makes it more generalizable than existing predictors.</p>
<p>Out of three existing predictors, the LSTM-PHV predictor performs better across TR3-TS1 in terms of 2 different evaluation metrics i.e., 85.7%, 71.6%, for accuracy, and MCC. Similarly, Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI shows better performance in terms of sensitivity 91.2%, specificity 83.28%, precision 85.31%, and AU-ROC 94.0%. On the other hand, the proposed meta predictor outperforms existing predictors on 7 different evaluation measures by significant margins. The proposed meta predictor achieves a raise of 4.83% in accuracy, 3.86% in sensitivity, 7.25% in specificity, 5.47% in precision, 9.71% in MCC, 7.46% in f1, and 1.98% in AUROC. Similar to the previous cases, existing predictors are again prone to type 1 errors due to high sensitivity and low specificity scores with differences of 23.1, 7, and 7.92% for Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>), LSTM-PHV (<xref ref-type="bibr" rid="B21">21</xref>), and LGCA-VHPPI (<xref ref-type="bibr" rid="B38">38</xref>) predictors. Comparatively, the proposed meta-predictor has a smaller difference of 4.53% between specificity and sensitivity scores, which makes the proposed meta-predictor more suitable for VHPPI prediction.</p>
<p>Over the TR4-TS2 dataset out of three existing predictors, LSTM-PHV (<xref ref-type="bibr" rid="B21">21</xref>) achieves better results across 4 evaluation measures i.e., 90.0%, 88.7%, 89.0%, 80.0%, in terms of accuracy, specificity, precision, and MCC. LGCA-VHPPI (<xref ref-type="bibr" rid="B38">38</xref>) excels in terms of AU-ROC 96.0%. Whereas, Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) SVM shows a better sensitivity score of 94.67%. The proposed predictor achieves performance gains of 3.62% in accuracy, 2.04% in sensitivity, 4.92% in specificity, 4.64% in precision, 8.25% in f1 score, 7.27% in MCC and 2.14% in AU-ROC. There exists a difference in the specificity and sensitivity scores of these predictors which are 26% for Zhou et al. predictor and 7.02% for Asim et al. (<xref ref-type="bibr" rid="B38">38</xref>) LGCA-VHPPI, which makes them more biased toward type I error due to high sensitivity and lower specificity scores. Comparatively, LSTM-PHV and the proposed meta predictor have a lower difference in specificity and sensitivity scores of (&#x0003C; 3.1%), which suggests that for the TR4-TS2 dataset, both of the predictors are able to generalize well over positive and negative class samples.</p>
</sec>
</sec>
<sec sec-type="discussion" id="s6">
<title>6. Discussion</title>
<p>Since the last decade, the development of machine and deep learning-based computational approaches for virus-host protein-protein interaction prediction has been an active area of research (<xref ref-type="bibr" rid="B21">21</xref>, <xref ref-type="bibr" rid="B22">22</xref>). In the marathon of developing robust computational VHPPI predictors, the aim of each newly developed predictor has been to utilize raw virus-host protein sequences and precisely discriminate interactive viral-host protein sequences from non-interactive ones. However, most predictors have been evaluated on a limited type of viruses and hosts, such as 6 different predictors have been evaluated on the Barman dataset that contains 5 different viruses and human proteins as hosts. Seven predictors are evaluated on the Denovo dataset which is comprised of 10 different viruses and human proteins as hosts, and 2 predictors are evaluated on the Coronavirus virus. Only 4 predictors are evaluated on the Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) dataset, which consists of 332 viruses and 29 host proteins. These datasets are more suitable to evaluate the robustness, generalizability, and predictive performance of a computational predictor. These datasets were developed with the objective to train models on different types of viruses and evaluate them on the particular viruses which were not part of the training set.</p>
<p>Over unseen virus-host protein-protein interaction prediction datasets, the performance of existing predictors is comparably low, as compared to their performance on Barman and Denovo datasets. Recently, we developed a machine learning-based predictor namely LGCA-VHPPI (<xref ref-type="bibr" rid="B38">38</xref>), which produced state-of-the-art performance on both Barman and Denovo datasets. We evaluated our predictor on Zhou et al. (<xref ref-type="bibr" rid="B30">30</xref>) datasets, where it showed a relatively lower performance as compared to its performance on Barman and Denovo datasets. This motivated us to develop an improved predictor that makes the best use of raw viral host protein sequences to perform better not only on Barman and Denovo datasets but also produce a similar performance for unseen viral-host protein-protein interaction predictions.</p>
<p>In viral and host protein sequences, the distribution of amino acids is almost similar across interactive and non-interactive classes. However, an amino acid occurrence at the same position across the sequences of interactive and non-interactive classes varies. The prime reason behind the biaseness of existing viral-host protein-protein interaction predictors toward type I or type II errors is their inability to capture position specific discriminative distribution of amino acids across both classes. To better illustrate this phenomenon, we perform amino acid distribution analysis across both classes with the help of Two Sample Logo (<xref ref-type="bibr" rid="B79">79</xref>). Virus and host protein sequences are huge in length and visualizing the amino acid distribution across entire sequence lengths is not feasible at all. Hence, for the purpose of visualization, we take 20 amino acids from the start of virus protein sequences and 20 amino acids from the start of host protein sequences. Using reduced 40 amino acids based sub-sequences, <xref ref-type="fig" rid="F3">Figure 3</xref> illustrates the amino acid distribution across interactive and non-interactive classes for 7 different datasets. It can be seen that the distribution of amino acids is approximately similar in interactive and non-interactive classes. Considering Barman&#x00027;s dataset as an example (<xref ref-type="fig" rid="F3">Figure 3A</xref>), in interactive and non-interactive samples, there are overlapping amino acids at every position i.e., for position 2, interactive samples contain one of the following amino acid, H, A, E, G, S whereas, non-interactive samples also contain one of the following amino acid, A, E, G, S. In both classes occurrence of 4 amino acids is the same while few samples of the interactive class contain amino acid H, a similar trend exists at other locations as well. Furthermore, other datasets also contain a similar distribution of amino acids as in the Barman dataset. It can be concluded that, across all 7 datasets, we observe the limited discriminative distribution of amino acids and because of that existing predictors lack in performance due to the utilization of sub-optimal sequence encoding methods that generate statistical vectors by neglecting most of the discriminative features about the distribution of amino acids in interactive and non-interactive classes. It is noteworthy to mention that the prime goal of visualizing the amino acid distribution across both classes is to demonstrate the significance of most effectively characterizing viral host protein sequences. However, we have taken entire viral-host protein sequences in our study to generate statistical vectors using physico-chemical properties based sequence encoders, informative feature space using dimensionality reduction algorithm, discriminative feature space using tree based classifiers, and final prediction using SVM classifier. A comprehensive performance comparison of the proposed predictor with VHPPI predictors proves that the proposed predictor manages to capture position specific discriminative distribution of amino acids across both classes.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Distribution of amino acid sequences across 7 different datasets. For each dataset the distribution of amino acids is shown across interactive and non-interactive protein samples, <bold>(A)</bold> Barman dataset <bold>(B)</bold> Denovo dataset <bold>(C)</bold> Coronavirus dataset <bold>(D)</bold> TR1-TS1 dataset <bold>(E)</bold> TR2-TS2 dataset <bold>(F)</bold> TR3-TS1 dataset <bold>(G)</bold> TR4-TS2 dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-09-1025887-g0003.tif"/>
</fig>
<p>Furthermore, it is important to mention that all the amino acids are either polar or non-polar in nature and can carry charges, such as out of 21 unique amino acids, 11 amino acids are polar in nature, 4 AAs carry a positive charge (R, D, H, K), 2 AAs carry a negative charge (D, E), and 5 AAs are neutral (C, Q, S, T, Y). Whereas, 10 amino acids are non-polar in nature (A, G, I, L, M, F, P, W, Y, V). Irrespective of positions aware occurrences, considering the overall distribution of amino acids in the protein sequence, charges can be computed by utilizing the physicochemical properties. Overall charge information of amino acids along with their distribution information can extract and encode more discriminative patterns.</p>
<p><xref ref-type="fig" rid="F4">Figure 4</xref> shows different clusters of 7 benchmark datasets for the intrinsic analyzes of the statistical vectors generated through APAAC and Qsorder sequence encoders. These clusters are computed by first reducing the dimensions of statistical vectors through principal component analysis (PCA) and then by t-distributed stochastic neighbor embedding (TSNE). In <xref ref-type="fig" rid="F4">Figures 4A,B</xref>, rows represent clusters of interactive and non-interactive classes based on statistical vectors generated through individual encoders (APAAC, Qsorder), and a combination of both encoders. Whereas, the columns represent 7 different benchmark datasets namely, Barman, Denovo, Coronavirus. TR1-TS1, TR2-TS2, TR3-TS1, and TR4-TS2. Overall, statistical vectors from APAAC and Qsorder without dimensionality reduction lead to the formation of overlapping clusters for interactive and non-interactive classes. This overlapping reveals that generated statistical vectors are almost similar and contain less discriminative information about interactive and non-interactive classes, as shown in <xref ref-type="fig" rid="F4">Figure 4A</xref>. Furthermore, this overlapping behavior among clusters exists due to the extraction of some irrelevant and redundant features by different physicochemical properties. To eradicate such type of information, we utilize the feature agglomeration method with the objective to transform generated statistical vectors into a more informative and discriminative feature space. Comparatively, statistical representations of APAAC and Qsorder with dimensionality reduction lead to the formation of slightly unique yet heavily dependent clusters as shown in <xref ref-type="fig" rid="F4">Figure 4B</xref>. Though these encodings could be used for classification purposes, however, still the performance would not be very promising. In addition, the clusters do not seem independent because a single human protein that interacts with some viral proteins, might not interact with some other viral proteins. This means that positive and negative class samples can have very similar representations due to the presence of such proteins. Although dimensionality reduction produces better feature space, however still clusters are not very much separable. To further improve the performance of the predictor, we perform iterative representation learning, where we pass 3 different statistical representations separately to RF and ET classifiers and take their predicted class probabilities to develop a new feature space. The generated feature space leads to the formation of unique and independent clusters as shown in <xref ref-type="fig" rid="F4">Figure 4C</xref>, which suggests the presence of comprehensive discriminatory features for interactive and non-interactive VHPPI pairs. Due to the discriminative and informative nature of the newly generated feature space, we utilize this feature space to train the SVM classifier for virus-host protein-protein interaction prediction.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Formation of clusters with representations of protein sequences based on <bold>(A)</bold> APAAC and Qsorder encoders without dimensionality reduction <bold>(B)</bold> APAAC and Qsorder encoders with dimensionality reduction <bold>(C)</bold> the positive class probabilities through ET and RF classifiers.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-09-1025887-g0004.tif"/>
</fig>
<p>Overall, as compared to state-of-the-art predictors, the proposed predictor has shown a slight performance improvement on Barman and Denovo datasets, and significant performance improvements on SARS-CoV-2 datasets and the other 4 datasets namely, TR1-TS1, TR2-TS2, TR3-TS1, and TR4-TS2. We believe that the performance of the proposed predictor can be further improved by incorporating representations learned through diverse types of language models such as BERT, and XLNET.</p>
</sec>
<sec id="s7">
<title>7. Web-server</title>
<p>To facilitate the biological community, we provide an interactive, and user-friendly web server for the proposed meta-predictor which is available at <ext-link ext-link-type="uri" xlink:href="https://sds_genetic_analysis.opendfki.de/MP-VHPPI/">https://sds_genetic_analysis.opendfki.de/MP-VHPPI/</ext-link>. This web server can be used to predict virus-host protein-protein interactions across 7 different datasets by using raw human and virus protein sequences. In addition, the online web server allows the user to train the proposed meta-predictor from scratch on all of the datastets. Moreover, it also provides 7 different benchmark datasets that are used in this study.</p>
</sec>
<sec sec-type="conclusions" id="s8">
<title>8. Conclusion</title>
<p>The prime objective of this research is the development of a robust machine learning-based computational framework capable of precisely predicting viral-host protein-protein interactions across a wide range of hosts and viruses. The proposed meta predictor makes use of APAAC and QS order sequence encoders for statistical representation generation and feature agglomeration method to refine feature space. Furthermore, the meta predictor utilizes the predictions of random forest and extra tree classifiers to feed the SVM classifier that makes final predictions. Experimental results reveal the competence of APAAC and QS order encoders for most effectively generating numerical representations of sequences by capturing amino acids sequence order and distributional information. We have observed dimensionality reduction method removes irrelevant and redundant information which slightly improves the performance of classifiers. The process of iterative representation learning in which predictions of RF and ET classifiers are passed to the SVM classifier significantly improves the accuracy of interaction predictions. The proposed meta predictor has evaluated over 7 benchmark datasets where it outperforms existing predictors with a significant margin of 3.07, 6.07, 2.95, and 2.85%, in terms of accuracy, MCC, precision, and sensitivity, respectively. We believe that the deployment of the proposed meta-predictor as a web interface will assist researchers and practitioners in analyzing the complex phenomenon of VHPPIs at a larger scale to unravel substantial drug targets and optimize antiviral strategies.</p>
</sec>
<sec id="s9">
<title>9. Limitations</title>
<p>Comprehensive performance analysis reveals that the proposed MP-VHPPI predictor manages to outperform existing viral-host protein-protein interaction predictors across 7 benchmark datasets by a decent margin. Although the use of different strategies at the level of representation learning reduces the prediction error decently, however, the proposed model lacks robustness as it is biased toward type II error. In the future, we will optimize the predictive pipeline of the proposed MP-VHPPI predictor with an aim to enhance robustness.</p>
</sec>
<sec sec-type="data-availability" id="s10">
<title>Data availability statement</title>
<p>The benchmark datasets for this study can be found at: <ext-link ext-link-type="uri" xlink:href="https://sds_genetic_analysis.opendfki.de/MP-VHPPI/Download/">https://sds_genetic_analysis.opendfki.de/MP-VHPPI/Download/</ext-link>.</p>
</sec>
<sec id="s11">
<title>Author contributions</title>
<p>MA conceptualized the presented idea and prepared graphics and developed web server. MA and MI performed data curation, formal analysis, validation, and investigation. MA and AF prepared the original draft and final manuscript under the supervision of AD and SA. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s12">
<title>Funding</title>
<p>This study was supported by Sartorius Artificial Intelligence Lab.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s13">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ack><p>We gratefully acknowledge the support of the Sartorius Artificial Intelligence Lab with the allocation of TITAN Xp GPU for this research.</p>
</ack>
<sec sec-type="supplementary-material" id="s14">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fmed.2022.1025887/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fmed.2022.1025887/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.PDF" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Data_Sheet_2.PDF" id="SM2" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<label>1.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Evans</surname> <given-names>H</given-names></name> <name><surname>Shapiro</surname> <given-names>M</given-names></name></person-group>. <article-title>Viruses</article-title>. In: <source>Manual of Techniques in Insect Pathology</source>. <publisher-loc>Elsevier</publisher-loc> (<year>1997</year>). p. <fpage>17</fpage>&#x02013;<lpage>53</lpage>.</citation>
</ref>
<ref id="B2">
<label>2.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LePan</surname> <given-names>N</given-names></name></person-group>. <article-title>Visualizing the history of pandemics</article-title>. <source>Vis Capit</source>. (<year>2020</year>) <fpage>14</fpage>.</citation>
</ref>
<ref id="B3">
<label>3.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nowosad</surname> <given-names>A</given-names></name> <name><surname>Turanli</surname> <given-names>&#x000DC;</given-names></name> <name><surname>Lorenc</surname> <given-names>K</given-names></name></person-group>. <article-title>The coronavirus SARS-CoV-2 and its impact on the world</article-title>. In: <source>The Socioeconomic Impact of COVID-19 on Eastern European Countries</source>. <publisher-loc>Routledge</publisher-loc> (<year>2022</year>).</citation>
</ref>
<ref id="B4">
<label>4.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Myoung</surname> <given-names>J</given-names></name></person-group>. <source>Two Years of COVID-19 Pandemic: Where Are we Now</source>?. <publisher-name>Springer</publisher-name> (<year>2022</year>).<pub-id pub-id-type="pmid">35235176</pub-id></citation></ref>
<ref id="B5">
<label>5.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carroll</surname> <given-names>MW</given-names></name> <name><surname>Matthews</surname> <given-names>DA</given-names></name> <name><surname>Hiscox</surname> <given-names>JA</given-names></name> <name><surname>Elmore</surname> <given-names>MJ</given-names></name> <name><surname>Pollakis</surname> <given-names>G</given-names></name> <name><surname>Rambaut</surname> <given-names>A</given-names></name> <etal/></person-group>. <article-title>Temporal and spatial analysis of the 2014-2015 Ebola virus outbreak in West Africa</article-title>. <source>Nature</source>. (<year>2015</year>) <volume>524</volume>:<fpage>97</fpage>&#x02013;<lpage>101</lpage>. <pub-id pub-id-type="doi">10.1038/nature14594</pub-id><pub-id pub-id-type="pmid">26083749</pub-id></citation></ref>
<ref id="B6">
<label>6.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Villarreal</surname> <given-names>LP</given-names></name></person-group>. <article-title>Are viruses alive?</article-title> <source>Sci Am</source>. (<year>2004</year>) <volume>291</volume>:<fpage>100</fpage>&#x02013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1038/scientificamerican1204-100</pub-id><pub-id pub-id-type="pmid">15597986</pub-id></citation></ref>
<ref id="B7">
<label>7.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Davey</surname> <given-names>NE</given-names></name> <name><surname>Trav&#x000E9;</surname> <given-names>G</given-names></name> <name><surname>Gibson</surname> <given-names>TJ</given-names></name></person-group>. <article-title>How viruses hijack cell regulation</article-title>. <source>Trends Biochem Sci</source>. (<year>2011</year>) <volume>36</volume>:<fpage>159</fpage>&#x02013;<lpage>69</lpage>. <pub-id pub-id-type="doi">10.1016/j.tibs.2010.10.002</pub-id><pub-id pub-id-type="pmid">21146412</pub-id></citation></ref>
<ref id="B8">
<label>8.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dimitrov</surname> <given-names>DS</given-names></name></person-group>. <article-title>Virus entry: molecular mechanisms and biomedical applications</article-title>. <source>Nat Rev Microbiol</source>. (<year>2004</year>) <volume>2</volume>:<fpage>109</fpage>&#x02013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1038/nrmicro817</pub-id><pub-id pub-id-type="pmid">15043007</pub-id></citation></ref>
<ref id="B9">
<label>9.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Louten</surname> <given-names>J</given-names></name></person-group>. <article-title>Virus replication</article-title>. In: <source>Essential Human Virology.</source> (<year>2016</year>). p. 49.</citation>
</ref>
<ref id="B10">
<label>10.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thaker</surname> <given-names>SK</given-names></name> <name><surname>Ch&#x00027;ng</surname> <given-names>J</given-names></name> <name><surname>Christofk</surname> <given-names>HR</given-names></name></person-group>. <article-title>Viral hijacking of cellular metabolism</article-title>. <source>BMC Biol</source>. (<year>2019</year>) <volume>17</volume>:<fpage>1</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1186/s12915-019-0678-9</pub-id><pub-id pub-id-type="pmid">31319842</pub-id></citation></ref>
<ref id="B11">
<label>11.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>S</given-names></name> <name><surname>Fu</surname> <given-names>C</given-names></name> <name><surname>Lian</surname> <given-names>X</given-names></name> <name><surname>Dong</surname> <given-names>X</given-names></name> <name><surname>Zhang</surname> <given-names>Z</given-names></name></person-group>. <article-title>Understanding human-virus protein-protein interactions using a human protein complex-based analysis framework</article-title>. <source>MSystems</source>. (<year>2019</year>) <volume>4</volume>:<fpage>e00303</fpage>&#x02013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.1128/mSystems.00303-18</pub-id><pub-id pub-id-type="pmid">30984872</pub-id></citation></ref>
<ref id="B12">
<label>12.</label>
<citation citation-type="journal"><person-group person-group-type="author"><collab>Chaplin DD. 1. Overview of the human immune response.</collab></person-group> <source>J Allergy Clin Immunol</source>. (<year>2006</year>) <volume>117</volume>:<fpage>S430</fpage>&#x02013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1016/j.jaci.2005.09.034</pub-id><pub-id pub-id-type="pmid">16455341</pub-id></citation></ref>
<ref id="B13">
<label>13.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rampersad</surname> <given-names>S</given-names></name> <name><surname>Tennant</surname> <given-names>P</given-names></name></person-group>. <article-title>Replication and expression strategies of viruses</article-title>. <source>Viruses</source>. (<year>2018</year>) <fpage>55</fpage>&#x02013;<lpage>82</lpage>. <pub-id pub-id-type="doi">10.1016/B978-0-12-811257-1.00003-6</pub-id></citation>
</ref>
<ref id="B14">
<label>14.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Perrin-Cocon</surname> <given-names>L</given-names></name> <name><surname>Diaz</surname> <given-names>O</given-names></name> <name><surname>Jacquemin</surname> <given-names>C</given-names></name> <name><surname>Barthel</surname> <given-names>V</given-names></name> <name><surname>Ogire</surname> <given-names>E</given-names></name> <name><surname>Rami&#x000E8;re</surname> <given-names>C</given-names></name> <etal/></person-group>. <article-title>The current landscape of coronavirus-host protein-protein interactions</article-title>. <source>J Transl Med</source>. (<year>2020</year>) <volume>18</volume>:<fpage>1</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1186/s12967-020-02480-z</pub-id><pub-id pub-id-type="pmid">32811513</pub-id></citation></ref>
<ref id="B15">
<label>15.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Llano</surname> <given-names>M</given-names></name> <name><surname>Pe na-Hernandez</surname> <given-names>MA</given-names></name></person-group>. <article-title>Defining pharmacological targets by analysis of virus-host protein interactions</article-title>. <source>Adv Protein Chem Struct Biol</source>. (<year>2018</year>) <volume>111</volume>:<fpage>223</fpage>&#x02013;<lpage>42</lpage>. <pub-id pub-id-type="doi">10.1016/bs.apcsb.2017.11.001</pub-id><pub-id pub-id-type="pmid">29459033</pub-id></citation></ref>
<ref id="B16">
<label>16.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Munier</surname> <given-names>S</given-names></name> <name><surname>Rolland</surname> <given-names>T</given-names></name> <name><surname>Diot</surname> <given-names>C</given-names></name> <name><surname>Jacob</surname> <given-names>Y</given-names></name> <name><surname>Naffakh</surname> <given-names>N</given-names></name></person-group>. <article-title>Exploration of binary virus-host interactions using an infectious protein complementation assay</article-title>. <source>Mol Cell Proteomics</source>. (<year>2013</year>) <volume>12</volume>:<fpage>2845</fpage>&#x02013;<lpage>55</lpage>. <pub-id pub-id-type="doi">10.1074/mcp.M113.028688</pub-id><pub-id pub-id-type="pmid">23816991</pub-id></citation></ref>
<ref id="B17">
<label>17.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rusnati</surname> <given-names>M</given-names></name> <name><surname>Chiodelli</surname> <given-names>P</given-names></name> <name><surname>Bugatti</surname> <given-names>A</given-names></name> <name><surname>Urbinati</surname> <given-names>C</given-names></name></person-group>. <article-title>Bridging the past and the future of virology: surface plasmon resonance as a powerful tool to investigate virus/host interactions</article-title>. <source>Crit Rev Microbiol</source>. (<year>2015</year>) <volume>41</volume>:<fpage>238</fpage>&#x02013;<lpage>60</lpage>. <pub-id pub-id-type="doi">10.3109/1040841X.2013.826177</pub-id><pub-id pub-id-type="pmid">24059853</pub-id></citation></ref>
<ref id="B18">
<label>18.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xing</surname> <given-names>S</given-names></name> <name><surname>Wallmeroth</surname> <given-names>N</given-names></name> <name><surname>Berendzen</surname> <given-names>KW</given-names></name> <name><surname>Grefen</surname> <given-names>C</given-names></name></person-group>. <article-title>Techniques for the analysis of protein-protein interactions in vivo</article-title>. <source>Plant Physiol</source>. (<year>2016</year>) <volume>171</volume>:<fpage>727</fpage>&#x02013;<lpage>58</lpage>. <pub-id pub-id-type="doi">10.1104/pp.16.00470</pub-id><pub-id pub-id-type="pmid">27208310</pub-id></citation></ref>
<ref id="B19">
<label>19.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Br&#x000FC;ckner</surname> <given-names>A</given-names></name> <name><surname>Polge</surname> <given-names>C</given-names></name> <name><surname>Lentze</surname> <given-names>N</given-names></name> <name><surname>Auerbach</surname> <given-names>D</given-names></name> <name><surname>Schlattner</surname> <given-names>U</given-names></name></person-group>. <article-title>Yeast two-hybrid, a powerful tool for systems biology</article-title>. <source>Int J Mol Sci</source>. (<year>2009</year>) <volume>10</volume>:<fpage>2763</fpage>&#x02013;<lpage>88</lpage>. <pub-id pub-id-type="doi">10.3390/ijms10062763</pub-id><pub-id pub-id-type="pmid">19582228</pub-id></citation></ref>
<ref id="B20">
<label>20.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Georges</surname> <given-names>AA</given-names></name> <name><surname>Frappier</surname> <given-names>L</given-names></name></person-group>. <article-title>Affinity purification-mass spectroscopy methods for identifying epstein-barr virus-host interactions</article-title>. <source>Methods Mol Biol</source>. (<year>2017</year>) <volume>1532</volume>:<fpage>79</fpage>&#x02013;<lpage>92</lpage>. <pub-id pub-id-type="doi">10.1007/978-1-4939-6655-4_5</pub-id><pub-id pub-id-type="pmid">27873268</pub-id></citation></ref>
<ref id="B21">
<label>21.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tsukiyama</surname> <given-names>S</given-names></name> <name><surname>Hasan</surname> <given-names>MM</given-names></name> <name><surname>Fujii</surname> <given-names>S</given-names></name> <name><surname>Kurata</surname> <given-names>H</given-names></name></person-group>. <article-title>LSTM-PHV: prediction of human-virus protein-protein interactions by LSTM with word2vec</article-title>. <source>Brief Bioinform</source>. (<year>2021</year>) <volume>22</volume>:<fpage>bbab228</fpage>. <pub-id pub-id-type="doi">10.1093/bib/bbab228</pub-id><pub-id pub-id-type="pmid">34160596</pub-id></citation></ref>
<ref id="B22">
<label>22.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barman</surname> <given-names>RK</given-names></name> <name><surname>Saha</surname> <given-names>S</given-names></name> <name><surname>Das</surname> <given-names>S</given-names></name></person-group>. <article-title>Prediction of interactions between viral and host proteins using supervised machine learning methods</article-title>. <source>PLoS ONE</source>. (<year>2014</year>) <volume>9</volume>:<fpage>e112034</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0112034</pub-id><pub-id pub-id-type="pmid">25375323</pub-id></citation></ref>
<ref id="B23">
<label>23.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>TN</given-names></name> <name><surname>Brogden</surname> <given-names>G</given-names></name> <name><surname>Gerold</surname> <given-names>G</given-names></name> <name><surname>Khosla</surname> <given-names>M</given-names></name></person-group>. <article-title>A multitask transfer learning framework for the prediction of virus-human protein-protein interactions</article-title>. <source>BMC Bioinform</source>. (<year>2021</year>) <volume>22</volume>:<fpage>1</fpage>&#x02013;<lpage>24</lpage>. <pub-id pub-id-type="doi">10.1186/s12859-021-04484-y</pub-id><pub-id pub-id-type="pmid">34837942</pub-id></citation></ref>
<ref id="B24">
<label>24.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Molina-Mora</surname> <given-names>JA</given-names></name> <name><surname>Gonz&#x000E1;lez</surname> <given-names>A</given-names></name> <name><surname>Jim&#x000E9;nez-Morgan</surname> <given-names>S</given-names></name> <name><surname>Cordero-Laurent</surname> <given-names>E</given-names></name> <name><surname>Brenes</surname> <given-names>H</given-names></name> <name><surname>Soto-Garita</surname> <given-names>C</given-names></name> <etal/></person-group>. <article-title>Clinical profiles at the time of diagnosis of SARS-CoV-2 infection in costa Rica during the pre-vaccination period using a machine learning approach</article-title>. <source>Phenomics</source>. (<year>2022</year>) <volume>2</volume>:<fpage>312</fpage>&#x02013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1007/s43657-022-00058-x</pub-id><pub-id pub-id-type="pmid">35692458</pub-id></citation></ref>
<ref id="B25">
<label>25.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Palma</surname> <given-names>SI</given-names></name> <name><surname>Traguedo</surname> <given-names>AP</given-names></name> <name><surname>Porteira</surname> <given-names>AR</given-names></name> <name><surname>Frias</surname> <given-names>MJ</given-names></name> <name><surname>Gamboa</surname> <given-names>H</given-names></name> <name><surname>Roque</surname> <given-names>AC</given-names></name></person-group>. <article-title>Machine learning for the meta-analyses of microbial pathogens&#x00027; volatile signatures</article-title>. <source>Sci Rep</source>. (<year>2018</year>) <volume>8</volume>:<fpage>1</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1038/s41598-018-21544-1</pub-id><pub-id pub-id-type="pmid">29679073</pub-id></citation></ref>
<ref id="B26">
<label>26.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mora</surname> <given-names>JAM</given-names></name> <name><surname>Montero-Manso</surname> <given-names>P</given-names></name> <name><surname>Garc&#x000ED;a-Bat&#x000E1;n</surname> <given-names>R</given-names></name> <name><surname>Campos-S&#x000E1;nchez</surname> <given-names>R</given-names></name> <name><surname>Vilar-Fern&#x000E1;ndez</surname> <given-names>J</given-names></name> <name><surname>Garc&#x000ED;a</surname> <given-names>F</given-names></name></person-group>. <article-title>A first perturbome of <italic>Pseudomonas aeruginosa</italic>: identification of core genes related to multiple perturbations by a machine learning approach</article-title>. <source>Biosystems</source>. (<year>2021</year>) <volume>205</volume>:<fpage>104411</fpage>. <pub-id pub-id-type="doi">10.1016/j.biosystems.2021.104411</pub-id><pub-id pub-id-type="pmid">33757842</pub-id></citation></ref>
<ref id="B27">
<label>27.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lanchantin</surname> <given-names>J</given-names></name> <name><surname>Weingarten</surname> <given-names>T</given-names></name> <name><surname>Sekhon</surname> <given-names>A</given-names></name> <name><surname>Miller</surname> <given-names>C</given-names></name> <name><surname>Qi</surname> <given-names>Y</given-names></name></person-group>. <article-title>Transfer learning for predicting virus-host protein interactions for novel virus sequences</article-title>. In: <source>Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics</source>. (<year>2021</year>). p. <fpage>1</fpage>&#x02013;<lpage>10</lpage>.</citation>
</ref>
<ref id="B28">
<label>28.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Karabulut</surname> <given-names>OC</given-names></name> <name><surname>Karpuzcu</surname> <given-names>BA</given-names></name> <name><surname>T&#x000FC;rk</surname> <given-names>E</given-names></name> <name><surname>Ibrahim</surname> <given-names>AH</given-names></name> <name><surname>S&#x000FC;zek</surname> <given-names>BE</given-names></name></person-group>. <article-title>ML-AdVInfect: a machine-learning based adenoviral infection predictor</article-title>. <source>Front Mol Biosci</source>. (<year>2021</year>) <volume>8</volume>:<fpage>647424</fpage>. <pub-id pub-id-type="doi">10.3389/fmolb.2021.647424</pub-id><pub-id pub-id-type="pmid">34026828</pub-id></citation></ref>
<ref id="B29">
<label>29.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eid</surname> <given-names>FE</given-names></name> <name><surname>ElHefnawi</surname> <given-names>M</given-names></name> <name><surname>Heath</surname> <given-names>LS</given-names></name></person-group>. <article-title>DeNovo: virus-host sequence-based protein-protein interaction prediction</article-title>. <source>Bioinformatics</source>. (<year>2016</year>) <volume>32</volume>:<fpage>1144</fpage>&#x02013;<lpage>50</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btv737</pub-id><pub-id pub-id-type="pmid">26677965</pub-id></citation></ref>
<ref id="B30">
<label>30.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>X</given-names></name> <name><surname>Park</surname> <given-names>B</given-names></name> <name><surname>Choi</surname> <given-names>D</given-names></name> <name><surname>Han</surname> <given-names>K</given-names></name></person-group>. <article-title>A generalized approach to predicting protein-protein interactions between virus and host</article-title>. <source>BMC Genomics</source>. (<year>2018</year>) <volume>19</volume>:<fpage>69</fpage>&#x02013;<lpage>77</lpage>. <pub-id pub-id-type="doi">10.1186/s12864-018-4924-2</pub-id><pub-id pub-id-type="pmid">30367586</pub-id></citation></ref>
<ref id="B31">
<label>31.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>X</given-names></name> <name><surname>Yang</surname> <given-names>S</given-names></name> <name><surname>Li</surname> <given-names>Q</given-names></name> <name><surname>Wuchty</surname> <given-names>S</given-names></name> <name><surname>Zhang</surname> <given-names>Z</given-names></name></person-group>. <article-title>Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method</article-title>. <source>Comput Struct Biotechnol J</source>. (<year>2020</year>) <volume>18</volume>:<fpage>153</fpage>&#x02013;<lpage>61</lpage>. <pub-id pub-id-type="doi">10.1016/j.csbj.2019.12.005</pub-id><pub-id pub-id-type="pmid">31969974</pub-id></citation></ref>
<ref id="B32">
<label>32.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alguwaizani</surname> <given-names>S</given-names></name> <name><surname>Park</surname> <given-names>B</given-names></name> <name><surname>Zhou</surname> <given-names>X</given-names></name> <name><surname>Huang</surname> <given-names>DS</given-names></name> <name><surname>Han</surname> <given-names>K</given-names></name></person-group>. <article-title>Predicting interactions between virus and host proteins using repeat patterns and composition of amino acids</article-title>. <source>J Healthc Eng</source>. (<year>2018</year>) <volume>2018</volume>:<fpage>1391265</fpage>. <pub-id pub-id-type="doi">10.1155/2018/1391265</pub-id><pub-id pub-id-type="pmid">29854357</pub-id></citation></ref>
<ref id="B33">
<label>33.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Du</surname> <given-names>H</given-names></name> <name><surname>Chen</surname> <given-names>F</given-names></name> <name><surname>Liu</surname> <given-names>H</given-names></name> <name><surname>Hong</surname> <given-names>P</given-names></name></person-group>. <article-title>Network-based virus-host interaction prediction with application to SARS-CoV-2</article-title>. <source>Patterns</source>. (<year>2021</year>) <volume>2</volume>:<fpage>100242</fpage>. <pub-id pub-id-type="doi">10.1016/j.patter.2021.100242</pub-id><pub-id pub-id-type="pmid">33817672</pub-id></citation></ref>
<ref id="B34">
<label>34.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>L</given-names></name> <name><surname>Zhao</surname> <given-names>J</given-names></name> <name><surname>Zhang</surname> <given-names>J</given-names></name></person-group>. <article-title>Predict the protein-protein interaction between virus and host through hybrid deep neural network</article-title>. In: <source>2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)</source>. <publisher-loc>IEEE</publisher-loc> (<year>2020</year>). p. <fpage>11</fpage>&#x02013;<lpage>16</lpage>.</citation>
</ref>
<ref id="B35">
<label>35.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu-Wei</surname> <given-names>W</given-names></name> <name><surname>Kafkas</surname> <given-names>&#x0015E;</given-names></name> <name><surname>Chen</surname> <given-names>J</given-names></name> <name><surname>Dimonaco</surname> <given-names>NJ</given-names></name> <name><surname>Tegn&#x000E9;r</surname> <given-names>J</given-names></name> <name><surname>Hoehndorf</surname> <given-names>R</given-names></name></person-group>. <article-title>DeepViral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes</article-title>. <source>Bioinformatics</source>. (<year>2021</year>) <volume>37</volume>:<fpage>2722</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btab147</pub-id><pub-id pub-id-type="pmid">33682875</pub-id></citation></ref>
<ref id="B36">
<label>36.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>X</given-names></name> <name><surname>Yang</surname> <given-names>S</given-names></name> <name><surname>Lian</surname> <given-names>X</given-names></name> <name><surname>Wuchty</surname> <given-names>S</given-names></name> <name><surname>Zhang</surname> <given-names>Z</given-names></name></person-group>. <article-title>Transfer learning via multi-scale convolutional neural layers for human-virus protein-protein interaction prediction</article-title>. <source>Bioinformatics</source>. (<year>2021</year>) <volume>37</volume>:<fpage>4771</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btab533</pub-id><pub-id pub-id-type="pmid">34273146</pub-id></citation></ref>
<ref id="B37">
<label>37.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Basit</surname> <given-names>AH</given-names></name> <name><surname>Abbasi</surname> <given-names>WA</given-names></name> <name><surname>Asif</surname> <given-names>A</given-names></name> <name><surname>Gull</surname> <given-names>S</given-names></name> <name><surname>Minhas</surname> <given-names>FUAA</given-names></name></person-group>. <article-title>Training host-pathogen protein-protein interaction predictors</article-title>. <source>J Bioinform Comput Biol</source>. (<year>2018</year>) <volume>16</volume>:<fpage>1850014</fpage>. <pub-id pub-id-type="doi">10.1142/S0219720018500142</pub-id><pub-id pub-id-type="pmid">30060698</pub-id></citation></ref>
<ref id="B38">
<label>38.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Asim</surname> <given-names>MN</given-names></name> <name><surname>Ibrahim</surname> <given-names>MA</given-names></name> <name><surname>Malik</surname> <given-names>MI</given-names></name> <name><surname>Dengel</surname> <given-names>A</given-names></name> <name><surname>Ahmed</surname> <given-names>S</given-names></name></person-group>. <article-title>LGCA-VHPPI: a local-global residue context aware viral-host protein-protein interaction predictor</article-title>. <source>PLoS ONE</source>. (<year>2022</year>) <volume>17</volume>:<fpage>e0270275</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0270275</pub-id><pub-id pub-id-type="pmid">35789333</pub-id></citation></ref>
<ref id="B39">
<label>39.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname> <given-names>KC</given-names></name></person-group>. <article-title>Prediction of protein subcellular locations by incorporating quasi-sequence-order effect</article-title>. <source>Biochem Biophys Res Commun</source>. (<year>2000</year>) <volume>278</volume>:<fpage>477</fpage>&#x02013;<lpage>83</lpage>. <pub-id pub-id-type="doi">10.1006/bbrc.2000.3815</pub-id><pub-id pub-id-type="pmid">11097861</pub-id></citation></ref>
<ref id="B40">
<label>40.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>JN</given-names></name> <name><surname>Yang</surname> <given-names>HY</given-names></name> <name><surname>Yao</surname> <given-names>J</given-names></name> <name><surname>Ding</surname> <given-names>H</given-names></name> <name><surname>Han</surname> <given-names>SG</given-names></name> <name><surname>Wu</surname> <given-names>CY</given-names></name> <etal/></person-group>. <article-title>Prediction of cyclin protein using two-step feature selection technique</article-title>. <source>IEEE Access</source>. (<year>2020</year>) <volume>8</volume>:<fpage>109535</fpage>&#x02013;<lpage>542</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2020.2999394</pub-id></citation>
</ref>
<ref id="B41">
<label>41.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>QY</given-names></name> <name><surname>You</surname> <given-names>ZH</given-names></name> <name><surname>Li</surname> <given-names>S</given-names></name> <name><surname>Zhu</surname> <given-names>Z</given-names></name></person-group>. <article-title>Using Chou&#x00027;s amphiphilic Pseudo-Amino Acid Composition and Extreme Learning Machine for prediction of Protein-protein interactions</article-title>. In: <source>2014 International Joint Conference on Neural Networks (IJCNN)</source>. <publisher-loc>Beijing</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2014</year>). p. <fpage>2952</fpage>&#x02013;<lpage>6</lpage>.</citation>
</ref>
<ref id="B42">
<label>42.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tian</surname> <given-names>B</given-names></name> <name><surname>Wu</surname> <given-names>X</given-names></name> <name><surname>Chen</surname> <given-names>C</given-names></name> <name><surname>Qiu</surname> <given-names>W</given-names></name> <name><surname>Ma</surname> <given-names>Q</given-names></name> <name><surname>Yu</surname> <given-names>B</given-names></name></person-group>. <article-title>Predicting protein-protein interactions by fusing various Chou&#x00027;s pseudo components and using wavelet denoising approach</article-title>. <source>J Theor Biol</source>. (<year>2019</year>) <volume>462</volume>:<fpage>329</fpage>&#x02013;<lpage>46</lpage>. <pub-id pub-id-type="doi">10.1016/j.jtbi.2018.11.011</pub-id><pub-id pub-id-type="pmid">30452960</pub-id></citation></ref>
<ref id="B43">
<label>43.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>ZH</given-names></name> <name><surname>Feng</surname> <given-names>J</given-names></name></person-group>. <article-title>Deep forest</article-title>. <source>arXiv preprint arXiv:170208835.</source> (<year>2017</year>).</citation>
</ref>
<ref id="B44">
<label>44.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname> <given-names>KC</given-names></name></person-group>. <article-title>Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes</article-title>. <source>Bioinformatics</source>. (<year>2005</year>) <volume>21</volume>:<fpage>10</fpage>&#x02013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bth466</pub-id><pub-id pub-id-type="pmid">15308540</pub-id></citation></ref>
<ref id="B45">
<label>45.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Feurer</surname> <given-names>M</given-names></name> <name><surname>Klein</surname> <given-names>A</given-names></name> <name><surname>Eggensperger</surname> <given-names>K</given-names></name> <name><surname>Springenberg</surname> <given-names>J</given-names></name> <name><surname>Blum</surname> <given-names>M</given-names></name> <name><surname>Hutter</surname> <given-names>F</given-names></name></person-group>. <article-title>Efficient and robust automated machine learning</article-title>. In: <source>Advances in Neural Information Processing Systems, Vol. 28</source>. (<year>2015</year>).</citation>
</ref>
<ref id="B46">
<label>46.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Alpaydin</surname> <given-names>E</given-names></name></person-group>. <source>Machine Learning</source>. <publisher-name>MIT Press</publisher-name> (<year>2021</year>).</citation>
</ref>
<ref id="B47">
<label>47.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chou</surname> <given-names>KC</given-names></name></person-group>. <article-title>Prediction of protein cellular attributes using pseudo-amino acid composition</article-title>. <source>Proteins</source>. (<year>2001</year>) <volume>43</volume>:<fpage>246</fpage>&#x02013;<lpage>55</lpage>. <pub-id pub-id-type="doi">10.1002/prot.1035</pub-id><pub-id pub-id-type="pmid">11288174</pub-id></citation></ref>
<ref id="B48">
<label>48.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schneider</surname> <given-names>G</given-names></name> <name><surname>Wrede</surname> <given-names>P</given-names></name></person-group>. <article-title>The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site</article-title>. <source>Biophys J</source>. (<year>1994</year>) <volume>66</volume>:<fpage>335</fpage>&#x02013;<lpage>44</lpage>. <pub-id pub-id-type="doi">10.1016/S0006-3495(94)80782-9</pub-id><pub-id pub-id-type="pmid">8161687</pub-id></citation></ref>
<ref id="B49">
<label>49.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grantham</surname> <given-names>R</given-names></name></person-group>. <article-title>Amino acid difference formula to help explain protein evolution</article-title>. <source>Science</source>. (<year>1974</year>) <volume>185</volume>:<fpage>862</fpage>&#x02013;<lpage>4</lpage>. <pub-id pub-id-type="doi">10.1126/science.185.4154.862</pub-id><pub-id pub-id-type="pmid">4843792</pub-id></citation></ref>
<ref id="B50">
<label>50.</label>
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Sasirekha</surname> <given-names>K</given-names></name> <name><surname>Baby</surname> <given-names>P</given-names></name></person-group>. <article-title>Agglomerative hierarchical clustering algorithm-a</article-title>. <source>Int J Sci Res Publ</source>. (<year>2013</year>) <volume>3</volume>:<fpage>1</fpage>&#x02013;<lpage>3</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.ijsrp.org/research-paper-0313/ijsrp-p1515.pdf">https://www.ijsrp.org/research-paper-0313/ijsrp-p1515.pdf</ext-link><pub-id pub-id-type="pmid">34737487</pub-id></citation></ref>
<ref id="B51">
<label>51.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chatr-Aryamontri</surname> <given-names>A</given-names></name> <name><surname>Ceol</surname> <given-names>A</given-names></name> <name><surname>Peluso</surname> <given-names>D</given-names></name> <name><surname>Nardozza</surname> <given-names>A</given-names></name> <name><surname>Panni</surname> <given-names>S</given-names></name> <name><surname>Sacco</surname> <given-names>F</given-names></name> <etal/></person-group>. <article-title>VirusMINT: a viral protein interaction database</article-title>. <source>Nucleic Acids Res</source>. (<year>2009</year>) <volume>37</volume>:<fpage>D669</fpage>&#x02013;<lpage>73</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkn739</pub-id><pub-id pub-id-type="pmid">18974184</pub-id></citation></ref>
<ref id="B52">
<label>52.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Consortium</surname> <given-names>U</given-names></name></person-group>. <article-title>UniProt: a worldwide hub of protein knowledge</article-title>. <source>Nucleic Acids Res</source>. (<year>2019</year>) <volume>47</volume>:<fpage>D506</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gky1049</pub-id><pub-id pub-id-type="pmid">30395287</pub-id></citation></ref>
<ref id="B53">
<label>53.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Calderone</surname> <given-names>A</given-names></name> <name><surname>Cesareni</surname> <given-names>G</given-names></name></person-group>. <article-title>Mentha: the interactome browser</article-title>. <source>EMBnet journal</source>. (<year>2012</year>) <volume>18</volume>:<fpage>128</fpage>. <pub-id pub-id-type="doi">10.14806/ej.18.A.455</pub-id></citation>
</ref>
<ref id="B54">
<label>54.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ammari</surname> <given-names>MG</given-names></name> <name><surname>Gresham</surname> <given-names>CR</given-names></name> <name><surname>McCarthy</surname> <given-names>FM</given-names></name> <name><surname>Nanduri</surname> <given-names>B</given-names></name></person-group>. <article-title>HPIDB 2.0: a curated database for host-pathogen interactions.</article-title> <source>Database</source>. (<year>2016</year>) <volume>2016</volume>:<fpage>baw103</fpage>. <pub-id pub-id-type="doi">10.1093/database/baw103</pub-id><pub-id pub-id-type="pmid">27374121</pub-id></citation></ref>
<ref id="B55">
<label>55.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guirimand</surname> <given-names>T</given-names></name> <name><surname>Delmotte</surname> <given-names>S</given-names></name> <name><surname>Navratil</surname> <given-names>V</given-names></name></person-group>. <article-title>VirHostNet 2.0: surfing on the web of virus/host molecular interactions data.</article-title> <source>Nucleic Acids Res</source>. (<year>2015</year>) <volume>43</volume>:<fpage>D583</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gku1121</pub-id><pub-id pub-id-type="pmid">25392406</pub-id></citation></ref>
<ref id="B56">
<label>56.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Durmu&#x0015F; Tekir</surname> <given-names>S</given-names></name> <name><surname>&#x000C7;ak&#x00131;r</surname> <given-names>T</given-names></name> <name><surname>Ard&#x00131;&#x000E7;</surname> <given-names>E</given-names></name> <name><surname>Say&#x00131;l&#x00131;rba&#x0015F;</surname> <given-names>AS</given-names></name> <name><surname>Konuk</surname> <given-names>G</given-names></name> <name><surname>Konuk</surname> <given-names>M</given-names></name> <etal/></person-group>. <article-title>PHISTO: pathogen-host interaction search tool</article-title>. <source>Bioinformatics</source>. (<year>2013</year>) <volume>29</volume>:<fpage>1357</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btt137</pub-id><pub-id pub-id-type="pmid">23515528</pub-id></citation></ref>
<ref id="B57">
<label>57.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sussman</surname> <given-names>JL</given-names></name> <name><surname>Lin</surname> <given-names>D</given-names></name> <name><surname>Jiang</surname> <given-names>J</given-names></name> <name><surname>Manning</surname> <given-names>NO</given-names></name> <name><surname>Prilusky</surname> <given-names>J</given-names></name> <name><surname>Ritter</surname> <given-names>O</given-names></name> <etal/></person-group>. <article-title>Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules</article-title>. <source>Acta Crystallogr D Biol Crystallogr</source>. (<year>1998</year>) <volume>54</volume>:<fpage>1078</fpage>&#x02013;<lpage>84</lpage>. <pub-id pub-id-type="doi">10.1107/S0907444998009378</pub-id><pub-id pub-id-type="pmid">10089483</pub-id></citation></ref>
<ref id="B58">
<label>58.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Del-Toro</surname> <given-names>N</given-names></name> <name><surname>Dumousseau</surname> <given-names>M</given-names></name> <name><surname>Orchard</surname> <given-names>S</given-names></name> <name><surname>Jimenez</surname> <given-names>RC</given-names></name> <name><surname>Galeota</surname> <given-names>E</given-names></name> <name><surname>Launay</surname> <given-names>G</given-names></name> <etal/></person-group>. <article-title>A new reference implementation of the PSICQUIC web service</article-title>. <source>Nucleic Acids Res</source>. (<year>2013</year>) <volume>41</volume>:<fpage>W601</fpage>&#x02013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkt392</pub-id><pub-id pub-id-type="pmid">23671334</pub-id></citation></ref>
<ref id="B59">
<label>59.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alonso-Lopez</surname> <given-names>D</given-names></name> <name><surname>Guti&#x000E9;rrez</surname> <given-names>MA</given-names></name> <name><surname>Lopes</surname> <given-names>KP</given-names></name> <name><surname>Prieto</surname> <given-names>C</given-names></name> <name><surname>Santamar&#x000ED;a</surname> <given-names>R</given-names></name> <name><surname>De Las Rivas</surname> <given-names>J</given-names></name></person-group>. <article-title>APID interactomes: providing proteome-based interactomes with controlled quality for multiple species and derived networks</article-title>. <source>Nucleic Acids Res</source>. (<year>2016</year>) <volume>44</volume>:<fpage>W529</fpage>&#x02013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkw363</pub-id><pub-id pub-id-type="pmid">27131791</pub-id></citation></ref>
<ref id="B60">
<label>60.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hermjakob</surname> <given-names>H</given-names></name> <name><surname>Montecchi-Palazzi</surname> <given-names>L</given-names></name> <name><surname>Lewington</surname> <given-names>C</given-names></name> <name><surname>Mudali</surname> <given-names>S</given-names></name> <name><surname>Kerrien</surname> <given-names>S</given-names></name> <name><surname>Orchard</surname> <given-names>S</given-names></name> <etal/></person-group>. <article-title>IntAct: an open source molecular interaction database</article-title>. <source>Nucleic Acids Res</source>. (<year>2004</year>) <volume>32</volume>:<fpage>D452</fpage>&#x02013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkh052</pub-id><pub-id pub-id-type="pmid">14681455</pub-id></citation></ref>
<ref id="B61">
<label>61.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fu</surname> <given-names>L</given-names></name> <name><surname>Niu</surname> <given-names>B</given-names></name> <name><surname>Zhu</surname> <given-names>Z</given-names></name> <name><surname>Wu</surname> <given-names>S</given-names></name> <name><surname>Li</surname> <given-names>W</given-names></name></person-group>. <article-title>CD-HIT: accelerated for clustering the next-generation sequencing data</article-title>. <source>Bioinformatics</source>. (<year>2012</year>) <volume>28</volume>:<fpage>3150</fpage>&#x02013;<lpage>2</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bts565</pub-id><pub-id pub-id-type="pmid">23060610</pub-id></citation></ref>
<ref id="B62">
<label>62.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Byvatov</surname> <given-names>E</given-names></name> <name><surname>Schneider</surname> <given-names>G</given-names></name></person-group>. <article-title>Support vector machine applications in bioinformatics</article-title>. <source>Appl Bioinform</source>. (<year>2003</year>) <volume>2</volume>:<fpage>67</fpage>&#x02013;<lpage>77</lpage>.<pub-id pub-id-type="pmid">15130823</pub-id></citation></ref>
<ref id="B63">
<label>63.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Demichev</surname> <given-names>V</given-names></name> <name><surname>Tober-Lau</surname> <given-names>P</given-names></name> <name><surname>Nazarenko</surname> <given-names>T</given-names></name> <name><surname>Lemke</surname> <given-names>O</given-names></name> <name><surname>Kaur Aulakh</surname> <given-names>S</given-names></name> <name><surname>Whitwell</surname> <given-names>HJ</given-names></name> <etal/></person-group>. <article-title>A proteomic survival predictor for COVID-19 patients in intensive care</article-title>. <source>PLoS Digit Health</source>. (<year>2022</year>) <volume>1</volume>:<fpage>e0000007</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pdig.0000007</pub-id><pub-id pub-id-type="pmid">35967328</pub-id></citation></ref>
<ref id="B64">
<label>64.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Feng</surname> <given-names>G</given-names></name> <name><surname>He</surname> <given-names>N</given-names></name> <name><surname>Xia</surname> <given-names>HHX</given-names></name> <name><surname>Mi</surname> <given-names>M</given-names></name> <name><surname>Wang</surname> <given-names>K</given-names></name> <name><surname>Byrne</surname> <given-names>CD</given-names></name> <etal/></person-group>. <article-title>Machine learning algorithms based on proteomic data mining accurately predicting the recurrence of hepatitis B-related hepatocellular carcinoma</article-title>. <source>J Gastroenterol Hepatol</source>. (<year>2022</year>) <pub-id pub-id-type="doi">10.1111/jgh.15940</pub-id><pub-id pub-id-type="pmid">35816347</pub-id></citation></ref>
<ref id="B65">
<label>65.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Al-Barakati</surname> <given-names>HJ</given-names></name> <name><surname>McConnell</surname> <given-names>EW</given-names></name> <name><surname>Hicks</surname> <given-names>LM</given-names></name> <name><surname>Poole</surname> <given-names>LB</given-names></name> <name><surname>Newman</surname> <given-names>RH</given-names></name> <name><surname>Kc</surname> <given-names>DB</given-names></name></person-group>. <article-title>SVM-SulfoSite: a support vector machine based predictor for sulfenylation sites</article-title>. <source>Sci Rep</source>. (<year>2018</year>) <volume>8</volume>:<fpage>1</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1038/s41598-018-29126-x</pub-id><pub-id pub-id-type="pmid">30050050</pub-id></citation></ref>
<ref id="B66">
<label>66.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>James</surname> <given-names>G</given-names></name> <name><surname>Witten</surname> <given-names>D</given-names></name> <name><surname>Hastie</surname> <given-names>T</given-names></name> <name><surname>Tibshirani</surname> <given-names>R</given-names></name></person-group>. <source>An introduction to Statistical Learning. Vol. 112</source>. <publisher-loc>Springer</publisher-loc> (<year>2013</year>).</citation>
</ref>
<ref id="B67">
<label>67.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pang-Ning</surname> <given-names>T</given-names></name> <name><surname>Steinbach</surname> <given-names>M</given-names></name> <name><surname>Kumar</surname> <given-names>V</given-names></name></person-group>. <source>Introduction to Data Mining</source>. <publisher-name>Addison Wesley</publisher-name> (<year>2005</year>).</citation>
</ref>
<ref id="B68">
<label>68.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>X</given-names></name> <name><surname>Ishwaran</surname> <given-names>H</given-names></name></person-group>. <article-title>Random forests for genomic data analysis</article-title>. <source>Genomics.</source> (<year>2012</year>) <volume>99</volume>:<fpage>323</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1016/j.ygeno.2012.04.003</pub-id><pub-id pub-id-type="pmid">22546560</pub-id></citation></ref>
<ref id="B69">
<label>69.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>O&#x00027;Leary</surname> <given-names>OE</given-names></name> <name><surname>Schoetzau</surname> <given-names>A</given-names></name> <name><surname>Amruthalingam</surname> <given-names>L</given-names></name> <name><surname>Geber-Hollbach</surname> <given-names>N</given-names></name> <name><surname>Plattner</surname> <given-names>K</given-names></name> <name><surname>Jenoe</surname> <given-names>P</given-names></name> <etal/></person-group>. <article-title>Tear proteomic predictive biomarker model for ocular graft versus host disease classification</article-title>. <source>Transl Vis Sci Technol.</source> (<year>2020</year>) <volume>9</volume>:<fpage>3</fpage>. <pub-id pub-id-type="doi">10.1167/tvst.9.9.3</pub-id><pub-id pub-id-type="pmid">32879760</pub-id></citation></ref>
<ref id="B70">
<label>70.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>C</given-names></name> <name><surname>Leng</surname> <given-names>W</given-names></name> <name><surname>Sun</surname> <given-names>C</given-names></name> <name><surname>Lu</surname> <given-names>T</given-names></name> <name><surname>Chen</surname> <given-names>Z</given-names></name> <name><surname>Men</surname> <given-names>X</given-names></name> <etal/></person-group>. <article-title>Urine proteome profiling predicts lung cancer from control cases and other tumors</article-title>. <source>EBioMedicine</source>. (<year>2018</year>) <volume>30</volume>:<fpage>120</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1016/j.ebiom.2018.03.009</pub-id><pub-id pub-id-type="pmid">29576497</pub-id></citation></ref>
<ref id="B71">
<label>71.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>Q</given-names></name> <name><surname>Chen</surname> <given-names>X</given-names></name> <name><surname>Wang</surname> <given-names>Y</given-names></name> <name><surname>Li</surname> <given-names>J</given-names></name> <name><surname>Liu</surname> <given-names>H</given-names></name> <name><surname>Xie</surname> <given-names>Y</given-names></name> <etal/></person-group>. <article-title>Hydloc: a tool for hydroxyproline and hydroxylysine sites prediction in the human proteome</article-title>. <source>Chemometr Intell Lab Syst.</source> (<year>2020</year>) <volume>202</volume>:<fpage>104035</fpage>. <pub-id pub-id-type="doi">10.1016/j.chemolab.2020.104035</pub-id></citation>
</ref>
<ref id="B72">
<label>72.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Geurts</surname> <given-names>P</given-names></name> <name><surname>Ernst</surname> <given-names>D</given-names></name> <name><surname>Wehenkel</surname> <given-names>L</given-names></name></person-group>. <article-title>Extremely randomized trees</article-title>. <source>Mach Learn</source>. (<year>2006</year>) <volume>63</volume>:<fpage>3</fpage>&#x02013;<lpage>42</lpage>. <pub-id pub-id-type="doi">10.1007/s10994-006-6226-1</pub-id></citation>
</ref>
<ref id="B73">
<label>73.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arafat</surname> <given-names>ME</given-names></name> <name><surname>Ahmad</surname> <given-names>MW</given-names></name> <name><surname>Shovan</surname> <given-names>S</given-names></name> <name><surname>Dehzangi</surname> <given-names>A</given-names></name> <name><surname>Dipta</surname> <given-names>SR</given-names></name> <name><surname>Hasan</surname> <given-names>MAM</given-names></name> <etal/></person-group>. <article-title>Accurately predicting glutarylation sites using sequential bi-peptide-based evolutionary features</article-title>. <source>Genes</source>. (<year>2020</year>) <volume>11</volume>:<fpage>1023</fpage>. <pub-id pub-id-type="doi">10.3390/genes11091023</pub-id><pub-id pub-id-type="pmid">32878321</pub-id></citation></ref>
<ref id="B74">
<label>74.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname> <given-names>L</given-names></name> <name><surname>Yuan</surname> <given-names>R</given-names></name> <name><surname>Shen</surname> <given-names>L</given-names></name> <name><surname>Gao</surname> <given-names>P</given-names></name> <name><surname>Zhou</surname> <given-names>L</given-names></name></person-group>. <article-title>LPI-EnEDT: an ensemble framework with extra tree and decision tree classifiers for imbalanced lncRNA-protein interaction data classification</article-title>. <source>BioData Min</source>. (<year>2021</year>) <volume>14</volume>:<fpage>1</fpage>&#x02013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1186/s13040-021-00277-4</pub-id><pub-id pub-id-type="pmid">34861891</pub-id></citation></ref>
<ref id="B75">
<label>75.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Iqbal</surname> <given-names>S</given-names></name> <name><surname>Ge</surname> <given-names>F</given-names></name> <name><surname>Li</surname> <given-names>F</given-names></name> <name><surname>Akutsu</surname> <given-names>T</given-names></name> <name><surname>Zheng</surname> <given-names>Y</given-names></name> <name><surname>Gasser</surname> <given-names>RB</given-names></name> <etal/></person-group>. <article-title>PROST: AlphaFold2-aware sequence-based predictor to estimate protein stability changes upon missense mutations</article-title>. <source>J Chem Inf Model</source>. (<year>2022</year>) <volume>62</volume>:<fpage>4270</fpage>&#x02013;<lpage>82</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.2c00799</pub-id><pub-id pub-id-type="pmid">35973091</pub-id></citation></ref>
<ref id="B76">
<label>76.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Z</given-names></name> <name><surname>Zhao</surname> <given-names>P</given-names></name> <name><surname>Li</surname> <given-names>C</given-names></name> <name><surname>Li</surname> <given-names>F</given-names></name> <name><surname>Xiang</surname> <given-names>D</given-names></name> <name><surname>Chen</surname> <given-names>YZ</given-names></name> <etal/></person-group>. <article-title>iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization</article-title>. <source>Nucleic Acids Res</source>. (<year>2021</year>) <volume>49</volume>:<fpage>e60</fpage>. <pub-id pub-id-type="doi">10.1093/nar/gkab122</pub-id><pub-id pub-id-type="pmid">33660783</pub-id></citation></ref>
<ref id="B77">
<label>77.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hao</surname> <given-names>J</given-names></name> <name><surname>Ho</surname> <given-names>TK</given-names></name></person-group>. <article-title>Machine learning made easy: a review of scikit-learn package in python programming language</article-title>. <source>J Educ Behav Stat</source>. (<year>2019</year>) <volume>44</volume>:<fpage>348</fpage>&#x02013;<lpage>61</lpage>. <pub-id pub-id-type="doi">10.3102/1076998619832248</pub-id></citation>
</ref>
<ref id="B78">
<label>78.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shekar</surname> <given-names>B</given-names></name> <name><surname>Dagnew</surname> <given-names>G</given-names></name></person-group>. <article-title>Grid search-based hyperparameter tuning and classification of microarray cancer data</article-title>. In: <source>2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP)</source>. <publisher-loc>Gangtok</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2019</year>). p. <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation>
</ref>
<ref id="B79">
<label>79.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vacic</surname> <given-names>V</given-names></name> <name><surname>Iakoucheva</surname> <given-names>LM</given-names></name> <name><surname>Radivojac</surname> <given-names>P</given-names></name></person-group>. <article-title>A graphical representation of the differences between two sets of sequence alignments <italic>Bioinformatics</italic></article-title>. (<year>2006</year>) <volume>22</volume>:<fpage>1536</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btl151</pub-id><pub-id pub-id-type="pmid">16632492</pub-id></citation></ref>
</ref-list> 
</back>
</article> 