<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fgene.2018.00717</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>FLOating-Window Projective Separator (FloWPS): A Data Trimming Tool for Support Vector Machines (SVM) to Improve Robustness of the Classifier</article-title>
</title-group>
<contrib-group> 
<contrib contrib-type="author">
<name><surname>Tkachev</surname> <given-names>Victor</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/656091/overview"/>
</contrib> 
<contrib contrib-type="author">
<name><surname>Sorokin</surname> <given-names>Maxim</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/500181/overview"/>
</contrib> 
<contrib contrib-type="author">
<name><surname>Mescheryakov</surname> <given-names>Artem</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/665354/overview"/>
</contrib> 
<contrib contrib-type="author">
<name><surname>Simonov</surname> <given-names>Alexander</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/658527/overview"/>
</contrib> 
<contrib contrib-type="author">
<name><surname>Garazha</surname> <given-names>Andrew</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/70918/overview"/>
</contrib> 
<contrib contrib-type="author">
<name><surname>Buzdin</surname> <given-names>Anton</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/64369/overview"/>
</contrib> 
<contrib contrib-type="author">
<name><surname>Muchnik</surname> <given-names>Ilya</given-names></name>
<xref ref-type="aff" rid="aff5"><sup>5</sup></xref>
</contrib> 
<contrib contrib-type="author" corresp="yes">
<name><surname>Borisov</surname> <given-names>Nicolas</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/127081/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Bioinformatics and Molecular Networks, OmicsWay Corporation</institution>, <addr-line>Walnut, CA</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry</institution>, <addr-line>Moscow</addr-line>, <country>Russia</country></aff>
<aff id="aff3"><sup>3</sup><institution>Yandex N.V. Corporation</institution>, <addr-line>Moscow</addr-line>, <country>Russia</country></aff>
<aff id="aff4"><sup>4</sup><institution>I.M. Sechenov First Moscow State Medical University (Sechenov University)</institution>, <addr-line>Moscow</addr-line>, <country>Russia</country></aff>
<aff id="aff5"><sup>5</sup><institution>Hill Center, Rutgers University</institution>, <addr-line>Piscataway, NJ</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Guilherme De Alencar Barreto, Universidade Federal do Cear&#x00E1;, Brazil; Firoz Ahmed, Jeddah University, Saudi Arabia</p></fn>
<corresp id="c001">&#x002A;Correspondence: Nicolas Borisov, <email>borisov@oncobox.com</email></corresp>
<fn fn-type="other" id="fn002"><p>This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics</p></fn></author-notes>
<pub-date pub-type="epub">
<day>15</day>
<month>01</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2018</year>
</pub-date>
<volume>9</volume>
<elocation-id>717</elocation-id>
<history>
<date date-type="received">
<day>01</day>
<month>09</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>21</day>
<month>12</month>
<year>2018</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2019 Tkachev, Sorokin, Mescheryakov, Simonov, Garazha, Buzdin, Muchnik and Borisov.</copyright-statement>
<copyright-year>2019</copyright-year>
<copyright-holder>Tkachev, Sorokin, Mescheryakov, Simonov, Garazha, Buzdin, Muchnik and Borisov</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Here, we propose a heuristic technique of data trimming for SVM termed <italic>FLOating Window Projective Separator</italic> (<italic>FloWPS</italic>), tailored for personalized predictions based on molecular data. This procedure can operate with high throughput genetic datasets like gene expression or mutation profiles. Its application prevents SVM from extrapolation by excluding non-informative features. FloWPS requires training on the data for the individuals with known clinical outcomes to create a clinically relevant classifier. The genetic profiles linked with the outcomes are broken as usual into the training and validation datasets. The unique property of FloWPS is that irrelevant features in <italic>validation</italic> dataset that don&#x2019;t have significant number of neighboring hits in the <italic>training</italic> dataset are removed from further analyses. Next, similarly to the <italic>k</italic> nearest neighbors (kNN) method, for each point of a <italic>validation</italic> dataset, FloWPS takes into account only the proximal points of the <italic>training</italic> dataset. Thus, for every point of a <italic>validation</italic> dataset, the <italic>training</italic> dataset is adjusted to form a <italic>floating window</italic>. FloWPS performance was tested on ten gene expression datasets for 992 cancer patients either responding or not on the different types of chemotherapy. We experimentally confirmed by leave-one-out cross-validation that FloWPS enables to significantly increase quality of a classifier built based on the classical SVM in most of the applications, particularly for polynomial kernels.</p>
</abstract>
<kwd-group>
<kwd>bioinformatics</kwd>
<kwd>machine learning</kwd>
<kwd>oncology</kwd>
<kwd>gene expression</kwd>
<kwd>support vector machines</kwd>
<kwd>personalized medicine</kwd>
</kwd-group>
<contract-sponsor id="cn001">Russian Science Foundation<named-content content-type="fundref-id">10.13039/501100006769</named-content></contract-sponsor>
<counts>
<fig-count count="6"/>
<table-count count="3"/>
<equation-count count="0"/>
<ref-count count="48"/>
<page-count count="12"/>
<word-count count="0"/>
</counts>
</article-meta>
</front>
<body>
<sec><title>Introduction</title>
<p>Support vector machine is one of the most popular machine learning methods in biomedical sciences with constantly growing impact and more than 11,000 citations in the PubMed-indexed literature<sup><xref ref-type="fn" rid="fn01">1</xref></sup>, of those &#x223C;2,300 are only for the 2017 and first 6 months of 2018. This method has been successfully applied for a wide variety of biomedical applications like searching Dicer RNase cleavage sites on pre-miRNA (<xref ref-type="bibr" rid="B3">Ahmed et al., 2013</xref>), prediction of miRNA guide strands (<xref ref-type="bibr" rid="B1">Ahmed et al., 2009a</xref>), identification of poly(A) signals in genomic DNA (<xref ref-type="bibr" rid="B2">Ahmed et al., 2009b</xref>), finding conformational B-cell epitopes in antigens by nucleotide sequence (<xref ref-type="bibr" rid="B6">Ansari and Raghava, 2010</xref>). More recent developments include drug design according to physicochemical properties (<xref ref-type="bibr" rid="B47">Yosipof et al., 2018</xref>), learning on transcriptomic profiles for age recognition (<xref ref-type="bibr" rid="B31">Mamoshina et al., 2018</xref>), predictions of drug toxicities and other side effects (<xref ref-type="bibr" rid="B48">Zhang et al., 2018</xref>).</p>
<p>The performance quality of the classifiers based on these methods may reach the value of 0.80 or higher for the metrics such as ROC AUC<sup><xref ref-type="fn" rid="fn02">2</xref></sup> and/or accuracy rate, e.g., for problems of age recognition (<xref ref-type="bibr" rid="B31">Mamoshina et al., 2018</xref>) and drug compound selection (<xref ref-type="bibr" rid="B47">Yosipof et al., 2018</xref>). However, although generally clearly helpful, the SVM approach frequently demonstrates insufficient performance in several applications for separating groups of the patients with different clinical outcomes (<xref ref-type="bibr" rid="B32">Mulligan et al., 2007</xref>; <xref ref-type="bibr" rid="B34">Ray and Zhang, 2009</xref>; <xref ref-type="bibr" rid="B8">Babaoglu et al., 2010</xref>; <xref ref-type="bibr" rid="B25">Kim et al., 2018</xref>). These failures were most likely caused by insufficient number of preceding clinical cases, which provokes overtraining of all machine learning algorithms. Particularly, the rareness of training points in the feature space leads to frequent extrapolations, and SVM method is known to be highly vulnerable to such conditions (<xref ref-type="bibr" rid="B7">Arimoto et al., 2005</xref>; <xref ref-type="bibr" rid="B9">Balabin and Lomakina, 2011</xref>; <xref ref-type="bibr" rid="B10">Balabin and Smirnov, 2012</xref>; <xref ref-type="bibr" rid="B12">Betrie et al., 2013</xref>).</p>
<p>In order to increase the performance of SVM for distinguishing between clinically relevant features, such as degrees of response to cancer therapies, we propose here a new method termed <italic>FloWPS</italic> for data trimming that generalizes the SVM technique by precluding extrapolation in the feature space. FloWPS acts by selecting for further analysis only those features that lay within the intervals of data projections from the training dataset. This approach can avoid extrapolations in favor of interpolations and thus increases a prediction quality of the output data. FloWPS combines somehow two methods, SVM and kNN (<xref ref-type="bibr" rid="B4">Altman, 1992</xref>), where kNN plays a particular role to extract informative features. The idea to combine feature extraction methods with SVM is well known (<xref ref-type="bibr" rid="B38">Tan and Gilbert, 2003</xref>; <xref ref-type="bibr" rid="B26">Kourou et al., 2015</xref>; <xref ref-type="bibr" rid="B39">Tan, 2016</xref>; <xref ref-type="bibr" rid="B29">Liu et al., 2017</xref>; <xref ref-type="bibr" rid="B40">Tarek et al., 2017</xref>). The approach proposed in this paper, however, is in principle a novelty, at least because its selection capacity is focused on every single point available for prediction.</p>
<p>We tested FloWPS on ten published gene expression datasets for totally 992 cancer patients treated with different types of chemotherapy with known clinical outcomes. In all the cases, the classifiers built using FloWPS outperformed standard SVM classifiers.</p>
</sec>
<sec><title>Results</title>
<sec><title>Data Sources and Feature Selection</title>
<p>In this study, we investigated gene expression features associated with the responses to chemotherapy. The gene expression profiles were extracted from the datasets summarized in Table <xref ref-type="table" rid="T1">1</xref>. The clinical outcome information was related to response on different chemotherapy regimens, linked with high throughput gene expression profiles for the individual patients.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Clinically annotated gene expression datasets.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Reference</th>
<th valign="top" align="center">Dataset ID</th>
<th valign="top" align="center">Disease type</th>
<th valign="top" align="center">Treatment type</th>
<th valign="top" align="center">Experimental platform</th>
<th valign="top" align="center">Number of samples</th>
<th valign="top" align="center">Number of <italic>core marker genes</italic></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B20">Hatzis et al., 2011</xref>; <xref ref-type="bibr" rid="B23">Itoh et al., 2014</xref></td>
<td valign="top" align="left">GSE25066</td>
<td valign="top" align="left">Breast cancer with different hormonal and HER2 status</td>
<td valign="top" align="left">Neoadjuvant taxane + anthracycline</td>
<td valign="top" align="left">Affymetrix Human Genome U133 Array</td>
<td valign="top" align="left">235 (118 responders, 117 non-responders)</td>
<td valign="top" align="center">20</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B21">Horak et al., 2013</xref></td>
<td valign="top" align="left">GSE41998</td>
<td valign="top" align="left">Breast cancer with different hormonal and HER2 status</td>
<td valign="top" align="left">Neoadjuvant doxorubicin + cyclophosphamide, followed by paclitaxel</td>
<td valign="top" align="left">Affymetrix Human Genome U133 Array</td>
<td valign="top" align="left">68 (34 responders, 34 non-responders)</td>
<td valign="top" align="center">11</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B32">Mulligan et al., 2007</xref></td>
<td valign="top" align="left">GSE9782</td>
<td valign="top" align="left">Multiple myeloma</td>
<td valign="top" align="left">Bortezomib</td>
<td valign="top" align="left">Affymetrix Human Genome U133 Array</td>
<td valign="top" align="left">169 (85 responders, 84 non-responders)</td>
<td valign="top" align="center">18</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B16">Chauhan et al., 2012</xref></td>
<td valign="top" align="left">GSE39754</td>
<td valign="top" align="left">Multiple myeloma</td>
<td valign="top" align="left">Vincristine + adriamycin + dexamethasone followed by ASCT</td>
<td valign="top" align="left">Affymetrix Human Exon 1.0 ST Array</td>
<td valign="top" align="left">124 (62 responders, 62 non-responders)</td>
<td valign="top" align="center">16</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B41">Terragna et al., 2016</xref></td>
<td valign="top" align="left">GSE68871</td>
<td valign="top" align="left">Multiple myeloma</td>
<td valign="top" align="left">Bortezomib-thalidomide-dexamethasone (VTD)</td>
<td valign="top" align="left">Affymetrix Human Genome U133 Plus</td>
<td valign="top" align="left">98 (49 responders, 49 non-responders)</td>
<td valign="top" align="center">12</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B5">Amin et al., 2014</xref></td>
<td valign="top" align="left">GSE55145</td>
<td valign="top" align="left">Multiple myeloma</td>
<td valign="top" align="left">Bortezomib followed by ASCT</td>
<td valign="top" align="left">Affymetrix Human Exon 1.0 ST Array</td>
<td valign="top" align="left">56 (28 responders, 28 non-responders)</td>
<td valign="top" align="center">14</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>; <xref ref-type="bibr" rid="B45">Walz et al., 2015</xref></td>
<td valign="top" align="left">TARGET-50</td>
<td valign="top" align="left">Childhood kidney Wilms tumor</td>
<td valign="top" align="left">Vincristine sulfate + non-target drugs + conventional surgery + radiation therapy</td>
<td valign="top" align="left">Illumina HiSeq 2000</td>
<td valign="top" align="left">72 (36 responders, 36 non-responders)</td>
<td valign="top" align="center">14</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>; <xref ref-type="bibr" rid="B42">Tricoli et al., 2016</xref></td>
<td valign="top" align="left">TARGET-10</td>
<td valign="top" align="left">Childhood B acute lymphoblastic leukemia</td>
<td valign="top" align="left">Vincristine sulfate + non-target drugs</td>
<td valign="top" align="left">Illumina HiSeq 2000</td>
<td valign="top" align="left">60 (30 responders, 30 non-responders)</td>
<td valign="top" align="center">14</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref></td>
<td valign="top" align="left">TARGET-20</td>
<td valign="top" align="left">Childhood acute myeloid leukemia</td>
<td valign="top" align="left">Non-target drugs including busulfan and cyclophosphamide</td>
<td valign="top" align="left">Illumina HiSeq 2000</td>
<td valign="top" align="left">46 (23 responders, 23 non-responders)</td>
<td valign="top" align="center">10</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref></td>
<td valign="top" align="left">TARGET-20</td>
<td valign="top" align="left">Childhood acute myeloid leukemia</td>
<td valign="top" align="left">Non-target drugs excluding busulfan and cyclophosphamide</td>
<td valign="top" align="left">Illumina HiSeq 2000</td>
<td valign="top" align="left">124 (62 responders, 62 non-responders)</td>
<td valign="top" align="center">16</td></tr>
<tr>
<td valign="top" align="left"></td></tr>
</tbody>
</table>
</table-wrap>
<p>Each patient was primarily labeled as either responder or non-responder to a treatment. For all the datasets taken from the GEO repository, we used the response criteria formulated in the respective original papers first publishing these data. Namely, for two breast cancer datasets, GSE25066 (<xref ref-type="bibr" rid="B20">Hatzis et al., 2011</xref>; <xref ref-type="bibr" rid="B23">Itoh et al., 2014</xref>) and GSE41998 (<xref ref-type="bibr" rid="B21">Horak et al., 2013</xref>), we considered <italic>partial responders</italic> as responders. For the first multiple myeloma dataset, GSE9782 (<xref ref-type="bibr" rid="B32">Mulligan et al., 2007</xref>), we took the (non)responder classification used by the authors, where patents with <italic>complete</italic> and <italic>partial response</italic> were annotated as responders, and with <italic>no change</italic> and <italic>progressive disease</italic> &#x2013; as non-responders. For three other multiple myeloma datasets, GSE39753 (<xref ref-type="bibr" rid="B16">Chauhan et al., 2012</xref>), GSE68871 (<xref ref-type="bibr" rid="B41">Terragna et al., 2016</xref>), and GSE55145 (<xref ref-type="bibr" rid="B5">Amin et al., 2014</xref>), we considered <italic>complete</italic>, <italic>near-complete</italic> and <italic>very good partial responders</italic> as responders, whereas <italic>partial</italic>, <italic>minor</italic> and <italic>worse</italic> responders &#x2013; as non-responders. For the datasets of pediatric Wilms kidney tumor, ALL and AML, extracted from the TARGET gene expression repository of National Cancer Institute (<xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>), the cases was classified according the distribution of the event-free survival time, which appeared to have two modes with different slopes (Supplementary Figure <xref ref-type="supplementary-material" rid="SM5">S1</xref>).</p>
<p>To preclude any possible bias that may affect the performance of machine-learning classifiers due to unequal representation of samples in two different classes (clinical responders and non-responders), numbers of responding and non-responding cases were equalized within each dataset. Equalization was done by taking the full <italic>smaller</italic> subset of those for the two classes (responders/non-responders), and then by random selection of samples from the <italic>bigger</italic> subset. Thus, each resulting dataset contained equal numbers of cases classified as responders and non-responders.</p>
<p>To engineer a plausible feature space, where the SVM can be applied efficiently, we proposed to select from tens of thousands of individual gene expression features only few of them, which produce a good separation of clinical responders from non-responders. To do so, for every dataset under investigation we selected its particular top 30 genes, whose expression levels taken one by one had the highest ROC AUC values for distinguishing responder and non-responder profiles. We made a number of top informative features equal to 30 because the usual number of samples in considered datasets was not lower than 50 (a direct heuristic number for degree of freedom). These <italic>30 top marker genes</italic>, and response statuses (100 for a responder, 0 for a non-responder) for all selected patients from all datasets are listed on Supplementary Table <xref ref-type="supplementary-material" rid="SM1">S1</xref>.</p>
<p>To produce more robust feature selection, for each dataset having, say, <italic>N</italic> samples, the leave-one-out procedure has been performed. Each individual sample was removed from the investigation one at a time, so <italic>N</italic> subdatasets each having <italic>N</italic>-1 individuals were generated. For each subdataset, the ROC AUC test was performed between responders and non-responders for each gene. The genes were next sorted according to their ROC AUC, and top 30 were selected for each subdataset. The final list of such <italic>core informative</italic> genes was generated as an intersection between top 30 selected genes for all <italic>N</italic> subdatasets. For every dataset under investigation, these final core sets are listed in Supplementary Table <xref ref-type="supplementary-material" rid="SM2">S2</xref>; the number of core marker genes is also shown on Table <xref ref-type="table" rid="T1">1</xref>.</p>
</sec>
<sec><title>Data Trimming for Application in SVM</title>
<p>We developed a first of its class data trimming<sup><xref ref-type="fn" rid="fn03">3</xref></sup> tool termed FloWPS that has a potential to improve the performance of machine learning methods. Since extrapolation is a widely recognized Achilles heel of SVM (<xref ref-type="bibr" rid="B7">Arimoto et al., 2005</xref>; <xref ref-type="bibr" rid="B9">Balabin and Lomakina, 2011</xref>; <xref ref-type="bibr" rid="B10">Balabin and Smirnov, 2012</xref>; <xref ref-type="bibr" rid="B12">Betrie et al., 2013</xref>), FloWPS avoids it by using the rectangular projections along all irrelevant expression features that cause extrapolation during the SVM-based predictions for every validation point.</p>
<p>In this section we describe and investigate our data trimming procedure (FloWPS) as a preprocessing for SVM application.</p>
<p>Since the number of samples in most of the datasets used here was relatively low, we tested our classifier using the leave-one-out cross-validation method, which introduces lesser errors than the standard five-bin cross-validation scheme generally applied for bigger datasets. According to the leave-one-out approach, for each sample <italic>i =</italic> 1, <italic>N</italic> serves as a validation case whose response to the treatment had to be predicted, whereas all remaining samples, <italic>j</italic> = 1,&#x2026;(<italic>i</italic>-1),(<italic>i+</italic>1)<italic>,&#x2026;,N</italic>, collectively acts as a training dataset, and this procedure is repeated for all the samples. For machine leaning without data trimming, in a predefined feature space <bold>F</bold> = (<italic>f</italic><sub>1</sub>,&#x2026;, <italic>f<sub>s</sub></italic> ) every sample <italic>i</italic>, given for the test, is assigned by a classifier, constructed to (<italic>N</italic>-1) samples used for training.</p>
<p>According to the current data trimming approach, instead a fixed space <bold>F</bold> for all <italic>N</italic> testing samples, we propose using an individual space <bold>F</bold><italic><sub>i</sub></italic>, which contains individually adapted training data (of <italic>N</italic>-1 samples) for the testing sample <italic>i</italic>. It can be implemented using the following heuristics (Figure <xref ref-type="fig" rid="F1">1</xref>).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Data trimming pipeline. <bold>(A)</bold> selection of relevant features in FloWPS according to the <italic>m</italic>-condition. A violet dot shows the position of a validation point. Turquoise dots stand for the points from the training dataset. The features (here: <italic>f<sub>1</sub></italic> and <italic>f<sub>2</sub></italic>) are considered relevant when they satisfy the criterion that at least <italic>m</italic> flanking training points must be present on both sides relative to the validation point along the feature-specific axis. In the figure, it is exemplified that <italic>m</italic>-condition is satisfied for <italic>f</italic><sub>1</sub> feature when <italic>m</italic> = 0 only, and for the <italic>f</italic><sub>2</sub>, when <italic>m</italic> &#x2264; 5. <bold>(B)</bold> After selection of the relevant features, only <italic>k</italic> nearest neighbors in the training sets are selected to construct the SVM model. On the figure, <italic>k</italic> = 4, although <italic>k</italic> starting from 20 was used in our calculations, to build SVM model.</p></caption>
<graphic xlink:href="fgene-09-00717-g001.tif"/>
</fig>
<p>(1) From the whole predefined feature space <bold>F</bold> = (<italic>f</italic><sub>1</sub>,&#x2026;, <italic>f</italic><sub>s</sub> ) we extract a subset <bold>F</bold><italic><sub>i</sub></italic> (<italic>m</italic>), where <italic>m</italic> is a parameter. A feature <italic>f<sub>j</sub></italic> is kept in <bold>F</bold><italic><sub>i</sub></italic>(<italic>m</italic>) if on its axis there are at least <italic>m</italic> projections from training samples, which are larger than <italic>f<sub>j</sub></italic> (<italic>i</italic>), and, at the same time, at least <italic>m</italic>, which are smaller than <italic>f<sub>j</sub></italic> (<italic>i</italic>). The procedure for extraction of a subset <bold>F</bold><italic><sub>i</sub></italic>(<italic>m</italic>) is illustrated in Figure <xref ref-type="fig" rid="F1">1A</xref> for a two-dimensional space <bold>F</bold> = (<italic>f</italic><sub>1</sub>, <italic>f</italic><sub>2</sub>). A violet point stands for the validation sample in the feature space. Turquoise dots represent scattering of the training points. For example, the <italic>m</italic>-condition for the feature <italic>f</italic><sub>2</sub> is satisfied when <italic>m</italic> = 0,1,2,3,4,5 (projection of the training set on <italic>f</italic><sub>2</sub> axis has five points both below and above the validation point), whereas for the feature <italic>f</italic><sub>1</sub> it is satisfied only for <italic>m</italic> = 0 (projection of the validation point on axis <italic>f</italic><sub>1</sub> lies outside of the cloud of training points).</p>
<p>(2) In <bold>F</bold><italic><sub>i</sub></italic> (<italic>m</italic>) we keep for training only <italic>k</italic> closest samples (from given (<italic>N</italic>-1) samples); <italic>k</italic> is also a parameter (Figure <xref ref-type="fig" rid="F1">1B</xref>; note that although for the sake of simplicity <italic>k</italic> = 4 in the picture, in the computational trials we varied <italic>k</italic> from 20 to <italic>N</italic>-1).</p>
<p>Hence, for every individual <italic>i =</italic> 1, <italic>N</italic>, and <italic>m</italic> and <italic>k</italic> parameter values, the predicted classification values are obtained [i.e., <italic>predictions</italic> <italic>P<sub>i</sub></italic> (<italic>m</italic>,<italic>k</italic>)<italic>, i =</italic> 1, <italic>N</italic>]. Considering known response status for each sample <italic>i</italic>, it is possible to calculate AUC values for a whole set of samples as a function over whole range of the parameters <italic>m</italic> and <italic>k</italic> (Figure <xref ref-type="fig" rid="F2">2B</xref>). Since the predicted classification efficiencies depend upon the chosen values for <italic>m</italic> and <italic>k</italic>, it is possible to interrogate the AUC values over the full lattice of all possible (<italic>m</italic>, <italic>k</italic>) pairs.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Optimization of data trimming parameters <italic>m</italic> and <italic>k</italic> for a given individual. <bold>(A)</bold> Overall scheme for prediction for an individual sample <italic>i</italic> = 1, <italic>N</italic>. All but one individuals serve as a training dataset. For a training dataset at the fitting step, the AUC for a classifier prediction is calculated and plotted <bold>(B)</bold> as a function of data trimming parameters <italic>m</italic> and <italic>k</italic>. Positions of this AUC topogram where AUC > <italic>p</italic> &#x22C5; max(AUC), <italic>p</italic> = 0.95, are considered <italic>prediction-accountable</italic> (highlighted with bright yellow color) and form the prediction-accountable set <italic>S</italic>. This AUC topogram, as well as the set <italic>S</italic>, is individual for every validation point <italic>i</italic>.</p></caption>
<graphic xlink:href="fgene-09-00717-g002.tif"/>
</fig>
<p>We propose an algorithm of achieving the optimal (<italic>m</italic>,<italic>k</italic>)-settings for a final classifier (Figure <xref ref-type="fig" rid="F2">2A</xref>). The AUC threshold (&#x03B8;) is set to &#x03B8;<italic> = p</italic> &#x22C5; max(AUC), where max(AUC) is the maximal value of AUC, taken over the set of all possible (<italic>m</italic>, <italic>k</italic>) pairs, and the parameter <italic>p</italic> equals to a user-defined confidence threshold. To illustrate performance of this approach, we took two alternative values of <italic>p</italic> = 0.95 or 0.90, and then considered all the (<italic>m</italic>,<italic>k</italic>) pair positions on the AUC(<italic>m</italic>,<italic>k</italic>) topogram. We next screened for the positions where AUC exceeded the threshold &#x03B8;, and the total combination of these positions was taken as the <italic>prediction-accountable set S</italic> (Figure <xref ref-type="fig" rid="F2">2B</xref>; prediction-accountable positions are shown in yellow). The final prediction of FloWPS (<italic>P<sub>F</sub></italic>) for a certain validation case should be calculated by averaging the SVM predictions, <italic>P</italic>(<italic>m,k</italic>), over the whole set of positions belonging to the prediction-accountable set <italic>S</italic>, according to the formula: <italic>P<sub>F</sub></italic> = <italic>mean<sub>S</sub></italic>(<italic>P</italic>(<italic>m,k</italic>)).</p>
<p>The usual SVM method, i.e., without FloWPS data trimming, corresponds to a very right and bottom corner of the AUC(<italic>m</italic>,<italic>k</italic>) topogram (Figure <xref ref-type="fig" rid="F2">2B</xref>), with the parameter settings <italic>m</italic> = 0, <italic>k</italic> = <italic>N</italic> - 1. On the example shown in Figure <xref ref-type="fig" rid="F2">2B</xref>, the classical SVM, without any doubt, provides essentially lower accuracy than FloWPS.</p>
</sec>
<sec><title>FloWPS Performance for Default SVM Settings</title>
<p>At first, we investigated performance of FloWPS on ten cancer gene expression datasets (Table <xref ref-type="table" rid="T1">1</xref>) with the default SVM settings (linear kernel and cost/penalty parameter <italic>C</italic> = 1). During our calculations, the FloWPS classifier was first fitted for the training dataset without a sample (say, <italic>i</italic>) to be classified. For these all (<italic>N</italic>-1) samples AUC<italic><sub>i</sub></italic>(<italic>m</italic>,<italic>k</italic>) was calculated as a function of data trimming parameters <italic>m</italic> and <italic>k</italic> (see Figure <xref ref-type="fig" rid="F2">2A</xref>). This enabled finding the prediction-accountable set <italic>S<sub>i</sub></italic> in the AUC<italic><sub>i</sub></italic>(<italic>m</italic>,<italic>k</italic>) topogram (on Figure <xref ref-type="fig" rid="F2">2B</xref>, the set was marked with bright yellow). The <italic>m</italic> and <italic>k</italic> values from the set <italic>S<sub>i</sub></italic> were then used for data trimming and classifying of a single sample <italic>i</italic>. In parallel, we applied the standard SVM algorithm for leave-one-out cross-validation without data trimming, i.e., <italic>m</italic> = 0, <italic>k</italic> = <italic>N</italic>-1 for each training sub-dataset. The comparison is shown on Table <xref ref-type="table" rid="T2">2</xref>, Supplementary Table <xref ref-type="supplementary-material" rid="SM3">S3</xref>, and Figures <xref ref-type="fig" rid="F3">3</xref>, <xref ref-type="fig" rid="F4">4</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Performance of clinical response classifiers for clinically annotated gene expression datasets.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Dataset</th>
<th valign="top" align="center" colspan="6">Top 30 marker genes<hr/></th>
<th valign="top" align="center" colspan="6">Core marker genes<hr/></th>
</tr>
<tr>
<td valign="top" align="left"></td>
<th valign="top" align="center" colspan="2">SVM</th>
<th valign="top" align="center" colspan="2">FloWPS<break/><italic>p</italic> = 0.95</th>
<th valign="top" align="center" colspan="2">FloWPS<break/><italic>p</italic> = 0.90</th>
<th valign="top" align="center" colspan="2">SVM</th>
<th valign="top" align="center" colspan="2">FloWPS<break/><italic>p</italic> = 0.95</th>
<th valign="top" align="center" colspan="2">FloWPS<break/><italic>p</italic> = 0.90</th>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="2"><hr/></td>
<td valign="top" align="center" colspan="2"><hr/></td>
<td valign="top" align="center" colspan="2"><hr/></td>
<td valign="top" align="center" colspan="2"><hr/></td>
<td valign="top" align="center" colspan="2"><hr/></td>
<td valign="top" align="center" colspan="2"><hr/></td>
</tr>
<tr>
<td valign="top" align="left"></td>
<th valign="top" align="left">AUC</th>
<th valign="top" align="left">FDR</th>
<th valign="top" align="left">AUC</th>
<th valign="top" align="left">FDR</th>
<th valign="top" align="left">AUC</th>
<th valign="top" align="left">FDR</th>
<th valign="top" align="left">AUC</th>
<th valign="top" align="left">FDR</th>
<th valign="top" align="left">AUC</th>
<th valign="top" align="left">FDR</th>
<th valign="top" align="left">AUC</th>
<th valign="top" align="left">FDR</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">GSE25066 (<xref ref-type="bibr" rid="B20">Hatzis et al., 2011</xref>;<break/><xref ref-type="bibr" rid="B23">Itoh et al., 2014</xref>)</td>
<td valign="top" align="left"><bold>0.70</bold></td>
<td valign="top" align="left">0.28</td>
<td valign="top" align="left"><bold>0.76</bold></td>
<td valign="top" align="left">0.10</td>
<td valign="top" align="left"><bold>0.77</bold></td>
<td valign="top" align="left">0.13</td>
<td valign="top" align="left"><bold>0.73</bold></td>
<td valign="top" align="left">0.26</td>
<td valign="top" align="left"><bold>0.76</bold></td>
<td valign="top" align="left">0.25</td>
<td valign="top" align="left"><bold>0.76</bold></td>
<td valign="top" align="left">0.23</td>
</tr>
<tr>
<td valign="top" align="left">GSE41998 (<xref ref-type="bibr" rid="B21">Horak et al., 2013</xref>)</td>
<td valign="top" align="left"><bold>0.79</bold></td>
<td valign="top" align="left">0.25</td>
<td valign="top" align="left"><bold>0.87</bold></td>
<td valign="top" align="left">0.14</td>
<td valign="top" align="left"><bold>0.91</bold></td>
<td valign="top" align="left">0.14</td>
<td valign="top" align="left"><bold>0.87</bold></td>
<td valign="top" align="left">0.14</td>
<td valign="top" align="left"><bold>0.89</bold></td>
<td valign="top" align="left">0.15</td>
<td valign="top" align="left"><bold>0.92</bold></td>
<td valign="top" align="left">0.12</td>
</tr>
<tr>
<td valign="top" align="left">GSE9782 (<xref ref-type="bibr" rid="B32">Mulligan et al., 2007</xref>)</td>
<td valign="top" align="left"><bold>0.73</bold></td>
<td valign="top" align="left">0.28</td>
<td valign="top" align="left"><bold>0.78</bold></td>
<td valign="top" align="left">0.22</td>
<td valign="top" align="left"><bold>0.76</bold></td>
<td valign="top" align="left">0.17</td>
<td valign="top" align="left"><bold>0.68</bold></td>
<td valign="top" align="left">0.33</td>
<td valign="top" align="left"><bold>0.71</bold></td>
<td valign="top" align="left">0.33</td>
<td valign="top" align="left"><bold>0.72</bold></td>
<td valign="top" align="left">0.34</td>
</tr>
<tr>
<td valign="top" align="left">GSE39754 (<xref ref-type="bibr" rid="B16">Chauhan et al., 2012</xref>)</td>
<td valign="top" align="left"><bold>0.65</bold></td>
<td valign="top" align="left">0.36</td>
<td valign="top" align="left"><bold>0.68</bold></td>
<td valign="top" align="left">0.27</td>
<td valign="top" align="left"><bold>0.71</bold></td>
<td valign="top" align="left">0.34</td>
<td valign="top" align="left"><bold>0.65</bold></td>
<td valign="top" align="left">0.36</td>
<td valign="top" align="left"><bold>0.68</bold></td>
<td valign="top" align="left">0.36</td>
<td valign="top" align="left"><bold>0.72</bold></td>
<td valign="top" align="left">0.35</td>
</tr>
<tr>
<td valign="top" align="left">GSE68871 (<xref ref-type="bibr" rid="B41">Terragna et al., 2016</xref>)</td>
<td valign="top" align="left"><bold>0.66</bold></td>
<td valign="top" align="left">0.35</td>
<td valign="top" align="left"><bold>0.75</bold></td>
<td valign="top" align="left">0.25</td>
<td valign="top" align="left"><bold>0.74</bold></td>
<td valign="top" align="left">0.27</td>
<td valign="top" align="left"><bold>0.68</bold></td>
<td valign="top" align="left">0.33</td>
<td valign="top" align="left"><bold>0.78</bold></td>
<td valign="top" align="left">0.20</td>
<td valign="top" align="left"><bold>0.77</bold></td>
<td valign="top" align="left">0.24</td>
</tr>
<tr>
<td valign="top" align="left">GSE55145 (<xref ref-type="bibr" rid="B5">Amin et al., 2014</xref>)</td>
<td valign="top" align="left"><bold>0.84</bold></td>
<td valign="top" align="left">0.19</td>
<td valign="top" align="left"><bold>0.86</bold></td>
<td valign="top" align="left">0.11</td>
<td valign="top" align="left"><bold>0.90</bold></td>
<td valign="top" align="left">0.11</td>
<td valign="top" align="left"><bold>0.77</bold></td>
<td valign="top" align="left">0.24</td>
<td valign="top" align="left"><bold>0.81</bold></td>
<td valign="top" align="left">0.19</td>
<td valign="top" align="left"><bold>0.82</bold></td>
<td valign="top" align="left">0.06</td>
</tr>
<tr>
<td valign="top" align="left">TARGET-50 (<xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>; <xref ref-type="bibr" rid="B45">Walz et al., 2015</xref>)</td>
<td valign="top" align="left"><bold>0.64</bold></td>
<td valign="top" align="left">0.35</td>
<td valign="top" align="left"><bold>0.75</bold></td>
<td valign="top" align="left">0.13</td>
<td valign="top" align="left"><bold>0.78</bold></td>
<td valign="top" align="left">0.16</td>
<td valign="top" align="left"><bold>0.72</bold></td>
<td valign="top" align="left">0.26</td>
<td valign="top" align="left"><bold>0.81</bold></td>
<td valign="top" align="left">0.08</td>
<td valign="top" align="left"><bold>0.82</bold></td>
<td valign="top" align="left">0.09</td>
</tr>
<tr>
<td valign="top" align="left">TARGET-10 (<xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>; <xref ref-type="bibr" rid="B42">Tricoli et al., 2016</xref>)</td>
<td valign="top" align="left"><bold>0.85</bold></td>
<td valign="top" align="left">0.16</td>
<td valign="top" align="left"><bold>0.86</bold></td>
<td valign="top" align="left">0.14</td>
<td valign="top" align="left"><bold>0.87</bold></td>
<td valign="top" align="left">0.12</td>
<td valign="top" align="left"><bold>0.87</bold></td>
<td valign="top" align="left">0.13</td>
<td valign="top" align="left"><bold>0.94</bold></td>
<td valign="top" align="left">0.07</td>
<td valign="top" align="left"><bold>0.94</bold></td>
<td valign="top" align="left">0.04</td>
</tr>
<tr>
<td valign="top" align="left">TARGET-20 (<xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>) with busulfan and cyclophosphamide</td>
<td valign="top" align="left"><bold>0.74</bold></td>
<td valign="top" align="left">0.26</td>
<td valign="top" align="left"><bold>0.79</bold></td>
<td valign="top" align="left">0.16</td>
<td valign="top" align="left"><bold>0.79</bold></td>
<td valign="top" align="left">0.17</td>
<td valign="top" align="left"><bold>0.76</bold></td>
<td valign="top" align="left">0.23</td>
<td valign="top" align="left"><bold>0.77</bold></td>
<td valign="top" align="left">0.22</td>
<td valign="top" align="left"><bold>0.83</bold></td>
<td valign="top" align="left">0.00</td>
</tr>
<tr>
<td valign="top" align="left">TARGET-20 (<xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>) w/o busulfan and cyclophosphamide</td>
<td valign="top" align="left"><bold>0.73</bold></td>
<td valign="top" align="left">0.28</td>
<td valign="top" align="left"><bold>0.76</bold></td>
<td valign="top" align="left">0.30</td>
<td valign="top" align="left"><bold>0.76</bold></td>
<td valign="top" align="left">0.27</td>
<td valign="top" align="left"><bold>0.74</bold></td>
<td valign="top" align="left">0.26</td>
<td valign="top" align="left"><bold>0.77</bold></td>
<td valign="top" align="left">0.13</td>
<td valign="top" align="left"><bold>0.79</bold></td>
<td valign="top" align="left">0.11</td></tr>
<tr>
<td valign="top" align="left"></td></tr></tbody></table>
<table-wrap-foot>
<attrib><italic>Area-under-curve (AUC) and false discovery rate (FDR) values calculated for each version of a classifier are shown. All calculations were made using leave-one-out cross-validation approach.</italic></attrib>
</table-wrap-foot>
</table-wrap>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Distribution (violin plots together with each instance showed as a red/green dot) of FloWPS predictions (<italic>P<sub>F</sub></italic>) for patients without (red plots and dots) and with (green plots and dots) positive clinical response to chemotherapy treatment. For FloWPS, <italic>core marker genes</italic> and <italic>p</italic> = 0.90 settings were used. Black horizontal line shows the discrimination threshold (&#x03C4;) between responders and non-responders for each classifier. Panels represent different data sources, <bold>(A)</bold> GSE25066; <bold>(B)</bold> GSE41998; <bold>(C)</bold> GSE9782; <bold>(D)</bold> GSE39754; <bold>(E)</bold> GSE68871; <bold>(F)</bold> GSE55134; <bold>(G)</bold> TARGET-50; <bold>(H)</bold> TARGET-10; <bold>(I)</bold> and <bold>(J)</bold>: TARGET-20 with and without busulfan and cyclophosphamide, respectively.</p></caption>
<graphic xlink:href="fgene-09-00717-g003.tif"/>
</fig>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>Receiver&#x2013;operator curves (ROC) showing the dependence of sensitivity (<italic>Sn</italic>) upon specificity (<italic>Sp</italic>) for FloWPS-based classifier of treatment response for datasets with <italic>core marker genes</italic>. Red dots: confidence parameter <italic>p</italic> = 0.95, blue dots: <italic>p</italic> = 0.90. Panels represent different clinically annotated datasets, <bold>(A)</bold> GSE25066; <bold>(B)</bold> GSE41998; <bold>(C)</bold> GSE9782; <bold>(D)</bold> GSE39754; <bold>(E)</bold> GSE68871; <bold>(F)</bold> GSE55134; <bold>(G)</bold> TARGET-50; <bold>(H)</bold> TARGET-10; <bold>(I,J)</bold> TARGET-20 with and without busulfan and cyclophosphamide, respectively.</p></caption>
<graphic xlink:href="fgene-09-00717-g004.tif"/>
</fig>
<p>The discrimination threshold (&#x03C4;), which is shown as a black horizontal line on Figure <xref ref-type="fig" rid="F3">3</xref> (so that any sample with FloWPS prediction value above &#x03C4; is classified as a responder, and below it &#x2013; as a non-responder), was set to minimize the sum of FP and FN predictions.</p>
<p>For every dataset, confidence parameter <italic>p</italic> and scheme of gene selection, FloWPS classifier demonstrated the ROC AUC exceeding the corresponding value for the classical SVM (Table <xref ref-type="table" rid="T2">2</xref>). For three datasets out of ten, AUC for classical SVM was between 0.64 and 0.68. For all these cases, application of FloWPS with confidence level <italic>p</italic> = 0.90 enabled obtaining essentially better AUC values ranging between 0.71 and 0.78.</p>
<p>The comparison of classifier&#x2019;s quality by another metric, the FDR<sup><xref ref-type="fn" rid="fn04">4</xref></sup>, has demonstrated similar results: FDR was lower for FloWPS than for classical SVM for almost all the cases (Table <xref ref-type="table" rid="T2">2</xref>, columns without boldface font). Other metrics, such as sensitivity (Sn), specificity (Sp), accuracy rate (ACC) and MCC<sup><xref ref-type="fn" rid="fn05">5</xref></sup> also strongly tend to be higher for FloWPS than for classical SVM without data trimming (Supplementary Table <xref ref-type="supplementary-material" rid="SM3">S3</xref>).</p>
</sec>
<sec><title>FloWPS Performance at Different Settings and Comparison With Alternative Data Reduction Approach</title>
<p>Although the classifier quality tended to be higher for data trimming than for default SVM settings, the advantages were different in different cancer datasets. The FloWPS performance, therefore, was investigated for different SVM kernels (linear vs. polynomial) and different values for cost/penalty parameters <italic>C</italic> (ranged from 0.1 to 1000), Figure <xref ref-type="fig" rid="F5">5</xref> and Supplementary Table <xref ref-type="supplementary-material" rid="SM4">S4</xref>. These calculations were done for the core marker gene datasets and FloWPS confidence parameter <italic>p</italic> = 0.90. The advantage of FloWPS over SVM is more essential in the conditions vulnerable to SVM overtraining, e.g., for linear kernel with high values of the cost/penalty parameter (C = 100 or 1000) or for polynomial kernel, where SVM may be easily overfitted. Fortunately, FloWPS precludes such overfitting, thus raising AUC and decreasing FDR. The same pattern was also seen for the Sn, Sp, ACC and MCC values (Supplementary Table <xref ref-type="supplementary-material" rid="SM4">S4</xref>).</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>AUC and FDR for (non)responders classifier as a function of cost/penalty parameter <italic>C</italic> for classical SVM (without data trimming) and FloWPS for both linear and polynomial kernels. Calculations were done for core marker gene datasets and confidence parameter <italic>p</italic> = 0.90. Different panels represent different datasets, <bold>(A)</bold> GSE25066; <bold>(B)</bold> GSE41998; <bold>(C)</bold> GSE9782; <bold>(D)</bold> GSE39754; <bold>(E)</bold> GSE68871; <bold>(F)</bold> GSE55134; <bold>(G)</bold> TARGET-50; <bold>(H)</bold> TARGET-10; <bold>(I,J)</bold> TARGET-20 with and without busulfan and cyclophosphamide, respectively. <bold>(K)</bold> Legend showing FloWPS and SVM modifications.</p></caption>
<graphic xlink:href="fgene-09-00717-g005.tif"/>
</fig>
<p>Note that FloWPS is not the only possible data reduction/feature selection method, which may be used for preprocessing to improve the classifier&#x2019;s quality. To try a simple alternative to FloWPS, which is, however, not specific to individual samples, we did calculations based on PCA mode rather than original features. The number of PCs taken for building the SVM model, may act as a parameter, which is optimized in a manner similar to optimization of <italic>m</italic> and <italic>k</italic> for FloWPS. Namely, a maximum for AUC as a function of PC number is found and then used as the optimal number of PCs for an SVM-based prediction.</p>
<p>Thus, we compared the classifier qualities for three methods, namely classical SVM without data reduction, PCA-assisted SVM with pre-trained PC number, and FloWPS with the confidence parameter <italic>p</italic> = 0.90 (Table <xref ref-type="table" rid="T3">3</xref>; note that both classical SVM and FloWPS calculations were done using gene expression features rather than PCs). The calculations were done for core marker gene datasets and cost/penalty SVM parameters <italic>C</italic> = 1 and 100. For linear kernel, several datasets had comparable AUC for simple PCA-assisted data reduction and for FloWPS (Table <xref ref-type="table" rid="T3">3</xref>). However, for polynomial kernel FloWPS essentially outperformed the PCA-assisted data reduction, most likely due to bigger risk of overtraining for SVM with nonlinear kernels.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>AUC of (non)responder classifier for classical SVM without data reduction (SVM), PCA-assisted SVM (PCA) and FloWPS with confidence parameter <italic>p</italic> = 0.90.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Dataset</th>
<th valign="top" align="center" colspan="6">Linear kernel<hr/></th>
<th valign="top" align="center" colspan="6">Polynomial kernel<hr/></th>
</tr>
<tr>
<td valign="top" align="left"></td>
<th valign="top" align="center" colspan="3"><italic>C</italic> = 1</th>
<th valign="top" align="center" colspan="3"><italic>C</italic> = 100</th>
<th valign="top" align="center" colspan="3"><italic>C</italic> = 1</th>
<th valign="top" align="center" colspan="3"><italic>C</italic> = 100</th>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="3"><hr/></td>
<td valign="top" align="center" colspan="3"><hr/></td>
<td valign="top" align="center" colspan="3"><hr/></td>
<td valign="top" align="center" colspan="3"><hr/></td>
</tr>
<tr>
<td valign="top" align="left"></td>
<th valign="top" align="center">SVM</th>
<th valign="top" align="center">PCA</th>
<th valign="top" align="center">FloWPS</th>
<th valign="top" align="center">SVM</th>
<th valign="top" align="center">PCA</th>
<th valign="top" align="center">FloWPS</th>
<th valign="top" align="center">SVM</th>
<th valign="top" align="center">PCA</th>
<th valign="top" align="center">FloWPS</th>
<th valign="top" align="center">SVM</th>
<th valign="top" align="center">PCA</th>
<th valign="top" align="center">FloWPS</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">GSE25066 (<xref ref-type="bibr" rid="B20">Hatzis et al., 2011</xref>;<break/><xref ref-type="bibr" rid="B23">Itoh et al., 2014</xref>)</td>
<td valign="top" align="center">0.73</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">0.63</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">0.65</td>
<td valign="top" align="center">0.67</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.63</td>
<td valign="top" align="center">0.66</td>
<td valign="top" align="center">0.75</td>
</tr>
<tr>
<td valign="top" align="left">GSE41998 (<xref ref-type="bibr" rid="B21">Horak et al., 2013</xref>)</td>
<td valign="top" align="center">0.87</td>
<td valign="top" align="center">0.84</td>
<td valign="top" align="center">0.92</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.88</td>
<td valign="top" align="center">0.86</td>
<td valign="top" align="center">0.60</td>
<td valign="top" align="center">0.62</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.81</td>
</tr>
<tr>
<td valign="top" align="left">GSE9782 (<xref ref-type="bibr" rid="B32">Mulligan et al., 2007</xref>)</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.60</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.62</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.73</td>
<td valign="top" align="center">0.64</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.76</td>
</tr>
<tr>
<td valign="top" align="left">GSE39754 (<xref ref-type="bibr" rid="B16">Chauhan et al., 2012</xref>)</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.56</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.71</td>
<td valign="top" align="center">0.66</td>
<td valign="top" align="center">0.61</td>
<td valign="top" align="center">0.67</td>
<td valign="top" align="center">0.65</td>
<td valign="top" align="center">0.61</td>
<td valign="top" align="center">0.68</td>
</tr>
<tr>
<td valign="top" align="left">GSE68871 (<xref ref-type="bibr" rid="B41">Terragna et al., 2016</xref>)</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">0.64</td>
<td valign="top" align="center">0.65</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">0.74</td>
</tr>
<tr>
<td valign="top" align="left">GSE55145 (<xref ref-type="bibr" rid="B5">Amin et al., 2014</xref>)</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.84</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.84</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">0.63</td>
<td valign="top" align="center">0.73</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">0.73</td>
<td valign="top" align="center">0.83</td>
</tr>
<tr>
<td valign="top" align="left">TARGET-50 (<xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>; <xref ref-type="bibr" rid="B45">Walz et al., 2015</xref>)</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">0.81</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.64</td>
<td valign="top" align="center">0.73</td>
<td valign="top" align="center">0.65</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.74</td>
</tr>
<tr>
<td valign="top" align="left">TARGET-10 (<xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>; <xref ref-type="bibr" rid="B42">Tricoli et al., 2016</xref>)</td>
<td valign="top" align="center">0.87</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.65</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">0.78</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.86</td>
</tr>
<tr>
<td valign="top" align="left">TARGET-20 (<xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>) with busulfan and cyclophosphamide</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">0.78</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.70</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.63</td>
<td valign="top" align="center">0.63</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.82</td>
</tr>
<tr>
<td valign="top" align="left">TARGET-20 (<xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>) w/o busulfan and cyclophosphamide</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.81</td>
<td valign="top" align="center">0.79</td>
<td valign="top" align="center">0.65</td>
<td valign="top" align="center">0.79</td>
<td valign="top" align="center">0.79</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">0.68</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">0.79</td>
</tr>
<tr>
<td valign="top" align="left"></td></tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec><title>Discussion</title>
<p>It was seen previously that SVM sometimes fails when it is intended for distinguishing fine biomedical properties such as disease progression prognosis or assessment of clinical efficiency of drugs for an individual patient, using high throughput molecular data, e.g., complete DNA mutation or gene expression profiles (<xref ref-type="bibr" rid="B34">Ray and Zhang, 2009</xref>; <xref ref-type="bibr" rid="B8">Babaoglu et al., 2010</xref>). Particularly, for many biologically relevant applications, SVM occurred either fully incapable to predict drug sensitivity (<xref ref-type="bibr" rid="B43">Turki and Wei, 2016</xref>), or demonstrated poorer performance than competing method for machine learning (<xref ref-type="bibr" rid="B18">Davoudi et al., 2017</xref>; <xref ref-type="bibr" rid="B17">Cho et al., 2018</xref>; <xref ref-type="bibr" rid="B24">Jeong et al., 2018</xref>; <xref ref-type="bibr" rid="B28">Leite et al., 2018</xref>; <xref ref-type="bibr" rid="B35">Sauer et al., 2018</xref>; <xref ref-type="bibr" rid="B47">Yosipof et al., 2018</xref>). Thus, the tool for improvement of SVM performance is certainly needed.</p>
<p>In this study, we investigated ten sets of gene expression data for cancer patients treated with different anti-cancer drugs with known clinical outcomes, where the original dimension of samples (patients) is many hundreds times larger than the numbers of patients. So, the first problem in such applications was to extract an appropriate number of features, in which space one could achieve a classifier-predictor with a high level of quality. There are many authors focused to resolve the preprocessing problem (<xref ref-type="bibr" rid="B38">Tan and Gilbert, 2003</xref>; <xref ref-type="bibr" rid="B26">Kourou et al., 2015</xref>; <xref ref-type="bibr" rid="B39">Tan, 2016</xref>; <xref ref-type="bibr" rid="B29">Liu et al., 2017</xref>; <xref ref-type="bibr" rid="B40">Tarek et al., 2017</xref>). Some feature selection methods, like the DWFS wrapping tool (<xref ref-type="bibr" rid="B37">Soufan et al., 2015</xref>), use sophisticatedly designed approaches such as genetic algorithms to improve the classifier quality. In this paper we proposed one more, FloWPS, which is very different from all known. Its critical characteristic is that for every single new sample, which class has to be predicted, the method extracted its individual sub-space and, more, in that subspace takes for training data an appropriate subset of samples.</p>
<p>FloWPS data trimming method simultaneously combines the advantages of both <italic>global</italic> (like SVM) and local (like kNN) (<xref ref-type="bibr" rid="B4">Altman, 1992</xref>) methods of machine learning, and successfully acts even when purely local and global approaches fail. The failure of SVM, which we have observed at least for 3 out of 10 datasets in the current study (Table <xref ref-type="table" rid="T2">2</xref>), means that there is no strict <italic>distant order</italic> in the placement of responder and non-responder points in the space of gene expression features. Yet, the lack of <italic>distant</italic> order does not necessary mean the absence of <italic>local</italic> order (Figure <xref ref-type="fig" rid="F6">6</xref>). The latter may be detected using <italic>local</italic> methods such as kNN, which has been confirmed by our FloWPS (Table <xref ref-type="table" rid="T2">2</xref> and Figures <xref ref-type="fig" rid="F3">3</xref>, <xref ref-type="fig" rid="F5">5</xref>). The FloWPS advantages are better seen for SVM with polynomial than for linear kernel due to higher risk of overtraining on such models (Figure <xref ref-type="fig" rid="F5">5</xref> and Table <xref ref-type="table" rid="T3">3</xref>).</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p><bold>(A)</bold> Global machine learning methods, such as SVM, may fail to separate classes in datasets without global order. <bold>(B)</bold> Machine-learning with data trimming works locally and may separate classes more accurately.</p></caption>
<graphic xlink:href="fgene-09-00717-g006.tif"/>
</fig>
<p>We hypothesize that FloWPS and data trimming may be also helpful for improving other learning methods based on multi-omics data, including nowadays-flourishing deep learning approaches (<xref ref-type="bibr" rid="B11">Bengio et al., 2013</xref>; <xref ref-type="bibr" rid="B27">LeCun et al., 2015</xref>; <xref ref-type="bibr" rid="B36">Schmidhuber, 2015</xref>).</p>
</sec>
<sec id="s1" sec-type="materials|methods">
<title>Materials and Methods</title>
<sec><title>Preprocessing of Gene Expression Data</title>
<p>For the datasets investigated using the Affymetrix microarray hybridization platforms, gene expression data were taken from the series matrices deposited in the GEO public repository and then quantile-normalized (<xref ref-type="bibr" rid="B14">Bolstad et al., 2003</xref>) using the R package <italic>preprocessCore</italic> (<xref ref-type="bibr" rid="B13">Bolstad, 2018</xref>). All pediatric datasets taken from the TARGET database (<xref ref-type="bibr" rid="B19">Goldman et al., 2015</xref>) contained results of NGS mRNA profiling at Illumina HiSeq 2000 platforms; they were normalized using R package <italic>DESeq2</italic> (<xref ref-type="bibr" rid="B30">Love et al., 2014</xref>).</p>
</sec>
<sec><title>SVM Calculations</title>
<p>All the SVM calculations with linear and polynomial kernels were performed using the Python package <italic>sklearn</italic> (<xref ref-type="bibr" rid="B33">Pedregosa et al., 2012</xref>) that employs the C++ library &#x2018;libsvm&#x2019; (<xref ref-type="bibr" rid="B15">Chang and Lin, 2011</xref>). The penalty parameter <italic>C</italic> varied from 0.1 to 1000 for different calculations. Other SVM parameters had the default settings for the <italic>sklearn</italic> package.</p>
</sec>
<sec><title>Plot Preparations</title>
<p>AUC(<italic>m</italic>,<italic>k</italic>) topograms, like Figure <xref ref-type="fig" rid="F2">2B</xref>, were plotted using <italic>mathplotlib</italic> Python library (<xref ref-type="bibr" rid="B22">Hunter, 2007</xref>). Violin plots for FloWPS predictions (see Figure <xref ref-type="fig" rid="F3">3</xref>) for responders and non-responders were plotted using the <italic>ggplot2</italic> R package (<xref ref-type="bibr" rid="B46">Wilkinson, 2011</xref>).</p>
</sec>
</sec>
<sec><title>Availability of Data and Materials</title>
<p>The datasets analyzed during the current study are available in the GEO repository,</p>
<p><ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25066">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25066</ext-link></p>
<p><ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41998">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41998</ext-link></p>
<p><ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9782">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9782</ext-link></p>
<p><ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39754">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39754</ext-link></p>
<p><ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68871">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68871</ext-link></p>
<p><ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55145">https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55145</ext-link></p>
<p><ext-link ext-link-type="uri" xlink:href="ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/WT/mRNA-seq/">ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/WT/mRNA-seq/</ext-link></p>
<p><ext-link ext-link-type="uri" xlink:href="ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/AML/mRNA-seq/">ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/AML/mRNA-seq/</ext-link></p>
<p><ext-link ext-link-type="uri" xlink:href="ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/ALL/mRNA-seq/">ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/ALL/mRNA-seq/</ext-link></p>
<p>The Python module that performs data trimming according to the FloWPS method for different values of parameters <italic>m</italic> and <italic>k</italic>, as well as the R code that makes FloWPS predictions using the results obtained with the Python module, and a README manual how to use these codes, were deposited on Gitlab and are available by the link: <ext-link ext-link-type="uri" xlink:href="https://gitlab.com/oncobox/flowps">https://gitlab.com/oncobox/flowps</ext-link>.</p>
</sec>
<sec><title>Ethics Statement</title>
<p>Current research did not involve any new human material. All the gene expression data that were used for research, were taken from publicly available repositories Gene Expression Omnibus (GEO) and TARGET, and had been previously anonymized by the teams, who had worked with them.</p>
</sec>
<sec><title>Author Contributions</title>
<p>NB designed the overall research, suggested the principles of data trimming and prediction-accountable set, and wrote most parts of the manuscript. VT performed most part of calculations. MS suggested datasets with clinical responders and non-responders and performed feature selection. AM wrote the initial version of computational code. AS adapted this code for parallel calculations. AG tested and debugged the computational code. IM and AB essentially improved the manuscript upon the draft version has been prepared. AB preformed the overall scientific supervision of the project.</p>
</sec>
<sec><title>Conflict of Interest Statement</title>
<p>VT, MS, AS, AG, AB, and NB were employed by OmicsWay Corporation, Walnut, CA, United States. AM was employed by Yandex N.V. Corporation, Moscow, Russia. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="financial-disclosure">
<p><bold>Funding.</bold> This work was supported by Amazon and Microsoft Azure grants for cloud-based computational facilities. We thank Oncobox/OmicsWay research program in machine learning and digital oncology for software and pathway databases for this study. Financial support was provided by the Russian Science Foundation grant no. 18-15-00061.</p>
</fn>
</fn-group>
<sec sec-type="supplementary material">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fgene.2018.00717/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fgene.2018.00717/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Table_1.xlsx" id="SM1" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_2.xlsx" id="SM2" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_3.xlsx" id="SM3" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table_4.xlsx" id="SM4" mimetype="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Data_Sheet_1.docx" id="SM5" mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ahmed</surname> <given-names>F.</given-names></name> <name><surname>Ansari</surname> <given-names>H. R.</given-names></name> <name><surname>Raghava</surname> <given-names>G. P. S.</given-names></name></person-group> (<year>2009a</year>). <article-title>Prediction of guide strand of microRNAs from its sequence and secondary structure.</article-title> <source><italic>BMC Bioinformatics</italic></source> <volume>10</volume>:<issue>105</issue>. <pub-id pub-id-type="doi">10.1186/1471-2105-10-105</pub-id> <pub-id pub-id-type="pmid">19358699</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ahmed</surname> <given-names>F.</given-names></name> <name><surname>Kumar</surname> <given-names>M.</given-names></name> <name><surname>Raghava</surname> <given-names>G. P. S.</given-names></name></person-group> (<year>2009b</year>). <article-title>Prediction of polyadenylation signals in human DNA sequences using nucleotide frequencies.</article-title> <source><italic>In Silico Biol.</italic></source> <volume>9</volume> <fpage>135</fpage>&#x2013;<lpage>148</lpage>. <pub-id pub-id-type="pmid">19795571</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ahmed</surname> <given-names>F.</given-names></name> <name><surname>Kaundal</surname> <given-names>R.</given-names></name> <name><surname>Raghava</surname> <given-names>G. P. S.</given-names></name></person-group> (<year>2013</year>). <article-title>PHDcleav: a SVM based method for predicting human Dicer cleavage sites using sequence and secondary structure of miRNA precursors.</article-title> <source><italic>BMC Bioinformatics</italic></source> <volume>14(Suppl. 14)</volume>:<issue>S9</issue>. <pub-id pub-id-type="doi">10.1186/1471-2105-14-S14-S9</pub-id> <pub-id pub-id-type="pmid">24267009</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Altman</surname> <given-names>N. S.</given-names></name></person-group> (<year>1992</year>). <article-title>An introduction to kernel and nearest-neighbor nonparametric regression.</article-title> <source><italic>Am. Stat.</italic></source> <volume>46</volume> <fpage>175</fpage>&#x2013;<lpage>185</lpage>. <pub-id pub-id-type="doi">10.1080/00031305.1992.10475879</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Amin</surname> <given-names>S. B.</given-names></name> <name><surname>Yip</surname> <given-names>W.-K.</given-names></name> <name><surname>Minvielle</surname> <given-names>S.</given-names></name> <name><surname>Broyl</surname> <given-names>A.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Hanlon</surname> <given-names>B.</given-names></name><etal/></person-group> (<year>2014</year>). <article-title>Gene expression profile alone is inadequate in predicting complete response in multiple myeloma.</article-title> <source><italic>Leukemia</italic></source> <volume>28</volume> <fpage>2229</fpage>&#x2013;<lpage>2234</lpage>. <pub-id pub-id-type="doi">10.1038/leu.2014.140</pub-id> <pub-id pub-id-type="pmid">24732597</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ansari</surname> <given-names>H. R.</given-names></name> <name><surname>Raghava</surname> <given-names>G. P.</given-names></name></person-group> (<year>2010</year>). <article-title>Identification of conformational B-cell epitopes in an antigen from its primary sequence.</article-title> <source><italic>Immunome Res.</italic></source> <volume>6</volume>:<issue>6</issue>. <pub-id pub-id-type="doi">10.1186/1745-7580-6-6</pub-id> <pub-id pub-id-type="pmid">20961417</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arimoto</surname> <given-names>R.</given-names></name> <name><surname>Prasad</surname> <given-names>M.-A.</given-names></name> <name><surname>Gifford</surname> <given-names>E. M.</given-names></name></person-group> (<year>2005</year>). <article-title>Development of CYP3A4 inhibition models: comparisons of machine-learning techniques and molecular descriptors.</article-title> <source><italic>J. Biomol. Screen.</italic></source> <volume>10</volume> <fpage>197</fpage>&#x2013;<lpage>205</lpage>. <pub-id pub-id-type="doi">10.1177/1087057104274091</pub-id> <pub-id pub-id-type="pmid">15809315</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Babaoglu</surname> <given-names>&#x0130;.</given-names></name> <name><surname>Findik</surname> <given-names>O.</given-names></name> <name><surname>&#x00DC;lker</surname> <given-names>E.</given-names></name></person-group> (<year>2010</year>). <article-title>A comparison of feature selection models utilizing binary particle swarm optimization and genetic algorithm in determining coronary artery disease using support vector machine.</article-title> <source><italic>Expert Syst. Appl.</italic></source> <volume>37</volume> <fpage>3177</fpage>&#x2013;<lpage>3183</lpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2009.09.064</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Balabin</surname> <given-names>R. M.</given-names></name> <name><surname>Lomakina</surname> <given-names>E. I.</given-names></name></person-group> (<year>2011</year>). <article-title>Support vector machine regression (LS-SVM)&#x2014;an alternative to artificial neural networks (ANNs) for the analysis of quantum chemistry data?</article-title> <source><italic>Phys. Chem. Chem. Phys.</italic></source> <volume>13</volume> <fpage>11710</fpage>&#x2013;<lpage>11718</lpage>. <pub-id pub-id-type="doi">10.1039/c1cp00051a</pub-id> <pub-id pub-id-type="pmid">21594265</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Balabin</surname> <given-names>R. M.</given-names></name> <name><surname>Smirnov</surname> <given-names>S. V.</given-names></name></person-group> (<year>2012</year>). <article-title>Interpolation and extrapolation problems of multivariate regression in analytical chemistry: benchmarking the robustness on near-infrared (NIR) spectroscopy data.</article-title> <source><italic>Analyst</italic></source> <volume>137</volume> <fpage>1604</fpage>&#x2013;<lpage>1610</lpage>. <pub-id pub-id-type="doi">10.1039/c2an15972d</pub-id> <pub-id pub-id-type="pmid">22337290</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Courville</surname> <given-names>A.</given-names></name> <name><surname>Vincent</surname> <given-names>P.</given-names></name></person-group> (<year>2013</year>). <article-title>Representation learning: a review and new perspectives.</article-title> <source><italic>IEEE Trans. Pattern Anal. Mach. Intell.</italic></source> <volume>35</volume> <fpage>1798</fpage>&#x2013;<lpage>1828</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2013.50</pub-id> <pub-id pub-id-type="pmid">23787338</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Betrie</surname> <given-names>G. D.</given-names></name> <name><surname>Tesfamariam</surname> <given-names>S.</given-names></name> <name><surname>Morin</surname> <given-names>K. A.</given-names></name> <name><surname>Sadiq</surname> <given-names>R.</given-names></name></person-group> (<year>2013</year>). <article-title>Predicting copper concentrations in acid mine drainage: a comparative analysis of five machine learning techniques.</article-title> <source><italic>Environ. Monit. Assess.</italic></source> <volume>185</volume> <fpage>4171</fpage>&#x2013;<lpage>4182</lpage>. <pub-id pub-id-type="doi">10.1007/s10661-012-2859-7</pub-id> <pub-id pub-id-type="pmid">22983612</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bolstad</surname> <given-names>B.</given-names></name></person-group> (<year>2018</year>). <article-title><italic>preprocessCore: A Collection of Pre-Processing Functions</italic>.</article-title> <source><italic>R package.</italic></source> Available at: <ext-link ext-link-type="uri" xlink:href="https://github.com/bmbolstad/preprocessCore">https://github.com/bmbolstad/preprocessCore</ext-link></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bolstad</surname> <given-names>B. M.</given-names></name> <name><surname>Irizarry</surname> <given-names>R. A.</given-names></name> <name><surname>Astrand</surname> <given-names>M.</given-names></name> <name><surname>Speed</surname> <given-names>T. P.</given-names></name></person-group> (<year>2003</year>). <article-title>A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.</article-title> <source><italic>Bioinformatics</italic></source> <volume>19</volume> <fpage>185</fpage>&#x2013;<lpage>193</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/19.2.185</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>C.-C.</given-names></name> <name><surname>Lin</surname> <given-names>C.-J.</given-names></name></person-group> (<year>2011</year>). <article-title>LIBSVM: a library for support vector machines.</article-title> <source><italic>ACM Trans. Intell. Syst. Technol.</italic></source> <volume>2</volume> <fpage>1</fpage>&#x2013;<lpage>27</lpage>. <pub-id pub-id-type="doi">10.1145/1961189.1961199</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chauhan</surname> <given-names>D.</given-names></name> <name><surname>Tian</surname> <given-names>Z.</given-names></name> <name><surname>Nicholson</surname> <given-names>B.</given-names></name> <name><surname>Kumar</surname> <given-names>K. G. S.</given-names></name> <name><surname>Zhou</surname> <given-names>B.</given-names></name> <name><surname>Carrasco</surname> <given-names>R.</given-names></name><etal/></person-group> (<year>2012</year>). <article-title>A small molecule inhibitor of ubiquitin-specific protease-7 induces apoptosis in multiple myeloma cells and overcomes bortezomib resistance.</article-title> <source><italic>Cancer Cell</italic></source> <volume>22</volume> <fpage>345</fpage>&#x2013;<lpage>358</lpage>. <pub-id pub-id-type="doi">10.1016/j.ccr.2012.08.007</pub-id> <pub-id pub-id-type="pmid">22975377</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cho</surname> <given-names>H.-J.</given-names></name> <name><surname>Lee</surname> <given-names>S.</given-names></name> <name><surname>Ji</surname> <given-names>Y. G.</given-names></name> <name><surname>Lee</surname> <given-names>D. H.</given-names></name></person-group> (<year>2018</year>). <article-title>Association of specific gene mutations derived from machine learning with survival in lung adenocarcinoma.</article-title> <source><italic>PLoS One</italic></source> <volume>13</volume>:<issue>e0207204</issue>. <pub-id pub-id-type="doi">10.1371/journal.pone.0207204</pub-id> <pub-id pub-id-type="pmid">30419062</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Davoudi</surname> <given-names>A.</given-names></name> <name><surname>Ozrazgat-Baslanti</surname> <given-names>T.</given-names></name> <name><surname>Ebadi</surname> <given-names>A.</given-names></name> <name><surname>Bursian</surname> <given-names>A. C.</given-names></name> <name><surname>Bihorac</surname> <given-names>A.</given-names></name> <name><surname>Rashidi</surname> <given-names>P.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x201C;Delirium prediction using machine learning models on predictive electronic health records data,&#x201D; in</article-title> <source><italic>Proceedings of the 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)</italic></source> (<publisher-loc>Washington, DC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>568</fpage>&#x2013;<lpage>573</lpage>. <pub-id pub-id-type="doi">10.1109/BIBE.2017.00014</pub-id> <pub-id pub-id-type="pmid">30393788</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goldman</surname> <given-names>M.</given-names></name> <name><surname>Craft</surname> <given-names>B.</given-names></name> <name><surname>Swatloski</surname> <given-names>T.</given-names></name> <name><surname>Cline</surname> <given-names>M.</given-names></name> <name><surname>Morozova</surname> <given-names>O.</given-names></name> <name><surname>Diekhans</surname> <given-names>M.</given-names></name><etal/></person-group> (<year>2015</year>). <article-title>The UCSC cancer genomics browser: update 2015.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>43</volume> <fpage>D812</fpage>&#x2013;<lpage>D817</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gku1073</pub-id> <pub-id pub-id-type="pmid">25392408</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hatzis</surname> <given-names>C.</given-names></name> <name><surname>Pusztai</surname> <given-names>L.</given-names></name> <name><surname>Valero</surname> <given-names>V.</given-names></name> <name><surname>Booser</surname> <given-names>D. J.</given-names></name> <name><surname>Esserman</surname> <given-names>L.</given-names></name> <name><surname>Lluch</surname> <given-names>A.</given-names></name><etal/></person-group> (<year>2011</year>). <article-title>A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer.</article-title> <source><italic>JAMA</italic></source> <volume>305</volume> <fpage>1873</fpage>&#x2013;<lpage>1881</lpage>. <pub-id pub-id-type="doi">10.1001/jama.2011.593</pub-id> <pub-id pub-id-type="pmid">21558518</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Horak</surname> <given-names>C. E.</given-names></name> <name><surname>Pusztai</surname> <given-names>L.</given-names></name> <name><surname>Xing</surname> <given-names>G.</given-names></name> <name><surname>Trifan</surname> <given-names>O. C.</given-names></name> <name><surname>Saura</surname> <given-names>C.</given-names></name> <name><surname>Tseng</surname> <given-names>L.-M.</given-names></name><etal/></person-group> (<year>2013</year>). <article-title>Biomarker analysis of neoadjuvant doxorubicin/cyclophosphamide followed by ixabepilone or Paclitaxel in early-stage breast cancer.</article-title> <source><italic>Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res.</italic></source> <volume>19</volume> <fpage>1587</fpage>&#x2013;<lpage>1595</lpage>. <pub-id pub-id-type="doi">10.1158/1078-0432.CCR-121359</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hunter</surname> <given-names>J. D.</given-names></name></person-group> (<year>2007</year>). <article-title>Matplotlib: a 2D graphics environment.</article-title> <source><italic>Comput. Sci. Eng.</italic></source> <volume>9</volume> <fpage>90</fpage>&#x2013;<lpage>95</lpage>. <pub-id pub-id-type="doi">10.1109/MCSE.2007.55</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Itoh</surname> <given-names>M.</given-names></name> <name><surname>Iwamoto</surname> <given-names>T.</given-names></name> <name><surname>Matsuoka</surname> <given-names>J.</given-names></name> <name><surname>Nogami</surname> <given-names>T.</given-names></name> <name><surname>Motoki</surname> <given-names>T.</given-names></name> <name><surname>Shien</surname> <given-names>T.</given-names></name><etal/></person-group> (<year>2014</year>). <article-title>Estrogen receptor (ER) mRNA expression and molecular subtype distribution in ER-negative/progesterone receptor-positive breast cancers.</article-title> <source><italic>Breast Cancer Res. Treat.</italic></source> <volume>143</volume> <fpage>403</fpage>&#x2013;<lpage>409</lpage>. <pub-id pub-id-type="doi">10.1007/s10549-013-2763-z</pub-id> <pub-id pub-id-type="pmid">24337596</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jeong</surname> <given-names>E.</given-names></name> <name><surname>Park</surname> <given-names>N.</given-names></name> <name><surname>Choi</surname> <given-names>Y.</given-names></name> <name><surname>Park</surname> <given-names>R. W.</given-names></name> <name><surname>Yoon</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>Machine learning model combining features from algorithms with different analytical methodologies to detect laboratory-event-related adverse drug reaction signals.</article-title> <source><italic>PLoS One</italic></source> <volume>13</volume>:<issue>e0207749</issue>. <pub-id pub-id-type="doi">10.1371/journal.pone.0207749</pub-id> <pub-id pub-id-type="pmid">30462745</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>Y. R.</given-names></name> <name><surname>Kim</surname> <given-names>D.</given-names></name> <name><surname>Kim</surname> <given-names>S. Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Prediction of acquired taxane resistance using a personalized pathway-based machine learning method.</article-title> <source><italic>Cancer Res. Treat.</italic></source> <pub-id pub-id-type="doi">10.4143/crt.2018.137</pub-id> [Epub ahead of print]. <pub-id pub-id-type="pmid">30092623</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kourou</surname> <given-names>K.</given-names></name> <name><surname>Exarchos</surname> <given-names>T. P.</given-names></name> <name><surname>Exarchos</surname> <given-names>K. P.</given-names></name> <name><surname>Karamouzis</surname> <given-names>M. V.</given-names></name> <name><surname>Fotiadis</surname> <given-names>D. I.</given-names></name></person-group> (<year>2015</year>). <article-title>Machine learning applications in cancer prognosis and prediction.</article-title> <source><italic>Comput. Struct. Biotechnol. J.</italic></source> <volume>13</volume> <fpage>8</fpage>&#x2013;<lpage>17</lpage>. <pub-id pub-id-type="doi">10.1016/j.csbj.2014.11.005</pub-id> <pub-id pub-id-type="pmid">25750696</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning.</article-title> <source><italic>Nature</italic></source> <volume>521</volume> <fpage>436</fpage>&#x2013;<lpage>444</lpage>. <pub-id pub-id-type="doi">10.1038/nature14539</pub-id> <pub-id pub-id-type="pmid">26017442</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Leite</surname> <given-names>D. M. C.</given-names></name> <name><surname>Brochet</surname> <given-names>X.</given-names></name> <name><surname>Resch</surname> <given-names>G.</given-names></name> <name><surname>Que</surname> <given-names>Y.-A.</given-names></name> <name><surname>Neves</surname> <given-names>A.</given-names></name> <name><surname>Pe&#x00F1;a-Reyes</surname> <given-names>C.</given-names></name></person-group> (<year>2018</year>). <article-title>Computational prediction of inter-species relationships through omics data analysis and machine learning.</article-title> <source><italic>BMC Bioinformatics</italic></source> <volume>19</volume>:<issue>420</issue>. <pub-id pub-id-type="doi">10.1186/s12859-018-2388-7</pub-id> <pub-id pub-id-type="pmid">30453987</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Cheng</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>L.</given-names></name></person-group> (<year>2017</year>). <article-title>Tumor gene expression data classification via sample expansion-based deep learning.</article-title> <source><italic>Oncotarget</italic></source> <volume>8</volume> <fpage>109646</fpage>&#x2013;<lpage>109660</lpage>. <pub-id pub-id-type="doi">10.18632/oncotarget.22762</pub-id> <pub-id pub-id-type="pmid">29312636</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Love</surname> <given-names>M. I.</given-names></name> <name><surname>Huber</surname> <given-names>W.</given-names></name> <name><surname>Anders</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <article-title>Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.</article-title> <source><italic>Genome Biol.</italic></source> <volume>15</volume>:<issue>550</issue>. <pub-id pub-id-type="doi">10.1186/s13059-014-0550-8</pub-id> <pub-id pub-id-type="pmid">25516281</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mamoshina</surname> <given-names>P.</given-names></name> <name><surname>Volosnikova</surname> <given-names>M.</given-names></name> <name><surname>Ozerov</surname> <given-names>I. V.</given-names></name> <name><surname>Putin</surname> <given-names>E.</given-names></name> <name><surname>Skibina</surname> <given-names>E.</given-names></name> <name><surname>Cortese</surname> <given-names>F.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>Machine learning on human muscle transcriptomic data for biomarker discovery and tissue-specific drug target identification.</article-title> <source><italic>Front. Genet.</italic></source> <volume>9</volume>:<issue>242</issue>. <pub-id pub-id-type="doi">10.3389/fgene.2018.00242</pub-id> <pub-id pub-id-type="pmid">30050560</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mulligan</surname> <given-names>G.</given-names></name> <name><surname>Mitsiades</surname> <given-names>C.</given-names></name> <name><surname>Bryant</surname> <given-names>B.</given-names></name> <name><surname>Zhan</surname> <given-names>F.</given-names></name> <name><surname>Chng</surname> <given-names>W. J.</given-names></name> <name><surname>Roels</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2007</year>). <article-title>Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib.</article-title> <source><italic>Blood</italic></source> <volume>109</volume> <fpage>3177</fpage>&#x2013;<lpage>3188</lpage>. <pub-id pub-id-type="doi">10.1182/blood-2006-09-044974</pub-id> <pub-id pub-id-type="pmid">17185464</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pedregosa</surname> <given-names>F.</given-names></name> <name><surname>Varoquaux</surname> <given-names>G.</given-names></name> <name><surname>Gramfort</surname> <given-names>A.</given-names></name> <name><surname>Michel</surname> <given-names>V.</given-names></name> <name><surname>Thirion</surname> <given-names>B.</given-names></name> <name><surname>Grisel</surname> <given-names>O.</given-names></name><etal/></person-group> (<year>2012</year>). <article-title>Scikit-learn: machine learning in python.</article-title> <source><italic>arXiv</italic></source> [Preprint]. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1201.0490">arXiv:1201.0490</ext-link></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ray</surname> <given-names>M.</given-names></name> <name><surname>Zhang</surname> <given-names>W.</given-names></name></person-group> (<year>2009</year>). <article-title>Integrating gene expression and phenotypic information to analyze Alzheimer&#x2019;s disease.</article-title> <source><italic>J. Alzheimers Dis.</italic></source> <volume>16</volume> <fpage>73</fpage>&#x2013;<lpage>84</lpage>. <pub-id pub-id-type="doi">10.3233/JAD-2009-0917</pub-id> <pub-id pub-id-type="pmid">19158423</pub-id></citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sauer</surname> <given-names>C. M.</given-names></name> <name><surname>Sasson</surname> <given-names>D.</given-names></name> <name><surname>Paik</surname> <given-names>K. E.</given-names></name> <name><surname>McCague</surname> <given-names>N.</given-names></name> <name><surname>Celi</surname> <given-names>L. A.</given-names></name> <name><surname>S&#x00E1;nchez Fern&#x00E1;ndez</surname> <given-names>I.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>Feature selection and prediction of treatment failure in tuberculosis.</article-title> <source><italic>PLoS One</italic></source> <volume>13</volume>:<issue>e0207491</issue>. <pub-id pub-id-type="doi">10.1371/journal.pone.0207491</pub-id> <pub-id pub-id-type="pmid">30458029</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning in neural networks: an overview.</article-title> <source><italic>Neural Netw. Off. J. Int. Neural Netw. Soc.</italic></source> <volume>61</volume> <fpage>85</fpage>&#x2013;<lpage>117</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2014.09.003</pub-id> <pub-id pub-id-type="pmid">25462637</pub-id></citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Soufan</surname> <given-names>O.</given-names></name> <name><surname>Kleftogiannis</surname> <given-names>D.</given-names></name> <name><surname>Kalnis</surname> <given-names>P.</given-names></name> <name><surname>Bajic</surname> <given-names>V. B.</given-names></name></person-group> (<year>2015</year>). <article-title>DWFS:a wrapper feature selection tool based on a parallel genetic algorithm.</article-title> <source><italic>PLoS One</italic></source> <volume>10</volume>:<issue>e0117988</issue>. <pub-id pub-id-type="doi">10.1371/journal.pone.0117988</pub-id> <pub-id pub-id-type="pmid">25719748</pub-id></citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tan</surname> <given-names>A. C.</given-names></name> <name><surname>Gilbert</surname> <given-names>D.</given-names></name></person-group> (<year>2003</year>). <article-title>Ensemble machine learning on gene expression data for cancer classification.</article-title> <source><italic>Appl. Bioinformatics</italic></source> <volume>2</volume> <fpage>S75</fpage>&#x2013;<lpage>S83</lpage>.</citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tan</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Prediction of anti-cancer drug response by kernelized multi-task learning.</article-title> <source><italic>Artif. Intell. Med.</italic></source> <volume>73</volume> <fpage>70</fpage>&#x2013;<lpage>77</lpage>. <pub-id pub-id-type="doi">10.1016/j.artmed.2016.09.004</pub-id> <pub-id pub-id-type="pmid">27926382</pub-id></citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tarek</surname> <given-names>S.</given-names></name> <name><surname>Abd Elwahab</surname> <given-names>R.</given-names></name> <name><surname>Shoman</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>Gene expression based cancer classification.</article-title> <source><italic>Egpt. Inform. J.</italic></source> <volume>18</volume> <fpage>151</fpage>&#x2013;<lpage>159</lpage>. <pub-id pub-id-type="doi">10.1016/j.eij.2016.12.001</pub-id></citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Terragna</surname> <given-names>C.</given-names></name> <name><surname>Remondini</surname> <given-names>D.</given-names></name> <name><surname>Martello</surname> <given-names>M.</given-names></name> <name><surname>Zamagni</surname> <given-names>E.</given-names></name> <name><surname>Pantani</surname> <given-names>L.</given-names></name> <name><surname>Patriarca</surname> <given-names>F.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>The genetic and genomic background of multiple myeloma patients achieving complete response after induction therapy with bortezomib, thalidomide and dexamethasone (VTD).</article-title> <source><italic>Oncotarget</italic></source> <volume>7</volume> <fpage>9666</fpage>&#x2013;<lpage>9679</lpage>. <pub-id pub-id-type="doi">10.18632/oncotarget.5718</pub-id> <pub-id pub-id-type="pmid">26575327</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tricoli</surname> <given-names>J. V.</given-names></name> <name><surname>Blair</surname> <given-names>D. G.</given-names></name> <name><surname>Anders</surname> <given-names>C. K.</given-names></name> <name><surname>Bleyer</surname> <given-names>W. A.</given-names></name> <name><surname>Boardman</surname> <given-names>L. A.</given-names></name> <name><surname>Khan</surname> <given-names>J.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>Biologic and clinical characteristics of adolescent and young adult cancers: acute lymphoblastic leukemia, colorectal cancer, breast cancer, melanoma, and sarcoma: biology of AYA Cancers.</article-title> <source><italic>Cancer</italic></source> <volume>122</volume> <fpage>1017</fpage>&#x2013;<lpage>1028</lpage>. <pub-id pub-id-type="doi">10.1002/cncr.29871</pub-id> <pub-id pub-id-type="pmid">26849082</pub-id></citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Turki</surname> <given-names>T.</given-names></name> <name><surname>Wei</surname> <given-names>Z.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x201C;Learning approaches to improve prediction of drug sensitivity in breast cancer patients,&#x201D; in</article-title> <source><italic>Proceedings of the 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)</italic></source> (<publisher-loc>Orlando, FL</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3314</fpage>&#x2013;<lpage>3320</lpage>. <pub-id pub-id-type="doi">10.1109/EMBC.2016.7591437</pub-id> <pub-id pub-id-type="pmid">28269014</pub-id></citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Turkiewicz</surname> <given-names>K. L.</given-names></name></person-group> (<year>2017</year>). <source><italic>The SAGE Encyclopedia of Communication Research Methods.</italic></source> <publisher-loc>Thousand Oaks, CA</publisher-loc>: <publisher-name>SAGE Publications, Inc</publisher-name>. <pub-id pub-id-type="doi">10.4135/9781483381411.n130</pub-id></citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Walz</surname> <given-names>A. L.</given-names></name> <name><surname>Ooms</surname> <given-names>A.</given-names></name> <name><surname>Gadd</surname> <given-names>S.</given-names></name> <name><surname>Gerhard</surname> <given-names>D. S.</given-names></name> <name><surname>Smith</surname> <given-names>M. A.</given-names></name> <name><surname>Guidry Auvil</surname> <given-names>J. M.</given-names></name><etal/></person-group> (<year>2015</year>). <article-title>Recurrent DGCR8, DROSHA, and SIX homeodomain mutations in favorable histology wilms tumors.</article-title> <source><italic>Cancer Cell</italic></source> <volume>27</volume> <fpage>286</fpage>&#x2013;<lpage>297</lpage>. <pub-id pub-id-type="doi">10.1016/j.ccell.2015.01.003</pub-id> <pub-id pub-id-type="pmid">25670082</pub-id></citation></ref>
<ref id="B46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wilkinson</surname> <given-names>L.</given-names></name></person-group> (<year>2011</year>). <article-title>ggplot2: elegant graphics for data analysis by WICKHAM, H.</article-title> <source><italic>Biometrics</italic></source> <volume>67</volume> <fpage>678</fpage>&#x2013;<lpage>679</lpage>. <pub-id pub-id-type="doi">10.1111/j.1541-0420.2011.01616.x</pub-id></citation></ref>
<ref id="B47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yosipof</surname> <given-names>A.</given-names></name> <name><surname>Guedes</surname> <given-names>R. C.</given-names></name> <name><surname>Garc&#x00ED;a-Sosa</surname> <given-names>A. T.</given-names></name></person-group> (<year>2018</year>). <article-title>Data mining and machine learning models for predicting drug likeness and their disease or organ category.</article-title> <source><italic>Front. Chem.</italic></source> <volume>6</volume>:<issue>162</issue>. <pub-id pub-id-type="doi">10.3389/fchem.2018.00162</pub-id></citation></ref>
<ref id="B48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Ai</surname> <given-names>H.</given-names></name> <name><surname>Hu</surname> <given-names>H.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Zhao</surname> <given-names>J.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>Applications of machine learning methods in drug toxicity prediction.</article-title> <source><italic>Curr. Top. Med. Chem.</italic></source> <volume>18</volume> <fpage>987</fpage>&#x2013;<lpage>997</lpage>. <pub-id pub-id-type="doi">10.2174/1568026618666180727152557</pub-id> <pub-id pub-id-type="pmid">30051792</pub-id></citation></ref>
</ref-list>
<glossary>
<title>Abbreviations</title>
<def-list id="DL1">
<def-item>
<term>ALL</term>
<def>
<p>acute lymphoblastic leukemia</p>
</def>
</def-item>
<def-item>
<term>AML</term>
<def>
<p>acute myelogenous leukemia</p>
</def>
</def-item>
<def-item>
<term>ASCT</term>
<def>
<p>allogeneic stem cell transplantation</p>
</def>
</def-item>
<def-item>
<term>AUC</term>
<def>
<p>area under curve</p>
</def>
</def-item>
<def-item>
<term>FDR</term>
<def>
<p>false discovery rate</p>
</def>
</def-item>
<def-item>
<term>FloWPS</term>
<def>
<p>floating window projective separator</p>
</def>
</def-item>
<def-item>
<term>FP</term>
<def>
<p>false positive</p>
</def>
</def-item>
<def-item>
<term>FN</term>
<def>
<p>false negative</p>
</def>
</def-item>
<def-item>
<term>GEO</term>
<def>
<p>gene expression omnibus</p>
</def>
</def-item>
<def-item>
<term>GSE</term>
<def>
<p>GEO series</p>
</def>
</def-item>
<def-item>
<term>HER2</term>
<def>
<p>human epidermal growth factor receptor 2</p>
</def>
</def-item>
<def-item>
<term>kNN</term>
<def>
<p><italic>k</italic> nearest neighbors</p>
</def>
</def-item>
<def-item>
<term>MCC</term>
<def>
<p>Matthews correlation coefficient</p>
</def>
</def-item>
<def-item>
<term>mRNA</term>
<def>
<p>messenger ribonucleic acid</p>
</def>
</def-item>
<def-item>
<term>NGS</term>
<def>
<p>next-generation sequencing</p>
</def>
</def-item>
<def-item>
<term>PC</term>
<def>
<p>principal component</p>
</def>
</def-item>
<def-item>
<term>PCA</term>
<def>
<p>principal component analysis</p>
</def>
</def-item>
<def-item>
<term>ROC</term>
<def>
<p>receiver operating characteristic</p>
</def>
</def-item>
<def-item>
<term>SVM</term>
<def>
<p>support vector machine</p>
</def>
</def-item>
<def-item>
<term>TN</term>
<def>
<p>true negative</p>
</def>
</def-item>
<def-item>
<term>TP</term>
<def>
<p>true positive</p>
</def>
</def-item>
<def-item>
<term>VTD</term>
<def>
<p>velcade, thalidomide and dexamethasone</p>
</def>
</def-item>
</def-list>
</glossary>
<fn-group>
<fn id="fn01"><label>1</label><p>This is the result of a PubMed query <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/pubmed/?term=support+vector+machine_">https://www.ncbi.nlm.nih.gov/pubmed/?term=support+vector+machine_</ext-link></p></fn>
<fn id="fn02"><label>2</label><p>The ROC (receiver&#x2013;operator curve) is a widely-used graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The area under the ROC curve, called ROC AUC, or simply AUC, is routinely employed for assessment of any classifier&#x2019;s quality.</p></fn>
<fn id="fn03"><label>3</label><p>Data trimming is the process of removing or excluding extreme values, or outliers, from a dataset (<xref ref-type="bibr" rid="B44">Turkiewicz, 2017</xref>).</p></fn>
<fn id="fn04"><label>4</label><p>FDR shows the percentage of <italic>false positive</italic> (FP) predictions among all those classified as positive, FDR = FP/(FP + TP), where TP means <italic>true positive</italic>.</p></fn>
<fn id="fn05"><label>5</label><p>MCC can be calculated from the confusion matrix, <inline-formula><mml:math id="M1"><mml:mrow><mml:mi>M</mml:mi><mml:mi>C</mml:mi><mml:mi>C</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mi>T</mml:mi><mml:mi>N</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>T</mml:mi><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow></mml:math></inline-formula></p></fn>
</fn-group>
</back>
</article>