<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fgene.2018.00515</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Predicting Diabetes Mellitus With Machine Learning Techniques</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Zou</surname> <given-names>Quan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/531759/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Qu</surname> <given-names>Kaiyang</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Luo</surname> <given-names>Yamei</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Yin</surname> <given-names>Dehui</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Ju</surname> <given-names>Ying</given-names></name>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Tang</surname> <given-names>Hua</given-names></name>
<xref ref-type="aff" rid="aff5"><sup>5</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/625623/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>School of Computer Science and Technology, Tianjin University</institution>, <addr-line>Tianjin</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China</institution>, <addr-line>Chengdu</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>School of Medical Information and Engineering, Southwest Medical University</institution>, <addr-line>Luzhou</addr-line>, <country>China</country></aff>
<aff id="aff4"><sup>4</sup><institution>School of Information Science and Technology, Xiamen University</institution>, <addr-line>Xiamen</addr-line>, <country>China</country></aff>
<aff id="aff5"><sup>5</sup><institution>Department of Pathophysiology, School of Basic Medicine, Southwest Medical University</institution>, <addr-line>Luzhou</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Jianbo Pan, Johns Hopkins Medicine, United States; Zhu-Hong You, Xinjiang Technical Institute of Physics &#x0026; Chemistry (CAS), China; Chao Pang, Columbia University Medical Center, United States</p></fn>
<corresp id="c001">&#x002A;Correspondence: Quan Zou, <email>zouquan@nclab.net</email> Hua Tang, <email>huatang@swmu.edu.cn</email></corresp>
<fn fn-type="other" id="fn002"><p>This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>06</day>
<month>11</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection">
<year>2018</year>
</pub-date>
<volume>9</volume>
<elocation-id>515</elocation-id>
<history>
<date date-type="received">
<day>29</day>
<month>07</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>10</month>
<year>2018</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2018 Zou, Qu, Luo, Yin, Ju and Tang.</copyright-statement>
<copyright-year>2018</copyright-year>
<copyright-holder>Zou, Qu, Luo, Yin, Ju and Tang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Diabetes mellitus is a chronic disease characterized by hyperglycemia. It may cause many complications. According to the growing morbidity in recent years, in 2040, the world&#x2019;s diabetic patients will reach 642 million, which means that one of the ten adults in the future is suffering from diabetes. There is no doubt that this alarming figure needs great attention. With the rapid development of machine learning, machine learning has been applied to many aspects of medical health. In this study, we used decision tree, random forest and neural network to predict diabetes mellitus. The dataset is the hospital physical examination data in Luzhou, China. It contains 14 attributes. In this study, five-fold cross validation was used to examine the models. In order to verity the universal applicability of the methods, we chose some methods that have the better performance to conduct independent test experiments. We randomly selected 68994 healthy people and diabetic patients&#x2019; data, respectively as training set. Due to the data unbalance, we randomly extracted 5 times data. And the result is the average of these five experiments. In this study, we used principal component analysis (PCA) and minimum redundancy maximum relevance (mRMR) to reduce the dimensionality. The results showed that prediction with random forest could reach the highest accuracy (ACC = 0.8084) when all the attributes were used.</p>
</abstract>
<kwd-group>
<kwd>diabetes mellitus</kwd>
<kwd>random forest</kwd>
<kwd>decision tree</kwd>
<kwd>neural network</kwd>
<kwd>machine learning</kwd>
<kwd>feature ranking</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="8"/>
<equation-count count="10"/>
<ref-count count="59"/>
<page-count count="11"/>
<word-count count="0"/>
</counts>
</article-meta>
</front>
<body>
<sec><title>Introduction</title>
<p>Diabetes is a common chronic disease and poses a great threat to human health. The characteristic of diabetes is that the blood glucose is higher than the normal level, which is caused by defective insulin secretion or its impaired biological effects, or both (<xref ref-type="bibr" rid="B29">Lonappan et al., 2007</xref>). Diabetes can lead to chronic damage and dysfunction of various tissues, especially eyes, kidneys, heart, blood vessels and nerves (<xref ref-type="bibr" rid="B22">Krasteva et al., 2011</xref>). Diabetes can be divided into two categories, type 1 diabetes (T1D) and type 2 diabetes (T2D). Patients with type 1 diabetes are normally younger, mostly less than 30 years old. The typical clinical symptoms are increased thirst and frequent urination, high blood glucose levels (<xref ref-type="bibr" rid="B12">Iancu et al., 2008</xref>). This type of diabetes cannot be cured effectively with oral medications alone and the patients are required insulin therapy. Type 2 diabetes occurs more commonly in middle-aged and elderly people, which is often associated with the occurrence of obesity, hypertension, dyslipidemia, arteriosclerosis, and other diseases (<xref ref-type="bibr" rid="B40">Robertson et al., 2011</xref>).</p>
<p>With the development of living standards, diabetes is increasingly common in people&#x2019;s daily life. Therefore, how to quickly and accurately diagnose and analyze diabetes is a topic worthy studying. In medicine, the diagnosis of diabetes is according to fasting blood glucose, glucose tolerance, and random blood glucose levels (<xref ref-type="bibr" rid="B12">Iancu et al., 2008</xref>; <xref ref-type="bibr" rid="B6">Cox and Edelman, 2009</xref>; <xref ref-type="bibr" rid="B2">American Diabetes Association, 2012</xref>). The earlier diagnosis is obtained, the much easier we can control it. Machine learning can help people make a preliminary judgment about diabetes mellitus according to their daily physical examination data, and it can serve as a reference for doctors (<xref ref-type="bibr" rid="B23">Lee and Kim, 2016</xref>; <xref ref-type="bibr" rid="B1">Alghamdi et al., 2017</xref>; <xref ref-type="bibr" rid="B18">Kavakiotis et al., 2017</xref>). For machine learning method, how to select the valid features and the correct classifier are the most important problems.</p>
<p>Recently, numerous algorithms are used to predict diabetes, including the traditional machine learning method (<xref ref-type="bibr" rid="B18">Kavakiotis et al., 2017</xref>), such as support vector machine (SVM), decision tree (DT), logistic regression and so on. <xref ref-type="bibr" rid="B33">Polat and G&#x00FC;nes (2007)</xref> distinguished diabetes from normal people by using principal component analysis (PCA) and neuro fuzzy inference. <xref ref-type="bibr" rid="B56">Yue et al. (2008)</xref> used quantum particle swarm optimization (QPSO) algorithm and weighted least squares support vector machine (WLS-SVM) to predict type 2 diabetes <xref ref-type="bibr" rid="B7">Duygu and Esin (2011)</xref> proposed a system to predict diabetes, called LDA-MWSVM. In this system, the authors used Linear Discriminant Analysis (LDA) to reduce the dimensions and extract the features. In order to deal with the high dimensional datasets, <xref ref-type="bibr" rid="B38">Razavian et al. (2015)</xref> built prediction models based on logistic regression for different onsets of type 2 diabetes prediction. <xref ref-type="bibr" rid="B9">Georga et al. (2013)</xref> focused on the glucose, and used support vector regression (SVR) to predict diabetes, which is as a multivariate regression problem. Moreover, more and more studies used ensemble methods to improve the accuracy (<xref ref-type="bibr" rid="B18">Kavakiotis et al., 2017</xref>). <xref ref-type="bibr" rid="B31">Ozcift and Gulten (2011)</xref> proposed a newly ensemble approach, namely rotation forest, which combines 30 machine learning methods. <xref ref-type="bibr" rid="B11">Han et al. (2015)</xref> proposed a machine learning method, which changed the SVM prediction rules.</p>
<p>Machine learning methods are widely used in predicting diabetes, and they get preferable results. Decision tree is one of popular machine learning methods in medical field, which has grateful classification power. Random forest generates many decision trees. Neural network is a recently popular machine learning method, which has a better performance in many aspects. So in this study, we used decision tree, random forest (RF) and neural network to predict the diabetes.</p>
</sec>
<sec id="s1" sec-type="materials|methods">
<title>Materials and Methods</title>
<sec><title>Data</title>
<p>The dataset was obtained from hospital physical examination data in Luzhou, China. This dataset is divided two parts: the healthy people and the diabetes. There are two healthy people physical examination data. We used one of healthy people physical examination data that contains 164431 instances as the training set. In the other data set, 13700 samples were randomly selected as an independent test set. The physical data include 14 physical examination indexes: age, pulse rate, breathe, left systolic pressure (LSP), right systolic pressure (RSP), left diastolic pressure (LDP), right diastolic pressure (RDP), height, weight, physique index, fasting glucose, waistline, low density lipoprotein (LDL), and high density lipoprotein (HDL). In the training dataset, there are many missing data. We deleted the abnormal and missing samples to reduce the impact of data processing on result. Consequently, we got 151598 diabetic physical data and 69082 healthy people physical data. So, we randomly selected 68994 healthy people and diabetic patients&#x2019; data, respectively as training set. Due to the data unbalance, we randomly extracted 5 times. The final result was the mean value of 5 experiments. The 13,700 patients physical examination data, which were randomly selected as the independent test set, were different from the previous five sets which were used as training set.</p>
<p>Another dataset is Pima Indians diabetics data (<xref ref-type="bibr" rid="B14">Jegan, 2014</xref>). In particular, all patients are females at least 21 years old of Pima Indian heritage. The dataset contains 8 attributes which are times of pregnancy, plasma glucose concentration after an 2-h oral glucose tolerance test, diastolic blood pressure, triceps skin fold thickness, 2-h serum insulin, body mass index, diabetes pedigree function and age. In this dataset, the original 786 diabetics data reduces to 392 after deleted the missing data.</p>
</sec>
<sec><title>Classification</title>
<p>In this section, we used decision tree, RF and neural network as the classifiers. Decision tree and RF can implement in WEKA, which is a free, non-commercial, open source machine learning and data mining software based on JAVA environment. Neural network can be implemented in MATLAB, which is a commercial mathematics software exploited by MathWorks, Inc. It is used for algorithmic development, data visualization, data analysis and provides advanced computational language, and interactive environment for numerical calculation.</p>
<sec><title>Decision Tree</title>
<p>Decision tree is a basic classification and regression method. Decision tree model has a tree structure, which can describe the process of classification instances based on features (<xref ref-type="bibr" rid="B35">Quinlan, 1986</xref>). It can be considered as a set of if-then rules, which also can be thought of as conditional probability distributions defined in feature space and class space.</p>
<p>Decision tree uses tree structure and the tree begins with a single node representing the training samples (<xref ref-type="bibr" rid="B8">Friedl and Brodley, 1997</xref>; <xref ref-type="bibr" rid="B10">Habibi et al., 2015</xref>; <xref ref-type="bibr" rid="B26">Liao et al., 2018</xref>). If the samples are all in the same class, the node becomes the leaf and the class marks it. Otherwise, the algorithm chooses the discriminatory attribute as the current node of the decision tree. According to the value of the current decision node attribute, the training samples are divided into serval subsets, each of which forms a branch, and there are serval values that form serval branches (<xref ref-type="bibr" rid="B35">Quinlan, 1986</xref>; <xref ref-type="bibr" rid="B20">Kohabi, 1996</xref>). For each subset or branch obtained in the previous step, the previous steps are repeated, recursively forming a decision tree on each of the partitioned samples (<xref ref-type="bibr" rid="B35">Quinlan, 1986</xref>; <xref ref-type="bibr" rid="B8">Friedl and Brodley, 1997</xref>; <xref ref-type="bibr" rid="B10">Habibi et al., 2015</xref>).</p>
<p>The typical algorithms of decision tree are ID3, C4.5, CART and so on. In this study, we used the J48 decision tree in WEKA. J48 another name is C4.8, which is an upgrade of C4.5. J48 (<xref ref-type="bibr" rid="B42">Salzberg, 1994</xref>; <xref ref-type="bibr" rid="B20">Kohabi, 1996</xref>) is a top-down, recursive divide and conquer strategy. This method selects an attribute to be root node, generates a branch for each possible attribute value, divides the instance into multiple subsets, and each subset corresponds to a branch of the root node, and then repeats the process recursively on each branch (<xref ref-type="bibr" rid="B20">Kohabi, 1996</xref>). When all instances have the same classification, the algorithm stop. In J48, the nodes are decided by information gain. According to the following formulas, in each iteration, J48 calculates the information gain of each attribute, and selects the attribute with the largest value of information gain as the node of this iteration (<xref ref-type="bibr" rid="B36">Quinlan, 1996a</xref>,<xref ref-type="bibr" rid="B37">b</xref>; <xref ref-type="bibr" rid="B43">Sharma et al., 2014</xref>).</p>
<p>Attribute <italic>A</italic> information gain:</p>
<disp-formula id="E1"><mml:math id="M1"><mml:mrow><mml:mi mathvariant='normal'>G</mml:mi><mml:mi mathvariant='normal'>a</mml:mi><mml:mi mathvariant='normal'>i</mml:mi><mml:mi mathvariant='normal'>n</mml:mi><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='normal'>A</mml:mi><mml:mo mathvariant='normal'>)</mml:mo><mml:mo mathvariant='normal'>=</mml:mo><mml:mi mathvariant='normal'>I</mml:mi><mml:mi mathvariant='normal'>n</mml:mi><mml:mi mathvariant='normal'>f</mml:mi><mml:mi mathvariant='normal'>o</mml:mi><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='normal'>D</mml:mi><mml:mo mathvariant='normal'>)</mml:mo><mml:mo mathvariant='normal'>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant='italic'>I</mml:mi><mml:mi mathvariant='italic'>n</mml:mi><mml:mi mathvariant='italic'>f</mml:mi><mml:mi mathvariant='italic'>o</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='normal'>A</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>D</mml:mi><mml:mo mathvariant='normal'>)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>Pre-segmentation information entropy:</p>
<disp-formula id="E2"><mml:math id="M2"><mml:mrow><mml:mrow><mml:mi mathvariant='normal'>I</mml:mi><mml:mi mathvariant='normal'>n</mml:mi><mml:mi mathvariant='normal'>f</mml:mi><mml:mi mathvariant='normal'>o</mml:mi><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='normal'>D</mml:mi><mml:mo mathvariant='normal'>)</mml:mo><mml:mo mathvariant='normal'>=</mml:mo><mml:mi mathvariant='normal'>E</mml:mi><mml:mi mathvariant='normal'>n</mml:mi><mml:mi mathvariant='normal'>t</mml:mi><mml:mi mathvariant='normal'>r</mml:mi><mml:mi mathvariant='normal'>o</mml:mi><mml:mi mathvariant='normal'>p</mml:mi><mml:mi mathvariant='normal'>y</mml:mi><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='normal'>D</mml:mi><mml:mo mathvariant='normal'>)</mml:mo><mml:mo mathvariant='normal'>=</mml:mo><mml:mo mathvariant='normal'>&#x2212;</mml:mo><mml:munder><mml:mrow><mml:mi mathvariant='normal'>&#x03a3;</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='normal'>j</mml:mi></mml:mrow></mml:munder><mml:mi mathvariant='italic'>p</mml:mi><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>j</mml:mi><mml:mo mathvariant='normal'>|</mml:mo><mml:mi mathvariant='italic'>D</mml:mi><mml:mo mathvariant='normal'>)</mml:mo><mml:mi mathvariant='italic'>log</mml:mi><mml:mo mathvariant='italic'>&#x2061;</mml:mo><mml:mi mathvariant='italic'>p</mml:mi></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>j</mml:mi><mml:mo mathvariant='normal'>|</mml:mo><mml:mi mathvariant='italic'>d</mml:mi><mml:mo mathvariant='normal'>)</mml:mo></mml:mrow></mml:math></disp-formula>
<p>Distributed information entropy:</p>
<disp-formula id="E3"><mml:math id="M3"><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant='italic'>I</mml:mi><mml:mi mathvariant='italic'>n</mml:mi><mml:mi mathvariant='italic'>f</mml:mi><mml:mi mathvariant='italic'>o</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='normal'>A</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>D</mml:mi><mml:mo mathvariant='normal'>)</mml:mo><mml:mo mathvariant='normal'>=</mml:mo><mml:mrow><mml:munderover><mml:mrow><mml:mi mathvariant='normal'>&#x03a3;</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='normal'>i</mml:mi><mml:mo mathvariant='normal'>=</mml:mo><mml:mn mathvariant='normal'>1</mml:mn></mml:mrow><mml:mrow><mml:mi mathvariant='normal'>v</mml:mi></mml:mrow></mml:munderover><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant='italic'>n</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='normal'>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi mathvariant='italic'>n</mml:mi></mml:mrow></mml:mfrac><mml:mi mathvariant='italic'>I</mml:mi><mml:mi mathvariant='italic'>n</mml:mi><mml:mi mathvariant='italic'>f</mml:mi><mml:mi mathvariant='italic'>o</mml:mi></mml:mrow></mml:mrow></mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant='italic'>D</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='normal'>i</mml:mi></mml:mrow></mml:msub><mml:mo mathvariant='normal'>)</mml:mo></mml:mrow></mml:math></disp-formula>
</sec>
<sec><title>Random Forest</title>
<p>RF is a classification by using many decision trees. This algorithm proposed by Breiman (<xref ref-type="bibr" rid="B4">Breiman, 2001</xref>). RF is a multifunctional machine learning method. It can perform the tasks of prediction and regression. In addition, RF is based on Bagging and it plays an important role in ensemble machine learning (<xref ref-type="bibr" rid="B4">Breiman, 2001</xref>; <xref ref-type="bibr" rid="B28">Lin et al., 2014</xref>; <xref ref-type="bibr" rid="B46">Svetnik et al., 2015</xref>). RF has been employed in several biomedicine research (<xref ref-type="bibr" rid="B57">Zhao et al., 2014</xref>; <xref ref-type="bibr" rid="B25">Liao et al., 2016</xref>).</p>
<p>RF generates many decision trees, which is very different from decision tree algorithm (<xref ref-type="bibr" rid="B32">Pal, 2005</xref>). When the RF is predicting a new object based on some attributes, each tree in RF will give its own classification result and &#x2018;vote,&#x2019; and then the overall output of the forest will be the largest number of taxonomy. In the regression problem, the RF output is the average value of output of all decision trees (<xref ref-type="bibr" rid="B27">Liaw and Wiener, 2002</xref>; <xref ref-type="bibr" rid="B46">Svetnik et al., 2015</xref>).</p>
</sec>
<sec><title>Neural Network</title>
<p>Neural network is a math model, which imitates the animal&#x2019;s neural network behaviors. This model depends on the complexity of the system to achieve the purpose of processing information by adjusting the relationship between the internal nodes (<xref ref-type="bibr" rid="B30">Mukai et al., 2012</xref>). According to the connections&#x2019; style, the neural network model can be divided into forward network and feedback network. In this paper, we used the Neural Pattern Recognition app in MATLAB, which is a two-layer-feed-back network with sigmoid hidden and softmax output neurons. The neural network structural is shown in (Figure <xref ref-type="fig" rid="F1">1</xref>).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>The structural of two&#x2013;layer-feed-back network in MATLAB. This figure is from MATLAB, which can describe this network working principle preferably. Where, <italic>W</italic> is representation the weight and <italic>b</italic> is the bias variable.</p></caption>
<graphic xlink:href="fgene-09-00515-g001.tif"/>
</fig>
<p>In neural network, there are some important parts, namely input layer, hidden layer and output layer. The input layer is responsible for accepting input data. We can get the results from the output layer. The layer between the input layer and the output layer is called hidden layer. Because they are invisible to the outside. There is no connection between neurons on the same layer. In this network, the number of hidden layers set to 10, which can get a better performance. We suppose the input vector is <inline-formula><mml:math id="M11"><mml:mrow><mml:mover><mml:mrow><mml:mi mathvariant='italic'>x</mml:mi></mml:mrow><mml:mrow><mml:mo mathvariant='italic'>&#x2192;</mml:mo></mml:mrow></mml:mover></mml:mrow></mml:math></inline-formula>, the weight vector is <inline-formula><mml:math id="M12"><mml:mrow><mml:mover><mml:mrow><mml:mi mathvariant='italic'>w</mml:mi></mml:mrow><mml:mrow><mml:mo mathvariant='italic'>&#x2192;</mml:mo></mml:mrow></mml:mover></mml:mrow></mml:math></inline-formula>, and the activation function is a sigmoid function, then the output is:</p>
<disp-formula id="E4"><mml:math id="M4"><mml:mrow><mml:mi mathvariant='normal'>y</mml:mi><mml:mo mathvariant='normal'>=</mml:mo><mml:mi mathvariant='normal'>s</mml:mi><mml:mi mathvariant='normal'>i</mml:mi><mml:mi mathvariant='normal'>g</mml:mi><mml:mi mathvariant='normal'>m</mml:mi><mml:mi mathvariant='normal'>o</mml:mi><mml:mi mathvariant='normal'>i</mml:mi><mml:mi mathvariant='normal'>d</mml:mi><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:msup><mml:mrow><mml:mover><mml:mrow><mml:mi mathvariant='italic'>w</mml:mi></mml:mrow><mml:mrow><mml:mo mathvariant='normal'>&#x2192;</mml:mo></mml:mrow></mml:mover></mml:mrow><mml:mrow><mml:mi mathvariant='normal'>T</mml:mi></mml:mrow></mml:msup><mml:mo mathvariant='normal'>&#x22c5;</mml:mo><mml:mover><mml:mrow><mml:mi mathvariant='italic'>x</mml:mi></mml:mrow><mml:mrow><mml:mo mathvariant='normal'>&#x2192;</mml:mo></mml:mrow></mml:mover><mml:mo mathvariant='normal'>)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>and the sigmoid is:</p>
<disp-formula id="E5"><mml:math id="M5"><mml:mrow><mml:mi mathvariant='normal'>s</mml:mi><mml:mi mathvariant='normal'>i</mml:mi><mml:mi mathvariant='normal'>g</mml:mi><mml:mi mathvariant='normal'>m</mml:mi><mml:mi mathvariant='normal'>o</mml:mi><mml:mi mathvariant='normal'>i</mml:mi><mml:mi mathvariant='normal'>d</mml:mi><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='normal'>x</mml:mi><mml:mo mathvariant='normal'>)</mml:mo><mml:mo mathvariant='normal'>=</mml:mo><mml:mfrac><mml:mrow><mml:mn mathvariant='normal'>1</mml:mn></mml:mrow><mml:mrow><mml:mn mathvariant='normal'>1</mml:mn><mml:mo mathvariant='normal'>+</mml:mo><mml:msup><mml:mrow><mml:mi mathvariant='italic'>e</mml:mi></mml:mrow><mml:mrow><mml:mo mathvariant='normal'>&#x2212;</mml:mo><mml:mi mathvariant='normal'>x</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mrow></mml:mrow></mml:math></disp-formula>
</sec>
</sec>
<sec><title>Model Validation</title>
<p>In many studies, authors often used two validation methods, namely hold-out method and k-fold cross validation method, to evaluate the capability of the model (<xref ref-type="bibr" rid="B21">Kohavi, 1995</xref>; <xref ref-type="bibr" rid="B3">Bengio and Grandvalet, 2005</xref>; <xref ref-type="bibr" rid="B19">Kim, 2009</xref>; <xref ref-type="bibr" rid="B5">Chen et al., 2016</xref>; <xref ref-type="bibr" rid="B39">Refaeilzadeh et al., 2016</xref>; <xref ref-type="bibr" rid="B54">Yang et al., 2016</xref>, <xref ref-type="bibr" rid="B53">2018</xref>; <xref ref-type="bibr" rid="B45">Su et al., 2018</xref>; <xref ref-type="bibr" rid="B47">Tang H. et al., 2018</xref>). According to the goal of each problem and the size of data, we can choose different methods to solve the problem. In hold-out method, the dataset is divided two parts, training set and test set. The training set is used to train the machine learning algorithm and the test set is used to evaluate the model (<xref ref-type="bibr" rid="B19">Kim, 2009</xref>). The training set is different from test set. In this study, we used this method to verity the universal applicability of the methods. In k-fold cross validation method, the whole dataset is used to train and test the classifier (<xref ref-type="bibr" rid="B19">Kim, 2009</xref>). First, the dataset is average divided into <italic>k</italic> sections, which called folds. In training process, the method uses the <italic>k</italic>-1 folds to training the model and onefold is used to test. This process will be repeat <italic>k</italic> times, and each fold has the chance to be the test set. The final result is the average of all the tests performance of all folds (<xref ref-type="bibr" rid="B21">Kohavi, 1995</xref>). The advantage of this method is the whole samples in the dataset are trained and tested, which can avoid the higher variance (<xref ref-type="bibr" rid="B39">Refaeilzadeh et al., 2016</xref>; <xref ref-type="bibr" rid="B18">Kavakiotis et al., 2017</xref>). In this study, we used the five-fold cross validation method.</p>
</sec>
<sec><title>Feature Selection</title>
<p>Feature selection methods can reduce the number of attributes, which can avoid the redundant features. There are many feature selection methods. In this study, we used PCA and minimum redundancy maximum relevance (mRMR) to reduce the dimensionality.</p>
<sec><title>Principal Component Analysis</title>
<p>PCA (<xref ref-type="bibr" rid="B50">Wang and Paliwal, 2003</xref>; <xref ref-type="bibr" rid="B33">Polat and G&#x00FC;nes, 2007</xref>; <xref ref-type="bibr" rid="B55">You et al., 2018</xref>) obtains the <italic>K</italic> vectors and unit eigenvectors by solving the characteristic equation of the correlation matrix of the observed variables. The eigenvalues are sorted from large to small, representing the variance of the observed variables explained by <italic>K</italic> principal components, respectively (<xref ref-type="bibr" rid="B44">Smith, 2002</xref>).</p>
<p>The model for extracting principal component factors is:</p>
<disp-formula id="E6"><mml:math id="M6"><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant='italic'>F</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='italic'>i</mml:mi></mml:mrow></mml:msub><mml:mo mathvariant='normal'>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant='italic'>T</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='italic'>i</mml:mi><mml:mn mathvariant='normal'>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi mathvariant='italic'>X</mml:mi></mml:mrow><mml:mrow><mml:mn mathvariant='normal'>1</mml:mn></mml:mrow></mml:msub><mml:mo mathvariant='normal'>+</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant='italic'>T</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='italic'>i</mml:mi><mml:mn mathvariant='normal'>2</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi mathvariant='italic'>X</mml:mi></mml:mrow><mml:mrow><mml:mn mathvariant='normal'>2</mml:mn></mml:mrow></mml:msub><mml:mo mathvariant='normal'>+</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant='italic'>T</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='italic'>i</mml:mi><mml:mi mathvariant='normal'>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi mathvariant='italic'>X</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='italic'>k</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>i</mml:mi><mml:mo mathvariant='normal'>=</mml:mo><mml:mn mathvariant='normal'>1</mml:mn><mml:mo mathvariant='normal'>,</mml:mo><mml:mn mathvariant='normal'>2</mml:mn><mml:mo mathvariant='normal'>,</mml:mo><mml:mn mathvariant='normal'>...</mml:mn><mml:mo mathvariant='normal'>,</mml:mo><mml:mi mathvariant='italic'>m</mml:mi><mml:mo mathvariant='normal'>)</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where, <italic>F<sub>i</sub></italic> is the <italic>i</italic> principal component factor; <italic>T<sub>ij</sub></italic> is the load of the <italic>i</italic> principal component factor on the <italic>j</italic> index; <italic>m</italic> is the number of principal component factors; <italic>k</italic> is the number of indicators.</p>
<p>The PCA method can reduce the original multiple indicators to one or more comprehensive indicators. This small number of comprehensive indicators can reflect the vast majority of the information reflected by the original indicators, and they are not related to each other, and they can avoid the repeated information (<xref ref-type="bibr" rid="B13">Jackson, 1993</xref>; <xref ref-type="bibr" rid="B17">Jolliffe, 1998</xref>). At the same time, the reduction of indicators facilitates further calculation, analysis and evaluation.</p>
<p>We used Statistical Product and Service Solutions (SPSS) to implement the PCA algorithm. SPSS is a general term for a series of software products and related services launched by IBM. It is mainly used for statistical analysis, data mining, predictive analysis and other tasks. SPSS has a friendly visual interface and is easy to operate.</p>
</sec>
<sec><title>Minimum Redundancy Maximum Relevance</title>
<p>mRMR (<xref ref-type="bibr" rid="B13">Jackson, 1993</xref>; <xref ref-type="bibr" rid="B41">Sakar et al., 2012</xref>; <xref ref-type="bibr" rid="B24">Li et al., 2016</xref>; <xref ref-type="bibr" rid="B49">Wang et al., 2018</xref>) ensures the features have the max Euclidean distances, or their pairwise have the minimized correlations. Minimum redundancy standards are usually supplemented by the largest relevant standards, such as maximum mutual information and target phenotypes. Two ways can achieve the benefits. First, with the same number of features, mRMR feature set can have a more representative target phenotype for better generalization. Secondly, we can use a smaller mRMR feature set to effectively cover the same space made by a larger regular feature set. For individual categorical variables, the similarity level between each feature is measured by using mutual information. Minimum redundancy is the choice to have the most different features. Similar to mRMR, researchers also developed Maximum Relevance Maximum Distance (MRMD) (<xref ref-type="bibr" rid="B59">Zou et al., 2016b</xref>) for features ranking. And they were employed in several biomedicine researches (<xref ref-type="bibr" rid="B58">Zou et al., 2016a</xref>; <xref ref-type="bibr" rid="B15">Jia et al., 2018</xref>; <xref ref-type="bibr" rid="B48">Tang W. et al., 2018</xref>; <xref ref-type="bibr" rid="B52">Wei et al., 2018</xref>).</p>
</sec>
</sec>
<sec><title>Measurement</title>
<p>In this study, we used sensitivity (SN), specificity (SP), accuracy (ACC), and Matthews correlation coefficient (MCC) to measure the classified effectiveness. And the formulas are as follow:</p>
<disp-formula id="E7"><mml:math id="M7"><mml:mrow><mml:mi mathvariant='normal'>S</mml:mi><mml:mi mathvariant='normal'>N</mml:mi><mml:mo mathvariant='normal'>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>P</mml:mi><mml:mo mathvariant='normal'>+</mml:mo><mml:mi mathvariant='italic'>F</mml:mi><mml:mi mathvariant='italic'>N</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<disp-formula id="E8"><mml:math id="M8"><mml:mrow><mml:mi mathvariant='normal'>S</mml:mi><mml:mi mathvariant='normal'>P</mml:mi><mml:mo mathvariant='normal'>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>N</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>N</mml:mi><mml:mo mathvariant='normal'>+</mml:mo><mml:mi mathvariant='italic'>F</mml:mi><mml:mi mathvariant='italic'>P</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<disp-formula id="E9"><mml:math id="M9"><mml:mrow><mml:mi mathvariant='normal'>A</mml:mi><mml:mi mathvariant='normal'>C</mml:mi><mml:mi mathvariant='normal'>C</mml:mi><mml:mo mathvariant='normal'>=</mml:mo><mml:mfrac><mml:mrow><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>N</mml:mi><mml:mo mathvariant='normal'>+</mml:mo><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>P</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>N</mml:mi><mml:mo mathvariant='normal'>+</mml:mo><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>P</mml:mi><mml:mo mathvariant='normal'>+</mml:mo><mml:mi mathvariant='italic'>F</mml:mi><mml:mi mathvariant='italic'>P</mml:mi><mml:mo mathvariant='normal'>+</mml:mo><mml:mi mathvariant='italic'>F</mml:mi><mml:mi mathvariant='italic'>N</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<disp-formula id="E10"><mml:math id="M10"><mml:mrow><mml:mi mathvariant='normal'>M</mml:mi><mml:mi mathvariant='normal'>C</mml:mi><mml:mi mathvariant='normal'>C</mml:mi><mml:mo mathvariant='normal'>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>P</mml:mi><mml:mo mathvariant='normal'>&#x00d7;</mml:mo><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>N</mml:mi><mml:mo mathvariant='normal'>)</mml:mo></mml:mrow><mml:mo mathvariant='normal'>&#x2212;</mml:mo><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>F</mml:mi><mml:mi mathvariant='italic'>N</mml:mi><mml:mo mathvariant='normal'>&#x00d7;</mml:mo><mml:mi mathvariant='italic'>F</mml:mi><mml:mi mathvariant='italic'>P</mml:mi><mml:mo mathvariant='normal'>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mrow><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>P</mml:mi><mml:mo mathvariant='normal'>+</mml:mo><mml:mi mathvariant='italic'>F</mml:mi><mml:mi mathvariant='italic'>N</mml:mi><mml:mo mathvariant='normal'>)</mml:mo><mml:mo mathvariant='normal'>&#x00d7;</mml:mo><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>N</mml:mi><mml:mo mathvariant='normal'>+</mml:mo><mml:mi mathvariant='italic'>F</mml:mi><mml:mi mathvariant='italic'>P</mml:mi><mml:mo mathvariant='normal'>)</mml:mo><mml:mo mathvariant='normal'>&#x00d7;</mml:mo><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>P</mml:mi><mml:mo mathvariant='normal'>+</mml:mo><mml:mi mathvariant='italic'>F</mml:mi><mml:mi mathvariant='italic'>P</mml:mi><mml:mo mathvariant='normal'>)</mml:mo><mml:mo mathvariant='normal'>&#x00d7;</mml:mo><mml:mo mathvariant='normal'>(</mml:mo><mml:mi mathvariant='italic'>T</mml:mi><mml:mi mathvariant='italic'>N</mml:mi><mml:mo mathvariant='normal'>+</mml:mo><mml:mi mathvariant='italic'>F</mml:mi><mml:mi mathvariant='italic'>N</mml:mi><mml:mo mathvariant='normal'>)</mml:mo></mml:mrow></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>where true positive represents (TP) the number of identified positive samples in the positive set. True negative (TP) means the number of classification negative samples in the negative set. False positive (FP) is the number of the number of identified positive samples in the negative set. And false negative (FN) represents the number of identified negative samples in the positive set. It is often used to evaluate the quality of classification models. The accuracy is defined as the ratio of the number of samples correctly classified by the classifier to the total number of samples. In medical statistics, there are two basic characteristics, sensitivity (SN) and specificity (SP). Sensitivity is the true positive rate, and specificity is the true negative rate. The MCC is a correlation coefficient between the actual classification and the predicted classification. Its value range is [-1, 1]. When the MCC equals one, it indicates a perfect prediction for the subject. When the MCC value is 0, it indicates the predicted result is not as good as the result of random prediction, and -1 means that the predicted classification is completely inconsistent with the actual classification.</p>
</sec>
</sec>
<sec><title>Results and Discussion</title>
<p>In the tables, we used Luzhou to represent the dataset from hospital physical examination data in Luzhou, China and Pima Indians represents the Pima Indians diabetics data. The two datasets contain 14 and 8 attributes, respectively.</p>
<p>For better comparison, firstly, we used all features for predicting diabetes. And the results are shown in Table <xref ref-type="table" rid="T1">1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Predict the diabetes by using all features.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Dataset</th>
<th valign="top" align="center">Classifier</th>
<th valign="top" align="center">ACC</th>
<th valign="top" align="center">SN</th>
<th valign="top" align="center">SP</th>
<th valign="top" align="center">MCC</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Luzhou</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">0.8084</td>
<td valign="top" align="center">0.8495</td>
<td valign="top" align="center">0.7673</td>
<td valign="top" align="center">0.6189</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">J48</td>
<td valign="top" align="center">0.7853</td>
<td valign="top" align="center">0.8153</td>
<td valign="top" align="center">0.7563</td>
<td valign="top" align="center">0.5726</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Neural network</td>
<td valign="top" align="center">0.7841</td>
<td valign="top" align="center">0.8231</td>
<td valign="top" align="center">0.7451</td>
<td valign="top" align="center">0.5699</td>
</tr>
<tr>
<td valign="top" align="left">Pima Indians</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">0.7604</td>
<td valign="top" align="center">0.7578</td>
<td valign="top" align="center">0.7631</td>
<td valign="top" align="center">0.5210</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">J48</td>
<td valign="top" align="center">0.7275</td>
<td valign="top" align="center">0.7027</td>
<td valign="top" align="center">0.7523</td>
<td valign="top" align="center">0.4569</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Neural network</td>
<td valign="top" align="center">0.7667</td>
<td valign="top" align="center">0.7828</td>
<td valign="top" align="center">0.7508</td>
<td valign="top" align="center">0.5349</td>
</tr>
<tr>
<td valign="top" align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Through the Table <xref ref-type="table" rid="T1">1</xref>, we can get better results. In addition, RF has the best result among the three classifiers when the dataset is Luzhou physical examination. When the dataset is Pima Indians, random forest has similar effects to neural networks. And the decision tree structure of Luzhou dataset is shown in Figure <xref ref-type="fig" rid="F2">2</xref>, the decision tree structure of Pima Indians dataset is shown in Figure <xref ref-type="fig" rid="F3">3</xref>. According to Figures <xref ref-type="fig" rid="F2">2</xref>, <xref ref-type="fig" rid="F3">3</xref>, we can find the root node is glucose, which can show the glucose has the max information gain, so it confirm the common sense and the clinical diagnosis basis. But there are diabetic patients whose fasting blood glucose is less than 6.8 in Luzhou dataset, we considered the reason maybe they injected insulin before the physical examination to control blood sugar levels.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Decision tree structure by using all features and Luzhou dataset. In this figure, we can find the fasting blood sugar is an important index for predicting diabetes And weight, age also have the higher information gain and play vital roles in this method.</p></caption>
<graphic xlink:href="fgene-09-00515-g002.tif"/>
</fig>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Decision tree structure by using all features and Pima Indians dataset. From this figure, we can find in this method glucose as the root node, which can indicate the index has the highest information gain and insulin and age play important roles in this method.</p></caption>
<graphic xlink:href="fgene-09-00515-g003.tif"/>
</fig>
<p>According to consulting relevant information, we know there are three indicators to determination the diabetes mellitus, which are fasting blood glucose, random blood glucose and blood glucose tolerance. Because the data only has fasting blood glucose in Luzhou dataset and the Pima Indians dataset only has blood glucose tolerance, we used fasting blood glucose and blood glucose tolerance to prediction, respectively. And the results are shown in Table <xref ref-type="table" rid="T2">2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Predict the diabetes by using blood glucose.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Dataset</th>
<th valign="top" align="center">Classifier</th>
<th valign="top" align="center">ACC</th>
<th valign="top" align="center">SN</th>
<th valign="top" align="center">SP</th>
<th valign="top" align="center">MCC</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Luzhou</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">0.7597</td>
<td valign="top" align="center">0.8795</td>
<td valign="top" align="center">0.6400</td>
<td valign="top" align="center">0.5350</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">J48</td>
<td valign="top" align="center">0.7610</td>
<td valign="top" align="center">0.8818</td>
<td valign="top" align="center">0.6401</td>
<td valign="top" align="center">0.5379</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Neural network</td>
<td valign="top" align="center">0.7572</td>
<td valign="top" align="center">0.8870</td>
<td valign="top" align="center">0.6274</td>
<td valign="top" align="center">0.5327</td>
</tr>
<tr>
<td valign="top" align="left">Pima Indians</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">0.6728</td>
<td valign="top" align="center">0.6765</td>
<td valign="top" align="center">0.6692</td>
<td valign="top" align="center">0.3461</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">J48</td>
<td valign="top" align="center">0.6895</td>
<td valign="top" align="center">0.7320</td>
<td valign="top" align="center">0.6355</td>
<td valign="top" align="center">0.3733</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Neural network</td>
<td valign="top" align="center">0.7198</td>
<td valign="top" align="center">0.6950</td>
<td valign="top" align="center">0.7446</td>
<td valign="top" align="center">0.4411</td>
</tr>
<tr>
<td valign="top" align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>According to the Table <xref ref-type="table" rid="T2">2</xref>, we found in Luzhou dataset J48 has a better performance than the others do, and the accuracy is above 0.76. In the Pima Indians dataset, only using blood glucose tolerance is not good.</p>
<p>Then, we used mRMR to select features. We get the score of each feature. According to the matrix, we chose the first five features, which are height, HDL, fasting glucose, breathe, and LDL, to predict diabetes using Luzhou dataset and select the first three attributes, which are glucose, 2-h serum insulin and age, to predict the Pima Indians dataset. The results are shown in Table <xref ref-type="table" rid="T3">3</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Predict diabetes of using mRMR to reduce dimensionality.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Dataset</th>
<th valign="top" align="center">Classifier</th>
<th valign="top" align="center">ACC</th>
<th valign="top" align="center">SN</th>
<th valign="top" align="center">SP</th>
<th valign="top" align="center">MCC</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Luzhou</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">0.7508</td>
<td valign="top" align="center">0.8334</td>
<td valign="top" align="center">0.6681</td>
<td valign="top" align="center">0.5085</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">J48</td>
<td valign="top" align="center">0.7613</td>
<td valign="top" align="center">0.8795</td>
<td valign="top" align="center">0.6431</td>
<td valign="top" align="center">0.5379</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Neural network</td>
<td valign="top" align="center">0.7570</td>
<td valign="top" align="center">0.8828</td>
<td valign="top" align="center">0.6313</td>
<td valign="top" align="center">0.5312</td>
</tr>
<tr>
<td valign="top" align="left">Pima Indians</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">0.7721</td>
<td valign="top" align="center">0.7458</td>
<td valign="top" align="center">0.7985</td>
<td valign="top" align="center">0.5451</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">J48</td>
<td valign="top" align="center">0.7534</td>
<td valign="top" align="center">0.7228</td>
<td valign="top" align="center">0.7846</td>
<td valign="top" align="center">0.5095</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Neural network</td>
<td valign="top" align="center">0.7390</td>
<td valign="top" align="center">0.8073</td>
<td valign="top" align="center">0.6708</td>
<td valign="top" align="center">0.4837</td>
</tr>
<tr>
<td valign="top" align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>When we use the Luzhou dataset, J48 has the best performance. But the results are not better than using all features. In the Pima Indians dataset, this method, which used RF as the classifier, has the best performance.</p>
<p>Then we used PCA to reduce the features. Because height and weight are related to physical index, we did not use height and weight to using PCA in Luzhou dataset. We used SPSS to analyzing the factors. According to the KMO and Bartlett test, the two datasets can use PCA to reduce the features. And we can get the composition matrix and eigenvalues. According to the composition matrix and total variance interpretation, we can get the new five features for Luzhou dataset and three features for Pima Indians dataset. We use the new features to conduct experiment, and the results are shown in Table <xref ref-type="table" rid="T4">4</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Predict diabetes of using PCA to reduce dimensionality.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Dataset</th>
<th valign="top" align="left">Classifier</th>
<th valign="top" align="center">ACC</th>
<th valign="top" align="center">SN</th>
<th valign="top" align="center">SP</th>
<th valign="top" align="center">MCC</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Luzhou</td>
<td valign="top" align="left">RF</td>
<td valign="top" align="center">0.7395</td>
<td valign="top" align="center">0.7435</td>
<td valign="top" align="center">0.7354</td>
<td valign="top" align="center">0.4790</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">J48</td>
<td valign="top" align="center">0.7388</td>
<td valign="top" align="center">0.7335</td>
<td valign="top" align="center">0.7441</td>
<td valign="top" align="center">0.4777</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">Neural network</td>
<td valign="top" align="center">0.7414</td>
<td valign="top" align="center">0.7370</td>
<td valign="top" align="center">0.7457</td>
<td valign="top" align="center">0.4828</td>
</tr>
<tr>
<td valign="top" align="left">Pima Indians</td>
<td valign="top" align="left">RF</td>
<td valign="top" align="center">0.7144</td>
<td valign="top" align="center">0.7057</td>
<td valign="top" align="center">0.7231</td>
<td valign="top" align="center">0.4291</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">J48</td>
<td valign="top" align="center">0.7167</td>
<td valign="top" align="center">0.7381</td>
<td valign="top" align="center">0.6954</td>
<td valign="top" align="center">0.4353</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">Neural network</td>
<td valign="top" align="center">0.7475</td>
<td valign="top" align="center">0.7381</td>
<td valign="top" align="center">0.7569</td>
<td valign="top" align="center">0.4968</td>
</tr>
<tr>
<td valign="top" align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The ACC of Luzhou dataset is less than the above methods. The results show PCA is not suitable for this data. For Pima Indians dataset, the accuracy is better than only use glucose. In this second, neural network has the best performance for predicting diabetes.</p>
<p>In order to explore the importance of other indexes in predicting diabetes, we designed the following experiments by using Luzhou dataset. Firstly, we used the all features without blood glucose to predict diabetes, and the results are shown in Table <xref ref-type="table" rid="T5">5</xref>.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Predict diabetes of using all features without blood glucose.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Dataset</th>
<th valign="top" align="center">Classifier</th>
<th valign="top" align="center">ACC</th>
<th valign="top" align="center">SN</th>
<th valign="top" align="center">SP</th>
<th valign="top" align="center">MCC</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Luzhou</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">0.7225</td>
<td valign="top" align="center">0.7228</td>
<td valign="top" align="center">0.7222</td>
<td valign="top" align="center">0.4450</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">J48</td>
<td valign="top" align="center">0.6917</td>
<td valign="top" align="center">0.6880</td>
<td valign="top" align="center">0.6953</td>
<td valign="top" align="center">0.3834</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Neural network</td>
<td valign="top" align="center">0.6986</td>
<td valign="top" align="center">0.6646</td>
<td valign="top" align="center">0.7326</td>
<td valign="top" align="center">0.3981</td>
</tr>
<tr>
<td valign="top" align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>And then, we deleted the blood glucose, LDL and HDL which need to go to the hospital for testing data. So there are 11 features in this experiment, and the results are shown in Table <xref ref-type="table" rid="T6">6</xref>.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Predict diabetes of using 11 features.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Dataset</th>
<th valign="top" align="center">Classifier</th>
<th valign="top" align="center">ACC</th>
<th valign="top" align="center">SN</th>
<th valign="top" align="center">SP</th>
<th valign="top" align="center">MCC</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Luzhou</td>
<td valign="top" align="center">RF</td>
<td valign="top" align="center">0.7104</td>
<td valign="top" align="center">0.7082</td>
<td valign="top" align="center">0.7125</td>
<td valign="top" align="center">0.4207</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">J48</td>
<td valign="top" align="center">0.6916</td>
<td valign="top" align="center">0.6880</td>
<td valign="top" align="center">0.6953</td>
<td valign="top" align="center">0.3833</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Neural network</td>
<td valign="top" align="center">0.6983</td>
<td valign="top" align="center">0.6685</td>
<td valign="top" align="center">0.7281</td>
<td valign="top" align="center">0.3973</td>
</tr>
<tr>
<td valign="top" align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>According to the Tables <xref ref-type="table" rid="T5">5</xref>, <xref ref-type="table" rid="T6">6</xref>, we found the RF is able to predict better diabetes. Although the accuracy is not the best, we can use the prediction as a reference.</p>
<p>According to the above experiments, we summarized the above results and get Figures <xref ref-type="fig" rid="F4">4</xref>, <xref ref-type="fig" rid="F5">5</xref>, which can more clearly demonstrate the accuracy of each method in order to make a better comparison.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>The results of using Luzhou dataset. According to this figure, we found the method, which used all features and random forest has the greatest performance. And the methods without blood glucose are not good.</p></caption>
<graphic xlink:href="fgene-09-00515-g004.tif"/>
</fig>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>The results of using Pima Indians dataset. From the figure, mRMR is friendly for this dataset and method only using glucose is not suitable for this dataset.</p></caption>
<graphic xlink:href="fgene-09-00515-g005.tif"/>
</fig>
<p>From the Figures <xref ref-type="fig" rid="F4">4</xref>, <xref ref-type="fig" rid="F5">5</xref>, we can find PCA is not very suitable to the two dataset. And using all features has a good performance, especially for the Luzhou dataset. There is not much difference among random forest, decision tree and neural network when the feature set contains blood glucose. When we used the features without blood glucose, random forest has the best performance. But relatively speaking, the neural network performs poorly.</p>
<p>According to the Figure <xref ref-type="fig" rid="F4">4</xref>, we selected several methods that performed better and conducted independent testing experiments by using Luzhou dataset. So we chose three methods (all features, mRMR and blood glucose) to conduct independent test experiments. The results are shown in Table <xref ref-type="table" rid="T7">7</xref>.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Predict diabetes of using independence test data.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Method</th>
<th valign="top" align="left">Classifier</th>
<th valign="top" align="center">ACC</th>
<th valign="top" align="center">SN</th>
<th valign="top" align="center">SP</th>
<th valign="top" align="center">MCC</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">mRMR</td>
<td valign="top" align="left">RF</td>
<td valign="top" align="center">0.8857</td>
<td valign="top" align="center">0.9568</td>
<td valign="top" align="center">0.8146</td>
<td valign="top" align="center">0.7794</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">J48</td>
<td valign="top" align="center">0.7547</td>
<td valign="top" align="center">0.8647</td>
<td valign="top" align="center">0.6447</td>
<td valign="top" align="center">0.5223</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">Neural network</td>
<td valign="top" align="center">0.7470</td>
<td valign="top" align="center">0.8655</td>
<td valign="top" align="center">0.6284</td>
<td valign="top" align="center">0.5085</td>
</tr>
<tr>
<td valign="top" align="left">All features</td>
<td valign="top" align="left">RF</td>
<td valign="top" align="center">0.8963</td>
<td valign="top" align="center">0.9226</td>
<td valign="top" align="center">0.8700</td>
<td valign="top" align="center">0.7937</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">J48</td>
<td valign="top" align="center">0.8011</td>
<td valign="top" align="center">0.8135</td>
<td valign="top" align="center">0.7887</td>
<td valign="top" align="center">0.6025</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">Neural network</td>
<td valign="top" align="center">0.7725</td>
<td valign="top" align="center">0.7942</td>
<td valign="top" align="center">0.7508</td>
<td valign="top" align="center">0.5455</td>
</tr>
<tr>
<td valign="top" align="left">Blood glucose</td>
<td valign="top" align="left">RF</td>
<td valign="top" align="center">0.7537</td>
<td valign="top" align="center">0.8704</td>
<td valign="top" align="center">0.6371</td>
<td valign="top" align="center">0.5218</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">J48</td>
<td valign="top" align="center">0.7535</td>
<td valign="top" align="center">0.8713</td>
<td valign="top" align="center">0.6358</td>
<td valign="top" align="center">0.5218</td>
</tr>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">Neural network</td>
<td valign="top" align="center">0.5010</td>
<td valign="top" align="center">0.9388</td>
<td valign="top" align="center">0.0631</td>
<td valign="top" align="center">0.0040</td>
</tr>
<tr>
<td valign="top" align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>According to Table <xref ref-type="table" rid="T7">7</xref>, we found the method using all features still has a better result. And the method only using blood glucose is not good, especially using neural network as classifier. The reason for this result may be that the blood glucose contains too little information.</p>
<p>Because Luzhou dataset is collected by ourselves, it is unable to use this data for comparison experiments. In order to compare with the methods in other papers, we used Pima Indians dataset for 10-fold cross validation experiments. The results are shown in Table <xref ref-type="table" rid="T8">8</xref>.</p>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption><p>Predict diabetes of using all features without blood glucose.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left">Method</th>
<th valign="top" align="center">ACC</th>
<th valign="top" align="left">Reference</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">mRMR (RF)</td>
<td valign="top" align="center">0.7852</td>
<td valign="top" align="left">Our study</td>
</tr>
<tr>
<td valign="top" align="left">mRMR (J48)</td>
<td valign="top" align="center">0.7806</td>
<td valign="top" align="left">Our study</td>
</tr>
<tr>
<td valign="top" align="left">All feature (RF)</td>
<td valign="top" align="center">0.7604</td>
<td valign="top" align="left">Our study</td>
</tr>
<tr>
<td valign="top" align="left">All feature (J48)</td>
<td valign="top" align="center">0.7275</td>
<td valign="top" align="left">Our study</td>
</tr>
<tr>
<td valign="top" align="left">AWAIS(10xCV)</td>
<td valign="top" align="center">0.7587</td>
<td valign="top" align="left"><xref ref-type="bibr" rid="B34">Polat and Kodaz, 2005</xref></td>
</tr>
<tr>
<td valign="top" align="left">NNEE</td>
<td valign="top" align="center">0.7557</td>
<td valign="top" align="left"><xref ref-type="bibr" rid="B16">Jiang and Zhou, 2004</xref></td>
</tr>
<tr>
<td valign="top" align="left">AIRS(13xCV)</td>
<td valign="top" align="center">0.7410</td>
<td valign="top" align="left"><xref ref-type="bibr" rid="B51">Watkins and Boggess, 2002</xref></td>
</tr>
<tr>
<td valign="top" align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec><title>Conclusion</title>
<p>Diabetes mellitus is a disease, which can cause many complications. How to exactly predict and diagnose this disease by using machine learning is worthy studying. According to the all above experiments, we found the accuracy of using PCA is not good, and the results of using the all features and using mRMR have better results. The result, which only used fasting glucose, has a better performance especially in Luzhou dataset. It means that the fasting glucose is the most important index for predict, but only using fasting glucose cannot achieve the best result, so if want to predict accurately, we need more indexes. In addition, by comparing the results of three classifications, we can find there is not much difference among random forest, decision tree and neural network, but random forests are obviously better than the another classifiers in some methods. The best result for Luzhou dataset is 0.8084, and the best performance for Pima Indians is 0.7721, which can indicate machine learning can be used for prediction diabetes, but finding suitable attributes, classifier and data mining method are very important. Due to the data, we cannot predict the type of diabetes, so in future we aim to predicting type of diabetes and exploring the proportion of each indicator, which may improve the accuracy of predicting diabetes. We uploaded the Pima Indians dataset in <ext-link ext-link-type="uri" xlink:href="http://121.42.167.206/PIMAINDIANS/data.html">http://121.42.167.206/PIMAINDIANS/data.html</ext-link>.</p>
</sec>
<sec><title>Author Contributions</title>
<p>QZ designed the experiments. KQ and YL performed the experiments. KQ wrote the paper. DY and YJ analyzed the data. HT provided the data.</p>
</sec>
<sec><title>Conflict of Interest Statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="financial-disclosure">
<p><bold>Funding.</bold> The work was supported by the National Key R&#x0026;D Program of China (SQ2018YFC090002), and Natural Science Foundation of China (Nos. 61771331 and 61702430), the Scientific Research Foundation of the Health Department of Sichuan Province (120373), the Scientific Research Foundation of the Education Department of Sichuan Province (11ZB122) the Scientific Research Foundation of Luzhou city (2012-S-36).</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alghamdi</surname> <given-names>M.</given-names></name> <name><surname>Al-Mallah</surname> <given-names>M.</given-names></name> <name><surname>Keteyian</surname> <given-names>S.</given-names></name> <name><surname>Brawner</surname> <given-names>C.</given-names></name> <name><surname>Ehrman</surname> <given-names>J.</given-names></name> <name><surname>Sakr</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: the henry ford exercise testing (FIT) project.</article-title> <source><italic>PLoS One</italic></source> <volume>12</volume>:<issue>e0179805</issue>. <pub-id pub-id-type="doi">10.1371/journal.pone.0179805</pub-id> <pub-id pub-id-type="pmid">28738059</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><collab>American Diabetes Association</collab> (<year>2012</year>). <article-title>Diagnosis and classification of diabetes mellitus.</article-title> <source><italic>Diabetes Care</italic></source> <volume>35(Suppl. 1)</volume>, <fpage>S64</fpage>&#x2013;<lpage>S71</lpage>. <pub-id pub-id-type="doi">10.2337/dc12-s064</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Grandvalet</surname> <given-names>Y.</given-names></name></person-group> (<year>2005</year>). <source><italic>Bias in Estimating the Variance of K -Fold Cross-Validation.</italic></source> <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>, <fpage>75</fpage>&#x2013;<lpage>95</lpage>. <pub-id pub-id-type="doi">10.1007/0-387-24555-3_5</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breiman</surname> <given-names>L.</given-names></name></person-group> (<year>2001</year>). <article-title>Random forest.</article-title> <source><italic>Mach. Learn.</italic></source> <volume>45</volume> <fpage>5</fpage>&#x2013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1023/A:1010933404324</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>X. X.</given-names></name> <name><surname>Tang</surname> <given-names>H.</given-names></name> <name><surname>Li</surname> <given-names>W. C.</given-names></name> <name><surname>Wu</surname> <given-names>H.</given-names></name> <name><surname>Chen</surname> <given-names>W.</given-names></name> <name><surname>Ding</surname> <given-names>H.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>Identification of bacterial cell wall lyases via pseudo amino acid composition.</article-title> <source><italic>Biomed. Res. Int.</italic></source> <volume>2016</volume>:<issue>1654623</issue>. <pub-id pub-id-type="doi">10.1155/2016/1654623</pub-id> <pub-id pub-id-type="pmid">27437396</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cox</surname> <given-names>M. E.</given-names></name> <name><surname>Edelman</surname> <given-names>D.</given-names></name></person-group> (<year>2009</year>). <article-title>Tests for screening and diagnosis of type 2 diabetes.</article-title> <source><italic>Clin. Diabetes</italic></source> <volume>27</volume> <fpage>132</fpage>&#x2013;<lpage>138</lpage>. <pub-id pub-id-type="doi">10.2337/diaclin.27.4.132</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Duygu</surname> <given-names>&#x00E7;.</given-names></name> <name><surname>Esin</surname> <given-names>D.</given-names></name></person-group> (<year>2011</year>). <article-title>An automatic diabetes diagnosis system based on LDA-wavelet support vector machine classifier.</article-title> <source><italic>Expert Syst. Appl.</italic></source> <volume>38</volume> <fpage>8311</fpage>&#x2013;<lpage>8315</lpage>. <pub-id pub-id-type="pmid">17138215</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Friedl</surname> <given-names>M. A.</given-names></name> <name><surname>Brodley</surname> <given-names>C. E.</given-names></name></person-group> (<year>1997</year>). <article-title>Decision tree classification of land cover from remotely sensed data.</article-title> <source><italic>Remote Sens. Environ.</italic></source> <volume>61</volume> <fpage>399</fpage>&#x2013;<lpage>409</lpage>. <pub-id pub-id-type="pmid">24310365</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Georga</surname> <given-names>E. I.</given-names></name> <name><surname>Protopappas</surname> <given-names>V. C.</given-names></name> <name><surname>Ardigo</surname> <given-names>D.</given-names></name> <name><surname>Marina</surname> <given-names>M.</given-names></name> <name><surname>Zavaroni</surname> <given-names>I.</given-names></name> <name><surname>Polyzos</surname> <given-names>D.</given-names></name><etal/></person-group> (<year>2013</year>). <article-title>Multivariate prediction of subcutaneous glucose concentration in type 1 diabetes patients based on support vector regression.</article-title> <source><italic>IEEE J. Biomed. Health Inform.</italic></source> <volume>17</volume> <fpage>71</fpage>&#x2013;<lpage>81</lpage>. <pub-id pub-id-type="doi">10.1109/TITB.2012.2219876</pub-id> <pub-id pub-id-type="pmid">23008265</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Habibi</surname> <given-names>S.</given-names></name> <name><surname>Ahmadi</surname> <given-names>M.</given-names></name> <name><surname>Alizadeh</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Type 2 diabetes mellitus screening and risk factors using decision tree: results of data mining.</article-title> <source><italic>Glob. J. Health Sci.</italic></source> <volume>7</volume> <fpage>304</fpage>&#x2013;<lpage>310</lpage>. <pub-id pub-id-type="doi">10.5539/gjhs.v7n5p304</pub-id> <pub-id pub-id-type="pmid">26156928</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>L.</given-names></name> <name><surname>Luo</surname> <given-names>S.</given-names></name> <name><surname>Yu</surname> <given-names>J.</given-names></name> <name><surname>Pan</surname> <given-names>L.</given-names></name> <name><surname>Chen</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Rule extraction from support vector machines using ensemble learning approach: an application for diagnosis of diabetes.</article-title> <source><italic>IEEE J. Biomed. Health Inform.</italic></source> <volume>19</volume> <fpage>728</fpage>&#x2013;<lpage>734</lpage>. <pub-id pub-id-type="doi">10.1109/JBHI.2014.2325615</pub-id> <pub-id pub-id-type="pmid">24860043</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Iancu</surname> <given-names>I.</given-names></name> <name><surname>Mota</surname> <given-names>M.</given-names></name> <name><surname>Iancu</surname> <given-names>E.</given-names></name></person-group> (<year>2008</year>). <article-title>&#x201C;Method for the analysing of blood glucose dynamics in diabetes mellitus patients,&#x201D; in</article-title> <source><italic>Proceedings of the 2008 IEEE International Conference on Automation, Quality and Testing, Robotics</italic></source>, <publisher-loc>Cluj-Napoca</publisher-loc>. <pub-id pub-id-type="doi">10.1109/AQTR.2008.4588883</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jackson</surname> <given-names>D. A.</given-names></name></person-group> (<year>1993</year>). <article-title>Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches.</article-title> <source><italic>Ecology</italic></source> <volume>74</volume> <fpage>2204</fpage>&#x2013;<lpage>2214</lpage>. <pub-id pub-id-type="doi">10.2307/1939574</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jegan</surname> <given-names>C.</given-names></name></person-group> (<year>2014</year>). <article-title>Classification of diabetes disease using support vector machine.</article-title> <source><italic>Microcomput. Dev.</italic></source> <volume>3</volume> <fpage>1797</fpage>&#x2013;<lpage>1801</lpage>.</citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jia</surname> <given-names>C.</given-names></name> <name><surname>Zuo</surname> <given-names>Y.</given-names></name> <name><surname>Zou</surname> <given-names>Q.</given-names></name></person-group> (<year>2018</year>). <article-title>O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique.</article-title> <source><italic>Bioinformatics</italic></source> <volume>34</volume> <fpage>2029</fpage>&#x2013;<lpage>2036</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bty039</pub-id> <pub-id pub-id-type="pmid">29420699</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname> <given-names>Y.</given-names></name> <name><surname>Zhou</surname> <given-names>Z. H.</given-names></name></person-group> (<year>2004</year>). <article-title>Editing training data for kNN classifiers with neural network ensemble.</article-title> <source><italic>Lect. Notes Comput. Sci.</italic></source> <volume>3173</volume> <fpage>356</fpage>&#x2013;<lpage>361</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-540-28647-9_60</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jolliffe</surname> <given-names>I. T.</given-names></name></person-group> (<year>1998</year>). <article-title>&#x201C;Principal components analysis,&#x201D; in</article-title> <source><italic>Proceedings of the International Conference on Document Analysis and Recognition</italic></source> (<publisher-loc>Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>).</citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kavakiotis</surname> <given-names>I.</given-names></name> <name><surname>Tsave</surname> <given-names>O.</given-names></name> <name><surname>Salifoglou</surname> <given-names>A.</given-names></name> <name><surname>Maglaveras</surname> <given-names>N.</given-names></name> <name><surname>Vlahavas</surname> <given-names>I.</given-names></name> <name><surname>Chouvarda</surname> <given-names>I.</given-names></name></person-group> (<year>2017</year>). <article-title>Machine learning and data mining methods in diabetes research.</article-title> <source><italic>Comput. Struct. Biotechnol. J.</italic></source> <volume>15</volume> <fpage>104</fpage>&#x2013;<lpage>116</lpage>. <pub-id pub-id-type="doi">10.1016/j.csbj.2016.12.005</pub-id> <pub-id pub-id-type="pmid">28138367</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>J. H.</given-names></name></person-group> (<year>2009</year>). <article-title>Estimating classification error rate: repeated cross-validation, repeated hold-out and bootstrap.</article-title> <source><italic>Comput. Stat. Data Anal.</italic></source> <volume>53</volume> <fpage>3735</fpage>&#x2013;<lpage>3745</lpage>. <pub-id pub-id-type="doi">10.1016/j.csda.2009.04.009</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kohabi</surname> <given-names>R.</given-names></name></person-group> (<year>1996</year>). <article-title>&#x201C;Scaling up the accuracy of naive-bayes classifiers : a decision-tree hybrid,&#x201D; in</article-title> <source><italic>Proceedings of the Second International Conference on Knowledge Discovery and Data Mining</italic></source>, <publisher-loc>Portland, OR</publisher-loc>.</citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kohavi</surname> <given-names>R.</given-names></name></person-group> (<year>1995</year>). <article-title>&#x201C;A study of cross-validation and bootstrap for accuracy estimation and model selection,&#x201D; in</article-title> <source><italic>Proceedings of the 14th International Joint Conference on Artificial Intelligence</italic></source>, <publisher-loc>Montreal</publisher-loc>.</citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krasteva</surname> <given-names>A.</given-names></name> <name><surname>Panov</surname> <given-names>V.</given-names></name> <name><surname>Krasteva</surname> <given-names>A.</given-names></name> <name><surname>Kisselova</surname> <given-names>A.</given-names></name> <name><surname>Krastev</surname> <given-names>Z.</given-names></name></person-group> (<year>2011</year>). <article-title>Oral cavity and systemic diseases&#x2014;<italic>Diabetes Mellitus</italic>.</article-title> <source><italic>Biotechnol. Biotechnol. Equip.</italic></source> <volume>25</volume> <fpage>2183</fpage>&#x2013;<lpage>2186</lpage>. <pub-id pub-id-type="doi">10.5504/BBEQ.2011.0022</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>B. J.</given-names></name> <name><surname>Kim</surname> <given-names>J. Y.</given-names></name></person-group> (<year>2016</year>). <article-title>Identification of type 2 diabetes risk factors using phenotypes consisting of anthropometry and triglycerides based on machine learning.</article-title> <source><italic>IEEE J. Biomed. Health Inform.</italic></source> <volume>20</volume> <fpage>39</fpage>&#x2013;<lpage>46</lpage>. <pub-id pub-id-type="doi">10.1109/JBHI.2015.2396520</pub-id> <pub-id pub-id-type="pmid">25675467</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>B. Q.</given-names></name> <name><surname>Zheng</surname> <given-names>L. L.</given-names></name> <name><surname>Feng</surname> <given-names>K. Y.</given-names></name> <name><surname>Hu</surname> <given-names>L. L.</given-names></name> <name><surname>Huang</surname> <given-names>G. H.</given-names></name> <name><surname>Chen</surname> <given-names>L.</given-names></name></person-group> (<year>2016</year>). <article-title>Prediction of linear B-cell epitopes with mRMR feature selection and analysis.</article-title> <source><italic>Curr. Bioinform.</italic></source> <volume>11</volume> <fpage>22</fpage>&#x2013;<lpage>31</lpage>. <pub-id pub-id-type="doi">10.2174/1574893611666151119215131</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liao</surname> <given-names>Z.</given-names></name> <name><surname>Ju</surname> <given-names>Y.</given-names></name> <name><surname>Zou</surname> <given-names>Q.</given-names></name></person-group> (<year>2016</year>). <article-title>Prediction of G protein-coupled receptors with SVM-Prot features and random forest.</article-title> <source><italic>Scientifica</italic></source> <volume>2016</volume>:<issue>8309253</issue>. <pub-id pub-id-type="doi">10.1155/2016/8309253</pub-id> <pub-id pub-id-type="pmid">27529053</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liao</surname> <given-names>Z. J.</given-names></name> <name><surname>Wan</surname> <given-names>S.</given-names></name> <name><surname>He</surname> <given-names>Y.</given-names></name> <name><surname>Zou</surname> <given-names>Q.</given-names></name></person-group> (<year>2018</year>). <article-title>Classification of small GTPases with hybrid protein features and advanced machine learning techniques.</article-title> <source><italic>Curr. Bioinform.</italic></source> <volume>13</volume> <fpage>492</fpage>&#x2013;<lpage>500</lpage>. <pub-id pub-id-type="doi">10.2174/1574893612666171121162552</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liaw</surname> <given-names>A.</given-names></name> <name><surname>Wiener</surname> <given-names>M.</given-names></name></person-group> (<year>2002</year>). <article-title>Classification and regression by randomforest.</article-title> <source><italic>R. News</italic></source> <volume>2</volume> <fpage>18</fpage>&#x2013;<lpage>22</lpage>.</citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>C.</given-names></name> <name><surname>Chen</surname> <given-names>W.</given-names></name> <name><surname>Qiu</surname> <given-names>C.</given-names></name> <name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Krishnan</surname> <given-names>S.</given-names></name> <name><surname>Zou</surname> <given-names>Q.</given-names></name></person-group> (<year>2014</year>). <article-title>LibD3C: ensemble classifiers with a clustering and dynamic selection strategy.</article-title> <source><italic>Neurocomputing</italic></source> <volume>123</volume> <fpage>424</fpage>&#x2013;<lpage>435</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2013.08.004</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lonappan</surname> <given-names>A.</given-names></name> <name><surname>Bindu</surname> <given-names>G.</given-names></name> <name><surname>Thomas</surname> <given-names>V.</given-names></name> <name><surname>Jacob</surname> <given-names>J.</given-names></name> <name><surname>Rajasekaran</surname> <given-names>C.</given-names></name> <name><surname>Mathew</surname> <given-names>K. T.</given-names></name></person-group> (<year>2007</year>). <article-title>Diagnosis of diabetes mellitus using microwaves.</article-title> <source><italic>J. Electromagnet. Wave.</italic></source> <volume>21</volume> <fpage>1393</fpage>&#x2013;<lpage>1401</lpage>. <pub-id pub-id-type="doi">10.1163/156939307783239429</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mukai</surname> <given-names>Y.</given-names></name> <name><surname>Tanaka</surname> <given-names>H.</given-names></name> <name><surname>Yoshizawa</surname> <given-names>M.</given-names></name> <name><surname>Oura</surname> <given-names>O.</given-names></name> <name><surname>Sasaki</surname> <given-names>T.</given-names></name> <name><surname>Ikeda</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <article-title>A computational identification method for GPI-anchored proteins by artificial neural network.</article-title> <source><italic>Curr. Bioinform.</italic></source> <volume>7</volume> <fpage>125</fpage>&#x2013;<lpage>131</lpage>. <pub-id pub-id-type="doi">10.2174/157489312800604390</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ozcift</surname> <given-names>A.</given-names></name> <name><surname>Gulten</surname> <given-names>A.</given-names></name></person-group> (<year>2011</year>). <article-title>Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms.</article-title> <source><italic>Comput. Methods Programs Biomed.</italic></source> <volume>104</volume> <fpage>443</fpage>&#x2013;<lpage>451</lpage>. <pub-id pub-id-type="doi">10.1016/j.cmpb.2011.03.018</pub-id> <pub-id pub-id-type="pmid">21531475</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pal</surname> <given-names>M.</given-names></name></person-group> (<year>2005</year>). <article-title>Random forest classifier for remote sensing classification.</article-title> <source><italic>Int. J. Remote Sens.</italic></source> <volume>26</volume> <fpage>217</fpage>&#x2013;<lpage>222</lpage>. <pub-id pub-id-type="doi">10.1080/01431160412331269698</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Polat</surname> <given-names>K.</given-names></name> <name><surname>G&#x00FC;nes</surname> <given-names>S.</given-names></name></person-group> (<year>2007</year>). <article-title>An expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease.</article-title> <source><italic>Digit. Signal Process.</italic></source> <volume>17</volume> <fpage>702</fpage>&#x2013;<lpage>710</lpage>. <pub-id pub-id-type="doi">10.1016/j.dsp.2006.09.005</pub-id></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Polat</surname> <given-names>K.</given-names></name> <name><surname>Kodaz</surname> <given-names>H.</given-names></name></person-group> (<year>2005</year>). <article-title>&#x201C;The medical applications of attribute weighted artificial immune system (AWAIS): diagnosis of heart and diabetes diseases,&#x201D; in</article-title> <source><italic>Proceedings of the 4th International Conference on Artificial Immune Systems</italic></source>, <publisher-loc>Banff</publisher-loc>.</citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Quinlan</surname> <given-names>J. R.</given-names></name></person-group> (<year>1986</year>). <article-title>Induction on decision tree.</article-title> <source><italic>Mach. Learn.</italic></source> <volume>1</volume> <fpage>81</fpage>&#x2013;<lpage>106</lpage>. <pub-id pub-id-type="doi">10.1007/BF00116251</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Quinlan</surname> <given-names>J. R.</given-names></name></person-group> (<year>1996a</year>). <article-title>&#x201C;Bagging, boosting, and C4.5,&#x201D; in</article-title> <source><italic>Proceedings of the Thirteenth National Conference on Artificial Intelligence</italic></source> (<publisher-loc>Menlo Park, CA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>725</fpage>&#x2013;<lpage>730</lpage>.</citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Quinlan</surname> <given-names>J. R.</given-names></name></person-group> (<year>1996b</year>). <article-title>Improved use of continuous attributes in C4.5.</article-title> <source><italic>J. Artif. Intell. Res.</italic></source> <volume>4</volume> <fpage>77</fpage>&#x2013;<lpage>90</lpage>. <pub-id pub-id-type="doi">10.1613/jair.279</pub-id></citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Razavian</surname> <given-names>N.</given-names></name> <name><surname>Blecker</surname> <given-names>S.</given-names></name> <name><surname>Schmidt</surname> <given-names>A. M.</given-names></name> <name><surname>Smith-McLallen</surname> <given-names>A.</given-names></name> <name><surname>Nigam</surname> <given-names>S.</given-names></name> <name><surname>Sontag</surname> <given-names>D.</given-names></name></person-group> (<year>2015</year>). <article-title>Population-level prediction of type 2 diabetes from claims data and analysis of risk factors.</article-title> <source><italic>Big Data</italic></source> <volume>3</volume> <fpage>277</fpage>&#x2013;<lpage>287</lpage>. <pub-id pub-id-type="doi">10.1089/big.2015.0020</pub-id> <pub-id pub-id-type="pmid">27441408</pub-id></citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Refaeilzadeh</surname> <given-names>P.</given-names></name> <name><surname>Tang</surname> <given-names>L.</given-names></name> <name><surname>Liu</surname> <given-names>H.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x201C;Cross-validation,&#x201D; in</article-title> <source><italic>Encyclopedia of Database Systems</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Liu</surname> <given-names>L.</given-names></name> <name><surname>T&#x00D6;zsu</surname> <given-names>M.</given-names></name></person-group> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>532</fpage>&#x2013;<lpage>538</lpage>.</citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Robertson</surname> <given-names>G.</given-names></name> <name><surname>Lehmann</surname> <given-names>E. D.</given-names></name> <name><surname>Sandham</surname> <given-names>W.</given-names></name> <name><surname>Hamilton</surname> <given-names>D.</given-names></name></person-group> (<year>2011</year>). <article-title>Blood glucose prediction using artificial neural networks trained with the AIDA diabetes simulator: a proof-of-concept pilot study.</article-title> <source><italic>J. Electr. Comput. Eng.</italic></source> <volume>2011</volume>:<issue>681786</issue>. <pub-id pub-id-type="doi">10.1155/2011/681786</pub-id></citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sakar</surname> <given-names>C. O.</given-names></name> <name><surname>Kursun</surname> <given-names>O.</given-names></name> <name><surname>Gurgen</surname> <given-names>F.</given-names></name></person-group> (<year>2012</year>). <article-title>A feature selection method based on kernel canonical correlation analysis and the minimum redundancy-maximum relevance filter method.</article-title> <source><italic>Expert Syst. Appl.</italic></source> <volume>39</volume> <fpage>3432</fpage>&#x2013;<lpage>3437</lpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2011.09.031</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Salzberg</surname> <given-names>S. L.</given-names></name></person-group> (<year>1994</year>). <article-title>C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann publishers, Inc., 1993.</article-title> <source><italic>Mach. Learn.</italic></source> <volume>16</volume> <fpage>235</fpage>&#x2013;<lpage>240</lpage>.</citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sharma</surname> <given-names>S.</given-names></name> <name><surname>Agrawal</surname> <given-names>J.</given-names></name> <name><surname>Sharma</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <article-title>classification through machine learning technique: C4. 5 algorithm based on various entropies.</article-title> <source><italic>Int. J. Comput. Appl.</italic></source> <volume>82</volume> <fpage>28</fpage>&#x2013;<lpage>32</lpage>.</citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Smith</surname> <given-names>L. I.</given-names></name></person-group> (<year>2002</year>). <article-title>A tutorial on principal components analysis.</article-title> <source><italic>Inform. Fusion</italic></source> <volume>51</volume>:<issue>52</issue>.</citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Su</surname> <given-names>Z. D.</given-names></name> <name><surname>Huang</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>Z. Y.</given-names></name> <name><surname>Zhao</surname> <given-names>Y. W.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Chen</surname> <given-names>W.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC.</article-title> <source><italic>Bioinformatics</italic></source> <pub-id pub-id-type="doi">10.1093/bioinformatics/bty508</pub-id> <comment>[Epub ahead of print]</comment>. <pub-id pub-id-type="pmid">29931187</pub-id></citation></ref>
<ref id="B46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Svetnik</surname> <given-names>V.</given-names></name> <name><surname>Liaw</surname> <given-names>A.</given-names></name> <name><surname>Tong</surname> <given-names>C.</given-names></name> <name><surname>Culberson</surname> <given-names>J. C.</given-names></name> <name><surname>Sheridan</surname> <given-names>R. P.</given-names></name> <name><surname>Feuston</surname> <given-names>B. P.</given-names></name></person-group> (<year>2015</year>). <article-title>Random forest: a classification and regression tool for compound classification and QSAR modeling.</article-title> <source><italic>J. Chem. Inform. Comput. Sci.</italic></source> <volume>43</volume> <fpage>1947</fpage>&#x2013;<lpage>1958</lpage>. <pub-id pub-id-type="doi">10.1021/ci034160g</pub-id> <pub-id pub-id-type="pmid">14632445</pub-id></citation></ref>
<ref id="B47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>H.</given-names></name> <name><surname>Zhao</surname> <given-names>Y. W.</given-names></name> <name><surname>Zou</surname> <given-names>P.</given-names></name> <name><surname>Zhang</surname> <given-names>C. M.</given-names></name> <name><surname>Chen</surname> <given-names>R.</given-names></name> <name><surname>Huang</surname> <given-names>P.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>HBPred: a tool to identify growth hormone-binding proteins.</article-title> <source><italic>Int. J. Biol. Sci.</italic></source> <volume>14</volume> <fpage>957</fpage>&#x2013;<lpage>964</lpage>. <pub-id pub-id-type="doi">10.7150/ijbs.24174</pub-id> <pub-id pub-id-type="pmid">29989085</pub-id></citation></ref>
<ref id="B48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>W.</given-names></name> <name><surname>Wan</surname> <given-names>S.</given-names></name> <name><surname>Yang</surname> <given-names>Z.</given-names></name> <name><surname>Teschendorff</surname> <given-names>A. E.</given-names></name> <name><surname>Zou</surname> <given-names>Q.</given-names></name></person-group> (<year>2018</year>). <article-title>Tumor origin detection with tissue-specific miRNA and DNA methylation markers.</article-title> <source><italic>Bioinformatics</italic></source> <volume>34</volume> <fpage>398</fpage>&#x2013;<lpage>406</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btx622</pub-id> <pub-id pub-id-type="pmid">29028927</pub-id></citation></ref>
<ref id="B49"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>S. P.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name> <name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Cai</surname> <given-names>Y. D.</given-names></name></person-group> (<year>2018</year>). <article-title>Analysis and prediction of nitrated tyrosine sites with the mRMR method and support vector machine algorithm.</article-title> <source><italic>Curr. Bioinform.</italic></source> <volume>13</volume> <fpage>3</fpage>&#x2013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.2174/1574893611666160608075753</pub-id></citation></ref>
<ref id="B50"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Paliwal</surname> <given-names>K. K.</given-names></name></person-group> (<year>2003</year>). <article-title>Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition.</article-title> <source><italic>Pattern Recogn.</italic></source> <volume>36</volume> <fpage>2429</fpage>&#x2013;<lpage>2439</lpage>. <pub-id pub-id-type="doi">10.1016/S0031-3203(03)00044-X</pub-id></citation></ref>
<ref id="B51"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Watkins</surname> <given-names>A. B.</given-names></name> <name><surname>Boggess</surname> <given-names>L.</given-names></name></person-group> (<year>2002</year>). <article-title>&#x201C;A resource limited artificial immune classifier,&#x201D; in</article-title> <source><italic>Proceedings of the 2002 Congress on Evolutionary Computation (CEC2002)</italic></source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE Press</publisher-name>), <fpage>926</fpage>&#x2013;<lpage>931</lpage>. <pub-id pub-id-type="doi">10.1109/CEC.2002.1007049</pub-id></citation></ref>
<ref id="B52"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname> <given-names>L.</given-names></name> <name><surname>Xing</surname> <given-names>P.</given-names></name> <name><surname>Shi</surname> <given-names>G.</given-names></name> <name><surname>Ji</surname> <given-names>Z. L.</given-names></name> <name><surname>Zou</surname> <given-names>Q.</given-names></name></person-group> (<year>2018</year>). <article-title>Fast prediction of protein methylation sites using a sequence-based feature selection technique.</article-title> <source><italic>IEEE/ACM Trans. Comput. Biol. Bioinform.</italic></source> <pub-id pub-id-type="doi">10.1109/TCBB.2017.2670558</pub-id> <comment>[Epub ahead of print]</comment>. <pub-id pub-id-type="pmid">28222000</pub-id></citation></ref>
<ref id="B53"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>H.</given-names></name> <name><surname>Qiu</surname> <given-names>W. R.</given-names></name> <name><surname>Liu</surname> <given-names>G.</given-names></name> <name><surname>Guo</surname> <given-names>F. B.</given-names></name> <name><surname>Chen</surname> <given-names>W.</given-names></name> <name><surname>Chou</surname> <given-names>K. C.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>iRSpot-Pse6NC: identifying recombination spots in <italic>Saccharomyces cerevisiae</italic> by incorporating hexamer composition into general PseKNC.</article-title> <source><italic>Int. J. Biol. Sci.</italic></source> <volume>14</volume> <fpage>883</fpage>&#x2013;<lpage>891</lpage>. <pub-id pub-id-type="doi">10.7150/ijbs.24616</pub-id> <pub-id pub-id-type="pmid">29989083</pub-id></citation></ref>
<ref id="B54"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>H.</given-names></name> <name><surname>Tang</surname> <given-names>H.</given-names></name> <name><surname>Chen</surname> <given-names>X. X.</given-names></name> <name><surname>Zhang</surname> <given-names>C. J.</given-names></name> <name><surname>Zhu</surname> <given-names>P. P.</given-names></name> <name><surname>Ding</surname> <given-names>H.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>Identification of secretory proteins in <italic>Mycobacterium tuberculosis</italic> using pseudo amino acid composition.</article-title> <source><italic>Biomed. Res. Int.</italic></source> <volume>2016</volume>:<issue>5413903</issue>. <pub-id pub-id-type="doi">10.1155/2016/5413903</pub-id> <pub-id pub-id-type="pmid">27597968</pub-id></citation></ref>
<ref id="B55"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>You</surname> <given-names>Y.</given-names></name> <name><surname>Cai</surname> <given-names>H. M.</given-names></name> <name><surname>Chen</surname> <given-names>J. Z.</given-names></name></person-group> (<year>2018</year>). <article-title>Low rank representation and its application in bioinformatics.</article-title> <source><italic>Curr. Bioinform.</italic></source> <volume>13</volume> <fpage>508</fpage>&#x2013;<lpage>517</lpage>. <pub-id pub-id-type="doi">10.2174/1574893612666171121155347</pub-id></citation></ref>
<ref id="B56"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yue</surname> <given-names>C.</given-names></name> <name><surname>Xin</surname> <given-names>L.</given-names></name> <name><surname>Kewen</surname> <given-names>X.</given-names></name> <name><surname>Chang</surname> <given-names>S.</given-names></name></person-group> (<year>2008</year>). <article-title>&#x201C;An intelligent diagnosis to type 2 diabetes based on QPSO algorithm and WLS-SVM,&#x201D; in</article-title> <source><italic>Proceedings of the 2008 IEEE International Symposium on Intelligent Information Technology Application Workshops</italic></source>, <publisher-loc>Washington, DC</publisher-loc>. <pub-id pub-id-type="doi">10.1109/IITA.Workshops.2008.36</pub-id></citation></ref>
<ref id="B57"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Zou</surname> <given-names>Q.</given-names></name> <name><surname>Liu</surname> <given-names>B.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name></person-group> (<year>2014</year>). <article-title>Exploratory predicting protein folding model with random forest and hybrid features.</article-title> <source><italic>Curr. Proteom.</italic></source> <volume>11</volume> <fpage>289</fpage>&#x2013;<lpage>299</lpage>. <pub-id pub-id-type="doi">10.2174/157016461104150121115154</pub-id></citation></ref>
<ref id="B58"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zou</surname> <given-names>Q.</given-names></name> <name><surname>Wan</surname> <given-names>S.</given-names></name> <name><surname>Ju</surname> <given-names>Y.</given-names></name> <name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Zeng</surname> <given-names>X.</given-names></name></person-group> (<year>2016a</year>). <article-title>Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy.</article-title> <source><italic>BMC Syst. Biol.</italic></source> <volume>10(Suppl. 4)</volume>:<issue>114</issue>. <pub-id pub-id-type="doi">10.1186/s12918-016-0353-5</pub-id> <pub-id pub-id-type="pmid">28155714</pub-id></citation></ref>
<ref id="B59"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zou</surname> <given-names>Q.</given-names></name> <name><surname>Zeng</surname> <given-names>J.</given-names></name> <name><surname>Cao</surname> <given-names>L.</given-names></name> <name><surname>Ji</surname> <given-names>R.</given-names></name></person-group> (<year>2016b</year>). <article-title>A novel features ranking metric with application to scalable visual and bioinformatics data classification.</article-title> <source><italic>Neurocomputing</italic></source> <volume>173</volume> <fpage>346</fpage>&#x2013;<lpage>354</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2014.12.123</pub-id></citation></ref>
</ref-list>
</back>
</article>