Feature Selection Using Approximate Conditional Entropy Based on Fuzzy Information Granule for Gene Expression Data Classification

Classification is widely used in gene expression data analysis. Feature selection is usually performed before classification because of the large number of genes and the small sample size in gene expression data. In this article, a novel feature selection algorithm using approximate conditional entropy based on fuzzy information granule is proposed, and the correctness of the method is proved by the monotonicity of entropy. Firstly, the fuzzy relation matrix is established by Laplacian kernel. Secondly, the approximately equal relation on fuzzy sets is defined. And then, the approximate conditional entropy based on fuzzy information granule and the importance of internal attributes are defined. Approximate conditional entropy can measure the uncertainty of knowledge from two different perspectives of information and algebra theory. Finally, the greedy algorithm based on the approximate conditional entropy is designed for feature selection. Experimental results for six large-scale gene datasets show that our algorithm not only greatly reduces the dimension of the gene datasets, but also is superior to five state-of-the-art algorithms in terms of classification accuracy.


INTRODUCTION
The development of DNA microarray technology has brought about a large number of gene expression data. It is a hot topic in bioinformatics to analyze and mine the knowledge behind these data (Sun et al., 2019b). As the most basic data mining method, classification is widely used in the analysis of gene expression data. Due to the small sample size and high dimensionality of gene expression data, the traditional classification methods are often ineffective when applied to gene expression data directly (Fu and Wang, 2003;Mitra et al., 2011;Phan et al., 2012;Konstantina et al., 2015). It has become a consensus in the academic community to reduce the dimensionality before classification. Feature selection is the most widely used dimensionality reduction method in gene expression data because it can maintain the biological significance of each feature. Feature selection can not only reduce the time and space complexity of classification learning algorithm, avoid dimensionality disaster, and improve the prediction accuracy of classification, but also help to explain biological phenomena.
Feature selection methods are generally divided into three categories: filter, wrapper, and embedded method (Hu et al., 2018). The filter method obtains the optimal subset of features by judging the similarity between the features and the objective function based on the statistical characteristics of data. The wrapper method uses a specific model to carry out multiple rounds of training. After each round of training, several features are removed according to the score of the objective function, and then the next round of training is carried out based on the new feature set. In this way, recursion is repeated until the number of remaining features reaches the required number. The embedded method uses machine learning algorithm to get the weight coefficient of each feature in the first place, and then selects the feature according to the weight coefficient from large to small. Wrapper and embedded methods have heavy computational burden and are not suitable for large-scale gene data sets. Our feature selection method belongs to the filter method, in which a heuristic search algorithm is used to find an optimal subset of features using approximate conditional entropy based on fuzzy information granule for gene expression data classification.
Attribute reduction is a fundamental research topic and an important application of granular computing (Dong et al., 2018;Wang et al., 2019). Attribute reduction can be used for feature selection. Granular computing is a new concept and new computing paradigm of information processing, which is mainly used to deal with fuzzy and uncertain information (Qian et al., 2011). Pawlak (1982) proposed the rough set theory. Rough set theory is a new mathematical tool to deal with fuzziness and uncertainty. Granular computing is one of the important research contents of rough set theory. On the basis of equivalence relation, rough set theory is only suitable for dealing with discrete data widely existing in real life. When dealing with attribute reduction problem of continuous data in classical rough set theory, discretization method is often used to convert continuous data into discrete data, but the discretization will inevitably lead to information loss (Dai and Xu, 2012). To overcome this drawback, Hu et al. proposed a neighborhood rough set model (Hu et al., 2008(Hu et al., , 2011. Using neighborhood rough set model to select attribute of decision table containing continuous data can keep classification ability well and need not discretize it. The existing neighborhood rough set attribute reduction methods are based on the perspective of algebra or information theory. The definition of attribute significance based on algebra theory only describes the influence of attributes on the definite classification subset contained in the universe. The definition of attribute significance based on information theory only describes the influence of attributes on uncertain classification subsets contained in the universe. A single perspective is not comprehensive (Jiang et al., 2015). Zadeh (1979) proposed the concept of information granulation based on fuzzy sets theory. Objects in the universe are granulated into a set of fuzzy information granules by a fuzzy-binary relation (Tsang et al., 2008;Jensen and Shen, 2009).
In this article, a heuristic feature selection algorithm based on fuzzy information granules and approximate conditional entropy is designed to improve the classification performance of gene expression data sets. The experimental results for several gene expression data sets show that the proposed algorithm can find optimal reduction sets with few genes and high classification accuracy.
The remainder of this article is organized as follows. Section "Materials and Methods" gives the gene expression datasets for the experiment and our feature selection algorithm. Section "Experimental Results and Analysis" shows and analyzes the experimental results. Section "Conclusion and Discussion" summarizes this study and discusses future research focus.

Gene Expression Data Sets
The following six gene expression datasets are used in this article.
(1) Leukemia1 dataset consists of 7129 genes and 72 samples with two subtypes: patients and healthy people (Sun et al., 2019a). The six gene expression datasets are summarized in Table 1.

Fuzzy Sets and Fuzzy-Binary Relation
Let U = {x 1 , x 2 , . . . , x n } be a nonempty finite set and denote a universe, I = [0, 1], I U denotes all fuzzy sets on U. Fuzzy sets are regarded as the extensions of classical sets (Zadeh, 1965).
F is a fuzzy set on U, i.e., F : Fuzzy-binary relation are fuzzy sets on two universes. I U×U denotes all fuzzy-binary relations on U × U.

Information Systems and Rough Sets
Definition 2.1 . Let Ube a set of objects and A a set of attributes. Suppose that U and A are finite sets. If each attribute a ∈ A determines an information function a : U → V a , where V a is the set of function values of attribute a, then the pair (U, A) is called an information system. Moreover, if A = C D, C is a condition attribute set and D is a decision attribute set, then the pair (U, A) is called a decision information system.
If (U, A) is an information system and P ⊆ A, then an equivalence relation (or indiscernibility relation) ind(P) can be defined by (x, y) ∈ ind(P) ⇔ ∀a ∈ P, a(x) = a(y).
Usually, [x] ind(P) and U/ind(P) are briefly denoted by [x] P and U/P, respectively.
According to the rough set theory, for P ⊆ A, X ⊆ U is characterized byP(X) and P(X), where P( P(X) andP(X) are referred to as the lower and upper approximations of X, respectively.
X is crisp ifP(X) = P(X) and X is rough ifP(X) = P(X).

The Approximately Equal Relation on Fuzzy Sets
Given F, G ∈ I U . For x ∈ U, F(x) and G(x) are the membership degrees of x belonging to fuzzy sets F and G, respectively. F(x) and G(x) ∈ [0, 1]. Actually, it is very difficult to ensure that the equation F(x) = G(x) holds. For this reason, we propose the following approximately equal relation of fuzzy sets.

Definition 2.2 Given
x R is referred to as a fuzzy set that means the membership degree of a to x.
referred to as the fuzzy equal class of x induced by the fuzzy relation R on U.
Definition 2.5 [x i ] R (i = 1, 2, ..., |U|) is named as the fuzzy information granule induced by the fuzzy relation R on U.
to as the fuzzy-binary granular structure of the universe U induced by R.
It is easy to prove: Fuzzy-Binary Relation Based on Laplacian Kernel Hu et al. (2010) found that there are some relationships between rough sets and Gaussian kernel method, so Gaussian kernel is used to obtain fuzzy relations. Compared with Gaussian kernel, Laplacian kernel has higher peak, faster reduction and smoother tail. Therefore, Laplacian kernel is better than Gaussian kernel in describing the similarity between objects. In this article, we use Laplacian kernel k( ) to extract the similarity between two objects from decision information system, where ||x i − x j || is the Euclidean distance between two objects x i and x j . In general, σ is a given positive value.
Obviously, k(x i , x j ) satisfies: ) n×n , then R is called the fuzzy relation matrix induced by Laplacian kernel.

Feature Selection Using Approximate Conditional Entropy Based on Fuzzy Information Granule
Approximate Accuracy and Approximate Conditional Entropy Definition 2.7 Given a decision information system (U, C D), ∀X ⊆ U, X = φ (φ is an empty set), then the approximate accuracy of X is defined as where |.| denotes the cardinality of set. Obviously, 0 ≤ a(X) ≤ 1. Definition 2.8 Given a decision information system (U, C D), ∀B ⊆ C, the fuzzy information granule of object x under B is [x] R B , the partition of U derived from D is {X 1 , X 2 , ..., X k }, then the conditional entropy of D relative to B is defined as where R B denotes the fuzzy relation based on attribute set B and log is a base-2 logarithm. The approximate accuracy can effectively measure the imprecision of the set caused by the boundary region, while the conditional entropy can effectively measure the knowledge uncertainty caused by the information granularity. We combine the two to propose approximate conditional entropy. Definition 2.9 Let (U, C D) be a decision information system, ∀B ⊆ C, the fuzzy information granule of object x under B is [x] R B , the partition of U derived from D is {X 1 , X 2 , ..., X k }, a B (X i ) is the approximate accuracy of X i under R B , then the approximate conditional entropy of D relative to B is defined as Theorem 2.1 Let (U, C D) be a decision information system, ∀B ⊆ C, the fuzzy information granule of object x under B is (1) H ace (D/B) gets the maximum value |U| log |U| if and only if [x i ] R B = U(i = 1, 2, ..., n) and X j = 1(j = 1, 2, ..., k = n). Proof.
The converse is also true.
The converse is also true. Proof. Due to M ⊆ L ⊆ C, we have P M (X) ⊆ P L (X) and P M (X) ⊇ P L (X).
Then a M (X) ≤ a L (X) according to Definition 2.7. By M ⊆ L and U/D = {X 1 , X 2 , ..., X k }, we have Consequently, H ace (D/M) ≥ H ace (D/L) according to Definition 2.9.
Theorem 2.2 shows that H ace (D/B) decreases monotonically with the increase of the number of attributes in B, which is very important for constructing forward greedy algorithm of attributes reduction.
Definition 2.10 Let (U, C D) be a decision information system and B ⊆ C, if H ace (D/B) = H ace (D/C) and H ace (D/(B − {b})) > H ace (D/C)(∀b ∈ B), then B is called a reduction of C relative to D.
The first condition guarantees that the selected attribute subset has the same amount of information as the whole attribute set. The second condition guarantees that there is no redundancy in the attribute reduction set.
Definition 2.11 Assume that (U, C D) be a decision information system, ∀c ∈ C, define the following indicator, then IIA(c, C, D) is called the importance of internal attribute of c in C relative to D. Definition 2.12 Assume that (U, C D) be a decision information system, ∀c ∈ C, if IIA(c, C, D) > 0, then attribute c is called a core attribute of C relative to D.
Definition 2.13 Assume that (U, C D) be a decision information system, B ⊆ C, ∀d ∈ C−B, define the following indicator,

Feature Selection Algorithm Using Approximate Conditional Entropy
In this article, a novel feature selection algorithm using approximate conditional entropy (FSACE) is proposed and described as follows.
Input: A decision information system (U, C D) and σ.
Output:A selected gene subset B.
Step 4. If B = φ, then turn to step 5. If B = φ, compute H ace (D/B). If H ace (D/B) = H ace (D/C), then turn to step 6; otherwise, turn to step 5.
Step 5 Step 6. The feature selection subset B is obtained, and the algorithm ends.

EXPERIMENTAL RESULTS AND ANALYSIS
All experiments are performed on a personal computer running Windows 10 with an Intel(R) Core(TM) i7-4790 CPU operating at 3.60 GHz with 8 GB memory using MATLAB R2019a. The classifiers (KNN, CART, and SVM) are selected to verify the classification accuracy, where the parameter k = 3 in KNN and Gaussian kernel function is selected in SVM. Other parameters of the three algorithms are the default values of the software.

Influence of Different Values of σ on Classification Performance
In this part, the classification accuracy of different Laplacian kernel parameters values of σ is tested. For gene expression data, feature selection aims to improve classification accuracy by eliminating redundant genes. The different values of σ influence the size of granulated gene data, which affects the classification accuracy of selected genes. Therefore, the different values of σ should be set in the process of feature selection of gene expression data sets. Moreover, the different values of σ also affect the composition of the selected gene subset. To obtain a suitable σ and a good gene subset, the classification accuracy of the selected gene subset for different values of σ should be discussed in detail.
The corresponding experiments are performed to graphically illustrate the classification accuracy of FSACE under different values of σ. The results are shown in Figure 1, where the horizontal axis denotes σ ∈ [0.05, 1] at intervals of 0.05, and the vertical axis represents the classification accuracy. Figure 1 shows that σ greatly influences the classification performance of FSACE. σ is usually set to make the classification accuracy highest. Thus, the appropriate parameter values of σ can be obtained for each data set from Figure 1. In Figure 1A, for Leukemia1 data set, when σ is 0.95, the classification accuracy is the highest. In Figure 1B, for Leukemia2 data set, when σ is 0.55, the classification accuracy is the highest. In Figure 1C, for Brain tumor data set, when σ is 0.80, the classification accuracy is the highest. In Figure 1D, for 9-tumors data set, when σ is 0.75, the classification accuracy is the highest. In Figure 1E, for Robert data set, when σ is 0.60, the classification accuracy is the highest. In Figure 1F, for Ting data set, when σ is 0.75, the classification accuracy is the highest. Therefore, the appropriate values of σ for different data sets are determined.

The Feature Selection Results and Classification Performance of FSACE
The classification results obtained from the three classifiers (KNN, CART, and SVM) with 10-fold cross-validation are shown in Table 2 on the test data by FSACE. Table 2 shows that FSACE not only greatly reduces the dimensionality of all six gene expression data sets, but also improves the classification accuracy.
The results of feature genes selection from six gene expression data sets are shown in Table 3 using FSACE.

Comparison of the Classification Performance of Several Entropy-Based Feature Selection Algorithms
To evaluate the performance of FSACE in terms of classification accuracy, FSACE algorithm is compared with several state-of-the-art feature selection algorithms, including EGGS (Chen et al., 2017), EGGS-FS (Yang et al., 2016), MEAR (Xu et al., 2009), Fisher (Saqlain et al., 2019), and Lasso (Tibshirani, 1996). According to the change trend of Fisher scores of six gene datasets, we select the top-200 genes as the reduction set for Fisher algorithm.
Tables 4-9 show the experimental results of six gene expression data sets using six different feature selection methods.
As shown in Tables 4, 5, FSACE has the highest average classification accuracy for Leukemia1 and Leukemia2, and

Leukemia1
Robert (12883,1600,9892,16398,8720,4510,18137,2320,14931,14679,10352,12481,18034,406) Ting (4754,5676,2503,5379,3304,4752,6015,2193,15687,641,7938,2629,6837,4653,19016,8621,4267)   exhibits better classification performance than the other five algorithms. As shown in Tables 6, 7, MEAR cannot work on Brain Tumor data set and 9-tumors data set, its results are denoted by the sign -. FSACE obtains the highest average classification accuracy among the five feature selection algorithms for Brain Tumor data set and 9-tumors data set.    9 shows that MEAR still can not work on Robert data set and Ting data set, which indicates that the algorithm is not stable. Our algorithm still has the highest classification accuracy among all the algorithms. Although the classification accuracy of our algorithm is only a little higher than lasso algorithm, the number of attributes reduced by our algorithm is much less than lasso algorithm.
Tables 4-9 show that the average number of attributes reduced by our algorithm is slightly more than that of MEAR, ECGS, and EGGS-FS, but the average classification accuracy is much higher than that of these three algorithms. Therefore, FSACE can not only effectively remove noise and redundant data from the original data, but also improve the classification accuracy of gene expression data sets.

CONCLUSION AND DISCUSSION
Firstly, the concept of approximate conditional entropy is given and its monotonicity is proved in this article. Approximate conditional entropy can describe the uncertainty of knowledge from two aspects of boundary and information granule. And then, a novel feature selection algorithm FSACE is proposed based on the approximate conditional entropy. Finally, the effectiveness of the proposed algorithm is verified on several gene expression data sets. Experimental results show that compared with several state-of-the-art feature selection algorithms, the proposed feature selection algorithm not only can obtain compact features, but also improve classification performance. The time complexity of FSACE is O(|U| 2 |C| 2 ). Because the gene expression data sets usually contain a large number of genes, the time complexity of FSACE is high. In addition, FSACE does not consider the interaction between attributes. Therefore, reducing the time complexity of FSACE and seeking more efficient feature selection algorithm considering interaction between attributes are two issues that we will study in the future.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: http://portals.broadinstitute. org/cgi-bin/cancer/datasets.cgi (cancer Program Legacy Publication Resources).