MSC-CSMC: A multi-objective semi-supervised clustering algorithm based on constraints selection and multi-source constraints for gene expression data

Many clustering techniques have been proposed to group genes based on gene expression data. Among these methods, semi-supervised clustering techniques aim to improve clustering performance by incorporating supervisory information in the form of pairwise constraints. However, noisy constraints inevitably exist in the constraint set obtained on the practical unlabeled dataset, which degenerates the performance of semi-supervised clustering. Moreover, multiple information sources are not integrated into multi-source constraints to improve clustering quality. To this end, the research proposes a new multi-objective semi-supervised clustering algorithm based on constraints selection and multi-source constraints (MSC-CSMC) for unlabeled gene expression data. The proposed method first uses the gene expression data and the gene ontology (GO) that describes gene annotation information to form multi-source constraints. Then, the multi-source constraints are applied to the clustering by improving the constraint violation penalty weight in the semi-supervised clustering objective function. Furthermore, the constraints selection and cluster prototypes are put into the multi-objective evolutionary framework by adopting a mixed chromosome encoding strategy, which can select pairwise constraints suitable for clustering tasks through synergistic optimization to reduce the negative influence of noisy constraints. The proposed MSC-CSMC algorithm is testified using five benchmark gene expression datasets, and the results show that the proposed algorithm achieves superior performance.


Introduction
The rapid development of microarray technology has generated a large amount of gene expression data and mining the inherent patterns in the massive gene expression data is a major challenge in the current bioinformatics field (Bandyopadhyay et al., 2007;Pirooznia et al., 2008). As an important unsupervised data mining method, clustering has become a powerful tool for gene expression data analysis. One of the main tasks of gene expression data clustering is to identify co-expressed genomes, which is a useful tool for further research on gene function (Bandyopadhyay et al., 2007;Chen et al., 2019). Compared with the unsupervised clustering methods, the semi-supervised clustering methods use prior information to guide the clustering process through data labels or pairwise constraints, which can effectively improve the performance of clustering (Wagstaff et al., 2001;Bilenko et al., 2004;Yin et al., 2010).
For semi-supervised clustering algorithms, the pairwise constraints are usually used to describe if two data belong to the same cluster. Specifically, the must-link constraint (ML) means that two data must be divided into the same cluster, and the cannot-link constraint (CL) means that two data must be divided into different clusters. The quality of the selected pairwise constraints is of vital importance, which significantly affects the performance of semisupervised clustering algorithms (Grira et al., 2008;Vu et al., 2012;Masud et al., 2019;Abin and Vu, 2020). The pairwise constraints can be generated by directly using part of the known data labels (Lai et al., 2021) or by using an active learning method (Masud et al., 2019). In practical, most gene expression data are unlabeled, for which it is impossible to obtain pairwise constraints based on labels. Vu et al. (2012) indicated that the generation of the pairwise constraints should mainly focus on the data samples on the cluster boundaries, which are more likely to be misclassified. To this end, Basu et al. (2004) developed a farthest-first traversal scheme-based active learning method to obtain pairwise constraints. However, this method has been reported to be sensitive to noise (Davidson and Qi, 2008). Grira et al. (2008) proposed an active learning method to generate pairwise constraints by determining cluster boundary data using membership obtained by fuzzy clustering. Vu et al. (2012) identified data in sparse regions based on k-nearest neighbor graphs and constructed pairwise constraints. However, it was claimed that some pairwise constraints might not be generated by this method (Abin and Vu, 2020). Liu et al. (2018) proposed an entropy-based query strategy to select the most uncertain pairwise constraints. Abin (2018) proposed a random walk approach on the adjacency graph of data for querying informative constraints. Masud et al. (2019) used local density estimation to identify the most informative objects as pairwise constraints. Abin and Vu (2020) proposed a density tracking method which takes into account the density relationship between data, and uses the information about boundaries and skeleton of clusters to generate the pairwise constraints.
Although the above methods can automatically mine and learn the pairwise constraints of unlabeled datasets through different approaches, there are inevitably noisy constraints, i.e., constraints inconsistent with the ground-truth clusters, in the obtained pairwise constraints (Yin et al., 2010;Lai et al., 2021). However, the existing semi-supervised clustering algorithms are mostly based on the assumption that pairwise constraints conform to real cluster information, and usually susceptible to noisy constraints. Therefore, it is necessary to implement constraints selection, where noisy constraints are filtered out, and only pairwise constraints that are beneficial for semi-supervised clustering are retained. In addition, most of the pairwise-constraints-based semisupervised clustering algorithms were developed for single-source constraints, i.e., the pairwise constraints are obtained only from the data itself. In real-world applications, many data also possess related domain information. For example, Gene Ontology (GO) (Ashburner et al., 2000), which describes gene products in terms of their associated biological processes, cellular components and molecular functions, can further provide gene annotation information for gene expression data. In this paper, the multi-source constraints are the pairwise constraints formed by the data itself and domain information. Apparently, compared with the single-source pairwise constraints based solely on gene expression data, the multi-source constraints formed by the fusion of gene ontology can provide more comprehensive information about the structure of gene clusters and help to guide semi-supervised clustering to obtain more accurate clustering results.
Aiming at the unlabeled gene expression data and from the perspective of reducing the negative impact of noisy constraints and integrating multi-source constraints, a method called multiobjective semi-supervised clustering algorithm based on constraints selection and multi-source constraints (MSC-CSMC) is proposed in this research. At first, the proposed algorithm uses gene expression data and GO information to generate multi-source pairwise constraints. Then, under the multi-objective optimization framework of Non-dominated Sorting Genetic Algorithm-II (NSGA-II), the constraints selection and the cluster prototypes are collaboratively optimized to realize the selection of pairwise constraints suitable for clustering with respect to the multi-source constraints and to improve the accuracy of semi-supervised clustering of gene expression data by reducing the negative impact of noisy constraints.

Methods
In this section, the details of our proposed MSC-CSMC algorithm are described. Our proposed method consists of two parts. Firstly, multi-source pairwise constraints are generated by integrating gene expression and gene ontology (GO) information. Then, by using the improved penalty weights as well as mixed chromosome encoding strategy of cluster prototype and constraints selection, multi-objective semi-supervised clustering based on constraints selection and multi-source constraints is performed to identify co-expressed gene groups. The workflow of MSC-CSMC is shown in Figure 1.

Generation of multi-source pairwise constraints
Gene expression data and gene ontology (GO) describe generelated information from the abundance of mRNA of genes and gene annotation. Compared with the method only using gene expression data, the combination of these two aspects of information can help Frontiers in Genetics frontiersin.org to further improve the clustering accuracy of gene expression data (Giri and Saha, 2020;. In this paper, we use gene expression data and gene ontology information to generate multisource pairwise constraints for semi-supervised clustering. In view of the superior performance of the density tracking method (Abin and Vu, 2020), we use this method to generate the initial gene expression constraint set. The method consists of three steps: density estimation, density following, and constraints generation. Let X x 1 , x 2 , . . . x n { } , x i ∈ R d denote a d-dimensional gene expression dataset with n genes. Gene x i 's density is obtained by where N b (x i ) is the set of b nearest genes of gene x i ; · 2 is the Euclidean distance. Based on the density in Formula 1, the density tracking method constructs density chains according to the density relationship between data. Specifically, starting from each gene x i , the closest gene x j ∈ N b (x i ) whose density is greater than that of x i is selected, and the relation between them is recorded as density chain x i → x j . Then start from gene x j and continue the above density tracking until there exists no gene whose density is greater than that of the gene at the end of the chain. Consequently, the density chain Chains (x i ) can be denoted as After constructing all the density chains, the total times of gene x i appearing in all the chains is referred to as centrality and denoted by Centrality (x i ). The sum of centrality with respect to all genes in a density chain is used as the centrality of the density chain. All density chains with a common endpoint are considered connected density chains and the points belonging to them are considered to be in the same density group. Besides, the impurity of gene x i is defined as follows: with |Groups| being the total number of groups, Group(x j ) being the group index of x j , I being the indictor function.
According to the density, impurity, density chain, and density group of the data, the density tracking method proposes three assumptions for mining informative pairwise constraints. Let Ω denote the pairwise constraint set, whose elements satisfy the following key assumptions: (1) providing feasible information about the boundary data of clusters; (2) providing feasible information about the boundary between various clusters; (3) providing feasible information about the skeleton of clusters. Among them, assumptions (1) and (3) are used to generate the must-link constraint set Ω ML , assumption (2) is used to generate the cannot-link constraint set Ω CL . With the subsets Ω ML and Ω CL , the penalization can be constructed for the cost function of the clustering. The workflow of density tracking method is given in Figure 2. The initial gene expression constraint set Ω = Ω ML ∪ Ω CL is generated as follows.
1. For each gene x i , calculate its Density(x i ) and Impurity(x i ).
Construct density chain Chains(x i ) and density group Group(x i ), get the centrality of density chain. Initialize Ω ML = ∅, Ω CL = ∅; 2. Select gene x i in descending order of Impurity (x i ), query the nearest neighbor gene x j that is not in its density group Group (x i ), and add the pairwise constraint (x i , x j ) into the cannot-link constraint set, i.e., Select gene x i in descending order of Impurity(x i ), and find the next gene x j along its density chain Chains(x i ). Let ε > 0 denote the density drop rate. If Density(x j ) ≥ε× Density(x e ), then add the pairwise constraint (x i , x j ) to the must-link constraint set, i.e., Ω ML = Ω ML ∪ {(x i , x j )}; 4. Select the density chain Chains(x i ) in descending order of the centrality of the density chain, start from the starting gene x i , select the gene x j with an interval, and add the pairwise constraint (x i , x j ) to the must-link constraint set, i.e., For a set of genes to be analyzed, each gene can be annotated with several GO terms. Thus, the functional similarity between genes can be deduced based on the term similarity. In the proposed MSC-CSMC algorithm, we adopt the aggregate information content (AIC) Frontiers in Genetics frontiersin.org (Song et al., 2014) to measure the semantic similarity of GO terms t 1 and t 2 : Here, T t is the set of ancestors of term t in the GO graph, p(t) is the frequency of the term appearing in the GO database, IC(t) = − log p(t) is the information content of term t. The higher the annotation frequency, the more general the information contained and the smaller the corresponding IC value. SW(t) normalizes the knowledge reflected by 1/IC(t), describing the semantic weight of term t. Consequently, the functional similarity of genes x i and x j can be obtained as follows: is the similarity of gene x i and term t 2 . ann(x i ) and ann(x j ) represent the sets of GO terms that annotate the two genes, respectively. The cardinalities of ann(x i ) and ann(x j ) are denoted by |ann(x i )| and |ann(x j )|, respectively. The gene function similarity obtained through GO can also reflect the pairwise constraint relationship between genes to a certain extent. In the proposed MSC-CSMC algorithm, gene pairs with a similarity of more than 0.9 constitute the GO must-link constraint set Ω ML * , gene pairs with a similarity less than 0.1 constitute the GO cannot-link constraint set Ω CL * , and then generate the GO pairwise constraint set Ω* Ω ML * ∪ Ω CL * . Finally, the gene expression pairwise constraint set Ω and the gene ontology pairwise constraint set Ω* together constitute multi-source constraints for gene clustering.

Semi-supervised clustering objective functions based on multi-source constraints
At present, multi-objective optimization has gradually become a mainstream method for solving gene expression data clustering problems, which can achieve better clustering results on gene expression data compared with single-objective optimization methods. In the unsupervised multi-objective clustering problem of gene expression data, the cluster validity indices J FCM (Bezdek et al., 1981) and XB (Xie and Beni, 1991), which measure the intracluster compactness and inter-cluster separation respectively, are commonly used as objective functions to realize the evolution of decision variables based on two conflicting objectives (Bandyopadhyay et al., 2007;Maulik et al., 2009;Mukhopadhyay et al., 2013;. In this paper, the proposed MSC-CSMC algorithm uses XB and the function based on quadratic-regularized fuzzy c-means with constraint violation penalty, namely, J P (Mei, 2019), as the objective functions. Furthermore, the constraint violation penalty weights in J P are improved to achieve semisupervised clustering of gene expression data based on the multisource constraints in the NSGA-II framework. The objective functions of XB and J P are as follows: is the cth cluster prototype. k is the number of clusters, parameters η and β control the level of fuzziness and the contribution of the penalty term during clustering, respectively. u ic is the membership degree of the datum x i belonging to the cth cluster, obtained by where w ij ∈ W is the penalty weight for violating pairwise constraint (x i , x j ). In order to simultaneously consider both the gene expression constraint set Ω = Ω ML ∪ Ω CL and gene ontology constraint set Ω* Ω ML * ∪ Ω CL * , that is, the multi-source constraints proposed in this paper, we improve the constraint violation penalty weights through the following analysis: (1) if

FIGURE 2
Workflow of density tracking method.
Frontiers in Genetics frontiersin.org pairwise constraint (x i , x j ) exists in both Ω ML and Ω ML * , or in both Ω CL and Ω CL * , it means that the same category information of gene pair (x i , x j ) can be obtained from gene expression and gene annotation, so the weight of violating this constraint should be increased; ( and Ω CL * , or in both Ω CL and Ω ML * , it should be regarded as a contradictory constraint and removed from the constraint sets Ω and Ω*. Based on the above idea, the MSC-CSMC algorithm proposed in this paper improves the constraint violation penalty weight as follows: with θ > 0 being the GO action parameter. It can be seen that the improved penalty weights can effectively integrate the gene expression and Gene Ontology information, and provide reasonable violation penalty for pairwise constraints in semisupervised clustering.

Mixed chromosome encoding strategy used in MSC-CSMC
For the purpose of co-optimizing the constraints selection and clustering in the process of multi-objective evolution, a mixed encoding strategy combining the constraints selection and cluster prototype is adopted, as shown in Figure 3. Let P denote the genetic population, N be the population size, and s be the number of pairwise constraints to be selected. Considering the existence of noisy constraints in the initial pairwise constraint set and to improve the search efficiency of the algorithm, 2s constraints are randomly selected from the initial pairwise constraint set to generate the candidate constraint set Ω p , and a serial number is assigned for each pairwise constraint. For a gene expression dataset with k clusters X x 1 , x 2 , . . . x n { } , x i ∈ R d , the rth individual in the lth generation P r (l) consists of two parts: the cluster prototype P (v) r (l) and the constraints selection P (set) r (l). Among them, P (v) r (l) [v r,1 , v r,2 , . . . , v r,k ] encode k cluster prototypes v r,c [v r,c1 , v r,c2 , . . . , v r,cd ](1 ≤ c ≤ k) with real numbers, P (set) r (l) [g r,1 , g r,2 , . . . , g r,s ] encode the serial numbers of s pairwise constraints g r,j (1 ≤ g r,j ≤ 2s, 1 ≤ j ≤ s) selected from Ω p with integers.
In the proposed algorithm, the two parts of the chromosomes are initialized separately. For the cluster prototype part, in order to ensure initialization quality and population diversity, half of the individuals are encoded as the k cluster prototypes obtained by the density peak method (Rodriguez and Laio, 2014), and the other half are encoded from the randomly generated cluster prototypes. For the constraints selection part of each individual, the components are initialized with non-repeated random integers in [1, 2s].

Genetic operations
In the genetic evolution process of the MSC-CSMC algorithm, the roulette wheel strategy is first used to implement the selection. Since the NSGA-II algorithm tends to select individuals with lower non-domination ranks, for the rth individual P r (l) of the lth generation, the selection probability (Zhou and Zhu, 2018) is calculated as follows: Here, α ∈ (0, 1) is the selection parameter, f rank is the nondomination rank of individual P r (l).
For the parent individuals P r1 (l) and P r2 (l), let the crossover probability be p c , different crossover operators are used for the cluster prototypes and constraints selection. Among them, P (v) r1 (l) and P (v) r2 (l) generate offspring through the normal distribution crossover operator (Zhang and Luo, 2009), and the offspring cluster prototypes are: where N(0, 1) is a random variable of normal distribution. The constraints selection P (set) r1 (l) and P (set) r2 (l) adopts the single-point crossover operator, for a random integer rand c in [1, s], the offspring constraints selections are: of f sp set ( ) 1 g r1,1 , . . . , g r1,randc , g r2,randc+1 , . . . , g r2,s of f sp set ( ) 2 g r2,1 , . . . , g r2,randc , g r1,randc+1 , . . . , g r1,s If repeated pairwise constraints appear after crossover, non-repeated pairwise constraints are randomly selected from the candidate constraint set Ω p as a replacement. For individual P r (l), different mutation operators are adopted for the two parts. The polynomial mutation operator (Rousseeuw, 1987) is applied for P (v) r (l), where site v r,ci mutates with probability p m : where, v u and v l are the upper and lower bounds of the cluster prototype, respectively. For normalized gene expression data, the bounds are set to 1 and 0. δ is determined as follows (Deb and Tiwari, 2008): The mixed chromosome encoding strategy used in MSC-CSMC.
Frontiers in Genetics frontiersin.org Here, η m is the distribution index, rand m is a random number in [0, 1]. For P (set) r (l), random mutation is used, that is, first randomly select a position in P (set) r (l), and then replace its value with a random integer in [1, 2s] that is not repeated with others. In summary, the procedure of the MSC-CSMC algorithm is shown as follows: Input: Gene expression dataset X, number of neighbors b, density drop rate ε, population size N, maximal number of generations L max , number of clusters k, fuzzy parameter η, penalty parameter β, constraint number s, GO action parameter θ, selection parameter α, crossover probability p c , mutation probability p m , and distribution index η m .
Step 1: Generate gene expression pairwise constraint sets Ω based on density tracking method.
Step 2: Calculate the functional similarity of genes based on AIC, and generate the gene ontology pairwise constraint set Ω*.
Then delete the contradictory constraints, and determine the penalty weight matrix W corresponding to the multi-source constraints based on Formula 10.
Step 3: Randomly select 2s pairwise constraints from the initial constraint set to construct the candidate constraint set Ω p , and initialize the population.
Step 4: When the genetic generation index is l(l 1, 2, . . . , L max ), for each individual P r (l) (1 ≤ r ≤ N), decode to obtain the cluster prototypes and the selected pairwise constraints. Update the membership degree according to Formulas 7-9, and calculate the individual fitness values based on Formulas 5-6.
Step 5: According to the individual fitness values, calculate the nondomination rank and crowding distance of each individual.
Step 6: Apply selection, crossover, and mutation based on Formulas 11-17, and update the individual fitness values according to Formulas 5-6.
Step 7: Merge the parent and offspring populations, and select the next-generation according to the elite retention strategy.
Step 8: If l 0. 5 × L max or l 0. 8 × L max , update the penalty parameter β = 2 × β to increase the penalty for violating the currently selected constraints.
Step 9: Set l = l + 1, repeat Steps 4-8 until the maximal number of generations L max is reached.
Output: The Pareto optimal solutions.

Datasets
In this study, five benchmark gene expression datasets, namely, Yeast Galactose Metabolism, Yeast Cell Cycle, Yeast Sporulation, Serum, and Arabidopsis are used for the experiment.
The Yeast Galactose Metabolism dataset (Ideker et al., 2001) is composed of 205 genes whose expression patterns reflect four functional categories. The gene expression profiles were measured with four replicate assays across 20 time points. The Yeast Cell Cycle dataset (Cho et al., 1998) contains the expression levels of 384 genes involved in yeast cell cycle regulation at 17 time points, and these data are related with five phases of cell cycle. The Yeast sporulation dataset (Chu et al., 1998) contains the expression levels of more than 6,000 genes measured during the sporulation process of budding yeast across seven time points. The genes that showed no significant changes in expression during the harvesting were excluded, and the resulting set consists of 474 genes. The Serum dataset (Iyer et al., 1999) contains the expression levels of 517 human genes. The dataset has 13 dimensions corresponding to 12 time points and 1 unsynchronized sample. The Arabidopsis dataset (Reymond et al., 2000) consists of 138 Arabidopsis Thaliana genes. Each gene has eight expression values that correspond to eight time points. The details of the datasets are shown in Table 1.

Model evaluation criteria and parameter assignment
In order to evaluate the effectiveness of the model, the silhouette index (Rousseeuw, 1987) is chosen as the evaluation criterion for the clustering results. For gene x i , the silhouette width is calculated as follows: Here, a(i) is the average distance from gene x i to other genes in the same cluster, b(i) is the minimum average distance between gene x i and genes in the other clusters. The silhouette index SI of dataset X is the mean value of the silhouette widths of all genes, with SI ∈ [−1, 1]. A greater SI value represents the algorithm with better clustering quality. Besides, as suggested by (Saha and Bandyopadhyay, 2013), the final solution of MSC-CSMS is selected from Pareto optimal solutions by using the silhouette index. According to (Mei, 2019) and (Abin and Vu, 2020), the parameters of MSC-CSMC are assigned as follows: ε = 0.8, b = 10, η = 0.001, β = 0.1, N = 100, L max = 300, α = 0.3, η m = 5, p c = 0.8, p m = 0.1. The number of pairwise constraints s is chosen as 0, 5, 10, 15, 20, and 25. In gene expression data analysis, the determination of the number of clusters k is an open problem. Generally, there are two approaches to determine the value of k; one is to directly set it as the true number of clusters (Yu et al., 2018;Zhao et al., 2021;Liu et al., 2022;Wu and Ma, 2022); The other approach is applicable to the case where the true number of clusters is unknown, in which the variation range of k is determined firstly, and the k corresponding to the optimal value of an index (Silhouette index, Dunn index, Davies-Bouldin index, etc.) can be chosen as the optimal number of clusters (Gao et al., 2019;Acharya et al., 2020;López-Cortés et al., 2020;Zhang et al., 2022). In this paper, we adopt the first approach, and the number of clusters k is selected according to Table 1. In order to analyze the impact of the GO action parameter θ, we set θ from 0.1 to 0.9 at intervals of 0.1 under the condition that the number of the pairwise constraints is 15. The results are shown in Figure 4. It can be seen that the value of SI barely changes as θ increases, which means that the algorithm is not very Frontiers in Genetics frontiersin.org sensitive to the value of θ. For Yeast Galactose Metabolism, Yeast Cell Cycle, Yeast Sporulation, Serum, and Arabidopsis, the θ values are respectively set to 0.4, 0.7, 0.6, 0.5, and 0.4, which lead to the optimal clustering performances.

Result analysis and model comparison
For the purpose of inspecting the performance of the proposed MSC-CSMC algorithm, several advanced semi-supervised clustering algorithms based on single-source constraints, including COP-Kmeans (Wagstaff et al., 2001), PCKMeans , MPCKMeans (Bilenko et al., 2004), PCCA (Grira et al., 2008), PCFCMq (Mei, 2019) and MSC-CS (Zhao and Li, 2022), are used for comparison. Among them, the MSC-CS algorithm is the single-source constrained version of MSC-CSMC, which does not consider the annotation information provided by GO. In the above algorithms, the pairwise constraints are randomly selected from the initial gene expression constraint set Ω. To avoid the influence of randomness, each method is run for ten times under the same number of pairwise constraints, and the mean value of the clustering results is taken as the final result. The SI values of all seven algorithms applied to five datasets are shown in Tables 2-6, the optimal solutions in each row are highlighted in bold.
According to Tables 2-6, it can be seen that the proposed MSC-CSMS algorithm and its single-source constraint version MSC-CS can always achieve optimal and suboptimal clustering results on five gene expression datasets, demonstrating the effectiveness of the constraints selection. The mixed chromosome encoding strategy combining the constraint selection and cluster prototype can find the pairwise constraints suitable for clustering in the co-evolution process and improve clustering accuracy, and the highly accurate clustering

FIGURE 4
The impact of parameter θ on SI tested on different datasets. (A) Yeast Galactose Metabolism (B) Yeast Cell Cycle (C) Yeast Sporulation (D) Serum (E) Arabidopsis.
Frontiers in Genetics frontiersin.org results can further improve the constraint selection ability of the algorithm in turn. Conversely, the algorithms for comparison are based on the assumption that the pairwise constraints conform to the real cluster information and are easily affected by noisy constraints. This is consistent with the analysis of the negative effects of noisy constraints by (Yin et al., 2010) and (Lai et al., 2021). In addition, the The bold values indicate the optimal solutions in each row. The bold values indicate the optimal solutions in each row. The bold values indicate the optimal solutions in each row. The bold values indicate the optimal solutions in each row.
Frontiers in Genetics frontiersin.org MSC-CSMC algorithm is better than MSC-CS in most cases, indicating that using multi-source constraints can improve the performance of semi-supervised clustering. The gene ontology used to generate multisource pairwise constraints in our MSC-CSMC algorithm can explain gene expression profiles from the perspective of gene function. By effectively integrating the gene expression and Gene Ontology information, the proposed penalty weights can provide reasonable violation penalty for pairwise constraints.
In the case of s = 0, that is, there is no pairwise constraint, both MSC-CSMC and MSC-CS degenerate into unsupervised multiobjective clustering methods, turning out the same result. Compared with PCFCMq, which uses J P as the single objective function, the better performance of MSC-CSMC and MSC-CS shows the advantages of using multi-objective optimization in clustering gene expression data.
Among the comparison algorithms, the performance of the PCFCMq algorithm, which is based on fuzzy clustering, is generally better than the hard clustering-based COP-Kmeans, PCKMeans, and MPCKMeans algorithms. According to (Gasch and Eisen, 2002), genes may be co-expressed with different genomes under different measurement conditions, and there is usually overlap between gene clusters. Therefore, compared with hard clustering algorithms, fuzzy clustering algorithms are more suitable for analyzing gene expression data. Furthermore, due to the proposed constraints selection and multisource constraint fusion strategy, the MSC-CSMC algorithm achieves better clustering results than the PCFCMq algorithm. In terms of the robustness of the clustering results, the performances of semisupervised clustering algorithms for comparison fluctuate with the increase of pairwise constraints, which is mainly due to the quality of randomly selected pairwise constraints. As stated by Lai et al. (2021), even non-noisy constraints that conform to the real cluster information may have a negative impact on the clustering results, which further illustrates the necessity of constraints selection in semi-supervised clustering algorithms. The proposed MSC-CSMC algorithm can select pairwise constraints suitable for clustering based on the coevolution of the cluster prototype and constraints selection, which guarantees both accuracy and stability of the clustering results. The bold values indicate the optimal solutions in each row. Frontiers in Genetics frontiersin.org  To illustrate the consistency of the gene clusters obtained by the MSC-CSMC algorithm, the Eisen plots and cluster profile plots corresponding to the clustering results of five datasets are shown in Figure 5 and Figure 6. In the Eisen plots, each row corresponds to a gene, each column to a time point (sample), and each entry of the plot represents the expression level of a gene at a specific time point by coloring the corresponding cell. To illustrate more clearly the gene clusters obtained by MSC-CSMC, the genes partitioned into the same cluster are placed together. In the cluster profile plots, the X-and Y-axis represent the time points and gene expression values, respectively. The expression values of genes partitioned into the same cluster are plotted in the same subplot. In the subplots, each green line indicates the normalized expression values of a gene over all time points, and the black line represents the mean expression level of the genes in the corresponding cluster. It can be seen in the Eisen plots that the color patterns (expression levels) of genes in the same cluster are similar to each other, while genes in different clusters show different color patterns. According to Figure 6, the cluster profiles of different clusters are different from each other, and the cluster profiles within a cluster reveal consistency.
In order to inspect the biological significance of the gene clusters obtained by the MSC-CSMC algorithm, enrichment analysis is carried out using the GO annotation database, which results in the significant GO terms shared by genes in each cluster and their corresponding p-values. Taking the case where the number of pairwise constraints in the Yeast Sporulation dataset is 15 as an example, we focus on the three most significant GO terms (corresponding to the three lowest p-values) in each of the six clusters obtained by each algorithm. Figure 7 shows the plot of the average p-values. To illustrate the difference significantly, the pvalues are negative log-transformed and the clusters are sorted in descending order according to the transformed values. Table 7 reports the three most significant GO terms and the corresponding p-values in each cluster obtained by MSC-CSMC.
From Figure 7, it can be seen that the curve corresponding to MSC-CSMC is higher than those of the other algorithms, indicating that MSC-CSMC gains the result with the highest biological significance.
Moreover, all the p-values of the significant GO terms listed in Table 7 are far less than 0.01, indicating that the MSC-CSMC algorithm can identify biologically relevant gene clusters.

Conclusion
Aiming at the problem that current semi-supervised clustering methods based on pairwise constraints are easily affected by noisy constraints and do not take the fusion of multi-source constraints into account, in this paper, we propose a multi-objective semi-supervised clustering algorithm based on constraints selection and multi-source constraints (MSC-CSMC). The proposed algorithm uses gene expression data and GO information to generate multi-source pairwise constraints and applies the multi-source constraints to the semi-supervised clustering process through improved constraint violation penalty weights. On this basis, a collaborative multi-objective optimization framework for constraints selection and cluster prototypes is constructed, and the negative impact of the noisy constraints is reduced by selecting pairwise constraints suitable for clustering. Experimental results on multiple gene expression datasets show that the MSC-CSMC algorithm effectively improves the performance of semi-supervised clustering. The validity of the proposed method proposed is not limited to the cluster analysis of gene expression data. Other semisupervised clustering studies with multi-source information or constrained selection requirements can also be enlightened.
The effectiveness of the algorithm in this paper has been verified in small and medium-sized gene expression datasets. With the increase in the data size, the augment in the number of decision variables in the process of multi-objective evolution will lead to a decrease in algorithm efficiency and optimization performance. Therefore, the next step is to use decision variable analysis and other methods to design a multi-objective evolution strategy of the algorithm so as to further improve the applicability of the algorithm in practical clustering problems. In addition, we will also try to use various evaluation indices and design a multi-objective optimization framework with variable coding length (Rodríguez-Méndez et al., 2019) to optimize the number of clusters for gene expression data.

Data availability statement
The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.