You're viewing our updated article page. If you need more time to adjust, you can return to the old layout.

METHODS article

Front. Genet., 07 October 2022

Sec. Computational Genomics

Volume 13 - 2022 | https://doi.org/10.3389/fgene.2022.996941

An enhanced adaptive Bi-clustering algorithm through building a shielding complex sub-matrix

  • 1. School of Electronic Engineering, Xidian University, Xi’an, China

  • 2. School of Management, Hefei University of Technology, Hefei, China

  • 3. School of Optoelectronic Engineering, Xidian University, Xi’an, China

  • 4. State Key Laboratory of Applied Optics, Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun, China

Article metrics

View details

1,7k

Views

523

Downloads

Abstract

Bi-clustering refers to the task of finding sub-matrices (indexed by a group of columns and a group of rows) within a matrix of data such that the elements of each sub-matrix (data and features) are related in a particular way, for instance, that they are similar with respect to some metric. In this paper, after analyzing the well-known Cheng and Church bi-clustering algorithm which has been proved to be an effective tool for mining co-expressed genes. However, Cheng and Church bi-clustering algorithm and summarizing its limitations (such as interference of random numbers in the greedy strategy; ignoring overlapping bi-clusters), we propose a novel enhancement of the adaptive bi-clustering algorithm, where a shielding complex sub-matrix is constructed to shield the bi-clusters that have been obtained and to discover the overlapping bi-clusters. In the shielding complex sub-matrix, the imaginary and the real parts are used to shield and extend the new bi-clusters, respectively, and to form a series of optimal bi-clusters. To assure that the obtained bi-clusters have no effect on the bi-clusters already produced, a unit impulse signal is introduced to adaptively detect and shield the constructed bi-clusters. Meanwhile, to effectively shield the null data (zero-size data), another unit impulse signal is set for adaptive detecting and shielding. In addition, we add a shielding factor to adjust the mean squared residue score of the rows (or columns), which contains the shielded data of the sub-matrix, to decide whether to retain them or not. We offer a thorough analysis of the developed scheme. The experimental results are in agreement with the theoretical analysis. The results obtained on a publicly available real microarray dataset show the enhancement of the bi-clusters performance thanks to the proposed method.

1 Introduction

As an important technology of data mining and information granular construction (Xu et al., 2022), the traditional clustering algorithms analyze only the properties of the data samples (the number and types of the attributes or variables), but do not focus on the components of data such as Data-table, Data-column, Data-relation etc. This is a major issue affecting the clustering performance (Abe and Yadohisa, 2019), especially when dealing with high-dimensional genes expression data, which motivates the development of the bi-clustering algorithm. Bi-clustering is not only able to reveal the global structure (as the traditional methods do) in data, but also able to discover the local information (it can discover clusters in the feature space and the data space simultaneously). In addition, Bi-clustering technology is also regarded as an effective tool for dealing with the high-dimensional data. Bi-clustering, after its birth, has received much attention and become one of the focal points in the data mining community. Bi-clustering was first introduced by Hartigan (Hu et al., 2019), and has been further developed since Cheng and Church proposed a bi-clustering algorithm based on variance and applied it to gene expression data (Cheng and Church, 2000). Their work remains the most important contribution to the field of bi-clustering research.

At present, bi-clustering is the most widely used technology in the field of bioinformatics. Unlike traditional clustering methods that treat similarity (distance-based measures) as a function of pairs of genes or pairs of conditions, which is not applicable in high-dimensional space, the bi-clustering model measures coherence within the subset of genes and conditions. This model may be particularly useful in disclosing the involvement of genes or conditions in multiple pathways, some of which can only be discovered under the dominance of more consistent ones (Yang et al., 2003). The coherence score (Cheng and Church, 2000) is defined as a symmetric function of genes and conditions involved, and therefore the bi-clustering is a process of simultaneous grouping of genes and conditions. The so-called Mean Squared Residue (MSR) (Cheng and Church, 2000) is employed and applied to expression data transformed by a logarithm and augmented by the additive inverse. Bi-clustering is also referred in the literature as co-clustering and direct clustering, among other names, and has also been used in fields such as information retrieval and data mining.

Popular bi-clustering algorithms, such as Cheng and Church (CC) algorithm, FLOC (Yang et al., 2005), Plaid (Lazzeroni and Owen, 2000), OPSM (Ben-Dor et al., 2003), ISA (Bergmann et al., 2003), Spectral (Kluger et al., 2003), xMOTIFs (Murali and Kasif, 2003), and BiMax (Prelic et al., 2006) have drawn much attention in the literature. Newer algorithms, such as Bayesian Bi-clustering (Gu and Liu, 2007), COALESCE (Huttenhower et al., 2009), CPB (Bozdag et al., 2009), QUBIC (Li et al., 2009), and FABIA (Hochreiter et al., 2010) have not been extensively studied. Among them, the CC algorithm is the earliest and most studied one, and the newer algorithms are mostly based on the idea of the CC algorithm.

So far, a number of bi-clustering algorithms have been proposed; however, as of now the research on the bi-clustering is still at its initial stage. For the enrichment and development of the bi-clustering algorithms, in this paper, we first present a brief analysis of the well-known CC algorithm and elaborate on some related bi-clustering concepts. Meanwhile we discuss the advantages and drawbacks of the CC algorithm. In reference to the drawbacks of the CC algorithm (such as interference of random numbers in the greedy strategy; ignoring overlapping bi-clusters), we design an improved adaptive bi-clustering algorithm by building a shielding complex sub-matrix to adaptively shield the obtained bi-clusters and to discover new ones.

In the implementation of the new bi-clustering algorithm, we add an imaginary part to the constructed bi-clusters to increase their MSR and also set a shielding factor to adjust the MSR increments of the row (or column) which contains elements in the constructed bi-clusters to effectively shield the constructed bi-clusters. In addition, we set two unit impulse signals to adaptively detect the constructed bi-clusters and the null data (zero-size data) of the dataset to avoid the over-shielding and the shielding failure, so as to adaptively improve the bi-clusters. A detailed analysis and a comprehensive suite of experiments are provided. Obviously, the proposed scheme is also applicable to other similar bi-clustering techniques. The experimental studies demonstrate that the proposed approach achieves better performance compared with that of the well-known CC algorithm. To the best of our knowledge, the idea of the proposed approach has not been considered in the previous studies.

This paper is organized as follows. The CC algorithm and bi-clustering related ideas are briefly reviewed in Section 2. A novel enhancement of the adaptive bi-clustering algorithm is detailed in Section 3. Section 4 includes experimental setup and covers an analysis of completed experiments. Section 5 contains some conclusions.

2 Bi-clustering: Definitions and problem formulation

Consider a sample-feature expression matrix , where there are n rows representing n samples (data), m columns representing m features, and the entry denotes the expression level of feature j in sample i. Let be the sample set, where is called the feature vector of sample i. Similarly, for the features, it is denoted by with each vector being a column vector. Thus, we have . A bi-cluster is a sub-matrix of data matrix, denoted by Bk=(Sk, Fk) satisfying that SkS, FkF and an entry denotes an intersection entry with corresponding row (sample) and column in both A and Bk. Assume that there are K bi-clusters found in data matrix A; the set of bi-clusters is denoted by . Usually, we use (Sk, F) to denote a cluster of rows (samples) and (Sk, F) a cluster of columns (features). Additionally, |Sk| denotes the cardinality of Sk, i.e., the number of samples in bi-cluster Bk=(Sk, Fk)while |Fk| denotes the number of features. Clearly, we have |S| = n and |F| = m.

Given a data matrix A, the bi-clustering problem is to design algorithms to find bi-clusters of it, i.e., a sub-set of matrices of A such that samples (rows Sk) of each bi-cluster Bk exhibit some similar behavior under the corresponding features (columns, Fk). In other words, the bi-clustering problem is to identify a set of bi-clusters Bk=(Sk, Fk) such that each bi-cluster Bk satisfies some specific characteristics of homogeneity (Xhafa et al., 2011).

For a bi-cluster Bk=(Sk, Fk), several means based on the bi-cluster are defined. The mean of row i of Bk is (Fan et al., 2010)

the mean of column j of Bk is

and the mean of all the entries in Bk is

The residue (Hu et al., 2019) of the entry aij in bi-cluster Bk is

the variance of bi-cluster Bk is

and mean squared residue (MSR) score (Xhafa et al., 2011) of the bi-cluster Bk is

A sub-matrix Bk is called a δ-bi-cluster if Hk for some . When calculating the MSR of a single row or single column in data matrix A, Eq. 6 will be converted to the following expressions:

Bi-clusters can thus be seen as sub-matrices of a matrix representing features of elements. It should be noted that bi-clusters need not to be exclusive nor exhaustive (Xhafa et al., 2011). The well-known CC algorithm obtains an optimum bi-cluster (get as large a δ-bi-cluster as possible) each time by adding and deleting some rows or columns in the original data matrix to reduce the MSR of the whole matrix. The bi-clustering result produced each time is shielded by random numbers. However, the random numbers will result in the phenomenon of interference of random numbers, which in turn impacts the discovery of high quality bi-clusters (Cheng and Church, 2000; Yang et al., 2005). Eventually, by the action of random numbers, there would be some elements not satisfying the condition of bi-clustering mistakenly clustered; in addition, the overlapping bi-clusters will also be ignored.

An example of the random number interference is shown in Table 1. Assume that 0 and 90 in the shadowed entries are clustered in the previous iteration, and replaced by the random numbers 6 and 9 in brackets, and this makes the sub-matrix (such as Bk=(Sk, Fk), Sk = {2, 4, 5}, Fk = {1, 2, 4}) which is not a bi-cluster, satisfy the condition of the bi-cluster. This shows the unreasonable aspect of the algorithm.

TABLE 1

Col 1 Col 2 Col 3 Col 4 … … Col m
Row 1 Data Data Data Data … … Data
Row 2 1 2 0 3 … … Data
Row 3 Data Data Data Data … … Data
Row 4 4 5 10 0 (6) … … Data
Row 5 7 8 15 90 (9) … … Data
: : : : : … … Data
Row n Data Data Data Data … … Data

An example of the random number interference.

3 Complex sub-matrix shielding model

In connection with this issue above, this paper presents a new complex sub-matrix shielding model and the solutions are figured out. The proposed approach focuses on improving the iteration in the CC algorithm, which must use random numbers to replace the bi-clustering results. However, the new complex sub-matrix shielding can help the algorithm complete the bi-clustering and avoid the interference of the random numbers in the greedy strategy.

3.1 Construction of a shielding complex sub-matrix

Suppose that Bk=(Sk, Fk) is a bi-cluster of the kth bi-clustering searching. To find a (k+1)th new bi-cluster, the first k bi-clusters that have been obtained should be shielded. When searching the (k+1)th bi-cluster based on the first k shielded bi-clusters, on the one hand, it is desired that the first k bi-clusters be temporarily ignored (to find a new bi-cluster); on the other hand, it is desired the (k+1)th bi-cluster should contain the elements that have been clustered in the first k shielded bi-clusters and satisfy the prespecified condition of bi-lustering. To meet the above requirements, a shielding complex sub-matrix is built as

where 1j is the imaginary unit, is a shielding factor whose role is to adjust the MSR increments of the row (or column) to be shielded. The MSRs increase with the increase of the shielding factor, but usually we just need them to exceed a previously set threshold value. * denotes the Schur product (Hadamard product) (Hanyu et al., 2022; Tian et al., 2022; Xu et al., 2019) of matrices, and is the unit pulse response function. It is necessary to point out that in the process of the shielding, the action of in (9) is to avoid being shielded a second time, and that of the latter part of (9) is to avoid unsuccessful shielding when encountering 0-value data.

3.2 Implementation

By using the shielding complex sub-matrix , a series of more satisfactory bi-clusters will be discovered. The specific process is as follows:

First, the first bi-cluster B1=(S1, F1) is found with the CC algorithm and shielded by using the proposed approach. When searching the kth new bi-cluster Bk the following two expressions are employed to calculate the MSR of a single row or single column in data matrix A and decide whether to delete it or not.

It is easy to see that and , due to the effect of the shielding complex sub-matrix. When the ith row (or jth column) contains elements in the first k-1 bi-clusters, then

Meanwhile, with the aid of the shielding factor , the first k-1 bi-clusters will be ignored in the searching of the kth bi-cluster. In order to make the bi-clusters shielded (obtained) be deleted fast, a new parameter is introduced. If the MSRs of the rows and columns satisfy the following equations, they will be deleted collectively.

After finishing the shielding, a new (kth) bi-cluster is obtained. To make the new bi-cluster contain all the elements in the dataset that satisfy the prespecified condition, what we need to do next is to add the rows and columns that satisfy the prespecified condition. To avoid the disturbance coming from the shielded data, the rows and columns can be added without violating the requirements, i.e.,

In the processes of deletion and addition of the rows and columns, the MSR is constantly recalculated and improved until the new optimal bi-cluster is obtained. Obviously, the proposed method can avoid the phenomenon of random interference which impacts the discovery of high quality bi-clusters, and can discover overlapping bi-clusters.

4 Experimental studies

The following experiments are designed to test the performance of the proposed approach, where a well-known gene expression dataset yeast (http://arep.med.harvard.edu/biclustering/yeast.matrix) which is one of the most commonly used datasets (Cheng and Church, 2000) in bi-clustering is used. Two also gene datasets names: west-2001 and laiho-2007 (Li and Wong, 2019) are also used.

The methods try to discover 50 bi-clusters (co-expressed) with the MSR score not larger than 300. The MSR score and the sizes of the bi-clusters (co-expressed genes) which are commonly used to estimate (and validate) the performance of the bi-clustering algorithms is used in the experiments.

For each dataset the algorithms are repeated 10 times and the means and the standard deviations of the experimental results are recorded. The experimental results are plotted in Figures 14. It is clear that the proposed algorithm is effective in discovering the quality of bi-clusters with low MSR scores. In addition, the bi-clusters obtained by the proposed method also have larger sizes than discovered by CC method. Thus, the experimental results are in agreement with the theoretical analysis, and compared with the well-known CC method, the proposed method exhibits visible advantages.

FIGURE 1

FIGURE 1

Results of the MSR scores of the yeast dataset.

FIGURE 2

FIGURE 2

Results of the Sizes of the co-expressed genes.

FIGURE 3

FIGURE 3

Results of the MSR scores of the west-2001 dataset.

FIGURE 4

FIGURE 4

Results of the MSR scores of the west-2001 dataset.

5 Conclusion

In this study, we designed a novel enhancement adaptive bi-clustering algorithm. During the design process, a shielding complex sub-matrix is built, in which two signals are set to detect the characteristics of dataset and adaptively improve the bi-clusters. We conduct theoretical analysis and offer a comprehensive suite of experiments. Both the theoretical and experimental results are presented to verify the validity of the proposed method. Experiments results show that the proposed algorithm outperforms the CC algorithm in finding the bi-clusters. On the one hand the proposed algorithm can discover overlapping bi-clusters and the MSR is reduced and the sizes of the bi-clusters are increased at the same time. On the other hand, the proposed algorithm is very stable. To the best of our knowledge, this research scheme is first proposed which steadily improves the performance of the bi-clustering. Our result opens a specific way for bi-clustering research, and suggests a far-reaching question for further research: How to discover multiple bi-clusters during a search process? This may open up a new direction of future research pursuits.

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

All the authors made significant contributions to the work. The idea was proposed by KX; XT and RZ simulated the algorithm; XY designed the experiments and polish the English; KX and XT wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62101400, 72101075, and 61971349, and in part by the National Key R&D Program of China (2021YFF0704600), Natural Science Foundation of Anhui Province of China under Grant No. 2108085QG289, Guangdong Basic and Applied Basic Research Foundation under Grant Nos. 2020A1515111012, and the Fundamental Research Funds for the Central Universities under Grant No. JZ2022HGTB0286.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    Abe H. Yadohisa H. (2019). Orthogonal nonnegative matrix tri-factorization based on Tweedie distributions. Adv. Data Anal. Classif.13 (4), 825853. 10.1007/s11634-018-0348-8

  • 2

    Ben-Dor A. Chor B. Karp R. Yakhini Z. (2003). Discovering local structure in gene expression data: The order-preserving submatrix problem. J. Comput. Biol.10 (3), 373384. 10.1089/10665270360688075

  • 3

    Bergmann S. Ihmels J. Barkai N. (2003). Iterative signature algorithm for the analysis of large-scale gene expression data. Phys. Rev. E Stat. Nonlin. Soft Matter Phys.67 (1), 031902. 10.1103/PhysRevE.67.031902

  • 4

    Bozdag D. Parvin J. D. Catalyurek U. V. (2009). “A biclustering method to discover co-regulated genes using diverse gene expression datasets,” in Proc. Proceedings of the 1st International Conference on Bioinformatics and Computational Biology, Niagara Falls New York, August 2 - 4, 2010, 151163.

  • 5

    Cheng Y. Church G. M. (2000). Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol.8, 93103. ISM.

  • 6

    Fan N. Boyko N. Panos M. (2010). “Pardalos recent advances of data biclustering with application in computational neuroscience,” in Computational Neuroscience. (New York, USA: Springer), 85112.

  • 7

    Gu J. Liu J. S. (2007). Bayesian biclustering of gene expression data. BMC Genomics1, 2528. 10.1186/1471-2164-9-S1-S4

  • 8

    Hanyu E. Cui Y. Pedrycz W. Li Z. (2022). Fuzzy relational matrix factorization and its granular characterization in data description. IEEE Trans. Fuzzy Syst.30 (3), 794804. 10.1109/tfuzz.2020.3048577

  • 9

    Hochreiter S. Bodenhofer U. Heusel M. Mayr A. Mitterecker A. Kasim A. et al (2010). FABIA: Factor analysis for bicluster acquisition. Bioinformatics26 (12), 15201527. 10.1093/bioinformatics/btq227

  • 10

    Hu H. Wang H. Bai Y. Liu M. (2019). Determination of endometrial carcinoma with gene expression based on optimized Elman neural network. Appl. Math. Comput.341, 204214. 10.1016/j.amc.2018.09.005

  • 11

    Huttenhower C. Mutungu K. T. Indik N. Yang W. Schroeder M. Forman J. J. et al (2009). Detailing regulatory networks through large scale data integration. Bioinformatics25 (24), 32673274. 10.1093/bioinformatics/btp588

  • 12

    Kluger Y. Basri R. Chang T. Gerstein M. (2003). Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Res.13 (4), 703716. 10.1101/gr.648603

  • 13

    Lazzeroni L. Owen A. (2000). Plaid models for gene expression data. Stat. Sin.12 (1), 6186.

  • 14

    Li G. Ma Q. Tang H. Paterson A. H. Xu Y. (2009). QUBIC: A qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Res.37 (15), e101. 10.1093/nar/gkp491

  • 15

    Li X. Wong K. -C. (2019). Evolutionary multiobjective clustering and its applications to patient stratification. IEEE Trans. Cybern.49 (5), 16801693. 10.1109/TCYB.2018.2817480

  • 16

    Murali T. M. Kasif S. (2003). “Extracting conserved gene expression motifs from gene expression data,” in Pacific Symp. Biocomputing, Lihue, Hawaii, January 3-7, 2003, 7788.

  • 17

    Prelic A. Bleuler S. Zimmermann P. Wille A. Buhlmann P. Gruissem W. et al (2006). A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics22 (9), 11221129. 10.1093/bioinformatics/btl060\

  • 18

    Tian G. Yuan G. Aleksandrov A. Zhang T. Li Z. Fathollahi-Fard A. M. et al (2022). Recycling of spent lithium-ion batteries: A comprehensive review for identification of main challenges and future research trends. Sustain. Energy Technol. Assessments53, 102447. 10.1016/j.seta.2022.102447

  • 19

    Xhafa F. Caballe S. Barolli L. (2011). “Using bi-clustering algorithm for analyzing online users activity in a virtual campus,” in International Conference on Intelligent NETWORKING and Collaborative Systems, Thessaloniki, Greece, 24-26 November 2010, 214221.

  • 20

    Xu K. J. Pedrycz W. Li Z. W. Nie W. K. (2019). High-accuracy signal subspace separation algorithm based on Gaussian kernel soft partition. IEEE Trans. Ind. Electron.66 (1), 491499. 10.1109/tie.2018.2823666

  • 21

    Xu K. Pedrycz W. Li Z. (2022). Granular computing: An augmented scheme of degranulation through a modified partition matrix. Fuzzy Sets Syst.440 (6), 131148. 10.1016/j.fss.2021.06.001

  • 22

    Yang J. Wang H. Wang W. (2003). “Enhanced biclustering on expression data,” in Third IEEE Symposium on Bioinformatics and Bioengineering, Bethesda, MD, USA, 12-12 March 2003, 321327.

  • 23

    Yang J. Wang H. X. Wang W. Yu P. S. (2005). An improved biclustering method for analyzing gene expression profiles. Int. J. Artif. Intell. Tools14 (5), 771789. 10.1142/s0218213005002387

Summary

Keywords

bi-clustering, adaptive control, shielding factor, mean squared residue (MSR), co-expressed genes

Citation

Xu K, Tang X, Yin X and Zhang R (2022) An enhanced adaptive Bi-clustering algorithm through building a shielding complex sub-matrix. Front. Genet. 13:996941. doi: 10.3389/fgene.2022.996941

Received

19 July 2022

Accepted

21 September 2022

Published

07 October 2022

Volume

13 - 2022

Edited by

Yinghua Shen, Chongqing University, China

Reviewed by

Hengrong Ju, Nantong University, China

Hanyu E, University of Alberta, Canada

Yongming He, National University of Defense Technology, China

Updates

Copyright

*Correspondence: Xiaoan Tang, ; Xukun Yin,

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics