An enhanced adaptive Bi-clustering algorithm through building a shielding complex sub-matrix

Bi-clustering refers to the task of finding sub-matrices (indexed by a group of columns and a group of rows) within a matrix of data such that the elements of each sub-matrix (data and features) are related in a particular way, for instance, that they are similar with respect to some metric. In this paper, after analyzing the well-known Cheng and Church bi-clustering algorithm which has been proved to be an effective tool for mining co-expressed genes. However, Cheng and Church bi-clustering algorithm and summarizing its limitations (such as interference of random numbers in the greedy strategy; ignoring overlapping bi-clusters), we propose a novel enhancement of the adaptive bi-clustering algorithm, where a shielding complex sub-matrix is constructed to shield the bi-clusters that have been obtained and to discover the overlapping bi-clusters. In the shielding complex sub-matrix, the imaginary and the real parts are used to shield and extend the new bi-clusters, respectively, and to form a series of optimal bi-clusters. To assure that the obtained bi-clusters have no effect on the bi-clusters already produced, a unit impulse signal is introduced to adaptively detect and shield the constructed bi-clusters. Meanwhile, to effectively shield the null data (zero-size data), another unit impulse signal is set for adaptive detecting and shielding. In addition, we add a shielding factor to adjust the mean squared residue score of the rows (or columns), which contains the shielded data of the sub-matrix, to decide whether to retain them or not. We offer a thorough analysis of the developed scheme. The experimental results are in agreement with the theoretical analysis. The results obtained on a publicly available real microarray dataset show the enhancement of the bi-clusters performance thanks to the proposed method.


I. INTRODUCTION
The traditional clustering algorithms analyze only the properties of the data samples (the number and types of the attributes or variables), but do not focus on the components of data such as Data-table, Data-column, Data-relation etc.This is a major issue affecting the clustering performance [1], especially when dealing with high-dimensional genes expression data, which motivates the development of the bi-clustering algorithm.Bi-clustering is not only able to reveal the global structure (as the traditional methods do) in data, but also able to discover the local information (it can discover clusters in the feature space and the data space simultaneously).In addition, Bi-clustering technology is also considered to be an effective tool for dealing with the high-dimensional data.Bi-clustering, after its birth, has received much attention and become one of the focal points in the data mining community.Bi-clustering was first introduced by Hartigan [2], and has been further developed since Cheng and Church proposed a bi-clustering algorithm based on variance and applied it to gene expression data [3].Their work remains the most important contribution to the bi-clustering field.
At present, bi-clustering is the most widely used technology in the field of bioinformatics.Unlike traditional clustering methods that treat similarity (distance-based measures) as a function of pairs of genes or pairs of conditions, which is not applicable in high-dimensional space, the bi-clustering model measures coherence within the subset of genes and conditions.This model may be particularly useful in disclosing the involvement of genes or conditions in multiple pathways, some of which can only be discovered under the dominance of more consistent ones [4].The coherence score [3] is defined as a symmetric function of genes and conditions involved, and therefore the bi-clustering is a process of simultaneous grouping of genes and conditions.The so-called mean squared residue (MSR) [3] is employed and applied to expression data transformed by a logarithm and augmented by the additive inverse.Bi-clustering is also referred in the literature as co-clustering and direct clustering, among other names, and has also been used in fields such as information retrieval and data mining.
So far, many bi-clustering algorithms have been proposed; however, as of now the research on the bi-clustering is still at its initial stage.For the enrichment and development of the bi-clustering algorithms, in this paper, we first present a brief analysis of the well-known CC algorithm and elaborate on some related bi-clustering concepts.Meanwhile we discuss the advantages and drawbacks of the CC algorithm.In reference to the drawbacks of the CC algorithm (such as interference of random numbers in the greedy strategy; ignoring overlapping bi-clusters), we design an improved adaptive bi-clustering algorithm by building a shielding complex sub-matrix to adaptively shield the obtained bi-clusters and to discover new ones.
In the implementation of the new bi-clustering algorithm, we add an imaginary part to the constructed bi-clusters to increase their MSR and also set a shielding factor to adjust the MSR increments of the row (or column) which contains elements in the constructed bi-clusters to effectively shield the constructed bi-clusters.In addition, we set two unit impulse signals to adaptively detect the constructed bi-clusters and the null data (zero-size data) of the dataset to avoid the over-shielding and the shielding failure, so as to adaptively improve the bi-clusters.A detailed analysis and a comprehensive suite of experiments are provided.The experimental studies demonstrate that the proposed approach achieves better performance compared with that of the well-known CC algorithm.To the best of our knowledge, the idea of the proposed approach has not been considered in the previous studies.
This paper is organized as follows.The CC algorithm and bi-clustering related ideas are briefly reviewed in Section II.A novel enhancement of the adaptive bi-clustering algorithm is detailed in Section III.Section IV includes experimental setup and covers an analysis of completed experiments.Section V contains some conclusions.

II. BI-CLUSTERING: DEFINITIONS AND PROBLEM FORMULATION
Consider a sample-feature expression matrix , where there are n rows representing n samples (data), m columns representing m features, and the entry denotes the expression level of feature j in sample i.Let be the sample set, where is called the feature vector of sample i.Similarly, for the features, it is denoted by with each vector being a column vector.Thus, we have .A bi-cluster is a sub-matrix of data matrix, denoted by satisfying that , and an entry denotes an intersection entry with corresponding row (sample) and column in both and .Assume that there are K bi-clusters found in data matrix A; the set of bi-clusters is denoted by .Usually, we use to denote a cluster of rows (samples) and a cluster of columns (features).Additionally, denotes the cardinality of , i.e., the number of samples in bi-cluster while denotes the number of features.Clearly, we have and .
Given a data matrix A, the bi-clustering problem is to design algorithms to find bi-clusters of it, i.e., a sub-set of matrices of A such that samples (rows ) of each bi-cluster exhibit some similar behavior under the corresponding features (columns, ).In other words, the bi-clustering problem is to identify a set of bi-clusters such that each bi-cluster satisfies some specific characteristics of homogeneity [17].
For a bi-cluster , several means based on the bi-cluster are defined.The mean of row i of is [18] (1) the mean of column j of is (2) and the mean of all the entries in is (3) The residue [3] of the entry in bi-cluster is the variance of bi-cluster is (5) and mean squared residue (MSR) score [17] of the bi-cluster is A sub-matrix is called a δ-bi-cluster if for some .When calculating the MSR of a single row or single column in data matrix , (6) will be converted to the following expressions: Bi-clusters can thus be seen as sub-matrices of a matrix representing features of elements.It should be noted that bi-clusters need not to be exclusive nor exhaustive [17].The well-known CC algorithm obtains an optimum bi-cluster (get as large a δ-bi-cluster as possible) each time by adding and deleting some rows or columns in the original data matrix to reduce the MSR of the whole matrix.The bi-clustering result produced each time is shielded by random numbers.However, the random numbers will result in the phenomenon of interference of random numbers, which in turn impacts the discovery of high quality bi-clusters [3,5].Eventually, by the action of random numbers, there would be some elements not satisfying the condition of bi-clustering mistakenly clustered; in addition, the overlapping bi-clusters will also be ignored.
An example of the random number interference is shown in Table I.Assume that 0 and 90 in the shadowed entries are ( , , , ) ( , , , ) arXiv-2021 clustered in the previous iteration, and replaced by the random numbers 6 and 9 in brackets, and this makes the sub-matrix (such as , ) which is not a bi-cluster, satisfy the condition of the bi-cluster.This shows the unreasonable aspect of the algorithm.

III. COMPLEX SUB-MATRIX SHIELDING MODEL
In connection with this issue above, this paper presents a new complex sub-matrix shielding model and the solutions are figured out.The proposed approach focuses on improving the iteration in the CC algorithm, which must use random numbers to replace the bi-clustering results.However, the new complex sub-matrix shielding can help the algorithm complete the bi-clustering and avoid the interference of the random numbers in the greedy strategy.

A. Construction of a shielding complex sub-matrix
Suppose that is a bi-cluster of the kth bi-clustering searching.To find a (k+1)th new bi-cluster, the first k bi-clusters that have been obtained should be shielded.When searching the (k+1)th bi-cluster based on the first k shielded bi-clusters, on the one hand, it is desired that the first k bi-clusters be temporarily ignored (to find a new bi-cluster); on the other hand, it is desired the (k+1)th bi-cluster should contain the elements that have been clustered in the first k shielded bi-clusters and satisfy the prespecified condition of bi-lustering.To meet the above requirements, a shielding complex sub-matrix is built as (9) where is the imaginary unit, is a shielding factor whose role is to adjust the MSR increments of the row (or column) to be shielded.The MSRs increase with the increase of the shielding factor, but usually we just need them to exceed a previously set threshold value.denotes the Schur product (Hadamard product) [19] of matrices, and is the unit pulse response function.It is necessary to point out that in the process of the shielding, the action of in ( 9) is to avoid being shielded a second time, and that of the latter part of ( 9) is to avoid unsuccessful shielding when encountering 0-value data.

B. Implementation
By using the shielding complex sub-matrix , a series of more satisfactory bi-clusters will be discovered.The specific process is as follows: First, the first bi-cluster is found with the CC algorithm and shielded by using the proposed approach.When searching the kth new bi-cluster the following two expressions are employed to calculate the MSR of a single row or single column in data matrix and decide whether to delete it or not.
(10) (11) It is easy to see that and , due to the effect of the shielding complex sub-matrix.When the ith row (or jth column) contains elements in the first k-1 bi-clusters, then Meanwhile, with the aid of the shielding factor , the first k-1 bi-clusters will be ignored in the searching of the kth bi-cluster.
In order to make the bi-clusters shielded (obtained) be deleted fast, a new parameter is introduced.If the MSRs of the rows and columns satisfy the following equations, they will be deleted collectively.
After finishing the shielding, a new (kth) bi-cluster is obtained.To make the new bi-cluster contain all the elements in the dataset that satisfy the prespecified condition, what we need to do next is to add the rows and columns that satisfy the prespecified condition.To avoid the disturbance coming from the shielded data, the rows and columns can be added without violating the requirements, i.e., (15) (16) In the processes of deletion and addition of the rows and columns, the MSR is constantly recalculated and improved until the new optimal bi-cluster is obtained.Obviously, the proposed method can avoid the phenomenon of random interference which impacts the discovery of high quality bi-clusters, and can discover overlapping bi-clusters.

IV. EXPERIMENTAL STUDIES
The following experiments are designed to test the performance of the proposed approach, where a well-known gene expression dataset yeast (http://arep.med.harvard.edu/biclustering/yeast.matrix) which is one of the most commonly used datasets [3] in bi-clustering is used.The methods try to discover 50 bi-clusters (co-expressed) with the MSR score not larger than 300.The MSR score and the sizes of the bi-clusters (co-expressed genes) ( , ) which are commonly used to estimate (and validate) the performance of the bi-clustering algorithms is used in the experiments.For each dataset the algorithms are repeated 10 times and the means and the standard deviations of the experimental results are recorded.The experimental results are plotted in Figs. 1 to 2. It is clear that the proposed algorithm is effective in discovering the quality of bi-clusters with low MSR scores.In addition, the bi-clusters obtained by the proposed method also have larger sizes than discovered by CC method.Thus, the experimental results are in agreement with the theoretical analysis, and compared with the well-known CC method, the proposed method exhibits visible advantages.

V. CONCLUSIONS
In this research, we designed a novel enhancement adaptive bi-clustering algorithm.During the design process, a shielding complex sub-matrix is built, in which two signals are set to detect the characteristics of dataset and adaptively improve the bi-clusters.We conduct theoretical analysis and offer a comprehensive suite of experiments.Both the theoretical and experimental results are presented to verify the validity of the proposed method.Experiments results show that the proposed arXiv-2021 algorithm outperforms the CC algorithm in finding the bi-clusters.On the one hand the proposed algorithm can discover overlapping bi-clusters and the MSR is reduced and the sizes of the bi-clusters are increased at the same time.On the other hand, the proposed algorithm is very stable.To the best of our knowledge, this research scheme is first proposed which steadily improves the performance of the bi-clustering.Our result opens a specific way for bi-clustering research, and suggests a far-reaching question for further research: How to discover multiple bi-clusters during a search process?This may open up a new direction of future research pursuits.

Fig. 2 .
Fig. 2. Results of the Sizes of the co-expressed genes.

Table I :
An example of the random number interference