Optimal ranking and directional signature classification using the integral strategy of multi-objective optimization-based association rule mining of multi-omics data

Introduction: Association rule mining (ARM) is a powerful tool for exploring the informative relationships among multiple items (genes) in any dataset. The main problem of ARM is that it generates many rules containing different rule-informative values, which becomes a challenge for the user to choose the effective rules. In addition, few works have been performed on the integration of multiple biological datasets and variable cutoff values in ARM. Methods: To solve all these problems, in this article, we developed a novel framework MOOVARM (multi-objective optimized variable cutoff-based association rule mining) for multi-omics profiles. Results: In this regard, we identified the positive ideal solution (PIS), which maximized the profit and minimized the loss, and negative ideal solution (NIS), which minimized the profit and maximized the loss for all gene sets (item sets), belonging to each extracted rule. Thereafter, we computed the distance (d +) from PIS and distance (d −) from NIS for each gene set or product. These two distances played an important role in determining the optimized associations among various pairs of genes in the multi-omics dataset. We then globally estimated the relative closeness to PIS for ranking the gene sets. When the relative closeness score of the rule is greater than or equal to the pre-defined threshold value, the rule can be considered a final resultant rule. Moreover, MOOVARM evaluated the relative score of the rule based on the status of all genes instead of individual genes. Conclusions: MOOVARM produced the final rank of the extracted (multi-objective optimized) rules of correlated genes which had better disease classification than the state-of-the-art algorithms on gene signature identification.


Introduction
A microarray (Bandyopadhyay et al., 2014;Bandyopadhyay and Mallik, 2016) has been widely used to measure a large number of genes for determining differences between two groups (e.g., cases versus control samples), including gene expression profile, methylation, and genotype-based association studies. Methylation of cytosine (Navarro et al., 2012) changes the structure of DNA by introducing a methyl group (−CH3) at the carbon5 position of cytosine without altering the underlying DNA sequences. Methylation changes the gene expression, which pathologically leads to cancer. In general, methylation decreases the gene expression level. At present, the association rule mining (ARM) method plays a vital role in generating the significant relationships between two genes (items) in the research field of bioinformatics and biomedical sciences (Mallik, 2013). The rule representing the format is as follows: {G1+, G2−, G4+ 0 G3−, G5+ }, where G1, G2, and G4 are the cause variables (antecedent) and G3 and G5 are effective variables (consequent). Of note, here, "+" symbolizes upregulation and "−" denotes downregulation. The aforementioned rule states that when the genes G1 is upregulated, G2 is downregulated, and G4 is upregulated concurrently, it is likely that G3 will be downregulated and G5 will be upregulated simultaneously. The support of a rule {A 0 B} (where A and B are items) is defined as the fraction of the number of transactions that contains A and B to the total number of transactions in the database, whereas the confidence of the rule is defined as the ratio of the support of the whole gene set (i.e., A and B) to the support of the antecedent/left-hand side (i.e., A). If the support of the gene set is higher than the user-defined minimum support, then the gene set is called frequent. A useful and fundamental association rule mining method, Apriori, was introduced by Agrawal et al. (1993) for identifying the association among the genes in the gene expression data or other similar kinds of data. Apriori and Eclat are the two benchmark algorithms used for mining the frequent item sets. The Apriori algorithm is introduced by Agrawal et al. (1993), while Eclat was developed by Zaki (2000) (Alves et al., 2010). The basic steps of the Apriori algorithm are as follows: i) obtaining the support (frequency) value for each individual item (feature), ii) filtering out non-frequent items by using a user-defined support threshold, iii) selecting frequent k-item sets (k = 1,2,×), iv) then converting all frequent item sets into association rules, and v) finally, estimating two more rule interestingness measures, viz., confidence and lift. However, the Eclat algorithm (Zaki, 2000) is somewhat different from Apriori (Agrawal et al., 1993). Apriori is basically a join-based algorithm, while Eclat is a tree-based algorithm. In other words, Apriori follows a breadth-first search (horizontal search), while Eclat follows a depth-first search (vertical search). Eclat is faster than Apriori. The Eclat algorithm requires only the support metric. Both the algorithms use static support and static confidence thresholds.
This basic ARM method has been updated and modified depending on the problem types to overcome various limitations by the researchers, such as Han et al. (2004), Creighton and Hanash (2003), Georgii et al. (2005), McIntosh and Chawla (2007), and Martinez et al. (2008). Those updated techniques help us manage the critical problems which arise in our daily life like medical diagnosis, marketing, and traveling. The genes of the gene sets have different types of priority. However, the basic rule mining algorithms treat all genes of the gene sets as belonging to the same class equivalence (quality). To overcome this challenge, the following researchers introduced weighted ARM methods for the classification of genes: Ramkumar (1998), Cai (1998), Wang (2000), Tao (2003), Yun and Leggett (2005), Tseng (2010), and Mallik et al. (2015). The weighted ARM methods were further modified and considered multiple weighted factors for solving transaction data-related problems (Liu et al., 1999;Su et al., 2008;Liu et al., 2011). Some clustering-and biclustering-based techniques were invented for studying gene expression data by Cheng and Church (2000), Pei (2003), Jiang et al. (2004), Madeira and Oliveira (2004), Thalamuthu et al. (2006), and Prelic et al. (2006). StatBicRM , another classification analysis, was also developed for this reason, in which Bhasin and Raghava (2004), Paziewska et al. (2014), Martella (2009), Xu (2009), andGeorgii et al. (2005) used a half-space concept for extracting quantitative association rules from numeric microarray datasets without using discretization. The limitation to this approach is that it was unable to find the complete set of significant rules from the microarray data. The GenMiner technique was proposed by Martinez et al. (2008) for finding association rules from a set of gene expression data and the online available terms that were linked to Gene Ontology (i.e., GO-terms). Bhadra et al. (2017), Mallik et al. (2013), and Bhadra et al. (2018) proposed a new concept where the cutoff (threshold) value was considered dynamical and altered for each gene set according to the quality/importance of the whole gene set rather than the quantification property. Some latest works are also based on the ARM/optimization method. Theilhaber et al. (2020) provided a tool in two-arm clinical studies. The methodology was based on the construction and optimization of a predictive multivariate gene signature that can predict the differential survival of patients undergoing anti-cancer therapies. Theilhaber et al. (2020) applied enhanced binary particle swarm optimization (EBPSO) in clinical transcriptomic cohorts to identify accurate, crisp, and significantly prognostic unique candidate signatures. The gene regulator within this signature yields biological insights into the relevant functions that were strongly correlated with their cancer type (Murphy et al., 2022). Nivedhitha et al. (2020) conducted survey research by categorizing different feature selection algorithms under supervised, unsupervised, and semi-supervised learning. This survey presented some latest tools of dimensionality reduction for tumor detection and also analyzed their performances and highlighted limitations and direction of future research to handle the high-dimension and less sample size data. On 2020, Ganguly and Mukherjee (2020) provided a modeling, simulation, and performance analysis study for an isolated hybrid power system (IHPS) which contained the solar thermal power plant, diesel engine generator (DEG), and wind turbine generator (WTG). To achieve better results for the studied IHPS model, authors applied the quasi-oppositional-based whale optimization algorithm and obtained better controller gain than other benchmark algorithms. In addition, there are some existing works of association rule mining which are based on fuzzy or rough theory (Sharmila and Vijayarani, 2019;Singh and Ganesh Wayal, 2012). However, the outcome rules are not good enough. Inclusion of the multi-objective optimization technique is an efficient step to improving the performance of association rule mining. After surveying the literature, we obtained some recently developed multi-objective optimization techniques that were presented by Mudi et al. (2019), Mudi et al. (2021a), Mudi et al. (2021b), Mudi et al., (2022), Ganguly et al. (2018), Ganguly et al. (2017), and Ganguly and Mukherjee (2020). In this article, we developed multi-objective optimized variable cutoff-based association rule mining (MOOVARM) for multi-omics profiles based on the minimum distance from the positive ideal solution (PIS) and that from the negative ideal solution (NIS). In this regard, we first identified (PIS) and (NIS) with respect to all gene sets. Therefore, we calculated the distance (d +) from PIS and distance (d −) from NIS for each product/item set. According to our proposed method, we calculated the relative closeness score value based on those two distances d + and d − for ranking the gene sets. If the relative closeness score of any rule was greater than or equal to the pre-defined cutoff value, the rule could be considered the final resultant rule. The proposed method calculated the relative closeness score globally instead of individual genes. Last, we made the ranking of the rules based on the relative score which had better disease classification performance than the state-of-the-art algorithms in disease diagnosis and therapeutic response.

Shortest distance-based cutoffs
The distance-based variable supports (denoted by D b VS) cutoff technique proposed by Mallik and Zhao (2017b) was introduced to obtain some attractive rules from multi-omics datasets by combining co-expression, co-methylation, and protein-protein interactions. The normalized combined correlation score was calculated by the integration of co-expression and co-methylation values (say CECM exm ) between the expression and methylation profiles containing a specific number of genes which are both differentially expressed and methylated. Basically, CECM exm measures the similarity of expression and methylation patterns between the two genes. The expression/methylation data of all the diseased and control values are denoted by a gene vector G. Let p and q be two genes, and CECM exm between p and q is denoted by CECM exm (p, q). This is computed as follows: where G ex (p) and G m (p) are two vectors consisting of expression and methylation values, respectively, across all samples for the pth gene. Pearson's correlation coefficient (Mallik, 2013) between the two groups is denoted by r (·, ·), where PCB(·, ·) processes the multiplication of Pearson's correlation score and the BioSIM score (Bandyopadhyay and Bhattacharyya, 2011) between any two genes. Here, the normalization technique is denoted by norm (·) which followed the min-max normalization concept. The lower and upper limits CECM exm (·, ·) were 0 and 1, respectively. Thereafter, the corresponding dissimilarity scores (say D simt ) were computed with the help of CECM exm scores, i.e., D simt (p, q) = (1 − CECM exm (p, q)). Thereafter, we determined protein-protein interactions from the Human Protein Resource Database (HPRD) and selected the interactions of the interactive protein-oriented genes among the set of genes which are differentially expressed and methylated. H is the protein-protein interaction matrix for the selected differentially expressed and methylated genes. In every gene pair (p, q), we multiplied the interaction value in H and the corresponding weighted distance value in D simt and subsequently calculated the resultant value, DijStP (p, q). The expression of DijStP (p, q) is given as DijStP (p, q) = (H (p, q)*D simt (p, q)). To compute DijStP (p,q) for all gene pairs (p, q), we selected the weighted distance for every gene pair that contained the interactions in their corresponding protein levels among each other. This resulted in a similarity and symmetric matrix. Using this matrix, we constructed a weighted transcriptomic gene regulatory network. Dijkstra's shortest path algorithm was then used on the gene regulatory network, and the relative weighted shortest distance matrix was generated (denoted by W e SD). According to the fundamental biological theory, the biological functions or biological pathways of two genes are the same if the distance between two genes is low. In this work, we utilized the shortest distance between every two genes belonging to the network. Thereafter, we calculated different distances among all gene pairs belonging to the W e SD matrix such as the maximum weighted shortest distance (W e SD mx ), minimum weighted shortest distance (W e SD mn ), and average weighted shortest distance (W e SD avg ) that were computed without considering the diagonal elements of the underlying matrix. The distance-based variable supports threshold within the gene set (GS) D b VS(GS) is defined as follows: where where ngp indicates the total number of possible gene pairs within GS and c1 and c2 are two constant terms. The value for c1 is set at 0.10, while c2 is a constant scaling factor whose value is set at 1.4826 for the assumption of a Gaussian distribution pattern to utilize any parametric test. Similarly, another two different types of thresholds, viz., distance-oriented variable confidence (denoted by D b VC) and distance-oriented variable lift (denoted by D b VL), are defined as follows: where where UVminC depicts the user-mentioned minimum confidence threshold, and where where UDminL represents the user-mentioned minimum lift threshold value, while c1 and c2 denote the constant values for scaling the fractional part.
Frontiers in Bioinformatics frontiersin.org 03 3 Multi-objective optimized association rule mining for the multi-omics dataset In this section, we developed a novel algorithm called MOOVARM for multi-omics profiles. Here, we integrated the gene expression, methylation, and protein-protein interaction data based on the idea of multi-objective optimization and weighted shortest distance to produce interesting rules for multi-omics profiles. The three basic steps of this algorithm are explained in the following sections. All abbreviations of Model parameters are discussed in Table 1.

Finding significant genes
Initially, matched genes and matched samples between gene expression and methylation data were found. Using the zero-mean normalization (Bandyopadhyay et al., 2014) technique, the gene expression/methylation data were normalized genewise. The empirical Bayes test using the limma package (Mallik and Zhao, 2017b;Mallik and Zhao, 2017a;Smyth, 2004) on both normalized expression and methylation data was executed for finding differentially expressed and methylated genes. limma was used because of its effectiveness on normalized gene expression/methylation data for any data distribution and any number of samples. Numerous pairs of genes in the normalized expression/methylation dataset comprised more than one probe. We applied limma for every gene probe individually and found the differentially expressed/methylated probes in terms of the significant Benjamini-Hochberg (BH) corrected p-value. The probes for which the Benjamini-Hochberg (BH) corrected p-value is less than the standard cutoff 0.05, the expression/methylation data are treated as differentially expressed/methylated gene probes. Then, we selected the probe of each gene for which the corresponding Benjamini-Hochberg (BH) corrected p-value generated using the limma tool was the lowest among all probes of each gene. The remaining probes of those genes were deducted from the corresponding dataset. Last, only those genes containing single probes were obtained which were both differentially expressed and methylated and whose respective proteins had interactions in the HPRD.

Discretization and post-discretization formats
Assuming that N referred to the set of genes which had both the differential expression and differential methylation profiles and which were involved in the protein-protein interaction, while n denotes the number of genes that are both differentially expressed and differentially methylated (N). Let M denote the set of matched samples between the expression and methylation data, while m denotes the number of Frontiers in Bioinformatics frontiersin.org 04 matched samples between the expression and methylation data (M). The normalized expression and methylation data matrices of the genes belonging to N are symbolized as DiE norm and DiM norm , respectively. The row of data matrices represents the gene, whereas the column indicates the transaction (sample). The binary representation of DiE norm and DiM norm is essential for the association rule mining. When DiE norm was normalized applying the zero-mean normalization technique, the rowwise (i.e., genewise) mean values became zero. If the value of expression data was greater than 0, the value was treated as upregulation (denoted as UpR), and thus, it was converted into 1 at the time of discretization, whereas any value which was less than 0 denoting downregulation (denoted by DwR) was turned into 0 during discretization. On the other hand, in the methylation data, the value that was greater than 0 indicating hyper-methylation (denoted as HperM) was converted into 0, whereas any value that was less than 0 indicating hypo-methylation (denoted as HpoM) was converted into 1. The aforementioned discretization procedure for the expression and methylation datasets is described in the following equations, respectively: where DDiE norm and DDiM norm indicate the discretized expression and methylation data matrices, respectively. The range of i and j values are 1-n and 1-m, respectively. Then, all the resultant discretized matrices are transposed as follows: and During post-discretization, the transposed discretized expression data (denoted by DDiTE norm in Eq. 7) and methylation data (denoted by DDiTM norm in Eq. 8) were merged into a single binary matrix (denoted by PDiD em ), with the size of [m × (2*n)]. The integration of the expression and methylation data produced four types of genes, viz., 1) upregulated and hypo-methylated genes, 2) upregulated and hypermethylated genes, 3) downregulated and hyper-methylated genes, and 4) downregulated and hypo-methylated genes. As gene expression and methylation are inversely proportional to each other, the first and third categories of gene sets (i.e., categories denoted by (i) and (iii)) were selected. As mentioned previously, the column length (gene area) of post-discretization is twice that of the column length (gene area) of the transposed discretized expression/methylation matrix, i.e., the size of PDiD em is [m × (2*n)].
The first half of the column vector of PDiD em is for type (i) upregulation and hypo-methylation, while the second half of the column vector of PDiD em is for type (iii) downregulation and hyper-methylation. Therefore, if the particular cell/house value (say cell at the jth sample and the ith gene) of the transposed discretized expression data matrix DDiTE norm is 1 (i.e., the so-called upregulated) and the same cell/house value of the transposed discretized methylation data matrix DDiTM norm is 1 (i.e., the so-called hypomethylation), it satisfies type (i) upregulation and hypo-methylation. We place a symbol "1" at the same cell/location of the first half of the post-discretized matrix (i.e., cell at the jth sample and the ith gene of PDiD em that are seen as the first joint condition of Eq. 9) that indicated type (i) both upregulated and hypo-methylated genes, and simultaneously we also place a symbol "0" at the same cell/location of the second half of the post-discretized matrix (i.e., cell at the jth sample and the ith gene of PDiD em that are seen as the first joint condition of Eq. 10) which is just the negation of "1." On the other hand, when both the transposed discretized scores for the same cell/ house were 0 (downregulation and hyper-methylation), the resultant post-discretized value for the second half of the post-discretized matrix would be 1 (see the second joint condition of Eq. 10), whereas the same value for the first half of the post-discretized matrix would automatically be the negation of 1 (viz., 0) (see the second joint condition of Eq. 9). In addition, for all the other combinations of the transposed discretized expression value and the transposed discretized methylation value [e.g., (0 and 1), (1 and 0)], the post-discretized values for both the first and second half would be 0. and Flowchart of the proposed rule mining method.
Frontiers in Bioinformatics frontiersin.org 05 However, we illustrated some examples of aforementioned computations in Figure 1. After the post-discretization step, we carried out transpose on the resultant post-discretized matrix for the next step.

Proposed association rule mining approach
The MOOVARM approach changed the traditional concept of using the static support threshold and a static confidence threshold which were generally applied to maintain these same thresholds across all item sets (i.e., gene sets). In our method, after postdiscretization, the association rule mining algorithm utilized the weighted shortest distance depending on multiple minimum support thresholds, multiple minimum confidence thresholds, and multiple minimum lift thresholds instead of the static support threshold and the static confidence threshold. Those multiple minimum thresholds were formed through the integration of gene expression, methylation, and protein-protein interaction profiles. The MOOVARM method worked on three different types of profiles: gene expression, methylation, and protein-protein interaction profiles concurrently instead of the individual dataset, like gene expression or DNA methylation or any other data, and produced multi-objective multi-prolific association rules. The six main steps of this MOOVARM method were as follows: 1) determination of frequency of every gene (item) contained in the post-discretized data; 2) computation of WV msc , WV mcc , and WV mlc scores; 3) formation of a gene set tree (GSTR); 4) generation of gene sets (item sets); 5) determination of D b VS, D b VC, and D b VL; and 6) production of the top significant relationdependent association rules.
In the first step, the binary matrix denoted by PDiD em was transformed into the transactional matrix TRDiM, which contained transactions associated with several genes IDs per transaction. The number of transactions that existed in TRDiM was denoted by TR n . Both the user-mentioned minimum support cutoff (UD min S) and usermentioned minimum confidence cutoff (UD min C) were to be described. UD min L (user-defined minimum lift cutoff) was kept at the value 1. Then, the frequency of every gene from the TRDiM dataset was determined. The frequency of the genes was greater than or equal to UD min S and were considered frequent genes. The frequent genes were arranged according to their frequency (from high to low order). In the second step, the generated cutoff WV msc (;) was computed for every pair of genes by combining H (;), CECM exm (;), and UD min S. Similarly, WV mcc (;) was evaluated by applying H (;), CECM exm (;), and  UD min C, and WV mlc (;) was computed by integrating H (;), CECM exm (;), and UD min L. In the next two phases, the GSTR was first obtained, and the important gene sets were then generated consecutively by following the same steps used in the typical FP-Growth association rule mining method. Next, in the fifth phase, the distance-based cutoff (i.e., D b VS, D b VC, and D b VL) scores were evaluated for every resultant gene set by the initially computed WV msc , WV mcc , and WV mlc matrices, successively. In the final phase, the support of every resultant gene set was first identified. The frequent gene sets (i.e., the gene sets whose support scores were greater than or equal to the respective individual D b VS threshold instead of the user-specified support threshold UD min S) were then identified. Next, the rules were obtained with respect to the frequent gene sets, and the confidences and lifts of the respective rules were computed. From the aforementioned set of rules, we chose only those rules for which both the confidence and lift scores were greater than or equal to their individual D b VC and D b VL cutoffs instead of UD min C and UD min L, respectively. A flowchart of the proposed MOOVARM rule mining method is illustrated in Figure 2.

Multi-criteria (multi-objective optimization) decision-making technique
Multi-criteria decision-making (MCDM) (Das et al., 2013) is a procedure used to select the best alternative of the set of finite alternatives with respect to multiple criteria. The MCDM technique has various applications in different fields such as economy, management, engineering, and medical diagnosis and helps the decision maker in selecting the best alternative in conflicting situations.   (Saaty, 1990), ELECTRE (Roy and Vanderpooten, 1996), and TOPSIS (Hwang and Yoon, 1981). Those methods are considered a decision-making procedure depending on the problem behaviors such as ranking, scoring, selecting, ordering, and surrounding environments such as available data type and size, processing/execution time, internal consistency, and logical relations. The TOPSIS method is the most suitable MCDM method under two cases: first, in case of problems related to the large number of criteria and alternatives; second, in case of availability of objective and quantity data. First, the TOPSIS method identifies the positive ideal alternative which has the extreme performance on each criterion. It also identifies the negative ideal alternative that produced the worst performance on each criterion. The positive ideal solution is the solution that maximizes the benefit criterion and minimizes the cost criterion, whereas the negative ideal solution maximizes the cost criterion and minimizes the benefit criterion. Next, the method finds the alternative, depending on the closest distance from the positive ideal solution and farthest distance from the negative ideal solution. The classical TOPSIS method was based on the information of the criteria that was collected from the expert opinions and quantitative data, whereas the generated solution was concentrated on evaluation, prioritization, and selection ( Figure 3).
The TOPSIS method calculates relative closeness and ranking through the following steps.
Step 1: Constructing the decision matrix.Let M (p ij ) nxm correspond to a decision matrix, where p ij indicates the choice value of the ith alternative and jth criteria.
Step 2: Determining the positive ideal solution (PIS + ) and negative ideal solution (NIS − ). The positive ideal solution (PIS + ) is denoted as follows: The negative ideal solution (NIS − ) is denoted as follows: where K is associated with the benefit criteria and L is associated with the cost criteria.
Step 3: Calculating the distance from the positive ideal solution and negative ideal solution. The distance of the ith alternative from the positive ideal solution DIS + i is then calculated accordingly as follows: while the distance of the ith alternative from the negative ideal solution DIS − i is then computed as follows: Step 4: Calculating the relative closeness to the positive ideal solution S i as follows: Step 5: Ranking the preference order, and selecting the alternative close to 1. Ranking of the alternatives depending on the S i score was made in descending order.
Notably, see Algorithm 1 for the major steps of the proposed algorithm MOOVARM (Figure 4).

FIGURE 4
Comparative study between our proposed method MOOVARM and other well-known related rule mining methods, Apriori and Eclat, in terms of mean classification accuracy of the generated rules obtained by (A) MOOVARM, (B) Apriori, and (C) Eclat, where "+" (red symbol) denotes the average classification accuracies and the bold line signifies their median.

Experimental datasets and results
In the experiment, integrative data consisting of DNA methylation and gene expression high-grade soft tissue sarcoma (HSTS) profiles (NCBI ID: GSE52392) (Renner et al., 2013;Chudasama et al., 2017) were utilized. At the initial stage, the methylation profile had 27,578 methylation probes, whereas the gene expression profile consisted of a total of 48,645 genes. Of note, we selected those samples which contained both the values that consisted of two categories of samples: (i) undifferentiated pleomorphic liposarcoma (UdPLs) (diseased samples) and (ii) normal tumor cell line (nrTCL) (i.e., control samples). The profile had 13 UdPLs samples and 13 nrTCL samples. Thereafter, we chose the matched genes (i.e., 12,438) that consisted of both methylation and expression values.
During the experiment, we first selected the genes that contained both DNA methylation and expression values. Since genes had more than one single probe for methylation and expression profiles, we preliminarily filtered out those probes containing the missing values. The limma R tool (Smyth, 2004) was then applied on each probe to know whether the probe was differentially expressed/methylated or not ( Figure 5).
The probe with the best significance (minimal corrected p-value) among all the probes for every gene was selected for the next analysis, whereas all the remaining probes for every gene were simply omitted from the methylation profile and the expression profile. Next, we conducted the intersection between the set of differentially methylated genes, the set of differentially expressed genes, and the set of genes whose respective proteins interacted with one another in the HPRD (Peri et al., 2003). Herein, we identified many such common genes. For each dataset, we constructed a protein-protein interaction (PPI) network where each protein denoted a gene in the respective intersected set of genes. Next, we calculated the degree of each node (gene) in the PPI network and rearranged the genes with respect to the high to low order of their degree values. Thereafter, we conducted the discretization and postdiscretization steps, respectively. Then, we used our proposed rule mining method, MOOVARM, and obtained multi-objective optimized variable support-based association rule mining. Table 2 shows the resultant rules. Notably, using the four measures (confidence, support, lift, and WeSD) of each rule, we optimized the rules through computing the relative score in optimization where confidence, support, and lift were used to maximize their values and WeSD was used to minimize their values. Then, we ranked the rules according to the relative score from high to low. For the HSTS dataset, the topmost rule {STAT3+, TP53-→ MAPK3+} states that if the gene STAT3 is both upregulated and hypo-methylated and the gene TP53 is both downregulated and hyper-methylated, then it is likely that the gene MAPK3 is upregulated and hypo-methylated. The confidence, support, lift, avg. WeSD, and relative score values of this rule are 0.01, 0.00269, 0.02275, 0.00543, and 0.36, respectively. Its previous rank before optimization was 4, but after optimization, it secured the first rank since it has the highest relative score among all the rules. The next top four ranked optimized rules are {STAT3+ → MAPK3+}, {JUN+, STAT3+, TP53-→ MAPK3+}, {ESR1+ → MAPK3+}, and {JUN+, STAT3+ → MAPK3+}, whose relative scores are 0.3596, 0.3588, 0.3565, and 0.355, respectively (in Table 2). All details of the different rule interestingness measures, WeSD, relative scores, and the ranks prior to and after optimization for these top genes generated by MOOVARM are described in Table 2.
In order to validate the significance of each of the top 10 rules (in Table 2) generated from DTFP-Growth, we used and executed the PAM classifier for comparing the classification performance of the different rules obtained from MOOVARM, Apriori, and Eclat rule mining methods toward the samples. For this purpose, we considered only the participating features (genes) from both sides of each individual rule of the top 10 rules and then ran 10-fold cross-validation on the data with the help of the PAM classifier with the default parameters to evaluate the importance of the combination of all genes participating in FIGURE 5 ROC curves of average accuracies and AUC in terms of sensitivity vs. specificity to classify the respective disease using the gene sets belonging to the topmost two rules by MOOVARM (of Table 3) from the respective dataset. (A) ROC curve of avg. accuracies to classify the respective disease using the gene sets belonging to the first rule, (B) AUC to classify the respective disease using the gene sets belonging to the first rule, (C) ROC curve of avg. accuracies to classify the respective disease using the gene sets belonging to the second rule, and (D) AUC to classify the respective disease using the gene sets belonging to the second rule. Of note, the blue curve denotes the empirical line, while the light black line indicates the smoothed control (X:Y = 1:1) line. Herein, "0.550 (0.900, 0.808)" denotes the optimal threshold point (=0.550) that is closest to the top left part of the plot (for rule 1), where the respective specificity and sensitivity are 0.900 and 0.808, respectively. Similarly, "0.536 (0.792, 0.877)" signifies the optimal threshold point (=0.536) that is closest to the top left part of the plot (for rule 2), where the respective specificity and sensitivity are 0.792 and 0.877, respectively.

Frontiers in Bioinformatics
frontiersin.org the rule. We repeated the entire procedure 10 times in every occasion. The obtained classification accuracies of the top 10 rules of the HSTS dataset are presented in Table 3. Similarly, accuracy values of the top 10 rules as per the Apriori and Eclat algorithms are displayed in Table 4 and Table 5, respectively. The graphical plot for the top 10 rules classification accuracy measures obtained by three methods, namely, MOOVARM, Apriori, and Eclat is presented in Figure 3. According to the top 10 classification accuracy metrics, it is clear that the overall accuracy of the proposed method MOOVARM is higher than the other two methods. We also computed the average values of the classification accuracies of the top 10 rules which were 85.08% (±0.03), 81.96% (±0.02), and 81.65% (±0.02) for the methods, MOOVARM, Apriori and Eclat, respectively (in Figure 4). In addition, the AUC values in terms of sensitivity vs. specificity for classifying the respective disease using the gene sets belonging to the topmost two rules by MOOVARM (from Table 3) from the respective dataset were found as 0.926 and 0.920, respectively (in Figure 5). Moreover, in summary, the average AUC values of the top 10 rules of the same dataset using those methods (MOOVARM, Apriori, and Eclat) were 0.909, 0.861, and 0.859, respectively. Herein, we used an open-source R package "pROC" (Robin et al., 2011) to illustrate the ROC and AUC curves as depicted in Figure 5. Furthermore, we performed the KEGG pathway and Gene Ontology (GO) analyses using the participating genes belonging to the top rules generated by MOOVARM, and then we identified the GO-terms with significant p-values. Table 6 and Table 7 summarize enrichment results for GO terms: molecular function (GO: MF) and cellular component (GO: CC), respectively, containing the resultant rules of MOOVARM, whereas Table 8 provides the enrichment result for Gene Ontology: biological processing (GO: BP) terms containing the resultant rules of MOOVARM. Table 9 describes the KEGG pathways having the resultant rules of MOOVARM. For example, hsa05161: hepatitis B KEGG pathway (p-value = 6.20E-06) contained five genes and eight rules out of the 24 evolved rules obtained from MOOVARM as shown in Table 2. These five genes are GRB2, JUN, MAPK3, TP53, and STAT3, while these eight rules are {STAT3+ → *"+" denotes upregulated and hypo-methylated genes; "−" represents downregulated and hyper-methylated genes. Association rule mining is related to the directional signature and its effects on disease discovery. The top association rule is STAT3+, TP53-→ MAPK3+, where STAT3 and TP53 play opposing roles in cellular pathway regulation. According to the literature survey, the **sd, standard deviation; "+"denotes upregulated and hypo-methylated genes; "−"represents downregulated and hyper-methylated genes. **sd, standard deviation; "+" denotes upregulated and hypo-methylated genes; "−" represents downregulated and hyper-methylated genes.
Frontiers in Bioinformatics frontiersin.org 11 activation function of STAT3 upregulates the survival pathway, whereas p53 activates the apoptotic pathway. STAT3 contributes to cancer cell proliferation and is associated with tumor malignancy. Similarly, TP53 is a well-known tumor suppressor gene. TP53 provides protection against DNA damage by inducing cell cycle arrest, DNA repair, or apoptosis. Mutation of p53 is often observed in cancer, especially in late events in malignant progression (Pham et al., 2020). The rule says where if antecedent genes (STAT3+, TP53-) are expressed/methylated in a specified manner together, then it is likely that consequent genes (MAPK3+) will also be expressed/methylated in a specified manner together. According to the literature survey, MARK3 regulates the proliferation and bone metastasis of human breast cancer cells (Du et al., 2020).
To illustrate the efficiency of our top 10 association rules, we conducted literature mining. Our first association rule {STAT3+, TP53-→ MAPK3+} says that if antecedent genes (STAT3+, TP53-) are expressed/methylated in a specified manner together, then it is *"+" denotes upregulated and hypo-methylated genes; "−" represents downregulated and hyper-methylated genes.   Table S1 for more details.
Frontiers in Bioinformatics frontiersin.org likely that consequent genes (MAPK3+) will also be expressed/ methylated in a specified manner together. According to Yang et al. (2021), Yang et al. (2022), andLiu et al. (2022), we obtained these three genes, namely, STAT3, TP53, and MAPK3 together as core target genes or highest degree hub genes related to several diseases like gastric cancer and type 2 diabetes mellitus. According to Zu et al. (2021), the three genes of the second association rule, JUN, TP53, and MAPK3 together, are related to gastric cancer. According to Santh Rani (2023), the three genes associated with rule 8, ESR1, MAPK3, and STAT3 together, are found as the top hub protein target genes in the PPI network analysis. Therefore, we can conclude that the genes associated with our association rules also jointly played several roles in recent literature studies. However, in the conceptual prospective, the MOOVARM approach modifies the traditional concept of using the static support threshold and a static confidence threshold which were generally applied to maintain these same thresholds across all item sets (i.e., gene sets) in the traditional algorithms like Apriori and Eclat. In MOOVARM, after post-discretization, the association rule mining algorithm utilizes the weighted shortest distance that depended on multiple minimum support thresholds, multiple minimum confidence thresholds, and multiple minimum lift thresholds instead of the static support threshold and the static confidence threshold. Those multiple/dynamic minimum thresholds were estimated by the integration of gene expression, methylation, and protein-protein interaction profiles and a weighted shortest distance-based scheme. The MOOVARM method works on all three different types of profiles: gene expression, methylation, and protein-protein interaction profiles instead of individual datasets like gene expression or DNA methylation or any other data, and produced multi-objective multi-prolific association rules. We also applied a multi-objective optimization technique, TOPSIS, which is named the multi-criteria decision-making technique. It is the procedure to select the best alternative of the set of finite alternatives with respect to multiple criteria. Herein, we ranked the association rules using multiple criteria (such as weighted support, weighted confidence, and weighted lift) and chose the top-ranked association rules through the multi-objective optimization technique. Thus, in a single word, the traditional rule mining algorithms like Apriori and Eclat use static user-defined threshold values and there is no optimized ranking of the estimated rules, while MOOVARM follows  dynamic thresholds and generates optimized association rules using multi-objective optimization on various objectives/criteria/rule interestingness values for each individual rule.
Furthermore, to explain this relationship between association rule mining and directional gene signature and its effects on disease discovery, we took the topmost optimized association rule {STAT3+,  (Table 3) for example, which is a directional gene signature. The rule states that if the antecedent genes express and methylate in a specified manner together (i.e., STAT3 is upregulated and hypo-methylated, and TP53 is downregulated and hyper-methylated, concurrently), then it is likely that consequent genes will be expressed and methylated in a specified manner together (i.e., MAPK3 will be upregulated and hypo-methylated). The combined effects of the rule create a directional three-gene signature since the total number of the participating genes in the associated rule is three here.

Conclusion
In this article, we proposed a unique associated rule mining method denoted as MOOVARM to find the most acceptable and appropriate rule for multi-omics profiles. To produce the interesting rules for multi-omics profiles, we used and integrated gene expression, methylation, and protein-protein interaction data based on the idea of multi-objective optimization and weighted shortest distance. For this purpose, we identified PIS and NIS with respect to all gene sets. PIS maximized the profit and minimized the loss. Alternatively, NIS maximized the loss and minimized the profit. Then, we calculated the distances d + and d − from PIS and NIS, respectively, for each gene set. Then, with the help of these two distances, we measured the relative closeness to PIS for ranking the gene sets. In this proposed method, we computed relative closeness scores globally instead of individual genes. Finally, MOOVARM generated the final rank of the extracted (multiobjective optimized) rules of correlated genes which may play a significant role in better disease classification performance than the state-of-the-art algorithms in disease discovery as well as therapeutic value. However, the limitation to this work is that MOOVARM works on the multi-omics RNAseq/microarray dataset consisting of DNA methylation, gene expression dataset for the same set of patients/ samples, and protein-protein interaction dataset in this framework. MOOVARM cannot work on the single-omics data. Furthermore, our method might not work on single-cell sequencing data without the usage of the matrix imputation prior to pre-filtering steps.
As a future work, we will include more datasets with the advanced added mechanism. In addition, we are interested to further use our proposed model for determining the directional optimized gene signatures in hub gene findings in the multimolecular regulation study (i.e., the regulation among the long non-coding RNAs, transcription factors, microRNAs, and target genes). Moreover, we also want to use this method in single-cell RNA sequencing and single-cell ATAC sequencing data to detect directional gene signatures for cancer detection.
Furthermore, we have checked some state-of-the-art works of association rule mining based on the fuzzy or rough set theory, but the outcome rules are not good enough, which means that the outcomes are not always beneficial using fuzzy/rough set-based association rule mining (Sharmila and Vijayarani, 2019;Singh and Ganesh Wayal, 2012). Comparatively, as our proposed method MOOVARM used the multi-objective optimization technique, the outcome rules are optimized and efficient enough. Therefore, only the inclusion of the fuzzy/rough set is not always beneficial. Thus, we assume that to improve the performance of the fuzzy/rough theory-based method, the inclusion of multi-objective optimization technique together with fuzzy/rough set-based rule mining will be an efficient step. Therefore, as our future work, we will extend our proposed framework by including both fuzzy/rough theory and multi-objective optimization to produce better and more effective association rule mining from multi-omics data.

Data availability statement
The original contributions presented in the study are publicly available. This data can be found here: https://www.ncbi.nlm.nih. gov/geo/query/acc.cgi?acc=GSE52392 Author contributions SM and SS conceptualized the method and design. SM, SS, and AS conducted the experiment and analyzed the results. SM, SS, and TB wrote the manuscript. SM and ZZ reviewed and edited the manuscript. All authors contributed to the article and approved the submitted version.
Funding ZZ was partially supported by the Cancer Prevention and Research Institute of Texas (CPRIT RP170668 and RP180734) (to ZZ). Publication costs were funded by ZZ's Professorship Fund. The funder had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.