Robust Co-clustering to Discover Toxicogenomic Biomarkers and Their Regulatory Doses of Chemical Compounds Using Logistic Probabilistic Hidden Variable Model

Detection of biomarker genes and their regulatory doses of chemical compounds (DCCs) is one of the most important tasks in toxicogenomic studies as well as in drug design and development. There is an online computational platform “Toxygates” to identify biomarker genes and their regulatory DCCs by co-clustering approach. Nevertheless, the algorithm of that platform based on hierarchical clustering (HC) does not share gene-DCC two-way information simultaneously during co-clustering between genes and DCCs. Also it is sensitive to outlying observations. Thus, this platform may produce misleading results in some cases. The probabilistic hidden variable model (PHVM) is a more effective co-clustering approach that share two-way information simultaneously, but it is also sensitive to outlying observations. Therefore, in this paper we have proposed logistic probabilistic hidden variable model (LPHVM) for robust co-clustering between genes and DCCs, since gene expression data are often contaminated by outlying observations. We have investigated the performance of the proposed LPHVM co-clustering approach in a comparison with the conventional PHVM and Toxygates co-clustering approaches using simulated and real life TGP gene expression datasets, respectively. Simulation results show that the proposed method improved the performance over the conventional PHVM in presence of outliers; otherwise, it keeps equal performance. In the case of real life TGP data analysis, three DCCs (glibenclamide-low, perhexilline-low, and hexachlorobenzene-medium) for glutathione metabolism pathway dataset as well as two DCCs (acetaminophen-medium and methapyrilene-low) for PPAR signaling pathway dataset were incorrectly co-clustered by the Toxygates online platform, while only one DCC (hexachlorobenzene-low) for glutathione metabolism pathway was incorrectly co-clustered by the proposed LPHVM approach. Our findings from the real data analysis are also supported by the other findings in the literature.


INTRODUCTION
Toxicogenomics studies combines toxicology with several omics technologies (genomics, transcriptomics, proteomics, and metabolomics) to assess the risk of toxins (small molecules, peptides, or proteins) and chemical agents (drugs, gasoline, alcohol, pesticides, fuel oil, and cosmetics) in organism (NRC, 2007;Afshari et al., 2011). Through integration of these omics technologies with bioinformatics, toxicogenomics can be used to suggest the molecular mechanism of toxicity. This can reduce the cost in terms of time, labor, compound synthesis, and animal use which are main limitations of traditional toxicology work (Nuwaysir et al., 1999;Chen et al., 2012). In drug discovery and development, it is also necessary to assess the doses of chemical compounds (DCCs) toxicity administering these DCCs on individuals for measuring drugs' safety. This assessment can be done by toxicogenomic biomarkers those are upregulated or downregulated by the influence of a set of DCCs on individuals. These toxicogenomic biomarkers can be identified from the extensive gene-treatment expression dataset of target organs of individuals (Fielden et al., 2007;Uehara et al., 2008;Igarashi et al., 2015).
An online toxicogenomic data analysis platform "ToxDB" increases its predictive power based on the pathway level gene expression data (Hardt et al., 2016). It calculates the pathway scores for a chemical compound to identify significant biomarker genes using t-statistic from different pathways. Nevertheless, there is no facility in this platform to study another interesting problem of relationship between gene groups and DCCs groups asserted by Afshari et al. (2011). To address this problem another online platform "Toxygates" produces coclusters between genes and DCCs using hierarchical clustering (HC) (Nyström-Persson et al., 2017). But HC does not use twoway (gene-DCC) information simultaneously for co-clustering and it is sensitive to outlying observations (García-Escudero et al., 2010). Probabilistic hidden variable model (PHVM) has been developed for co-clustering between words and documents in a text mining problem (Hofmann, 2001). It uses two-way (row-column) information simultaneously during co-clustering. It was also successfully used in detecting hidden patters of biological profiling datasets (Joung et al., 2006;Bicego et al., 2010). Therefore, PHVM would be more effective approach than HC for co-clustering between genes and DCCs which is also supported by Joung et al. (2006). However, the PHVM algorithm is sensitive to outlying observations of gene expression. These outlying observations often occur in the gene expression dataset due to several steps involve in the data generating processes from hybridization to image analysis including scratches or dust on the surface, imperfections in the glass or imperfections in the array production (Gottardo et al., 2006;Upton et al., 2009). The outliers in the dataset may arise following Tukey-Huber contamination model (THCM; Agostinelli et al., 2015) or independent contamination model (ICM; Alqallaf et al., 2009). To overcome the robustness problems of conventional PHVM approach an attempt is made to propose logistic PHVM approach called as LPHVM for robust co-clustering between genes and DCCs to discover toxicogenomic biomarkers and their regulatory DCCs.

METHODS AND MATERIALS
Let us consider a toxicogenomic experimental design as described in Figure 1 that reflects Japanese Toxicogenomics Project (TGP) (Uehara, 2010) experiment for a single time point from which the toxygates (Nyström-Persson et al., 2013) data were collected. According to this design, gene expression data of both treatment and control group of animal samples are assumed to be generated. Then the fold change gene expression data for a single time point are computed from the treatment and control group of animals. It can measure the actual treatment (DCCs) effects on the genes. The fold change gene-expression value of a gene is defined as follows: where Y tlq is the fold change expression value of a gene for the qth (q = 1, 2, 3) sample under lth (l = Low, Middle, High) dose level of the tth (t = 1, 2, · · · , T) chemical compound, x tlq is the expression value of that gene of mentioned sample under the treatment group and x tlq ′ is the expression value of the same gene of the respective control sample. The effect of compound-dose combination or treatment/DCCs on the animal can be measured by Y tl. which is the average fold change value over the samples. In this paper, our objective is to robust co-clustering between genes and DCCs to discover toxicogenomic biomarkers and their regulatory DCCs from the fold change gene expression data using the proposed LPHVM.

Logistic Transformation of Fold Change Gene Expression Data
There are two ways to obtain robust estimates in presence of outlying observations (1) applying the robust methods (2) applying conventional methods on the modified dataset. The modification of the outlier contaminated dataset can be done deleting the outlying observations from the dataset or applying transformation on the dataset. Nonetheless, application of robust methods is complicated than using the conventional methods and deletion of outlying observations loses the information of the dataset. Hence, transformation is the better option for reducing outlier effects. Several authors (Box and Cox, 1964;Atkinson, 1982;Carroll, 1982) have been proved that transformation based robust methods outperform the conventional methods in reducing outlier effects. Thus, in this paper we consider logistic transformation for reducing outlier effects from the dataset. Before application of logistic transformation in the dataset we have taken average value (Y tl. ) of the fold change gene expression (Y tlq ) over the samples. We denote this average value by F G i , C j for the convenience of further use. In toxicogenomic data the expression profile of a subset of genes is highly correlated across a subset of conditions/treatments (Madeira and Oliveira, 2004;Bicego et al., 2010;Afshari et al., 2011). Interestingly, in the FIGURE 1 | A typical toxicogenomic experimental model for a single time point according to which gene expression data of the animal samples can be collected. In the figure there is a treatment group of animals and a control group of animals from which the fold change gene expression data can be obtained. gene expression or average fold change gene expression data there is a subset of genes which consists an upregulated and a downregulated clusters of genes which is highly correlated over a subset of DCCs. Therefore, we take absolute of the average fold change expression data to merge upregulated and downregulated clusters of genes into a single cluster/subset which are regulated by a subset of DCCs. Thereafter, the subset of genes forms a cocluster with its regulatory subset of DCCs. Since in this study, we consider all the biomarker and non-biomarker genes (genes are not affected by DCCs) in a pathway, the non-biomarker genes make another co-cluster together with non-regulatory DCCs (which do not affect the expression patterns of the genes in a specific pathway). The term co-cluster refers to the clustering of correlated row (genes) and column (DCCs) simultaneously. Now we apply logistic transformation on the ( F G i , C j ). If there are extreme values of F G i , C j the logistic transformation bring them within the range of 0-1. The other transformation methods like Box-Cox family of power transformation returns unbounded value for the extreme one. The observed n×m (gene-DCCs) fold change gene expression data matrix consisting of G = (G 1 , G 2 , . . . , G n ) genes and C = (C 1 , C 2 , . . . , C m ) DCCs is transformed using logistic function Similar to other works (Joung et al., 2006;Bicego et al., 2010) we assume the transformed value #(G i , C j ) as the count value for applying PHVM.

Number of Co-clusters (k) Prediction
As we see from the previous section "logistic transformation of fold change gene expression data" in toxicogenomic dataset there are hidden patters or co-clusters between genes and DCCs. Thus the number of clusters in the DCCs is equal to the number of clusters in the genes. Before applying PHVM it is required to know the number of co-clusters in the dataset. Therefore, in this study, we consider gap statistic (Tibshirani et al., 2001) the most popular and reliable algorithm for predicting the number of coclusters in the dataset. We use R function "fviz_nbclust" which required packages "factoextra" and "NbClust" (Malika et al., 2014) in order to predict number of co-clusters in the dataset via gap statistic. The detail algorithm of gap statistic is given in the Supplementary Material.

Robust Co-clustering Using Logistic Probabilistic Hidden Variable Model
In order to perform robust co-clustering between genes and DCCs we propose LPHVM approach. We define LPHVM as the application of PHVM on the count valued dataset which is obtained transforming absolute value of the fold change gene expression data by logistic transformation. For this standpoint, let us consider n × m gene-DCC count valued fold change gene expression data matrix consisting of G = (G 1 , G 2 , . . . , G n ) genes and C = (C 1 , C 2 , . . . , C m ) DCCs. LPHVM assumes that there prevail a certain number of unobserved hidden co-clusters or clusters underlying the gene-DCC count valued data matrix. We have estimated the number of co-clusters (k) in the dataset using gap statistic algorithm proposed by Tibshirani et al. (2001).
Introducing the hidden variable H = (H 1 , H 2 , . . . , H k ; r = 1, 2, . . . , k) the model quantifies the relationships Pr (G i |H r ), Pr(C j |H r ), and Pr G i , C j . The following are the probability definition and underlying assumptions of LPHVM accordingly: (1) Pr(H r ) is the probability of the rth co-cluster/cluster and k r = 1 Pr(H r ) = 1.
(2) Pr (G i |H r ) is the probability of the ith gene over the rth co-cluster and ∀H r ; is the probability of the jth DCC over the rth co-cluster and ∀H r ; m j = 1 Pr(C j |H r ) = 1. (4) Pr G i , C j is the joint probability of the ith gene and the jth DCC and n i = 1 m j = 1 Pr G i , C j = 1. Based on these definition and assumptions we obtain the joint probability of the gene-DCC observed pair (G i , C j ) considering hidden co-cluster H r as follows: Where, Applying Bayes' rule, the gene-DCC joint probability Pr G i , C j can be written as So as to estimate the parameters of the model, we need to maximize the total likelihood of the observations: We have applied the widely used Expectation-Maximization (EM) algorithm (Dempster et al., 1977) for estimating the maximum likelihood parameters of the proposed model. The EM algorithm starts with a random set of initial parameter values and iterates both the expectation (E-step) and maximization (M-step) step alternatively until a certain convergence criteria is satisfied. For this study, we have taken the values of initial parameters from dirichlet distribution and the stopping condition for EM estimation was set to <0.00001 (difference between two log likelihood of successive EM iteration). The E and M-step for the total likelihood can be given as follows: E-step: Once the parameters Pr (G i |H r ) and Pr(C j |H r ) have been estimated the genes and DCCs are clustered independently and co-clustered simultaneously. The gene (G i ) and DCC (C j ) will belong to co-cluster r if · · · , n; r = 1, 2, . . . , k and Pr C j H r = argmax r ′ Pr C j H r ′ ; j = 1, 2, . . . , m; r = 1, 2, . . . , k At the same time, if the gene (G i ) and the DCC (C j ) is grouped into a co-cluster (r) and this pair has the highest joint probability Pr G i , C j in that co-cluster (Figure 3).

Extraction of Toxicogenomic Biomarker Genes and Their Regulatory Doses of Chemical Compounds
As described in section "logistic transformation of gene expression data" the biomarker genes form co-clusters with their respective regulatory DCCs. Additionally, the nonbiomarker genes in a pathway form another co-cluster with non-regulatory DCCs. The LPHVM grouped the genes and DCCs simultaneously to their respective co-clusters. Zhu et al. (2005) has shown that the PHVM generated cooccurrence probabilities between correlated genes and chemical compounds which co-occur more frequently are higher than others. Biological relationship among these correlated genes and chemical compounds is also stronger. Therefore, we ranked the co-clusters based on the average LPHVM generated joint probability Pr G i , C j of gene-DCC within the co-clusters. The co-cluster having largest average joint probability contains most important biomarker genes and their regulatory DCCs and so on. The non-biomarker genes and non-regulatory DCCs in a dataset of a particular pathway are filtered in a co-cluster by LPHVM which have the smallest average joint probability. Except this co-cluster (co-cluster having smallest average joint probability) others are the co-clusters of biomarker genes and their regulatory DCCs and we define these co-clusters as biomarker co-clusters. We extract the toxicogenomic biomarker genes and their regulatory DCCs from these biomarker coclusters.

Up/Down-Regulated Biomarker Genes and Ranking of Doses of Chemical Compounds
The biomarker co-clusters consisting of biomarker genes and their regulatory DCCs are separated from the whole gene-DCC fold change data matrix which is discussed in the previous section. Within this co-clustering matrix a subset of biomarker genes may be upregulated corresponding to a subset of DCCs or downregulated corresponding to another subset of DCCs. These can be observed from the average fold change value (Y tl. ) of the co-clustering matrix. For example, a biomarker is define as up or down-regulated gene corresponding to the l th dose level of the t th chemical compounds if Y tl. > 0 or Y tl. < 0. Then this dose of chemical compound is said to be a regulatory DCC. Furthermore, for ranking the biomarker gene regulatory DCCs and their relationships with biomarker genes we have separated a sub matrix of biomarker genes and their regulatory DCCs (biomarker co-clusters) from the LPHVM generated gene-DCC joint probability Pr G i , C j matrix. The biomarker gene regulatory DCCs are ranked according to their average joint probability value over all biomarkers. We also rank the relationships among biomarker genes and their regulatory DCCs based on their joint probability. The ranking is made considering the formula: where Z j is the average joint probability of a DCC over the biomarkers or Z i,j is the joint probability of gene G i and DCC C j within the biomarker co-clusters.

Robustness of the Proposed Algorithm
We investigate the robustness of the proposed (LPHVM) algorithm and conventional PHVM using simulated datasets in absence and presence of outliers in the dataset based on the co-clustering /clustering error rate (ER). The genes and DCCs which are considered in one co-cluster/cluster in the simulated data are incorrectly assigned in another cocluster/cluster by the PHVM or LPHVM is considered as the miss co-clustered/clustered observations. The ER is the percentage of miss co-clustered/clustered observations which is calculated as: tolal miss co − clustered/clustered observations Total observations × 100

Computational Steps of LPHVM at a Glance
For detecting the toxicogenomic biomarker genes and their regulatory DCCs from the pathway level toxicogenomic dataset using LPHVM the following steps are to be considered for desired outputs: Step 1: Obtain gene expression data of treatment and control group of animals from the toxicogenomic experiment (Figure 1). Thereafter, compute fold change gene expression data using Equation (1) and then make it absolute.
Step 2: Apply logistic transformation on the dataset obtain from step 1 and assume the transformed value as count value.
Step 3: Estimate the number of co-clusters in the dataset which is obtained from step 2.
Step 4: Obtain robust co-clusters applying PHVM on the dataset obtained from step 2 using the number of co-clusters which we get from step 3.
Step 5: Calculate average joint probability of gene-DCC within the co-clusters and ranked them.
Step 6: Separate the co-clusters of biomarker genes and their regulatory DCCs from the co-cluster which have smallest average joint probability of gene-DCC.
Step 7: The genes and DCCs in the separated co-clusters which we get from step 6 are the toxicogenomic biomarkers and their regulatory DCCs.
Step 8: A biomarker gene obtains from step 7 may be upregulated corresponding to a DCC or downregulated corresponding to another DCC. A biomarker gene is said to be a up or down-regulated if its average fold change value corresponding to the lth dose level of the tth chemical compound is Y tl. > 0 or Y tl. < 0.

Simulated Datasets
To investigate the performance of the proposed LPHVM algorithm over the conventional PHVM we have simulated two sets of pathway level fold change gene expression data D 1 (n = 50 × m = 30) and D 2 (n = 50 × m = 60) imitating the toxicogenomic experiment given in Figure 1. Alongside these a pathway level dataset considering all time points of toxygates data are analyzed in the real data section. According to this experiment the fold change gene expression data (Y tlq ) have been generated using the following model: In the above model, +F11and +F21 represent the fold change expression values for upregulated genes under the DCCs group 1 and 2, respectively. Similarly, -F12, and -F22 represent the fold change expression values for the downregulated genes under the DCCs group 1 and 2, respectively. The 0s represent there is no compound effects on the respective gene group and N(0, σ 2 ) represents the random error term generated from normal distribution with mean 0 and variance σ 2 . Now if we take absolute value of the fold change gene expression data generated from the above data generating model (2), the fold change gene expression data +F11 and -F12 will merge into a single gene group-1 and make a co-cluster with their correlated DCCs group-1. Accordingly, +F21 and -F22 will merge into a single gene group-2 and make a co-cluster with their correlated DCCs group-2. The rest of the genes which are not regulated by any DCCs make a gene group-3 and the DCCs that do not regulate the expression pattern of genes make a DCCs group-3. The gene group-3 and DCCs group-3 together will make another co-cluster. These co-clusters can be retrieved by the LPHVM. In the simulated datasets n represents the number of genes (G i ; i = 1, 2, . . . , n) and m represents the number of DCCs (C j ; j = 1, 2, . . . , m). The data generation procedures for D 1 and D 2 datasets are given in the Supplementary Material.

Real Datasets
Several studies proved that molecular network or pathway based analysis improved the predictive power of gene expression data (Yildirimman et al., 2011;Hofree et al., 2013). Hardt et al. (2016) also analyzed the pathway level data from in vitro and in vivo experiment of human and rat model. Presently, pathway based analysis in cancer research has also advanced promptly since pathway level analysis able to produce more stable biomarkers (Kim, 2017). Since performance of any method cannot be measured without known dataset. Besides the simulation study, to investigate the performance of the proposed method compare to other existing methods we use two known datasets of glutathione metabolism and PPAR signaling pathways.

Simulation Study
We investigate the performance of our proposed method (LPHVM) by comparing it with the conventional PHVM using simulated datasets D 1 and D 2 in absence and presence of outlying observations for robust co-clustering between genes and DCCs to discover biomarker genes and their regulatory DCCs. The number of co-clusters/clusters for both of the simulated datasets is estimated as 3 via gap statistic as per the datasets are simulated ( Figure S1). For calculating average co-clustering and clustering ER we have simulated each of the datasets 100 times. Every time of data simulation outliers are introduced in the dataset using the data contamination methods THCM and ICM at the same time ER are calculated for PHVM and LPHVM applying these methods on the datasets. The description of the data contamination by outliers, THCM and ICM are given in the Supplementary Material. Here it should be mentioned that in the case of THCM we have contaminated the simulated datasets by 5-50% rate of outliers. Similarly, in the case of ICM we have considered the range of probability of at least one component of the dataset is to be contaminated is 0.14-0.60 for D 1 dataset and 0.165-0.5962 for D 2 dataset. Figure 2 visualizes the average co-clustering ER between genes and DCCs for datasets D 1 and D 2 in absence and presence of outliers when the datasets are contaminated by outliers using the THCM. The Table 1 shows the average co-clustering ER between genes and DCCs in absence and presence of outliers for the simulated datasets D 1 and D 2 when the datasets are contaminated by outliers using ICM. Figure S2 and Table S1 in the Supplementary Material show the average clustering ER for gene and DCCs. It is observed from the mentioned figures and tables that in absence of outlier both of the proposed LPHVM and conventional PHVM approaches produce 0 ER. However, in presence of outlaying observations in the datasets the proposed approach produce far smaller ER than the conventional approach for both of the data contamination methods (THCM and ICM). The simulated data structure, structure of the data when row (gene) and column (DCCs) entities are randomly allocated and proposed method recovered structure of the data are given in the Supplementary Material (Figures S3, S4) for the datasets D 1 and D 2 . From these figures it is observed that the proposed algorithm is efficient for co-clustering between genes and DCCs of the pathway level fold change gene expression data. Figure S3C represents the dataset D 1 where all the genes and DCCs are grouped into three co-clusters (co-clusters 1, 2, and 3) and within co-cluster average joint probability of gene-DCC are given in Table 3. From where it is found that co-cluster-1 produces the smallest average joint probability of gene-DCC. Therefore, co-cluster 2 and 3 are the co-cluster of biomarker genes and their regulatory DCCs for the dataset D 1 . Similarly, for D 2 dataset co-cluster-3 produces the smallest average joint probability of gene-DCC (Table 3). Thus, co-cluster 1 and 2 are the biomarker co-clusters consisting of biomarker genes and their regulatory DCCs. The biomarker genes and their regulatory DCCs that we get from the biomarker coclusters of the simulated datasets are given in the Table S9.
Ranking of the biomarker regulatory DCCs are performed based on the biomarker gene-DCC joint probability matrix of biomarker co-clusters following the raking method described in sub section (Up/Down-regulated Biomarker Genes and Ranking of Doses of Chemical Compounds). The results are given in

Analysis of Glutathione Metabolism Pathway Data
Reactive oxygen species (ROS) are produced by living organisms as a normal product as a result of normal cellular metabolism. However, in presence of environmental pollutants or toxic chemical the production of ROS increased dramatically. It is highly reactive molecules and can damage cell structures such as carbohydrates, nucleic acids, lipids, and proteins and alter their functions. In the liver, glutathione is an important antioxidant; a major detoxification player which scavenges ROS. Thus imbalance in the abundance of ROS and glutathione/antioxidant in favor of ROS in the liver in presence of toxic chemicals/drugs causes' drug induced liver injury. Subsequently, gene expression changes occur simultaneously in response to the glutathione depletion or after the glutathione depletion (Gao et al., 2010;Birben et al., 2012;Nyström-Persson et al., 2013). In order to identify glutathione depletion related biomarker genes and their regulatory DCCs as well as to investigate the performance of the proposed LPHVM approach we use known fold change gene expression dataset of glutathione metabolism pathway. The fold change gene expression dataset consists 62 glutathione metabolism pathway genes, three glutathione depleting compounds (acetaminophen, methapyrilene, and nitrofurazone) and seven non-glutathione depleting compounds (erythromycin, hexachlorobenzene, isoniazid, gentamicin, glibenclamide, penicillamine, and perhexilline) (Nyström-Persson et al., 2013) along with the dose levels (low, middle, and high) for 24 h time point. The number of co-clusters which is required in applying LPHVM for this dataset is estimated as 2 ( Figure S1) via gap statistic. Figure 3A shows actual co-clusters in the glutathione metabolism pathway dataset. The genes and DCCs in the co-clusters are given in the Table S2. The average joint probabilities of gene-DCC within the co-clusters are 0.0006196723 and 0.0005331547 (Table 3), respectively for co-cluster-1 and co-cluster-2. Thus, Co-cluster-1 is the co-cluster of biomarker genes and glutathione depleting DCCs as it produces highest average joint probability. The biomarker genes and their regulatory DCCs in co-cluster-1 are given in Table 2. Additionally, the upregulated and downregulated biomarker genes corresponding to their regulatory DCCs are presented in the Figure S7A. For the same dataset the clustering results (heatmap) produced by toxygates are given in Figure S5 where glibenclamide-low, perhexilline-low, and hexachlorobenzene-medium dose level are incorrectly co-clustered whereas only hexachlorobenzenelow dose is incorrectly co-clustered by the proposed LPHVM approach according to Nyström-Persson et al. (2013). The biomarker genes in co-cluster-1 are functionally annotated by the online database DAVID (Huang da et al., 2009) and the results are given in the Tables S5, S6. The results show that the biomarker genes are significant in different biological functions or processes including glutathione metabolism pathway. Ranking of biomarker gene regulatory DCCs and top 20 gene-DCCs relationship along with their ranking score for glutathione metabolism pathway dataset are given in Tables 4,  5. From the tables it is observed that acetaminophen_High, nitrofurazone_High, and acetaminophen_Middle dose etc. are the most important glutathione depleting compounds and Gsta5, G6pd, Gpx2, Gsr, Mgst2, Gstp1, Gclc etc. are the most important biomarker genes. The detail ranked relationships results are given in Table S12. Besides this we have analyzed the same dataset considering all time points (3, 6, 9, and 24 h) by LPHVM to know about toxicity mechanism of the glutathione depleting compounds in other time points. The co-clusters produced by LPHVM are given in Figure 3C. The detail analyzed results of this dataset are given in Tables S4, S11. The proposed LPHVM identified 25 genes for the dataset at 24 h time points and 21 genes for the dataset where all time points are considered as biomarker in the glutathione metabolism pathway among which 18 are common.

Analysis of PPAR Signaling Pathway Data
Peroxisome proliferator-activated receptors (PPARs) PPAR ∝ , PPAR β/δ , and PPAR γ are transcription factors which are activated by ligand/drug. They regulate the expression of target genes in response to endogenous and exogenous ligands/chemicals. The PPAR ligands may produce toxicity via receptor-dependent and/or off-target-mediated mechanism(s) (Peraza et al., 2006). To discover PPARs regulated biomarker genes and their regulatory DCCs as well as to investigate the performance of the proposed LPHVM approach we consider known dataset consisting 88 PPAR signaling pathway genes and PPARs related gene regulatory compounds (WY-14643, clofibrate, gemfibrozil, benzbromarone, and aspirin) (Kiyosawa et al., 2006) and some other randomly selected compounds (cisplatin, diltiazem, methapyrilene, phenobarbital, and triazolam) along with their dose levels low, middle and high. The number of hidden co-clusters for this dataset is 2 estimated via gap statistic ( Figure S1). The LPHVM generates co-clusters of the PPAR signaling pathway dataset which is shown in Figure 3B. The average joint probabilities of gene-DCC within co-clusters are 0.0004471087 and 0.0003704091 where co-cluster-1 has the larger value than the co-cluster-2. Therefore, co-cluster-1 is the biomarker co-cluster of biomarker genes and their regulatory DCCs. The non-regulated genes and non-regulatory DCCs consist in co-cluster-2. The detail co-clustering results are given in the Table S3. The biomarker genes and their regulatory DCCs in co-cluster-1 are given in Table 2. Additionally, up/down-regulated biomarker genes corresponding to their regulatory DCCs are depicted in the Figure S7B For the same dataset the toxygates coclustering result using HC given in Figure S6 which shows that acetaminophen-middle and methapyrilene-low are incorrectly co-clustered whereas our proposed method properly co-cluster the DCCs (Table 2) according to the statement of Kiyosawa et al. (2006). Biomarker genes in co-cluster-1 are functionally annotated via DAVID the results are given in the Tables S7,  S8. WY14643-High, WY14643-Middle and clofibrate-High are the top most DCCs for regulating PPARs related biomarker genes for detail see Table 4. Top 20 (ranked) relationships between biomarker genes and their regulatory DCCs are given in Table 5 from where it is observed that Ehhadh, Cyp4a1, Acaa1a, Plin5 etc. are the most important biomarker genes and WY14643_High, clofibrate_High, benzbromarone_High, aspirin_High etc. are their important regulatory DCCs in PPAR signaling pathway. The detail results of these relationships are given in the Table S13.

DISCUSSION AND CONCLUSIONS
Identification of biomarker genes and their regulatory DCCs is one of the most important tasks in the toxicogenomics studies as well as in drug design and development as mentioned before. In this article, we have proposed a robust co-clustering approach based on logistic probabilistic hidden variable model (LPHVM) to detect important biomarker genes and their regulatory DCCs. The proposed LPHVM approach is robust against outlying gene expressions and more flexible and effective than the application of one-way classical clustering approaches (e.g., k-means, fuzzy, HC, etc.) for co-clustering. The proposed method produces robust results by using the logistic transformation of fold-change gene expression data into the conventional PHVM approach. The logistic transformation reduces unusual/outlying observations into the reasonable space without changing the original hidden patterns of genes and DCCs in the dataset. Thus the proposed LPHVM approach produces robust results.
We investigated the performance of the proposed LPHVM method in a comparison with the traditional PHVM and Toxygates online computational platform using simulated and real life TGP gene expression data, respectively. The simulation results showed that the proposed method improves the performance over the conventional PHVM in presence of outlying observations; otherwise, they perform equally. We also demonstrated the performance of the proposed method in a comparison with the online computational platform "Toxygates" using the real life pathway based fold change gene expression datasets collected from the "Toxygates" database. We observed that three DCCs (glibenclamide-low, perhexilline-low, and hexachlorobenzene-medium) for glutathione metabolism pathway dataset as well as two DCCs (acetaminophenmedium and methapyrilene-low) for PPAR signaling pathway dataset were incorrectly co-clustered by the Toxygates online platform, while only one DCC (hexachlorobenzene-low) for glutathione metabolism pathway was incorrectly co-clustered by the proposed LPHVM approach. Our findings from the real life data analysis are also supported by the other findings in the literature (Kiyosawa et al., 2006;Nyström-Persson et al., 2013). Thus the proposed LPHVM outperform over the classical PHVM and "Toxygates" online coputational platform to detect toxicogenomic biomarkers and their regulatory DCCs.

DATA AVAILABILITY
The demo data and R-code are provided at http://www.bbcba. org/softwares/rCoClust.zip.

AUTHOR CONTRIBUTIONS
MH and MM worked together to develop the algorithm. MH analyzed the data and drafted the manuscript. MM coordinated and supervised the project. MMR, AB, and MR attended at the meeting regarding this article as well as read and approved the final version of the manuscript.