Asymmetric author-topic model for knowledge discovering of big data in toxicogenomics

The advancement of high-throughput screening technologies facilitates the generation of massive amount of biological data, a big data phenomena in biomedical science. Yet, researchers still heavily rely on keyword search and/or literature review to navigate the databases and analyses are often done in rather small-scale. As a result, the rich information of a database has not been fully utilized, particularly for the information embedded in the interactive nature between data points that are largely ignored and buried. For the past 10 years, probabilistic topic modeling has been recognized as an effective machine learning algorithm to annotate the hidden thematic structure of massive collection of documents. The analogy between text corpus and large-scale genomic data enables the application of text mining tools, like probabilistic topic models, to explore hidden patterns of genomic data and to the extension of altered biological functions. In this paper, we developed a generalized probabilistic topic model to analyze a toxicogenomics dataset that consists of a large number of gene expression data from the rat livers treated with drugs in multiple dose and time-points. We discovered the hidden patterns in gene expression associated with the effect of doses and time-points of treatment. Finally, we illustrated the ability of our model to identify the evidence of potential reduction of animal use.


Introduction
As first introduced in 1999, toxicogenomics has emerged as a new subdiscipline of toxicology to take advantage of the newly available genomics profiling technique to gain an enhanced understanding of toxicity at the molecular level (Schena et al., 1995;Derisi et al., 1996;Nuwaysir et al., 1999). Since then, toxicogenomics significantly contributes to toxicological research and has provided an avenue for joining of multidisciplinary sciences including engineering and informatics into traditional toxicological research (Afshari et al., 2011). On the other hand, due to high computational cost and lack of advanced knowledge discovery as well as data mining tools, the pace of toxicogenomics has been tardy in recent years . First, a significant deterrent has been the enormous size of toxicogenomic datasets. With perhaps thousands of samples and tens of thousands of genes, the tremendous size of the toxicogenomic database often is cumbersome to handle, analyze and interpret. Gene selection (i.e., selecting relevant genes) and grouping genes (i.e., dealing only partial data at a time) has often been used to reduce complexity and make analyses more tractable (Rogers et al., 2005). However, both gene selection and grouping run the risk of losing valuable information contained in excluded data. Hence, a method that can efficiently handle the entire data without losing potentially valuable information is desirable. Second, any given biological phenomenon normally involves multiple biological pathways and mechanisms. Currently, some existing clustering algorithms like hierarchical cluster analysis and k-means only allow individuals to be assigned into mutually exclusive clusters. To capture the reality of biological phenomena in gene expression data, we need an algorithm to assign individuals into multiple clusters and to give each cluster a summary of most important genes. One might argue that some fuzzy clustering algorithms (Pal and Bezdek, 1995;Fu and Medico, 2007) are able to assign multiple clusters, yet very few existing algorithm provide much interpretability for clusters. In order to thoroughly utilize the rich interaction in a large database, we desire to organize our samples into meaningful clusters which can be directly linked by actual biological pathways.
The introduction of Latent Dirichlet Allocation (LDA) (Blei et al., 2003) along with its predecessor Probabilistic Latent Semantic Analysis (Hofmann, 1999) provide a new type of statistical models, namely, probabilistic topic models that have become a standard approach to analyze large collections of unstructured text documents. For a large corpus, probabilistic topic models assume the existence of latent variables (i.e., topics) that govern the likelihood of appearance for each word. Topics are defined as distributions over a fixed vocabulary. Based on the most likely words in each topic, we are able to interpret the meanings of topics. This intuition can be seamlessly transformed into genomics datasets. For a large toxicogenomic data, we assume that there exist latent biological processes that govern alteration of gene expression levels after samples are treated with drugs at various dose levels and time-points. Each latent biological process is characterized by a distribution of a fixed number of genes. By annotating the mostly likely differentially expressed genes in a latent biological process, we then can link the latent variable with a real biological pathway. In recent years, probabilistic topic models have spawn many similar works on genomic data, noticeably in population genetics (Pritchard et al., 2000), chemogenomic profiling (Flaherty et al., 2005) and microarray data (Rogers et al., 2005;Bicego et al., 2012;Yu et al., 2014). However, most of the previous works of probabilistic topic models on microarray data either have limited size of samples, or probabilistic topic models are used merely for their clustering ability. The versatility of probabilistic topic models has not been fully assessed.
We proposed a probabilistic topic model that was tailored to the structure of a dataset and applied the model to a large toxicogenomics database recently made publicly available. This socalled asymmetric author-topic model (ATT model) combines author-topic model (Rosen-Zvi et al., 2004) with asymmetric prior (Wallach et al., 2009). In Section Materials and Methods, we outlined our data, the proposed model and its application to toxicogenomic data. In Section Results, we presented the analysis results. Analyses were done with MALLET (McCallum, 2002) that contains the option for asymmetric prior distributions.

Toxicogenomic Data
The Japanese Toxicogenomics Project (Uehara et al., 2010;Chen et al., 2012) is a 10-year collaborative project involving two Japanese government institutes and 18 private companies (Igarashi et al., 2015). The project produced a comprehensive gene expression database, called Open TG-GATEs for the effects of 170 compounds (drugs) on liver and kidney as primary target organs in both in vivo and in vitro experiments. Specifically, in the in vivo experiment, animals are treated at three different doses (low, middle, and high) of drugs once every day for four different treatment durations (3, 7, 14, and 28 days). In addition, control animals are concurrent with all the 12 combinations of doses and durations. More details on the animals and experimental design have been described previously (Uehara et al., 2011). Microarray based gene expression data were generated using the GeneChip R Rat Genome 230 2.0 Arrays (Affymetrix, Santa Clara, CA, USA) that contains 31,042 probe sets. The data used in this study is obtained from the Annual International Conference on Critical Assessment of Massive Data Analysis (CAMDA) 1 2013 (http://dokuwiki.bioinf.jku.at/doku.php/tgp_prepro). In this paper, only the data from in vivo repeated dose experiment was used.

Data Preprocessing
Similar to others (Rogers et al., 2005;Bicego et al., 2010;Yu et al., 2014), our first step of analysis was to obtain a "document-word" matrix for gene expression data to apply topic model. Instead of the sample-gene expression matrix used in others' works, we created treatment-fold change matrix for our studies. This was due to the fact that TG-GATEs has multiple treated samples for one treatment (a unique drug-time-dose combination) along with controlled group. Therefore, we were able to apply a more refined treatment-fold change matrix as our inputs. Here, all fold change values of gene expressions between treated and control samples were calculated and used as the value of elements of the matrix. Genes with absolute fold change greater than 1.5 were considered as differentially expressed genes (DEGs) and set the fold change values zeros for the non-DEG. The final product is a treatmentfold change matrix where each column represents a treatment and each row represents a gene.

Latent Dirichlet Allocation in Microarray Data
The fundamental concept of probabilistic topic modeling is the assumption of the existence of latent variables. In Latent Dirichlet Allocation (LDA) (Blei et al., 2003), the latent variables are referred as "topics" and words in documents are chosen based on what topics the document are related to. "Topics" then stands for groups of words that are likely to co-occur in a document. Similar to the previous studies (Bicego et al., 2010;Yu et al., 2014), we referred latent variables in toxicogenomics as "latent biological process" and words in documents were replaced by genes. The elements of document-word matrix, which usually are frequencies of occurrences of words in text mining, were transformed to the fold change values in our treatment-gene matrix. Hence, the latent biological processes represent the groups of genes that are significantly co-expressed (or often have high fold change values within groups). Unlike (Rogers et al., 2005) which alters the original assumption of LDA model, we utilized the original assumption of LDA and this enabled us to implement our models via existing resources of LDA (i.e., MALLET, the opensource software used in our analysis). Therefore, similar to LDA, the model inferences were primarily focused on two probability distributions. In the context of TG-GATEs data, the probability distribution of latent biological processes for each treatment is P(Z|Tr), where Z is defined as latent process assignment while Tr is defined as treatment to describe biological processes that are activated in a specific treatment. Meanwhile, the probability distribution of gene for each latent biological process is P(Ge|Z), where Ge is defined as genes that are DEGs from which we are able to associate the latent process to biological pathways. The ability of linking latent process to biological pathway is a definite advantage over other clustering algorithms and we explored its applications in Section Functional Annotation and Similarity Ranking.

Asymmetric Author-topic Model
Although LDA could be used for treatment-centric analysis, it doesn't take many unique features of the TG-GATEs data into account. In addition to examine the treatment-centric view, drug-centric and/or time-dose-centric analysis were another important component of this study. The author-topic model (Rosen-Zvi et al., 2004) is a proper methodology to incorporate other aspects of data into model construction. Authorship in author-topic model can be seen as a regrouping of all the documents. While both models are essentially identical, authortopic model groups documents together and give LDA model an author-oriented view for inferences. In other words, once the regrouping is done, the whole process can be seen as an LDA model again. For TG-GATEs data, treatment is defined as a unique drug-time-dose combination, thus we can regroup treatments based on their drug or time-dose to provide a drugcentric or a time-dose-wise analysis. The inferences on models are the same except treatment is replaced by either drug or timedose. Furthermore, P(Z|Tr) is replaced by P(Z|Dr) (Dr stands for Drug) and P(Z|DoTi) (DoTi stands for time-dose) respectively. Table 1 summarizes the total number of individuals in each setting.
As Wallach et al. (2009) pointed out, asymmetric prior on the probability distribution of topic for a document substantially increases the robustness of LDA, yet only adds negligible model complexity and computational cost. Therefore, we further improved author-topic model by introducing an asymmetric prior. The asymmetry of priors can be easily achieved since the chosen software MALLET has a build-in option in the command. More information about MALLET can be found on their website (http://mallet.cs.umass.edu/).

Functional Annotation and Similarity Ranking
One essential aspect of any clustering algorithm is to organize individuals into their respective clusters. However, the clusters often are difficult to interpret. Through asymmetric authortopic model, individuals are clustered to multiple latent biological processes based on the probability distribution P(Z|Tr) (or P(Z|Dr), P(Z|DoTi)). For each latent biological process, probability distribution P(Ge|Z) controls how likely each gene is differentially expressed (i.e., a DEG). According to our results, there are often fewer than 200 genes (out of 31,042 total genes) that have positive probability in each latent biological process while other genes have probability of zeros. We then annotate the found list of DEGs in each latent biological process through online database DAVID (Huang Da et al., 2009). Consequently, every feature (i.e., treatment, drug, or time-dose) in the database is automatically connected to annotated biological pathways. The ability of our proposed model to link from the latent biological processes to functional annotation, such as real biological pathways, is a significant advantage over other existing methods.
Another application of author-topic model is to find most similar feature to a given one. We can quantitatively measure the similarity between a pair of features by calculating the symmetric Kullback-Leibler divergence (sKL) (Rosen-Zvi et al., 2004) between a pair of P(Z|Tr) (or P(Z|Dr), P(Z|DoTi)). For instance, by finding the sKL between P(Z|Dr 1 ) and P(Z|Dr 2 ), we can tell how similar Drug 1 and Drug 2 is (i.e., a low sKL score indicates that two drugs exhibit similar topic distributions). Given a drug, our model is able to recommend a list of drugs ranked by the similarity score sKL. Due to (1) the similarity is based on P(Z|Dr), the probability of latent biological processes given drugs, and (2) all the latent biological processes are able to annotated to biological pathways, we know which drugs are similar as well as exactly which pathways link them together.

Model Selection
We run all three of our models on MALLET, whose model inference is based on Gibbs sampling algorithm. One common concern using Gibbs sampling is the convergence of the model. Generally, convergence of the model is monitored via tracking the probability of the likelihood function after burn-in. After the likelihood probability stabilizes, we can deem convergence to be adequate. We run 3000 iterations for all models and observe stability after about 1500 iterations. We also perform sensitivity analyses for major parameters, including number of latent biological processes, and the initial values for hyperpriors. Hyperpriors are usually not big factors in the model as they are constantly revised during rounds of Gibbs sampling inference. On the other hand, the number of latent biological processes is important. While there is no way to know how many biological processes are involved in the whole database, we can estimate the number based on perplexity performance (Blei et al., 2003). In addition, asymmetric topic models have been shown to be robust to variations in the number of topics (Wallach et al., 2009). All the parameters are chosen based on 10-fold cross-validation. For model 1 (treatment), the number of latent biological processes is 200. For models 2 and 3 (drug and time-dose) the number of latent biological processes is 100.

Application on Glutathione Depletion
One proven application of TGP database is detection of glutathione depletion (Uehara et al., 2010). Taking well-known hepatotoxin acetaminophen as an example, it was reported that glutathione metabolism was related to acetaminophen-induced hepatotoxicity and the mechanisms that underline such liver injury (Agarwal et al., 2011;Ben-Shachar et al., 2012). For instance, James et al. (2003) pointed out that acetaminophen could induce potentially fatal, hepatic centrilobular necrosis when taken in overdose, since the amount of active metabolite overwhelmed the detoxification capacity of intracellular glutathione. Among our proposed models, model 1 gives us a treatment-centric view of the TGP database. Table 2 shows P(Z|Tr) from model 1 that represents the most likely latent biological processes that encode biological phenomena associated with acetaminophen. Latent process 161 is identified in 8 out of 12 time-dose combinations for acetaminophen, as early as the 3-day treatment with the middle dose of 600 mg. Furthermore, the list of most probable DEGs for latent process 161 is extracted from P(Ge|Z) and functionally annotated by online database DAVID. As seen on Table 3, glutathione metabolism pathway is significantly identified in the KEGG database, which is consistent with the previous findings.
In model 2, the drug-centric view of the TGP database, we observe similar results. Again, the most likely active latent process for acetaminophen is latent process 92 (Table 4) and it is once again significantly identified as glutathione metabolism pathway in the KEGG database ( Table 5). In addition, by simply searching the drugs that also have No. 92 among the top ranked latent processes, we find that bromobenzene, chlormezanone, coumarin, methimazole, and ticlopidine strongly link with glutathione metabolism pathway (Table 4), and hence presumably become causes of glutathione depletion. Such hepatotoxicity associated with these 6 drugs through the glutathione metabolism pathway is well supported in other papers (Jollow et al., 1974;Thor et al., 1979;Wright et al., 1996;Mizutani et al., 2000;Uehara et al., 2010;Shimizu et al., 2011). Overall, our results indicate that the construction of our proposed model indeed matches with the well-known biological processes and hence the model is able to detect potential treatments or drugs that cause glutathione depletion.

Application of Drug Similarity and Potential Reduction of Animal Use
Through sKL score (described in Section Functional Annotation and Similarity Ranking), functional similarity of drugs can be explored. As an example, we can obtain the most functionally similar drugs to acetaminophen as shown in Table 6. The drugs that have smaller sKL score with acetaminophen (i.e., a pair-wise score) will exhibit most similar latent biological processes. We can observe that bromobenzene and coumarin, which linked through glutathione depletion pathway, are on the list.
Another application of sKL score is to be used as potential evidence of reduction of animal use. Reducing, replacing and refining animal use (3Rs) has been increasingly a goal in toxicogenomics (Russell et al., 1959;Workman et al., 2010). While dose      level and time-point are expected to be important, there is generally no easy way to determine which treatment is ignorable for a given drug. sKL scores measure the similarity between a pair of treatments. The idea is to see if either dose or time in treatments of a drug does not play a significant role to affect sKL score. If one of them is not significant to sKL score, then there exists the potential to reduce the number of treatments without compromising study goals. Similar to multivariate analysis of variance (MANOVA), the importance of dose and time can be attained with generalized linear models on sKL scores as the following: Here, X Dose is defined as a categorical variable that includes six different dose pairs (i.e., Low-Low, Low-Middle, Low-High, Middle-Middle, Middle-High, and High-High). X Time is defined as a continuous non-negative variable that represents the difference between two time-points. By fitting the generalized linear model using various common model criteria (e.g., adjusted R-square, AIC, and BIC), we can compare dose and/or time significance regarding to sKL score. A level of feature that has no significant impact on sKL score can be potentially reduced. While only having 12 individuals, model 3 can be used to detect the overall significance of dose and time. Unsurprisingly, dose and time generally are both significant to sKL score as seen in Table 7.
It is naïve to think we can remove any treatment regardless which drug is been tested, yet there might be specific drugs that fit our assumption. As examples, we chose acetaminophen, coumarin, and benzbromarone to be tested in the generalized linear models. Among all, only benzbromarone consistently demonstrate the superiority of dose only model under all three model criteria. Therefore, it is possible to combine time-points for treatments of benzbromarone due to the insignificance of time regarding to sKL score.

Discussion
Our proposed asymmetric author-topic model is useful in the large-scale genomics data set analysis because of their ability to handle large numbers of potentially interrelated variables, and because of their ability to discern statistical relationships between drugs and their inner pathways. In this paper, we first give our rationale on why a probabilistic topic model is suitable for genomic profiling expression, such as the Japanese Toxicogenomics Project database. We have demonstrated that our asymmetric author-topic model can be implemented to explore hidden relationships among different features (treatment, drug, and time-dose) and genes through latent biological processes. The straightforward data preprocessing makes the transition of data format manageable and easy to expand. In fact, the same principle of data preprocessing can also be applied to next-generation sequencing (NGS) technology since microarray expression intensity can be simply replaced by read counts in NGS (Yu et al., 2014). Since our model enhances the traditional probabilistic topic modeling approach without altering the core assumptions, our framework can be easily adapted for new probabilistic topic model. For example, if we have labels or classes attached to each treatment, we can again enhanced supervised topic models (Blei and McAuliffe, 2007) with asymmetric priors and applied the model on database with same feature-centric analysis capacity. Because of the popularity of probabilistic topic modeling, there are many existing and wellbuilt software packages ready to be used, including MALLET. Therefore, the implementation of newer probability topic models should also be straightforward in the future. Moreover, other models can also potentially improve some of the limitation our model has. Although changing a continuous value (i.e., fold change values) into a discrete value (i.e., counts) has been done before (Flaherty et al., 2005), this process ultimately decrease the precision of the data. Models like Gaussian mixture model that supports continuous outcome will eliminate the need of altering data. Another limitation of our model is the need to determine the number of latent biological processes in advanced. While the perplexity analysis ensures a relatively proper number of latent processes were chosen initially, finding an optimal number of latent processes is still difficult and costly. Many nonparametric Bayesian models has been developed, including Hierarchical Dirichlet Processes (Teh et al., 2006), and Hierarchical Pachinko Allocation (Mimno et al., 2007), and the number of latent processes is automatically determined within the algorithm. One definite advantage of asymmetric author-topic model is the ability to connect the latent biological processes with functional annotation. By connecting our finding with KEGG pathways via DAVID, we further increase the interpretability of latent biological processes. Therefore, we are able to browse and interact with TGP data through meaningful and interpretable biological pathway (i.e., glutathione metabolism). Regarding the application on glutathione depletion, acetaminophen is a wellknown drug that can potentially cause fatal liver injury due to an overdose. Through our approach, we identify that the alteration of glutathione metabolism at even the middle dose (600 mg) of acetaminophen as early as treatment day three. The conclusion of linkages among pathway glutathione metabolism, acetaminophen, and other 5 drugs are found and confirmed in other papers. This demonstrates the possibility of finding existing or new pathway-like annotation through our proposed model, and the ability to cluster drugs with similar mechanisms of action. It is possible to even predict potential pathways for a new drug by estimating the probability distribution of latent biological processes under this framework. Our model also has the capability to adapt analysis that put focus on different features of data. We show how to identify the dominant factor in dose and time combinations in our second application through generalized linear model. As animal reduction in experiment becomes a global trend, the outcome of similarity of time-dose combination is a viable approach to reducing animals needed for future study. Overall, the asymmetric author-topic model has demonstrated potential to be an accessible and flexible approach for finding hidden patterns in large toxicogenomic data. and Education (ORISE). MC would like to acknowledge Binsheng Gong for his assist on data manipulation and functional annotation. MC also would like to acknowledge Roger Perkins for providing insightful comments and English editing for the manuscript.

Supplementary Material
The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fphar. 2015.00081/abstract