Conditional Generative Adversarial Networks for Individualized Treatment Effect Estimation and Treatment Selection

Treatment response is heterogeneous. However, the classical methods treat the treatment response as homogeneous and estimate the average treatment effects. The traditional methods are difficult to apply to precision oncology. Artificial intelligence (AI) is a powerful tool for precision oncology. It can accurately estimate the individualized treatment effects and learn optimal treatment choices. Therefore, the AI approach can substantially improve progress and treatment outcomes of patients. One AI approach, conditional generative adversarial nets for inference of individualized treatment effects (GANITE) has been developed. However, GANITE can only deal with binary treatment and does not provide a tool for optimal treatment selection. To overcome these limitations, we modify conditional generative adversarial networks (MCGANs) to allow estimation of individualized effects of any types of treatments including binary, categorical and continuous treatments. We propose to use sparse techniques for selection of biomarkers that predict the best treatment for each patient. Simulations show that MCGANs outperform seven other state-of-the-art methods: linear regression (LR), Bayesian linear ridge regression (BLR), k-Nearest Neighbor (KNN), random forest classification [RF (C)], random forest regression [RF (R)], logistic regression (LogR), and support vector machine (SVM). To illustrate their applications, the proposed MCGANs were applied to 256 patients with newly diagnosed acute myeloid leukemia (AML) who were treated with high dose ara-C (HDAC), Idarubicin (IDA) and both of these two treatments (HDAC+IDA) at M. D. Anderson Cancer Center. Our results showed that MCGAN can more accurately and robustly estimate the individualized treatment effects than other state-of-the art methods. Several biomarkers such as GSK3, BILIRUBIN, SMAC are identified and a total of 30 biomarkers can explain 36.8% of treatment effect variation.

Treatment response is heterogeneous. However, the classical methods treat the treatment response as homogeneous and estimate the average treatment effects. The traditional methods are difficult to apply to precision oncology. Artificial intelligence (AI) is a powerful tool for precision oncology. It can accurately estimate the individualized treatment effects and learn optimal treatment choices. Therefore, the AI approach can substantially improve progress and treatment outcomes of patients. One AI approach, conditional generative adversarial nets for inference of individualized treatment effects (GANITE) has been developed. However, GANITE can only deal with binary treatment and does not provide a tool for optimal treatment selection. To overcome these limitations, we modify conditional generative adversarial networks (MCGANs) to allow estimation of individualized effects of any types of treatments including binary, categorical and continuous treatments. We propose to use sparse techniques for selection of biomarkers that predict the best treatment for each patient. Simulations show that MCGANs outperform seven other state-of-the-art methods: linear regression (LR), Bayesian linear ridge regression (BLR), k-Nearest Neighbor (KNN), random forest classification [RF (C)], random forest regression [RF (R)], logistic regression (LogR), and support vector machine (SVM). To illustrate their applications, the proposed MCGANs were applied to 256 patients with newly diagnosed acute myeloid leukemia (AML) who were treated with high dose ara-C (HDAC), Idarubicin (IDA) and both of these two treatments (HDAC+IDA) at M. D. Anderson Cancer Center. Our results showed that MCGAN can more accurately and robustly estimate the individualized treatment effects than other state-of-the art methods. Several biomarkers such as GSK3, BILIRUBIN, SMAC are identified and a total of 30 biomarkers can explain 36.8% of treatment effect variation.
Keywords: causal inference, generative adversarial networks, counterfactuals, treatment estimation, precision medicine INTRODUCTION Traditional clinical management estimates the average treatment effects from observational data, assuming that the complex disease is homogeneous (Rosenbaum and Rubin, 1983;Hansen, 2004;Diamond and Sekhon, 2013;Kennedy et al., 2017;Liu et al., 2018;Luo and Zhu, 2020). Alternatives to traditional clinical management, "precision medicine" or "precision oncology" attempts to match the most accurate and effective treatments with the individual patient (Shin et al., 2017;Ali and Aittokallio, 2019), rather than using monotherapy that treats all patients. In the real world, treatment response is heterogeneous. Therapy should be tailored with the best response possible and highest safety margin to ensure that the right therapy is offered to "the right patient at the right time" (Subbiah and Kurzrock, 2018). Precision oncology can substantially improve progress and treatment outcomes of patients. It plays a central role in revolutionizing cancer research. Consequently, alternative to calculating the average effect of an intervention over a population, many recent methods attempt to estimate individualized treatment effects (ITEs) or conditional average treatment effects from observational data (Makar et al., 2019). To accurately estimate the individualized treatment effects and learn optimal treatment choices are key issues for precision oncology. More accurate estimation of individualized treatment effects, which provides information to guide the individual selection of the target therapies, is essential for the success of precision medicine (Kornblau et al., 2009).
Methods for estimation of individualized treatment effects (ITEs) using observational data largely differ from standard statistical estimation methods. Estimating of ITEs and learning optimal treatment strategies raise a great challenge for the following reasons. First, a common framework for treatment effect estimation is the potential outcomes assumptions (Ray and Szabo, 2019) where every individual has two "potential outcomes" covering the hypothesized individual's outcomes with and without treatment. Estimation of ITEs requires estimation of both factual and counterfactual outcomes for each individual. However, only the factual outcome is actually observed. We never observe the counterfactual outcomes (Rosenbaum and Rubin, 1983;Chen and Paschalidis, 2018;Yoon et al., 2018a).
If the effect of each treatment in the subpopulation which is separately estimated is taken as an individual effect, this can create large biases. The estimated effect of each treatment in the subpopulation is still the average effect of the treatment in that subpopulation and is not an individualized treatment effect in the subpopulation.
Second, clinical data often have many missing values. Simultaneously imputing both counterfactual values and missing values is not easy. Third, the function forms of the treatment effects which are often non-linear functions are unknown (Ray and Szabo, 2019). Statistical methods and computational algorithms that can efficiently deal with unknown forms of non-linear functions are still lacking (Lengerich et al., 2019).
Classical works such as random forest and hierarchical models are adapted to estimate heterogeneous treatment effects (Wager and Athey, 2015). Recently, machine learning and neural network methods are used to move away from average treatment effect estimation to personalized estimation Shalit et al., 2016;Alaa and van der Schaar, 2017). AI and causal inferences are becoming a driving force for innovation in precision oncology (Seyhan and Carini, 2019). A key issue for ITE estimation is to learn unobserved (missing) counterfactuals. The idea of using generative adversarial networks (GANs) for handling missing data is a very promising approach to imputing counterfactual (Goodfellow et al., 2014;Ding and Li, 2017;Yoon et al., 2018a). Using conditional GAN (CGAN) to estimate the individualized treatment effects (GANITE) has been developed (Yoon et al., 2018a,b). The CGANs consist of a generator and a discriminator. The generator (G) observes the factual part of real data and imputes the counterfactuals (missing part) conditioned on observed factual data, and outputs the complete dataset. The discriminator (D) inputs the real dataset and tries to determine which part was actually observed and which part was imputed counterfactuals. The discriminator enforces the generator to learn the desired distribution (hidden data distribution) (Yoon et al., 2018b).
However, the original GANITE was designed for estimation of the effects of binary treatment and cannot be applied to continuous and categorical treatments. The treatment variable in the original GANITE is a binary variable which only represents the presence and absence of treatment. Therefore, the treatment variable in the original GANITE is unable to quantify the dosage of the treatment, and hence the original GANITE cannot be applied to continuous treatment. To overcome this limitation, we introduce a treatment assignment indicator variable and treatment quantity variable. The treatment quantity variable can represent binary treatment, categorical treatment, and continuous treatment. We change mathematical formulations of the generator and discriminator and extend GANITE from binary treatment to all types of treatments including binary, categorical, and continuous treatments. The modified GANITE is abbreviated as MGANITE.
GANITE or in general, CGAN has not systematically investigated the estimation of ITE for chemotherapy and other types of treatments in cancer and compared the results from causal inference using observed data with the results of randomized clinical trials. One of our goals in this manuscript is to examine whether MGANITE still works well in cancer research.
In MGANITE, biomarkers that serve as conditioned variables, will be used to estimate the ITEs of both single and multiple treatments (Mirza and Osindero, 2014;Yoon et al., 2018a). Sparse techniques will be employed to select biomarkers for prediction of treatment effects and to learn optimal treatment choices of patients (Emmert-Streib and Dehmer, 2019).
In summary, The novelty of modified GANITE (MGANITE) is summarized below.

Potential Outcome Framework for Estimation of Treatment Effects
We assume the Rubin causal model for estimation of treatment effects (Rubin, 1974) and modifies the approach to the individualized treatment effect estimation in Yoon et al., 2018a). The original GANITE only can estimate ITE of binary treatments, but it cannot be applied to categorical and continuous treatments. We develop MGANITE which can estimate ITE of all types of treatments including binary, categorical, and continuous treatments by introducing a treatment assignment indicator variable and changing the formulation of the generator and discriminator. Consider K treatments. Let T k be the k th treatment variable that can be binary, categorical or continuous, and T = [T 1 , . . . , T K ] T be the treatment vector. We assume that there is precisely one non-zero component of the treatment vector T, which is denoted by T η , where η is the index of this component. Each sample has one and only one assigned treatment T η . To extend the binary treatment to include categorical and continuous treatments, we define the treatment assignment For the sample with no treatment, we have is the potential outcome of the sample under the treatment T k . When K = 2, the potential outcome Y(T 1 ) corresponds to the widely used notation for one treatment Y 1 , the potential outcome of the treated sample, while the potential outcome Y(T 2 ) corresponds to Y 0 , the potential outcome of the untreated sample. Only one of the potential outcomes can be observed. The observed outcome that corresponds to the potential outcome of the individual receiving the treatment T η is denoted by Y(T η ). The observed outcome is called the factual outcome and the unobserved potential outcomes are called counterfactual outcomes, or simply counterfactuals. For the convenience of notation, the factual outcome is also denoted by Y f and the counterfactuals are denoted by Y cf .
The observed outcome Y f can be expressed as When K = 2, we have M 2 = 1−M 1 . The above equation becomes which coincides with the standard expression of the observed outcome for one treatment. Let X = [X 1 , . . . , X q ] T be the q-dimensional feature vector. Assume that n individuals are sampled. Let . . , n be the treatment vector, the vector of potential outcomes, and feature vector of the i th individual, respectively.
The most widely used measure of the treatment effect for the multiple treatment is the pair-wise treatment effect. The individual effect ξ (i) jk between the pairwise treatments: T j and T k is defined as jk . The average pairwise treatment effect τ jk|T j on the patients treated with T j is defined as τ jk|T j = E ξ (i) jk |T j . The focus of this paper is on the conditional distribution of treatment effect, given the feature vector X. Let F Y|X (T k ) be the conditional distribution of the potential outcome Y (T k ) under the treatment T k , given the feature vector X, and F Y|X (T) be the conditional joint distribution of the potential outcome vector Y(T) under the K treatment T, given the feature vector X. Assume that n individuals are sampled. For the i th individual, T η f be the observed feature vector and the observed potential outcome of the i th individual. Therefore, the observed dataset is given by . . , n). The factual and counterfactual outcomes of the i th individual are denoted by y (i) f and y (i) cf , respectively. To estimate the treatment effects, we often make the following three assumptions (Rubin, 1974;Yoon et al., 2018a): Assumption 1 (Ignorability Assumption). Conditional on X, the potential outcomes, Y(T) and the treatment T are independent, This assumption requires no unmeasured confounding variables. Assumption 2 (Common Support). For the feature vector X and all treatment, Assumption 3 (Stable Unit Treatment Value Assumption). No interference (units do not interfere with each other).

Conditional Generative Adversarial Networks as a General Framework for Estimation of Individualized Treatment Effects
The key issue for the estimation of individualized treatment effects is unbiased counterfactual estimation. Counterfactuals will never be observed and cannot be tested by data. The true counterfactuals are unknown. Recently developed generative adversarial networks (GANs) started a revolution in deep learning (Luo and Zhu, 2020). GANs are a perfect tool for missing data imputation. An incredible potential of GANs is to accurately generate the hidden (missing) data distribution given some of the features in the data. Therefore, we can use GANs to generate counterfactual outcomes. GANs consist of two parts: the "generative" part that is called the generator and "adversarial" part that is called the discriminator. Both the generator and discriminator are implemented by neural networks. Typically, a K-dimensional noise vector is input into the generator network that converts the noise vector to a new fake data instance. Then the generated new data instance is input into the discriminator network to evaluate them for authenticity. The generator constantly learns to generate better fake data instances while the discriminator constantly obtains both real data and fake data and improves accuracy of evaluation for authenticity.

Architecture of Conditional Generative Adversarial Networks (CGANs) for Generating Potential Outcomes
Features provide essential information for estimation of counterfactual outcomes. Therefore, we use conditional generative adversarial networks (CGANs) (Mirza and Osindero, 2014) as a general framework for individualized treatment effect (ITE) estimation. The CGANs for ITE estimation consist of two blocks. The first imputation block is to impute the counterfactual outcomes. The second ITE block is to estimate distribution of the treatment effects using the complete dataset that is generated in the imputation block. The architecture of CGANs is shown in Figure 1.
Both the generator and discriminator are implemented by feedforward neural networks. The architectures of the neural networks are described as follows. The generator consists of seven layers of feedforward neural network. The first layer is the covariate input layer that input a vector X of covariates. The second and third layers are hidden layers, each layer with 64 nodes. The fourth layer concatenates the output of the third layer, the response vector Y, treatment vector T and treatment assignment indicator vector M and noise vector Z. The fifth and sixth layers are hidden layers, each layer with 64 nodes. Finally, the seventh layer is the output layer. All activation functions of the neurons were sigmoid function. The architecture of the discriminator is similar to the architecture of the generator except for adding one more output layer with sigmoid non-linear activation function.

Imputation Block
To extend GANTITE from binary treatments to all types of treatments, we introduce the treatment assignment vector and change some mathematical formulation of the generator. A counterfactual generator in the imputation block is a non-linear function of the feature vector, treatment vector T, treatment assignment indicator vector M, observed factual outcome y f and where outputỸ represents a sample of G. It can take binary values, categorical values or continuous values. 1 is a vector of 1, ⊙ denotes element-wise multiplication, and θ G is the parameters in the generator. We use Y to denote the complete dataset that is obtained by replacingỸ η with Y f . The distribution ofỸ depends on the determinant of the Jacobian matrix of the transformation function G X, Y f , T, M, z G , θ G . Changing the transformation function can change the distribution of the generated counterfactual outcomes. Let P Y|x,t,m,y f (y) be the conditional distribution of the potential outcomes, given X = x, T = t, M = m, Y f = y f . The goal of the generator is to learn the neural network G such that Unlike the discriminator in the standard CGANs where the discriminator evaluates the input data for their authenticity (real or fake data), the counterfactual discriminator D G that maps pairs (x, y) to vectors in [0, 1] k attempts to distinguish the factual component from the counterfactual components. The output of the counterfactual discriminator D G is a vector of probabilities that the component represents the factual outcome. Let D G (x,ỹ, t, m, θ d ) i represent the probability that the i th component ofỹ is the factual outcome, i.e., i = η, where θ d denotes the parameters in the discriminator. The goal of the counterfactual discriminator is to maximize the probability D G (x,ỹ, t, m, θ d ) i for correctly identifying the factual component η via changing the parameters in the discriminator neural network D G .

Loss Function
The imputation block in MGANITE attempts to impute counterfactual outcomes by extending the loss function of the binary treatment in GANITE (Yoon et al., 2018a) to all types of treatments: binary, categorical or continuous treatments. We define the loss function V(D G , G) as where log is an element-wise operation. The goal of the imputation block is to maximize the counterfactual discriminator D G and then minimize the counterfactual generator G: In other words, we train the counterfactual discriminator D G to maximize the probability of correctly identifying the assigned treatment M η and the quantity of the treatment T η or Y f (Y η ), and then train the counterfactual generator G to minimize the probability of correctly identifying M η and T η . After the imputation block is performed, the counterfactual generator G produces the complete dataset D = {x, y}. Next, we use the imputed complete dataset D = {X, Y} to generate the distribution of potential outcomes and to estimate the ITE via CGANs which is called the ITE block.

ITE Block
The CGANs consist of three parts: generator, discriminator and loss function which are summarized as follows (Yoon et al., 2018a).

ITE Generator
Unlike the ITE in GANITE where the ITE generator is a nonlinear transform function of only X and Z I , the ITE generator G I in MGANITE is a non-linear transform function of X, T and Z I : whereŶ is the generated K-dimensional vector of potential outcomes, X is a feature vector, T is a treatment vector, and Z I is a K-dimensional vector of random variables and follows the uniform distribution Z I ∼ u((−1, 1) K ). The ITE generator attempts to find the transformationŶ = G I (X, T, Z I, θ g I ) such thatŶ ∼ P Y|X,T (y).

ITE Discriminator
Following the CGANs, we define a discriminator D I as a nonlinear classifier with (X, T, Y * = Y) or (X, T, Y * =Ŷ) as input and a scalar that outputs the probability of Y * being from the complete dataset D.

Loss Function
Again, unlike the loss function in GANITE where the decision function is D I (X, Y * ), a decision function in MGANITE is defined as D(X, T, Y * ). The loss function for the ITE block in MGANITE is then defined as where D I (X, T, Y * ) is the non-linear classifier that determines whether Y * is from the complete dataset D or from generator G I .The goal of the ITE block is to maximize the probability of correctly identifying that Y * is from the complete dataset D and to minimize the probability of a correct classification. Mathematically, the ITE attempts The algorithms for numerically solving the optimization problems (4) and (7) are summarized in the Supplementary Note. The learning parameters for the feedforward neural networks are given below. We set batch size equal to 16. We assumed that the learning rates for the discriminator and generator were 0.0001 and 0.001, respectively. We further assume that the decay rate was 0.1. The learning rate decayed (exponentially) to 10% of the starting learning rate during 70% of the total batches, and stayed at 10% during the last 30% batches. The total number of batches was 1,000,000. Adam Optimizer was used to perform optimization. We assume that 20% of the nodes were dropped randomly during the training process.

Sparse Techniques for Biomarker Identification
The LASSO (least absolute shrinkage and selection operator) that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the results can be used to select biomarkers for optimal treatment selection (Ali and Aittokallio, 2019). Let Y i k and X (i) denote the estimated effect of the k th treatment and feature vector of the i th individual, respectively. Let The outputs of the neural networks are in general a continuous function even if the potential outcomes are binary. For the where . F is the Frobenius norm of the matrix. Non-zero elements β jl = 0 predict treatment effect variation and hence its correspondence X j = X (1) j · · · X (n) j T can be used as biomarkers for investigation of the l th treatment. For the continuous treatment, we define the treatment matrix T and its associated coefficient matrix Ŵ: where λ 1 , λ 2 are penalty parameters.

Biomarker Identification for Optimal Treatment Selection
Consider K treatments. LetŶ i = Ŷ i 1 · · ·Ŷ i K T be the Kdimensional vector of the estimated potential outcomes for the i th individual and z i = argmax 1,...,k {Ŷ i 1 , . . . ,Ŷ i K } be the index for the optimal potential outcomes of the i th individual. To select biomarkers for optimal treatment selection, we define the following LASSO: Solving the above categorical LASSO problem, we obtain a set of non-zero coefficients that are denoted asα l = 0, l = 1, . . . , L.
The covariates that correspond to the non-zero coefficients of the LASSO solution are chosen as biomarkers for optimal treatment selection. Again, for the continuous treatment, Equation (10) needs to be changed tô

Data Collection
The proposed MGANITE was applied to 256 newly diagnosed acute myeloid leukemia (AML) patients, treated with high dose ara-C (HDAC), Idarubicin (IDA), and HDAC+IDA at M. D. Anderson Cancer Center. There were 212 valid samples and 85 useable features (14 discrete and 71 continuous), including 51 total and phosphoprotein from several biological processes such as apoptosis, cell-cycle, and signal transduction pathways (Kornblau et al., 2009 . Prediction accuracy was defined as the proportions of correctly predicted potential outcomes. The false positive rate was defined as the proportion of individuals who were wrongly classified as having a positive treatment response. Discriminator accuracy is defined as the proportion of correctly classified real or fake samples. Replication error is defined as cross entropy −y f logŷ f whereŷ f = G x, t, t * , y f , z G , θ g , t = t * and separate distance is defined as

Simulations
We first examine the performance of MGANITE in estimating the ITE of binary treatment using simulations. A synthetic dataset is generated as follows. A total of 10,000 individuals with 30-dimentinal feature vectors follow the normal distributions N(0, I). Let where i is a sample index.
Then, the potential outcomes are generated as Treatment is assigned by the Bernoulli distribution: The true potential outcomes with treatment Y 1 and estimated potential outcomesŷ 1 using MGANITE, where the x axis denoted a value of covariate X 1 , the y axis denoted the potential outcome, a blue color dot represented the true outcome Y 1 and a red color dot represented the estimated outcomesŷ 1 . (B) The true potential outcomes without treatment Y 0 and estimated potential outcomesŷ 0 using MGANITE, where the x axis denoted a value of covariate X 1 , the y axis denoted the potential outcome, a blue color dot represented the true outcome Y 0 and a red color dot represented the estimated outcomesŷ 0 .
where t is a treatment index, W T t ∼ u(−0.1, 0.1) 30×1 , n t ∼ N(0, 0.1), and Bern represents the Bernoulli distribution. When one sample has only one treatment assigned, then t = i.
Treatment effect can take three values 1, 0, and −1. In other words, We  (Breiman, 2001). We use six methods: MGANITE, LR, LogR, SVM, kNN, and RF (C) to estimate the counterfactual potential outcomes and calculate the mean square error (MSE) between the estimated treatment effect and the true treatment effect, standard deviation (STD) and prediction accuracy. Table 1 presents MSE, STD, and prediction accuracy of six methods to fit the generated data. We observe that MGANITE more accurately estimate the potential outcomes than the other five state-of-the-art methods. Figure 2 presents the true counterfactuals and estimates counterfactuals using MGANITE. We observe that MGANITE reaches remarkably high accuracy for estimating counterfactuals. The treatment effect estimation of eight methods [MGANITE, LR, LogR, SVM, KNN (5,10), BLR, RF (C), RF (R)] are summarized in Table 2. Table 2 shows that MGANITE has the highest accuracy of estimation of all treatment effects: average treatment effect (ATE), average treatment effects on the treated (ATT), and average treatment effect on the control (ATC), followed by RF (R) or RF (C). We observe that the estimations of ATE using all methods are inflated. The inflation rates of ATE using MGANITE and RF (C) are 3.9 and 7.9%, respectively. The SVM reaches the inflation rate of the estimation of ATE as high as 29.8%. All inflation rates of estimation of ATE using LR, LogR, SVM, KNN, and BLR are very high. The simulations also show that the false positive rates using MGANITE, LR, LogR, SVC, Next we examine the performance of MGANITE in estimating the ITE of continuous treatment using simulations. A synthetic dataset is generated as follows.
1. Draw the covariate variable X from the standard normal distribution for 10,000 individuals. 2. The treatment T is exponentially distributed as ., 10, 000, where n 0 i is a randomly sampled noise variable from a normal distribution N(0, 0.01).

Treatment assignment indicator variable M i is drawn from a
Bernoulli distribution with P = 0.5 for each subject.  Figures 3A,B plot the true ITE and estimated ITE for insamples and out-of-samples data, using six methods: MGANITE, LR, KNN, BLR, RF (R), and SVM, respectively, where a dash straight line indicates that the true ITE and the estimated ITE are equal. We observe from Figures 3A,B that many green cross points for both in-sample and out-of-sample data are much closer to the dash straight line than other types of points. This shows that the estimated ITE points using MGANITE are much closer to the true ITE point than using the other five methods. In other words, the estimator of the ITE using MGANITE is more accurate than that of using the other five methods. The results clearly demonstrate that MGANITE outperforms the 5 other state-of-the-art treatment effect estimation methods.
To further evaluate the performance of MGANITE, we provide Figure 4 that plots the receiver operating characteristic (ROC) curve for evaluation of the ability of MGANITE to predict potential outcomes of treatment. Our calculation shows that area under the ROC curve (AUC) for MGANITE reaches 0.98, which is a very high value. The ROC curve and AUC value demonstrate that the power of MGANITE for prediction of the potential outcomes of the treatments is very high.

Real Data Analysis
MGANITE is applied to 256 newly diagnosed acute myeloid leukemia (AML) patients from the clinical trial dataset (Kornblau et al., 2009). We first present the results of treatment using FIGURE 3 | (A) True ITE and estimated ITE for in-sample data using six methods: MGANITE, LR, KNN, BLR, RF (R), and SVM, where MGANTE was denoted by a green cross point, LR was denoted by an orange point, KNN was denoted by a green point, BLR was denoted by a red point, RF (R) was denoted by a purple point and SVM was denoted by a dark red point, the x axis denoted the true ITE and the y axis denoted the estimated ITE. (B) True ITE and estimated ITE for out-of-sample data using six methods: MGANITE, LR, KNN, BLR, RF (R), and SVM, where MGANTE was denoted by a green cross point, LR was denoted by a orange point, KNN was denoted by a green point, BLR was denoted by a red point, RF (R) was denoted by a purple point and SVM was denoted by a dark red point, the x axis denoted the true ITE and the y axis denoted the estimated ITE.
HDAC, HDAC+IDA (101) vs. all other drugs (111). A key issue for MGANITE is how to train MGANITE. To track the training process of MGANITE, we present Figure 5 that shows ATE, discriminator accuracy, replication error, and separate distance curves as a function of the number of batches. FIGURE 4 | ATE, discriminator accuracy, replication error and separate distance curves as a function of the number of batches where the x axis denoted the number of batches, the y axis denoted values of the ATE, discriminator accuracy, replication error, and separation distance for ATE, discriminator, replication, and separation curves, respectively, red, orange, blue and green curves were ATE, discriminator, replication and separation curves, respectively.  We observe from Figure 5 that discriminator accuracy converges to 1, replication error converges to zero, separation distance converges to a constant, and ATE converges to a stable value. Figure 4 demonstrates that MGANITE is trained very well.
Next we compared the treatment effect estimations using nine methods: MGANITE, LR, LogR, SVM, KNN (5), KNN (10), BLR, RF (C), and RF (R) where 5 and 10 are the number of neighbors. Treatment with HDAC or HDAC+IDA, and 85 protein expressions and other geographical variables are used as covariates. The response status (response or no response) is used as the outcome. Table 3 summarizes results of the estimation of HDAC treatment effect using MGANITE and other eight methods where individuals with HDAC or HDAC+IDA are taken as the treated population and individuals with other drugs are taken as the control population. Comparison of treatment effect estimation algorithms on real data analysis is not easy because of the lack of ground truth treatment effects and small sample sizes. In general, using MGANITE, we observe that the majority of individuals who are treated by other drugs do not show any response and that 65% of the individuals who are treated by HDAC or HDAC+IDA respond. Only 13.5% of individuals who are treated by other drugs respond. To illustrate the difference between the estimated treatment effect and treatment response, we present Figure 6 that shows the histogram of the estimated effects of the treatments HDAC or HDAC+IDA vs. other drugs using MGANITE (Figure 6A), and observe the number of responses of the individuals in the population who are treated with HDAC or HDAC+IDA vs. other drugs (Figure 6B). ITE is calculated based on both the factual and counterfactual. We observe that ITE = 0 consists of two scenarios: (1) no response of the patients to any drugs and (2) response of the patients to both HDAC or HDAC+IDA, and other drugs. A proportion of the patients with response to HDAC or HDAC+IDA on the right side of Figure 6B and the patient with response to other drugs on the left side of Figure 6B has ITE = 0. The observed response of the patients to one drug does not imply that these patients would not respond to other drugs. However, ITE = 1 or ITE = 0 implies that the patients respond to only one type of drug. To further compare the performance of MGANITE and other methods for evaluation of ITE, we split a given data set into an in-sample dataset (190 samples), used for the initial parameter estimation and model selection, and an outof-sample dataset (22 samples), used to evaluate performance of ITE estimation. The results are summarized in Table 4. We observe that the difference in the estimated ATT, ATC, ATE and proportions of the ITE between in-samples and out-of-samples using MGANITE are much smaller than using other methods. This shows that the ITE estimation using MGANITE is more robust than using other methods. We calculate the Kullback-Leibler (K-L) divergence between the distributions of the ITE using in-sample and out-of-samples, and using nine methods. The results are summarized in Table 5. Table 5 shows that K-L divergence using MGANITE is much smaller than that using other methods, which implies that MGANITE is more robust than the other eight methods.
LASSO is used to identify biomarkers for prediction of treatment effect and treatment selection. Table 6 lists the top 30 biomarkers identified by LASSO. All top 30 biomarkers explain 36.82% of the variation of HDAC or HDAC+IDA treatment effect. The top Gene GSK3 accounts for 4.4% of the explanation of treatment effect variation.
Garson's algorithm (Garson, 1991;Siu, 2017;Zhang et al., 2018) that describes the relative magnitude of the importance of input variables (biomarkers) in its connection with outcome variables (ITE) of the neural network can also be used to identify biomarkers for predicting the ITE. The top 30 biomarkers identified by the Garson algorithm are listed in Supplementary Table 1 where the relative contribution of each biomarker to the ITE variation and cumulative contribution of biomarkers to the ITE variation are also listed in Supplementary Table 1. The correlation coefficient between the importance ranking of the markers using the Garson algorithm and LASSO is only −0.05.
Next, we study the joint estimation of effects of the multiple treatments. The number of individuals that are treated with HDAC, HDAC+IDA, and other drugs are 37, 54, and 121, respectively. The widely used treatment estimation methods with multiple treatments are simultaneous estimations of the effects of pairwise treatments. We estimate the effects of the pairwise treatments HDAC vs. HDAC+IDA, HDAC vs. other drugs, and HDAC+IDA vs. other drugs. The results are summarized in Table 7. Pairwise comparisons listed in Table 7 does not present the results of the treatment compared with a placebo (without using any drugs). We compare the effect of one treatment with another treatment. Specifically, we make pairwise comparisons: HDAC vs. other drugs, HDAC+IDA vs. other drugs, and HDAC+IDA vs. HDAC. The average treatment effects (ATE) of these three pairwise treatments: HDAC vs. other drugs, HDAC+IDA vs. other drugs, and HDAC+IDA vs. HDAC using MGANITE, are 0.1001, 0.2311 and 0.1310, respectively. This demonstrates that on the average, the effect of the HDAC+IDA is the largest among the three treatments: HDAC+IDA, HDAC, and other drugs, followed by the treatment HDAC. In other words, the treatment HDAC is better than other drugs, in turn, the combination of HDAC and IDA is better than HDAC. It is also noted that the effect of HDAC+IDA vs. other drugs-effect of HDAC vs. other drugs = 0.2311-0.1001 = 0.1310 = effect of HDAC+IDA vs. HDAC.
However, using LR, LogR, SVM, RF (C), and RF (R), we observe that HDAC is the best treatment. This conclusion violates the biological interpretation. We explain the reasons that causes this incorrect conclusion as follows. The traditional methods for treatment estimation are mainly based on the population average of the treatment responses. The number of observed responses and no responses of the individuals treated with other drugs is 66 and 55, respectively. The average response rate of the other drugs is 0.545. The number of observed responses and no responses of the individuals treated with HDAC is 29 and 8, respectively. The average response rate for HDAC is 0.784. The number of observed response and no response of individuals treated with HDAC + IDA is 33 and 21, respectively. The average response rate for HDAC +IDA is  (Jassal et al., 2020) to assess whether the number of identified biomarkers associated with the Reactome pathway is over-represented more than expected. The original P-value from the hypergeometric test is then adjusted by FDR for multiple test correction. We find that top ranking biomarkers for the explanation of treatment effect variation are enriched in multiple cancer related pathways (Figure 7A), including the intrinsic pathway for apoptosis (R-HSA-109606, P = 2.86 × 10 −14 ), Signaling by Interleukins (R-HSA-449147, P = 2.86 × 10 −14 ), Programmed Cell Death (R-HSA-5357801, P = 9.7 × 10 −11 ), PIP3 activates AKT signaling (R-HSA-1257604, P = 2.98 × 10 −8 ), RUNX3 regulates WNT signaling (R-HSA-8951430, P = 1.03 × 10 −5 ), and RNA Polymerase II Transcription (R-HSA-73857, P = 9.4 × 10 −5 ). In addition, we find that the drug target of idarubicin (TOP2A) and Cytarabine (POLB) form a significant protein-protein interaction network (P < 1.0 × 10 −16 ) (Szklarczyk et al., 2019), indicating that the predictive biomarkers work as the direct interactive proteins of cancer drug targets ( Figure 7B).

DISCUSSION
In this paper, we present MGANITE coupled with sparse techniques as a framework to estimate the ITEs and select FIGURE 7 | Reactome pathway analysis and protein-protein interaction (PPI) network analysis to top ranking biomarkers for explanation of treatment effect variation.
(A) Enrichment analysis to the top 44 ranking biomarkers for explanation of treatment effect variation with the Reactome pathway database by hypergeometric test to assess whether the number of identified biomarkers associated with the Reactome pathway was over-represented more than expected. The original P-value from the hypergeometric test was then adjusted by FDR for multiple test correction. The top 15 most significantly enriched pathways was shown. (B) PPI network analysis was performed by String 11.0 to show the protein-protein interaction among top ranking biomarkers. We found that these proteins were highly interacted which was consistent with pathway enrichment analysis (PPI enrichment P-value is 1.0e-16).
the optimal treatments. We demonstrate that the proposed MGANITE has several remarkable features. First, MGANITE extends GANITE from binary treatment to all types of treatments: binary, categorical, and continuous treatments. We show that MGANITE has a much higher accuracy for estimation of ITE than other state-of-theart methods.
Second, in-sample and out-of-sample analysis show that the K-L divergence between the distributions of ITE for in-sample and out-of-samples for MGANITE is much smaller than that of other methods, which implies that MGANITE is more robust than other state-of-the art methods.
Third, unlike many popular methods that are usually used to estimate the average effect of the single treatment, MGANITE not only can estimate the ITE of a single treatment, but also can accurately and jointly estimate the ITE of multiple treatments. We also show that the results of the joint estimation of multiple treatments using other classical methods are inconsistent and might violate the biological interpretation.
Fourth, precision oncology is the identification of the right treatment for the right patient. The essential aim is to discover biomarkers that can accurately predict individual treatment effect for each individual. Our results show that MGANITE with sparse techniques can identify a set of biomarkers with significant biological features. The following identified biomarkers are such typical examples.
GSK3 is a kinase so adaptable that it has been recruited evolutionarily to phosphorylate over 100 substrates, and can regulate numerous cellular functions (Beurel et al., 2015). GSK3 phosphorylates HDAC3 and promotes its activity, including the neurotoxic activity of HDAC3 (Bardai and D'Mello, 2011). GSK3 also phosphorylates HDAC6 to modify its activity and the link between GSK3beta and HDAC6 involved in neurodegenerative disorders (Chen et al., 2010).
Bilirubin is a reddish yellow pigment generated when the normal red blood cells break. Normal levels range from 0.2 to 1.2 mg/dL (Davis, 2020). In adults, indirect hyperbilirubinemia can be due to overproduction, impaired liver uptake or abnormalities of conjugation (Gondal and Aronsohn, 2016 (Pollyea et al., 2019). The most common treatment-related adverse events are indirect hyperbilirubinemia (31%), nausea (23%), and fatigue (Steinwascher et al., 2015). Therefore, bilirubin is an important biomarker for monitoring adverse effect in AML patients who receive treatment.
Preclinical studies have discovered that Smac mimetics can directly cause cancer cell death, or make tumor cells become more sensitive to various cytotoxic treatment agents, including conventional chemotherapy, radiotherapy, or new drugs (Fulda, 2015). There is synergistic interaction of Smac mimetic and HDAC inhibitors in AML cell lines, and Smac mimetic and HDAC inhibitors can trigger necroptosis when caspase activation is blocked (Meng et al., 2016).
AKT.p308 and Src.p527 are phosphorylated signal transduction proteins. These two proteins are found to have lower expression in M0, M1, M2, but they have higher levels in the other AML French-American-British (FAB) types. The expression of those two proteins, together with 22 other proteins, can be used to define distinct signatures for each FAB type (Kornblau et al., 2009).
PTEN is a tumor suppressor protein. Promising anti-cancer agents, HDAC inhibitors, particularly trichostatin A (TSA), can promote PTEN membrane translocation. Meng et al. (2016) reveals that non-selective HDAC inhibitors, such as TSA or suberoylanilide hydroxamic acid (SAHA), induces PTEN membrane translocation through PTEN acetylation at K163 by inhibiting HDAC67. Similarly, treatment with an HDAC6 inhibitor alone promoted PTEN membrane translocation and correspondingly dephosphorylated AKT. The combination of celecoxib and an HDAC6 inhibitor synergistically increases PTEN membrane translocation and inactivated AKT (Zhang and Gan, 2017).
Our results show that multiple treatments improve efficiency of drugs for curing AML. This can be biologically explained. HDAC inhibitors have emerged as a potent and promising strategy for the treatment of leukemia via inducing differentiation and apoptosis in tumor cells (Jin et al., 2016). A phase II study with 37 refractory acute myelogenous leukemia (AML) patients shows only minimal activity of Vorinostat (HDACi), and Vorinostat fails to control the leukocyte count among most AML patients (Schaefer et al., 2009). A preclinical study reveals that the combination regimen of chidamide (a novel orally active HDAC inhibitor) and IDA could rapidly diminish the tumor burden in patients with refractory or relapsed AML . A Phase II trial of Vorinostat with idarubicin (IDA) and Ara-C for patients with newly diagnosed AML or myelodysplastic syndrome reveals good activity with overall response rates of 85%. No excess toxicity due to Vorinostat is observed (Garcia-Manero et al., 2012). Taken together, HDACs in combination therapy with IDA or other chemotherapeutic drugs show encouraging clinical activity in different hematologic malignancies. This explains that the combination of HDAC and IDA is the best treatment.
Although MGANITE shows remarkable features in ITE for estimation and optimal treatment selection; the results in this paper are very preliminary. Training stable GANs is a challenging task. The training process is inherently unstable, resulting in the inaccurate estimation of ITEs. In this study, we ignore unobserved confounders, unmeasured variables that affect both patients' medical prescription and their outcome. Overlooking the presence of unobserved confounders may lead to biased results. The main purpose of this paper is to stimulate discussion about how to use AI as a powerful tool to improve the estimation of ITEs and optimal treatment selection. We hope that our results will greatly increase the confidence in using AI as a driving force to facilitate the development of precision oncology.

DATA AVAILABILITY STATEMENT
The data analyzed in this study is subject to the following licenses/restrictions: data were obtained from Department of Biostatistics, The University of Texas MD Anderson Cancer Center. Requests to access these datasets should be directed to Data access need to request from Xuelin Huang, xlhuang@mdanderson.org.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
MX and WL: conception and design. QG, MX, and SF: development of methodology. XH: acquisition of data. QG, SG, YL, and SF: analysis and interpretation of data. MX, QG, SG, SF, YL, WL, and XH: writing, review, and/or revision of the manuscript. All authors: contributed to the article and approved the submitted version.