Conditional Generative Adversarial Networks for Individualized Treatment Effect Estimation and Treatment Selection

Ge, Qiyang; Huang, Xuelin; Fang, Shenying; Guo, Shicheng; Liu, Yuanyuan; Lin, Wei; Xiong, Momiao

doi:10.3389/fgene.2020.585804

METHODS article

Front. Genet., 11 December 2020

Sec. Statistical Genetics and Methodology

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.585804

This article is part of the Research TopicApplication of Novel Statistical and Machine-learning Methods to High-dimensional Clinical Cancer and (Multi-)Omics dataView all 11 articles

Conditional Generative Adversarial Networks for Individualized Treatment Effect Estimation and Treatment Selection

Qiyang Ge^1,2

Wei Lin²

Momiao Xiong¹^*

¹Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States
²School of Mathematical Sciences, Fudan University, Shanghai, China
³Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
⁴Department of Surgical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
⁵Department of Medical Genetics, University of Wisconsin-Madison, Madison, WI, United States

Treatment response is heterogeneous. However, the classical methods treat the treatment response as homogeneous and estimate the average treatment effects. The traditional methods are difficult to apply to precision oncology. Artificial intelligence (AI) is a powerful tool for precision oncology. It can accurately estimate the individualized treatment effects and learn optimal treatment choices. Therefore, the AI approach can substantially improve progress and treatment outcomes of patients. One AI approach, conditional generative adversarial nets for inference of individualized treatment effects (GANITE) has been developed. However, GANITE can only deal with binary treatment and does not provide a tool for optimal treatment selection. To overcome these limitations, we modify conditional generative adversarial networks (MCGANs) to allow estimation of individualized effects of any types of treatments including binary, categorical and continuous treatments. We propose to use sparse techniques for selection of biomarkers that predict the best treatment for each patient. Simulations show that MCGANs outperform seven other state-of-the-art methods: linear regression (LR), Bayesian linear ridge regression (BLR), k-Nearest Neighbor (KNN), random forest classification [RF (C)], random forest regression [RF (R)], logistic regression (LogR), and support vector machine (SVM). To illustrate their applications, the proposed MCGANs were applied to 256 patients with newly diagnosed acute myeloid leukemia (AML) who were treated with high dose ara-C (HDAC), Idarubicin (IDA) and both of these two treatments (HDAC+IDA) at M. D. Anderson Cancer Center. Our results showed that MCGAN can more accurately and robustly estimate the individualized treatment effects than other state-of-the art methods. Several biomarkers such as GSK3, BILIRUBIN, SMAC are identified and a total of 30 biomarkers can explain 36.8% of treatment effect variation.

Introduction

Traditional clinical management estimates the average treatment effects from observational data, assuming that the complex disease is homogeneous (Rosenbaum and Rubin, 1983; Hansen, 2004; Diamond and Sekhon, 2013; Kennedy et al., 2017; Liu et al., 2018; Luo and Zhu, 2020). Alternatives to traditional clinical management, “precision medicine” or “precision oncology” attempts to match the most accurate and effective treatments with the individual patient (Shin et al., 2017; Ali and Aittokallio, 2019), rather than using monotherapy that treats all patients. In the real world, treatment response is heterogeneous. Therapy should be tailored with the best response possible and highest safety margin to ensure that the right therapy is offered to “the right patient at the right time” (Subbiah and Kurzrock, 2018). Precision oncology can substantially improve progress and treatment outcomes of patients. It plays a central role in revolutionizing cancer research. Consequently, alternative to calculating the average effect of an intervention over a population, many recent methods attempt to estimate individualized treatment effects (ITEs) or conditional average treatment effects from observational data (Makar et al., 2019). To accurately estimate the individualized treatment effects and learn optimal treatment choices are key issues for precision oncology. More accurate estimation of individualized treatment effects, which provides information to guide the individual selection of the target therapies, is essential for the success of precision medicine (Kornblau et al., 2009).

Methods for estimation of individualized treatment effects (ITEs) using observational data largely differ from standard statistical estimation methods. Estimating of ITEs and learning optimal treatment strategies raise a great challenge for the following reasons. First, a common framework for treatment effect estimation is the potential outcomes assumptions (Ray and Szabo, 2019) where every individual has two “potential outcomes” covering the hypothesized individual's outcomes with and without treatment. Estimation of ITEs requires estimation of both factual and counterfactual outcomes for each individual. However, only the factual outcome is actually observed. We never observe the counterfactual outcomes (Rosenbaum and Rubin, 1983; Chen and Paschalidis, 2018; Yoon et al., 2018a).

If the effect of each treatment in the subpopulation which is separately estimated is taken as an individual effect, this can create large biases. The estimated effect of each treatment in the subpopulation is still the average effect of the treatment in that subpopulation and is not an individualized treatment effect in the subpopulation.

Second, clinical data often have many missing values. Simultaneously imputing both counterfactual values and missing values is not easy. Third, the function forms of the treatment effects which are often non-linear functions are unknown (Ray and Szabo, 2019). Statistical methods and computational algorithms that can efficiently deal with unknown forms of non-linear functions are still lacking (Lengerich et al., 2019).

Classical works such as random forest and hierarchical models are adapted to estimate heterogeneous treatment effects (Wager and Athey, 2015). Recently, machine learning and neural network methods are used to move away from average treatment effect estimation to personalized estimation (Johansson et al., 2016; Shalit et al., 2016; Alaa and van der Schaar, 2017). AI and causal inferences are becoming a driving force for innovation in precision oncology (Seyhan and Carini, 2019). A key issue for ITE estimation is to learn unobserved (missing) counterfactuals. The idea of using generative adversarial networks (GANs) for handling missing data is a very promising approach to imputing counterfactual (Goodfellow et al., 2014; Ding and Li, 2017; Yoon et al., 2018a). Using conditional GAN (CGAN) to estimate the individualized treatment effects (GANITE) has been developed (Yoon et al., 2018a,b). The CGANs consist of a generator and a discriminator. The generator (G) observes the factual part of real data and imputes the counterfactuals (missing part) conditioned on observed factual data, and outputs the complete dataset. The discriminator (D) inputs the real dataset and tries to determine which part was actually observed and which part was imputed counterfactuals. The discriminator enforces the generator to learn the desired distribution (hidden data distribution) (Yoon et al., 2018b).

However, the original GANITE was designed for estimation of the effects of binary treatment and cannot be applied to continuous and categorical treatments. The treatment variable in the original GANITE is a binary variable which only represents the presence and absence of treatment. Therefore, the treatment variable in the original GANITE is unable to quantify the dosage of the treatment, and hence the original GANITE cannot be applied to continuous treatment. To overcome this limitation, we introduce a treatment assignment indicator variable and treatment quantity variable. The treatment quantity variable can represent binary treatment, categorical treatment, and continuous treatment. We change mathematical formulations of the generator and discriminator and extend GANITE from binary treatment to all types of treatments including binary, categorical, and continuous treatments. The modified GANITE is abbreviated as MGANITE.

GANITE or in general, CGAN has not systematically investigated the estimation of ITE for chemotherapy and other types of treatments in cancer and compared the results from causal inference using observed data with the results of randomized clinical trials. One of our goals in this manuscript is to examine whether MGANITE still works well in cancer research.

In MGANITE, biomarkers that serve as conditioned variables, will be used to estimate the ITEs of both single and multiple treatments (Mirza and Osindero, 2014; Yoon et al., 2018a). Sparse techniques will be employed to select biomarkers for prediction of treatment effects and to learn optimal treatment choices of patients (Emmert-Streib and Dehmer, 2019).

In summary, The novelty of modified GANITE (MGANITE) is summarized below.

1. The previous conditional generative adversarial network (CGAN)-based causal inference methods (GANITE) only can estimate the individualized effects of binary treatment and cannot estimate the individualized effects of continuous treatments. The proposed MGANITE is the first time to use modified CGANs for estimation of individualized effects of continuous treatments.

2. We develop new network structures for the generator and discriminator in the CGANs.

3. We combined sparse techniques for selection of biomarkers with MGANITE to predict the best treatment for each patient.

To evaluate its performance for estimating ITEs, simulations are conducted to estimate ITEs using simulated data and MGANITE, and to compare its estimation accuracy with five other state-of-the-art methods (LR, KNN, BLR, random forest, and SVM). To further evaluate its performance, MGANITE is applied to 256 newly diagnosed acute myeloid leukemia (AML) patients, treated with high dose ara-C (HDAC), Idarubicin (IDA), and HDAC+IDA at M. D. Anderson Cancer Center to estimate ITEs and identify the optimal treatment strategy for each patient. Preliminary results from simulations and real data analysis show that MGANITE outperforms five other state-of-the-art methods. A program for implementing the proposed MGANITE for ITE estimation and optimal treatment selection can be downloaded from our website https://sph.uth.edu/research/centers/hgc/software/xiong/.

Materials and Methods

Potential Outcome Framework for Estimation of Treatment Effects

We assume the Rubin causal model for estimation of treatment effects (Rubin, 1974) and modifies the approach to the individualized treatment effect estimation in Yoon et al., 2018a). The original GANITE only can estimate ITE of binary treatments, but it cannot be applied to categorical and continuous treatments. We develop MGANITE which can estimate ITE of all types of treatments including binary, categorical, and continuous treatments by introducing a treatment assignment indicator variable and changing the formulation of the generator and discriminator. Consider K treatments. Let T_k be the k^th treatment variable that can be binary, categorical or continuous, and $T = {[T_{1}, \dots, T_{K}]}^{T}$ be the treatment vector. We assume that there is precisely one non-zero component of the treatment vector T, which is denoted by T_η, where η is the index of this component. Each sample has one and only one assigned treatment T_η. To extend the binary treatment to include categorical and continuous treatments, we define the treatment assignment indicator vector $M = {[M_{1}, \dots, M_{k}, \dots, M_{K}]}^{T}$ as

\begin{array}{l} M_{k} = {\begin{matrix} 1 k = η \\ 0 otherwise \end{matrix} \end{array}

where $\sum_{k = 1}^{K} M_{k} = 1$ .

For example, if

\begin{array}{l} T = [\begin{matrix} 0 \\ T_{2} \\ 0 \end{matrix}] \end{array}

then η = 2 and

\begin{array}{l} M = [\begin{matrix} 0 \\ 1 \\ 0 \end{matrix}] \end{array}

If we consider treated and untreated cases, then K = 2. Let T₁ denote the treatment and T₂ denote no treatment where T₂ = 1. For the sample with the treatment, we have

\begin{array}{l} T = [\begin{matrix} T_{1} \\ 0 \end{matrix}] and M = [\begin{matrix} 1 \\ 0 \end{matrix}] \end{array}

For the sample with no treatment, we have

\begin{array}{l} T = [\begin{matrix} 0 \\ T_{2} \end{matrix}] and M = [\begin{matrix} 0 \\ 1 \end{matrix}] . \end{array}

Define the vector of potential outcome $Y (T) = {[Y (T_{1}), \dots, Y (T_{K})]}^{T}$ , where Y(T_k) is the potential outcome of the sample under the treatment T_k. When K = 2, the potential outcome Y(T₁) corresponds to the widely used notation for one treatment Y¹, the potential outcome of the treated sample, while the potential outcome Y(T₂) corresponds to Y⁰, the potential outcome of the untreated sample. Only one of the potential outcomes can be observed. The observed outcome that corresponds to the potential outcome of the individual receiving the treatment T_η is denoted by Y(T_η). The observed outcome is called the factual outcome and the unobserved potential outcomes are called counterfactual outcomes, or simply counterfactuals. For the convenience of notation, the factual outcome is also denoted by Y_f and the counterfactuals are denoted by Y_cf.

The observed outcome Y_f can be expressed as

\begin{array}{l} Y_{f} = Y_{η} = \sum_{k = 1}^{K} M_{k} Y (T_{k}) \end{array}

When K = 2, we have M₂ = 1−M₁. The above equation becomes

\begin{array}{l} Y_{f} = M_{1} Y (T_{1}) + (1 - M_{1}) Y (T_{2}) = M_{1} Y^{1} + (1 - M_{1}) Y^{0} \end{array}

which coincides with the standard expression of the observed outcome for one treatment.

Let $X = {[X_{1}, \dots, X_{q}]}^{T}$ be the q-dimensional feature vector. Assume that n individuals are sampled. Let $T^{(i)} = {[T_{1}^{(i)}, \dots, T_{K}^{(i)}]}^{T}, Y^{(i)} = {[Y^{(i)} (T_{1}^{(i)}), \dots, Y^{(i)} (T_{K}^{(i)})]}^{T}$ and $X^{(i)} = {[X_{1}^{(i)}, \dots, X_{q}^{(i)}]}^{T}, i = 1, \dots, n$ be the treatment vector, the vector of potential outcomes, and feature vector of the i^th individual, respectively.

The most widely used measure of the treatment effect for the multiple treatment is the pair-wise treatment effect. The individual effect $ξ_{j k}^{(i)}$ between the pairwise treatments: T_j and T_k is defined as $ξ_{j k}^{(i)} = Y^{(i)} (T_{j}^{(i)}) - Y^{(i)} (T_{k}^{i})$ , the average pairwise treatment effect $τ_{j k} = E [ξ_{j k}^{(i)}]$ . The average pairwise treatment effect τ_{jk|_T_j} on the patients treated with T_j is defined as $τ_{j k | T_{j}} = E [ξ_{j k}^{(i)} | T_{j}]$ .

The focus of this paper is on the conditional distribution of treatment effect, given the feature vector X. Let F_Y|X(T_k) be the conditional distribution of the potential outcome Y(T_k) under the treatment T_k, given the feature vector X, and F_Y|X(T) be the conditional joint distribution of the potential outcome vector Y(T) under the K treatment T, given the feature vector X. Assume that n individuals are sampled. For the i^th individual, T_η treatment (M_η = 1) is assigned. Let X⁽ⁱ⁾ and $Y_{η}^{(i)} (T_{η}^{(i)}) = Y_{f}^{(i)}$ be the observed feature vector and the observed potential outcome of the i^th individual. Therefore, the observed dataset is given by $D = (X^{(i)}, T^{(i)}, Y_{η}^{(i)}, i = 1, \dots, n)$ . The factual and counterfactual outcomes of the i^th individual are denoted by $y_{f}^{(i)}$ and $y_{c f}^{(i)}$ , respectively.

To estimate the treatment effects, we often make the following three assumptions (Rubin, 1974; Yoon et al., 2018a):

Assumption 1 (Ignorability Assumption). Conditional on X, the potential outcomes, Y(T) and the treatment T are independent,

\begin{array}{l} Y (T) = (Y (T_{1}), \dots, Y (T_{K})) T | X & (1) \end{array}

This assumption requires no unmeasured confounding variables.

Assumption 2 (Common Support). For the feature vector X and all treatment,

\begin{array}{l} 0 < P (T_{k} = t_{k} | X) < 1 & (2) \end{array}

Assumption 3 (Stable Unit Treatment Value Assumption). No interference (units do not interfere with each other).

Conditional Generative Adversarial Networks as a General Framework for Estimation of Individualized Treatment Effects

The key issue for the estimation of individualized treatment effects is unbiased counterfactual estimation. Counterfactuals will never be observed and cannot be tested by data. The true counterfactuals are unknown. Recently developed generative adversarial networks (GANs) started a revolution in deep learning (Luo and Zhu, 2020). GANs are a perfect tool for missing data imputation. An incredible potential of GANs is to accurately generate the hidden (missing) data distribution given some of the features in the data. Therefore, we can use GANs to generate counterfactual outcomes.

GANs consist of two parts: the “generative” part that is called the generator and “adversarial” part that is called the discriminator. Both the generator and discriminator are implemented by neural networks. Typically, a K-dimensional noise vector is input into the generator network that converts the noise vector to a new fake data instance. Then the generated new data instance is input into the discriminator network to evaluate them for authenticity. The generator constantly learns to generate better fake data instances while the discriminator constantly obtains both real data and fake data and improves accuracy of evaluation for authenticity.

Architecture of Conditional Generative Adversarial Networks (CGANs) for Generating Potential Outcomes

Features provide essential information for estimation of counterfactual outcomes. Therefore, we use conditional generative adversarial networks (CGANs) (Mirza and Osindero, 2014) as a general framework for individualized treatment effect (ITE) estimation. The CGANs for ITE estimation consist of two blocks. The first imputation block is to impute the counterfactual outcomes. The second ITE block is to estimate distribution of the treatment effects using the complete dataset that is generated in the imputation block. The architecture of CGANs is shown in Figure 1.

FIGURE 1

Figure 1. Scheme of MGANITE for the estimation of potential outcomes.

Both the generator and discriminator are implemented by feedforward neural networks. The architectures of the neural networks are described as follows. The generator consists of seven layers of feedforward neural network. The first layer is the covariate input layer that input a vector X of covariates. The second and third layers are hidden layers, each layer with 64 nodes. The fourth layer concatenates the output of the third layer, the response vector Y, treatment vector T and treatment assignment indicator vector M and noise vector Z. The fifth and sixth layers are hidden layers, each layer with 64 nodes. Finally, the seventh layer is the output layer. All activation functions of the neurons were sigmoid function. The architecture of the discriminator is similar to the architecture of the generator except for adding one more output layer with sigmoid non-linear activation function.

Imputation Block

To extend GANTITE from binary treatments to all types of treatments, we introduce the treatment assignment vector and change some mathematical formulation of the generator. A counterfactual generator in the imputation block is a non-linear function of the feature vector, treatment vector T, treatment assignment indicator vector M, observed factual outcome y_f and K dimensional random vector z_G with uniform distribution $z_{G} ~ U ({(- 1, 1)}^{K})$ where Y_f = Y_η. The generator is denoted by

\begin{array}{l} \tilde{Y} = G (X, Y_{f}, {T ⊙ M, (1 - M) ⊙ z}_{G}, θ_{G}) & (3) \end{array}

where output Ỹ represents a sample of G. It can take binary values, categorical values or continuous values. 1 is a vector of 1, ⊙ denotes element-wise multiplication, and θ_G is the parameters in the generator. We use $\bar{} Y$ to denote the complete dataset that is obtained by replacing Ỹ_ηwith Y_f.

The distribution of Ỹ depends on the determinant of the Jacobian matrix of the transformation function G(X, Y_f, T, M, z_G, θ_G). Changing the transformation function can change the distribution of the generated counterfactual outcomes. Let P_{Y|x, t,_{m, y}_f}(y) be the conditional distribution of the potential outcomes, given X = x, T = t, M = m, Y_f = y_f. The goal of the generator is to learn the neural network G such that G(x, y_f, t, m, z_G, θ_G) ~ P_{Y|x, t,_{m, y}_f}(y).

Unlike the discriminator in the standard CGANs where the discriminator evaluates the input data for their authenticity (real or fake data), the counterfactual discriminator D_G that maps pairs (x,y) to vectors in [0, 1]^k attempts to distinguish the factual component from the counterfactual components. The output of the counterfactual discriminator D_G is a vector of probabilities that the component represents the factual outcome. Let D_G(x, ỹ, t, m,_{θ_d)i} represent the probability that the i^thcomponent of ỹ is the factual outcome, i.e., i = η, where θ_d denotes the parameters in the discriminator. The goal of the counterfactual discriminator is to maximize the probability D_G(x, ỹ, t, m,_{θ_d)i} for correctly identifying the factual component η via changing the parameters in the discriminator neural network D_G.

Loss Function

The imputation block in MGANITE attempts to impute counterfactual outcomes by extending the loss function of the binary treatment in GANITE (Yoon et al., 2018a) to all types of treatments: binary, categorical or continuous treatments. We define the loss function V(D_G, G) as

\begin{array}{l} E_{(x, t, m, y_{f}) ~ P_{d a t a} (x, t, m, y_{f})} E_{z_{G} ~ u ({(- 1, 1)}^{K})} \\ [M^{T} log D_{G} (X, \tilde{Y}, T, M) + {(1 - M)}^{T} log (1 - D_{G} (X, \tilde{Y}, T, M))] \end{array}

where log is an element-wise operation. The goal of the imputation block is to maximize the counterfactual discriminator D_G and then minimize the counterfactual generator G:

\begin{array}{l} {min}_{G} {max}_{D G} V (D_{G}, G, θ_{d}) & (4) \end{array}

In other words, we train the counterfactual discriminator D_G to maximize the probability of correctly identifying the assigned treatment M_η and the quantity of the treatment T_η or Y_f(Y_η), and then train the counterfactual generator G to minimize the probability of correctly identifying M_η and T_η. After the imputation block is performed, the counterfactual generator G produces the complete dataset $\bar{D} = {x, \bar{y}}$ . Next, we use the imputed complete dataset $\bar{D} = {X, \bar{Y}}$ to generate the distribution of potential outcomes and to estimate the ITE via CGANs which is called the ITE block.

ITE Block

The CGANs consist of three parts: generator, discriminator and loss function which are summarized as follows (Yoon et al., 2018a).

ITE Generator

Unlike the ITE in GANITE where the ITE generator is a non-linear transform function of only X and Z_I, the ITE generator G_I in MGANITE is a non-linear transform function of X, T and Z_I:

\begin{array}{l} \hat{Y} = G_{I} (X, T, Z_{I}, θ_{g_{I}}) & (5) \end{array}

where Ŷ is the generated K-dimensional vector of potential outcomes, X is a feature vector, T is a treatment vector, and Z_I is a K-dimensional vector of random variables and follows the uniform distribution $Z_{I} ~ u ({(- 1, 1)}^{K})$ . The ITE generator attempts to find the transformation Ŷ = G_I(X, T, Z_I,θ_{g_I}) such that Ŷ ~ P_{Y|X, T}(y).

ITE Discriminator

Following the CGANs, we define a discriminator D_I as a non-linear classifier with $(X, T, Y^{*} = \bar{} Y)$ or (X, T, Y^* = Ŷ) as input and a scalar that outputs the probability of Y^* being from the complete dataset $\bar{} D$ .

Loss Function

Again, unlike the loss function in GANITE where the decision function is $D_{I} (X, Y^{*})$ , a decision function in MGANITE is defined as D(X, T, Y^*). The loss function for the ITE block in MGANITE is then defined as

\begin{array}{l} V_{I} (D_{I}, G_{I}) = E_{X, T ~ P (x, T)} \\ [E_{Y^{*} ~ P_{Y | X, T}} (y) [log D_{I} (X, T, Y^{*})] + E_{Z_{I} ~ u ({(- 1, 1)}^{K})} [log (1 - D_{I} X, T, Y^{*})]] & (6) \end{array}

where $D_{I} (X, {T, Y}^{*})$ is the non-linear classifier that determines whether Y^* is from the complete dataset $\bar{D}$ or from generator G_I.The goal of the ITE block is to maximize the probability of correctly identifying that Y^* is from the complete dataset $\bar{D}$ and to minimize the probability of a correct classification. Mathematically, the ITE attempts

\begin{array}{l} {min}_{G I} {max}_{D I} V_{I} (D_{I}, G_{I}) & (7) \end{array}

The algorithms for numerically solving the optimization problems (4) and (7) are summarized in the Supplementary Note.

The learning parameters for the feedforward neural networks are given below. We set batch size equal to 16. We assumed that the learning rates for the discriminator and generator were 0.0001 and 0.001, respectively. We further assume that the decay rate was 0.1. The learning rate decayed (exponentially) to 10% of the starting learning rate during 70% of the total batches, and stayed at 10% during the last 30% batches. The total number of batches was 1,000,000. Adam Optimizer was used to perform optimization. We assume that 20% of the nodes were dropped randomly during the training process.

Sparse Techniques for Biomarker Identification

The LASSO (least absolute shrinkage and selection operator) that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the results can be used to select biomarkers for optimal treatment selection (Ali and Aittokallio, 2019). Let $Y_{k}^{i}$ and X⁽ⁱ⁾denote the estimated effect of the k^th treatment and feature vector of the i^th individual, respectively. Let

\begin{array}{l} Y_{T} = [\begin{matrix} Y_{1}^{1} \dots Y_{K}^{1} \\ \dots \\ \dots \\ \dots \\ Y_{1}^{n} \dots Y_{K}^{n} \end{matrix}], X = [\begin{matrix} x_{1}^{(1)} \dots x_{q}^{(1)} \\ \dots \\ \dots \\ \dots \\ x_{1}^{(n)} \dots x_{q}^{(n)} \end{matrix}], β = [\begin{matrix} β_{11} \dots β_{1 K} \\ \dots \\ \dots \\ \dots \\ β_{q 1} \dots β_{q K} \end{matrix}] \end{array}

The outputs of the neural networks are in general a continuous function even if the potential outcomes are binary. For the convenience of presentation, we assume that the treatment effects are continuous regardless if the potential outcomes are binary, categorical or continuous.

The LASSO estimators for identifying biomarkers that predict treatment effects are given by

\begin{array}{l} {\hat{β}}_{λ} = argmi n_{β} | | Y_{T} - X β | |_{F}^{2} + λ \sum_{j = 1}^{q} \sum_{l = 1}^{K} | β_{j l} | & (8) \end{array}

where ||.||_F is the Frobenius norm of the matrix. Non-zero elements β_jl≠0 predict treatment effect variation and hence its correspondence $X_{j} = {[\begin{matrix} X_{j}^{(1)} \dots X_{j}^{(n)} \end{matrix}]}^{T}$ can be used as biomarkers for investigation of the l^th treatment. For the continuous treatment, we define the treatment matrix T and its associated coefficient matrix Γ:

\begin{array}{l} T = [\begin{matrix} T_{1}^{(1)} \dots T_{K}^{(1)} \\ \dots \\ \dots \\ \dots \\ T_{1}^{(n)} \dots T_{K}^{(n)} \end{matrix}], Γ = [\begin{matrix} γ_{11} \dots γ_{1 K} \\ \dots \\ \dots \\ \dots \\ γ_{K 1} \dots γ_{K K} \end{matrix}] \end{array}

Equation (8) should be changed to

\begin{array}{l} {[{\hat{γ}}_{λ_{1},} \hat{β}}_{λ_{2}}] = argmi n_{γ, β} | | Y_{T} - T Γ - X β | |_{F}^{2} + λ_{1} \sum_{j = 1}^{K} \sum_{l = 1}^{K} | γ_{j l} | + λ_{2} \sum_{j = 1}^{q} \sum_{l = 1}^{K} | β_{j l} | & (9) \end{array}

where λ₁, λ₂ are penalty parameters.

Biomarker Identification for Optimal Treatment Selection

Consider K treatments. Let $Ŷ^{i} = {[\begin{matrix} Ŷ_{1}^{i} \dots Ŷ_{K}^{i} \end{matrix}]}^{T}$ be the K-dimensional vector of the estimated potential outcomes for the i^th individual and $z_{i} = argma x_{1, . . ., k} {Ŷ_{1}^{i}, \dots, Ŷ_{K}^{i}}$ be the index for the optimal potential outcomes of the i^th individual. To select biomarkers for optimal treatment selection, we define the following LASSO:

\begin{array}{l} {\hat{Y}}_{z_{i}}^{i} = \sum_{j = 1}^{q} x_{j}^{(i)} α_{j} + λ \sum_{j = 1}^{q} | α_{j} |, i = 1, \dots, n & (10) \end{array}

Solving the above categorical LASSO problem, we obtain a set of non-zero coefficients that are denoted as ${\hat{α}}_{l} \neq 0, l = 1, \dots, L$ . The covariates that correspond to the non-zero coefficients of the LASSO solution are chosen as biomarkers for optimal treatment selection. Again, for the continuous treatment, Equation (10) needs to be changed to

\begin{array}{l} {\hat{Y}}_{z_{i}}^{i} = \sum_{l = 1}^{K} T_{l}^{(i)} δ_{l} + \sum_{j = 1}^{q} x_{j}^{(i)} α_{j} + λ_{1} \sum_{l = 1}^{K} | δ_{l} | + λ_{2} \sum_{j = 1}^{q} | α_{j} |, i = 1, \dots, n . & (11) \end{array}

Data Collection

The proposed MGANITE was applied to 256 newly diagnosed acute myeloid leukemia (AML) patients, treated with high dose ara-C (HDAC), Idarubicin (IDA), and HDAC+IDA at M. D. Anderson Cancer Center. There were 212 valid samples and 85 useable features (14 discrete and 71 continuous), including 51 total and phosphoprotein from several biological processes such as apoptosis, cell-cycle, and signal transduction pathways (Kornblau et al., 2009). Among the 212 valid samples, 37 were treated with HDAC, 9 were treated with IDA and 54 were treated with HDAC+IDA, and 112 were treated with other drugs. Data were downloaded from the M. D. Anderson Cancer Center database (http://bioinformatics.mdanderson.org/Supplements/Kornblau-AML-RPPA/aml-rppa.xls) and (https://pubmed.ncbi.nlm.nih.gov/18840713/).

Prediction accuracy was defined as the proportions of correctly predicted potential outcomes. The false positive rate was defined as the proportion of individuals who were wrongly classified as having a positive treatment response. Discriminator accuracy is defined as the proportion of correctly classified real or fake samples. Replication error is defined as cross entropy −y_flogŷ_f where $ŷ_{f} = G (x, t, t_{^{*}}, y_{f}, z_{G}, θ_{g}), t = t_{^{*}}$ and separate distance is defined as

\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} | y_{i f} - {\hat{y}}_{i f} | \end{array}

where $ŷ_{i f} = G (x, t, t_{^{*}}, y_{f}, z_{G}, θ_{g}), t \neq t_{^{*}}$ .

Results

Simulations

We first examine the performance of MGANITE in estimating the ITE of binary treatment using simulations. A synthetic dataset is generated as follows. A total of 10,000 individuals with 30-dimentinal feature vectors follow the normal distributions N(0, I). Let

\begin{array}{l} {\hat{y}}_{i}^{0} = 0.05 + 0.4 x_{i 1}^{2} + 0.25 x_{i 2} + n_{i 0}, n_{i 0} ~ N (0, 0.05) \end{array}

and

\begin{array}{l} {\hat{y}}_{i}^{1} = 0.15 + {0.5 x}_{i 1}^{2} + 0.25 x_{i 1} x_{i 2} + 0.25 x_{i 2} + n_{i 1} \\ i = 1, 2, \dots, 10, 000, n_{i 1} ~ N (0, 0.05), \end{array}

where i is a sample index.

Then, the potential outcomes are generated as

y_{i}^{0} = {\begin{matrix} 1 {\hat{y}}_{i}^{0} \geq 0.5 \\ 0 {\hat{y}}_{i}^{0} < 0.5 \end{matrix} and y_{i}^{1} = {\begin{matrix} 1 {\hat{y}}_{i}^{1} \geq 0.5 \\ 0 {\hat{y}}_{i}^{1} < 0.5 \end{matrix}

Treatment is assigned by the Bernoulli distribution:

\begin{array}{l} M = T | X ~ B e r n (s i g m o i d (W_{t}^{T} X + n_{t})) \end{array}

where t is a treatment index, $W_{t}^{T} ~ u {(- 0.1, 0.1)}^{30 \times 1}$ , n_t~N(0, 0.1), and Bern represents the Bernoulli distribution. When one sample has only one treatment assigned, then t = i.

Treatment effect can take three values 1, 0, and −1. In other words,

ξ_{i} = {\begin{array}{l} 1 & y_{i}^{1} = 1, y_{i}^{0} = 0 \\ \begin{matrix} 0 y_{i}^{1} = 1, y_{i}^{0} = 1 \end{matrix} o r & y_{i}^{1} = 0, y_{i}^{0} = 0 \\ - 1 & y_{i}^{1} = 0, y_{i}^{0} = 1 \end{array}

We compare MGANITE with linear regression (LR) (Makar et al., 2019), logistic regression (LogR) (Emmert-Streib and Dehmer, 2019; Makar et al., 2019), support vector machine (SVM) (Makar et al., 2019), k- nearest neighbor (k-NN) (Crump et al., 2008), Bayesian linear regression (BLR) (Johansson et al., 2016), causal forest (CForest) ( Wager and Athey, 2015), and random forest classification [RF (C)] (Breiman, 2001). We use six methods: MGANITE, LR, LogR, SVM, kNN, and RF (C) to estimate the counterfactual potential outcomes and calculate the mean square error (MSE) between the estimated treatment effect and the true treatment effect, standard deviation (STD) and prediction accuracy. Table 1 presents MSE, STD, and prediction accuracy of six methods to fit the generated data. We observe that MGANITE more accurately estimate the potential outcomes than the other five state-of-the-art methods. Figure 2 presents the true counterfactuals and estimates counterfactuals using MGANITE. We observe that MGANITE reaches remarkably high accuracy for estimating counterfactuals.

TABLE 1

Table 1. Performance of six methods for estimating the potential outcomes.

FIGURE 2

Figure 2. (A) The true potential outcomes with treatment Y¹ and estimated potential outcomes ŷ¹ using MGANITE, where the x axis denoted a value of covariate X₁, the y axis denoted the potential outcome, a blue color dot represented the true outcome Y¹ and a red color dot represented the estimated outcomes ŷ¹. (B) The true potential outcomes without treatment Y⁰and estimated potential outcomes ŷ⁰ using MGANITE, where the x axis denoted a value of covariate X₁, the y axis denoted the potential outcome, a blue color dot represented the true outcome Y⁰ and a red color dot represented the estimated outcomes ŷ⁰.

The treatment effect estimation of eight methods [MGANITE, LR, LogR, SVM, KNN (5,10), BLR, RF (C), RF (R)] are summarized in Table 2. Table 2 shows that MGANITE has the highest accuracy of estimation of all treatment effects: average treatment effect (ATE), average treatment effects on the treated (ATT), and average treatment effect on the control (ATC), followed by RF (R) or RF (C). We observe that the estimations of ATE using all methods are inflated. The inflation rates of ATE using MGANITE and RF (C) are 3.9 and 7.9%, respectively. The SVM reaches the inflation rate of the estimation of ATE as high as 29.8%. All inflation rates of estimation of ATE using LR, LogR, SVM, KNN, and BLR are very high. The simulations also show that the false positive rates using MGANITE, LR, LogR, SVC, KNN (5), KNN (10), BLR, RF (R), and RF (C) are 3.9, 24.7, 28.1, 29.8, 28/1, 19.7, 25.3, 9, and 8.4%, respectively. The results show that false positive rates of LR, LogR, SVM, KNN, and BLR for prediction of positive treatment response are too high to be applied to treatment selection. Even RF (R) reaches the false positive rate as high as 8.4%. Table 2 also shows that the number of individuals that show positive treatment effects increases while the number of individuals that show no treatment effect decreases from ground truth.

TABLE 2

Table 2. Treatment effects estimated for simulation data using nine methods.

Next we examine the performance of MGANITE in estimating the ITE of continuous treatment using simulations. A synthetic dataset is generated as follows.

1. Draw the covariate variable X from the standard normal distribution for 10,000 individuals.

2. The treatment T is exponentially distributed as P(t) = e^−(t−1), t ≥ 1. Define g(t) = 0.1t².

3. Define a non-linear function $f (x) = \frac{1}{2 + e x p (- 20 (x - \frac{1}{3}))}$ .

4. Define $y_{i}^{0} = 0.3 + f (x) + n_{i}^{0}, i = 1, \dots, 10, 000$ , where $n_{i}^{0}$ is a randomly sampled noise variable from a normal distribution N(0, 0.01).

5. Define $y_{i}^{1} = 0.3 + f (x) + g (t) + n_{i}^{1}, i = 1, \dots, 10, 000$ , where $n_{i}^{1}$ is a randomly sampled noise variable from a normal distribution N(0, 0.01).

6. Treatment assignment indicator variable M_i is drawn from a Bernoulli distribution with P = 0.5 for each subject.

The mean square errors (MSE) for MGANITE, Linear Regression, KNN, Bayesian ridge regression, RF (R), and SVM regression are 0.011004916, 0.08500695, 0.012520364, 0.085007192, 0.014281599, 0.013962992, respectively. Figures 3A,B plot the true ITE and estimated ITE for in-samples and out-of-samples data, using six methods: MGANITE, LR, KNN, BLR, RF (R), and SVM, respectively, where a dash straight line indicates that the true ITE and the estimated ITE are equal. We observe from Figures 3A,B that many green cross points for both in-sample and out-of-sample data are much closer to the dash straight line than other types of points. This shows that the estimated ITE points using MGANITE are much closer to the true ITE point than using the other five methods. In other words, the estimator of the ITE using MGANITE is more accurate than that of using the other five methods. The results clearly demonstrate that MGANITE outperforms the 5 other state-of-the-art treatment effect estimation methods.

FIGURE 3

Figure 3. (A) True ITE and estimated ITE for in-sample data using six methods: MGANITE, LR, KNN, BLR, RF (R), and SVM, where MGANTE was denoted by a green cross point, LR was denoted by an orange point, KNN was denoted by a green point, BLR was denoted by a red point, RF (R) was denoted by a purple point and SVM was denoted by a dark red point, the x axis denoted the true ITE and the y axis denoted the estimated ITE. (B) True ITE and estimated ITE for out-of-sample data using six methods: MGANITE, LR, KNN, BLR, RF (R), and SVM, where MGANTE was denoted by a green cross point, LR was denoted by a orange point, KNN was denoted by a green point, BLR was denoted by a red point, RF (R) was denoted by a purple point and SVM was denoted by a dark red point, the x axis denoted the true ITE and the y axis denoted the estimated ITE.

To further evaluate the performance of MGANITE, we provide Figure 4 that plots the receiver operating characteristic (ROC) curve for evaluation of the ability of MGANITE to predict potential outcomes of treatment. Our calculation shows that area under the ROC curve (AUC) for MGANITE reaches 0.98, which is a very high value. The ROC curve and AUC value demonstrate that the power of MGANITE for prediction of the potential outcomes of the treatments is very high.

FIGURE 4

Figure 4. ATE, discriminator accuracy, replication error and separate distance curves as a function of the number of batches where the x axis denoted the number of batches, the y axis denoted values of the ATE, discriminator accuracy, replication error, and separation distance for ATE, discriminator, replication, and separation curves, respectively, red, orange, blue and green curves were ATE, discriminator, replication and separation curves, respectively.

Real Data Analysis

MGANITE is applied to 256 newly diagnosed acute myeloid leukemia (AML) patients from the clinical trial dataset (Kornblau et al., 2009). We first present the results of treatment using HDAC, HDAC+IDA (101) vs. all other drugs (111). A key issue for MGANITE is how to train MGANITE. To track the training process of MGANITE, we present Figure 5 that shows ATE, discriminator accuracy, replication error, and separate distance curves as a function of the number of batches.

FIGURE 5

Figure 5. Receiver operating characteristic (ROC) curve for evaluation of performance of MGANITE.

We observe from Figure 5 that discriminator accuracy converges to 1, replication error converges to zero, separation distance converges to a constant, and ATE converges to a stable value. Figure 4 demonstrates that MGANITE is trained very well.

Next we compared the treatment effect estimations using nine methods: MGANITE, LR, LogR, SVM, KNN (5), KNN (10), BLR, RF (C), and RF (R) where 5 and 10 are the number of neighbors. Treatment with HDAC or HDAC+IDA, and 85 protein expressions and other geographical variables are used as covariates. The response status (response or no response) is used as the outcome.

Table 3 summarizes results of the estimation of HDAC treatment effect using MGANITE and other eight methods where individuals with HDAC or HDAC+IDA are taken as the treated population and individuals with other drugs are taken as the control population. Comparison of treatment effect estimation algorithms on real data analysis is not easy because of the lack of ground truth treatment effects and small sample sizes. In general, using MGANITE, we observe that the majority of individuals who are treated by other drugs do not show any response and that 65% of the individuals who are treated by HDAC or HDAC+IDA respond. Only 13.5% of individuals who are treated by other drugs respond. To illustrate the difference between the estimated treatment effect and treatment response, we present Figure 6 that shows the histogram of the estimated effects of the treatments HDAC or HDAC+IDA vs. other drugs using MGANITE (Figure 6A), and observe the number of responses of the individuals in the population who are treated with HDAC or HDAC+IDA vs. other drugs (Figure 6B). ITE is calculated based on both the factual and counterfactual. We observe that ITE = 0 consists of two scenarios: (1) no response of the patients to any drugs and (2) response of the patients to both HDAC or HDAC+IDA, and other drugs. A proportion of the patients with response to HDAC or HDAC+IDA on the right side of Figure 6B and the patient with response to other drugs on the left side of Figure 6B has ITE = 0. The observed response of the patients to one drug does not imply that these patients would not respond to other drugs. However, ITE = 1 or ITE = 0 implies that the patients respond to only one type of drug. To further compare the performance of MGANITE and other methods for evaluation of ITE, we split a given data set into an in-sample dataset (190 samples), used for the initial parameter estimation and model selection, and an out-of-sample dataset (22 samples), used to evaluate performance of ITE estimation. The results are summarized in Table 4. We observe that the difference in the estimated ATT, ATC, ATE and proportions of the ITE between in-samples and out-of-samples using MGANITE are much smaller than using other methods. This shows that the ITE estimation using MGANITE is more robust than using other methods. We calculate the Kullback-Leibler (K-L) divergence between the distributions of the ITE using in-sample and out-of-samples, and using nine methods. The results are summarized in Table 5. Table 5 shows that K-L divergence using MGANITE is much smaller than that using other methods, which implies that MGANITE is more robust than the other eight methods.

TABLE 3

Table 3. Treatment effects estimated for AML dataset using nine methods.

FIGURE 6

Figure 6. (A) Histogram of estimated drug treatment effect using MGANITE, where the x axis denoted the value of ITE and the y axis denoted the number of patients, ITE = +1 denoted the ITE of patients treated with HDAC or HDAC+IDA, ITE = −1 denoted the ITE of patients treated with other drugs, and ITE = 0 denoted the ITE of two groups of patients: one group of the patients treated with HDAC or HDAC+IDA and another group of the patients treated with other drugs. (B) Histogram of observed drug treatment response where the x axis indicated three scenarios as described in (B) and the y axis denoted the number of patients, the right side in the (B) denoted the number of patients only responding to the HDAC or HDAC+IDA, the middle denoted the number of the patients that responds to both (HDAC or HDAC+IDA) and other drugs or did not respond to both (HDAC or HDAC+IDA) and other drugs, and the left side denoted the number of patients only responding to the other drugs.

TABLE 4

Table 4. Treatment effects estimated for AML dataset using nine methods.

TABLE 5

Table 5. K-L divergence between the distribution of ITEs using in-samples and out-of-samples.

LASSO is used to identify biomarkers for prediction of treatment effect and treatment selection. Table 6 lists the top 30 biomarkers identified by LASSO. All top 30 biomarkers explain 36.82% of the variation of HDAC or HDAC+IDA treatment effect. The top Gene GSK3 accounts for 4.4% of the explanation of treatment effect variation.

TABLE 6

Table 6. Top ranking variables for explanation of treatment effect variation.

Garson's algorithm (Garson, 1991; Siu, 2017; Zhang et al., 2018) that describes the relative magnitude of the importance of input variables (biomarkers) in its connection with outcome variables (ITE) of the neural network can also be used to identify biomarkers for predicting the ITE. The top 30 biomarkers identified by the Garson algorithm are listed in Supplementary Table 1 where the relative contribution of each biomarker to the ITE variation and cumulative contribution of biomarkers to the ITE variation are also listed in Supplementary Table 1. The correlation coefficient between the importance ranking of the markers using the Garson algorithm and LASSO is only −0.05.

Next, we study the joint estimation of effects of the multiple treatments. The number of individuals that are treated with HDAC, HDAC+IDA, and other drugs are 37, 54, and 121, respectively. The widely used treatment estimation methods with multiple treatments are simultaneous estimations of the effects of pairwise treatments. We estimate the effects of the pairwise treatments HDAC vs. HDAC+IDA, HDAC vs. other drugs, and HDAC+IDA vs. other drugs. The results are summarized in Table 7. Pairwise comparisons listed in Table 7 does not present the results of the treatment compared with a placebo (without using any drugs). We compare the effect of one treatment with another treatment. Specifically, we make pairwise comparisons: HDAC vs. other drugs, HDAC+IDA vs. other drugs, and HDAC+IDA vs. HDAC. The average treatment effects (ATE) of these three pairwise treatments: HDAC vs. other drugs, HDAC+IDA vs. other drugs, and HDAC+IDA vs. HDAC using MGANITE, are 0.1001, 0.2311 and 0.1310, respectively. This demonstrates that on the average, the effect of the HDAC+IDA is the largest among the three treatments: HDAC+IDA, HDAC, and other drugs, followed by the treatment HDAC. In other words, the treatment HDAC is better than other drugs, in turn, the combination of HDAC and IDA is better than HDAC. It is also noted that the effect of HDAC+IDA vs. other drugs—effect of HDAC vs. other drugs = 0.2311–0.1001 = 0.1310 = effect of HDAC+IDA vs. HDAC.

TABLE 7

Table 7. Multiple treatment effects estimated for AML dataset using nine methods.

However, using LR, LogR, SVM, RF (C), and RF (R), we observe that HDAC is the best treatment. This conclusion violates the biological interpretation. We explain the reasons that causes this incorrect conclusion as follows. The traditional methods for treatment estimation are mainly based on the population average of the treatment responses. The number of observed responses and no responses of the individuals treated with other drugs is 66 and 55, respectively. The average response rate of the other drugs is 0.545. The number of observed responses and no responses of the individuals treated with HDAC is 29 and 8, respectively. The average response rate for HDAC is 0.784. The number of observed response and no response of individuals treated with HDAC + IDA is 33 and 21, respectively. The average response rate for HDAC +IDA is 0.611. Therefore, estimators of ATE for the treatment of HDAC vs. other drugs using LR, LogR, SVM, RF (C), and RF (R) are higher than the estimators of ATE for the HDAC + IDA treatment. However, the individuals treated with HDAC+IDA usually do not respond to HDAC treatment. Therefore, the number of individuals with no response should be adjusted to 62. After adjustment, the response rate of HDAC is changed to 0.319. Therefore, after adjustment, the ATE of HDAC vs. other drugs is smaller than the ATE for HDAC +IDA. Then, the estimators of the pair-wise treatments using MGANITE are consistent with the treatment responses after the data are adjusted. This example shows that these traditional methods that are designed for single treatment effect estimation should be modified when they are applied to multiple treatment effect estimation.

Enrichment analysis to top ranking variables for explanation of treatment effect variation is performed by the hypergeometric test via the Reactome Pathway Database (RPD) (Jassal et al., 2020) to assess whether the number of identified biomarkers associated with the Reactome pathway is over-represented more than expected. The original P-value from the hypergeometric test is then adjusted by FDR for multiple test correction. We find that top ranking biomarkers for the explanation of treatment effect variation are enriched in multiple cancer related pathways (Figure 7A), including the intrinsic pathway for apoptosis (R-HSA-109606, P = 2.86 × 10⁻¹⁴), Signaling by Interleukins (R-HSA-449147, P = 2.86 × 10⁻¹⁴), Programmed Cell Death (R-HSA-5357801, P = 9.7 × 10⁻¹¹), PIP3 activates AKT signaling (R-HSA-1257604, P = 2.98 × 10⁻⁸), RUNX3 regulates WNT signaling (R-HSA-8951430, P = 1.03 × 10⁻⁵), and RNA Polymerase II Transcription (R-HSA-73857, P = 9.4 × 10⁻⁵). In addition, we find that the drug target of idarubicin (TOP2A) and Cytarabine (POLB) form a significant protein-protein interaction network (P < 1.0 × 10⁻¹⁶) (Szklarczyk et al., 2019), indicating that the predictive biomarkers work as the direct interactive proteins of cancer drug targets (Figure 7B).

FIGURE 7

Figure 7. Reactome pathway analysis and protein-protein interaction (PPI) network analysis to top ranking biomarkers for explanation of treatment effect variation. (A) Enrichment analysis to the top 44 ranking biomarkers for explanation of treatment effect variation with the Reactome pathway database by hypergeometric test to assess whether the number of identified biomarkers associated with the Reactome pathway was over-represented more than expected. The original P-value from the hypergeometric test was then adjusted by FDR for multiple test correction. The top 15 most significantly enriched pathways was shown. (B) PPI network analysis was performed by String 11.0 to show the protein-protein interaction among top ranking biomarkers. We found that these proteins were highly interacted which was consistent with pathway enrichment analysis (PPI enrichment P-value is 1.0e-16).

Discussion

In this paper, we present MGANITE coupled with sparse techniques as a framework to estimate the ITEs and select the optimal treatments. We demonstrate that the proposed MGANITE has several remarkable features.

First, MGANITE extends GANITE from binary treatment to all types of treatments: binary, categorical, and continuous treatments. We show that MGANITE has a much higher accuracy for estimation of ITE than other state-of-the-art methods.

Second, in-sample and out-of-sample analysis show that the K-L divergence between the distributions of ITE for in-sample and out-of-samples for MGANITE is much smaller than that of other methods, which implies that MGANITE is more robust than other state-of-the art methods.

Third, unlike many popular methods that are usually used to estimate the average effect of the single treatment, MGANITE not only can estimate the ITE of a single treatment, but also can accurately and jointly estimate the ITE of multiple treatments. We also show that the results of the joint estimation of multiple treatments using other classical methods are inconsistent and might violate the biological interpretation.

Fourth, precision oncology is the identification of the right treatment for the right patient. The essential aim is to discover biomarkers that can accurately predict individual treatment effect for each individual. Our results show that MGANITE with sparse techniques can identify a set of biomarkers with significant biological features. The following identified biomarkers are such typical examples.

GSK3 is a kinase so adaptable that it has been recruited evolutionarily to phosphorylate over 100 substrates, and can regulate numerous cellular functions (Beurel et al., 2015). GSK3 phosphorylates HDAC3 and promotes its activity, including the neurotoxic activity of HDAC3 (Bardai and D'Mello, 2011). GSK3 also phosphorylates HDAC6 to modify its activity and the link between GSK3beta and HDAC6 involved in neurodegenerative disorders (Chen et al., 2010).

Bilirubin is a reddish yellow pigment generated when the normal red blood cells break. Normal levels range from 0.2 to 1.2 mg/dL (Davis, 2020). In adults, indirect hyperbilirubinemia can be due to overproduction, impaired liver uptake or abnormalities of conjugation (Gondal and Aronsohn, 2016). For AML patients,[[Inline Image]][[Inline Image]] enasidenib is an inhibitor of mutant IDH2 proteins used to treat newly diagnosed mutant-IDH2 AML patients (Pollyea et al., 2019). The most common treatment-related adverse events are indirect hyperbilirubinemia (31%), nausea (23%), and fatigue (Steinwascher et al., 2015). Therefore, bilirubin is an important biomarker for monitoring adverse effect in AML patients who receive treatment.

Preclinical studies have discovered that Smac mimetics can directly cause cancer cell death, or make tumor cells become more sensitive to various cytotoxic treatment agents, including conventional chemotherapy, radiotherapy, or new drugs (Fulda, 2015). There is synergistic interaction of Smac mimetic and HDAC inhibitors in AML cell lines, and Smac mimetic and HDAC inhibitors can trigger necroptosis when caspase activation is blocked (Meng et al., 2016).

AKT.p308 and Src.p527 are phosphorylated signal transduction proteins. These two proteins are found to have lower expression in M0, M1, M2, but they have higher levels in the other AML French-American-British (FAB) types. The expression of those two proteins, together with 22 other proteins, can be used to define distinct signatures for each FAB type (Kornblau et al., 2009).

PTEN is a tumor suppressor protein. Promising anti-cancer agents, HDAC inhibitors, particularly trichostatin A (TSA), can promote PTEN membrane translocation. Meng et al. (2016) reveals that non-selective HDAC inhibitors, such as TSA or suberoylanilide hydroxamic acid (SAHA), induces PTEN membrane translocation through PTEN acetylation at K163 by inhibiting HDAC67. Similarly, treatment with an HDAC6 inhibitor alone promoted PTEN membrane translocation and correspondingly dephosphorylated AKT. The combination of celecoxib and an HDAC6 inhibitor synergistically increases PTEN membrane translocation and inactivated AKT (Zhang and Gan, 2017).

Our results show that multiple treatments improve efficiency of drugs for curing AML. This can be biologically explained. HDAC inhibitors have emerged as a potent and promising strategy for the treatment of leukemia via inducing differentiation and apoptosis in tumor cells (Jin et al., 2016). A phase II study with 37 refractory acute myelogenous leukemia (AML) patients shows only minimal activity of Vorinostat (HDACi), and Vorinostat fails to control the leukocyte count among most AML patients (Schaefer et al., 2009). A preclinical study reveals that the combination regimen of chidamide (a novel orally active HDAC inhibitor) and IDA could rapidly diminish the tumor burden in patients with refractory or relapsed AML (Li et al., 2017). A Phase II trial of Vorinostat with idarubicin (IDA) and Ara-C for patients with newly diagnosed AML or myelodysplastic syndrome reveals good activity with overall response rates of 85%. No excess toxicity due to Vorinostat is observed (Garcia-Manero et al., 2012). Taken together, HDACs in combination therapy with IDA or other chemotherapeutic drugs show encouraging clinical activity in different hematologic malignancies. This explains that the combination of HDAC and IDA is the best treatment.

Although MGANITE shows remarkable features in ITE for estimation and optimal treatment selection; the results in this paper are very preliminary. Training stable GANs is a challenging task. The training process is inherently unstable, resulting in the inaccurate estimation of ITEs. In this study, we ignore unobserved confounders, unmeasured variables that affect both patients' medical prescription and their outcome. Overlooking the presence of unobserved confounders may lead to biased results. The main purpose of this paper is to stimulate discussion about how to use AI as a powerful tool to improve the estimation of ITEs and optimal treatment selection. We hope that our results will greatly increase the confidence in using AI as a driving force to facilitate the development of precision oncology.

Data Availability Statement

The data analyzed in this study is subject to the following licenses/restrictions: data were obtained from Department of Biostatistics, The University of Texas MD Anderson Cancer Center. Requests to access these datasets should be directed to Data access need to request from Xuelin Huang, eGxodWFuZ0BtZGFuZGVyc29uLm9yZw==.

Ethics Statement

Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.

Author Contributions

MX and WL: conception and design. QG, MX, and SF: development of methodology. XH: acquisition of data. QG, SG, YL, and SF: analysis and interpretation of data. MX, QG, SG, SF, YL, WL, and XH: writing, review, and/or revision of the manuscript. All authors: contributed to the article and approved the submitted version.

Funding

WL was supported by the National Key R&D Program of China (Grant No. 2018YFC0116600), the National Natural Science Foundation of China (Grant No. 11925103), and by the STCSM (Grant Nos. 18DZ1201000, 19511101404).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer SJ declared a shared affiliation, with no collaboration, with the authors XH and SF to the handling editor at the time of review.

Acknowledgments

The authors thank Sara Barton for editing the manuscript and the Texas Advanced Computing Center for computation support.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020.585804/full#supplementary-material

References

Alaa, A. M., and van der Schaar, M. (2017). “Bayesian inference of individualized treatment effects using multi-task gaussian processes,” in Proceedings of the 31st International Conference on Neural Information Processing Systems. 3427–3435.

Ali, M., and Aittokallio, T. (2019). Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev. 11, 31–39. doi: 10.1007/s12551-018-0446-z

PubMed Abstract | CrossRef Full Text | Google Scholar

Bardai, F. H., and D'Mello, S. R. (2011). Selective toxicity by HDAC3 in neurons: regulation by Akt and GSK3beta. J. Neurosci. 31, 1746–1751. doi: 10.1523/JNEUROSCI.5704-10.2011

PubMed Abstract | CrossRef Full Text | Google Scholar

Beurel, E., Grieco, S. F., and Jope, R. S. (2015). Glycogen synthase kinase-3 (GSK3): regulation, actions, and diseases. Pharmacol. Ther. 148, 114–131. doi: 10.1016/j.pharmthera.2014.11.016

PubMed Abstract | CrossRef Full Text | Google Scholar

Breiman, L. (2001). Random forests. Mach. Learn. 45, 5–32. doi: 10.1023/A:1010933404324

CrossRef Full Text | Google Scholar

Chen, S., Owens, G. C., Makarenkova, H., and Edelman, D. B. (2010). HDAC6 regulates mitochondrial transport in hippocampal neurons. PLoS ONE 5:e10848. doi: 10.1371/journal.pone.0010848

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, R., and Paschalidis, I. (2018). Learning optimal personalized treatment rules using robust regression informed K-NN. arXiv Preprint arXiv:1811.06083.

Google Scholar

Crump, R. K., Hotz, V. J., Imbens, G. W., and Mitnik, O. A. (2008). Nonparametric tests for treatment effect heterogeneity. Rev. Econ. Stat. 90, 389–405. doi: 10.1162/rest.90.3.389

CrossRef Full Text | Google Scholar

Davis, C. P. (2020). Bilirubin and Bilirubin Blood Test. MedicineNet. Available online at: https://www.medicinenet.com/bilirubin_and_bilirubin_blood_test/article.htm

Diamond, A., and Sekhon, J. S. (2013). Genetic matching for estimating causal effects: a general multivariate matching method for achieving balance in observational studies. Rev. Econ. Stat. 95, 932–945. doi: 10.1162/REST_a_00318

CrossRef Full Text | Google Scholar

Ding, P., and Li, F. (2017). Causal inference: a missing data perspective. Statist. Sci. 33, 214–237.

Google Scholar

Emmert-Streib, F., and Dehmer, M. (2019). High-dimensional LASSO-based computational regression models: regularization, shrinkage, and selection. Mach. Learn. Knowl. Extract. 1, 359–383. doi: 10.3390/make1010021

CrossRef Full Text | Google Scholar

Fulda, S. (2015). Promises and challenges of smac mimetics as cancer therapeutics. Clin. Cancer Res. 21, 5030–5036. doi: 10.1158/1078-0432.CCR-15-0365

PubMed Abstract | CrossRef Full Text | Google Scholar

Garcia-Manero, G., Tambaro, F. P., Bekele, N. B., Yang, H., Ravandi, F., Jabbour, E., et al. (2012). Phase II trial of vorinostat with idarubicin and cytarabine for patients with newly diagnosed acute myelogenous leukemia or myelodysplastic syndrome. J. Clin. Oncol. 30, 2204–2210. doi: 10.1200/JCO.2011.38.3265

CrossRef Full Text | Google Scholar

Garson, G. D. (1991). Interpreting neural network connection weights. Artif. Intell. Expert. 6, 47–51.

Google Scholar

Gondal, B., and Aronsohn, A. (2016). A systematic approach to patients with Jaundice. Semin. Intervent. Radiol. 33, 253–258. doi: 10.1055/s-0036-1592331

PubMed Abstract | CrossRef Full Text | Google Scholar

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). “Generative adversarial networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems (Montreal, QC). 2, 2672–2680.

Google Scholar

Hansen, B. (2004). Full matching in an observational study of coaching for the SAT. J. Am. Stat. Assoc. 99, 609–618. doi: 10.1198/016214504000000647

CrossRef Full Text | Google Scholar

Jassal, B., Jupe, S., Matthews, L., Sidiropoulos, K., Gillespie, M., Garapati, P., et al. (2020). The reactome pathway knowledgebase. Nucl.Acids Res. 48, 498–503. doi: 10.1093/nar/gkv1351

PubMed Abstract | CrossRef Full Text | Google Scholar

Jin, Y., Yao, Y., Chen, L., Zhu, X., Jin, B., Shen, Y., et al. (2016). Depletion of γ-catenin by histone deacetylase inhibition confers elimination of CML stem cells in combination with imatinib. Theranostics 6, 1947–1962. doi: 10.7150/thno.16139

PubMed Abstract | CrossRef Full Text | Google Scholar

Johansson, F. D., Shalit, U., and Sontag, D. (2016). “Learning representations for counterfactual inference,” in Proceedings of the 33rd International Conference on Machine Learning. 48, 3020–3029.

Google Scholar

Kennedy, E. H., Ma, Z., McHugh, M. D., and Small, D. S. (2017). Nonparametric methods for doubly robust estimation of continuous treatment effects. J. R. Stat. Soc. Series B Stat. Methodol. 79, 1229–1245. doi: 10.1111/rssb.12212

PubMed Abstract | CrossRef Full Text | Google Scholar

Kornblau, S. M., Tibes, R., Qiu, Y. H., Chen, W., Kantarjian, H. M., Andreeff, M., et al. (2009). Functional proteomic profiling of AML predicts response and survival. Blood 113, 154–164. doi: 10.1182/blood-2007-10-119438

PubMed Abstract | CrossRef Full Text | Google Scholar

Lengerich, B., Aragam, B., and Xing, E. P. (2019). Learning sample-specific models with low-rank personalized regression. Advances in Neural Information Processing Systems 3575–3585.

Google Scholar

Li, Y., Wang, Y., Zhou, Y., Li, J., Chen, K., Zhang, L., et al. (2017). Cooperative effect of chidamide and chemotherapeutic drugs induce apoptosis by DNA damage accumulation and repair defects in acute myeloid leukemia stem and progenitor cells. Clin. Epigenetics 9, 83–83. doi: 10.1186/s13148-017-0377-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, J., Ma, Y., and Wang, L. (2018). An alternative robust estimator of average treatment effect in causal inference. Biometrics 74, 910–923. doi: 10.1111/biom.12859

PubMed Abstract | CrossRef Full Text | Google Scholar

Luo, W., and Zhu, Y. (2020). Matching using sufficient dimension reduction for causal inference. J. Business Econ. Stat. 38, 888–900. doi: 10.1080/07350015.2019.1609974

CrossRef Full Text | Google Scholar

Makar, M., Swaminathan, A., and Kiciman, E. (2019). A distillation approach to data efficient individual treatment effect estimation. Proc. AAAI Conf. Artif. Intellig. 33, 4544–4551. doi: 10.1609/aaai.v33i01.33014544

CrossRef Full Text | Google Scholar

Meng, Z., Jia, L. F., and Gan, Y. H. (2016). PTEN activation through K163 acetylation by inhibiting HDAC6 contributes to tumour inhibition. Oncogene 35, 2333–2344. doi: 10.1038/onc.2015.293

PubMed Abstract | CrossRef Full Text | Google Scholar

Mirza, M., and Osindero, S. (2014). Conditional generative adversarial nets. arXiv.

Google Scholar

Pollyea, D. A., Tallman, M. S., de Botton, S., Kantarjian, H. M., Collins, R., Stein, A. S., et al. (2019). Enasidenib, an inhibitor of mutant IDH2 proteins, induces durable remissions in older patients with newly diagnosed acute myeloid leukemia. Leukemia 33, 2575–2584. doi: 10.1038/s41375-019-0472-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Ray, K., and Szabo, B. (2019). Debiased Bayesian inference for average treatment effects. Advances in Neural Information Processing Systems, 11952–11962.

Google Scholar

Rosenbaum, P. R., and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55. doi: 10.1093/biomet/70.1.41

CrossRef Full Text | Google Scholar

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66, 688. doi: 10.1037/h0037350

CrossRef Full Text | Google Scholar

Schaefer, E. W., Loaiza-Bonilla, A., Juckett, M., DiPersio, J. F., Roy, V., Slack, J., et al. (2009). A phase 2 study of vorinostat in acute myeloid leukemia. Haematologica 94, 1375–1382. doi: 10.3324/haematol.2009.009217

PubMed Abstract | CrossRef Full Text | Google Scholar

Seyhan, A. A., and Carini, C. (2019). Are innovation and new technologies in precision medicine paving a new era in patients centric care? J. Transl. Med. 17:114. doi: 10.1186/s12967-019-1864-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Shalit, U., Johansson, F. D., and Sontag, D. (2016). “Estimating individual treatment effect: generalization bounds and algorithms,” in Proceedings of the 34th International Conference on Machine Learning. 70, 3076–3085.

Google Scholar

Shin, S. H., Bode, A. M., and Dong, Z. (2017). Addressing the challenges of applying precision oncology. NPJ Precision Oncol. 1:28. doi: 10.1038/s41698-017-0032-z

PubMed Abstract | CrossRef Full Text | Google Scholar

Siu, C. (2017). Day33: Garson's Algorithm. Available online at: https://csiu.github.io/blog/update/2017/03/29/day33.html (accessed March 29, 2017).

Steinwascher, S., Nugues, A. L., Schoeneberger, H., and Fulda, S. (2015). Identification of a novel synergistic induction of cell death by Smac mimetic and HDAC inhibitors in acute myeloid leukemia cells. Cancer Lett. 366, 32–43. doi: 10.1016/j.canlet.2015.05.020

PubMed Abstract | CrossRef Full Text | Google Scholar

Subbiah, V., and Kurzrock, R. (2018). Challenging standard-of-care paradigms in the precision oncology era. Trends Cancer 4, 101–109. doi: 10.1016/j.trecan.2017.12.004

PubMed Abstract | CrossRef Full Text | Google Scholar

Szklarczyk, D., Gable, A. L., Lyon, D., Junge, A., Wyder, S., Huerta-Cepas, J., et al. (2019). STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucl. Acids Res. 47, 607–613. doi: 10.1093/nar/gky1131

PubMed Abstract | CrossRef Full Text | Google Scholar

Wager, S., and Athey, S. (2015). Estimation and inference of heterogeneous treatment effects using random forests. Stat Med. 37, 3309–3324.

Yoon, J., Jordon, J., and van der Schaar, M. (2018a). GAIN: missing data imputation using generative adversarial nets. Proceedings of the 35th International Conference on Machine Learning. 80, 5689–5698.

Google Scholar

Yoon, J., Jordon, J., and van der Schaar, M. (2018b). GANITE: Estimation of Individualized Treatment Effects Using Generative Adversarial Nets. ICLR.

Google Scholar

Zhang, G., and Gan, Y. H. (2017). Synergistic antitumor effects of the combined treatment with an HDAC6 inhibitor and a COX-2 inhibitor through activation of PTEN. Oncol. Rep. 38, 2657–2666. doi: 10.3892/or.2017.5981

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, Z., Beck, M. W., Winkler, D. A., Huang, B., Sibanda, W., et al. (2018). Opening the black box of neural networks: methods for interpreting neural network models in clinical applications. Ann. Transl. Med. 6:216. doi: 10.21037/atm.2018.05.32

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: causal inference, generative adversarial networks, counterfactuals, treatment estimation, precision medicine

Citation: Ge Q, Huang X, Fang S, Guo S, Liu Y, Lin W and Xiong M (2020) Conditional Generative Adversarial Networks for Individualized Treatment Effect Estimation and Treatment Selection. Front. Genet. 11:585804. doi: 10.3389/fgene.2020.585804

Received: 21 July 2020; Accepted: 18 November 2020;
Published: 11 December 2020.

Edited by:

Chao Xu, University of Oklahoma Health Sciences Center, United States

Reviewed by:

Shuangxi Ji, University of Texas MD Anderson Cancer Center, United States
Chuan Qiu, Tulane University, United States

Copyright © 2020 Ge, Huang, Fang, Guo, Liu, Lin and Xiong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Momiao Xiong, bW9taWFvLnhpb25nQHV0aC50bWMuZWR1

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.