An Explainable Bayesian Decision Tree Algorithm

Bayesian Decision Trees provide a probabilistic framework that reduces the instability of Decision Trees while maintaining their explainability. While Markov Chain Monte Carlo methods are typically used to construct Bayesian Decision Trees, here we provide a deterministic Bayesian Decision Tree algorithm that eliminates the sampling and does not require a pruning step. This algorithm generates the greedy-modal tree (GMT) which is applicable to both regression and classification problems. We tested the algorithm on various benchmark classification data sets and obtained similar accuracies to other known techniques. Furthermore, we show that we can statistically analyze how was the GMT derived from the data and demonstrate this analysis with a financial example. Notably, the GMT allows for a technique that provides explainable simpler models which is often a prerequisite for applications in finance or the medical industry.


INTRODUCTION
The success of machine learning techniques applied to financial and medical problems can be encumbered by the inherent noise in the data. When the noise is not properly considered, there is a risk to overfit the data generating unnecessarily complex models that may lead to incorrect interpretations. Thus, there has been lot of efforts aimed at increasing model interpretability in machine learning applications [1][2][3][4][5].
Decision Trees (DT) are popular machine learning models applied to both classification and regression tasks with known training algorithms such as CART [6], C4.5 [7], and boosted trees [8]. With fewer nodes than other node-based models, DT are considered an explainable model. In addition, the tree structure can return the output with considerably fewer computations than other more complex models. However, as discussed by Linero in [9], greedely constructed trees are unstable. To improve the stability, new algorithms utilize tree ensembles such as bagging trees [10], Random Forests (RF) [11], and XGBoost (XG) [12]. But increasing the number of trees also increases the number of nodes and therefore the complexity of the model.
The Bayesian approach was introduced to solve the DT instability issue while producing a single tree model that accounts for the noise in the data. The first techniques, also known as Bayesian Decision Trees, were introduced in [13], BCART [14,15], and BART [16]. The former article proposed a deterministic algorithm while the other three are based on Markov Chain Monte Carlo convergence. Some recent studies have improved upon these algorithms, for review see [9], and include a detailed interpretability analysis of the model, [17]. While most of the Bayesian work is based on Markov Chain convergence, here we take a deterministic approach that: 1) considers the noise in the data, 2) generates less complex models measured in terms of the number of nodes, and 3) provides a statistical framework to understand how the model is constructed.
The proposed algorithm departs from [13], introduces the trivial partition to avoid the pruning step, and generalizes the approach to employ any conjugate prior. Although this approach is Bayesian, given the input data and model parameters the resulting tree is deterministic. Since it is deterministic, one can easily analyze the statistical reasons behind the choice of each node. We start with an overview of the Bayesian Decision Trees in Section 2. Section 3 describes the building block of our algorithm, namely the partition probability space, and provides the algorithms to construct the greedy-modal tree (GMT). Section 4 benchmarks the GMT vs. common techniques showing that the GMT works well for various publicly available data sets. Finally, a trading example is discussed in Section 5 followed by some conclusive remarks in Section 6.

BAYESIAN DECISION TREES OVERVIEW
A Decision Tree is a directed acyclic graph. All its nodes have a parent node except the root node, the only one that has no parent. The level ℓ ∈ N 0 of a node is the number of ancestors of the node, starting from 0 at the root node. We classify the nodes as either sprouts or leaves. While sprouts point to two other child nodes in the case of binary trees, leaves are terminal nodes containing the model information. Each sprout contains a rule used to choose one of its children. To query the tree, we start at the root node and apply the rules to an input to select the child nodes until we reach a leaf.
We can use Decision Trees to partition R d and assign an output to each subset of the partition. In this work, we restrict ourselves to finite partitions of R d . Each leaf of the tree will correspond to one of the subsets of the partition, one-to-one and onto. Our approach of Decision Trees departs from the Bayesian Decision Tree framework which provides a marginal likelihood to a Decision Tree based on some input data. Let's define the input data as D [(x i , y i )] n i 1 with n independent observations. A point x (x 1 , . . . , x d ) in R d contains the features of each observation whose outcome y is randomly sampled from a random field Y x . The Bayesian Decision Tree assumes the distribution of Y x is constant at each leaf. Given x, the tree will return the posterior distribution of the parameters θ generating Y within the leaf x belongs to. In practice, the distribution of Y will determine the type of problem we are solving: a discrete random variable translates into a classification problem whereas a continuous random variable translates into a regression problem.
The probability of such a Bayesian Decision Tree, namely T , can be computed with the usual Bayes approach, where p(T ) is the prior distribution over the tree space. To compute the marginal likelihood p(D|T ), we consider the partition D {D 1 , . . . , D k } induced by T and take the product of the marginal likelihoods at each leaf, The probability p(θ) from Eq. 2 is the prior distribution of the parameters θ. In this article, we will assume for simplicity that p(θ) is independent of T although this is not a requirement. The purpose of p(θ) is therefore two-fold: -To obtain the tree probability from Eq. 1, -To compute the posterior distribution of the parameters generating Y at each leaf. . More information about conjugate priors and marginal likelihoods can be found in [18].
In an attempt to build explainable Bayesian Decision Trees, we define a greedy construction that does not apply Markov Chain Monte Carlo. This construction balances the greedy approach from [6] with the Bayesian approach discussed in [9,[14][15][16][17]. For this, we compute the probability of each split at every node and choose the modal split. This results in a model that performs well with different data sets as shown in Section 4.

FROM THE PARTITION PROBABILITY SPACE TO BAYESIAN DECISION TREES
The building block of the GMT algorithm is the partition probability space. For this space, we only consider binary Any partition of this form will induce a partition {D 1 , D 2 } of D. Note that any of these two subsets are allowed to be the empty set. Finally, for each dimension we identify all partitions that result in the same nonempty D 1 and D 2 . All partitions that leave D 1 or D 2 empty are also identified as the trivial partition S 0 . After identification, we will have 1 + (n r − 1) different partitions S, n r tbeing the number of different features along dimension r. Following the minimum margin classifier idea, the partition representative location h will be placed at the mid-point between contiguous different features in a dimension. The partition probability space is the finite probability space defined by partitions S and their probabilities p(S|D). These probabilities can be computed using Eqs 1, 2 when replacing T by S. In practice, we will work with ln[p(D|S)p(S)] which are the log-probabilities from Eq. 1 omitting the constant normalizing factor p(D): We will also need the feature sorted indices of the input features for computation and visualization purposes, namely i 1 , . . . , i n such that x r i r j ≤ x r i r j+1 for all r 1, . . . , d and j 1, . . . , n − 1. An example of the split probability space is shown in Figure 2. In this example, the inputs x live in R 2 and the outcomes are drawn from a Bernoulli random variable. The points x are generated from two independent Gaussian distributions equally likely, i.e. we drew from each distribution with probability 0.5. The first distribution is a multivariate Gaussian with mean (−1, −1) t and covariance 2I. Points sampled from this distribution have a probability of 0.25 of being green. The second distribution is another multivariate Gaussian with mean (1, 3) t and covariance 0.5I. In this case, the probability of being green is 0.75. Because the mean of these Gaussian distributions are further apart along the x 2 axis, the most probable partitions given the data are found along this dimension.
Each partition S can be encoded into a tree node N : if the partition is s 0 , the node becomes a leaf and stores the posterior hyper-parameters of θ; for any other S r,h , the node becomes a sprout and stores the values r and h. Among all partitions, the FIGURE 2 | Example of a data set whose outcomes are either green or red. The location of the points is sampled from a mixture of two Gaussian distributions with equal probability. One distribution draws outcomes from a Bernoulli distribution with probability 0.25, while the other from a Bernoulli with probability 0.75. On the left: log-probabilities of all possible non-trivial partitions given the data set. On the right: actual probability of a point being green and modal splits along each dimension.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org March 2021 | Volume 7 | Article 598833 mode and its node are of particular interest. Algorithm 1 returns the modal node in the general case. For the classification problem, we also provide Algorithm 2 with O(dn) cost assuming p(θ) follows a Dirichlet conjugate prior. Both algorithms start by computing ln[p(D|S 0 )p(S 0 )] and initializing N to be a leaf. Then, they loop through each dimension and the sorted features to verify whether there exists a new node with higher log-probability. Because the features are sorted, there is at most one observation that moves from D 2 to D 1 when j increases by one.
With the partition space and modal node defined, we can introduce the GMT construction. We start by finding the modal node N for our initial data set D. This node is the root of the tree and will be returned by the train method in Algorithm 3. If N is a leaf, the GMT is completed. Otherwise, we split D into D 1 and D 2 according to N . We repeat the process for the new input data sets D 1 and D 2 , and link N 1 and N 2 to their parent N . The recursion is defined in the grow_tree method from Algorithm 3. Note that Algorithms 1 and 2 are just two implementations of find_modal_node, but one can replace this method by any other that returns the desired node based on the partition space. In addition, one can easily compute ln[p(D|T )] for the GMT by adding the leaves' ln[L(D j )] calculated in Algorithm 1 line 2, or Algorithm 2 line 2. In practice, we realized that the GMT marginal log-likelihood ln[p(D|T )] tends to be the highest when exploring for different possible roots.
The average cost of Algorithm 3 is O[c(n)ln(n)] where c(n) is the cost of find_modal_node. If we choose find_modal_node to be Algorithm 2, the average cost of Algorithm 3 becomes O[dn ln(n)]. While Algorithms 1 and 2 only look at one successor ahead, we could improve the greedy exploration by looking at several levels ahead as suggested in [13]. Looking at m levels ahead comes at the expense of increasing the order of c(n), for instance c(n) (dn) m in the case of Algorithm 2. Section 4 shows that the GMT constructed by looking at only one level ahead performs well in practice.

Decision Trees, Random Forests, XGBoost, and GMT
In this Section we use Algorithm 2 and 3 to construct the GMT. We assume that the outcomes, 0 or 1, are drawn from Bernoulli random variables. The prior distribution p(θ) is chosen to be the Beta (10, 10) and each tree will return the expected probability of drawing the outcome 0. The prior probabilities for each partition will be p(S 0 ) 1 − 0.9 1+ℓ and p(S r,h ) 0.9 1+ℓ /dn r , where ℓ is the level and n r the number of non-trivial partitions along r. Note that the denominator d in p(S r,h ) is implicitly assuming a uniform prior distribution over the dimension space. One could also project the probabilities on each dimension to visualize which features are most informative. As an alternative to the suggested p(S), one can use the partition margin weighted approach from [13].
The accuracy is measured as a percentage of the correct predictions. Each prediction will simply be the highest probability outcome. If there is a tie, we choose the category 0 Frontiers in Applied Mathematics and Statistics | www.frontiersin.org March 2021 | Volume 7 | Article 598833 by default. We compare the GMT results to DT [6,19], RF [11,19], and XG [12]. For reproducibility purposes, we set all random seeds to 0. In the case of RF, we enable bootstrapping to improve its performance. We also fix the number of trees to five for RF and XG. We provide the GMT Python module with integration into scikit-learn in [20]. We test the GMT on a selection of data sets from the University of California, Irvine (UCI) database [21]. We compute the accuracy of the DT, RF, XG, and GMT with a shuffled 10-fold cross validation.We do not perform any parameter tuning and keep the same p(θ) and p(S) for all examples. Accuracy is shown in Table 1 while training time and node count in Table 2.
The results reveal some interesting properties of the GMT. Noticeably, the GMT seems to perform well in general. In all cases, the DT accuracy is lower than the RF accuracy. The only case in which RF considerably outperforms the GMT is with the EEG data set. One reason may be that some information is hidden at the lower levels, i.e. feature correlation information that is hard to extract by looking at only one level ahead. The accuracy difference between GMT and RF indicates that these two techniques may work well for different data sets. Interestingly, the XG and GMT yield similar accuracies. Finally, in most cases the GMT takes more time to train than the other three techniques which is caused by the feature sorting overhead computation. Notably, the node count in Table 2 shows that we successfully managed to simplify the models while producing similar accuracy. Note that for four of the seven data sets, the average number of nodes is less than ten and produces slightly better accuracies than RF. Ten nodes implies less than five sprouts in average which can be easily analyzed by a human. This highlights the importance of the priors p(S) and p(θ) to avoid a pruning step. The strength of these two priors will determine how much statistical evidence do we require from our data to produce a meaningful split. In the following Section 5, we take a deeper look and explain the reasons behind the GMT construction with a finance application.

Bayesian Decision Trees and GMT
In this section we analyze the GMT on the Wisconsin breast-cancer data set studied in [9,14] which is available at the University of California, Irvine (UCI) database [21]. Although this data set contains 699 observations, we are going to use the 683 that are complete. Each observation contains nine features and the outcome classifies the tumor as benign or malignant. We test the GMT for p(S 0 ) 1 − q 1+ℓ , q ∈ {0.75, 0.8, 0.85, 0.90, 0.95, 0.97} and a Dirichlet prior with parameters (α 1 , α 2 ) ∈ {1, 2, 3, 4, 5, 10} 2 . For each of the 216 parameter sets we perform a 10-fold cross validation and plot the average accuracy in Figure 3. The results display a lower average accuracy compared to the 98.4% for BCART [14] with nine or more leaves, and the 96.8% for BART [9]. When we run the methods and parameters from Section 4.1 we obtain 95.3% for DT, 95.8% for RF, and 95.5% for XG. We were unable to compare the BCART and BART performance with the data sets from Section 4.1 due to the lack of software.

TRADING EXAMPLE
We consider three stocks, A, B, and C, whose price follows a multidimensional Ornstein-Uhlenbeck process, [22]. Using the notation from [22], we can sample the prices by applying the Euler's discretization, X t+Δt μ + e −θΔt (X t − μ) + G. We assume that σ is a unitary matrix, therefore the random vector G follows a normal distribution with mean 0 and covariance (θ + θ T ) − 1 [I − e −(θ+θ T )Δt ]. For this example, we set the parameters to, Our goal is to train the GMT to predict the best portfolio configuration. Given that we have three stocks, we consider the following eight buy/sell configurations: +A/−B/−C (buy one stock A, sell one stock B, sell one stock C), −A/−B/−C, −A/+B/−C, +A/+B/−C, −A/−B/+C,+A/−B/+C, -A/+B/+C, +A/+B/+C. At each time step, we take the three stock prices X ti as inputs. The outcome is defined as the configuration that corresponds to the next price move, i.e. sign( For example, if the prices are (100, 105, 110) at t i and (110, 100, 120) at t i+1 , the features are (100, 105, 110), the outcome is +A − B + C, and the profit between t i and t i+1 for this portfolio is +(110 − 100) − (100 − 105) + (120 − 110). Each portfolio configuration is identified to an integer from 0 to 7. We sample 10,000 time steps, train on the first 8,000 observations and test on the next 2,000.
We treat this problem as an eight class classification problem. The GMT is trained with the p(S) from Section 4 and a Dirichlet conjugate prior p(θ) with hyper-parameters (1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8). To benchmark the results, we train a 10 × 10. nodes neural network using the MLPClassifier from [19]. During the test phase, our predictions will be the expected modal portfolio configuration: if the input x returns the leaf posterior hyper-parameters α * , we predict the portfolio arg max k {α p k / α p }. The test results are shown in Figure 4. The GMT we obtained by training on the first 8,000 observations has only four leaves: if the price of stock A is below 99.96 and the price of stock B below 109.85, we choose + A + B − C; if the price of A is below 99.96 and the price of B above 109.85, we choose + A-B-C; if the price of A is above 99.96 and the price of B below 110.14, we choose −A + B − C; and if the price of A is above 99.96 and the price of B above 110.14, we choose −A − B + C. Although the mean reversion for stock C is not captured in this model, we successfully recovered simple rules to trade the mean reversion of A and B. Since the price of C is more volatile by Eq. 4, the current price of C is not enough to recover the mean reversion decision logic. Some filtering of the price of C would allow to capture its mean reversion. In the neural network case, the over-parametrization makes it difficult to recover this simple model. The deterministic nature of Algorithm 3 provides a practical framework to explain how was the GMT constructed. We look at each of the nodes to understand how were the modal nodes chosen. The resulting GMT model contains three sprouts-node 0, node 1, node 2-and four leaves-node 3, node 4, node 5, node 6. Figure 5 shows the logprobability 3) of splitting our data-set at a particular price by stock for the three sprouts. At the root level, node 0, we consider the whole data set. In this case, one can increase the GMT likelihood the most by choosing S 0,99.96 , i.e., splitting the data according to Stock A's price at 99.96. After this node becomes a sprout, the input data is split into two subsets of sizes 3,640 (inputs with Stock A's price below 99.96), and 4,360 (inputs with Stock A's price above 99.96). These two subsets' partition log-probabilities are then shown in node one and node two plots respectively. By looking at the two figures, we conclude we can maximize the log-probability by splitting at Stock's B price 109.85 for node 1, and at Stock's B price 110.14 for node 2.
The black horizontal line in each figure marks the log-probability of S 0 , i.e.. the stopping condition. When any possible split logprobability is below this line, the node is chosen to be a leaf, as it happens in this example for nodes 3, 4, 5, and 6. Finally, note that by symmetry, the blue, orange, and green lines should look periodic because the extreme splits only separate one input point from the data set. In addition, the green line looks convex which indicates it is better not to split the data based on Stock C's price.

DISCUSSION AND FUTURE WORK
The proposed GMT is a deterministic Bayesian Decision Tree that reduces the training time by avoiding any Markov Chain Monte Carlo sampling or a pruning step. The GMT numerical example results show similar accuracies to other known techniques. This approach may be most useful where the ability to explain the model is a requirement. Hence, the advantages of the GMT are that it can be easily understood. Furthermore, the ability to specify p(θ) and p(S) may be particularly suitable to noisy problems. However, it is not clear whether the hyper-parameters used in the examples are optimal for each data set. Future work will explore the sensitivity and parameter tuning for different prior distributions. It still remains to find a more efficient deterministic way to explore meaningful trees like Markov Chain Monte Carlo based Bayesian Decision Trees do.
As an extension, we would like to assess the performance of this algorithm on regression problems and experiment with larger partition spaces such as the SVM hyperplanes. Another computational advantage not explored is parallelization, which would allow for a more exhaustive exploration of the tree probability space from Eq. 1.