A Tree-Based Multiscale Regression Method

Cai, Haiyan; Jiang, Qingtang

doi:10.3389/fams.2018.00063

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 21 December 2018

Sec. Mathematics of Computation and Data Science

Volume 4 - 2018 | https://doi.org/10.3389/fams.2018.00063

A Tree-Based Multiscale Regression Method

Haiyan Cai^*

Qingtang Jiang

The Department of Mathematics and Computer Science, University of Missouri–St. Louis, St. Louis, MO, United States

A tree-based method for regression is proposed. In a high dimensional feature space, the method has the ability to adapt to the lower intrinsic dimension of data if the data possess such a property so that reliable statistical estimates can be performed without being hindered by the “curse of dimensionality.” The method is also capable of producing a smoother estimate for a regression function than those from standard tree methods in the region where the function is smooth and also being more sensitive to discontinuities of the function than smoothing splines or other kernel methods. The estimation process in this method consists of three components: a random projection procedure that generates partitions of the feature space, a wavelet-like orthogonal system defined on a tree that allows for a thresholding estimation of the regression function based on that tree and, finally, an averaging process that averages a number of estimates from independently generated random projection trees.

1. Introduction

We consider the problem of estimating the unknown function in the model:

\begin{array}{l} Y = f (X) + ε, X \in ℝ^{p}, Y \in R & (1) \end{array}

with a sample (x_i, y_i), i = 1, …, n, where ε is a random variable with E(ε) = 0 and Var(ε) = σ². We are mainly interested in the case where the dimension p of the feature variable X is large, either relative to the sample size n of data or in itself, although the method of this paper also applies well to lower dimensional cases. When p is large, we assume that the domain of the function f is actually restricted to a much lower dimensional but unknown sub-manifold of ℝ^p. We refer to the dimension of this sub-manifold as the intrinsic dimension of the regression problem.

An important question to ask in this regression setting is that without having to learn the sub-manifold first, can one find an estimator for f that automatically adapts to the intrinsic low dimensional structure of data in the sense that it can still provide efficient and reliable estimates, without being hindered by the “curse of dimensionality” [1, 2] due to the large p? Bickel and Li [3] provide an affirmative answer to this question by showing that local polynomial regressions can achieve such a property under regularity conditions so that the decay of the prediction error in sample size depends only on the intrinsic dimension rather than p.

It turns out that the local polynomial regression is not the only approach that is capable of being adaptive to the intrinsic dimension of data. This is a more general phenomenon. Tree-based methods are useful and efficient alternatives and can be computationally more efficient. A work of Binev et al. [4, 5] provides arguments implying that certain tree-based approximations are adaptive. Later, Dasgupta and Freund [6] and Kpotufe and Dasgupta [7] demonstrate explicitly such adaptability for their random projection trees when the intrinsic dimension is defined to be the so-called doubling dimension. A tree-based regression [7–11] is to partition recursively the feature space ℝ^p into finer and finer subsets and then estimate f locally within each finest subset. Such a process has a naturally hierarchical tree structure. The corresponding tree is called a partitioning tree. An extension of the tree methods is the random-forest approach [8]. In this approach, an average of individual tree estimates based on bootstrap samples is used. In addition to the adaptive property, Dasgupta et al. [6] and Kpotufe and Dasgupta [7] argue that tree methods can be computationally more efficient than kernel smoothing methods, including the local polynomial regressions, or k-nearest neighbor methods, because while it takes only O(logn) steps, the height of a balanced tree, for a tree method to reach to a prediction, it takes Ω(n) for a kernel-type method and Ω(n^1/2) for a k-nearest neighbor method to obtain a prediction.

We note that most of the tree-based regression methods depend exclusively on the means of y_i's over the partitioning subsets in constructing their partitioning trees, and use the means of the tree leaves for final estimates. While such estimates can be MLEs of the corresponding sampling distributions, they need not be optimal in terms of minimax theory [12]. Better nonparametric estimates can often be obtained through shrinkage and thresholding of the mean values. Some recent developments provide a framework for a wavelet-like representation and analysis for tree-structure data. These methods adapt naturally to the shrinkage and thresholding processes in estimation. The unbalanced Haar orthogonal systems (also called the Haar-type wavelet orthogonal systems) on a given tree were constructed by several authors [13–15]. We note that the unbalanced Haar wavelets were first constructed in Mitrea [16] for dyadic cubes in ℝ^p and were later generalized in Girardi and Sweldens [15] to more general trees.

With the background above, we propose in this paper to integrate the wavelet-like analysis on trees with some efficient tree partitioning method so that an estimator for f that has the ability of adapting to the intrinsic low dimension of data can be obtained. However, our numerical experiments show that a simple combination of those ideas does not work well. Note that the ideas and results discussed above are mostly theoretical. There is little discussion in the literature on the actual numerical performance of those approaches. For example, the partition methods discussed in Binev et al. [4, 5], while possessing very nice analytical properties (the rates of convergence), would have a computational complexity that grows exponentially in p. Our numerical experiments also indicate that, when compared to the standard methods like random-forest, support vector machines, or even classical CART, the random projection tree regression of Dasgupta et al. [6] and Kpotufe and Dasgupta [7] or a wavelet soft-thresholding method based on trees generated through various partitioning procedures, including CART and random projection, do not show much, if any, significant improvements in terms of prediction error. A better approach is needed in order to fulfill the theoretical potential of these ideas. We propose a regression estimator in this paper that mitigates the difficulties. Our focus in this paper is mainly the methodology.

We only consider binary trees in this paper. The same ideas can be easily applied to more general trees. The tree-based estimator proposed here consists of three components. The first component is a partitioning algorithm. In principle any reasonable binary classifier, supervised or un-supervised, can be a candidate. The random projection algorithm of Dasgupta et al. [6] and Kpotufe and Dasgupta [7] or uniform partition of Binev et al. [4, 5] are un-supervised. For the kind of data we have, we however prefer supervised classifiers so that the information from y_i's can be utilized. We propose one such classifier below. The second component is a wavelet-like tree-based estimation process that incorporates thresholding and shrinkage operations. It turns out that this can be done in a statistically rather intuitive manner. The final component, a necessary and crucial one, is an averaging process, taking the average of the estimates across several independently generated partitioning trees on the same training data. This averaging process is different from that of a random-forest. In random-forest, the average is taken over bootstrapped samples, and our experiments indicate that it does not work well in some problems. We will call our method “averaging random tree regression.”

While proposed as a computationally feasible algorithm for regression with low intrinsic dimensional data, our method also possesses other good properties. This estimator is to be compared through numerical experiments to some standard non-parametric estimators using higher dimensional data. We will also compare it in a low dimensional setting to a smoothing spline and random-forest. In low dimensional experiments, we observe that the proposed estimator is capable producing visually smoother estimates than CART-based estimators and, at the same time, being much more sensitive than a smoothing spline estimator to discontinuities of f, as it would be expected.

This paper is organized as follows. Section 2 gives a detailed description of the regression method and demonstrates some interesting features of the method. Section 3 gives some examples of our method. Section 4 provides a formulation of the method in terms of wavelet-like multilevel analysis on a tree. It includes a formal description of decomposition and reconstruction algorithms for the soft-thresholding estimation and an analytical study that establishes properties necessary for the thresholding method to work.

2. The Averaging Random Tree Regression

Suppose a dataset ${(x_{i}, y_{i}), x_{i} \in ℝ^{p}, y_{i} \in R, i = 1, \dots, n}$ is given. Let $X = {x_{1}, \dots, x_{n}}$ and $Y = {y_{1}, \dots, y_{n}}$ . For any subset $B \subset X$ , we will write ȳ_B for the mean of y_is in B: $\sum_{x_{i} \in B} y_{i} / n_{B}$ , where n_B denotes the size of B. Our tree-based method consists of three parts. We give a description for each of them below.

2.1. A Classifier for Space Partition and the Partitioning Tree

We first need a binary classifier which acts on the subsets of $X$ and partitions any given subset $A \subset X$ of at least two points into two finer subsets based on some classification criterion. Note that the partitioning method of the classical CART regression can certainly be one of such classifiers, but it can be computationally inefficient for higher dimensional data. Here we propose a different classifier based on the combination of an extension of the random projection proposed in Dasgupta and Freund [6] and Kpotufe and Dasgupta [7] and the idea of minimum sum of squares criterion of CART. The original random projection partition method do not utilize any information from y_i's. This is a weakness of that method and we try to avoid it in the current approach. The idea is to choose among several random projection partitions the one that has the least mean square error for y_i's. The partitioning classifier works as follows.

Let M be a preset integer number. For any subset A of $X$ with r points, we generate M unit directional vectors v₁, …, v_M in ℝ^p from a priori distribution (an uniform distribution for example). For each v_j, j = 1, …M, we project all the p-dimensional points x in A onto this unit vector to obtain a set of r scalars $v_{j}^{T} x$ . Let m_{v_j} be the median of these scalars. Let

A_{L, j} = {x \in A : v_{j}^{T} x < m_{v_{j}}} and A_{R, j} = {x \in A : v_{j}^{T} x > m_{v_{j}}} .

If there are one or more x_i ∈ A, i = 1, …, r ≥ 1 such that $v_{j}^{T} x_{i} = m_{v_{j}}$ , we split these points randomly with equal probability into subsets of ⌊r⌋ (floor of r) and ⌈r⌉ (ceiling of r) points and assign the subsets with equal probability to A_{L, j} or A_{R, j} and still denote these subsets as A_{L, j} and A_{R, j}. In this way we obtain the partition A = A_{L, j} ∪A_{R, j}. Next we find the sum of squares for this directional vector v_j:

\begin{array}{l} S (v_{j}) = \sum_{x_{i} \in A_{L, j}} {(y_{i} - ȳ_{A_{L, j}})}^{2} + \sum_{x_{i} \in A_{R, j}} {(y_{i} - ȳ_{A_{R, j}})}^{2} . & (2) \end{array}

We determine the vector v ∈ {v₁, …, v_M} with the smallest S(v_j) value:

v = {arg}_{v_{j}, j = 1, \dots, M} min S (v_{j}),

and choose the corresponding partitioning subsets A_L and A_R as the final partition of A and let m_v be the corresponding dividing median. Let us denote the above process of obtaining a binary partition for a set A as

\begin{array}{l} π_{M} (A) = {A_{L}, A_{R}, v, m_{v}}, & (3) \end{array}

where M is an adjustable parameter. Clearly, the outcomes of π_M are random, depending on realizations of the random vectors v₁, …, v_M. The reason for using median as the splitting point for partition is to keep the corresponding partition tree balanced. A balanced tree possesses some nice analytical properties for wavelet-like analysis on the tree, as we will see later.

With the partitioning classifier π_M, we construct recursively a partitioning tree $T$ so that each node in the tree $T$ is a subset of $X$ and the children nodes are partitioning subsets of their parent nodes. In other words, we set the root of $T$ to be the whole point set $X$ and, starting from the root, we keep partitioning each node into two children nodes with about the same size until a node becomes a singleton set. The leaves of $T$ are singleton sets of $X$ .

In this construction, each non-leaf node is associated with a pair (v, m_v) of a unit directional vector and the corresponding dividing median. All the pairs are pre-calculated and stored, and will serve as the parameters for partitioning the whole feature space ℝ^p. The space ℝ^p can now be partitioned into disjoint regions identified by the leaves of $T$ as follows. To begin with, we classify every point x ∈ ℝ^p as an “ $X$ point.” Next, for any x ∈ ℝ^p, if it is already classified as an “A point” for some tree node $A \subset X$ of at least two points and if π_M(A) = {A_L, A_R, v, m_v}, we calculate the projection of x onto v and then classify x further as either an “A_L point” if $v^{T} x < m_{v}$ , or an “A_R point” if $v^{T} x > m_{v}$ . In the case of $v^{T} x = m_{v}$ , we classify x with equal probability into either A_L or A_R. This classification process goes on and will eventually classify x into one and only one of the leaf nodes. If x is ultimately classified as a leaf node “{x_i} point” for some $x_{i} \in X$ , we write $T (x) = x_{i}$ . For every $x_{i} \in X$ , let

\begin{array}{l} A_{i} : = {x \in ℝ^{p} : T (x) = x_{i}} . & (4) \end{array}

Then A_i, i = 1, …, n forms a partition of ℝ^p according to the partitioning tree $T$ .

These trees are maximal trees. In order to preserve more local features of f in data, we do not prune them like CART. In fact, a default pruning process in CART destroys its ability to be adaptive to the intrinsic low dimension of data. On the other hand, as an implementation issue, if all y_is in a node A have very close values so that S(v_j) is small in all M selected directions, we can stop splitting this node and treat it as a leaf and use the mean of y_is as the y-value for this node. All computations below can still be carried out without change. The overfitting issue caused by a large tree will be addressed in the last step of our estimation process to be described later.

2.2. A Multiscale Soft-Thresholding Estimate of f(X_i)

As we will see in section 4, the procedure described below is actually a modified version of a common wavelet denoising process with an unbalanced Haar wavelet on a tree. But since this process can also be described in a statistically more intuitive manner with simpler notations, we give the following algorithmic description first.

Suppose a partitioning tree $T$ is obtained. A hierarchical representation of the data $X \times Y = {(x_{i}, y_{i}), i = 1, \dots, n}$ can be obtained according to the tree $T$ as follows. For each non-leaf node in the tree $A \subset X$ , we find the mean ȳ_A of the node and the difference of the means of its children nodes A_L and A_R:

\begin{array}{l} d_{A} : = ȳ_{A_{L}} - ȳ_{A_{R}} . & (5) \end{array}

If A is a leaf, we set d_A = 0. The original data can now be represented with the set of the numbers

D = {ȳ_{X}} \cup {d_{A}, A \in T}

based on which regression estimates will be obtained.

To estimate $\hat{f} (x_{i})$ , i = 1, …, n, we first apply a soft-thresholding operation with a given α ≥ 0 to d_A for all the non-leaf nodes A according to the formula

\begin{array}{l} {\hat{d}}_{A} = sign (d_{A}) max {0, | d_{A} | - α \sqrt{\frac{1}{| A_{L} |^{2}} + \frac{1}{| A_{R} |^{2}}}} . & (6) \end{array}

Next, we calculate estimates ŷ_A of E(ȳ_A|X = x) for all nodes A of $T$ based on the data

\hat{D} = {{\bar{y}}_{X}} \cup {{\hat{d}}_{A}, A \in T}

as follows. We start with setting

ŷ_{X} = ȳ_{X} .

For each node A, once ŷ_A is obtained, we calculate the estimates for its two children nodes, ŷ_{A_L} and ŷ_{A_R}, according to the formula:

\begin{array}{l} ŷ_{A_{L}} = ŷ_{A} + \frac{| A_{R} |}{| A |} {\hat{d}}_{A} & (7) \end{array}

and

\begin{array}{l} ŷ_{A_{R}} = ŷ_{A} - \frac{| A_{L} |}{| A |} {\hat{d}}_{A} . & (8) \end{array}

Repeat this process until one reaches all the leaves. If A = {x_i} is a leaf, we use

\hat{f} (x_{i}) : = ŷ_{{x_{i}}}

as an estimate of f (x_i). We will write ŷ_i for ŷ_{{x_i}} for short below.

Note that, if α = 0 in (6), then ${\hat{d}}_{A} = d_{A}$ for all nodes A and, as a consequence, ŷ_A = ȳ_A for all $A \in T$ . To see this, we note that $ŷ_{X} = ȳ_{X}$ , and if we have ŷ_A = ȳ_A and ${\hat{d}}_{A} = d_{A}$ , then

ŷ_{A_{L}} = ȳ_{A} + \frac{| A_{R} |}{| A |} (ȳ_{A_{L}} - ȳ_{A_{R}}) = ȳ_{A_{L}}

and

ŷ_{A_{R}} = ȳ_{A} - \frac{| A_{L} |}{| A |} (ȳ_{A_{L}} - ȳ_{A_{R}}) = ȳ_{A_{R}} .

Also note that |A_L| and |A_R| are differed by at most 1 and therefore for large nodes, |A_L|/|A| ≃ |A_R|/|A| ≃ 1/2. The smoothing parameter α can be determined through cross-validation.

In the framework of wavelet analysis, ${\bar{d}}_{A}$ are the wavelet coefficients and (7)–(8) is the fast wavelet reconstruction algorithm to be derived in section 4.

2.3. The Averaging Random Tree Regression

Now to estimate f (x) for any x ∈ ℝ^p, we can in principle use

\begin{array}{l} \hat{f} (x) = \sum_{x_{i} \in X} ŷ_{i} χ_{A_{i}} (x) . & (9) \end{array}

In other words, a search along the tree $T$ (with O(log₂ n) steps) will allow us to determined for each x ∈ ℝ^p which partitioning subset A_i, as defined in (4), it belongs to and then use ŷ_i as an estimate for f (x). However, despite some nice theoretical properties, this estimate itself doesn't perform very well in our simulation experiments. It turns out that a significant improvement can be achieved by an averaging procedure described below.

For a given integer K, we repeat the process described in Section 2.1 K times, independently, on the same data to obtain trees $T_{k}$ , k = 1, …, K. For each tree $T_{k}$ , we calculate estimates ${\hat{f}}_{k} (x)$ according to (9). Finally, we take the average

\begin{array}{l} {\hat{f}}_{*} (x) = \frac{1}{K} \sum_{k = 1}^{K} {\hat{f}}_{k} (x) & (10) \end{array}

as the final estimate of f (x). Let's call this estimate an “Averaging Random Tree Regression” estimate, or ARTR estimate for short. This averaging step improves significantly the accuracy of the estimates for the regression function. The resulting estimates can be adaptive to lower intrinsic dimension of data. It can also be visually smoother than other tree-based regression methods in two or three dimensional cases even with piecewise constant Haar wavelets, and being sensitive to discontinuities at the same time. It also addresses efficiently the overfitting problem.

We summarize the process into the following steps:

1. Generate a partitioning tree $T$ according to classifier π_M.

2. Obtain $ȳ_{X}$ and d_A = ȳ_{A_L} − ȳ_{A_R} for all non-leave nodes $A \in T$ and A_L, A_R ∈ π_M(A).

3. Obtain $\hat{D}$ from $D$ based on (6) for a given α and ŷ_i, i = 1, …, n, through recursively applying (7) and (8).

4. Calculate the estimate $\hat{f}$ of f using (9).

5. Repeat steps 1–4 K times and take the average according to (10).

There are three tuning parameters in this process: M, the number of random projection directions for each partition in generating a partitioning tree; K, the number of trees to generate for averaging; and α, a factor in the threshold for smoothing. Cross-validation can be used to choose the values of these parameters.

Simulation studies show that ARTR estimation outperforms some standard methods, as can be seen through the examples in the following subsection. Note that our averaging approach is similar in its appearance to but different in principle from that of the random forests. In fact, the random forests approach which averages tree estimates based on resampling dataset does not produce good results in our experiments. Formally, our estimate can be viewed as a kernel method which takes a weighted average of nearby data points to obtain an estimate.

3. Examples of Applications to Some Statistical Problems

We give three examples demonstrating potential applications of the ARTR method. In all computations below, we set the number of random projections for partitioning M = 10 and the number of partition trees for averaging K = 36.

Example 1: Regression on a Lower Intrinsic Dimensional Manifold

This example is about high dimensional regression where the values of predictors are actually from an embedded unknown sub-manifold with a much lower dimension. In our example, this sub-manifold is the surface of a two-dimensional “Swiss roll” given by the parametric equations: x₁ = u cosu, x₂ = usinu, x₃ = v, for u, v ∈ [0, 4π]. The regression function is

f (x_{1}, x_{2}, x_{3}) = {(x_{3} - \sqrt{x_{1}^{2} + x_{2}^{2}})}^{2} / 2 = {(v - u)}^{2} / 2

and ε ~ N(0, 1). Our data do not contain exact values of (x₁, x₂, x₃). Rather, all the 3-dimensional points are embedded isometrically (with an arbitrary and unknown isometric transformation) into the space ℝ^{4, 000}. Therefore, each feature point is recorded as a p = 4, 000 dimensional point. We apply support vector machine regression (SVMR), random forests regression (RFR), extreme gradient boost regression (XGBR) with the parameter nrounds set to 100, our ART regression (ARTR) without averaging (K = 1) and with averaging (K = 36) to a training dataset of n = 1, 000 points and then apply the estimated models from these different methods to an independent testing dataset of 1,000 points. Predictions of the values for the function are obtained at these 1,000 testing points. The approximation error at a point x is the difference between the predicted value at x and the true value of the function at x. Table 1 shows the mean squared approximation errors in this computation. We see that in terms of the mean approximation errors, ARTR with K = 36 is better than SVMR, and significantly better than RFR or XGBR. We also see that without taking average, ARTR with K = 1 has the worse performance. In ARTR we have used α = 2.0.

TABLE 1

Table 1. The mean squared approximation errors in example 1.

Example 2: Discovering Mixtures in a Higher Dimensional Space

This is a higher dimensional example. The problem in this example can be described in more general terms as follows. Suppose S₁ and S₂ are two unknown lower dimensional sub-manifolds embedded in ℝ^p. Suppose that data are sampled from S₁ ∪ S₂ and that y_i's sampled from S₁ have distribution $N (μ_{1}, σ^{2})$ and y_i's from S₂ has distribution $N (μ_{2}, σ^{2})$ . In other words,

y = μ_{1} χ_{S_{1}} (x) + μ_{2} χ_{S_{2}} (x) + ε, x \in S_{1} \cup S_{2},

where ε ~ N(0, σ²). We ask, from the data can we discover that y_i's are a mixture of two different distributions? We show that ARTR can achieve this while some standard method fails in the following experiment.

In this example, S₁ and S₂ are two intersecting unit 2-spheres embedded into a p = 6,000 dimensional space ℝ^6,000. The centers of the 2-spheres are set to be 0.35 apart. We set μ₁ = 0, μ₂ = 2, and σ = 1. The data consists of n = 4, 000 sample points. A histogram of y_i's would reveal no signs that y_i are from a mixture of two distributions. Four different methods are applied to fit the data: SVMR, RFR, XGBR, all with the default settings in R packages “svm,” “randomForest,” and “xgboost” (nround=100), and our ART regression. Estimates ŷ_i of f(x_i) are obtained from the methods and their histograms are displayed also in Figure 1. For the ARTR we choose K = 36 and α = 2.0.

FIGURE 1

Figure 1. Histograms of fitted data from different methods.

We see from the histograms in Figure 1 that SVMR fails to recognize correctly the underlying lower dimensional structure of the data. It mistakenly treats the central region bounded by the intersection of the two spheres as the third region in which y_i's have a mean value (μ₁+μ₂)/2 = 1. A much larger sample size is needed for SVMR to achieve a correct estimate, demonstrating the effect of the curse of dimensionality on the method. The RFR is capable of noticing the mixture and XGBR is better than RFR, but ARTR is clearly much more powerful than all others in detecting the mixture. This comparison between ARTR and RFR or XGBR supports the comment we make in the introduction section that a mean-based estimate can be significantly improved through shrinkage and thresholding operations.

The mean squared errors, calculated using ŷ_i, i = 1, …, 4, 000, and the true means μ₁ = 0 and μ₂ = 2, are listed in Table 2. It shows that ARTR has the smallest mean squared error among the four methods. We also notice that, while the mean squared error of SVMR is smaller than those of RFR or XGBR, the estimates from SVMR can be totally misleading.

TABLE 2

Table 2. The mean squared errors of the estimates in example 2.

Example 3: Smoothness and Sensitivity to Discontinuities

This is a two dimensional example to show smoothness and sensitivity to discontinuities of ARTR estimates. We do this through comparing ARTR to the thin plate splines (TPS) and random forests regression (RFR). The regression function we use is

f (x) = f_{1} (x) + f_{2} (x) + f_{3} (x) + f_{4} (x)

with

\begin{array}{l} f_{1} (x) = 10 e^{- 2.5 [x_{1}^{2} + {(x_{2} - 0.5)}^{2}]} \\ f_{2} (x) = 7 e^{- 3 [{(x_{1} - 0.5)}^{2} + {(x_{2} - 0.5)}^{2}]} \\ f_{3} (x) = 4 e^{- 4 [(x_{1} - 0.5) + {(x_{2} + 0.5)}^{2}]} \\ f_{4} (x) = 4 I (x_{1} + x_{2} < - 0.5) \end{array}

and ε ~ N(0, 1). An image of f (x) is given in Figure 2 (in these figures, the valleys are shown in red and the peaks are shown in white). The data consist of n = 4, 000 points with points in $X$ generated uniformly inside a two-dimensional unit square. Setting α = 2.0, we obtain ARTR estimates at 100 × 100 grid-points which is compared to the estimates at the same grid-points from the TPS and the RFR. In Figure 2, we observe that both TPS and ARTR produce smoother estimates than RFR does. Furthermore, the estimates from TPS and ARTR look more similar in local regions.

FIGURE 2

Figure 2. The regression function and its estimates from different methods.

We note that the surface of the true regression function has a discontinuity line. Figure 3 provides a comparison among the three estimates. The two figures in the top row are the plots of errors from ARTR estimates and TPS estimates respectively. A difference between the two estimates along the discontinuity line is visible. The left plot in the second row of Figure 3 shows an even more significant difference. This figure displays a difference in sensitivity to discontinuities between the two methods. Note that, while such a property is what one would expect from a wavelet-type method, estimates from ARTR are smoother than those from a Haar-type wavelet. In contrast, the difference between the TPS estimates and the RFR estimates, while visible, is more noisy and less definite. Overall, ARTR performs best in this example. The “ARTR vs TPS” figure in Figure 3 suggests that one might even consider using such a figure, obtained completely from data, as a mean for detecting discontinuity of a noisy surface.

FIGURE 3

Figure 3. Differences among the estimates.

The mean squared approximation errors of the three methods from the same simulation computation are listed in Table 3. We see that in terms of approximation errors, ARTR is similar to (or slightly better than) TPS, and RFR is worse among the three.

TABLE 3

Table 3. The mean squared approximation errors in example 3.

4. Wavelet-Like Analysis on Trees

It is possible that the tree-based method proposed above be formulated and applicable within a more general context in which $X$ can be an arbitrary point set with a given partitioning tree $T$ . In particular, $X$ need not be a subset of ℝ^p. An important concept that makes an analysis possible for this setting is the tree metric [14] which characterizes the smoothness of the functions defined on $X$ in terms of their tree wavelet coefficients. In this section we first give a formal description of a Haar wavelet-like orthogonal system on the tree $T$ , then we present a fast-algorithm for decomposition and reconstruction of functions defined on $X$ according to the tree $T$ . This algorithm includes the algorithm in section 2 as a special case. The discussion below is for more general trees for which a leaf node can have more than one point and in this case the corresponding y-value is the average of y_is in the node.

4.1. Wavelet-Like Orthogonal Systems on $T$

Without loss of generality, we can assume that the binary partitioning tree $T$ has L levels, with the root $X$ at level 0 and all the leaves at level L. This would be exactly the case for a tree $T$ constructed in the previous section if $X$ consists of 2^L points. To achieve this for an arbitrary $X$ and tree $T$ , if A is a node of the tree at level ℓ < L which is a leaf node, we simply define its offspring at level ℓ+1 to be the node itself.

For each ℓ = 0, 1, …, L, we index all the nodes in $T$ at level ℓ with an index set I_ℓ and let P_ℓ be the set of these nodes :

P_{ℓ} = {A_{ℓ, j}, j \in I_{ℓ}}, 0 \leq ℓ \leq L .

Then P_ℓ forms a partition of $X$ : A_{ℓ, i} ∩ A_{ℓ, j} = ∅ for i ≠ j and $X = ⋃_{j \in I_{ℓ}} A_{ℓ, j}$ . Further more, P_{ℓ + 1} is a refinement of P_ℓ with

\begin{array}{l} A_{ℓ, j} = A_{ℓ + 1, j^{'}} \cup A_{ℓ + 1, j^{″}}, A_{ℓ + 1, j^{'}} \cap A_{ℓ + 1, j^{″}} = \emptyset & (11) \end{array}

for some $A_{ℓ + 1, j^{'}}, A_{ℓ + 1, j^{″}} \in P_{ℓ + 1}$ if |A_{ℓ, j}| > 1, and

A_{ℓ, j} = A_{ℓ + 1, j^{'}}

for some $A_{ℓ + 1, j^{'}} \in P_{ℓ + 1}$ if |A_{ℓ, j}| = 1. With these notations, $A_{0, 0} = X$ is the root and A_{L, j}, j ∈ I_L are the leaves of $T$ .

A wavelet-like orthogonal system can now be defined on $T$ as follows. Let

V = {f | f : X \to ℝ}

be the space of all functions defined on $X,$ equipped with the inner product:

\begin{array}{l} 〈 f, g 〉 = \frac{1}{n} \sum_{x \in X} f (x) g (x), f, g \in V . & (12) \end{array}

Let ν be the empirical probability measure of X induced by the set $X$ of sample feature points:

\begin{array}{l} ν (A) = \frac{| A \cap X |}{| X |} = \frac{1}{n} | A \cap X |, \forall A \subset C, & (13) \end{array}

with the understanding that ν depends on the sample size n. For a function f on $X$ , then

| | f | | : = {(\int_{X} | f (x) |^{2} d ν (x))}^{\frac{1}{2}}

is just $\sqrt{〈 f, f 〉}$ .

An orthogonal system on V with respect to the inner product in (12) can now be constructed based on the partitioning tree $T$ . The following describes the construction of Haar-type wavelets on $T$ , [13–15].

For each ℓ = 0, …, L and P_ℓ = {A_{ℓ, j}:j ∈ I_ℓ}, let

V_{ℓ} : = {f \in V : f |_{A_{ℓ, j}} is a constant for every j \in I_{ℓ}} .

Clearly, we have V_ℓ ⊂ V_{ℓ + 1}.

Let ϕ_{ℓ, j}, 0 ≤ ℓ ≤ L and j ∈ I_ℓ, be functions on $X$ defined by

\begin{array}{l} ϕ_{ℓ, j} (x) : = χ_{_{A_{ℓ, j}}} (x) . & (14) \end{array}

For each ℓ, 0 ≤ ℓ ≤ L−1, let J_ℓ ⊂ I_ℓ denote the index set for those nodes A_{ℓ, j} with |A_{ℓ, j}| > 1. For each j ∈ J_ℓ, with $A_{ℓ, j} = A_{ℓ + 1, j^{'}} \cup A_{ℓ + 1, j^{″}}$ , let ψ_{ℓ, j} be functions defined by

\begin{array}{l} ψ (x) : = \frac{ν (A_{ℓ + 1, j^{″}})}{ν (A_{ℓ, j})} χ_{_{A_{ℓ + 1, j^{'}}}} (x) - \frac{ν (A_{ℓ + 1, j^{'}})}{ν (A_{ℓ, j})} χ_{_{A_{ℓ + 1, j^{″}}}} (x) \\ = \frac{ν (A_{ℓ + 1, j^{″}})}{ν (A_{ℓ, j})} ϕ_{ℓ + 1, j^{'}} (x) - \frac{ν (A_{ℓ + 1, j^{'}})}{ν (A_{ℓ, j})} ϕ_{ℓ + 1, j^{″}} (x) . & (15) \end{array}

Then we have

\begin{array}{l} 〈 ϕ_{ℓ, j}, ψ_{ℓ^{'}, j^{'}} 〉 = 0, for ℓ \leq ℓ^{'}, j^{'} \in J_{ℓ} \\ 〈 ψ_{ℓ, j}, ψ_{ℓ, j^{'}} 〉 = 0, for j = j^{'}, j, j^{'} \in J_{ℓ} . \end{array}

Denote

W_{ℓ} = span {ψ_{ℓ, j}, j \in J_{ℓ}} .

The fact V_ℓ = span{ϕ_{ℓ, j} : j ∈ I_ℓ} and the orthogonality of ϕ_{ℓ, j} and ψ_{ℓ, j} imply that W_ℓ ⊂ V_ℓ+1 and V_ℓ ⊥ W_ℓ. In addition, by a direct calculation, we have, for A_{ℓ, j}, $A_{ℓ + 1, j^{'}}$ , and $A_{ℓ + 1, j^{″}}$ as related in (11) and j ∈ J_ℓ,

\begin{array}{l} ϕ_{ℓ + 1, j^{'}} (x) = \frac{ν (A_{ℓ + 1, j^{'}})}{ν (A_{ℓ, j})} ϕ_{ℓ, j} (x) - ψ_{ℓ, j} (x), \\ ϕ_{ℓ + 1, j^{″}} (x) = \frac{ν (A_{ℓ + 1, j^{″}})}{ν (A_{ℓ, j})} ϕ_{ℓ, j} (x) - ψ_{ℓ, j} (x), \end{array}

which implies that V_ℓ+1 ⊆ V_ℓ + W_ℓ. Therefore, W_ℓ is the orthogonal complement of V_ℓ in V_ℓ+1, i.e., $V_{ℓ + 1} = V_{ℓ} \oplus^{⊥} W_{ℓ}$ , where ⊕^⊥ denotes the orthogonal sum. Hence

V_{L} = W_{L - 1} \oplus^{⊥} V_{L - 1} = W_{L - 1} \oplus^{⊥} W_{L - 2} \oplus^{⊥} V_{L - 2} = \dots

and finally

\begin{array}{l} V_{L} = V_{0} \oplus^{⊥} W_{0} \oplus^{⊥} \dots \oplus^{⊥} W_{L - 1} . & (16) \end{array}

From (16), we see $W_{ℓ} ⊥ W_{ℓ^{'}}$ for ℓ ≠ ℓ′. Hence ψ_{ℓ, j}, 0 ≤ ℓ ≤ L − 1, j ∈ J_ℓ are orthogonal to each other. More precisely, we have

〈 ψ_{ℓ, j}, ψ_{ℓ^{'}, i} 〉 = B_{ℓ, j} δ (ℓ - ℓ^{'}) δ (j - i),

for all $0 \leq ℓ, ℓ^{'} \leq L - 1, j \in J_{ℓ}, i \in J_{ℓ^{'}}$ , where

\begin{array}{l} B_{ℓ, j} : = 〈 ψ_{ℓ, j}, ψ_{ℓ, j} 〉 = \frac{ν (A_{ℓ + 1, j^{'}}) ν (A_{ℓ + 1, j^{″}})}{ν (A_{ℓ, j})} . & (17) \end{array}

We summarize this into the following theorem.

Theorem 1. The system of Haar-type wavelets

\begin{array}{l} ψ_{ℓ, j} (x), 0 \leq ℓ \leq L - 1, j \in J_{ℓ}, & (18) \end{array}

together with ϕ_{0, 0}(x) ≡ 1 form an orthogonal basis of V_L. More precisely, V_L can be decomposed as (16) with f ∈ V_L represented as

\begin{array}{l} f (x) = 〈 f, ϕ_{0, 0} 〉 ϕ_{0, 0} (x) + \sum_{ℓ = 0}^{L - 1} \sum_{j \in J_{ℓ}} \frac{1}{B_{ℓ, j}} 〈 f, ψ_{ℓ, j} 〉 ψ_{ℓ, j} (x), & (19) \end{array}

where $B_{ℓ, j} = | | ψ_{ℓ, j} | |^{2}$ is given by (17).

Since V_L = V, (19) is a representation of all functions f on $X$ .

4.2. Fast Multiresolution Algorithm For Wavelet Transform

The orthogonal system we discussed above allows for a fast algorithm for computing wavelet coefficients 〈f, ψ_{ℓ, j}〉.

Let f ∈ V_L be the input data given by

\begin{array}{l} f (x) = \sum_{j \in I_{L}} a_{L, j} ϕ_{L, j} (x) & (20) \end{array}

with a_{L, j} : = n〈f, ϕ_{L, j}〉/|A_{L, j}| = the average value of f (x) at the leave node A_{L, j}.

From

V_{L} = V_{L - 1} \oplus^{⊥} W_{L - 1} = \dots = V_{0} \oplus^{⊥} W_{0} \oplus^{⊥} \dots \oplus^{⊥} W_{L - 1}

and that for any L₀ with 0 ≤ L₀ ≤ L − 1, ϕ_{L₀, i}, ψ_{ℓ, j}, L₀ ≤ ℓ ≤ L − 1, i ∈ I_L₀, j ∈ J_ℓ form an orthogonal basis for V_L, we know that f ∈ V_L can also be represented as

\begin{array}{l} f (x) = \sum_{j \in I_{L - 1}} a_{L - 1, j} ϕ_{L - 1, j} (x) + \sum_{j \in J_{L - 1}} d_{L - 1, j} ψ_{L - 1, j} (x) \\ = \sum_{j \in I_{L - 2}} a_{L - 2, j} ϕ_{L - 2, j} (x) + \sum_{j \in J_{L - 2}} d_{L - 2, j} ψ_{L - 2, j} (x) \\ + \sum_{j \in J_{L - 1}} d_{L - 1, j} ψ_{L - 1, j} (x) \\ = \dots \\ = a_{0, 0} ϕ_{0, 0} (x) + \sum_{ℓ = 0}^{L - 1} \sum_{j \in J_{ℓ}} d_{ℓ, j} ψ_{ℓ, j} (x), & (21) \end{array}

where $a_{0, 0} = \frac{1}{n} \sum_{x \in X} f (x)$ , and the wavelet coefficients d_{ℓ, j} are given by

d_{ℓ, j} = \frac{1}{B_{ℓ, j}} 〈 f, ψ_{ℓ, j} 〉 .

A multiscale fast algorithm to compute the wavelet coefficients can be obtained based on the refinement of the scaling function ϕ_{ℓ, j}. Next, let us look at the decomposition algorithm for calculating a_{L−1, j}, d_{L−1, j} from a_{L, j}, and the reconstruction algorithm for recovering a_{L, j} from a_{L−1, j}, d_{L−1, j}.

Clearly, if k ∈ I_L−1\J_L−1, then $a_{L - 1, k^{'}} = a_{L - 1, k}$ , where $k^{'} \in I_{L - 1}$ is such an index that $A_{L - 1, k^{'}} = A_{L, k}$ . Next we consider k ∈ J_L−1, and let $A_{L, k^{'}}$ and $A_{L, k^{″}}$ be two children of A_{L−1, k}. From (20), (21), the orthogonality of ϕ_{L−1, j}, ψ_{L−1, j} and the fact supp $(ϕ_{L - 1, k}) = A_{L - 1, k} = A_{L, k^{'}} ⋃ A_{L, k^{″}}$ , we have

\begin{array}{l} a_{L - 1, k} | | ϕ_{L - 1, k} | |^{2} = 〈 ϕ_{L - 1, k}, \sum_{j \in J_{L - 1}} a_{L - 1, j} ϕ_{L - 1, j} + \sum_{j \in J_{L - 1}} d_{L - 1, j} ψ_{L - 1, j} 〉 \\ = 〈 f, ϕ_{L - 1, k} 〉 = 〈 ϕ_{L - 1, k}, \sum_{j \in I_{L}} a_{L, j} ϕ_{L, j} 〉 \\ = a_{L, k^{'}} 〈 ϕ_{L - 1, k}, ϕ_{L, k^{'}} 〉 + a_{L, k^{″}} 〈 ϕ_{L - 1, k}, ϕ_{L, k^{″}} 〉 \\ = a_{L, k^{'}} ν (A_{L, k^{'}}) + a_{L, k^{″}} ν (A_{L, k^{″}}) . \end{array}

With $| | ϕ_{L - 1, k} | |^{2} = ν (A_{L - 1, k}) = ν (A_{L, k^{'}}) + ν (A_{L, k^{″}})$ , we have

a_{L - 1, k} = \frac{ν (A_{L, k^{'}})}{ν (A_{L, k^{'}}) + ν (A_{L, k^{″}})} a_{L, k^{'}} + \frac{ν (A_{L, k^{'}})}{ν (A_{L, k^{'}}) + ν (A_{L, k^{″}})} a_{L, k^{″}} .

Similarly, we have

\begin{array}{l} d_{L - 1, k} ‖ ψ_{L - 1, k} ‖^{2} = 〈 ψ_{L - 1, k}, f 〉 = 〈 ψ_{L - 1, k}, \sum_{j \in I_{L}} a_{L, j} ϕ_{L, j} 〉 \\ = a_{L, k^{'}} 〈 ψ_{L - 1, k}, ϕ_{L, k^{'}} 〉 + a_{L, k^{″}} 〈 ψ_{L - 1, k}, ϕ_{L, k^{″}} 〉 \\ = a_{L, k^{'}} \int_{A_{L, k^{'}}} ψ_{L - 1, k} (x) d ν (x) + a_{L, k^{″}} \int_{A_{L, k^{″}}} ψ_{L - 1, k} (x) d ν (x) \\ = a_{L, k^{'}} \frac{ν (A_{L, k^{'}}) ν (A_{L, k^{″}})}{ν (A_{L - 1, k})} - a_{L, k^{″}} \frac{ν (A_{L, k^{'}}) ν (A_{L, k^{″}})}{ν (A_{L - 1, k})} \\ = (a_{L, k^{'}} - a_{L, k^{″}}) ‖ ψ_{L - 1, k} ‖^{2} . \end{array}

Thus, we have

d_{L - 1, k} = a_{L, k^{'}} - a_{L, k^{″}} .

Combining these calculations, we obtain the following decomposition algorithm

[\begin{matrix} a_{L - 1, k} \\ d_{L - 1, k} \end{matrix}] = [\begin{matrix} \frac{ν (A_{L, k^{'}})}{ν (A_{L, k^{'}}) ν (A_{L, k^{″}})} & \frac{ν (A_{L, k^{″}})}{ν (A_{L, k^{'}}) + ν (A_{L, k^{″}})} \\ 1 & - 1 \end{matrix}] [\begin{matrix} a_{L, k^{'}} \\ a_{L, k^{″}} \end{matrix}]

One can obtain as above by the refinement of ϕ_{L, j}: $ϕ_{L - 1, j} = ϕ_{L, j^{'}} + ϕ_{L, j^{″}}$ and the orthogonality of ϕ_{L, j} the following reconstruction algorithm (which can also be obtained directly from the above decomposition algorithm):

[\begin{matrix} a_{L, k^{'}} \\ a_{L, k^{″}} \end{matrix}] = [\begin{matrix} 1 & \frac{ν (A_{L, k^{″}})}{ν (A_{L, k^{'}}) + ν (A_{L, k^{″}})} \\ 1 & - \frac{ν (A_{L, k^{'}})}{ν (A_{L, k^{'}}) + ν (A_{L, k^{″}})} \end{matrix}] [\begin{matrix} a_{L - 1, k} \\ d_{L - 1, k} \end{matrix}]

We can obtain the algorithms in the same way for all other a_{ℓ, k}, d_{ℓ, k}. To summarize, we have the following theorem.

Theorem 2. Let a_{ℓ, k}, d_{ℓ, k} be coefficients of f ∈ V_L with the wavelet expansion. Then for every non-leave node A_{ℓ−1, k}, k ∈ J_ℓ−1, and its children nodes $A_{ℓ, k^{'}}$ and $A_{ℓ, k^{″}}$ , where 1 ≤ ℓ ≤ L, we have the decomposition algorithm:

\begin{array}{l} [\begin{matrix} a_{ℓ - 1, k} \\ d_{ℓ - 1, k} \end{matrix}] = [\begin{matrix} \frac{ν (A_{ℓ, k^{'}})}{ν (A_{ℓ, k^{'}}) + ν (A_{ℓ, k^{″}})} & \frac{ν (A_{ℓ, k^{″}})}{ν (A_{ℓ, k^{'}}) + ν (A_{ℓ, k^{″}})} \\ 1 & - 1 \end{matrix}] [\begin{matrix} a_{ℓ, k^{'}} \\ a_{ℓ, k^{″}} \end{matrix}] & (22) \end{array}

and the reconstruction algorithm:

\begin{array}{l} [\begin{matrix} a_{ℓ, k^{'}} \\ a_{ℓ, k^{″}} \end{matrix}] = [\begin{matrix} 1 & \frac{ν (A_{ℓ, k^{″}})}{ν (A_{ℓ, k^{'}}) + ν (A_{ℓ, k^{″}})} \\ 1 & - \frac{ν (A_{ℓ, k^{'}})}{ν (A_{ℓ, k^{'}}) + ν (A_{ℓ, k^{″}})} \end{matrix}] [\begin{matrix} a_{ℓ - 1, k} \\ d_{ℓ - 1, k} \end{matrix}] . & (23) \end{array}

It can be easily verified that here a_{ℓ, k} is exactly the mean of f (x_i)'s over the subset A_{ℓ, k} and d_{ℓ, k} is the difference of the means of f (x_i) over the children subsets $A_{ℓ + 1, k^{'}}$ and $A_{ℓ + 1, k^{″}}$ , respectively. This justifies (5), (7), and (8) without shrinking.

After the wavelet coefficients d_{ℓ, k} are thresholded with

\begin{array}{l} {\hat{d}}_{ℓ, k} = sign (d_{ℓ, k}) max {0, | d_{ℓ, k} | - α \sqrt{\frac{1}{| A_{ℓ, k^{'}} |^{2}} + \frac{1}{| A_{ℓ, k^{″}} |^{2}}}} & (24) \end{array}

for some α > 0, the estimation of f (x) is obtained:

\begin{array}{l} \hat{f} (x) = 〈 f, ϕ_{0, 0} 〉 ϕ_{0, 0} (x) + \sum_{ℓ = 0}^{L - 1} \sum_{j \in J_{ℓ}} {\hat{d}}_{ℓ, j} ψ_{ℓ, j} (x) . & (25) \end{array}

A fast algorithm to evaluate the estimation $ŷ_{j} : = \hat{f} (x_{j})$ can be given as follows.

Set â_{0, 0} = a_0.0 = 〈f, ϕ_{0, 0}〉. Assume â_{ℓ−1, k} for k ∈ I_ℓ−1 have been obtained. Define â_{ℓ, k} for k ∈ I_ℓ as follows. If k ∉ J_ℓ−1, then the corresponding node A_{ℓ−1, k} is a leaf node, and let $â_{ℓ, k} = â_{ℓ - 1, k^{'}}$ , where k′ is the index such that $A_{ℓ, k} = A_{ℓ - 1, k^{'}}$ . If k ∈ J_ℓ−1, then the corresponding node A_{ℓ−1, k} has two children, denoted by $A_{ℓ, k^{'}}$ and $A_{ℓ, k^{″}}$ , and we define

\begin{array}{l} [\begin{matrix} â_{ℓ, k^{'}} \\ â_{ℓ, k^{″}} \end{matrix}] = [\begin{matrix} 1 & \frac{ν (A_{ℓ, k^{″}})}{ν (A_{ℓ, k^{'}}) + ν (A_{ℓ, k^{″}})} \\ 1 & - \frac{ν (A_{ℓ, k^{'}})}{ν (A_{ℓ, k^{'}}) + ν (A_{ℓ, k^{″}})} \end{matrix}] [\begin{matrix} â_{ℓ - 1, k} \\ {\hat{d}}_{ℓ - 1, k} \end{matrix}] . & (26) \end{array}

Theorem 3. Let â_{L, j}, j ∈ I_L be the scalars defined above iteratively with ℓ = 1, 2, ⋯ , L and let $\hat{f} (x)$ in be the estimation for f (x) given in (25). Then $\hat{f} (x_{j}) = â_{L, j}$ . More precisely, $\hat{f} (x)$ in (25) can be represented as

\begin{array}{l} \hat{f} (x) = \sum_{j \in I_{L}} â_{L, j} χ_{A_{L, j}} (x) . & (27) \end{array}

Clearly, (26) is actually the wavelet reconstruction algorithm (23). Thus we can use a fast algorithm to evaluate $\hat{f} (x)$ .

The representation (27) for $\hat{f} (x)$ defined by (25) can be proved easily by applying the following claim.

Claim 1. For k ∈ J_ℓ−1 with $A_{ℓ - 1, k} = A_{ℓ, k^{'}} \cup A_{ℓ, k^{″}}$ , where 1 ≤ ℓ ≤ L, we have

Proof of Calim 1: By the definitions of ϕ_{ℓ−1, k}(x) and ψ_{ℓ−1, k}(x), we have

\begin{array}{l} â_{ℓ - 1, k} ϕ_{ℓ - 1, k} (x) + {\hat{d}}_{ℓ - 1, k} ψ_{ℓ - 1, k} (x) \\ = â_{ℓ - 1, k} (ϕ_{ℓ, k^{'}} (x) + ϕ_{ℓ, k^{″}} (x)) + {\hat{d}}_{ℓ - 1, k} \\ (\frac{ν (A_{ℓ, k^{″}})}{ν (A_{ℓ - 1, j})} ϕ_{ℓ, k^{'}} (x) - \frac{ν (A_{ℓ, k^{'}})}{ν (A_{ℓ - 1, j})} ϕ_{ℓ, k^{″}} (x)) \\ = (â_{ℓ - 1, k} + {\hat{d}}_{ℓ - 1, k} \frac{ν (A_{ℓ, k^{″}})}{ν (A_{ℓ - 1, j})}) ϕ_{ℓ, k^{'}} (x) \\ + (â_{ℓ - 1, k} - {\hat{d}}_{ℓ - 1, k} \frac{ν (A_{ℓ, k^{'}})}{ν (A_{ℓ - 1, j})}) ϕ_{ℓ, k^{″}} (x) \\ = â_{ℓ, k^{'}} ϕ_{ℓ, k^{'}} (x) + â_{ℓ, k^{″}} ϕ_{ℓ, k^{″}} (x), \end{array}

as desired. □

Observe that (26) is the algorithm (7) and (8) we use in section 2, and (27) is the representation (9) we also use in section 2 for estimation of f (x).

5. Discussion

The regression method discussed in this paper is based on a tree-based representation of data and a wavelet-like multiscale analysis. The tree-based representation organizes data into hierarchically related partitioning subsets of feature points together with the differences of means of the response variables over the partitioning children subsets. With this representation, a wavelet soft-thresholding and reconstruction procedure allow us to fit the data into the tree-structure. For normal data, the soft-thresholding is equivalent to shrink a t-confidence interval about the origin to the origin on the real-line.

Through the examples (section 3), we see that this tress regression method can be an effective alternative to CART, random-forest, smoothing splines, or support vector machines in various circumstances. Its ability of being adaptive to intrinsic low dimension of data allows it to detect some hidden features of data, as is shown in Example 2, when the standard methods like support vector machine fail to archive this. It outperforms another popular tree-based method, random-forest in terms of prediction error in our regression example (Example 1) in high dimensional feature space with low intrinsic dimension of data. When applied to lower dimensional data (Example 3), it again shows lower prediction error than CART or random-forest, and outperforms other smoothing method when regression function has discontinuities.

Other partitioning trees could be used in subsection 2.1. Rules for stopping further splitting a node can also be considered, provided that the local structures of the regression function in data can be optimally preserved. Another feature of this regression is that, unlike a standard wavelet analysis, the unbalanced Haar orthogonal system here is data dependent.

For large and high dimensional datasets, our method takes significantly more computation time than the other algorithms we used in our examples above. How to improve the speed of computation in our method is a challenge. One possible direction in searching for a faster algorithm is to use smaller and random subsets of the features for splitting each node in growing a tree.

Author Contributions

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

Funding

The work was partially supported by Simons Foundation (Grant No. 353185).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

1. Stone CJ. Optimal rates of convergence for nonparametric estimators. Ann Stat. (1980) 8:1348–60.

Google Scholar

2. Stone CJ. Optimal global rates of convergence for nonparametric regression. Ann Stat. (1982) 10:1040–53.

Google Scholar

3. Bickel P, Li B. Local polynomial regression on unknown manifolds. In: Liu R, Strawderman W, Zhang CH. Complex Datasets and Inverse Problems: Tomography, Networks, and Beyond. IMS Lecture Notes Monograph Series. Vol. 54. Bethesda, MD: Institute of Mathematical Statistics (2007). p. 177–86.

Google Scholar

4. Binev P, Cohen A, Dahmen W, DeVore R, Temlyakov V. Universal algorithms for learning theory part I: piecewise constant functions. J Mach Learn Res. (2005) 6:1297–321.

Google Scholar

5. Binev P, Cohen A, Dahmen W, DeVore R. Universal algorithms for learning theory part II: piecewise polynomial functions. Construct Approx. (2007) 26:127–52. doi: 10.1007/s00365-006-0658-z

CrossRef Full Text | Google Scholar

6. Dasgupta S, Freund Y. Random projection trees and low dimensional manifolds. In: Proceedings of the 40th Annual ACM Symposium on Theory of Computing. New York, NY (2008) p. 537–46.

Google Scholar

7. Kpotufe S, Dasgupta S. A tree-based regressor that adapts to intrinsic dimension. J Comput Syst Sci. (2012) 78:1496–515. doi: 10.1016/j.jcss.2012.01.002

CrossRef Full Text | Google Scholar

8. Breiman L. Random forests. Mach Learn. (2001) 45:5–32. doi: 10.1023/A:1010933404324

CrossRef Full Text | Google Scholar

9. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Belmont, CA: Wadsworth (1984).

Google Scholar

10. Loh WY. Fifty years of classification and regression trees. Int Stat Rev. (2014) 82:329–48. doi: 10.1111/insr.12016

CrossRef Full Text | Google Scholar

11. Morgan JN, Sonquist JA. Problems in the analysis of survey data, and a proposal. J Am Stat Assoc. (1963) 58:415–34.

Google Scholar

12. Wasserman L. All of Nonparametric Statistics Springer (2006).

Google Scholar

13. Chui CK, Filbir F, Mhaskar HN. Representation of functions on big data: graphs and trees. Appl Comput Harmon Anal. (2015) 38:489–509. doi: 10.1016/j.acha.2014.06.006

CrossRef Full Text | Google Scholar

14. Gavish M, Nadler B, Coifman RR. Multiscale wavelets on trees, graphs and high dimensional data: theory and applications to semi supervised learning. In: Proceedings of the 27th International Conference on Machine Learning. Haifa (2010). p. 367–74.

Google Scholar

15. Girardi M, Sweldens W. A new class of unbalanced Haar wavelets that form an unconditional basis for L_p on general measure spaces. J Fourier Anal Appl. (1997) 3:457–74.

Google Scholar

16. Mitrea M. Singular integrals, hardy spaces and cliord wavelets. Lecture Notes in Mathematics. Vol. 1575. Berlin; Heidelberg: Springer-Verlag (1994). doi: 10.1007/BFb0073556

CrossRef Full Text

Keywords: regression, non-linear, high dimension data, tree methods, multiscale (MS) modeling, manifold learning

Citation: Cai H and Jiang Q (2018) A Tree-Based Multiscale Regression Method. Front. Appl. Math. Stat. 4:63. doi: 10.3389/fams.2018.00063

Received: 07 October 2018; Accepted: 07 December 2018;
Published: 21 December 2018.

Edited by:

Xiaoming Huo, Georgia Institute of Technology, United States

Reviewed by:

Don Hong, Middle Tennessee State University, United States
Shao-Bo Lin, Wenzhou University, China

Copyright © 2018 Cai and Jiang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Haiyan Cai, Y2FpaEB1bXNsLmVkdQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

A Tree-Based Multiscale Regression Method

1. Introduction

2. The Averaging Random Tree Regression

2.1. A Classifier for Space Partition and the Partitioning Tree

2.2. A Multiscale Soft-Thresholding Estimate of f(Xi)

2.3. The Averaging Random Tree Regression

3. Examples of Applications to Some Statistical Problems

Example 1: Regression on a Lower Intrinsic Dimensional Manifold

Example 2: Discovering Mixtures in a Higher Dimensional Space

Example 3: Smoothness and Sensitivity to Discontinuities

4. Wavelet-Like Analysis on Trees

4.1. Wavelet-Like Orthogonal Systems on T

4.2. Fast Multiresolution Algorithm For Wavelet Transform

5. Discussion

Author Contributions

Funding

Conflict of Interest Statement

References

2.2. A Multiscale Soft-Thresholding Estimate of f(X_i)

4.1. Wavelet-Like Orthogonal Systems on $T$