World-class research. Ultimate impact.
More on impact ›

Original Research ARTICLE

Front. Appl. Math. Stat., 11 September 2019 | https://doi.org/10.3389/fams.2019.00046

Deep Net Tree Structure for Balance of Capacity and Approximation Ability

  • 1Department of Mathematics, Hong Kong Baptist University, Kowloon, Hong Kong
  • 2Department of Statistics, Stanford University, Stanford, CA, United States
  • 3Department of Mathematics, Wenzhou University, Wenzhou, China
  • 4Department of Mathematics, School of Data Science, City University of Hong Kong, Kowloon, Hong Kong

Deep learning has been successfully used in various applications including image classification, natural language processing and game theory. The heart of deep learning is to adopt deep neural networks (deep nets for short) with certain structures to build up the estimator. Depth and structure of deep nets are two crucial factors in promoting the development of deep learning. In this paper, we propose a novel tree structure to equip deep nets to compensate the capacity drawback of deep fully connected neural networks (DFCN) and enhance the approximation ability of deep convolutional neural networks (DCNN). Based on an empirical risk minimization algorithm, we derive fast learning rates for deep nets.

1. Introduction

Deep learning [1], a learning strategy based on deep neural networks (deep nets), has recently made significant breakthrough on bottlenecks of classical learning schemes, such as support vector machines, random forests and boosting algorithms, by demonstrating its remarkable success in such research areas as computer vision [2], speech recognition [3], and game theory [4]. Understanding the theory of deep learning has recently triggered enormous research activities in communities of statistics, optimization, approximation theory, and learning theory. Continually rapid developments on the deep learning methodology as well as its rationality verifications gradually uncover its mysterious veils.

Depth and structure of deep nets are two crucial factors in promoting the development of deep learning [5]. The necessity of depth has been rigorously verified from the viewpoints of approximation theory and representation theory, via showing the advantages of deep nets in localized approximation [6], sparse approximation in the frequency domain [7, 8], sparse approximation in the spatial domain [9], manifold learning [10, 11], hierarchical structures grasping [12, 13], piecewise smoothness realization [14], universality with bounded number of parameters [15, 16] and rotation invariance protection [17]. We refer the readers to Pinkus [18] and Poggio et al. [19] for details on the theoretical advantages of deep nets over shallow neural networks (shallow nets). The gain in approximation and feature extraction inevitable leads to large capacity of deep nets, making the derived estimators sensitive to noise accumulated from significant increase amount of computation. In particular, under some capacity measurements like the number of linear regions [20], Betti numbers [21], and number of monomials [22], it is well-known that while the capacity of deep nets increases exponentially with respect to depth and polynomially with respect to width, the increase in depth of the network brings additional risk in stability, additional difficulty in designing learning algorithms, and may result in large variance. In this regard, we would like to point out that although there are the same number of free parameters in neural networks presented in Figure 1, the capacity of the network in Figure 1A is much larger than that in Figure 1B.

FIGURE 1
www.frontiersin.org

Figure 1. Deep nets vs. shallow nets. (A) A special deep net. (B) Shallow net.

Fortunately, the structure, reflected by the layer-to-layer conjunction rule, compensates for the capacity drawback of deep nets and allows deep learning feasible and even practical. Two dominant structures of deep nets, as shown in Figure 2, are the deep fully connected neural networks (DFCN) and deep convolutional neural networks (DCNN). While the pros of DFCN is its excellent approximation ability, since all the conjunctions are considered in this structure, its cons, however, lies in the extremely large capacity, leading to scalable difficulty and large variance from the learning theory viewpoint [23]. On the other hand, the advantage of DCNN is its small number of free parameters as a result of sparse connectivity and weight-sharing mechanisms. For example, there are 2 free parameters in each layers for a DCNN with filter length 2 (see Figure 2B). Such a parameter reduction certainly brings the benefit in stability and consequently small variance. However, it is questionable if DCNN could maintain the attractive approximation ability of DFCN. Indeed, with the exception of the universal approximation property and approximation rate estimates [24, 25], there is insufficient theoretical study in the assessment of the approximation capability of DCNN. Thus, equipping deep nets with an appropriate structure to reduce the number of parameters of DFCN while enhancing the approximation ability of DCNN requires some desirable balance of the bias and variance in the learning process.

FIGURE 2
www.frontiersin.org

Figure 2. Structures for deep nets. (A) Deep fully-connected nets. (B) Convolutional neural networks.

In this paper, we propose an appropriate structure to equip deep nets with a combination of some smaller variance provided by DCNN and a corresponding less bias advantage of DFCN. Two important ingredients of our approach are feature grouping via dimensionality-leveraging and tree-type feature extraction. Our construction is motivated by the structures of deep nets presented in Chui et al. [6] and Lin [9] for the realization of locality and sparsity features. As shown in Figure 3A, to capture the position information for x ∈ ℝd among 4 candidates, a dimensionality-leveraging, from d to 8d, is used to group each position information via 2d neurons. With the help of the neural networks in dimensionality-leveraging, features are coupled in a group of neurons, and then the tree structure, instead of the convolutional structure, is sufficient to capture such features. Thus, we will use the first hidden layer to group the features via dimensionality-leveraging, and will then utilize the tree structure to extract the features, as exhibited in Figure 3B.

FIGURE 3
www.frontiersin.org

Figure 3. Deep nets with tree structures. (A) Deep nets for locality. (B) Deep nets with tree structures.

It is important to emphasize that the aim of the present paper is not to pursue the advantages of deep nets with tree structures in approximation, since this has been the subject of investigation in a vast amount of literature (see for example [6, 9, 10, 13, 15, 17, 26]), but to show the benefit of tree structures in deriving small variance. In particular, using the tree structures, we are able to decouple deep nets, layer by layer, and derive a tight covering number [27] estimate by using the Lipschitz property of the activation function. Since there are much fewer free parameters in deep nets with tree structures than those in DFCN, with the same number of neurons, the covering number of the former is smaller than that of the latter, resulting in smaller variance of deep nets with tree structures. We will then derive fast learning rates for “generalization error” for implementing the empirical risk minimization on deep nets. Deep nets with tree structures, revealed by our study, possess three theoretical advantages, namely: the capacity, as measured by the covering number, is much smaller than that of DFCN; based on tree structures, the approximation capability is comparable with that of DFCN; and fast learning rate is achieved, by applying an empirical risk minimization algorithm.

2. Deep Nets With Tree Structures

In image processing, a standard approach is to leverage a low-dimensional image to a high-dimensional pixel-scale image. While leveraging is a brutal approach that loses such image features as sparsity, locality and symmetry, and makes the variables highly inter-related, one method to capture the structure information by means of grouping the adjacent variables is machine learning. In particular, DCNN with numerous hidden layers, as exhibited in Figure 4, has been utilized, with the underlying intuition that the convolutional structure can extract missing features by deepening the network. The problem is, however, that with the exception of being able to extract transition-invariance features [28], there is no theoretical verification that DCNN could out-perform other neural network structures in feature extraction. Motivated by the application of DCNN in image processing, we propose a novel structure to equip deep nets for feature extraction and learning. Our basic idea is to group different features via several neurons in the first hidden layer rather than brutal leveraging. In this way, each group is independent and thus a tree structure feature extraction is sufficient to extract the grouped feature, just as Figure 3B purports to show.

FIGURE 4
www.frontiersin.org

Figure 4. DCNN in image processing.

In the following, we present the detailed definition of deep nets with tree structures. Let 𝕀: = [−1, 1], x = (x(1), …, x(d)) ∈ 𝕀d = [−1, 1]d, and L ∈ ℕ denote the number of hidden layers. Also let ϕk : ℝ → ℝ, k = 0, 1, …, L, be univariate activation functions. Let N0 = d and for each j = 1, …, L, denote by Nj ≥ 2, the size of tree in the j-th hidden layer. Set

Hα0,0(x)=j=1N0aj,α0,0ϕ0(wj,α0,0x(j)+bj,α0,0),  x=(x(1),,x(d)),α0i=1L{1,2,,Ni}.    (1)

Then a deep net with the tree structure of L layers can be formulated recursively by

Hαk,k(x)=j=1Nkaj,αk,kϕk(Hj,αk,k-1(x)+bj,αk,k),  1kL,αki=k+1L{1,2,,Ni},    (2)

where aj,αk,k,bj,αk,k,wj,α0,0 for each j ∈ {1, 2, …, Nk}, k ∈ {0, 1, …, L}, L+1L{1,2,,NL}= and Hαk-1,k-1(x)=(H1,αk,k-1(x),,HNk,αk,k-1(x)). Let HLtree denote the set of output functions HL=HαL,L for αL at the L-th layer. For 0 ≤ kL − 1 and αki=k+1L{1,2,,Ni}, denote by Hαk,ktree the set of functions Hαk,k defined in (2).

By setting ϕ0(t) = t and bj,α0,0=0, it is easy to see that H1tree reduces to the classical shallow net. In view of the tree structure, it follows from (1), (2) and Figure 5 that there are a total of

AL:=2k=0LΠ=0L-kNL-+Π=0LN    (3)

free parameters for HLHLtree. For α,R1, we introduce the notation

HL,α,Rtree:={HLHLtree:|aj,αk,k|,|bj,αk,k|,|wj,αk,0|R(AL)α, 0kL,1jNk,αki=k+1L{1,2,,Ni}}.    (4)

With the restrictions imposed by (4) on deep nets, the parameters are bounded. This is indeed a necessity condition, since it can be found in Guo et al. [29] and Maiorov and Pinkus [15] that there exists some hH2,,tree with finitely many neurons but infinite capacity (covering number).

FIGURE 5
www.frontiersin.org

Figure 5. Size of tree in deep nets.

3. Advantages of Deep Nets With Tree Structures

The study of the advantages of deep nets over shallow nets in approximation is a classical topic and several theoretical benefits of deep nets are revealed in a large literature. We refer the readers to a fruitful review paper [18] for more details. Due to the concise mathematical formulation, deep nets with tree structures are one of the most popular structures in approximation theory. It dates back to Mhaskar [26], where it was proved that deep nets with tree structures can be constructed to overcome the saturation phenomenon of shallow nets in the sense that the approximation rate cannot go beyond a certain level when the regularity of the target function increases. In Chui et al. [6], deep nets with two hidden layers and tree structures were constructed to provide localized approximation, which is beyond the performance of shallow nets. In Maiorov and Pinkus [15], a deep net with tree structures, two hidden layers and finitely many neurons, was demonstrated to possess the universal approximation property. Furthermore, in our recent papers Chui et al. [10, 17], deep nets with tree structures were proved to be capable of extracting the manifold structure feature and rotation-invariance feature, respectively.

Most importantly, it is clear from the above-mentioned results that deep nets with tree structures do not degrade the approximation performance of DFCN, while sparse connections between neurons significantly reduces the number of free parameters. In the following, we will show that deep nets with tree structures have an overall advantage over DFCN by deriving tight covering number estimates. Let 𝔹 be a Banach space and V be a subset of 𝔹. Denote by N(ε,V,B) the ε-covering number of V under the metric of 𝔹 [27], defined by the minimal number of elements in an ε-net of V. For 𝔹=L(Id), we set N(ε,V):=N(ε,V,L(𝕀d)) for brevity. The objective of this consideration is to establish the following theorem, that exhibits a tight bound for covering numbers of HL,α,Rtree.

Theorem 1. Assume that

|ϕj(t)-ϕj(t)|c1|t-t|,   and   |ϕj(t)|1,   t,t,j=0,,L.    (5)

Then for any 0 < ε ≤ 1,

N(ε,HL,α,Rtree)(2L+5/2c1L+3/2AR,α,LL+1ε)2AL,    (6)

where AR,α,L:=R(AL)α and AL is defined by (3).

The proof of Theorem 1 is delayed to section 5. We remark that the assumption (5) is mild. Indeed, almost all widely used activation functions including the logistic function ϕ(t)=11+e-t, hyperbolic tangent sigmoidal function ϕ(t)=12(tanh(t)+1) with tanh(t) = (e2t − 1)/(e2t + 1), arctan sigmoidal function ϕ(t)=1πarctan(t)+12, Gompertz function ϕ(t) = eaebt with a, b > 0 and Gaussian function σ(t) = et2 satisfy this assumption. We also remark that numerous quantities such as the number of linear regions [20], Betti numbers [21], VC-dimension [30], and number of monomials [22] have been employed to measuring the capacity of deep nets. To compare these measurements, it is noted that covering numbers possess three advantages. Firstly, the covering number is close to the coding length in information theory according to the encode-decode theory proposed by Donoho [31]. Thus, it is a powerful capacity measurement to show the expressivity of deep nets. Secondly, covering numbers determine the limitations of approximation ability of deep nets [17, 29]. Therefore, studying covering numbers of deep nets facilitates the verification of the optimality of the existing approximation results in Chui et al. [6, 10, 17] and Mhaskar [26]. Finally, covering numbers usually correspond to some oracle inequalities [23] and can reflect the stability of learning algorithms. All these features suggest the rationality of adopting the covering number to measure the capacity of deep nets.

Under the Lipchitz assumption (5) for the activation function, a bound of the covering number for the set

F:={f=σ(w·x+b):wd,b,f*1}

with ‖ · ‖* denoting some norm including the uniform norm was derived in Kůrková and Sanguineti [32]. Based on this, Maiorov [33] presented a tight estimate for shallow nets as

N(ε,Sσ,n*)=O(nd logΓnε),    (7)

where

Sσ,n*:={j=1ncjσ(wj·x+θj):|cj|,|wj(i)|,|θj|Γn,  1jn,1id}

and Γn > 0 depending on n.

Estimates of covering number for deep nets were first studied in Kohler and Krzyżak [34], where a tight bound for covering numbers of deep nets with tree structures and two hidden layers is derived. Using a similar approach, it was presented in Kohler and Krzyżak [34] and Lin [9] an upper bound estimate for deep nets with tree structures, five hidden layers and without the Liptchitz assumption (5) of the activation function. Recently, Kohler and Krzyzak [13] provided an estimate for covering numbers of deep nets with L-hidden layers with L ∈ ℕ. Furthermore, covering numbers for deep nets with arbitrary structures and bounded parameters were deduced in Guo et al. [29]. Our result, exhibited in Theorem 1, establishes a covering number estimate for deep nets with arbitrarily many hidden layers and tree structures. This result improves the estimate in Guo et al. [29] by reducing the exponent of AR,α,L from L2 to (L + 1), since AR,α,L>1 is usually very large. The main tool in our analysis is to use the Liptchitz property of the activation function and boundedness of the free parameters to decouple the depth layer by layer due to tree structures. It should be mentioned that Theorem 1 also removes the monotonic increasing assumption on the activation function while exhibits a similar covering number estimate as Anthony and Bartlett [35, Theorem 14.5]. Due to the boundedness assumption (5), our result excludes the covering number estimate for deep nets with the widely used rectifier linear unit (ReLU). Using the technique in Guo et al. [29, Lemma 1], we can derive upper bound estimates of deep nets in different layers. But it leads to an additional power L on AR,α,LL+1 in (6), i.e., AR,α,LL2+L. Thus, it requires a novel technique to derive the same covering number estimate for deep ReLU nets as Theorem 1. We leave it as a future work.

4. Generalization Error Estimates for Deep Nets

In this section, we present the generalization error estimates for empirical risk minimization on deep nets in the framework of learning theory [23]. In this framework, samples Dm={(xi,yi)}i=1m are assumed to be drawn independently according to the Borel probability measure ρ on Z=X×Y with X=𝕀d and Y[-M,M] for some M > 0. The primary objective is to apply the regression function:

fρ(x)=Yydρ(y|x),   xX

which minimizes the generalization error

ε(f):=Z(f(x)-y)2dρ,

where ρ(y|x) denotes the conditional distribution at x induced by ρ. Let ρX be the marginal distribution of ρ on X and (LρX2,ρ) be the Hilbert space of ρX square-integrable functions on X. For fLρX2, we have [23]

ε(f)-ε(fρ)=f-fρρ2.    (8)

Denote by εD(f):=1mi=1m(f(xi)-yi)2 the empirical risk for the estimator f. Before presenting the generalization error for deep nets with tree structures, we derive an oracle inequality based on covering numbers for the empirical risk minimization (ERM) algorithm, i.e.,

fD,H=arg min fHεD(f),    (9)

where H is a set of continuous functions on X and is HL,α,Rtree in our study. Since |y| ≤ M almost everywhere, we have |fρ(x)| ≤ M. It is natural to project an output function f:X onto the interval [−M, M] by the projection operator

πMf(x):={f(x),if -Mf(x)M,M,if f(x)>M,-M,if f(x)<-M.

Thus, the estimator we study in this paper is πMfD,H. The following theorem presents the oracle inequality for ERM based on covering numbers.

Theorem 2. Suppose there exist n,U>0, such that

logN(ε,H)nlogUε,   ε>0.    (10)

Then for any hH and ϵ > 0,

Prob{ π  MfD,Hfρρ2>ε+2hfρρ2}                   exp{nlog16UMε3mε512M2}               +exp{3mε216(3M+hL(X))2(6hfρρ2+ε)}.

The proof of Theorem 2 will be given in the next section. Theorem 2 shows that the covering number plays an important role in deducing the generalization error. As a result of this theorem and Theorem 1, we can derive tight generalization error bounds for ERM on deep nets with tree structures. Suppose that there exist some β > 0, c~>0, R>0 and α > 0, such that

mingHL,α,Rtreefρ-gL(d)c~AL-β.    (11)

Define

fD,L=argminfHL,α,RtreeεD(f).    (12)

We then derive the following generalization error estimate for (12).

Theorem 3. Let 0 < δ < 1. Suppose that there exist some β,c~,α,R>0 such that (11) holds. If (5) holds and Cm1/(2β+1)LALCm1/(2β+1), then with confidence at least 1 − δ, we have

ε(πMfD,L)-ε(fρ)CL2βm-2β2β+1 log m log 3δ,    (13)

where C, C′, Care constants independent of AL, L, N1, …, NL, m, or δ.

The proof of Theorem 3 will be given in the next section. Assumption (11) describes the expressivity of HL,α,Rtree. For some constants α,R, the exponent β in (11) implies the regularity for the regression function fρ. In particular, it can be found in Chui et al. [17] and Guo et al. [29] that the Liptchitz continuity and radial property of fρ corresponds to β = 1/d and β = 1, respectively. It was shown in (13) that there is an additional L in our estimate, which is different from generalization errors of shallow nets [36] and deep nets with fixed depth [10]. The main reason is that there is an additional L in the exponent for the covering numbers of HL,α,Rtree in (6). With the same number of parameters, large depth of deep nets with tree structures usually leads to large variance, as shown in (13). However, it was also shown in Chui et al. [6, 10, 17], Guo et al. [29], Lin [9, 37], Mhaskar and Poggio [12], and Pinkus [18] that the depth is necessary in improving the performance of deep nets. It would be of some interest to study the smallest depth of deep nets with tree structures in extracting specific features. This study is left in a future work.

5. Proofs of Main Results

To facilitate our proof of Theorem 1, let us first establish the following lemma:

Lemma 1. Let ι ∈ ℕ, 𝔸 ⊆ ℝι, B be a Banach space of functions on 𝔸 and R1,R2>0. For F,GB, set FG:={f+f*:fF,f*G} and FG:={f·f*:fF,f*G}. Then it follows that for any ε, ν > 0,

N(ε+ν,FG,B)N(ε,F,B)N(ν,G,B).    (14)

In addition, if maxx𝔸|f(x)|R1, maxx𝔸|f*(x)|R2 for all fF and f*G, and FGB, then

N(ε+ν,FG,B)N(ε/R2,F,B)N(ν/R1,G,B).    (15)

Proof. Let {f1, …, fN} and {f1*,,fN*} be an ε-cover and a ν-cover of F and G with

N=N(ε,F,B),   and  N=N(ν,G,B).    (16)

Then, for every fF and f*G, there exist k ∈ {1, …, N} and ℓ ∈ {1, …, N′}, such that

f-fkB<ε,   f*-f*B<ν.

By the triangle inequality, we have

f+f*-fk-f*Bf-fkB+f*-f*B<ε+ν.

Thus, {fk+f*:1kN,1N} is an (ε + ν)-cover of FG. Therefore, (16) implies

N(ε+ν,FG,B)NN=N(ε,F,B)N(ν,G,B).

This establishes (14).

To prove (15), let {f1, …, fN*} and {f1*,,fN**} be an ε/R2-cover and a ν/R1-cover of F and G, respectively, with

N*=N(ε/R2,F,B),   and  N*=N(ν/R1,G,B).    (17)

Then, for every fF and f*G, there exist k ∈ {1, …, N*} and {1,,N*} that satisfy maxx𝔸|fk(x)|R1 and maxx𝔸|f*(x)|R2 such that

f-fkB<ε/R2,   f*-f*B<ν/R1.

It then follows from the triangle inequality that

f·f*-fk·f*Bf·f*-f·f*B+f·f*-fk·f*B                             R1f*-f*B+R2f-fkB<ν+ε,

which implies that {fkf*:1kN*,1N*} is an (ε + ν)-cover of FG. This together with (17) imply

N(ε+ν,FG,B)N*N*=N(ε/R2,F,B)N(ν/R1,G,B).

This completes the proof of Lemma 1.

We are now ready to prove Theorem 1 as follows.

Proof of Theorem 1: Define, for k ∈ {0, 1, …, L} and αki=k+1L{1,2,,Ni},

Hk,α,R,L,αktree:={HkHαk,ktree:|aj,α,|,|bj,α,|,|wj,α,0|AR,α,L,  0k,1jN,αi=+1k{1,2,,Ni}}.    (18)

Then, (14) implies that for ε > 0,

N(ε,Hk,α,R,L,αktree)(max1jNkN(ε/Nk,Hk,j,α,R,L,αktree,*))Nk,    (19)

where for 1 ≤ jNk,

Hk,j,α,R,L,αtree,*:={fj*(x)=aj,αk,kϕk(Hα,k-1(x)+bj,α,)                                    :|aj,α,|,|bj,α,|AR,α,L,                                    Hj,αk-1,k-1Hk-1,α,R,L,αk-1tree,                                    0k,αi=+1k{1,2,,Ni}}.

For each j ∈ {1, …, Nk}, since |aj,αk,k|AR,α,L and ‖ϕkL(ℝ) ≤ 1, we obtain, from (15) with ι = 1, B = L(ℝ), R1=AR,α,L and R2=1, that

N(ε/Nk,Hk,jk,α,R,L,αktree,*)N(ε/Nk,{aj,αk,k:|aj,αk,k|AR,α,L})N(ε/(NkAR,α,L),Hk,j,α,R,L,αktree,**),    (20)

where

Hk,j,α,R,L,αktree,**:={fj**(x)=ϕk(Hj,αk,k-1(x)+bj,αk,k):                                       |bj,αk,|AR,α,L,                                       Hαk-1,k-1Hk-1,α,R,L,αk-1tree,0k-1,                                       αi=+1k-1{1,2,,Ni}}.

Since ϕk satisfies (5), it follows from the definition of the covering number that

N(ε/Nk,{aj,αk,k:|aj,αk,k|AR,α,L})2NkAR,α,Lε    (21)

and

N(ε/(NkAR,α,L),Hk,j,α,R,L,αktree,**)N(ε/(c1NkAR,α,L),Hk,j,α,R,L,αktree,***),    (22)

where

Hk,j,α,R,L,αtree,***:={fj***(x)=Hαk-1,k-1(x)+bj,αk,k                                    :|bj,α,|AR,α,L,                                    Hj,αk-1,k-1Hk-1,α,R,L,αk-1tree,0k-1,                                    αi=+1k-1{1,2,,Ni}}.

Using (14) again, we have

      N(ε/(c1NkAR,α,L),Hk,j,α,R,L,αktree,***)N(ε/(2c1NkAR,α,L),{bj,αk,k:|bj,αk,k|AR,α,L})      N(ε/(2c1NkAR,α,L),Hk-1,α,R,L,αk-1tree)4c1NkAR,α,LεN(ε/(2c1NkAR,α,L),Hk-1,α,R,L,αk-1tree).    (23)

Combing (19), (20), (21), (22), and (23), we get

N(ε,Hk,α,,L,αktree)(8c1Nk2AR,α,L2ε2)Nk(N(ε/(2c1NkAR,α,L),Hk1,α,,L,αk1tree)))Nk.    (24)

Using (24), we have

       N(ε,HL,α,R,L,αLtree)(8c1NL2AR,α,L2ε2)NL       [N(ε2c1NLAR,α,L,HL-1,α,R,L,αL-1tree)]NL(8c1NL2AR,α,L2ε2)NL(8c1(2c1)2NL-12NL2AR,α,L4ε2)NLNL-1×[N(ε(2c1)2AR,α,L2NLNL-1,HL-2,α,R,L,αL-2F)]NLNL-1,

which implies by induction

 N(ε,HL,α,,L,αLtree)(2c1)2k=1L1(Lk)=0LkNL×(AR,α,L)2k=1L(Lk+1)=0LkNLk=1L(=kLN)2=kLN (8c1ε2)k=1L=0LkNL×[N(ε(2c1)LAR,α,LLNLNL1NL2···N1, H0,α,,L,α0tree)]NLNL1NL2···N1.    (25)

For arbitrary ν > 0, using the same arguments as those in proving (24), we get

    N(ν,H0,α,,L,α0tree)(8c1N02AR,α,L2ν2)N0×(max1jNN(ν/(2c1N0AR,α,L),{wj,α0,0x(j0)+bj,α0,0    :|wj,α0,0|,|bj,α0,0|AR,α,L})))N0.

For j ∈ {1, …, N0} and 0x(j0)1, noting that {wj,α0,kx(j0)+bj,α0,k:|wj,α0,0|,|bj,α0,0|AR,α,L} is in a two dimensional linear space whose elements are bounded by 2AR,α,L, we get

N(ν/(2c1N0AR,α,L),{wj,α,0x(j0)+bj,α,0:|wj,α,0|,|bj,α,0|AR,α,L})(8c1N0AR,α,L2ν)2.

This implies

 N(ε/((2c1)LAR,α,LLNLNL-1NL-2···N1),H0,α,R,L,α0tree))(8c1(2c1)2LAR,α,L2L+2NL2NL-12···N02ε)N0 (8c1(2c1)LAR,α,LL+1NLNL-1NL-2···N1N0ε)2N0.

Inserting this estimate into (25), we have

      N(ε,HL,α,R,L,α0tree)(2c1)2k=0L-1(L-k)=0L-kNL-      (AR,α,L)2k=0L(L-k+1)=0L-kNL-×k=0L(=kLN)2=kLN(8c1ε2)k=0L=0L-kNL-×(8c1(2c1)LAR,α,LL+1NLNL-1NL-2···N1N0ε)2N0N1···NL.

Recalling (3), Nk ≥ 2 for arbitrary k ∈ {0, 1, …, N}, we have

2k=0L(L-k+1)=0L-kNL-+2(L+1)=0LN2(L+1)AL,2k=0L=0L-kNL-+2=0LN2AL

and

[k=0L(=kLN)2=kLN][(k=0LNk)2k=0LNk][(k=0LNk)(2L+4)k=0LNk]AL(L+1)AL.

Thus, ALAR,α,L yields

     N(ε,HL,α,Rtree)=N(ε,HL,α,R,L,αLtree)(2c1AR,α,L)(2L+2)AL     (22c1ε)2AL=(22c1(2c1AR,α,L)2L+2ε)2AL=(2L+5/2c1L+3/2AR,α,LL+1ε)2AL.

This completes the proof of Theorem 1.

The proof of Theorem 2 depends on the following two concentration inequalities, which can be found in Cucker and Zhou [23], Wu and Zhou [38], and Zhou and Jetter [39], respectively.

Lemma 2 (B-Inequality). Let ξ be a random variable in a probability space Z with mean E(ξ) and variance σ2(ξ) = σ2. If |ξ(z) − E(ξ)| ≤ Mξ for almost all zZ, then for any ε > 0,

Prob{1mi=1mξ(zi)-E(ξ)>ε}exp{-mε22(σ2+13Mξε)}.

Lemma 3 (C-Inequality). Let G be a set of continuous functions on Z such that, for some B>0,c~>0, |f*E(f*)| ≤ Balmost surely and E((f*)2)c~E(f*) for all f*G. Then for every ε > 0,

Prob{supf*GE(f*)-1mi=1mf*(zi)E(f*)+ε>ε}N(ε,G,L(X))exp{-mε2c~+2B3}.

We now turn to the proof of Theorem 2.

Proof of Theorem 2: For hH, from (9) we have εD(fD,H)εD(h), which together with εD(πMfD,H)εD(fD,H), implies

ε(πMfD,H)-ε(fρ)ε(h)-ε(fρ)+εD(h)-ε(h)+ε(πMfD,H)-εD(πMfD,H).

In the following we set, for convenience,

       D(H):=ε(h)-ε(fρ)=h-fρρ2,S1(m,H):={εD(h)-εD(fρ)}-{ε(h)-ε(fρ)}

and

S2(m,H):={ε(πMfD,H)-ε(fρ)}-{εD(πMfD,H)-εD(fρ)}.

Then we have

ε(πMfD,H)-ε(fρ)D(H)+S1(m,H)+S2(m,H).    (26)

To apply the B-Inequality in Lemma 2, let the random variable ξ on Z be defined by

ξ(z)=(y-h(x))2-(y-fρ(x))2.

Then since |y| ≤ M and |fρ(x)| ≤ M almost surely, we have

|ξ(z)|Mξ:=(3M+hL(X))2,  |ξ-Eξ|2Mξ,  and  σ2E(ξ2)MξD(H)

almost surely. It then follows from B-Inequality with Mξ=2Mξ, that

S1(D,H)ε    (27)

holds with confidence at least

1-exp{-mε22(σ2+13Mξε)}1-exp{-mε22(3M+hL(X))2(D(H)+23ε)}.    (28)

On the other hand, for

G:={f*=(πMf(x)-y)2-(fρ(x)-y)2:fH}.

and any (fixed) f*G, there exists an fH such that f*(z)=(πMf(x)-y)2-(fρ(x)-y)2. Therefore, it follows from (8) that

E(f*)=ε(πMf)-ε(fρ)=πMf-fρρ2,                1mi=1mf*(zi)=εD(πMf)-εD(fρ),

and

f*(z)=(πMf(x)-fρ(x))[(πMf(x)-y)+(fρ(x)-y)].

Since |y| ≤ M and |fρ(x)| ≤ M almost surely, we have

|f*(z)|(M+M)(M+3M)8M2,

which implies

|f*(z)-E(f*)|B:=16M2,  and     E((f*)2)16M2πMf-fρρ2=16M2E(f*).

Hence, we may apply C-Inequality to G, with B=c~=16M2, to conclude that

supfHε(πMf)-ε(fρ)-(εD(πMf)-εD(fρ))ε(πMf)-ε(fρ)+ε<ε    (29)

holds with confidence at least

1-N(ε,G,L(X×Y)exp{-3mε128M2}.

For any f1,f2H, we have

|(πMf1(x)-y)2-(πMf2(x)-y)2|4M|πMf1(x)-πMf2(x)|4M|f1(x)-f2(x)|.

Thus, an ε4M-covering of H provides an ε-covering of G for any ε > 0. This implies that

N(ε,G,L(X×Y))N(ε/(4M),H,L(X)).

This together with (10) implies

N(ε,G,L(X×Y))exp{nlog4MUε}.

Hence, (29) implies that

S2(D,H)12(ε(fD,H)-ε(fρ))+ε    (30)

holds with confidence at least

1-exp{nlog4MUε-3mε128M2}.    (31)

Inserting (27), (28), (30), and (31) into (26), we conclude that

ε(πMfD,H)-ε(fρ)2D(H)+4ε

holds with confidence at least

1-exp{-mε22(3M+hL(X))2(D(H)+23ε)}-exp{nlog4MUε-3mε128M2}.

This completes the proof of Theorem 2 by re-scaling 4ε to ε.

To complete the discussion in this paper, we now prove Theorem 3 by applying Theorem 1 and Theorem 2, as follows.

Proof of Theorem 3: Due to (11), there exists some hHL,α,Rtree such that

fρ-hρ2c~2AL-2β,   hL(Id)M+c~.

Since (5) holds, Theorem 1 implies

logN(ε,HL,α,Rtree)2AL(L+3)log(2c1AR,α,Lε).

Applying Theorem 2 with n=2AL(L+3), U=2c1AR,α,L to HL,α,Rtree and setting LAL=[C1*m12β+1] with C1* giving below, we have that for

ε2c~2AL-2βlogAL2h-fρρ2,    (32)

so that

     Prob{πMfD,L-fρρ2>2ε}Prob{πMfD,L     -fρρ2>ε+2h-fρρ2}exp{2AL(L+3)log2c1AR,α,Lε-3mε512M2} +exp{-3mε216(4M+c~)2(6c~2AL-2β+ε)}exp{c~1LALlogAL-3mε512M2}+exp{-3mε112(4M+c~)2},

where c~1 is a constant independent of AL or L. Setting C1* to be a constant independent of L or AL such that LAL=[C1*m12β+1], ε2c~2AL-2βlogAL and c~1LALlogAL3mε1024M2, we have

Prob{πMfD,L-fρρ2>2ε}exp{-3mε1024M2}+exp {--3mε112(4M+c~)2}2 exp {-3mε112(4M+c~)2}3 exp {-3m2β2β+1ε112(4M+c~)2logAL},(33)

Then setting

3exp{-3m2β2β+1ε112(4M+c~)2logAL}=δ,

we obtain

2c~2AL-2βlogALε1123((C1*)2β4M+c~)2L2βm-2β2β+1logALlog3δ.

Thus, it follows from (33) that with confidence of at least 1 − δ, we have

πMfD,L-fρρ2C2*L2βm-2β2β+1 log m log 3δ,

where C2*:=1123((C1*)2β4M+c~)2. This completes the proof of Theorem 3.

6. Conclusion

In this paper, we provided a novel tree structure to equip deep nets and studied its theoretical advantages. Our studied showed that deep nets with tree structure succeeded in reducing the free parameters of deep fully-connected nets without sacrificing their excellent approximation ability. Under this circumstance, implementing the well known empirical risk minimization on deep nets with tree structures yields fast learning rates.

Data Availability

All datasets generated and analyzed for this study are included in the manuscript and the supplementary files.

Author Contributions

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

Funding

The research of CC was partially supported by Hong Kong Research Council [Grant Nos. 12300917 and 12303218] and Hong Kong Baptist University [Grant No. HKBU-RC-ICRS/16-17/03]. The research of S-BL was supported by the National Natural Science Foundation of China [Grant No. 61876133], and the research of D-XZ was partially supported by the Research Grant Council of Hong Kong [Project No. CityU 11306617].

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

1. Hinton GE, Osindero S, Teh YW. A fast learning algorithm for deep belief netws. Neural Comput. (2006) 18:1527–54. doi: 10.1162/neco.2006.18.7.1527

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Krizhevsky A, Sutskever I, Hinton GE. Imagenet Classification With Deep Convolutional Neural Networks. Lake Tahoe (2012). 1097–105.

Google Scholar

3. Lee H, Pham P, Largman Y, Ng AY. Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Neural Information Processing Systems. Vancouver, BC (2010). p. 469–77.

Google Scholar

4. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, et al. Mastering the game of Go with deep neural networks and tree search. Nature. (2016) 529:484–9. doi: 10.1038/nature16961

PubMed Abstract | CrossRef Full Text | Google Scholar

5. Goodfellow I, Bengio Y, Courville A. Deep Learning. London, UK: MIT Press (2016).

Google Scholar

6. Chui CK, Li X, Mhaskar HN. Neural networks for localized approximation. Math Comput. (1994) 63:607–23. doi: 10.2307/2153285

CrossRef Full Text | Google Scholar

7. Lin HW, Tegmark M, Rolnick D. Why does deep and cheap learning works so well? J Stat Phys. (2017) 168:1223–47. doi: 10.1007/s10955-017-1836-5

CrossRef Full Text | Google Scholar

8. Schwab C, Zech J. Deep learning in high dimension: neural network expression rates for generalized polynomial chaos expansions in UQ. Anal Appl. (2018). doi: 10.1142/S0219530518500203

CrossRef Full Text | Google Scholar

9. Lin SB. Generalization and expressivity for deep nets. IEEE Trans Neural Netw Learn Syst. (2019) 30:1392–406. doi: 10.1109/TNNLS.2018.2868980

PubMed Abstract | CrossRef Full Text | Google Scholar

10. Chui CK, Lin SB, Zhou DX. Construction of neural networks for realization of localized deep learning. Front Appl Math Stat. (2018) 4:14. doi: 10.3389/fams.2018.00014

CrossRef Full Text | Google Scholar

11. Shaham U, Cloninger A, Coifman RR. Provable approximation properties for deep neural networks. Appl Comput Harmon Anal. (2018) 44:537–57. doi: 10.1016/j.acha.2016.04.003

CrossRef Full Text | Google Scholar

12. Mhaskar H, Poggio T. Deep vs shallow networks: an approximation theory perspective. Anal Appl. (2006) 14:829–48. doi: 10.1142/S0219530516400042

CrossRef Full Text | Google Scholar

13. Kohler M, Krzyzak A. Nonparametric regression based on hierarchical interaction models. IEEE Trans Inform Theory. (2017) 63:1620–30. doi: 10.1109/TIT.2016.2634401

CrossRef Full Text | Google Scholar

14. Petersen P, Voigtlaender F. Optimal aproximation of piecewise smooth functions using deep ReLU neural networks. Neural Netw. (2018) 108:296–330. doi: 10.1016/j.neunet.2018.08.019

PubMed Abstract | CrossRef Full Text | Google Scholar

15. Maiorov V, Pinkus A. Lower bounds for approximation by MLP neural networks. Neurocomputing. (1999) 25:81–91. doi: 10.1016/S0925-2312(98)00111-8

CrossRef Full Text | Google Scholar

16. Ismailov VE. On the approximation by neural networks with bounded number of neurons in hidden layers. J Math Anal Appl. (2014) 417:963–9. doi: 10.1016/j.jmaa.2014.03.092

CrossRef Full Text | Google Scholar

17. Chui CK, Lin SB, Zhou DX. Deep neural networks for rotation-invariance approximation and learning. Anal Appl. arXiv:1904.01814.

Google Scholar

18. Pinkus A. Approximation theory of the MLP model in neural networks. Acta Numer. (1999) 8:143–95. doi: 10.1017/S0962492900002919

CrossRef Full Text | Google Scholar

19. Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. Int J Auto Comput. (2017). 14: 503–19. doi: 10.1007/s11633-017-1054-2

CrossRef Full Text | Google Scholar

20. Montúfar G, Pascanu R, Cho K, Bengio Y. On the number of linear regions of deep neural networks. In: Neural Information Processing Systems. Montréal, QC (2014). p. 2924–32.

Google Scholar

21. Bianchini M, Scarselli F. On the complexity of neural network classifiers: a comparison between shallow and deep architectures, IEEE Trans Neural Netw Learn Syst. (2014) 25:1553–65. doi: 10.1109/TNNLS.2013.2293637

PubMed Abstract | CrossRef Full Text | Google Scholar

22. Delalleau O, Bengio Y. Shallow vs. deep sum-product networks. In: Advances in Neural Information Processing Systems. Granada (2011). p. 666–74.

Google Scholar

23. Cucker F, Zhou DX. Learning Theory: An Approximation Theory Viewpoint. Cambridge: Cambridge University Press (2007).

Google Scholar

24. Zhou DX. Deep distributed convolutional neural networks: universality. Anal Appl. (2018) 16:895–919. doi: 10.1142/S0219530518500124

CrossRef Full Text | Google Scholar

25. Zhou DX. Universality of deep convolutional neural networks. Appl Comput Harmonic Anal. arXiv:1805.10769.

Google Scholar

26. Mhaskar H. Approximation properties of a multilayered feedforward artificial neural network. Adv Comput Math. (1993) 1:61–80. doi: 10.1007/BF02070821

CrossRef Full Text | Google Scholar

27. Zhou DX. Capacity of reproducing kernel spaces in learning theory. IEEE Trans Inform Theory. (2003) 49:1743–52. doi: 10.1109/TIT.2003.813564

CrossRef Full Text | Google Scholar

28. Bruna J, Mallat S. Invariant scattering convolution networks. IEEE Trans Patt Anal Mach Intel. (2013) 35:1872–86. doi: 10.1109/TPAMI.2012.230

PubMed Abstract | CrossRef Full Text | Google Scholar

29. Guo ZC, Shi L, Lin SB. Realizing data features by deep nets. arXiv: 1901.00130.

Google Scholar

30. Harvey N, Liaw C, Mehrabian A. Nearly-tight VC-dimension bounds for piecewise linear neural networks. Conference on Learning Theory. Amsterdam (2017). p. 1064–8.

Google Scholar

31. Donoho DL. Unconditional bases are optimal bases for data compression and for statistical estimation. Appl Comput Harmonic Anal. (1993) 1:100–15. doi: 10.1006/acha.1993.1008

CrossRef Full Text | Google Scholar

32. Kůrková V, Sanguineti M. Estimates of covering numbers of convex sets with slowly decaying orthogonal subsets. Discrete Appl Math. (2007) 155:1930–42. doi: 10.1016/j.dam.2007.04.007

CrossRef Full Text | Google Scholar

33. Maiorov V. Pseudo-dimension and entropy of manifolds formed by affine-invariant dictionary. Adv Comput Math. (2006) 25:435–50. doi: 10.1007/s10444-004-7645-9

CrossRef Full Text | Google Scholar

34. Kohler M, Krzyżak A. Adaptive regression estimation with multilayer feedforward neural networks. J Nonparametric Stat. (2005) 17:891–913. doi: 10.1080/10485250500309608

CrossRef Full Text | Google Scholar

35. Anthony M, Bartlett PL. Neural Network Learning: Theoretical Foundations. Cambridge: Cambridge University Press (2009).

Google Scholar

36. Maiorov V. Approximation by neural networks and learning theory. J Complex. (2006) 22:102–17. doi: 10.1016/j.jco.2005.09.001

CrossRef Full Text | Google Scholar

37. Lin SB. Limitations of shallow nets approximation. Neural Netw. (2017) 94:96–102. doi: 10.1016/j.neunet.2017.06.016

PubMed Abstract | CrossRef Full Text | Google Scholar

38. Wu Q, Zhou DX. SVM soft margin classifiers: linear programming versus quadratic programming. Neural Comput. (2015) 17:1160–87. doi: 10.1162/0899766053491896

CrossRef Full Text | Google Scholar

39. Zhou DX, Jetter K. Approximation with polynomial kernels and SVM classifiers. Adv Comput Math. (2006) 25:323–44. doi: 10.1007/s10444-004-7206-2

CrossRef Full Text | Google Scholar

Keywords: deep nets, learning theory, deep learning, tree structure, empirical risk minimization

Citation: Chui CK, Lin S-B and Zhou D-X (2019) Deep Net Tree Structure for Balance of Capacity and Approximation Ability. Front. Appl. Math. Stat. 5:46. doi: 10.3389/fams.2019.00046

Received: 20 June 2019; Accepted: 27 August 2019;
Published: 11 September 2019.

Edited by:

Lucia Tabacu, Old Dominion University, United States

Reviewed by:

Jianjun Wang, Southwest University, China
Jinshan Zeng, Jiangxi Normal University, China

Copyright © 2019 Chui, Lin and Zhou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Shao-Bo Lin, sblin1983@gmail.com

These authors have contributed equally to this work