Construction of neural networks for realization of localized deep learning

The subject of deep learning has recently attracted users of machine learning from various disciplines, including: medical diagnosis and bioinformatics, financial market analysis and online advertisement, speech and handwriting recognition, computer vision and natural language processing, time series forecasting, and search engines. However, theoretical development of deep learning is still at its infancy. The objective of this paper is to introduce a deep neural network (also called deep-net) approach to localized manifold learning, with each hidden layer endowed with a specific learning task. For the purpose of illustrations, we only focus on deep-nets with three hidden layers, with the first layer for dimensionality reduction, the second layer for bias reduction, and the third layer for variance reduction. A feedback component also designed to eliminate outliers. The main theoretical result in this paper is the order $\mathcal O\left(m^{-2s/(2s+d)}\right)$ of approximation of the regression function with regularity $s$, in terms of the number $m$ of sample points, where the (unknown) manifold dimension $d$ replaces the dimension $D$ of the sampling (Euclidean) space for shallow nets.


Introduction
The continually rapid growth in data acquisition and data updating has recently posed crucial challenges to the machine learning community on developing learning schemes to match or outperform human learning capability. Fortunately, the introduction of deep learning (see, for example, [21]) has led to the feasibility of getting around the bottleneck of classical learning strategies, such as the support vector machine and boosting algorithms, based on classical neural networks (see, for example, [6,11,17,31]), by demonstrating remarkable successes in many applications, particularly computer vision [25] and speech recognition [27], and more currently in other areas, including: natural language processing, medical diagnosis and bioinformatics, financial market analysis and online advertisement, time series 1 forecasting and search engines. Furthermore, the exciting recent advances of deep learning schemes for such applications have motivated the current interest in re-visiting the development of classical neural networks (to be called "shallow nets" in later discussions), by allowing multiple hidden layers between the input and output layers. Such neural networks are called "deep" neural nets, or simply, deep nets (DN). Indeed, the advantages of DN's over shallow nets, at least in applications, have led to various popular research directions in the academic communities of Approximation Theory and Learning Theory. Explicit results on the existence of functions, that are expressible by DN's but cannot be approximated by shallow nets with comparable number of parameters, are generally regarded as powerful features of the advantage of DN's in Approximation Theory. The first theoretical understanding of such results dates back to our early work [7], where by using the Heaviside activation function, it was shown that DN's with two hidden layers already provide localized approximation, while shallow nets fail. Later explicit results on DN approximation [14,37,39,40,44] further reveal other various advantages of DN's over shallow nets.
From approximation to learning, the tug of war between bias and variance [10] indicates that explicit derivation of DN's is insufficient to show its success in machine learning, in that besides bias, the capacity of DN should possess the expressivity of embodying variance. In this direction, the capacity of DN's, as measured by the number of linear regions, Betty number, neuron transitions, and DN trajectory length were studied in [38], [3], and [40] respectively, in showing that DN's allow for many more functionalities than shallow nets. Although these results certainly show the benefits of deep nets, yet they pose more difficulties in analyzing the deep learning performance, since large capacity usually implies large variance and requires more elaborate learning algorithms. One of the main difficulties is development of satisfactory learning rate analysis for DN learning, that has been well studied for shallow nets (see, for example, [34]). In this paper, we present an analysis of the advantages of DN's in the framework of learning theory [10], taking into account the trade-off between bias and variance.
Our starting point is to assume that the samples are located approximately on some unknown manifold in the sample (D-dimensional Euclidean) space. For simplicity, consider the set of inputs of samples: x 1 , . . . , x m ∈ X ⊆ [−1, 1] D , with a corresponding set of outputs: y 1 , · · · , y m ∈ Y ⊆ [−M, M ] for some positive number M , where X is an unknown data-dependent d-dimensional connected C ∞ Riemannian manifold (without boundary). We will call S m = {(x i , y i )} m i=1 the sample set, and construct a DN with three hidden layers, with the first for the dimensionality-reduction, the second for bias-reduction, and the third for variance-reduction. The main tools for our construction are the "local manifold learning" for deep nets in [9], "localized approximation" for deep nets in [7], and "local average" in [19]. We will also introduce a feedback procedure to eliminate outliers during the learning process. Our constructions justify the common consensus that deep nets are intuitively capable of capturing data features via their architectural structures [2]. In addition, we will prove that the constructed DN can well approximate the so-called regression function [10] within the accuracy of O m −2s/(2s+d) in expectation, where s denotes the order of smoothness (or regularity) of the regression function. Noting that the best existing learning rates of the shallow nets are O m −2s/(2s+D) log 2 m [34] and O m −s/(8s+4d) (log m) s/(4s+2d) [46], we observe the power of deep nets over shallow nets, at least theoretically, in the framework of Learning Theory.
The organization of this paper is as follows. In the next section, we present a detailed construction of the proposed deep net. The main results of the paper will be stated in Section 3, where tight learning rates of the constructed deep net are also deduced. Discussions of our contributions along with comparison with some related work and proofs of the main results will be presented in Section 4 and 5, respectively.

Construction of Deep Nets
In this section, we present a construction of deep neural networks (called deep nets, for simplicity) with three hidden layers to realize certain deep learning algorithms, by applying the mathematical tools of localized approximation in [7], local manifold learning in [9], and local average arguments in [19]. Throughout this paper, we will consider only two activation functions: the Heaviside functionσ 0 and the square-rectifier σ 2 , where the standard notation t + = max{0, t} is used to define σ n (t) = t n + = (t + ) n , for any non-negative integer n.

Localized approximation and localized manifold learning
Performance comparison between deep nets and shallow nets is a classical topic in Approximation Theory. It is well-known from numerous publications (see, for example, [7,14,40,44]) that various functions can be well approximated by deep nets but not by any shallow net with the same order of magnitude in the numbers of neurons. In particular, it was proved in [7] that deep nets can provide localized approximation, while shallow nets fail. For r, q ∈ N and an arbitrary j ∈ N r 2q , where N r ∈ (−1, 1). For a > 0 and ζ ∈ R r , let us denote by A r,a,ζ = ζ + − a 2 , a 2 r , the cube in R r with center ζ and width a. Furthermore, we define In what follows, the standard notion I A of the indicator function of a set (or an event) A 3 will be used. For x ∈ R, since This implies that N 1,r,q,ζ j as introduced in (1), is the indicator function of the cube ζ j + [−1/(2q), 1/(2q)] r = A r,1/q,ζ j . Thus, the following proposition which describes the localized approximation property of N 1,r,q,ζ j , can be easily deduced by applying Theorem 2.3 in [7].
Proposition 1 Let r, q ∈ N be arbitrarily given. Then N 1,r,q,ζ j = I A r,1/q,ζ j for all j ∈ N r 2q .
On the other hand, it was proposed in [1,12] with practical arguments, that deep nets can tackle data in highly-curved manifolds, while any shallow net fails. These arguments were theoretically verified in [9,41], with the implication that adding hidden layers to shallow nets should enable the neural networks to have the capability of processing massive data in a high-dimensional space from samples in lower dimensional manifolds. More precisely, it follows from [13,41] that for a lower d-dimensional connected and compact C ∞ Riemannian submanifold X ⊆ [−1, 1] D (without boundary), isometrically embedded in R D and endowed with the geodesic distance d G , there exists some δ > 0, such that for any x, x ′ ∈ X , with d G (x, x ′ ) < δ, where for any r > 0, · r denotes, as usual, the Euclidean norm of R r . In the following, let B G (ξ 0 , τ ), B D (ξ 0 , τ ), and B d (ξ 0 , τ ) denote the closed geodesic ball, the D-dimensional Euclidean ball, and the d-dimensional Euclidean ball, with center at ξ 0 , respectively, and with radius τ > 0. Then the following proposition is a brief summary of Theorem 2.2, Theorem 2.3 and Remark 2.1 in [9], with the implication that neural network can be used as a dimensionality-reduction tool.

Learning via deep nets
Our construction of deep nets depends on the localized approximation and dimensionalityreduction technique, as presented in Propositions 1 and 2. To describe the learning process, firstly select a suitable q * , so that for every To this end, we need a constant C 0 ≥ 1, such that The existence of such a constant is proved in the literature (see, for example, [46]). Also, in view of the compactness of X , since ξ∈X {x ∈ X : With this choice, we claim that (5) holds. Indeed, if A D,1/q * ,ζ j,q * ∩X = ∅, then (5) obviously holds for any choice of ξ ∈ X . On the other hand, if A D,1/q * ,ζ j,q * ∩ X = ∅, then from the inclusion property Therefore, it follows from (7) that This implies that A D,1/q * ,ζ j,q * ∩ X ⊂ B G (ξ * i * , δ ξ * i * ) and verifies our claim (5) with the choice of ξ * j = ξ * i * .
Observe that for every j ∈ N D 2q * we may choose the point ξ * j ∈ X to define N 2,j = (N and apply Proposition 2, (5), and (3) to obtain the following.
As a result of Propositions 1 and 3, we now present the construction of the deep nets for the proposed learning purpose. Start with selecting (2n) Then the desired deep net estimator with three hidden layers may be defined by where we set N 3 (x) = 0 if the denominator is zero.
Observe that in the above construction there is a totality of three hidden layers to perform three separate tasks, namely: the task of the first hidden layer is to reduce the dimension of the input space, while the second and third hidden layers are to perform localized approximation on R d and data variance reduction by applying local averaging [19], respectively.

Fine-tuning
For each x ∈ X , it follows from X = j∈N D 2q * A D,1/q * ,ζ j,q * that there is some j ∈ N D 2q * , such that x ∈ A D,1/q * ,ζ j,q * , which implies that N 2,j (x) ∈ [−1, 1] d . For each j ∈ N * 2q , since A D,1/q * ,ζ j,q * is a cube in R D , the cardinality of the set {j : x ∈ A D,1/q * ,ζ j,q * } is at most and that the number of such integers k is bounded by 2 d . For each x ∈ X , we consider a non-empty subset Also, for each x ∈ X , we further define and Then it follows from (15) and (16) that |Λ ′ x,S | ≤ |Λ x,S |, and it is easy to see that if each x,S |, (and this is possible when some x i lies on the boundary of H k,j for some (j, k) ∈ N D 2q * × N d 2n ), then the estimator N 3 (12) might perform badly, and this happens even for training data. Note that to predict some which is much smaller than y j when |Λ ′ x,S | < |Λ x,S |. The reason is that there are only |Λ x,S | summations in the numerator. Noting that the Riemannian measure of the boundary of ∪ (j,k)∈N D 2q * ×N d 2n H k,j is zero, we consider the above phenomenon as outliers. Fine-tuning, often referred to as feedback in the literature of deep learning [2], can essentially improve the learning performance of deep nets [26]. We observe that fine-tuning can also be applied to avoid outliers for our constructed deep net in (12), by counting the cardinalities of Λ x,S and Λ ′ x,S . In the training processing, besides computing N 3 (x) for some query point x, we may also record |Λ x,S | and |Λ ′ x,S |. If the estimator is not big enough, we propose to add the factor . In this way, the deep net estimator with feedback can be mathematically represented by where Φ k,j = Φ k,j,D,q * ,n : X × X → R is defined by and as before, we set N F 3 (x) = 0 if the denominator

Learning Rate Analysis
We consider a standard regression setting in learning theory [10] and assume that the of size m is drawn independently according to some Borel probability measure ρ on Z = X × Y. The regression function is then defined by where ρ(y|x) denotes the conditional distribution at x induced by ρ. Let ρ X be the marginal distribution of ρ on X and (L 2 ρ X , · ρ ) be the Hilbert space of square-integrable functions with respect to ρ X on X . Our goal is to estimate the distance between the output function N 3 and the regression function f ρ measured by N 3 − f ρ ρ , as well as the distance between N F 3 and N F 3 − f ρ ρ . We say that a function f on X is (s, c 0 )-Lipschitz (continuous) with positive exponent s ≤ 1 and constant c 0 > 0, if and denote by Lip (s,c 0 ) := Lip (s,c 0 ) (X ), the family of all (s, c 0 ) Lipschitz functions that satisfy (18). Our error analysis of N 3 will be carried out based on the following two assumptions.
Assumption 2 ρ X is continuous with respect to the geodesic distance d G of the Riemannian manifold.
Note that Assumption 2, which is about the geometrical structure of ρ X , is slightly weaker than the distortion assumption in [43,49] but somewhat similar to the assumption considered in [35]. The objective of this assumption is for describing the functionality of fine-tuning.
We are now ready to state the main results of this paper. In the first theorem below, we obtained an upper bound of learning rate for the constructed deep nets N 3 .
Theorem 1 Let m be the number of samples and set n = ⌈m 1/(2s+d) ⌉, where 1/(2n) is the uniform spacing of the points t k = t k,n ∈ (−1, 1) d in the definition of N 3 in (11). Then under Assumptions 1 and 2, for some positive constant C 1 independent of m.
Observe that Theorem 1 provides fast learning rate for the constructed deep net which depends on manifold dimension d instead of the sample space dimension D. In the second theorem below, we show the necessity of the fine-tuning process as presented in (17), when Assumption 2 is removed.
Theorem 2 Let m be the number of samples and set n = ⌈m 1/(2s+d) ⌉, where 1/(2n) is the uniform spacing of the points t k = t k,n ∈ (−1, 1) d in the definition of N 3 in (11), which is used to define N F 3 in (17). Then under Assumption 1, for some positive constant C 2 independent of m.
Observe that while Assumption 2 is needed in Theorem 1, it is not necessary for the validity of Theorem 2, which theoretically shows the significance of fine-tuning in our construction. The proofs of these two theorems will be presented in the final section of this paper.

Related Work and Discussions
The success in practical applications, especially in the fields of computer vision [25] and speech recognition [27], has triggered enormous research activities on deep learning. Several other encouraging results, such as object recognition [12], unsupervised training [15], and artificial intelligence architecture [2], have been obtained to demonstrate the significance of deep learning. We refer the interested readers to the 2016 MIT monograph, "Deep Learning" [18], by Goodfellow, Bengjio and Courville, for further study of this exciting subject, which is only at the infancy of its development.
Indeed, deep learning has already created several challenges to the machine learning community. Among the main challenges are to show the necessity of the usage of deep nets and to theoretically justify the advantages of deep nets over shallow nets. This is essentially a classical topic in Approximation Theory. In particular, dating back to the early 1990's, it was already proved that deep nets can provide localized approximation but shallow nets fail (see, for example, [7]). Furthermore, it was also shown that deep nets provide high approximation orders, that are certainly not restricted by the lower error bounds for shallow nets (see [8,33]). More recently, stimulated by the avid enthusiasm of deep learning, numerous advantages of deep nets were also revealed from the point of view of function approximation. In particular, certain functions discussed in [14] can be represented by deep nets but cannot be approximated by shallow nets; it was shown in [37] that deep nets, but not shallow nets, can approximate composition of functions; it was exhibited in [39] that deep nets can avoid the curse of dimension of shallow nets; a probability argument was given in [30] to show that deep nets have better approximation performance than shallow nets with high confidence; it was demonstrated in [9,41] that deep nets can improve the approximation capability of shallow nets when the data are located on data-dependent manifolds; and so on. All of these results give theoretical explanations of the significance of deep nets from the Approximation Theory point of view.
As a departure from the work mentioned above, our present paper is devoted to explore better performance of deep nets over shallow nets in the framework of Leaning Theory. In particular, we are concerned not only with the approximation accuracy but also with the cost to attain such accuracy. In this regard, learning rates of certain deep nets have been analyzed in [23], in which Kohler and Krzyżak provided certain near-optimal learning rates for a fairly complex regularization scheme, with the hypothesis space being the family of deep nets with two hidden layers proposed in [36]. More precisely, they derived a learning rate of order O(m −2s/(2s+D) (log m) 4s/(2s+D) ) for functions f ρ ∈ Lip (s,c 0 ) . This is close to the optimal learning rate of shallow nets in [34], different only by a logarithmic factor. Hence, the study in [23] theoretically showed that deep nets at least do not downgrade the learning performance of shallow nets. In comparison with [23], our study is focussed on answering the question: "What is to be gained by deep learning?" The deep net constructed in our paper possesses a learning rate of order O(m −2s/(2s+d) ), when X is an unknown d-dimensional connected C ∞ Riemannian manifold (without boundary). This rate is the same as the optimal learning rate [19,Chapeter 3] for special case of the cube X = [−1, 1] d under a similar condition, though it is smaller than the optimal learning rates for shallow nets [34]. Another line of related work is [46,47], where Ye and Zhou deduced learning rates for regularized least-squares over shallow nets for the same setting of our paper. They derived a learning rate of O m −s/(8s+4d) (log m) s/(4s+2d) , which is slower than the rate established in our paper. It should be mentioned that in a more recent work [24], some advantages of deep nets are revealed from the learning theory viewpoint. However, the results in [24] requires a hierarchical interaction structure, which is totally different from what is presented in our present paper.
Due to the high degree of freedom for deep nets, the number and type of parameters for deep nets are much more than those of shallow nets. Thus, it should be of great interest to develop scalable algorithms to reduce the computational burdens of deep learning. Distributed learning based on a divide-and-conquer strategy [28,48] could be a fruitful approach for this purpose. It is also of interest to establish results similar to those of Theorem 2 and Theorem 1 for deep nets, but with rectifier neurons, by using the rectifier (or ramp) function, σ 2 (t) = t 2 + = (t + ) 2 , as activation. The reason is that the rectifier is one of the most widely used activations in the literature on deep learning. Our research in these directions is postponed to a later work.

Proofs of the main results
To facilitate our proofs of the theorems stated in Section 3, we first establish the following two lemmas.
Observe from Proposition 1 and the definition (11) of the function N 3,k,j that For j ∈ N D 2q * , k ∈ N d 2n , define a random function T k,j : Z m → R in term of the random so that be a non-empty subset, (j × k) ∈ Λ * and T k,j (S) be defined as in (22). Then where if (j,k)∈Λ * T k,j (S) = 0, we set Proof Observe that it follows from (23) that T k,j (S) ∈ {0, 1, . . . , m} and Since by the definition of the fraction , the term with ℓ = 0 above vanishes, so that On the other hand, from (23), note that (j,k)∈Λ * T k,j (S) = ℓ is equivalent to x i ∈ ∪ (j,k)∈Λ * H k,j for ℓ indices i from {1, · · · , m}, which in turn implies that Thus, we obtain Therefore, the desired inequality (24) follows. This completes the proof of Lemma 1.
for any Borel probability measure µ on X .
Proof. Since f ρ (x) is the conditional mean of y given x ∈ X , we have from Thus, along with the inner-product expression the above equality yields the desired result (25). This completes the proof of Lemma 2.
We are now ready to prove the two main results of the paper.
Proof of Theorem 1. We divide the proof into four steps, namely: error decomposition, sampling error estimation, approximation error estimation, and learning rate deduction.
Step 1: Error decomposition. LetḢ k,j be the set of interior points of H k,j . For arbitrarily fixed k ′ , j ′ and x ∈Ḣ k ′ ,j ′ , it follows from (21) that If, in addition, each x i ∈Ḣ k,j for some k, j ∈ N D 2q * × N d 2n , then we have, from (12), that In view of Assumption 2, it follows that for an arbitrary subset A ⊂ R D , λ G (A) = 0 implies ρ X (A) = 0, where λ G (A) denotes the geodesic metric of the Riemannian manifold X . In particular, for A = H k,j \Ḣ k,j in the above analysis, we have ρ X (H k,j \Ḣ k,j ) = 0, which implies that (26) almost surely holds. Next, set Then it follows from Lemma 2, with µ = ρ X , that In what follows, the two terms on the right-hand side of (28) will be called sampling error and approximation error, respectively.
Step 2: Sampling error estimation. Due to Assumption 2, we have On the other hand, (26) and (27) together imply that almost surely for x ∈Ḣ k,j , and that

13
where E[y i |x i ] = f ρ (x i ) in the second equality, I 2 H k,j (x i ) = I H k,j (x i ) and |y i | ≤ M holds almost surely in the inequality. It then follows from Lemma 1 and Assumption 2 that .
Proof of Theorem 2. Similar to the proof of Theorem 1, we also divide this proof into four steps.
Step 1: Error decomposition. From (17), we have where h x : X × X → R is a function defined for x, u ∈ X by and h x (x, u) = 0 when the denominator vanishes.
Then it follows from Lemma 2 with µ = ρ X , that In what follows, the terms on the right-hand side of (36) will be called sampling error and approximation error, respectively. By (21), for each x ∈ X and i ∈ {1, · · · , m}, we have Φ k,j (x, x i ) = I H k,j (x i )N 3,k,j (x) = I H k,j (x i ) for (j, k) ∈ Λ x and Φ k,j (x, x i ) = 0 for (j, k) / ∈ Λ x , where Λ x is defined by (13). This, together with (35), (33) and (34), yields both and where Step 2: Sampling error estimation. First consider Then for each x ∈ H k,j , since E[y|x] = f ρ (x), it follows from (37) and |y| ≤ M that almost surely holds. Hence, since m i=1 I H k,j (x i ) = T k,j (S), we may apply the Schwarz inequality to (j,k)∈Λx I H k,j (x i ) to obtain .
Thus, it follows from Lemma 1 and (14) that .
This, along with (39), implies that Step 3: Approximation error estimation. For each x ∈ X , set and observe that Let us first consider X A 1 (x)dρ X as follows. Since N F 3 (x) = 0 for (j,k)∈Λx T k,j (S) = 0, we have, from |f ρ (x)| ≤ M , that On the other hand, since We next consider X A 2 (x)dρ X . Let x ∈= X be so chosen that (j,k)∈Λx T k,j (S) ≥ 1. Then x i ∈ H x := ∪ (j,k)∈Λx H k,j at least for some i ∈ {1, 2, . . . , m}. For those x i / ∈ H x , we have (j,k)∈Λx I H k,j (x i ) = 0, so that . For x i ∈ H x , we have x i ∈ H k,j for some (j, k) ∈ Λ x . But x ∈ H k,j , so that But (10)