Construction of Neural Networks for Realization of Localized Deep Learning

Chui, Charles K.; Lin, Shao-Bo; Zhou, Ding-Xuan

doi:10.3389/fams.2018.00014

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 17 May 2018

Sec. Mathematics of Computation and Data Science

Volume 4 - 2018 | https://doi.org/10.3389/fams.2018.00014

Construction of Neural Networks for Realization of Localized Deep Learning

Charles K. Chui^1,2

Shao-Bo Lin³^*

Ding-Xuan Zhou⁴

¹Department of Mathematics, Hong Kong Baptist University, Kowloon, Hong Kong
²Department of Statistics, Stanford University, Stanford, CA, United States
³Department of Mathematics, Wenzhou University, Wenzhou, China
⁴Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong

The subject of deep learning has recently attracted users of machine learning from various disciplines, including: medical diagnosis and bioinformatics, financial market analysis and online advertisement, speech and handwriting recognition, computer vision and natural language processing, time series forecasting, and search engines. However, theoretical development of deep learning is still at its infancy. The objective of this paper is to introduce a deep neural network (also called deep-net) approach to localized manifold learning, with each hidden layer endowed with a specific learning task. For the purpose of illustrations, we only focus on deep-nets with three hidden layers, with the first layer for dimensionality reduction, the second layer for bias reduction, and the third layer for variance reduction. A feedback component is also designed to deal with outliers. The main theoretical result in this paper is the order $O (m^{- 2 s / (2 s + d)})$ of approximation of the regression function with regularity s, in terms of the number m of sample points, where the (unknown) manifold dimension d replaces the dimension D of the sampling (Euclidean) space for shallow nets.

1. Introduction

The continually rapid growth in data acquisition and data updating has recently posed crucial challenges to the machine learning community on developing learning schemes to match or outperform human learning capability. Fortunately, the introduction of deep learning (see for example [1]) has led to the feasibility of getting around the bottleneck of classical learning strategies, such as the support vector machine and boosting algorithms, based on classical neural networks (see for example [2–5]), by demonstrating remarkable successes in many applications, particularly computer vision [6] and speech recognition [7], and more recently in other areas, including: natural language processing, medical diagnosis and bioinformatics, financial market analysis and online advertisement, time series forecasting and search engines. Furthermore, the exciting recent advances of deep learning schemes for such applications have motivated the current interest in re-visiting the development of classical neural networks (to be called “shallow nets” in later discussions), by allowing multiple hidden layers between the input and output layers. Such neural networks are called “deep” neural nets, or simply, deep nets. Indeed, the advantages of deep nets over shallow nets, at least in applications, have led to various popular research directions in the academic communities of Approximation Theory and Learning Theory. Explicit results on the existence of functions, that are expressible by deep nets but cannot be approximated by shallow nets with comparable number of parameters, are generally regarded as powerful features of the advantage of deep nets in Approximation Theory. The first theoretical understanding of such results dates back to our early work [8], where by using the Heaviside activation function, it was shown that deep nets with two hidden layers already provide localized approximation, while shallow nets fail. Explicit results on neural network approximation derived in Eldan and Shamir [9], Mhaskar and Poggio [10], Poggio et al. [11], Raghu et al. [12], Shaham et al. [13], and Telgarsky [14] further reveal various advantages of deep nets over shallow nets. For example, the power of depth of neural network in approximating hierarchical functions was shown in Mhaskar and Poggio [10] and Poggio et al. [11], and that deep nets can improve the approximation capability of shallow nets when the data are located on a manifold was demonstrated in Shaham et al. [13].

From approximation to learning, the tug of war between bias and variance [15] indicates that explicit derivation of deep nets is insufficient to show its success in machine learning, in that besides bias, the capacity of deep nets should possess the expressivity of embodying variance. In this direction, the capacity of deep nets, as measured by the Betti number, number of linear regions and neuron transitions were studied in Bianchini and Scarselli [16], Montúfar et al. [17], and Raghu et al. [12] respectively, in showing that deep nets allow for many more functionalities than shallow nets. Although these results certainly show the benefits of deep nets, yet they pose more difficulties in analyzing the deep learning performance, since large capacity usually implies large variance and requires more elaborate learning algorithms. One of the main difficulties is development of satisfactory learning rate analysis for deep net learning, that has been well studied for shallow nets (see for example [18]). In this paper, we present an analysis of the advantages of deep nets in the framework of learning theory [15], taking into account the trade-off between bias and variance.

Our starting point is to assume that the samples are located approximately on some unknown manifold in the sample (D-dimensional Euclidean) space. For simplicity, consider the set of sample inputs: $x_{1}, \dots, x_{m} \in X \subseteq {[- 1, 1]}^{D}$ , with a corresponding set of outputs: $y_{1}, \dots, y_{m} \in Y \subseteq [- M, M]$ for some positive number M, where $X$ is an unknown d-dimensional connected C^∞ Riemannian manifold (without boundary). We will call $S_{m} = {(x_{i}, y_{i})}_{i = 1}^{m}$ the sample set, and construct a deep net with three hidden layers, with the first for the dimensionality-reduction, the second for bias-reduction, and the third for variance-reduction. The main tools for our construction are the “local manifold learning” for deep nets in Chui and Mhaskar [19], “localized approximation” for deep nets in Chui et al. [8], and “local average” in Györfy et al. [20]. We will also introduce a feedback procedure to eliminate outliers during the learning process. Our constructions justify the common consensus that deep nets are intuitively capable of capturing data features via their architectural structures [21]. In addition, we will prove that the constructed deep net can well approximate the so-called regression function [15] within the accuracy of $O (m^{- 2 s / (2 s + d)})$ in expectation, where s denotes the order of smoothness (or regularity) of the regression function. Noting that the best existing learning rates of the shallow nets are $O (m^{- 2 s / (2 s + D)} {log}^{2} m)$ in Maiorov [18] and $O (m^{- s / (8 s + 4 d)} {(log m)}^{s / (4 s + 2 d)})$ in Ye and Zhou [22], we observe the power of deep nets over shallow nets, at least theoretically, in the framework of Learning Theory.

The organization of this paper is as follows. In the next section, we present a detailed construction of the proposed deep net. The main results of the paper will be stated in section 3, where tight learning rates of the constructed deep net are also deduced. Discussions of our contributions along with comparison with some related work and proofs of the main results will be presented in sections 4 and 5, respectively.

2. Construction of Deep Nets

In this section, we present a construction of deep neural networks with three hidden layers to realize certain deep learning algorithms, by applying the mathematical tools of localized approximation in Chui et al. [8], local manifold learning in Chui and Mhaskar [19], and local average arguments in Györfy et al. [20]. Throughout this paper, we will consider only two activation functions: the Heaviside function σ₀ and the square-rectifier σ₂, where the standard notation t₊ = max{0, t} is used to define $σ_{n} (t) = t_{+}^{n} = {(t_{+})}^{n}$ , for any non-negative integer n.

2.1. Localized Approximation and Localized Manifold Learning

Performance comparison between deep nets and shallow nets is a classical topic in Approximation Theory. It is well-known from numerous publications (see for example [8, 9, 12, 14]) that various functions can be well approximated by deep nets but not by any shallow net with the same order of magnitude in the numbers of neurons. In particular, it was proved in Chui et al. [8] that deep nets can provide localized approximation, while shallow nets fail.

For r, q ∈ ℕ and an arbitrary $j = {(j^{(ℓ)})}_{ℓ = 1}^{r} \in ℕ_{2 q}^{r}$ , where $ℕ_{2 q}^{r} = {1, 2, \dots, 2 q}^{r}$ , let

\begin{array}{l} ζ_{j} = ζ_{j, q} = {(ζ_{j}^{(ℓ)})}_{ℓ = 1}^{r} with ζ_{j}^{(ℓ)} = - 1 + \frac{2 j^{(ℓ)} - 1}{2 q} \in (- 1, 1) . \end{array}

For a > 0 and ζ ∈ ℝ^r, let us denote by $A_{r, a, ζ} = ζ + {[- \frac{a}{2}, \frac{a}{2}]}^{r}$ , the cube in ℝ^r with center ζ and width a. Furthermore, we define $N_{1, r, q, ζ_{j}} : ℝ^{r} \to ℝ$ by

\begin{array}{l} N_{1, r, q, ζ_{j}} (ξ) = σ_{0} {\sum_{ℓ = 1}^{r} σ_{0} [\frac{1}{2 q} + ξ^{(ℓ)} - ζ_{j}^{(ℓ)}] \\ + \sum_{ℓ = 1}^{r} σ_{0} [\frac{1}{2 q} - ξ^{(ℓ)} + ζ_{j}^{(ℓ)}] - 2 r + \frac{1}{2}} . & (1) \end{array}

In what follows, the standard notion I_A of the indicator function of a set (or an event) A will be used. For x ∈ ℝ, since

\begin{array}{l} σ_{0} [\frac{1}{2 q} + x] + σ_{0} [\frac{1}{2 q} - x] - 2 \\ = I_{[- 1 / (2 q), \infty)} (x) + I_{(- \infty, 1 / (2 q)]} (x) - 2 \\ = {\begin{array}{l} 0, & if x \in [- 1 / (2 q), 1 / (2 q)], \\ - 1, & otherwise, \end{array} \end{array}

we observe that

\begin{array}{l} \sum_{ℓ = 1}^{r} σ_{0} [\frac{1}{2 q} + ξ^{(ℓ)}] + \sum_{ℓ = 1}^{r} σ_{0} [\frac{1}{2 q} - ξ^{(ℓ)}] - 2 r \\ + \frac{1}{2} {\begin{array}{l} = \frac{1}{2}, & for ξ \in {[- 1 / (2 q), 1 / (2 q)]}^{r}, \\ \leq - \frac{1}{2}, & otherwise. \end{array} \end{array}

This implies that N_{1, r, q,_ζ_j} as introduced in (1) is the indicator function of the cube $ζ_{j} + {[- 1 / (2 q), 1 / (2 q)]}^{r} = A_{r, 1 / q, ζ_{j}}$ . Thus, the following proposition which describes the localized approximation property of N_{1, r, q,_ζ_j}, can be easily deduced by applying Theorem 2.3 in Chui et al. [8].

Proposition 1. Let r, q ∈ ℕ be arbitrarily given. Then N_{1, r, q,_ζ_j} = I_{A_{r, 1/q,_ζ_j}} for all $j \in ℕ_{2 q}^{r}$ .

On the other hand, it was proposed in Basri and Jacobs [23] and DiCarlo and Cox [24] with practical arguments, that deep nets can tackle data in highly-curved manifolds, while any shallow nets fail. These arguments were theoretically verified in Chui and Mhaskar [19] and Shaham et al. [13], with the implication that adding hidden layers to shallow nets should enable the neural networks to have the capability of processing massive data in a high-dimensional space from samples in lower dimensional manifolds. More precisely, it follows from do Carmo [25] and Shaham et al. [13] that for a lower d-dimensional connected and compact C^∞ Riemannian submanifold $X \subseteq {[- 1, 1]}^{D}$ (without boundary), isometrically embedded in ℝ^D and endowed with the geodesic distance d_G, there exists some δ > 0, such that for any $x, x^{'} \in X$ , with $d_{G} (x, x^{'}) < δ$ ,

\begin{array}{l} \frac{1}{2} d_{G} (x, x^{'}) \leq ‖ x - x^{'} ‖_{D} \leq 2 d_{G} (x, x^{'}), & (2) \end{array}

where for any r > 0, ||·||_r denotes, as usual, the Euclidean norm of ℝ^r. In the following, let B_G(ξ₀, τ), B_D(ξ₀, τ), and B_d(ξ₀, τ) denote the closed geodesic ball, the D-dimensional Euclidean ball, and the d-dimensional Euclidean ball, with center at ξ₀, respectively, and with radius τ > 0. Noting that $t^{2} = σ_{2} (t) - σ_{2} (- t)$ , the following proposition then is a brief summary of Theorem 2.2 and Remark 2.1 in Chui and Mhaskar [19], with the implication that neural networks can be used as a dimensionality-reduction tool.

Proposition 2. For each $ξ \in X$ , there exist a positive number δ_ξ and a neural network

\begin{array}{l} Φ_{ξ} = {(Φ_{ξ}^{(ℓ)})}_{ℓ = 1}^{d} : X \to ℝ^{d} \end{array}

with

\begin{array}{l} Φ_{ξ}^{(ℓ)} (x) = \sum_{k = 1}^{(D + 2) (D + 1)} a_{k, ξ, ℓ} σ_{2} (w_{k, ξ, ℓ} \cdot x + b_{k, ξ, ℓ}), \\ w_{k, ξ, ℓ} \in ℝ^{D}, a_{k, ξ, ℓ}, b_{k, ξ, ℓ} \in ℝ, & (3) \end{array}

that maps B_G(ξ, δ_ξ) diffeomorphically onto [−1, 1]^d and satisfies

\begin{array}{l} α_{ξ} d_{G} (x, x^{'}) \leq ‖ Φ_{ξ} (x) - Φ_{ξ} (x^{'}) ‖_{d} \leq β_{ξ} d_{G} (x, x^{'}), \\ \forall x, x^{'} \in B_{G} (ξ, δ_{ξ}) & (4) \end{array}

for some α_ξ, β_ξ > 0.

2.2. Learning via Deep Nets

Our construction of deep nets depends on the localized approximation and dimensionality-reduction technique, as presented in Propositions 1 and 2. To describe the learning process, firstly select a suitable q*, so that for every $j \in ℕ_{2 q^{*}}^{D}$ , there exists some point $ξ_{j}^{*}$ in a finite set that satisfies

\begin{array}{l} A_{D, 1 / q^{*}, ζ_{j, q^{*}}} \cap X \subset B_{G} (ξ_{j}^{*}, δ_{ξ_{j}^{*}}) . & (5) \end{array}

To this end, we need a constant C₀ ≥ 1, such that

\begin{array}{l} d_{G} (x, x^{'}) \leq C_{0} ‖ x - x^{'} ‖_{D}, \forall x, x^{'} \in X . & (6) \end{array}

The existence of such a constant is proved in the literature (see for example [22]). Also, in view of the compactness of $X$ , since $⋃_{ξ \in X} {x \in X : B_{G} (x, ξ) < δ_{ξ} / 2}$ is an open covering of $X$ , there exists a finite set of points ${ξ_{i}^{*}}_{i = 1}^{F_{X}} \subset X$ , such that $X \subset \cup_{i = 1}^{F_{X}} B_{G} (ξ_{i}^{*}, δ_{ξ_{i}^{*}} / 2) .$ . Hence, q* ∈ ℕ may be chosen to satisfy

\begin{array}{l} q^{*} \geq \frac{2 C_{0} \sqrt{D}}{{min}_{1 \leq i \leq F_{X}} δ_{ξ_{i}^{*}}} . & (7) \end{array}

With this choice, we claim that (5) holds. Indeed, if $A_{D, 1 / q^{*}, ζ_{j, q^{*}}} \cap X = \emptyset$ , then (5) obviously holds for any choice of $ξ \in X$ . On the other hand, if $A_{D, 1 / q^{*}, ζ_{j, q^{*}}} \cap X \neq \emptyset$ , then from the inclusion property $X \subset ⋃_{i = 1}^{F_{X}} B_{G} (ξ_{i}^{*}, δ_{ξ_{i}^{*}} / 2)$ , it follows that there is some $i^{*} \in {1, \dots, F_{X}}$ , depending on $j \in N_{2 q^{*}}^{d}$ , such that

\begin{array}{l} A_{D, 1 / q^{*}, ζ_{j, q^{*}}} \cap B_{G} (ξ_{i^{*}}^{*}, δ_{ξ_{i^{*}}^{*}} / 2) \neq \emptyset . & (8) \end{array}

Next, let $η^{*} \in A_{D, 1 / q^{*}, ζ_{j, q^{*}}} \cap B_{G} (ξ_{i^{*}}^{*}, δ_{ξ_{i^{*}}^{*}} / 2)$ . By (6), we have, for any $x \in A_{D, 1 / q^{*}, ζ_{j, q^{*}}} \cap X$ ,

\begin{array}{l} d_{G} (x, η^{*}) \leq C_{0} ‖ x - η^{*} ‖_{D} \leq C_{0} \sqrt{D} \frac{1}{q^{*}} . \end{array}

Therefore, it follows from (7) that

\begin{array}{l} d_{G} (x, ξ_{i^{*}}^{*}) & \leq & d_{G} (x, η^{*}) + d_{G} (η^{*}, ξ_{i^{*}}^{*}) \leq C_{0} \sqrt{D} \frac{1}{q^{*}} + \frac{δ_{ξ_{i^{*}}^{*}}}{2} \leq δ_{ξ_{i^{*}}^{*}} . \end{array}

This implies that $A_{D, 1 / q^{*}, ζ_{j, q^{*}}} \cap X \subset B_{G} (ξ_{i^{*}}^{*}, δ_{ξ_{i^{*}}^{*}})$ and verifies our claim (5) with the choice of $ξ_{j}^{*} = ξ_{i^{*}}^{*}$ .

Observe that for every $j \in ℕ_{2 q^{*}}^{D}$ we may choose the point $ξ_{j}^{*} \in X$ to define $N_{2, j} = {(N_{2, j}^{(ℓ)})}_{ℓ = 1}^{d} : X \to ℝ^{d}$ by setting

\begin{array}{l} N_{2, j}^{(ℓ)} (x) : = Φ_{ξ_{j}^{*}}^{(ℓ)} (x) = \sum_{k = 1}^{(D + 2) (D + 1)} a_{k, ξ_{j}^{*}, ℓ} σ_{2} (w_{k, ξ_{j}^{*}, ℓ} \cdot x + b_{k, ξ_{j}^{*}, ℓ}), \\ ℓ = 1, \dots, d & (9) \end{array}

and apply (5) and (3) to obtain the following.

Proposition 3. For each $j \in ℕ_{2 q^{*}}^{D}$ , N_2,j maps $A_{D, 1 / q^{*}, ζ_{j, q^{*}}} \cap X$ diffeomorphically into [−1, 1]^d and

\begin{array}{l} α d_{G} (x, x^{'}) \leq ‖ N_{2, j} (x) - N_{2, j} (x^{'}) ‖_{d} \leq β d_{G} (x, x^{'}), \\ \forall x, x^{'} \in A_{D, 1 / q^{*}, ζ_{j, q^{*}}} \cap X, & (10) \end{array}

where $α : = min_{1 \leq i \leq F_{X}} α_{ξ_{i}^{*}}$ and $β : = max_{1 \leq i \leq F_{X}} β_{ξ_{i}^{*}}$ .

As a result of Propositions 1 and 3, we now present the construction of the deep nets for the proposed learning purpose. Start with selecting (2n)^d points $t_{k} = t_{k, n} \in {(- 1, 1)}^{d}$ , $k \in ℕ_{2 n}^{d}$ and n ∈ ℕ, with $t_{k} = (t_{k}^{1}, \dots, t_{k}^{d})$ , where $t_{k}^{(ℓ)} = - 1 + \frac{2 k^{(ℓ)} - 1}{2 n}$ in (−1, 1)^d. Denote C_k = A_{d, 1/n,_t_k} and $H_{k, j} = {x \in X \cap A_{D, 1 / q^{*}, ζ_{j, q^{*}}} : N_{2, j} (x) \in C_{k}}$ . In view of Proposition 3, it follows that H_k,j is well defined, $X \subseteq \cup_{j \in ℕ_{2 q^{*}}^{D}} A_{D, 1 / q^{*}, ζ_{j, q^{*}}}$ , and $⋃_{k \in ℕ_{2 n}^{d}} H_{k, j} = X \cap A_{D, 1 / q^{*}, ζ_{j, q^{*}}} .$ We also define $N_{3, k, j} : X \to ℝ$ by

\begin{array}{l} N_{3, k, j} (x) = N_{1, d, n, t_{k}} \circ N_{2, j} (x) \\ = σ_{0} {\sum_{ℓ = 1}^{d} σ_{0} [\frac{1}{2 n} + N_{2, j}^{(ℓ)} (x) - t_{k}^{(ℓ)}] \\ + \sum_{ℓ = 1}^{d} σ_{0} [\frac{1}{2 n} - N_{2, j}^{(ℓ)} (x) + t_{k}^{(ℓ)}] - 2 d + \frac{1}{2}} . & (11) \end{array}

Then the desired deep net estimator with three hidden layers may be defined by

\begin{array}{l} N_{3} (x) = \frac{\sum_{j \in ℕ_{2 q^{*}}^{D}} \sum_{k \in ℕ_{2 n}^{d}} \sum_{i = 1}^{m} N_{1, D, q^{*}, ζ_{j}} (x_{i}) N_{3, k, j} (x_{i}) y_{i} N_{3, k, j} (x)}{\sum_{j \in ℕ_{2 q^{*}}^{D}} \sum_{k \in ℕ_{2 n}^{d}} \sum_{i = 1}^{m} N_{1, D, q^{*}, ζ_{j}} (x_{i}) N_{3, k, j} (x_{i})}, & (12) \end{array}

where we set N₃(x) = 0 if the denominator is zero.

For a d-dimensional submanifold $X$ and an x in $A_{D, 1 / q^{*}, ζ_{j, q^{*}}}$ , it is clear from (9) that the task of the first hidden layer N_2,j(x) is to map $X$ into [−1, 1]^d. On the other hand, the second hidden layer is intended to searching for the location of N_2,j(x) in [−1, 1]^d. Indeed, it follows from (11) that large values of the parameter n narrow down certain small region that contains x, thereby reducing the bias. Furthermore, observe that N₃(x) in (12) is some kind of local average, based on N_3,k,j(x) and the small region that contains x. This is a standard local averaging strategy for reducing variance in statistics [20]. In summary, there is a totality of three hidden layers in the above construction for performing three separate tasks, namely: the first hidden layer is for reducing the dimension of the input space, while by applying local averaging [20], the second and third hidden layers are for reducing bias and data variance, respectively.

2.3. Fine-Tuning

For each $x \in X$ , it follows from $X = ⋃_{j \in ℕ_{2 q^{*}}^{D}} A_{D, 1 / q^{*}, ζ_{j, q^{*}}}$ that there is some $j \in ℕ_{2 q^{*}}^{D}$ , such that $x \in A_{D, 1 / q^{*}, ζ_{j, q^{*}}}$ , which implies that $N_{2, j} (x) \in {[- 1, 1]}^{d}$ . For each $j \in ℕ_{2 q}^{*}$ , since $A_{D, 1 / q^{*}, ζ_{j, q^{*}}}$ is a cube in ℝ^D, the cardinality of the set ${j : x \in A_{D, 1 / q^{*}, ζ_{j, q^{*}}}}$ is at most 2^D. Also, because ${[- 1, 1]}^{d} = ⋃_{k \in ℕ_{2 n}^{d}} A_{d, 1 / n, t_{k}}$ for each $j \in ℕ_{2 q}^{*}$ , there exists some $k \in ℕ_{2 n}^{d}$ , such that N_2,j(x) ∈ A_{d, 1/n,_t_k}, implying that N_{3, k,j}(x) = N_{1, d, n,_t_k}◦N_{2, j}(x) = 1 and that the number of such integers k is bounded by 2^d. For each $x \in X$ , we consider a non-empty subset

\begin{array}{l} Λ_{x} = {(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d} : x \in A_{D, 1 / q^{*}, ζ_{j, q^{*}}}, N_{3, k, j} (x) = 1} . & (13) \end{array}

of $ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}$ , with cardinality

\begin{array}{l} | Λ_{x} | \leq 2^{D + d}, \forall x \in X . & (14) \end{array}

Also, for each $x \in X$ , we further define $S_{Λ_{x}} = \cup_{(j, k) \in Λ_{x}} H_{k, j} \cap {x_{i}}_{i = 1}^{m}$ , as well as

\begin{array}{l} Λ_{x, S} = {(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d} : N_{1, D, q^{*}, ζ_{j}} (x_{i}) N_{3, k, j} (x_{i}) = 1, x_{i} \in S_{Λ_{x}}}, & (15) \end{array}

and

\begin{array}{l} {Λ^{'}}_{x, S} = {(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d} : N_{1, D, q^{*}, ζ_{j}} (x_{i}) N_{3, k, j} (x_{i}) N_{3, k, j} (x) \\ = 1, x_{i} \in S_{Λ_{x}}} . & (16) \end{array}

Then it follows from (15) and (16) that $| Λ_{x, S}^{'} | \leq | Λ_{x, S} |,$ and it is easy to see that if each x_i ∈ S_{Λ_x} is an interior point of some H_{k, j}, then $| Λ_{x, S} | = | Λ_{x, S}^{'} |$ . In this way, N₃ is some local average estimator. However, if $| Λ_{x, S} | \neq | Λ_{x, S}^{'} |$ , (and this is possible when some x_i lies on the boundary of H_k,j for some $(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}$ ), then the estimator N₃ in (12) might perform badly, and this happens even for training data. Note that to predict for some x_j ∈ S_m, which is an interior point of H_{k₀, j₀}, we have

\begin{array}{l} N_{3} (x_{j}) = \frac{\sum_{i = 1}^{m} N_{1, D, q^{*}, ζ_{j_{0}}} (x_{i}) N_{3, k_{0}, j_{0}} (x_{i}) y_{i}}{| Λ_{x_{j}, S}^{'} |}, \end{array}

which might be far away from y_j when $| Λ_{x, S}^{'} | < | Λ_{x, S} |$ . The reason is that there are |Λ_x,S| summations in the numerator. Noting that the Riemannian measure of the boundary of $\cup_{(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} H_{k, j}$ is zero, we consider the above phenomenon as outliers.

Fine-tuning, often referred to as feedback in the literature of deep learning [21], can essentially improve the learning performance of deep nets [26]. We observe that fine-tuning can also be applied to handle outliers for our constructed deep net in (12), by counting the cardinalities of Λ_x,S and $Λ_{x, S}^{'}$ . In the training process, besides computing N₃(x) for some query point x, we may also record |Λ_x,S| and $| Λ_{x, S}^{'} |$ . If the estimator is not big enough, we propose to add the factor $\frac{| Λ_{x, S}^{'} |}{| Λ_{x, S} |}$ to N₃(x). In this way, the deep net estimator with feedback can be mathematically represented by

\begin{array}{l} N_{3}^{F} (x) = \frac{| Λ_{x, S}^{'} |}{| Λ_{x, S} |} N_{3} (x) = \frac{\sum_{j \in ℕ_{2 q^{*}}^{D}} \sum_{k \in ℕ_{2 n}^{d}} \sum_{i = 1}^{m} y_{i} Φ_{k, j} (x, x_{i})}{\sum_{j \in ℕ_{2 q^{*}}^{D}} \sum_{k \in ℕ_{2 n}^{d}} \sum_{i = 1}^{m} Φ_{k, j} (x, x_{i})}, & (17) \end{array}

where $Φ_{k, j} = Φ_{k, j, D, q^{*}, n} : X \times X \to ℝ$ is defined by

\begin{array}{l} Φ_{k, j} (x, u) = N_{1, D, q^{*}, ζ_{j}} (u) N_{3, k, j} (u) N_{3, k, j} (x); \end{array}

and as before, we set $N_{3}^{F} (x) = 0$ if the denominator $\sum_{j \in ℕ_{2 q^{*}}^{D}} \sum_{k \in ℕ_{2 n}^{d}} \sum_{i = 1}^{m} Φ_{k, j} (x, x_{i})$ vanishes.

3. Learning Rate Analysis

We consider a standard least squares regression setting in learning theory [15] and assume that the sample set $S = S_{m} = {(x_{i}, y_{i})}_{i = 1}^{m}$ of size m is drawn independently according to some Borel probability measure ρ on $Z = X \times Y$ . The regression function is then defined by

\begin{array}{l} f_{ρ} (x) = \int_{Y} y d ρ (y | x), x \in X, \end{array}

where ρ(y|x) denotes the conditional distribution at x induced by ρ. Let ρ_X be the marginal distribution of ρ on $X$ and $(L_{ρ_{X}}^{2}, ‖ \cdot ‖_{ρ})$ be the Hilbert space of square-integrable functions with respect to ρ_X on $X$ . Our goal is to estimate the distance between the output function N₃ and the regression function f_ρ measured by ||N₃ − f_ρ||_ρ, as well as the distance between $N_{3}^{F}$ and f_ρ.

We say that a function f on $X$ is (s, c₀)-Lipschitz (continuous) with positive exponent s ≤ 1 and constant c₀ > 0, if

\begin{array}{l} | f (x) - f (x^{'}) | \leq c_{0} {(d_{G} (x, x^{'}))}^{s}, \forall x, x^{'} \in X; & (18) \end{array}

and denote by $L i p^{(s, c_{0})} = L i p^{(s, c_{0})} (X)$ , the family of all (s, c₀)-Lipschitz functions that satisfy (18). Our error analysis of N₃ will be carried out based on the following two assumptions.

Assumption 1. There exist an s ∈ (0, 1] and a constant c₀ ∈ ℝ₊ such that $f_{ρ} \in L i p^{(s, c_{0})}$ .

This smoothness assumption is standard in learning theory for regression functions (see for example [15, 18, 20, 27–35]).

Assumption 2. ρ_X is continuous with respect to the geodesic distance d_G of the Riemannian manifold.

Note that Assumption 2, which is about the geometrical structure of ρ_X, is slightly weaker than the distortion assumption in Shi [36] and Zhou and Jetter [37] but similar to the assumption considered in Meister and Steinwart [38]. The objective of this assumption is for describing the functionality of fine-tuning.

We are now ready to state the main results of this paper. In the first theorem below, we obtain a learning rate for the constructed deep nets N₃.

Theorem 1. Let m be the number of samples and set n = ⌈m^1/(2s+d)⌉, where 1/(2n) is the uniform spacing of the points $t_{k} = t_{k, n} \in {(- 1, 1)}^{d}$ in the definition of N₃ in (11). Then under Assumptions 1 and 2,

\begin{array}{l} E [‖ N_{3} - f_{ρ} ‖_{ρ}^{2}] \leq C_{1} m^{- \frac{2 s}{2 s + d}} & (19) \end{array}

for some positive constant C₁ independent of m.

Observe that Theorem 1 provides a fast learning rate for the constructed deep net which depends on the manifold dimension d instead of the sample space dimension D. In the second theorem below, we show the necessity of the fine-tuning process as presented in (17), when Assumption 2 is removed.

Theorem 2. Let m be the number of samples and set n = ⌈m^1/(2s+d)⌉, where 1/(2n) is the uniform spacing of the points $t_{k} = t_{k, n} \in {(- 1, 1)}^{d}$ in the definition of N₃ in (11), which is used to define $N_{3}^{F}$ in (17). Then under Assumption 1,

\begin{array}{l} E [‖ N_{3}^{F} - f_{ρ} ‖_{ρ}^{2}] \leq C_{2} m^{- \frac{2 s}{2 s + d}} . & (20) \end{array}

for some positive constant C₂ independent of m.

Observe that while Assumption 2 is needed in Theorem 1, it is not necessary for the validity of Theorem 2, which theoretically shows the significance of fine-tuning in our construction. The proofs of these two theorems will be presented in the final section of this paper.

4. Related Work and Discussions

The success in practical applications, especially in the fields of computer vision [6] and speech recognition [7], has triggered enormous research activities on deep learning. Several other encouraging results, such as object recognition [24], unsupervised training [39], and artificial intelligence architecture [21], have been obtained to demonstrate further the significance of deep learning. We refer the interested readers to the 2016 MIT monograph, “Deep Learning” [40], by Goodfellow, Bengjio and Courville, for further study of this exciting subject, which is only at the infancy of its development.

Indeed, deep learning has already created several challenges to the machine learning community. Among the main challenges are to show the necessity of the usage of deep nets and to theoretically justify the advantages of deep nets over shallow nets. This is essentially a classical topic in Approximation Theory. In particular, dating back to the early 1990's, it was already proved that deep nets can provide localized approximation but shallow nets fail (see for example [8]). Furthermore, it was also shown that deep nets provide high approximation orders, that are certainly not restricted by the lower error bounds for shallow nets (see [41, 42]). More recently, stimulated by the avid enthusiasm of deep learning, numerous advantages of deep nets were also revealed from the point of view of function approximation. In particular, certain functions discussed in Eldan and Shamir [9] can be represented by deep nets but cannot be approximated by shallow nets with polynomially increasing orders of neurons; it was shown in Mhaskar and Poggio [10] that deep nets, but not shallow nets, can approximate efficiently functions composed by bivariate ones; it was exhibited in Poggio et al. [11] that deep nets can avoid the curse of dimension of shallow nets; a probability argument was given in Lin [43] to show that deep nets have better approximation performance than shallow nets with high confidence; it was demonstrated in Chui and Mhaskar [19] and Shaham et al. [13] that deep nets can improve the approximation capability of shallow nets when the data are located on data-dependent manifolds; and so on. All of these results give theoretical explanations of the significance of deep nets from the Approximation Theory point of view.

As a departure from the work mentioned above, our present paper is devoted to explore better performance of deep nets over shallow nets in the framework of Leaning Theory. In particular, we are concerned not only with the approximation accuracy but also with the cost to attain such accuracy. In this regard, learning rates of certain deep nets have been analyzed in Kohler and Krzyżak [32], where near-optimal learning rates are provided for a fairly complex regularization scheme, with the hypothesis space being the family of deep nets with two hidden layers proposed in Mhaskar [44]. More precisely, they derived a learning rate of order $O (m^{- 2 s / (2 s + D)} {(log m)}^{4 s / (2 s + D)})$ for functions $f_{ρ} \in L i p^{(s, c_{0})}$ . This is close to the optimal learning rate of shallow nets in Maiorov [18], different only by a logarithmic factor. Hence, the study in Kohler and Krzyżak [32] theoretically shows that deep nets at least do not downgrade the learning performance of shallow nets. In comparison with Kohler and Krzyżak [32], our study is focussed on answering the question: “What is to be gained by deep learning?” The deep net constructed in our paper possesses a learning rate of order $O (m^{- 2 s / (2 s + d)})$ , when $X$ is an unknown d-dimensional connected C^∞ Riemannian manifold (without boundary). This rate is the same as the optimal learning rate [20, Chapter 3] for special case of the cube $X = {[- 1, 1]}^{d}$ under a similar condition, and it is better than the optimal learning rates for shallow nets [18]. Another line of related work is Ye and Zhou [22, 45], where Ye and Zhou deduced learning rates for regularized least-squares over shallow nets for the same setting of our paper. They derived a learning rate of $O (m^{- s / (8 s + 4 d)} {(log m)}^{s / (4 s + 2 d)})$ , which is worse than the rate established in our paper. It should be mentioned that in a more recent work Kohler and Krzyzak [46], some advantages of deep nets are revealed from the learning theory viewpoint. However, the results in Kohler and Krzyzak [46] require a hierarchical interaction structure, which is totally different from what is presented in our present paper.

Due to the high degree of freedom for deep nets, the number and type of parameters for deep nets are much more than those of shallow nets. Thus, it should be of great interest to develop scalable algorithms to reduce the computational burdens of deep learning. Distributed learning based on a divide-and-conquer strategy [47, 48] could be a fruitful approach for this purpose. It is also of interest to establish results similar to Theorem 2 and Theorem 1 for deep nets, but with rectifier neurons, by using the rectifier (or ramp) function, σ₁(t) = t₊, as activation. The reason is that the rectifier is one of the most widely used activations in the literature on deep learning. Our research in these directions is postponed to a later work.

5. Proofs of the Main Results

To facilitate our proofs of the theorems stated in section 3, we first establish the following two lemmas.

Observe from Proposition 1 and the definition (11) of the function N_{3, k,j} that

\begin{array}{l} N_{1, D, q^{*}, ζ_{j}} (x) N_{3, k, j} (x) = I_{A_{D, 1 / q^{*}, ζ_{j}}} (x) I_{A_{d, 1 / n, t_{k}}} (N_{2, j} (x)) = I_{H_{k, j}} (x) . & (21) \end{array}

For $j \in ℕ_{2 q^{*}}^{D}, k \in ℕ_{2 n}^{d}$ , define a random function $T_{k, j} : Z^{m} \to ℝ$ in term of the random sample $S = {(x_{i}, y_{i})}_{i = 1}^{m}$ by

\begin{array}{l} T_{k, j} (S) = \sum_{i = 1}^{m} N_{1, D, q^{*}, ζ_{j}} (x_{i}) N_{3, k, j} (x_{i}), & (22) \end{array}

so that

\begin{array}{l} T_{k, j} (S) = \sum_{i = 1}^{m} I_{H_{k, j}} (x_{i}) . & (23) \end{array}

Lemma 1. Let $Λ^{*} \subseteq ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}$ be a non-empty subset, (j × k) ∈ Λ* and T_k,j(S) be defined as in (22). Then

\begin{array}{l} E_{S} [\frac{I_{{z \in Z^{m} : \sum_{(j, k) \in Λ^{*}} T_{k, j} (z) > 0}} (S)}{\sum_{(j, k) \in Λ^{*}} T_{k, j} (S)}] \leq \frac{2}{(m + 1) ρ_{X} (\cup_{(j, k) \in Λ^{*}} H_{k, j})}, & (24) \end{array}

where if $\sum_{(j, k) \in Λ^{*}} T_{k, j} (S) = 0$ , we set

\begin{array}{l} \frac{I_{{z \in Z^{m} : \sum_{(j, k) \in Λ^{*}} T_{k, j} (z) > 0}} (S)}{\sum_{j, k \in Λ^{*}} T_{k, j} (S)} = 0 . \end{array}

Proof. Observe from (23) that T_k,j(S)∈{0, 1, …, m} and

\begin{array}{l} E_{S} [\frac{I_{{z \in Z^{m} : \sum_{(j, k) \in Λ^{*}} T_{k, j} (z) > 0}} (S)}{\sum_{(j, k) \in Λ^{*}} T_{k, j} (S)}] \\ = \sum_{ℓ = 0}^{m} E_{S} [\frac{I_{{z \in Z^{m} : \sum_{(j, k) \in Λ^{*}} T_{k, j} (z) > 0}} (S)}{\sum_{(j, k) \in Λ^{*}} T_{k, j} (S)} | \sum_{(j, k) \in Λ^{*}} T_{k, j} (S) = ℓ] \\ P r [\sum_{(j, k) \in Λ^{*}} T_{k, j} (S) = ℓ] . \end{array}

By the definition of the fraction $\frac{I_{{z \in Z^{m} : \sum_{(j, k) \in Λ^{*}} T_{k, j} (z) > 0}} (S)}{\sum_{(j, k) \in Λ^{*}} T_{k, j} (S)}$ , the term with ℓ = 0 above vanishes, so

\begin{array}{l} E_{S} [\frac{I_{{z \in Z^{m} : \sum_{(j, k) \in Λ^{*}} T_{k, j} (z) > 0}} (S)}{\sum_{(j, k) \in Λ^{*}} T_{k, j} (S)}] = \sum_{ℓ = 1}^{m} E [\frac{1}{ℓ} | \sum_{(j, k) \in Λ^{*}} T_{k, j} (S) = ℓ] \\ P r [\sum_{(j, k) \in Λ^{*}} T_{k, j} (S) = ℓ] \\ = \sum_{ℓ = 1}^{m} \frac{1}{ℓ} P r [\sum_{(j, k) \in Λ^{*}} T_{k, j} (S) = ℓ] . \end{array}

On the other hand, note from (23) that $\sum_{(j, k) \in Λ^{*}} T_{k, j} (S) = ℓ$ is equivalent to $x_{i} \in \cup_{(j, k) \in Λ^{*}} H_{k, j}$ for ℓ indices i from {1, ⋯, m}, which in turn implies that

\begin{array}{l} P r [\sum_{(j, k) \in Λ^{*}} T_{k, j} (S) = ℓ] = (\begin{matrix} m \\ ℓ \end{matrix}) {[ρ_{X} (\cup_{(j, k) \in Λ^{*}} H_{k, j})]}^{ℓ} \\ {[1 - ρ_{X} (\cup_{(j, k) \in Λ^{*}} H_{k, j})]}^{m - ℓ} . \end{array}

Thus, we obtain

\begin{array}{l} E_{S} [\frac{I_{{z \in Z^{m} : \sum_{(j, k) \in Λ^{*}} T_{k, j} (z) > 0}} (S)}{\sum_{(j, k) \in Λ^{*}} T_{k, j} (S)}] \\ = \sum_{ℓ = 1}^{m} \frac{1}{ℓ} (\begin{matrix} m \\ ℓ \end{matrix}) {[ρ_{X} (\cup_{(j, k) \in Λ^{*}} H_{k, j})]}^{ℓ} {[1 - ρ_{X} (\cup_{(j, k) \in Λ^{*}} H_{k, j})]}^{m - ℓ} \\ \leq \sum_{ℓ = 1}^{m} \frac{2}{ℓ + 1} (\begin{matrix} m \\ ℓ \end{matrix}) {[ρ_{X} (\cup_{(j, k) \in Λ^{*}} H_{k, j})]}^{ℓ} {[1 - ρ_{X} (\cup_{(j, k) \in Λ^{*}} H_{k, j})]}^{m - ℓ} \\ = \frac{2}{(m + 1) ρ_{X} (\cup_{(j, k) \in Λ^{*}} H_{k, j})} \sum_{ℓ = 1}^{m} (\begin{matrix} m + 1 \\ ℓ + 1 \end{matrix}) {[ρ_{X} (\cup_{(j, k) \in Λ^{*}} H_{k, j})]}^{ℓ + 1} \\ {[1 - ρ_{X} (\cup_{(j, k) \in Λ^{*}} H_{k, j})]}^{m - ℓ} . \end{array}

Therefore, the desired inequality (24) follows. This completes the proof of Lemma 1. □

Lemma 2. Let $S = {(x_{i}, y_{i})}_{i = 1}^{m}$ be a sample set drawn independently according to ρ. If $f_{S} (x) = \sum_{i = 1}^{m} y_{i} h_{x} (x, x_{i})$ with a measurable function $h_{x} : X \times X \to ℝ$ that depends on $x : = {x_{i}}_{i = 1}^{m}$ , then

\begin{array}{l} E [‖ f_{S} - f_{ρ} ‖_{μ}^{2} | x] = E [‖ f_{S} - \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i}) ‖_{μ}^{2} | x] \\ + ‖ \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i}) - f_{ρ} ‖_{μ}^{2} & (25) \end{array}

for any Borel probability measure μ on $X$ .

Proof. Since f_ρ(x) is the conditional mean of y given $x \in X$ , we have from $f_{S} (x) = \sum_{i = 1}^{m} y_{i} h_{x} (x, x_{i})$ that $E [f_{S} | x] = \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i})$ . Hence,

\begin{array}{l} E [{〈 f_{S} - \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i}), \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i}) - f_{ρ} 〉}_{μ} | x] \\ = {〈 E [f_{S} | x] - \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i}), \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i}) - f_{ρ} 〉}_{μ} = 0 . \end{array}

Thus, along with the inner-product expression

\begin{array}{l} ‖ f_{S} - f_{ρ} ‖_{μ}^{2} = ‖ f_{S} - \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i}) ‖_{μ}^{2} + ‖ \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i}) - f_{ρ} ‖_{μ}^{2} \\ + 2 {〈 f_{S} - \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i}), \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (\cdot, x_{i}) - f_{ρ} 〉}_{μ} \end{array}

the above equality yields the desired result (25). This completes the proof of Lemma 2. □

We are now ready to prove the two main results of the paper.

Proof of Theorem 1. We divide the proof into four steps, namely: error decomposition, sampling error estimation, approximation error estimation, and learning rate deduction.

Step 1: Error decomposition. Let Ḣ_k,j be the set of interior points of H_k,j. For arbitrarily fixed k′, j′ and $x \in {\dot{H}}_{k^{'}, j^{'}}$ , it follows from (21) that

\begin{array}{l} \sum_{j \in ℕ_{2 q^{*}}^{D}} \sum_{k \in ℕ_{2 n}^{d}} \sum_{i = 1}^{m} N_{1, D, q^{*}, ζ_{j}} (x_{i}) N_{3, k, j} (x_{i}) y_{i} N_{3, k, j} (x) \\ = \sum_{i = 1}^{m} y_{i} N_{1, D, q^{*}, ζ_{j^{'}}} (x_{i}) N_{3, k^{'}, j^{'}} (x_{i}) \\ = \sum_{i = 1}^{m} y_{i} I_{H_{k^{'}, j^{'}}} (x_{i}) . \end{array}

If, in addition, for each i ∈ {1, …, m}, x_i ∈ Ḣ_k,j for some $k, j \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}$ , then from (12) we have

\begin{array}{l} N_{3} (x) = \frac{\sum_{i = 1}^{m} y_{i} I_{H_{k^{'}, j^{'}}} (x_{i})}{\sum_{i = 1}^{m} I_{H_{k^{'}, j^{'}}} (x_{i})} = \frac{\sum_{i = 1}^{m} y_{i} I_{H_{k^{'}, j^{'}}} (x_{i})}{T_{k^{'}, j^{'}} (S)} . & (26) \end{array}

In view of Assumption 2, for an arbitrary subset A ⊂ R^D, λ_G(A) = 0 implies ρ_X(A) = 0, where λ_G(A) denotes the Riemannian measure of A. In particular, for A = H_k,j\Ḣ_k,j in the above analysis, we have ρ_X(H_k,j\Ḣ_k,j) = 0, which implies that (26) almost surely holds. Next, set

\begin{array}{l} \tilde{N_{3}} = E [N_{3} | x] . & (27) \end{array}

Then it follows from Lemma 2, with μ = ρ_X, that

\begin{array}{l} E [‖ N_{3} - f_{ρ} ‖_{ρ}^{2}] = E [‖ N_{3} - \tilde{N_{3}} ‖_{ρ}^{2}] + E [‖ \tilde{N_{3}} - f_{ρ} ‖_{ρ}^{2}] . & (28) \end{array}

In what follows, the two terms on the right-hand side of (28) will be called sampling error and approximation error, respectively.

Step 2: Sampling error estimation. Due to Assumption 2, we have

\begin{array}{l} E [‖ N_{3} - \tilde{N_{3}} ‖_{ρ}^{2}] = \sum_{(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} \int_{\dot{H} k, j} E [{(N_{3} (x) - \tilde{N_{3}} (x))}^{2}] d ρ_{X} . & (29) \end{array}

On the other hand, (26) and (27) together imply that

\begin{array}{l} N_{3} (x) - \tilde{N_{3}} (x) = \frac{\sum_{i = 1}^{m} (y_{i} - f_{ρ} (x_{i})) I_{H_{k, j}} (x_{i})}{T_{k, j} (S)} \end{array}

almost surely for x ∈ Ḣ_k,j, and that

\begin{array}{l} E [{(N_{3} (x) - \tilde{N_{3}} (x))}^{2} | x] = \frac{\sum_{i = 1}^{m} \int_{Y} {(y - f_{ρ} (x_{i}))}^{2} d ρ (y | x_{i}) I_{H_{k, j}}^{2} (x_{i})}{{[T_{k, j} (S)]}^{2}} \\ \leq 4 M^{2} \frac{I_{{z : T_{k, j} (z) > 0}} (S)}{T_{k, j} (S)}, \end{array}

where 𝔼[y_i|x_i] = f_ρ(x_i) in the second equality, $I_{H_{k, j}}^{2} (x_{i}) = I_{H_{k, j}} (x_{i})$ and |y_i| ≤ M holds almost surely in the inequality. It then follows from Lemma 1 and Assumption 2 that

\begin{array}{l} E [{(N_{3} (x) - \tilde{N_{3}} (x))}^{2}] \leq \frac{8 M^{2}}{(m + 1) ρ_{X} (H_{k, j})} . \end{array}

This, together with (29), implies that

\begin{array}{l} E [‖ N_{3} - {\tilde{N}}_{3} ‖_{ρ}^{2}] \leq \sum_{(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} \int_{{\dot{H}}_{k, j}} \frac{8 M^{2}}{(m + 1) ρ_{X} (H_{k, j})} d ρ_{X} \\ \leq \frac{8 {(2 q^{*})}^{D} {(2 n)}^{d} M^{2}}{m + 1} . & (30) \end{array}

Step 3: Approximation error estimation. According to Assumption 2, we have

\begin{array}{l} E [‖ f_{ρ} - {\tilde{N}}_{3} ‖_{ρ}^{2}] = \sum_{(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} \int_{{\dot{H}}_{k, j}} E [{(f_{ρ} (x) - {\tilde{N}}_{3} (x))}^{2}] d ρ_{X} . & (31) \end{array}

For x ∈ Ḣ_k,j, it follows from Assumption 1, (26) and (27) that

\begin{array}{l} | f_{ρ} (x) - \tilde{N_{3}} (x) | \leq \frac{\sum_{i = 1}^{m} | f_{ρ} (x) - f_{ρ} (x_{i}) | I_{H_{k, j}} (x_{i})}{T_{k, j} (S)} \\ \leq c_{0} {(max_{x, x^{'} \in H_{k, j}} d_{G} (x, x^{'}))}^{s} \end{array}

almost surely holds. We then have, from (10) and $N_{2, j} (x), N_{2, j} (x^{'}) \in A_{d, 1 / n, t_{k}}$ , that

\begin{array}{l} max_{x, x^{'} \in H_{k, j}} d_{G} (x, x^{'}) \leq max_{x, x^{'} \in H_{k, j}} α^{- 1} ‖ N_{2, j} (x) - N_{2, j} (x^{'}) | |_{d} . \end{array}

Now, since $max_{t, t^{'} \in A_{d, 1 / n, t_{k}}} ‖ t - t^{'} ‖_{d} \leq \frac{2 \sqrt{d}}{n}$ , we obtain

\begin{array}{l} max_{x, x^{'} \in H_{k, j}} d_{G} (x, x^{'}) \leq \frac{2 d^{1 / 2}}{α} n^{- 1}, \end{array}

so that

\begin{array}{l} | f_{ρ} (x) - \tilde{N_{3}} (x) | \leq c_{0} \frac{2^{s} d^{s / 2}}{α^{s}} n^{- s} . \end{array}

holds almost surely. Inserting the above estimate into (31), we obtain

\begin{array}{l} E [‖ f_{ρ} - \tilde{N_{3}} ‖_{ρ}^{2}] \leq \sum_{(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} ρ_{X} ({\dot{H}}_{k, j}) \frac{c_{0}^{2} 4^{s} d^{s}}{α^{2 s}} n^{- 2 s} \leq \frac{c_{0}^{2} 4^{s} d^{s}}{α^{2 s}} n^{- 2 s} . & (32) \end{array}

Step 4: Learning rate deduction. Inserting (32) and (30) into (28), we obtain

\begin{array}{l} E [‖ N_{3} - f_{ρ} ‖_{ρ}^{2}] \leq \frac{8 {(2 q^{*})}^{D} {(2 n)}^{d} M^{2}}{m + 1} + \frac{c_{0}^{2} 4^{s} d^{s}}{α^{2 s}} n^{- 2 s} . \end{array}

Since n = ⌈m^1/(2s+d)⌉, we have

\begin{array}{l} E [‖ N_{3}^{F} - f_{ρ} ‖_{ρ}^{2}] \leq C_{1} m^{- \frac{2 s}{2 s + d}} \end{array}

with

\begin{array}{l} C_{1} : = 8 {(2 q^{*})}^{D} 2^{d} M^{2} + \frac{c_{0}^{2} 4^{s} d^{s}}{α^{2 s}} . \end{array}

As q* depends only on $X$ , C₁ is independent of m or n. This completes the proof of Theorem 1. □

Proof of Theorem 2. As in the proof of Theorem 1, we divide this proof into four steps.

Step 1: Error decomposition. From (17), we have

\begin{array}{l} N_{3}^{F} (x) = \sum_{i = 1}^{m} y_{i} h_{x} (x, x_{i}), & (33) \end{array}

where $h_{x} : X \times X \to ℝ$ is a function defined for $x, u \in X$ by

\begin{array}{l} h_{x} (x, u) = \frac{\sum_{j \in ℕ_{2 q^{*}}^{D}} \sum_{k \in ℕ_{2 n}^{d}} Φ_{k, j} (x, u)}{\sum_{j \in ℕ_{2 q^{*}}^{D}} \sum_{k \in ℕ_{2 n}^{d}} \sum_{i = 1}^{m} Φ_{k, j} (x, x_{i})}, & (34) \end{array}

and h_x(x, u) = 0 when the denominator vanishes. Define $\tilde{N_{3}^{F}} : X \to ℝ$ by

\begin{array}{l} \tilde{N_{3}^{F}} (x) = E [N_{3}^{F} (x) | x] = \sum_{i = 1}^{m} f_{ρ} (x_{i}) h_{x} (x, x_{i}) . & (35) \end{array}

Then it follows from Lemma 2 with μ = ρ_X, that

\begin{array}{l} E [‖ N_{3}^{F} - f_{ρ} ‖_{ρ}^{2}] = E [‖ N_{3}^{F} - \tilde{N_{3}^{F}} ‖_{ρ}^{2}] + E [‖ \tilde{N_{3}^{F}} - f_{ρ} ‖_{ρ}^{2}] . & (36) \end{array}

In what follows, the terms on the right-hand side of (36) will be called sampling error and approximation error, respectively. By (21), for each $x \in X$ and i ∈ {1, ⋯ , m}, we have Φ_k,j(x, x_i) = I_{H_k,j}(x_i)N_{3, k, j}(x) = I_{H_k,j}(x_i) for (j, k) ∈ Λ_x and Φ_k,j(x, x_i) = 0 for (j, k) ∉ Λ_x, where Λ_x is defined by (13). This, together with (35), (33), and (34), yields

\begin{array}{l} N_{3}^{F} (x) - \tilde{N_{3}^{F}} (x) = \sum_{i = 1}^{m} (y_{i} - f_{ρ} (x_{i})) \frac{\sum_{(j, k) \in Λ_{x}} I_{H_{k, j}} (x_{i})}{\sum_{(j, k) \in Λ_{x}} T_{k, j} (S)}, \forall x \in X & (37) \end{array}

and

\begin{array}{l} \tilde{N_{3}^{F}} (x) - f_{ρ} (x) = \sum_{i = 1}^{m} [f_{ρ} (x_{i}) - f_{ρ} (x)] \frac{\sum_{(j, k) \in Λ_{x}} I_{H_{k, j}} (x_{i})}{\sum_{(j, k) \in Λ_{x}} T_{k, j} (S)}, \forall x \in X, & (38) \end{array}

where $T_{k, j} (S) = \sum_{i = 1}^{m} I_{H_{k, j}} (x_{i}) .$

Step 2: Sampling error estimation. First consider

\begin{array}{l} E [‖ N_{3}^{F} - {\tilde{N}}_{3}^{F} ‖_{ρ}^{2}] \leq \sum_{(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} \int_{H_{k, j}} E [{(N_{3}^{F} (x) - {\tilde{N}}_{3}^{F} (x))}^{2}] d ρ_{X} . & (39) \end{array}

For each x ∈ H_k,j, since 𝔼[y|x] = f_ρ(x), it follows from (37) and |y| ≤ M that

\begin{array}{l} E [{(N_{3}^{F} (x) - \tilde{N_{3}^{F}} (x))}^{2} | x] \\ = E [{(\sum_{i = 1}^{m} (y_{i} - f_{ρ} (x_{i})) \frac{\sum_{(j, k) \in Λ_{x}} I_{H_{k, j}} (x_{i})}{\sum_{(j, k) \in Λ_{x}} T_{k, j} (S)})}^{2} | x] \\ = E [\sum_{i = 1}^{m} {(y_{i} - f_{ρ} (x_{i}))}^{2} {(\frac{\sum_{(j, k) \in Λ_{x}} I_{H_{k, j}} (x_{i})}{\sum_{(j, k) \in Λ_{x}} T_{k, j} (S)})}^{2} | x] \\ \leq 4 M^{2} \sum_{i = 1}^{m} {(\frac{\sum_{(j, k) \in Λ_{x}} I_{H_{k, j}} (x_{i})}{\sum_{(j, k) \in Λ_{x}} T_{k, j} (S)})}^{2} \end{array}

holds almost surely. Since $\sum_{i = 1}^{m} I_{H_{k, j}} (x_{i}) = T_{k, j} (S)$ , we apply the Schwarz inequality to $\sum_{(j, k) \in Λ_{x}} I_{H_{k, j}} (x_{i})$ to obtain

\begin{array}{l} E [{(N_{3}^{F} (x) - \tilde{N_{3}^{F}} (x))}^{2} | x] \leq \frac{4 M^{2} | Λ_{x} | \sum_{(j, k) \in Λ_{x}} \sum_{i = 1}^{m} I_{H_{k, j}}^{2} (x_{i})}{{(\sum_{(j, k) \in Λ_{x}} T_{k, j} (S))}^{2}} \\ = \frac{4 M^{2} | Λ_{x} | I_{{z \in Z^{m} : \sum_{(j, k) \in Λ_{x}} T_{k, j} > 0}} (S)}{\sum_{(j, k) \in Λ_{x}} T_{k, j} (S)} . \end{array}

Thus, from Lemma 1 and (14) we have

\begin{array}{l} E [{(N_{3}^{F} (x) - \tilde{N_{3}^{F}} (x))}^{2}] = E [E [{(N_{3}^{F} (x) - \tilde{N_{3}^{F}} (x))}^{2} | x]] \\ \leq \frac{8 M^{2} 2^{D + d}}{(m + 1) ρ_{X} (\cup_{(j, k) \in Λ_{x}} H_{k, j})} . \end{array}

This, along with (39), implies that

\begin{array}{l} E [‖ N_{3}^{F} - \tilde{N_{3}^{F}} ‖_{ρ}^{2}] \leq \frac{2^{D + d + 3} M^{2}}{(m + 1)} \sum_{(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} \\ \int_{H_{k, j}} \frac{1}{ρ_{X} (\cup_{(j, k) \in Λ_{x}} H_{k, j})} d ρ_{X} \leq \frac{2^{D + d + 3} M^{2}}{(m + 1)} \sum_{(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} \\ \int_{H_{k, j}} \frac{1}{ρ_{X} (H_{k, j})} d ρ_{X} \leq \frac{2^{D + d + 3} {(2 q^{*})}^{D} M^{2} {(2 n)}^{d}}{(m + 1)} . & (40) \end{array}

Step 3 Approximation error estimation. For each $x \in X$ , set

\begin{array}{l} A_{1} (x) = E [{(\tilde{N_{3}^{F}} (x) - f_{ρ} (x))}^{2} | \sum_{(j, k) \in Λ_{x}} T_{k, j} (S) = 0] \\ P r [\sum_{(j, k) \in Λ_{x}} T_{k, j} (S) = 0] \end{array}

and

\begin{array}{l} A_{2} (x) = E [{(\tilde{N_{3}^{F}} (x) - f_{ρ} (x))}^{2} | \sum_{(j, k) \in Λ_{x}} T_{k, j} (S) \geq 1] \\ P r [\sum_{(j, k) \in Λ_{x}} T_{k, j} (S) \geq 1]; \end{array}

and observe that

\begin{array}{l} E [‖ \tilde{N_{3}^{F}} - f_{ρ} ‖_{ρ}^{2}] = \int_{X} E [{(\tilde{N_{3}^{F}} (x) - f_{ρ} (x))}^{2}] d ρ_{X} \\ = \int_{X} A_{1} (x) d ρ_{X} + \int_{X} A_{2} (x) d ρ_{X} . & (41) \end{array}

Let us first consider $\int_{X} A_{1} (x) d ρ_{X}$ as follows. Since $\tilde{N_{3}^{F}} (x) = 0$ for $\sum_{(j, k) \in Λ_{x}} T_{k, j} (S) = 0$ , we have, from |f_ρ(x)| ≤ M, that

\begin{array}{l} E [{(\tilde{N_{3}^{F} (x)} - f_{ρ} (x))}^{2} | \sum_{(j, k) \in Λ_{x}} T_{k, j} (S) = 0] \leq M^{2} . \end{array}

On the other hand, since

\begin{array}{l} P r [\sum_{(j, k) \in Λ_{x}} T_{k, j} (S) = 0] = {[1 - ρ_{X} (\cup_{(j, k) \in Λ_{x}} H_{k, j})]}^{m}, \end{array}

it follows from the elementary inequality

\begin{array}{l} v {(1 - v)}^{m} \leq v e^{- m v} \leq \frac{1}{e m}, \forall 0 \leq v \leq 1 \end{array}

that

\begin{array}{l} \int_{X} A_{1} (x) d ρ_{X} \leq \int_{X} M^{2} {[1 - ρ_{X} (\cup_{(j, k) \in Λ_{x}} H_{k, j})]}^{m} d ρ_{X} \\ \leq M^{2} \sum_{(j^{'}, k^{'}) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} \int_{H_{k^{'}, j^{'}}} {[1 - ρ_{X} (\cup_{(j, k) \in Λ_{x}} H_{k, j})]}^{m} d ρ_{X} \\ \leq M^{2} \sum_{(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} \int_{H_{k, j}} {[1 - ρ_{X} (H_{k, j})]}^{m} d ρ_{X} \leq M^{2} \\ \sum_{(j, k) \in ℕ_{2 q^{*}}^{D} \times ℕ_{2 n}^{d}} {[1 - ρ_{X} (H_{k, j})]}^{m} ρ_{X} (H_{k, j}) \leq \frac{{(2 n)}^{d} {(2 q^{*})}^{D} M^{2}}{e m} . & (42) \end{array}

We next consider $\int_{X} A_{2} (x) d ρ_{X}$ . Let $x \in X$ satisfy $\sum_{(j, k) \in Λ_{x}} T_{k, j} (S) \geq 1$ . Then x_i ∈ H_x: = ∪_{(j, k)∈_Λ_x}H_k,j at least for some i ∈ {1, 2, …, m}. For those x_i ∉ H_x, we have $\sum_{(j, k) \in Λ_{x}} I_{H_{k, j}} (x_{i}) = 0$ , so that

\begin{array}{l} | \tilde{N_{3}^{F}} (x) - f_{ρ} (x) | = \sum_{i : x_{i} \in H_{x}} | f_{ρ} (x_{i}) - f_{ρ} (x) | \frac{\sum_{(j, k) \in Λ_{x}} I_{H_{k, j}} (x_{i})}{\sum_{(j, k) \in Λ_{x}} T_{k, j} (S)} . \end{array}

For x_i ∈ H_x, we have x_i ∈ H_k,j for some (j, k) ∈ Λ_x. But x ∈ H_k,j, so that

\begin{array}{l} | \tilde{N_{3}^{F}} (x) - f_{ρ} (x) | \leq max_{u, u^{'} \in H_{k, j}} | f_{ρ} (u) - f_{ρ} (u^{'}) \leq c_{0} max_{u, u^{'} \in H_{k, j}} {[d_{G} (u, u^{'})]}^{s}, \\ x \in X . \end{array}

But (10) implies that

\begin{array}{l} max_{u, u^{'} \in H_{k, j}} {[d_{G} (u, u^{'})]}^{s} \leq max_{u, u^{'} \in H_{k, j}} α^{- s} ‖ N_{2, j_{x}} (u) - N_{2, j_{x}} (u^{'}) | |_{d}^{s} \\ \leq α^{- s} max_{t, t^{'} \in A_{d, 1 / n, t_{k}}} ‖ t - t^{'} | |_{d}^{s} \leq \frac{2^{s} d^{s / 2}}{α^{s}} n^{- s} . \end{array}

Hence, for $x \in X$ with $\sum_{(j, k) \in Λ_{x}} T_{k, j} (S) \geq 1$ , we have

\begin{array}{l} | \tilde{N_{3}^{F}} (x) - f_{ρ} (x) | \leq \frac{c_{0} 2^{s} d^{s / 2}}{α^{s}} n^{- s} \frac{\sum_{i : x_{i} \in H_{x}} \sum_{(j, k) \in Λ_{x}}}{\sum_{(j, k) \in Λ_{x}} T_{k, j} (S)} \leq \frac{c_{0} 2^{s} d^{s / 2}}{α^{s}} n^{- s}, \end{array}

and threby

\begin{array}{l} \int_{X} A_{2} (x) d ρ_{X} \leq \int_{X} E [{(\tilde{N_{3}^{F}} (x) - f_{ρ} (x))}^{2} | \sum_{(j, k) \in Λ_{x}} T_{k, j} (S) \geq 1] \\ d ρ_{X} \leq \frac{c_{0}^{2} 4^{s} d^{s}}{α^{2 s}} n^{- 2 s} . n & (43) \end{array}

Therefore, putting (42) and (43) into (41), we have

\begin{array}{l} E [‖ \tilde{N_{3}^{F}} - f_{ρ} ‖_{ρ}^{2}] \leq \frac{c_{0}^{2} 4^{s} d^{s}}{α^{2 s}} n^{- 2 s} + \frac{M^{2} {(2 n)}^{d} {(2 q^{*})}^{D}}{e m} . & (44) \end{array}

Step 4: Learning rate deduction. By inserting (40) and (44) into (36), we obtain

\begin{array}{l} E [‖ N_{3}^{F} - f_{ρ} ‖_{ρ}^{2}] \leq \frac{2^{D + d + 3} {(2 q^{*})}^{D} M^{2} {(2 n)}^{d}}{m + 1} + \frac{c_{0}^{2} 4^{s} d^{s}}{α^{2 s}} n^{- 2 s} \\ + \frac{M^{2} {(2 n)}^{d} {(2 q^{*})}^{D}}{e m} . \end{array}

Hence, in view of n = ⌈m^1/(2s+d)⌉, we have

\begin{array}{l} E [‖ N_{3}^{F} - f_{ρ} ‖_{ρ}^{2}] \leq C_{2} m^{- \frac{2 s}{2 s + d}} \end{array}

with

\begin{array}{l} C_{2} : = 2^{D + d + 4} {(2 q^{*})}^{D} M^{2} {(2 n)}^{d} + \frac{c_{0}^{2} 4^{s} d^{s}}{α^{2 s}} . \end{array}

This completes the proof of Theorem 2, since q* depends only on $X$ , so that C₂ is independent of m or n. □

Author Contributions

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The research of CC is partially supported by U.S. ARO Grant W911NF-15-1-0385, Hong Kong Research Council (Grant No. 12300917), and Hong Kong Baptist University (Grant No. HKBU-RC-ICRS/16-17/03). The research of S-BL is partially supported by the National Natural Science Foundation of China (Grant No. 61502342). The work of D-XZ is supported partially by the Research Grants Council of Hong Kong [Project No. CityU 11303915] and by National Natural Science Foundation of China under Grant 11461161006. Part of the work was done during the third author's visit to Shanghai Jiaotong University (SJTU), for which the support from SJTU and the Ministry of Education is greatly appreciated.

References

1. Hinton GE, Osindero S, Teh YW. A fast learning algorithm for deep belief netws. Neural Comput. (2006) 18:1527–54. doi: 10.1162/neco.2006.18.7.1527

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Chui CK, Li X. Approximation by ridge functions and neural networks with one hidden layer. J Approx Theory (1992) 70:131–41. doi: 10.1016/0021-9045(92)90081-X

CrossRef Full Text | Google Scholar

3. Cybenko G. Approimation by superpositions of a sigmoid function. Math Control Signals Syst. (1989) 2:303–14. doi: 10.1007/BF02551274

CrossRef Full Text | Google Scholar

4. Funahashi KI. On the approximate realization of continuous mappings by neural networks. Neural Netw. (1989) 2:183–92. doi: 10.1016/0893-6080(89)90003-8

CrossRef Full Text | Google Scholar

5. Lippmann RP. An introduction to computing with neural nets. IEEE ASSP Mag. (1987) 4:4–22. doi: 10.1109/MASSP.1987.1165576

CrossRef Full Text | Google Scholar

6. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems. Granada (2012). p. 1105–2097.

7. Lee H, Pham P, Largman Y, Ng AY. Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Neural Information Processing Systems. Vancouver, BC (2010). p. 469–477.

Google Scholar

8. Chui CK, Li X, Mhaskar HN. Neural networks for localized approximation. Math Comput. (1994) 63:607–23. doi: 10.1090/S0025-5718-1994-1240656-2

CrossRef Full Text | Google Scholar

9. Eldan R, Shamir O. The power of depth for feedforward neural networks. In: Conference on Learning Theory. New York, NY (2016). p. 907–940.

Google Scholar

10. Mhaskar H, Poggio T. Deep vs shallow networks: an approximation theory perspective. Anal Appl. (2006) 14:829–48. doi: 10.1142/S0219530516400042

CrossRef Full Text | Google Scholar

11. Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. Int J Auto Comput. (2017) 14:503–19. doi: 10.1007/s11633-017-1054-2

CrossRef Full Text | Google Scholar

12. Raghu M, Poole B, Kleinberg J, Ganguli S, Sohl-Dickstein J. On the expressive power of deep neural networks. In: Proceedings of the 34th International Conference on Machine Learning, PMLR, Vol. 70 (2017), p. 2847-54.

Google Scholar

13. Shaham U, Cloninger A, Coifman RR. Provable approximation properties for deep neural networks. Appl Comput Harmon Anal. (2018) 44:537–57. doi: 10.1016/j.acha.2016.04.003

CrossRef Full Text | Google Scholar

14. Telgarsky M. Benefits of depth in neural networks. In: 29th Annual Conference on Learning Theory, PMLR Vol. 49 (2016), p. 1517–39.

Google Scholar

15. Cucker F, Zhou DX. Learning Theory: An Approximation Theory Viewpoint. Cambridge: Cambridge University Press (2007).

Google Scholar

16. Bianchini M, Scarselli F. On the complexity of neural network classifiers: a comparison between shallow and deep architectures, IEEE Trans Neural Netw Learn Syst. (2014) 25:1553–65. doi: 10.1109/TNNLS.2013.2293637

PubMed Abstract | CrossRef Full Text | Google Scholar

17. Montúfar G, Pascanu R, Cho K, Bengio Y. On the number of linear regions of deep nerual networks. In: Neural Information Processing Systems. Lake Tahoe, CA (2014). p. 2924–2932.

18. Maiorov V. Approximation by neural networks and learning theory. J Complex. (2006) 22:102–17. doi: 10.1016/j.jco.2005.09.001

CrossRef Full Text | Google Scholar

19. Chui CK, Mhaskar HN. Deep nets for local manifold learning. Front Appl Math Stat. (2016) arXiv: 1607.07110.

Google Scholar

20. Györfy L, Kohler M, Krzyzak A, Walk H. A Distribution-Free Theory of Nonparametric Regression. Berlin: Springer (2002).

Google Scholar

21. Bengio Y. Learning deep architectures for AI, Found. Trends Mach Learn. (2009) 2:1–127. doi: 10.1561/2200000006

CrossRef Full Text | Google Scholar

22. Ye GB, Zhou DX. Learning and approximation by Gaussians on Riemannian manifolds. Adv Comput Math. (2008) 29:291–310. doi: 10.1007/s10444-007-9049-0

CrossRef Full Text | Google Scholar

23. Basri R, Jacobs D. Efficient representation of low-dimensional manifolds using deep networks. (2016) arXiv:1602.04723.

Google Scholar

24. DiCarlo J, Cox D. Untangling invariant object recognition. Trends Cogn Sci. (2007) 11:333–41. doi: 10.1016/j.tics.2007.06.010

PubMed Abstract | CrossRef Full Text | Google Scholar

25. do Carmo M. Riemannian Geometry. Boston, MA: Birkhäuser (1992).

26. Larochelle H, Bengio Y, Louradour J, Lamblin R. Exploring strategies for training deep neural networks. J Mach Learn Res. (2009) 10:1–40.

Google Scholar

27. Chang X, Lin SB, Wang Y. Divide and conquer local average regression. Electron J Stat. (2017) 11:1326–50. doi: 10.1214/17-EJS1265

CrossRef Full Text | Google Scholar

28. Christmann A, Zhou DX. On the robustness of regularized pairwise learning methods based on kernels. J Complex. (2017) 37:1–33. doi: 10.1016/j.jco.2016.07.001

CrossRef Full Text | Google Scholar

29. Fan J, Hu T, Wu Q, Zhou DX. Consistency analysis of an empirical minimum error entropy algorithm. Appl Comput Harmon Anal. (2016) 41:164–89. doi: 10.1016/j.acha.2014.12.005

CrossRef Full Text | Google Scholar

30. Guo ZC, Xiang DH, Guo X, Zhou DX. Thresholded spectral algorithms for sparse approximations Anal Appl. (2017) 15:433–55. doi: 10.1142/S0219530517500026

CrossRef Full Text | Google Scholar

31. Hu T, Fan J, Wu Q, Zhou DX. Regularization schemes for minimum error entropy principle. Anal Appl. (2015) 13:437–55. doi: 10.1142/S0219530514500110

CrossRef Full Text | Google Scholar

32. Kohler M, Krzyżak A. Adaptive regression estimation with multilayer feedforward neural networks. J Nonparametr Stat. (2005) 17:891–913. doi: 10.1080/10485250500309608

CrossRef Full Text | Google Scholar

33. Lin SB, Zhou DX. Distributed kernel-based gradient descent algorithms. Constr Approx. (2018) 47:249–76. doi: 10.1007/s00365-017-9379-1

CrossRef Full Text | Google Scholar

34. Shi L, Feng YL, Zhou DX. Concentration estimates for learning with l₁-regularizer and data dependent hypothesis spaces. Appl Comput Harmon Anal. (2011) 31:286–302. doi: 10.1016/j.acha.2011.01.001

CrossRef Full Text | Google Scholar

35. Wu Q, Zhou DX. Learning with sample dependent hypothesis space. Comput Math Appl. (2008) 56:2896–907. doi: 10.1016/j.camwa.2008.09.014

CrossRef Full Text | Google Scholar

36. Shi L. Learning theory estimates for coefficient-based regularized regression. Appl Comput Harmon Anal. (2013) 34:252–65. doi: 10.1016/j.acha.2012.05.001

CrossRef Full Text | Google Scholar

37. Zhou DX, Jetter K. Approximation with polynomial kernels and SVM classifiers. Adv Comput Math. (2006) 25:323–44. doi: 10.1007/s10444-004-7206-2

CrossRef Full Text | Google Scholar

38. Meister M, Steinwart I. Optimal learning rates for localized SVMs. J Mach Learn Res. (2016) 17:1–44.

Google Scholar

39. Erhan D, Bengio Y, Courville A, Manzagol P, Vincent P, Bengio S. Why does unsupervised pre-training help deep learning? J Mach Learn Res. (2010) 11:625–60.

Google Scholar

40. Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge: MIT Press (2016).

Google Scholar

41. Chui CK, Li X, Mhaskar HN. Limitations of the approximation capabilities of neural networks with one hidden layer. Adv Comput Math. (1996) 5:233–43. doi: 10.1007/BF02124745

CrossRef Full Text | Google Scholar

42. Maiorov V, Pinkus A. Lower bounds for approximation by MLP neural networks. Neurocomputing (1999) 25:81–91. doi: 10.1016/S0925-2312(98)00111-8

CrossRef Full Text | Google Scholar

43. Lin SB. Limitations of shallow nets approximation. Neural Netw. (2017) 94:96–102. doi: 10.1016/j.neunet.2017.06.016

PubMed Abstract | CrossRef Full Text | Google Scholar

44. Mhaskar H. Approximation properties of a multilayered feedforward artificial neural network. Adv Comput Math. (1993) 1:61–80. doi: 10.1007/BF02070821

CrossRef Full Text | Google Scholar

45. Ye GB, Zhou DX. SVM learning and L^p approximation by Gaussians on Riemannian manifolds. Anal Appl. (2009) 7:309–39. doi: 10.1142/S0219530509001384

CrossRef Full Text | Google Scholar

46. Kohler M, Krzyzak A. Nonparametric regression based on hierarchical interaction models. IEEE Trans Inform. Theory (2017) 63:1620–30. doi: 10.1109/TIT.2016.2634401

CrossRef Full Text

47. Lin SB, Guo X, Zhou DX. Distributed learning with least square regularization. J Mach Learn Res. (2017) 18:1–31.

48. Zhang YC, Duchi J, Wainwright M. Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J Mach Learn Res. (2015) 16:3299–340.

Keywords: deep nets, learning theory, deep learning, manifold learning, feedback

Citation: Chui CK, Lin S-B and Zhou D-X (2018) Construction of Neural Networks for Realization of Localized Deep Learning. Front. Appl. Math. Stat. 4:14. doi: 10.3389/fams.2018.00014

Received: 30 January 2018; Accepted: 26 April 2018;
Published: 17 May 2018.

Edited by:

Lixin Shen, Syracuse University, United States

Reviewed by:

Sivananthan Sampath, Indian Institutes of Technology, India
Ashley Prater, United States Air Force Research Laboratory, United States

Copyright © 2018 Chui, Lin and Zhou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Shao-Bo Lin, c2JsaW4xOTgzQGdtYWlsLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.