Kernel-Based Analysis of Massive Data

Mhaskar, Hrushikesh N.

doi:10.3389/fams.2020.00030

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 20 October 2020

Sec. Mathematics of Computation and Data Science

Volume 6 - 2020 | https://doi.org/10.3389/fams.2020.00030

This article is part of the Research TopicFundamental Mathematical Topics in Data ScienceView all 7 articles

Kernel-Based Analysis of Massive Data

Hrushikesh N. Mhaskar^*

Institute of Mathematical Sciences, Claremont Graduate University, Claremont, CA, United States

Dealing with massive data is a challenging task for machine learning. An important aspect of machine learning is function approximation. In the context of massive data, some of the commonly used tools for this purpose are sparsity, divide-and-conquer, and distributed learning. In this paper, we develop a very general theory of approximation by networks, which we have called eignets, to achieve local, stratified approximation. The very massive nature of the data allows us to use these eignets to solve inverse problems, such as finding a good approximation to the probability law that governs the data and finding the local smoothness of the target function near different points in the domain. In fact, we develop a wavelet-like representation using our eignets. Our theory is applicable to approximation on a general locally compact metric measure space. Special examples include approximation by periodic basis functions on the torus, zonal function networks on a Euclidean sphere (including smooth ReLU networks), Gaussian networks, and approximation on manifolds. We construct pre-fabricated networks so that no data-based training is required for the approximation.

1. Introduction

Rapid advances in technology have led to the availability and need to analyze a massive data. The problem arises in almost every area of life from medical science to homeland security to finance. An immediate problem in dealing with a massive data set is that it is not possible to store it in a computer memory; we therefore have to deal with the data piecemeal to keep access to an external memory to a minimum. The other challenge is to devise efficient numerical algorithms to overcome difficulties, for example, in using the customary optimization problems in machine learning. On the other hand, the very availability of a massive data set should lead also to opportunities to solve some problems heretofore considered unmanageable. For example, deep learning often requires a large amount of training data, which, in turn, helps us to figure out the granularity in the data. Apart from deep learning, distributed learning is also a popular way of dealing with big data. A good survey with the taxonomy for dealing with massive data was recently conducted by Zhou et al. [1].

As pointed out in Cucker and Smale [2], Cucker and Zhou [3], and Girosi and Poggio [4], the main task in machine learning can be viewed as one of approximation of functions based on noisy values of the target function, sampled at points that are themselves sampled from an unknown distribution. It is therefore natural to seek approximation theory techniques to solve the problem. However, most of the classical approximation theory results are either not constructive or study function approximation only on known domains. In this century, there is a new paradigm to consider function approximation on data-defined manifolds; a good introduction to the subject is in the special issue [5] of Applied and Computational Harmonic Analysis, edited by Chui and Donoho. In this theory, one assumes the manifold hypothesis, i.e., that the data is sampled from a probability distribution μ^* supported on a smooth, compact, and connected Riemannian manifold; for simplicity, even that μ^* is the Riemannian volume measure for the manifold, normalized to be a probability measure. Following (e.g., [6–10]), one constructs first a “graph Laplacian” from the data and finds its eigen decomposition. It is proved in the abovementioned papers that as the size of the data tends to infinity, the graph Laplacian converges to the Laplace-Beltrami operator on the manifold, and the eigenvalues (eigenvectors) converge to the corresponding quantities on the manifold. A great deal of work is devoted to studying the geometry of this unknown manifold (e.g., [11, 12]) based on the so-called heat kernel. The theory of function approximation on such manifolds is also well-developed (e.g., [13–17]).

A bottleneck in this theory is the computation of the eigendecomposition of a matrix, which is necessarily huge in the case of big data. Kernel-based methods have been used also in connection with approximation on manifolds (e.g., [18–22]). The kernels used in this method are constructed typically as a radial basis function (RBF) in the ambient space, and the methods are traditional machine learning methods involving optimization. As mentioned earlier, massive data poses a big challenge for the solution of these optimization problems. The theoretical results in this connection assume a Mercer's expansion in terms of the Laplacian eigenfunctions for the kernel, satisfying certain conditions. In this paper, we develop a general theory including several RBF kernels in use in different contexts (examples are discussed in section 2). Rather than using optimization-based techniques, we will provide a direct construction of the approximation based on what we have called eignets. An eignet is defined directly using the eigendecomposition on the manifold. We thus focus directly on the properties of Mercer expansion in an abstract and unified manner that enables us to construct local approximations suitable for working with massive data without using optimization.

It is also possible that the manifold hypothesis does not hold, and there is a recent work [23] by Fefferman et al. proposing an algorithm to test this hypothesis. On the other hand, our theory for function approximation does not necessarily use the full strength of Riemannian geometry. In this paper, we have therefore decided to work with a general locally compact metric measure space, isolating those properties which are needed for our analysis and substituting some that are not applicable in the current setting.

Our motivation comes from some recent works on distributed learning by Zhou et al. [24–26] as well as our own work on deep learning [27, 28]. For example, in Lin et al. [26], the approximation is done on the Euclidean sphere using a localized kernel introduced in Mhaskar [29], where the massive data is divided into smaller parts, each dense on the sphere, and the resulting polynomial approximations are added to get the final result. In Chui et al. [24], the approximation takes place on a cube, and exploits any known sparsity in the representation of the target function in terms of spline functions. In Mhaskar and Poggio [28] and Mhaskar [27], we have argued that from a function approximation point of view, the observed superiority of deep networks over shallow ones results from the ability of deep networks to exploit any compositional structure in the target function. For example, in image analysis, one may divide the image into smaller patches, which are then combined in a hierarchical manner, resulting in a tree structure [30]. By putting a shallow network at each node to learn those aspects of the target function that depend upon the pixels seen up to that level, one can avoid the curse of dimensionality. In some sense, this is a divide-and-conquer strategy, not so much on the data set itself but on the dimension of the input space.

The highlights of this paper are the following.

• In order to avoid an explicit, data-dependent eigendecomposition, we introduce the notion of an eignet, which generalizes several radial basis function and zonal function networks. We construct pre-fabricated eignets, whose linear combinations can be constructed just by using the noisy values of the target function as the coefficients, to yield the desired approximation.

• Our theory generalizes the results in a number of examples used commonly in machine learning, some of which we will describe in section 2.

• The use of optimization methods, such as empirical risk minimization has an intrinsic difficulty, namely, the minimizer of this risk may have no connection with the approximation error. There are also other problems, such as local minima, saddle points, speed of convergence, etc. that need to be taken into account, and the massive nature of the data makes this an even more challenging task. Our results do not depend upon any kind of optimization in order to determine the necessary approximation.

• We developed a theory for local approximation using eignets so that only a relatively small amount of data is used in order to approximate the target function in any ball of the space, the data being sub-sampled using a distribution supported on a neighborhood of that ball. The accuracy of approximation adjusts itself automatically depending upon the local smoothness of the target function on the ball.

• In normal machine learning algorithms, it is customary to assume a prior on the target function called smoothness class in approximation theory parlance. Our theory demonstrates clearly how a massive data can actually help to solve the inverse problem to determine the local smoothness of the target function using a wavelet-like representation based solely on the data.

• Our results allow one to solve the inverse problem of estimating the probability density from which the data is chosen. In contrast to the statistical approaches that we are aware of, there is no limitation on how accurate the approximation can be asymptotically in terms of the number of samples; the accuracy is determined entirely by the smoothness of the density function.

• All our estimates are given in terms of probability of the error being small rather than the expected value of some loss function being small.

This paper is abstract, theoretical, and technical. In section 2, we present a number of examples that are generalized by our set-up. The abstract set-up, together with the necessary definitions and assumptions, are discussed in section 3. The main results are stated in section 4 and proved in section 8. The proofs require a great deal of preparation, which is presented in sections 5–7. The results in these sections are not all new. Many of them are new only in some nuance. For example, we have proven in section 7 the quadrature formulas required in the construction of our pre-fabricated networks in a probabilistic setting, and we have also substituted an estimate on the gradients by certain Lipschitz condition, which makes sense without the differentiability structure on the manifold as we had done in our previous works. Our Theorem 7.1 generalizes most of our previous results in this direction with the exception of [31, Theorem 2.3]. We have striven to give as many proofs as possible, partly for the sake of completion and partly because the results were not stated earlier in exactly the same form as needed here. In Appendix A, we give a short proof of the fact that the Gaussian upper bound for the heat kernel holds for arbitrary smooth, compact, connected manifolds. We could not find a reference for this fact. In Appendix B, we state the main probability theory estimates that are used ubiquitously in the paper.

2. Motivating Examples

In this paper, we aim to develop a unifying theory applicable to a variety of kernels and domains. In this section, we describe some examples which have motivated the abstract theory to be presented in the rest of the paper. In the following examples, q ≥ 1 is a fixed integer.

Example 2.1. Let 𝕋^q = ℝ^q/(2πℤ^q) be the q-dimensional torus. The distance between points x = (x₁, ⋯, x_q) and y = (y₁, ⋯, y_q) is defined by ${max}_{1 \leq k \leq q} | (x_{k} - y_{k}) mod 2 π |$ . The trigonometric monomial system {exp(ik · ○) : k ∈ ℤ^q} is orthonormal with respect to the Lebesgue measure normalized to be a probability measure on 𝕋^q. We recall that the periodization of a function f :ℝ^q → ℝ is defined formally by $f^{○} (x) = \sum_{k \in ℤ^{q}} f (x + 2 k π)$ . When f is integrable then the Fourier transform of f at k ∈ ℤ^q is the same as the k-th Fourier coefficient of f^○. This Fourier coefficient will be denoted by $\hat{f^{○}} (k) = \hat{f} (k)$ . A periodic basis function network has the form $x \mapsto \sum_{k = 1}^{n} a_{k} G (x - x_{k})$ , where G is a periodic function called the activation function. The examples of the activation functions in which we are interested in this paper include:

1. Periodization of the Gaussian.

\begin{array}{l} G (x) = \sum_{k \in ℤ^{q}} exp (- | x - 2 π k |_{2}^{2} / 2), \\ \hat{G} (k) = {(2 π)}^{q / 2} exp (- | k |_{2}^{2} / 2) . \end{array}

2. Periodization of the Hardy multiquadric¹.

\begin{array}{l} G (x) = \sum_{k \in ℤ^{q}} {(α^{2} + | x - 2 π k |_{2}^{2})}^{- 1}, \\ \hat{G} (k) = \frac{π^{(q + 1) / 2}}{Γ (\frac{q + 1}{2}) α} exp (- α | k |_{2}), α > 0 . □ \end{array}

Example 2.2. If $x = (x_{1}, \dots, x_{q}) \in {[- 1, 1]}^{q}$ , there exists a unique $θ = (θ_{1}, \dots, θ_{q}) \in {[0, π]}^{q}$ such that x = cos(θ). Therefore, [−1, 1]^q can be thought of as a quotient space of 𝕋^q where all points of the form ε ⊙ θ = {(ε₁θ₁, ⋯, ε_qθ_q)}, $ε = (ε_{1}, \dots, ε_{q}) \in {- 1, 1}^{q}$ , are identified. Any function on [−1, 1]^q can then by lifted to 𝕋^q, and this lifting preserves all the smoothness properties of the function. Our set-up below includes [−1, 1]^q, where the distance and the measure are defined via the mapping to the torus, and suitably weighted Jacobi polynomials are considered to be the orthonormalized family of functions. In particular, if G is a periodic activation function, x = cos(θ), y = cos(ϕ), then the function $G^{□} (x, y) = \sum_{ε \in {- 1, 1}^{q}} G (ε ⊙ (θ - ϕ))$ is an activation function on [−1, 1]^q with an expansion $\sum_{k \in ℤ_{+}^{q}} b_{k} T_{k} (x) T_{k} (y)$ , where T_k's are tensor product, orthonormalized, Chebyshev polynomials. Furthermore, b_k's have the same asymptotic behavior as $\hat{G}$ (k)'s. □

Example 2.3. Let $𝕊^{q} = {x \in ℝ^{q + 1} : | x |_{2} = 1}$ be the unit sphere in ℝ^q+1. The dimension of 𝕊^q as a manifold is q. We assume the geodesic distance ρ on 𝕊^q and the volume measure μ^* are normalized to be a probability measure. We refer the reader to Müller [33] for details, describing here only the essentials to get a “what-it-is-all-about” introduction. The set of (equivalence classes) of restrictions of polynomials in q + 1 variables with total degree < n to 𝕊^q are called spherical polynomials of degree < n. The set of restrictions of homogeneous harmonic polynomials of degree ℓ to 𝕊^q is denoted by ℍ_ℓ with dimension d_ℓ. There is an orthonormal basis ${Y_{ℓ, k}}_{k = 1}^{d_{ℓ}}$ for each ℍ_ℓ that satisfies an addition formula

\begin{array}{l} \sum_{k = 1}^{d_{ℓ}} Y_{ℓ, k} (x) Y_{ℓ, k} (y) = ω_{q - 1}^{- 1} p_{ℓ} (1) p_{ℓ} (x \cdot y), \end{array}

where ω_q−1 is the volume of 𝕊^q−1, and p_ℓ is the degree ℓ ultraspherical polynomial so that the family {p_ℓ} is orthonormalized with respect to the weight (1 − x²)^(q−2)/2 on (−1, 1). A zonal function on the sphere has the form x ↦ G(x · y), where the activation function G:[−1, 1] → ℝ has a formal expansion of the form

\begin{array}{l} G (t) = ω_{q - 1}^{- 1} \sum_{ℓ = 0}^{\infty} \hat{G} (ℓ) p_{ℓ} (1) p_{ℓ} (t) . \end{array}

In particular, formally, $G (x \cdot y) = \sum_{ℓ = 0}^{\infty} \hat{G} (ℓ) \sum_{k = 1}^{d_{ℓ}} Y_{ℓ, k} (x) Y_{ℓ, k} (y)$ . The examples of the activation functions in which we are interested in this paper include

\begin{array}{l} G_{r} (x) : = {(1 - 2 r x + r^{2})}^{- (q - 1) / 2}, x \in [- 1, 1], 0 < r < 1 . \end{array}

It is shown in Müller [33, Lemma 18] that

\begin{array}{l} \hat{G_{r}} (ℓ) = \frac{(q - 1) ω_{q}}{2 ℓ + q - 1} r^{ℓ}, ℓ = 1, 2, \dots . \end{array}

\begin{array}{l} G_{r}^{E} (x) : = exp (r x), x \in [- 1, 1], r > 0 . \end{array}

It is shown in Mhaskar et al. [34, Lemma 5.1] that

\begin{array}{l} \hat{G_{r}^{E}} (ℓ) = \frac{ω_{q} r^{ℓ}}{2^{ℓ} Γ (ℓ + \frac{q + 1}{2})} (1 + O (1 / ℓ)) . \end{array}

3. The smooth ReLU function $G (t) = log (1 + e^{t}) = t_{+} + O (e^{- | t |})$ . The function G has an analytic extension to the strip ℝ + (−π, π)i of the complex plane. So, Bernstein approximation theorem [35, Theorem 5.4.2] can be used to show that

\begin{array}{l} \underset{ℓ \to \infty}{lim sup} | \hat{G} (ℓ) |^{1 / ℓ} = 1 / π . □ \end{array}

Example 2.4. Let 𝕏 be a smooth, compact, connected Riemannian manifold (without boundary), ρ be the geodesic distance on 𝕏, μ^* be the Riemannian volume measure normalized to be a probability measure, {λ_k} be the sequence of eigenvalues of the (negative) Laplace-Beltrami operator on 𝕏, and ϕ_k be the eigenfunction corresponding to the eigenvalue λ_k; in particular, ϕ₀ ≡ 1. This example, of course, includes Examples 2.1–2.3. An eignet in this context has the form $x \mapsto \sum_{k = 1}^{n} a_{k} G (x, x_{k})$ , where the activation function G has a formal expansion of the form $G (x, y) = \sum_{k} b (λ_{k}) ϕ_{k} (x) ϕ_{k} (y)$ . One interesting example is the heat kernel:

\begin{array}{l} \sum_{k = 0}^{\infty} exp (- λ_{k}^{2} t) ϕ_{k} (x) ϕ_{k} (y) . \end{array}

□

Example 2.5. Let 𝕏 = ℝ^q, ρ be the ℓ^∞ norm on 𝕏, μ^* be the Lebesgue measure. For any multi-integer $k \in ℤ_{+}^{q}$ , the (multivariate) Hermite function ϕ_k is defined via the generating function

\begin{array}{l} \sum_{k \in ℤ_{+}^{q}} \frac{ϕ_{k} (x)}{\sqrt{2^{| k |_{1}} k!}} w^{k} = π^{- 1 / 4} exp (- \frac{1}{2} | x - w |_{2}^{2} + | w |_{2}^{2} / 4), w \in ℂ^{q} . & (2.1) \end{array}

The system {ϕ_k} is orthonormal with respect to μ^*, and satisfies

\begin{array}{l} Δ ϕ_{k} (x) - | x |_{2}^{2} ϕ_{k} (x) = - (2 | k |_{1} + 1) ϕ_{k} (x), x \in ℝ^{q}, \end{array}

where Δ is the Laplacian operator. As a consequence of the so called Mehler identity, one obtains [36] that

\begin{array}{l} exp (- | x - \frac{\sqrt{3}}{2} y |_{2}^{2}) exp (- | y |_{2}^{2} / 4) \\ \begin{array}{l} = {(\frac{3}{2 π})}^{- q / 2} \sum_{k \in ℤ_{+}^{d}} ϕ_{k} (x) ϕ_{k} (y) 3^{- | k |_{1} / 2} . \end{array} & (2.2) \end{array}

A Gaussian network is a network of the form $x \mapsto \sum_{k = 1}^{n} a_{k} (- | x - z_{k} |_{2}^{2})$ , where it is convenient to think of $z_{k} = \frac{\sqrt{3}}{2} y_{k}$ . □

3. The Set-Up and Definitions

3.1. Data Spaces

Let 𝕏 be a connected, locally compact metric space with metric ρ. For r > 0, x ∈ 𝕏, we denote

\begin{array}{l} 𝔹 (x, r) = {y \in 𝕏 : ρ (x, y) \leq r}, Δ (x, r) = closure (𝕏 \ 𝔹 (x, r)) . \end{array}

If K ⊆ 𝕏 and x ∈ 𝕏, we write as usual $ρ (K, x) = {inf}_{y \in K} ρ (y, x)$ . It is convenient to denote the set

{x ∈ 𝕏; ρ(K, x) ≤ r} by 𝔹(K, r). The diameter of K is defined by $diam (K) = {sup}_{x, y \in K} ρ (x, y)$ .

For a Borel measure ν on 𝕏 (signed or positive), we denote by |ν| its total variation measure defined for Borel subsets K ⊂ 𝕏 by

\begin{array}{l} | ν | (K) = sup_{U} \sum_{U \in U} | ν (U) |, \end{array}

where the supremum is over all countable measurable partitions $U$ of K. In the sequel, the term measure will mean a signed or positive, complete, sigma-finite, Borel measure. Terms, such as measurable will mean Borel measurable. If f:𝕏 → ℝ is measurable, K ⊂ 𝕏 is measurable, and ν is a measure, we define²

\begin{array}{l} {‖ f ‖}_{p, ν, K} = {\begin{matrix} {\int_{K} | f (x) |^{p} d | ν | (x)}^{1 / p}, & if 1 \leq p < \infty, \\ | ν | - \underset{x \in K}{ess sup} | f (x) |, & if p = \infty . \end{matrix} \end{array}

The symbol L^p(ν, K) denotes the set of all measurable functions f for which ‖f‖_{p, ν, K} < ∞, with the usual convention that two functions are considered equal if they are equal |ν|-almost everywhere on K. The set C₀(K) denotes the set of all uniformly continuous functions on K vanishing at ∞. In the case when K = 𝕏, we will omit the mention of K, unless it is necessary to mention it to avoid confusion.

We fix a non-decreasing sequence ${λ_{k}}_{k = 0}^{\infty}$ , with λ₀ = 0 and λ_k ↑ ∞ as k → ∞. We also fix a positive sigma-finite Borel measure μ^* on 𝕏, and a system of orthonormal functions ${ϕ_{k}}_{k = 0}^{\infty} \subset L^{1} (μ^{*}, 𝕏) \cap C_{0} (𝕏)$ , such that ϕ₀(x) > 0 for all x ∈ 𝕏. We define

\begin{array}{l} Π_{n} = span {ϕ_{k} : λ_{k} < n}, n > 0 . & (3.1) \end{array}

It is convenient to write Π_n = {0} if n ≤ 0 and Π_∞ = ⋃_n>0Π_n. It will be assumed in the sequel that Π_∞ is dense in C₀ (and, thus, in every L^p, 1 ≤ p < ∞). We will often refer to the elements of Π_∞ as diffusion polynomials in keeping with [13].

Definition 3.1. We will say that a sequence {a_n} (or a function F :[0, ∞) → ℝ) is fast decreasing if $lim_{n \to \infty} n^{S} a_{n} = 0$ (respectively, $lim_{x \to \infty} x^{S} f (x) = 0$ ) for every S > 0. A sequence {a_n} has polynomial growth if there exist c₁, c₂ > 0 such that $| a_{n} | \leq c_{1} n^{c_{2}}$ for all n ≥ 1, and similarly for functions.

Definition 3.2. The space 𝕏 (more precisely, the tuple $Ξ = (𝕏, ρ, μ^{*}, {λ_{k}}_{k = 0}^{\infty}, {ϕ_{k}}_{k = 0}^{\infty})$ ) is called a data space if each of the following conditions is satisfied.

1. For each x ∈ 𝕏, r > 0, 𝔹(x, r) is compact.

2. (Ball measure condition) There exist q ≥ 1 and κ > 0 with the following property: for each x ∈ 𝕏, r > 0,

\begin{array}{l} μ^{*} (𝔹 (x, r)) = μ^{*} ({y \in 𝕏 : ρ (x, y) < r}) \leq κ r^{q} . & (3.2) \end{array}

(In particular, μ^*({y ∈ 𝕏: ρ(x, y) = r}) = 0.)

3. (Gaussian upper bound) There exist κ₁, κ₂ > 0 such that for all x, y ∈ 𝕏, 0 < t ≤ 1,

\begin{array}{l} | \sum_{k = 0}^{\infty} exp (- λ_{k}^{2} t) ϕ_{k} (x) ϕ_{k} (y) | \leq κ_{1} t^{- q / 2} exp (- κ_{2} \frac{ρ {(x, y)}^{2}}{t}) . & (3.3) \end{array}

4. (Essential compactness) For every n ≥ 1, there exists a compact set 𝕂_n ⊂ 𝕏 such that the function n ↦ diam(𝕂_n) has polynomial growth, while the functions

n \mapsto sup_{x \in 𝕏 \ 𝕂_{n}} \sum_{λ_{k} < n} ϕ_{k} {(x)}^{2}

and

\begin{array}{l} n \mapsto \int_{𝕏 \ 𝕂_{n}} {(\sum_{λ_{k} < n} ϕ_{k} {(x)}^{2})}^{1 / 2} d μ^{*} (x) \end{array}

are both fast decreasing. (Necessarily, $n \mapsto μ^{*} (𝕂_{n})$ has polynomial growth as well.)

Remark 3.1. We assume without loss of generality that 𝕂_n ⊆ 𝕂_m for all n < m and that $μ^{*} (𝕂_{1}) > 0$ . □

Remark 3.2. If 𝕏 is compact, then the first condition as well as the essential compactness condition are automatically satisfied. We may take 𝕂_n = 𝕏 for all n. In this case, we will assume tacitly that μ^* is a probability measure, and ϕ₀ ≡ 1. □

Example 3.1. (Manifold case) This example points out that our notion of data space generalizes the set-ups in Examples 2.1–2.4. Let 𝕏 be a smooth, compact, connected Riemannian manifold (without boundary), ρ be the geodesic distance on 𝕏, μ^* be the Riemannian volume measure normalized to be a probability measure, {λ_k} be the sequence of eigenvalues of the (negative) Laplace-Beltrami operator on 𝕏, and ϕ_k be the eigenfunction corresponding to the eigenvalue λ_k; in particular, ϕ₀ ≡ 1. If the condition (3.2) is satisfied, then $(𝕏, ρ, μ^{*}, {λ_{k}}_{k = 0}^{\infty}, {ϕ_{k}}_{k = 0}^{\infty})$ is a data space. Of course, the assumption of essential compactness is satisfied trivially (see Appendix B for the Gaussian upper bound). □

Example 3.2. (Hermite case) We illustrate how Example 2.5 is included in our definition of a data space. Accordingly, we assume the set-up as in that example. For a > 0, let $ϕ_{k, a} (x) = a^{- q / 2} ϕ_{k} (a x)$ . With $λ_{k} = \sqrt{| k |_{1}}$ , the system $Ξ_{a} = (ℝ^{q}, ρ, μ^{*}, {λ_{k}}, {ϕ_{k, a}})$ is a data space. When a = 1, we will omit its mention from the notation in this context. The first two conditions are obvious. The Gaussian upper bound follows by the multivariate Mehler identity [37, Equation 4.27]. The assumption of essential compactness is satisfied with 𝕂_n = 𝔹(0, cn) for a suitable constant c (cf. [38, Chapter 6]). □

In the rest of this paper, we assume 𝕏 to be a data space. Different theorems will require some additional assumptions, two of which we now enumerate. Not every theorem will need all of these; we will state explicitly which theorem uses which assumptions, apart from 𝕏 being a data space.

The first of these deals with the product of two diffusion polynomials. We do not know of any situation where it is not satisfied but are not able to prove it in general.

Definition 3.3. (Product assumption) There exists A^* ≥ 1 and a family ${R_{j, k, n} \in Π_{A^{*} n}}$ such that for every S > 0,

\begin{array}{l} lim_{n \to \infty} n^{S} (max_{λ_{k}, λ_{j} < n, p = 1, \infty} ‖ ϕ_{k} ϕ_{j} - R_{j, k, n} ϕ_{0} ‖_{p}) = 0 . & (3.4) \end{array}

We say that an strong product assumption is satisfied if, instead of (3.4), we have for every n > 0 and P, Q ∈ Π_n, $P Q \in Π_{A^{*} n}$ .

Example 3.3. In the setting of Example 3.2, if P, Q ∈ Π_n, then PQ = Rϕ₀ for some R ∈ Π_2n. So, the product assumption holds trivially. The strong product assumption does not hold. However, if P, Q ∈ Π_n, then $P Q \in span {ϕ_{k, \sqrt{2}} : λ_{k} < n \sqrt{2}}$ . The manifold case is discussed below in Remark 3.3. □

Remark 3.3. One of the referees of our paper has pointed out three recent references [39–41], on the subject of the product assumption. The first two of these deal with the manifold case (Example 3.1). The paper [41] extends the results in Lu et al. [40] to the case when the functions ϕ_k are eigenfunctions of a more general elliptic operator. Since the results in these two papers are similar qualitatively, we will comment on Lu et al. [40] and Steinerberger [39].

In this remark only, let $K_{t} (x, y) = \sum_{k} exp (- λ_{k}^{2} t) ϕ_{k} (x) ϕ_{k} (y)$ . Let λ_k, λ_j < n. In Steinerberger [39], Steinerberger relates E_An(2, ϕ_kϕ_j) [see (3.6) below for definition] with

‖ \int_{𝕏} K_{t} (○, y) (ϕ_{k} (y) - ϕ_{k} (○)) (ϕ_{j} (y) - ϕ_{j} (○)) d μ^{*} (y) ‖_{2, μ^{*}} .

While this gives some insight into the product assumption, the results are inconclusive about the product assumption as stated. Also, it is hard to verify whether the conditions mentioned in the paper are satisfied for a given manifold.

In Lu et al. [40], it is shown that for any ϵ, δ > 0, there exists a subspace V of dimension $O_{δ} (ϵ^{- δ} n^{1 + δ})$ such that for all ϕ_k, ϕ_j ∈ Π_n, ${inf}_{P \in V} ‖ ϕ_{k} ϕ_{j} - P ‖_{2, μ^{*}} \leq ϵ$ . The subspace V does not have to be Π_An for any A. Since the dimension of span{ϕ_kϕ_j} is O(n²), the result is meaningful only if 0 < δ < 1 and ϵ ≥ n^1−1/δ.

In Geller and Pesenson [42, Theorem 6.1], it is shown that the strong product assumption (and, thus, also the product assumption) holds in the manifold case when the manifold is a compact homogeneous manifold. We have extended this theorem in Filbir and Mhaskar [17, Theorem A.1] for the case of eigenfunctions of general elliptic partial differential operators on arbitrary compact, smooth manifolds provided that the coefficient functions in the operator satisfy some technical conditions. □

In our results in section 4, we will need the following condition, which serves the purpose of gradient in many of our earlier theorems on manifolds.

Definition 3.4. We say that the system Ξ satisfies Bernstein-Lipschitz condition if for every n > 0, there exists B_n > 0 such that

\begin{array}{l} | P (x) - P (y) | \leq B_{n} ρ (x, y) ‖ P ‖_{\infty}, x, y \in 𝕏, P \in Π_{n} . & (3.5) \end{array}

Remark 3.4. Both in the manifold case and the Hermite case, B_n = cn for some constant c > 0. A proof in the Hermite case can be found in Mhaskar [43] and in the manifold case in Filbir and Mhaskar [44]. □

3.2. Smoothness Classes

We define next the smoothness classes of interest here.

Definition 3.5. A function w:𝕏 → ℝ will be called a weight function if $w ϕ_{k} \in C_{0} (𝕏) \cap L^{1} (𝕏)$ for all k. If w is a weight function, we define

\begin{array}{l} E_{n} (w; p, f) = min_{P \in Π_{n}} ‖ f - P w ‖_{p, μ^{*}}, n > 0, 1 \leq p \leq \infty, f \in L^{p} (𝕏) . & (3.6) \end{array}

We will omit the mention of w if w ≡ 1 on 𝕏.

We find it convenient to denote by X^p the space ${f \in L^{p} (𝕏) : {lim}_{n \to \infty} E_{n} (p, f) = 0}$ ; i.e., X^p = L^p(𝕏) if 1 ≤ p < ∞ and $X^{\infty} = C_{0} (𝕏)$ .

Definition 3.6. Let 1 ≤ p ≤ ∞, γ > 0, and w be a weight function.

(a) For f ∈ L^p(𝕏), we define

\begin{array}{l} ‖ f ‖_{W_{γ, p, w}} = ‖ f ‖_{p, μ^{*}} + sup_{n > 0} n^{γ} E_{n} (w; p, f), & (3.7) \end{array}

and note that

\begin{array}{l} ‖ f ‖_{W_{γ, p, w}} ~ ‖ f ‖_{p, μ^{*}} + sup_{n \in ℤ_{+}} 2^{n γ} E_{2^{n}} (w; p, f) . & (3.8) \end{array}

The space W_γ,p,w comprises all f for which ‖f‖_{W_γ,p,w} < ∞.

(b) We write $C_{w}^{\infty} = ⋂_{γ > 0} W_{γ, \infty, w}$ . If B is a ball in 𝕏, $C_{w}^{\infty} (B)$ comprises functions in $f \in C_{w}^{\infty}$ , which are supported on B.

(c) If x₀ ∈ 𝕏, the space W_γ,p,w(x₀) comprises functions f such that there exists r > 0 with the property that, for every $ϕ \in C_{w}^{\infty} (𝔹 (x_{0}, r))$ , ϕf ∈ W_γ,p,w.

Remark 3.5. In both the manifold case and the Hermite case, characterizations of the smoothness classes W_γ,p are available in terms of constructive properties of the functions, such as the number of derivatives, estimates on certain moduli of smoothness or K-functionals, etc. In particular, the class C^∞ coincides with the class of infinitely differentiable functions vanishing at infinity. □

We can now state another assumption that will be needed in studying local approximation.

Definition 3.7. (Partition of unity) For every r > 0, there exists a countable family $F_{r} = {ψ_{k, r}}_{k = 0}^{\infty}$ of functions in C^∞ with the following properties:

1. Each $ψ_{k, r} \in F_{r}$ is supported on 𝔹(x_k, r) for some x_k ∈ 𝕏.

2. For every $ψ_{k, r} \in F_{r}$ and x ∈ 𝕏, 0 ≤ ψ_{k, r}(x) ≤ 1.

3. For every x ∈ 𝕏, there exists a finite subset $F_{r} (x) \subseteq F_{r}$ such that

\begin{array}{l} \sum_{ψ_{k, r} \in F_{r} (x)} ψ_{k, r} (y) = 1, y \in 𝔹 (x, r) . & (3.9) \end{array}

We note some obvious observations about the partition of unity without the simple proof.

Proposition 3.1. Let r > 0, $F_{r}$ be a partition of unity.

(a) Necessarily, $\sum_{ψ_{k, r} \in F_{r} (x)} ψ_{k, r}$ is supported on 𝔹(x, 3r).

(b) For x ∈ 𝕏, $\sum_{ψ_{k, r} \in F_{r}} ψ_{k, r} (x) = 1$ .

The constant convention In the sequel, c, c₁, ⋯ will denote generic positive constants depending only on the fixed quantities under discussion, such as Ξ, q, κ, κ₁, κ₂, the various smoothness parameters, and the filters to be introduced. Their value may be different at different occurrences, even within a single formula. The notation A ~ B means c₁A ≤ B ≤ c₂A. □

We end this section by defining a kernel that plays a central role in this theory.

Let H :[0, ∞) → ℝ be a compactly supported function. In the sequel, we define

\begin{array}{l} Φ_{N} (H; x, y) = \sum_{k = 0}^{\infty} H (λ_{k} / N) ϕ_{k} (x) ϕ_{k} (y), N > 0, x, y \in 𝕏 . & (3.10) \end{array}

If S ≥ 1 is an integer, and H is S times continuously differentiable, we introduce the notation

\begin{array}{l} ‖ | H ‖ |_{S} : = max_{0 \leq k \leq S} max_{x \in ℝ} | H^{(k)} (x) | . \end{array}

The following proposition recalls an important property of these kernels. Proposition 3.2 is proven in Maggioni and Mhaskar [13] and more recently in much greater generality in Mhaskar [45, Theorem 4.3].

Proposition 3.2. Let S > q be an integer, H :ℝ → ℝ be an even, S times continuously differentiable, compactly supported function. Then, for every x, y ∈ 𝕏, N > 0,

\begin{array}{l} | Φ_{N} (H; x, y) | \leq \frac{c N^{q} ‖ | H | ‖_{S}}{\max (1, {(N ρ (x, y))}^{S})} . & (3.11) \end{array}

In the sequel, let h :ℝ → [0, 1] be a fixed, infinitely differentiable, even function, non-increasing on [0, ∞), with h(t) = 1 if |t| ≤ 1/2 and h(t) = 0 if t ≥ 1. If ν is any measure with a bounded total variation on 𝕏, we define

\begin{array}{l} σ_{n} (ν, h; f) (x) = \int_{𝕏} Φ_{n} (h; x, y) f (y) d ν (y) . & (3.12) \end{array}

We will omit the mention of h in the notations; e.g., write Φ_n(x, y) = Φ_n(h; x, y), and the mention of ν if ν = μ^*. In particular,

\begin{array}{l} σ_{n} (f) (x) = \sum_{k = 0}^{\infty} h (\frac{λ_{k}}{n}) \hat{f} (k) ϕ_{k} (x), \\ \begin{array}{l} n > 0, x \in 𝕏, f \in L^{1} (𝕏) + C_{0} (𝕏), \end{array} & (3.13) \end{array}

where for $f \in L^{1} + C_{0}$ , we write

\begin{array}{l} \hat{f} (k) = \int_{𝕏} f (y) ϕ_{k} (y) d μ^{*} (y) & (3.14) \end{array}

3.3. Measures

In this section, we describe the terminology involving measures.

Definition 3.8. Let d ≥ 0. A measure $ν \in M$ will be called d–regular if

\begin{array}{l} | ν | (𝔹 (x, r)) \leq c {(r + d)}^{q}, x \in 𝕏 . & (3.15) \end{array}

The infimum of all constants c that work in (3.15) will be denoted by |||ν|||_{R, d}, and the class of all d-regular measures will be denoted by $R_{d}$ .

For example, μ^* itself is in R₀ with $‖ | μ^{*} ‖ |_{R, 0} \leq κ$ [cf. (3.2)]. More generally, if w ∈ C₀(𝕏) then the measure wdμ^* is R₀ with $‖ | μ^{*} ‖ |_{R, 0} \leq κ ‖ w ‖_{\infty, μ^{*}}$ .

Definition 3.9. (a) A sequence {ν_n} of measures on 𝕏 is called an admissible quadrature measure sequence if the sequence {|ν_n|(𝕏)}has polynomial growth and

\begin{array}{l} \int_{𝕏} P d ν_{n} = \int_{𝕏} P d μ^{*}, P \in Π_{n}, n \geq 1 . & (3.16) \end{array}

(b) A sequence {ν_n} of measures on 𝕏 is called an admissible product quadrature measure sequence if the sequence {|ν_n|(𝕏)}has polynomial growth and

\begin{array}{l} \int_{𝕏} P_{1} P_{2} d ν_{n} = \int_{𝕏} P_{1} P_{2} d μ^{*}, P_{1}, P_{2} \in Π_{n}, n \geq 1 . & (3.17) \end{array}

(c) By abuse of terminology, we will say that a measure ν_n is an admissible quadrature measure (respectively, an admissible product quadrature measure) of order n if $| ν_{n} | \leq c_{1} n^{c}$ (with constants independent of n) and (3.16) [respectively, (3.17)] holds.

In the case when 𝕏 is compact, a well-known theorem called Tchakaloff's theorem [46, Exercise 2.5.8, p. 100] shows the existence of admissible product quadrature measures (even finitely supported probability measures). However, in order to construct such measures, it is much easier to prove the existence of admissible quadrature measures, as we will do in Theorem 7.1, and then use one of the product assumptions to derive admissible product quadrature measures.

Example 3.4. In the manifold case, let the strong product assumption hold as in Remark 3.3. If n ≥ 1 and $C \subset 𝕏$ is a finite subset satisfying the assumptions of Theorem 7.1, then the theorem asserts the existence of an admissible quadrature measure supported on $C$ . If {ν_n} is an admissible quadrature measure sequence, then ${ν_{A^{*} n}}$ is an admissible product quadrature measure sequence. In particular, there exist finitely supported admissible product quadrature measures of order n for every n ≥ 1. □

Example 3.5. We consider the Hermite case as in Example 3.2. For every a > 0 and n ≥ 1, Theorem 7.1 applied with the system Ξ_a yields admissible quadrature measures of order n supported on finite subsets of ℝ^q (in fact, of [−cn, cn]^q for an appropriate c). In particular, an admissible quadrature measure of order $n \sqrt{2}$ for $Ξ_{\sqrt{2}}$ is an admissible product quadrature measure of order n for Ξ = Ξ₁. □

3.4. Eignets

The notion of an eignet defined below is a generalization of the various kernels described in the examples in section 2.

Definition 3.10. A function b:[0, ∞) → (0, ∞) is called a smooth mask if b is non-increasing, and there exists B^* = B^*(b) ≥ 1 such that the mapping t ↦ b(B^*t)/b(t) is fast decreasing. A function G:𝕏 × 𝕏 → ℝ is called a smooth kernel if there exists a measurable function W = W(G) :𝕏 → ℝ such that we have a formal expansion (with a smooth mask b)

\begin{array}{l} W (y) G (x, y) = \sum_{k} b (λ_{k}) ϕ_{k} (x) ϕ_{k} (y), x, y \in 𝕏 . & (3.18) \end{array}

If m ≥ 1 is an integer, an eignet with m neurons is a function of the form $x \mapsto \sum_{k = 1}^{m} a_{k} G (x, y_{k})$ for y_k ∈ 𝕏.

Example 3.6. In the manifold case, the notion of eignet includes all the examples stated in section 2 with W ≡ 1, except for the example of smooth ReLU function described in Example 2.3. In the Hermite case, (2.2) shows that the kernel $G (x, y) = exp (- | x - \frac{\sqrt{3}}{2} y |_{2}^{2})$ defined on ℝ^q × ℝ^q is a smooth kernel, with λ_k = |k|₁, ϕ_k as in Example 2.5, and $b (t) = {(\frac{3}{2 π})}^{- q / 2} 3^{- t / 2}$ . The function W here is $W (y) = exp (- | y |_{2}^{2} / 4)$ . □

Remark 3.6. It is possible to relax the conditions on the mask in Definition 3.10. Firstly, the condition that b should be non-increasing is made only to simplify our proofs. It is not difficult to modify them without this assumption. Secondly, let b₀ :[0, ∞) → ℝ satisfy |b₀(t)| ≤ b₁(t) for a smooth mask b₁ as stipulated in that definition. The function b₂ = b + 2b₁ is then a smooth mask and so is b₁. Let $G_{j} (x, y) = \sum_{k = 0}^{\infty} b_{j} (λ_{k}) ϕ_{k} (x) ϕ_{k} (y)$ , j = 0, 1, 2. Then G₀(x, y) = G₂(x, y) − 2G₁(x, y). Therefore, all of the results in sections 4 and 8 can be applied once with G₂ and once with G₁ to obtain a corresponding result for G₀ with different constants. For this reason, we will simplify our presentation by assuming the apparently restrictive conditions stipulated in Definition 3.10. In particular, this includes the example of the smooth ReLU network described in Example 2.3. □

Definition 3.11. Let ν be a measure on 𝕏 (signed or having bounded variation), and G ∈ C₀(𝕏 × 𝕏). We define

\begin{array}{l} D_{G, n} (x, y) = \sum_{k = 0}^{\infty} h (λ_{k} / n) b {(λ_{k})}^{- 1} ϕ_{k} (x) ϕ_{k} (y), n \geq 1, x, y \in 𝕏, & (3.19) \end{array}

and

\begin{array}{l} 𝔾_{n} (ν; x, y) = \int_{𝕏} G (x, z) W (z) D_{G, n} (z, y) d ν (z) . & (3.20) \end{array}

Remark 3.7. Typically, we will use an approximate product quadrature measure sequence in place of the measure ν, where each of the measures in the sequence is finitely supported, to construct a sequence of networks. In the case when 𝕏 is compact, Tchakaloff's theorem shows that there exists an approximate product quadrature measure of order m supported on ${(dim (Π_{m}) + 1)}^{2}$ points. Using this measure in place of ν, one obtains a pre-fabricated eignet 𝔾_n(ν) with ${(dim (Π_{m}) + 1)}^{2}$ neurons. However, this is not an actual construction. In the presence of the product assumption, Theorem 7.1 leads to the pre-fabricated networks 𝔾_n in a constructive manner with the number of neurons as stipulated in that theorem. □

4. Main Results

In this section, we assume the Bernstein-Lipschitz condition (Definition 3.4) in all the theorems. We note that the measure μ^* may not be a probability measure. Therefore, we take the help of an auxiliary function f₀ to define a probability measure as follows. Let f₀ ∈ C₀(𝕏), f₀ ≥ 0 for all x ∈ 𝕏, and $d ν^{*} = f_{0} d μ^{*}$ be a probability measure. Necessarily, ν^* is 0-regular, and $k : ‖ | ν^{*} ‖ |_{R, 0} \leq k ‖ f_{0} ‖_{\infty, μ^{*}}$ . We assume noisy data of the form (y, ϵ), with a joint probability distribution τ defined for Borel subsets of 𝕏 × Ω for some measure space Ω, and with ν^* being the marginal distribution of y with respect to τ. Let $F (y, ϵ)$ be a random variable following the law τ, and denote

\begin{array}{l} f (y) = 𝔼_{τ} (F (y, ϵ) | y) . & (4.1) \end{array}

It is easy to verify using Fubini's theorem that if $F$ is integrable with respect to τ, then, for any x ∈ 𝕏,

\begin{array}{l} 𝔼_{τ} (F (y, ϵ) Φ_{n} (x, y)) = σ_{n} (ν^{*}; f) (x) : = \int_{𝕏} f (y) Φ_{n} (x, y) d ν^{*} (y) . & (4.2) \end{array}

Let Y be a random sample from τ, and {ν_n} be an admissible product quadrature sequence in the sense of Definition 3.9. We define [cf. (3.20)]

\begin{array}{l} G_{n} (Y; F) (x) = G_{n} (ν_{B^{*} n}, Y; F) (x) \\ \begin{array}{l} = \frac{1}{| Y |} \sum_{(y, ϵ) \in Y} F (y, ϵ) 𝔾_{n} (ν_{B^{*} n}; x, y), x \in 𝕏, n = 1, 2, \dots, \end{array} & (4.3) \end{array}

where B^* is as in Definition 3.10.

Remark 4.1. We note that the networks 𝔾_n are prefabricated independently of the data. The network $G_{n}$ therefore has only |Y| terms depending upon the data. □

Our first theorem describes local function recovery using local sampling. We may interpret it in the spirit of distributed learning as in Chui et al. [24] and Lin et al. [26], where we are taking a linear combination of pre-fabricated networks 𝔾_n using the function values themselves as the coefficients. The networks 𝔾_n have essentially the same localization property as the kernels Φ_n (cf. Theorem 8.2).

Theorem 4.1. Let x₀ ∈ 𝕏 and r > 0. We assume the partition of unity and find a function ψ ∈ C^∞ supported on 𝔹(x₀, 3r), which is equal to 1 on 𝔹(x₀, r), $𝔪 = \int_{𝕏} ψ d μ^{*}$ , and let f₀ = ψ/𝔪, $d ν^{*} = f_{0} d μ^{*}$ . We assume the rest of the set-up as described. If f₀f ∈ W_{γ, ∞}, then for 0 < δ < 1, and $| Y | \geq c n^{q + 2 γ} r^{q} log (n B_{n} / δ)$ ,

\begin{array}{l} {Prob}_{τ} ({‖ \frac{m}{| Y |} \sum_{(y, ϵ) \in Y} F (y, ϵ) 𝔾_{n} (ν_{B^{*} n}; °, y) \\ {- f ‖}_{\infty, μ^{*}, 𝔹 (x_{0}, r)} \geq c_{3} n^{- γ}}) \\ \leq δ . & (4.4) \end{array}

Remark 4.2. If {y₁, ⋯, y_M} is a random sample from some probability measure supported on 𝕏, $s = \sum_{ℓ = 1}^{M} f_{0} (y_{ℓ})$ , and we construct a sub-sample using the distribution that associates the mass f₀(y_j)/s with each y_j, then the probability of selecting points outside of the support of f₀ is 0. This leads to a sub-sample Y. If $M \geq c n^{q + 2 γ} log (n B_{n} / δ)$ , then the Chernoff bound, Proposition B.1(b), can be used to show that |Y| is large, as stipulated in Theorem 4.1. □

Next, we state two inverse theorems. Our first theorem obtains accuracy on the estimation of the density f₀ using eignets instead of positive kernels.

Theorem 4.2. With the set-up as in Theorem 8.3, let γ > 0, f₀ ∈ W_{γ, ∞}, and

\begin{array}{l} | Y | \geq ‖ f_{0} ‖_{\infty, μ^{*}} n^{q + 2 γ} log (\frac{n B_{n}}{δ}) . \end{array}

Then, with $F \equiv 1$ ,

\begin{array}{l} {P r o b}_{τ} ({‖ \frac{1}{| Y |} \sum_{(y, ϵ) \in Y} 𝔾_{n} (ν_{B^{*} n}; \circ, y) - f_{0} ‖_{\infty} \geq c_{3} n^{- γ}}) \leq δ . & (4.5) \end{array}

Remark 4.3. Unlike density estimation using positive kernels, there is no inherent limit on the accuracy predicted by (4.5) on the estimation of f₀. □

The following theorem gives a complete characterization of the local smoothness classes using eignets. In particular, Part (b) of the following theorem gives a solution to the inverse problem of determining what smoothness class the target function belongs to near each point of 𝕏. In theory, this leads to a data-based detection of singularities and sparsity analogous to what is assumed in Chui et al. [24] but in a much more general setting.

Theorem 4.3. Let f₀ ∈ C₀(𝕏), f₀(x) ≥ 0 for all x ∈ 𝕏, and $d ν^{*} = f_{0} d μ^{*}$ be a probability measure, τ, $F$ , and let f be as described above. We assume the partition of unity and the product assumption. Let S ≥ q + 2, 0 < γ ≤ S, x₀ ∈ 𝕏, 0 < δ < 1. For each j ≥ 0, suppose that Y_j is a random sample from τ with $| Y_{j} | \geq 2 c_{1} 2^{j (q + 2 S)} ‖ | ν^{*} ‖ |_{R, 0} log (c 2^{2 j} B_{2^{j}} / δ)$ . Then with τ-probability ≥ 1 − δ,

(a) If f₀f ∈ W_γ,∞(x₀) then there exists a ball 𝔹 centered at x₀ such that

\begin{array}{l} sup_{j \geq 1} 2^{j γ} ‖ G_{2^{j}} (Y_{j}; F) - G_{2^{j - 1}} (Y_{j}; F) ‖_{\infty, μ^{*}, 𝔹} < \infty . & (4.6) \end{array}

(b) If there exists a ball 𝔹 centered at x₀ for which (4.6) holds, then f₀f ∈ W_{γ, ∞,_ϕ₀}(x₀).

5. Preparatory Results

We prove a lower bound on μ^*(𝔹(x, r)) for x ∈ 𝕏 and 0 < r ≤ 1 (cf. [47]).

Proposition 5.1. We have

\begin{array}{l} μ^{*} (𝔹 (x, r)) \geq c r^{q}, 0 < r \leq 1, x \in 𝕏 . & (5.1) \end{array}

In order to prove the proposition, we recall a lemma, proved in Mhaskar [14, Proposition 5.1].

Lemma 5.1. Let ν ∈ R_d, N > 0. If g₁:[0, ∞) → [0, ∞) is a non-increasing function, then, for any N > 0, r > 0, x ∈ 𝕏,

\begin{array}{l} N^{q} \int_{Δ (x, r)} g_{1} (N ρ (x, y)) d | ν | (y) \leq \\ \begin{array}{l} c \frac{2^{q} (1 + {(d / r)}^{q}) q}{1 - 2^{- q}} ‖ | ν ‖ |_{R, d} \int_{r N / 2}^{\infty} g_{1} (u) u^{q - 1} d u . \end{array} & (5.2) \end{array}

PROOF OF PROPOSITION 5.1.

Let x ∈ 𝕏, r > 0 be fixed in this proof, although the constants will not depend upon these. In this proof, we write

\begin{array}{l} K_{t} (x, y) = \sum_{k = 0}^{\infty} exp (- λ_{k}^{2} t) ϕ_{k} (x) ϕ_{k} (y) . \end{array}

The Gaussian upper bound (3.3) shows that for t > 0,

\begin{array}{l} \int_{Δ (x, r)} | K_{t} (x, y) | d μ^{*} (y) \leq κ_{1} t^{- q / 2} \int_{Δ (x, r)} exp (- κ_{2} ρ {(x, y)}^{2} / t) d μ^{*} (y) . & (5.3) \end{array}

Using Lemma 5.1 with d = 0, dν = dμ^*, $g_{1} (u) = exp (- u^{2})$ , $N = \sqrt{κ_{2} / t}$ , we obtain for $r^{2} / t \geq (q - 2) / κ_{2}$ :

\begin{array}{l} \int_{Δ (x, r)} | K_{t} (x, y) | d μ^{*} (y) \\ \leq c \int_{N r / 2}^{\infty} u^{q - 1} exp (- u^{2}) d u = c_{1} \int_{{(N r / 2)}^{2}}^{\infty} u^{q / 2 - 1} e^{- u} d u \\ \begin{array}{l} \leq c_{2} {(r^{2} / t)}^{(q - 2) / 2} exp (- κ_{2} r^{2} / (4 t)) . \end{array} & (5.4) \end{array}

Therefore, denoting in this proof only that κ₀ = ‖ϕ₀‖_∞, we obtain that

\begin{array}{l} 1 = \int_{𝕏} K_{t} (x, y) ϕ_{0} (y) d μ^{*} (y) \leq κ_{0} \int_{𝕏} | K_{t} (x, y) | d μ^{*} (y) \\ \leq κ_{0} κ_{2} t^{- q / 2} μ^{*} (𝔹 (x, r)) + c_{3} {(r^{2} / t)}^{(q - 2) / 2} exp (- κ_{2} r^{2} / (4 t) . & (5.5) \end{array}

We now choose t ~ r² so that $c_{3} {(r^{2} / t)}^{(q - 2) / 2} exp (- κ_{3} r^{2} / (4 t)) \leq 1 / 2$ to obtain (5.1) for r ≤ c₄. The estimate is clear for c₄ < r ≤ 1. □

Next, we prove some results about the system {ϕ_k}.

Lemma 5.2. For n ≥ 1, we have

\begin{array}{l} \sum_{λ_{k} < n} ϕ_{k} {(x)}^{2} \leq c n^{q}, x \in 𝕏 . & (5.6) \end{array}

and

\begin{array}{l} d i m (Π_{n}) \leq c n^{q} μ^{*} (𝕂_{n}) . & (5.7) \end{array}

In particular, the function n ↦ dim(Π_n) has polynomial growth.

PROOF. The Gaussian upper bound with x = y implies that

\begin{array}{l} \sum_{k = 0}^{\infty} exp (- λ_{k}^{2} t) ϕ_{k} {(x)}^{2} \leq c t^{- q / 2}, 0 < t \leq 1, x \in 𝕏 . \end{array}

The estimate (5.6) follows from a Tauberian theorem [44, Proposition 4.1]. The essential compactness now shows that for any R > 0,

\begin{array}{l} \int_{𝕏 \ K_{n}} \sum_{λ_{k} < n} ϕ_{k} {(x)}^{2} d μ^{*} (x) \leq {sup_{x \in 𝕏 \ K_{n}} \sum_{λ_{k} < n} ϕ_{k} {(x)}^{2}}^{1 / 2} \\ \int_{𝕏 \ K_{n}} {(\sum_{λ_{k} < n} ϕ_{k} {(x)}^{2})}^{1 / 2} d μ^{*} (x) \leq c n^{- R} . \end{array}

In particular,

\begin{array}{l} d i m (Π_{n}) = \int_{𝕏} \sum_{λ_{k} < n} ϕ_{k} {(x)}^{2} d μ^{*} (x) \\ \leq \int_{𝕂_{n}} \sum_{λ_{k} < n} ϕ_{k} {(x)}^{2} d μ^{*} (x) + c n^{- R} \leq c n^{q} μ^{*} (𝕂_{n}) . \end{array}

□

Next, we prove some properties of the operators σ_n and diffusion polynomials. The following proposition follows easily from Lemma 5.1 and Proposition 3.2. (cf. [14, 48]).

Proposition 5.2. Let S, H be as in Proposition 3.2, d > 0, $ν \in R_{d}$ , and x ∈ 𝕏.

(a) If r ≥ 1/N, then

\begin{array}{l} \int_{Δ (x, r)} | Φ_{N} (H; x, y) | d | ν | (y) \leq c (1 + {(d N)}^{q}) {(r N)}^{- S + q} ‖ | ν ‖ |_{R, d} ‖ | H ‖ |_{S} . & (5.8) \end{array}

(b) We have

\begin{array}{l} \int_{𝕏} | Φ_{N} (H; x, y) | d | ν | (y) \leq c (1 + {(d N)}^{q}) ‖ | ν ‖ |_{R, d} ‖ | H ‖ |_{S}, & (5.9) \end{array}

\begin{array}{l} ‖ Φ_{N} (H; x, \circ) ‖_{ν; 𝕏, p} \leq c N^{q / p^{'}} {(1 + {(d N)}^{q})}^{1 / p} ‖ | ν ‖ |_{R, d}^{1 / p} ‖ | H ‖ |_{S}, & (5.10) \end{array}

and

\begin{array}{l} {‖ \int_{𝕏} | Φ_{N} (H; ∘, y) | d | ν | (y) ‖}_{p} \leq c {(1 + {(d N)}^{q})}^{1 / p^{'}} {‖ | ν | ‖}_{R, d}^{1 / p^{'}} {(| ν | (𝕏))}^{1 / p} ‖ | H | ‖_{S} . & (5.11) \end{array}

The following lemma is well-known; a proof is given in Mhaskar [15, Lemma 5.3].

Lemma 5.3. Let (Ω₁, ν), (Ω₂, τ) be sigma–finite measure spaces, Ψ : Ω₁ × Ω₂ → ℝ be ν × τ–integrable,

\begin{array}{l} M_{\infty} : = ν - \underset{x \in Ω_{1}}{e s s s u p} \int_{Ω_{2}} | Ψ (x, y) | d τ (y) < \infty, \\ \begin{array}{l} M_{1} : = τ - \underset{y \in Ω_{2}}{e s s s u p} \int_{Ω_{1}} | Ψ (x, y) | d ν (x) < \infty, \end{array} & (5.12) \end{array}

and formally, for τ–measurable functions f : Ω₂ → ℝ,

T (f, x) : = \int_{Ω_{2}} f (y) Ψ (x, y) d τ (y), x \in Ω_{1} .

Let 1 ≤ p ≤ ∞. If $f \in L^{p} (τ; Ω_{2})$ then T(f, x) is defined for ν–almost all x ∈ Ω₁, and

\begin{array}{l} ‖ T f ‖_{ν; Ω_{1}, p} \leq M_{1}^{1 / p} M_{\infty}^{1 / p^{'}} ‖ f ‖_{τ; Ω_{2}, p}, f \in L^{p} (Ω_{2}, τ) . & (5.13) \end{array}

Theorem 5.1. Let n > 0. If P ∈ Π_n/2, then σ_n(P) = P. Also, for any p with 1 ≤ p ≤ ∞,

\begin{array}{l} ‖ σ_{n} (f) ‖_{p} \leq c ‖ f ‖_{p}, f \in L^{p} . & (5.14) \end{array}

If 1 ≤ p ≤ ∞, and f ∈ L^p (𝕏), then

\begin{array}{l} E_{n} (p, f) \leq ‖ f - σ_{n} (f) ‖_{p, μ^{*}} \leq c E_{n / 2} (p, f) . & (5.15) \end{array}

PROOF. The fact that σ_n(P) = P for all P ∈ Π_n/2 is verified easily using the fact that h(t) = 1 for 0 ≤ t ≤ 1/2. Using (5.9) with μ^* in place of |ν| and 0 in place of d, we see that

\begin{array}{l} sup_{x \in 𝕏} \int_{𝕏} | Φ_{n} (x, y) | d μ^{*} (y) \leq c . \end{array}

The estimate (5.14) follows using Lemma 5.3. The estimate (5.15) is now routine to prove. □

Proposition 5.3. For n ≥ 1, P ∈ Π_n, 1 ≤ p ≤ ∞, and S > 0, we have

\begin{array}{l} ‖ P ‖_{p, μ^{*}, 𝕏 \ 𝕂_{2 n}} \leq c (S) n^{- S} ‖ P ‖_{p, μ^{*}, 𝕏} . & (5.16) \end{array}

PROOF. In this proof, all constants will depend upon S. Using Schwarz inequality and essential compactness, it is easy to deduce that

\begin{array}{l} sup_{x \in 𝕏 \ 𝕂_{2 n}} \int_{𝕏} | Φ_{2 n} (x, y) | d μ^{*} (y) \leq c_{1} n^{- S}, \\ sup_{y \in 𝕏} \int_{𝕏 \ 𝕂_{2 n}} | Φ_{2 n} (x, y) | d μ^{*} (x) \leq c_{1} n^{- S} . & (5.17) \end{array}

Therefore, a use of Lemma 5.3 shows that

‖ σ_{2 n} (f) ‖_{p, μ^{*}, 𝕏 \ 𝕂_{2 n}} \leq c n^{- S} ‖ f ‖_{p} .

We use P in place of f to obtain (5.16). □

Proposition 5.4. Let n ≥ 1, P ∈ Π_n, 0 < p < r ≤ ∞. Then

\begin{array}{l} ‖ P ‖_{r} \leq c n^{q (1 / p - 1 / r)} ‖ P ‖_{p}, ‖ P ‖_{p} \leq c μ^{*} {(𝕂_{2 n})}^{1 / p - 1 / r} ‖ P ‖_{r} . & (5.18) \end{array}

PROOF. The first part of (5.18) is proved in Mhaskar [15, Lemma 5.4]. In that paper, the measure μ^* is assumed to be a probability measure, but this assumption was not used in this proof. The second estimate follows easily from Proposition 5.3. □

Lemma 5.4. Let R, n > 0, P₁, P₂ ∈ Π_n, 1 ≤ p, r, s ≤ ∞. If the product assumption holds, then

\begin{array}{l} E_{A^{*} n} (ϕ_{0}; p, P_{1} P_{2}) \leq c n^{- R} ‖ P_{1} ‖_{r} ‖ P_{2} ‖_{s} . & (5.19) \end{array}

PROOF. In view of essential compactness, Proposition 5.4 implies that for any P ∈ Π_n, 1 ≤ r ≤ ∞, $‖ P ‖_{2} \leq c_{1} n^{c} ‖ P ‖_{r}$ . Therefore, using Schwarz inequality, Parseval identity, and Lemma 5.2, we conclude that

\begin{array}{l} \sum_{k} | \hat{P} (k) | \leq {(dim (Π_{n}))}^{1 / 2} ‖ P ‖_{2} \leq c_{1} n^{c} ‖ P ‖_{r} . & (5.20) \end{array}

Now, the product assumption implies that for p = 1, ∞, and λ_k, λ_j < n, there exists $R_{j, k, n} \in Π_{A^{*} n}$ such that for any R > 0,

\begin{array}{l} ‖ ϕ_{k} ϕ_{j} - R_{j, k, n} ϕ_{0} ‖_{p} \leq c n^{- R - 2 c}, & (5.21) \end{array}

where c is the constant appearing in (5.20). The convexity inequality

\begin{array}{l} ‖ f ‖_{p} \leq ‖ f ‖_{\infty}^{1 / p^{'}} ‖ f ‖_{1}^{1 / p} \end{array}

shows that (5.21) is valid for all p, 1 ≤ p ≤ ∞. So, using (5.20), we conclude that

\begin{array}{l} ‖ P_{1} P_{2} - \sum_{k, j} \hat{P_{1}} (k) \hat{P_{2}} (k) R_{j, k, n} ϕ_{0} ‖_{p} \leq c n^{- R - 2 c} (\sum_{k} | \hat{P_{1}} (k) |) \\ (\sum_{k} | \hat{P_{2}} (k) |) \leq c n^{- R} ‖ P_{1} ‖_{r} ‖ P_{2} ‖_{s} . \end{array}

□

6. Local Approximation by Diffusion Polynomials

In the sequel, we write g(t) = h(t) − h(2t), and

\begin{array}{l} τ_{j} (f) = {\begin{array}{l} σ_{1} (f), & if j = 0, \\ σ_{2^{j}} (f) - σ_{2^{j - 1}} (f), & if j = 1, 2, \dots . \end{array} & (6.1) \end{array}

We note that

\begin{array}{l} τ_{j} (f) (x) = σ_{2^{j}} (μ^{*}, g; f) (x) = \int_{𝕏} f (y) Φ_{2^{j}} (g; x, y) d μ^{*} (y), j = 1, 2, \dots . & (6.2) \end{array}

It is clear from Theorem 5.1 that for any p, 1 ≤ p ≤ ∞,

\begin{array}{l} f = \sum_{j = 0}^{\infty} τ_{j} (f), f \in X^{p}, & (6.3) \end{array}

with convergence in the sense of L^p.

Theorem 6.1. Let 1 ≤ p ≤ ∞, γ > 0, f ∈ X^p, x₀ ∈ 𝕏. We assume the partition of unity and the product assumption.

(a) If 𝔹 is a ball centered at x₀, then

\begin{array}{l} sup_{n \geq 0} 2^{n γ} ‖ f - σ_{2^{n}} (f) ‖_{p, μ^{*}, 𝔹} ~ sup_{j \geq 0} 2^{j γ} ‖ τ_{j} (f) ‖_{p, μ^{*}, 𝔹} . & (6.4) \end{array}

(b) If there exists a ball B centered at x₀ such that

\begin{array}{l} sup_{n \geq 0} 2^{n γ} ‖ f - σ_{2^{n}} (f) ‖_{p, μ^{*}, 𝔹} ~ sup_{j \geq 0} 2^{j γ} ‖ τ_{j} (f) ‖_{p, μ^{*}, 𝔹} < \infty, & (6.5) \end{array}

then f ∈ W_{γ, p,_ϕ₀}(x₀).

(c) If f ∈ W_{γ, p}(x₀), then there exists a ball 𝔹 centered at x₀ such that (6.5) holds.

Remark 6.1. In the manifold case (Example 3.1), ϕ₀ ≡ 1. So, the statements (b) and (c) in Theorem 6.1 provide necessary and sufficient conditions for f ∈ W_{γ, p}(x₀) in terms of the local rate of convergence of the globally defined operator σ_n(f) and the growth of the local norms of the operators τ_j, respectively In the Hermite case (Example 3.2), it is shown in Mhaskar [49] that f ∈ W_{γ, p,_ϕ₀} if and only if f ∈ W_{γ, p}. Therefore, the statements (b) and (c) in Theorem 6.1 provide similar necessary and sufficient conditions for f ∈ W_{γ, p}(x₀) in this case as well. □

The proof of Theorem 6.1 is routine, but we sketch a proof for the sake of completeness.

PROOF OF THEOREM 6.1.

Part (a) is easy to prove using the definitions.

In the rest of this proof, we fix S > γ + q + 2. To prove part (b), let ϕ ∈ C^∞ be supported on 𝔹. Then there exists ${R_{n} \in Π_{2^{n}}}_{n = 0}^{\infty}$ such that

\begin{array}{l} ‖ ϕ - R_{n} ‖_{\infty} \leq c (ϕ) 2^{- n S} . & (6.6) \end{array}

Further, Lemma 5.4 yields a sequence ${Q_{n} \in Π_{A^{*} 2^{n}}}$ such that

\begin{array}{l} ‖ R_{n} σ_{2^{n}} (f) - ϕ_{0} Q_{n} ‖_{p} \leq c 2^{- n S} ‖ R_{n} ‖_{\infty} ‖ σ_{2^{n}} (f) ‖_{p} \leq c (ϕ) 2^{- n S} ‖ f ‖_{p} . & (6.7) \end{array}

Hence,

\begin{array}{l} E_{A^{*} 2^{n}} (ϕ_{0}; p, f ϕ) \\ \leq ‖ f ϕ - ϕ_{0} Q_{n} ‖_{p} \leq c (ϕ) 2^{- n S} ‖ f ‖_{p} + ‖ f ϕ - σ_{2^{n}} (f) R_{n} ‖_{p} \\ \leq c (ϕ) 2^{- n S} ‖ f ‖_{p} + ‖ (f - σ_{2^{n}} (f)) ϕ ‖_{p} + ‖ σ_{2^{n}} (f) (ϕ - R_{n}) ‖_{p} \\ \leq c (ϕ) {2^{- n S} ‖ f ‖_{p} + ‖ f - σ_{2^{n}} (f) ‖_{p, μ^{*}, 𝔹} + ‖ σ_{2^{n}} (f) ‖_{p} ‖ ϕ - R_{n} ‖_{\infty}} \\ \leq c (ϕ) 2^{- n S} ‖ f ‖_{p} + c (ϕ, f) {(A^{*} 2^{- n})}^{γ} . \end{array}

Thus, fϕ ∈ W_{γ, p,_ϕ₀} for every ϕ ∈ C^∞ supported on 𝔹, and part (b) is proved.

To prove part (c), we observe that there exists r > 0 such that for any $ϕ \in C^{\infty} (𝔹 (x_{0}, 6 r))$ , fϕ ∈ W_{γ, p}. Using partition of unity [cf. Proposition 3.1(a)], we find $ψ \in C^{\infty} (𝔹 (x_{0}, 6 r))$ such that ψ(x) = 1 for all x ∈ 𝔹(x₀, 2r), and we let 𝔹 = 𝔹(x₀, r). In view of Proposition 3.2, $| Φ_{2^{n}} (x, y) | \leq c (r) 2^{- n (s - q)}$ for all x ∈ B and y ∈ 𝕏\𝔹(x₀, 2r). Hence,

\begin{array}{l} ‖ σ_{2^{n}} ((1 - ψ) f) ‖_{p} \leq | \int_{𝕏} | (1 - ψ (y)) f (y) Φ_{2^{n}} (◦, y) | d μ^{*} (y) ‖_{p} \\ = | \int_{𝕏 \ 𝔹 (x_{0}, 2 r)} | (1 - ψ (y)) f (y) Φ_{2^{n}} (◦, y) | d μ^{*} (y) ‖_{p} \\ \begin{array}{l} \leq c (ψ, r) 2^{- n (S - q)} ‖ f ‖_{p} . \end{array} & (6.8) \end{array}

Recalling that ψ(x) = 1 for x ∈ B and 𝕊 − q ≥ γ + 2, we deduce that

\begin{array}{l} ‖ f - σ_{2^{n}} (f) ‖_{p, μ^{*}, 𝔹} = ‖ ψ f - σ_{2^{n}} (f) ‖_{p, μ^{*}, 𝔹} \\ \leq ‖ ψ f - σ_{2^{n}} (ψ f) ‖_{p, μ^{*}, 𝔹} + ‖ σ_{2^{n}} ((1 - ψ) f) ‖_{p} \\ \leq c E_{2^{n}} (ψ f) + c (ψ, r) 2^{- n (S - q)} ‖ f ‖_{p} \\ \leq c (r, ψ, f) 2^{- n γ} . \end{array}

This proves part (c). □

Let {Ψ_n: 𝕏 × 𝕏 → 𝕏} be a family of kernels (not necessarily symmetric). With a slight abuse of notation, we define when possible, for any measure ν with bounded total variation on 𝕏,

\begin{array}{l} σ (ν, Ψ_{n}; f) (x) = \int_{𝕏} f (y) Ψ_{n} (x, y) d ν (y), \\ \begin{array}{l} x \in 𝕏, f \in L^{1} (𝕏) + C_{0} (𝕏), \end{array} & (6.9) \end{array}

and

\begin{array}{l} τ_{j} (ν, {Ψ_{n}}; f) = {\begin{matrix} σ (ν, Ψ_{1}; f), & if j = 0, \\ σ (ν, Ψ_{2^{j}}; f) - σ (ν, Ψ_{2^{j - 1}}; f), & if j = 1, 2, \dots . \end{matrix} & (6.10) \end{array}

As usual, we will omit the mention of ν when ν = μ^*.

Corollary 6.1. Let the assumptions of Theorem 6.1 hold, and {Ψ_n:𝕏 × 𝕏 → 𝕏} be a sequence of kernels (not necessarily symmetric) with the property that both of the following functions of n are decreasing rapidly.

\begin{array}{l} sup_{x \in 𝕏} \int_{𝕏} | Ψ_{n} (x, y) - Φ_{n} (x, y) | d μ^{*} (y), \\ \begin{array}{l} sup_{y \in 𝕏} \int_{𝕏} | Ψ_{n} (x, y) - Φ_{n} (x, y) | d μ^{*} (x) . \end{array} & (6.11) \end{array}

(a) If B is a ball centered at x₀, then

\begin{array}{l} sup_{n \geq 0} 2^{n γ} ‖ f - σ (Ψ_{2^{n}}; f) ‖_{p, μ^{*}, 𝔹} ~ sup_{j \geq 0} 2^{j γ} ‖ τ_{j} ({Ψ_{n}}; f) ‖_{p, μ^{*}, 𝔹} . & (6.12) \end{array}

(b) If there exists a ball B centered at x₀ such that

\begin{array}{l} sup_{n \geq 0} 2^{n γ} ‖ f - σ (Ψ_{2^{n}}; f) ‖_{p, μ^{*}, 𝔹} ~ sup_{j \geq 0} 2^{j γ} ‖ τ_{j} ({Ψ_{n}}; f) ‖_{p, μ^{*}, 𝔹} < \infty, & (6.13) \end{array}

then f ∈ W_{γ, p,_ϕ₀}(x₀).

(c) If f ∈ W_{γ, p}(x₀), then there exists a ball B centered at x₀ such that (6.13) holds.

PROOF. In view of Lemma 5.3, the assumption about the functions in (6.11) implies that ‖σ(Ψ_n; f) − σ_n(f)‖_p is decreasing rapidly. □

7. Quadrature Formula

The purpose of this section is to prove the existence of admissible quadrature measures in the general set-up as in this paper. The ideas are mostly developed already in our earlier works [17, 36, 43, 44, 50, 51] but always require an estimate on the gradient of diffusion polynomials. Here, we use the Bernstein-Lipschitz condition (Definition 3.4) instead.

If $C \subset 𝕂 \subset 𝕏$ , we denote

\begin{array}{l} δ (K, C) = sup_{x \in K} inf_{y \in C} ρ (x, y), η (C) = inf_{x, y \in C, x \neq y} ρ (x, y) . & (7.1) \end{array}

If K is compact, ϵ > 0, a subset $C \subset K$ is ϵ-distinguishable if ρ(x, y) ≥ ϵ for every $x, y \in C$ , x ≠ y. The cardinality the maximal ϵ-distinguishable subset of K will be denoted by H_ϵ(K).

Remark 7.1. If $C_{1} \subset C$ is a maximal $δ (K, C)$ -distinguishable subset of $C$ , x ≠ y, then it is easy to deduce that

\begin{array}{l} δ (K, C) \leq η (C_{1}) \leq 2 δ (K, C), δ (K, C) \leq δ (K, C_{1}) \leq 2 δ (K, C) . \end{array}

In particular, by replacing $C$ by $C_{1}$ , we can always assume that

\begin{array}{l} (1 / 2) δ (K, C) \leq η (C) \leq 2 δ (K, C) . & (7.2) \end{array}

Theorem 7.1. We assume the Bernstein-Lipschitz condition. Let n > 0, $C_{1} = {z_{1}, \dots, z_{M}} \subset 𝕂_{2 n}$ be a finite subset, ϵ > 0.

(a) There exists a constant c(ϵ) with the following property: if $δ (𝕂_{2 n}, C_{1}) \leq c (ϵ) min (1 / n, 1 / 𝔹_{2 n})$ , then there exist non-negative numbers W_k satisfying

\begin{array}{l} 0 \leq W_{k} \leq c δ {(𝕂_{2 n}, C_{1})}^{q}, \sum_{k = 1}^{M} W_{k} \leq c μ^{*} (𝔹 (𝕂_{2 n}, 4 δ (𝕂_{2 n}, C_{1}))), & (7.3) \end{array}

such that for every P ∈ Π_n,

\begin{array}{l} | \sum_{k = 1}^{M} W_{k} | P (z_{k}) | - \int_{𝕏} | P (x) | d μ^{*} (x) | \leq ϵ \int_{𝕏} | P (x) | d μ^{*} (x) . & (7.4) \end{array}

(b) Let the assumptions of part (a) be satisfied with ϵ = 1/2. There exist real numbers w₁, ⋯, w_M such that |w_k| ≤ 2W_k, k = 1, ⋯, M, in particular,

\begin{array}{l} \sum_{k = 1}^{M} | w_{k} | \leq c μ^{*} (𝔹 (𝕂_{2 n}, 4 δ (𝕂_{2 n}, C_{1}))), & (7.5) \end{array}

and

\begin{array}{l} \sum_{k = 1}^{M} w_{k} P (z_{k}) = \int_{𝕏} P (x) d μ^{*} (x), P \in Π_{n} . & (7.6) \end{array}

(c) Let δ > 0, $C_{1}$ be a random sample from the probability law $μ_{𝕂_{2 n}}^{*}$ given by

\begin{array}{l} μ_{𝕂_{2 n}}^{*} (B) = \frac{μ^{*} (B \cap 𝕂_{2 n})}{μ^{*} (𝕂_{2 n})}, \end{array}

and ϵ_n = min(1/n, 1/B_2n). If

\begin{array}{l} | C_{1} | \geq c ϵ_{n}^{- q} μ^{*} (𝕂_{2 n}) log (\frac{μ^{*} (𝔹 (𝕂_{2 n}, ϵ_{n}))}{δ ϵ_{n}^{q}}), \end{array}

then the statements (a) and (b) hold with $μ_{𝕂_{2 n}}^{*}$ -probability exceeding 1−δ.

In order to prove Theorem 7.1, we first recall the following theorem [52, Theorem 5.1], applied to our context. The statement of Mhaskar [52, Theorem 5.1] seems to require that μ^* is a probability measure, but this fact is not required in the proof. It is required only that μ^*(𝔹(x, r)) ≥ cr^q for 0 < r ≤ 1.

Theorem 7.2. Let τ be a positive measure supported on a compact subset of 𝕏, ϵ > 0, $A$ be a maximal ϵ-distinguishable subset of supp(τ), and $𝕂 = 𝔹 (A, 2 ϵ)$ . There then exists a subset $C \subseteq A \subseteq s u p p (τ)$ and a partition ${Y_{y}}_{y \in C}$ of 𝕂 with each of the following properties.

1. (volume property) For $y \in C$ , Y_y ⊆ 𝔹(y, 18ϵ), $(κ_{1} / κ_{2}) 7^{- q} ϵ^{q} \leq μ^{*} (Y_{y}) \leq κ_{2} {(18 ϵ)}^{q}$ , and $τ (Y_{y}) \geq (κ_{1} / κ_{2}) 1 9^{- q} {min}_{y \in A} τ (𝔹 (y, ϵ)) > 0$ .

2. (density property) $η (C) \geq ϵ$ , $δ (K, C) \leq 18 ϵ$ .

3. (intersection property) Let K₁ ⊆ K be a compact subset. Then

\begin{array}{l} | {y \in C : Y_{y} \cap K_{1} \neq \emptyset} | \leq (κ_{2}^{2} / κ_{1}) {(133)}^{q} H_{ϵ} (K_{1}) . \end{array}

PROOF OF THEOREM 7.1 (a), (b).

We observe first that it is enough to prove this theorem for sufficiently large values of n. In view of Proposition 5.3, we may choose n large enough so that for any P ∈ Π_n,

\begin{array}{l} ‖ P ‖_{1, μ^{*}, 𝕏 \ 𝕂_{2 n}} \leq n^{- S} ‖ P ‖_{1} \leq (ϵ / 3) ‖ P ‖_{1} . & (7.7) \end{array}

In this proof, we will write $δ = δ (𝕂_{2 n}, C_{1})$ so that $𝕂_{2 n} \subset 𝔹 (C_{1}, δ)$ . We use Theorem 7.2 with τ to be the measure associating the mass 1 with each element of $C_{1}$ , and δ in place of ϵ. If $A$ is a maximal δ-distinguished subset of $C_{1}$ , then we denote in this proof, $𝕂 = 𝔹 (A, 2 δ)$ and observe that $𝕂_{2 n} \subset 𝔹 (C_{1}, δ) \subset 𝕂 \subset 𝔹 (𝕂_{2 n}, 4 δ)$ . We obtain a partition {Y_y} of 𝕂 as in Theorem 7.2. The volume property implies that each Y_y contains at least one element of $C_{1}$ . We construct a subset $C$ of $C_{1}$ by choosing exactly one element of $Y_{y} \cap C_{1}$ for each y. We may then re-index $C_{1}$ so that, without loss of generality, $C = {z_{1}, \dots, z_{N}}$ for some N ≤ M, and re-index {Y_y} as {Y_k}, so that z_k ∈ Y_k, k = 1, ⋯, N. To summarize, we have a subset ${z_{1}, \dots, z_{N}} \subseteq C_{1}$ , and a partition ${Y_{k}}_{k = 1}^{N}$ of 𝕂 ⊃ 𝕂_2n such that each Y_k ⊂ 𝔹(z_k, 36δ) and $μ^{*} (Y_{k}) ~ δ^{q}$ . In particular (cf. (7.7)), for any P ∈ Π_n,

\begin{array}{l} ‖ P ‖_{1} - ‖ P ‖_{1, μ^{*}, K} \leq (ϵ / 3) ‖ P ‖_{1} . & (7.8) \end{array}

We now let $W_{k} = μ^{*} (Y_{k})$ , k = 1, ⋯, N, and W_k = 0, k = N + 1, ⋯, M.

The next step is to prove that if δ ≤ c(ϵ) min(1/n, 1/B_2n), then

\begin{array}{l} sup_{y \in 𝕏} \sum_{k = 1}^{N} \int_{Y_{k}} | Φ_{2 n} (z_{k}, y) - Φ_{2 n} (x, y) | d μ^{*} (x) \leq 2 ϵ / 3 . & (7.9) \end{array}

In this part of the proof, the constants denoted by c₁, c₂, ⋯ will retain their value until (7.9) is proved. Let y ∈ 𝕏. We let r ≥ δ to be chosen later, and write in this proof, $N = {k : d i s t (y, Y_{k}) < r}$ , $L = {k : d i s t (y, Y_{k}) \geq r}$ and for j = 0, 1, ⋯, $L_{j} = {k : 2^{j} r \leq d i s t (y, Y_{k}) < 2^{j + 1} r}$ . Since r≥δ, and each Y_k ⊂ 𝔹(z_k, 36δ), there are at most $c_{1} {(r / δ)}^{q}$ elements in $N$ . Using the Bernstein-Lipschitz condition and the fact that $‖ Φ_{2 n} (◦, y) ‖_{\infty} \leq c_{2} n^{q}$ , we deduce that

\begin{array}{l} \sum_{k \in N} \int_{Y_{k}} | Φ_{2 n} (z_{k}, y) - Φ_{2 n} (x, y) | d μ^{*} (x) \leq c_{3} μ^{*} (Y_{k}) n^{q} B_{2 n} δ {(r / δ)}^{q} \\ \begin{array}{l} \leq c_{3} μ^{*} (B (z_{k}, 36 δ)) n^{q} B_{2 n} δ {(r / δ)}^{q} \leq c_{4} {(n r)}^{q} B_{2 n} δ . \end{array} & (7.10) \end{array}

Next, since $μ^{*} (Y_{k}) ~ δ^{q}$ , we see that the number of elements in each $L_{j}$ is ~ (^{2^jr/δ)q}. Using Proposition 3.2 and the fact that S > q, we deduce that if r ≥ 1/n, then

\begin{array}{l} \sum_{k \in L} \int_{Y_{k}} | Φ_{2 n} (z_{k}, y) - Φ_{2 n} (x, y) | d μ^{*} (x) \\ = \sum_{j = 0}^{\infty} \sum_{k \in L_{j}} \int_{Y_{k}} | Φ_{2 n} (z_{k}, y) - Φ_{2 n} (x, y) | d μ^{*} (x) \\ \leq c_{5} n^{q} {(n r)}^{- S} \sum_{j = 0}^{\infty} 2^{- j S} {\sum_{k \in L_{j}} μ^{*} (Y_{k})} \\ \leq c_{6} {(n r)}^{q - S} . & (7.11) \end{array}

Since S > q, we may choose r ~ _ϵn such that $c_{6} {(n r)}^{q - 𝕊} \leq ϵ / 3$ , and we then require δ ≤ min(r, c₇(ϵ)/B_2n) so that, in (7.10), $c_{4} {(n r)}^{q} 𝔹_{2 n} δ \leq ϵ / 3$ . Then (7.10) and (7.11) lead to (7.9). The proof of (7.9) being completed, we resume the constant convention as usual.

Next, we observe that for any P ∈ Π_n,

\begin{array}{l} P (x) = \int_{𝕏} P (y) Φ_{2 n} (x, y) d μ^{*} (y), x \in 𝕏 . \end{array}

We therefore conclude, using (7.9), that

\begin{array}{l} | \sum_{k = 1}^{N} μ^{*} (Y_{k}) | P (z_{k}) | - \int_{K} | P (x) | d μ^{*} (x) | \\ = | \sum_{k = 1}^{N} \int_{Y_{k}} (| P (z_{k}) | - | P (x) |) d μ^{*} (x) | \leq \sum_{k = 1}^{N} \int_{Y_{k}} | P (z_{k}) \\ - P (x) | d μ^{*} (x) \leq \sum_{k = 1}^{N} \int_{Y_{k}} | \int_{𝕏} P (y) {Φ_{2 n} (z_{k}, y) \\ - Φ_{2 n} (x, y)} d μ^{*} (y) | d μ^{*} (x) \\ \leq \int_{𝕏} | P (y) | {\sum_{k = 1}^{N} \int_{Y_{k}} | Φ_{2 n} (z_{k}, y) - Φ_{2 n} (x, y) | d μ^{*} (x)} d μ^{*} (y) \\ \leq (2 ϵ / 3) \int_{𝕏} | P (y) | d μ^{*} (y) . \end{array}

Together with (7.8), this leads to (7.4). From the definition of $W_{k} = μ^{*} (Y_{k})$ , k = 1, ⋯, N, $W_{k} \leq c δ^{q}$ , and $\sum_{k = 1}^{N} W_{k} = μ^{*} (𝕂) = μ^{*} (𝔹 (𝕂_{2 n}, 4 δ))$ . Since W_k = 0 if k ≥ N + 1, we have now proven (7.3), and we have thus completed the proof of part (a).

Having proved part (a), the proof of part (b) is by now a routine application of the Hahn-Banach theorem [cf. [17, 44, 50, 51]]. We apply part (a) with ϵ = 1/2. Continuing the notation in the proof of part (a), we then have

\begin{array}{l} (1 / 2) ‖ P ‖_{1} \leq \sum_{k = 1}^{N} W_{k} | P (z_{k}) | \leq (3 / 2) ‖ P ‖_{1}, P \in Π_{n} . & (7.12) \end{array}

We now equip ℝ^N with the norm $‖ | (a_{1}, \dots, a_{N}) ‖ | = \sum_{k = 1}^{N} W_{k} | a_{k} |$ and consider the sampling operator $S : Π_{n} \to ℝ^{N}$ given by $S (P) = (P (z_{1}), \dots, P (z_{N}))$ , let V be the range of this operator, and define a linear functional x^* on V by $x^{*} (S (P)) = \int_{𝕏} P d μ^{*}$ . The estimate (7.12) shows that the norm of this functional is ≤ 2. The Hahn-Banach theorem yields a norm-preserving extension 𝕏^* of x^* to ℝ^N, which, in turn, can be identified with a vector $(w_{1}, \dots, w_{N}) \in ℝ^{N}$ . We set w_k = 0 if k ≥ N + 1. Formula (7.6) then expresses the fact that X^* is an extension of x^*. The preservation of norms shows that |w_k| ≤ 2W_k if k = 1, ⋯, N, and it is clear that for k = N + 1, ⋯, M, |w_k| = 0 = W_k. This completes the proof of part (b). □

Part (c) of Theorem 7.1 follows immediately from the first two parts and the following lemma.

Lemma 7.1. Let ν^* be a probability measure on 𝕏, 𝕂 ⊂ supp(ν^*) be a compact set. Let ϵ, δ ∈ (0, 1], $C$ be a maximal ϵ/2-distinguished subset of K, and $ν_{ϵ} = {min}_{x \in C} ν^{*} (𝔹 (x, ϵ / 2))$ . If

\begin{array}{l} M \geq c ν_{ϵ}^{- 1} log (c_{1} μ^{*} (𝔹 (K, ϵ)) / (δ ϵ^{q})), \end{array}

and {z₁, ⋯, z_M} be random samples from the probability law ν^* then

\begin{array}{l} {Prob}_{ν^{*}} ({δ (K, {z_{1}, \dots, z_{M}}) > ϵ}) \leq δ . & (7.13) \end{array}

PROOF. If δ(K, {z₁, ⋯, z_M}) > ϵ, then there exists at least one $x \in C$ such that 𝔹(x, ϵ/2)∩{z₁, ⋯, z_M} = ∅. For every $x \in C$ , $p_{x} = ν^{*} (𝔹 (x, ϵ / 2)) \geq ν_{ϵ}$ . We consider the random variable z_j to be equal to 1 if z_j ∈ 𝔹(x, ϵ/2) and 0 otherwise. Using (B.2) with t = 1, we see that

\begin{array}{l} Prob (𝔹 (x, ϵ / 2) \cap {z_{1}, \dots, z_{M}} \\ = \emptyset) \leq exp (- M p_{x} / 2) \leq exp (- c M ν_{ϵ}) . \end{array}

Since $| C | \leq c_{1} μ^{*} (𝔹 (K, ϵ)) / ϵ^{q}$ ,

\begin{array}{l} Prob ({δ (K, {z_{1}, \dots, z_{M}}) > ϵ}) \leq c_{1} \frac{μ^{*} (𝔹 (K, ϵ))}{ϵ^{q}} exp (- c M ν_{ϵ}) . \end{array}

We set the right-hand side above to δ and solve for M to prove the lemma. □

8. Proofs of the Results in Section 4

We assume the set-up as in section 4. Our first goal is to prove the following theorem.

Theorem 8.1. Let τ, ν^*, $F$ , f be as described section 4. We assume the Bernstein-Lipschitz condition. Let 0 < δ <1. We assume further that $| F (y, ϵ) | \leq 1$ for all y ∈ 𝕏, ϵ ∈ Ω. There exist constants c₁, c₂, such that if $M \geq c_{1} n^{q} ‖ | ν^{*} ‖ |_{R, 0} log (c n B_{n} / δ)$ , and {(y₁, ϵ₁), ⋯, (y_M, ϵ_M)} is a random sample from τ, then

\begin{array}{l} {Prob}_{ν^{*}} ({{‖ \frac{1}{M} \sum_{j = 1}^{M} F (y_{j}, ϵ_{j}) Φ_{n} (○, y_{j}) - σ_{n} (ν^{*}; f) ‖}_{\infty} \\ \geq c_{3} \sqrt{\frac{n^{q} {‖ | ν^{*} | ‖}_{R, 0} log (c n B_{n} {‖ | ν^{*} | ‖}_{R, 0} / δ)}{M}}}) \leq \frac{δ}{{‖ | ν^{*} | ‖}_{R, 0}} . & (8.1) \end{array}

In order to prove this theorem, we record an observation. The following lemma is an immediate corollary of the Bernstein-Lipschitz condition and Proposition 5.3.

Lemma 8.1. Let the Bernstein-Lipschitz condition be satisfied. Then for every n > 0 and ϵ > 0, there exists a finite set $C_{n, ϵ} \subset 𝕂_{2 n}$ such that $| C_{n, ϵ} | \leq c B_{n}^{q} ϵ^{- q} μ^{*} (𝔹 (𝕂_{2 n}, ϵ))$ and for any P ∈ Π_n,

\begin{array}{l} | max_{x \in C_{n, ϵ}} | P (x) | - ‖ P ‖_{\infty} | \leq ϵ ‖ P ‖_{\infty} . & (8.2) \end{array}

PROOF OF THEOREM 8.1.

Let x ∈ 𝕏. We consider the random variables

\begin{array}{l} Z_{j} = F (y_{j}, ϵ_{j}) Φ_{n} (x, y_{j}), j = 1, \dots, M . \end{array}

Then in view of (4.2), $𝔼_{τ} (Z_{j}) = σ_{n} (ν^{*}; f) (x)$ for every j. Further, Proposition 3.2 shows that for each j, $| Z_{j} | \leq c n^{q}$ . Using (5.10) with ν^* in place of ν, N = n, d = 0, we see that for each j,

\begin{array}{l} \int_{𝕏 \times Ω} | Z_{j} |^{2} d τ \leq \int_{𝕏} | Φ_{n} (x, y) |^{2} d ν^{*} (y) \leq c n^{q} ‖ | ν^{*} ‖ |_{R, 0} . \end{array}

Therefore, Bernstein concentration inequality (B.1) implies that for any t ∈ (0, 1),

\begin{array}{l} Prob ({| \frac{1}{M} \sum_{j = 1}^{M} F (y_{j}, ϵ_{j}) Φ_{n} (x, y_{j}) - σ_{n} (ν^{*}; f) (x) | \geq t / 2}) \\ \begin{array}{l} \leq 2 exp (- c \frac{t^{2} M}{n^{q} ‖ | ν^{*} ‖ |_{R, 0}}); \end{array} & (8.3) \end{array}

We now note that Z_j, $σ_{n} (ν^{*}; f)$ are all in Π_n. Taking a finite set $C_{n, 1 / 2}$ as in Lemma 8.1, so that $| C_{n, 1 / 2} | \leq c B_{n}^{q} μ^{*} (𝔹 (𝕂_{2 n}, 1 / 2)) \leq c_{1} n^{c} B_{n}^{q}$ , we deduce that

\begin{array}{l} max_{x \in C_{n, 1 / 2}} | \frac{1}{M} \sum_{j = 1}^{M} F (y_{j}, ϵ_{j}) Φ_{n} (x, y_{j}) - σ_{n} (ν^{*}; f) (x) | \\ \geq (1 / 2) ‖ \frac{1}{M} \sum_{j = 1}^{M} F (y_{j}, ϵ_{j}) Φ_{n} (○, y_{j}) - σ_{n} (ν^{*}; f) ‖_{\infty} . \end{array}

Then (8.3) leads to

\begin{array}{l} Prob ({‖ \frac{1}{M} \sum_{j = 1}^{M} F (y_{j}, ϵ_{j}) Φ_{n} (x, y_{j}) - σ_{n} (ν^{*}; f) (x) ‖_{\infty} \geq t}) \\ \begin{array}{l} \leq c_{1} B_{n}^{q} n^{c} exp (- c_{2} \frac{t^{2} M}{n^{q} | ‖ ν^{*} ‖ |_{R, 0}}) . \end{array} & (8.4) \end{array}

We set the right-hand side above equal to $δ / | ‖ ν^{*} ‖ |_{R, 0}$ and solve for t to obtain (8.1) (with different values of c, c₁, c₂). □

Before starting to prove results regarding eignets, we first record the continuity and smoothness of a “smooth kernel” G as defined in Definition 3.10.

Proposition 8.1. If G is a smooth kernel, then (x, y) ↦ W(y)G(x, y) is in $C_{0} (𝕏 \times 𝕏) \cap L^{1} (μ^{*} \times μ^{*}; 𝕏 \times 𝕏)$ . Further, for any p, 1 ≤ p ≤ ∞, and Λ ≥ 1,

\begin{array}{l} sup_{x \in 𝕏} ‖ W (○) G (x, ○) - \sum_{k : λ_{k} < Λ} b (λ_{k}) ϕ_{k} (x) ϕ_{k} (○) ‖_{p} \leq c_{1} Λ^{c} b (Λ) . & (8.5) \end{array}

In particular, for every x, y ∈ 𝕏, W(○)G(x, ○) and W(y)G(○, y) are in C^∞.

PROOF. Let b be the smooth mask corresponding to G. For any S ≥ 1, b(n) ≤ cn^−sb(n/B^*) ≤ cn^−sb(0). Thus, b itself is decreasing rapidly. Next, let r > 0. Then remembering that B^* ≥ 1 and b is non-increasing, we obtain that for S > 0, b(B^*Λu) ≤ c(Λu)^−s−r−1b(Λu), and

\begin{array}{l} \begin{array}{l} \int_{Λ}^{\infty} t^{r} b (t) d t = {(B^{*} Λ)}^{r + 1} \int_{1 / B^{*}}^{\infty} u^{r} b (B^{*} Λ u) d u \\ \leq c Λ^{- S} \int_{1 / B^{*}}^{\infty} u^{- S - 1} b (Λ u) d u \\ \leq c Λ^{- S} \int_{1}^{\infty} u^{- S - 1} b (Λ u) d u \leq c Λ^{- S} b (Λ) . \end{array} & (8.6) \end{array}

In this proof, let $s (t) = \sum_{k : λ_{k} < t} ϕ_{k} {(x)}^{2}$ , so that s(t) ≤ ct^q, t ≥ 1. If Λ ≥ 1, then, integrating by parts, we deduce (remembering that b is non-increasing) that for any x ∈ 𝕏,

\begin{array}{l} \begin{array}{l} \sum_{k : λ_{k} \geq Λ} b (λ_{k}) ϕ_{k} {(x)}^{2} \\ = \int_{Λ}^{\infty} b (t) d s (t) = - b (Λ) s (Λ) - \int_{Λ}^{\infty} s (t) d b (t) \\ \leq c_{1} {Λ^{q} b (Λ) - \int_{Λ}^{\infty} t^{q} d b (t)} \leq c_{2} {Λ^{q} b (Λ) \\ + \int_{Λ}^{\infty} t^{q - 1} b (t) d t} \leq c_{3} Λ^{q} b (Λ) . \end{array} & (8.7) \end{array}

Using Schwarz inequality, we conclude that

\begin{array}{l} sup_{x, y \in 𝕏} \sum_{k : λ_{k} \geq Λ} b (λ_{k}) | ϕ_{k} (x) ϕ_{k} (y) | \leq c_{3} Λ^{q} b (Λ) . & (8.8) \end{array}

In particular, since b is fast decreasing, W(○)G(x, ○) ∈ C₀(𝕏) (and in fact, W(y)G(x, y) ∈ C₀(𝕏 × 𝕏)) and (8.5) holds with p = ∞. Next, for any j ≥ 0, essential compactness implies that

\begin{array}{l} \int_{𝕏 \ 𝕂_{2^{j + 1} Λ}} {(\sum_{k : λ_{k} \in [2^{j} Λ, 2^{j + 1} Λ)} b (λ_{k}) ϕ_{k} {(y)}^{2})}^{1 / 2} \\ d μ^{*} (y) \leq c Λ^{- S - q} b {(2^{j} Λ)}^{1 / 2} . \end{array}

So, there exists r ≥ q such that

\begin{array}{l} \int_{𝕏} (\sum_{k : λ_{k} \in [2^{j} Λ, 2^{j + 1} Λ)} {b (λ_{k}) ϕ_{k} {(y)}^{2})}^{1 / 2} d μ^{*} (y) \\ \leq \int_{𝕂_{2^{j + 1} Λ}} {(\sum_{k : λ_{k} \in [2^{j} Λ, 2^{j + 1} Λ)} b (λ_{k}) ϕ_{k} {(y)}^{2})}^{1 / 2} d μ^{*} (y) \\ + c Λ^{- S - q} b {(2^{j} Λ)}^{1 / 2} \\ \leq c {({(2^{j} Λ)}^{q} b (2^{j} Λ))}^{1 / 2} μ^{*} (𝕂_{2^{j + 1} Λ}) \leq c {({(2^{j} Λ)}^{r} b (2^{j} Λ))}^{1 / 2} . \end{array}

Hence, for any x ∈ 𝕏,

\begin{array}{l} \int_{𝕏} \sum_{k : λ_{k} \geq Λ} b (λ_{k}) | ϕ_{k} (x) ϕ_{k} (y) | d μ^{*} (y) \\ = \sum_{j = 0}^{\infty} \int_{𝕏} \sum_{k : λ_{k} \in [2^{j} Λ, 2^{j + 1} Λ)} b (λ_{k}) | ϕ_{k} (x) ϕ_{k} (y) | d μ^{*} (y) \\ \leq \sum_{j = 0}^{\infty} {\sum_{k : λ_{k} \in [2^{j} Λ, 2^{j + 1} Λ)} b (λ_{k}) ϕ_{k} {(x)}^{2}}^{1 / 2} \\ \int_{𝕏} {(\sum_{k : λ_{k} \in [2^{j} Λ, 2^{j + 1} Λ)} b (λ_{k}) ϕ_{k} {(y)}^{2})}^{1 / 2} d μ^{*} (y) \\ \leq c \sum_{j = 0}^{\infty} {(2^{j} Λ)}^{r} b (2^{j} Λ) \leq c \sum_{j = 0}^{\infty} \int_{2^{j - 1} Λ}^{2^{j} Λ} t^{r - 1} b (t) d t \\ = c \int_{Λ / 2}^{\infty} t^{r - 1} b (t) d t \leq c Λ^{- S} b (Λ) . & (8.9) \end{array}

This shows that

\begin{array}{l} sup_{x \in 𝕏} ‖ \sum_{k : λ_{k} \geq Λ} b (λ_{k}) | ϕ_{k} (x) ϕ_{k} (○) | ‖_{1} \leq c Λ^{- S} b (Λ) . & (8.10) \end{array}

In view of the convexity inequality,

‖ f ‖_{p} \leq ‖ f ‖_{\infty}^{1 - 1 / p} ‖ f ‖_{1}^{1 / p}, 1 < p < \infty,

(8.8) and (8.10) lead to

sup_{x \in 𝕏} ‖ \sum_{k : λ_{k} \geq Λ} b (λ_{k}) | ϕ_{k} (x) ϕ_{k} (○) | ‖_{p} \leq c_{1} Λ^{c} b (Λ), 1 \leq p \leq \infty .

In turn, this implies that WG(x, ○) ∈ L^p for all x ∈ 𝕏, and (8.5) holds. □

A fundamental fact that relates the kernels Φ_n and the pre-fabricated eignets 𝔾_n's is the following theorem.

Theorem 8.2. Let G be a smooth kernel and {ν_n} be an admissible product quadrature measure sequence. Then, for 1 ≤ p ≤ ∞,

{sup_{x \in 𝕏} ‖ 𝔾_{n} (ν_{B^{*} n}; x, ○) - Φ_{n} (x, ○) ‖_{p}}

is fast decreasing. In particular, for every S > 0

\begin{array}{l} | 𝔾_{n} (ν_{B^{*} n}; x, y) | \leq c (S) {\frac{n^{q}}{max (1, {(N ρ (x, y))}^{S})} + n^{- 2 S}} . & (8.11) \end{array}

PROOF. Let x ∈ 𝕏. In this proof, we define P_n = P_{n, x} by $P_{n} (z) = \sum_{k : λ_{k} < B^{*} n} b (λ_{k}) ϕ_{k} (x) ϕ_{k} (z)$ , z ∈ 𝕏, and note that $P_{n} \in Π_{B^{*} n}$ . In view of Proposition 8.1, the expansion in (3.18) converges in $C_{0} (𝕏 \times 𝕏) \cap L^{1} (μ^{*} \times μ^{*}; 𝕏 \times 𝕏)$ , so that term-by-term integration can be made to deduce that for y ∈ 𝕏,

\begin{array}{l} \int_{𝕏} G (x, z) W (z) D_{G, n} (z, y) d μ^{*} (z) = \int_{𝕏} P_{n} (z) D_{G, n} (z, y) d μ^{*} (z) \\ + \sum_{k : λ_{k} \geq B^{*} n} b (λ_{k}) ϕ_{k} (x) \int_{𝕏} ϕ_{k} (z) D_{G, n} (z, y) d μ^{*} (z) . \end{array}

By definition, $D_{G, n} (○, y) \in Π_{n}^{q}$ , and, hence, each of the summands in the last expression above is equal to 0. Therefore, recalling that h(λ_k/n) = 0 if λ_k > n, we obtain

\begin{array}{l} \int_{𝕏} G (x, z) W (z) D_{G, n} (z, y) d μ^{*} (z) = \int_{𝕏} P_{n} (z) D_{G, n} (z, y) d μ^{*} (z) \\ = \sum_{k : λ_{k} < B^{*} n} b (λ_{k}) ϕ_{k} (x) \int_{𝕏} ϕ_{k} (z) D_{G, n} (z, y) d μ^{*} (z) \\ = \sum_{k : λ_{k} < B^{*} n} b (λ_{k}) ϕ_{k} (x) h (λ_{k} / n) b {(λ_{k})}^{- 1} ϕ_{k} (y) \\ = \sum_{k} h (λ_{k} / n) ϕ_{k} (x) ϕ_{k} (y) \\ \begin{array}{l} = Φ_{n} (x, y) . \end{array} & (8.12) \end{array}

Since $D_{G, n} (z, ○) \in Π_{n} \subset Π_{B^{*} n}$ , and $ν_{𝔹^{*} n}$ is an admissible product quadrature measure of order B^*n, this implies that

\begin{array}{l} Φ_{n} (x, y) = \int_{𝕏} P_{n} (z) D_{G, n} (z, y) d ν_{B^{*} n} (z), y \in 𝕏 . & (8.13) \end{array}

Therefore, for y ∈ 𝕏,

\begin{array}{l} 𝔾_{n} (ν_{B^{*} n}; x, y) - Φ_{n} (x, y) \\ = \int_{𝕏} {W (z) G (x, z) - P_{n} (z)} D_{G, n} (z, y) d ν_{B^{*} n} (z) . \end{array}

Using Proposition 8.1 (used with Λ = B^*n) and the fact that ${| ν_{B^{*} n} | (𝕏)}$ has polynomial growth, we deduce that

\begin{array}{l} ‖ 𝔾_{n} (ν_{B^{*} n}; x, ○) - Φ_{n} (x, ○) ‖_{p} \leq | ν_{B^{*} n} | (𝕏) \\ \times ‖ W (○) G (x, ○) - P_{n} ‖_{\infty} sup_{z \in 𝕏} ‖ D_{G, n} (z, ○) ‖_{p} \\ \begin{array}{l} \leq c_{1} n^{c} b (B^{*} n) sup_{z \in 𝕏} ‖ D_{G, n} (z, ○) ‖_{p} . \end{array} & (8.14) \end{array}

In view of Proposition 5.4 and Proposition 5.2, we see that for any z ∈ 𝕏,

\begin{array}{l} ‖ D_{G, n} (z, ○) ‖_{p}^{2} \leq c_{1} n^{2 c} ‖ D_{G, n} (z, ○) ‖_{2}^{2} \\ = c_{1} n^{2 c} \sum_{k : λ_{k} < n} {(h (λ_{k} / n) b {(λ_{k})}^{- 1} ϕ_{k} (z))}^{2} \\ \leq c_{1} n^{2 c} b {(n)}^{- 2} ‖ Φ_{n} (z, ○) ‖_{2}^{2} \leq c_{1} n^{c} b {(n)}^{- 2} ‖ Φ_{n} (z, ○) ‖_{1}^{2} \\ \leq c_{1} n^{c} b {(n)}^{- 2} . \end{array}

We now conclude from (8.14) that

‖ 𝔾_{n} (ν_{B^{*} n}; x, ○) - Φ_{n} (x, ○) ‖_{p} \leq c_{1} n^{c} \frac{b (B^{*} n)}{b (n)} .

Since {b(B^*n)/b(n)} is fast decreasing, this completes the proof. □

The theorems in section 4 all follow from the following basic theorem.

Theorem 8.3. We assume the strong product assumption and the Bernstein-Lipschitz condition. With the set-up just described, we have

\begin{array}{l} \begin{array}{l} {Prob}_{ν^{*}} ({‖ G_{n} (Y; F) - σ_{n} (f_{0} f) ‖_{\infty} \\ \begin{array}{l} \geq c_{3} \sqrt{\frac{n^{q} ‖ | ν^{*} | ‖_{R, 0} log (c n B_{n} ‖ | ν^{*} | ‖_{R, 0} / δ)}{| Y |}}}) \leq \frac{δ}{‖ | ν^{*} | ‖_{R, 0}} . \end{array} \end{array} & (8.15) \end{array}

In particular, for f ∈ 𝕏^∞(𝕏), Then

\begin{array}{l} {Prob}_{ν^{*}} ({{‖ 𝔾_{n} (Y; F) - f_{0} f ‖}_{\infty} \\ \geq c_{3} (\sqrt{\frac{n^{q} {‖ | ν^{*} | ‖}_{R, 0} log (c n B_{n} {‖ | ν^{*} | ‖}_{R, 0} / δ)}{| Y |}} + E_{n / 2} (\infty, f_{0} f))}) \\ \leq \frac{δ}{{‖ | ν^{*} | ‖}_{R, 0}} . & (8.16) \end{array}

PROOF. Theorems 8.1 and Theorem 8.2 together lead to (8.15). Since $σ_{n} (ν^{*}; f) = σ_{n} (f_{0} f)$ , the estimate 8.91 follows from Theorem 5.1 used with p = ∞. □

PROOF OF THEOREM 4.1.

We observe that with the choice of f₀ as in this theorem, $| ‖ ν^{*} ‖ |_{R, 0} \leq ‖ f_{0} ‖_{\infty} \leq 1 / 𝔪$ . Using 𝔪δ in place of δ, we obtain Theorem 4.1 directly from Theorem 8.3 by some simple calculations. □

PROOF OF THEOREM 4.2.

This follows directly from Theorem 8.3 by choosing $F \equiv 1$ . □

PROOF OF THEOREM 4.3.

In view of Theorem 8.3, our assumptions imply that for each j ≥ 0,

{Prob}_{ν^{*}} ({‖ G_{2^{j}} (Y; F) - σ_{2^{j}} (f_{0} f) ‖_{\infty} \leq c 2^{- j S}}) \leq δ / 2^{j + 1} .

Consequently, with probability ≥ 1 − δ, we have for each j ≥ 1,

‖ G_{2^{j}} (Y; F) - G_{2^{j - 1}} (Y_{j}; F) - τ_{j} (f_{0} f) ‖_{\infty} \leq c 2^{- j S} .

Hence, the theorem follows from Theorem 6.1. □

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

1. ^A Hardy multiquadric is a function of the form $x \to {(α^{2} + | x |_{2}^{2})}^{- 1}$ , x ∈ ℝ^q. It is one of the oft-used function in theory and applications of radial basis function networks. For a survey, see the paper [32] of Hardy.

2. ^|ν|−ess sup_{x ∈ 𝕂}|f(x)| = inf{t : |ν|({x ∈ 𝕂:|f(x)| > t}) = 0}

References

1. Zhou L, Pan S, Wang J, Vasilakos AV. Machine learning on big data: opportunities and challenges. Neurocomputing. (2017) 237:350–61. doi: 10.1016/j.neucom.2017.01.026

CrossRef Full Text | Google Scholar

2. Cucker F, Smale S. On the mathematical foundations of learning. Bull Am Math Soc. (2002) 39:1–49. doi: 10.1090/S0273-0979-01-00923-5

CrossRef Full Text | Google Scholar

3. Cucker F, Zhou DX. Learning Theory: An Approximation Theory Viewpoint, Vol. 24. Cambridge: Cambridge University Press (2007).

Google Scholar

4. Girosi F, Poggio T. Networks and the best approximation property. Biol Cybernet. (1990) 63:169–76. doi: 10.1007/BF00195855

CrossRef Full Text | Google Scholar

5. Chui CK, Donoho DL. Special issue: diffusion maps and wavelets. Appl Comput Harm Anal. (2006) 21:1–2. doi: 10.1016/j.acha.2006.05.005

CrossRef Full Text | Google Scholar

6. Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. (2003) 15:1373–96. doi: 10.1162/089976603321780317

PubMed Abstract | CrossRef Full Text | Google Scholar

7. Belkin M, Niyogi P. Towards a theoretical foundation for Laplacian-based manifold methods. J Comput Syst Sci. (2008) 74:1289–308. doi: 10.1016/j.jcss.2007.08.006

CrossRef Full Text | Google Scholar

8. Belkin M, Niyogi P. Semi-supervised learning on Riemannian manifolds. Mach Learn. (2004) 56:209–39. doi: 10.1023/B:MACH.0000033120.25363.1e

CrossRef Full Text | Google Scholar

9. Lafon SS. Diffusion maps and geometric harmonics (Ph.D. thesis), Yale University, New Haven, CT, United States (2004).

Google Scholar

10. Singer A. From graph to manifold Laplacian: the convergence rate. Appl Comput Harm Anal. (2006) 21:128–34. doi: 10.1016/j.acha.2006.03.004

CrossRef Full Text | Google Scholar

11. Jones PW, Maggioni M, Schul R. Universal local parametrizations via heat kernels and eigenfunctions of the Laplacian. Ann Acad Sci Fenn Math. (2010) 35:131–74. doi: 10.5186/aasfm.2010.3508

CrossRef Full Text | Google Scholar

12. Liao W, Maggioni M. Adaptive geometric multiscale approximations for intrinsically low-dimensional data. arXiv. (2016) 1611.01179.

Google Scholar

13. Maggioni M, Mhaskar HN. Diffusion polynomial frames on metric measure spaces. Appl Comput Harm Anal. (2008) 24:329–53. doi: 10.1016/j.acha.2007.07.001

CrossRef Full Text | Google Scholar

14. Mhaskar HN. Eignets for function approximation on manifolds. Appl Comput Harm Anal. (2010) 29:63–87. doi: 10.1016/j.acha.2009.08.006

CrossRef Full Text | Google Scholar

15. Mhaskar HN. A generalized diffusion frame for parsimonious representation of functions on data defined manifolds. Neural Netw. (2011) 24:345–59. doi: 10.1016/j.neunet.2010.12.007

PubMed Abstract | CrossRef Full Text | Google Scholar

16. Ehler M, Filbir F, Mhaskar HN. Locally learning biomedical data using diffusion frames. J Comput Biol. (2012) 19:1251–64. doi: 10.1089/cmb.2012.0187

PubMed Abstract | CrossRef Full Text | Google Scholar

17. Filbir F, Mhaskar HN. Marcinkiewicz-Zygmund measures on manifolds. J Complexity. (2011) 27:568–96. doi: 10.1016/j.jco.2011.03.002

CrossRef Full Text | Google Scholar

18. Rosasco L, Belkin M, Vito ED. On learning with integral operators. J Mach Learn Res. (2010) 11:905–34.

Google Scholar

19. Rudi A, Carratino L, Rosasco L. Falkon: an optimal large scale kernel method. arXiv. (2017) 1705.10958. Available online at: http://jmlr.org/papers/v11/rosasco10a.html.

Google Scholar

20. Lu S, Pereverzev SV. Regularization Theory for Ill-Posed Problems. Berlin: de Gruyter (2013).

PubMed Abstract | Google Scholar

21. Mhaskar H, Pereverzyev SV, Semenov VY, Semenova EV. Data based construction of kernels for semi-supervised learning with less labels. Front Appl Math Stat. (2019) 5:21. doi: 10.3389/fams.2019.00021

CrossRef Full Text | Google Scholar

22. Pereverzyev SV, Tkachenko P. Regularization by the linear functional strategy with multiple kernels. Front Appl Math Stat. (2017) 3:1. doi: 10.3389/fams.2017.00001

CrossRef Full Text | Google Scholar

23. Fefferman C, Mitter S, Narayanan H. Testing the manifold hypothesis. J Am Math Soc. (2016) 29:983–1049. doi: 10.1090/jams/852

CrossRef Full Text | Google Scholar

24. Chui CK, Lin S-B, Zhang B, Zhou DX. Realization of spatial sparseness by deep relu nets with massive data. arXiv. (2019) 1912.07464.

Google Scholar

25. Guo ZC, Lin SB, Zhou DX. Learning theory of distributed spectral algorithms. Inverse Probl. (2017) 33:074009. doi: 10.1088/1361-6420/aa72b2

CrossRef Full Text | Google Scholar

26. Lin SB, Wang YG, Zhou DX. Distributed filtered hyperinterpolation for noisy data on the sphere. arXiv. (2019) 1910.02434.

Google Scholar

27. Mhaskar HN, Poggio T. Deep vs. shallow networks: an approximation theory perspective. Anal Appl. (2016) 14:829–48. doi: 10.1142/S0219530516400042

CrossRef Full Text | Google Scholar

28. Mhaskar H, Poggio T. Function approximation by deep networks. arXiv. (2019) 1905.12882.

Google Scholar

29. Mhaskar HN. On the representation of smooth functions on the sphere using finitely many bits. Appl Comput Harm Anal. (2005) 18:215–33. doi: 10.1016/j.acha.2004.11.004

CrossRef Full Text | Google Scholar

30. Smale S, Rosasco L, Bouvrie J, Caponnetto A, Poggio T. Mathematics of the neural response. Foundat Comput Math. (2010) 10:67–91. doi: 10.1007/s10208-009-9049-1

CrossRef Full Text | Google Scholar

31. Mhaskar HN. On the representation of band limited functions using finitely many bits. J Complexity. (2002) 18:449–78. doi: 10.1006/jcom.2001.0637

CrossRef Full Text | Google Scholar

32. Hardy RL. Theory and applications of the multiquadric-biharmonic method 20 years of discovery 1968–1988. Comput Math Appl. (1990) 19:163–208. doi: 10.1016/0898-1221(90)90272-L

CrossRef Full Text | Google Scholar

33. Müller A. Spherical Harmonics, Vol. 17. Berlin: Springer (2006).

Google Scholar

34. Mhaskar HN, Narcowich FJ, Ward JD. Approximation properties of zonal function networks using scattered data on the sphere. Adv Comput Math. (1999) 11:121–37. doi: 10.1023/A:1018967708053

CrossRef Full Text | Google Scholar

35. Timan AF. Theory of Approximation of Functions of a Real Variable: International Series of Monographs on Pure and Applied Mathematics, Vol. 34. New York, NY: Dover Publications (2014).

Google Scholar

36. Chui CK, Mhaskar HN. A unified method for super-resolution recovery and real exponential-sum separation. Appl Comput Harmon Anal. (2019) 46:431–51. doi: 10.1016/j.acha.2017.12.007

CrossRef Full Text | Google Scholar

37. Chui CK, Mhaskar HN. A Fourier-invariant method for locating point-masses and computing their attributes. Appl Comput Harmon Anal. (2018) 45:436–52. doi: 10.1016/j.acha.2017.08.010

CrossRef Full Text | Google Scholar

38. Mhaskar HN. Introduction to the Theory of Weighted Polynomial Approximation, Vol. 56. Singapore: World Scientific Singapore (1996).

Google Scholar

39. Steinerberger S. On the spectral resolution of products of laplacian eigenfunctions. arXiv. (2017) 1711.09826.

Google Scholar

40. Lu J, Sogge CD, Steinerberger S. Approximating pointwise products of laplacian eigenfunctions. J Funct Anal. (2019) 277:3271–82. doi: 10.1016/j.jfa.2019.05.025

CrossRef Full Text | Google Scholar

41. Lu J, Steinerberger S. On pointwise products of elliptic eigenfunctions. arXiv. (2018) 1810.01024.

Google Scholar

42. Geller D, Pesenson IZ. Band-limited localized Parseval frames and Besov spaces on compact homogeneous manifolds. J Geometr Anal. (2011) 21:334–71. doi: 10.1007/s12220-010-9150-3

CrossRef Full Text | Google Scholar

43. Mhaskar HN. Local approximation using Hermite functions. In: N. K. Govil, R. Mohapatra, M. A. Qazi, G. Schmeisser eds. Progress in Approximation Theory and Applicable Complex Analysis. Cham: Springer (2017). p. 341–62. doi: 10.1007/978-3-319-49242-1_16

CrossRef Full Text | Google Scholar

44. Filbir F, Mhaskar HN. A quadrature formula for diffusion polynomials corresponding to a generalized heat kernel. J Fourier Anal Appl. (2010) 16:629–57. doi: 10.1007/s00041-010-9119-4

CrossRef Full Text | Google Scholar

45. Mhaskar HN. A unified framework for harmonic analysis of functions on directed graphs and changing data. Appl Comput Harm Anal. (2018) 44:611–44. doi: 10.1016/j.acha.2016.06.007

CrossRef Full Text | Google Scholar

46. Rivlin TJ. The Chebyshev Polynomials. New York, NY: John Wiley and Sons (1974).

Google Scholar

47. Grigorlyan A. Heat kernels on metric measure spaces with regular volume growth. Handb Geometr Anal. (2010) 2. Available online at: https://www.math.uni-bielefeld.de/~grigor/hga.pdf.

Google Scholar

48. Mhaskar HN. Approximate quadrature measures on data-defined spaces. In: Dick J, Kuo FY, Wozniakowski H, editors. Festschrift for the 80th Birthday of Ian Sloan. Berlin: Springer (2017). p. 931–62. doi: 10.1007/978-3-319-72456-0_41

CrossRef Full Text | Google Scholar

49. Mhaskar HN. On the degree of approximation in multivariate weighted approximation. In: M. D. Buhman, and D. H. Mache, eds. Advanced Problems in Constructive Approximation. Basel: Birkhäuser (2003). p. 129–41. doi: 10.1007/978-3-0348-7600-1_10

CrossRef Full Text | Google Scholar

50. Mhaskar HN. Approximation theory and neural networks. In: Proceedings of the International Workshop in Wavelet Analysis and Applications. Delhi (1999). p. 247–89.

PubMed Abstract | Google Scholar

51. Mhaskar HN, Narcowich FJ, Ward JD. Spherical Marcinkiewicz-Zygmund inequalities and positive quadrature. Math Comput. (2001) 70:1113–30. doi: 10.1090/S0025-5718-00-01240-0

CrossRef Full Text | Google Scholar

52. Mhaskar HN. Dimension independent bounds for general shallow networks. Neural Netw. (2020) 123:142–52. doi: 10.1016/j.neunet.2019.11.006

PubMed Abstract | CrossRef Full Text | Google Scholar

53. Hörmander L. The spectral function of an elliptic operator. Acta Math. (1968) 121:193–218. doi: 10.1007/BF02391913

CrossRef Full Text | Google Scholar

54. Shubin MA. Pseudodifferential Operators and Spectral Theory. Berlin: Springer (1987).

Google Scholar

55. Grigor'yan A. Gaussian upper bounds for the heat kernel on arbitrary manifolds. J Diff Geom. (1997) 45:33–52. doi: 10.4310/jdg/1214459753

CrossRef Full Text | Google Scholar

56. Boucheron S, Lugosi G, Massart P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford: Oxford University Press (2013).

Google Scholar

57. Hagerup T, Rüb C. A guided tour of Chernoff bounds. Inform Process Lett. (1990) 33:305–8. doi: 10.1016/0020-0190(90)90214-I

CrossRef Full Text | Google Scholar

Appendix

A. Gaussian Upper Bound on Manifolds

Let 𝕏 be a compact and connected smooth q-dimensional manifold, g(x) = (g_{i, j}(x)) be its metric tensor, and (g^{i, j}(x)) be the inverse of g(x). The Laplace-Beltrami operator on 𝕏 is defined by

Δ (f) (x) = \frac{1}{\sqrt{| g (x) |}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} \partial_{i} (\sqrt{| g (x) |} g^{i, j} (x) \partial_{j} f),

where |g| = det(g). The symbol of Δ is given by

a (x, ξ) = \frac{1}{\sqrt{| g (x) |}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} (\sqrt{| g (x) |} g^{i, j} (x)) ξ_{i} ξ_{j} .

Then a(x, ξ) ≥ c|ξ|². Therefore, Hörmander's theorem [53, Theorem 4.4], [54, Theorem 16.1] shows that for x ∈ 𝕏,

\begin{array}{l} \sum_{λ_{j} < λ} ϕ_{k} {(x)}^{2} \leq c λ^{q}, λ \geq 1 . & (A.1) \end{array}

In turn, [44, Proposition 4.1] implies that

\sum_{k = 0}^{\infty} exp (- λ_{k}^{2} t) ϕ_{k} {(x)}^{2} \leq c t^{- q / 2}, t \in (0, 1], x \in 𝕏 .

Then [55, Theorem 1.1] shows that (3.3) is satisfied.

B. Probabilistic Estimates

We need the following basic facts from probability theory. Proposition B.1(a) below is a reformulation of Boucheron et al. [56, section 2.1, 2.7]. A proof of Proposition B.1(b) below is given in Hagerup and Rüb [57, Equation (7)].

Proposition B.1. (a) (Bernstein concentration inequality) Let Z₁, ⋯, Z_M be independent real valued random variables such that for each j = 1, ⋯, M, |Z_j| ≤ R, and $𝔼 (Z_{j}^{2}) \leq V$ . Then, for any t > 0,

\begin{array}{l} Prob (| \frac{1}{M} \sum_{j = 1}^{M} (Z_{j} - 𝔼 (Z_{j})) | \geq t) \leq 2 exp (- \frac{M t^{2}}{2 (V + R t)}) . & (8.18) \end{array}

(b) (Chernoff bound) Let M ≥ 1, 0 ≤ p ≤ 1, and Z₁, ⋯, Z_M be random variables taking values in {0, 1}, with Prob(Z_k = 1) = p. Then for t ∈ (0, 1],

\begin{array}{l} Prob (\sum_{k = 1}^{M} Z_{k} \leq (1 - t) M p) \leq exp (- t^{2} M p / 2), \\ \begin{array}{l} Prob (| \sum_{k = 1}^{M} Z_{k} - M p | \geq t M p) \leq 2 exp (- t^{2} M p / 2) . \end{array} & (B.2) \end{array}

Keywords: Kernel based approximation, distributed learning, machine learning, inverse problems, probability estimation

Citation: Mhaskar HN (2020) Kernel-Based Analysis of Massive Data. Front. Appl. Math. Stat. 6:30. doi: 10.3389/fams.2020.00030

Received: 29 March 2020; Accepted: 03 July 2020;
Published: 20 October 2020.

Edited by:

Ke Shi, Old Dominion University, United States

Reviewed by:

Jianjun Wang, Southwest University, China
Alex Cloninger, University of California, San Diego, United States

Copyright © 2020 Mhaskar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Hrushikesh N. Mhaskar, aHJ1c2hpa2VzaC5taGFza2FyQGNndS5lZHU=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.