Kernel-Based Analysis of Massive Data

Dealing with massive data is a challenging task for machine learning. An important aspect of machine learning is function approximation. In the context of massive data, some of the commonly used tools for this purpose are sparsity, divide-and-conquer, and distributed learning. In this paper, we develop a very general theory of approximation by networks, which we have called eignets, to achieve local, stratified approximation. The very massive nature of the data allows us to use these eignets to solve inverse problems, such as finding a good approximation to the probability law that governs the data and finding the local smoothness of the target function near different points in the domain. In fact, we develop a wavelet-like representation using our eignets. Our theory is applicable to approximation on a general locally compact metric measure space. Special examples include approximation by periodic basis functions on the torus, zonal function networks on a Euclidean sphere (including smooth ReLU networks), Gaussian networks, and approximation on manifolds. We construct pre-fabricated networks so that no data-based training is required for the approximation.


Introduction
Rapid advances in technology have led to the availability and need to analyze a massive data.The problem arises in almost every area of life from medical science to homeland security to finance.An immediate problem in dealing with a massive data set is that it is not possible to store it in a computer memory, so that one has to deal with the data piecemeal to keep access to an external memory to a minimum.The other challenge is to devise efficient numerical algorithms to overcome difficulties, for example, in using the customary optimization problems in machine learning.On the other hand, the very availability of a massive data set should lead also to opportunities to solve some problems here-to-fore considered unmanageable.For example, deep learning often requires a large amount of training data, which in turn, helps to figure out the granularity in the data.Apart from deep learning, distributed learning is also a popular way of dealing with big data.A good survey with the taxonomy for dealing with massive data is given recently in [48].
As pointed out in [15,9,10], the main task in machine learning can be viewed as one of approximation of functions based on noisy values of the target function, sampled at points which are themselves sampled from an unknown distribution.Therefore, it is natural to seek approximation theory techniques to solve the problem.However, most of the classical approximation theory results are either not constructive, or else study function approximation only on known domains.In this century, there is a new paradigm to consider function approximation on data-defined manifolds; a good introduction to the subject is in the special issue [5] of Applied and Computational Harmonic Analysis, edited by Chui and Donoho.In this theory, one assumes the manifold hypothesis, i.e., that the data is sampled from a probability distribution µ * supported on a smooth, compact, connected, Riemannian manifold; for simplicity, even that µ * is the Riemannian volume measure for the manifold, normalized to be a probability measure.Following, e.g., [1,3,2,23,45], one constructs first a "graph Laplacian" from the data, and finds its eigen-decomposition.It is proved in the above mentioned papers that as the size of the data tends to infinity, the graph Laplacian converges to the Laplace-Beltrami operator on the manifold and the eigen-values (respectively, eigen-vectors) converge to the corresponding quantities on the manifold.A great deal of work is devoted to studying the geometry of this unknown manifold (e.g., [22,24]), based on the so called heat kernel.The theory of function approximation on such manifolds is also well developed (e.g., [26,33,34,11,14]).A bottleneck in this theory is the computation of the eigen-decomposition of a matrix, which is necessarily huge in the case of big data.It is also possible that the manifold hypothesis does not hold, and there is a recent work [12] by Fefferman, Mitter, and Narayanan proposing an algorithm to test this hypothesis.On the other hand, our theory for function approximation does not necessarily use the full strength of Riemannian geometry.In this paper, we have therefore decided to work with a general locally compact metric measure space, isolating those properties which are needed for our analysis, and substituting some which are not applicable in the current setting.
Our motivation comes from some recent works on distributed learning by Zhou and his collaborators [6,18,25] as well as our own work on deep learning [41,27].For example, in [25], the approximation is done on the Euclidean sphere using a localized kernel introduced in [32], where the massive data is divided into smaller parts, each dense on the sphere, and the resulting polynomial approximation is added to get the final result.In [6], the approximation takes place on a cube, and exploits any known sparsity in the representation of the target function in terms of spline functions.In [41,27], we have argued that from a function approximation point of view, the observed superiority of deep networks over shallow ones results from the ability of deep networks to exploit any compositional structure in the target function.For example, in image analysis, one may divide the image into smaller patches, which are then combined in a hierarchical manner, resulting in a tree structure [46].By putting a shallow network at each node to learn those aspects of the target function which depend upon the pixels seen up to that level, one can avoid the curse of dimensionality.In some sense, this is a divide-and-conquer strategy, not so much on the data set itself, but on the dimension of the input space.
The highlights of this paper are the following.
• In order to avoid an explicit, data dependent eigen-decomposition, we introduce the notion of an eignet, that generalizes several radial basis function and zonal function networks.We construct pre-fabricated eignets, whose linear combinations can be constructed just by using the noisy values of the target function as the coefficients, to yield the desired approximation.
• Our theory generalizes the results in a number of examples used commonly in machine learning, some of which we will describe in Section 2.
• Our results do not depend upon any kind of optimization in order to determine the necessary approximation.Therefore, the problems associated with that part of the theory are absent.
• We develop a theory for local approximation using eignets, so that only a relatively small amount of data is used in order to approximate the target function in any ball of the space, the data sub-sampled using a distribution supported on a neighborhood of that ball.The accuracy of approximation adjusts itself automatically depending upon the local smoothness of the target function on the ball.
• In usual machnie learning algorithms, it is customary to assume a prior on the target function, called smoothness class in approximation theory parlance.Our theory demonstrates clearly how a massive data can actually help to solve the inverse problem to determine the local smoothness of the target function using a wavelet-like representation, based solely on the data.
• Our results allows one to solve the inverse problem of estimating the probability density from which the data is chosen.In contrast to the statistical approaches that we are aware of, there is no limitation on how accurate the approximation can be asymptotically in terms of the number of samples; the accuracy is determined entirely by the smoothness of the density function.
• All our estimates are given in terms of probability of the error being small, rather than the expected value of some loss function being small.
Necessarily, the paper is very abstract, theoretical, and technical.In Section 2, we present a number of examples which are generalized by our set-up.The abstract set-up, together with the necessary definitions and assumptions are discussed in Section 3. The main results are stated in Section 4 and proved in Section 8.The proofs require a great deal of preparation, which is presented in Sections 5, 6, and 7.The results in these sections are not all new.Many of them are new only in some nuance.For example, we have proved in Section 7 the quadrature formulas required in the construction of our pre-fabricated networks in a probabilistic setting, and substituting an estimate on the gradients by certain Lipschitz condition, which makes sense without the differentiability structure on the manifold as we had done in our previous works.Our Theorem 7.1 generalizes most of our previous results in this direction, except for [30,Theorem 2.3].We have strived to give as many proofs as possible, partly for the sake of completion and partly because the results were not stated earlier in exactly the same form as needed here.In Appendix A, we give a short proof of the fact that the Gaussian upper bound for the heat kernel holds for arbitrary smooth, compact, connected manifolds.We could not find a reference for this fact.In Appendix B, we state the main probability theory estimates that are used ubiquitously in the paper.

Motivating examples
In this paper, we aim to develop a unifying theory applicable to a variety of kernels and domains.In this section, we describe some examples which have motivated the abstract theory to be presented in the rest of the paper.In the following examples, q ≥ 1 is a fixed integer.
Example 2.1.Let T q = R q /(2πZ q ) be the q-dimensional torus.The distance between points x = (x 1 , • • • , x q ) and y = (y 1 , • • • , y q ) is defined by max 1≤k≤q |(x k − y k ) mod 2π|.The trigonometric monomial system {exp(ik • •) : k ∈ Z q } is orthonormal with respect to the Lebesgue measure normalized to be a probability measure on T q .We recall that the periodization of a function f : R q → R is defined formally by f • (x) = k∈Z q f (x + 2kπ).When f is integrable then the Fourier transform of f at k ∈ Z q is the same as the k-th Fourier coefficient of f • .This Fourier coefficient will be denoted by f where G is a periodic function, called the activation function.The examples of the activation functions in which we are interested in this paper include: 1. Periodization of the Gaussian.
2. Periodization of the Hardy multiquadric.1 Therefore, [−1, 1] q can be thought of as a quotient space of T q where all points of the form ε⊙θ = {(ε q can then by lifted to T q , and this lifting preserves all the smoothness properties of the function.Our set-up below includes [−1, 1] q , where the distance and the measure are defined via the mapping to the torus, and suitably weighted Jacobi polynomials are considered to be the orthonormalized family of functions.In particular, if G is a periodic activation function, x = cos(θ), y = cos(φ), then the function G (x, y) = ε∈{−1,1} q G(ε ⊙ (θ − φ)) is a activation function on [−1, 1] q with an expansion , where T k 's are tensor product, orthonormalized, Chebyshev polynomials.Furthermore, b k 's have the same asymptotic behavior as Ĝ(k)'s.
Example 2.3.Let S q = {x ∈ R q+1 : |x| 2 = 1} be the unit sphere in R q+1 .The dimension of S q as a manifold is q.
We assume the geodesic distance ρ on S q , and the volume measure µ * normalized to be a probability measure.We refer the reader to [42] for details, describing here only the essentials to get a "what-it-is-all-about" introduction.The set of (equivalence classes) of restrictions to polynomials in q + 1 variables with total degree < n to S q are called spherical polynomials of degree < n.The set of restriction of homogeneous harmonic polynomials of degree ℓ to S q is denoted by H ℓ , with dimension d ℓ .There is an orthonormal basis {Y ℓ,k } d ℓ k=1 for each H ℓ , that satisfies an addition formula where ω q−1 is the volume of S q−1 , and p ℓ is the degree ℓ ultraspherical polynomial so that the family {p ℓ } is orthonormalized with respect to the measure (1 − x2 ) (q−2)/2 on (−1, 1).A zonal function on the sphere has the form x → G(x • y), where the activation function G : [−1, 1] → R has a formal expansion of the form In particular, formally, The examples of the activation functions in which we are interested in this paper include: It is shown in [42,Lemma 18] that One interesting example is the heat kernel: Example 2.5.Let X = R q , ρ be the ℓ ∞ norm on X, µ * be the Lebesgue measure.For any multi-integer k ∈ Z q + , the (multivariate) Hermite function φ k is defined via the generating function (2.1) The system {φ k } is orthonormal with respect to µ * , and satisfies where ∆ is the Laplacian operator.As a consequence of the so called Mehler identity, one obtains ( [8]) that , where it is convenient to think of 3 The set-up and definitions

Data spaces
Let X be a connected, locally compact, metric space with metric ρ.For r > 0, x ∈ X, we denote If K ⊆ X and x ∈ X, we write as usual ρ(K, x) = inf y∈K ρ(x, y).It is convenient to denote the set {y ∈ X; ρ(K, x) ≤ r} by B(K, r).
For a Borel measure ν on X (signed or positive), we denote by |ν| its total variation measure, defined for Borel subsets K ⊂ X by |ν|(K) = sup where the supremum is over all countable measurable partitions U of K.In the sequel, the term measure will mean a signed or positive, complete, sigma-finite Borel measure.Terms such as measurable will mean Borel measurable.If f : X → R is measurable, K ⊂ X is measurable, and ν is a measure, we define The symbol L p (ν, K) denotes the set of all measurable functions f for which f p,ν,K < ∞, with the usual convention that two functions are considered equal if they are equal |ν|-almost everywhere on K.The set C 0 (K) denotes the set of all uniformly continuous functions on K vanishing at ∞.In the case when K = X, we will omit the mention of K, unless it is necesssary to mention it to avoid confusion.We fix a non-decreasing sequence {λ k } ∞ k=0 , with λ 0 = 0 and λ k ↑ ∞ as k → ∞.We also fix a positive, sigmafinite, Borel measure µ * on X, and a system of orthonormal functions It will be assumed in the sequel that Π ∞ is dense in C 0 (and hence, in every L p , 1 ≤ p < ∞).We will often refer to the elements of Π ∞ as diffusion polynomials in keeping with [26].
Definition 3.1.We will say that a sequence {a n } (or a function for all n ≥ 1, and similarly for functions.
Definition 3.2.The space X (more precisely, the tuple ) is called a data space if each of the following conditions is satisfied.

(Gaussian upper bound)
4. (Essential compactness) For every n ≥ 1 there exists a compact set K n ⊂ X such that the function n → diam(K n ) has polynomial growth, while the functions are both fast decreasing.(Necessarily, n → µ * (K n ) has polynomial growth as well.) Remark 3.1.It is clear that if X is compact, then the first condition as well as the essential compactness condition are automatically satisfied.We may take K n = X for all n.In this case, we will assume tacitly that µ * is a probability measure, and φ 0 ≡ 1.
Example 3.1.(Manifold case) This example points out that our notion of data space generalizes the setups in Examples 2.1, 2.2, 2.3, and 2.4.Let X be a smooth, compact, connected Riemannian manifold (without boundary), ρ be the geodesic distance on X, µ * be the Riemannian volume measure normalized to be a probability measure, {λ k } be the sequence of eigenvalues of the (negative) Laplace-Beltrami operator on X, and φ k be the eigenfunction corresponding to the eigenvalue λ k ; in particular, ) is a data space.Of course, the assumption of essential compactness is satisfied trivially.(See Appendix A for the Gaussian upper bound.)Example 3.2.(Hermite case) We illustrate how Example 2.5 is included in our definition of a data space.Accordingly, we assume the set-up as in that example.For a > 0, let φ k,a (x) = a −q/2 φ k (ax).With λ k = |k| 1 , the system Ξ a = (R q , ρ, µ * , {λ k }, {φ k,a }) is a data space.When a = 1, we will omit its mention from the notation in this context.The first two conditions are obvious.The Gaussian upper bound follows by the multivariate Mehler identity [7,Equation (4.27)].The assumption of essential compactness is satisfied with K n = B(0, cn) for a suitable constant c (cf. [28,Chapter 6].) In the rest of this paper, we assume X to be a data space.Different theorems will require some additional assumptions, two of which we now enumerate.Not every theorem will need all of these; we will state explicitly which theorem uses which assumptions, apart from X being a data space.
The first of these deals with the product of two diffusion polynomials.We do not know of any situation where it is not satisfied, but are not able to prove it in general.

Definition 3.3. (Product assumption)
We say that an strong product assumption is satisfied if instead of (3.4), we have for every n > 0 and Example 3.3.In the setting of Example 3.2, if P, Q ∈ Π n , then P Q = Rφ 0 for some R ∈ Π 2n .So, the product assumption holds trivially.The strong product assumption does not hold.However, if P, Q ∈ Π n , then Similarly, in the setting of Example 3.1, we may assume that φ 0 ≡ 1 on X.It is shown in [14,Theorem A.1] under certain assumptions on Ξ that there exists A * ≥ 2 such that for every P, Q ∈ Π n , P Q ∈ Π A * n .Thus, the strong product assumption holds (and hence, so does the product assumption).
In our results in Section 4, we will need the following condition, which serves the purpose of gradient in many of our earlier theorems on manifolds.Definition 3.4.We say that the system Ξ satisfies Bernstein-Lipschitz condition if for every n > 0, there exists Both in the manifold case and the Hermite case, B n = cn for some constant c > 0. A proof in the Hermite case can be found in [36], and in the manifold case in [13].

Smoothness classes
We define next the smoothness classes of interest here.
Definition 3.5.A function w : X → R will be called a weight function if wφ k ∈ C 0 (X) ∩ L 1 (X) for all k.If w is a weight function, we define We will omit the mention of w if w ≡ 1 on X.
We find it convenient to denote by X p the space {f ∈ L p (X) : The space W γ,p,w comprises all f for which f Wγ,p,w < ∞.
the space W γ,p,w (x 0 ) comprises functions f such that there exists r > 0 with the property that for every φ ∈ C ∞ w (B(x 0 , r)), φf ∈ W γ,p,w .Remark 3.3.In both the manifold case and the Hermite case, characterizations of the smoothness classes W γ,p are available in terms of constructive properties of the functions, such as the number of derivatives, estimates on certain moduli of smoothness or K-functionals etc.In particular, the class C ∞ coincides with the the class of infinitely differntiable functions vanishing at infinity.
We can now state another assumption which will be needed in studying local approximation.Definition 3.7.(Partition of unity) For every r > 0, there exists a countable family F r = {ψ k,r } ∞ k=0 of functions in C ∞ with the following properties: 3. For every x ∈ X, there exists a finite subset F r (x) ⊆ F r such that We note some obvious observations about the partition of unity without the simple proof.
Proposition 3.1.Let r > 0, F r be a partition of unity.
(a) Necessarily, The constant convention In the sequel, c, c 1 , • • • will denote generic positive constants depending only on the fixed quantities under discussion such as Ξ, q, κ, κ 1 , κ 2 , the various smoothness parameters and the filters to be introduced.Their value may be different at different occurrences, even within a single formula.The notation We end this section by defining a kernel which plays a central role in this theory.Let H : [0, ∞) → R be a compactly supported function.In the sequel, we define If S ≥ 1 is an integer, and H is S times continuously differentiable, we introduce the notation The following proposition recalls an important property of these kernels.Proposition 3.2 is proved in [26], and more recently in much greater generality in [37,Theorem 4.3].
Proposition 3.2.Let S > q be an integer, H : R → R be an even, S times continuously differentiable, compactly supported function.Then for every x, y ∈ X, N > 0, In the sequel, let h : R → [0, 1] be a fixed, infinitely differentiable, even function, non-increasing on [0, ∞), with h(t) = 1 if |t| ≤ 1/2 and h(t) = 0 if t ≥ 1.If ν is any measure having a bounded total variation on X, we define (3.12) We will omit the mention of h in the notations; e.g., write Φ n (x, y) = Φ n (h; x, y), and the mention of ν if ν = µ * .In particular, where for

Measures
In this section, we describe the terminology involving measures.
The infimum of all constants c which work in (3.15) will be denoted by |||ν||| R,d , and the class of all d-regular measures will be denoted by R d .
For example, (3.17) (c) By abuse of terminology, we will say that a measure ν n is an admissible quadrature measure (respectively, an admissible product quadrature measure) of order n if |ν n | ≤ c 1 n c (with constants independent of n) and (3.16) (respectively, (3.17)) holds.
In the case when X is compact, a well known theorem called Tchakaloff's theorem [43, Exercise 2.5.8, p. 100] shows the existence of admissible product quadrature measures (even a finitely supported probability measures).However, in order to construct such measures, it is much easier to prove the existence of admissible quadrature measures, as we will do in Theorem 7.1, and then use one of the product assumptions to derive admissible product quadrature measures.
Example 3.4.In the manifold case, let the strong product assumption hold as in Example 3.3.If n ≥ 1 and C ⊂ X is a finite subset satisfying the assumptions of Theorem 7.1, then the theorem asserts the existence of an admissible quadrature measure supported on C. If {ν n } is an admissible quadrature measure sequence, then {ν A * n } is an admissible product quadrature measure sequence.In particular, there exist finitely supported admissible product quadrature measures of order n for every n ≥ 1.
Example 3.5.We consider the Hermite case as in Example 3.2.For every a > 0 and n ≥ 1, Theorem 7.1 applied with the system Ξ a yields admissible quadrature measures of order n supported on finite subsets of R q (in fact, of [−cn, cn] q for an appropriate c).In particular, an admissible quadrature measure of order n √ 2 for Ξ √ 2 is an admissible product quadrature measure of order n for Ξ = Ξ 1 .

Eignets
The notion of an eignet defined below is a generalization of the various kernels described in the examples in Section 2.
Therefore, all of the results in Sections 4 and 8 can be applied once with G 2 and once with G 1 to obtain a corresponding result for G 0 , with different constants.For this reason, we will simplify our presentation by assuming the apparently restrictive conditions stipulated in Definition 3.10.In particular, this includes the example of the smooth ReLU network described in Example 2.3.Definition 3.11.Let ν be a measure on X (signed or having bounded variation), and G ∈ C 0 (X × X).We define and Remark 3.5.Typically, we will use an approximate product quadrature measure sequence in place of the measure ν, where each of the measures in the sequence is finitely supported, to construct a sequence of networks.In the case when X is compact, Tchakaloff's theorem shows that there exists an approximate product quadrature measure of order m supported on (dim(Π m ) + 1) 2 points.Using this measure in place of ν, one obtains a pre-fabricated eignet G n (ν) with (dim(Π m ) + 1) 2 neurons.However, this is not an actual construction.In the presence of the product assumption, Theorem 7.1 leads to the pre-fabricated networks G n in a constructive manner with the number of neurons as stipulated in that theorem.

Main results
In this section, we assume the Bernstein-Lipschitz condition (Definition 3.4) in all the theorems.We note that the measure µ * may not be a probability measure.Therefore, we take the help of an auxiliary function f 0 to define a probability measure as follows.Let f 0 ∈ C 0 (X), f 0 ≥ 0 for all x ∈ X, and dν * = f 0 dµ * be a probability measure.Necessarily, ν * is 0-regular, and |||ν * ||| R,0 ≤ f 0 ∞,µ * .We assume noisy data of the form (y, ǫ), with a joint probability distribution τ defined for Borel subsets of X × Ω for some measure space Ω, and with ν * being the marginal distribution of y with respect to τ .Let F (y, ǫ) be a random variable following the law τ , and denote It is easy to verify using Fubini's theorem that if F is integrable with respect to τ then for any x ∈ X, Let Y be a random sample from τ , and {ν n } be an admissible product quadrature sequence in the sense of Definition 3.9.We define (cf.(3.20)) where B * is as in Definition 3.10.
Remark 4.1.We note that the networks G n are pre-fabricated, independently of the data.Therefore, effectively, the network G n has only |Y | terms depending upon the data.
Our first theorem describes local function recovery using local sampling.We may interpret it in the spirit of distributed learning as in [6,25], where we are taking a linear combination of pre-fabricated networks G n using the function values themselves as the coefficients.The networks G n have the essentially the same localization property as the kernels Φ n (cf.Theorem 8.2).Theorem 4.1.Let x 0 ∈ X, and r > 0. We assume the partition of unity, and find a function ψ ∈ C ∞ supported on B(x 0 , 3r) which is equal to 1 on B(x 0 , r), m = X ψdµ * , and let f 0 = ψ/m, dν * = f 0 f dµ * .We assume the rest of the set-up as described.If f 0 f ∈ W γ,∞ , then for 0 < δ < 1, and |Y | ≥ cn q+2γ r q log(nB n /δ), is a random sample from some probability measure supported on X, s = M ℓ=1 f 0 (y ℓ ), and we consturct a sub-sample using the distribution that associates the mass f 0 (y j )/s with each y j , then the probability of selecting points outside of the support of f 0 is 0. This leads to a sub-sample Y .If M ≥ cn q+2γ log(nB n /δ), then the Chernoff bound, Proposition B.1(b), can be used to show that |Y | is large as stipulated in Theorem 4.1.
Next, we state two inverse theorems.Our first theorem obtains accuracy on the estimation of the density f 0 using eignets instead of positive kernels.
Then, with F ≡ 1, Remark 4.3.Unlike density estimation using positive kernels, there is no inherent limit on the accuracy predicted by (4.5) on the estimation of f 0 .
The following theorem gives a complete characterization of the local smoothness classes using eignets.In particular, Part (b) of the following theorem gives a solution to the inverse problem of determining what smoothness class the target function belongs to near each point of X.In theory, this leads to a data-based detection of singularities and sparsity analogous to what is assumed in [6], but in much more general setting.Theorem 4.3.Let f 0 ∈ C 0 (X), f 0 (x) ≥ 0 for all x ∈ X, and dν * = f 0 dµ * be a probability measure, τ , F , and f be as described above.We assume the partition of unity and the product assumption.Let S ≥ q +2, 0 < γ ≤ S, x 0 ∈ X, 0 < δ < 1.For each j ≥ 0, suppose that Y j is a random sample from τ with (b) If there exists a ball B centered at x 0 for which (4.6) holds, then f 0 f ∈ W γ,∞,φ0 (x 0 ).
Next, we prove some results about the system {φ k }.
Proof.The Gaussian upper bound with x = y implies that The estimate (5.6) follows from a Tauberian theorem [13,Proposition 4.1].The essential compactness now shows that for any R > 0, In particular, Next, we prove some properties of the operators σ n and diffusion polynomials.The following proposition follows easily from Lemma 5.1 and Proposition 3.2.(cf.[35,33]).
(5.8) (b) We have and (5.11) The following lemma is well known; a proof is given in [34,Lemma 5.3].
We use P in place of f to obtain (5.16).
If the product assumption holds, then Proof.In view of essential compactness, Proposition 5.4 implies that for any ∈ Π n , 1 ≤ r ≤ ∞, P 2 ≤ c 1 n c P r .Therefore, using Schwarz inequality, Parseval identity, and Lemma 5.2, we conclude that Now, the product assumption implies that for p = 1, ∞, and λ k , λ j < n, there exists R j,k,n ∈ Π A * n such that for any R > 0, where c is the constant appearing in (5.20).The convexity inequality shows that (5.21) is valid for all p, 1 ≤ p ≤ ∞.So, using (5.20), we conclude that

Local approximation by diffusion polynomials
In the sequel, we write g(t) = h(t) − h(2t), and We note that It is clear from Theorem 5.1 that for any p, with convergence in the sense of L p .
We assume the partition of unity and the product assumption.
(a) If B is a ball centered at x 0 , then (b) If there exists a ball B centered at x 0 such that ) , then there exists a ball B centered at x 0 such that (6.5) holds.
Remark 6.1.In the case of Example 3.1, φ 0 ≡ 1.So, the statements (b) and (c) in Theorem 6.1 provide necessary and sufficient conditions for f ∈ W γ,p (x 0 ) in terms of the local rate of convergence of the globally defined operator σ n (f ), respectively, the growth of the local norms of the operators τ j .In the case of Example 3.2, it is shown in [31] that f ∈ W γ,p,φ0 if and only if f ∈ W γ,p .Therefore, the statements (b) and (c) Theorem 6.1 provide similar necessary and sufficient conditions for f ∈ W γ,p (x 0 ) in this case as well.
The proof of Theorem 6.1 is routine, but we sketch a proof for the sake of completeness.
Proof of Theorem 6.1 Part (a) is easy to prove using the definitions.In the rest of this proof, we fix S > γ + q + 2. To prove part (b), let φ ∈ C ∞ be supported on B. Then there exists Further, Lemma 5.4 yields a sequence Hence, Thus, f φ ∈ W γ,p,φ0 for every φ ∈ C ∞ supported on B, and part (b) is proved.
Let {Ψ n : X × X → X} be a family of kernels (not necessarily symmetric).With a slight abuse of notation, we define when possible, for any measure ν with bounded total variation on X, and As usual, we will omit the mention of ν when ν = µ * .Corollary 6.1.Let the assumptions of Theorem 6.1 hold, and {Ψ n : X × X → X} be a seqence of kernels (not necessarily symmetric) with the property that both of the following functions of n are fast decreasing.
(b) If there exists a ball B centered at , then there exists a ball B centered at x 0 such that (6.13) holds.
Proof.In view of Lemma 5.3, the assumption about the functions in (6.11) implies that σ(Ψ n ; f ) − σ n (f ) p is fast decreasing.

Quadrature formula
The purpose of this section is to prove the existence of admissible quadrature measures in the general set-up as in this paper.The ideas are mostly developed already in our earlier works [29,40,13,14,36,8], but always requiring an estimate on the gradient of diffusion polynomials.Here, we use the Bernstein-Lipschitz condition (Definition 3.4) instead.
If C ⊂ K ⊂ X, we denote In particular, by replacing C by C 1 , we can always assume that Theorem 7.1.We assume the Bernstein-Lipschitz condition.Let n > 0, (a) There exists a constant c with the following property: if δ(K 2n , C 1 ) ≤ c(ǫ) min(1/n, 1/B 2n ) then there exist non-negative numbers W k with such that for every and then the statements (a) and (b) hold with µ * K2n -probability exceeding 1 − δ.In order to prove Theorem 7.1, we first recall the following theorem [38,Theorem 5.1], applied to our context.The statement of [38,Theorem 5.1] seems to require that µ * is a probability measure, but this fact is not required in the proof.It is required only that µ * (B(x, r)) ≥ cr q for 0 < r ≤ 1.
Theorem 7.2.Let τ be a positive measure supported on a compact subset of X, ǫ > 0, A be a maximal ǫdistinguishable subset of supp(τ ), and K = B(A, 2ǫ).Then there exists a subset C ⊆ A ⊆ supp(τ ) and a partition {Y y } y∈C of K with each of the following properties.

(intersection property)
Let K 1 ⊆ K be a compact subset.Then Proof of Theorem 7.1 (a), (b).
We observe first that it is enough to prove this theorem for sufficiently large values of n.In view of Proposition 5.3, we may choose n large enough so that for any P ∈ Π n , P 1,µ * ,X\K2n ≤ n −S P 1 ≤ (ǫ/3) P 1 . (7.7) In this proof, we will write δ = δ(K 2n , C 1 ), so that K 2n ⊂ B(C 1 , δ).We use Theorem 7.2 with τ to be the measure associating the mass 1 with each element of C 1 , and δ in place of ǫ.If A is a maximal δ-distinguished subset of C 1 , then we denote in this proof, K = B(A, 2δ) and observe that K 2n ⊂ B(C 1 , δ) ⊂ K ⊂ B(K 2n , 4δ).We obtain a partition {Y y } of K as in Theorem 7.2.The volume property implies that each Y y contains at least one element of C 1 .We construct a subset C of C 1 by choosing exactly one element of Y y ∩ C for each y.We may then re-index C 1 , so that without loss of generality, C = {z 1 , • • • , z N } for some N ≤ M , and re-index ) and µ * (Y k ) ∼ δ q .In particular, for any P ∈ Π n , P 1 − P 1,µ * ,K ≤ (ǫ/3) P 1 . (7.8) We now let In this part of the proof, the constants denoted by c 1 , c 2 , • • • will retain their value until (7.9) is proved.Let y ∈ X.
We let r ≥ δ to be chosen later, and write in this proof, r}.Since r ≥ δ, and each Y k ⊂ B(z k , 36δ), there are at most c 1 (r/δ) q elements in N .Using the Bernstein-Lipschitz condition and the fact that Φ 2n (•, y) ∞ ≤ c 2 n q , we deduce that ))n q B 2n δ(r/δ) q ≤ c 4 (nr) q B 2n δ.
Therefore, we conclude using (7.9) that Together with (7.8), this leads to (7.4).From the definition of we have now proved (7.3), and thus completed the proof of part (a).
Having proved part (a), the proof of part (b) is by now a routine application of the Hahn-Banach theorem (cf.[29,40,13,14]).We apply part (a) with ǫ = 1/2.Continuing the notation in the proof of part (a), we then have consider the sampling operator S : Π n → R N given by S(P ) = (P (z 1 ), • • • , P (z N )), let V be the range of this operator, and define a linear functional x * on V by x * (S(P )) = X P dµ * .The estimate (7.11) shows that the norm of this functional is ≤ 2. The Hahn-Banach theorem yields a norm-preserving extension X * of x * to R N , which in turn, can be identified with a vector (w 1 , • • • , w N ) ∈ R N .We set w k = 0 if k ≥ N + 1. Formula (7.6) then expresses the fact that X * is an extension of x * .The preservation of norms shows that Part (c) of Theorem 7.1 follows immediately from the first two parts and the following lemma.
Lemma 7.1.Let ν * be a probability measure on X, K ⊂ supp(ν * ) be a compact set.Let ǫ, δ ∈ (0, 1], C be a maximal ǫ/2-distinguished subset of K, and and {z 1 , • • • , z M } be random samples from the probability law ν * then We consider the random variable z j to be equal to 1 if z j ∈ B(x, ǫ/2), and 0 otherwise.Using (B.2) with t = 1, we see that We set the right hand side above to δ and solve for M to prove the lemma.
8 Proofs of the results in Section 4.
We assume the set-up as in Section 4. Our first goal is to prove the following theorem.
Theorem 8.1.Let τ , ν * , F , f be as described Section 4. We assume the Bernstein-Lipschitz condition.Let 0 < δ < 1.We assume further that |F (y, ǫ)| ≤ 1 for all y ∈ X, ǫ ∈ Ω.There exist constants c 1 , c 2 , such that if In order to prove this theorem, we record an observation.The following lemma is an immediate corollary of the Bernstein-Lipschitz condition and Proposition 5.3.Lemma 8.1.Let the Bernstein-Lipschitz condition be satisfied.Then for every n > 0 and ǫ > 0, there exists a finite set C n,ǫ ⊂ K 2n such that |C n,ǫ | ≤ cB q n ǫ −q µ * (B(K 2n , ǫ)) and for any Proof of Theorem 8.1.
Let x ∈ X.We consider the random variables Then in view of (4.2), E τ (Z j ) = σ n (ν * ; f )(x) for every j.Further, Proposition 3.2 shows that for each j, |Z j | ≤ cn q .Using (5.10) with ν * in place of ν, N = n, d = 0, we see that for each j, Therefore, Bernstein concentration inequality (B.1) implies that for any t ∈ (0, 1), We now note that Z j , σ n (ν * ; f ) are all in Π n , and take a finite set C n,1/2 as in Lemma 8.1, so that We set the right hand side above equal to δ/|||ν * ||| R,0 and solve for t to obtain (8.1) (with different values of c, c 1 , c 2 ).Before starting to prove results regarding eignets, we first record the continuity and smoothness of a "smooth kernel" G as defined in Definition 3.11.(8.6)In this proof, let s(t) = k:λ k <t φ k (x) 2 , so that s(t) ≤ ct q , t ≥ 1.If Λ ≥ 1, then integrating by parts, we deduce (remembering that b is non-increasing) that for any x ∈ X, ≤ c (2 j Λ) q b(2 j Λ) Hence, for any x ∈ X,  In view of the convexity inequality: In turn, this implies that W G(x, •) ∈ L p for all x ∈ X, and (8.5) holds.
A fundamental fact that relates the kernels Φ n and the pre-fabricated eignets G n 's is the following theorem.Since {b(B * n)/b(n)} is fast decreasing, this completes the proof.
The theorems in Section 4 all follow from the following basic theorem.
Theorem 8.3.We assume the strong product assumption and the Bernstein-Lipschitz condition.With the set-up just described, we have In particular, for f ∈ X ∞ (X), Then . (8.16) Proof.Theorems 8.1 and Theorem 8.2 together lead to (8.15).Since σ n (ν * ; f ) = σ n (f 0 f ), the estimate 8.16 follows from Theorem 5.1 used with p = ∞.
Proof of Theorem 4.1.

Definition 3 . 9 .
(a) A sequence {ν n } of measures on X is called an admissible quadrature measure sequence if the sequence {|ν n |(X)} has polynomial growth and X P dν = X P dµ * , P ∈ Π n , n ≥ 1. (3.16)(b) A sequence {ν n } of measures on X is called an admissible product quadrature measure sequence if the sequence {|ν n |(X)} has polynomial growth and X P 1 P 2 dν = X P 1 P 2 dµ * , P 1 , P 2 ∈ Π n , n ≥ 1.

Definition 3 .3 2 y| 2 2
10.A function b : [0, ∞) → (0, ∞) is called a smooth mask if b is non-increasing, and there existsB * = B * (b) ≥ 1 such that the mapping t → b(B * t)/b(t) is fast decreasing.A function G : X × X → R is called a smooth kernel if there exists a measurable function W = W (G) : X → R such that we have a formal expansion (with a smooth mask b) W (y)G(x, y) = k b(λ k )φ k (x)φ k (y), x, y ∈ X. (3.18)If m ≥ 1 is an integer, an eignet with m neurons is a function of the form x → m k=1 a k G(x, y k ) for y k ∈ X. Example 3.6.In the manifold case, the notion of eignet includes all the examples stated in Section 2 with W ≡ 1, except for the example of smooth ReLU function described in Example 2.3.In the Hermite case, (2.2) shows that the kernel G(x, y) = exp −|x − √ defined on R q × R q is a smooth kernel, withλ k = |k| 1 , φ k as in Example 2.5, and b(t) = 3 −t/2 .The function W here is W (y) = exp(−|y| 2 2 /4).Remark 3.4.It is possible to relax the conditions on the mask in Definition 3.10.Firstly, the condition that b should be non-increasing is made only to simplify our proofs.It is not difficult to modify them without this assumption.Secondly, let b 0 : [0, ∞) → R satisfy |b 0 (t)| ≤ b 1 (t) for a smooth mask b 1 as stipulated in that definition.Then the function b 2 = b + 2b 1 is a smooth mask, and so is b 1
Let X be a smooth, compact, connected Riemannian manifold (without boundary), ρ be the geodesic distance on X, µ * be the Riemannian volume measure normalized to be a probability measure, {λ k } be the sequence of eigenvalues of the (negative) Laplace-Beltrami operator on X, and φ k be the eigenfunction corresponding to the eigenvalue λ k ; in particular, φ 0 ≡ 1.This example, of course, includes Examples 2.1, 2.2, and 2.3.An eignet in this context has the form x → and w be a weight function.(a) For f ∈ L p (X), we define f Wγ,p,w = f p,µ * + sup n>0 n γ E n (w; p, f ), (3.7) and note that f Wγ,p,w ∼ f p,µ * + sup n∈Z+ 2 nγ E 2 n (w; p, f ).(3.8)