Exact heat kernel on a hypersphere and its applications in kernel SVM

Many contemporary statistical learning methods assume a Euclidean feature space, however, the"curse of dimensionality"associated with high feature dimensions is particularly severe for the Euclidean distance. This paper presents a method for defining similarity based on hyperspherical geometry and shows that it often improves the performance of support vector machine compared to other competing similarity measures. Specifically, the idea of using heat diffusion on a hypersphere to measure similarity has been proposed and tested by \citet{Lafferty:2015uy}, demonstrating promising results based on an approximate heat kernel, however, the exact hyperspherical heat kernel hitherto remains unknown. In this paper, we derive an exact form of the heat kernel on a unit hypersphere in terms of a uniformly and absolutely convergent series in high-dimensional angular momentum eigenmodes. Being a natural measure of similarity between sample points dwelling on a hypersphere, the exact kernel often shows superior performance in kernel SVM classifications applied to text mining, tumor somatic mutation imputation, and stock market analysis. The improvement in classification accuracy compared with kernels based on Euclidean geometry may arise from ameliorating the curse of dimensionality on compact manifolds.


Introduction
As the techniques for analyzing large data sets continue to grow, diverse quantitative sciences -including computational biology, observation astronomy, and high energy physicsare becoming increasingly data driven. Moreover, modern business decision making critically depends on quantitative analyses such as community detection and consumer behavior prediction. Consequently, statistical learning has become an indispensable tool for modern data analysis. Data acquired from various experiments are usually organized into an n × m matrix, where the number n of features typically far exceeds the number m of samples. In this view, the m samples, corresponding to the columns of the data matrix, are naturally interpreted as points in a high-dimensional feature space R n . Traditional statistical modeling approaches often lose their power when the feature dimension is high; this performance decline is one of the symptoms of the "the curse of dimensionality." To ameliorate this problem, Lafferty and Lebanon proposed a multinomial interpretation of non-negative feature vectors and an accompanying transformation of the multinomial simplex to a hypersphere, demonstrating that using the heat kernel on this hypersphere may improve the performance of kernel support vector machine (SVM) (Lafferty and Lebanon, 2005;Hastie et al., 2013;Evgeniou and Pontil, 2001;Boser et al., 1992;Cortes and Vapnik, 1995;Freund and Schapire, 1999;Guyon et al., 1993). Despite the interest that this idea has attracted, only approximate heat kernel is known to date. We here present an exact form of the heat kernel on a hypersphere of arbitrary dimension and study its performance in kernel SVM classifications of text mining, genomic, and stock price data sets.
To date, sparse data clouds have been extensively analyzed in the flat Euclidean space endowed with the L 2 -norm using traditional statistical learning algorithms, including KMeans, hierarchical clustering, SVM, and neural network (Hastie et al., 2013;Kaufman and Rousseeuw, 2009;Evgeniou and Pontil, 2001;Boser et al., 1992;Cortes and Vapnik, 1995;Freund and Schapire, 1999;Guyon et al., 1993); however, the flat geometry of the Euclidean space often poses severe challenges in clustering and classification problems when the data clouds take non-trivial geometric shapes or class labels are spatially mixed. Manifold learning and kernel-based embedding methods attempt to address these challenges by estimating the intrinsic geometry of a putative submanifold from which the data points were sampled and by embedding the data into an abstract Hilbert space using a nonlinear map implicitly induced by the chosen kernel, respectively (Belkin et al., 2006;Aronszajn, 1950;Paulsen and Raghupathi, 2016). The geometry of these curved spaces may then provide novel information about the structure and organization of original data points.
Heat equation on the data submanifold or transformed feature space offers an especially attractive idea of measuring similarity between data points by using the physical model of diffusion of relatedness ("heat") on curved space, where the diffusion process is driven by the intrinsic geometry of the underlying space. Even though such diffusion process has been successfully approximated as a discrete-time, discrete-space random walk on complex networks, its continuous formulation is rarely analytically solvable and usually requires complicated asymptotic expansion techniques from differential geometry (Berger et al., 1971). An analytic solution, if available, would thus provide a valuable opportunity for comparing its performance with approximate asymptotic solutions and rigorously testing the power of heat diffusion for geometric data analysis.
Given that a Riemannian manifold of dimension d is locally homeomorphic to R d , and that the heat kernel is a solution to the heat equation with a point source initial condition, one may assume in the short diffusion time limit (t 1) that most of the heat is localized within the vicinity of the initial point and that the heat kernel on a Riemannian manifold locally resembles the Euclidean heat kernel. This idea forms the motivation behind the parametrix expansion, where the heat kernel in curved space is approximated as a product of the Euclidean heat kernel in normal coordinates and an asymptotic series involving the diffusion time and normal coordinates. In particular, for a unit hypersphere, the parametrix expansion in the limit t 1 involves a modified Euclidean heat kernel with the Euclidean distance x replaced by the geodesic arc length θ. Computing this parametrix expansion is, however, technically challenging; even when the computation is tractable, applying the approximation directly to high-dimensional clustering and classification problems may have limitations. For example, in order to be able to group samples robustly, one needs the diffusion time t to be not too small; otherwise, the sample relatedness may be highly localized and decay too fast away from each sample. Moreover, the leading order term in the asymptotic series is an increasing function of θ and diverges as θ approaches π, yielding an incorrect conclusion that two antipodal points are highly similar. For these reasons, the machine learning community has been using only the Euclidean diffusion term without the asymptotic series correction; how this resulting kernel, called the parametrix kernel (Lafferty and Lebanon, 2005), compares with the exact heat kernel on a hypersphere remains an outstanding question, which is addressed in this paper.
Analytically solving the diffusion equation on a Riemannian manifold is challenging (Hsu, 2002;Varopoulos, 1987;Berger et al., 1971). Unlike the discrete analogues -such as spectral clustering (Ng et al., 2002) and diffusion map (Coifman and Lafon, 2006), where eigenvectors of a finite dimensional matrix can be easily obtained -the eigenfunctions of the Laplace operator on a Riemannian manifold are usually intractable. Fortunately, the high degree of symmetry of a hypersphere allows the explicit construction of eigenfunctions, called hyperspherical harmonics, via the projection of homogeneous polynomials (Atkinson and Han, 2012;Wen and Avery, 1985). The exact heat kernel is then obtained as a convergent power series in these eigenfunctions. In this paper, we compare the analytic behavior of this exact heat kernel with that of the parametrix kernel and analyze their performance in classification.

Results
The heat kernel is the fundamental solution to the heat equation (∂ t − ∆ x )u(x, t) = 0 with an initial point source (Stone and Goldbart, 2009); the amount of heat emanating from the source that has diffused to a neighborhood during time t > 0 is used to measure the similarity between the source and proximal points. The heat conduction depends on the geometry of feature space, and the main idea behind the application of hyperspherical geometry to data analysis relies on the following projective map from a non-negative feature space to a unit hypersphere: We will henceforth denote the image of a feature vector x under the hyperspherical projective map asx. The notion of neighborhood requires a well-defined measurement of distance on the hypersphere, which is naturally the great arc length -the geodesic on a hypersphere. Both parametrix approximation and exact solution employ the great arc length, which is related to the following definition of cosine similarity: Definition 2 The generic cosine similarity between two feature vectors x, y ∈ R n \ {0} is where · is the Euclidean L 2 -norm, and θ ∈ [0, π] is the great arc length on S n−1 . For unit vectorsx = ϕ(x) andŷ = ϕ(y) obtained from non-negative feature vectors x, y ∈ R n ≥0 \ {0} via the hyperspherical projective map, the cosine similarity reduces to the dot product cos θ = x ·ŷ; the non-negativity of x and y guarantees that θ ∈ [0, π/2] in this case.

Parametrix expansion
The parametrix kernel K prx previously used in the literature is just a Gaussian RBF function with θ = arccosx ·ŷ as the radial distance (Lafferty and Lebanon, 2005): The parametrix kernel is a non-negative function defined for t > 0 and attaining global maximum 1 at θ = 0.
The normalization factor (4πt) − n−1 2 is numerically unstable as t ↓ 0 and complicates hyperparameter tuning; as a global scaling factor of the kernel can be absorbed into the misclassification C-parameter in SVM, this overall normalization term is ignored in this paper. Importantly, the parametrix kernel K prx is merely the Gaussian multiplicative factor without any asymptotic expansion terms in the full parametrix expansion G prx of the exact heat kernel (Lafferty and Lebanon, 2005;Berger et al., 1971). The full G prx suffers from numerous problems, as described in detail in Appendix B.5 and briefly sketched here. For example, the zeroth order correction term u 0 = (sin θ/θ) − n−2 2 diverges at θ = π; this behavior is not a major problem if θ is restricted to the range [0, π 2 ]. Moreover, G prx is also unphysical as θ ↓ 0, when (n − 2)t > 3; this condition on dimension and time is obtained by expanding e −θ 2 /4t = 1 − θ 2 4t + O(θ 4 ) and (sin θ/θ) − n−2 2 = 1 + θ 2 12 (n − 2) + O(θ 3 ), and noting that the leading order θ 2 term in the product of the two factors is a non-decreasing function of distance θ when n−2 12 ≥ 1 4t , corresponding to the unphysical situation of nearby points being hotter than the heat source itself. As the feature dimension n is typically very large, the restriction (n − 2)t < 3 implies that we need to take the diffusion time to be very small, thus making the similarity measure captured by G prx decay too fast away from each data point for use in clustering applications.
In this work, we computed the first and second order correction terms, denoted u 1 and u 2 in Eq. 8 and Eq. 9, respectively, and found that they both diverged at θ = π and failed to remove the aforementioned unphysical behavior as θ ↓ 0 in high dimensions. Furthermore, these correction terms vanished identically when the intrinsic dimension d ≡ n − 1 was 1 and became constant when d = 3. (In Appendix B.5, we used r to denote the geodesic distance θ and d the intrinsic dimension n − 1.) Hence, the only remaining part of G prx still applicable to SVM classification is the Gaussian factor, which is clearly not a heat kernel on the hypersphere. The failure of this perturbative expansion using the Euclidean heat kernel as a starting point suggests that diffusion in R d and S d are fundamentally different and that the exact hyperspherical heat kernel derived from a non-perturbative approach will likely yield better insights into the diffusion process.

Exact hyperspherical heat kernel
By definition, the exact heat kernel G ext (x,ŷ; t) is the fundamental solution to heat equation ∂ t u +L 2 u = 0 where −L 2 is the hyperspherical Laplacian (Stone and Goldbart, 2009;Grigor'yan, 1999;Hsu, 2002;Varopoulos, 1987). In the language of operator theory, G ext (x,ŷ; t) is an integral kernel, or Green's function, for the operator exp{−L 2 t} and has an associated eigenfunction expansion. BecauseL 2 and exp{−L 2 t} share the same eigenfunctions, obtaining the eigenfunction expansion of G ext (x,ŷ; t) amounts to solving for the complete basis of eigenfunctions ofL 2 . The spectral decomposition of the Laplacian is in turn facilitated by embedding S n−1 in R n and utilizing the global rotational symmetry of S n−1 in R n . The Euclidean space harmonic functions, which are the solutions to the Laplace equation ∇ 2 u = 0 in R n , can be projected to the unit hypersphere S n−1 through the usual separation of radial and angular variables (Atkinson and Han, 2012;Wen and Avery, 1985). In this formalism, the hyperspherical Laplacian −L 2 on S n−1 naturally arises as the angular part of the Euclidean Laplacian on R n , andL 2 can be interpreted as the squared angular momentum operator in R n (Wen and Avery, 1985).
The resulting eigenfunctions ofL 2 are known as the hyperspherical harmonics and generalize the usual spherical harmonics in R 3 to higher dimensions. Each hyperspherical harmonic is equipped with a triplet of parameters or "quantum numbers" ( , {m i }, α): the degree , magnetic quantum numbers {m i } and α = n 2 − 1. In the eigenfunction expansion of exp{−L 2 t}, we use the addition theorem of hyperspherical harmonics to sum over the magnetic quantum number {m i } and obtain the following main result: Theorem 1 The exact hyperspherical heat kernel G ext (x,ŷ; t) can be expanded as a uniformly and absolutely convergent power series in the intervalx ·ŷ ∈ [−1, 1] and for t > 0, where C α (w) are the Gegenbauer polynomials and A S n−1 = 2π is the surface area of S n−1 . Since the kernel depends onx andŷ only throughx ·ŷ, we will write G ext (x,ŷ; t) = G ext (x ·ŷ; t).
The proof of the theorem can be found in Appendix B.6. As before, we will rescale the kernel by self-similarity and define: Definition 4 The exact kernel K ext (x,ŷ; t) is the exact heat kernel normalized by selfsimilarity: which is defined for t > 0, is non-negative, and attains global maximum 1 atx ·ŷ = 1.
Note that unlike K prx (x,ŷ; t), K ext (x,ŷ; t) explicitly depends on the feature dimension n. In general, SVM kernel hyperparameter tuning can be computationally costly for a data set with both high feature dimension and large sample size. In particular, choosing an appropriate diffusion time scale is an important challenge. On the one hand, choosing a very large value of t will make the series converge rapidly; but, then, all points will become uniformly similar, and the kernel will not be very useful. On the other hand, a too small value of t will make most data pairs too dissimilar, again limiting the applicability of the kernel. In practice, we thus need a guideline for a finite time scale at which the degree of "self-relatedness" is not singular, but still larger than the "relatedness" averaged over the whole hypersphere. Examining the asymptotic behavior of the exact heat kernel in high feature dimension n shows that an appropriate time scale is t ∼ O(log n/n); in this regime the numerical sum in Theorem 1 satisfies a stopping condition at low orders in and the sample points are in moderate diffusion proximity to each other so that they can be accurately classified (Appendix B.6). Figure 1(a) illustrates the diffusion process captured by our exact kernel K ext (x,ŷ; t) in three feature dimensions at time t = t * log 3/3, for t * = 0.5, 1.0, 2.0. In Figure 1(b), we systematically compared the behavior of (1) dimension-independent parametrix kernel K prx at time t = 0.5, 1.0, 2.0 and (2) exact kernel K ext on S n−1 at t = t * log n/n for t * = 0.5, 1.0, 2.0 and n = 3, 100, 200. By symmetry, the slope of K ext vanished at the south pole θ = π for any time t and dimension n. In sharp contrast, K prx had a negative slope at θ = π, again highlighting a singular behavior of the parametrix kernel. The "relatedness" measured by K ext at the sweet spot t = log n/n was finite over the whole hypersphere with sufficient contrast between nearby and far away points. Moreover, the characteristic behavior of K ext at t = log n/n did not change significantly for different values of the feature dimension n, confirming that the optimal t for many classification applications will likely reside near the "sweet spot" t = log n/n.

SVM classifications
We evaluated the performance of kernel SVM using the on two independent data sets: (1) WebKB data of websites from four universities (WebKB-4-University) (Craven et al., 1998), and (2) glioblastoma multiforme (GBM) mutation data from The Cancer Genome Atlas (TCGA) with 5-fold cross-validations (CV) (Appendix A). The WebKB-4-University data contained 4199 documents in total comprising four classes: student (1641), faculty (1124), course (930), and project (504); in our analysis, however, we selected an equal number of representative samples from each class, so that the training and testing sets had balanced classes. Table 1 shows the average optimal prediction accuracy scores of the five kernels for a varying number of representative samples, using 393 most frequent word features (Appendix A). The exact kernel outperformed the Gaussian RBF and parametrix kernel, reducing the error by 41% ∼ 45% and by 1% ∼ 7%, respectively. Changing the feature dimension did not affect the performance much (Table 2). In the TCGA-GBM data, there were 497 samples, and we aimed to impute the mutation status of one gene -i.e., mutant or wild-type -from the mutation counts of other genes. For each imputation target, we first counted the number m r of mutant samples and then selected an equal number of wild-type samples for 5-fold CV. Imputation tests were performed for top 102 imputable genes (Appendix A).  Table 2: WebKB-4-University Document Classification. Comparison of kernel SVMs on the WebKB-4-University data with a fixed sample size m r , but varying feature dimension n. To account for the randomness in selecting the representative samples using KMeans (Appendix A), we performed fives runs of representative selection, and then performed CV using the training and test sets obtained from each run. Finally, we averaged the five mean CV scores to assess the performance of each classifier on the imbalanced WebKB-4-University data set. The exact (ext) and cosine (cos) kernels outperformed the Gaussian RBF (rbf) and linear (lin) kernels in various feature dimensions n = 393, 726, 1023, and 1312, with fixed and balanced class size m r = 400. A word was selected as a feature if its total count was greater than 1/10, 1/20, 1/30 or 1/40 times the total number of web pages in the WebKB-4-University data set, with the different thresholds corresponding to the different rows in the table. The exact kernel reduced the errors of Gaussian RBF and parametrix kernels by 45 ∼ 48% and 1 ∼ 6%, respectively; the cosine kernel reduced the errors of linear kernel by 36 ∼ 49%. scores for 5 biologically interesting genes known to be important for cancer (Hanahan and Weinberg, 2011): 1. ZMYM4 (m r = 33) is implicated in an antiapoptotic activity; (Smedley, 1999;Shchors et al., 2002); 2. ADGRB3 (m r = 37) is a brain-specific angiogenesis inhibitor (Zohrabian et al., 2007;Kaur et al., 2010;Hamann et al., 2015);  Table 3: TCGA-GBM Genotype Imputation. Performance test on binary classification of mutant vs. wild-type in TCGA-GBM mutation count data. The rows are different genes, the mutation statuses of which were imputed using m r samples in each mutant and wild-type class. The entries show the average of optimal 5-fold crossvalidation mean accuracy scores of five runs.
For the remaining genes, the exact kernel generally outperformed the linear, cosine and parametrix kernels ( Figure 2). However, even though the exact kernel dramatically outperformed the Gaussian RBF in the WebKB-4-University classification problem, the advantage of the exact kernel in this mutation analysis was not evident (Figure 2). It is possible that the radial degree of freedom n i=1 x i in this case, corresponding to the genomewide mutation load in each sample, contained important covariate information not captured by the hyperspherical heat kernel. The difference in accuracy between the hyperspherical kernels (cos, prx, and ext) and the Euclidean kernels (lin and rbf) also hinted some weak dependence on class size m r (Figure 2), or equivalently the sample size m = 2m r . In fact, the level of accuracy showed much stronger correlation with the "effective sample size"m related to the empirical Vapnik-Chervonenkis (VC) dimension (Vapnik, 2013;Boser et al., 1992;Guyon et al., 1993;Vapnik et al., 1994;Paliouras et al., 2003) of a kernel SVM classifier (Figure 3(a-e)); moreover, the advantage of the exact kernel over the Guassian RBF kernel grew with the effective sample size ratiom cos /m lin (Figure 3(f), Appendix B.6.5).
By construction, our definition of the hyperspherical map exploits only the positive portion of the whole hypersphere, where the parametrix and exact heat kernels seem to have similar performances. However, if we allow the data set to assume negative values, i.e. the feature space is the usual R n \{0} instead of R n ≥0 \{0}, then we may apply the usual projective map, where each vector in the Euclidean space is normalized by its L 2 -norm. As shown in Figure 1(b), the parametrix kernel is singular at θ = π and qualitatively deviates from the exact kernel for large values of θ. Thus, when data points populate the whole ext/rbf R 2 = 0. 8%, y = − 1. 9 × 10 −4 x + 0. 98 Figure 2: Comparison of the classification accuracy of SVM using linear (lin), cosine (cos), Gaussian RBF (rbf), parametrix (prx), and exact (ext) kernels on TCGA mutation count data. The plots show the ratio of accuracy scores for two different kernels. For visualization purpose, we excluded one gene with m r = 250. The ratios rbf/lin, prx/cos, and ext/cos were essentially constant in class size m r and greater than 1; in other words, the Gaussian RBF (rbf) kernel outperformed the linear (lin) kernel, while the exact (ext) and parametrix (prx) kernels outperformed the cosine (cos) kernel uniformly over all values of class size m r . However, the more negative slope in the linear fit of cos/lin hints that the accuracy scores of cosine and linear kernels may depend on the class size m r ; the exact kernel also tended to outperform Gaussian RBF kernel when m r was small. The scatter plots of accuracy scores for cosine (cos), linear (lin), exact (ext), and Gaussian RBF (rbf) kernels vs. the effective sample sizem = 2m r /µ * VC ; the accuracy scores of exact and cosine kernels increased with the effective sample size, whereas those of Gaussian RBF and linear kernels tended to decrease with the effective sample size. (f) The ratio of ext vs. rbf accuracy scores is positively correlated with the ratiom cos /m lin of effective sample sizes.
hypersphere, we expect to find more significant differences in performance between the exact and parametrix kernels. For example, Table 4 shows the kernel SVM classifications of 91 S&P500 Financials stocks against 64 Information Technology stocks (m = 155) using their log-return instances between January 5, 2015 and November 18, 2016 as features. As long as the number of features was greater than sample size, n > m, the exact kernel outperformed all other kernels and reduced the error of Gaussian RBF by 29 ∼ 51% and that of parametrix kernel by 17 ∼ 51%.  We uniformly subsampled the instances to generate variations in the feature dimension n. Here, we report the mean 5-fold CV accuracy score for each kernel. Although the two classes were slightly imbalanced, all scores were much larger than the "random score" 91/155 ≈ 58.7%, calculated from the majority class size and sample size. For n > m, the exact (ext) kernel outperformed all other kernels and reduced the errors of Gaussian RBF (rbf) and parametrix (prx) kernels by 29 ∼ 51% and 17 ∼ 51%, respectively. When n < m, the exact kernel started to lose its advantage over the Gaussian RBF kernel.

Discussion
This paper has constructed the exact hyperspherical heat kernel using the complete basis of high-dimensional angular momentum eigenfunctions and tested its performance in kernel SVM. We have shown that the exact kernel and cosine kernel, both of which employ the hyperspherical projections, often outperform the Gaussian RBF and linear kernels. The advantage of using hyperspherical kernels likely arises from the hyperspherical projections of feature space, and the exact kernel may further improve the decision boundary flexibility of the raw cosine kernel. To be specific, the hyperspherical maps project out the less informative radial degree of freedom in a nonlinear fashion and compactify the Euclidean feature space into a unit hypersphere where all data points may then be enclosed within a finite radius. By contrast, our numerical estimations using TCGA-GBM data show that for linear kernel SVM, the margin M tends to be much smaller than the data range R in order to accommodate the separation of strongly mixed data points of different class labels; as a result, the ratio R/M was much larger than that for cosine kernel SVM. This insight may be summarized by the fact that the upper bound on the empirical VC-dimension of linear kernel SVM tends to be much larger than that for cosine kernel SVM, especially in high dimensions, suggesting that the cosine kernel SVM is less sensitive to noise and more generalizable to unseen data. The exact kernel is equipped with an additional tunable hyperparameter, namely the diffusion time t, which adjusts the curvature of nonlinear decision boundary and thus adds to the advantage of hyperspherical projections. Moreover, the hyperspherical kernels often have larger effective sample sizes than their Euclidean counterparts and, thus, may be especially useful for analyzing data with a small sample size in high feature dimensions. Theoretically, the failure of the parametrix expansion of heat kernel, especially in dimensions n 3, signals a dramatic difference between diffusion in a non-compact space and that on a compact manifold; it is possibly related to the fact that Brownian motions on compact manifolds are recurrent in any dimension, but transient in R n for n ≥ 3. It remains to be examined how these differences in diffusion process, random walk and topology between non-compact Euclidean spaces and compact manifolds, like a hypersphere, contribute to ameliorating the "curse of dimensionality," as hinted by the results of this paper.
The WebKB-4-University raw webpage data were downloaded from http://www.cs.cmu. edu/afs/cs/project/theo-20/www/data/ and processed with the python packages Beautiful Soup and Natural Language Toolkit (NLTK). Our feature extraction excluded punctuation marks and included only letters and numerals where capital letters were all converted to lower case and each individual digit 0-9 was represented by a "#." Very infrequent words, such as misspelled words, non-English words, and words mixed with special characters, were filtered out. We selected top 393 most frequent words as features in our classification tests; the cutoff was chosen to select frequent words whose counts across all webpage documents are greater than 10% of the total number of documents. There were 4199 documents in total: student (1641), faculty (1124), course (930), and project (504).
The TCGA-GBM data were downloaded from the GDC Data Portal under the name TCGA-GBM Aggregated Somatic Mutation. The mutation count data set was extracted from the MAF file, while ignoring the detailed types of mutations and counting only the total number of mutations in each gene. Very infrequently, mutated genes were filtered out if the total number of mutations in one gene across all samples is less than 10% of the total number of samples (m = 497 samples and n = 439 genes). We imputed the mutation status of one gene, mutant or wild-type, from the mutation counts of the remaining genes. The most imputable genes were selected using 5-fold cross-validation linear kernel SVM. Most of the mutant and wild-type samples were highly unbalanced, the ratio being typically around 1 : 9; therefore, unthresholded area-under-the-curve (AUC) of the receiver operating characteristic (ROC) curve was used to quantify the classification performance of the linear kernel SVM. Mutated genes with AUC greater than 60% were selected for the subsequent imputation tests.
To balance the sample size between classes, we performed K-means clustering of samples within each class, with a specified number m r of centroids and took the samples closest to each centroid as representatives. For the WebKB document classifications, we used m r ≤ min{m student , m faculty , m course , m project }, and K-means clustering was performed in each of the four classes separately; for the TCGA-GBM data, m r was chosen to be the number of samples in each mutant (minority) class, and K-means clustering was performed in the wild-type (majority) class. Since K-means might depend on the random initialization, we performed the clustering 50 times and selected the top m r most frequent representatives. Five-fold stratified cross-validations (CV) were performed on the resulting balanced data sets, where training and test samples were drawn without replacement from each class. The mean CV accuracy scores across the five folds were recorded.

B.1 Laplacian on a Riemannian manifold
The Laplacian on a Riemannian manifold M with metric g µν is the operator where g = | det g|, and the Einstein summation convention is used. It can be also written in terms of the covariant derivative ∇ µ as The covariant derivative satisfies the following properties αβ is the Levi-Civita connection satisfying Γ λ αβ = Γ λ βα and ∇ λ g µν = 0. To show Eq. 2, recall that the Levi-Civita connection is uniquely determined by the geometry, or the metric tensor, as Using the formula for determinant differentiation we can thus write Γ λ λµ = ∂ µ log √ g.
Hence, for any f ∈ C ∞ (M), proving the equivalence of Eq. 1 and Eq. 2.

B.3 Laplacian in geodesic polar coordinates
In geodesic polar coordinates (r, ξ) around a point, one can show using Eq. 2 that the Laplacian on a d-dimensional Riemannian manifold M takes the form is the Laplacian induced on the geodesic sphere S d−1 r of radius r. If function f depends only on the geodesic distance r from the fixed point, then where denotes the radial derivative. For the special case when M is S n−1 , the coordinates θ 1 , . . . , θ n−1 described above correspond to the geodesic polar coordinates around the north pole, with r = θ 1 . From Eq. 3, we get log g(x) = (n − 2) log sin r + (n − 3) log sin θ 2 + · · · + log sin θ n−2 .
Note that only the first terms contributes to the radial derivative.

B.4 Euclidean heat kernel
Heat kernels in general are solutions to the heat equation with a point-source (Dirac delta) initial condition. The heat kernel in R d is easily found to be where K is known as the Gaussian RBF kernel with parameter γ = 1/4t. G(x, y; t) is the solution to the heat equation satisfying the initial condition G(x, y; 0) = δ(x−y). Note that formally, G(x, y; t) = e t∆ δ(x − y); using the Fourier transform representation of the right-hand side then yields the expression in Eq. 5.

B.5 Parametrix expansion
Because of the second term containing metric derivatives in Eq. 4, the canonical solution on a manifold M of dimension d does not satisfy the heat equation. That is, (∆−∂ t )G(r, t) = 0. For sufficiently small time t and geodesic distance r, the parametrix expansion of the heat kernel proposes an approximate solution where the functions u i should be found such that K p satisfies the heat equation to order t p−d/2 , which is small for t 1 and p > d/2; more precisely, we seek u i such that Taking time derivative of K p yields while the Laplacian of K p is ∆K p = (u 0 + u 1 t + · · · + u p t p ) ∆G+G∆ (u 0 + u 1 t + · · · + u p t p )+2G (u 0 + u 1 t + · · · + u p t p ) .
One can easily compute The left-hand side of Eq. 6 is thus equal to G multiplied by and we need to solve for u i such that all the coefficients of t q in this expression, for q < p, vanish. For q = −1, we need to solve Integrating with respect to r yields where we implicitly take only the radial part of log √ g. Thus, we get as the zeroth-order term in the parametrix expansion. Using this expression of u 0 , the remaining terms become r (u 1 + u 2 t + · · · ) (log u 0 ) − u 1 + u 2 t + · · · + + (∆u 0 + t∆u 1 + · · · ) − (u 1 + 2u 2 t + · · · ) , and we obtain the recursion relation Algebraic manipulations show that from which we get u k+1 r k+1 u 0 = r (k+1)−1 u −1 0 ∆u k .
Integrating this equation and rearranging terms, we finally get Setting k = 0 in this recursion equation, we find the second correction term to be From our previously obtained solution for u 0 , we find Substituting these expressions into the recursion relation for u 1 yields For the hypersphere S d , where g = const. × sin 2(d−1) r, we have Thus, Notice that u 1 (r) = 0 when d = 1 and u 1 (r) = u 0 (r) when d = 3. For d = 2, u 1 /u 0 is an increasing function in r and diverges to ∞ at r = π. By contrast, for d > 3, u 1 /u 0 is a decreasing function in r and diverges to −∞ at r = π; u 1 /u 0 is relatively constant for r < π and starts to decrease rapidly only near π. Therefore, the first order correction is not able to remove the unphysical behavior near r = 0 in high dimensions where, according to the first order parametrix kernel, the surrounding area is hotter than the heat source. Next, we apply Eq. 7 again to obtain u 2 as After some cumbersome algebraic manipulations, we find Again, d = 1 and d = 3 are special dimensions, where u 2 (r) = 0 for d = 1, and u 2 (r) = u 0 /2 for d = 3; for other dimensions, u 2 (r) is singular at both r = 0 and π. Note that on S 1 , the metric in geodesic polar coordinate is g 11 = 1, so all parametrix expansion coefficients u k (r) must vanish identically, as we have explicitly shown above. In high dimensions, the divergence of u 1 /u 0 and u 2 /u 0 at r = π is not a major problem, as we expect the expansion to be valid only in the vicinity r ↓ 0; however, the divergence of u 2 /u 0 at r = 0 (to −∞ in high dimensions) is pathological, and thus, we truncate our approximation to O(t 2 ). Since u 1 (r) is not able to correct the unphysical behavior of the parametrix kernel near r = 0 in high dimensions, we conclude that the parametrix approximation fails in high dimensions.

B.6 Exact hyperspherical heat kernel
We treat the hypersphere S n−1 as being embedded in R n and use the induced metric on S n−1 to define the Laplacian. The Laplacian in R n takes the usual form where the differential operatorL 2 depends only on the angular coordinates. −L 2 is the spherical Laplacian operator. (Wen and Avery, 1985) B.6.1 Spherical Laplacian and its eigenfunctions For n = 3, the Laplacian on R 3 is whereL 2 is the squared orbital angular momentum operator in quantum mechanics. Restricted to r = 1, the Laplacian reduces to the spherical Laplacian on S 2 , which is exactly the operator −L 2 whose eigenfunctions are the spherical harmonics Y lm (θ, φ) with eigenvalue − ( + 1). In this setting, Y lm (θ, φ) can be viewed as the angular component of homogeneous harmonic polynomials in R 3 , and this perspective will be used in the subsequent discussion of hyperspherical Laplacian. By convention, our spherical harmonics satisfy the normalization condition m=− |Y m (θ, φ)| 2 = 2 + 1 4π and the completeness condition Analogous to the Euclidean case, applying the evolution operator exp(−L 2 t) on the initial delta distribution yields the following eigenfunction expansion of the heat kernel on S 2 : Applying the addition theorem of spherical harmonics, B.6.2 Generalization to S n−1 Similar to the spherical harmonics, the hyperspherical harmonics arise as the angular part of degree-homogeneous harmonic polynomials h that satisfy ∆h = 0. In spherical coordinates (r, ξ), we can decompose h (x) = r Ỹ (ξ) (Atkinson and Han, 2012;Wen and Avery, 1985), whereỸ (ξ) is the desired hyperspherical harmonic. Using the spherical coordinate Laplacian in R n shown in Eq. 10, we get which can be simplified to yield the following eigenvalue equation for the hyperspherical Laplacian: where the set {m} indexes the degenerate eigenstates.

B.6.3 Eigenfunction expansion and proof of convergence
To construct the eigenfunction expansion of the exact heat kernel and prove its convergence, we need the following lemmas (Atkinson and Han, 2012;Wen and Avery, 1985;Lorch, 1984): The hyperspherical harmonics are complete on S n−1 and resolve the δ-function Lemma 2 The hyperspherical harmonics satisfy the generalized addition theorem where C ν (w) are the Gegenbauer polynomials and A S n−1 = 2π n/2 /Γ n 2 is the surface area of S n−1 .
Proof. Applying the addition theorem to Eq. 11, we get Next, we apply time evolution operator e −tL 2 on this initial state to generate the heat kernel To show that it is a uniformly and absolutely convergent series for t > 0, note that where w =x ·ŷ. The terms involving Gegenbauer polynomials can be bounded by using Lemma 3 as C n−2 2 (w) ≤ w 2 Γ( + n − 2) Γ(n − 2)Γ( + 1) We thus have But, in the large limit, we have for any t > 0. The sequence {Q } is thus convergent, and hence, the Weiestrass M-test implies that the eigenfunction expansion of the heat kernel is uniformly and absolutely convergent in the indicated intervals. Q.E.D.

B.6.4 The sweet spot of t
Choosing an appropriate diffusion time t for the heat kernel is important for machine learning applications. Here, we use the degree of self-similarity measured by the heat kernel as a function of t, and propose a choice for which the self-similarity is neither too large nor too small. If t is too large, then the self-similarity is roughly the uniform similarity 1/A S n−1 , thereby losing contrast between neighbors and outliers. By contrast, as t approaches 0, the self-similarity becomes infinite, and the sense of neighborhood becomes too localized. We thus need an intermediate value of t, for which the self-similarity interpolates between the two limits. The self-similarity is a special value of the heat kernel .
Because the series converges rapidly for sufficiently large t, we can truncate the series at = max ; i.e.
In the large n limit, we can bound the sum as To keep the self-similarity finite, but larger than the uniform similarity, suggests the choice for t of order log n/n, at which the self-similarity is roughly e/A S n−1 . We thus search for an optimal value of t around log n/n.

B.6.5 SVM Classification
In the main text, we denoted the parametrix and exact heat kernels normalized by selfsimilarity as the "parametrix kernel" and "exact kernel," respectively. We then used the linear (lin), Gaussian RBF (rbf), cosine (cos), parametrix (prx), and exact (ext) kernels in SVM to (1) classify WebKB-4-University web pages into four classes: student, faculty, course, and project; and (2) impute the binary mutation status of genes in TCGA-GBM data. The kernel SVM classification results shown in the main text indicated that the cosine kernel usually outperformed the linear kernel, most likely as a pure consequence of the hyperspherical geometry, as we argue below. The exact kernel outperformed the Gaussian RBF kernel for the WebKB document data, but the advantage of exact kernel diminished in the TCGA mutation count data. Figure 2 compares the accuracy of SVM using different kernels on the TCGA-GBM data, where the accuracy ratios rbf/lin, cos/lin, ext/lin, prx/cos, and ext/cos were greater than 1 for most class sizes m r . Interestingly, the ratio cos/lin showed some dependence on the sample size m r , and the exact kernel also tended to outperform the Gaussian RBF kernel when m r was small; in general, we noted that the hyperspherical kernels tended to outperform the Euclidean kernels in small-sample-size classification problems. This pattern may be understood by examining the generalization error of kernel SVM as follows. Intuitively, if a generic classifier were closely acquainted with the population distribution of data through a large sample size, then its predictions would be more generalizable to unseen samples. The "largeness" of sample size m, however, is not explicitly quantifiable unless we have a natural unit for it. Statistical learning theory (Vapnik, 2013;Vapnik et al., 1994;Guyon et al., 1993) provides such a unit associated with a probabilistic upper bound on generalization errors. That is, with probability at least 1 − η, the generalization error of a binary SVM classification is bounded from above by where µ VC is the VC-dimension of the classifier, andm = m/µ VC is the effective sample size. The derivative of F (m; µ VC , η) with respect tom is proportional to a positive factor times − log [(2m) µ VC 4/η]. Thus, the upper bound decreases withm when (2m) µ VC > η/4, and increases otherwise; the critical effective sample sizem crt = 1 2 · (η/4) 1/µ VC ≈ 1 2 for typical values of µ VC > 100 and η ∈ [10 −3 , 0.1]. The VC dimension of a linear kernel SVM can be estimated using an empirical upper bound (Vapnik et al., 1994;Paliouras et al., 2003) µ V C ≤ µ * V C = min n, where n is the feature space dimension, R is the radius of the smallest ball in feature space that encloses all data points, and M is the SVM margin. We evaluated the bound µ * VC for the TCGA-GBM mutation count data with C = 1, and found that the linear kernel had R 2 /M 2 ≈ 6 × 10 3 and thus that µ * lin VC = n + 1 ≈ 4 × 10 2 . By contrast, the cosine kernel, which is a linear kernel in the hyperspherically transformed space with R ≤ 1, had µ * cos VC approximately in the range 20 ∼ 100 µ * lin VC , as shown in Figure 3(a). This reduction in the VC-dimension is likely responsible for the classification improvement of the cosine kernel over the linear kernel. We thus found thatm cos = 2m r /µ * cos VC >m crt , whilẽ m lin = 2m r /µ * lin VC <m crt for the TCGA-GBM data, and that the cosine kernel accuracy increased with effective sample size, whereas the linear kernel accuracy tended to decrease (Figure 3(b,c)), consistent with the analysis of the upper bound on generalization error F (m; µ VC , η). In addition, the Gaussian RBF and exact kernels followed similar trends as the linear and cosine kernels, respectively (Figure 3(d,e)). Similar to the cosine kernel, the exact kernel likely inherited the reduction in VC-dimension from the hyperspherical map; as a result, the accuracy of the exact kernel also increased withm cos , but with slightly higher accuracy due to the additional tunable parameter t that can adjust the curvature of nonlinear decision boundaries. Moreover, the cases of small sample size where the exact kernel outperformed the Gaussian RBF kernel corresponded to the cases of larger effective sample size ratiom cos /m lin (Figure 3(f)).