Optimisation of Spectral Wavelets for Persistence-based Graph Classification

A graph's spectral wavelet signature determines a filtration, and consequently an associated set of extended persistence diagrams. We propose a framework that optimises the choice of wavelet for a dataset of graphs, such that their associated persistence diagrams capture features of the graphs that are best suited to a given data science problem. Since the spectral wavelet signature of a graph is derived from its Laplacian, our framework encodes geometric properties of graphs in their associated persistence diagrams and can be applied to graphs without a priori node attributes. We apply our framework to graph classification problems and obtain performances competitive with other persistence-based architectures. To provide the underlying theoretical foundations, we extend the differentiability result for ordinary persistent homology to extended persistent homology.

1. Introduction 1.1.Background.Graph classification is a challenging problem in machine learning.Unlike data represented in Euclidean space, there is no easily computable notion of distance or similarity between graphs.As such, graph classification requires techniques that lie beyond mainstream machine learning techniques focused on Euclidean data.Much research has been conducted on methods such as graph neural networks (GNNs) [Xu+19] and graph kernels [Vis+10; She+09] that embed graphs in Euclidean space in a consistent manner.
Recently, persistent homology [ZC05; EH08] has been applied as a feature map that explicitly represents topological and geometric features of a graph as a set of persistence diagrams (a.k.a.barcodes).In the context of our discussion, the persistent homology of a graph G = (V, E) depends on a vertex function f : V → R. In the case where a vertex function is not given with the data, several schemes have been proposed in the literature to assign vertex functions to graphs in a consistent way.For example, vertex functions can be constructed using local geometric descriptions of vertex neighbourhoods, such as discrete curvature [ZW19], heat kernel signatures [Car+20b] and Weisfeiler-Lehman graph kernels [RBB19].
However, it is often difficult to know a priori whether a heuristic vertex assignment scheme will perform well in addressing different data science problems.For a single graph, we can optimise the vertex function over |V | many degrees of freedom in R V .In recent years, there have been many other examples of persistence optimisation in data science applications.The first two examples of persistence optimisation are the computation of Fréchet mean of barcodes using gradients on Alexandrov spaces [Tur+14], and that of point cloud inference [GHO16], where a point cloud is optimised so that its barcode fits a target fixed barcode.The latter is an instance of topological inverse problems (see [OS20] for a recent overview of such).Another inverse problem is that of surface reconstruction [Brü+20].Besides, in the context of shape matching [PSO18], persistence optimisation is used in order to learn an adequate function between shapes.Finally, there are also many recent applications of persistence optimisation in Machine Learning, such as the incorporation of topological information in Generative Modelling [Moo+20;Hof+19;Gab+20] or in Image Segmentation [Hu+19;Clo+19], the design of topological losses for Regularization in supervised learning [Che+19] or for dimension reduction [Kac20].
Each of these applications can be thought of as minimising a certain loss function over a manifold M of parameters: where L : M → Bar N → R factors through the space Bar N of N -tuples of barcodes.The aim is to find the parameter θ that best fits the application at hand.Gradient descent is a very popular approach in minimisation, but it requires the ability to differentiate the loss function.In fact, [LOT19] provide notions of differentiability for maps in and out Bar that are compatible with smooth calculus, and show that the loss functions L corresponding the applications cited in the above paragraph are generically differentiable.The use of (stochastic) gradient descent is further legitimated by [Car+20a], where convergence guarantees on persistence optimisation problems are devised, using a recent study of stratified non-smooth optimisation problems [Dav+20].In practice, the minimisation of L can be unstable due to its non-convexity and partial non-differentiability.Some research has been conducted in order to smooth and regularise the optimisation procedure [SWB20;CD20].
In a supervised learning setting, we want to optimise our vertex function assignment scheme over many individual graphs in a dataset.Since graphs may not share the same vertex set and come in different sizes, optimising over the |V | degrees of freedom of any one graph is not conducive to learning a vertex function assignment scheme that can generalise to another graph.The degrees of freedom in any practical vertex assignment scheme should be independent of the number of vertices of a graph.However, a framework for parametrising and optimising the vertex functions of many graphs over a common parameter space M is not immediately apparent.
The first instance of a graph persistence optimisation framework (GFL) [Hof+20] uses a one layer graph isomorphism network (GIN) [Xu+19] to parametrise vertex functions.The GIN learns a vertex function by exploiting the local topology around each vertex.In this paper, we propose a different framework for assigning and parametrising vertex functions, based on a graph's Laplacian operator.Using the Laplacian, we can explicitly take both local and global structures of the graph into consideration in an interpretable and transparent manner.
1.2.Outline and Contributions.We address the issue of vertex function parametrisation and optimisation using wavelet signatures.Wavelet signatures are vertex functions derived from the eigenvalues and eigenvectors of the graph Laplacian and encode multiscale geometric information about the graph [LH13].The wavelet signature of a graph is dependent on a choice of wavelet g : R → R, a function on the eigenvalues of the graph's Laplacian matrix.We can thus obtain a parametrisation of vertex functions for any graph F : M → R V by parametrising g.Consequently, the extended persistence of a graph -which has only four non-trivial persistence diagrams -can be varied over the parameter space M. If we have a function Out : Bar 4 → R on persistence diagrams that we wish to minimise, we can optimise over M to minimise the loss function If L is generically differentiable, we can optimise the wavelet signature parameters θ ∈ M using gradient descent methods.We illustrate an application of this framework to a graph classification problem in Figure 1, where the loss function L is the classification error of a graph classification prediction model based on the graph's extended persistence diagrams.
In Section 2, we describe the assignment of vertex functions F : M → R V by reviewing the definition of wavelet signatures.While spectral wavelets have been used in graph neural network architectures that predict vertex features [Xu+19] and compress vertex functions [RG19], they have not been considered in a persistent homology framework for graph classification.We describe several ways to parametrise wavelets.We also show in Proposition 2.2 that wavelet signatures are independent of the choice of eigenbasis of the graph Laplacian from which it is derived, ensuring that it is well-defined.We prove this result in Section B.
In Section 3, we describe the theoretical basis for optimising the extended persistent homology of a vertex function EPH : R V → Bar 4 and elucidate what it means for L to be differentiable.In Proposition 3.3, we generalise the differentiability formalism of ordinary persistence [LOT19] to extended persistence.We prove this result in Section A Finally, in Section 4, we apply our framework to graph classification problems on several benchmark datasets.We show that our model is competitive with state-of-the-art persistence-based models.In particular, optimising the vertex function appreciably improves the prediction accuracy on some datasets.

Filter Function Parametrization
We describe our recipe for assigning vertex functions to any simplicial graph G = (V, E) based on a parametrised spectral wavelet, the first part F of the loss function Our recipe is based on a graph's wavelet signature, a vertex function derived from the graph's Laplacian.
The wavelet signature also depends on a so-called 'wavelet function' in g : R → R, which is independent of the graph.By modulating the wavelet, we can jointly vary the wavelet signature across many graphs.We parametrise the wavelet using a finite linear combination of basis functions, such that the wavelet signature can be manipulated in a computationally tractable way.In the following section, we define the wavelet signature and describe our linear approach to wavelet parametrisation.
2.1.Wavelet Signatures.The wavelet signature is a vertex function initially derived from wavelet transforms of vertex functions on graphs [HVG11], a generalisation of wavelet transforms for square integrable functions on Euclidean space [Gra95; Chu16] for signal analysis [Aka+01].Wavelet signatures for graphs have been applied to encode geometric information about meshes of 3D shapes [Aka+01;LH13].Special cases of wavelets signatures, such as the heat kernel signature [SOG09] and wave kernel signature [ASC11], have also been applied to describe graphs and 3D shapes [BK10;HRG14].
The wavelet signature of a graph is constructed from the graph's Laplacian operator.A graph's normalised Laplacian L ∈ R V ×V is a symmetric positive semi-definite matrix, whose entries are given by where k u is the degree of vertex u.
and (φ i ) v denotes the component of eigenvector φ i corresponding to vertex v.
If the eigenvalues of L have geometric multiplicity one (i.e.their eigenspaces are one dimensional), then the orthonormal eigenvectors are uniquely defined up to a choice of sign.It is then apparent from Equation (3) that the wavelet signature is independent of the choice of sign.However, if some eigenvalues have geometric multiplicity greater than one, then the orthonormal eigenvectors of L are uniquely defined up to orthonormal transformations in the individual eigenspaces.However, the wavelet signature is well-defined even when the multiplicities of eigenvalues are greater than one.This is the content of the next Proposition, whose proof is deferred to Section B.
Proposition 2.2.The wavelet signature of a graph is independent of the choice of orthonormal eigenbasis for the Laplacian.
Remark 2.3.In addition to the traditional view of wavelets from a spectral signal processing perspective [HVG11], we can also relate the wavelet signature of a vertex v to the degrees of vertices in some neighbourhood of v prescribed by g.Consider a wavelet g : [0, 2] → R. On a finite graph G, the normalised Laplacian L has at most |V | many distinct eigenvalues.As such, there exists a polynomial ĝ(x) = p n=0 a n x n of finite order that interpolates g at the eigenvalues g(λ i ) = ĝ(λ i ).Therefore, W (g) = W (ĝ).Moreover, the vertex values assigned by W (ĝ) are the diagonal entries of the matrix polynomial ĝ(L): Furthermore, we can also write the matrix polynomial ĝ(L) as a matrix polynomial in A = I − L, the normalised adjacency matrix.From the definition of L, we can compute the diagonal entry of a monomial A r corresponding to vertex v as an inverse degree weighted count of paths 1 [v 0 , v 1 , . . ., v r ] on the graph which begin and end on vertex v = v 0 = v r [New18]: By expressing the wavelet signature as a matrix polynomial in A, we see that g controls how information at different length scales of the graph contribute to the wavelet signature.For instance, if g were an order p polynomial, then W (g) v only takes the degrees of vertices that are p/2 away from v into account.As a corollary, since W (g) can be specified by replacing g with a polynomial ĝ of order at most |V | − 1, the wavelet signature at a vertex is only dependent on the subgraph of G that is within |V | − 1 /2 steps away from v.
2.2.Parametrising the Wavelet.We see from Remark 2.3 that the choice of wavelet g determines how the topology and geometry of the graph is reflected in the vertex function.Though the space of wavelets is potentially infinite dimensional, here we only consider wavelets g θ (x) that are parametrised by parameters θ in a finite dimensional manifold, so that we can easily optimise them using computational methods.In particular, we focus on wavelets written as a linear combination of m basis functions h This parametrisation of wavelets in turn defines a parametrisation of vertex functions F : R m → R V for our optimisation pipeline in eq. ( 1) Since W (g) is a linear function of the wavelet g, F is a linear transformation: We can write F as a |V | × m matrix acting on a vector [θ 1 , . . .θ m ] ∈ R m , whose columns are the vertex functions W (h j ).
Example 2.4 (Chebyshev Polynomials).Any Lipschitz continuous function on an interval can be well approximated by truncating its Chebyshev series at some finite order [TB97].The Chebyshev polynomials form an orthonormal set of functions.We can thus consider h j (λ) = T j (λ − 1), j = 0, 2, . . ., m as a naïve basis for wavelets.We exclude T 1 (x) = x in the linear combination as W (T 1 (1 − x)) = 0 for graphs without self loops.
Example 2.5 (Radial Basis Functions).In the machine learning community, a radial function refers loosely to a continuous monotonically decreasing function ρ : R ≥0 → R ≥0 .There are many possible choices for ρ, for example, the inverse multiquadric where = 0 is a width parameter.We can obtain a naïve wavelet basis h j (x) = ρ( x − x j ) using copies of ρ offset by a collection of centroids x j ∈ R along R. In general, the centroids are parameters that could be optimised, but we fix them in this study.This parametrisation can be considered as a radial basis function neural network.RBNNs are well-studied in function approximation and subsequently machine learning; we refer readers to [CCG91; PS91] for further details.

The Choice of Wavelet
Basis.The choice of basis functions determines the space of wavelet signatures and also the numerical stability of the basis function coefficients which serve as the wavelet signature parameters.The stability of the parametrisation depends on the graphs as much as the choice of wavelet basis h 1 , . . ., h m .We can analyse the stability of a parametrisation F by its the singular value decomposition where σ 1 , . . ., σ r are the non-zero singular values of the matrix, and u k ∈ R |V | and v k ∈ R m are orthonormal sets of vectors respectively.If the distribution of singular values span many orders of magnitude, we say the parametrisation is ill-conditioned.An ill-conditioned parametrisation interferes with the convergence of gradient descent algorithms on a loss function evaluated on wavelet signatures.We discuss the relationship between the conditioning of F and the stability of gradient descent in detail in Remark 2.7.We empirically observe that the coefficients of a naïve choice of basis functions, such as Chebyshev polynomials or radial basis functions, are numerically ill-conditioned.In Figure 3, we can see that the singular values of radial basis function and Chebyshev polynomial parametrisations respectively are distributed across a large range on the logarithmic scale for some datasets of graphs in machine learning.We address this problem by picking out a new wavelet basis where σ k are the singular values of F and v k are the associated vectors in R m from the singular value decomposition of matrix F in eq. ( 11).Then the parametrisation F : have singular values equal to one, since this is a linear combination of orthonormal vectors u k ∈ R V : As an example, we plot the new wavelet basis h k derived from a twelve parameter radial basis function parametrisation for the MUTAG dataset in Figure 4 in Section B.
Remark 2.6 (Learning a Wavelet Basis for Wavelet Signatures on Multiple Graphs).In the case where the wavelet coefficients parametrise the wavelet signatures over graphs G 1 , . . ., G N , we can view the maps F 1 , . . ., F N that map wavelet basis coefficients to vertex functions of graphs G 1 , . . ., G N respectively as a parametrisation for the disjoint union i G i : We can then perform a singular value decomposition of the parametrisation F on i G i and derive a new, well-conditioned basis.
Remark 2.7 (Why the Conditioning of F Matters).Let us optimise a loss function L on the parameter space of wavelet coefficients θ using a gradient descent algorithm.In a gradient descent step of step size s, the wavelet coefficients are updated to θ → θ −s∇ θ L. Using the singular value decomposition of F (eq. ( 11)), we can write The change in the vertex function is simply the matrix F applied to the change in wavelet parameters.
Hence the vertex function is updated to f → f − sF∇ θ L, where If the loss function L has large second derivatives-for example, due to nonlinearities in the function on persistence diagrams Out : Bar 4 → R -the projections ∇ f L, u k in eqs.( 16) and (17) may change dramatically from one gradient descent update to another.If the smallest singular value is much smaller than the largest, then updates to the wavelet signature can be especially unstable throughout the optimisation process.This source of instability can be removed if we choose a parametrisation with uniform singular values σ k = 1.In this case, the update to f is simply the projection of ∇ f L onto the space of wavelet signatures spanned by u 1 , . . ., u r , without any distortion introduced by non-uniform singular values:

Extended Persistent Homology
The homology of a given graph is a computable vector space whose dimension counts the number of connected components or cycles in the graph.Finer information can be retained by filtering the graph and analysing the evolution of the homology throughout the filtration.This evolution is described by a set of extended persistence diagrams (a.k.a.extended barcodes), a multiset of points b, d that record the birth b and death of homological features in the filtration.In this section, we begin by summarising these constructions.We refer the reader to [ZC05], [EH08], and [CEH07] for full treatments of the theory of Persistence.
Compared to ordinary persistence, extended persistence is a more informative and convenient feature map for graphs.Extended persistence encodes strictly more information than ordinary persistence.For instance, the cycles of a graph are represented as points with d = ∞ in ordinary persistence.Thus, only the birth coordinate b of such points contain useful information about the cycles.In contrast, the corresponding points in extended persistence are each endowed with a finite death time d, thus associating extra information to the cycles.The points at infinity in ordinary persistence also introduce obstacles to vectorisation procedures, as often arbitrary finite cutoffs are needed to 'tame' the persistence diagrams before vectorisation.
3.1.Extended Persistent Homology.Let G = (V, E) be a finite graph without double edges and self-loops.For the purposes of this paper, the associated extended persistent homology is a map EPH : R V → Bar 4 from functions f ∈ R V on its vertices to the space of four persistence diagrams or barcodes, which we define below.The map arises from a filtration of the graph, a sequential attachment of vertices and edges in ascending or descending order of f .We extend f on each edge e = (v, v ) by the maximal value of f over the vertices v and v , and we then let G t ⊂ G be the sub graph induced by vertices taking value less than t.Then we have the following sequence of inclusions: Similarly, the sub graphs G t ⊂ G induced by vertices taking value greater than t assemble into a sequence of inclusions: The changes in the topology of the graph along the filtration in ascending and descending order of f can be detected by its extended persistence module, indexed over the poset R ∪ {∞} ∪ R op : where H p is the singular (relative) homology functor in degre p ∈ 0, 1 with coefficients in a fixed field, chosen to be Z/2Z in pratice.In general terms, the modules V 0 (f ) and V 1 (f ) together capture the evolution of the connected components and loops in the sub graphs of G induced by the function f .Each module V p (f ) is completely characterised by a finite multi-set EPH p (f ) of pairs of real numbers b, d called intervals representing the birth and death of homological features.Following [CEH09], the intervals in EPH p (f ) are further partitioned according to the type of homological feature they represent: Each of the three finite multiset EPH k p (f ), for k ∈ {ord, ext, rel}, is an element in the space Bar of so-called barcodes or persistence diagrams.However, EPH rel 0 (f ) and EPH ord 1 (f ) being trivial for graphs, we refer to the collection of four remaining persistence diagrams as the extended barcode or extended persistence diagram of f .We have thus defined the extended persistence map EPH : R V → Bar 4 .
Remark 3.1.If we only apply homology to the filtration of Eq. ( 19), we get an ordinary persistence module indexed over the real line, which is essentially the first row in Eq. ( 21).This module is characterised by a unique barcode PH p (f ) ∈ Bar.We refer to the map 24) as the ordinary persistence map.

Differentiability of Extended
Persistence.The extended persistence map can be shown to be locally Lipschitz by the Stability theorem [CEH09].The Rademacher theorem states that any real-valued function that is locally Lipschitz is differentiable on a full measure set.Thus, so is our loss function as long as Out and F are smooth or locally Lipschitz2 .If a loss function L is locally Lipschitz, we can use stochastic gradient descent as a paradigm for optimisation.Nonetheless, the theorem above does not rule out dense sets of non differentiability in general.
In this section, we show that the set where EPH is not differentiable is not pathological.Namely, we show that EPH is generically differentiable, i.e. differentiable on an open dense subset.This property guarantees that local gradients yield reliable descent directions in a neighbourhood of the current iterate.We recall from [LOT19] the definition of differentiability for maps to barcodes.
We call a map F : M → R V a parametrisation, as it corresponds to a selection of filter functions over G parametrised by the manifold M. Then B := EPH • F is the barcode valued map whose differentiability properties are of interest in applications.Definition 3.2.A map B : M → Bar on a smooth manifold M is said to be differentiable at θ ∈ M if for some neighbourhood U of θ, there exists a finite collection of differentiable maps For N ∈ N, we say that a map B : M → Bar N is differentiable at θ if all its components are so.
In [LOT19], it is proven that the composition PH • F is generically differentiable as long as F is so.It is possible to show that EPH • F is generically differentiable along the same lines, but we rather provide an alternative argument in the appendix.Namely, we rely on the fact that the extended persistence of G can be decoded from the ordinary persistence of the cone complex C(G), a connection first noted in [CEH09] for computational purposes.
Proposition 3.3.Let F : M → R V be a generically differentiable parametrisation.Then the composition EPH • F is generically differentiable.
For completeness, the proof provided in the appendix treats the general case of a finite simplicial complex K of arbitrary dimension.

Binary graph classification
We investigate whether optimising the extended persistence of wavelet signatures can be usefully applied to graph classification problems, where persistence diagrams are used as features to predict discrete, real life attributes of networks.In this setting, we aim to learn θ ∈ M that minimise the classification error of graphs over a training dataset.
We apply our wavelet optimisation framework to classification problems on the graph datasets MUTAG . The former five datasets are biochemical molecules and IMDB-B is a collection of social ego networks.In our models, we use persistence images [Ada+17] as a fixed vectorisation method and use a feed forward neural network to map the persistence images to classification labels.We also include the eigenvalues of the graph Laplacian as additional features; model particulars are described in the sections below.
To illustrate the effect of wavelet optimisation on different classification problems, we also perform a set of control experiments where for the same model architecture, we fix the wavelet and only optimise the parameters of the neural network.The control experiment functions as a baseline against which we assess the efficacy of wavelet optimisation.
We benchmark our results with two existing persistence based architectures, PersLay [Car+20b] and GFL [Hof+20].Perslay optimises the vectorisation parameters and use two heat kernel signatures as fixed rather than optimisable vertex functions for computing extended persistence.GFL optimises and parametrises vertex functions using a graph isomorphism network [Xu+19], and computes ordinary sublevel and superlevel set persistence instead of extended persistence.
4.1.Model Architecture.We give a high level description of our model and relegate details and hyperparameter choices of the vectorisation method and neural network architecture to section C. In our setting, the extended persistence diagrams of the optimisable wavelet signatures for each graph are vectorised as persistence images.We also include the static persistence images of a fixed heat kernel signature, W (e −0.1x ), as an additional set of features, alongside some non-persistence features.Both the optimised and static persistence diagrams are transformed into the persistence images using identical hyperparameters.We feed the optimisable and static persistence images into two separate convolutional neural networks (CNNs) with the same architecture.Similarly, we feed the non-persistence features as a vector into a separate multilayer perceptron.The outputs of the CNNs are concatenated with the outputs of the multi-layer perceptron.Finally, an affine transformation sends the concatenated vector to a real number whose sign determines the binary classification.4.1.1.Wavelet Parametrisation.We choose a space of wavelets spanned by 12 inverse multiquadric radial basis functions whose centroids x j are located at x j = 2(j − 1)/9, j = 0, . . ., 11.The width parameter is chosen to be the distance between the centroids, = 2/9.On each dataset, we derive a numerically stable parametrisation using the procedure described in Section 2.2; the parameters we optimise are the coefficients of the new basis given by eq. ( 12).We initialise the parameters by fitting them via least squares to the heat kernel signature W (e −10x ) on the whole dataset of graphs.
4.1.2.Non-Persistence Features.We also incorporate the eigenvalues of the normalised Laplacian as additional, fixed features of the graph.Since the number of eigenvalues for a given graph is equal to the number of vertices, it differs between graphs in the same dataset.To encode the information represented in the eigenvalues as a fixed length vector, we first sort the eigenvalues into a time-series; we then compute the log path signature of the time series up to level four, which is a fixed length vector in R 8 .The log-signature captures the geometric features of the path; we refer readers to [CK16] for details about path signatures.For IMDB-B in particular, we also include the maxima and minima of the heat kernel signatures W (e −10x ) and W (e −0.1x ) respectively of each graph.

Experimental set up.
We employ a 10 ten-fold test-train split scheme on each dataset to measure the accuracy of our model.Each ten-fold is a set of ten experiments, corresponding to a random partition of the dataset into ten portions.In each experiment, a different portion is selected as the test set while the model is trained on the remaining nine portions.We perform 10 ten-folds to obtain a total of 10 × 10 experiments, and report the accuracy of the classifier as the average accuracy over 100 such experiments.The epochs at which the accuracies were measured are specified in Table 3.
Across all experiments, we use binary cross entropy as the loss function.We use the Adam optimiser [KB14] with learning rate lr = 1e-3 to optimise the parameters of the neural network.The wavelet parameters are updated using stochastic gradient descent with learning rate lr = 1e-2, for all datasets except for IMDB-B, where the learning rate is set to lr = 1e − 1.The batch sizes for each experiment are shown in Table 4.In all experiments, we stop the optimisation of wavelet parameters at epoch 50 while the neural network parameters continue to be optimised.
We use the GUDHI library to compute persistence, and make use of the optimisation and machine learning library PyTorch for the construction of the graph classifications models.

Results and Discussion.
In Table 2, we present the classification accuracies of our model.For each dataset, we perform four experiments using our model, varying whether the wavelet parameter is optimised and whether additional features are included.In Table 1, we show the test accuracy of our model alongside two persistence-based graph classification architectures, Perslay and GFL, as well as other state-of-the-art graph classification architectures.
We first compare the performances of our model between cases where we optimise and fix the wavelets.In Table 2, we see that on MUTAG and DHFR, optimising the wavelet improves the classification accuracy regardless of whether extra features are included.On NCI1, wavelet optimisation improves the classification accuracy only only persistence features are included.When we include non-persistence features in the model, the performances of the optimised and control models are statistically indistinguishable for NCI1, suggesting that the non-persistence features play a more significant role in the classification.As for COX2, PROTEINS, and IMDB-B, optimising the wavelet coefficients do not bring about statistically significant improvements.This indicates that the initial wavelet signature -the heat kernel signature W (e −10x ) -is a locally optimal choice of wavelet for our neural network classifier.
We now compare our architecture to other persistence based architectures, Perslay and GFL, where node attributes are excluded from their vertex function models.Except on PROTEINS, our wavelet optimised model matches or exceeds Perslay.While our model architecture and choice of wavelet initialisation is similar to that of Perslay, we differ in two important respects.Perslay fixes the vertex functions but optimises the weights assigned to points on the persistence diagrams, as well as the parameters of the persistence images.Our improvements on Perslay for MUTAG, DHFR, and IMDB-B indicate that vertex function optimisation yields improvements that cannot be obtained through vectorisation optimisation alone on some datasets of graphs.
Compared to GFL (without node attributes), both Perslay and our architecture achieves similar or higher classification accuracies on PROTEINS and NCI1.This supports wavelet signatures being viable models for vertex functions on those datasets.On the other hand, both Perslay and our model lag behind GFL on IMDB-B.We attribute this to the fact that IMDB-B, unlike the other bionformatics datasets, consists of densely connected graphs.The graphs in IMDB-B have diameter at most two and 14% of the graphs are cliques.This fact has two consequences.First, we expect the one-layer GIN used in GFL -a local topology summary -to be highly effective in optimising for the salient features of a graph with small diameter.Second, the extended persistence modules for cliques have zero persistence, since all vertices are assigned the same function value due to symmetry.In contrast, ordinary persistence used in GFL is able to capture the cycles in a complete graph as points with infinite persistence.
Compared to non-persistence state-of-the-art architectures in Table 2, our model achieves competitive accuracies on MUTAG, COX2, and DHFR.For NCI1 and PROTEINS, all persistence architectures listed that exclude additional node attributes perform poorly in comparison, though PWL was able to achieve leading results with node attributes.
All in all, we observe that wavelet signatures can be an effective parametrisation of vertex functions when we use extended persistence as features for graph classification.In particular, on some bioinformatics datasets, we show that optimising the wavelet signature can lead to improvements in classification accuracy.The wavelet signature approach is complementary to the GFL approach to vertex function parametrisation as they show strengths on different datasets.

Conclusion
We have presented a framework for equipping any graph G with a set of extended persistence diagrams EPH • F : M → Bar 4 parametrised over a manifold M, a parameter space for the graph's wavelet signature.We described how wavelet signatures can be parametrised and interpreted.Given a function on extended persistence diagrams Out : Bar 4 → R that is differentiable, we have shown how a loss function L = Out • EPH • F can be generically differentiable with respect to θ ∈ M as L. Thus, we can apply gradient descent methods to optimise the extended persistence diagrams of a graph to minimise L.
We applied this framework to a graph classification architecture where the wavelet signature is optimised for classification accuracy.We are able to demonstrate an increase in accuracy on several benchmark datasets where the wavelet is optimised, and perform competitively with state-of-the-art persistence based graph classification architectures.

Funding
Ka Man Yim is funded by the EPSRC Centre For Doctoral Training in Industrially Focused Mathematical Modelling (EP/L015803/1) with industrial sponsorship from Elsevier.Jacob Leygonie is funded by the EPSRC grant EP/R513295/1.Both authors are members of the Centre for Topological Data Analysis, which is supported by the EPSRC grant New Approaches to Data Science: Application Driven Topological Data Analysis EP/R018472/1.3. We also provide the standard deviations of the 10 mean accuracies of each ten-fold.For other architectures, we indicate whether their accuracies were reported as averages over 1 ten-fold or 10 ten-fold in the bottom row of the table.To avoid confusion, we leave out the errors reported for P-SAN, GIN and GFL and refer the reader to the original sources, as they were calculated using a different formula.Errors were not reported in [VZ17] for FGSD.

A. Differentiability of the extended persistence map
Let K be a finite simplicial complex with vertex set V and dimension d ∈ N. A vertex function f ∈ R V extends to the whole complex via f (σ) := max v∈σ f (v).Filtrations, persistence modules and barcodes are then defined analogously to the case of a graph.The extended barcode of a function f now consists of 3(d + 1) barcodes: We then have the extended persistence map and the ordinary persistence map as in remark 3.1 Proposition A.1.Let K be a finite simplicial complex, and let F : M → R V be a generically differentiable parametrisation.Then the composition EPH • F is generically differentiable.
In particular, taking the parameter space M to be the space R V of vertex functions, we obtain the generic differentiablility of the extended persistence map EPH itself.Note that, however, we could not have directly deduced the generic differentiability of any composition of the form EPH • F from the generic differentiability of EPH.This is due to the fact that the image of a parametrisation F might lie in the set where EPH is not differentiable.
The idea of our proof is to view the extended persistence of a vertex function f ∈ R V as the ordinary persistence of an extension of f over the cone complex C(K).We note that this point of view has proven to be particularly useful for computing extended persistence in practice [CEH09].The relationship between EPH and PH can be described by a commutative diagram: Bar EPH PH whose vertical maps are differentiable.Thus, we can deduce the differentiability of the extended persistence map EPH from the results of [LOT19] about the ordinary persistence map PH.
Proof of Proposition A.1.Let f ∈ R V be a vertex function.Let K t (resp.K t ) be the maximal sub complexes of K induced by vertices taking values greater (resp less) than t.For 0 p d, the associated p-th extended persistent homology module V p (f ) is: As such, V p (f ) is a module indexed over the extended real line R {∞} R op .We construct an equivalent module V p,R (f ) over the simpler, compact poset [−R; 3R], where R > 0 is a large enough constant chosen hereafter.For this, we consider the poset map that collapses R {∞} R op onto [−R; 3R] as in fig. 2. Formally, the poset map is defined on R as the canonical retraction onto [−R; R], on R op as the symmetry t → 2R − t followed by the canonical retraction onto [−R; R], and sends the point ∞ to R.
The extended module module V p (f ) is essentially equivalent to the ordinary module V p,R (f ), since we can retrieve the extended barcode EPH p (f ) of V p (f ) from the barcode of V p,R (f ) as follows: We denote this decoding map by Dec R : Bar → Bar 3 .We next take advantage of working with the ordinary module V p,R (f ) by viewing it as the sub level set persistent homology module of a function defined on the cone C(K).
Note that the relative homology groups of V p,R (f ) in the second row of Eq. ( 28) may be replaced with ordinary (reduced) homology groups of the cones C K 2R−t using the functorial isomorphism: We denote by ω the distinguished vertex of such cones.It is then clear that V p,R (f ) equals the ordinary p-th sub level set persistent (reduced) homology module of the function fR : C(K) → R defined by fR (σ) := f (σ) and fR (σ {ω}) := 2R − min for any simplex σ ∈ K, and fR (ω) := −R by convention.Plugging these constructions together, we connect the ordinary and extended maps in the commmutative diagram: Note that this diagram only makes sense for parameters θ such that F (θ) R is a function whose sub level sets are sub complexes of C(K), as PH F (θ) R is undefined otherwise.This requirement is satisfied whenever the inequality sup σ |F(θ)(σ)| < R holds.For simplicity, we assume that R can be chosen large enough for the inequality to hold for all parameters θ, hence the diagram (29) makes sense globally on M. One can always avoid this restriction by working locally on compact neighbourhoods in M.
From [LOT19, Theorem 4.9], the subset M ⊆ M where the parametrisation F is differentiable and induces a locally constant pre-order on simplices of K is a generic sub manifold.In turn, all the maps θ → min v∈σ F(θ)(v) and θ → max v∈σ F(θ)(v), for σ ∈ K a simplex, are differentiable over M .Therefore FR : M → R C(K) is differentiable over the generic submanifold M .
Since FR is generically differentiable, so is PH • FR [LOT19, Theorem 4.9], i.e. we generically have local coordinate systems as in Def.3.2 .Since the decoding map Dec R in diagram (29) merely applies an affine transformation to the local coordinate systems and then splits them into three parts (the splitting is constant), we obtain local coordinate systems for EPH • F .Therefore, EPH • F is generically differentiable.

B. The Wavelet Signature is Well-defined
In definition 2.1, we defined the wavelet signature using the eigenvalues and eigenvectors of a graph Laplacian L. The wavelet signature is only well defined if it is independent of the choice of eigenbasis for L, where ambiguity could occur if L has eigenvalues with multiplicity 5 greater than one.
Proposition 2.2.The wavelet signature of a graph is independent of the choice of orthonormal eigenbasis for the Laplacian.
Proof.Let Spec(L) ⊂ R denote the spectrum of L and φ 1 , . . .φ |V | be a set of orthonormal eigenvectors of L. Let us denote Φ(λ) to be a |V | × m matrix where m corresponds to the geometric multiplicity of λ, and the m column vectors of Φ(λ) correspond to eigenvectors φ i 1 , . . ., φ im with eigenvalue λ.Then we can rewrite the wavelet signature eq.(3) as Suppose we have another choice of eigenbasis of L. Without loss of generality for λ ∈ Spec(L), the new basis φ i 1 , . . ., φ im for eig(λ) is related to the previous eigenbasis φ i 1 , . . ., φ im by an orthonormal transformation transformation U (λ) ∈ R m×m on Φ(λ): 4 Strictly speaking, the decoding map should furthermore forget the unique unbounded interval b, +∞ in the barcode PH fR , since the ordinary persistence map PH computes the barcode of a module made of non-reduced homology groups.
5 As L is symmetric and hence diagonalisable, the geometric and algebraic multiplicities of its eigenvalues agree.
Since the V × V matrix Φ(λ)Φ(λ) is independent of the choice of eigenbasis, the wavelet signature given on the right hand side of eq.(30) must also be independent of the choice of eigenbasis. .We consider the parametrisations of wavelet signatures on some datasets of graphs in machine learning, namely MUTAG, COX2, DHFR, NCI1, PROTEINS and IMDB-B, using coefficients of 12 radial basis functions (see eq. ( 25)) and a degree 13 Chebyshev polynomial respectively.For each dataset, we plot the distribution of the singular values σ of the map F in eq. ( 15) from the basis function coefficients θ ∈ R 12 to the wavelet signature on the whole dataset of graphs, as a fraction of the largest singular value σ max of F. We can observe that for both parametrisations, the singular values span many orders of magnitudes across different datasets.Note that the singular values of F not only depend on the choice of basis but also on the dataset of graphs.

C. Experimental Details
C.1.Persistence Images Parameters.We vectorised each of the three persistence diagrams EPH 0 , EPH rel 1 and EPH ext 1 as a persistence image.Prior to vectorising the persistence diagrams, we apply a fixed and identical affine transformation to the values of the vertex functions across all graphs in the dataset concerned, such that the maximum and minimum values taken across all graphs in the dataset of the initial vertex function prior to optimisation are scaled to 1 and 0 respectively.The persistence image is sampled on a 20 × 20 grid, whose grid points are equidistantly placed σ = 1/17 apart on the square [−σ, 1 + σ] 2 of the persistence diagrams, where σ is the width of the Gaussian.The Gaussian centred on the birth and persistence coordinates b, p of each point is weighted according to its persistence ω(p) = sin 2 π 2 min p σ , 1 .
Points with persistence p ≥ σ are assigned a uniform weight ω = 1, else assigned a weight that diminishes to zero as p → 0.
C.2. Convolutional Neural Network Architecture for Persistence Images.We feed each set of three persistence images belonging to either the optimisable or static persistence diagrams as a three channel image into the following convolutional neural network to obtain a 22×22 image: The function Conv denotes a convolutional layer with kernel size 2, stride 1, padding 1; BN2D denotes a 2D batch normalisation layer; and DO denotes a dropout layer with dropout probability p = 0.5.12)) for the MUTAG dataset, derived from an initial numerically unstable parametrisation using twelve inverse multiquadric radial basis functions (eq.( 25)).We parametrise the wavelet as a linear combination of these basis functions.
C.3.Multilayer Perceptron for Non-Persistence Features.We feed non-persistence features as a vector of length n = #features into the following multilayer perceptron: where Aff : R n → R n denotes an affine transformation, and BN denotes a batch normalisation layer.
C.4.Path Encoding of Laplacian Eigenvalues.For MUTAG, COX2, DHFR, and NCI1, we sort the Laplacian eigenvalues in ascending order and transform the one-dimensional sequence into a two-dimensional time series via a delay embedding For IMDB-B, we incorporate a fictitious time coordinate t j = 2(j − 1)/(N − 1) for j = 1, . . .N as the second coordinate instead of a 'delayed' eigenvalue:

Figure 1 .
Figure1.Given a wavelet g : R → R, we can equip any graph with a non-trivial vertex function.This allows us to compute the extended persistence diagrams of a graph and use the diagrams as features of the graph to predict a graph's classification in some real world setting.The wavelet g can be optimised to improve the classification accuracy of a graph classification pipeline based on the extended persistence diagrams of a graph's vertex function.

Figure 2 .
Figure 2. Collapsing the dotted part of the left poset yields the compact poset on the right.
Figure3.We consider the parametrisations of wavelet signatures on some datasets of graphs in machine learning, namely MUTAG, COX2, DHFR, NCI1, PROTEINS and IMDB-B, using coefficients of 12 radial basis functions (see eq. (25)) and a degree 13 Chebyshev polynomial respectively.For each dataset, we plot the distribution of the singular values σ of the map F in eq.(15) from the basis function coefficients θ ∈ R 12 to the wavelet signature on the whole dataset of graphs, as a fraction of the largest singular value σ max of F. We can observe that for both parametrisations, the singular values span many orders of magnitudes across different datasets.Note that the singular values of F not only depend on the choice of basis but also on the dataset of graphs.

Figure 4 .
Figure 4.The functions shown are the new, stable wavelet basis h 1 , . . ., h 12 (eq.(12)) for the MUTAG dataset, derived from an initial numerically unstable parametrisation using twelve inverse multiquadric radial basis functions (eq.(25)).We parametrise the wavelet as a linear combination of these basis functions.
Let L ∈ R V ×V be the normalised Laplacian of a simplical graph G = (V, E).Let φ 1 , . . ., φ |V | be an orthonormal eigenbasis for L and λ 1 , . . ., λ |V | be their corresponding eigenvalues.The wavelet signature W : R [0,2] → R V maps a function g : [0, 2] → R, which we refer to as a wavelet, to a vertex function W (g) ∈ R V linearly, where the value of W (g) on vertex v is given by

Table 1 .
Binary classification accuracy on datasets of graphs.The best accuracy of persistence-based models without using node attributes is made bold for each dataset.The performance of our model is reported in the column Wavelet Opt. on the right hand side.The accuracies of the control model, where the wavelet parameters are fixed to the initial values, are shown in the column Control.Both these models use additional features (see Section 4.1.2).The accuracies of our model are the means over 10 ten-folds, recorded at epochs reported in Table

Table 2 .
Binary classification accuracy of our model where we vary whether non-Persistence features are included and whether the wavelet is optimised.The reported accuracies are the mean over 10 ten-folds, recorded at epochs reported in Table3.We also provide standard deviations of the 10 mean accuracies of each ten-fold.See Section 4.1.2for the particulars about the non-persistence features.

Table 4 .
Batch sizes in the graph classification experiments for different datasets described in section 4.