A Comparative Study of Machine Learning Methods for Persistence Diagrams

Many and varied methods currently exist for featurization, which is the process of mapping persistence diagrams to Euclidean space, with the goal of maximally preserving structure. However, and to our knowledge, there are presently no methodical comparisons of existing approaches, nor a standardized collection of test data sets. This paper provides a comparative study of several such methods. In particular, we review, evaluate, and compare the stable multi-scale kernel, persistence landscapes, persistence images, the ring of algebraic functions, template functions, and adaptive template systems. Using these approaches for feature extraction, we apply and compare popular machine learning methods on five data sets: MNIST, Shape retrieval of non-rigid 3D Human Models (SHREC14), extracts from the Protein Classification Benchmark Collection (Protein), MPEG7 shape matching, and HAM10000 skin lesion data set. These data sets are commonly used in the above methods for featurization, and we use them to evaluate predictive utility in real-world applications.


INTRODUCTION
Persistence diagrams are an increasingly useful shape descriptor from Topological Data Analysis. One of their more popular uses to date has been as features for machine learning tasks, with success in several applications to science and engineering. Though many methods and heuristics exist for performing learning with persistence diagrams, evaluating their relative merits is still largely unexplored. Our goal here is to contribute to the comparative analysis of machine learning methods with persistence diagrams.
Starting with topological descriptors of datasets, in the form of persistence diagrams, we provide examples and methodology to create features from these diagrams to be used in machine learning algorithms. We provide the necessary background and mathematical justification for six different methods (in chronological order): the Multi-Scale Kernel, Persistence Landscapes, Persistence Images, Adcock-Carlsson Coordinates, Template Systems, and Adaptive Template Systems. To thoroughly evaluate these methods, we have researched five different data sets and the relevant methods to compute persistence diagrams from them. The datasets, persistence diagrams and code to compute the persistence diagrams is readily available for academic use.
As part of this review, we also provide a user guide for these methods, including comparisons and evaluations across the different types of datasets. After computing the six types of features, we compared the predictive accuracy of a ridge regression, random forest, and support vector machine model to assess the type of featurization that is most useful in predictive models. The code developed for this analysis is available, with some functions developed specifically for use in machine learning applications, and easyto-use jupyter notebooks showing examples of each function with multiple dataset types.
Of these methods, Persistence Landscapes, Adcock-Carlsson Coordinates, and Template Systems are quite accurate and create features for large datasets quickly. Adaptive Template Systems and Persistence Images took somewhat longer to run, however, the Adaptive Template Systems featurization method did improve accuracy over other methods. The Multi-Scale Kernel was the most computationally intensive, and during our evaluation we did not observe instances of it outperforming other methods.

BACKGROUND
Algebraic topology is the branch of mathematics concerned with the study of shape in abstract spaces. Its main goal is to quantify the presence of features which are invariant under continuous deformations; these include properties like the number of connected components in the space, the existence of holes and whether or not the space is orientable. As an example, Figure 1 shows two spaces: the 2-dimensional sphere on the left, which is the set S 2 {x ∈ R 3 : x 1} of 3-dimensional vectors with unit norm, and the Möbius band M [−1, 1] × [−1, 1]/(−1, y) ∼ (1, −y) on the right. The latter can be thought of as the result of gluing the right and left edges of the square [−1, 1] × [−1, 1] with opposite orientations.
The aforementioned properties of shape for these spaces are as follows. Both S 2 and M are connected, while S 2 is orientable but M is not. Moreover, any closed curve drawn on the surface of S 2 bounds a 2-dimensional spherical cap, and thus we say that the sphere has no 1-dimensional holes. The equator {(x, 0) : |x| ≤ 1} of the Möbius band, on the other hand, is a closed curve in M which is not the boundary of any 2-dimensional region, and therefore we say that M has one 1-dimensional hole. Finally, S 2 is itself a closed 2-dimensional surface bounding a 3-dimensional void-thus the sphere is said to have a 2-dimensional hole-but M has no such features.
The homology of a space is one way in which topologists have formalized measuring the presence of n-dimensional holes in a space (Hatcher, 2002). Indeed, for a space X (e.g., like the sphere or the Möbius band) an integer n ≥ 0 and a field F (like the integers modulo a prime p, denoted Z p ), the n-th homology of X with coefficients in F is a vector space over F denoted H n (X; F). The main point is that the dimension of this vector space corresponds roughly to the number of essentially distinct n-dimensional holes in X. Going back to the examples from Figure 1: where, again, the dimension of H 0 (X; F) corresponds to the number of connected components in X, the dimension of H 1 (X; F) represents the number of 1-dimensional holes, and so on for H n (X; F) and n ≥ 1. It is entirely possible that different choices of F result in different dimensions for H n (X; F); this is an indication of intricate topological structure in X, but the metaphor of holes is still useful.

Persistent Homology
There are several learning tasks where each point in a data set has shape or geometric information relevant to the problem at hand. Indeed, in shape retrieval, database elements are often 3D meshes discretizing physical objects, and the ensuing learning tasks are often related to pose-invariant classification (Pickup et al., 2014). In computational chemistry and drug design, databases of chemical compounds are mined in order to discover new targets with desirable functional properties. In this case, the shape of each molecule (i.e., of the collection of comprising atoms) is closely related to molecular function, and thus shape features can be useful in said data analysis tasks (Bai et al., 2009).
If homology is what topologists use to measure the shape of abstract spaces, then persistent homology is how the shape of a geometric data set can be quantified (Perea, 2019). Persistent homology takes as input an increasing sequence of spaces Any such sequence is called a filtration. The definition of persistent homology relies on two facts: first, that one can compute homology for each space separately, i.e., H n (X ℓ ; F) for each 0 ≤ ℓ ≤ L, and second, that each inclusion X ℓ ⊂ X ℓ+1 induces a linear transformation H n (X ℓ ; F) → H n (X ℓ+1 ; F) between the corresponding vector spaces. The n-th persistent homology of the filtration X is the sequence of vector spaces and induced linear transformations.  The evolution of features in PH n (X ; F), which is the main point of interest, can be encoded and visualized through a persistence diagram. In a nutshell, if each H n (X j ; F) is finite dimensional and β j,ℓ n (X ; F) denotes the rank of the linear transformation H n (X j ; F) → H n (X ℓ ; F) induced by the inclusion X j ⊂ X ℓ , j ≤ ℓ, then the persistence diagram of PH n (X ; F), denoted dgm n (X ; F), is the collection of pairs (j, ℓ) with nonzero multiplicity (i.e., number of repeats).
See section VII.1 of Edelsbrunner and Harer (2010) for more details. In other words, dgm n (X ; F) is a multiset (i.e., a set whose elements appear with multiplicity) of pairs, where each (j, ℓ) ∈ dgm n (X ; F) encodes μ j,ℓ n homological features of the filtration X which appear at X j (i.e., j is the birth time) and disappear entering X ℓ (ℓ is the death time). The persistence ℓ − j of (j, ℓ) is often used as a measure of prominence across the filtration X , but short-lived features can be quite informative for learning purposes as well [see for instance Bendich et al. (2016)].

Filtrations From Point Cloud Data
There are several ways of constructing filtrations from geometric data. Indeed, let X be a set and d X a measure of distance between its elements. The pair (X, d X ) is often referred to as point cloud data, and the running hypothesis is that it is the result of sampling X from an unknown continuous space. The ensuing inference problem in Topological Data Analysis is to use (X, d X ) to estimate shape/homological features of the unknown underlying space. A popular strategy is to compute the Vietoris-Rips complex where ϵ ≥ 0, a singleton {x} is thought of as a vertex at x, a set with two elements {x 0 , x 1 } represents an edge between x 0 and x 1 , a set {x 0 , x 1 , x 2 } spans a triangle, and so on. This construction is motivated by the fact that R ϵ (X) is known to approximate the topology of the underlying space from which X was sampled under various conditions on X and ϵ (Latschev, 2001). In practice, however, an optimal choice of scale ϵ ≥ 0 is unclear at best, so one instead considers the Vietoris-Rips filtration for 0 ≤ ϵ 0 < ϵ 1 < / < ϵ L . The ϵ ℓ 's can be chosen, for instance, to be the different values of the distance function d X . Figure 2 shows an example of this construction for X ⊂ C sampled around the unit circle S 1 {z ∈ C : |z| 1}, and four scales ϵ ≥ 0. The persistent homology of the Vietoris-Rips filtration, i.e., PH n (R(X); F), can then be used to measure the shape of the underlying shape of the point cloud. An important point is that even though homology is invariant under continuous deformations, the Vietoris-Rips complex is a metric-based construction. Thus, the resulting Vietoris-Rips persistence diagrams dgm R n (X) ϵ j , ϵ ℓ with multiplicity μ j,ℓ n > 0 often encode features such as density and curvature, in addition to the presence of holes and voids . Figure 3 shows the Vietoris-Rips persistence diagrams in dimensions n 0, 1 for the data sampled around the unit circle in Figure 2. The persistence of a point in a persistence diagram can be visualized as its vertical distance to the diagonal. This measures how likely it is for said feature to correspond to one of the underlying space, instead of being a reflection of sampling artifacts [see for instance Theorem 5.3 in Oudot (2015)]. The fact that there is one highly persistent point for n 0 indicates that the data has one cluster (i.e., one connected component), while the presence of one highly persistent point for n 1 indicates that there is a strong 1-dimensional hole in the data. Both are consistent with, and suggest, that the circle is the underlying space.

Filtrations From Scalar Functions and Image Data
If X is a topological space and f : X → R is a function, then the sublevel sets X a f −1 (−∞, a], a ∈ R define the so called sublevel set filtration of X. If X is a 3D mesh, for example, then one can compute estimates of curvature at every vertex, and then extend said function linearly (via barycentric coordinates) to the triangular faces. The persistent homology of the sublevel set filtration is often called sublevel set persistence, and it is useful in quantifying shape properties of geometric objects which are endowed with scalar functions. See Figure 4 for an application of this idea. The corresponding persistence diagrams are denoted dgm n (f ). Images provide another data modality where sublevel set persistence can be useful. Indeed, an image can be thought of as a function on a grid of pixels; if the image is in grey scale, FIGURE 6 | Persistence Landscapes from the MPEG7 dataset to show differences in features. Each color corresponds to a different landscape, i.e., λ k for k 1, 2, 3. Frontiers in Artificial Intelligence | www.frontiersin.org July 2021 | Volume 4 | Article 681174 5 then we have a single scalar valued function, and if the image is multi-channel (like RGB color representations) then each channel is analyzed independently. The grid yields a triangulated space via a Freudenthal triangulation of the plane, and the values of pixel intensity in each channel can be extended via convex combinations to the faces [see Lemma 1 of Kuhn (1960)]. We will apply this methodology later on to the MNIST hand written digit data base ( Figure 5). This approach to computing persistent homology from images is not unique in the literature; other popular methods such as cubical homology (Kaczynski et al., 2004) have been used for this same purpose. This work, however, deals exclusively with simplicial homology as it is the standard approach in many applications.

The Space of Persistence Diagrams
Persistence diagrams have shown to be a powerful tool for quantifying shape in geometric data (Carlsson, 2014). Moreover, one of their key properties is their stability with respect to perturbations in the input, which is crucial when dealing with noisy measurements. Indeed, two persistence diagrams D and D ′ are said to be δ-matched, δ > 0, if there exists a bijection m : The bottleneck distance d B (D, D ′ ) is the infimum over all δ > 0 so that D and D ′ are δ-matched; this defines a metric on the set D 0 of all finite persistence diagrams. The stability theorem of Cohen-Steiner et al. (2007) for sublevel set persistence contends that if X is a finitely triangulated space and f , g : X → R are tame and continuous, then for every integer n ≥ 0. We note that the theorem is still true if continuous is replaced by piecewise linear. Similarly, if (X, d X ) and (Y, d Y ) are finite metric spaces, then the stability of Rips persistent homology (Chazal et al., 2014, Theorem 5.2) says that where d GH (·, ·) denotes the Gromov-Hausdorff distance (Gromov, 2007). In order to develop the mathematical foundations needed for doing machine learning with persistence diagrams, it has been informative to first study the structure of the space they form. Indeed, if D 0 denotes the space of finite persistence diagrams, then we will let D denote its metric completion with respect to the bottleneck distance d B . It readily follows that d B extends to a metric on D. See Blumberg et al. (2014) for an explicit description of what the elements of D are. In addition to the bottleneck distance, the Wasserstein metric from optimal transport suggests another way of measuring similarity between persistence diagrams. Indeed, for each integer p ≥ 1 and D, D ′ ∈ D, their p-th Wasserstein distance is where the infimum runs over all multiset bijections m : A → A ′ , for A ⊂ D and A ′ ⊂ D ′ . One can show that d W p defines a metric on the set and that (D p , d W p ) is a complete separable metric space (Mileyko et al., 2011) with d W p → d B as p → ∞. Doing statistics and machine learning directly on the space of persistence diagrams turns out to be quite difficult. Indeed, (D, d B ) does not have unique geodesics, and thus the Fréchet mean of general collections of persistence diagrams is not unique (Turner et al., 2014). Since computing averages, and in general, doing linear algebra on persistence diagrams is not available, then several authors have proposed mapping (D, d B ) to topological vector spaces where further analysis can be done. These methods are the main focus of this review. The theory of vectorization of persistence diagrams is an active area of research, with recent results showing the impossibility of full embeddability. Indeed, even though the space of persistence diagrams with exactly n points can be coarsely embedded in a Hilbert space (Mitra and Virk, Frontiers in Artificial Intelligence | www.frontiersin.org July 2021 | Volume 4 | Article 681174 6 2021), this ceases to be true if the number of points is allowed do vary (Wagner, 2019;Bubenik and Wagner, 2020). That said, partial featurization is still useful as we will demonstrate here.

FEATURIZATION METHODS
For each of the methods below, we start with a collection of persistence diagrams. A persistence diagram can be represented in either the birth-death plane or birth-lifetime plane-some methods will require birth-death coordinates and others will require birth-lifetime coordinates. The birth-death plane is the representation pair (x, y) where x is the time of birth, and y is the time of death of the feature in the persistence diagram. The birth-lifetime plane can be defined as the collection of points In this manner, we define lifetime as the persistence y − x of a feature (x, y). The persistence diagrams of a particular geometric object can be calculated in a variety of ways, which will be made explicit for each dataset at time of evaluation.

Multi-Scale Kernel
The Multi-Scale Kernel of Reininghaus et al. (2015) defines a Kernel over the space of persistence diagrams, which can then be used in various types of kernel learning methods. In general, a kernel k is by definition a symmetric and positive definite function of two variables. Mathematically, from Reininghaus et al. (2015), given a set X, a function k : X × X → R is a kernel if there exists a Hilbert space H, called the feature space, and a map Φ : X → H, called the feature map, such that k(x, y) 〈Φ(x), Φ(y)〉 H for all x, y ∈ X. The kernel induces a distance on X defined as Reininghaus et al. (2015) propose a multi-scale kernel on D as follows. Given F, G ∈ D, the persistence scale space kernel k σ is where Φ σ : D → L 2 (Ω) is the associated feature map, and Ω ⊂ R 2 is the closed half-plane above the diagonal. Deriving the solution of a distribution-analogue of the Heat equation with boundary conditions in Definition 1 of Reininghaus et al. (2015), the closed form expression of the multi-scale kernel is: The multi-scale kernel is shown to be stable w.r.t the 1-Wasserstein distance by Theorem 2 of Reininghaus et al. (2015), which is a desirable property for classification algorithms. However, by Theorem 3 of Reininghaus et al. (2015), the multi-scale kernel is not stable in the Wasserstein sense for 1 < p ≤ ∞.

Persistence Landscapes
Persistence landscapes are a mapping of persistence diagrams into a function space that is either a Banach space or Hilbert space (Bubenik, 2020). Advantages of persistence landscapes are that they are invertible, stable, parameter-free, and nonlinear. Persistence landscapes can be computed from a persistence diagram as follows. and with kmax as the kth largest element. The persistence landscape is the sequence of piecewise linear functions, λ 1 , λ 2 , . . . : R → R. Bubenik shows desirable properties for working with persistence landscapes in statistical modeling, in particular that even if unique means do not exist in the set of persistence diagrams, persistence landscapes do have unique means and the mean landscape converges to the expected persistence landscape. Figure 6 shows an example of persistence landscapes from the MPEG7 dataset, described in the data section.

Persistence Images
From Adams et al. (2015), persistence images are a mapping sending a persistence diagram to an integrable function, called a persistence surface. Fixing a grid on R 2 , the integral over this grid yields pixel values forming the persistence image. Advantages of persistence images include a representation in R n , stability, and ease of computation. When calculating the persistence image, a resolution, a distribution, and a weighting function are required as parameters. It is worth noting that the resolution (i.e., number of pixels) determines the number of features computed by the persistence image.  More explicitly, let D be a persistence diagram in birth-lifetime coordinates. We take ϕ u : R 2 → R to be a differentiable probability distribution. Using, for instance, the Gaussian Distribution with mean u and variance σ 2 we have with f : R 2 → R, a nonnegative weighting function that is zero along the horizontal axis, continuous, and piecewise differentiable. The persistence image is then I(ρ D )p p ρ D dydx, where integration is over the fixed grid on R 2 . This creates an image depicting high and low density areas in the defined grid, that are represented as a highdimensional vector for use in machine learning algorithms. An example is shown in Figure 7 taken from the MNIST dataset.

Adcock-Carlsson Coordinates: The Ring of Algebraic Functions on Persistence Diagrams
This method is explored by Adcock et al. (2016) where the authors highlight the fact that any persistence diagram with exactly n points can be described by a vector of the form (x 1 , y 1 , x 2 , y 2 , . . . , x n , y n ) where x i denotes the birth of the i-th class and y i the corresponding death time. Since this specific representation imposes an arbitrary ordering of the elements in the persistence diagram, one can more precisely identify the set of persistence diagrams with exactly n points with elements of the n-symmetric product of R 2 , denoted Sp n (R 2 ).
The inclusions Sp n (R 2 )ISp n+1 (R 2 ) thus produce an inverse system of affine coordinate rings which provide the basis for studying algebraic functions on the space of persistence diagrams.
With this setting in mind, the main goal of Adcock et al. (2016) is to determine free generating sets for the subalgebra of A[Sp ∞ (R 2 )] comprised of elements which are invariant under adjoining a point of zero persistence to a persistence diagram. The following theorem is an answer to this question (see Theorem 1 Adcock et al. (2016)).
Theorem 1 The subalgebra of 2-multisymmetric functions invariant under adding points with zero persistence, is freely generated over R by the set of elements of the form for integers a ≥ 0 and b ≥ 1. These are the features we call Adcock-Carlsson coordinates.
Using this method we chose the following features for both the 0dimensional and 1-dimensional persistence diagrams, as suggested in Adcock et al. (2016) when analyzing the MNIST data set:

Template Systems
The goal of this method is to find features for persistence diagrams by finding dense subsets of C(D, R). To accomplish this we will rely on the fact that given a persistence diagram D ∈ D, and a continuous and compactly supported real-valued function on W {(x, y) ∈ R 2 : 0 ≤ x < y}, i.e. for f ∈ C c (W),  This injective featurization allows us to define a template system for D as a collection T ∈ C c (W) such that The advantage of working with these template systems is that they can be used to approximate real-valued functions on the space of persistence diagrams as proven by the following theorem Even though this theorem provides the theoretical underpinnings to guarantee the existence of solutions to supervised machine learning problems, it does not provide the specifics for doing so. In particular, one question to answer is how to choose suitable families of template functions. In our evaluations we will explore both prescribed families for template systems, as well as data-driven or adaptive ones.
In the prescribed front we have the tent functions described below. See also Figure 8. In the birth-lifetime plane, and given a point x (a, b) ∈ W and a discretization scale 0 < δ < b, the associated tent function on W is given by Given a persistence diagram D ∈ D in birth-death coordinates, the value of the tent function is

Adaptive Template Systems
The Adaptive Template Systems methodology of Polanco and Perea (2019a) concerns itself with improving and furthering some of the work presented in Perea et al. (2019). The goal is to produce template systems that are attuned or adaptive to the input data set and the supervised classification problem at hand. One shortcoming of template systems, like tent functions, when applied to Theorem 2 is that without prior knowledge about the compact set C ⊂ D, the number of template functions that carry no information relevant to the problem can be high. By reducing this overhead, adaptive templates improve the computation times and accuracy in some specific problems. The relationship between template systems and adaptive template systems is demonstrated in Figure 4, showing the adaptive template systems depend on density of data. To do so, given a compact set C ⊂ D we consider the set S ∪ D∈C D ⊂ W along with different algorithms such as Gaussian mixture models (GMM) (Reynolds, 2009), Hierarchical density-based spatial clustering of applications with noise (HDBSCAN) (Campello et al. ,2013) and Cover-Tree Entropy Reduction (CDER) (Smith et al., 2017) to define a family of ellipsoidal domains {z ∈ R 2 : (z − x) p A(z − x) ≤ 1} in W, fitting the density distribution of S. Here A is a 2 × 2 symmetric matrix and x ∈ R 2 .
Once this family of ellipsoidal domains is computed, we use them to define the following adaptive template functions

Other Approaches
The featurization methods presented in this section are by no means an exhaustive list of what is available in the literature. Here are some others that the interested reader may find useful: • The Persistent homology rank functions of Robins and Turner (2016) are similar in spirit to persistent landscapes, in that they provide an injective inclusion of D 0 into a Hilbert space of functions where techniques like functional Principal Component Analysis are available. Indeed, for a filtration X , its n-th persistent rank function is defined as This is equivalent, for a persistence diagram D ∈ D 0 , to defining the function where # is multiset cardinality. The Hilbert space in question is the weighted L 2 -space L 2 (W, ϕ). Here ϕ : [0, ∞) → [0, ∞) satisfies ∞ 0 ϕ(t)dt < ∞, and the inner product of rank functions is 〈f , g〉 ϕ W f x, y g x, y ϕ y − x dxdy. This approach has shown to be effective in analyzing point processes, and sphere packing patterns.
The Persistent curve (Chung et al., 2018;Giusti et al., 2015) provides another functional summary closely related to persistent rank functions. Specifically, for a persistence diagram D ∈ D 0 , its persistence curve (Chung et al., 2018) is the function Discretizations of these curves have been useful in computer vision tasks (Chung and Lawson, 2020), as well as in neuroscience applications (Giusti et al., 2015).
Other kernel methods, besides the Multi-Scale kernel of Reininghaus et al. (2015), have appeared in the literature. They correspond to the following choices of kernel function k : D 0 × D 0 → R. The Persistence Weighted Gaussian Kernel of Kusano et al. (2016) is defined as D1op ω(x)ϕ(x) x∈D where op(·) is a permutation invariant operator (e.g., max, min, sum, etc), ω : R 2 → R is a weight function, and ϕ : R 2 → R q is a representation function. By optimizing ω and ϕ in a parametric family-i.e., ω ω u and ϕ ϕ v -the training of the network can lead to vectorizations attuned to specific learning tasks.

DATASETS
The five different datasets considered in this work were chosen from a collection of experiments presented in the literature of topological methods for machine learning. We acknowledge that this selection is inherently bias towards datasets with favorable performance with regards to specific topological methods. Nevertheless, we counterbalance this by applying all the evaluated featurization methods to all the data sets here considered and compare the classification results across all the presented methodologies. This comparative work showcases how the variation between methods results in the need for the user to find suitable combination of featurization methods and parameter tuning to obtain optimal results in a given dataset. As such, readers should view this as a resource for their own analysis, and not as a recommendation for specific techniques. For all datasets and methods, parameter tuning was done using a grid search method on a subset of data that was not used to report final results, and parameters were chosen based on performance of a ridge regression model, random forest and support-vector machine (SVM) model. It is worth noting a weakness of the analysis in that the same parameters were used in the feature set calculation for all reported models, and run with a single split. This was due to time required for feature calculation.
The ridge regression and random forest classifier were run with default parameters, and the support-vector machine was run using the radial basis function (RBF) with some tuning on the cost parameter (C). The exception is for the Multi-Scale Kernel feature set-we only fit a support-vector machine model. It is important to highlight that results regarding ridge regression with (polynomial and radial basis function) kernel methods are not included in this work as they produce increased computational times while the classification results do not improve significantly compared to the one presented here. Each dataset was sampled for a 10 or 100 trials depending on size, with the exception of the Protein Classification Dataset, which included indices for predefined problems.
Random forest classifiers as presented in Breiman (2001) are used to solve the same classification problems presented for each data set. Parameters such as number of trees in each forest and the size of each tree are chosen based on performance and tuned on the testing set.

MNIST
The MNIST dataset from LeCun and Cortes (1999) is a database of 70,000 handwritten single digits of numbers zero to nine. An example image from the MNIST database is shown in Figure 7.
The calculation of persistence diagrams for the MNIST dataset is as in Adcock et al. (2016). This method creates a base of 8 different persistence diagrams to use in the application of methods. The persistence diagrams are calculated using a "sweeping" motion in one of four directions: left to right, right to left, top to bottom, or bottom to top, corresponding to the 0dimensional and 1-dimensional persistence diagrams. To compute this filtration, pixels are converted to a binary response with intensity calculated based on position. This has the effect that depending on the direction of sweep, features will have different birth and death times, providing distinct information for each direction. The number of topological features available for model fitting is dependent on the method. For the Persistent Images, Persistence Landscapes, and Template Systems there are eight features each. The Multi-Scale Kernel produces eight different kernel matrices, and for Adcock-Carlsson Coordinates, 32 different features were computed from these persistence diagrams. Figure 5 shows the various calculations of persistence diagrams for an example number eight. Both 0-dimensional and 1-dimensional persistence diagrams were used for the MNIST dataset, noting that some observations did not have 1-dimensional persistence diagrams, so these observations were filled with a single diagram of birth-death coordinate of [0,.01].
For the MNIST dataset, a random sample of 1,000 images was used to tune parameters, with 80% used for the training portion, and 20% used for the testing portion. We used the set of 60,000 images corresponding to the training set of MNIST to create our own training and testing sets for model fitting and evaluation. For this set of 60,000, 10 trials were run with an 80% training and 20% testing split to determine model performance.

SHREC14
We evaluated the SHREC 2014 dataset (Pickup et al., 2014) in the same manner as the authors of Polanco and Perea (2019a). To compute the topological features, the authors of Reininghaus et al. (2015) describe using a heat kernel signature to compute persistence diagrams for both the 0-dimensional and 1dimensional persistence diagrams. The dataset consists of 15 different labels, corresponding to five different poses for the three classes of male, female, and child figures.
As noted in Polanco and Perea (2019a), parameters in the dataset define different problems due to a different calculation of the heat kernel signature, and for this evaluation we focused on the problem with the highest accuracy as reported in Polanco and Perea (2019a).
For the SHREC14 dataset, a random sample of 90 images (30% of the data) was used to tune the model and determine appropriate model parameters. The remaining 210 observations were split into 80% training and 20% testing for 100 trials to report final model fit. Persistence diagrams for 0-dimensional homology and 1dimensional homology were computed for this dataset. Table 1 shows complete results for the SHREC 2014 dataset.

Protein Classification
We use the Protein Classification Benchmark dataset PCB00019 Sonego et al. (2007) as another type of data to evaluate the topological methods above. This specific set contains information for 1,357 proteins corresponding to 55 classification problems, and we reported on 54 of the problems using one to tune parameters. The training and testing index were provided, and the mean accuracy was reported for both training and testing sets using these indices. Table 2 shows results from our experiments using the training and testing indices provided in the original dataset. Persistence diagrams for this dataset were computed for each protein by considering the 3-D structure [provided in wwPDB consortium (2018)] as a point cloud in R 3 . This point cloud was built using the x, y and z position of each atom in the molecule at hand. With this information the persistent 0-dimensional and 1dimensional homology is computed using Ripser from Tralie et al. (2018).

MPEG7
The mpeg-7 dataset from Bai et al. (2009) is a database of object shapes in black and white, with 1,400 shapes in 70 classes. An example from the original dataset is shown in Figure 4 along with the contour as described below.
To compute persistence diagrams, first the image contour is computed by placing observations from the point cloud into a sequence. The distance curve is computed as the distance from the center of the sequence. Sublevel set persistence is taken using the computed distance curves as point cloud data. Persistence diagrams for both 0-dimensional and 1-dimensional homology were computed for this dataset.
We used this dataset for a timing comparison of featurization methods from persistence diagrams. We do not report on the results of this dataset. An example notebook of MPEG7 is provided using only four shapes-apple, children, dog, and bone. This approach is due to the initial difficulty in getting accurate models for the full dataset. Due to the small number of samples (80 total) and lack of repeated sampling, the estimates provided for this dataset are not stable and are not reported.

HAM10000
The HAM10000 dataset provided by Philipp Tschandl et al. (2018) is a collection of 10,000 images of skin lesions with one 7 potential classifications: Actinic Keratoses and Intraepithelial Carcionma, Basal cell carcinoma, Benign keratosis, Dermatofibroma, Melanocytic nevi, Melanoma, Vascular skin lesions. A total of 18 persistence diagrams for this set were calculated using the methods outlined in Chung et al. (2018), 9 corresponding to the 0-dimensional homology and 9 corresponding to the 1-dimensional homology.
To obtain such diagrams, first a mask is computed by implementing the methodology proposed in Chung et al. (2018). In general terms, this method creates a filtration of binary images obtained from different thresholds to convert the gray scale image into a binary one. Once this binary filtration is obtained, the center most region of the image is computed using the "persistence" of each point in the binary filtration. An example image and this process is shown in Figure 9.
Once the mask is computed it is applied to the original image and then it is converted into three different color models: RGB, HSV , and XYZ. Each color model is split into their corresponding channels, and for each channel we use sublevel set filtration to obtain 0dimensional and 1-dimensional persistence diagrams. In total, for each image on the data set we obtain 18 persistence diagrams, 9 in homological dimension 0 and 9 for homological dimension 1.
To tune the models, a random sample of 250 images were taken a ridge regression, random forest, and support vector machine model were fit to determine parameters. The remaining 9,750 images were split into an 80% training and 20% testing set to report final results.
To evaluate the HAM10000 dataset, due to the large number of birth and death pairs in each persistence diagram, subsampling of persistent features was required. Each observation in a data set, for example an image, will yield 18 persistence diagrams corresponding to homological features in that observation. In the HAM10000 dataset, there was an average of 5,783 birth-death pairs in each persistence diagram. This was an issue to complete computation for the vectorization methods, even for adaptive templates, so each persistence diagram was subsampled as follows.
The method of subsampling is two steps: Highly persistent features were always included, and a uniform random sampling method (without replacement) was used to sample the remaining points. The threshold for feature lifetime and number of points to sample was determined by using parameters that preserve the distribution of points in each persistence diagram. As a result, features in each persistence diagram with a lifetime of five or more were automatically included, and 5% of the rest of the points were also included. This resulted in sampled persistence diagrams with an average of 290 points each ( Table 3).

Available Functions
As part of the available code, a function for each method is included. Each function requires two sets of persistence diagrams, a training set and a testing set, and parameters specific to the function. The function returns two feature sets for that method, corresponding to the training and test set respectively. Each function also prints the time in seconds taken at the end of each run. In this section of the user guide each function is described, along with the required parameters for the function.
The Multi-Scale Kernel feature matrix can be computed using the function kernel_features or fast_kernel_features. It is recommended to use fast_kernel_features due to computation time. Both functions require a parameter sigma, denoted as s in the function with a default value of 4. In Reininghaus et al. (2015) this parameter is referred to as the scale parameter. From the closed form distribution of the Multi-Scale Kernel we note that as sigma, σ, increases the function decreases. Increasing sigma results in a less diffuse kernel matrix, while decreasing sigma results in a more diffuse kernel matrix. Due to time required for the Multi-Scale Kernel, there are two additional sets of functions that use Numba (Siu Kwan Lam and Seibert, 2015) for significantly faster computation. In the current implementation, these are not able to be combined with multi-core processing (MPI for example), and have a different format than the other functions included. These functions are provided in the github repository for this project, and were used to compute results for the Multi-Scale Kernel for the MNIST dataset.
The Persistence Landscapes features can be computed using the function landscape_features. The Multi-Scale Kernel function, landscape_features requires two parameters: the number of landscapes, n and resolution, r. The number of landscapes parameter, n, controls the number of landscapes used, and the resolution, r, controls the number of samples used to compute the landscapes. The default parameters for n is 5 and r is 100.
The Persistence Images can be computed using the function persistence_image_features. The persistence_image_features function requires two parameters, pixels and spread. The pixels, p is a pair that defines the number of pixels along each axis. The spread, s, is the standard deviation of the gaussian kernel used to generate the persistence image. It is worth noting that the implementation here uses the gaussian kernel, however, other distributions could be chosen so that s would correspond to parameters specific to the chosen distribution. Additionally, the weighting function is constant for this implementation. Increasing spread increases the variance for each distribution, resulting in larger "hot spots". Increasing pixels provides a smoother distribution, whereas decreasing pixels yields a less smooth distribution. Note that increasing pixels increases computation time. This is demonstrated in Figure 7 in the methods section.
The Adcock-Carlsson Coordinates features can be computed using the function carlsson_coordinates, does not require any parameters. This function returns four different features for every type of persistence diagram provided. So for datasets that have persistence diagrams corresponding to 0-dimensional and 1dimensional homology, 8 features are returned for machine learning. The features returned correspond to the four coordinates calculated in Adcock et al. (2016), and are: The Template Systems features can be computed using the function tent_features, and has a choice of two parameters: d, which defines the number of bins in each axis and padding, which controls the padding around the collection of persistence diagrams. This function returns a training and testing set. This function computes the tent features from Perea et al. (2019).
The Adaptive Template Systems features can be called with the function adaptive_features, and requires the labels for the training set. Users can choose three different types of Adaptive Templates: Gaussian Mixture Modeling (GMM), Cover-Tree Entropy Reduction (CDER), and Hierarchical density-based clustering of applications with noise (HDBSCAN). The parameter d refers to the number of components when using the GMM model type. This would be minimally the number of classes in your data, and ideally represents closer to the number of distributions in the data that correspond to each observation. Details on these methods can be found in Polanco and Perea (2019a), as well as the original references linked in the methods section. During this evaluation, we evaluated adaptive templates using both GMM and CDER methods, but did not formally evaluate HDBSCAN. HDBSCAN was difficult to formally assess as we had difficulty with completion of the algorithm for some datasets. For those datasets we were able to complete, we did not notice an improvement over other adaptive methods.

RESULTS
One consideration we must make before analysing the results comes from the computation of Multi-Scale Kernel features. As explained for each dataset in Section 4, more often than not we will compute multiple persistent diagrams per data point in a given data set. Such persistent diagram correspond to 0-dimensional, 1-dimensional, and in some cases 2-dimensional persistent homology (see details in Section 2). To compute Multi-Scale Kernel as given by Eq. 6 we require pairs of persistent diagrams. Since this multi-scale kernel provides a notion of similarity between persistent diagrams (Reininghaus et al., 2015) we require it to be computed between diagrams corresponding to the same dimension homology and method type. For example, the kernel matrix that corresponds to the 0-dimensional homology of a data set is computed using the persistence scale space kernel between two sets of persistence diagrams that represent the 0-dimensional homology. This means that for a dataset that has sets of 0-dimensional homology persistence diagrams and 1-dimensional homology persistence diagrams, two kernel matrices were returned (one per each dimension).
The kernel matrix used in our models is the sum of available kernels, and differs based on the persistence diagrams available for each dataset. While this does improve accuracy significantly over individual kernel matrices, other methods of combining kernel features were not explored in this paper, but is available in Gönen and Alpaydin (2011) for the interested reader. The available parameter, sigma, is consistent across all types of diagrams for our evaluation.
For each of the other methods, Persistence Landscapes, Persistence Images, Adcock-Carlsson Coordinates, Template Systems, and Adaptive Template Systems, each feature matrix was constructed for the relevant set of diagrams, and all topological types were used in fitting the same model.
The datasets used in this analysis were of varying size, both in terms of observations and the size of sets of persistence diagrams. As noted in the descriptions of data, the types of persistence diagrams calculated also differs. A summary of characteristics for each dataset is included in Table 3.

MNIST
The Multi-Scale Kernel features calculated yielded eight different kernel matrices, and the final kernel matrix was calculated using the unweighted summation of these kernels as in Gönen and Alpaydin (2011). Due to the time needed for computation of the Multi-Scale Kernel, a smaller set of 12,000 observations was used to report final results and a version of the kernel computation using Numba with a gpu target was necessary.

SHREC14
Results are reported in Table 1. Adaptive Template Systems and Persistence Landscapes were the two methods with highest classification accuracy on the test dataset, with Template Systems and the Multi-Scale Kernel performing nearly as well.

Protein Classification
Nearly all of the topological methods in this paper provided similar classification accuracy for this dataset. We observe the testing accuracy as higher than the training accuracy for this dataset, and the results are similar to those in Polanco and Perea (2019b). The Multi-Scale Kernel though did not perform as well and as shown in Figure 10 is the most computationally intensive. Results are reported in Table 2.

HAM10000
Due to run time for the large number of points in each persistence diagram, even after subsampling, results were not reported for the Multi-Scale Kernel or Adaptive Template Systems.
Results are listed in Table 5. The HAM10000 dataset presented the largest computational challenge during this review, and is a continued area of research.

COMPUTATION TIME OF FEATURES
Formal timings were captured for all features for the 0-dimensional persistence diagrams for the MPEG7 and Protein Datasets. A comparison of timings is in Figure 10. The timing reported is for the generation of features from one type of persistence diagram for a dataset of that size. This means when computing a training feature set and testing feature set for multiple types of persistence diagrams, the expected time to generate features can be significantly longer. For example, in the MNIST dataset we compute four different types of persistence diagrams with both 0-dimensional and 1-dimensional homology, giving eight sets of features that can be generated for the sets of persistence diagrams for that dataset. Specific to the multi-scale kernel method, the timing reported is for a symmetric feature matrix that is nxn, where n is the number of observations in the dataset. This means the training feature set requires less computation time than a testing feature set of comparable size.
Additionally, during the review of these methods, we did not encounter significant issues with model fitting, hence formal timings were not recorded for this portion of the analysis. Conclusions from these timings are addressed in the discussion section.

Data Availability
The datasets, persistence diagrams (or code to compute the diagrams), and all other associated code for this study can be found in the machine learning methods for persistence diagrams github repository https://github.com/barnesd8/machine_learning_for_persistence.
For each of the five datasets, the following code is available: • A jupyter notebook that loads and formats the persistence diagrams including images and does a preliminary model fitting on a subset of the data • A python script that calculates the persistence diagrams from the original dataset -some of these are written using MPI depending on the size of the dataset • A python script that fits models for random samples of the data to get mean estimates of accuracy for both the training and test dataset Frontiers in Artificial Intelligence | www.frontiersin.org These scripts reference modules included in the github repository, including a persistence methods script that calculates features from persistence diagrams for a training and test dataset. This uses a combination of other available functions and functions written specifically for this paper.
The Template Systems and Adaptive Template Systems methods use functions from https://github.com/lucho8908/ adaptive_template_systems, which is the corresponding code repository for Polanco and Perea (2019a). The available methods in our extension include Tent Functions and Adaptive Template Functions (GMM, CDER, and HDBScan methods).
The Adcock-Carlsson Coordinates method is a function developed specifically for this paper, and includes the calculation of the 4 different features used in our analysis. The Persistence Landscape method uses the persistence landscape calculation from the Gudhi Package Dlotko (2021). The Multi-Scale Kernel Method has two included implementations, one is from Nathaniel Saul (2019) and is slower to compute, while the other is a faster implementation that can be used on larger datasets. All of the results in this paper were reported using the implementation written specifically for this paper. The Persistence Images features are also from Nathaniel Saul (2019). Additionally, many functions from Pedregosa et al. (2011) are used throughout.

Adcock-Carlsson Coordinates, Tent Functions, and Persistence
Landscapes scale well, and perform well even for large datasets. It is of note though that parameter choice will affect computation time. This was especially notable in the Template Features (Tent Functions) computation time. As the number of tent functions is increased, the time to compute features also increases. We observed a superlinear increase, however, even with this increase computation time was not a barrier for analysis.
Persistence Images and Adaptive Template Functions do not scale or perform as well, however, do provide good featurizations for accurate models and should be considered depending on the dataset. Specifically, the Adaptive Template Functions was not completed for the full HAM10000 dataset due to computation time.
When using these methods, it should be of note that the Multi-Scale Kernel method is computationally intensive, and does not scale well. Additionally, the accuracy achieved is not better than other methods for the datasets in this paper.

CONCLUSION
This paper reviews six methods for topological feature extraction for use in machine-learning algorithms. Persistence Landscapes, Adcock-Carlsson Coordinates, Template Systems, and Adaptive Template Systems perform consistently with minimal differences between datasets and types of persistence diagrams. These methods are also less expensive in terms of execution time. A main contribution of this paper is the availability of datasets, persistence diagrams, and code for others to use and contribute to the research community.

AUTHOR CONTRIBUTIONS
JP coordinated the experimental framework and co-wrote the paper with significant contributions to the background sections. LP curated datasets, contributed to the software code and co-wrote the paper with significant contributions to the methods sections. DB curated datasets, ran the experimental results, developed the accompanying software and co-wrote the paper with significant contributions in most areas. All authors reviewed results and the manuscript, with DB approving and reviewing the final version.

FUNDING
This work was partially supported by the National Science Foundation through grants CCF-2006661, and CAREER award DMS-1943758.

ACKNOWLEDGMENTS
This work was supported in part by Michigan State University through computational resources provided by the Institute for Cyber-Enabled Research. Special thanks to Elizabeth Munch for providing background information for computing persistence diagrams using a directional transform. We also thank the reviewers for critical feedback that helped improved our review.