Applications of Topological Data Analysis in Oncology

The emergence of the information age in the last few decades brought with it an explosion of biomedical data. But with great power comes great responsibility: there is now a pressing need for new data analysis algorithms to be developed to make sense of the data and transform this information into knowledge which can be directly translated into the clinic. Topological data analysis (TDA) provides a promising path forward: using tools from the mathematical field of algebraic topology, TDA provides a framework to extract insights into the often high-dimensional, incomplete, and noisy nature of biomedical data. Nowhere is this more evident than in the field of oncology, where patient-specific data is routinely presented to clinicians in a variety of forms, from imaging to single cell genomic sequencing. In this review, we focus on applications involving persistent homology, one of the main tools of TDA. We describe some recent successes of TDA in oncology, specifically in predicting treatment responses and prognosis, tumor segmentation and computer-aided diagnosis, disease classification, and cellular architecture determination. We also provide suggestions on avenues for future research including utilizing TDA to analyze cancer time-series data such as gene expression changes during pathogenesis, investigation of the relation between angiogenic vessel structure and treatment efficacy from imaging data, and experimental confirmation that geometric and topological connectivity implies functional connectivity in the context of cancer.


INTRODUCTION
With the advent of next-generation high-throughput sequencing (Roychowdhury et al., 2011;Reuter et al., 2015), improved medical imaging (Wang, 2016;Tahmassebi et al., 2018;Aiello et al., 2019), and an increased focus on personalized medicine (Dilsizian and Siegel, 2014;Gu and Taylor, 2014;Alyass et al., 2015;Suwinski et al., 2019), more data is being collected than ever before. Efficient data analysis techniques are critically needed to convert this data into meaningful, clinically translatable information. Topological data analysis (TDA) focuses on the shape of data, identifying both local and global structures at multiple scales. Consider a trivial example: suppose data points lie on a circle. The data points could represent customers' preferences or patient gene expression. In this case if a product or drug were targeted to the average person, the target would be the center of the circle and would thus miss the data set entirely. While this is a simple made-up example, it illustrates the importance of understanding the shape of data. TDA can be applied to high-dimensional and noisy data. While the output of TDA can be affected by incomplete data, it is still effective at distinguishing between data sets that have different shapes.
TDA has been successfully applied in a variety of medical contexts including to discover phenotype-biomarker associations in traumatic brain injury (Nielson et al., 2017), identify diagnostic factors for pulmonary embolism (Rucco et al., 2015), discriminate between healthy patients and those with diabetic retinopathy from retinal imaging (Garside et al., 2019), map human recombination at fine scales (Camara et al., 2016), identify novel pathological phenotypes of asthma (Siddiqui et al., 2018), and characterize the structure of chromatin conformation inside the nucleus . In this review, we shall focus our attention on some recent applications of persistent homology, a main tool of TDA, to oncology. We specifically discuss treatment responses, clinical outcomes, disease classification, biomarker identification, and cellular architecture in cancer. We will also provide insights into possible future fruitful avenues of research, including analysis of time-series data to help with disease classification and identification of selection events, investigation of the relation between angiogenic vessel structure and treatment efficacy from imaging data, and experimental confirmation that geometric and topological connectivity implies functional connectivity in the context of cancer. Though we focus on persistent homology here, it is worth noting that there have been many notable successes of the application of other TDA methods, such as the Mapper algorithm (Singh et al., 2007). For example, Mapper was recently used to extract information from high-throughput microarray data and define a new subtype of breast cancer, c-MYB+, characterized by high c-MYB expression and low levels of innate inflammatory genes, with corresponding patients exhibiting 100% survival and no metastasis (Nicolau et al., 2007). In another study, Mapper was used to discover 38 new cancerassociated genes across tumor types, some of which were then confirmed to play a key role in tumorigenesis in mouse models . Before delving into the applications of persistent homology in cancer, we introduce some of the key mathematical underpinnings needed to understand these results.

WHAT IS PERSISTENT HOMOLOGY?
The mathematical definition of homology/homologous is very precise and often differs from the English common usage. Homology uses algebra to detect topological shapes. Topology is sometimes called rubber sheet geometry as two objects are topologically equivalent to each other if one can be deformed into the other without tearing or puncturing the objects. For example, the spherical and cubical surfaces are topologically equivalent per Figure 1A. The sphere is topologically different from the 3-dimensional ball that the sphere bounds. Homology detects this difference by noting that the 2-dimensional spherical surface bounds a void while the 3-dimensional ball is solid and thus does not bound any voids.
To describe homology, we will first focus on two quantities: β 0 = the number of connected components and β 1 = the number of 1-dimensional holes (a circle that has not been filled in). One does not need to understand the algebra of homology in order to understand the basics of persistent homology, thus we will only briefly introduce some concepts for the interested reader. Two points are homologous if they are in the same connected component. Thus, β 0 = 1 if the object is connected. To describe β 1 , we will focus on Figure 1B. We can use addition to represent topological objects. For example, the rectangle in Figure 1 is represented by the sum of edges: e 1 + e 2 + e 3 + e 4 . Two 1dimensional cycles are homologous to each other if they form the boundary of a surface. Thus, the rectangle is homologous to the cycle e 5 + e 8 + e 10 + e 11 since these two cycles bound the green surface. The cycles e 5 +e 6 +e 7 +e 8 and e 9 +e 10 +e 11 +e 12 are also homologous since they bound the light green surface consisting of two crescent moons. In fact all these cycles are homologous to the rectangle e 1 +e 2 +e 3 +e 4 . One can see that this object contains many cycles, many of which are homologous to the rectangle (or a multiple of the rectangle, for example, 12 i=5 e i is homologous to 2 4 i=1 e i ). A 1-dimensional cycle is homologous to 0 if it bounds a surface. Thus the cycles e 5 + e 9 + e 12 + e 8 and e 6 + e 7 + e 11 + e 10 are both homologous to 0 since they each form the boundary of a surface (the two crescent moons, waning or waxing, respectively). Since each of the cycles in this figure are homologous to 0 or to a multiple of the rectangle, its homology is generated by a single cycle (for example, the rectangle) and thus β 1 = 1.
The intuitive definition of homology is that β n equals the number of n−dimensional holes 1 . Per the Figure 1 caption, homology can be used to distinguish the following objects from each other: solid ball, sphere, higher dimensional balls and spheres, solid torus, and torus. Homology cannot distinguish all objects that are topologically different. For example,the 1dimensional circle, the 2-dimensional surface in Figure 1B, and the 3-dimensional solid torus ( Figure 1C) all have the same homology. For more on the mathematical definition of homology (please see Munkres, 1984;Hatcher, 2002;Ghrist, 2014).
We will illustrate with an elementary example how persistent homology can detect shape at multiple scales by noting the birth and death of topological features. Our dataset will consist of 5 points from a circle as shown in Figure 2. To detect the circle, we need to connect these points in some manner. For example, we could connect all points whose distance is less than some fixed ǫ. If one can visualize the data set, then the choice of ǫ may be clear. But more often, there is no obvious choice, so instead we analyze the data at multiple scales using persistent homology. The first box in Figure 2 shows the five data points. At this stage, we have five components, one for each data point (β 0 = 5). These components are represented by the five red lines in the top part of this figure. These five red lines along with the blue segment is called the barcode for the data set. The barcode keeps track of the number of components (red bars) and number of 1-dimensional holes (blue bar) as the threshold for connecting FIGURE 1 | (A) The solid ball and solid cube are topologically equivalent and thus have the same homology. Their surface boundaries also have the same homology since these surfaces are topologically equivalent. The solid ball has one connected component and thus β 0 = 1. The solid ball does not contain any voids, and thus β i = 0 for all i > 0. The sphere, which is the boundary of the ball, has β 0 = 1 since it is connected, and β 2 = 1 since the 2-dimensional sphere bounds a void, while β i = 0 for all other i since there are no lower or higher dimensional voids. For an n + 1-dimensional ball (for example, all points of distance less than or equal to 1 from the origin in R n+1 ), β i = 0 for all i > 0 since it does not contain any voids. The n-dimensional sphere, which is the boundary of the n + 1-dimensional ball, has β i = 1 for i = 0, n and β i = 0 for all other i. Since the n-dimensional sphere contains a void, it is the n-dimensional object that generates β n . (B) A surface with boundary that is topologically equivalent to an annulus. The annulus is a 2-dimensional surface that has the same homology as a 1-dimensional circle. Since this object has one connected component, β 0 = 1. We can use addition to represent a cycle. The cycle e 5 + e 9 + e 12 + e 8 = e 5 + e 8 + e 9 + e 12 is homologous to 0 since it bounds a surface (the light green waning crescent moon). Since all 1-dimensional cycles are either homologous to 0 or to (a multiple of) the rectangle cycle e 1 + e 2 + e 3 + e 4 , β 1 = 1. Since this object lives in the 2-dimensional plane, β i = 0 for all i > 1. (C) The solid torus has β i = 1 for i = 0, 1 and β i = 0 for all other i while its boundary, the torus, has β i = 1 for i = 0, 2, β 1 = 2, and β i = 0 for all other i. The thick blue cycle is a 1-dimensional homology generator for both the solid torus and its boundary. The thiner black cycle is a homologous to 0 in the solid torus as it bounds a meridinal disk, while this black circle is a homology generator in the torus which is not homologous to the blue circle. The torus surface generates the 2-dimensional homology.
FIGURE 2 | A barcode captures topological features in a dataset at multiple scales. The topology of a dataset at a fixed scale is determined by joining pairs of data points with an edge if the distances between the pair of points is less than the fixed scale. If three edges form a triangle, then the triangle is filled in. This process is shown in the seven boxes as the scale for joining vertices increases from box 1 to box 7. The corresponding barcode is shown at the top of the figure. The persistence of a feature over multiple scales determines the length of the bar corresponding to that feature. The number of components (β 0 ) that exist at a particular scale is represented by the number of red bars that exists at the corresponding Rips diameter. The creation of the 1-dimensional cycle in box 6 is represented by the birth of the blue bar. The blue bar dies when this cycle is filled in with triangles (box 7). This figure was created by modifying the output of the R package TDAstat (Wadhwa et al., 2018) and latex code written by Catalina Betancourt. data points increases. We can visualize the increasing threshold (or proximity parameter) by growing balls around each data point and connecting pairs of points as soon as their respective balls intersect. Thus, in the second box, an edge joins the two closest points, reducing the number of connected components by one. Thus, one bar ends (dies), and only 4 bars (β 0 = 4) continue past this threshold. Observe that every time an edge joins two components, a bar dies (and β 0 reduces by one). In the timepoint just before 1.5 (box labeled 5), two edges are added. One connects two components, but the third forms a triangle with two previously created edges. These three edges surround a small hole, but we fill in this hole (shaded in pink) as we only want to detect large holes. We are forming a Rips complex where whenever a triangle is formed, it is immediately filled in and thus triangles do not contribute to β 1 . In the timepoint after 1.5 (box labeled 6), a cycle containing four edges is formed. This is indicated in the barcode by the start (birth) of the blue bar. As more edges are added, eventually this region is divided into two triangles and the blue bar dies at timepoint close to 2 (corresponding to box labeled 7). Note we have one infinitely long bar (top red bar with arrow) since after time 1.5 we have one connected component.
To summarize, this example of a TDA pipeline consists of taking a dataset, creating a sequence of Rips complexes, and outputting a barcode (Edelsbrunner et al., 2002;Carlsson et al., 2005;Zomorodian and Carlsson, 2005). A Rips complex is a generalization of a graph. While in our example we only looked at adding edges and triangles, we can also add higher dimensional simplices. A n-simplex in a Rips complex is a collection of n + 1 points where each pair of points is connected by an edge. Thus an edge is a 1-simplex, a triangle is a 2-simplex, and a tetrahedron is a 3-simplex. In our circle example, when all pairs of the 5 points are connected by edges, we add a 4-simplex even though the data set lives in 2-dimensions. The existence of an n-simplex means that (all pairs of) n + 1 points are close together according to a given threshold. The Rips complex is also called a clique complex, the latter term coming from graph theory where a clique is a graph where every pair of vertices is connected. Thus, our simplices correspond to clique subgraphs. Other names for Rips complex include Vietoris-Rips complex and flag complex. There are other ways to form a simplicial complex from data. For the Rips complex, an n-simplex is formed at threshold r when all pairs of n + 1 points are of distance less than r (so that each pair of points is connected by an edge). This is equivalent to requiring every pair of balls of radius r centered around the n + 1 points to intersect. If we require the intersection of all these balls to be nonempty in order to form an n-simplex, we instead form theČech complex. Thus, to form a 2-simplex (triangle), the Rips complex only requires non-empty pairwise intersection of three balls while theČech complex requires the intersection of all three balls to be nonempty. Thus, theČech complex is similar to the Rips complex, but an n-simplex is formed at a slightly larger threshold in theČech complex. Under certain conditions, theČech complex is guaranteed to have the same homology as the union of all balls of radius r centered around data points (Hatcher, 2002). But the Rips complex has much smaller computer memory requirements as only the edges need to be stored to determine the Rips complex, and thus the Rips complex is normally used when calculating persistent homology. A very different TDA technique called Mapper uses a completely different method to create a simplicial complex from data (Singh et al., 2007). For Mapper, each vertex represents a cluster of data points. If n+1 of these clusters have a common intersection, then an n-simplex is formed. Mapper can be used to reduce the size of a data set and to visualize it.
The example in Figure 2 focused on β 0 and β 1 . For data that lives in a higher dimensional space, we can similarly calculate β n = the number of n-dimensional holes. For example, β 2 = 1 for both the sphere and torus as these are 2-dimensional surfaces that bound voids in space. For more details regarding persistent homology and barcodes (please see Ghrist, 2008;Carlsson, 2009;Edelsbrunner and Harer, 2010;Otter et al., 2017).
In order to use persistent homology in machine learning, we need a distance between barcodes. We first convert barcodes to persistence diagrams as described in the next section and use these diagrams to define a distance between barcodes. In this section, we show how persistent homology is stable with respect to noise: small perturbations in the data have only a small effect on the barcode (Cohen-Steiner et al., 2007). In section 2.2, we discuss the advantages/disadvantages of persistent homology with regard to how it handles noise, incomplete data, and computational complexity. In section 2.3, we discuss one method (persistent images) of converting a persistence diagram into a vector that can be used in machine learning. We also give references to many other methods for using persistent homology in machine learning.
While we have discussed the basic method for converting Euclidean data into barcodes, there are a number of other methods for obtaining barcodes from data. All one needs is a method to determine when to add an edge between pairs of data points. Thus, the data do not need to live in Euclidean space. We also assumed that small holes correspond to noise, but there are applications where the point of using persistent homology is to detect small holes (Bendich et al., 2016). We also had only one infinite bar corresponding to the one connected component we obtained when all our data points were connected by edges. If one is working with Euclidean data, eventually all holes will be filed in and thus eventually a Rips complex with only one component and no holes will be formed. But in other applications, holes may persist forever, resulting in infinite bars. One can also obtain additional information by looking at the group structure of the filtered homology groups, and prove stability properties using interleaving distance (Bauer and Lesnick, 2014;Bubenik and Scott, 2014;Oudot, 2015;Chazal et al., 2016).

Persistence Diagrams and Stability
While barcodes are useful for visualizing changes in homology, barcodes are generally converted into persistence diagrams for statistical and machine learning analysis (Edelsbrunner et al., 2002;Mileyko et al., 2011). The start of a bar represents the birth of a cycle while the end represents its death. The plot of the points (birth time, death time) in 2-dimensional space is called the persistent diagram (PD). The persistent diagram corresponding to the barcode in Figure 2 is shown in Figure 3. A persistence diagram also includes the diagonal as shown in this figure as the diagonal is used when computing distances between PDs. A PD can be a multiset if multiple bars have the same birth time b and death time d, so that the point (b, d) occurs multiple times in the PD.
The formula for the bottleneck distance for a fixed β i between two persistence diagrams, P 1 and P 2 , is d B (P 1 , P 2 ) : = inf To compute this distance we first create a matching γ between these diagrams for the fixed β i as shown in Figure 4. In this figure the blue triangles represent features with the fixed β i from one data set while the purple stars represent features from a different data set for the same β i . A matching γ : P 1 → P 2 is a bijective function from P 1 to P 2 where both persistence diagrams include the diagonal. Features that are close to the diagonal get matched to the diagonal unless they are closer to another feature that does not have a better FIGURE 4 | Two persistence diagram, P 1 and P 2 , are shown for a single dimension (for example, β 1 ). The blue triangles correspond to P 1 while purple stars are used for P 2 . Both persistence diagrams include the diagonal. A matching between P 1 and P 2 is shown where the red dotted lines indicate features that have been matched where some of the features are matched to the diagonal. The length of the thicker dark red dotted lines indicate the distance between matched features. The distance between a feature and the diagonal is the persistence of the feature, d − b, where b = birth time and d = death time of that feature. If feature (b, d) is matched with feature (β, δ), then the distance between these features is max(|b − β|, |d − δ|). Since the best matching is shown, d B (P 1 , P 2 ) equals the length of the longest of the thick dark red dotted lines. Any other matching would have a matched pair of features with larger distance.
matching than to the diagonal. If x = (b, d) ∈ P 1 is matched to the point (β, δ), then the distance between these features is . To find the distance for a particular matching γ , we calculate sup x∈P 1 x − γ (x) ∞ = the largest distance between a point x in P 1 and its match γ (x) in P 2 . The bottleneck distance is obtained by taking the infimum of this distance over all possible matchings. In Figure 4, red dotted lines indicate best matches between features from P 1 and P 2 .
If P 1 is the PD for the data set X and P 2 is the PD for the data set Y, the stability theorem states that Cohen-Steiner et al., 2007). In other words, if each data point is perturbed by at most a distance ǫ, then the persistence of a feature will change by at most 2ǫ since the birth and death times can change by at most ǫ. Features with persistence <2ǫ may disappear, while new features with persistence less than 2ǫ may be created.

Benefits and Limitations of Persistent Homology
That persistent homology is stable with respect to noise is, of course, a major advantage. But any method that uses Euclidean distance is affected by the curse of dimensionality due to the effect of noise on distance. For example, suppose a data point should be at the origin, but due to noise, each coordinate is perturbed by 0.01 units, then the point which should be at the origin is now n i=1 (0.01 2 ) units away from the origin if the data lives in R n . Thus, for example if n = 10, 000, then the data point is perturbed by a distance 10,000 i=1 (0.01 2 ) = 1. While the change in persistent homology is bounded by the distance between the original data set and the perturbed data set, the latter can be quite large, depending on the amount of noise and the dimension of the dataset. Thus, performing PCA or t-SNE or other dimension reduction technique first may lead to stronger results.
In order to recover the shape of an object, one must have sufficient coverage. Some holes detected by persistent homology may be due to incomplete data. If these are small, then they only result in short bars which may be considered noise. But in high dimensional spaces, one has many degrees of freedom, so even recovering the shape of simple objects in high dimensions can be impossible as obtaining a sufficient number of data points may not be feasible. However, differences between data sets may still be detected even if coverage is lacking. For example, one may have insufficient coverage to recover the topology of a torus if one uniformly under-samples data points from a torus. However, the resulting barcode will likely be very different than the barcode obtained from uniformly under-sampling points from a sphere. Also, coverage can be less of an issue if you have some information regarding the shape of the data such as periodicity (for example, Dequeant et al., 2008). Thus, in practice, topological data analysis has proven to be quite robust. For more on complexity and topological inference (see Weinberger, 2014).
Due to computational complexity, most analysis using TDA restricts to the use of β i for i ≤ 4. Often only β 0 and β 1 are used, but faster algorithms such as Ripser (Bauer, 2019) are becoming available. To calculate persistent homology of a point cloud, one first needs to create simplicial complexes. The number of simplices grows rapidly with the number of data points as well as the homology dimension (not the dimension of the data set, but the dimension of the holes one wishes to detect-in order to calculate β i , one needs i-dimensional and i + 1 dimensional simplices). The TDA pipeline also requires the computation of distances between data points. The dimension in which the data lives can affect this step, but after distances are calculated, it is the shape of the data that can have the largest effect, sometimes even larger than the number of data points as there are several algorithms that can greatly simplify the simplicial complex (Zomorodian, 2010;Mischaikow and Nanda, 2013;Wilkerson et al., 2014;Boissonnat and Pritam, 2020). The effectiveness of these simplification algorithms depends on both the topology and geometry of the data set. For example, suppose one takes n data points equally spaced on a straight line. The topology of the line is the same as the topology of a point. Thus, to calculate the homology of the line, one can remove all simplices except for a single vertex. For more on computational complexity of persistent homology (see Otter et al., 2017).
If all the data points enter at time 0, the β 0 bars all start at time 0. Thus the barcode for β 0 can be created from a single linkage hierarchical clustering dendrogram as the merge heights of the dendrogram become the lengths of the β 0 bars. Hence the β 0 barcode contains less information than a single linkage hierarchical clustering dendrogram. However, there are applications where the data points enter at different times such as time series data. Thus, the β 0 barcode can be applied to a wider variety of applications than standard clustering techniques. Clustering also cannot capture holes and voids; the higher dimensional barcodes capture structure that other methods such as clustering miss.

Persistent Homology and Machine Learning
The barcode can be used as a topological signature to identify structure in data. While homology is built to detect topology and not geometry, persistent homology can be implemented in a variety of ways to distinguish geometrical shapes (e.g., Turner et al., 2014;Li et al., 2018;. Machine learning can be applied to a collection of persistent diagrams to distinguish between data sets with different structures. Many machine learning algorithms take a vector as input. There are many ways to create a vector from persistent homology. A pipeline to create a vector using persistence images (Adams et al., 2017) is illustrated in Figure 5. A persistent diagram is first rotated by 45 • so that the diagonal becomes the horizontal axis (2nd panel of Figure 5). Thus the horizontal axis represents the birth time, while the vertical axis represents persistence = deathbirth. A heat map is then created using a Gaussian distribution (or other weight function) about each point (3rd panel). The height of the Gaussian distribution is indicated with color in the heat map and is dependent on the persistence of the feature. Points closest to the diagonal are considered to be the result of noise and are thus given no intensity. Hence the bottom of the heat map will always have the color corresponding to zero intensity, in this case blue. In other words, points close to the diagonal have no effect on the heat map. Observe that the point furthest from the diagonal in the first panel corresponds to a feature with the largest persistence per second panel. Thus, in the heat map in the 3rd panel, the color at this point is given the highest intensity (yellow). As shown in the fourth panel, the heat map is discretized by partitioning the heat map into n × n squares where the color of each square corresponds to the average value of the corresponding square in the heat map. In the discretized heat map (4th panel), the yellowish region from the 3rd panel corresponding to the most persistent feature is partitioned between two squares with the yellow square in the top row of this heat map containing a larger portion than the pinkish square next to it in the same row. In the final panel, an n 2 -dimensional vector is created by concatenating the rows of the discretized heat map.

TREATMENT RESPONSES AND PROGNOSIS
What impedes the success of cancer therapies is often the coexistence of therapy resistant cells along with therapy sensitive tumor cell populations. When administered separately, all currently adopted therapeutic strategies-ranging from cytotoxic chemotherapies to molecular targeted therapiesimpose a dramatic, yet homogeneous selective pressure on an often heterogeneous group of tumor cells. Despite varying resistance mechanisms contingent upon therapy-type and tumor composition, every therapeutic intervention inevitably selects for resistant cells, which expand and become the dominant cell type of recurrent tumors, that cease to respond to therapy (Maley and Reid, 2005;Aparicio and Caldas, 2013;Bukkuri, 2020). The increased resolution on the clonal architecture of intermixed tumor cell populations that has just now become available calls for prognostic and therapeutic benefits. High intra-tumor diversity in pre-malignant lesions has been shown to predict progression to malignant growths and poor outcome (Maley et al., 2006;Laurie et al., 2012). The therapeutic significance of intratumoral heterogeneity (ITH) is exemplified in a recent study that measured genetic and transcriptional diversity of breast cancer tumors before and after therapy based on four genetic markers and two transcriptional markers. The study provided proof-of-principle that therapy-induced phenotypic changes can be predicted based on the characterization of coexisting tumor subpopulations (Almendro et al., 2014). Another recent study used RNA interference to model heterogeneous tumors and tested the efficacy of predicted drug combinations in eliminating coexisting tumor subpopulations (Zhao et al., 2014). Their findings suggest that the most effective drug combination for a given tumor cannot be achieved by targeting the predominant subpopulation alone, but requires detailed characterization of the genetic makeup of branched subpopulations and their contribution to the tumor bulk.
Techniques from computational homology have been used to develop a new algorithm to characterize comparative genomic hybridization (CGH) profiles and identify the frequency of cancer recurrence in early stage breast cancer patients through identification of recurrent copy number aberrations (CNAs) in cancer (DeWoskin et al., 2010), which serve as markers of genomic instability and thus cancer prognosis (Hanahan and Weinberg, 2000;Han et al., 2006). Specifically, the method uses a sliding window algorithm to associate a set of point clouds to each array CGH. Different window sizes allow one to analyze the data at various scales by considering different dimensional point clouds. Then, persistent homology is applied to these point clouds for classification. It was found, in accordance with prior results (Climent et al., 2007), that the Betti numbers of the zero dimensional homology groups (β 0 ) can distinguish between recurrent and non-recurrent groups in patients who did not receive anthracycline-based chemotherapy after surgery but not in patients who were treated with anthracycline. Note that, in this approach, no segmentation of the data was required.
In another study, a novel statistic called the smooth Euler characteristic transform (SECT), which allows shape information to be integrated into traditional statistical models, was developed and applied to predict disease free survival in glioblastoma multiforme (GBM) based on tumor shape from post-contrast T1 axial magnetic resonance imaging (MRI) (Crawford et al., 2020). SECT is a variation of the persistent homology transform (PHT) introduced in Turner et al. (2014) that was created to overcome the difficulties in integration with traditional statistical models. Specifically, the output of SECT is a collection of smooth vectors, while the output of PHT is a collection of persistence diagrams (Edelsbrunner et al., 2002), thus having a complicated representation and geometry which does not lend itself easily into integration with statistical models. In the GBM application, the statistical model used was a Bayesian linear mixed model (BLMM) (Ishwaran and Rao, 2005;Guan and Stephens, 2011;Zhou et al., 2013). When this topological approach was applied to the GBM MRI data, it was found to outperform gene expression, volumetric, and morphological summaries in predicting disease free survival.
Clinically, there is a great importance in the identification of biomarkers which can serve as predictors for metastasis and patient prognosis in cancer. To this end, researchers have recently used persistent homology techniques, in an exploratory data analysis fashion, to identify biologically meaningful geometric properties of single cell data (Lockwood and Krishnamoorthy, 2015). In this method, data was first transposed and analyzed in its dual space with each gene represented in a much lower dimensional sample space, thus circumventing the problem of high dimensionality that is typical of single cell data. A small set of genes (120-200) were then selected as landmarks (De Silva and Carlsson, 2004) and a family of nested simplicial complexes was constructed, indexed by a proximity parameter. Unlike many other methods which focus on the analysis of zero dimensional homology groups (DeWoskin et al., 2010;Nicolau et al., 2011), thus performing analyses which are topologically equivalent to clustering, this study focused their efforts on identifying loops of one dimensional homology groups which persist over a large range of values of the proximity parameter, hypothesizing that connections around holes imply nontrivial interactions among genes and biological functions which could have implications for tumorigenesis. Repeating this process for various landmarks, features which remain stable over large ranges of both the proximity parameter and number of landmarks could be detected. Applying these techniques to five different cancer data sets from brain, breast, ovarian, and acute myeloid leukemia cancers, many members of the significant loops in the one dimensional homology groups that were found have been previously shown to be accurate biomarkers for cancer biogenesis, while others serve as potential new markers which have yet to be experimentally validated.

TUMOR SEGMENTATION AND COMPUTER-AIDED DIAGNOSIS
Computerized methods can efficiently and effectively identify quantitative image features that are otherwise difficult to spot by manual inspection (Yu et al., 2016). Quantitative morphological features extracted from H&E stained slides, such as Zernike shape features, have been shown to predict survival in lung adeno-and squamous cell carcinoma (Yu et al., 2016). Recent advances in next-generation sequencing technologies gave rise to a plethora of approaches that quantify and characterize the genotypic diversity within a given tumor. Evidence supporting a quantitative relation between genotypic and morphological ITH followed. A quantitative image analysis approach that complements genomic profiling with geographical information was developed (Yuan et al., 2012;Andor et al., 2016). Furthermore, the authors characterized cellular heterogeneity by distinguishing between well-defined cell-populations (stromal cells, lymphocytes, cancer cells). However, so far qualitative details of how this diversity in morphology is structured (i.e., how many subpopulations are present and what their geographical boundaries are on the H&E slide) are unknown.
As a step toward a computer-aided cancer diagnosis system, persistent homology has been used to develop an automated tumor segmentation approach for Hematoxylin & Eosin (H&E) stained colorectal cancer histology whole slide images (WSI) (Qaiser et al., 2016). The authors exploit the fact that nuclei in tumor regions have atypical characteristics such as nonuniform chromatin texture, irregularity in shape and size, and clustering of nuclei, and use persistent homology profiles to characterize the degree of connectivity among nuclei and to classify cancerous regions based on this information. Specifically, once a WSI has been obtained, it is first divided into patches, each of which has a persistent homology profile. Given two patches, the symmetrized Kullback-Leibler divergence (KLD) can be computed between the respective persistent homology profiles, which serves as a metric for interpatch distance. Then an input patch is classified as cancerous or non-cancerous by a kNN classifier, based on KLD distances between its persistent homology patch and those of each representative patches. These exemplar patches are chosen by training a CNN and selecting patches whose activation during training is large (separately for cancerous and non-cancerous classes). The benefit of this approach over previous approaches is that only the subset of highly activated patches from the convolutional layers are used as exemplars rather than the set of all patches in the training data. This method was compared against standard CNN and HyMaP (Khan et al., 2013) approaches on 74 H&E stained WSIs of colorectal cancers; in addition to being computationally less expensive than the other two methods, it was also shown to have better precision and segmentation accuracy.
Another example of tumor segmentation and algorithmic diagnosis is a recent study which aimed to segment a diseased area of skin and classify the type of skin lesion into one of seven classes in a given dermatoscopic image (Tschandl et al., 2018) using persistent homology (Chung et al., 2018). Like the colorectal image segmentation study (Qaiser et al., 2016), the segmentation algorithm used is a concept similar to persistent homology (Edelsbrunner et al., 2002). Linear support vector machines (SVMs) were used for classification on the persistence statistics (Chung et al., 2018) and persistence curves (Chung and Lawson, 2019) were derived from persistence diagrams. Specifically, given an image, a segmentation algorithm was first implemented to obtain an image mask: a binary image in which each pixel is colored either white (if it part of the healthy skin) or black (if it is part of a lesion). Once the mask was applied to the original image, the RGB color space is transformed into an RGB, HSV, or XYZ color space and each channel was extracted. Persistent homology software was then used to compute persistence diagrams for each channel; from each diagram, persistence statistics and curves were computed as features. Finally, a multi-class SVM was used to classify the input into one of the seven types of skin lesions. When this approach was applied to a validation set of 5,000 images, the highest resulting accuracy scores were 65.6, 66, and 67.2%.
Similar persistent homology techniques were used to classify H&E stained stage T3 and stage T4 colorectal adenocarcinomas images as benign or malignant (Chittajallu et al., 2018). To do this, given an image, it was first color normalized (Reinhard et al., 2001) and the nuclear stain and minimum cross entropy thresholding (Li and Tam, 1998) for nuclear foreground segmentation were extracted using an unsupervised color deconvolution method (Macenko et al., 2009). Then, a fast difference-of-Gaussian implementation of the scale-adaptive Laplacian-of-Gaussian filter of Al-Kofahi et al. (2010) was performed to detect nuclei centroids. Then, by considering the set of nuclei centroids as a point cloud, the persistence diagram of its Vietoris-Rips filtration for the one dimensional homology groups (loops) was computed using a fast multiscale approach (Doyle et al., 2008). Then, persistence landscape (Bubenik, 2015) and image (Adams et al., 2017) representations were computed and used as features to characterize loops formed by glandular epithelial cell nuclei. Then given training images with benign/malignant labels, a random forest classifier was trained using these topological features. PCA was used to reduce the dimensionality of each feature group so as to preserve 99% of the variance. Hyperparameter optimization was also performed via cross-validation using a tree-structured parzen estimator (Bergstra et al., 2011). When this method was applied to testing data consisting of 80 images, an accuracy of 85%, AUC of 0.85, precision of 78%, and recall of 95% was obtained, an improvement over the traditional cell graph property approach in all areas (Doyle et al., 2008).

DISEASE CLASSIFICATION
Cancers of unknown primary represent 3-5% of all cancer cases, whereby physicians find one or multiple metastases but fail to locate the primary tumor. Pathologic evaluation of a metastatic biopsy often does not provide a definitive answer. Molecular data ranging from gene expression to somatic mutations have been shown to significantly aid classification of metastatic biopsies to their corresponding primary tumor site (Ferracin et al., 2011;Marquard et al., 2015;Vikeså et al., 2015;Moran et al., 2016;Søndergaard et al., 2017).
One study used persistent homology on 150 non-contrastenhanced fat-suppressed 3D T1-weighted magnetic resonance (MR) images to classify hepatic tumors into three classes: hepatocellular carcinomas (HCC), metastatic tumors (MT), and hepatic hemangiomas (HH) (Oyama et al., 2019). To do this, for each image, a 3D region of interest (ROI) in the shape of a rectangular solid enclosing the entire lesion was created by an experienced radiologist. Then, gray-scale values of the voxels in each ROI were normalized and persistence diagrams were created for dimensions 0, 1, and 2 using HomCloud (Kimura et al., 2018;Obayashi and Hiraoka, 2018). These diagrams were vectorized into persistence images (Adams et al., 2015). Feature vectors were then obtained from these images and inputted into logistic regression with an elastic net penalty and extreme gradient boosting machine learning models for classification. The results from classification showed that dimension 1 persistence images had the highest accuracy rates: 85% for classifying HCC and MT, 84% for HCC and HH, and 74% for HH and MT.
An alternative method to accurately classify tumor subtypes is through the use of high throughput genomics (Nutt et al., 2003;Freije et al., 2004). Aiming to produce more robust algorithms than traditional classification methods, given gene expression profile data, researchers used statistical invariants and persistent homology to identify core patient groups associated with the classical, mesenchymal, and proneural subtypes of GBM and a compact set of genes most useful for this partitioning (Seemann et al., 2012). To do this, a sufficient, but compact, panel of genes to be used for clustering was predetermined using nondimensionalized standard deviation (to ensure bimodality of gene expression distribution across patient samples; Phillips et al., 2006;Verhaak et al., 2010) and persistent homology (to find groups of genes whose expression levels change coherently among patient samples; Carlsson, 2009;Horak et al., 2009). Then, a hierarchical partitioning of patient samples based on gene expression levels is performed using persistent homology; specifically, samples are repeatedly bisected until further partitioning is not possible, thus obtaining the number of clusters that exists and some notion of genetic proximity of the clusters. Each bisection was implemented using 30 genes. A predictive model was then implemented to assign cancer subtypes to each cluster. Applying this approach to the 20 GBM test samples, fifteen predictions were in accordance with results from standard clustering calculations (Verhaak et al., 2010), five of which were unassigned by both algorithms. Of the remaining five samples, four were classified as "neural" by the clustering algorithm, but were unassigned by this approach since the neural group was not found in a single cluster.
Another example of the use of persistent and computational homology on gene expression data is in Arsuaga et al. (2012), whereby, upon application to a breast cancer gene expression dataset, the algorithm was able to distinguish among most breast cancer subtypes. This paper extended the work of DeWoskin et al. (2010) to gene expression data, under the assumption that gene expression is a measure of the underlying copy number changes (Neve et al., 2006;Horlings et al., 2010). Before applying the sliding window algorithm developed in DeWoskin et al. (2010) to gene expression data, theoretical work was done to show that under idealized conditions, the point cloud defined by the algorithm is a good representation of the original data. Hence, analysis of the point cloud is applicable to the original data set. This was done using Taken's embedding theorem, an extension of Whitney's embedding theorem to dynamical systems theory, and a circularization technique. To apply the sliding window algorithm to gene expression data, instead of preselecting differentially expressed genes like traditional clustering algorithms, all genes were ordered by their location in the genomes. Then, the sliding window algorithm was applied to generate point clouds, upon which topological and statistical analysis was performed. It was shown that when only β 0 was used, the algorithm could distinguish between less aggressive subtypes, like normal and luminal-A, and more aggressive ones, such as luminal B, basal-like, and Her2. It was also noted that the algorithm could not distinguish luminal B from Her2 and basallike, implying the close similarities among these subtypes. Thus, it was noted that breast cancer subtypes can not only be classified by specific sets of genes, but also by certain global relationships among all genes.

CELLULAR ARCHITECTURE
Imaging is an essential part of cancer clinical protocols, providing physicians with morphological, structural, and metabolic information about patient tumors, thereby assisting in clinical decision making and treatment planning (Fass, 2008). The development of new image segmentation tools (Zhang et al., 2001;Hong and Brady, 2003;Xiaohua et al., 2004) and quantitative multiplex immunofluorescence (Stack et al., 2014;Dimitriou et al., 2019;Abousamra et al., 2020) have set the stage for topological data analysis and persistent homology techniques to be harnessed for interpretation of high-dimensional information in histopathological imaging data.
One example of this is using persistent homology techniques to investigate architectural characteristics of cellular organization and nuclear arrangements from microarray tissue samples to distinguish among genetically derived breast cancer subtypes (Basal, Luminal A, Luminal B, and HER2;Singh et al., 2014). This was done through distinct topological characterizations such as nuclear connectivity (generators of zero dimensional homology groups) and loops (generators of one dimensional homology groups) based on Vietoris-Rips filtration of nuclei centers (Mischaikow and Nanda, 2013). When its performance was compared to a standard distance weighted discrimination classifier (Marron and Todd, 2007), nearly a four times improvement in classification accuracy was noted. Furthermore, for certain combinations of feature weightings, it was shown that topological features provide complementary information to patch based image appearance features. By using such topological features, they solve/address two main challenges in obtaining accurate cellular architectural characterization: the heterogeneity of spatial arrangements, both among patients and within single tumor samples, and differences in stain intensity which require manually determined phenotypic thresholds (Engers, 2007;Truesdale et al., 2011;Goodman et al., 2012;Helpap et al., 2012;Truong et al., 2013;Epstein et al., 2016;Evans et al., 2016). This improves performance over existing standard classifiers, which are more sensitive to noise, cannot model stain concentration variations, and have issues with larger cell arrangements (Aukerman et al., 2020).
In another paper, researchers used TDA to cluster prostate cancer histology into architectural groups consistent with the continuum of Gleason patterns, the most widely accepted system for evaluating prostate cancer architecture (Humphrey, 2004;Lawson et al., 2019). Persistent homology was used to compute persistence intensity diagrams (of zero and one dimensional components) of purely graded prostate cancer histopathology images of Gleason patterns 3-5. This revealed key insights into characteristics such as nuclei density, glandular shape, and inter-glandular arrangement. Furthermore, persistent homology was able to cluster these images into architectural groups through a rank descending persistence vector-the six resulting clusters provided a stable architectural continuum from well differentiated to poorly differentiated adenocarcinoma at an even finer level than the standard Gleason scale. Persistent homology has also been used to characterize the spatial arrangement of immune and epithelial (tumor) cells within the breast cancer immune microenvironment from quantitative multiplex immunofluorescence (qmIF) imaging (Aukerman et al., 2020). Stain intensities and spatial coordinates of individual cells were collected from qmIF through nuclear segmentation, cytoplasmic definition, and stain quantification. In order to incorporate these stain intensities, instead of directly using a Rips or Cech filtration on the point cloud data (Chazal et al., 2009), a discretization process was first implemented to convert the point cloud data with stain intensity values into an image. Then, persistence diagrams were created from these images by using the opposite of the pixel stain intensity as the filter function. These diagrams were assessed as potential biomarkers of cancer subtype and prognostic biomarkers of overall survival using kernel mean embeddings (Gretton et al., 2012) with the sliced Wasserstein kernel (Carrière et al., 2017) and were shown to outperform the standard nearest neighbor analysis with a standard Gaussian kernel. Furthermore, a correlation analysis using constrained covariance (Herbrich et al., 2005) showed that the correlation between nearest neighbor and persistence diagrams were always <0.1, implying the features are nearly statistically independent and thus complementary.

DISCUSSION
As we have seen in this paper, TDA has proven to be a powerful tool, yielding critical insights in the treatment prognosis, tumor segmentation and diagnosis, disease classification, and cellular architecture of cancer. But despite the many recent successes of TDA in the field of oncology, it is still a nascent field with much fruitful work yet to be done. Experimentally, to biologically validate the TDA methodology and results, it would be worth performing thorough studies to assess whether geometric and topological connectivity implies functional connectivity. Computationally, one area which deserves further exploration is the use of TDA to analyze time-series data (Ravishanker and Chen, 2019) in cancer. This has been done extensively in several other fields including climate analysis (Berwald et al., 2014), tracking stability of dynamical systems (Khasawneh and Munch, 2016), clustering populations of Tribolium flour beetles (Pereira and de Mello, 2015), analyzing motion sensor data during sports activities (Stolz et al., 2017), and financial time series data (Gidea, 2017;Truong, 2017;Gidea and Katz, 2018;Gidea et al., 2020). Though time series oncological data have been analyzed with varying degrees of success (Aoto et al., 2018;Kourou et al., 2020), TDA techniques of any sort have yet to be applied. Applying persistent homology techniques to time series microarray, cell anatomy imaging, or gene/pathway expression data, for example, may further help in disease classification, identifying intra-tumoral selection events, and contribute to a greater understanding of tumorigenesis. Another possible avenue of research is to investigate the process of angiogenesis, an inherently geometric and spatially dependent process, using persistent homology techniques. Specifically, we anticipate that TDA will help us understand the changes that occur in tumor vasculature morphology during cancer progression and under treatments. More importantly, we hope that connections between cancer vessel network and treatment prognosis can be found, such as by testing vessel normalization theory (Jain, 2005). In addition to the ideas presented above, it is worth noting that research into the use of TDA in oncology is sparse and, as such, there is much important and clinically relevant work to be done in simply applying well-understood persistent homology algorithms to broader classes of cancer data sets (note that most TDA analyses have been concentrated in just melanoma, brain, breast, and colorectal cancers) and in performing longitudinal studies across several cancer types.

AUTHOR CONTRIBUTIONS
AB conceptualized the project and wrote the sections 3-7. AB and ID wrote the section 1. ID wrote the section 2. NA wrote the sections 1, 3, 4, and 5. All authors contributed to the article and approved the submitted version.