Observability of Complex Systems by Means of Relative Distances Between Homological Groups

It is common to consider using a data-intensive strategy as a way to develop systemic and quantitative analysis of complex systems so that data collection, sampling, standardization, visualization, and interpretation can determine how causal relationships are identified and incorporated into mathematical models. Collecting enough large datasets seems to be a good strategy in reducing bias of the collected data; but persistent and dynamic anomalies in the data structure, generated from variations in intrinsic mechanisms, can actually induce persistent entropy thus affecting the overall validity of quantitative models. In this research, we are introducing a method based on the definition of homological groups that aims at evaluating this persistent entropy as a complexity measure to estimate the observability of the systems. This method identifies patterns with persistent topology, extracted from the combination of different time series and clustering them to identify persistent bias in the data. We tested this method on accumulated data from patients using mobile sensors to measure the response of physical exercise in real-world conditions outside the lab. With this method, we aim to better stratify time series and customize models in complex biological systems.


INTRODUCTION
The quantitative description of complex systems often makes use of time series because its relationships and correlations aim to infer causal connections between observations [1]. At the end, a robust quantitative description must fulfill the condition of system's observability, that is, the system's internal states being accessed from the data, such that a mathematical model can be extrapolated or used to make predictions about future states of the system.
In this research work, we face the problem of estimating persistent entropy generated by all the internal processes and states in complex systems that could compromise the stability of a quantitative description of a complex system. Previous research has focused on the definition of causality tests by using time series [1], for example, using transfer entropy [2]. But understanding causal relationships that lead to the successful implementation of models requires a sound analysis of the influence of the sampled data [3]. In some cases, causality inference can be complicated by a bias when estimating a limited amount of data that is possibly noisy [1]. This causality inference is based on the notion of cooperative behavior of complex coupled systems, where synchronization and related phenomena have been observed, for example, in physical and biological systems [4].
However, there are constant individual variations between organisms that challenge this approach. For example, a bird flying in a forest calculates its trajectory according to the distribution of the trees in the environment. Bees also "compute" and create a model of their environment [5]. Also, a cancer cell adapts its response represented by changes in its microenvironment as well as internal changes in the regulatory systems, for instance, depending on the acidity of the tissue, the presence of toxic chemicals [6], or the existence of landscapes with complex attractors in stem cells that depend on different molecular signatures (for a study on this topic, see, e.g., Ref. 7). Such representations are useful considering that an environment and changes in internal constraints like molecular shapes or boundary conditions are not static: a storm can change the distribution of trees in a forest affecting an ecosystem, or changing a diet can induce substances capable to modify microenvironments or regulation mechanisms of cells in tissues affected by cancer cells [8].
The reason why individual biological variations can take place is thus not easy to precise; but the important fact is that this problem permanently challenges the construction of models [9]. In effect, the myriad of possible interactions motivates a continuous update in the information registered in database. Thus, while some canonical pathways are well known, many other interactions, and possible variations, are still unknown and must be constantly updated when these mechanisms are reconstructed [8]. Therefore, a good strategy is the identification of individual deviations that might require individual modeling or an update in the database.
These individual deviations generate persistent entropy that can be estimated by analyzing persistent patterns in time series. For this reason, we make use of persistent homology groups to qualitatively assess persistent incoherencies and imbalances in the sampled data associated to the trajectories Γ. This method is useful to detect and "shape recognize" in high-dimensional data [10], which has been recently used in different fields in biology, from the analysis of cancer tumors [11] to the analysis of time series in biology [12], as well as in physics, for instance, in the analysis of folding structures of proteins in soft matter [13], the analysis of the structure of complex networks [14], or in combination with machine learning techniques for the identification of novel materials and structures from molecular database [15]. Since this methodology is robust against noise [16], it is best suited to detect persistent defects in the sampled Γ's. Such imbalances are more than errors in the sampling of data and can be identified as persistent and inherent characteristics of the trajectories Γ. A qualitative assessment is not only relevant for the optimization of modeling methods, for example, avoiding expensive training of models (mechanistic or based on machine learning), but also to assure the safety in the use of models by recognizing when a sample of data from a biological system or organism can be represented with a common underlying model, or instead requires a customized mathematical representation, which is for instance helpful to determine if personalization of relevant mathematical models is required for the diagnose and therapy in medicine [17]. Furthermore, our methodology aims at being an alternative method to perform signal analysis in this context.
In Topological Methods for the Assessment of Bias in the Sampled Data, we introduce the mathematical background of our methodology, which is tested in Supplementary Appendix 3 with synthetic data generated from a simple model on a population of chemotactic cells with different response mechanisms. In Proof of Concept for Data Analysis, we perform a test on real data with the mhealth dataset, which contains data of patients wearing Internet of things (IoT) sensors connected to internet devices to measure electrocardiograms (ECGs) and acceleration while they were performing physical exercise in normal and noncontrolled conditions [18]. Finally, in "Discussion" and "Conclusion," we discuss the results and their future perspectives.

Topological Methods for the Assessment of Bias in the Sampled Data
As a starting point, we consider different biological/physiological data (e.g., number of individuals in a population, nerve impulses, concentration of chemicals, etc.), being recorded at different time series that can be coupled in a path Γ i (t) defined in a phase space Γ, as shown in Figure 3.
Under similar conditions, all organisms must have similar responses so that an average value of the data points in the phase space can be sampled and used to train models represented by a function f shown in Figure 1 for organisms A and B. These models can be mechanistic, like network models with physical constraints, or black box and statistical models defined using machine learning. With this assumption, the function f is not only descriptive, representing the distribution of different data points associated to an average path Γ, but is also predictive, helping to estimate future responses.
In the modeling process, there is a statistical error and a bias associated with the way the researcher selects and validates the model. And the more the data points are sampled in Γ, the smaller is the model error f . This approach is the basis of methods using big data attempting to detect regularities in sampled datasets. However, subtle differences between datasets can be much more than just statistical deviations or outliers in average data samples. Such deviations may indicate a different physical constraints originating in changes in the organisms environment or its internal regulation mechanisms, as shown in the example of the two organisms A and B in Figure 1. In this example, a separate analysis helps to discover subtle changes in the trajectory, implying that two different models for two completely different trajectories, Γ A and Γ B , are required.
For this reason, a method, which goes beyond mere statistical variations, is required to extract relative variations generated in changes in physical constraints. Hence, we make use of the variance and bias to assess differences and effectively cluster trajectories with similar responses, leading to the concept of persistent bias, which in turn is related to this persistent variability of physical constraints.

Definition of Persistent Bias
According to the bias-variance decomposition, the error of a model f , Error( f ), is composed of three terms: a bias that depends on the definitions of the researcher, a variance term, and an unavoidable irreducible error term which is given by Ref. 19 Frontiers in Physics | www.frontiersin.org December 2020 | Volume 8 | Article 465982 is the bias of the model f . This bias is the result of false assumptions in the parameters used in the learning algorithm. But individual reactions of the organism induce a persistent bias in the data structure, for instance, how internal regulatory processes in an organism k are defined and how they differentiate relative to other organisms l. Therefore, the variability of the estimated error of a model is defined as (see Supplementary Appendix 1) where as the expectation value of X (see Supplementary Appendix 1). Considering that Γ k and Γ l are the sets of discrete points (as is shown in Figure 3), then are also a set of discrete points as well, such that Here, P(X) is the probability of occurrence of X. This basically is a perturbation of the error in respect to the trajectories of other organisms. When systems are observable, that is, when it is possible to extract the internal states of the system, then Δ kl Error( f ) 0, such that f can describe these internal states and could eventually fulfill the theorem of observability (see, e.g., Ref. 20).
In this case, we can use these probabilities to define persistent entropy of the system Therefore, a persistent bias is not a mere statistical error originating from the observer or the sampled data but is the amount that generates persistent entropy that originated from variations in internal states of the system or organisms.

Topological Persistence: Separation of Internal Bias from Statistical Error and Modeling
In order to estimate the entropy in Eq. 3, we analyze the structure of Γ kl and (Γ kl ) 2 and compute an observable similarity to a persistent entropy [21]. The strategy we propose is to assess the topological structure of the data before a model or regression is performed, ideally combining different trajectories in a phase space.
In the end, we construct point , . . . . . . ) generated from the trajectory sample {Γ k } of the organism or system k (as shown in Figure 3 as well as in Figure 2). A point cloud includes a large but finite set of points sampled from the primary form.
In this theory, the combination of the time series of the trajectories Γ, including the time delay of time series, can recover the dynamics of the system [22]. Furthermore, the presence of harmonic structures in the data represented in point clouds, related to this dynamics, can be explored by analyzing persistent homology [23]. Persistent homology, a tool in algebraic topology, is particularly useful in situations where the "scale" is not a known a priori. Persistence theory, as considered by H. Edelsbrunner [24], starts with a space X equipped with finite filtration rather than represented by smooth manifolds using realvalued function [25]; thus, it can be seen as a generalization of hierarchical grouping of topological characteristics of the higher order that leads to a type of invariants represented by bar codes [23,24,26]. (For a more extensive introduction of this methodology, see Refs. 24-27. For an overview of its application, see Pun et al. [28] as well as Pereira et al. for the FIGURE 2 | Exemplary estimation of topological persistence using a proximity parameter. A set of points can, for instance, be completed to a Cech complex [10]. In the example to the left, connected points and a surface with homologies 0 and 1 can be extracted. On the right, the same procedure allows the discovery of a second structure with homology 1. Figure 1. Each concentric circle on each data point represents a distance function d K . For S1, the persistent diagram delivers a homology group H0, that is, a group with genus 1. For S2, the increase in the number of points allows the definition of a homology group H1, that is, a surface with connected points. On the left, we schematically represent the difference between both persistent diagrams.

FIGURE 3 | Homology group and bar diagram for two different datasets based on example in
Frontiers in Physics | www.frontiersin.org December 2020 | Volume 8 | Article 465982 application on topology persistence in different fields in biology and medicine [29].) We sample a collection of points in a metric space into a global object defined as the vertices of a combinatorial graph whose edges are defined by proximity [26]. While the graph captures the connectivity of the data, it allows the construction of filtration of simplexes using the values of the function and computes the persistent homology of the filtration, as in the example illustrated in Figure 2 for the discovery of different homologies in almost similar clouds of points.
Γ λ owns a topology that reflects the periodic behavior of a signal with Euler characteristics; this means Γ λ owns a function g with a compact subset of R D and d Γ k : R D → R the distance function of Γ k (see Figure 3).
Here, we consider L {δ : d Γ k (δ) ≤ ε} as a set of persistent bars δ ε that estimates the length of the topological feature. For example, for a first order homology group H 1 , that is, a loop in the data cloud, δ ε | H1 in the persistence bar is a measure on how a data point is properly clustered in the group by measuring the distance of the data to the group with respect to the distance parameter ε. In this context, a bar code is the persistence analogue of a Betti number. Recall that the kth Betti number of a complex acts as a coarse numerical measure of H k . Key topological features H k include zero (connected points) and the first order topology (loops) (see Figure 3). In the following equations, we use the notation provided by Fasy et al. [30] (see Supplementary Appendix 2 for a detailed explanation about the interpretation of the persistence bars).
Therefore, the estimation of these equivalences helps to characterize differences between trajectories as well as the differences of the topological signatures. Using persistent homology groups m as the difference of the clusters H m , the difference of the topological signatures can then be measured as the sum over all the topological characteristics (see Figures 3,4 and Supplementary Appendix 2) where δ k i,m is the persistence bar for corresponding topological feature m, or homology group H m , of the trajectory Γ k , and Δ kl |H m is the total difference of the persistent bars δ k m and δ l m associated to H m , as presented in Figure 3.
Bar codes are intuitive, but their statistical analysis is rather complex. To perform a useful statistical analysis of persistent homology for small samples, we need a real number which encapsulates the information contained in the bar code. Using a similar definition of a persistent entropy [21], we define it in function of the length of the persistence bars defined as ; using this definition, we define the entropy for the difference of the persistence bars, from Eq 4, as Given that any differences in the trajectories contains topological signatures, then if M Γ kl ≥ 0 and S Γ kl ≥ 0 then E Γ kl ≥ 0 and Bias kl Γ ≥ 0.
This equation implies that a persistent internal bias, that is, a persistent entropy that is originated from variations in internal states of the system or organisms. Otherwise, if M Γ kl → 0 and S Γ kl → 0 then E Γ kl → 0 and Bias kl Γ → 0.
The matrix M kl will be called, in what follows, a distortion matrix. Finally, when both M(Γ kl2 ) → 0 and M(Γ kl ) → 0, then, according to Eq 1, Δ kl Error( f ) → 0, implying that the system is observable, since a model can be defined, and parameters can be identified. Accordingly, low relative persistence of data, that is., M(Γ kl ) ≥ 0, implies a persistent intrinsic entropy and complexity with a high probability that a customized model fk is required, that is, fl will probably not completely fit the sampled data of k.
Thus, the goal is to estimate both the distortion matrix M(Γ kl ) and the entropy S(Γ kl ) to assess if the system can be modeled with a function f and if this function can account the internal states of the system. This method is illustrated in Figure 4.
As a reference, we have performed a simple test of the methodology using synthetic data from a predator/prey system of chemotactic cells with two kinds of responses in Supplementary Appendix 3. There we are able to show how with these method, we can stratify the distance between different background mechanisms generating the population dynamics and show how the estimated entropy accounts for the persistent bias that are associated to the difference of the intrinsic mechanisms of the chemotactic cells.
In the next section, we present our main results for the mhealth dataset.

PROOF OF CONCEPT FOR DATA ANALYSIS
From the example presented in Supplementary Appendix 3, we learn that the methodology aims to group systems with similar topological signatures, suggesting that the underlying mechanisms and causal relationships are similar between systems 1 and 2. Of course, the method is able to detect the fact that system 1 (switching model) generates few topological signatures than that from system 2, affecting the size of the error bars. But within the period where the data are analyzed, the model correctly stratifies both datasets and identifies a low distortion in the data, suggesting that systems 1 and 2 have its own similar causal relationships.
In this section, we test the methodology using data containing physiological signals of patients. As has been suggested in other studies in animals, the physical activity is associated to changes of different physiological signals (like heart rate, arterial pressure, etc.) [31]. Furthermore, the heart response to exercise (macroscopic scale) has origin in complex molecular mechanisms, for instance, in subjects undergoing investigation for angina, some individuals with a low chronotropic index (a measure of heart rate response that corrects for exercise capacity) had impaired endothelial function, raised markers of systemic inflammation, and raised concentrations of N-terminal pro-brain natriuretic peptide (NT-proBNP) compared to those with a normal heart rate response [32].
Based on these notions, we assume that each individual generates a unique pattern for this integrated response due to the individual capacity-across several scales-to adapt and/or accommodate to changes in the environment, similar to the case presented in Figure 4 (in this case, the response to physical exercise).
The data used in this analysis have been obtained from the mobile health (mhealth) dataset [18], which comprises body motion and vital signs recordings of ten volunteers with diverse profile while performing several physical activities. Sensors placed on the subject's chest, right wrist, and left ankle are used to measure the motion experienced by diverse body parts, namely, acceleration, rate of turn, and magnetic field orientation. The sensor positioned on the chest also provides 2lead ECG measurements, which can be potentially used for basic heart monitoring, checking various arrhythmias, or looking at the effects of exercise on the ECG. These activities were monitored and collected in an out of lab environment with no constraints on how it must be executed, with the exception that the subject should try their best when executing them (see Figure 5).
Ideally, if this system is observable, then low persistent entropy (low inherent complexity) must lead to a quantitative description of the accelerations and ECGs.
The final raw data are analyzed when it is sampled in a phase space as Γ   Figure 6 are constructed).
The homology groups were computed with the Dionysus software 1 which included the TDA package in R language; the persistent homology has been measured over a triangular grid using the Gaussian kernel density estimator.
Thereafter, the relative distance between the homology groups M(Γ kl ) is shown in detail in Supplementary Appendix 4. We perform an analysis linking the acceleration measured with a chest sensor in relation to the ECG and apply the methodology described in the previous section and in Figure 4. The final M(Γ kl ) and M(Γ kl2 ) are represented with heat maps defined from 0 to 1 as is shown in Figure 6  In these figures, we discover a relatively rich structure, with larger variations on the x axis (see also box plots in Supplementary Appendix 4). According to these results, there is a relatively low distortion of the response between patients, which is more evident on the y and z axes. On the other hand, we can extract groups of patients with high relative distortion, which are listed in Table 1.
We found that the seventh patient overlap all the groups, that is, that any quantitative prediction based on the rest of the population will deliver Δ 7l Error( f ) > 0, that is, that this patient might require a customized observation.
However, the analysis of this relative entropy (Figure 7) is additionally required to perform a complete assessment, using the package "entropy" in R [33].
After computing the overlapping results between the groups, we find that patients 4, 2, and 7 also reoccur in all these groups. However, remnant differences in the entropy values indicate that persistent entropy remains for several patient groups, that is, that there is a persistent difference in the mechanisms leading to the response of each patient to physical exercise ( Table 2).
These results are relevant when the mhealth data are used in the definition and training of predictive models. For example, activity recognition (AR) systems are typically built to recognize a predefined set of physical activities common in different applications, such as patient surveillance or as support systems to help individuals change or modify their habits. To this end, the data from the mhealth collection has been used to extract features and train AR models for the recognition of different physical activities such as walking, sitting, etc. [34]. However, when building a model, it is necessary to know whether the feature extraction can be generalized for any dataset and any new observations (low persistent entropy), or whether the model can only be generalized locally for selected datasets or observations [35]. Therefore, it is relevant to know if the dataset can be used to train models that can be validated over an entire patient population and be extrapolated to any new patient (extreme generalization), or can only be effectively used for specific subsamples of data (local generalization).
Our result helps to identify the degree of generalization of trained models and indicates that an AR model [34] can in principle be used for any patient excepting individuals with topological signatures similar to the seventh individual (and eventually, the second and fourth patients). These patients require a personalized approach, that is, persistent entropy in the data may be an indication of a heart failure or similar physiological impairments, which implies that AR models require additional features to account such individuals.

DISCUSSION
The extraction of topological features is useful for pattern recognition and is an alternative to methods like 1dimensional convolutional neural networks (CNNs) [36]. This methodology has been already used in different fields, particularly in biology and medicine, from the analysis and classification of tissue structures [11], to the analysis of time series in physiology [29] and thus can be considered as a kind of unsupervised learning machine with some advantages such as the following: • It does not require large data samples to detect patterns in this data. • It is much transparent in the way how patterns are computed in comparison to CNNs. • It is robust against noise and data variations.
Topological persistence aims to identify structures in data and is suitable for pattern recognition. This technique is indeed used in Uniform Manifold Approximation and Projection for Dimension Reduction or topological autoencoders [37] for the optimization of deep learning methods. These approaches are based on local manifold approximations and patch together their local fuzzy simplicial set representations to construct a topological representation of a high-dimensional data [38], or they are used to identify topological signatures and using them as topological constraints while training deep neural networks [37]. Therefore, in these approaches, the estimation of topological signatures are used as constraints for an efficient training of neural networks, for instance, for image recognition [37], eventually improving the training and performance of deep learning models. The present study follows a different strategy since we do not aim to implement a topological analysis to outperform current established methods for training of deep neural networks but to analyze and cluster topological signatures in the data, in this case, time series. Thus, the analysis of the topological signatures of the sampled data is helpful to better assess how a model can be generalized, for example, to estimate how other modeling methods for time series, like Long short-term memory models [36], can be generalized for   the data analysis of novel datasets or for extrapolation of predictions. Thus, we implemented this kind of pattern recognition to analyze the structure of sampled time series and to find out relative differences in order to • assess the structure of time series, • get hints about possible differences in underlying causal relations and intrinsic mechanisms, and • help to drive the construction of predictive models since it allows the detection of implicit bias in the sampled data.
Therefore, instead of accumulating and managing very large datasets, it seems a better strategy is to first recognize which data collections are appropriate and balanced for training models that can be validated and reliable for further extrapolation, improving the safety (reliability) of the conclusions derived from models, while minimizing the amount of data used for model training. This means, an appropriate customization of models ab initio after assessing persistent bias is more efficient than the training of universal models on several datasets that will be problematic in its validation [39]. 2 We demonstrate that our method allows the analysis of sampled data which in turn helps to find out individual structures that can be interpreted as intrinsic bias. We tested this method in data collected from individuals performing physical activity (see Figures 6, 7). 2 Therefore, we think that any research in machine learning do a better job by dealing with the natural symbiosis between information and life sciences, rather than try to simulate or imitate human cognitive capabilities. Based on this result, we estimate the persistent entropy in synthetic vs. real data, thus helping us to assess if a model can be defined. The results suggested that few individuals probably require a customized model, that is, that the system is not completely observable. In this way, our method complements traditional modeling methods, such as the search of causal structures and deduction of network models [40] or the use of artificial intelligence techniques, to distinguish organisms that potentially cannot be reduced to canonical models [9].
However, the methodology has also some disadvantages: • It generates large datasets for the analysis.
• It requires the fine tuning of parameters like the grid size where the analysis is performed. • It is computationally intensive.
For this reason, extended analysis and optimization of the use of this technique in several datasets is required to further improve and standardize its applicability.
In this way, this method can either be used for direct pattern recognition and analysis of data structures or to pair it with other machine learning methods as a promising perspective to increase the effectiveness and safety of trained models [15], as already shown in autoencoders for image recognition [37]. This will be the subject of future research, in particular for automated workflows to autonomously estimate the generalization of a model.

CONCLUSION
The quantitative description of complex systems is limited from the internal states of the system from accessible data, which is in practice limited to a subset of variables. A system is called observable if we can reconstruct the system's complete internal state from its outputs [20]. Under this assumption, it should be even possible to define optimal number of measurements in order to develop such quantitative descriptions.
In this research work, we have developed a method to qualitatively detect data imbalances by measuring the variability of the modeling error. If the data obtained from any organism's trajectory has a persistent structure, that is, having low persistent entropy, then the variability of the modeling error is low, implying that a model can be identified and trained.
Otherwise, the errors in the model can not only be assigned to the sampling techniques and model selection but also to persistent entropy which has originated from constant intersystem variations in internal states (between individuals, organisms, or in general systems). This has an impact in the way how models and theoretical approaches are developed in any field, not only in biophysics but also in other complex fields like sociology and economics (in particular, in economic-socio physics, for instance, with complex networks), since persistent entropy values generated in intrinsic mechanisms limit the observation of the system. This type of qualitative analysis prior to any data processing serves to better understand the data to be analyzed, as well as to avoid costly model formation. To detect persistent structures in trajectories, we have implemented methods using persistent topology for the analysis of time series [41], which have become a promising way to detect patterns in data different to entropy-based methods, combined with a clustering analysis.
This methodology complements other methodologies like the measure the complexity of the data to be analyzed, using for instance, a Kolmogorov or a Chaitin complexity measurement [42] (see also Ref. 1; Discussion), together with the design of alternative learning architectures.
Our aim and vision was to use this method to alleviate problems like bias and disparity in big datasets used in train machines as well as the ever-increasing use of resources used in modeling and machine learning: the increasing processing and storage of information requires a lot of energy and resources that end up in the atmosphere in the form of greenhouse gases [44]. 3 In addition, this method provides the capability to better select data for training and indicates the possibility to introduce methods such as intelligent bias into the modeling process to reduce the amount of training data [39]. The concrete application of this method in the analysis of physiological data helps to characterize structural deviations of integrated data of a single individual from the rest of the population, which is relevant in machine learning and mathematical modeling in biology and medicine [9].
Of course, it is necessary to extensively test this methodology on different datasets and in different problems to get a better standardization. However, we have managed to demonstrate that this with method is possible to recognize structures in training data to have a better assessment of the possible differences in causal relationships, which is a relevant information for the derivation of models in complex systems (for instance in biology and medicine), and in general, for various applications in the field of artificial intelligence [3].

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://archive.ics.uci.edu/ml/datasets/ MHEALTH+Dataset.

AUTHOR CONTRIBUTIONS
The author designed and implemented the theory and data analysis and wrote the article.