Preventing Failures by Dataset Shift Detection in Safety-Critical Graph Applications

Dataset shift refers to the problem where the input data distribution may change over time (e.g., between training and test stages). Since this can be a critical bottleneck in several safety-critical applications such as healthcare, drug-discovery, etc., dataset shift detection has become an important research issue in machine learning. Though several existing efforts have focused on image/video data, applications with graph-structured data have not received sufficient attention. Therefore, in this paper, we investigate the problem of detecting shifts in graph structured data through the lens of statistical hypothesis testing. Specifically, we propose a practical two-sample test based approach for shift detection in large-scale graph structured data. Our approach is very flexible in that it is suitable for both undirected and directed graphs, and eliminates the need for equal sample sizes. Using empirical studies, we demonstrate the effectiveness of the proposed test in detecting dataset shifts. We also corroborate these findings using real-world datasets, characterized by directed graphs and a large number of nodes.


INTRODUCTION
Most machine learning (ML) applications, e.g., healthcare, drug-discovery, etc., encounter dataset shift when operating in the real-world. The reason for this comes from the bias in the testing conditions compared to the training environment introduced by experimental design. It is well known that ML systems are highly susceptible to such dataset shifts, which often leads to unintended and potentially harmful behavior. For example, in ML-based electronic health record systems, input data is often characterized by shifting demographics, where clinical and operational practices evolve over time and a wrong prediction can threaten human safety.
Although dataset shift is a frequent cause of failure of ML systems, very few ML systems inspect incoming data for a potential distribution shift (Bulusu et al., 2020). While some practical methods such as (Rabanser et al., 2019) have been proposed for detecting shifts in applications with Euclidean structured data (speech, images, or video), there are limited efforts in solving such issues for graph structured data that naturally arises in several scientific and engineering applications. In recent years there has been a surge of interest in applying ML techniques to structured data, e.g. graphs, trees, manifolds etc. In particular, graph structured data is becoming prevalent in several high-impact applications including bioinformatics, neuroscience, healthcare, molecular chemistry and computer graphics. In this paper, we investigate the problem of detecting distribution shifts in graph-structured datasets for responsible deployment of ML in safety-critical applications. Specifically, we propose to solve the problem of detecting shifts in graph-structured data through the lens of statistical two-sample testing. Broadly, the objective in two-sample testing for graphs is to test whether two populations of random graphs are different or not based on the samples generated from each of them.
Two-sample testing has been of significant research interest due to its broad applicability. An important class of testing methods relies on summary metrics that quantify the topological differences between networks. For example, in brain network analysis, commonly adopted topological summary metrics include the global efficiency (Ginestet et al., 2011) and network modularity (Ginestet et al., 2014). An inherent challenge with these approaches is that the topological characteristics depend directly on the number of edges in the graph, and can be insufficient in practice. An alternative class of methods is based on comparing the structure of subgraphs to produce a similarity score (Shervashidze et al., 2009;Macindoe and Richards, 2010). For example, Shervashidze et al. (2009) used the earth mover's distance between the distributions of feature summaries of their constituent subgraphs.
While these heuristic methods are reasonably effective for comparing real-world graphs, not until recently that a principled analysis of hypothesis testing with random graphs was carried out. In this spirit, Ginestet et al. (2017) developed a test statistic based on a precise geometric characterization of the space of graph Laplacian matrices. Most of these approaches for graph testing based on classical two-sample tests are only applicable to the restrictive lowdimensional setting, where the population size (number of graphs) is larger than the size of the graphs (number of vertices). To overcome this challenge, Tang et al. (2017a) proposed a semi-parametric two-sample test for a class of latent position random graphs, and studied the problem of testing whether two dot product random graphs are drawn from the same population or not. Other testing approaches that focused on hypothesis testing for specific scenarios, such as sparse networks (Ghoshdastidar et al., 2017a) and networks with a large number of nodes (Ghoshdastidar et al., 2017b), have been developed. More recently, Ghoshdastidar and von Luxburg (2018) developed a novel testing framework for random graphs, particularly for the cases with small sample sizes and the large number of nodes, and studied its optimality. More specifically, this test statistic was based on the asymptotic null distributions under certain model assumptions.
Unfortunately, all these approaches are limited to testing undirected graphs under the equal sample size (for two graph populations) setting. In real-world dataset shift detection problems, these assumptions are extremely restrictive, making existing approaches inapplicable to several applications. In order to circumvent these crucial shortcomings, we develop a novel approach based on hypothesis testing for detecting shifts in graph-structured data, which is more flexible (i.e., accommodates 1) both undirected and directed graphs and 2) unequal sample size cases). Moreover, it is highly effective even when the sample size grows. Notice that, similar to the setting in Ghoshdastidar and von Luxburg (2018), we also consider scenarios where all networks are defined from the same vertex set, which is common to several real-world applications. The main contributions of this paper are summarized below: • We propose a new test statistic that can be applied to undirected graphs as well as directed graphs and/or unweighted graphs as well as weighted graphs, while eliminating the equal sample size requirement. The asymptotic distribution for the proposed statistic, based on the well-known U-statistic, is derived. • A practical permutation approach based on a simplified form of the statistic is also proposed. • We compare the new approach with existing methods for graph testing in diverse simulation settings, and show that the proposed statistic is more flexible and achieves significant performance improvements. • In order to demonstrate the usefulness of the proposed method in challenging real-world problems, we consider several applications (including a healthcare application), and show the effectiveness of our approach.

PRELIMINARIES
We consider the following two-sample setting. Let two random graph populations with d vertices be denoted as A 1 , . . . , A m from P ∈[0, 1] d×d and B 1 , . . . , B n from Q ∈[0, 1] d×d with their adjacency matrices A 1 , . . . , A m and B 1 , . . . , B n , respectively. We are concerned with testing hypotheses: Notice that we consider the cases where each population consists of independent and identically distributed samples, which encompasses a wide-range of network analysis problems, see, e.g., Holland et al. (1983), Newman andGirvan (2004), Newman (2006). In contrast to existing formulations, e.g., Ghoshdastidar and von Luxburg (2018), we consider a more flexible setup where 1) the sample sizes m and n are allowed to be different and 2) the graphs in p and Q can be weighted and/or directed.
While there have several efforts to two-sample testing of graphs (Bubeck et al., 2016;Gao and Lafferty, 2017;Maugis et al., 2017), recent works such as Tang et al. (2017a), Tang et al. (2017b); Ginestet et al. (2017) have focused on designing more general testing methods that are applicable to practical settings. For example, Ginestet et al. (2017) proposed a practical test statistic based on the correspondence between an undirected graph and its Laplacian under the inhomogeneous Erd} os-Rényi (IER) assumption, which means all nodes are independently generated from a Bernoulli distribution (see details in Section 3). The test statistic, under the assumption of equal sample sizes m, can be described as follows: where The authors showed that T gin converges to a chi-square distribution as m → ∞ under H 0 . However, this statistic can be interpreted as Hotelling's T 2 statistic for multivariate data, thus leading to no performance guarantees for "small m and large d" scenario. This is because the variance estimates used in Eq. 2 are not stable for small m and large d, especially when graphs are sparse.
Recently, Ghoshdastidar and von Luxburg (2018) proposed a new class of test statistics, designed for different scenarios under the IER model assumption. More specifically, they focused on cases with small m and large d. For cases with m > 1, the following test statistic was used: While it was suggested by the authors to perform this test using bootstraps from the aggregated data, this could be challenging for sparse graphs, since it is difficult to construct bootstrapped statistics from an operator norm. Hence, they considered an alternate test statistic based on the Frobeniusnorm as follows: It was shown that this test is provably effective and more reliable. Furthermore, they derived the asymptotic normality of T fro as d → ∞ to make the method instantly applicable without the bootstrap procedure. Despite the good properties of this method, this test can be used only when the two sample sizes are equal, and when graphs are undirected. In the rest of this paper, we develop a new test statistic which addresses these two crucial limitations.

PROPOSED TEST
To carry out two-sample testing, we want to measure the distance between two populations. Here, we utilize the Frobenius distance as the evidence for discrepancy between two populations: Next, we provide finite sample estimates of this quantity. To accommodate more general settings for random graphs, the new test statistic is defined as follows: where Note that the proposed test statistic accommodates scenarios where 1) the sample sizes m and n are different and 2) the graphs in p and Q are weighted and/or directed.
Next, we analyze the theoretical properties of the proposed test. For the ease of theoretical analysis, we focus on the case where graphs are unweighted and undirected. However, the proposed test and algorithmic tools are applicable to weighted and/or directed graph scenarios which is the main focus of the paper and is considered in our experimental evaluations. More specifically, in our theoretical analysis, we assume that graphs are drawn from the inhomogeneous Erd} os-Rényi (IER) random graph process, which is considered as an extended version of the Erd} os-Rényi (ER) model from Bollobás et al. (2007). In other words, we consider unweighted and undirected random graphs, where edges occur independently without any additional structural assumption on the population adjacency matrix. Note, the IER model encompasses other models studied in the literature including random dot product graphs (Tang et al., 2017b) and stochastic block models (Lei et al., 2016). A graph G ∈ [0, 1] d×d from a population symmetric adjacency p with zero diagonal is considered to be an IER graph if Here, d denotes the cardinality of the vertex set. Next we analyze the theoretical properties of the proposed test under IER assumption.
LEMMA 3.1. T new is an unbiased empirical estimate of T, that is, PROOF. Under the IER assumptions, for all i, j 1, . . . , d, we have since A k and B l are mutually independent (k 1, . . . , m, l 1, . . . , n). Then, In the form of T ij , the first term and the second term represent a similarity (closeness) within two samples, and the last term represents similarity between two samples. Hence, a relatively large value of T new is the evidence against the null hypothesis. Note that the proposed statistic does not require equal sample sizes and undirected graphs assumptions.
When m n, we have a simpler form of the estimate. where and Since the proposed estimate has a form of U-statistics, which provides a minimum-variance unbiased estimator for T (Hoeffding, 1992;Serfling, 2009), the asymptotic distribution of T new can be derived based on the asymptotic results of U-statistics.
where σ 2 var z (E z' h(z, z ′ ) ij ). Under H 0 , the U-statistic is degenerate and where ξ u i.i.d N(0, 1) and λ u are the solutions of PROOF. These results can be obtained by applying the asymptotic properties of U-statistics as given in Serfling (2009) and the IER assumptions.
Having devised the test statistic, our next aim is to determine whether the new test statistic T new is large enough to be outside the 1 − α quantile of the limiting null distribution in Eq. 11, where a is the significance level of the test. One difficulty in implementing this test is that the asymptotic null distribution 11) and its a quantile do not have an analytic form unless λ u 0 or 1. Therefore, in order to estimate this quantile, we propose a permutation approach on the aggregated data. The main advantage of this method is that it yields a valid level a test in finite-sample scenarios (Lehmann and Romano, 2006). To this end, we first consider a simpler form of the test statistic (based on T new ) defined as follows: where Although we do not use the last term of T ij in the definition of T ′ ij , the performance of the test statistic T ′ new achieved by incorporating similarities in two samples is still maintained in the permutation framework. The permutation test is summarized in Algorithm 1; its computational cost is O(R(m∨n) 2 ), where (m∨n) indicates the maximum among m and n.
Algorithm 1Permutation test using T ′ new .  (2018) where the test is reliable even for a small number of samples, due to its asymptotic distribution, our test procedure needs a reasonable number of samples to implement the permutation test. Based on simulations, we see that as low as four samples are sufficient to obtain reliable results.

EXPERIMENTS
Here, we first examine the performance of the new test statistics under diverse settings through simulation studies. Later, we will apply the new test to real-world applications.

Simulated Data
To evaluate the performance of the new test, we examine sparse graphs from stochastic block models with two communities as studied in Tang  (2018). Specifically, we consider sparse graphs with d nodes where the same d/2 size community is constructed with an edge probability p and d/2 size different community with an edge probability q. In other words, we define p and Q as follows: P : p q q p d×d vs Q : p + ϵ q q p+ ϵ d×d .
We generate m samples from p and n samples from Q. Under the null, ϵ 0, implying P Q, whereas ϵ > 0 under H 1 , implying P ≠ Q. Following Ghoshdastidar and von Luxburg (2018), we set p 0.1, q 0.05, and ϵ 0 for null, whereas ϵ 0.04 for the alternative hypothesis. We examine the performance of the new test for different choices of d ∈ {100, 200, 300, 400, 500}.
The performance of the test based on T ′ new is studied and compared to existing methods. T fro in Ghoshdastidar and von Luxburg (2018) is the bootstrap test based on T fro , and T asymp denotes the normal dominance test based on the asymptotic distribution of T fro (also from Ghoshdastidar and von Luxburg (2018)). We denote the new test which is the permutation test based on T ′ new as T ′ new. The estimated power is calculated as the number of null rejections at α 0.05 level out of 100 independent trials for each of these methods. For T fro and T new, p-values are determined by 1,000 permutation runs to have a reliable comparison. Figure 1 shows results for the undirected graph case under different settings. When two sample sizes are equal (upper panels), where existing methods can be applied, we see that the proposed test outperforms all other methods. Note that, when the sample size of two graph populations are different (i.e., m ≠ n), the existing methods cannot be applied. We see that the proposed test still performs well under sample imbalance and the large d regime.
We also evaluate the performance of the new test for directed graphs under various configurations. (Figure 2). The existing methods are not applicable to directed graphs, but we transform T fro so that it can be applied to directed graphs. The results show that the new test also has better power than the existing method in two-sample testing for directed graph and works well for large graphs.
Next, we examine the effect of the sparsity on the performance of the tests. To this end, we consider the same setting as above, but with different choices of ϵ ∈ {0.02, 0.03, 0.04} for each of methods. Small ϵ implies that there is small difference between p and Q, making the tests more difficult to detect discrepancy between two samples. Table 1 shows results for undirected graphs with variations in the sparsity level ϵ. We see that, in general, the proposed method is consistently superior to existing methods. This indicates that our test statistic is more effective in detecting the inhomogeneity between two samples than the existing methods. The effect of a sparsity level ϵ on the performance of the proposed test for directed graphs can be found in Table 2. We see that the proposed test also performs better than the existing method for directed graph settings, and as expected, the power increases as ϵ or the number of samples increases. This observation becomes particularly evident when we have a large number of samples. To this end, we study how the performance of the tests is affected by the number of samples. For this study, we consider m n ∈ {10, 20, 50} with relatively small graphs d ∈ {50, 100, 150, 200} and fix ϵ 0.02. This analysis is designed to reveal the potential impact of sample size in high-dimensional settings. Tables 3, 4 report numerical results for the performance of the tests with varying number of samples. We see that the proposed test in general outperforms the existing tests for both undirected and directed graphs. Hence, we can claim that the new test works well in high-dimensional settings.

Phone-Call Network
The MIT Media Laboratory conducted a study following 87 subjects who used mobile phones with a pre-installed device that can record call logs. The study lasted for 330°days from July 2004 to June 2005 (Eagle et al., 2009). Given the richness of this dataset, one question of interest to answer is that whether the phone call patterns among subjects are different between weekends and weekdays. These patterns can be viewed as a representation of the personal relationship and professional relationships of a subject. Removing days with no calls among subjects, there are t 299 networks in total (corresponding to number of days) and 87 subjects (or nodes) with adjacency matrices N t with value one for element (i, j) if subject i called j on day t and 0 otherwise. This in turn comprises of 85°days in weekends and 214°days in weekdays. This is an example of unweighted directed graphs with imbalanced sample sizes. The test statistic and corresponding p-value are shown in Table 5. We see that the new test rejects the null hypothesis of equal distribution at 0.05 significance level. This outcome is intuitively plausible as phone call patterns in weekends (personal) can be different from the patterns in weekdays (work).

Safety-Critical Healthcare Application
Modeling relationships between functional or structural regions in the brain is a significant step toward understanding, diagnosing, and eventually treating a gamut of neurological conditions including epilepsy, stroke, and autism. A variety of sensing mechanisms, such as functional-MRI, Electroencephalography (EEG), and Electrocorticography (ECoG), are commonly adopted to uncover patterns in both brain structure and function. In particular, the resting state fMRI (Kelly et al., 2008) has been proven effective in identifying diagnostic biomarkers for mental health conditions such as the Alzheimer disease (Chen et al., 2011) and autism (Plitt et al., 2015). At the core of these neuropathology studies is predictive models that map variations in brain functionality, obtained as timeseries measurements in regions of interest, to clinical scores. For example, the Autism Brain Imaging Data Exchange (ABIDE) is a collaborative effort (Di Martino et al., 2014), which seeks to build a data-driven approach for autism diagnosis. Further, several published studies have reported that predictive models can reveal patterns in brain activity that act as effective biomarkers for classifying patients with mental illness (Plitt et al., 2015). Following current practice (Parisot et al., 2017), graphs are natural data structures to model the functional connectivity of human brain (e.g. fMRI), where nodes correspond to the different functional regions in the brain and edges represent the functional correlations between the regions. The problem of defining appropriate metrics to compare these graphs and thereby identify suitable biomarkers for autism severity has been of significant research interest. We show that the proposed two-sample test is highly effective at characterizing stratification based on demographics (e.g. age, gender) as well as autism severity states (normal vs abnormal) across a large population of brain networks.    In the dataset, there are total 871 graphs and each graph consists of 111 nodes (functional regions). Through this example, we study the effectiveness of our approach under the weighted and undirected graph setting. In particular, we focus on detecting variations across stratification arising from demographics (gender, age). Specifically, groups of normal control subjects as well as those diagnosed with Autism Spectrum Disorders (ADS) are further sub-divided according to their gender (Male or Female) and age (under 20 or over 20), and we compare these sub-groups using the proposed test. Table 6 shows the distribution of graphs in the dataset and Figure 3 shows an example of the network structure of normal-male and normal-female groups.
We conduct the two-sample test based on T ′ new for each group with 10,000 permutations and the results are summarized in Table 7. We see that the new test rejects the null hypothesis of homogeneity in groups with respect to the treatment and age at 5% significance 6 | Distribution of graphs. "M" and "F" indicate male and female, respectively. '<20' and '>20' represent age less than 20 and over 20, respectively.  level (Normal>20 vs ADS<20 and Normal<20 vs ADS>20). In addition, the new test rejects the null hypothesis of homogeneity in both normal and ADS groups with respect to the age difference (Normal<20 vs Normal>20 and ADS<20 vs ADS>20).
This conclusion indicates there is a dataset shift even within the same normal and ADS groups, depending on the age. Hence, the fact that normal and ADS groups are considered differently by age may affect the machine learning subjects classification and prediction task in population. Moreover, with the dataset in which the normal group and ADS group are determined differently by age and not by gender, the machine learning classification and prediction model may not be reliable. Hence, detecting dataset shift shed some light on the machine learning task for more reliable results.
We also compare the new test with the existing method T fro to this example. Note that the existing method T asymp may not be reliable due to the small number of nodes. Since T fro is only applicable to the balanced sample sizes, we randomly choose 54 graphs from each group as the smallest sample size among the groups is 54. We run the tests 100 times at the significance level 5%. The test powers are shown in Table 8. We see that the new test in general outperforms T fro. Compared to the results in Table 7, some examples show inconsistent performance of the tests. This is because we only consider a subset of graphs due to the limitation of the existing approaches in that they cannot be applied to unbalanced sample size examples.

CONCLUSION
We propose the new two-sample test statistic for graph-structured data. Unlike the existing methods, the new test statistic is more versatile, which is applicable to directed graphs, imbalanced sample size cases, and even weighted graphs. The asymptotic distribution of the test statistic is presented and a practical testing procedure is proposed. The performance of the new method is studied under a number of settings. Experiments demonstrate that the new test in general outperforms state-of-the-art tests. The proposed test is also applied to two real datasets (including a safety-critical healthcare application), and we reveal that the new approach is effective to detecting the heterogeneity between disparate samples.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
HS developed the main method and proposed the testing procedure based on the new test statistic. He conducted the simulation experiments and real data analysis. JJ and BK provided the intuition and the direction of the method and worked on simulation experiments with HS. JJ provided the real dataset, and JJ and BK discussed about the results with HS. HS, JJ, and BK generated the paper together.