Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.


INTRODUCTION
The supplementary material is composed of five items: A) Data Collection example: survey design of a study of sexual and reproductive health -corresponding to section 3.1 in the main text B) Existing Datasets example: Analysis of an existing global health survey -corresponding to section 3.2 C) Existing Datasets example: ALARM -corresponding to section 3.3 D) Non-uniform α Estimation and Meta-Feature Similarity E) Additional Uniform α Estimates Items A, B, and C correspond to the datasheets discussed in the results section of the main text. How these should be interpreted can be found in section 2.7. Item D is the work we have performed to address the limitations discussed in section 4 of the main text, along with some initial results. E provides extra results not given in the main text. 1 (Supplementary Material A) Causal Datasheet for Data Collection: Sample size 5,000 -15,000, Number of Variables 30 -60

Recommendations
The following are a list of recommendations for using Dataset to learn a Bayesian network: • PCOR SUMMARY • Skeleton precision and recall -minimum combination to reach some threshold? Or something similar • V-structure performance

Proportion of Correct Odds Ratios
The Proportion of Correct Odds Ratios(PCOR) measures the proportion of interventional odds ratios a learnt BN correctly estimates. This metric is calculated by first splitting odds ratios into three types of effects: protective (less than 1), detrimental (greater than 1), and neutral (uncertainty crosses 1). A matrix of these three effects is constructed for both the learned and true odds ratios, as show below.

Learnt/True Protective Neutral Detrimental
Protective The above matrix is the one used in this evaluation. If the true effect is protective or detrimental, it is counted as correct when the learned odds ratio matches. Matching a neutral effect is not considered as a correct odds ratio, as we have found neutral effects can artificially inflate the results and create a false correlation.
The  The surface plot graphs expected PCOR performance for different combinations of algorithm, number of variables, and sample size. The color of the graph surface indicates the algorithm that is expected to have the highest performance for a given combination of parameters.
Below is a summary of expected PCOR performance: • Item 1 • Insight 2

Skeleton Precision
Precision, also known as the positive predictive value, is defined as the number of correct edge predictions made divided by all edge predictions made. The upper bound of precision is one, which corresponds to when all predicted true cases are correct. The lower bound is zero, which indicates that none of the predicted cases were correct.
The OrderMCMC (   The surface plot graphs expected Skeleton Precision performance for different combinations of algorithm, number of variables, and sample size. The color of the graph surface indicates the algorithm that is expected to have the highest performance for a given combination of parameters.
Below is a summary of expected Skeleton Precision performance: • At any of the given operating characteristics, very high precision can be expected • The PC algorithm should be used if precision is a priority as it overall has the lowest IQR

Skeleton Recall
Recall is the proportion of predicted positive cases over true positive cases. The upper bound of recall is one, corresponding to all true positive values being predicted. The lower bound is zero, indicating none of the true positive values have been captured.
The OrderMCMC (   The surface plot graphs expected Skeleton Recall performance for different combinations of algorithm, number of variables, and sample size. The color of the graph surface indicates the algorithm that is expected to have the highest performance for a given combination of parameters.
Below is a summary of expected Skeleton Recall performance: • The OrderMCMC (qNML) algorithm obtains the best skeleton recall at every operating characteristic • At 5,000 samples: recall reduces from 1 at 30 variables, to 0.971 at 60 variables.
• Increasing the number of samples from 5,000 to 15,000 improves the recall to a median of 1 at 60 variables.

V-Structure Precision
The precision with respect to V-Structures is the proportion of learnt V-Structures which are present in the true CPDAG.
The table below show the median V-Structure Precision of different algorithms for each combination of specified parameters.In general, users should select combinations of Algorithm, Number of Variables, and Sample Size that maximize V-Structure Precision performance and minimize IQR.   The surface plot graphs expected V-Structure Precision performance for different combinations of algorithm, number of variables, and sample size. The color of the graph surface indicates the algorithm that is expected to have the highest performance for a given combination of parameters.

Number of Variables
Below is a summary of expected V-Structure Precision performance: • OrderMCMC (BIC) should be used in order to maximize V-Structure Precision.

V-Structure Recall
Recall of V-Structures is the proportion of the true V-Structures that are present in the predicted case.
The OrderMCMC (   The surface plot graphs expected V-Structure Recall performance for different combinations of algorithm, number of variables, and sample size. The color of the graph surface indicates the algorithm that is expected to have the highest performance for a given combination of parameters.
Below is a summary of expected V-Structure Recall performance: • OrderMCMC (qNML) should be used in order to maximize V-Structure recall.

Correctness of Causal Effects
The Proportion of Correct Odds Ratios(PCOR) measures the proportion of interventional odds ratios a learnt BN correctly estimates. This metric is calculated by first splitting odds ratios into three types of effects: protective (less than 1), detrimental (greater than 1), and neutral (uncertainty crosses 1). A matrix of these three effects is constructed for both the learned and true odds ratios, as show below. The above matrix is the one used in this evaluation. If the true effect is protective or detrimental, it is counted as correct when the learned odds ratio matches. Matching a neutral effect is not considered as a correct odds ratio, as we have found neutral effects can artificially inflate the results and create a false correlation.

Skeleton Precision
Precision, also known as the positive predictive value, is defined as the number of correct edge predictions made divided by all edge predictions made. The upper bound of precision is one, which corresponds to when all predicted true cases are correct. The lower bound is zero, which indicates that none of the predicted cases were correct.

Skeleton Recall
Recall is the proportion of predicted positive cases over true positive cases. The upper bound of recall is one, corresponding to all true positive values being predicted. The lower bound is zero, indicating none of the true positive values have been captured.

Learning the Direction
The following are plots showing what V-structure structure learning performance to expect given Maternal Health in Uttar Pradesh Dataset properties:

V-Structure Precision
The precision with respect to V-Structures is the proportion of learnt V-Structures which are present in the true CPDAG.

Improving with More Samples
If the performance shown is this datasheet is not acceptable, improvements can generally be made by adding more samples.

Recommendations
The following are a list of recommendations for using ALARM Dataset to learn a Bayesian network: • Given the 0.80 PCOR threshold supplied by the user, the ALARM Dataset can be used for learning a Bayesian network without further samples, constraints, or expert input. • It is reccomended the OrderMCMC algorithm is used as this obtains the best performance. • PC should be used if skeleton precision is the priority, while OrderMCMC should be used if skeleton recall is the priority • OrderMCMC should be used if successfully capturing V-Structures is a priority

Correctness of Causal Effects
The Proportion of Correct Odds Ratios(PCOR) measures the proportion of interventional odds ratios a learnt BN correctly estimates.

Learning the Skeleton
The following are plots showing what skeleton structure learning performance to expect given ALARM Dataset's properties:

Skeleton Precision
Precision, also known as the positive predictive value, is defined as the number of correct edge predictions made divided by all edge predictions made.

Skeleton Recall
Recall is the proportion of predicted positive cases over true positive cases.

Learning the Direction
The following are plots showing what V-structure structure learning performance to expect given ALARM Dataset properties:

V-Structure Precision
The precision with respect to V-Structures is the proportion of learnt V-Structures which are present in the true CPDAG.

V-Structure Recall
Recall of V-Structures is the proportion of the true V-Structures that are present in the predicted case.

Improving with More Samples
If the performance shown is this datasheet is not acceptable, improvements can generally be made by adding more samples. The following plot shows expected performance gain with further samples for ALARM Dataset:

Improving with Fewer Variables
If the performance shown is this datasheet is not acceptable, improvements can generally be made by reducing the number of variables. The following plot show the expected performance gain using less variables for the ALARM Dataset:

Supplimentary Material D: Preliminary Results on Non-Uniform α Estimation and Meta-Feature Similarity
In this section we provide some insight into the work we are performing to resolve the limitations discussed in the main text, along with some initial results.

Non-uniform α Estimation
We have observed in the ALARM dataset that we can overestimate performance when the parameters are non-uniform, it follows to improve the current performance, we should attempt to capture this case in our synthetic generation. In our preliminary test of this idea, we employ the following algorithm to estimate the non-uniform alpha: Algorithm 1: Alpha-Estimation Synthetic Generation Scheme Given: •An Initial Estimate BN B of a real dataset •A Synthetic DAG 1.For each real node: a.Extract the corresponding parameters θ i b.Stack this parameter estimate t times, adding normally distributed noise N (0, v). Both t and v are hyper-parameters which control the amount of regularization on the alpha estimates. c.Using Maximium Likelihood Estimation: Estimate alpha for each CPT in B (Huang, 2005) 2.For each Synthetic Node: a.Assign each alpha to a synthetic node inversely proportionate to the in-degree, with some random noise. b.Sample, using the assigned alpha, from a Dirichlet distribution to aquire the synthetic parameters As described in the algorithm, this method requires an estimate DAG in order to function. How to best form this estimate, whether that be an agreement graph (as in the Intersection-validation paper (Viinikka et al., 2018)), high recall, or high precision DAG, is future work. In the following experiments we provide the oracle BN to the algorithm. Figure S1 shows the comparison between the CDG-T estimates and the real ALARM performance. There is generally a closer alignment compared to the estimates formed using a uniform alpha value, particularly for V-Structure recall ( Figure S1d) which no-longer estimates performance with a median of 1.
To further test the alpha estimation we use another expert-designed dataset with a known-ground truth -Insurance (Binder et al., 1997). Because Insurance is a dataset designed to capture the probability of car theft, it contains events which are very low frequency. Much like ALARM, some CPTs of the Insurance dataset have a high marginal imbalance. Therefore, this dataset suits our purposes as another test case for our alpha estimate technique. Compared to synthetic datasets which assume uniform alpha, there is overall much better alignment when we use the alpha estimation technique, bringing both skeleton and V-structure estimates further into alignment with the ground truth ( Figure S2. Results from the uniform estimates can be found in the supplementary material).
Finally, we test our alpha estimation algorithm on another expert-designed BN -Child (Dawid, 1992). This data set does not have any notable characteristics in its CPTs, and could generally be generated using Dirichlet distributions with a uniform alpha. As anticipated, our calculated performance estimates for this dataset using the uniform alpha and estimated alpha have minimal differences ( Figure S3). All of the expert-designed BNs can be found in the BNlearn repository: https://www.bnlearn.com/bnrepository/ (Scutari, 2009).

Meta-feature Similarity Score
In presenting performance estimates from synthetic data, we assume that they are similar enough to a given real dataset to be valid. Because a practitioner must specify some set of unobservable BN characteristics, further guidance on whether these selections were reasonable is desirable.
We borrow the concept of defining meta-features and computing dataset similarities from the Bayesian optimization hyperparameter initialization literature (Wistuba et al., 2015). Currently, because our intended datasets are categorical, a small set of information theoretic meta-features (i.e., Shannon's entropy and Concentration Coefficient) are computed for both the real and synthetic datasets (Michie et al., 1994;Alexandros and Melanie, 2001). Similarity between the synthetic and real datasets can then be assessed via a meta-feature similarity score, which is simply the mean cosine similarity between the meta-features of the real and synthetic datasets. The formula is as follows: MF-Similarity(R, S) = 1 |S| where M is a function computing the meta-features for a dataset, R is a real dataset, and S is the set of synthetic datasets. Each meta-feature is computed as the set {min, median, max} over the variables. Careful selection of meta-features is crucial in order to obtain a reliable score. One should avoid selecting meta-features with any overlap to the user-specified properties (e.g., sample size, number of variables). As a gut check, we computed the meta-feature similarity for the datasets generated assuming an alpha using the non-uniform alpha estimates, as well as the datasets generated assuming a uniform alpha vector for some common datasets (Beinlich et al., 1989;Binder et al., 1997;Dawid, 1992) and the Surgo Household dataset (Table S1 using the PyMFE library (Alcobaça et al., 2020)). As expected, synthetic datasets are more similar to the source datasets when non-uniform alpha estimates are used. While there appears to be a ceiling effect, we hope this provides a basis by which estimates in a causal datasheet can be used practically.   Table S1. Meta-feature similarity scores for each dataset using different parameter generation schemes. The estimate DAG provided to the alpha estimation scheme for the Surgo Household dataset was learnt using the OrderMCMC + qNML algorithm. Figure S5: The performance alignment for CDG-T and child using Uniform alpha estimates.