Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.


INTRODUCTION
Self-supervised generative AI has made significant progress over the past years, with algorithms capable of creating "shockingly" realistic synthetic data across a wide range of domains. Illustrations like those presented in Figures 1, 2 are particularly impressive within domains of unstructured data, like images (Karras et al., 2017) and text (Brown et al., 2020). These samples demonstrate that it is becoming increasingly difficult for us humans, as well as for machines, to discriminate actual from machine-generated fake data. While less prominent, similar progress is made within structured data domains, such as synthesizing medical health records (Choi et al., 2017;Goncalves et al., 2020;Krauland et al., 2020), census data (Freiman et al., 2017), human genoms (Yelmen et al., 2021), website traffic (Lin et al., 2020) or financial transactions (Assefa, 2020). These advances are particularly remarkable considering that they do not build upon our own human understanding of the world, but "merely" require a flexible, scalable self-supervised learning algorithm that teaches itself to create novel records based on a sufficient amount of training data. These AI-based approaches, with Generative Adversarial Networks (Goodfellow et al., 2014) and Variational Autoencoders (Kingma and Welling, 2013) being two prominent representatives, have in common that they fit high-capacity deep neural networks to training data, that can then be leveraged for sampling an unlimited amount of new records. This is in contrast to traditional synthetization techniques, that either rely on expert-engineered generation mechanisms or on the perturbation of existing data (Muralidhar et al., 1999;Reiter, 2010;Wieringa et al., 2021).
We focus on the data sharing use cases, where data owners seek to provide highly accurate, yet truly anonymous statistical representations of datasets. AI-based approaches for generating synthetic data provide a promising novel tool box for data stewards in the field of statistical disclosure control (SDC) (Drechsler, 2011), but just as more traditional methodologies also share the fundamental need to balance data utility against disclosure risk. One can maximize utility by releasing the full original dataset, but would thereby expose the privacy of all contained data subjects. On the other hand, one can easily minimize the risk by releasing no data at all, which naturally yields zero utility. It is this privacy-utility trade-off that we seek to FIGURE 1 | Progress in synthetic face generation due to advances in self-supervised generative AI methods (Source: Tweet by Ian Goodfellow).
FIGURE 2 | Sample text generated by Generative Pre-trained Transformer 2 (GPT-2), a large-scale open-source generative language model created by OpenAI (Radford et al., 2019). quantify for mixed-type synthetic data. To this end, in this paper we introduce and empirically demonstrate a novel, flexible and easy-to-use framework for measuring the fidelity as well as the privacy risk entailed in synthetic data in mixed-type tabular data setting. After briefly discussing the background we present the building blocks of the proposed framework in section Framework. This will then allow us to compare the performance of generative models from the rapidly growing field of synthetic data approaches against each other, as well as against alternative SDC techniques in section Empirical Demonstration.

RELATED WORK
The field of generative AI gained strong momentum ever since the introduction of Generative Adversarial Networks (Goodfellow et al., 2014) and its application to image synthesis. This seminal and widely cited paper assessed synthetic data quality by fitting Gaussian Parzen windows to the generated samples in order to estimate the log-likelihood of holdout samples. At that time the authors already called out for further research to assess synthetic data, as they highlighted the limitations of Parzen window estimates for higher dimensional domains. Theis et al. (2015) further confirmed the fundamental shortcomings of likelihood-based measures as quality metrics, as they were easily able to construct counter examples where these two do not align.
In addition to quantitative assessments, nearly all of the research advances for image synthesis also present non-cherry picked synthetic samples as an indicator for quality (see e.g., Radford et al., 2015;Liu and Tuzel 2016;Karras et al., 2017). While these allow to visually judge plausibility of the generated data, they do not allow to capture a generator's ability to faithfully represent the full variety and richness of a dataset, i.e., its datasetlevel statistics. On the contrary, by overly focusing on "realistic" sample records in the assessment, one will potentially favor generators that bias toward conservative, safe-bet samples, at the cost of diversity and representativeness. Note that methods like temperature-based sampling (Ackley et al., 1985), top-k sampling (Fan et al., 2018), and nucleus sampling (Holtzman et al., 2019) are all techniques to make such trade-offs explicitly, and are commonly applied for synthetic text generation.
For mixed-type tabular data a popular and intuitive approach is to visually compare histograms and correlation plots (see e.g., Howe et al., 2017;Beaulieu-Jones et al., 2019;Lu et al., 2019). While this does allow to capture representativeness, it is typically applied to only a small subset of statistics and misses out on systematically quantifying any discrepancies thereof.
A popular assessment technique within structured as well as unstructured domains is known as "Train on Synthetic, Test on Real" (TSTR) method (Esteban et al., 2017). Using this technique, a supervised machine learning task is trained on the generated synthetic data to then see how its predictive accuracy compares against the same model being trained on real data (Jordon et al., 2018;Xu et al., 2019). By validating against an actual holdout dataset, that is not used for the data synthesis itself, one gets an indication for the information loss for a specific relationship within the data incurred due to the synthesis. If the chosen predictive task is difficult enough and a capable downstream machine learning model is used, this can indeed yield a strong measure. However, results will depend on both of these assumptions, and will vary even for the same dataset from predicted target to predicted target, as it tests only for a singular relationship within the high dimensional data distribution. And more importantly, the measure again does not allow statements regarding the overall statistical representativeness. Any accidentally introduced bias, any artifacts, any misrepresentations within the generated data might remain unnoticed. Yet, all of these are of particular importance when a data owner seeks to disseminate granularlevel information with highest possible accuracy, without needing to restrict or even to know the downstream application.
No accuracy assessment of a synthetic data solution can be complete, if it does not include some measurement of its ability to produce truly novel samples, rather than merely memorizing and recreating actual data. Closely related, users of synthetic data solutions seek to establish the privacy of a generated dataset. i.e., whether the synthetic dataset is considered to be anonymous, non-personal data in a legal sense. With data protection regulations varying from country to country, and industry to industry, any ultimate assessment requires legal expertise and can only be done with respect to a given regulation. However, there are a growing number of technical definitions and assessments of privacy being introduced, that serve practitioners well to make the legal case. Two commonly used concepts within the context of synthetic data are empirical attribute disclosure assessments (Taub et al., 2018;Hittmeir et al., 2020), and Differential Privacy (Dwork et al., 2006). Both of these have proven to be useful in establishing trust in the safety of synthetic data, yet come with their own challenges in practice. While the former requires computationally intensive, case-specific repeated synthetization re-runs that can become infeasible to perform on a continuous base, the latter requires the inspection of the algorithms as well as their actual implementations for these to be validated.

FRAMEWORK
We seek to close these existing gaps for evaluating data synthesizers by offering 1) a flexible, model-free and easy to reason empirical assessment framework of data fidelity, and 2) an easy to compute summary statistic for the empirical assessment of privacy for mixed-type tabular data. Grace to their purely datadriven, non-parametric nature both measure neither require any a priori domain specific knowledge nor assumptions of the investigated synthetization process. This framework allows for a systematic assessment of black-box synthetic data solutions that can be performed on a continuous base and thus shall help to establish transparency and ultimately trust in this new technology.

Fidelity
We motivate our introduced fidelity measure by visualizing selected distributions and cross-tabulations for the "adult" dataset, which we will use later in our empirical demonstration study. Figure 3 exhibits the distribution of four selected numeric attributes and shows the wide variety of shapes that can occur in real-world datasets. For example, the numeric attribute "age" ranges from 17 to 90, with a small group of subjects that are exactly 90 years old, while hardly any subject is between 85 and 89 years old. The numeric attribute "fnlwgt" spans a much wider range, with nearly all observed values being unique within the dataset. Thus these values need to be binned to adequately visualize the shape of the variable's distribution. Attribute "hours-per-week" is characterized by specific outstanding integer values, while "capital-gain" is dominated by zeros with only a few exceptions that themselves can range up to 100,000. Since we would want to see synthesizers faithfully retaining any of these different types and shapes of univariate distributional patterns we also require an accurate fidelity measure to capture any such discrepancies as well.
However, we expect from synthetic data that they are not only representative for the distribution of individual attributes, but for all multivariate combinations and relationships among the set of attributes. For example, Figure 4 displays three selected bivariate distributions for the dataset "adult," each with distinct patterns and insights that are to be retained and assessed. The challenge for deriving a metric that accommodates these empirical interdependencies in an adequate way is that the number of relationships to investigate grows quickly with the number of attributes. More specifically, a dataset with m attributes results in ( m k ) combinations of k-way interactions. For example, for 50 attributes this yields 1,225 two-way, and 19,600 threeway interactions. Ideally, we would want to compare the full joint empirical distributions (m k) between the actual and synthetic data, but that is, except for the most trivial cases, infeasible in practice.
The curse of dimensionality (Bellman, 1966) strikes here again; i.e.,the number ofcross-combinationsof attribute values grows exponentially as more attributes are considered, resulting in the available data becoming too sparse in a highdimensional data space. While binning and grouping of attribute values mitigates the issue for the lower-level interactions, this fundamental principle cannot be defeated for deeper levels. Thus we propose as a non-parametric, model-and assumption-free approach to empirically measure the fidelity of a synthetic dataset with respect to a target dataset by averaging across the total variation distances. 1 Further, (TVD) of the corresponding discretized, lower-level empirical marginal distributions.
The construction of our proposed fidelity metric is as follows: Let's consider a random split of the available records into a training dataset T and a holdout dataset H. We only expose the training data T to a synthesizer which yields a synthetic dataset S of arbitrary size n S . Further, let's transform each of the m attributes of these datasets into categorical variables, that have a fixed upper bound c for their cardinality. For those categorical FIGURE 3 | Selected univariate marginal distributions for dataset "adult" demonstrating the broad spectrum of distributional patterns encountered in real-world datasets.
variables that have cardinality c j > c, we merge the (c j − c + 1) least frequent values into a single group. For numeric variables we apply quantile binning, i.e., we cut the range of values into a maximum of c ranges, based on their c quantiles. Any date or datetime variable is to be converted first into a numeric representation before applying the same transformation as suggested above. Any missing values are treated as yet another categorical value, thus can increase cardinality to c + 1 for those variables that contain missing values. Note, that the required statistics for the discretization, i.e., the list of least frequent values as well as the quantiles, are to be determined based on the training dataset T alone, and then reused for the discretization of the other datasets.
We then proceed in calculating relative frequencies for all k-way interactions for the discretized m attributes and do so for both the training dataset T and the synthetic dataset S. For each k-way interaction we calculate the TVD between the two corresponding empirical marginal distributions and then average across all m k combinations. This yields a measure F k (T,S), which quantifies the fidelity of synthetic dataset S with respect to original training dataset T. Formally, the TVD between a specific k-way combination v for datasets T and S is half the L1 distance between the empirical distributions: with f A v denoting the empirical marginal distribution for a dataset A, v being any of the ( m k ) k-element combinations of the set of m attributes and i being any of the occurring attribute values of v. The introduced fidelity metric of dataset S with respect to dataset T is then the average across the TVDs for all possible k-way combinations and can be written as follows: In order to get a sense of how much information is lost due to the synthetization and how much discrepancies are expected due to sampling noise we need to compare F k (T, S) with the fidelity measure of the holdout dataset, F k (T, H). This serves us as a reference for what we aim for when retaining statistics that generalize beyond the individuals. This relationship can be easily quantified as the ratio F k ratio (T, H, S) : F k (T, S)/F k (T, H). Note that a ratio of 1 would be optimal as it indicates that the synthetic dataset S is just as close to the training dataset T as a holdout dataset H is with respect to T due to the sampling noise. On the other hand, a ratio smaller than 1 would indicate that the synthetic dataset is systematically "too close" and contains information that represents training data specific information.

Privacy
While fidelity is assessed at the dataset-level, we need to look at individual-level distances for making the case that none of the training subjects is exposed by any of the generated synthetic records. A simplistic approach is to check for identical matches, i.e., records from the training set that are also contained in the synthetic set. However, the occurrence of identical matches is neither a required nor a sufficient condition for detecting a leakage of privacy. Just as any dataset can contain duplicate records, we shall expect a similar relative occurrence within a representative synthetic dataset. Further, and analogous to that metaphorical monkey typing the complete works of William Shakespeare by hitting random keys on a typewriter for an infinite time (also known as the "infinite monkey theorem"), even an uninformed random data generator will eventually end up generating any "real" data record. More importantly, these identical matches must not be removed from the synthetic output, as such a rejection filter actually leaks privacy, since it would reveal the presence of a specific record in the training data by it being absent from a sufficiently large generated synthetic dataset.
The concept of identical matches is commonly generalized toward measuring the distance to closest records (DCR) (Park et al., 2018;Lu et al., 2019). These are the individual-level distances of synthetic records with respect to their corresponding nearest neighboring records from the training dataset. The distance measure itself is interchangeable, whereas in line with the discrete perspective we took in our fidelity assessment we opt for the Hamming distance applied to the discretized dataset as an easyto-compute distance metric that fulfills the conditions of nonnegativity and symmetry. However, we note that the very same framework can be just as well applied on top of alternative distance metrics, including ones based on more meaningful learned representations of domain-specific embedding spaces. A DCR of 0 corresponds to an identical match. But as argued above, also that metric in itself does not reveal anything regarding the leakage of individual-level information, but is rather a statistic of the data distribution we seek to retain. Therefore, to provide meaning and to facilitate interpretation the measured DCRs need to be put into the context of their expected value, which can be estimated based on an actual holdout dataset.
As illustrated in Figure 5, we therefore propose to calculate for each synthetic record its DCR with respect to the training data T as well as with respect to an equally sized holdout dataset H. The share of records that are then closer to a training than to a holdout record serves us as our proposed privacy risk measure. Any ties are to be distributed equally between these two datasets. If that resulting share is then close to 50%, we gain empirical evidence of the training and holdout data being interchangeable with respect to the synthetic data. 2 This in turn allows to make a strong case for plausible deniability for any individual, as the synthetic data records do not allow to conjecture whether an individual was or was not contained in the training dataset. Even for cases of a strong resemblance of a particular record with a real-world subject, it can be argued that such a resemblance can occur for unseen subjects just as well. Translated into the world of motion pictures this idea would correspond to the proverbial disclaimer that "any resemblance to persons living or dead is purely coincidental."

EMPIRICAL DEMONSTRATION
To demonstrate the usefulness of the presented framework for assessing fidelity and privacy of synthetic data solutions, we apply it to four publicly available, mixed-type tabular datasets from the UCI Machine Learning repository (Dua and Graff, 2017) and synthesize them using seven publicly available data synthesizers.
The datasets cover a broad range of scenarios and are commonly used in the data synthesis literature (Park et al., 2018;Xu et al., 2019;Zhao et al., 2021) as well as by commercial and open-source software providers 3 to demonstrate the effectiveness of the proposed methods. Each dataset is representative of a common business scenario, where privacy-sensitive data assets are to be shared for analytical tasks. Every record within these datasets corresponds to a single person, whose privacy shall be protected, while the statistical information of the overall dataset shall be retained.
Each of the four datasets is randomly split into an equally sized training and holdout dataset. The seven generative models are fitted to the training data to then generate 50,000 synthetic records for each dataset. All synthesizers are run with their default settings unchanged, i.e., no parameter tuning is being performed.
To provide further context to the assessment of the synthesized datasets, we also generate additional datasets by simply perturbating the training data with a varying degree of noise. We do so by drawing 50,000 records with replacement from the dataset and then decide for each value of each record with a given probability (ranging from 10% up to 90%) whether to keep it or to replace it with a value from a different record. This approach adds noise to existing records, yet retains the univariate marginal distributions. With more noise being added, one expects privacy to be increasingly protected, while also more statistical relations to be distorted. This allows us to compare the newly emerging class of data synthesizers with a simpler, yet more established method in the field of statistical disclosure control (see also Muralidhar et al. (1999) or Muralidhar and Sarathy (2006) for similar approaches).
Figures 6-8 visualize the resulting distributions of selected univariate, bivariate and three-way attribute interactions for the "adult" dataset across the synthetic datasets generated by the various synthesizers included in our study as well as the two purturbated datasets (i.e., Flip 10% and Flip 90%). While the visual inspection already allows to spot some qualitative differences with respect to the goodness of representativeness of the training data, it is the corresponding fidelity metric F (reported as percentages in brackets) that provides us with a quantitative summary statistic. The reported fidelity measure for the holdout data serves as a reference, as the derived distributions should not be systematically closer to the training data than what is expected from the holdout data. For example, an F 1 (T, S) fidelity score coming close to the 2.7% reported for the variable "age" in the holdout (which is due to the sampling noise) can be considered as an accurate representation of the underlying distribution in the training data. However, visually inspecting 15 univariate, 105 bivariate. and 455 three-way interactions for dataset "adult" is prohibitive, but the proposed summary statistics which average across these yield a condensed but informative fidelity assessment of synthetic data. Figure 9 reports the proposed fidelity measures across all four datasets, the used generative synthetization methods and various degrees of perturbation. 6 It is interesting to note that the rankings with respect to fidelity among synthesizers are relatively consistent across all datasets, showing that these metrics indeed serve as general-purpose measures for the quality of a synthesizer. The reported numbers for the perturbated datasets exhibit the expected relationship between noise level and fidelity. Among the benchmarked synthesizers "synthpop" and "MOSTLY" exhibit the highest fidelity score with the caveat that the former is systematically too close to the training data compared to what is expected based on the holdout. Figure 10, on the other hand, contains the results for the proposed privacy risk measure. For each dataset and synthesizer the share of synthetic records that is closer to a training record than to a holdout record is being reported. In addition, the average DCRs are displayed, once with respect to the training and once with respect to the holdout data. With the notable exception of "synthpop" all of the presented synthesizers exhibit almost identical DCR distributions for the training as well as for the holdout records. This indicates that no individual-level FIGURE 5 | Construction of the holdout-based privacy risk measure. For each synthetic record we determine whether the nearest neighbor within training data T is or is not smaller than the nearest neighbor with respect to holdout data H. The share of records closer to a training than to a holdout data serves for evaluating privacy risk.
information of the training subjects has been exposed beyond what is attainable from the underlying distribution and thus makes a strong case for the generated data preserving the privacy of the training subjects. In contrast, the reported numbers for the perturbated datasets reveal a severe exposure of the training subjects, even if a high amount of noise is being added. Only at a level where most of the utility of these datasets is being destroyed, the privacy measures start to align with the holdout dataset.
Based on these results we can further visualize the uncovered empirical relationship between privacy and fidelity. The x-axes in Figure 11 represent the three-way fidelity measure in relation to its corresponding value for the holdout dataset, i.e., the proposed fidelity ratio F 3 (T,S)/F 3 (T,H). The y-axes represent the reported share of records that are closer to training than to the holdout. Presented this way, the holdout dataset serves us as a "north star" for truly privacy-respecting data synthesizers in the upper right corner. The orange dots represent the range of perturbated datasets and reveal the difficulties of basic obfuscation techniques to protect privacy without sacrificing fidelity, particularly for higher-dimensional datasets. The turquoise marks on the other hand represent the performance metrics for a broad range of emerging synthesizers, whereas all except one exhibit DCR shares close to 50%, and with some getting already very close to representing the characteristics of a true holdout dataset.

DISCUSSION AND FUTURE RESEARCH
The field of supervised machine learning benefited from having commonly used benchmark datasets and metrics in place to measure performance across methods as well as progress over time. The emerging field of privacy-preserving structured synthetic data is still to converge onto commonly agreed fidelity and privacy measures, as well as to a set of canonical datasets to benchmark on. This research aims at complementing already existing methods by introducing a practical, assumptionfree and easy to reason empirical assessment framework that can be applied for any black-box synthetization method and thus shall help to objectively capture and measure the progress in the field. In addition, the reported findings from the empirical benchmark experiments demonstrate the promise of AI-based data synthesis when compared to simpler data perturbation techniques. However, they also highlight the need to not only assess fidelity but just as well the privacy risk of these newly emerging, powerful data generators.
We hope this contribution fosters further empirical benchmarks across a broad range of datasets and thus helps to establish transparency as well as facilitate comparability across the various synthetization methods. To the best of our knowledge this paper is the first model-free, non-parametric approach to quantify the fidelity and privacy risk of synthetic data by using two easy-to-compute summary statistics. We acknowledge the fact that TVD is sensitive to the cardinality and the actual shape of the underlying distributions. Furthermore, in our approach we need to discretize the data by specifying an upper bound c on the cardinality of all variables. While we find that our results and substantive findings are robust to the choice of c and the distance measures, future research could further explore the relationship between these aspects and the power of our proposed fidelity and privacy metrics.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found at https:// github.com/mostly-ai/paper-fidelity-accuracy.

AUTHOR CONTRIBUTIONS
MP concepted, implemented, and empirically applied assessment framework. TR contributed, challenged, and edited during the whole process.