CTAB-GAN+: enhancing tabular data synthesis

The usage of synthetic data is gaining momentum in part due to the unavailability of original data due to privacy and legal considerations and in part due to its utility as an augmentation to the authentic data. Generative adversarial networks (GANs), a paragon of generative models, initially for images and subsequently for tabular data, has contributed many of the state-of-the-art synthesizers. As GANs improve, the synthesized data increasingly resemble the real data risking to leak privacy. Differential privacy (DP) provides theoretical guarantees on privacy loss but degrades data utility. Striking the best trade-off remains yet a challenging research question. In this study, we propose CTAB-GAN+ a novel conditional tabular GAN. CTAB-GAN+ improves upon state-of-the-art by (i) adding downstream losses to conditional GAN for higher utility synthetic data in both classification and regression domains; (ii) using Wasserstein loss with gradient penalty for better training convergence; (iii) introducing novel encoders targeting mixed continuous-categorical variables and variables with unbalanced or skewed data; and (iv) training with DP stochastic gradient descent to impose strict privacy guarantees. We extensively evaluate CTAB-GAN+ on statistical similarity and machine learning utility against state-of-the-art tabular GANs. The results show that CTAB-GAN+ synthesizes privacy-preserving data with at least 21.9% higher machine learning utility (i.e., F1-Score) across multiple datasets and learning tasks under given privacy budget.


I. INTRODUCTION
Many companies nowadays discover valuable insights from various internal and external data sources.However, the deep knowledge behind big data often violates personal privacy and leads to an unjustified analysis [22].To prevent the abuse of data and the risks of privacy breaching, the European Commission introduced the European General Data Protection Regulation (GDPR) and enforced strict data protection measures.This however instills a new challenge in data-driven industries to look for new scientific solutions that can empower big discoveries while respecting the constraints of data privacy and governmental regulation.
An emerging solution is to leverage synthetic data [21], which statistically resembles real data and can comply with GDPR due to its synthetic nature.Generative Adversarial Network (GAN) [12] is one of the emerging data synthesizing methodologies.Beyond its success in generating images [26], [21], [25], [33], [39] have recently applied GAN to generate tabular data.However, recent studies have shown that GANs may fall prey to membership inference attacks which greatly endanger the personal information present in the real training § Equal contribution data [7], [27].Therefore, it is imperative to safeguard the training of tabular GANs such that synthetic data can be generated without causing harm.To address these issues, prior work [16], [19], [28], [29] relies on differential privacy (DP) [9].DP is a mathematical framework that provides theoretical guarantees bounding the statistical difference between any resulting ML model trained with or without a particular individual's information in the original training dataset.Typically, this can be achieved by injecting calibrated statistical noise while updating the parameters of a network during back-propagation, i.e., DP Stochastic Gradient Descent (DP-SGD) [1], [6], [32], or by injecting noise while aggregating teacher ensembles using the PATE framework [16], [24].
However, state-of-the-art (SOTA) tabular GAN algorithms only focus on two types of variables, namely continuous and categorical, overlooking an important class of mixed data type.In addition, it is unclear if existing solutions can efficiently handle highly imbalanced or skewed variables.Furthermore, most SOTA DP GANs are evaluated on images and their efficacy on tabular datasets needs to be verified.Existing DP GANs do not provide a well-defined consensus on which DP framework (i.e., DP-SGD or PATE) is optimal for training tabular GANs.Moreover, DP GAN algorithms such as [6] (GS-WGAN), [16] (PATE-GAN) change the original GAN structure from one discriminator to multiple discriminators, which increases the complexity of the algorithm.And [32] (DP-WGAN) and [28](RDP-GAN) use the weight clipping to bound gradients which introduces instability for GAN training.
In this paper, we extend CTAB-GAN [39] to a new algorithm CTAB-GAN+.The objectives of CTAB-GAN+ are twofolds: (1) further improve the synthetic data quality in terms of machine learning utility and statistical similarity; and (2) implement efficient DP into tabular GAN training to control its performance under different privacy budgets.To achieve the first goal, CTAB-GAN+ introduces a new feature encoder used with variables following single Gaussian distribution.Moreover, CTAB-GAN+ adopts the Wasserstein distance plus gradient penalty (hereinafter referred to as Was+GP) loss [13] to further enhance the stability and effectiveness of GAN training.Finally, CTAB-GAN+ adds a new auxiliary component to improve the synthesis performance for regression tasks.To achieve the second goal, CTAB-GAN+ uses DP-SGD algorithm to train a single instead of multiple discriminators as in PATE-GAN and GS-WGAN.This reduces the complexity of the algorithm.Additionally, CTAB-GAN+ reduces the privacy cost by accounting for sub-sampling [31] of smaller subsets from the full dataset used to train the discriminator.

A. Motivation
We empirically demonstrate how the prior SOTA methods fall short in solving challenges in industrial data sets.The detailed experimental setup can be found in Sec.IV-A.
Single Gaussian variables.Single mode Gaussian distributions are very common.Fig. 1(a) shows the histogram of variable bmi (i.e., body mass index) in the Insurance dataset and synthetic data generated by 4 SOTA algorithms for this variable.The distribution of real data is close to a single mode Gaussian distribution.But except TableGAN, none of the SOTA algorithms can recover this distribution in their synthetic data.CTGAN uses variational Gaussian mixture (VGM) to model all continuous variables.However, VGM is a complicated method to deal with single mode Gaussian distributions as it initially approximates the distribution with multiple Gaussian mixtures by default.CWGAN and MedGAN use min-max normalization to scale the original data to [0, 1].TableGAN also uses min-max normalization but scales the original data to [-1, 1] to better match the output of the generator using tanh as activation function.The reason that min-max normalization works for TableGAN but not MedGAN and CWGAN is because the training convergence for both algorithms is less stable than for TableGAN.However, since TableGAN applies min-max normalization on all variables, it suffers from a disadvantage modelling column with complex multi-modal Gaussian distributions.
Mixed data type variables.To the best of our knowledge, existing GAN-based tabular generators only consider table columns as either categorical or continuous.However, in reality, a variable can be a mix of these two types, and often variables have missing values.The Mortgage variable from the Loan dataset is a good example of mixed variable.Fig. 1(b) shows the distribution of the original and synthetic data generated by 4 SOTA algorithms for this variable.According to the data description, a loan holder can either have no mortgage (0 value) or a mortgage (any positive value).In appearance, this variable is not a categorical type due to the numeric nature of the data.So all 4 SOTA algorithms treat this variable as continuous type without capturing the special meaning of the value zero.Hence, all 4 algorithms generate a value around 0 instead of exact 0. And the negative values for Mortgage have no/wrong meaning in the real world.
Long tail distributions.Many real world data can have long tail distributions where most of the occurrences happen near the initial value of the distribution, and rare cases towards the end.Fig. 1(c) plots the cumulative frequency for the original (top) and synthetic (bottom) data generated by 4 SOTA algorithms for the Amount in the Credit dataset.This variable represents the transaction amount when using credit cards.One can imagine that most transactions have small amounts, ranging from a few bucks to thousands of dollars.However, there definitely exists a very small number of transactions with large amounts.Note that for ease of comparison both plots use the same x-axis, but Real has no negative values.Real data clearly has 99% of occurrences happening at the start of the range, but the distribution extends until around 25000.In comparison none of the synthetic data generators is able to learn and imitate this behavior.
Skewed multi-mode continuous variables.The term multimode is extended from Variational Gaussian Mixtures (VGM).More details are given in Sec.III-C.The intuition behind using multiple modes can be easily captured from Fig. 1(d).The figure plots in each row the distribution of the working Hours-per-week variable from the Adult dataset.This is not a typical Gaussian distribution.There is an obvious peak at 40 hours but with several other lower peaks, e.g. at 50, 20 and 45.Also the number of people working 20 hours per week is higher than those working 10 or 30 hours per week.This behavior is difficult to capture for the SOTA data generators (see subsequent rows in Fig. 1(d)).The closest results are obtained by CTGAN which uses Gaussian mixture estimation for continuous variables.However, CTGAN loses some modes compared to the original distribution.
The above examples show the shortcomings of current SOTA GAN-based tabular data generation algorithms and motivate the design of our proposed CTAB-GAN+.

II. RELATED WORK
We divide the related work using GAN to generate tabular data into three: (i) based on GAN, (ii) based on conditional GAN, and (iii) based on DP GAN.
GAN-based generator.Several studies extend GAN to accommodate categorical variables by augmenting GAN architecture.MedGAN [8] combines an auto-encoder with a GAN.It can generate continuous or discrete variables, and has been applied to generate synthetic electronic health record (EHR) data.CrGAN-Cnet [21] uses GAN to conduct Airline Passenger Name Record Generation.It integrates the Cramér Distance [4] and Cross-Net architecture [30] into the algorithm.In addition to generating with continuous and categorical data types, CrGAN-Cnet can also handle missing value in the table by adding new variables.TableGAN [25] introduces information loss and a classifier into GAN framework.It specifically adopts Convolutional Neural Network (CNN) for generator, discriminator and classifier.Although aforementioned algorithms can generate tabular data, they cannot specify how to generate from a specific class for particular variable.For example, it is not possible to generate health record for users whose sex is female.
Conditional GAN-based generator.Due to the limitation of controlling generated data via GAN, Conditional GAN is increasingly used, and its conditional vector can be used to specify to generate a particular class of data.This feature is important when our available data is limited and highly skewed, and we need synthetic data of a specific class to re-balance the distribution.For instance, for preparing starting dataset of online learning scenarios [35], [37], [38].CW-GAN [11] applies the Wasserstein distance [2] into the conditional GAN framework.It leverages the usage of conditional vector to oversample the minority class to address imbalanced tabular data generation.CTGAN [33] integrates PacGAN [18] structure in its discriminator and uses Generator loss and WGAN loss plus gradient penalty (GP) [13] to train a conditional GAN framework.It also adopts a strategy called training-bysampling, which takes advantage of conditional vector, to deal with the imbalanced categorical variable problem.
CTAB-GAN+ not only focuses on modelling both continuous and categorical variables, but also covers the mixed data type (i.e., variables that contain both categorical and continuous values, or even missing values).We effectively combine the strengths of the prior art, such as Was+GP, classifier, information and generator losses along with an effective encoding.Furthermore, we proactively address the pain points of single Gaussian and long tail variable distributions and propose a new conditional vector structure to better deal with imbalanced datasets.
Differential Private Tabular GANs.To avoid leaking sensitive information on single individuals, previous studies explore multiple differential private learning techniques applied to GANs.Table I provides an overview.PATE-GAN [16] uses PATE [24] which relies on output sanitization by perturbing the output of an ensemble of teacher discriminators via Laplacian noise to train a student discriminator scoring the generated samples.One key limitation is that the student discriminator only sees synthetic data.Since this data is potentially unrealistic, the provided feedback can be unreliable.[32] (DP-WGAN), [6] (GS-WGAN) and [28](RDP-GAN) use differential private stochastic gradient descent (DP-SGD) coupled with the Wasserstein loss.Moreover, DP-WGAN uses a momentum accountant whereas GS-WGAN and RDP-GAN use a Rényi Differential Privacy (RDP) accountant.The Wasserstein loss is known to be more effective against mode-collapse compared to KL divergence [3].The RDP accountant provides tighter bounds on the privacy costs improving the privacy-utility trade-off.To incorporate differential privacy guarantees and make the training compatible with the Wasserstein Loss, [28], [32] use weight clipping to enforce the Lipschitz constraint.The drawback is the need for careful tuning of the clipping parameter (see Sec. III-G).To overcome this issue, [6] enforces the Lipschitz constraint via a gradient penalty term as suggested by [14], but addresses only images which are a better fit for GANs and studies its efficacy only for training the generator network.
The proposed CTAB-GAN+ leverages RDP-based privacy accounting comparing to PATE used by PATE-GAN.Same as DP-WGAN, CTAB-GAN+ uses one discriminator instead of multiple ones trained by PATE-GAN and GS-WGAN.Since CTAB-GAN+ adopts Was+GP loss, it intrinsically constraints the gradient norm allowing to forgo the weight clipping used in DP-WGAN.This leads to a more stable training.In a nutshell, CTAB-GAN+ by training only one discriminator with Was+GP loss results in a more stable DP GAN algorithm compared to the SOTA algorithms.

III. CTAB-GAN+
CTAB-GAN+ is a tabular data generator designed to overcome the challenges outlined in Sec.I-A.CTAB-GAN+ adopts a re-designed min-max scaler to normalize single Gaussian variable.We also propose a novel Mixed-type Encoder which can better represent mixed categorical-continuous variables as well as missing values.CTAB-GAN+ is based on a conditional GAN (CGAN) to efficiently treat minority classes, with the addition of Was+GP, downstream, information and generator losses ( [13], [23], [25], [33]) to improve data quality and training stability.Moreover, we leverage a log-frequency sampler to overcome the mode collapse problem for imbalanced variables.Finally, differential private SGD training is implemented for the discriminator to achieve strict privacy guarantees.
A. Technical Background 1) Tabular GAN: GANs are a popular method to generate synthetic data first applied with great success to images [17] and later adapted to tabular data [34].GANs leverage an adversarial game between a generator trying to synthesize realistic data and a discriminator trying to discern synthetic from real samples.
To address the problem of dataset imbalance, we leverage conditional generator and training-by-sampling methods from CTGAN.The idea behind this is to use an additional vector, termed conditional vector, to represent the classes of categorical variables.This vector is both fed to the generator and used to bound the sampling of the real training data to subsets satisfying the condition.We can leverage the condition to resample all classes giving higher chances to minority classes to train the model.To improve the stability of GAN training, CTAB-GAN+ adopts Was+GP [13] loss.Previous study WGAN [2] offers stability in training GAN.However, the use of gradient clipping leads to issues such as exploding and vanishing gradients.Comparing to WGAN [2], WGAN-GP replaces weight clipping with a constraint on the gradient norm of the discriminator to enforce Lipschitz continuity.This further stabilizes the training of the network and requires less hyper-parameter tuning.Another unique feature of WGAN-GP is that its discriminator updates 5 times per mini-batch data comparing to only 1 time of generator.This influences our differential privacy budget (see details in Sec.III-G).To enhance the generation quality, we incorporate three extra terms into the loss function of the generator: information [25], downstream (referred as classification loss in [23] for classification problems) and generator loss [33].The information loss penalizes the discrepancy between statistics of the generated data and the real data.This helps to generate data which is statistically closer to the real one.The downstream loss requires adding to the GAN architecture an auxiliary classifier (or regressor) in parallel to the discriminator.For each synthesized value the classifier (or regressor) outputs a predicted value.The downstream loss quantifies the discrepancy between the synthesized and predicted values in the downstream analysis.This helps increase the semantic integrity of synthetic records.For instance, for a classification dataset, (sex=female, disease=prostate cancer) is not a semantically correct record as women do not have a prostate, and no such record should appear in the original data and is hence not learnt by the classifier.The generator loss measures the difference between the given conditions and the output classes of the generator.This loss helps the generator to learn to produce the exact same classes as the given conditions.Downstream loss is used by TableGAN but not by CTGAN, since CTGAN does not contain a classifier.Whereas, the generator loss is implemented by CTGAN but not by TableGAN, as TableGAN is not a conditional GAN.Both only treat classification problems.
To counter complex distributions in continuous variables we embrace the Mode-Specific Normalization (MSN) idea [33] which encodes each value as a value-mode pair stemming from the Gaussian mixture model.
2) Differential Privacy: DP is becoming the standard solution for privacy protection and has even been adopted by the US census department to bolster privacy of citizens [15].DP protects against privacy attacks by minimizing the influence of any individual data point based on a given privacy budget.In this work, we leverage the Rényi Differential Privacy (RDP) [20] as it provides stricter bounds on the privacy budget.A randomized mechanism M is (λ, )-RDP with order λ, if In addition, a (λ, )-RDP mechanism M can be expressed as: For the purpose of this work M corresponds to a tabular GAN model with privacy budget (λ, ).
RDP is a strictly stronger privacy definition than DP as it provides tighter bounds for tracking the cumulative privacy loss over a sequence of mechanisms via the Composition theorem [20].Let • denote the composition operator.For M 1 ,...,M k all being (λ, i )-RDP, the composition Additionally, for a Gaussian Mechanism [10] M σ parameterized by σ as: where f denotes an arbitrary function with sensitivity ∆ 2f = max S,S ||f (S) − f (S )|| 2 over all adjacent datasets S and S , and N represents a Gaussian distribution with zero mean and covariance σ 2 I (where I is the identity matrix), M σ satisfies (λ, 2σ 2 )-RDP [20].Lastly, two more theorems are key to this work.The post processing theorem [10] states that if M satisfies ( , δ)-DP, F • M will satisfy ( , δ)-DP, where F can be any arbitrary randomized function.Hence, it suffices to train one of the two networks in the GAN architecture with DP guarantees to ensure that the overall GAN is compatible with differential privacy.RDP for subsampled mechanisms [31] computes the reduction in privacy budget when sub-sampling private data.Formally, let X be a dataset with n data points and subsample return m ≤ n subsamples without replacement from X (subsampling rate γ = m/n).For all integers λ ≥ 2, if a ran- where 2 min 4(e (2) − 1), e (2) min{2, (e (∞) − 1) 2 } + λ j=3 γ j λ j e (j−1) (j) min{2, (e (∞)−1) j ) })

B. Architecture of CTAB-GAN+
The structure of CTAB-GAN+ is shown in Fig. 2. It comprises three blocks: Generator G, Discriminator D and an auxiliary component (either a classifier or a regressor) C. Since our algorithm is based on conditional GAN, the generator requires a noise vector plus a conditional vector.Details on the conditional vector are given in Sec.III-D.Before feeding data to D and C, variables are encoded via different feature encoders depending on the variable type and characteristics.Details of the used encoders are provided in Sec.III-C, III-E and III-F.
GANs are trained via a zero-sum min-max game where the discriminator tries to maximize the objective, while the generator tries to minimize it.The game can be seen as a mentor (D) providing feedback to a student (G) on the quality of his work.Here, we introduce additional feedback for G based on the information loss, downstream loss and generator loss.The information loss matches the first-order (i.e., mean) and second-order (i.e., standard deviation) statistics of synthesized and real records.This leads the synthetic records to have the same statistical characteristics as the real records.The downstream loss equates the correlation between target variable and the other variable values.This helps to check the semantic integrity, and penalizes synthesized records where the combination of values are semantically incorrect.Finally, the generator loss is the cross-entropy between the given conditional vector and the generated output classes.It enforces the conditional generator to produce the same classes as the given conditional vector.These three losses add to the default loss term (i.e., Was+GP) of G during training.G and D are implemented using CNNs with the same structure as in [25].CNNs are good at capturing the relation between pixels within an image, which in our case, can help to increase the semantic integrity of synthetic data.To process row records stored as vectors with CNN, we wrap the row data into the closest square matrix dimensions, i.e. d × d where d is the ceiled square root of the row data dimensionality and pad missing values with zeros.C uses a multi-layer-perceptron (MLP) with four 256-neuron hidden layers.The classifier is trained on the original data to better interpret the semantic integrity.Hence synthetic data are reverse transformed from their matrix encoding to vector (details in Sec.III-C).Real data is encoded (details in Sec.III-C and III-F) before being used as input for C to create the class label predictions.
Let f x and f G(z) denote the features fed into the softmax layer of D for a real sample x and a sample generated from latent value z, respectively.The information loss for G is expressed as || 2 where p data (x) and p(z) denote prior distributions for real data and latent variable, E and SD denote the mean and standard deviations of the features, respectively.The downstream loss is given by L where l(.) returns the target variable and f e(.) returns the input features of a given row x.Finally, the generator loss is given by L G generator = H(m i , mi ) where m i and mi are the given and generated conditional vector bits corresponding to column i and H(.) is the cross-entropy loss.Columns are selected using the training-by-sampling procedure (see Sec. III-D for details).
Let L D def ault and L G def ault denote the GAN loss of discriminator and generator from Was+GP where its unique objective gradient penalty where P x is defined as sampling uniformly along straight lines between pairs of points sampled from the real data distribution P r and the generator distribution P g .For G the complete training objective is The training objective for D is unchanged.Finally, the loss to train the auxiliary C is similar to the downstream loss of the generator, i.e.

C. Mixed-type Encoder
The tabular data is encoded variable by variable.We distinguish three types of variables: categorical, continuous and mixed.We define variables as mixed if they contain both categorical and continuous values or continuous values with missing values.We propose the new Mixed-type Encoder to deal with such variables.With this encoder, values of mixed variables are seen as concatenated value-mode pairs.We illustrate the encoding via the exemplary distribution of a mixed variable shown in red in Fig. 3(a).One can see that values can either be exactly µ 0 or µ 3 (the categorical part) or distributed around two peaks in µ 1 and µ 2 (the continuous part).We treat the continuous part by adapting the Mode-Specific Normalization (MSN) idea from [33] in using a variational Gaussian mixture model (VGM) [5] to estimate the number of modes k, e.g.k = 2 in our example, and fit a Gaussian mixture.The learned Gaussian mixture is where N is the normal distribution and ω k , µ k and σ k are the weight, mean and standard deviation of each mode, respectively.
To encode values in the continuous region of the variable distribution, we associate and normalize each value with the mode having the highest probability (see Fig. 3(b)).Given ρ 1 and ρ 2 being the probability density from the two modes in correspondence of the variable value τ to encode, we select the mode with the highest probability.In our example ρ 1 is higher and we use mode 1 to normalize τ .The normalized value α is: α = τ −µ1 4σ1 .Moreover we keep track of the mode β used to encode τ via one-hot encoding, e.g.β = [0, 1, 0, 0] in our example.The final encoding is giving by the concatenation

D. Counter Imbalanced Training Datasets
In CTAB-GAN+, we use conditional GAN to counter imbalanced training datasets using training-by-sampling [33], but extended to include the modes of continuous and mixed columns.When we sample real data, we use the conditional vector to filter and rebalance the training data.The conditional vector V is a bit vector given by the concatenation of all mode one-hot encodings β (for continuous and mixed variables) and all class one-hot encodings γ (for categorical variables) for all variables present in Eq. (5).Each conditional vector specifies a single mode or a class.More in detail, V is a zero vector with a single one in correspondence to the selected variable with selected mode/class.Fig. 4 shows an example with three variables, one continuous (C 1 ), one mixed (C 2 ) and one categorical (C 3 ), with class 2 selected on C 3 .
To rebalance the dataset, each time we need a conditional vector during training, we first randomly choose a variable with uniform probability.Then we calculate the probability distribution of each mode (or class for categorical variables) in that variable using frequency as proxy and sample a mode based on the logarithm of its probability.Using the log probability instead of the original frequency gives minority modes/classes higher chances to appear during training.This helps to alleviate the collapse issue for rare modes/classes.Extending the conditional vector to include the continuous and mixed variables helps to deal with imbalance in the frequency of modes used to represent them.Moreover, since generator is conditioned on all data-types during training, this enhances the learned correlation between all variables.

E. General Transform
CTAB-GAN originally adopts the mode-specificnormalization (MSN) from CTGAN to encode all continuous variables.MSN uses VGM to estimate the distribution of continuous variables.Fig. 1(a) shows that VGM is not suitable for simple distributions such as single Gaussian.Another problem is the dimensionality explosion caused by using one-hot-encoding for categorical variables with a high number of categories.To counter both problems we propose the general transform (GT).GT is an effective approach to minimize the complexity of our algorithm.
The main idea of GT is to encode columns in the range of (−1, 1).This makes the encoding directly compatible with the output range of the generator using tanh activation function.This is achieved via a shifted and scaled minmax normalization.Mathematically, given a data point x i of a continuous variable x, the transformed value, x t i = 2 * xi−min(x) max(x)−min(x) − 1 where min(x) and max(x) represents the minimum and maximum values of the continuous variable.Inversely an encoded or generated value x t i may be reverse transformed as + min(x).Continuous variable can be directly treated with the above formulas for normalization and denormalization.Categorical variables are first encoded using integers before using the above normalization and rounded to integers after using the above denormalization.
A similar transform was first introduced by TableGAN, but it applies this transformation on all variables.This choice is not optimal.From our experiments, we find that this technique only works well for continuous columns with simple distributions such as a single-mode Gaussian and does not cater to more complex distributions.By default, CTAB-GAN+ deals with continuous variable with MSN and only selectively uses GT for processing single-mode Gaussian variables.Similarly, categorical columns should prefer MSN as encoding rather than GT.Using GT loses the mode indicator, i.e. β 1 in Fig. 4, from the conditional vector forgoing the ability to enhance the correlation between variables for specific categories.Moreover, using integers instead of one-hot vectors can impose artificial distances between the different categories which do not reflect the reality.Therefore, we recommend to use GT for categorical variables only if the categorical variables contain so many categories that the available machines can not train with the encoded data.

F. Treat Long Tails
We encode continuous values using variational Gaussian mixtures to treat multi-mode data distributions (details in Sec.III-C).However, Gaussian mixtures can not deal with all types of data distribution, notable distributions with long tail where few rare points are far from the bulk of the data.VGM has difficulty to encode the values towards the tail.To counter this issue we pre-process variables with long tail distributions with a logarithm transformation.For such a variable having values with lower bound l, we replace each value τ with compressed τ c : The log-transform allows to compress and reduce the distance between the tail and bulk data making it easier for VGM to encode all values, including tail ones.We show the effectiveness of this simple yet performant method in Sec.IV-E.[1] is the central framework to provide DP guarantees in this work.DP-SGD uses noisy stochastic gradient descent to limit the influence of individual training samples x i .After computing the gradient g(x i ), the gradient is clipped based on a clipping parameter C and its L2 norm

DP-SGD
), and Gaussian noise is added g(x i ) ← ḡ(x i ) + N (0, σ 2 C 2 I)).g is then used in place of g to update the network parameters as in traditional SGD.
One of the biggest challenges with DP-SGD is tuning the clipping parameter C since clipping greatly degrades the information stored in the original gradients [6].Choosing an optimal clipping value that does not significantly impact utility is crucial.However, tuning the clipping parameter is laborious as the optimal value fluctuates depending on network hyperparameters (i.e.model architecture, learning rate) [1].
To avoid an intensive hyper-parameter search, [6] proposes to use the Wasserstein loss with a gradient penalty term.This term ensures that the discriminator generates bounded gradient norms which are close to 1 under real and generated distributions.Therefore, an optimal clipping threshold of C = 1 is obtained implicitly.
CTAB-GAN+ trains the discriminator using differential private-SGD where the number of training iterations is determined based on the total privacy budget ( ,δ).Thus, to compute the number of iterations, the privacy budget spent for every iteration must be bounded and accumulated.For this purpose we use the subsampled RDP analytical moments accountant technique.
Proof 1.Let f = clip(ḡ D , C) be the clipped gradient of the discriminator before adding noise.The sensitivity is derived via the triangle inequality: Since C = 1 as a consequence of the Wasserstein loss with gradient penalty, and by using (3), the gaussian mechanism used within the DP-SGD procedure denoted as M σ parameterized by noise scale σ may be represented as being (λ, 2λ/σ 2 )-RDP.
Furthermore, each discriminator update for a batch of real data points {x i , .., x B } can be represented as where gD and θ D represent the perturbed gradients and the weights of the discriminator network, respectively.This may be regarded as a composition of B Gaussian mechanisms and treated via (2).The privacy cost for a single gradient update step for the discriminator can be expressed as (λ, B i=1 2λ/σ2 ) or equivalently (λ, 2Bλ/σ 2 ).Note that M σ is only applied for those gradients that are computed with respect to the real training dataset [1], [36].Hence, the gradients computed with respect to the synthetic data and the gradient penalty term are left undisturbed.Next, to further amplify the privacy protection of the discriminator, we rely on (4) with subsampling rate γ = B/N where B is the batch size and N is the size of the training dataset.Intuitively, subsampling adds another layer of randomness and enhances privacy by decreasing the chances of leaking information about particular individuals who are not included in any given subsample of the dataset.
Lastly, it is worth mentioning that the Was+GP training objective has one major pitfall with respect to the privacy cost.This is because, it encourages the use of a stronger discriminator network to provide more meaningful gradient updates to the generator.This requires performing multiple updates to the discriminator for each corresponding update to the generator leading to a faster consumption of the overall privacy budget.

IV. EXPERIMENTAL ANALYSIS FOR DATA UTILITY
To show the efficacy of the proposed CTAB-GAN+, we select seven commonly used machine learning datasets, and compare with four SOTA GAN based tabular data generators and CTAB-GAN.We evaluate the effectiveness of CTAB-GAN+ in terms of the resulting ML utility, statistical similarity to the real data.Moreover, we provide ablation analyses to highlight the efficacy of the unique components of CTAB-GAN+.

A. Experimental Setup
Datasets.Our algorithm is tested on seven commonly used machine learning datasets.Three of them Adult, Covertype and Intrusion are from the UCI machine learning repository 1 .Credit and Loan are from Kaggle 2 .The above five tabular datasets are used for classification tasks using as target a categorical variable.To consider also regression tasks we use two more datasets, Insurance and King from Kaggle 3 where the target variable is continuous.
Due to computing resource limitations, 50K rows of data are sampled randomly in a stratified manner with respect to the target variable for the Covertype, Credit and Intrusion datasets.The Adult, Loan, Insurance and King datasets are taken in their entirety.The details of each dataset are shown in Tab.II.We assume that the data type of each variable is known before training.[33] holds the same assumption.
Baselines.Our CTAB-GAN+ is compared with CTAB-GAN and 4 other SOTA GAN-based tabular data generators: CTGAN, TableGAN, CWGAN and MedGAN.To have a fair comparison, all algorithms are coded using Pytorch, with the generator and discriminator structures matching the descriptions provided in their respective papers.For Gaussian mixture estimation of continuous variables, we use the same settings as the evaluation of CTGAN, i.e. 10 modes.All algorithms are trained for 150 epochs for Adult, Covertype, Credit and Intrusion datasets, whereas the algorithms are trained for 300 epochs on Loan, Insurance and King datasets.The reason is these three datasets are smaller than the others and require more epochs to converge.Lastly, each experiment is repeated 3 times.
Environment.Experiments are run under Ubuntu 20.04 on a machine equipped with 32 GB memory, a GeForce RTX 2080 Ti GPU and a 10-core Intel i9 CPU.

B. Evaluation Metrics
The evaluation is conducted on two dimensions: (1) machine learning (ML) utility, and (2) statistical similarity.They measure if the synthetic data can be used as a good proxy of the original data.
1) Machine Learning Utility: The ML utility of classification and regression tasks is quantified differently.For classification, we quantify the ML utility via the performance, i.e, accuracy, F1-score and AUC, achieved by 5 widely used machine learning algorithms on real versus synthetic data: decision tree classifier, linear support-vector-machine (SVM), random forest classifier, multinomial logistic regression and MLP.Fig. 5 shows the evaluation process for classification datasets.The training dataset and synthetic dataset are of the same size.The aim is to show the difference in ML utility when a ML model is trained on synthetic vs real data.We use different classification performance metrics.Accuracy is the most commonly used, but does not cope well with imbalanced target variables.F1-score and AUC are more stable metrics for such cases.AUC ranges from 0 to 1.For regression tasks, we quantify the ML utility in a similar manner but using 4 common regression algorithms -linear regression, ridge regression, lasso regression and Bayesian ridge regressionand 3 regression metrics -mean absolute percentage error (MAPE), explained variance score (EVS) and R 2 score.All algorithms are implemented using scikit-learn 0.24.2 with default parameters except max-depth 28 for decision tree and random forest, and 128 neurons for MLP.For a fair comparison, hyper-parameters are fixed across all datasets.Due to this our results can slightly differ from [33] where the authors use different ML models and hyper-parameters for different datasets.
2) Statistical Similarity: Three metrics are used to quantify the statistical similarity between real and synthetic data.
Jensen-Shannon divergence (JSD).The JSD provides a measure to quantify the difference between the probability mass distributions of individual categorical variables belonging to the real and synthetic datasets, respectively.Moreover, this metric is bounded between 0 and 1 and is symmetric allowing for an easy interpretation of results.Wasserstein distance (WD).In a similar vein, the Wasserstein distance is used to capture how well the distributions of individual continuous/mixed variables are emulated by synthetically produced datasets in correspondence to real datasets.We use WD because we found that the JSD metric was numerically unstable for evaluating the quality of continuous variables, especially when there is no overlap between the synthetic and original dataset.Hence, we resorted to utilize the more stable Wasserstein distance.
Difference in pair-wise correlation (Diff.Corr.).To evaluate how well feature interactions are preserved in the synthetic datasets, we first compute the pair-wise correlation matrix for the columns within real and synthetic datasets individually.Pearson correlation coefficient is used between any two continuous variables.It ranges between [−1, +1].Similarly, the Theil uncertainty coefficient is used to measure the correlation between any two categorical features.It ranges between [0, 1].And the correlation ratio between categorical and continuous variables is used.It also ranges between [0, 1].Note that the dython4 library is used to compute these metrics.Finally, the difference between pair-wise correlation matrices for real and synthetic datasets is computed.

C. Results Analysis
We first discuss the results in ML utility before addressing stochastic similarity.
ML Utility.Tab.III shows the results for the classification datasets.A better synthetic dataset is expected to have small differences in d ML utility for classification tasks trained on real and synthetic data.It can be seen that CTAB-GAN+ outperforms all other SOTA methods and CTAB-GAN in all  This shows the effectiveness of the feature engineering.Additionally, as CTAB-GAN+ adds the auxiliary regressor which explicitly enhances the regression analysis, the overall downstream performance of CTAB-GAN+ is better than CTAB-GAN.We note that CTAB-GAN uses auxiliary classification loss for the classification analysis and disables it for the regression analysis.Statistical similarity.Statistical similarity results for the classification datasets are reported in Tab.III and for regression datasets in Tab.IV.CTAB-GAN+ stands out again across all baselines in both groups of datasets.For classification datasets, CTAB-GAN+ outperforms CTAB-GAN, CTGAN and Table-GAN by 37.1%, 44.3% and 51.3% in average JSD.This is due to the use of the conditional vector, the log-frequency sampling and the extra losses, which work well for both balanced and imbalanced distributions.For continuous variables (i.e.average WD), the average WD column shows some extreme numbers such as 46257 and 238155 comparing to 484 of CTAB-GAN+.The reason is that these algorithms generate extremely large values for long tail variables.Comparing to CTAB-GAN, the significant improvement comes from the use of general transform to model continuous columns with simple distributions which originally used MSN under CTAB-GAN and CTGAN.For regression datasets, CTAB-GAN+ outperforms CTAB-GAN by 63.4% and 74.5% in average JSD and average WD, respectively.Besides JSD and WD, the synthetic regression datasets maintain much better correlation than all the comparisons.This result confirms the efficacy of the usage of regressor.

D. Ablation Analysis
For the sake of simplicity, ablation analysis are only implemented for classification datasets.We focus on conducting an ablation study to analyse the impact of the different components of CTAB-GAN and CTAB-GAN+.
1) With CTAB-GAN: To illustrate the efficiency of each strategy we implement four ablation studies which cut off the different components of CTAB-GAN one by one: (1) w/o C. In this experiment, Classifier C and the corresponding classification loss for Generator G are taken away from CTAB-GAN; (2) w/o I. loss (information loss).In this experiment, we remove information loss from CTAB-GAN; (3) w/o MSN.In this case, we substitute the mode specific normalization based on VGM for continuous variables with min-max normalization and use simple one-hot encoding for categorical variables.
Here the conditional vector is the same as for CTGAN; (4) w/o LT (long tail).In this experiment, long tail treatment is no longer applied.This only affects datasets with long tailed columns, i.e.Credit and Intrusion.
The results are compared with the reference CTAB-GAN implementing all strategies.All experiments are repeated 3 times, and results are evaluated on the same 5 machine learning algorithms introduced in Sec.IV-B1.The test datasets and evaluation flow are the same as shown in Sec.IV-A and Sec.IV-B.Tab.V shows the results in terms of F1-score difference between ablation and CTAB-GAN.Each part of CTAB-GAN has different impacts on different datasets.For instance, w/o C has a negative impact for all datasets except Credit.Since Credit has only 30 continuous variables and one target variable, the semantic check can not be very effective.w/o information loss has a positive impact for Loan, but results degenerate for all other datasets.It can even make the model unusable, e.g. for Intrusion.w/o MSN performs bad for  [12].It is worth noting that the information, downstream and generator losses are still present in this experiment.The other experimental settings are the same as in Sec.IV-D1.Tab.VI shows the results in terms of F1-score difference among different versions of CTAB-GAN+.For Covertype, Credit and Intrusion datasets, the effects of GT and Was+GP are all positive.GT significantly boosts the performance on Covertype and Credit datasets.But for Adult, it worsens the result.The reason is that the Adult dataset contains only one GT column: age.Since this column is strongly correlated with other columns, the original MSN encoding can better capture this interdependence.The positive impact of Was+GP on the other hand is limited but consistent across all datasets.The only exception is the Loan dataset, where GT and Was+GP have minor impacts.This is due to the fact that Loan has fewer variables comparing to other datasets, which makes it easier to capture the correlation between columns.CTAB-GAN already performs well on Loan, Therefore, GT and Was+GP cannot further improve performance on this dataset.

E. Results for Motivation Cases
After reviewing all the metrics, let us recall the four motivation cases from Sec. I-A.
Single Gaussian variables.Fig. 6a(a) shows the real and CTAB-GAN+ generated bmi variable.CTAB-GAN+ can reproduce the distribution with minor differences.This shows the effctiveness of general transform to better model variables with single Gaussian distribution.
Mixed data type variables.Fig. 6(b) compares the real and CTAB-GAN+ generated variable Mortgage in Loan dataset.CTAB-GAN+ encodes this variable as mixed type.We can see that CTAB-GAN+ generates clear 0 values and the frequency is close to real data.
Long tail distributions.Fig. 6(c) compares the cumulative frequency graph for the Amount variable in Credit.This variable is a typical long tail distribution.One can see that CTAB-GAN+ perfectly recovers the real distribution.Due to log-transform data pre-processsing, CTAB-GAN+ learns this structure significantly better than the SOTA methods shown in Fig. 1(c).

V. EXPERIMENT ANALYSIS FOR DIFFERENTIAL PRIVACY
In this section, we show the effect of adding DP to CTAB-GAN+ and compare CTAB-GAN+ with three SOTA DP GAN algorithms.

A. Experiment Setup
Datasets.For sake of simplicity, we only use the classification datasets: Adult, Covertype, Intrusion, Credit and Loan.
Metrics.We use the same ML utility metrics from Section IV-B under two privacy budgets, i.e., = 1 and = 100.
Baselines.CTAB-GAN+ is compared against 3 SOTA architectures: PATE-GAN [16], DP-WGAN [32] and GS-WGAN [6].The code of PATE-GAN and DP-WGAN is taken from Private Data Generation Toolbox5 which already adapts them for tabular data synthesis.We extend GS-WGAN to the tabular domain by converting each data row into a bitmap image.We first normalize all values to the range [0, 1] and re-shape rows in the form of square images filling missing entries (if any) with zeros.The re-shaped rows are fed into the algorithm and the generated images are transformed into data rows by reversing the previous two operations.All hyperparameters are kept to their default values except for the default network architecture which is adjusted according to the spatial dimensions of the tabular datasets.Lastly, note that to compute privacy cost fairly, the RDP accountant is used for all approaches that use DP-SGD as it provides tighter privacy guarantees than the moment accountant [31].
Privacy accounting.To compute the privacy cost in a fair manner, we use the RDP accountant for all approaches that employ DP-SGD: CTAB-GAN+, DP-WGAN and GS-WGAN.PATE-GAN uses moment accountant [31] by default.We set δ = 10 −5 for all experiments.We follow the examples of DP-WGAN and set the exploration span of λ to [2,4096].We use (1) to convert the overall cumulative privacy cost computed in terms of RDP back to ( , δ)-DP.

B. Results Analysis
ML Utility Tab.VII presents the results for the differences ML utility between models trained on the original and synthetic data: lower is better.CTAB-GAN+ outperforms all other SOTA algorithms under both privacy budgets.With a looser privacy budget, i.e., higher , almost all metrics for all algorithms improve.The only exception is AUC for GS-WGAN, but the difference is minor.These results are in line with our expectation because higher privacy budgets mean training the model with less injected noise and more training epochs -before exhaustion of the privacy budget.The superior performance of CTAB-GAN+ compared to other baselines can be explained by its sophisticated neural network architecture, i.e., conditional GAN, which improves the training objective and capacity to better deal with the challenges of the tabular domain such as imbalanced categorical columns and mixed data-types.This also explains the poor results offered by GS-WGAN which is not designed to handling these specific issues achieving the worst overall performance.
Statistical Similarity.Tab.VIII summarizes the statistical similarity results.Among all DP models, CTAB-GAN+ and GS-WGAN consistently improve across all metrics when the privacy budget is increased.But the performance of GS-WGAN is significantly worse than CTAB-GAN+.With a higher privacy budget, Avg WD of PATE-GAN is slightly increased.And the correlation difference of DP-WGAN increases too.This highlights the inability of this methods to capture the statistical distributions during training despite a

Figure 1 :
Figure 1: Challenges of modeling industrial dataset using existing GAN-based table generator: (a) single Gaussian (b) mixed type, (c) long tail distribution, and (d) skewed data

Figure 3 :
Figure 3: Encoding for mix data type variable

Figure 4 :
Figure 4: Conditional vector: example selects class 2 from third variable out of three

Figure 5 :
Figure 5: Evaluation flows for ML utility of Classification the metrics.CTAB-GAN+ decreases the AUC difference from 0.094 (best baseline) to 0.041 (56.4% reduction), and the difference in accuracy from 8.9% (best baseline) to 5.23% (41.2% reduction).The improvement over CTAB-GAN shows that general transform and Was+GP loss indeed help enhance the feature representation and GAN training.Tab.IV shows the results for the regression datasets.The result of CTAB-GAN and CTAB-GAN+ are far better than all other baselines.

Figure 6 :
Figure 6: Challenges of modeling industrial dataset using existing GAN-based table generator: (a) simple gaussian (b) mixed type, (c) long tail distribution, and (d) skewed data

Table II :
Description of Datasets

Table III :
Difference of ML Utility and Statistical Similarity for Classification between original and synthetic data, averaged on five datasets

Table IV :
Difference of ML Utility and Statistical Similarity for Regression between original and synthetic data, averaged on two datasets

Table VI :
Ablation Analysis For CTAB-GAN+ (F1.diff.)Covertype,but has little impact for Intrusion.Credit w/o MSN performs better than original CTAB-GAN.This is because out of 30 continuous variables, 28 are nearly single mode Gaussian distributed.The initialized high number of modes, i.e. 10, for each continuous variable (same setting as in CTGAN) degrades the estimation quality.w/o LT has the biggest impact on Intrusion, since it contains 2 long tail columns which are important predictors for the target column.For Credit, the influence is limited.Even if the long tail treatment fits well the amount column (see Sec. IV-E), this variable is not a strong predictor for the target column.2) With CTAB-GAN+: To show the efficacy of the General Transform and Was+GP loss in CTAB-GAN+, we propose two ablation studies.(1) w/o GT which disables the general transform in CTAB-GAN+.All continuous variables use MSN and all the categorical variables use one-hot encoding.(2) w/o Was+GP which switches the default GAN training loss from Was+GP to the original GAN loss defined in

Table VII :
Difference of accuracy (%), F1-score, AUC and AP between original and synthetic data: average over 5 ML models and 5 datasets with different privacy budgets = 1 & = 100.

Table VIII :
Statistical similarity metrics between original and synthetic data: average on 5 datasets with different privacy budgets = 1 & = 100.This can be explained by the lack of an effective training framework for dealing with complex statistical distributions present in the tabular domain which arise from imbalances in categorical columns and skews in continuous columns commonly found in real-world tabular datasets.