ORIGINAL RESEARCH article

Front. Comput. Sci., 18 February 2026

Sec. Computer Security

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1752739

Privacy-preserving process data generation based on dual-discriminator conditional generative adversarial networks

  • 1. School of Computer Science and Technology, Tongji University, Key Laboratory of Embedded System and Service Computing, Ministry of Education, Shanghai, China

  • 2. School of Information and Intelligent Science, Donghua University, Shanghai, China

Article metrics

View details

179

Views

41

Downloads

Abstract

Introduction:

The growing adoption of data-centric business analytics demands effective safeguarding techniques for processing data that contains procedural details. Although Petri net-driven process mining successfully extracts operational knowledge from activity sequences, current protection approaches often diminish analytical value. Therefore, preserving process-related information while ensuring privacy remains a critical challenge.

Methods:

This study presents a Privacy-Preserving Process Data Generation method based on Dual-Discriminator Conditional Generative Adversarial Networks (P3DGAN) to generate privacy-preserving process data. To avoid mode collapse during model training, P3DGAN employs two discriminators that separately model the dataflow and workflow characteristics of process data. Furthermore, we propose a game-optimization strategy based on Petri net theory to capture the global distribution characteristics of process data. Furthermore, we introduce a workflow-level privacy metric based on the Euclidean distance between trace variants (ED-TV) to support comprehensive risk assessment.

Results:

Experimental results on four real-world process datasets demonstrate that our method can generate high-quality process data with strong privacy protection compared with competitive peers.

Discussion:

The proposed framework achieves an effective multi-dimensional privacy-utility trade-off, demonstrating its potential for practical applications in privacy-sensitive domains such as healthcare, banking, and manufacturing.

1 Introduction

Growing privacy concerns in process mining necessitate privacy-protection methods for process data that model complex business processes. Although Petri net-driven process mining derives operational knowledge directly from sequences of activities, protection mechanisms often limit the value of analysis (van der Aalst, 2012). Traditional anonymization methods are insufficient because they do not account for the sequential dependencies that reveal institutional process structures (Brzychczy et al., 2024; Elkoumy et al., 2021).

Process event logs exhibit unique characteristics not found in standard tabular formats (Chundawat et al., 2024; Wang et al., 2024a). Rows contain segments denoting individual events at the field level (time, resource, identifiers) and their position in a session of execution within long activity sequences (Augusto et al., 2018). Figure 1 demonstrates this property: individual case executions are recorded in different rows, but their consecutive order forms workflow information. The release of unedited logs would jeopardize not only identity markers but also confidential information about the process. This twofold characteristic requires tailored privacy-preserving techniques that can address both dataflow and workflow information leakage (Gursoy et al., 2016).

Figure 1

Existing protection methods for process data can be categorized into three groups. Anonymization methods (Ye et al., 2024; Fahrenkrog-Petersen et al., 2019; Rafiei et al., 2020; Rott et al., 2024) mask identifying attributes using techniques such as field masking or category generalization. Encryption-based schemes (Tillem et al., 2016; Rafiei et al., 2018) enable joint processing of encrypted data. Differential privacy methods (Fahrenkrog-Petersen et al., 2020; Mannhardt et al., 2019) introduce controlled noise while adhering to strict privacy guarantees.

However, existing methods for protecting process data privacy still suffer from the following two drawbacks:

(1) Inadequate behavioral retention. Most current methods provide data privacy protection only from the dataflow perspective. The main drawback of anonymization-based methods is the potential loss of analytical value due to overgeneralization or information suppression in the data flow. For example, excessive anonymization may obscure critical insights into disease transmission patterns when analyzing medical process data (Rott et al., 2024). Although encryption-based methods provide a high level of security, the complexity and potential information loss introduced during data preprocessing or transformation may limit the ability to analyze process data. For instance, in time-series analysis, encryption prevents analysis tools from identifying intrinsic patterns (workflow patterns) within process data (Tillem et al., 2016; Rafiei et al., 2018). Furthermore, differential privacy-based methods inject noise into the dataflow, which reduces the accuracy and utility of process data (Fahrenkrog-Petersen et al., 2020; Mannhardt et al., 2019).

(2) Limited risk evaluation. Existing process data privacy risk assessment methods cannot effectively capture the sequence characteristics (workflow features). This limitation is particularly evident in anonymization techniques (Ye et al., 2024; Fahrenkrog-Petersen et al., 2019; Rafiei et al., 2020; Rott et al., 2024), where researchers struggle to accurately quantify the extent to which anonymization operations impose on the original data structure, resulting in either privacy over-protection that diminishes data utility (Gursoy et al., 2016) or inadequate anonymization that increases re-identification risk (Ye et al., 2024). In medical process mining, excessive anonymization may obscure critical patterns in patient treatment paths, whereas insufficient anonymization could expose patient privacy. Furthermore, the absence of accurate distance measures makes it impossible to determine the optimal trade-off between privacy protection and data utility; consequently, researchers have difficulty selecting suitable anonymization parameters.

Generative methods offer an alternative approach (Wang et al., 2024b; Gui et al., 2021). Instead of modifying real records, they create synthetic records that are statistically similar to the real records but contain no real users. Recent advances in visual synthesis (Xie et al., 2018; Chen et al., 2020; Hu et al., 2022) and structured generation (Zhao et al., 2021; Qiao et al., 2023) suggest feasibility for process records (Xu et al., 2019; Zhao et al., 2024; Lu et al., 2023; Dung and Huynh, 2022). However, the straightforward use of standard generative models introduces new challenges. Mode collapse, which limits the diversity of generated outputs, has been shown to undermine the representation of rare, but operationally important execution traces (Gui et al., 2021; Wang et al., 2022). Long procedural dependencies are hard to capture with simple models (Franzoi et al., 2025). Evaluation metrics designed for tabular arrays do not account for structural correctness and thus do not measure workflow validity (Liu et al., 2021).

In this study, we propose a Privacy-Preserving Process Data Generation method based on a Dual-Discriminator Conditional Generative Adversarial Network (P3DGAN). P3DGAN incorporates a generator with dual discriminators that model process data by integrating dataflow (tabular data) and workflow (directly-follows relationships) perspectives. This dual-discriminator approach challenges the generator to produce diverse samples, mitigating mode collapse; even if one discriminator collapses, the other can still provide effective discrimination (Wang et al., 2022). In addition to directly-follows relationships, global structural knowledge is used by Petri net-based deadlock detection. In this adversarial formulation, a trade-off is made between privacy protection and data utility.

The evaluation is based on a combination of traditional measures of utility and a novel measure of workflow risk (Rozinat and van der Aalst, 2008). First, we propose a risk assessment method for trace variants based on Euclidean distance, which uses a distance-based metric (Pereira et al., 2024) and incorporates re-identification attacks (Ye et al., 2024) to assess the risk of synthetic data. Second, to evaluate data utility, we employ the table-evaluator (Rai and Sural, 2023) and the process mining method (Akhramovich et al., 2024). Finally, four real-world event datasets are used to demonstrate the outstanding performance of our model for privacy-preserving process data.

Our contributions are:

  • Dual-discriminator architecture in process mining. We propose a generative adversarial network with dual-discriminator on dataflow and workflow within the process data. This is the first study to consider a minimax game for generating process data.

  • Petri net-based game optimization strategy. We introduce a deadlock condition loss to limit the policy space of the generator. This policy discards structurally invalid executions and leads to ~10% better generalization than unconstrained baselines.

  • ED-TV: Novel workflow-level privacy metric. We propose a risk measure based on the Euclidean distance between trace variants. Our method is an excellent trade-off: less than 0.5% re-identification on two datasets with similarity scores of 0.729–0.951 and F1-scores of 0.723–0.836 across four real-life datasets.

The remainder of the paper is structured as follows. Section 2 describes related studies, including an overview of process data privacy protection methods. Section 3 provides preliminary information on generative adversarial networks. Section 4 proposes a privacy protection framework for processing data based on dual-discriminator generative adversarial networks. Section 5 designs a deadlock conditional loss game optimization strategy based on the previous privacy protection framework. Section 6 theoretically proves that our proposed method provides differential privacy guarantees. Section 7 introduces a method based on the Euclidean distance of trace variants to enhance the risk assessment of process data privacy protection. Our method is qualitatively and quantitatively compared with several state-of-the-art methods on publicly available datasets in Section 8. Section 9 concludes this study.

2 Related work

Privacy protection for operational event records has prompted a range of technical solutions. This review categorizes prior research into three groups: anonymization-based, encryption-based, and differential privacy-based approaches. We then distinguish pseudonymization from synthetic data generation and explain our rationale for adopting generative adversarial networks.

2.1 Anonymization-based privacy protection in process data

Anonymization techniques mask identifying details by transforming or removing fields.

Pretsa (Fahrenkrog-Petersen et al., 2019): Fahrenkrog-Petersen and colleagues developed a prefix structure technique that achieves K-anonymity with T-similarity. The approach builds a prefix structure from activity sequences, incrementally broadens activities until groups exceed K sequences, and maintains ≥T distinct sensitive attributes per group. However, extensive broadening significantly reduces usefulness, particularly for rare execution sequences.

TLKC (Rafiei et al., 2020): Rafiei and colleagues augmented the LKC-privacy framework with accommodation for varied sequence representations (set, multiset, sequence, relative position). Individual representations maintain ≥L different sensitive attributes across K comparable sequences. Unfortunately, no individual representation setup achieves an optimal balance between protection and utility across diverse datasets.

2.2 Encryption-based privacy protection in process data

Solutions for log encryption protect logs through cryptographic transformation while enabling limited analysis of the encrypted data.

Privacy-preserving alpha algorithm (Tillem et al., 2016): Tillem and co-workers enhanced the alpha log workflow discovery technique with encryption methods. However, the need to preprocess the data complicates deployment. Furthermore, encryption prevents analysis tools from detecting temporal relationships that are important for reconstructing the workflow.

Privacy infrastructure (Rafiei et al., 2018): Rafiei et al. proposed a series of secure distributed computations for sensitive process analytic queries.

2.3 Differential privacy-based privacy protection in process data

PRIPEL (Fahrenkrog-Petersen et al., 2020): Fahrenkrog-Petersen added perturbation at the level of the workflow in case-based protection. It identifies active sequences, adds Laplace noise to their counts, filters out infrequent sequences, and reconstructs logs by sampling. This protocol achieves (ε, δ)-differential privacy at the level of sequences. However, it retains only the sequential features of relations, and field-level information remains exposed.

DPGAN (Xie et al., 2018): Xie et al. incorporated differential privacy into discriminator gradient perturbation through adversarial training. Adding Gaussian noise after clipping gradients creates (ε, δ)-privacy using the moments accountant. However, this technique was developed for images and lacks a domain-specific process model. For a fair comparison, the DPGAN variant for event logs uses activity vectorization and shares the same architecture and parameters as P3DGAN.

2.4 Pseudonymization vs. data generation

Pseudonymization replaces identifiers with pseudonyms (e.g., ID 123 → ID XXX) but preserves the structure and content exactly. The data can be reversed with the use of re-identification tactics, and indirect identifiers are still vulnerable to linkage. GDPR considers pseudonymized data as “personal data” to which restrictions can be applied.

P3DGAN (generation) generates entirely new records, statistically equivalent, yet structurally distinct from those of the sources. It is final in the sense that source records are never fetched by any procedure. The method provides robust protection by perturbing gradients and can be considered anonymous after appropriate validation.

Key advantage: Sharing a synthesized dataset cannot endanger the source's privacy. This is in stark contrast to pseudonymization, in which the exposure of the “algorithm” compromises all source records.

Use case comparison: Take companies that make their operating logs available to outside partners for collaborative analysis. The decryption keys must be shared for pseudonymization, posing a risk if they are leaked. P3DGAN creates synthetic records that statistically resemble real data, but no actual individuals exist in the synthetic dataset. This eliminates the need for key management and provides superior privacy protection.

2.5 GANs for privacy-preserving process data generation

A comparison with other frameworks of adversarial generative modeling is provided below:

Autoencoders/VAEs: autoencoders, similar to vanilla AE, reconstruct inputs using encoder-decoder pairs. Encoders can also inadvertently memorize training samples, leading to privacy attacks. Reconstruction objectives encourage outputs to be close to training samples, thus limiting diversity.

RNNs/LSTMs: RNNs/LSTMs maximize likelihood (next-element prediction), yielding “conservative” (average) sequences and low diversity. Temporal relations are often overfit in sequential modeling architectures.

GANs (our approach): generators never directly observe real records at any time, only gradient information from discriminators. Such architectural decoupling ensures privacy barriers. Adversarial objectives encourage exploration of the full data distribution rather than focusing on modes, thereby improving diversity. GANs allow flexible incorporation of constraints (e.g., a deadlock-aware loss) into the objective, whereas autoencoders require more substantial architectural modifications.

Natural fit for differential privacy: gradient perturbation fits naturally into updates of the discriminator, offering formal guarantees. Alternating training naturally accommodates the gradient clipping and noise injection required for differential privacy.

The three categories face common challenges in protecting operational logs: removing sensitive information while preserving utility for analyses. Our generative approach addresses this challenge by achieving statistical equivalence to the source data rather than directly transforming it.

3 Preliminary

Basic concepts are examined, including workflow diagrams, modeling formalisms, and the fundamentals of adversarial training.

3.1 Process data and event logs

Definition 1 (Process) A process is a collection of related activities that, together, transform inputs into outputs and are performed in a specified manner to achieve defined objectives. Running instances of an execution are called cases or process instances.

Definition 2 (Event Log) An event log L constitutes a multiset of traces, where individual traces σ represent single cases. Individual traces form sequences of events:

Individual events e are characterized by:

  • Case ID: unique identifier for the process instance.

  • Activity: the task/action performed (e.g., “inbound call”, “handle case”).

  • Timestamp: when the activity occurred.

  • Attributes: additional information (e.g., resource, cost, product).

Example: Figure 1 displays a sample event log from a call center process. Case 1 has the trace σ1 = 〈 Inbound Call, Handle Case, Call Outbound 〉; Case 9 has σ9 = 〈 Inbound Email, Call Outbound, Handle Email 〉.

3.2 Process data: dataflow vs. workflow

Process data inherently contains two types of information:

Definition 3 (Dataflow) Dataflow refers to the tabular attributes associated with each event, including categorical attributes (activity, resource, product type), numerical attributes (duration, cost, priority), and temporal attributes (timestamp, date, time-of-day). Formally, the dataflow of an event e is represented as a feature vector:

Definition 4 (Workflow) Workflow refers to control flow structure, i.e., ordering and dependencies between activities. Key workflow concepts include:

1. Trace variant: The unique sequence of activities in a trace, ignoring timestamps and attributes. For example, traces 〈A, B, C〉 and 〈A, B, C〉 with different timestamps are considered the same variant.

2. Directly-follows relation (DFR): A binary relation → such that ab indicates that activity b immediately follows activity a in at least one trace (van der Aalst, 2022):

3.3 Petri net fundamentals

Petri nets are a core formalism in process mining, offering mathematical principles for the discovery, analysis, and conformance checking of process models based on event logs (van der Aalst, 2012).

Definition 5 (Petri Net) A Petri net is a 5-tuple PN = (P, T, F, W, M0) (Murata, 1989) where:

  • P = {p1, p2, …, pm} is a finite set of places.

  • T = {t1, t2, …, tn} is a finite set of transitions.

  • F⊆(P×T)∪(T×P) is a set of arcs (flow relation).

  • W:F → ℕ+ is a weight function.

  • M0:P → ℕ is the initial marking.

  • PT = ∅ and PT≠∅.

A marking M represents the system state, where M(p) denotes the token count in place p. A transition t is enabled at marking M if ∀pt:M(p)≥W(p, t), where t denotes input places of t.

Firing an enabled transition t transforms marking M to M′ according to:

The reachability set R(PN, M0) contains all markings reachable from M0 through transition firing sequences.

Petri nets in process mining: the mined models are represented using Petri nets where activities correspond to transitions, dependencies to places and arcs, and execution is represented by a token flow (the so-called marking evolution). The alpha-algorithm, a basic mining technique, constructs Petri nets directly from event logs by detecting causal dependencies (van der Aalst et al., 2004). Using Petri nets, precise conformance checking can be performed by replaying logs on the models (discovered models), and behavioral properties (such as soundness, deadlock-freedom, and liveness) can be formally verified (van der Aalst, 2012).

Definition 6 (Process Petri Net) A Process Petri Net is a Petri Net (see Definition 5) whose structure is determined by an event log and where the elements have process-related meanings.

Given an event log L, a Process Petri Net is PN = (P, T, F, W, M0) that satisfies all properties in Definition 5, and has the following additional process-specific semantics:

  • Each transition tT corresponds to an activity in the log.

  • Places P represent causal dependencies between activities.

  • Marking M0 represents the initial state (typically one token in the start place).

  • A place p connecting transitions ti and tj (where (ti, p), (p, tj)∈F) indicates that activity tj can follow activity ti.

Definition 7 (Deadlock in Process Petri Nets) A marking M in a process Petri net is a deadlock state if no transition is enabled:

where t = {pP|(p, t)∈F} stands for the set of input places of the transition t. Intuitively, deadlock is a “stuck state”, in which the process is unable to continue, similar to the state in which a group of people are all waiting for somebody else to move first.

Role of Petri Nets in P3DGAN: Our approach leverages Petri net theory in three ways:

1. Structural representation: We exploit Petri nets to represent the found process structure formally and thus to pursue a rigorous behavior analysis.

2. Deadlock detection: We integrate Petri net theory deadlock detection into the loss function. This limitation guarantees that the synthesized processes will be structurally sound and will not pass through invalid states in which no activities can be executed.

3. Quality metrics: We employ Petri net quality dimensions (fitness, precision, generalization, and simplicity) to evaluate synthetic process data, as detailed in Section 8.

This dual nature of process data (dataflow + workflow) and the need for structural validity motivate our dual-discriminator architecture in P3DGAN.

3.4 Generative adversarial networks

Generative adversarial networks (GANs) are powerful generative models that perform implicit density estimation and consist of two neural networks (Goodfellow et al., 2014): a generator and a discriminator. The generator attempts to fool the discriminator by generating realistic data, while the discriminator aims to distinguish real from fake data. During training, the generator progressively improves at creating realistic data, while the discriminator becomes better at detection. The process reaches equilibrium when the discriminator can no longer distinguish real from fake data.

Generators G and discriminators D are jointly trained in a two-player minimax game. The objective function is:

where x is real data, Z represents random noise from the latent space, and G(Z) is generated data. pdata is the distribution of real data, while pZ(Z) represents the prior distribution of input noise Z for generator G. Discriminator D is fixed during G training. The adversarial process constitutes a two-player minimax game where G tries to fool D, while D is trained to discriminate generated data. Hence, generated samples become increasingly indistinguishable from real data.

3.5 Notation

For clarity, we define key notations used throughout this study in Table 1.

Table 1

SymbolDefinition
L, LrEvent log, real process data
LfSynthetic (fake) process data
XReal tabular data (dataflow)
RReal directly-follows relations (workflow)
ZLatent noise vector sampled from
GGenerator network
G(Z)Generated (synthetic) process data
DtDiscriminator for tabular data (dataflow)
DrDiscriminator for directly-follows relations (workflow)
Total generator loss
Adversarial loss (from discriminators)
Deadlock condition loss
Size loss (number of deadlocks)
Distribution loss (KL divergence)
EXExpectation over real data distribution
EZExpectation over noise distribution
EG(Z)Expectation over generated data distribution
λWeight for deadlock condition loss (∈[0, 1])
εPrivacy budget (differential privacy parameter)
δPrivacy failure probability
σStandard deviation of DP noise
CClipping threshold for gradients
dlr, dlfDeadlock markings (real and fake)
QdlFrequency distribution of deadlocks
Frequency of i-th deadlock type
mMini-batch size
TdNumber of discriminator iterations per generator iteration
TgNumber of generator training iterations
MtTotal number of training samples for Dt
MrTotal number of training samples for Dr
qt, qrSampling probabilities (m/Mt, m/Mr)

Notation and definitions.

4 Privacy-preserving framework for processing data based on dual-discriminator adversarial generative networks

A privacy-preserving framework for processing data using dual-discriminator adversarial generative networks is designed. First, the problem is described, and the motivation is explained. The overall framework architecture and its components are introduced.

4.1 Problem description

Process data exists in the form of event logs where each event is represented as: event = {Case ID, Activity, Timestamp, …} (as shown in Figure 1). This data is hierarchical in nature: (1) at the level of dataflow represents single events, (2) at the level of workflow are the sequences of events which make up traces.

Limitation of the current approaches: Although there is some work on protecting the privacy of process data (Fahrenkrog-Petersen et al., 2020; Rafiei et al., 2020; Mannhardt et al., 2019), there are some limitations of:

(1) Dimension protection mechanism results separation: The existing solutions treat dataflow dimension and workflow dimension within the realm of information protection individually, but do not consider the whole privacy protection from the two dimensions together. For example, PRIPEL (Fahrenkrog-Petersen et al., 2020) enables sequence-level differential privacy, but field-level attributes are still unprotected. Additionally, TLKC variants (Rafiei et al., 2020) anonymize representations separately, without a coordinated approach.

(2) Absence of structural verification: Anonymization or perturbation methods do not guarantee that the resulting or altered data consists of valid process models (e.g., no deadlocks).

(3) Unsatisfactory attacker model: The current risk metrics are not sufficient to measure a workflow-level privacy leakage, as they lack precise distance metrics to quantify the structural similarity (between process models).

For the original process data Lr, the goal is to produce a differentially private synthetic data Lf = G(Z) where G:ZL is a generative model, which maximizes the utility of data while ensuring structural validity and (ε, δ)-differential privacy.

4.2 Overall framework

The structure of the dual-discriminator adversarial generative network comprises three blocks: Generator G, Discriminator for Tabular Data Dt, and Discriminator for Directly-Follows Relationship Dr (see Figure 2). Each discriminator evaluates the generated process data from different perspectives (workflow or dataflow), forcing the generator to produce process data that is realistic across multiple aspects and satisfies the criteria of all discriminators. Tabular Data (dataflow): Process data is represented as a table in which event characteristics (e.g., timestamps, activity names, etc.) appear in columns. Directly-Follows Relation (workflow): In Petri Nets, one transition is said to follow directly after another.

Figure 2

The overall objective is to learn the generator G, conditioned on Tabular Data and Directly-Follows Relationships, to generate realistic and informative synthetic process data. For this purpose, two discriminators Dt and Dr are used. Dt differentiates synthetic tabular data from real tabular data, and Dr differentiates synthetic Directly-Follows Relationships from real ones. This two-discriminator system has several important benefits:

1. Mitigates mode collapse: with two separate feedbacks, the generator has to meet more than one condition at a time, which is more constraining. If the feedback from one discriminator becomes less informative (potential collapse), the other still provides guidance for learning.

2. Captures dual nature of process data: process data is inherently two-dimensional (dataflow and workflow) and cannot be fully represented by one discriminator. Both are explicitly modeled in our architecture, which results in better process model discovery as shown in Section 8.

3. Enables differential privacy at multiple levels: various privacy mechanisms can be used at the dataflow or workflow level, providing privacy-utility tradeoffs at a very granular level.

Meanwhile, for privacy protection, differential privacy is integrated by adding noise vectors to the discriminators' gradients. This ensures privacy preservation during training without compromising utility. Noise is added on the gradient of the Wasserstein distance with respect to the training data, providing robust privacy guarantees (Yang et al., 2022).

Based on this privacy-preserving framework, we propose a privacy-preserving method for processing data based on dual-discriminator generative adversarial networks, as shown in Algorithm 1. Before introducing this algorithm, we have formal definitions of trace and the directly-follows relationship based on Petri Nets (van der Aalst, 2012).

Algorithm 1

Privacy-Preserving Method for Process Data based on Dual-Discriminator GANs.

Definition 8 (trace of process data) in process data (event log L), each row represents an event. Each event contains attributes such as its corresponding ID, activity name, and timestamp. is the universe of activity names in event log L. A , 0 ≤ ijlen(L) is a sequence of events corresponding to activities with the same ID. πk(Trace) = ak, where ikj, denotes the mapping of activity ak corresponding to position k in one trace.

Definition 9 (directly-follows relationship) for any two activities , there is a directly-follows relationship from a1 to a2 in , denoted by a1a2, iff ∃0 ≤ i, jlen(L), Tracei(Trace) = a1∧πj(Trace) = a2j = i+1. The frequency, or weight, of a directly-follows relationship is the number of times it occurs in the event log, denoted by |a1a2|.

Key implementation details:

Differential privacy integration: We apply Gaussian noise to the clipped gradients of the two discriminators (Lines 13 and 21 in Algorithm 1), with the noise scale σ being determined from the privacy budget ε and the failure probability δ as:

where q = m/M is the probability of sampling. This guarantees (ε, δ)-differential privacy, which is proven in Section 6.

The generator adversarially minimizes a mixed loss over both adversarial loss (from discriminators) and deadlock condition loss (maintaining structural validity) (refer to Line 26 in Algorithm 1). The parameter λ∈[0, 1] controls the trade off between data realism and structural faithfulness. Details of are given in Section 5.

4.3 Network architectures

Figure 3 illustrates network architectures for both the generator and the discriminators in P3DGAN.

Figure 3

Generator Architecture (Figure 3a): The generator takes a latent noise vector as input and transforms it through multiple fully connected (FC) layers with ReLU activation. The final layer uses Gumbel-Softmax activation to generate discrete categorical values for activities and other process data attributes. The generator outputs both tabular data (event attributes) and directly-follows relations simultaneously.

Discriminator Architecture (Figure 3b): The two discriminators (Dt and Dr) share the same architecture but process different inputs. Each discriminator consists of a sequence of fully connected layers with Leaky ReLU activations. The discriminators receive as input tabular data (for Dt) or directly-follows relations (for Dr), and output a scalar representing the likelihood that the input is real (versus fake). The shared architecture ensures the same discriminative power across modalities while allowing each discriminator to focus on its own domain using distinct parameters.

4.4 Multi-objective optimization framework

The dual-discriminator architecture constitutes a multi-objective optimization problem formulated as a minimax game:

Discriminator optimization: The two discriminators are optimized independently and simultaneously:

where RX denotes the directly-follows relations extracted from real data X, and RG(Z) denotes those from generated data G(Z).

Generator optimization against both discriminators: the generator is required to fool both discriminators at the same time and keep the structure valid:

where the first two terms represent adversarial losses from Dt and Dr respectively, and is the deadlock condition loss ensuring structural validity (detailed in Section 5).

Balancing mechanism: we use alternating training to balance the two discriminators:

  • Training Dt and Dr for kd steps (typically kd = 5)

  • Training G for kg steps (typically kg = 1)

This alternating schedule prevents the discriminators from dominating the optimization or from suffering from mode collapse, which may occur when one discriminator becomes too strong.

Convergence to nash equilibrium: upon convergence, the system reaches a Nash equilibrium such that:

indicating that generated data matches real data in both dataflow and workflow distributions.

5 Game optimization strategy based on deadlock conditional loss

A game-theoretic approach to optimization under deadlock-conditioned loss is proposed. This is achieved through the new combined global learning scheme that exploits structural constraints derived from Petri net theory. We first introduce deadlock marking sets, then characterize the components of the deadlock-condition loss, and finally discuss how this loss is incorporated into generator optimization.

5.1 Deadlock marking sets

Definition 10 (dead marking) a marking (state) M in which no transitions are enabled and which cannot evolve.

In Petri Nets, a deadlock is an illegal or problematic state in which the process cannot proceed further. Detecting such states and limiting their occurrence during generation ensures that the synthetic process data remain structurally valid.

Let dlr be the set of deadlock markings extracted from real process data, and dlf be those extracted from synthetic data. The deadlock loss enforces the decay of the difference between these two sets in number as well as in distribution.

5.2 Components of deadlock condition loss

The deadlock condition loss consists of two components (see Figure 4):

Figure 4

1. Size Loss (): Penalizes the difference in the number of deadlock states between real and fake data:

This ensures that the generator produces several deadlock configurations similar to those observed in real data, preventing over- or under-generation of problematic states.

2. Distribution Loss (): This is the KL divergence penalty, which makes sure that the deadlock types in the real and generated data are the same:

where and are the frequencies of the i-th deadlock type in real and fake data, correspondingly. Here, i refers to different deadlock patterns (e.g., deadlocks at certain places in the Petri net).

Purpose of KL penalty: The KL divergence serves three critical functions:

(a) Distribution matching: It makes sure that the real and synthetic data have not only a similar number of deadlocks, but also a similar distribution of deadlock types. Among these, the impact on process behavior differs depending on which tasks are involved in the deadlock.

(b) Fine-grained control: Although is intended to guarantee rough overall count similarity, is to ensure that certain deadlock patterns are preserved (fine-grained structural features).

(c) Mode coverage: By penalizing differences in all types of deadlocks, the KL term prevents the generator from disregarding rare but meaningful deadlock patterns and prevents the collapse of modes in the structural space.

Example: A true process has two types of deadlocks—Type A (70%) and Type B (30%). Without , the generator may generate only Type A deadlocks (100%). The KL penalty ensures that the generator also learns to generate both types in the correct proportions, thereby preventing structural bias.

5.3 Integration with generator loss

Note: The deadlock condition loss is enforced only on the generator, not on the discriminators. The main factors to consider in this design choice are as follows:

Generator's objective: the generator attempts to create process data that, at all times, fools the discriminators and is structurally valid. The deadlock-condition loss serves as a regularization term that encourages the generator to produce Petri nets without deadlock states.

Discriminators' objective: the discriminators (Dt and Dr) are concerned only with real vs. fake at the dataflow and workflow level (respectively). They do not have to check for structural properties such as deadlock-freedom, since this is imposed via the loss of the generator.

The total generator loss is:

where:

  • represents combined adversarial loss from both discriminators:

  • enforces deadlock-freedom in generated process data

  • λ∈[0, 1] balances data authenticity (from adversarial training) and structural validity (from deadlock constraints)

This separation of concerns allows discriminators to focus on data realism while the generator optimizes for both realism and structural correctness.

Meanwhile, recalling from (Equations 89), the discriminators are trained by maximizing:

and:

From a game theory perspective, the deadlock loss acts as a new rule constraint that limits the generator's strategy space. This game-optimization training mechanism incentivizes the generator to learn richer modes of the data distribution and to maximize payoff in a complex, dynamic gaming environment. In summary, the Privacy-Preserving Process Data Generation Based on Dual-Discriminator Conditional Generative Adversarial Network (P3DGAN), incorporating deadlock conditional loss from the perspective of game optimization, can strengthen adversarial nature during training, stimulate learning potential of the generator, and avoid learning only simple partial modes of the data distribution.

6 Privacy guarantee of P3DGAN

Tracking and demonstrating privacy loss is a key aspect of differentially private deep learning. To show P3DGAN can well protect differential privacy, we give a privacy proof based on DPGAN (Xie et al., 2018) combined with differential privacy parallel composition (Wijesinghe et al., 2024). Before the proof, we define an adjacent dataset, and LEMMA 1 establishes DP across different discriminator-training procedures.

Definition 11 (Adjacent Data Set) For a data set x, its first norm is ||x||1. For two data sets x and y, their l1 distance is ||xy||1, which is the number of different elements. If the distance between them is 1, that is ||xy||1 = 1, then x and y are called adjacent data sets. The elements of the difference between two adjacent sets can also be expressed as |xy| = (xy)−(xy) = 1.

Explanations: The privacy budget ε dictates the trade-off between privacy and utility: smaller ε results in stronger privacy (more noise) and potentially lower quality of data, while larger ε leads to better utility, at the expense of weaker privacy guarantees.

Lemma 1. Given the sampling probability qt = m/Mt, where m represents the batch size and Mt is the total number of training data in the tabular data discriminator iteration, the number of discriminator iterations in each inner loop Td, and privacy violation δ, for any positive ε, the parameters of Dt guarantee (ε, δ)-differential privacy for all data used in that outer loop if it satisfies:

Similarly, Dr is the same as Dt, except for the sampling probability qr = m/Mr, where Mr is the total number of training data in the directly-follows relationship discriminator iteration.

Proof : The DP guarantee for the discriminator training procedure follows the intermediate result (Xie et al., 2018). For a fixed perturbation σ on a gradient, a larger q leads to less privacy guarantee. This is indeed true since the more data is involved in computing the discriminator parameter w, the less privacy is assigned to each of them. Also, more iterations Td lead to less privacy because observers provide more information to the data (specifically, more accurate gradients).

Theorem 1. The dual-discriminator in P3DGAN satisfies (ε, δ)-differential privacy during training on process data.

Proof : Given M:ℕ|x|R to be a (ε, δ)-differential privacy algorithm acting on a single discriminator. For any adjacent data set x, y (||xy|| ≤ 1) and the function f:RR′, any event SR′, T = {rR:f(r)∈S} has:

The random map can be decomposed into a convex combination of deterministic functions, and the convex combination of differential privacy satisfies differential privacy. When facing multiple data sets, for generality, we define the divisions D1, D2, …, Di of data set D, and the corresponding algorithms A1, A2, …, Ai, respectively, where the divided data sets are disjoint. Algorithms satisfy ε1, ε2, …, εi differential privacy, respectively. Furthermore, k is the number of database partitions.

According to the dual mode of process data, we divide the data set D into two parts D1 and D2 representing different types of data, where D1 and D2 are disjoint. Assuming that algorithms M and f(M) act on different data sets, A1, A2. Based on the above general derivation, under the same assumptions, the privacy analysis of our data-processing method satisfies the above conditions. From Lemma 1, we can find that different sampling probabilities are positively correlated with privacy loss. Thus, the dual-discriminator guarantees differential privacy in process data training for all data used in that outer loop if it satisfies:

This completes the proof that P3DGAN provides (ε, δ)-differential privacy protection for process data.

7 Risk assessment of P3DGAN based on euclidean distance of trace variant

In this study, we present a general risk-based approach for analyzing privacy-enhancing process data generation methodologies. Our framework considers both data utility and privacy risk from multiple perspectives, thereby enabling a more holistic assessment of the privacy-utility trade-off.

7.1 Euclidean distance of trace variant (ED-TV)

We proposed the ED-TV metric to quantify structural dissimilarity between real and synthetic process data at the workflow level. This metric complements traditional re-identification metrics by capturing privacy risks associated with disclosing process structure.

7.1.1 Formal definition

The metric is formally defined as:

where and are the normalized frequencies of the i-th trace variant in the real and synthetic datasets, respectively, and N is the total number of unique trace variants across both datasets.

The normalization ensures that frequencies sum to 1 across all variants:

7.1.2 Interpretation for privacy assessment

Larger ED-TV values, on average, provide stronger workflow-level privacy protection because they entail greater structural dissimilarity. If synthetic data has significantly different distributions of trace variants from real data, this makes it more difficult for adversaries to infer the original process structures via workflow pattern matching attacks.

Nevertheless, the trade-off between ED-TV and privacy protection is not monotone. Very large ED-TV values (close to 1.0) may indicate a complete disruption of workflow structure, making synthetic data impractical for process analysis, even if privacy is theoretically strong. Such instances correspond to unsuccessful generation rather than successful protection of privacy.

On the other hand, extremely low ED-TV values (< 0.02) indicate that the synthetic data is distributed very similarly to a real trace. Although this means that the utility (for process discovery) is very high, it may raise workflow-level privacy concerns if the structural similarity allows adversaries to leak sensitive business processes or to recognize unique execution patterns.

Optimal range: our results from four datasets indicate that relative ED-TV privacy parameters (0.015–0.130) provide a good privacy-utility trade-off under strong dataflow-level protection. Synthetic data retains key features of the workflow that enable meaningful analysis, but is sufficiently structurally distinct within that range to impede privacy-based attacks on workflows.

Contextual interpretation: ED-TV has to be read in conjunction with other metrics and not on its own. A method with even an ED-TV of 0.02 and a re-identification rate of 0.2% (such as P3DGAN on BPI 2019) has a strong overall protection by layered defense: dataflow-level privacy ensured by differential privacy and workflow-level privacy provided by structural diversity of synthetically generated variants. On the other hand, a technique with 0.03 ED-TV but 1.2% re-identification rate (and higher workflow distance) provides weaker protection.

Dataset-specific considerations: best ED-TV depends on characteristics of the dataset. Processes characterized by a small number of variants are, as expected, dominated by small ED-TV intervals (Simple Energy Production with 3 variants). Processes with thousands of variants (e.g., BPI 2019 with 4,183 variants) pose even greater challenges for creating a structured representation that is utility-preserving and allows structural diversity. Our assessment considers these dataset-specific characteristics in the context of privacy-utility trade-offs.

7.1.3 Algorithmic computation

The Computational Procedure for the ED-TV calculation is shown in Algorithm 2. To compute the Euclidean distance, the algorithm first normalizes the frequencies of variants in both real and synthetic datasets, then computes the Euclidean distance between the resulting frequency vectors.

Algorithm 2

Euclidean Distance of Trace Variant (ED-TV).

The complexity of Algorithm 2 is O(nr+ns+|TVall|) where nr and ns are the numbers of traces in the real and synthetic dataset, respectively, and |TVall| is the total number of unique variants. In practice, this computation is very fast, even for large datasets, because trace-variant extraction and frequency counting are performed in a single pass over the dataset.

7.1.4 Complementary role with re-identification metrics

ED-TV and re-identification rate offer two complementary views of the privacy risk. Re-identification attacks exploit combinations of attributes to identify synthetic records of real individuals (dataflow-level), whereas workflow-based attacks attempt to deduce business logic by observing the entire process (workflow-level).

The dual nature of process data enables low re-identification rates and moderate ED-TV values; P3DGAN is an example: differential privacy provides individual privacy while also preserving aggregate process norms. Conversely, methods with high ED-TV but high re-identification rates do not provide any privacy protection at the individual level.

7.1.5 Relationship to data utility

ED-TV exhibits an inherent tension with data utility: lower values are associated with higher F1-scores in process discovery, but larger values may degrade discovery quality. P3DGAN balances this trade-off through its dual-discriminator architecture, which retains salient workflow patterns (moderate ED-TV: 0.016–0.128), while differential privacy shields individuals (minimal re-identification), resulting in F1-scores of 0.723–0.836. These findings suggest that the seemingly conflicting requirements of workflow utility and privacy preservation may be aligned through a stack of protection layers, as demonstrated by our model.

8 Experimental results

We demonstrated the superiority of P3DGAN through comprehensive experiments on four public process datasets, in terms of privacy protection and data utility, compared with strong competing baselines.

8.1 Experimental setup

8.1.1 Datasets

We employed four open real-life process logs from different domains with varying structural properties. The statistics are summarized in Table 2.1

Table 2

DatasetEventsCasesActivitiesVariants
Call center10,2402,500827
BPI challenge 2019251,73442,912424,183
Production analysis225,9179,624221,096
Electronic invoicing1081263

Statistical characteristics of experimental datasets.

The size of these datasets ranges from 108 to 251,734 events and from 6 to 42 activities, providing a suitable basis for testing across various process mining contexts. Of particular interest, the Electronic Invoicing dataset (108 events, 12 cases) presents a small-sample case scenario and is included in the assessment to test model robustness under data scarcity.

8.1.2 Baseline methods

We evaluated P3DGAN with seven existing privacy-enhancing methods under three classes of defense:

Anonymization-based: Pretsa satisfies K-anonymity and T-similarity by means of prefix tree-based generalization. TLKC extends LKC-privacy to handle multiple variant representations (sets, multisets, sequences, relative orderings).

Differential privacy-based: PRIPEL is workflow-level perturbation based on Laplace noise. DPGAN integrates differential privacy into adversarial training via perturbing gradients.

We modified DPGAN to process data, using the same network architecture and hyperparameters as P3DGAN. All results are the mean of three independent trials.

8.1.3 Evaluation metrics

Our two-part assessment addresses both data utility and privacy risk.

Data utility: the similarity score quantifies statistical similarity between real and synthetic datasets. F1-score measures the quality of process discovery from the synthetic logs. We also look at precision and recall individually.

Privacy risk: re-identification rate measures the privacy at the level of the dataflow. Euclidean distance of trace variants (ED-TV) is used to measure privacy risk at the workflow level.

8.1.4 Implementation details

Experiments are conducted on a single NVIDIA Tesla V100 GPU (32GB) with Intel Xeon Gold 6148 (20 cores, 128 GB RAM). We implemented P3DGAN in PyTorch 1.12 and set ε = 50 and λ = 0.6. All models use 3-layer fully connected networks (hidden size = 256) and the Adam optimizer with a learning rate of 0.0002, trained with a batch size of 64 for 1000 epochs.

8.2 Data utility analysis

8.2.1 Table-evaluator results

Table 3 reports that P3DGAN achieves the best similarity scores on all datasets and the best F1-scores on three out of four datasets.

Table 3

MethodCall centerBPI 2019ProductionElectronic
Sim.F1Sim.F1Sim.F1Sim.F1
Pretsa0.8120.5820.8040.4750.8250.5820.8190.589
PRIPEL0.8040.3370.8210.4590.8150.3390.8070.329
TLKC_set0.7890.4010.8010.3940.8310.4080.8180.403
TLKC_multi0.7830.3980.7950.4010.8190.3920.7940.387
TLKC_seq0.7910.5030.7880.5090.8270.4970.8020.491
TLKC_rel0.7050.0480.7230.0590.7120.053
DPGAN0.8510.7390.8740.7930.8920.8020.7290.715
P3DGAN0.9030.8260.9030.8270.9510.8360.7290.723

Data utility evaluation of privacy-preserving methods.

Anonymization approaches (Pretsa, TLKC, PRIPEL) yield unsatisfactory results because they alter the original activities through generalization, noise addition, and utility compromise. GANs model the true data distribution and yield samples that are statistically similar to the true data, albeit with higher utility and privacy. In the case of the small Electronic Invoicing dataset (n=108 events, 12 cases), DPGAN and P3DGAN are on par (similarity: 0.729, F1-score: 0.723 vs. 0.715). Although only a limited number of samples available for training, P3DGAN can still achieve stable performance, thanks to the following three aspects: (1) Wasserstein loss leads to stable gradients even for small batches, (2) dual-discriminator architecture impedes mode collapse, (3) Deadlock condition loss introduced in this study acts as a kind of structural regularizer, thus moderating overfitting. Results are the mean of three runs, and the standard deviations are reported in Table 3.

8.2.2 Process discovery results

We applied the Inductive Miner to the synthetic logs to mine Petri nets. Figure 5 shows the comparison of models for the different methods.

Figure 5

P3DGAN generates models with 18 transitions, 15 places, and 42 arcs, and is, in terms of number of arcs—the one closest to the original (15 transitions, 12 places, 35 arcs). DPGAN produces 17 transitions, 23 places, and 51 arcs, with many hidden transitions, suggesting overfitting. Pretsa oversimplifies (12 transitions, 8 places, 25 arcs) by cutting execution trajectories that matter. The TLKC variants have quite a large complexity range, whereas TLKC_sequence is the most complex but will be less precise. These observations are confirmed in Table 4.

Table 4

MethodFitnessPrecisionF1-ScoreSimplicity
Original data1.0001.0001.0000.052
Pretsa0.8920.7340.8060.063
PRIPEL0.8750.6980.7760.059
TLKC_set0.8230.5610.6670.071
TLKC_multi0.8010.4930.6090.068
TLKC_seq0.7940.4820.5990.065
DPGAN0.9230.6150.7390.041
P3DGAN0.9510.7530.8390.048

Petri net quality metrics on call center dataset.

P3DGAN attains the best F1-score (0.839) by balancing fitness (0.951) and precision (0.753). Anonymization techniques entail the loss of precision for the generalization of activities. DPGAN achieves high fitness but low precision, capturing general flow but generating noisy behaviors. The P3DGAN's fair simplicity (0.048) is very close to the original (0.052), showing that it is neither over-simplified nor under-simplified in truthful modeling.

8.2.3 Computational complexity analysis

We present the time and memory comparison in Table 5.

Table 5

MethodTime complexitySpace complexity
PretsaO(llogl)O(l)
PRIPELO(l·lt)O(lt)
TLKC_setO(l·ltc)O(lt)
TLKC_multiO(l·ltc)O(lt)
TLKC_seqO(l·ltc)O(lt)
TLKC_relO(l·ltf)O(ltf)
DPGANO(l·HM+PDH0H2M)O(l)
P3DGANO((l+ltHM+PDH0H2M)O(l)

Time complexity and space complexity comparison.

Time complexity of P3DGAN is O((l+ltHM+PDH0H2M), where l denotes the number of events and lt denotes the number of traces. Practical datasets have ltl, thus it can be approximately simplified as O(l·HM+PDH0H2M), which is the linear scalability of DPGAN. By contrast, variants of PRIPEL and TLKC have quadratic-like complexity O(l·lt) or O(l·ltc) due to the nested event-trace comparison step, which imposes computational bottlenecks for large-scale data. The two discriminators Dt, Dr are applied to two complementary views of the same data that are obtained in parallel without duplication; the same linear space complexity is maintained: O(l). Therefore, P3DGAN enables high efficiency, superior data quality, and privacy protection.

8.2.4 Parameter sensitivity analysis

We systematically investigated the impact of the main parameters of P3DGAN on the quality of synthetic data. In this work, we study the privacy budget ε, which determines the amount of noise in differential privacy, and the deadlock loss weight λ, which controls the trade-off between the structural validity and the realism of data. Figure 6 shows the overall results for several metrics.

Figure 6

Effect of privacy budgetε

Figure 6a shows the effect of privacy budgets on four key metrics: similarity, F1-score, recall, and precision. When ε ranges from 0.1 to 50, all metrics show monotonically increasing trends as the noise level decreases. At very small budgets (ε = 0.1), the noise is so overwhelming that no meaningful learning can be obtained with similarity around 0.80–0.82. When ε grows to 1 and 10, both methods are significantly improved, with P3DGAN achieving a stable superiority of 2–3%. The best point is obtained at ε = 50, at which P3DGAN achieves an F1-score of 0.826 compared to 0.739 of DPGAN, showing that our method better captures the structural strength.

As shown in the recall and precision subplots of Figure 6a, recall converges at ε = 50 with P3DGAN = 0.92 and DPGAN = 0.88, corresponding to capturing more process variants. Nevertheless, recall remains constant, but precision degrades (from 0.82 to ~0.76) at ε = 100, implying that overfitting starts. The model begins to memorize training data and generate spurious transitions. This indicates that ε = 50 is a good balance where the model learns the true patterns and does not overfit. Combining the dual-discriminator structure with the deadlock loss , P3DGAN subsequently attains consistently superior F1-scores for all privacy budgets.

The influence of deadlock loss weightλ

Figure 6b shows the effect of the λ value on similarity and F1-score through box plots. These results demonstrate an inverted-U pattern: both metrics first increase as λ grows and attain the peak around 0.6, then slightly decrease. Figure 6c further compares P3DGAN with different λ values against baseline methods, confirming that λ = 0.6 achieves the best balance. For λ = 0.2, the F1-scores are in the range of 0.76–0.80, and roughly 15% of the generated traces include deadlock states that cannot be resolved. At λ = 0.4, the results further improve 0.78–0.81 as the structural restrictions remove invalid processes.

The optimal λ = 0.6 achieves F1-scores of 0.80–0.83 for all privacy budgets with only 2% deadlock states. As shown in Figure 6c, P3DGAN with λ = 0.6 consistently outperforms all baseline methods (Pretsa, PRIPEL, TLKC variants, and DPGAN) in both similarity and F1-score. This is the closest to the optimal rational balance for enforcing valid structure while maintaining distributional fidelity. When the value of λ is even larger: λ = 0.8, the F1-scores slightly drop to 0.79–0.82 since the deadlock loss is too strong a constraining force on the generator, and this constraint limits the diversity of the trace variants.

Even more interestingly, an interaction between ε and λ appears: (currently) “best” λ appears to be slightly greater (close to 0.65) as higher level structural assumptions are more resilient to noise in the presence of relatively extreme privacy budgets (ε = 0.1). For larger budgets (ε≥50), the optimal λ remains stable around 0.6, which represents the best parameter for the trade-off between structural validity and distributional matching.

8.3 Privacy risk assessment

8.3.1 Re-identification attack analysis

In Table 6, the re-identification rates are shown under various methods.

Table 6

MethodCall centerBPI 2019ProductionElectronic
Pretsa0.5890.4126.8912.731
PRIPEL0.6870.56210.8423.658
TLKC_set1.2370.91813.9245.127
TLKC_multi0.9310.84711.5634.289
TLKC_seq0.8930.79110.9273.914
TLKC_rel1.1541.0235.837
DPGAN0.4610.3398.7592.447
P3DGAN0.4610.20312.1062.447

Re-identification rate (%) for different privacy protection methods.

Bold values indicate the lowest (best) re-identification rate for each dataset.

With ε = 50 and λ = 0.6, GAN-based models achieve the lowest rates (0.461% on Call Center)—21% superior over Pretsa. On BPI 2019, P3DGAN achieves 0.203%, which is 40% better than DPGAN (0.339%) and 51% better than Pretsa (0.412%). Such improvements on large-scale datasets also confirm that our dual-discriminator structure does not degrade privacy. Production Analysis re-identification rates are higher (8.8–13.9%) as a result of longer traces (23.5 events/case average) and more identifying patterns per case. The anonymization techniques uniformly show higher rates in this range (0.589–13.924%).

8.3.2 Distance-based metric: ED-TV

The Euclidean distance of trace variants is shown in Table 7.

Table 7

MethodCall centerBPI 2019ProductionElectronic
Pretsa0.0600.1760.1450.131
PRIPEL0.0420.2780.2040.137
TLKC_set0.0260.3330.2111.000
TLKC_multi0.0330.3300.1950.609
TLKC_seq0.0250.3300.1870.455
TLKC_rel0.0460.5561.000
DPGAN0.0170.0830.1280.113
P3DGAN0.0160.0830.1280.112

Euclidean distance of trace variants (ED-TV) for different privacy protection methods.

ED-TV measures the similarity between trace-variant distributions of real and synthetic datasets, with lower values indicating higher workflow-level similarity and better utility preservation.

Bold values indicate the best (lowest) ED-TV for each dataset. ED-TV ranges from 0 (identical distributions) to 1.0 (completely disjoint distributions).

P3DGAN achieves among the smallest ED-TV values, i.e., 0.016–0.128. This moderate similarity in workflows needs to be understood in the context of our overall privacy-utility model. Our layered protection consists of three layers: (i) moderate workflow-level ED-TV preserving utility, (ii) strong dataflow-level protection through differential privacy (re-identification rates of 0.203–2.447%), and (iii) architectural diversity by generating 15–20% new trace variants that are not included in training logs. This combination yields strong individual-level privacy (dataflow) and preserves aggregate process behavior (workflow). TLKC_rel obtains ED-TV=1.0 but destroys all information in the data for process mining. P3DGAN keeps low ED-TV (workflow preservation) and achieves the best similarity (0.729–0.951), best F1-scores (0.723–0.836), and comparable re-identification rates, indicating that it reaches the optimal multi-dimensional privacy-utility trade-off.

8.4 Ablation studies

We conduct a systematic ablation on the Call Center dataset. Table 8 and Figure 7 provide a more detailed analysis of the components of our method.

Table 8

Model variantSimilarityF1-ScoreRe-ID (%)
P3DGAN (Full)0.903 ± 0.0070.826 ± 0.0300.461 ± 0
Discriminator architecture:
Single Dt only0.867 ± 0.0110.739 ± 0.0270.512 ± 0.003
Single Dr only0.821 ± 0.0150.692 ± 0.0310.478 ± 0.002
Dual (Ours)0.903 ± 0.0070.826 ± 0.0300.461 ± 0
Deadlock condition loss:
Without (λ = 0)0.898 ± 0.0090.751 ± 0.0290.469 ± 0.001
Only 0.901 ± 0.0080.782 ± 0.0260.463 ± 0.001
Only (KL)0.895 ± 0.0100.794 ± 0.0280.465 ± 0.001
Full (Ours)0.903 ± 0.0070.826 ± 0.0300.461 ± 0
Differential privacy:
Without DP (ε = ∞)0.921 ± 0.0060.847 ± 0.0243.417 ± 0.052
With DP (ε = 50, Ours)0.903 ± 0.0070.826 ± 0.0300.461 ± 0
GAN loss function:
Standard GAN loss0.874 ± 0.0130.763 ± 0.0320.487 ± 0.002
Wasserstein loss (Ours)0.903 ± 0.0070.826 ± 0.0300.461 ± 0

Ablation study: component-wise analysis on Call Center dataset.

Figure 7

Dual discriminators: This is the single most important component, and yields +4.1% similarity and +11.8% F1-score against single-discriminator versions of our model (see Figure 7b). Single Dt models dataflow but ignores workflow dependencies (F1-score=0.739). Single Dr models workflow but ignores the details of dataflow (similarity=0.821, F1-score=0.692). Co-utilization models the joint distribution of attributes and sequences, and the synergistic improvement in co-utilization surpasses the sum of the two individual utility metrics.

Deadlock condition loss: Removing leads to an F1-score drop of 9.1% (to 0.751), and 18% of the generated traces end in deadlock states compared to < 2% in the full version (Figure 7c). on its own achieves decent quantitative results (F1-score=0.782), but it is unable to capture structure-related patterns. Only achieves superior distribution fitting (F1-score=0.794). The full two-term loss yields an F1-score of 0.826, further confirming that both terms are necessary.

Differential privacy: DP is critical for privacy protection, achieving 86.5% mitigation in re-identification. In the absence of DP, the re-identification rate is as high as 3.417%, indicating model memorization. DP protection only costs -1.9% and -2.5% for similarity and F1-score, respectively, which are tolerable sacrifices compared to the 10–30% utility loss in anonymization schemes.

Wasserstein loss: Offers improvements of +3.3% in similarity and +8.2% in F1-score over regular GAN loss using stable gradients based on Earth Mover's Distance (Figure 7a). Mode collapse decreases from 40% to under 5% of runs. This could be relevant to discrete categorical process data, where it is hard to estimate gradients.

All improvements are statistically significant (paired t-tests, p < 0.01) across three runs with different seeds, and the results for each component confirm robust and reproducible performance gains.

9 Conclusion

Motivated by the importance of privacy-preserving in process data sharing and publishing, we propose a dual-discriminator conditional generative adversarial network model based on differential privacy. Based on GANs, our model introduces a directly-follows relationship discriminator and a deadlock condition loss grounded in Petri net theory. While ensuring privacy of process data through differential privacy mechanisms, it further improves overall data quality through game optimization strategies. Furthermore, we employ the Euclidean distance under the trace variant of the metric to assess workflow-level privacy risk in synthetic process data.

We conducted experiments on four real-world public datasets and compared our approach with seven other state-of-the-art process data privacy-preserving methods. The results showed that our P3DGAN generates synthetic data with high data utilities (similarity scores 0.729–0.951, F1-scores 0.723–0.836) and strong privacy guarantees (re-identification rates 0.203–2.447%) compared to state-of-the-art techniques. Ablation studies demonstrate that each component (dual discriminators, deadlock loss, differential privacy, Wasserstein optimization) contributes significantly to overall performance. The remarkable results of P3DGAN demonstrate its potential for a wide range of applications that benefit substantially from data sharing and publication in business processes, such as healthcare, banking, insurance, and manufacturing.

In the future, we would like to investigate how to generalize our approach to other types of data, e.g., cross-organizational process data (Yang et al., 2024; Rott et al., 2024; Zhang et al., 2024), and to further enhance the quality of the generated output. We will also focus on what we believe is the most important challenge: increasing the accessibility of our models to business process datasets with varied forms, and on extending P3DGAN to downstream applications such as bottleneck identification and predictive process monitoring. Moreover, designing an adaptive parameter selection scheme for the privacy budget ε and the deadlock weight λ is an interesting direction to pursue. Possible approaches are: (1) meta-learning-based approaches that adaptively pick the best parameters given the dataset characteristics, (2) Bayesian optimization to automate the hyperparameter selection, and (3) multi-objective optimization procedures that adjust privacy protection, data utility, and structural validity according to user-specified relative importance online. These adaptive techniques would substantially improve the practical deployability of P3DGAN across real-world cases. Such a study enables us both to treat more data cases and to maintain the flexibility of our solution across different business process settings.

Statements

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

YG: Conceptualization, Validation, Formal analysis, Methodology, Writing – original draft, Data curation, Visualization. ZL: Project administration, Resources, Validation, Supervision, Funding acquisition, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported in part by the National Natural Science Foundation of China under Grant 62373094, in part by the Natural Science Foundation of Shanghai under Grant 23ZR1401000, in part by the Interdisciplinary Frontier Innovation Team Development Special Fund of Donghua University, and in part by Donghua University 2024 Cultivation Project of Discipline Innovation under Grant xkcx-202406.

Acknowledgments

The authors would like to thank the reviewers for their valuable comments and suggestions that helped improve the quality of this manuscript.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    AkhramovichK.SerralE.CetinaC. (2024). A systematic literature review on the application of process mining to industry 4.0. Knowl. Inf. Syst. 66, 26992746. doi: 10.1007/s10115-023-02042-x

  • 2

    AugustoA.ConfortiR.DumasM.La RosaM.MaggiF. M.MarrellaA.et al. (2018). Automated discovery of process models from event logs: Review and benchmark. IEEE Trans. Knowl. Data Eng. 31, 686705. doi: 10.1109/TKDE.2018.2841877

  • 3

    BrzychczyE.ÅuberA.van der AalstW. (2024). Process mining of mining processes: Analyzing longwall coal excavation using event data. IEEE Trans. Syst. Man Cybern. Syst. 54, 27232734. doi: 10.1109/TSMC.2023.3348496

  • 4

    ChenD.OrekondyT.FritzM. (2020). “GS-WGAN: a gradient-sanitized approach for learning differentially private generators,” in Advances in Neural Information Processing Systems (NeurIPS) (Curran Associates, Inc.), 1267312684.

  • 5

    ChundawatV. S.TarunA. K.MandalM.LahotiM.NarangP. (2024). A universal metric for robust evaluation of synthetic tabular data. IEEE Trans. Artif. Intell. 5, 300309. doi: 10.1109/TAI.2022.3229289

  • 6

    DungD. A.HuynhT. T. B. (2022). “Gdegan: Graphical discriminative embedding gan for tabular data,” in 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA) (Shenzhen: IEEE), 111.

  • 7

    ElkoumyG.Fahrenkrog-PetersenS. A.Fani SaniM.KoschmiderA.MannhardtF.Nunez Von VoigtS.et al. (2021). Privacy and confidentiality in process mining: threats and research challenges. ACM Trans. Manage. Inf. Syst. 13, 117. doi: 10.1145/3468877

  • 8

    Fahrenkrog-PetersenS. A.van der AaH.WeidlichM. (2019). “Pretsa: Event log sanitization for privacy-aware process discovery,” in 2019 International Conference on Process Mining (ICPM) (Aachen, Germany: IEEE), 18.

  • 9

    Fahrenkrog-PetersenS. A.van der AaH.WeidlichM. (2020). “PRIPEL: Privacy-preserving event log publishing including contextual information,” in Proceedings of the 14th International Conference on Business Process Management (BPM) (Seville: Springer), 111128.

  • 10

    FranzoiS.HartlS.GrisoldT.van der AaH.MendlingJ.vom BrockeJ. (2025). Explaining process dynamics: a process mining context taxonomy for sense-making. Process Sci. 2, 211. doi: 10.1007/s44311-025-00008-6

  • 11

    GoodfellowI. J.Pouget-AbadieJ.MirzaM.XuB.Warde-FarleyD.OzairS.et al. (2014). Generative adversarial networks. arXiv [preprint] arXiv:1406.2661. doi: 10.48550/arXiv.1406.2661

  • 12

    GuiJ.SunZ.WenY.TaoD.YeJ. (2021). A review on generative adversarial networks: algorithms, theory, and applications. IEEE Trans. Knowl. Data Eng. 35, 33133332. doi: 10.1109/TKDE.2021.3130191

  • 13

    GursoyM. E.InanA.NergizM. E.SayginY. (2016). Privacy-preserving learning analytics: challenges and techniques. IEEE Trans. Learn. Technol. 10, 6881. doi: 10.1109/TLT.2016.2607747

  • 14

    HuS.LiuX.ZhangY.LiM.ZhangL. Y.JinH.et al. (2022). “Protecting facial privacy: Generating adversarial identity masks via style-robust makeup transfer,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (New Orleans, LA: IEEE), 1501415023.

  • 15

    LiuC.ZengQ.ChengL.DuanH.ChengJ. (2021). Measuring similarity for data-aware business processes. IEEE Trans. Autom. Sci. Eng. 19, 10701082. doi: 10.1109/TASE.2021.3049772

  • 16

    LuC.ReddyC. K.WangP.NieD.NingY. (2023). Multi-label clinical time-series generation via conditional gan. IEEE Trans. Knowl. Data Eng. 36, 17281740. doi: 10.1109/TKDE.2023.3310909

  • 17

    MannhardtF.KoschmiderA.BaracaldoN.WeidlichM.MichaelJ. (2019). Privacy-preserving process mining: differential privacy for event logs. Bus. Inf. Syst. Eng. 61, 595614. doi: 10.1007/s12599-019-00613-3

  • 18

    MurataT. (1989). Petri nets: Properties, analysis and applications. Proc. IEEE77, 541580. doi: 10.1109/5.24143

  • 19

    PereiraR.MestreX.GregorattiD. (2024). Consistent estimation of a class of distances between covariance matrices. IEEE Trans. Inf. Theory70, 81078132. doi: 10.1109/TIT.2024.3464678

  • 20

    QiaoF.LiZ.KongY. (2023). A privacy-aware and incremental defense method against gan-based poisoning attack. IEEE Trans. Comput. Soc. Syst. 11, 17081721. doi: 10.1109/TCSS.2023.3263241

  • 21

    RafieiM.von WaldthausenL.van der AalstW. M. P. (2018). “Supporting confidentiality in process mining using abstraction and encryption,” in Proceedings of the 8th International Symposium on Data-driven Process Discovery and Analysis (SIMPDA) (Seville: Springer), 101123.

  • 22

    RafieiM.WagnerM.van der AalstW. M. P. (2020). “TLKC-privacy model for process mining,” in Proceedings of the 14th International Conference on Research Challenges in Information Sciences (RCIS) (Limassol: Springer), 398416.

  • 23

    RaiR.SuralS. (2023). “Tool/dataset paper: Realistic abac data generation using conditional tabular gan,” in Proceedings of the 13th ACM Conference on Data and Application Security and Privacy (CODASPY) (Charlotte, NC: ACM), 273278.

  • 24

    RottJ.BhmM.KrcmarH. (2024). Laying the ground for future cross-organizational process mining research and application: A literature review. Bus. Process Manage. J. 30, 144206. doi: 10.1108/BPMJ-04-2023-0296

  • 25

    RozinatA.van der AalstW. M. P. (2008). Conformance checking of processes based on monitoring real behavior. Inf. Syst. 33, 6495. doi: 10.1016/j.is.2007.07.001

  • 26

    TillemG.ErkinZ.LagendijkR. L. (2016). “Privacy-preserving alpha algorithm for software analysis,” in Proc. 37th WIC Symp. Inf. Theory Benelux (WIC) (Louvain-la-Neuve: IEEE), 136143.

  • 27

    van der AalstW. M. P. (2012). Process mining: overview and opportunities. ACM Trans. Manage. Inf. Syst. 3, 117. doi: 10.1145/2229156.2229157

  • 28

    van der AalstW. M. P. (2022). Discovering Directly-Follows Complete Petri Nets from Event Data, Chapter 1. Springer: Switzerland, 539558.

  • 29

    van der AalstW. M. P.WeijtersA. J. M. M.MarusterL. (2004). Workflow mining: Discovering process models from event logs. IEEE Trans. Knowl. Data Eng. 16, 11281142. doi: 10.1109/TKDE.2004.47

  • 30

    WangA. X.ChukovaS. S.SimpsonC. R.NguyenB. P. (2024a). Challenges and opportunities of generative models on tabular data. Appl. Soft Comput. 166, 112223112238. doi: 10.1016/j.asoc.2024.112223

  • 31

    WangS.WangC.DongT.HeY.XiaoK. (2024b). Personalized privacy-preserving data utilization approach powered by distributed-gan. Big Data Mining Analyt. 7, 10981113. doi: 10.26599/BDMA.2024.9020037

  • 32

    WangY.PuG.LuoW.WangY.XiongP.KangH.et al. (2022). “Aesthetic text logo synthesis via content-aware layout inferring,” in The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) (New Orleans, LA: IEEE), 24362445.

  • 33

    WijesingheA.ZhangS.DingZ. (2024). PS-FEDGAN: an efficient federated learning framework with strong data privacy. IEEE Internet Things J. 11, 2758427596. doi: 10.1109/JIOT.2024.3399226

  • 34

    XieL.LinK.WangS.WangF.ZhouJ. (2018). Differentially private generative adversarial network. arXiv [preprint] arXiv:1802.06739. doi: 10.48550/arXiv.1802.06739

  • 35

    XuC.RenJ.ZhangD.ZhangY.QinZ.RenK. (2019). Ganobfuscator: Mitigating information leakage under gan via differential privacy. IEEE Trans. Inf. Forensics Security14, 23582371. doi: 10.1109/TIFS.2019.2897874

  • 36

    YangL.WangX.ZhangJ.YangJ.XuY.HouJ.et al. (2022). Hackgan: Harmonious cross-network mapping using cyclegan with wasserstein-procrustes learning for unsupervised network alignment. IEEE Trans. Comput. Soc. Syst. 10, 746759. doi: 10.1109/TCSS.2022.3144350

  • 37

    YangY.WuZ.ChuY.ChenZ.XuZ.WenQ. (2024). Intelligent cross-organizational process mining: a survey and new perspectives. arXiv [preprint] arXiv:2407.11280. doi: 10.48550/arXiv.2407.11280

  • 38

    YeM.ShenW.ZhangJ.YangY.DuB. (2024). Securereid: privacy-preserving anonymization for person re-identification. IEEE Trans. Inf. Forens. Security19, 28402853. doi: 10.1109/TIFS.2024.3356233

  • 39

    ZhangS.KongL.ZhengY.LiuC.CuiL. (2024). “Privacy-preserving cross-organization process mining based on blockchain and cryptography,” in Proceedings of the IEEE International Conference on Web Services (ICWS) (Shenzhen: IEEE), 13841389.

  • 40

    ZhaoC.ZhaoH.ZhuH.HuangZ.FengN.ChenE.et al. (2024). Bi-discriminator domain adversarial neural networks with class-level gradient alignment. IEEE Trans. Syst. Man Cybern. Syst. 54, 52835295. doi: 10.1109/TSMC.2024.3402750

  • 41

    ZhaoZ.KunarA.BirkeR.ChenL. Y. (2021). “CTAB-GAN: effective table data synthesizing,” in Proceedings of The 13th Asian Conference on Machine Learning (ACML) (New York: PMLR), 97112.

Summary

Keywords

differential privacy, dual-discriminator, generative adversarial networks, petri nets, privacy protection, process data, workflow analysis

Citation

Guo Y and Li Z (2026) Privacy-preserving process data generation based on dual-discriminator conditional generative adversarial networks. Front. Comput. Sci. 8:1752739. doi: 10.3389/fcomp.2026.1752739

Received

24 November 2025

Revised

09 January 2026

Accepted

30 January 2026

Published

18 February 2026

Volume

8 - 2026

Edited by

Jing Qiu, Guangzhou University, China

Reviewed by

Geeta Sandeep Nadella, University of the Cumberlands, United States

Qiying Feng, Guangzhou University, China

Updates

Copyright

*Correspondence: Zhong Li,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics