TRUSTLab dataset: a real-world CICFlowMeter dataset for IoT/edge intrusion detection

Villafranca, Antonio; Tasic, Igor; Cano, Maria-Dolores

doi:10.3389/fcomp.2026.1803271

ORIGINAL RESEARCH article

Front. Comput. Sci., 05 May 2026

Sec. Networks and Communications

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1803271

TRUSTLab dataset: a real-world CICFlowMeter dataset for IoT/edge intrusion detection

AV
Antonio Villafranca ¹
IT
Igor Tasic ²
MC
Maria-Dolores Cano ¹^*

1. Department of Information and Communication Technologies, Universidad Politécnica de Cartagena, Cartagena, Spain
2. Faculty of Economics and Business, UCAM Universidad Católica San Antonio de Murcia, Murcia, Spain

Abstract

Introduction:

ThIntrusion Detection Systems (IDS) for Internet of Things (IoT) and edge environments require datasets with unambiguous labels, yet existing datasets often mix benign and malicious traffic within the same capture window, producing ambiguous flow labels that may distort model evaluation.

Methods:

This work introduces the TRUST Lab dataset, a flow-based traffic collection generated in an operational testbed reproducing enterprise-grade services and modern application interfaces. The dataset follows a single-class session policy, whereby each capture contains exclusively benign traffic or a single attack family, preventing temporal overlap and ensuring label integrity at the bi-flow level. The dataset includes 15 attack families spanning volumetric flooding, reconnaissance, application-layer exploits, protocol manipulation, evasive techniques, and persistence vectors. Traffic was processed into 16 single-class files totaling approximately 4.6 million bi-flows with 80 features per flow.

Results:

Comprehensive statistical analyses confirm the presence of discriminative signals without requiring payload inspection. A baseline binary classifier achieved an Area Under the Receiver Operating Characteristic Curve (ROC-AUC) of 0.9676 and a recall of 0.95, supporting the dataset’s utility for lightweight, edge-oriented IDS evaluation. The multiclass benchmark further reported per-family precision, recall, and F1-scores, with the main residual confusion concentrated in low-and-slow and HTTP-based vectors.

Discussion:

By enforcing session-level class separation and preserving bi-flow label integrity, TRUST Lab provides a reproducible dataset for evaluating IDS models in IoT and edge environments. The dataset is publicly available to support further research.

1 Introduction

Machine-learning (ML) IDS have become a core enabler of AI-driven operations in communication networks, where security decisions increasingly need to be automated, low-latency, and deployable as part of closed-loop control (e.g., detect-classify-mitigate-validate) across heterogeneous edge and IoT segments (Neto et al., 2023; Qaddos et al., 2024; Villafranca and Cano, 2025). This setting imposes two constraints: (i) models must rely on observables collectible at scale without payload inspection, and (ii) training labels must be reliable enough for threshold calibration and cross-scenario evaluation. In practice, many public datasets are built from capture windows where benign and malicious activities overlap temporally, producing ambiguous flow labels that inflate offline metrics and degrade deployment performance (Pekar and Jozsa, 2024).

This issue is especially acute in IoT and edge environments, where traffic is highly heterogeneous (short-lived micro-flows, intermittent bursts, device-specific periodicity), predominantly encrypted, and processed under tight resource and latency budgets (Ferrag et al., 2022). To meet these constraints, the state-of-the-art in lightweight IoT intrusion detection increasingly shifts away from Deep Packet Inspection (DPI), which is computationally expensive and hindered by modern encryption, toward flow-based architectures. Consequently, many deployments adopt tabular representations extracted solely from packet headers and timestamps (e.g., CICFlowMeter-style bidirectional flows). Bidirectional flow extraction is lightweight: it updates per-session state counters (packet frequencies, byte volumes, inter-arrival times) without payload storage, fitting the resource constraints of edge gateways.

Although benign and malicious traffic naturally coexist in production networks, this overlap becomes a methodological flaw during dataset creation: if both are active within the same capture interval, CICFlowMeter aggregates concurrent packets into bi-flows whose statistics blend benign and malicious behavior. A single label assigned to such flows is semantically ambiguous, causing models to learn blurred decision boundaries.

To address the limitations of label ambiguity and feature inconsistency highlighted in recent literature (Sarhan et al., 2022b; Pekar and Jozsa, 2024), we propose the TRUST Lab dataset. It is designed under a single-class session paradigm to explicitly eliminate temporal and semantic contamination at the flow level. Unlike existing mixed-window benchmarks, each capture session contains exclusively benign traffic or a single attack family, preventing overlap. This enforces unambiguous per-flow labels after packet capture (PCAP) to-flow conversion, directly aligning with the need for reliable, CICFlowMeter-compatible data in modern IDS research. The dataset supports two evaluation granularities without data reinterpretation, namely, (i) binary (Benign vs. Attack) and (ii) family-level multiclass for post-detection triage (Ullah and Mahmoud, 2020; Albulayhi et al., 2021), enabling binary, multiclass, and cascaded pipelines without class remapping.

A second limitation is the combination of severe class imbalance and inflexible composition. Training on benign-dominated priors leads to majority-class bias, while standard balancing (oversampling, undersampling) can introduce artifacts. Moreover, many datasets package multiple families into monolithic mixed files with fixed prevalences, preventing researchers from constructing distributions matched to their own IoT/edge priors or from systematically studying prior shift (performance under varying attack prevalence). TRUST Lab dataset addresses this by distributing traffic as one Comma-Separated Values (CSV) file per class, with a homogeneous CICFlowMeter feature schema across all files. This modular organization enables reproducible recomposition of class priors and session-level partitioning that avoids train/test leakage, consistent with recent calls for dataset comparability and feature-set consistency in ML-based IDS (Sarhan et al., 2022a, 2022b).

Particularly, TRUST Lab dataset was generated in an operational testbed reproducing enterprise-grade services (HTTP/S, DNS, email, SSH, SNMP, NTP, MySQL) and modern programmable interfaces (REST, GraphQL, SOAP), and it includes 15 attack families covering volumetric flooding, reconnaissance, credential attacks, application-layer/API exploits, DNS abuse, Man-in-the-Middle (MitM), evasion, tunneling/exfiltration, C2/beaconing, TLS/SSL anomalies, buffer overflow, and Slowloris-style exhaustion. Traffic was processed into ~4.6 million bidirectional flows exported as 16 single-class CSV files (Benign plus 15 families) with 80 CICFlowMeter features per flow, enabling evaluation of lightweight, flow-only IDS intended for edge deployment. This dataset design is consistent with recent edge-oriented IDS architectures that rely on flow-level representations to meet latency and resource constraints (Villafranca and Cano, 2025).

In summary, the main contributions of this article are:

A High-Integrity IoT/Edge Dataset: A CICFlowMeter-compatible, flow-based dataset built under a strict single-class session policy to explicitly guarantee label integrity at the bi-flow level and eliminate temporal contamination.
Modular File Organization: A per-family CSV structure that inherently prevents train/test data leakage and enables researchers to flexibly recompose class priors (e.g., balanced, realistic, or rare-attack scenarios).
Modern Protocol and Threat Coverage: Comprehensive inclusion of modern API traffic (REST, GraphQL, SOAP) alongside traditional network services, mapped to a taxonomy designed to support both binary and multiclass IDS evaluations.

The rest of the paper is organized as follows. Section 2 reviews related IDS datasets and motivates the need for label-safe, flow-compatible corpora. Section 3 details the TRUST Lab dataset methodology, including the testbed, campaign design, labeling policy, and CICFlowMeter feature extraction. Section 4 provides an exploratory statistical characterization of the dataset. Section 5 reports baseline IDS evaluation using a two-stage binary-to-multiclass pipeline and discusses implications against existing benchmarks. Section 6 concludes and outlines future directions.

2 Related works

Research on machine-learning-based IDS has long depended on public datasets for training, benchmarking, and reproducible comparison. Early work largely relied on KDD Cup’99 and its revision NSL-KDD. In Tavallaee et al. (2009), the authors demonstrated that KDD’99 contains substantial redundancies and biases that can inflate reported performance, and subsequent surveys have repeatedly emphasized that dataset design directly conditions the validity and transferability of IDS models. In particular, Zuech et al. (2015) and Leevy and Khoshgoftaar (2020) underline that unrealistic traffic composition, outdated attack taxonomies, and artifacts in collection pipelines undermine claims of generalization beyond the capture domain.

To modernize the benchmark landscape, more contemporary enterprise-like datasets incorporated newer attack families and richer traffic mixes. UNSW-NB15 (Moustafa and Slay, 2015) was proposed as a modern replacement for KDD/NSL-KDD, explicitly including multiple attack categories and contemporary traffic generation. The authors note that, despite improved realism and diversity, it still reflects a specific laboratory topology and a fixed class distribution. CICIDS2017 (Sharafaldin et al., 2018) adopts CICFlowMeter-style extraction and simulates benign and malicious sessions in a corporate-like environment. Their authors highlight the richness of its flow statistics, but IoT/edge protocol coverage and deployment conditions differ substantially from typical IoT/edge settings, a gap that often becomes evident when evaluating cross-domain transfer.

The growing focus on IoT and Industrial IoT (IIoT) motivated datasets tailored to edge devices and IoT traffic dynamics. N-BaIoT (Meidan et al., 2018) captures real infections (Mirai/Bashlite) on commercial devices and provides highly realistic home-IoT botnet traffic, but it is largely restricted to that threat vector. It is positioned as a benchmark for rapid IoT detection with strong intra-domain separability and limited coverage of non-botnet attack classes. BoT-IoT (Koroniotis et al., 2019) broadens the spectrum by generating multiple attack families in an IoT/industrial testbed. It reports very large flow volumes and diverse attacks, but also strong dominance of certain classes, typically rebalancing to mitigate majority-class bias.

In the same direction, ToN-IoT (Alsaedi et al., 2020) integrates system, network, and service telemetry in an edge/fog/cloud architecture. It was designed for evaluating distributed IDS, but its heterogeneous sources and mixed representations can complicate direct transfer to purely flow-based pipelines when a consistent CICFlowMeter-like schema is required. IoTID20 (Ullah and Mahmoud, 2020) was proposed explicitly for IoT scenarios with a flow-based perspective. Its authors emphasize its suitability for both binary and multiclass classification, albeit with fewer attack families than large enterprise benchmarks. Edge-IIoTset (Ferrag et al., 2022) extends device and protocol diversity across edge/fog/cloud layers. It is presented as a realistic resource for IoT/IIoT IDS evaluation, although its dimensionality and class structure often motivate careful feature selection to maintain stable learning.

A complementary direction focuses on longitudinal captures of real malware activity. IoT-23 (García et al., 2020) provides real infection scenarios on IoT devices, though its PCAP/Zeek representation generally requires adaptation when a CICFlowMeter-style flow schema is targeted. CICIoT2023 (Neto et al., 2023) attempts broader coverage with CIC-style extraction, but its scale and attack mixture often lead authors to compose controlled subsets for comparable training and evaluation.

To address the rapidly evolving threat landscape, the most recent generation of IoT security research has focused on expanding device heterogeneity and attack complexity. For example, contemporary datasets such as IDSIoT2024 (Koppula and Leo Joseph, 2025) capture interactions across multiple smart devices under various cyberattack conditions. However, recent critical reviews of the domain (Salem et al., 2024; Apejoye et al., 2025) emphasize that despite the massive increase in data volume, fundamental methodological issues remain unresolved. Specifically, these studies highlight that inconsistencies in testbed configurations, ambiguous feature extraction methods, and the temporal overlapping of mixed traffic continue to severely hinder the generalizability and real-world applicability of AI-driven IDS. This contemporary consensus reinforces the urgent need for methodologically strict, single-class collections.

As these datasets became widely adopted, a recurring limitation emerged: models trained in one dataset/domain often degrade sharply when evaluated under a different benign profile, attack taxonomy, or feature-extraction regime (domain shift). In Essop et al. (2021) it is shown that even constructing edge-specific subsets from ToN-IoT can expose severe imbalance and require controlled subsampling before training. Comparing ToN-IoT, UNSW-NB15, and Edge-IIoTset, it was observed that models reporting 94–99% accuracy within their native domain lose robustness under cross-dataset testing, consistent with the broader trend that high intra-domain metrics do not guarantee operational transfer (Tareq et al., 2022). The use of distributed deep models in IoT further illustrates this pattern: performance is typically strong in-domain but remains sensitive to dataset-specific traffic composition and labeling assumptions (Diro and Chilamkurti, 2018).

To mitigate comparability barriers, feature-standardization efforts have been proposed (Sarhan et al., 2022b) it is argued that the absence of a common feature set across datasets (CICFlowMeter, NetFlow, Zeek, etc.) is a major driver of poor generalization and an obstacle to fair IDS comparison. Their subsequent work (Sarhan et al., 2022a) empirically shows that, even after aligning to a minimal common feature set, domain divergence still impacts performance, motivating datasets that remain compatible with widely used schemas (e.g., CICFlowMeter) while preserving clean labeling. In parallel, applied work on IoT detection (Al-Sarem et al., 2021) highlights that maintaining high edge performance frequently depends on feature selection (Analysis of Variance (ANOVA), mutual information, or PCA), implying that when datasets are not designed for schema stability and label integrity, preprocessing becomes a critical and costly stage.

Beyond feature-schema mismatch, class imbalance remains a cross-cutting limitation in modern IDS datasets (Qaddos et al., 2024) quantifies sensitivity to imbalance using IoTID20 and UNSW-NB15, and IoT-focused surveys (Albulayhi et al., 2021; Gyamfi and Jurcut, 2022) agree that many studies must resort to balancing and normalization to prevent trivial benign-biased predictors. Synthetic oversampling methods such as Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al., 2002) are common, and extensive analyses emphasize that oversampling/cleaning choices materially affect decision boundaries and can introduce noise or class overlap if applied without care (Batista et al., 2004; Haixiang et al., 2017; Fernandez et al., 2018). Consequently, many published pipelines conflate dataset properties with preprocessing decisions, making it difficult to isolate whether gains stem from model architecture or from distribution engineering.

In summary, existing datasets exhibit three recurrent limitations: domain shift from heterogeneous taxonomies, incompatible feature schemas, and severe class imbalance. TRUST Lab dataset addresses these gaps through single-class session capture and a homogeneous CICFlowMeter representation.

3 Methodology

3.1 Objective and scope of the dataset

TRUST Lab dataset provides a controlled, unambiguously labeled traffic corpus for evaluating flow-based IDS in IoT/edge environments, addressing the label ambiguity caused by temporally overlapping campaigns identified in Section 1.

To meet this objective, the dataset design is grounded in three core principles: (i) single-class experimental control, rather than merely segmenting mixed traffic into broad day/time slots (e.g., ‘Monday: Benign, Tuesday: Attacks’) where background noise still overlaps with malicious actions, our approach enforces strict isolation. Each capture session is an isolated temporal window containing exclusively benign traffic or a single attack family, guaranteeing zero temporal overlap and ensuring unambiguous bi-flow labels; (ii) comprehensive threat coverage—including volumetric (DDoS/DoS), reconnaissance (port scanning), application-layer (web/API attacks), protocol manipulation (DNS, MitM), evasive techniques, and persistence vectors (C2/beaconing, exfiltration); and (iii) reproducibility and edge alignment—using version-controlled scripts with fixed seeds, homogeneous CICFlowMeter feature extraction, and CSV export compatible with resource-constrained IoT gateways.

The dataset supports binary classification (Benign vs. Attack) and family-level multiclass analysis. The binary label is derived directly from the multiclass taxonomy without data transformation, as detailed in Section 3.4.

Finally, the dataset is captured in a laboratory topology with legitimate and offensive nodes and is delivered as separate CSV files per class, all following the same CICFlowMeter feature schema. This organization not only preserves independence between campaigns (with session/file-level partitioning recommended), but also allows users to compose tailored distributions: from realistic priors for operational evaluation to balanced settings or scenarios focused on specific attack subsets, without the dataset imposing a fixed prevalence. Consequently, TRUST Lab dataset enables controlled studies of sensitivity and precision under prevalence shifts (prior shift) without requiring complex reconstructions of the original traffic.

3.2 Testbed and capture topology

The capture infrastructure of TRUST Lab dataset was deployed within a private IPv4 network (192.168.56.0/24), designed to reproduce a mixed IoT/edge environment with common network services and a clear separation of roles. Figure 1 shows the full topology of the testbed, including service nodes, benign clients, offensive nodes, and observability components connected through mirror ports. This layout enables realistic end-to-end traffic capture without interfering with campaign execution, while also isolating each session to guarantee class uniqueness.

Figure 1

The testbed was implemented using VirtualBox 7.0 over a host machine with an Intel Core i7-10700K (8 cores, 16 threads), 64 GB DDR4 RAM, and Ubuntu 22.04 LTS. Virtual machines were configured with 2–4 vCPUs and 4–8 GB RAM depending on their role, running Ubuntu 20.04/22.04 LTS for service/client nodes and Kali Linux 2023.1 for offensive nodes. Network connectivity was provided through a virtual switch with promiscuous mode enabled and Switched Port Analyzer (SPAN) port mirroring configured to replicate all inter-VM traffic to the observability node, ensuring complete capture without packet loss.

While fully ‘wild’ or uncontrolled network captures offer high background realism, they suffer from inherently flawed ground-truth labeling because the exact nature of every background flow cannot be definitively verified. In contrast, our controlled, virtualized topology represents a methodological improvement for ML datasets because it guarantees absolute label integrity. By orchestrating both benign and offensive actors within a deterministic, segmented environment, we eliminate the label noise typical of wild captures, ensuring that every flow used to train or evaluate an IDS belongs unequivocally to its assigned class.

Functionally, the testbed is composed of four blocks: (i) Services—dedicated servers hosting HTTP/HTTPS, DNS, SMTP, IMAP, POP3, SSH, SNMP, NTP, and MySQL, covering protocols typically found in corporate networks and IoT gateways interacting with external backends; (ii) APIs—nodes dedicated to modern application traffic, with REST endpoints on port 3,000, GraphQL on 4,000, and SOAP on 5,000, allowing both benign and offensive campaigns to generate flows representative of contemporary edge environments (microservices, telemetry, and automation); (iii) Benign clients—machines that automatically generate legitimate traffic toward these services using pseudo-random timing controlled by fixed seeds and metadata logging for full traceability; (iv) Offensive nodes—machines exclusively dedicated to launching single-class attack campaigns, each temporally isolated from benign traffic and from all other families, preventing overlap or cross-contamination.

Traffic capture is performed in an observability block connected to SPAN ports of the virtual switch. These SPAN interfaces feed traffic in PCAP format to CICFlowMeter (version 4.0), which processes it into bidirectional flows and associated statistics, exporting one CSV file per session following the homogeneous 80-feature schema plus label. CICFlowMeter was configured with a flow timeout of 120 s and activity timeout of 5 s, following standard CICIDS practices (Sharafaldin et al., 2018). This architecture clearly separates the generation plane from the analysis plane: instrumentation observes all relevant traffic without modifying it, and feature extraction occurs out-of-band, avoiding latency or packet loss that could distort the flow distribution.

Finally, a temporal control plane is incorporated through an internal NTP server that synchronizes all nodes in the testbed. This synchronization makes it possible to unambiguously associate each PCAP file with its corresponding single-class session, verify that no benign–attack mixtures occur within the same window, and record campaign metadata.

Capture campaigns were conducted over a period of approximately 6 weeks, with individual sessions ranging from 5 min (burst attacks) to 30 min (low-and-slow vectors), accumulating ~80 h of total traffic.

3.3 Generation of benign traffic and attack campaigns

Traffic generation in TRUST Lab dataset follows a unified experimental policy aimed at producing clean flows with unambiguous labels. First, each capture corresponds to a single category: either benign traffic or one specific attack family. This single-class, non-overlapping session rule prevents pattern mixing within observation windows, eliminating ambiguous bi-flow labels. This issue commonly affects public datasets with coexisting campaigns. Second, both benign and malicious traffic are generated through reproducible scripts with fixed seeds, concurrency control, and rate limiting, ensuring realistic distributions of sizes, rates, and inter-arrival times without appreciable packet loss.

Benign traffic was designed to reflect routine activity in IoT/enterprise networks connected to edge services. Legitimate sessions were generated for web browsing (HTTP/HTTPS), recursive DNS queries, remote administration (SSH), email (SMTP/IMAP/POP3), database queries (MySQL), multimedia traffic (VoIP SIP/RTP and streaming), and lightweight transfers typical of automated clients. This automation is achieved through custom Python and Bash scripts that orchestrate standard network tools (e.g., curl, wget, dig, sshpass, sipp, vlc, as detailed in Table 1). These scripts are designed to simulate both periodic machine-to-machine (M2M) telemetry typical of IoT devices and pseudo-random human browsing patterns. This orchestration introduces temporal variability through pseudo-random intervals controlled by fixed seeds and provides traceability via activity logs (source, destination, protocol, ports, and operational status), ensuring that each “benign” window contains sufficient diversity without inadvertently introducing offensive anomalies.

Table 1

Category	Sub-attacks/Examples	Flow-level signals	Implementation/Tools
Benign	HTTP/HTTPS, recursive DNS, SSH, SMTP/IMAP/POP3, MySQL, VoIP (SIP/RTP), streaming	Expected domains/ports, stable rates, coherent RTTs	curl, wget, dig, sshpass, CLI mail clients, MySQL client, sipp, vlc
DDoS	L7 HTTP flood; UDP amplification (NTP/SSDP/SNMP); ICMP flood	Extreme PPS, bursts, UDP reflectors	ab, curl, hping3, ping
DoS	UDP flood, controlled ICMP, TCP SYN flood, slow-HTTP	Half-open connections, timeouts, anomalous windows	hping3, slowhttptest
Portscan	Horizontal/vertical, FIN/NULL/XMAS, decoys, fragmentation, TLS scans	SYN/ACK to many hosts/ports, uniform TTL, bursts	nmap, masscan
Bruteforce	SSH/FTP/HTTP Basic/RDP; SNMP community	Repeated retries, regular timing	hydra, medusa, ncrack, snmpwalk, Metasploit
Webbased	SQLi (union/blind), XSS, LFI/RFI, CSRF, command injection	Repeated short flows, 4xx/5xx errors, bursts to endpoints	sqlmap, nikto, commix, curl
API	REST mass/JSON tampering, GraphQL injection, SOAP XML/XXE	Atypical 4xx/5xx, short API series, overrides	custom REST/GraphQL/SOAP scripts, httpie, jq, JWT tools
DNS	Amplification, tunneling, spoofing, Domain Generation Algorithm (DGA)	High volume of A/MX/TXT, synthetic subdomains, inconsistent responses	dig, Scapy, DGA generator, IP aliasing tools
MitM	ARP/LLMNR/NBT-NS poisoning, TCP hijack, SSL stripping	Duplicate ARP, out-of-context RST/ACK, CN mismatch	arpspoof, responder, Scapy, sslstrip
Evasion	Overlapping IP fragmentation, TTL creeping, invalid TCP flags, anomalous TCP options	Extreme TTL, contradictory sizes/flags, overlaps	Scapy IP/TCP crafting
TLS/SSL	Heartbleed, POODLE, BEAST, renegotiation, certificate anomalies	Deviated handshakes, abnormal heartbeats, obsolete versions	nmap, testssl.sh, openssl, sslyze, sslscan
Exploitation	SMB EternalBlue/Relay, RDP BlueKeep, LDAP injection, SMTP VRFY/EXPN	Unusual MSRPC, malicious LDAP filters, SMTP enumeration	nmap, impacket (ntlmrelayx), crackmapexec, ldapsearch, smtp-user-enum
Exfiltration	DNS/ICMP tunnels, SMTP attachments, FTP/NFS/TFTP, HTTP chunked	Long outbound sessions, repeated chunks, atypical TXT	dig, hping3, FTP clients, TFTP clients, NFS tools, HTTP clients
Bufferoverflow	NOP-sled, raw shellcode, Return-Oriented Programming (ROP) chains	Long payloads, 0×90 patterns, shellcode signatures	msfvenom, Scapy, HTTP clients, DNS tools
Slowloris	Slowloris HTTP L7	Half-open connections and long latencies	slowloris, slowhttptest
C2Beac	HTTP/S beacons, DGA DNS, IRC/MQTT, persistent WebSocket	Regular IAT periodicity, low throughput, rotating destinations	curl, DGA generator, Python MQTT/IRC clients, WebSocket clients

TRUST Lab dataset taxonomy, sub-attacks, and generation procedure (single-class sessions).

Offensive campaigns are executed without concurrent benign traffic. Each family uses standard tools parameterized to produce representative flow behavior independent of payload content. Intensities and durations are controlled to reproduce both burst and low-and-slow patterns.

Operationally, sessions of 5–10 min were defined for burst campaigns (e.g., DNS amplification, port scans), while extended windows of 15–30 min were utilized for slower families (e.g., C2/beaconing, Slowloris, or exfiltration). These duration values were not chosen to model a single attack execution, such as a single Slow DoS cycle, which might default to ~24 s, but rather to ensure the accumulation of a statistically significant volume of bidirectional flows representing a sustained offensive campaign. Because machine learning models require thousands of examples to generalize effectively, capturing 15–30 min of low-and-slow activity ensures that enough flows reach the CICFlowMeter flow timeout (120 s) and are correctly exported. This provides the ML algorithms with sufficient data to learn the long-term periodicity, variance, and subtle statistical footprint of these stealthy attacks without triggering rate collapse or packet loss.

Table 1 summarizes the final taxonomy of TRUST Lab dataset and, for each family, lists examples of sub-attacks executed, the flow-level observable signal, and the specific generation tools used. This table reflects the design rationale: to cover diverse offensive mechanisms (volumetric, reconnaissance, application-layer, protocol manipulation, evasive techniques, and persistence vectors) while ensuring that each class leaves a statistically distinguishable signature within the CICFlowMeter feature space.

The selection of sub-attacks follows a criterion of functional coverage rather than an exhaustive enumeration. Variants were included that represent attack patterns currently observable in IoT/edge environments and that, in many public datasets, are underrepresented or entirely absent. For example, beyond classical volumetric families (DDoS/DoS or port scanning), application-layer and API attacks (REST, GraphQL, and SOAP) were incorporated because modern edge deployments rely on microservices, telemetry, and gateways exposing programmable interfaces; without this dimension, a dataset cannot capture the type of abuse that real IoT systems routinely face.

Similarly, evasive families, tunneling/exfiltration, C2/beaconing, and low-and-slow DoS attacks (e.g., Slowloris) are included. Recent literature emphasizes that low-and-slow attacks pose a particularly severe threat to resource-constrained IoT and edge environments. Because they rely on protocol semantic manipulation rather than volumetric flooding, they easily masquerade as legitimate nodes experiencing high latency or poor connectivity, successfully evading traditional flood-centric IDS (Ilango et al., 2022; Balaji et al., 2025). Despite their critical impact on operational continuity, these stealthy vectors remain underrepresented in many contemporary datasets, which motivated their explicit inclusion in our taxonomy. The variety of sub-attacks per family ensures that each class captures different operational modes (burst vs. low-and-slow, horizontal vs. vertical reconnaissance), yielding more robust distributions independent of any single tool.

3.4 Labeling and taxonomy

Labels are assigned at the bi-flow level during PCAP → CSV conversion: since each session contains a single class (Section 3.3), all bi-flows from a given PCAP inherit an unambiguous label.

Under this criterion, 16 final classes were defined: Benign, DoS, Portscan, Bruteforce, Webbased, API, DNS, MitM, Evasion, TLS/SSL, Exploitation, Exfiltration, Bufferoverflow, Slowloris, and C2Beac. Each class is distributed as an independent single-class CSV file with the same feature schema, ensuring label uniqueness per file and facilitating auditing, recomposition, and campaign-level splits.

To support operational detection scenarios, a binary view is also included, derived directly from the multiclass taxonomy without altering the original data. The binary target variable is defined as:

This allows the same corpus to be used for training binary classifiers (benign vs. attack) or family-level multiclass models while maintaining full traceability.

Label integrity is reinforced through systematic checks before and after each capture. All machines in the testbed synchronize their clocks using an internal NTP server; generation scripts fix random seeds and record metadata (source, destination, ports, protocol, duration, and launch parameters). After each session, the following are verified: (i) absence of temporal overlap with other campaigns; (ii) consistency of expected protocols and ports for the corresponding family; (iii) statistical coherence checks at the flow level (e.g., high burst and SYN-rate for Portscan, extreme TTL values and anomalous fragmentation for Evasion, strong periodicity for C2Beac, or deviated handshakes for TLS/SSL); and (iv) discarding of sessions with losses detected by capture counters. This process ensures that the published CSV files reflect only the nominal class of the campaign, without benign or malicious residuals out of context.

Figure 2 shows the multiclass distribution (Y-axis in ×10⁶, linear scale). The benign-dominated imbalance is intentional: it reflects real-world prevalence and can be recomposed via the modular CSV structure. Under binary aggregation (Figure 3), the dataset contains ~2.57 M benign and ~1.99 M malicious bi-flows (ratio ≈ 1.3:1). No synthetic balancing is applied; the dataset is delivered as a raw baseline.

Figure 2

Figure 3

For readability, the manuscript uses standardized English class names, while the released CSV/file identifiers retain the internal labels used during dataset generation.

3.5 Feature extraction with CICFlowMeter

Once traffic is captured in PCAP format from the mirror (SPAN) ports, all processing in TRUST Lab dataset is performed at the bi-flow level using CICFlowMeter, following the philosophy of the CICIDS datasets (Sharafaldin et al., 2018). A bi-flow comprises the 5-tuple (source IP, destination IP, source port, destination port, protocol) and the packet exchange time interval. CICFlowMeter groups packets in both directions under a single flow identifier and considers the flow terminated when an inactivity timeout is reached or when session-closing events are observed. The result of this process is a set of CSV files in which each row represents a complete bi-flow and each column corresponds to a statistical feature derived from header fields and timestamps.

TRUST Lab dataset uses the standard configuration of CICFlowMeter and exports a total of 80 numerical features per bi-flow, to which the label column described in section 3.4 is added. These features cover, in a structured manner, the main dimensions of network traffic:

Volume and rate: total number of packets and bytes in the flow (global and direction-specific for forward and backward), as well as byte/packet rates per second.
Temporal behavior: flow duration, Inter-Arrival Times (IAT) globally and per direction, and metrics describing active/idle periods.
Packet size: maximum, minimum, mean, and standard deviation of packet sizes in each direction.
Sub-flows: forward and backward sub-flow counters, useful for capturing bursts within the same connection.
Control and flags: counts of TCP flags (SYN, ACK, FIN, RST, PSH, URG), as well as proportions of packets exhibiting certain control patterns.

Although the 80 features are not listed individually in this article, they are organized under these families, which facilitates later interpretation and ensures compatibility with other CICFlowMeter-based work (Sharafaldin et al., 2018; Sarhan et al., 2022a, 2022b). The choice of this representation directly reflects the IoT/edge scenario for which the dataset is designed. By relying solely on headers and timing information, features can be computed on resource-constrained edge nodes without inspecting payloads or storing large volumes of raw packets.

This significantly reduces computation, memory, and storage costs compared to packet-based approaches, while still retaining sufficient discriminative power to distinguish between benign traffic and multiple attack families (volumetric, reconnaissance, application-layer, evasive, exfiltration, or C2). Furthermore, the use of a fixed, homogeneous 80-column schema across all single-class CSV files simplifies integration with existing pipelines and with other CICFlowMeter-based datasets, enabling any researcher to directly apply feature selection, normalization, or model optimization techniques without additional effort in feature engineering or schema alignment.

3.6 Minimal preprocessing and final dataset organization

The design of TRUST Lab dataset is grounded in a key principle: avoiding any manipulation that would artificially alter the original distribution of flows. For this reason, the preprocessing applied after CICFlowMeter extraction is intentionally minimal, limited to operations strictly necessary to ensure numerical coherence, reproducibility, and compatibility with machine-learning pipelines. First, after the PCAP → CSV conversion, a basic cleanup of values is performed:

NaN values, produced by legitimate absence of packets in one direction (typical in FIN/NULL/XMAS scans, unidirectional operations, or attacks with partially open connections), are replaced with 0, preserving the semantics of “absence” without introducing artificial imputations.
±∞ values, arising from division by zero (e.g., rate calculations when flow duration is extremely small or when one direction contains zero packets), are also set to 0, following common practice in flow-based studies to avoid overflows and maintain compatibility with standard scalers.

In general, these transformations do not modify the statistical signal of the traffic; they simply ensure that all columns contain valid numerical values comparable across sessions. Nevertheless, it is important to note that these transformations are limited to numerical sanitation for reproducibility and compatibility with standard machine-learning pipelines. We note, however, that replacing NaN and ±∞ with 0 may compress some extreme values in duration- or rate-derived features when such values arise from zero packets in one direction or near-zero effective durations. For this reason, TRUST Lab dataset is distributed as a minimally processed baseline, and downstream users should document any alternative handling policy adopted for sensitivity analyses.

Second, no balancing, synthetic oversampling (e.g., SMOTE), undersampling, or global normalization is applied to the delivered distribution. This choice is motivated by two reasons: (i) to maintain the dataset as a neutral and reproducible baseline, avoiding artifacts introduced by specific preprocessing decisions; and (ii) to allow each researcher to apply their own preprocessing strategy (normalization, column-wise scaling, balancing, feature selection) depending on the model or scenario they intend to evaluate. In contrast, many preprocessed public datasets impose scalers or mix sessions before distribution, masking temporal structure and making it difficult to replicate experiments. TRUST Lab dataset deliberately avoids such interventions to preserve campaign-level traceability.

The final organization of the dataset follows the same philosophy. Instead of a single CSV file with mixed classes, TRUST Lab dataset is distributed as 16 single-class CSV files, each corresponding to Benign or one of the 15 attack families. All files share exactly 81 columns: 80 numerical CICFlowMeter features and an additional label column. This modular structure provides several practical advantages:

Prevents temporal leakage: since each file corresponds to a complete session, splits can be performed at the file level, ensuring that test flows originate from campaigns different from those used for training.
Enables flexible compositions: users can create realistic, balanced, or focused distributions depending on the experiment, without reconstructing PCAPs or manually filtering individual flows.
Facilitates comparative experiments: different models or architectures can be trained on the same class selection without introducing noise from accidental session mixing.
Supports binary and multiclass evaluation directly: researchers can simply select the required CSVs and derive the binary label if a benign-vs-attack model is desired.

The dataset contains approximately 4.6 million bi-flows, distributed across the 16 single-class CSV files. This format efficiently packages the information required for flow-based pipelines without losing detail in the temporal or semantic structure of each campaign. Indeed, the dataset’s modularity is one of its distinguishing elements compared to other public resources, enabling the reproduction of complex experiments without label ambiguity or uncontrolled temporal mixing.

4 Analysis of the TRUST lab dataset

To ensure clarity and provide a structured narrative, the empirical validation of the TRUST Lab dataset is organized into two sequential phases across the following sections. First, the current section presents an exploratory statistical characterization of the data. Its goal is to demonstrate that our single-class session methodology successfully captures distinct, separable statistical signatures for different attack families without relying on payload inspection. This is validated through univariate distributions, correlation structures, ANOVA feature ranking, and Principal Component Analysis (PCA). Second, Section 5 evaluates the dataset in a downstream task, establishing a baseline detection performance using a standard ML pipeline. The overarching objective of this extensive evaluation is not to propose a novel IDS architecture, but to empirically prove that the dataset provides a reliable, high-integrity ground truth capable of supporting both fine-grained statistical analysis and robust flow-based intrusion detection.

4.1 General distribution and descriptive analysis of classes

The first step in characterizing TRUST Lab dataset is to quantify how traffic volume is distributed across classes. Figure 2 reports the multiclass distribution as the absolute number of bi-flows per label, with the Y-axis shown in millions (×10⁶) on a linear scale. As expected in realistic IoT/edge settings, Benign traffic dominates the dataset, while the 15 attack families comprise the remaining volume with markedly different levels of support. This imbalance is consistent with operational environments where normal activity is prevalent, and attacks appear as temporally bounded, isolated segments rather than as persistent background traffic.

TRUST Lab dataset does not enforce equal sample sizes across attack categories; however, it also avoids the opposite extreme—attack families represented by only a few dozen samples. Each attack category includes enough bi-flows to allow flow-based models to estimate meaningful distributions of durations, rates, sizes, and inter-arrival times. In practice, the dataset reflects the imbalance typical of real deployments while ensuring that each class has statistical support on the order of thousands or tens of thousands of flows, preventing any family from collapsing into anecdotal noise. This design choice is especially relevant for low-frequency but high-impact vectors such as exfiltration, C2/beaconing, and certain evasion techniques: they remain underrepresented relative to Benign traffic, but not to a degree that would prevent models from encountering them during training.

To support binary analyses, Figure 3 aggregates all malicious families into a single Attack label and contrasts its volume against Benign, again expressed in millions of bi-flows. Under this aggregation, TRUST Lab dataset contains approximately 2.57 million benign bi-flows and 1.99 million attack bi-flows, corresponding to a moderate imbalance (Benign/Attack ≈ 1.3:1). This proportion has several practical implications:

Binary baselines and accuracy pitfalls: a trivial classifier that always predicts “Benign” would achieve roughly 56% accuracy. Any meaningful model must exceed this baseline, and evaluations should account for the risk that accuracy-oriented training may still under detect attacks under moderate imbalance. Standard imbalance-aware practices remain applicable (e.g., class weighting and threshold adjustment).
Calibration: the moderate imbalance supports direct reporting of F1, F2, ROC-AUC, and PR-AUC without mandatory rebalancing, while the modular CSV structure allows simulation of alternative prevalences.
Metric stability at scale and per-family analysis: the dataset size reduces the statistical variance of evaluation metrics. With approximately 4.6 million bi-flows, even typical split strategies (e.g., 70% train, 15% validation, 15% test per campaign) retain hundreds of thousands of samples for evaluation. This supports precise estimation of global metrics as well as per-family metrics (recall, precision, F1), and facilitates analysis of class-specific behaviours—particularly useful in later sections focused on feature patterns and confusion among attack families.

This distributional profile sets the context for the feature-level analysis in the following subsections.

4.2 Univariate analysis of characteristics

Univariate analysis provides a first-order view of the statistical behavior of each CICFlowMeter feature in isolation, enabling an initial assessment of how individual attributes contribute to separating benign from malicious traffic. Figure 4 presents a representative set of univariate histograms computed over the full dataset (≈4.6 million bi-flows). Distributions are normalized for comparability and plotted on a linear scale, covering representative attributes associated with volume, size, temporal metrics, and flow direction.

Figure 4

Across features, the histograms reveal heavy tails and pronounced skewness, particularly for volume-related metrics (e.g., flow_bytes/s), total packet-count statistics, and maximum packet-size indicators. This behavior should not be interpreted as noise; it is an intrinsic characteristic of real traffic in which rare but extreme events occur. In TRUST Lab dataset, these extremes are strongly driven by volumetric families such as DDoS/DoS, which generate values that are markedly separated from benign traffic, while benign flows remain concentrated within comparatively moderate and stable ranges. This separation is especially visible in features where attacks produce high-rate bursts or sudden increases in packet counts over short intervals.

In contrast, application-layer families (Webbased, API, Exploitation) and stealth-oriented families (C2Beac, Exfiltration, Slowloris) diverge from benign traffic more subtly at the univariate level. This is consistent with their operational nature: rather than inducing large shifts in global volume or rate, these campaigns often manifest as short sequences, irregular timing behavior, or minimal payload transfer. Consequently, their univariate distributions exhibit partial overlap with Benign, indicating that reliable discrimination typically requires multivariate combinations of features rather than single-attribute thresholds.

Figure 4 also shows abrupt cut-offs in the tails of certain histograms. These cut-offs arise from CICFlowMeter’s flow reconstruction logic—specifically, termination after extended inactivity or protocol-specific limits—and do not introduce artifacts. On the contrary, they can encode meaningful behavioral information, such as persistently long flows with no data transfer or very short, unidirectional packet sequences.

Overall, the univariate profiles in Figure 4 confirm that TRUST Lab dataset simultaneously contains (i) clearly separable signals, typical of floods and reconnaissance-style behavior, and (ii) more subtle patterns, typical of low-and-slow and application-layer activity. This combination supports evaluation of IDS models under both high-separability conditions and regimes where decision boundaries are inherently less distinct.

4.3 Correlation and redundancy between characteristics

Understanding feature correlation is essential for assessing the practical utility of a flow-based dataset. In systems such as CICFlowMeter, many statistics are derived from one another (e.g., rates computed from packet counts and duration, or forward/backward metrics describing the same dimension), so a certain degree of structural collinearity is expected. Figure 5 presents the full correlation matrix of the 80 TRUST Lab dataset features, computed over a stratified sample of millions of flows to preserve inter-class variability and avoid imbalance-driven bias.

Figure 5

The matrix reveals well-defined blocks of high correlation that reflect mechanical dependencies among CICFlowMeter attributes. First, volume-related features (total_fwd_packets, total_bwd_packets, total_length_fwd_packets, flow_bytes/s) form a tightly correlated cluster, as they represent different projections of the same physical quantity (flow throughput). Variations in volume propagate almost linearly into rates and derived transformations. This redundancy benefits robust models (deep neural networks (DNN), tree-based models, ensembles), but can cause instability in linear models unless regularization is applied.

Second, a consistent block emerges between forward and backward variables. This reflects the natural symmetry of bidirectional flows: benign transactions generally show balanced patterns between directions, while aggressive attacks (e.g., floods, one-direction scans, repeated HTTP injections) produce a strong forward bias that simultaneously alters multiple dependent metrics. This structure helps multivariate models detect asymmetric behaviors that would not be identifiable through univariate attributes alone.

The third major cluster corresponds to temporal features (flow_duration, iat_mean, iat_std, active_mean, idle_mean, and others). Correlations show coherent temporal regimes: long flows tend to display wide variation in inter-arrival times, whereas volumetric attacks show short durations with very low IATs. Low-and-slow attacks (Slowloris, C2Beac) exhibit a different pattern—long durations paired with heterogeneous IATs—captured clearly in the matrix and providing an important source of discrimination.

The block of TCP flags exhibits moderate but meaningful correlations in specific campaigns. SYN and ACK correlate in stable benign traffic, but FIN/NULL/XMAS scan attacks disrupt this structure, reducing or inverting the correlation. This appears as a fragmented pattern, indicating that flags contribute useful signal but do not dominate the dataset’s separability.

A key observation is that, despite strong correlations in several groups, the matrix does not collapse into a single dominant cluster. This confirms that TRUST Lab dataset does not suffer from excessive redundancy nor derive from a single statistical mode: each attack family produces characteristic variations in distinct subgroups, which is essential for multiclass classification. The coexistence of strong correlations (supporting robust learning in high-capacity models) with weak or null correlations (providing independent signals) indicates a rich structure suitable for both regularized linear models and more expressive nonlinear architectures.

Overall, the correlation structure is diverse enough for multiclass discrimination yet sufficiently coherent for efficient learning without mandatory feature engineering. As depicted in the full correlation matrix (Figure 5), the goal is to observe the macroscopic clustering of highly correlated variables rather than individual pairwise values, demonstrating the inherent redundancy within the 80-feature schema.

4.4 Statistical significance of characteristics (ANOVA F-test)

To quantify the individual discriminative contribution of each CICFlowMeter attribute, we apply an ANOVA F-test, which measures—feature by feature—the ratio of between-class variability to within-class variability. Unlike the correlation analysis in §4.3, which highlights structural dependencies among attributes, ANOVA directly evaluates how strongly each feature separates the labeled families when considered in isolation.

Figure 6 reports the resulting F-values for the main TRUST Lab dataset characteristics, computed on a stratified sample of millions of bi-flows that preserves the real prevalence of each family. The scale of the analysis (millions of flows) reduces estimation variance and avoids artefacts associated with very small subsamples.

Figure 6

The ranking exhibits a consistent structure: the most discriminative variables cluster into four functional groups.

Volume and throughput (e.g., flow_bytes/s, total_length_forward, total_length_backward): these dominate because volumetric attacks (DDoS/DoS) and certain scans generate throughput far above benign traffic, producing very large between-class variance relative to within-class variance. This matches the univariate behavior observed in Figure 4.
Fine-grained temporal metrics (e.g., iat_mean, iat_std, flow_duration): low-and-slow families (Slowloris, C2Beac) induce timing patterns that diverge from benign traffic. Benign flows are comparatively stable, whereas evasive/persistent campaigns show wider oscillations, which elevates IAT-based features in the F-ranking.
Directional asymmetries (e.g., fwd/bwd packet ratio, fwd/bwd bytes ratio): unidirectional behavior from scanning, beaconing, or exfiltration yields strong forward/backward imbalances. These features are particularly useful for threats that do not necessarily raise global volume but reshape the directional structure of the flow.
Dispersion metrics (e.g., std., packet-size variance, length deviation): multiple families generate anomalous size distributions (e.g., uniform payloads in repeated injections or abrupt size changes during exfiltration), improving separability relative to benign traffic when dispersion is measured explicitly.

ANOVA also highlights features with low univariate discriminative power, including certain TCP flags and attributes that are largely redundant linear combinations of others. Low F-values do not imply these variables are useless in multivariate models; they indicate only that they do not separate classes meaningfully on their own.

For statistical testing, significance is assessed at α = 0.001 to account for multiple comparisons. Given the dataset scale, all features are statistically significant (p < 0.001), although practical significance varies substantially. To control the family-wise error rate across the 80 tested features, Bonferroni correction is applied, yielding adjusted α = 0.001/80 ≈ 1.25 × 10⁻⁵. Features whose F-values exceed the Bonferroni-based critical threshold are treated as robustly discriminative; the ordering in Figure 6 reflects both statistical significance and effect size, with volume and temporal metrics providing the strongest separation.

Overall, the ANOVA results confirm that discrimination in TRUST Lab dataset relies on complementary signals—spanning volume, timing, directionality, and dispersion—supporting both regularized linear models (where high-F attributes are particularly influential) and nonlinear models that can exploit interactions among moderately ranked features.

4.5 Dimensionality analysis using PCA

Principal Component Analysis (PCA) is used to inspect the global structure of the 80-dimensional CICFlowMeter feature space without supervision. Whereas ANOVA ranks the individual attributes that separate classes, PCA shows how flows arrange when projected onto a lower-dimensional subspace that maximizes retained variance.

Figure 7 reports the explained-variance curve: each component contributes a fraction of total variance, and the cumulative curve quantifies how much of the dataset can be represented as dimensions are added. The curve shows substantial variance captured early, yet no single dominant component, reflecting structural diversity. In particular, the first principal component explains ≈51.9% of the variance, the second adds 11.2%, and the third contributes 5.6%. To retain 95% of variance, approximately 19 components are required, and 23 components are needed to reach 96% (dashed lines). The gradual, stepwise accumulation indicates that relevant information is distributed across many attributes—consistent with the correlation and ANOVA results showing complementary signals from volume, timing, flags, directionality, and dispersion, with none dominating the space. Consequently, projections to only 2–3 components inevitably lose information, although they remain useful for visualization.

Figure 7

To illustrate separability under strong compression, Figure 8 shows a two-dimensional PCA projection (first two components) of a stratified subset, enabling qualitative inspection without labels being used to construct the projection. Several patterns emerge. Volumetric families (DDoS, DoS) occupy extreme regions along the first component, which groups volume- and rate-related measures, and remain separated from Benign even in 2D—confirming their strong statistical signature. Portscan forms a compact but shifted cluster relative to Benign, consistent with directional asymmetries and repeating short-packet patterns captured by CICFlowMeter. In contrast, application-layer families (Webbased, API, Exploitation) overlap more with Benign: their changes in volume/rate are moderate, and separability largely depends on non-linear feature interactions. Stealthy/low-and-slow families (C2Beac, Exfiltration, Slowloris) appear in intermediate or diffuse regions reflecting long flows, low rates, and irregular timing; they may overlap visually after dimensionality reduction, even if separable in the original 80-dimensional space.

Figure 8

Overall, PCA confirms that—despite information loss in 2D—TRUST Lab dataset still exhibits distinguishable structures for several attack families, while the remaining overlaps (especially for subtle campaigns) reinforce that effective detection requires models that capture multivariate interactions, aligned with the dataset’s objective of supporting evaluation under both easy-to-separate and intrinsically low-separability conditions.

4.6 Distributions by class, boxplots and statistical profiles

This subsection compares how selected critical CICFlowMeter features distribute across the 16 classes, to assess whether each family leaves a distinctive statistical signature. Figure 9 summarizes this using multiclass boxplots for representative attributes spanning volume, temporal behavior, dispersion, and flow direction, highlighting differences in median, interquartile range, and extremes without relying on per-feature histograms.

Figure 9

The first global observation is the strong heterogeneity of scale across families. Some classes concentrate tightly near low values (notably reconnaissance such as Portscan), while others show wide ranges with pronounced upper tails (e.g., DDoS and DoS). This confirms that flow magnitude—packet counts, duration, and total bytes—remains discriminative even before any multivariate modeling.

Temporal boxplots (flow_duration and related metrics) reveal class-specific regimes: Slowloris and C2Beac exhibit unusually long durations, consistent with persistence aimed at exhausting resources or maintaining covert channels; Exfiltration shows moderate but consistent temporal dispersion, with flows longer than typical benign traffic but less persistent than long-lived attacks; and volumetric attacks tend to produce short durations with high-rate peaks—compressed duration ranges paired with strong upper tails in volume-related metrics.

Directional asymmetrical features are most salient in campaigns dominated by one-way communication. In Portscan and Exfiltration, forward/backward medians diverge sharply and ranges remain narrow, reflecting repetitive, directional behavior. Benign traffic, by contrast, is typically more balanced between directions, consistent with regular client–server exchanges.

For packet-size dispersion (e.g., variance or length deviation), application-layer injection/manipulation families (Webbased, API, Exploitation) show higher dispersion due to mixed HTTP responses, exceptions, and error codes generated during attacks, whereas DDoS/DoS show lower dispersion because packets are repetitive and synthetic, produced by automated tools.

Finally, boxplots indicate internal modes within labels: for example, upper tails in C2Beac and dispersion patterns in Exfiltration suggest that different beaconing styles or data-extraction methods generate distinct statistical profiles under the same family label.

5 TRUST Lab dataset in the evaluation of IDS

5.1 Reference model and evaluation protocol

To illustrate TRUST Lab dataset in the evaluation of flow-based IDS, we adopt as reference a two-stage architecture inspired by the Edge-DL system of Villafranca and Cano (2025). In that work, a lightweight DNN is trained per dataset; here, those DNNs are reused as individual experts, and we build on top of them a stacking layer and a second-stage XGBoost classifier. The objective is not to introduce a novel IDS, but to provide a representative, up-to-date pipeline to exercise TRUST Lab dataset on binary and multiclass tasks while keeping the focus on the dataset’s behavior.

Figure 10 summarizes the reference model. Incoming bi-flows, represented by the 80 CICFlowMeter features, are processed in Phase 1 by DNN experts previously trained on public datasets (e.g., CICIDS2017, UNSW-NB15, BoT-IoT, IoTID20) following the design and training methodology in Villafranca and Cano (2025). Each expert outputs a probability ; these scores are concatenated into and fed to a low-complexity decision-tree stacking meta-classifier (a Random Forest–type model). The meta-classifier produces an aggregated probability , which is calibrated using isotonic regression to obtain well-behaved posterior estimates; a fixed operating threshold is then applied to generate the Benign/Attack decision on TRUST Lab dataset, using the binary label derived in section 3.4.

Figure 10

Bi-flows that Phase 1 flags as attacks are processed in Phase 2 by a multiclass XGBoost classifier trained specifically on TRUST Lab dataset. This model uses the same 80 CICFlowMeter features to discriminate among the 15 offensive families, leveraging the taxonomy in section 3.4. TRUST Lab dataset is therefore exploited in two complementary ways: labelled traffic for binary detection and as a testbed for fine-grained attack classification using flow-only information.

For internal evaluation on TRUST Lab dataset, both the Phase 1 stacker and the Phase 2 XGBoost model are trained with an 80/20 stratified split at the bi-flow level, mixing all classes while preserving in both subsets the real proportions of Benign and each offensive family (80% training, 20% testing). This scheme leverages the dataset volume to yield stable metrics without sacrificing per-class support. At the same time, the modular single-class CSV organization in section 3.6 enables future work to define stricter variants based on campaign- or session-level splits to study temporal concept shift or cross-scenario generalization.

Evaluated metrics include accuracy, precision, recall, and F1-score (global and per-class), plus F2-score and ROC and Precision–Recall curves for the binary task. The purpose is to characterize the dataset’s behavior—which families are separable, where errors concentrate, and how imbalance affects detection—rather than to optimize the architecture.

5.2 Results on the TRUST Lab dataset

Because TRUST Lab dataset is intended to support both detection and attribution, this section reports not only a binary benign-versus-attack baseline, but also a full Phase-2 multiclass benchmark including per-class precision, recall, F1-score, support, and a confusion-matrix-based family-wise difficulty analysis. This section reports the behavior of the reference model (mentioned in section 5.1) when trained and evaluated exclusively on TRUST Lab dataset using the 80/20 stratified bi-flow split. For the binary Benign/Attack task (Phase 1), the test set contains 1,034,194 bi-flows (436,332 benign and 597,862 malicious). Table 2 summarizes the results: accuracy = 0.8963, F2 = 0.9350 (higher weight on false negatives), and ROC–AUC = 0.9676. For the Attack class, precision = 0.8802, recall = 0.9498, and F1 ≈ 0.91, with 567,844 TP and 30,018 FN. For Benign, specificity = 0.8229, with 359,053 TN and 77,279 FP, corresponding to a false-alarm rate of ~7.5% of all flows.

Table 2

Class	Support	Precision	Recall	F1
Benign	436,332	≈0.923	0.823	≈0.87
Attack	597,862	0.8802	0.9498	≈0.91
Overall	1,034,194	–	–	Accuracy = 0.8963; F2 = 0.9350; ROC–AUC = 0.9676

Performance of the binary benign/attack stacker on the TRUST Lab dataset test set (80/20 stratified).

Figure 11 shows the distribution of the stacker output : Benign concentrates near 0, most attacks near 1, and the overlap lies roughly in [0.4, 0.5]. The operating threshold is set to t = 0.445, slightly above the value that maximizes F2, to reduce false alarms while maintaining an attack detection rate close to 95%. The overlap region is dominated by slow/low-volume attacks (e.g., mild DoS, Slowloris) and atypical benign traffic, explaining most FP/FN cases; outside this narrow band, Figure 11 exhibits a clear bimodal separation.

Figure 11

Figure 12 reports the Phase 1 ROC curve. The combination of ROC–AUC = 0.9676 and F2 = 0.9350 indicates that the model exploits the discriminative signal in the CICFlowMeter features close to the dataset’s inherent statistical limit. The ROC’s steep initial slope is consistent with IoT/edge settings where false negatives are costly: at a 10% false positive rate, the true positive rate exceeds 97%.

Figure 12

Flows classified as attacks by Phase 1 are forwarded to the Phase 2 multiclass XGBoost classifier. In total, 645,123 bi-flows are passed to XGBoost. Phase 2 serves two purposes: (i) correcting Phase 1 false positives by reclassifying flows statistically compatible with benign traffic, and (ii) assigning one of the 15 offensive families to true attacks. The multiclass model reclassifies 75,515 flows (≈11.7% of Phase 1 “attacks”) back to Benign, reducing the system-wide false-alarm rate from ~7.5% to ~2.2%; the remaining 569,608 flows are retained as malicious and distributed across the 15 families.

Figure 13 shows the final post-Phase-2 distribution. A moderate imbalance remains. DDoS (56,645) and DNS (50,429) account for a large fraction, while Evasion (25,312) and Bruteforce (27,444) are smaller but still provide sufficient support for reliable metrics; this reflects realistic IoT/edge prevalence where volumetric and DNS-based attacks are more frequent than sophisticated evasion or exploitation attempts.

Figure 13

Table 3 reports per-class precision/recall/F1/support and yields three profiles: (i) F1 ≥ 0.99 for 9/15 families (API, C2Beac, DDoS, DNS, Evasion, Exfiltration, MitM, Portscan, TLS/SSL), where flow rate, IAT, directional asymmetries, and packet-size patterns form strong signatures; (ii) intermediate performance for Exploitation (F1 ≈ 0.92; support = 42,838), with errors mainly toward Benign (≈1,990) and Webbased (≈870) due to shared HTTP semantics; and (iii) challenging DoS (F1 ≈ 0.74; support = 44,966) and Slowloris (F1 ≈ 0.68; support = 32,635), where low-and-slow or intermediate-rate behavior overlaps statistically with slow benign flows or other families.

Table 3

Class	Precision	Recall	F1	Support
API	1.00	1.00	1.00	41,395
Benign (FP)	0.96	0.94	0.95	75,515
Bruteforce	0.97	0.98	0.97	27,444
Bufferoverflow	0.95	0.98	0.96	36,019
C2Beac	1.00	1.00	1.00	34,472
DDoS	0.99	1.00	1.00	56,645
DNS	1.00	1.00	1.00	50,429
DoS	0.77	0.71	0.74	44,966
Evasion	1.00	1.00	1.00	25,312
Exfiltration	1.00	1.00	1.00	32,318
Exploitation	0.92	0.92	0.92	42,838
MitM	1.00	1.00	1.00	31,821
Portscan	1.00	0.99	0.99	41,758
Slowloris	0.64	0.71	0.68	32,635
TLS/SSL	1.00	1.00	1.00	35,073
Webbased	0.99	0.99	0.99	34,719

Per-class metrics for phase 2 (XGBoost) on TRUST Lab dataset.

Figure 14 (confusion matrix) confirms that high-F1 classes concentrate mass on the diagonal, while the main confusions occur between families with similar statistical behavior: Exploitation → Webbased (870 cases) and → Bufferoverflow (142 cases); Slowloris → Benign for roughly 9,310 flows (~28%) (quiet campaigns resembling slow legitimate connections); and DoS ↔ DDoS (1,306 cases), reflecting the boundary between intermediate-rate and high-rate flooding where distinctions depend on deployment thresholds. Overall, these results show that TRUST Lab dataset supports simultaneous evaluation of binary detection under moderate imbalance and multiclass separation among attack families using flow-only attributes in a realistic IoT/edge context, while explicitly exposing the hard cases where flow-level information is intrinsically limiting.

Figure 14

5.3 Discussion, implications and future work

From a technical standpoint, the results clarify where a purely flow-based IDS can exploit TRUST Lab dataset confidently and where the dataset’s true decision boundaries appear. Phase-1 binary performance (Table 2 and Figures 11, 12) is strong under moderate Benign/Attack imbalance, with ROC–AUC = 0.9676 (i.e., the model ranks a random attack flow above a random benign flow with 96.76% probability). At t = 0.445, the system reaches ~95% attack recall with a ~ 7.5% Phase-1 false-alarm rate, which is reduced to ~2.2% after Phase-2 refinement; operationally, this corresponds to ~120 false alarms/h for an edge gateway processing 10,000 flows/h, a rate considered acceptable for validation. Figure 11 exhibits two well-separated probability modes near 0 and 1, with a narrow overlap band around the operating threshold. Flows in this band correspond to low-intensity, medium-duration traffic that can plausibly arise from slow legitimate services or low-profile attacks; this region marks an inherent limit of header-only detection that cannot be resolved without payload inspection or contextual sources (e.g., user behavior profiles, application-layer logs).

At the multiclass level, Phase-2 (XGBoost) provides additional evidence of campaign consistency and label quality. The post-Phase-2 distribution (Figure 13) confirms that all families have substantial support (no marginal classes), enabling stable per-class metrics. Table 3 and Figure 14 show that nine families achieve F1 ≥ 0.99 (API, C2Beac, DDoS, DNS, Evasion, Exfiltration, MitM, Portscan, TLS/SSL), reflecting consistent statistical signatures (e.g., floods with extreme rates, scans with unilateral patterns, C2 beacons with regular periodicity, DNS attacks with distinctive size anomalies). These families form a practical benchmark for estimating the upper bound of flow-based IDS performance on TRUST Lab dataset. Conversely, the intermediate and hard classes identify genuinely ambiguous regions rather than construction artefacts: Exploitation errors toward Webbased and Benign indicate that some exploitation attempts produce HTTP traffic statistically close to legitimate requests or other application-layer abuse when only headers/timings are observed; and DoS/Slowloris leakage into Benign and between each other reflects that the boundary between “slow legitimate connection” and “low-and-slow attack” is weak in aggregate flow metrics (semi-idle resource-exhaustion attempts can resemble real user sessions at low request rates). The measured reduction from ~7.5% to ~2.2% false alarms across stages further supports TRUST Lab dataset as a suitable benchmark for hierarchical detection (recall-oriented stage → refinement and family separation) and for studying thresholding, calibration, and multi-stage architectures in IoT/edge settings.

Beyond performance interpretation, the TRUST Lab dataset can be positioned against established benchmarks. Table 4 compares TRUST Lab with four widely adopted datasets: CICIDS2017 (enterprise-focused), Edge-IIoTset, BoT-IoT, and the recent CICIoT2023 (IoT/edge-focused). Based on this expanded comparison, the TRUST Lab dataset represents a clear methodological improvement over existing resources due to three explicit distinctions:

Zero Label Ambiguity (Perfect Ground Truth): It is the only dataset among these benchmarks that strictly guarantees zero temporal overlap between benign and malicious traffic through a single-class session policy, completely eliminating the label noise that corrupts feature extraction in mixed-window datasets.
Balanced Operational Priors: Unlike datasets such as BoT-IoT or CICIoT2023, which are extremely attack-dominated, TRUST Lab provides a realistic, moderately benign-heavy baseline (1.3:1) without enforcing hidden dataset-level synthetic rebalancing.
Modern API Threat Coverage and Traceable Baseline: It systematically covers modern programmable interfaces (REST, GraphQL, SOAP)—bridging a critical gap ignored by pure IoT corpora—while strictly adhering to the standard 80-feature CICFlowMeter schema, enabling fairer cross-dataset comparisons and direct model transfer.

Table 4

Feature	TRUST Lab dataset	CICIDS2017	Edge-IIoTset	BoT-IoT	CICIoT2023
Total bi-flows	~4.6 M	~2.8 M	~13.9 M	~73.3 M	~33.9 M
Attack families	15	14	14	4	33
Feature extraction	CICFlowMeter (80)	CICFlowMeter (80)	Custom (61)	Argus (43)	CIC-derived (47)
Single-class sessions	Yes	No	No	No	No
Benign/Attack ratio	1.3:1	~4:1	~5:1	~1:7687	~1:25
IoT/edge protocols	HTTP/S, DNS, SSH, SMTP, APIs (REST/GraphQL/SOAP), SNMP, NTP, MySQL	HTTP/S, FTP, SSH, email	MQTT, CoAP, Modbus, HTTP	MQTT, HTTP, CoAP	MQTT, CoAP, AMQP, HTTP, DNS
Modern API attacks	Yes (REST, GraphQL, SOAP)	No	No	No	No
Low-and-slow attacks	Yes (Slowloris, C2/beaconing, exfiltration)	Limited	Limited	Limited	Limited
Label ambiguity	None (single-class)	High (temporal overlap)	Moderate	Moderate	High (temporal overlap)
Reproducibility	Full (versioned scripts, seeds)	Partial	Partial	Partial	Partial
Dataset size (CSV)	~1.42 GB	~6.5 GB	~28 GB	~16.7 GB	~33.8 GB

Comparison of the TRUST Lab dataset against widely adopted flow-based and IoT/edge benchmarks.

From an innovation and technology-transfer perspective, datasets of this kind behave as enabling infrastructure: they lower experimentation costs, standardize evaluation, and accelerate iteration across research and industry. General data stewardship principles such as FAIR (Findable, Accessible, Interoperable, Reusable) frame why well-packaged datasets and reproducible workflows increase downstream value by reducing friction in discovery, reuse, and automation across tooling (Wilkinson et al., 2016). In parallel, empirical evidence shows that openly available datasets can increase reuse and visibility (including measurable citation advantages for studies that share data), reinforcing the incentive alignment for maintaining dataset quality and traceability over time (Piwowar and Vision, 2013).

From a business/innovation lens, open and well-documented data assets also function as a catalyst for new products, benchmarking services, and security validation pipelines, because they allow companies and researchers to compare methods under shared assumptions and to quantify trade-offs (e.g., recall vs. false alarms) with fewer hidden variables. Policy and economics studies on open data (Open Data for Economic Growth, 2014; Open Data: Unlocking Innovation and Performance With Liquid Information, 2026) repeatedly emphasize that value is unlocked when data can be recombined, reused, and integrated into decision workflows—conditions that depend strongly on modularity, documentation, and reproducibility rather than on raw scale alone. For security and AI governance specifically, guidance such as the National Institute of Standards and Technology (NIST) AI Risk Management Framework (Tabassi, 2023) highlights transparency and documentation as prerequisites for trustworthy deployment; in IDS contexts, this translates into dataset designs that support auditability of labels, splits, and operating thresholds.

Finally, the scope of TRUST Lab dataset should be interpreted explicitly. It is a flow-based dataset for network-level IDS evaluation, not a comprehensive security testbed: it does not cover physical-layer attacks (side-channel, power analysis, hardware tampering), firmware/supply-chain threats (malicious updates, compromised provisioning), advanced persistent threats (multi-stage adaptive campaigns), zero-day exploits, or emerging IoT protocols (BACnet, Zigbee, Z-Wave, LoRaWAN), as these require different instrumentation (e.g., system calls, memory dumps, control-flow integrity) beyond header-only traffic analysis. The controlled laboratory setup with synthetic generators and scripted attacks ensures reproducibility and label integrity, but limits realism in three specific ways: benign traffic may underrepresent certain legitimate behaviors (intermittent connectivity, mobile handoffs, application updates), the testbed runs within a single IPv4/24 without WAN latency/packet loss/congestion, and scripted attacks follow deterministic parameters whereas real adversaries adapt timing and combine vectors. TRUST Lab dataset was captured over ~6 weeks in 2025, so models trained on it may degrade under temporal drift as tools, defenses, and firmware evolve; the dataset is a stable baseline for controlled comparison, not a guarantee of long-term robustness. Finally, the 15 families are broad categories (e.g., “Webbased” includes SQLi, XSS, LFI, CSRF), and flow-level features are insufficient for finer sub-family discrimination (e.g., union-based vs. blind SQLi), which would require payload features or application logs.

Several extensions would further enhance utility. First, expanding protocol coverage to include MQTT, CoAP, Modbus/TCP, BACnet, and other automation protocols would enable systematic analysis of domain shift across home, industrial, and smart-city networks and help assess whether the current 80-feature CICFlowMeter schema remains sufficient or must be complemented with application-layer attributes (e.g., MQTT topic patterns, CoAP resource paths). Second, defining standardized reference splits would support diverse evaluation scenarios: campaign-based splits for temporal drift and out-of-distribution generalization, in-domain/out-of-domain configurations for cross-domain transfer (e.g., train home-IoT, test industrial-IoT), and extreme-imbalance regimes emulating rare attacks (e.g., 100:1 Benign/Attack). Third, the clean labels and modular structure support unsupervised/self-supervised research: flow-only anomaly detection (autoencoders, one-class SVMs, isolation forests), contrastive pre-training (SimCLR, MoCo), and next-flow prediction framed as sequence modeling [transformers, Long Short-Term Memory (LSTM)]. Finally, the binary→multiclass pipeline can be extended to multi-stage IDS benchmarking, including selective rejection (abstention under low confidence), cost-sensitive learning (family-dependent weighting, e.g., prioritizing C2/beaconing over Portscan), and federated learning (distributed IDS across edge nodes).

A direct cross-dataset transfer experiment is outside the scope of the present dataset paper and is therefore left for future work. Nevertheless, TRUST Lab dataset was explicitly designed to facilitate such studies through its homogeneous CICFlowMeter schema, modular single-class organization, and clean family labels. Fair transfer evaluation against other public resources requires additional label harmonization and split control, since differences in attack taxonomies, benign priors, and collection protocols can otherwise confound the effect of domain shift itself.

6 Conclusion

TRUST Lab dataset is a CICFlowMeter-based flow dataset derived from realistic IoT/edge traffic and scripted attack campaigns, released as 16 single-class CSV files (Benign plus 15 attack families). The single-class capture policy preserves label integrity by explicitly preventing label ambiguity. In real operational terms, label ambiguity occurs when concurrent benign and malicious packets are aggregated by feature extraction tools (such as CICFlowMeter) into a single bidirectional flow. Because this statistically mixed flow is forced to receive a single, deterministic binary label, the resulting ground truth becomes corrupted, causing machine learning models to learn blurred decision boundaries. By eliminating this temporal overlap between benign and malicious activity within the same session, our dataset ensures unambiguous labeling while keeping preprocessing minimal to avoid artefacts and maintain temporal traceability. Exploratory characterization confirms a moderate Benign/Attack imbalance with sufficient support across all families, and shows that separability arises from complementary signals in volume, timing, directionality, and dispersion rather than a single dominant statistical mode. ANOVA, PCA, and class-wise profiles confirm that multiple families retain distinctive signatures across the feature space. A representative two-stage reference pipeline (binary stacking with calibration followed by multiclass XGBoost) illustrates how the dataset supports evaluation of flow-based IDS under IoT/edge conditions. Results show strong binary detection under moderate imbalance and near-perfect performance for families with consistent statistical fingerprints, while explicitly exposing the intrinsically hard cases—especially application-layer and low-and-slow activity—where flow-level information can remain ambiguous. Overall, TRUST Lab dataset provides a realistic, diverse, and reproducible benchmark for algorithm comparison, operating-threshold and calibration studies, and hierarchical detection architectures, and it offers a stable baseline that can be extended with additional protocols and standardized splits for broader edge-oriented IDS research.

Statements

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://doi.org/10.82432/10317/21203.

Author contributions

AV: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing. IT: Investigation, Supervision, Validation, Writing – original draft, Writing – review & editing. M-DC: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work has been supported by grant PID2023-148214OB-C21 funded by MICIU/AEI/10.13039/501100011033 and is part of the project R&D&I Lab in cybersecurity, privacy, and secure communications (TRUST Lab), financed by European Union NextGeneration-EU, the Recovery Plan, Transformation and Resilience, through INCIBE.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that Generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1
AlbulayhiK.SmadiA. A.SheldonF. T.AbercrombieR. K. (2021). IoT intrusion detection taxonomy, reference architecture, and analyses. Sensors21:6432. doi: 10.3390/s21196432,
2
AlsaediA.MoustafaN.TariZ.MahmoodA.AnwarA. (2020). TON_IoT telemetry dataset: a new generation dataset of IoT and IIoT for data-driven intrusion detection systems. IEEE Access8, 165130–165150. doi: 10.1109/ACCESS.2020.3022862
- CrossRef
- Google Scholar
3
Al-SaremM.SaeedF.AlkhammashE. H.AlghamdiN. S. (2021). An aggregated mutual information based feature selection with machine learning methods for enhancing IoT botnet attack detection. Sensors22:185. doi: 10.3390/s22010185,
4
ApejoyeO.AjienkaN.HeJ.MaX. (2025). Critical review of network intrusion detection benchmark datasets for practical IoT security. Comput. Netw. Commun.3, 182–208. doi: 10.37256/cnc.3220257228
- CrossRef
- Google Scholar
5
BalajiP.BabuS.MaM.FangZ.RahayuS. B.BiviM. A.et al. (2025). Renovated random attribute-based fennec fox optimized deep learning framework in low-rate DoS attack detection in IoT. CMC84, 5831–5858. doi: 10.32604/cmc.2025.065260
- CrossRef
- Google Scholar
6
BatistaG. E. A. P. A.PratiR. C.MonardM. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Expl. Newsletter6, 20–29. doi: 10.1145/1007730.1007735
- CrossRef
- Google Scholar
7
ChawlaN. V.BowyerK. W.HallL. O.KegelmeyerW. P. (2002). SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res.16, 321–357. doi: 10.1613/jair.953
- CrossRef
- Google Scholar
8
DiroA. A.ChilamkurtiN. (2018). Distributed attack detection scheme using deep learning approach for internet of things. Futur. Gener. Comput. Syst.82, 761–768. doi: 10.1016/j.future.2017.08.043
- CrossRef
- Google Scholar
9
EssopI.RibeiroJ. C.PapaioannouM.ZachosG.MantasG.RodriguezJ. (2021). Generating datasets for anomaly-based intrusion detection systems in IoT and industrial IoT networks. Sensors21:1528. doi: 10.3390/s21041528,
10
FernandezA.GarciaS.HerreraF.ChawlaN. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res.61, 863–905. doi: 10.1613/jair.1.11192
- CrossRef
- Google Scholar
11
FerragM. A.FrihaO.HamoudaD.MaglarasL.JanickeH. (2022). Edge-IIoTset: a new comprehensive realistic cyber security dataset of IoT and IIoT applications for centralized and federated learning. IEEE Access10, 40281–40306. doi: 10.1109/ACCESS.2022.3165809
- CrossRef
- Google Scholar
12
GarcíaS.ParmisanoA.ErquiagaM. J. (2020). IoT-23: a labeled dataset with malicious and benign IoT network traffic (version 1.0.0) [data set]. Zenodo. doi: 10.5281/zenodo.4743746
- CrossRef
- Google Scholar
13
GyamfiE.JurcutA. (2022). Intrusion detection in internet of things systems: a review on design approaches leveraging multi-Access edge computing, machine learning, and datasets. Sensors22:3744. doi: 10.3390/s22103744,
14
HaixiangG.YijingL.ShangJ.MingyunG.YuanyueH.BingG. (2017). Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl.73, 220–239. doi: 10.1016/j.eswa.2016.12.035
- CrossRef
- Google Scholar
15
IlangoH. S.MaM.SuR. (2022). A FeedForward–convolutional neural network to detect low-rate DoS in IoT. Eng. Appl. Artif. Intell.114:105059. doi: 10.1016/j.engappai.2022.105059
- CrossRef
- Google Scholar
16
KoroniotisN.MoustafaN.SitnikovaE.TurnbullB. (2019). Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-IoT dataset. Futur. Gener. Comput. Syst.100, 779–796. doi: 10.1016/j.future.2019.05.041
- CrossRef
- Google Scholar
17
KoppulaM.Leo JosephL. M. I. (2025). “A real-world dataset “IDSIoT2024” for machine learning/deep learning based cyber attack detection system for IoT architecture,” in 2025 3rd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), (IEEE), 1–6.
- Google Scholar
18
LeevyJ. L.KhoshgoftaarT. M. (2020). A survey and analysis of intrusion detection models based on CSE-CIC-IDS2018 big data. J. Big Data7:104. doi: 10.1186/s40537-020-00382-x
- CrossRef
- Google Scholar
19
MeidanY.BohadanaM.MathovY.MirskyY.ShabtaiA.BreitenbacherD.et al. (2018). N-BaIoT—network-based detection of IoT botnet attacks using deep autoencoders. IEEE Pervasive Comput17, 12–22. doi: 10.1109/MPRV.2018.03367731
- CrossRef
- Google Scholar
20
MoustafaN.SlayJ. (2015). “UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set),” in 2015 Military Communications and Information Systems Conference (MilCIS) (IEEE), 1–6.
- Google Scholar
21
NetoE. C. P.DadkhahS.FerreiraR.ZohourianA.LuR.GhorbaniA. A. (2023). CICIoT2023: a real-time dataset and benchmark for large-scale attacks in IoT environment. Sensors23:5941. doi: 10.3390/s23135941,
22
Open Data for Economic Growth (2014). Available online at: https://documents1.worldbank.org/curated/en/131621468154792082/pdf/896060REVISED000for0Economic0Growth.pdf (accessed February 3, 2026).
- Google Scholar
23
Open Data: Unlocking Innovation and Performance With Liquid Information (2026). Available online at: https://www.mckinsey.com/~/media/mckinsey/business%20functions/mckinsey%20digital/our%20insights/open%20data%20unlocking%20innovation%20and%20performance%20with%20liquid%20information/mgi_open_data_fullreport_oct2013.pdf (accessed February 3, 2026).
- Google Scholar
24
PekarA.JozsaR. (2024). Evaluating ML-based anomaly detection across datasets of varied integrity: a case study. Comput. Netw.251:110617. doi: 10.1016/j.comnet.2024.110617
- CrossRef
- Google Scholar
25
PiwowarH. A.VisionT. J. (2013). Data reuse and the open data citation advantage. PeerJ1:e175. doi: 10.7717/peerj.175,
26
QaddosA.YaseenM. U.Al-ShamaylehA. S.ImranM.AkhunzadaA.AlharthiS. Z. (2024). A novel intrusion detection framework for optimizing IoT security. Sci. Rep.14:21789. doi: 10.1038/s41598-024-72049-z,
27
SalemA. H.AzzamS. M.EmamO. E.AbohanyA. A. (2024). Advancing cybersecurity: a comprehensive review of AI-driven detection techniques. J. Big Data11:105. doi: 10.1186/s40537-024-00957-y
- CrossRef
- Google Scholar
28
SarhanM.LayeghyS.PortmannM. (2022a). Evaluating standard feature sets towards increased generalisability and explainability of ML-based network intrusion detection. Big Data Res.30:100359. doi: 10.1016/j.bdr.2022.100359
- CrossRef
- Google Scholar
29
SarhanM.LayeghyS.PortmannM. (2022b). Towards a standard feature set for network intrusion detection system datasets. Mobile Netw. Appl.27, 357–370. doi: 10.1007/s11036-021-01843-0
- CrossRef
- Google Scholar
30
SharafaldinI.Habibi LashkariA.GhorbaniA. A. (2018). “Toward generating a new intrusion detection dataset and intrusion traffic characterization,” in Proceedings of the 4th International Conference on Information Systems Security and Privacy, (SCITEPRESS - Science and Technology Publications), 108–116.
- Google Scholar
31
TabassiE. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). Gaithersburg, MD: National Institute of Standards and Technology (U.S.).
- Google Scholar
32
TareqI.ElbagouryB. M.El-RegailyS.El-HorbatyE.-S. M. (2022). Analysis of ToN-IoT, UNW-NB15, and edge-IIoT datasets using DL in cybersecurity for IoT. Appl. Sci.12:9572. doi: 10.3390/app12199572
- CrossRef
- Google Scholar
33
TavallaeeM.BagheriE.LuW.GhorbaniA. A. (2009). “A detailed analysis of the KDD CUP 99 data set,” in 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications (IEEE), 1–6.
- Google Scholar
34
UllahI.MahmoudQ. H. (2020). A Scheme for Generating a Dataset for Anomalous Activity Detection in IoT Networks, 508–520. Gaithersburg, MD: National Institute of Standards and Technology (U.S.).
- Google Scholar
35
VillafrancaA.CanoM.-D. (2025). A lightweight edge-DL intrusion detection system for IoT sustainable smart-agriculture. IoT34:101818. doi: 10.1016/j.iot.2025.101818
- CrossRef
- Google Scholar
36
WilkinsonM. D.DumontierM.AalbersbergI.AppletonG.AxtonM.BaakA.et al. (2016). The FAIR guiding principles for scientific data management and stewardship. Sci. Data3:160018. doi: 10.1038/sdata.2016.18,
37
ZuechR.KhoshgoftaarT. M.WaldR. (2015). Intrusion detection and big heterogeneous data: a survey. J. Big Data2:3. doi: 10.1186/s40537-015-0013-4
- CrossRef
- Google Scholar

Summary

Keywords

CICFlowMeter, cybersecurity, dataset, edge computing, internet of things, intrusion detection system

Citation

Villafranca A, Tasic I and Cano M-D (2026) TRUSTLab dataset: a real-world CICFlowMeter dataset for IoT/edge intrusion detection. Front. Comput. Sci. 8:1803271. doi: 10.3389/fcomp.2026.1803271

Received

03 February 2026

Revised

18 April 2026

Accepted

21 April 2026

Published

05 May 2026

Volume

8 - 2026

Edited by

Elisa Rojas, University of Alcalá, Spain

Reviewed by

Andy Reed, The Open University, United Kingdom

Sri Hari Nallamala, Vasireddy Venkatadri Institute of Technology, India

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Maria-Dolores Cano, mdolores.cano@upct.es

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Networks and Communications

ORIGINAL RESEARCH article

TRUSTLab dataset: a real-world CICFlowMeter dataset for IoT/edge intrusion detection

Abstract

1 Introduction

2 Related works

3 Methodology

3.1 Objective and scope of the dataset

3.2 Testbed and capture topology

3.3 Generation of benign traffic and attack campaigns

3.4 Labeling and taxonomy

3.5 Feature extraction with CICFlowMeter

3.6 Minimal preprocessing and final dataset organization

4 Analysis of the TRUST lab dataset

4.1 General distribution and descriptive analysis of classes

4.2 Univariate analysis of characteristics

4.3 Correlation and redundancy between characteristics

4.4 Statistical significance of characteristics (ANOVA F-test)

4.5 Dimensionality analysis using PCA

4.6 Distributions by class, boxplots and statistical profiles

5 TRUST Lab dataset in the evaluation of IDS

5.1 Reference model and evaluation protocol

5.2 Results on the TRUST Lab dataset

5.3 Discussion, implications and future work

6 Conclusion

Statements

Data availability statement

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

References

Summary

Outline

Figures

Cite article

Article metrics

ORIGINAL RESEARCH article

TRUSTLab dataset: a real-world CICFlowMeter dataset for IoT/edge intrusion detection

Abstract

1 Introduction

2 Related works

3 Methodology

3.1 Objective and scope of the dataset

3.2 Testbed and capture topology

3.3 Generation of benign traffic and attack campaigns

3.4 Labeling and taxonomy

3.5 Feature extraction with CICFlowMeter

3.6 Minimal preprocessing and final dataset organization

4 Analysis of the TRUST lab dataset

4.1 General distribution and descriptive analysis of classes

4.2 Univariate analysis of characteristics

4.3 Correlation and redundancy between characteristics

4.4 Statistical significance of characteristics (ANOVA F-test)

4.5 Dimensionality analysis using PCA

4.6 Distributions by class, boxplots and statistical profiles

5 TRUST Lab dataset in the evaluation of IDS

5.1 Reference model and evaluation protocol

5.2 Results on the TRUST Lab dataset

5.3 Discussion, implications and future work

6 Conclusion

Statements

Data availability statement

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

References

Summary

Outline

Figures

Cite article

Share article

Article metrics