<?xml version="1.0" encoding="utf-8"?>
    <rss version="2.0">
      <channel xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <title>Frontiers in Big Data | New and Recent Articles</title>
        <link>https://www.frontiersin.org/journals/big-data</link>
        <description>RSS Feed for Frontiers in Big Data | New and Recent Articles</description>
        <language>en-us</language>
        <generator>Frontiers Feed Generator,version:1</generator>
        <pubDate>2026-05-31T19:07:54.78+00:00</pubDate>
        <ttl>60</ttl>
        <item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1813265</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1813265</link>
        <title><![CDATA[Interpretable intrusion detection for IoT: a CNN-BiLSTM permutation importance framework for deep feature selection]]></title>
        <pubdate>2026-05-22T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Ibrahim Al-Shibly</author><author>Llorenç Burgas</author><author>Joaquim Massana</author>
        <description><![CDATA[Industrial intrusion detection systems (IDS) in Industrial Internet of Things (IIoT) environments have to address the problem of handling multi-feature temporally correlated network traffic and dynamic changes in attack patterns. Traditional filter-based feature selection methods, like Mutual Information (MI), only consider individual feature performance and may not be effective in dealing with non-linear feature dependencies. This may degrade detection performance, especially in class-imbalanced problems. To mitigate such challenges, this paper proposes a deep feature selection (DFS) framework that utilizes a hybrid Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) model. The proposed framework assesses the importance of native features using permutation importance. In the proposed framework, the CNN model detects local features in the data, whereas the BiLSTM model detects bidirectional temporal features in the data. The importance of features is computed by assessing the performance degradation of the model using time-aware perturbations on individual features. These identified features that are most relevant are then used to train lightweight traditional machine learning models like decision tree, K-nearest neighbor (KNN), logistic regression, naïve Bayes, and random forest. This makes it easy to deploy in resource-constrained IIoT environments. The approach is tested on the CIC IIoT 2025 dataset. From the experimental results, it is clear that the CNN-BiLSTM DFS framework improves recall and F1-score compared to other feature selection approaches like MI. This is especially true in imbalanced settings. The decoupling of feature selection from offline and edge-side inference provides a balance between detection accuracy, robustness, and deployability in real-world IIoT settings.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1829960</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1829960</link>
        <title><![CDATA[KATENA: a verifiable governance architecture for encrypted cloud storage systems]]></title>
        <pubdate>2026-05-15T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Jesús F. Rodríguez-Aragón</author><author>Carolina Zato</author><author>Francisco Pinto-Santos</author><author>Lorena Sánchez-Pravos</author>
        <description><![CDATA[Modern data-intensive infrastructures increasingly rely on cloud storage and client-controlled encryption to protect the confidentiality of outsourced information. However, while encryption prevents providers from accessing plaintext data, governance operations such as sharing, revocation, and policy updates typically remain opaque to users and auditors. This creates a structural gap between strong data confidentiality and verifiable governance in cloud environments that manage large volumes of sensitive information. This paper introduces KATENA (Key Architecture for Trustworthy Encrypted Networked Archives), an architectural model that enables client-verifiable governance in encrypted cloud storage systems. The proposed approach combines hierarchical key orchestration, transparency-based governance logging, and cryptographically verifiable governance artifacts so that clients can independently validate governance events without relying on provider-side trust. By integrating accountability mechanisms directly into encrypted storage architectures, the work provides a governance-by-design framework that bridges client-controlled encryption with verifiable data governance in modern data-intensive cloud systems.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1762571</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1762571</link>
        <title><![CDATA[Definitional ambiguity in cognitive warfare: a critical and systematic conceptual review through ideal-type analysis]]></title>
        <pubdate>2026-05-15T00:00:00Z</pubdate>
        <category>Systematic Review</category>
        <author>Per-Erik Nilsson</author><author>Andreas Haga</author><author>Kristina Hellström</author>
        <description><![CDATA[Cognitive warfare is a relatively new concept in both military and academic discourse. The article's purpose is to advance conceptual clarity regarding cognitive warfare and to support future policy-oriented and academic research that strengthens the field's conceptual and methodological foundations, understood here as the broader domain of communication and defense studies concerned with informational and cognitive forms of contestation. This article examines how the notion is conceptualized within the emerging body of research, drawing on a systematic literature review. With support from LLM-assisted analysis, the study employs an exploratory methodology to identify both conceptual commonalities and points of divergence. The review indicates that cognitive warfare remains an underdeveloped research field, characterized by broad assumptions and limited scientific rigor. While the concept may represent a reframing of long-standing practices, it may also serve a political function by drawing renewed attention to forms of influence and conflict that have been overshadowed in recent decades. The article concludes by outlining avenues for future interdisciplinary research, emphasizing the need for conceptual clarity, empirical operationalization, and a more nuanced understanding of how adversaries themselves articulate and employ cognitive warfare.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1769948</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1769948</link>
        <title><![CDATA[Quantifying energy and accuracy trade-offs of federated learning on wearable health devices]]></title>
        <pubdate>2026-05-15T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Rupaak S.</author><author>Ganesh Khekare</author><author>Yash Kumar</author><author>Gaurav Soni</author>
        <description><![CDATA[The rapid development of wearable health tools has made it possible to continuously monitor physiological conditions for preventive care. However, stringent privacy laws, including HIPAA and GDPR, require decentralized methods such as federated learning (FL) to safeguard personal patient information. Nonetheless, empirical profiling in this paper finds that typical FL implementations are plagued by a serious performance trilemma; a naive federated model attains a 35.3 percent energy savings (3.84 vs. 5.93 kJ in the centralized models), but at the cost of a disastrous performance penalty of 13.87 percentage points (84.94 vs. 98.81 percent in centralized models). The failure in research is largely due to the on-device computational load of 4.24 MFLOPs per training sample, resulting in a “straggler” bottleneck that increases the total training duration to 1,066.26 s, almost 70 times longer than centralized training. As a result, the introduction of the hybrid hierarchical federated split learning (H-FedSL) architecture helps in strategically splitting the neural network at a cut layer to divide the workload between wearable and nearby edge servers. The methodology provides a new framework that offloads the heavy and deep-layer computations to the edge server, leaving the shallow feature extraction to the point of operation, and sends only privacy-sensitive abstractions of the smashed data, rather than raw signals. The integration of asynchronous protocols will help manage device heterogeneity and resource-aware client selection, thereby achieving the aim of H-FedSL to restore the gold-standard accuracy of 98.81% with the state-of-the-art 35.3% energy efficiency of the federated model. Thus, a technically and economically feasible pathway will be provided for deploying medical-grade AI on resource-constrained Internet of Medical Things (IoMT) devices.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1799073</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1799073</link>
        <title><![CDATA[The role of statistical methods and artificial intelligence in inventory management for manufacturing industries: a systematic literature review]]></title>
        <pubdate>2026-05-15T00:00:00Z</pubdate>
        <category>Systematic Review</category>
        <author>Arvia Dwi Royani</author><author>Mahfud Sholihin</author><author>Dewi Dewi</author><author>Novika Novika</author><author>Annisa Sorayya</author><author>Wahyu Nur Hanifah</author><author>Rizki Ramadhani Arif Trilana</author><author>Paolina Buton</author>
        <description><![CDATA[Inventory management is a critical business process that affects the operational efficiency and competitiveness of manufacturing companies. Inaccurate inventory decisions can result in significant financial losses for companies. Demand variability poses a challenge in determining inventory levels, requiring more sophisticated, flexible forecasting methods. This study was conducted to examine the roles of statistical methods and Artificial Intelligence (AI) in inventory decision-making in the manufacturing industry, analyze the conditions under which each method is suitable, and evaluate the potential of a hybrid approach integrating statistical methods and AI. This study uses the Systematic Literature Review method with the PRISMA 2020 framework to ensure research transparency and accuracy. This study identifies articles from reputable databases indexed in Scopus. The findings show a significant shift in inventory management research. In the last decade, AI technology has dominated the literature at 62.5%, while statistical methods account for 25%, and hybrid methods have begun to emerge but remain limited to 12.5%. Based on the review of selected papers, statistical methods have proven to remain effective for consistent historical data and stable demand patterns. Conversely, in dynamic operational environments with large-scale data and complex nonlinear patterns, AI technology is superior. This study also found that the hybrid approach has great potential to balance accuracy, interpretability, and decision support, although the relevant literature remains limited. The implementation of technology in the manufacturing industry faces several obstacles, including limited data quality, a skills gap in technology, and the black-box nature of complex AI. This review provides a systematic and critical synthesis of methodological patterns and operational fit in the use of statistical, AI, and hybrid methods for manufacturing inventory management. Future research is recommended to focus on the development of interpretable AI, modular hybrid frameworks, and the use of real industry data to ensure that academic innovations can be applied in the manufacturing industry.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1817120</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1817120</link>
        <title><![CDATA[Cheatomaly: weakly supervised video anomaly ranking for exam cheating detection using vision transformers]]></title>
        <pubdate>2026-05-12T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>El Mehdi Alaoui Mrani</author><author>Anas Bouayad</author><author>Khalid Fardousse</author>
        <description><![CDATA[Detecting cheating in classroom examinations is challenging because suspicious behaviors are often subtle, temporally sparse, and context-dependent. To address the lack of dedicated benchmarks for this setting, we introduce Cheatomaly, a curated video dataset assembled from publicly available classroom examination material and annotated to support weakly supervised anomaly detection. We formulate cheating detection as a weakly supervised video anomaly ranking task using Multiple Instance Learning (MIL) with Vision Transformer features. Videos are divided into temporal segments, and segment-level representations are built using mean pooling and a Mean, Standard Deviation, and Temporal Difference (MSD) formulation. A margin-based ranking objective is used to prioritize anomalous videos and suspicious temporal segments using only video-level labels during training. Experimental results on Cheatomaly show strong video-level discrimination and meaningful frame-level localization across repeated runs. Ablation, baseline, statistical, and sensitivity analyses indicate that temporal aggregation affects the trade-off between ranking and localization but does not produce consistent statistically significant gains. Overall, Cheatomaly provides a realistic benchmark for studying subtle cheating-related anomalies in classroom examinations, and the results highlight that the main challenge lies in modeling context-dependent temporal behavior rather than feature aggregation alone.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1764468</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1764468</link>
        <title><![CDATA[Fairness across domains: a unified fairness-aware framework for domain generalization and unsupervised adaptation]]></title>
        <pubdate>2026-05-11T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Kai Jiang</author><author>Chen Zhao</author><author>Haoliang Wang</author><author>Xintao Wu</author><author>Latifur Khan</author><author>Christan Grant</author><author>Feng Chen</author>
        <description><![CDATA[Fairness in machine learning remains a critical challenge, particularly in the presence of domain shift. We propose a unified fairness-aware framework for both domain generalization (DG) and unsupervised domain adaptation (UDA), which jointly addresses domain shift and sensitive-attribute bias through disentangled representation learning. The framework disentangles content, style, and sensitive factors, and uses them to generate augmented samples that reduce bias while maintaining predictive reliability. Extensive experiments on four datasets demonstrate that the proposed method achieves state-of-the-art performance in both DG and UDA settings. Moreover, it yields a stronger balance between classification accuracy and fairness across diverse domains and sensitive subgroups. By incorporating unlabeled target-domain data, our framework extends prior fairness-aware approaches that were limited to DG and provides new insight into fairness-aware learning under unsupervised adaptation. Overall, this work offers a practical step toward scalable and robust fairness-aware learning in multi-domain environments.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1821612</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1821612</link>
        <title><![CDATA[A longitudinal multimodal big data infrastructure for precision poultry monitoring]]></title>
        <pubdate>2026-05-11T00:00:00Z</pubdate>
        <category>Methods</category>
        <author>Daniel Essien</author><author>Yashan Dhaliwal</author><author>Suresh Neethirajan</author>
        <description><![CDATA[Livestock systems are increasingly instrumented with heterogeneous sensors, yet the resulting data remain fragmented, short-lived, and rarely documented as integrated infrastructures. This gap limits the development of robust multimodal artificial intelligence under real production conditions. Here we present a longitudinal multimodal data infrastructure for poultry monitoring, spanning 22 consecutive weeks across five commercial-style barns. The dataset combines continuous RGB video (1080 p, 30 fps), continuous audio (48 kHz), periodic radiometric thermal imaging, and twice-daily environmental measurements, yielding 10.2 terabytes of temporally heterogeneous data. Rather than focusing on a specific predictive task, the study addresses the underlying data-engineering challenge: how to acquire, synchronize, store, and preprocess multimodal streams at production scale. We detail a reproducible system architecture for distributed sensing, local buffering, secure transfer, and cloud-based organization, together with standardized preprocessing pipelines for illumination correction, acoustic denoising, and radiometric temperature extraction. Temporal alignment is achieved through timestamp-based normalization across asynchronous modalities, with explicit characterization of alignment granularity and missing data under real-world constraints. This work positions multimodal livestock sensing as a data-systems problem. The resulting dataset supports longitudinal analysis, cross-modal querying, and the development and evaluation of machine learning and multimodal fusion approaches at appropriate temporal scales. By releasing both data and workflows, we provide a transparent and extensible foundation for building and evaluating AI systems in precision agriculture.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1768571</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1768571</link>
        <title><![CDATA[Optimizing large-scale graph database ingestion through edge value ranking: a proposed framework]]></title>
        <pubdate>2026-05-11T00:00:00Z</pubdate>
        <category>Hypothesis and Theory</category>
        <author>Phanindra Reddy Madduru</author><author>Bijo Thomas</author>
        <description><![CDATA[This paper proposes a preprocessing framework for optimizing large-scale graph database ingestion through intelligent edge filtering based on value ranking. We combine adapted PageRank algorithms with business-specific metrics and edge type importance to evaluate and rank edges, enabling selective retention of high-value relationships. The framework introduces three PageRank variants (maximum weight normalization, weighted average, and log-based normalization) with type-specific business value normalization to handle heterogeneous graphs. Current graph database ingestion approaches struggle with scale: loading 6.2TB of data (38 billion objects) requires over 3 weeks, forcing organizations to limit historical data retention. Our approach addresses this through preprocessing-stage filtering before database ingestion. While requiring experimental validation, preliminary analysis suggests potential for 40%–80% data volume reduction depending on graph characteristics, with corresponding improvements in loading efficiency and storage costs. The paper details the theoretical framework, computational complexity analysis, formal property preservation guarantees, and comprehensive validation methodology. This work represents a novel direction in graph database optimization: value-based preprocessing rather than runtime query optimization.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1786859</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1786859</link>
        <title><![CDATA[Detection and classification of lung cancer using sequential hybridization of CNN and RNN type architectures]]></title>
        <pubdate>2026-05-08T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Maheswari Vutukuri</author><author>Parveen Sultana Habibullah</author>
        <description><![CDATA[IntroductionEarly and accurate lung cancer detection from computed tomography (CT) images remains a challenging task because of the complex morphology of lung nodules, class imbalance, variation in image quality, and the risk of overfitting in deep learning models. Conventional manual interpretation is time-consuming and may be affected by inter-observer variability. Therefore, an automated and reliable CT-based classification framework is required to support early identification of benign, malignant, and normal lung conditions.MethodsThis study proposes a sequential hybrid deep learning framework that integrates convolutional and recurrent neural network components for multiclass lung cancer classification. A dataset of 1,600 CT-scan images collected from multiple hospital data repositories across Bengaluru was used and divided in an approximate 70:30 ratio for training and validation. The preprocessing pipeline includes contrast enhancement using Contrast Limited Adaptive Histogram Equalization (CLAHE), morphological operations for lung segmentation, and nodule-focused masking to isolate diagnostically relevant lung regions. Data augmentation and transfer learning were applied to improve model generalization and reduce overfitting. DenseNet201 was used for feature extraction, while a bidirectional gated recurrent unit (BiGRU) module was incorporated for sequential representation learning. Hyperparameter optimization and early stopping were used to improve training stability and classification performance.ResultsThe proposed DenseNet201-BiGRU sequential hybrid architecture achieved an overall accuracy of 95.8%. The class-wise accuracies were 97.33% for benign cases, 93.33% for malignant cases, and 96.67% for normal cases. Precision, recall, and F1-score values further demonstrated that the model maintained reliable classification performance across diverse and imbalanced CT image classes.DiscussionThe results indicate that sequential hybridization of DenseNet201-based feature extraction with BiGRU-based representation learning provides an efficient, precise, and robust framework for CT-based lung cancer classification. The proposed method improves classification reliability by combining enhanced preprocessing, focused lung-region extraction, transfer learning, and recurrent modeling. However, further validation using larger, multi-center datasets and additional clinical testing is required before real-world deployment and broader diagnostic application.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1807184</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1807184</link>
        <title><![CDATA[Explainable gradient convolutional vector fuzzy pattern analysis based on ensemble model for facial expression recognition]]></title>
        <pubdate>2026-05-08T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Lakshmi Sarvani Videla</author><author>Babu Reddy Mukamalla</author>
        <description><![CDATA[Facial expression recognition using machine learning involves training algorithms to identify and categorize human emotions based on visual cues from facial features. Explainable AI (XAI) enhances this process by providing transparency into how these algorithms arrive at their predictions. While machine learning algorithms provide the capability to recognize facial expressions, explainable AI offers the crucial ability to understand and interpret these recognition processes, leading to more robust, fair, and trustworthy systems.The aim of this research is to propose a novel method in facial expression recognition using segmentation by an ensemble machine learning algorithm and explainable AI model. The input consists of facial expression images, which are first processed for noise removal and normalization. The processed images are then segmented using the Explainable Gradient Convolutional Vector Fuzzy Pattern Recognition (ExGrConVFuzPR) model.The proposed method was evaluated on the JAFFE, CK, and AFLW datasets. The model achieved promising results with an accuracy of 97%, precision of 96%, recall of 96%, F1-score of 97%, and RMSE of 0.043. These outcomes demonstrate that the suggested approach provides good performance along with improved interpretability.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1733733</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1733733</link>
        <title><![CDATA[Strategic cyber intelligence with advanced analytics in Latin America: a perspective]]></title>
        <pubdate>2026-05-01T00:00:00Z</pubdate>
        <category>Perspective</category>
        <author>Tamara Briones-Lascano</author><author>Vanessa Vergara-Lozano</author><author>Alex Miranda-Andrade</author>
        <description><![CDATA[Digital transformation in Latin America has widened the attack surface, while exposing long-standing gaps in policy, capability, and data stewardship. From this perspective, we argue that the region can move from reactive cybersecurity to strategic cyber intelligence by embedding advanced analytics into an intelligence cycle that connects multisource data, governed models, and operational playbooks with clear accountability. We synthesize the demonstrated technical gains, diagnose implementation constraints, and outline a near-term agenda that includes a regional maturity index, comparative outcome studies, and decision research on explainability and bias. Thus, our position was practical. Analytics amplifies a good strategy only when governance, trustworthy data, and skilled teams are in place. This Perspective contributes a strategic analytical framework that links advanced analytics, governance, and decision-making to strengthen cyber intelligence and digital resilience in Latin America.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1821270</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1821270</link>
        <title><![CDATA[Novel approach of encrypted network traffic classification using deep convolutional neural network with Artificial Bee Colony and Genetic Algorithm]]></title>
        <pubdate>2026-04-28T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Sujan Kumar Mohanty</author><author>Satyajit Rath</author><author>Satya Ranjan Sahu</author><author>Bikram Kumar Parida</author><author>Rakesh Chandra Balabantaray</author>
        <description><![CDATA[The encode network traffic makes it difficult to perform successful and dynamic classification. This paper will present the use of a hybrid model to be used with the publicly available QUIC dataset to classify VPN and non-VPN encrypted traffic based on a Deep Convolutional Neural Network (DCNN) and Long Short-Term Memory (LSTM) network, which is optimized by the Artificial Bee Colony (ABC) and Genetic Algorithm (GA). The method involves multi-angle processing - preprocessing, Min-Max normalization, and features selection of with correlation analysis, Fisher Score, and mutual information to obtain a tiny, but meaningful feature set (Size, Batch Cache, Delta Previous Packet). The chosen features are translated to 2D tensors through a sliding time window of consecutive packets, which allows the spatio-temporal DCNN+LSTM architecture to represent the level of intra- and inter-packet feature associations as well as inter- and intra-packet time dynamics. The disadvantages of single-optimization are overcome using a dual metaheuristic optimization strategy again whereby the work of the global hyperparameter exploration is done using ABC and the structural optimization is done using GA. The imbalance of classes is reduced with weighted loss functions and stratified data division. The accuracy of the model is 99.66% with 0.994 ROC-AUC and 0.987 PR-AUC and its MCC is 0.963 which is even greater than that of the traditional classifiers (Decision Tree, Random Forest, SVM, KNN), individual deep-learning models (CNN, LSTM), and image-based FlowPic method. Three-quarters of stratified cross-validation marks the case of consistent generalization (99.53% ± 0.09% mean accuracy), and an ablation study confirms the value of any one of the components. The findings prove that the presented framework can be applied to monitor the network traffic on encrypted networks which are security-sensitive and in real-time.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1768366</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1768366</link>
        <title><![CDATA[Beyond performance metrics: evaluating the unique value of generative AI in hybrid cybersecurity threat detection]]></title>
        <pubdate>2026-04-24T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Juan Antonio González-Ramos</author><author>Pablo Chamoso</author>
        <description><![CDATA[IntroductionThis study examines the role of generative artificial intelligence (GenAI) in cybersecurity threat detection, focusing on its usefulness in workflows that support human decision-making.MethodsExperiments were performed on the BODMAS dataset (134,435 samples) and a smaller exploratory subset of UNSW-NB15. State-of-the-art machine learning (ML) classifiers were compared with a zero-shot large language model (LLM) using standard classification metrics, while also considering latency, cost, and hallucination risk.ResultsML classifiers consistently outperformed the LLM-based system on standard detection metrics. However, the LLM showed value in cases of ambiguity, where it could provide short plain-language explanations, organize alert-related context, and generate initial interpretations for instances that did not match learned classes.DiscussionGenAI is unlikely to replace ML-based detection methods, but it can provide useful interpretive support for ambiguous or unfamiliar alerts. A hybrid pipeline is therefore proposed, in which ML handles high-confidence and time-sensitive decisions, while the LLM is used selectively for low-confidence cases or when explanatory support is needed. Human oversight remains necessary to address hallucination risk and ensure reliability.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1761377</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1761377</link>
        <title><![CDATA[The designing of a transparent hybrid machine learning framework for water leak detection: a systematic review]]></title>
        <pubdate>2026-04-24T00:00:00Z</pubdate>
        <category>Systematic Review</category>
        <author>Chinemerem M. Anozie</author><author>Tite Tuyikeze</author><author>Ibidun C. Obagbuwa</author><author>Fezile Matsebula</author>
        <description><![CDATA[IntroductionGlobal water scarcity is increasingly exacerbated by substantial water losses, with approximately 30% of treated water lost annually due to leaks in aging Water Distribution Networks (WDNs). Addressing this challenge requires advanced and reliable leak detection mechanisms. This study investigates the design of a transparent hybrid machine learning framework aimed at improving the accuracy and effectiveness of water leak detection systems.MethodsA systematic literature review was conducted following PRISMA guidelines. A total of 27 relevant studies were analyzed, focusing on hybrid deep learning approaches that incorporate data fusion, mixed models, and ensemble techniques for leak detection in WDNs.ResultsThe findings indicate that hybrid and ensemble learning techniques are becoming more important in the identification of water leaks. Several studies reported exceptional high performance, with some models achieving up to 99% balanced accuracy by leveraging multiple data modalities. These approaches demonstrate strong resilience and adaptability across varying operational conditions.DiscussionDespite their high performance, the complexity and “black-box” nature of hybrid models limit their practical deployment. The study highlights the importance of integrating Explainable Artificial Intelligence (XAI) techniques to enhance transparency, interpretability, and user trust. The review concludes that future intelligent leak management systems should combine high-performing hybrid models with XAI to develop efficient, interpretable, and trustworthy decision-support systems that support sustainable water resource management.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1772101</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1772101</link>
        <title><![CDATA[Generation of Kazakhstan's unified national testing variants using AI: a platform for automatic task creation with expert control]]></title>
        <pubdate>2026-04-17T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Bolatbek Abdrasilov</author><author>Talgat Niyazov</author><author>Lyazzat Shinetova</author><author>Shugyla Altybayeva</author><author>Kenzhekul Turalbayeva</author><author>David Orlov</author>
        <description><![CDATA[This study examines the use of artificial intelligence for Automatic Item Generation (AIG) in the context of Kazakhstan's Unified National Testing (UNT) and presents a human-in-the-loop platform for scalable, expert-controlled test development. The objective is to evaluate whether large language models (LLMs) can reliably generate high-quality, isomorphic mathematics test items in the Kazakh language while preserving psychometric and pedagogical requirements. A hybrid AI system combining a local and a cloud-based LLM was implemented to perform semantic deconstruction of prototype items and constrained isomorphic generation of new variants. The pipeline included structured prompt engineering, parallel generation, and automated symbolic validation using Python and SymPy, followed by double-blind expert review. A stratified sample of 120 UNT mathematics items served as prototypes, from which 200 AI-generated clones were produced and validated. Six qualified subject-matter experts conducted independent evaluations using standardized criteria. Inter-rater reliability reached a substantial level (Cohen's κ = 0.78). Results show that 97.5% of generated items were recommended for use after review, with 50.5% accepted without revision and 47.0% accepted after corrections. The most frequent revision needs involved difficulty calibration, wording clarity, and factual or curricular alignment. Expert interviews confirmed that AI generation significantly reduces development time but remains limited in higher-order cognitive item design and pedagogically grounded distractor construction, especially in a low-resource, morphologically complex language environment. The findings support a hybrid augmentation model in which AI accelerates large-scale item production while experts ensure linguistic, cultural, and psychometric validity. The proposed framework demonstrates practical potential for multilingual, high-stakes assessment systems and provides implementation guidelines for responsible AI integration in test development.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1744885</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1744885</link>
        <title><![CDATA[Evaluating the impact of OMOP-CDM on data quality insight generation in respiratory disease management]]></title>
        <pubdate>2026-04-10T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Brenda Mbouamba Yankam</author><author>Fankoua Tchaptchet Luc Baudoin</author><author>Pauline Andeso</author><author>François Anicet Onana Akoa</author><author>Jean Blaise Ebimbe</author><author>Miranda Barasa</author><author>Mbele Onana</author><author>Samuel Iddi</author><author>Agnes Kiragga</author><author>Bertrand Hugo Mbatchou Ngahane</author><author>Data Science Without Borders Project </author>
        <description><![CDATA[The increasing volume and heterogeneity of patient care data present significant challenges for comprehensive analysis and the generation of insights, particularly in specific areas such as respiratory diseases. Standardizing diverse health data is crucial for enabling large-scale observational research and ensuring data readiness. The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) provides a widely adopted standard for harmonizing such data. However, evaluating the quality of data transformed into the OMOP CDM format is a critical step before its use in research or clinical decision support. This study evaluates the impact of the OMOP CDM standardization process on generating data quality insights for a respiratory disease dataset. The source dataset was initially paper-based, converted to an electronic format, and translated from French into English. This historical dataset covers the years 2009–2023 and contains 108 variables and 2,154 records. The data underwent the standard Extract, Transform, and Load (ETL) process to convert into the OMOP CDM format. Following this transformation, the quality of the resulting OMOP CDM instance was assessed. We utilized the Data Quality Dashboard (DQD) to evaluate the quality of the OMOP CDM database before and after ETL verification. DQD performs validation checks on the data based on key data quality dimensions, including completeness, plausibility, and conformance. Overall, the assessment conducted 2,344 checks, of which 2,269 passed, and 75 failed, resulting in a corrected pass rate of 96% for the Respiratory Diseases Inpatients data before ETL verification. After ETL verification, the assessment conducted 2,374 checks, of which 2,356 passed, and 40 failed, resulting in a 100% corrected pass rate. Standardizing respiratory disease data using the OMOP CDM enabled a structured and transparent evaluation of data quality. Through the application of the DQD, this study demonstrated the utility of OMOP CDM in generating meaningful data quality insights. These findings highlight the model's potential to enhance data readiness and support evidence-based decision-making in respiratory disease management.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1811110</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1811110</link>
        <title><![CDATA[A reinforcement learning-guided interpretable method for postoperative sepsis prediction with Hilbert-Schmidt Independence Criterion]]></title>
        <pubdate>2026-04-07T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Kunhua Zhong</author><author>Han Chen</author><author>Qilong Sun</author><author>Peng Wang</author><author>Zhenbei Liu</author><author>Yuwen Chen</author>
        <description><![CDATA[BackgroundSepsis is a major cause of postoperative morbidity and mortality, and early risk stratification from perioperative electronic health records (EHR) is a representative large-scale, high-dimensional data processing problem that requires models to be accurate, efficient, and clinically interpretable. However, many existing sepsis prediction methods operate as black boxes and rely on extensive temporal monitoring streams, which increases feature dimensionality and computation while limiting transparency.MethodsWe propose a reinforcement learning-guided, interpretable feature engineering framework for postoperative sepsis prediction that targets scalable learning on heterogeneous perioperative data. Within an Actor-Critic formulation, feature selection is treated as an action: an Actor network produces a stochastic feature mask over preoperative static variables and intraoperative statistical summaries, while a Critic network performs downstream prediction using a self-attention-based classifier. To benchmark and stabilize learning, we introduce an auxiliary baseline model that incorporates intraoperative temporal signals extracted by a temporal convolutional network (TCN) and regularized using the Hilbert-Schmidt Independence Criterion (HSIC) to encourage non-redundant representations between statistical and temporal feature views. The Actor is optimized to achieve comparable predictive performance to the baseline while using a reduced feature set, improving computational efficiency and supporting instance-level interpretability.ResultsExperiments on a real-world surgical cohort from Southwest Hospital (2014-2018) demonstrate that the proposed framework attains performance comparable to or better than competitive machine learning baselines while selecting fewer input features. On this dataset, our method achieved perfect scores of 1.00 for F1-score, Sensitivity, and Specificity.ConclusionThe proposed method accurately predicts the occurrence of postoperative sepsis and provides effective instance-level post hoc explanations. These findings offer a novel perspective for postoperative sepsis prediction.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1814157</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1814157</link>
        <title><![CDATA[A disease potential-driven graph attention model for comorbidity risk prediction of hypertension]]></title>
        <pubdate>2026-04-02T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Leming Zhou</author><author>Hanshu Qin</author><author>Yanmei Yang</author><author>Gang Huang</author><author>Zhigang Liu</author>
        <description><![CDATA[Hypertension is associated with an increased risk of serious complications, and the hazards are very serious. However, current methods for predicting comorbidity risks face the challenge that comorbidity prediction relying solely on data driven may lead to clinically implausible associations and reduce model interpretability. Also, how to capture the fusion features of patient and identify differences among them to facilitate risk prediction needs to be addressed. To overcome these challenges, we propose a Disease Potential-Driven Graph Attention (DP-GA) model for comorbidity risk prediction of hypertension, which has 3-fold ideas: (a) Constructing a fusion mechanism for the correlation among the patients' disease features and the structural, thus integrating feature attention and structural attention effectively; (b) Introducing a similarity-difference balance mechanism to further identify the relationships among patients; and (c) Designing a disease potential-driven attention mechanism to calculate the disease potential and construct masks, thus preserving the effective associations from high-risk patients to low-risk patients. Experimental results demonstrate that our proposed DP-GA model achieves a significant improvement in comorbidity risk prediction for patients with hypertension across three comorbidity datasets collected by the research group, compared with both the baseline and state-of-the-art peer methods. We also analyze the comorbidity network to predict the risk of hypertension comorbidity, thereby improving interpretability and early prediction of such comorbidities.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1594374</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1594374</link>
        <title><![CDATA[Toward robust social media sentiment for SMEs: a comparative study of dictionary-based and machine learning approaches with insights for hybrid methodologies]]></title>
        <pubdate>2026-04-01T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Heru Susanto</author><author>Aida Sari Omar</author><author>Alifya Kayla Shafa Susanto</author><author>Desi Setiana</author><author>Leu Fang-Yie</author><author>Junaid M. Shaikh</author><author>Asep Insani</author><author>Uus Khusni</author><author>Rachmat Hidayat</author><author>Akbari Indra Basuki</author><author>Iwan Setiawan</author>
        <description><![CDATA[Small and Medium-sized Enterprises (SMEs) increasingly rely on social media to engage customers, promote products, and enhance workplace collaboration. Customer opinions expressed through comments and posts on platforms such as Facebook and Instagram represent valuable insights, yet their informal and context-specific nature—often characterized by slang, misspellings, and bilingual usage—poses challenges for automated sentiment analysis. This study addresses this gap by comparatively evaluating dictionary-based and machine learning approaches to sentiment classification for SMEs' social media content. Data were collected from a diverse set of SMEs across multiple industries, with a substantial volume of customer comments extracted and pre-processed through tokenization, normalization, stop-word removal, and stemming. A customized dictionary was developed to account for local language variations, while Naïve Bayes and Support Vector Machine (SVM) models were employed as supervised classifiers. The findings indicate that dictionary-based methods, while simple and interpretable, struggle with accuracy when processing informal and localized language, whereas machine learning approaches deliver higher overall performance but require extensive preprocessing and tuning. Moreover, the study highlights the potential of hybrid frameworks that combine the interpretability of dictionary-based models with the adaptability of machine learning classifiers. This research contributes both practically and theoretically by (i) demonstrating the limitations of applying generic sentiment analysis tools in localized SME contexts, (ii) proposing a hybrid sentiment analysis framework tailored to SMEs, and (iii) offering empirical evidence to support digital transformation strategies for SMEs in resource-constrained environments. Ultimately, accurate sentiment analysis can enable SMEs to refine business strategies, strengthen customer engagement, and achieve sustainable growth in the digital economy.]]></description>
      </item>
      </channel>
    </rss>