<?xml version="1.0" encoding="utf-8"?>
    <rss version="2.0">
      <channel xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <title>Frontiers in Big Data | New and Recent Articles</title>
        <link>https://www.frontiersin.org/journals/big-data</link>
        <description>RSS Feed for Frontiers in Big Data | New and Recent Articles</description>
        <language>en-us</language>
        <generator>Frontiers Feed Generator,version:1</generator>
        <pubDate>2026-04-29T09:01:41.163+00:00</pubDate>
        <ttl>60</ttl>
        <item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1821270</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1821270</link>
        <title><![CDATA[Novel approach of encrypted network traffic classification using deep convolutional neural network with Artificial Bee Colony and Genetic Algorithm]]></title>
        <pubdate>2026-04-28T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Sujan Kumar Mohanty</author><author>Satyajit Rath</author><author>Satya Ranjan Sahu</author><author>Bikram Kumar Parida</author><author>Rakesh Chandra Balabantaray</author>
        <description><![CDATA[The encode network traffic makes it difficult to perform successful and dynamic classification. This paper will present the use of a hybrid model to be used with the publicly available QUIC dataset to classify VPN and non-VPN encrypted traffic based on a Deep Convolutional Neural Network (DCNN) and Long Short-Term Memory (LSTM) network, which is optimized by the Artificial Bee Colony (ABC) and Genetic Algorithm (GA). The method involves multi-angle processing - preprocessing, Min-Max normalization, and features selection of with correlation analysis, Fisher Score, and mutual information to obtain a tiny, but meaningful feature set (Size, Batch Cache, Delta Previous Packet). The chosen features are translated to 2D tensors through a sliding time window of consecutive packets, which allows the spatio-temporal DCNN+LSTM architecture to represent the level of intra- and inter-packet feature associations as well as inter- and intra-packet time dynamics. The disadvantages of single-optimization are overcome using a dual metaheuristic optimization strategy again whereby the work of the global hyperparameter exploration is done using ABC and the structural optimization is done using GA. The imbalance of classes is reduced with weighted loss functions and stratified data division. The accuracy of the model is 99.66% with 0.994 ROC-AUC and 0.987 PR-AUC and its MCC is 0.963 which is even greater than that of the traditional classifiers (Decision Tree, Random Forest, SVM, KNN), individual deep-learning models (CNN, LSTM), and image-based FlowPic method. Three-quarters of stratified cross-validation marks the case of consistent generalization (99.53% ± 0.09% mean accuracy), and an ablation study confirms the value of any one of the components. The findings prove that the presented framework can be applied to monitor the network traffic on encrypted networks which are security-sensitive and in real-time.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1768366</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1768366</link>
        <title><![CDATA[Beyond performance metrics: evaluating the unique value of generative AI in hybrid cybersecurity threat detection]]></title>
        <pubdate>2026-04-24T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Juan Antonio González-Ramos</author><author>Pablo Chamoso</author>
        <description><![CDATA[IntroductionThis study examines the role of generative artificial intelligence (GenAI) in cybersecurity threat detection, focusing on its usefulness in workflows that support human decision-making.MethodsExperiments were performed on the BODMAS dataset (134,435 samples) and a smaller exploratory subset of UNSW-NB15. State-of-the-art machine learning (ML) classifiers were compared with a zero-shot large language model (LLM) using standard classification metrics, while also considering latency, cost, and hallucination risk.ResultsML classifiers consistently outperformed the LLM-based system on standard detection metrics. However, the LLM showed value in cases of ambiguity, where it could provide short plain-language explanations, organize alert-related context, and generate initial interpretations for instances that did not match learned classes.DiscussionGenAI is unlikely to replace ML-based detection methods, but it can provide useful interpretive support for ambiguous or unfamiliar alerts. A hybrid pipeline is therefore proposed, in which ML handles high-confidence and time-sensitive decisions, while the LLM is used selectively for low-confidence cases or when explanatory support is needed. Human oversight remains necessary to address hallucination risk and ensure reliability.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1761377</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1761377</link>
        <title><![CDATA[The designing of a transparent hybrid machine learning framework for water leak detection: a systematic review]]></title>
        <pubdate>2026-04-24T00:00:00Z</pubdate>
        <category>Systematic Review</category>
        <author>Chinemerem M. Anozie</author><author>Tite Tuyikeze</author><author>Ibidun C. Obagbuwa</author><author>Fezile Matsebula</author>
        <description><![CDATA[IntroductionGlobal water scarcity is increasingly exacerbated by substantial water losses, with approximately 30% of treated water lost annually due to leaks in aging Water Distribution Networks (WDNs). Addressing this challenge requires advanced and reliable leak detection mechanisms. This study investigates the design of a transparent hybrid machine learning framework aimed at improving the accuracy and effectiveness of water leak detection systems.MethodsA systematic literature review was conducted following PRISMA guidelines. A total of 27 relevant studies were analyzed, focusing on hybrid deep learning approaches that incorporate data fusion, mixed models, and ensemble techniques for leak detection in WDNs.ResultsThe findings indicate that hybrid and ensemble learning techniques are becoming more important in the identification of water leaks. Several studies reported exceptional high performance, with some models achieving up to 99% balanced accuracy by leveraging multiple data modalities. These approaches demonstrate strong resilience and adaptability across varying operational conditions.DiscussionDespite their high performance, the complexity and “black-box” nature of hybrid models limit their practical deployment. The study highlights the importance of integrating Explainable Artificial Intelligence (XAI) techniques to enhance transparency, interpretability, and user trust. The review concludes that future intelligent leak management systems should combine high-performing hybrid models with XAI to develop efficient, interpretable, and trustworthy decision-support systems that support sustainable water resource management.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1772101</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1772101</link>
        <title><![CDATA[Generation of Kazakhstan's unified national testing variants using AI: a platform for automatic task creation with expert control]]></title>
        <pubdate>2026-04-17T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Bolatbek Abdrasilov</author><author>Talgat Niyazov</author><author>Lyazzat Shinetova</author><author>Shugyla Altybayeva</author><author>Kenzhekul Turalbayeva</author><author>David Orlov</author>
        <description><![CDATA[This study examines the use of artificial intelligence for Automatic Item Generation (AIG) in the context of Kazakhstan's Unified National Testing (UNT) and presents a human-in-the-loop platform for scalable, expert-controlled test development. The objective is to evaluate whether large language models (LLMs) can reliably generate high-quality, isomorphic mathematics test items in the Kazakh language while preserving psychometric and pedagogical requirements. A hybrid AI system combining a local and a cloud-based LLM was implemented to perform semantic deconstruction of prototype items and constrained isomorphic generation of new variants. The pipeline included structured prompt engineering, parallel generation, and automated symbolic validation using Python and SymPy, followed by double-blind expert review. A stratified sample of 120 UNT mathematics items served as prototypes, from which 200 AI-generated clones were produced and validated. Six qualified subject-matter experts conducted independent evaluations using standardized criteria. Inter-rater reliability reached a substantial level (Cohen's κ = 0.78). Results show that 97.5% of generated items were recommended for use after review, with 50.5% accepted without revision and 47.0% accepted after corrections. The most frequent revision needs involved difficulty calibration, wording clarity, and factual or curricular alignment. Expert interviews confirmed that AI generation significantly reduces development time but remains limited in higher-order cognitive item design and pedagogically grounded distractor construction, especially in a low-resource, morphologically complex language environment. The findings support a hybrid augmentation model in which AI accelerates large-scale item production while experts ensure linguistic, cultural, and psychometric validity. The proposed framework demonstrates practical potential for multilingual, high-stakes assessment systems and provides implementation guidelines for responsible AI integration in test development.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1744885</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1744885</link>
        <title><![CDATA[Evaluating the impact of OMOP-CDM on data quality insight generation in respiratory disease management]]></title>
        <pubdate>2026-04-10T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Brenda Mbouamba Yankam</author><author>Fankoua Tchaptchet Luc Baudoin</author><author>Pauline Andeso</author><author>François Anicet Onana Akoa</author><author>Jean Blaise Ebimbe</author><author>Miranda Barasa</author><author>Mbele Onana</author><author>Samuel Iddi</author><author>Agnes Kiragga</author><author>Bertrand Hugo Mbatchou Ngahane</author><author>Data Science Without Borders Project </author>
        <description><![CDATA[The increasing volume and heterogeneity of patient care data present significant challenges for comprehensive analysis and the generation of insights, particularly in specific areas such as respiratory diseases. Standardizing diverse health data is crucial for enabling large-scale observational research and ensuring data readiness. The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) provides a widely adopted standard for harmonizing such data. However, evaluating the quality of data transformed into the OMOP CDM format is a critical step before its use in research or clinical decision support. This study evaluates the impact of the OMOP CDM standardization process on generating data quality insights for a respiratory disease dataset. The source dataset was initially paper-based, converted to an electronic format, and translated from French into English. This historical dataset covers the years 2009–2023 and contains 108 variables and 2,154 records. The data underwent the standard Extract, Transform, and Load (ETL) process to convert into the OMOP CDM format. Following this transformation, the quality of the resulting OMOP CDM instance was assessed. We utilized the Data Quality Dashboard (DQD) to evaluate the quality of the OMOP CDM database before and after ETL verification. DQD performs validation checks on the data based on key data quality dimensions, including completeness, plausibility, and conformance. Overall, the assessment conducted 2,344 checks, of which 2,269 passed, and 75 failed, resulting in a corrected pass rate of 96% for the Respiratory Diseases Inpatients data before ETL verification. After ETL verification, the assessment conducted 2,374 checks, of which 2,356 passed, and 40 failed, resulting in a 100% corrected pass rate. Standardizing respiratory disease data using the OMOP CDM enabled a structured and transparent evaluation of data quality. Through the application of the DQD, this study demonstrated the utility of OMOP CDM in generating meaningful data quality insights. These findings highlight the model's potential to enhance data readiness and support evidence-based decision-making in respiratory disease management.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1811110</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1811110</link>
        <title><![CDATA[A reinforcement learning-guided interpretable method for postoperative sepsis prediction with Hilbert-Schmidt Independence Criterion]]></title>
        <pubdate>2026-04-07T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Kunhua Zhong</author><author>Han Chen</author><author>Qilong Sun</author><author>Peng Wang</author><author>Zhenbei Liu</author><author>Yuwen Chen</author>
        <description><![CDATA[BackgroundSepsis is a major cause of postoperative morbidity and mortality, and early risk stratification from perioperative electronic health records (EHR) is a representative large-scale, high-dimensional data processing problem that requires models to be accurate, efficient, and clinically interpretable. However, many existing sepsis prediction methods operate as black boxes and rely on extensive temporal monitoring streams, which increases feature dimensionality and computation while limiting transparency.MethodsWe propose a reinforcement learning-guided, interpretable feature engineering framework for postoperative sepsis prediction that targets scalable learning on heterogeneous perioperative data. Within an Actor-Critic formulation, feature selection is treated as an action: an Actor network produces a stochastic feature mask over preoperative static variables and intraoperative statistical summaries, while a Critic network performs downstream prediction using a self-attention-based classifier. To benchmark and stabilize learning, we introduce an auxiliary baseline model that incorporates intraoperative temporal signals extracted by a temporal convolutional network (TCN) and regularized using the Hilbert-Schmidt Independence Criterion (HSIC) to encourage non-redundant representations between statistical and temporal feature views. The Actor is optimized to achieve comparable predictive performance to the baseline while using a reduced feature set, improving computational efficiency and supporting instance-level interpretability.ResultsExperiments on a real-world surgical cohort from Southwest Hospital (2014-2018) demonstrate that the proposed framework attains performance comparable to or better than competitive machine learning baselines while selecting fewer input features. On this dataset, our method achieved perfect scores of 1.00 for F1-score, Sensitivity, and Specificity.ConclusionThe proposed method accurately predicts the occurrence of postoperative sepsis and provides effective instance-level post hoc explanations. These findings offer a novel perspective for postoperative sepsis prediction.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1814157</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1814157</link>
        <title><![CDATA[A disease potential-driven graph attention model for comorbidity risk prediction of hypertension]]></title>
        <pubdate>2026-04-02T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Leming Zhou</author><author>Hanshu Qin</author><author>Yanmei Yang</author><author>Gang Huang</author><author>Zhigang Liu</author>
        <description><![CDATA[Hypertension is associated with an increased risk of serious complications, and the hazards are very serious. However, current methods for predicting comorbidity risks face the challenge that comorbidity prediction relying solely on data driven may lead to clinically implausible associations and reduce model interpretability. Also, how to capture the fusion features of patient and identify differences among them to facilitate risk prediction needs to be addressed. To overcome these challenges, we propose a Disease Potential-Driven Graph Attention (DP-GA) model for comorbidity risk prediction of hypertension, which has 3-fold ideas: (a) Constructing a fusion mechanism for the correlation among the patients' disease features and the structural, thus integrating feature attention and structural attention effectively; (b) Introducing a similarity-difference balance mechanism to further identify the relationships among patients; and (c) Designing a disease potential-driven attention mechanism to calculate the disease potential and construct masks, thus preserving the effective associations from high-risk patients to low-risk patients. Experimental results demonstrate that our proposed DP-GA model achieves a significant improvement in comorbidity risk prediction for patients with hypertension across three comorbidity datasets collected by the research group, compared with both the baseline and state-of-the-art peer methods. We also analyze the comorbidity network to predict the risk of hypertension comorbidity, thereby improving interpretability and early prediction of such comorbidities.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1594374</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1594374</link>
        <title><![CDATA[Toward robust social media sentiment for SMEs: a comparative study of dictionary-based and machine learning approaches with insights for hybrid methodologies]]></title>
        <pubdate>2026-04-01T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Heru Susanto</author><author>Aida Sari Omar</author><author>Alifya Kayla Shafa Susanto</author><author>Desi Setiana</author><author>Leu Fang-Yie</author><author>Junaid M. Shaikh</author><author>Asep Insani</author><author>Uus Khusni</author><author>Rachmat Hidayat</author><author>Indra Akbari</author><author>Iwan Basuki</author>
        <description><![CDATA[Small and Medium-sized Enterprises (SMEs) increasingly rely on social media to engage customers, promote products, and enhance workplace collaboration. Customer opinions expressed through comments and posts on platforms such as Facebook and Instagram represent valuable insights, yet their informal and context-specific nature—often characterized by slang, misspellings, and bilingual usage—poses challenges for automated sentiment analysis. This study addresses this gap by comparatively evaluating dictionary-based and machine learning approaches to sentiment classification for SMEs' social media content. Data were collected from a diverse set of SMEs across multiple industries, with a substantial volume of customer comments extracted and pre-processed through tokenization, normalization, stop-word removal, and stemming. A customized dictionary was developed to account for local language variations, while Naïve Bayes and Support Vector Machine (SVM) models were employed as supervised classifiers. The findings indicate that dictionary-based methods, while simple and interpretable, struggle with accuracy when processing informal and localized language, whereas machine learning approaches deliver higher overall performance but require extensive preprocessing and tuning. Moreover, the study highlights the potential of hybrid frameworks that combine the interpretability of dictionary-based models with the adaptability of machine learning classifiers. This research contributes both practically and theoretically by (i) demonstrating the limitations of applying generic sentiment analysis tools in localized SME contexts, (ii) proposing a hybrid sentiment analysis framework tailored to SMEs, and (iii) offering empirical evidence to support digital transformation strategies for SMEs in resource-constrained environments. Ultimately, accurate sentiment analysis can enable SMEs to refine business strategies, strengthen customer engagement, and achieve sustainable growth in the digital economy.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1778363</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1778363</link>
        <title><![CDATA[Tree-based machine learning methods for predicting vehicle insurance claim size]]></title>
        <pubdate>2026-03-23T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Edossa Merga Terefe</author><author>Merga Abdissa Aga</author>
        <description><![CDATA[Vehicle insurance claim severity modeling requires accurate and interpretable methods that can handle skewed and heterogeneous loss data. This study provides a structured empirical comparison between classical parametric regression models and tree-based ensemble learning approaches for predicting claim size conditional on claim occurrence. The analysis is conducted within a cross-sectional conditional severity framework using real-world motor insurance data. We implement and compare ordinary least squares (OLS), a Tweedie generalized linear model (GLM), and three ensemble methods: bagging, random forests (RFs), and gradient boosting. Model performance is evaluated using out-of-sample root mean square error (RMSE), and variable importance measures assess the relative contribution of predictors. The results indicate that tree-based ensemble methods achieve modest improvements in predictive accuracy relative to classical parametric models. The Tweedie GLM remains a competitive, flexible parametric benchmark for skewed positive claim amounts. Variable importance analysis consistently identifies premium and insured value as key determinants of claim severity. Overall, the findings suggest that ensemble learning methods can complement traditional actuarial models, offering additional flexibility in capturing non-linear effects while maintaining comparable predictive performance in moderate-complexity severity data.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1737043</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1737043</link>
        <title><![CDATA[Fairer non-negative matrix factorization]]></title>
        <pubdate>2026-03-17T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Lara Kassab</author><author>Erin George</author><author>Deanna Needell</author><author>Haowen Geng</author><author>Nika Jafar Nia</author><author>Aoxi Li</author>
        <description><![CDATA[There has been a recent critical need to study fairness and bias in machine learning (ML) algorithms. Since there is clearly no one-size-fits-all solution to fairness, ML methods should be developed alongside bias mitigation strategies that are practical and approachable to the practitioner. Motivated by recent work on “fair” PCA, here we consider the more challenging method of non-negative matrix factorization (NMF) as both a showcasing example and a method that is important in its own right for both topic modeling tasks and feature extraction for other ML tasks. We demonstrate that a modification of the objective function, by using a min-max formulation, may sometimes be able to offer an improvement in fairness for groups in the population. We derive two methods for the objective minimization, a multiplicative update rule as well as an alternating minimization scheme, and discuss implementation practicalities. We include a suite of synthetic and real experiments that show how the method may improve fairness while also highlighting the important fact that this may sometime increase error for some individuals and fairness is not a rigid definition and method choice should strongly depend on the application at hand.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1676922</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1676922</link>
        <title><![CDATA[FunduScope: a human-centered, machine learning–based interactive tool for training junior ophthalmologists in diabetic retinopathy detection]]></title>
        <pubdate>2026-03-13T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Sara-Jane Bittner</author><author>Michael Barz</author><author>Daniel Sonntag</author>
        <description><![CDATA[Interpreting fundus images is an essential skill for detecting eye diseases, such as diabetic retinopathy (DR), one of the leading causes of visual impairment. However, the training of junior doctors relies on experienced ophthalmologists, who often lack the time for teaching, or on printed training materials that lack variability in examples. In this work, we present FunduScope, an interactive human-centered learning tool for training junior ophthalmologists, which is based on a pre-trained ML model for classifying DR. In a qualitative pre-study, we investigated the needs of junior doctors and identified gaps in recent learning procedures. In the main mixed-methods study, we examined the experience of 10 junior doctors with the tool and its impact on cognitive load, usability, and additional factors relevant to e-learning tools. Despite technical constraints our results confirm the potential of using an ML-based learning tool in medical education, addressing the time constraints of ophthalmologists, and providing learning independence for junior doctors. However, future work could extend the learning tool by using explainable artificial intelligence (XAI) to further support the clinical decision making of learners and exceeding the scope of this proof of concept to other ophthalmic diseases.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1752142</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1752142</link>
        <title><![CDATA[Jingdezhen ceramic culture in the digital era: a qualitative inquiry into digital dissemination and platform innovation]]></title>
        <pubdate>2026-03-09T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Qiuyang Huang</author><author>Zhengjun Chen</author>
        <description><![CDATA[IntroductionDigital platforms have increasingly reshaped the ways in which traditional craft cultures are produced, circulated, and interpreted. While prior research has examined digital heritage broadly, limited attention has been paid to how platform-based dissemination transforms ceramic culture in historically significant craft centers such as Jingdezhen.MethodsThis study adopts a qualitative research design, combining semi-structured interviews with 32 ceramic practitioners and digital ethnography of 58 ceramic-related livestreaming sessions on Douyin.ResultsThe findings reveal three key dynamics: (1) the reconfiguration of craft authority through platform visibility; (2) the emergence of hybrid artisan–educator–entrepreneur identities; and (3) persistent tensions between cultural authenticity and commercial logic in platform-mediated environments.DiscussionBy integrating cultural ecology and platform ecosystem theory, this study contributes to scholarship on digital heritage and provides practical insights for cultural practitioners and heritage institutions navigating digital platform ecosystems.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1779935</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1779935</link>
        <title><![CDATA[GFTrans: an on-the-fly static analysis framework for code performance profiling]]></title>
        <pubdate>2026-02-27T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Jie Li</author><author>Yunbao Wen</author><author>Jingxin Liu</author><author>Biqing Zeng</author><author>Seyedali Mirjalili</author>
        <description><![CDATA[Improving software efficiency is crucial for maintenance, but pinpointing runtime bottlenecks becomes increasingly difficult as systems expand. Traditional dynamic profiling tools require full build-execution cycles, creating significant latency that impedes agile development. To address this, we introduce GFTrans, a static analysis framework that predicts c program performance without execution. GFTrans utilizes a Transformer architecture with a novel “anchor-based embedding” technique to integrate control flow and data dependencies into a unified sequence. Additionally, a dynamic gating mechanism fuses these semantic representations with 16 handcrafted statistical features to comprehensively capture code complexity. Evaluated on a dataset of real-world GitHub c functions with high-precision runtime labels, GFTrans outperforms baseline models like Random Forest and Code2Vec, achieving 78.64% accuracy. The system identifies potential bottlenecks in milliseconds, enabling developers to perform optimization effectively during the coding phase.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1770989</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1770989</link>
        <title><![CDATA[Spatiotemporal deep learning framework for predictive behavioral threat detection in surveillance footage]]></title>
        <pubdate>2026-02-27T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Asha Aruna Sheela Matta</author><author>Venkata Purna Chandra Sekhara Rao Manukonda</author>
        <description><![CDATA[Anomaly detection in video surveillance remains a challenging problem due to complex human behaviors, temporal variability, and limited annotated data. This study proposes an optimized spatiotemporal deep learning (DL) framework that integrates a Convolutional Neural Network (CNN) for spatial feature extraction with a Long Short-Term Memory (LSTM) network for temporal dependency modeling. The CNN processes frame-level appearance information, while the LSTM captures sequential motion patterns across video frames, enabling effective representation of anomalous activities. Hyperparameter optimization and regularization strategies are employed to improve convergence stability and generalization performance. The proposed model is evaluated on the DCSASS surveillance dataset and the experimental results demonstrate that the optimized CNN-LSTM framework achieves an accuracy of 98.1%, with consistently high precision, recall, and F1-score across 3-fold, 5-fold, and 10-fold cross-validation settings. Comparative analysis shows that the proposed method outperforms conventional machine learning models and recent deep learning baselines, highlighting its effectiveness and robustness for practical video-based anomaly detection in surveillance environments.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1681382</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1681382</link>
        <title><![CDATA[Federated learning for teacher data privacy protection: a study in the context of the PIPL]]></title>
        <pubdate>2026-02-09T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Shanwei Chen</author><author>Xiu Zhi Qi</author><author>Xue Hui Han</author><author>Zhao Chen Fan</author><author>Le Le Wang</author>
        <description><![CDATA[BackgroundThe Personal Information Protection Law (PIPL) in China imposes strict requirements on personal data handling, particularly in educational contexts where teacher data privacy is critical. Traditional centralized machine learning approaches pose significant risks of data breaches and non-compliance. Federated Learning (FL) offers a promising decentralized alternative by enabling collaborative model training without sharing raw data.MethodsThis study combines quantitative simulations and qualitative compliance analysis to evaluate FL frameworks under PIPL principles, with a focus on Differential Privacy as the primary empirically validated mechanism for noise addition and privacy guarantee. Other techniques, such as Secure Multi-Party Computation (SMC), are analyzed theoretically for their alignment with PIPL requirements like data minimization, anonymization, and encrypted transmission.ResultsExperimental simulations demonstrate that FL effectively reduces data breach risks compared to centralized methods. It achieves principle-level compliance with PIPL through local data processing, differential privacy mechanisms, and secure aggregation, leading to improved privacy preservation while maintaining model performance.ConclusionFL conceptually supports teacher data privacy protection under the PIPL framework. This study proposes a tailored compliance framework that integrates FL with privacy-enhancing technologies, offering theoretical foundations and practical recommendations for educational institutions and technology implementers to deploy privacy-preserving machine learning solutions.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1782461</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1782461</link>
        <title><![CDATA[A genetic algorithm-based framework for online sparse feature selection in data streams]]></title>
        <pubdate>2026-02-09T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Guanyu Liu</author><author>Jinhang Liu</author><author>Guifan He</author><author>Yifan Liu</author><author>Huabo Bai</author><author>Min Zhou</author>
        <description><![CDATA[High-dimensional streaming data implementations commonly utilize online streaming feature selection (OSFS) techniques. In practice, however, incomplete data due to equipment failures and technical constraints often poses a significant challenge. Online Sparse Streaming Feature Selection (OS2FS) tackles this issue by performing missing data imputation via latent factor analysis. Nevertheless, existing OS2FS approaches exhibit considerable limitations in feature evaluation, resulting in degraded performance. To address these shortcomings, this paper introduces a novel genetic algorithm-based online sparse streaming feature selection (GA-OS2FS) in data streams, which integrates two key innovations: (1) imputation of missing values using a latent factor analysis model, and (2) application of genetic algorithm to assess feature importance. Comprehensive experiments conducted on six real-world datasets show that GA-OS2FS surpasses state-of-the-art OSFS and OS2FS methods, consistently attaining higher accuracy through the selection of optimal feature subsets.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1697392</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1697392</link>
        <title><![CDATA[Dynamic transfer learning with co-occurrence-guided multi-source fusion for urban spatio-temporal crime prediction]]></title>
        <pubdate>2026-02-05T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Chen Cui</author><author>Ziwan Zheng</author><author>Hao Du</author><author>Wen Wang</author>
        <description><![CDATA[Spatio-temporal crime prediction is crucial for optimizing police resource allocation but faces challenges including data sparsity, which hinders models from extracting effective patterns and limits robustness—and the underutilization of cross-type crime co-occurrence correlations. To address these issues, we propose a transfer learning approach that explores underlying cross-type relationships, enabling the sharing of spatio-temporal features across crime types and alleviating data sparsity. An adaptive weight updating mechanism is incorporated to enhance the perception of distinct crime categories, while the impacts of points of interest (POIs), meteorological factors, and other features are also analyzed. Experiments on real-world data from a Chinese city show that our model comprehensively captures latent features across crime types, thereby enhancing predictive performance and robustness, particularly for crime types with sparse data. Moreover, it effectively incorporates environmental features, further improving crime prediction performance.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1651290</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1651290</link>
        <title><![CDATA[Depression detection through dual-stream modeling with large language models: a fusion-based transfer learning framework integrating BERT and T5 representations]]></title>
        <pubdate>2026-02-04T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Na Wang</author><author>Weijia Zhang</author><author>Raja Kamil</author><author>Ian Renner</author><author>Syed Abdul Rahman Al-Haddad</author><author>Normala Ibrahim</author><author>Zhen Zhao</author>
        <description><![CDATA[Millions of people around the world suffer from depression. While early diagnosis is essential for timely intervention, it remains a significant challenge due to limited access to clinically diagnosed data and privacy restrictions on mental health records. These limitations hinder the training of robust AI models for depression detection. To tackle this, this article proposes a parallel transfer learning framework for depression detection that integrates BERT and T5 through a fusion mechanism, combining the complementary advantages of these two large language models (LLMs). By integrating their semantic embeddings, the method captures a broader range of linguistic cues from transcribed speech. These embeddings are processed through a model with two parallel branches: a one-dimensional convolutional neural network and a dense neural network are used to construct each branch for preliminary prediction, which are then fused for final prediction. Evaluations on the E-DAIC dataset demonstrate that the proposed method outperforms baseline models, achieving a 3.0% increase in accuracy (91.3%), a 6.9% increase in precision (95.2%), and a 1.7% improvement in F1-score (90.0%). The experimental results verify the effectiveness of BERT and T5 fusion in enhancing depression detection performance and highlight the potential of transfer learning for scalable and privacy-conscious mental health applications.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1750906</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1750906</link>
        <title><![CDATA[Algorithmic recourse in sequential decision-making for long-term fairness]]></title>
        <pubdate>2026-02-04T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Francisco Gumucio</author><author>Lu Zhang</author>
        <description><![CDATA[Long-term fairness in sequential decision-making is critical yet challenging, as decisions at each time step influence future opportunities and outcomes, potentially exacerbating existing disparities over time. While existing methods primarily achieve fairness by directly adjusting decision models, in this work, we study a complementary perspective based on sequential algorithmic recourse, in which fairness is pursued through actionable interventions for individuals. We introduce Sequential Causal Algorithmic Recourse for Fairness (SCARF), a causally grounded framework that generates temporally coherent recourse trajectories by integrating structural causal modeling with sequential generative modeling. By explicitly incorporating both short-term and long-term fairness constraints, as well as practical budget limitations, SCARF generates personalized recourse plans that effectively mitigate disparities over multiple decision cycles. Through experiments on synthetic and semi-synthetic datasets, we empirically examine how different recourse strategies influence fairness dynamics over time, illustrating the trade-offs between short-term and long-term fairness under sequential interventions. The results demonstrate that SCARF provides a practical and informative framework for analyzing long-term fairness in dynamic decision-making settings.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1718710</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1718710</link>
        <title><![CDATA[Modeling household adoption of IoT-based home security in Dhaka: a PLS–machine learning framework]]></title>
        <pubdate>2026-02-04T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Arif Mahmud</author><author>Ashikur Rahman</author><author>Fahmid Al Farid</author><author>Jia Uddin</author><author>Hezerul Bin Abdul Karim</author>
        <description><![CDATA[IntroductionDespite several strategies, Bangladesh has a poor rate of internet of things (IoT) deployment. This study therefore seeks to investigate the factors shaping IoT adoption for residential security in Dhaka and to analyze their respective contributions.MethodHence, this study combined two important theories, namely protection motivation theory (PMT) along with attitude-social influence-self-efficacy (ASE) in which a hybrid PLS-Machine learning approach has been used to identify both linear and nonlinear correlations with high predictive accuracy. Snowball sampling method was utilized to choose 348 valid replies from a survey of household heads. Afterward, partial least squares (PLS) followed by artificial neural networks (ANN) and machine learning (ML) classifiers were the procedures that made up the complete assessment method.ResultsThe variables that affected intention with a variance of 34.9% and accuracy of 74.28% were severity, vulnerability, response efficacy, response cost, and attitude. On the other hand, vulnerability was the most significant predictor, followed by response cost, attitude, response efficacy, self-efficacy, social influence, and severity.DiscussionThe theoretical contribution of this study lies in its novel integration of PMT and ASE models, offering new insights into their combined effect on technology adoption in emerging markets. Besides, the findings contribute to the literature by increasing the public awareness of home security that can enhance Dhaka's overall state of public order and safety. Moreover, the findings may offer valuable insights for companies and entrepreneurs, as incorporating these factors into marketing strategies and investment initiatives is likely to foster greater consumer adoption.]]></description>
      </item>
      </channel>
    </rss>