<?xml version="1.0" encoding="utf-8"?>
    <rss version="2.0">
      <channel xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <title>Frontiers in Big Data | Machine Learning and Artificial Intelligence section | New and Recent Articles</title>
        <link>https://www.frontiersin.org/journals/big-data/sections/machine-learning-and-artificial-intelligence</link>
        <description>RSS Feed for Machine Learning and Artificial Intelligence section in the Frontiers in Big Data journal | New and Recent Articles</description>
        <language>en-us</language>
        <generator>Frontiers Feed Generator,version:1</generator>
        <pubDate>2026-04-15T13:17:23.836+00:00</pubDate>
        <ttl>60</ttl>
        <item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1778363</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1778363</link>
        <title><![CDATA[Tree-based machine learning methods for predicting vehicle insurance claim size]]></title>
        <pubdate>2026-03-23T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Edossa Merga Terefe</author><author>Merga Abdissa Aga</author>
        <description><![CDATA[Vehicle insurance claim severity modeling requires accurate and interpretable methods that can handle skewed and heterogeneous loss data. This study provides a structured empirical comparison between classical parametric regression models and tree-based ensemble learning approaches for predicting claim size conditional on claim occurrence. The analysis is conducted within a cross-sectional conditional severity framework using real-world motor insurance data. We implement and compare ordinary least squares (OLS), a Tweedie generalized linear model (GLM), and three ensemble methods: bagging, random forests (RFs), and gradient boosting. Model performance is evaluated using out-of-sample root mean square error (RMSE), and variable importance measures assess the relative contribution of predictors. The results indicate that tree-based ensemble methods achieve modest improvements in predictive accuracy relative to classical parametric models. The Tweedie GLM remains a competitive, flexible parametric benchmark for skewed positive claim amounts. Variable importance analysis consistently identifies premium and insured value as key determinants of claim severity. Overall, the findings suggest that ensemble learning methods can complement traditional actuarial models, offering additional flexibility in capturing non-linear effects while maintaining comparable predictive performance in moderate-complexity severity data.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1676922</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1676922</link>
        <title><![CDATA[FunduScope: a human-centered, machine learning–based interactive tool for training junior ophthalmologists in diabetic retinopathy detection]]></title>
        <pubdate>2026-03-13T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Sara-Jane Bittner</author><author>Michael Barz</author><author>Daniel Sonntag</author>
        <description><![CDATA[Interpreting fundus images is an essential skill for detecting eye diseases, such as diabetic retinopathy (DR), one of the leading causes of visual impairment. However, the training of junior doctors relies on experienced ophthalmologists, who often lack the time for teaching, or on printed training materials that lack variability in examples. In this work, we present FunduScope, an interactive human-centered learning tool for training junior ophthalmologists, which is based on a pre-trained ML model for classifying DR. In a qualitative pre-study, we investigated the needs of junior doctors and identified gaps in recent learning procedures. In the main mixed-methods study, we examined the experience of 10 junior doctors with the tool and its impact on cognitive load, usability, and additional factors relevant to e-learning tools. Despite technical constraints our results confirm the potential of using an ML-based learning tool in medical education, addressing the time constraints of ophthalmologists, and providing learning independence for junior doctors. However, future work could extend the learning tool by using explainable artificial intelligence (XAI) to further support the clinical decision making of learners and exceeding the scope of this proof of concept to other ophthalmic diseases.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1651290</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1651290</link>
        <title><![CDATA[Depression detection through dual-stream modeling with large language models: a fusion-based transfer learning framework integrating BERT and T5 representations]]></title>
        <pubdate>2026-02-04T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Na Wang</author><author>Weijia Zhang</author><author>Raja Kamil</author><author>Ian Renner</author><author>Syed Abdul Rahman Al-Haddad</author><author>Normala Ibrahim</author><author>Zhen Zhao</author>
        <description><![CDATA[Millions of people around the world suffer from depression. While early diagnosis is essential for timely intervention, it remains a significant challenge due to limited access to clinically diagnosed data and privacy restrictions on mental health records. These limitations hinder the training of robust AI models for depression detection. To tackle this, this article proposes a parallel transfer learning framework for depression detection that integrates BERT and T5 through a fusion mechanism, combining the complementary advantages of these two large language models (LLMs). By integrating their semantic embeddings, the method captures a broader range of linguistic cues from transcribed speech. These embeddings are processed through a model with two parallel branches: a one-dimensional convolutional neural network and a dense neural network are used to construct each branch for preliminary prediction, which are then fused for final prediction. Evaluations on the E-DAIC dataset demonstrate that the proposed method outperforms baseline models, achieving a 3.0% increase in accuracy (91.3%), a 6.9% increase in precision (95.2%), and a 1.7% improvement in F1-score (90.0%). The experimental results verify the effectiveness of BERT and T5 fusion in enhancing depression detection performance and highlight the potential of transfer learning for scalable and privacy-conscious mental health applications.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1718710</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1718710</link>
        <title><![CDATA[Modeling household adoption of IoT-based home security in Dhaka: a PLS–machine learning framework]]></title>
        <pubdate>2026-02-04T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Arif Mahmud</author><author>Ashikur Rahman</author><author>Fahmid Al Farid</author><author>Jia Uddin</author><author>Hezerul Bin Abdul Karim</author>
        <description><![CDATA[IntroductionDespite several strategies, Bangladesh has a poor rate of internet of things (IoT) deployment. This study therefore seeks to investigate the factors shaping IoT adoption for residential security in Dhaka and to analyze their respective contributions.MethodHence, this study combined two important theories, namely protection motivation theory (PMT) along with attitude-social influence-self-efficacy (ASE) in which a hybrid PLS-Machine learning approach has been used to identify both linear and nonlinear correlations with high predictive accuracy. Snowball sampling method was utilized to choose 348 valid replies from a survey of household heads. Afterward, partial least squares (PLS) followed by artificial neural networks (ANN) and machine learning (ML) classifiers were the procedures that made up the complete assessment method.ResultsThe variables that affected intention with a variance of 34.9% and accuracy of 74.28% were severity, vulnerability, response efficacy, response cost, and attitude. On the other hand, vulnerability was the most significant predictor, followed by response cost, attitude, response efficacy, self-efficacy, social influence, and severity.DiscussionThe theoretical contribution of this study lies in its novel integration of PMT and ASE models, offering new insights into their combined effect on technology adoption in emerging markets. Besides, the findings contribute to the literature by increasing the public awareness of home security that can enhance Dhaka's overall state of public order and safety. Moreover, the findings may offer valuable insights for companies and entrepreneurs, as incorporating these factors into marketing strategies and investment initiatives is likely to foster greater consumer adoption.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1750906</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1750906</link>
        <title><![CDATA[Algorithmic recourse in sequential decision-making for long-term fairness]]></title>
        <pubdate>2026-02-04T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Francisco Gumucio</author><author>Lu Zhang</author>
        <description><![CDATA[Long-term fairness in sequential decision-making is critical yet challenging, as decisions at each time step influence future opportunities and outcomes, potentially exacerbating existing disparities over time. While existing methods primarily achieve fairness by directly adjusting decision models, in this work, we study a complementary perspective based on sequential algorithmic recourse, in which fairness is pursued through actionable interventions for individuals. We introduce Sequential Causal Algorithmic Recourse for Fairness (SCARF), a causally grounded framework that generates temporally coherent recourse trajectories by integrating structural causal modeling with sequential generative modeling. By explicitly incorporating both short-term and long-term fairness constraints, as well as practical budget limitations, SCARF generates personalized recourse plans that effectively mitigate disparities over multiple decision cycles. Through experiments on synthetic and semi-synthetic datasets, we empirically examine how different recourse strategies influence fairness dynamics over time, illustrating the trade-offs between short-term and long-term fairness under sequential interventions. The results demonstrate that SCARF provides a practical and informative framework for analyzing long-term fairness in dynamic decision-making settings.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1699561</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1699561</link>
        <title><![CDATA[Explainable attrition risk scoring for managerial retention decisions in human resource analytics]]></title>
        <pubdate>2026-01-12T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>M. S. Pavithran</author><author>S. M. Vadivel</author>
        <description><![CDATA[IntroductionEmployee turnover remains a significant challenge for organizations as it becomes difficult for them to retain the same employees and continue with their operations efficiently. With the assistance of predictive analytics, HR managers will be able to foresee and lower the potential turnover. Conventional research has focused on the effectiveness of technical models, yet there is a lack of studies investigating the interpretability and reliability of managerial forecasts.MethodsThis research used the Employee Attrition dataset and applied various pre-processing methods, including label encoding, feature scaling, and SMOTE for class balancing. Machine learning models were trained and optimized using grid search with stratified cross-validation. The best-performing model was calibrated using the sigmoid method to ensure the accuracy of the predicted probabilities. LIME enabled local interpretability, thus providing practical insights into individual employee attrition-related risks. Permutation feature importance analysis and SHAP summary plots helped in better understanding the model by showing the individual features that contributed to the attrition probability.ResultsThe Random Forest classifier achieved the highest AUC-ROC score of 97.37%. Risk distribution visualizations highlight employees with the highest attrition probability, and calibration is the main reason for the Brier Score reduction from 0.03873 to 0.03480.DiscussionThe study concludes that by prioritizing interventions and increasing the accuracy of retention strategies, a calibrated, interpretable, and risk-stratified model can enhance HR decision-making. This framework aids HR leaders in transitioning from reactive to proactive workforce management by leveraging data-driven insights.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1670833</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1670833</link>
        <title><![CDATA[Decoding deception: state-of-the-art approaches to deep fake detection]]></title>
        <pubdate>2026-01-09T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Tarak Hussain</author><author>B. Tirapathi Reddy</author><author>Kondaveti Phanindra</author><author>Sailaja Terumalasetti</author><author>Ghufran Ahmad Khan</author>
        <description><![CDATA[Deepfake technology evolves at an alarming pace, threatening information integrity and social trust. We present new multimodal deepfake detection framework exploiting cross-domain inconsistencies, utilizing audio-visual consistency. Its core is the Synchronization-Aware Feature Fusion (SAFF) architecture combined with Cross-Modal Graph Attention Networks (CM-GAN), both addressing the temporal misalignments explicitly for improved detection accuracy. Across eight models and five benchmark datasets with 93,750 test samples, the framework obtains 98.76% accuracy and significant robustness against multiple compression levels. Synchronized audio-visual inconsistencies are thus highly discriminative according to statistical analysis (Cohen's d = 1.87). With contributions centering around a cross-modal feature extraction pipeline, a graph-based attention mechanism for inter-modal reasoning and an extensive number of ablation studies validating the fusion strategy, the paper also provides statistically sound insights to guide future pursuit in this area. With a 17.85% generalization advantage over unimodal methods, the framework represents a new state of the art and introduces a self-supervised pre-training strategy that leverages labeled data 65% less.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1686452</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1686452</link>
        <title><![CDATA[Bias in AI systems: integrating formal and socio-technical approaches]]></title>
        <pubdate>2026-01-08T00:00:00Z</pubdate>
        <category>Review</category>
        <author>Amar Ahmad</author><author>Yvonne Vallès</author><author>Youssef Idaghdour</author>
        <description><![CDATA[Artificial Intelligence (AI) systems are increasingly embedded in high-stakes decision-making across domains such as healthcare, finance, criminal justice, and employment. Evidence has been accumulated showing that these systems can reproduce and amplify structural inequities, leading to ethical, social, and technical concerns. In this review, formal mathematical definitions of bias are integrated with socio-technical perspectives to examine its origins, manifestations, and impacts. Bias is categorized into four interrelated families: historical/representational, selection/measurement, algorithmic/optimization, and feedback/emergent, and its operation is illustrated through case studies in facial recognition, large language models, credit scoring, healthcare, employment, and criminal justice. Current mitigation strategies are critically evaluated, including dataset diversification, fairness-aware modeling, post-deployment auditing, regulatory frameworks, and participatory design. An integrated framework is proposed in which statistical diagnostics are coupled with governance mechanisms to enable bias mitigation across the entire AI lifecycle. By bridging technical precision with sociological insight, guidance is offered for the development of AI systems that are equitable, accountable, and responsive to the needs of diverse populations.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1683786</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1683786</link>
        <title><![CDATA[Hybrid deep learning models for fake news detection: case study on Arabic and English languages]]></title>
        <pubdate>2026-01-06T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Baqer M. Merzah</author><author>Jafar Razmara</author><author>Zolfaghar Salmanian</author>
        <description><![CDATA[IntroductionFake news has become a significant threat to public discourse due to the swift spread of online content and the difficulty of detecting and distinguishing it from real news. This challenge is further amplified by society's increasing dependence on online social networks. Many researchers have developed machine learning and deep learning models to combat the spread of misinformation and identify fake news. However, the studies focused on a single language, and the performance analysis achieved a low accuracy, especially for Arabic, which faces challenges due to resource constraints and linguistic intricacies.MethodsThis paper introduces an effective deep-learning technique for fake news detection (FND) in Arabic and English. The proposed model integrates a multi-channel Convolutional Neural Network (CNN) and dual Bidirectional Long Short-Term Memory (BiLSTM), parallelly capturing semantic and local textual features embedded by a pre-trained FastText model. Subsequently, a global max-pooling layer was added to reduce dimensionality and extract salient features from the sequential output. Finally, the model classifies news as fake or real. Moreover, the model is trained and evaluated on three benchmark datasets, AFND and ANS, Arabic datasets, and WELFake, an English dataset.ResultsExperimental results highlight the model's effectiveness and performance superiority over state-of-the-art (SOTA) approaches, with (94.43 ± 0.19) %, (71.63 ± 1.45) %, and (98.85 ± 0.03) %, accuracy on AFND, ANS and WELFake, respectively.DiscussionThis work provides a robust approach to combating misinformation, offering practical applications in enhancing the reliability of information on social networks.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1676054</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1676054</link>
        <title><![CDATA[Adaptive model for rate of penetration prediction based on the dynamic correlation of influencing factors]]></title>
        <pubdate>2026-01-05T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Yonggang Deng</author><author>Xiaojing Zhou</author><author>Zixuan Feng</author><author>Xin Li</author><author>Hui Li</author>
        <description><![CDATA[IntroductionAccurately predicting the rate of penetration (ROP) is a critical benchmark for evaluating operational efficiency in drilling operations, and it is necessary to optimize the drilling parameters and construct an accurate ROP prediction model. At present, the correlations between drilling operation parameters and the ROP are commonly evaluated using a static assessment, which overlooks dynamic changes in parameter correlations during drilling processes.MethodAn adaptive ROP prediction model that incorporates depth-varying correlations of influential parameters is constructed. This model can automatically identify the dynamic correlations of the modeling parameters at different depths of well sections, and the optimal modeling parameters for adaptive training are selected based on the ranking of the correlation coefficients.ResultsAn analysis of 33 drilling parameters across 4,837 datasets collected from 4 wellbores in Sichuan. The comparison analysis revealed that at different well sections, the dynamic correlation coefficient of each parameter deviates significantly from the overall correlation coefficient. According to the proposed model, it can dynamically select key parameters and achieve self-update based on real-time data streams, avoiding the defect of traditional fixed-parameter models that ignore the dynamic changes of well sections.DiscussionModeling comparison analysis revealed that in multiple rounds of prediction based on dynamic correlations, the prediction accuracy in 93% of the prediction rounds exceeded that of the overall correlation, indicating that the adaptive ROP prediction model with dynamic correlations has high application value.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1706417</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1706417</link>
        <title><![CDATA[Posterior averaging with Gaussian naive Bayes and the R package RandomGaussianNB for big-data classification]]></title>
        <pubdate>2025-12-11T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Patchanok Srisuradetchai</author>
        <description><![CDATA[RandomGaussianNB is an open-source R package implementing the posterior-averaging Gaussian naive Bayes (PAV-GNB) algorithm, a scalable ensemble extension of the classical GNB classifier. The method introduces posterior averaging to mitigate correlation bias and enhance stability in high-dimensional settings while maintaining interpretability and computational efficiency. Theoretical results establish the variance of the ensemble posterior, which decreases inversely with ensemble size, and a margin-based generalization bound that connects posterior variance with classification error. Together, these results provide a principled understanding of the bias–variance trade-off in PAV-GNB. The package delivers a fully parallel, reproducible framework for large-scale classification. Simulation studies under big-data conditions—large samples, many features, and multiple classes—show consistent accuracy, low variance, and agreement with theoretical predictions. Scalability experiments demonstrate near-linear runtime improvement with multi-core execution, and a real-world application on the Pima Indians Diabetes dataset validates PAV-GNB's reliability and computational efficiency as an interpretable, statistically grounded approach for ensemble naive Bayes classification.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1677331</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1677331</link>
        <title><![CDATA[Parameter-efficient fine-tuning for low-resource text classification: a comparative study of LoRA, IA3, and ReFT]]></title>
        <pubdate>2025-12-02T00:00:00Z</pubdate>
        <category>Brief Research Report</category>
        <author>Steve Nwaiwu</author>
        <description><![CDATA[The successful application of large-scale transformer models in Natural Language Processing (NLP) is often hindered by the substantial computational cost and data requirements of full fine-tuning. This challenge is particularly acute in low-resource settings, where standard fine-tuning can lead to catastrophic overfitting and model collapse. To address this, Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a promising solution. However, a direct comparative analysis of their trade-offs under unified low-resource conditions is lacking. This study provides a rigorous empirical evaluation of three prominent PEFT methods: Low-Rank Adaptation (LoRA), Infused Adapter by Inhibiting and Amplifying Inner Activations (IA3), and a Representation Fine-Tuning (ReFT) strategy. Using a DistilBERT base model on low-resource versions of the AG News and Amazon Reviews datasets, the present work compares these methods against a full fine-tuning baseline across accuracy, F1 score, trainable parameters, and GPU memory usage. The findings reveal that while all PEFT methods dramatically outperform the baseline, LoRA consistently achieves the highest F1 scores (0.909 on Amazon Reviews). Critically, ReFT delivers nearly identical performance (~98% of LoRA's F1 score) while training only ~3% of the parameters, establishing it as the most efficient method. This research demonstrates that PEFT is not merely an efficiency optimization, but a necessary tool for robust generalization in data-scarce environments, providing practitioners with a clear guide to navigate the performance—efficiency trade-off. By unifying these evaluations under controlled conditions, this study advances beyond fragmented prior research and offers a systematic framework for selecting PEFT strategies.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1697478</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1697478</link>
        <title><![CDATA[Adaptive deep Q-networks for accurate electric vehicle range estimation]]></title>
        <pubdate>2025-11-27T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Urvashi Khekare</author><author>Rajay Vedaraj I. S.</author>
        <description><![CDATA[It is critical that electric vehicles estimate the remaining driving range after charging, as this has direct implications for drivers' range anxiety and thus for large-scale EV adoption. Traditional approaches to predicting range using machine learning rely heavily on large amounts of vehicle-specific data and therefore are not scalable or adaptable. In this paper, a deep reinforcement learning framework is proposed, utilizing big data from 103 different EV models from 31 different manufacturers. This dataset combines several operational variables (state of charge, voltage, current, temperature, vehicle speed, and discharge characteristics) that reflect highly dynamic driving states. Some outliers in this heterogeneous data were reduced through a hybrid fuzzy k-means clustering approach, enhancing the quality of the data used in training. Secondly, a pathfinder meta-heuristics approach has been applied to optimize the reward function of the deep Q-learning algorithm, and thus accelerate convergence and improve accuracy. Experimental validation reveals that the proposed framework halves the range error to [−0.28, 0.40] for independent testing and [−0.23, 0.34] at 10-fold cross-validation. The proposed approach outperforms traditional machine learning and transformer-based approaches in Mean Absolute Error (outperforming by 61.86% and 4.86%, respectively) and in Root Mean Square Error (outperforming by 6.36% and 3.56%, respectively). This highlights the robustness of the proposed framework under complex, dynamic EV data and its ability to enable scalable intelligent range prediction, which engenders innovation in infrastructure and climate conscious mobility.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1667284</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1667284</link>
        <title><![CDATA[Intelligent leak monitoring of oil pipeline based on distributed temperature and vibration fiber signals]]></title>
        <pubdate>2025-11-20T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Xiaobin Liang</author><author>Yonghong Deng</author><author>Yibin Wang</author><author>Hongtao Li</author><author>Weifeng Ma</author><author>Ke Wang</author><author>Junjie Ren</author><author>Ruijiao Ma</author><author>Shuai Zhang</author><author>Jiawei Liu</author><author>Wei Wu</author>
        <description><![CDATA[Due to long-term usage, natural disasters and human factors, pipeline leaks or ruptures may occur, resulting in serious consequences. Therefore, it is of great significance to monitor and conduct real-time detection of pipeline leaks. Currently, the mainstream methods for pipeline leak monitoring mostly rely on a single signal, which have significant limitations such as single temperature being susceptible to environmental temperature interference leading to misjudgment, and single vibration signal being affected by pipeline operation noise. Based on this phenomenon, this research has built a distributed optical fiber system as an experimental platform for temperature and vibration monitoring, obtaining 3,530 sets of real-time synchronized spatial-temporal temperature and vibration signals. A dual-parameter fusion residual neural network structure has been constructed, which can extract characteristic signals from the original spatial-temporal temperature and vibration signals obtained from the above monitoring system, thereby achieving a classification accuracy of 92.16% for pipeline leak status and a leakage location accuracy of 1 m. This solves the problem of insufficient feature extraction and weak anti-interference ability in single signal monitoring. By fusing the original temperature and vibration signals, more leakage features can be extracted. Therefore, compared with single signal monitoring, this study has improved the accuracy of leakage identification and location, bridging the gap of misjudgment caused by single signal interference, and providing a basis for pipeline leakage monitoring and real-time warning in the oil industry.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1686479</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1686479</link>
        <title><![CDATA[Enhanced SQL injection detection using chi-square feature selection and machine learning classifiers]]></title>
        <pubdate>2025-11-19T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Emanuel Casmiry</author><author>Neema Mduma</author><author>Ramadhani Sinde</author>
        <description><![CDATA[In the face of increasing cyberattacks, Structured Query Language (SQL) injection remains one of the most common and damaging types of web threats, accounting for over 20% of global cyberattack costs. However, due to its dynamic and variable nature, the current detection methods often suffer from high false positive rates and lower accuracy. This study proposes an enhanced SQL injection detection using Chi-square feature selection (FS) and machine learning models. A combined dataset was assembled by merging a custom dataset with the SQLiV3.csv file from the Kaggle repository. A Jensen–Shannon Divergence (JSD) analysis revealed moderate domain variation (overall JSD = 0.5775), with class-wise divergence of 0.1340 for SQLi and 0.5320 for benign queries. Term Frequency-Inverse Document Frequency (TF-IDF) was used to convert SQL queries into feature vectors, followed by the Chi-square feature selection to retain the most statistically significant features. Five classifiers, namely multinomial Naïve Bayes, support vector machine, logistic regression, decision tree, and K-nearest neighbor, were tested before and after feature selection. The results reveal that Chi-square feature selection improves classification performance across all models by reducing noise and eliminating redundant features. Notably, Decision Tree and K-Nearest Neighbors (KNN) models, which initially performed poorly, showed substantial improvements after feature selection. The Decision Tree improved from being the second-worst performer before feature selection to the best classifier afterward, achieving the highest accuracy of 99.73%, precision of 99.72%, recall of 99.70%, F1-score of 99.71%, a false positive rate (FPR) of 0.25%, and a misclassification rate of 0.27%. These findings highlight the crucial role of feature selection in high-dimensional data environments. Future research will investigate how feature selection impacts deep learning architectures, adaptive feature selection, incremental learning approaches, robustness against adversarial attacks, and evaluate model transferability across production web environments to ensure real-time detection reliability, establishing feature selection as a vital step in developing reliable SQL injection detection systems.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1682984</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1682984</link>
        <title><![CDATA[Enhancing Bangla handwritten character recognition using Vision Transformers, VGG-16, and ResNet-50: a performance analysis]]></title>
        <pubdate>2025-11-14T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>A. H. M. Shahariar Parvez</author><author>Md. Samiul Islam</author><author>Fahmid Al Farid</author><author>Tashida Yeasmin</author><author>Md. Monirul Islam</author><author>Md. Shafiul Azam</author><author>Jia Uddin</author><author>Hezerul Abdul Karim</author>
        <description><![CDATA[Bangla Handwritten Character Recognition (BHCR) remains challenging due to complex alphabets, and handwriting variations. In this study, we present a comparative evaluation of three deep learning architectures—Vision Transformer (ViT), VGG-16, and ResNet-50—on the CMATERdb 3.1.2 dataset comprising 24,000 images of 50 basic Bangla characters. Our work highlights the effectiveness of ViT in capturing global context and long-range dependencies, leading to improved generalization. Experimental results show that ViT achieves a state-of-the-art accuracy of 98.26%, outperforming VGG-16 (94.54%) and ResNet-50 (93.12%). We also analyze model behavior, discuss overfitting in CNNs, and provide insights into character-level misclassifications. This study demonstrates the potential of transformer-based architectures for robust BHCR and offers a benchmark for future research.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1676477</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1676477</link>
        <title><![CDATA[LLM-supported collaborative ontology design for data and knowledge management platforms]]></title>
        <pubdate>2025-11-12T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Janis Kampars</author><author>Guntis Mosans</author><author>Tushar Jogi</author><author>Franz Roters</author><author>Napat Vajragupta</author>
        <description><![CDATA[The management of vast, heterogeneous, and multidisciplinary data presents a critical challenge across scientific domains, hindering interoperability and slowing scientific progress. This paper addresses this challenge by presenting a pragmatic extension to the NeOn iterative ontology engineering framework, a well-established methodology for collaborative ontology design, which integrates Large Language Models (LLMs) to accelerate key tasks while retaining domain expert-in-the-loop validation. The methodology was applied within the HyWay project, an EU-funded research initiative on hydrogen-materials interactions, to develop the Hydrogen-Material Interaction Ontology (HMIO), a domain-specific ontology covering 29 experimental methods and 14 simulation types for assessing interactions between hydrogen and advanced metallic materials. A key result is the successful integration of the HMIO into a Data and Knowledge Management Platform (DKMP), where it drives the automated generation of data entry forms, ensuring that all captured data is Findable, Accessible, Interoperable, and Reusable (FAIR) and HMIO compliant by design. The validation of this approach demonstrates that this hybrid human-machine workflow for ontology engineering and further integration with the DKMP is an effective and efficient strategy for creating and operationalising complex scientific ontologies, thereby providing a scalable solution to advance data-driven research in materials science and other complex scientific domains.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1705587</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1705587</link>
        <title><![CDATA[PHTFNet-RPM: a probabilistic hybrid network with RPM for tobacco root disease forecasting]]></title>
        <pubdate>2025-11-10T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Yunhong Bu</author><author>Tingshan Yao</author><author>Shaowu Geng</author><author>Renjie Huang</author>
        <description><![CDATA[IntroductionTobacco growers usually face particular challenges in predicting the risks of tobacco root diseases due to complex pathogenesis, concealed early symptoms, and heterogeneous farm conditions.MethodsTo address this problem, we proposed a flexible Probabilistic Hybrid Temporal Fusion Network with Random Period Mask (PHTFNet-RPM). This model is designed to forecast future multi-day disease incidences and indices. It incorporates a hybrid input structure with RPM to handle configurable static management variables and time-series data of weather factors and disease metrics, using the RPM to simulate diverse absences of historical observations. The model's internal hierarchically aggregated modules learn cross-variable and cross-temporal feature representations to model the complex non-linear relationships. Furthermore, probabilistic theory-based uncertainty quantification is designed to enhance the model's credibility and reliability.ResultsThe proposed PHTFNet-RPM was validated using a large-scale time-series dataset of tobacco root diseases, organized from 20-year meteorological and disease survey records in Chuxiong Prefecture, Yunnan Province. Extensive comparative experiments demonstrated that our model achieves a 4.44%–16.43% lower mean absolute error (MAE) than existing models (including LR, SVR, CNN-LSTM, and LSTM-Attention).DiscussionThe results confirm that the model can reliably forecast disease progression trends under different configurations, even when relying solely on historical weather observations. The integration of uncertainty quantification provides a robust tool for assessing prediction reliability, offering significant practical value for disease management.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1604887</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1604887</link>
        <title><![CDATA[Finding the needle in the haystack—An interpretable sequential pattern mining method for classification problems]]></title>
        <pubdate>2025-10-24T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Alexander Grote</author><author>Anuja Hariharan</author><author>Christof Weinhardt</author>
        <description><![CDATA[IntroductionThe analysis of discrete sequential data, such as event logs and customer clickstreams, is often challenged by the vast number of possible sequential patterns. This complexity makes it difficult to identify meaningful sequences and derive actionable insights.MethodsWe propose a novel feature selection algorithm, that integrates unsupervised sequential pattern mining with supervised machine learning. Unlike existing interpretable machine learning methods, we determine important sequential patterns during the mining process, eliminating the need for post-hoc classification to assess their relevance. Compared to existing interesting measures, we introduce a local, class-specific interestingness measure that is inherently interpretable.ResultsWe evaluated the algorithm on three diverse datasets - churn prediction, malware sequence analysis, and a synthetic dataset - covering different sizes, application domains, and feature complexities. Our method achieved classification performance comparable to established feature selection algorithms while maintaining interpretability and reducing computational costs.DiscussionThis study demonstrates a practical and efficient approach for uncovering important sequential patterns in classification tasks. By combining interpretability with competitive predictive performance, our algorithm provides practitioners with an interpretable and efficient alternative to existing methods, paving the way for new advances in sequential data analysis.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1623883</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1623883</link>
        <title><![CDATA[Study on coal and gas outburst prediction technology based on multi-model fusion]]></title>
        <pubdate>2025-10-20T00:00:00Z</pubdate>
        <category>Methods</category>
        <author>Qian Xie</author><author>Junsheng Yan</author><author>Zhenhua Dai</author><author>Wengang Du</author><author>Xuefei Wu</author>
        <description><![CDATA[The rapid advancement of artificial intelligence (AI) and machine learning (ML) technologies has opened up novel avenues for predicting coal and gas outbursts in coal mines. This study proposes a novel prediction framework that integrates advanced AI methodologies through a multi-model fusion strategy based on ensemble learning and model Stacking. The proposed model leverages the diverse data interpretation capabilities and distinct training mechanisms of various algorithms, thereby capitalizing on the complementary strengths of each constituent learner. Specifically, a Stacking-based ensemble model is constructed, incorporating Support Vector Machines (SVM), Random Forests (RF), and k-Nearest Neighbors (KNN) as base learners. An attention mechanism is then employed to adaptively weight the outputs of these base learners, thereby harnessing their complementary strengths. The meta-learner, primarily built upon the XGBoost algorithm, integrates these weighted outputs to generate the final prediction. The model's performance is rigorously evaluated using real-world coal and gas outburst data collected from a mine in Pingdingshan, China, with evaluation metrics including the F1-score and other standard classification indicators. The results reveal that individual models, such as XGBoost, SVM, and RF, can effectively quantify the contribution of input feature importance using their inherent mechanisms. Furthermore, the ensemble model significantly outperforms single-model approaches, particularly when the base learners are both strong and mutually uncorrelated. The proposed ensemble framework achieves a markedly higher F1-score, demonstrating its robustness and effectiveness in the complex task of coal and gas outburst prediction.]]></description>
      </item>
      </channel>
    </rss>