<?xml version="1.0" encoding="utf-8"?>
    <rss version="2.0">
      <channel xmlns:content="http://purl.org/rss/1.0/modules/content/">
        <title>Frontiers in Big Data | Data Science section | New and Recent Articles</title>
        <link>https://www.frontiersin.org/journals/big-data/sections/data-science</link>
        <description>RSS Feed for Data Science section in the Frontiers in Big Data journal | New and Recent Articles</description>
        <language>en-us</language>
        <generator>Frontiers Feed Generator,version:1</generator>
        <pubDate>2026-04-16T18:31:32.03+00:00</pubDate>
        <ttl>60</ttl>
        <item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2026.1737043</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2026.1737043</link>
        <title><![CDATA[Fairer non-negative matrix factorization]]></title>
        <pubdate>2026-03-17T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Lara Kassab</author><author>Erin George</author><author>Deanna Needell</author><author>Haowen Geng</author><author>Nika Jafar Nia</author><author>Aoxi Li</author>
        <description><![CDATA[There has been a recent critical need to study fairness and bias in machine learning (ML) algorithms. Since there is clearly no one-size-fits-all solution to fairness, ML methods should be developed alongside bias mitigation strategies that are practical and approachable to the practitioner. Motivated by recent work on “fair” PCA, here we consider the more challenging method of non-negative matrix factorization (NMF) as both a showcasing example and a method that is important in its own right for both topic modeling tasks and feature extraction for other ML tasks. We demonstrate that a modification of the objective function, by using a min-max formulation, may sometimes be able to offer an improvement in fairness for groups in the population. We derive two methods for the objective minimization, a multiplicative update rule as well as an alternating minimization scheme, and discuss implementation practicalities. We include a suite of synthetic and real experiments that show how the method may improve fairness while also highlighting the important fact that this may sometime increase error for some individuals and fairness is not a rigid definition and method choice should strongly depend on the application at hand.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1723155</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1723155</link>
        <title><![CDATA[Big data approaches to bovine bioacoustics: a FAIR-compliant dataset and scalable ML framework for precision livestock welfare]]></title>
        <pubdate>2026-01-16T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Mayuri Kate</author><author>Suresh Neethirajan</author>
        <description><![CDATA[The convergence of IoT sensing, edge computing, and machine learning is revolutionizing precision livestock farming. Yet bioacoustic data streams remain underexploited due to computational-complexity and ecological-validity challenges. We present one of the most comprehensive bovine vocalization datasets to date-569 expertly curated clips spanning 48 behavioral classes, recorded across three commercial dairy farms using multi-microphone arrays and expanded to 2,900 samples through domain-informed data augmentation. This FAIR-compliant resource addresses key Big Data challenges: volume (90 h of raw recordings, 65.6 GB), variety (multi-farm, multi-zone acoustic environments), velocity (real-time processing requirements), and veracity (noise-robust feature-extraction pipelines). A modular data-processing workflow combines denoising implemented both in iZotope RX 11 for quality control and an equivalent open-source Python pipeline using noisereduce, multi-modal synchronization (audio-video alignment), and standardized feature engineering (24 acoustic descriptors via Praat, librosa, and openSMILE) to enable scalable welfare monitoring. Preliminary machine-learning benchmarks reveal distinct class-wise acoustic signatures across estrus detection, distress classification, and maternal-communication recognition. The dataset's ecological realism-embracing authentic barn acoustics rather than controlled conditions-ensures deployment-ready model development. This work establishes the foundation for animal-centered AI, where bioacoustic streams enable continuous, non-invasive welfare assessment at industrial scale. By releasing a Zenodo-hosted, FAIR-compliant dataset (restricted access) and an open-source preprocessing pipeline on GitHub, together with comprehensive metadata schemas, we advance reproducible research at the intersection of Big Data analytics, sustainable agriculture, and precision livestock management. The framework directly supports UN SDG 9, demonstrating how data science can transform traditional farming into intelligent, welfare-optimized production systems capable of meeting global food demands while maintaining ethical animal-care standards.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1648730</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1648730</link>
        <title><![CDATA[Structure and dynamics mapping of illicit firearms trafficking using artificial intelligence models]]></title>
        <pubdate>2025-09-25T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Willy A. Valdivia-Granda</author>
        <description><![CDATA[Illicit firearms trafficking imposes severe social and economic costs, eroding public safety, distorting markets, and weakening state capacity while affecting vulnerable populations. Despite its profound consequences for global health, trade, and security, the network structure and dynamics of illicit firearms trafficking are one of the most elusive dimensions of transnational organized crime. News reports documenting these events are fragmented across countries, languages, and outlets with different levels of quality and bias. Motivated by the disproportionate impact in Latin America, this study operationalizes the International Classification of Crime for Statistical Purposes (ICCS) to convert multilingual news into structured and auditable indicators through a three-part analytic pipeline using BERT architecture and zero-shot prompts for entity resolution. This analytical approach generated outputs enriched with named entities, geocodes, and timestamps and stored as structured JSON, enabling reproducible analysis. The results of this implementation identified 8,171 firearms trafficking reports published from 2014 through July 2024. The number of firearms-related reports rose sharply over the decade. Incidents increase roughly tenfold, and the geographic footprint expands from about twenty to more than eighty countries, with a one hundred fifty five percent increase from 2022 to 2023. Correlation analysis links firearms trafficking to twelve other ICCS Level 1 categories, including drug trafficking, human trafficking, homicide, terrorism, and environmental crimes. Entity extraction and geocoding show a clear maritime bias; ports are referenced about six times more often than land or air routes. The analysis yielded eighty-five distinct points of entry or exit and forty-one named transnational criminal organizations, though attribution appears in only about forty percent of reports. This is the first automated and multilingual application of ICCS to firearms trafficking using modern language technologies. The outputs enable early warning through signals associated with ICCS categories, cross-border coordination focused on recurrent routes and high-risk ports, and evaluation of interventions. In short, embedding ICCS in a reproducible pipeline transforms fragmented media narratives into comparable evidence for strategic, tactical, and operational environments.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1640539</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1640539</link>
        <title><![CDATA[Enhancing intelligence source performance management through two-stage stochastic programming and machine learning techniques]]></title>
        <pubdate>2025-09-22T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Lucas Wafula Wekesa</author><author>Stephen Korir</author>
        <description><![CDATA[IntroductionThe effectiveness of intelligence operations depends heavily on the reliability and performance of human intelligence (HUMINT) sources. Yet, source behavior is often unpredictable, deceptive or shaped by operational context, complicating resource allocation and tasking decisions.MethodsThis study developed a hybrid framework combining Machine Learning (ML) techniques and Two-Stage Stochastic Programming (TSSP) for HUMINT source performance management under uncertainty. A synthetic dataset reflecting HUMINT operational patterns was generated and used to train classification and regression models. The extreme Gradient Boosting (XGBoost) and Support Vector Machines (SVM) were applied for behavioral classification and prediction of reliability and deception scores. The predictive outputs were then transformed into scenario probabilities and integrated into the TSSP model to optimize task allocation under varying behavioral uncertainties.ResultsThe classifiers achieved 98% overall accuracy, with XGBoost exhibiting higher precision and SVM demonstrating superior recall for rare but operationally significant categories. The regression models achieved R-squared scores of 93% for reliability and 81% for deception. These predictive outputs were transformed into scenario probabilities for integration into the TSSP model, optimizing task allocation under varying behavioral risks. When compared to a deterministic optimization baseline, the hybrid framework delivered a 16.8% reduction in expected tasking costs and a 19.3% improvement in mission success rates.Discussion and conclusionThe findings demonstrated that scenario-based probabilistic planning offers significant advantages over static heuristics in managing uncertainty in HUMINT operations. While the simulation results are promising, validation through field data is required before operational deployment.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1599704</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1599704</link>
        <title><![CDATA[Collaborative filtering based on nonnegative/binary matrix factorization]]></title>
        <pubdate>2025-07-29T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Yukino Terui</author><author>Yuka Inoue</author><author>Yohei Hamakawa</author><author>Kosuke Tatsumura</author><author>Kazue Kudo</author>
        <description><![CDATA[Collaborative filtering generates recommendations by exploiting user-item similarities based on rating data, which often contains numerous unrated items. To predict scores for unrated items, matrix factorization techniques such as nonnegative matrix factorization (NMF) are often employed. Nonnegative/binary matrix factorization (NBMF), which is an extension of NMF, approximates a nonnegative matrix as the product of nonnegative and binary matrices. While previous studies have applied NBMF primarily to dense data such as images, this paper proposes a modified NBMF algorithm tailored for collaborative filtering with sparse data. In the modified method, unrated entries in the rating matrix are masked, enhancing prediction accuracy. Furthermore, utilizing a low-latency Ising machine in NBMF is advantageous in terms of the computation time, making the proposed method beneficial.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1596615</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1596615</link>
        <title><![CDATA[Conceptualization and scale development for big data-based learning organization capability]]></title>
        <pubdate>2025-06-19T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Nesrin Alkan</author><author>Deniz Ersan Yilmaz</author><author>Bilal Baris Alkan</author>
        <description><![CDATA[IntroductionIn today's competitive business landscape, organizations must enhance learning and adaptability to gain a strategic edge. While big data significantly influences organizational learning, a comprehensive tool to measure this capability has been lacking in the literature. This study aims to develop a valid and reliable scale to assess big data-based learning organization capability.MethodsA two-phase research design was employed. In the first phase, Exploratory Factor Analysis (EFA) was conducted on data collected from 232 managers, identifying 22 items across three underlying factors. In the second phase, Confirmatory Factor Analysis (CFA) was applied to an independent sample (n = 128) to validate the scale's structure and its alignment with the theoretical model.ResultsThe EFA results revealed a clear three-factor structure, and the CFA confirmed the model's fit to the data, demonstrating good psychometric properties. The final BD-LOC scale shows high internal consistency and construct validity.DiscussionThe BD-LOC scale provides organizations with a valuable tool to assess their big data-driven learning capabilities. It supports strategic decision-making, fosters innovation, and enhances operational efficiency. This study fills a significant gap in the literature and contributes to the effective implementation of digital transformation strategies in organizations.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1569623</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1569623</link>
        <title><![CDATA[The climate gluing protests: analyzing their development and framing in media since 1986 using sentiment analyses and frame detection models]]></title>
        <pubdate>2025-05-19T00:00:00Z</pubdate>
        <category>Brief Research Report</category>
        <author>Markus Hadler</author><author>Alexander Ertl</author><author>Beate Klösch</author><author>Markus Reiter-Haas</author><author>Elisabeth Lex</author>
        <description><![CDATA[Recent climate-related protests by social movements such as Extinction Rebellion, Just Stop Oil, and others have included actions like defacing artwork and gluing oneself to objects and streets. Using sentiment analysis and frame detection models, we analyze a corpus of all available English-language news articles in LexisNexis, with the first recorded instance of a gluing protest appearing in 1986. Our study traces the development of this protest tactic over time and addresses three central questions from social movement literature: the use of glue in protests, the geographical spread of this tactic, and the framing of these actions. We find that gluing protests were initially associated with a range of issues—including abortion, criminal justice, and environmental concerns—but in recent years have become more strongly linked to climate activism. Media coverage of these protests is predominantly negative, although public media tends to be comparatively less so. Moreover, protesters' prognostic frames—suggestions for what should be done—are relatively rare, with discourse more often centering on policy and security concerns. From a data science perspective, we explore the use of various Natural Language Processing (NLP) methods. The discussion and conclusion section highlights challenges encountered when working with our corpus and NLP models, and suggests ways to address them in future research. We also consider how recent advancements in large language models (LLMs) could refine or extend these analyses while acknowledging important concerns related to their use.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1542483</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1542483</link>
        <title><![CDATA[An oversampling-undersampling strategy for large-scale data linkage]]></title>
        <pubdate>2025-04-23T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Hossein Hassani</author><author>Mohammad Reza Entezarian</author><author>Sara Zaeimzadeh</author><author>Leila Marvian</author><author>Nadejda Komendantova</author>
        <description><![CDATA[Effective record linkage in big data, particularly in imbalanced datasets, is a critical yet highly challenging task due to the inherent complexity involved. This article utilizes an oversampling-undersampling strategy to address linkage imbalances, enabling more accurate and efficient record linkage within large-scale datasets. It tries to increase the instances of the minority class and decrease the dominance of the majority classes to try to reach a more balanced dataset that can be used for training and testing. Sensitivity testing was carried out by varying the training-test ratio and degree of imbalance.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1455442</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1455442</link>
        <title><![CDATA[Impact of imbalanced features on large datasets]]></title>
        <pubdate>2025-03-13T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Waleed Albattah</author><author>Rehan Ullah Khan</author>
        <description><![CDATA[The exponential growth of image and video data motivates the need for practical real-time content-based searching algorithms. Features play a vital role in identifying objects within images. However, feature-based classification faces a challenge due to uneven class instance distribution. Ideally, each class should have an equal number of instances and features to ensure optimal classifier performance. However, real-world scenarios often exhibit class imbalances. Thus, this article explores the classification framework based on image features, analyzing balanced and imbalanced distributions. Through extensive experimentation, we examine the impact of class imbalance on image classification performance, primarily on large datasets. The comprehensive evaluation shows that all models perform better with balancing compared to using an imbalanced dataset, underscoring the importance of dataset balancing for model accuracy. Distributed Gaussian (D-GA) and Distributed Poisson (D-PO) are found to be the most effective techniques, especially in improving Random Forest (RF) and SVM models. The deep learning experiments also show an improvement as such.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2025.1477911</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2025.1477911</link>
        <title><![CDATA[Balancing act: Europeans' privacy calculus and security concerns in online CSAM detection]]></title>
        <pubdate>2025-01-22T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Răzvan Rughiniş</author><author>Simona-Nicoleta Vulpe</author><author>Dinu Ţurcanu</author><author>Daniel Rosner</author>
        <description><![CDATA[This study examines privacy calculus in online child sexual abuse material (CSAM) detection across Europe, using Flash Eurobarometer 532 data. Drawing on theories of structuration and risk society, we analyze how individual agency and institutional frameworks interact in shaping privacy attitudes in high-stakes digital scenarios. Multinomial regression reveals age as a significant individual-level predictor, with younger individuals prioritizing privacy more. Country-level analysis shows Central and Eastern European nations have higher privacy concerns, reflecting distinct institutional and cultural contexts. Notably, the Digital Economy and Society Index (DESI) shows a positive association with privacy concerns in regression models when controlling for Augmented Human Development Index (AHDI) components, contrasting its negative bivariate correlation. Life expectancy emerges as the strongest country-level predictor, negatively associated with privacy concerns, suggesting deep institutional mechanisms shape privacy attitudes beyond individual factors. This dual approach reveals that both individual factors and national contexts are shaping privacy calculus in CSAM detection. The study contributes to a better understanding of privacy calculus in high-stakes scenarios, with implications for policy development in online child protection.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2024.1466391</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2024.1466391</link>
        <title><![CDATA[A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning]]></title>
        <pubdate>2025-01-21T00:00:00Z</pubdate>
        <category>Technology and Code</category>
        <author>Shivika Prasanna</author><author>Ajay Kumar</author><author>Deepthi Rao</author><author>Eduardo J. Simoes</author><author>Praveen Rao</author>
        <description><![CDATA[Advances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and analyze the rich genomic data, knowledge graphs (KGs) and graph machine learning (GML) can be regarded as enabling technologies. In this article, we present a scalable tool called VariantKG for analyzing genomic variants of humans modeled using KGs and GML. Specifically, we used publicly available genome sequencing data from patients with COVID-19. VariantKG extracts variant-level genetic information output by a variant calling pipeline, annotates the variant data with additional metadata, and converts the annotated variant information into a KG represented using the Resource Description Framework (RDF). The resulting KG is further enhanced with patient metadata and stored in a scalable graph database that enables efficient RDF indexing and query processing. VariantKG employs the Deep Graph Library (DGL) to perform GML tasks such as node classification. A user can extract a subset of the KG and perform inference tasks using DGL. The user can monitor the training and testing performance and hardware utilization. We tested VariantKG for KG construction by using 1,508 genome sequences, leading to 4 billion RDF statements. We evaluated GML tasks using VariantKG by selecting a subset of 500 sequences from the KG and performing node classification using well-known GML techniques such as GraphSAGE, Graph Convolutional Network (GCN) and Graph Transformer. VariantKG has intuitive user interfaces and features enabling a low barrier to entry for KG construction, model inference, and model interpretation on genomic variants of humans.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2024.1501154</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2024.1501154</link>
        <title><![CDATA[Enhancing sentiment and intent analysis in public health via fine-tuned Large Language Models on tobacco and e-cigarette-related tweets]]></title>
        <pubdate>2024-11-28T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Sherif Elmitwalli</author><author>John Mehegan</author><author>Allen Gallagher</author><author>Raouf Alebshehy</author>
        <description><![CDATA[BackgroundAccurate sentiment analysis and intent categorization of tobacco and e-cigarette-related social media content are critical for public health research, yet they necessitate specialized natural language processing approaches.ObjectiveTo compare pre-trained and fine-tuned Flan-T5 models for intent classification and sentiment analysis of tobacco and e-cigarette tweets, demonstrating the effectiveness of pre-training a lightweight large language model for domain specific tasks.MethodsThree Flan-T5 classification models were developed: (1) tobacco intent, (2) e-cigarette intent, and (3) sentiment analysis. Domain-specific datasets with tobacco and e-cigarette tweets were created using GPT-4 and validated by tobacco control specialists using a rigorous evaluation process. A standardized rubric and consensus mechanism involving domain specialists ensured high-quality datasets. The Flan-T5 Large Language Models were fine-tuned using Low-Rank Adaptation and evaluated against pre-trained baselines on the datasets using accuracy performance metrics. To further assess model generalizability and robustness, the fine-tuned models were evaluated on real-world tweets collected around the COP9 event.ResultsIn every task, fine-tuned models performed much better than pre-trained models. Compared to the pre-trained model's accuracy of 0.33, the fine-tuned model achieved an overall accuracy of 0.91 for tobacco intent classification. The fine-tuned model achieved an accuracy of 0.93 for e-cigarette intent, which is higher than the accuracy of 0.36 for the pre-trained model. The fine-tuned model significantly outperformed the pre-trained model's accuracy of 0.65 in sentiment analysis, achieving an accuracy of 0.94 for sentiments.ConclusionThe effectiveness of lightweight Flan-T5 models in analyzing tweets associated with tobacco and e-cigarette is significantly improved by domain-specific fine-tuning, providing highly accurate instruments for tracking public conversation on tobacco and e-cigarette. The involvement of domain specialists in dataset validation ensured that the generated content accurately represented real-world discussions, thereby enhancing the quality and reliability of the results. Research on tobacco control and the formulation of public policy could be informed by these findings.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2024.1469981</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2024.1469981</link>
        <title><![CDATA[Prediction and classification of obesity risk based on a hybrid metaheuristic machine learning approach]]></title>
        <pubdate>2024-09-30T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Zarindokht Helforoush</author><author>Hossein Sayyad</author>
        <description><![CDATA[IntroductionAs the global prevalence of obesity continues to rise, it has become a major public health concern requiring more accurate prediction methods. Traditional regression models often fail to capture the complex interactions between genetic, environmental, and behavioral factors contributing to obesity.MethodsThis study explores the potential of machine-learning techniques to improve obesity risk prediction. Various supervised learning algorithms, including the novel ANN-PSO hybrid model, were applied following comprehensive data preprocessing and evaluation.ResultsThe proposed ANN-PSO model achieved a remarkable accuracy rate of 92%, outperforming traditional regression methods. SHAP was employed to analyze feature importance, offering deeper insights into the influence of various factors on obesity risk.DiscussionThe findings highlight the transformative role of advanced machine-learning models in public health research, offering a pathway for personalized healthcare interventions. By providing detailed obesity risk profiles, these models enable healthcare providers to tailor prevention and treatment strategies to individual needs. The results underscore the need to integrate innovative machine-learning approaches into global public health efforts to combat the growing obesity epidemic.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2024.1441869</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2024.1441869</link>
        <title><![CDATA[When we talk about Big Data, What do we really mean? Toward a more precise definition of Big Data]]></title>
        <pubdate>2024-09-10T00:00:00Z</pubdate>
        <category>Review</category>
        <author>Xiaoyao Han</author><author>Oskar Josef Gstrein</author><author>Vasilios Andrikopoulos</author>
        <description><![CDATA[Despite the lack of consensus on an official definition of Big Data, research and studies have continued to progress based on this “no consensus” stance over the years. However, the lack of a clear definition and scope for Big Data results in scientific research and communication lacking a common ground. Even with the popular “V” characteristics, Big Data remains elusive. The term is broad and is used differently in research, often referring to entirely different concepts, which is rarely stated explicitly in papers. While many studies and reviews attempt to draw a comprehensive understanding of Big Data, there has been little systematic research on the position and practical implications of the term Big Data in research environments. To address this gap, this paper presents a Systematic Literature Review (SLR) on secondary studies to provide a comprehensive overview of how Big Data is used and understood across different scientific domains. Our objective was to monitor the application of the Big Data concept in science, identify which technologies are prevalent in which fields, and investigate the discrepancies between the theoretical understanding and practical usage of the term. Our study found that various Big Data technologies are being used in different scientific fields, including machine learning algorithms, distributed computing frameworks, and other tools. These manifestations of Big Data can be classified into four major categories: abstract concepts, large datasets, machine learning techniques, and the Big Data ecosystem. This study revealed that despite the general agreement on the “V” characteristics, researchers in different scientific fields have varied implicit understandings of Big Data. These implicit understandings significantly influence the content and discussions of studies involving Big Data, although they are often not explicitly stated. We call for a clearer articulation of the meaning of Big Data in research to facilitate smoother scientific communication.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2024.1363978</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2024.1363978</link>
        <title><![CDATA[Stable tensor neural networks for efficient deep learning]]></title>
        <pubdate>2024-05-30T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Elizabeth Newman</author><author>Lior Horesh</author><author>Haim Avron</author><author>Misha E. Kilmer</author>
        <description><![CDATA[Learning from complex, multidimensional data has become central to computational mathematics, and among the most successful high-dimensional function approximators are deep neural networks (DNNs). Training DNNs is posed as an optimization problem to learn network weights or parameters that well-approximate a mapping from input to target data. Multiway data or tensors arise naturally in myriad ways in deep learning, in particular as input data and as high-dimensional weights and features extracted by the network, with the latter often being a bottleneck in terms of speed and memory. In this work, we leverage tensor representations and processing to efficiently parameterize DNNs when learning from high-dimensional data. We propose tensor neural networks (t-NNs), a natural extension of traditional fully-connected networks, that can be trained efficiently in a reduced, yet more powerful parameter space. Our t-NNs are built upon matrix-mimetic tensor-tensor products, which retain algebraic properties of matrix multiplication while capturing high-dimensional correlations. Mimeticity enables t-NNs to inherit desirable properties of modern DNN architectures. We exemplify this by extending recent work on stable neural networks, which interpret DNNs as discretizations of differential equations, to our multidimensional framework. We provide empirical evidence of the parametric advantages of t-NNs on dimensionality reduction using autoencoders and classification using fully-connected and stable variants on benchmark imaging datasets MNIST and CIFAR-10.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2024.1357926</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2024.1357926</link>
        <title><![CDATA[Sentiment analysis of COP9-related tweets: a comparative study of pre-trained models and traditional techniques]]></title>
        <pubdate>2024-03-20T00:00:00Z</pubdate>
        <category>Original Research</category>
        <author>Sherif Elmitwalli</author><author>John Mehegan</author>
        <description><![CDATA[IntroductionSentiment analysis has become a crucial area of research in natural language processing in recent years. The study aims to compare the performance of various sentiment analysis techniques, including lexicon-based, machine learning, Bi-LSTM, BERT, and GPT-3 approaches, using two commonly used datasets, IMDB reviews and Sentiment140. The objective is to identify the best-performing technique for an exemplar dataset, tweets associated with the WHO Framework Convention on Tobacco Control Ninth Conference of the Parties in 2021 (COP9).MethodsA two-stage evaluation was conducted. In the first stage, various techniques were compared on standard sentiment analysis datasets using standard evaluation metrics such as accuracy, F1-score, and precision. In the second stage, the best-performing techniques from the first stage were applied to partially annotated COP9 conference-related tweets.ResultsIn the first stage, BERT achieved the highest F1-scores (0.9380 for IMDB and 0.8114 for Sentiment 140), followed by GPT-3 (0.9119 and 0.7913) and Bi-LSTM (0.8971 and 0.7778). In the second stage, GPT-3 performed the best for sentiment analysis on partially annotated COP9 conference-related tweets, with an F1-score of 0.8812.DiscussionThe study demonstrates the effectiveness of pre-trained models like BERT and GPT-3 for sentiment analysis tasks, outperforming traditional techniques on standard datasets. Moreover, the better performance of GPT-3 on the partially annotated COP9 tweets highlights its ability to generalize well to domain-specific data with limited annotations. This provides researchers and practitioners with a viable option of using pre-trained models for sentiment analysis in scenarios with limited or no annotated data across different domains.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2023.1296508</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2023.1296508</link>
        <title><![CDATA[CTAB-GAN+: enhancing tabular data synthesis]]></title>
        <pubdate>2024-01-08T00:00:00Z</pubdate>
        <category>Methods</category>
        <author>Zilong Zhao</author><author>Aditya Kunar</author><author>Robert Birke</author><author>Hiek Van der Scheer</author><author>Lydia Y. Chen</author>
        <description><![CDATA[The usage of synthetic data is gaining momentum in part due to the unavailability of original data due to privacy and legal considerations and in part due to its utility as an augmentation to the authentic data. Generative adversarial networks (GANs), a paragon of generative models, initially for images and subsequently for tabular data, has contributed many of the state-of-the-art synthesizers. As GANs improve, the synthesized data increasingly resemble the real data risking to leak privacy. Differential privacy (DP) provides theoretical guarantees on privacy loss but degrades data utility. Striking the best trade-off remains yet a challenging research question. In this study, we propose CTAB-GAN+ a novel conditional tabular GAN. CTAB-GAN+ improves upon state-of-the-art by (i) adding downstream losses to conditional GAN for higher utility synthetic data in both classification and regression domains; (ii) using Wasserstein loss with gradient penalty for better training convergence; (iii) introducing novel encoders targeting mixed continuous-categorical variables and variables with unbalanced or skewed data; and (iv) training with DP stochastic gradient descent to impose strict privacy guarantees. We extensively evaluate CTAB-GAN+ on statistical similarity and machine learning utility against state-of-the-art tabular GANs. The results show that CTAB-GAN+ synthesizes privacy-preserving data with at least 21.9% higher machine learning utility (i.e., F1-Score) across multiple datasets and learning tasks under given privacy budget.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2023.1282541</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2023.1282541</link>
        <title><![CDATA[Hybridization of long short-term memory neural network in fractional time series modeling of inflation]]></title>
        <pubdate>2024-01-04T00:00:00Z</pubdate>
        <category>Methods</category>
        <author>Erman Arif</author><author>Elin Herlinawati</author><author>Dodi Devianto</author><author>Mutia Yollanda</author><author>Dony Permana</author>
        <description><![CDATA[Inflation is capable of significantly impacting monetary policy, thereby emphasizing the need for accurate forecasts to guide decisions aimed at stabilizing inflation rates. Given the significant relationship between inflation and monetary, it becomes feasible to detect long-memory patterns within the data. To capture these long-memory patterns, Autoregressive Fractionally Moving Average (ARFIMA) was developed as a valuable tool in data mining. Due to the challenges posed in residual assumptions, time series model has to be developed to address heteroscedasticity. Consequently, the implementation of a suitable model was imperative to rectify this effect within the residual ARFIMA. In this context, a novel hybrid model was proposed, with Generalized Autoregressive Conditional Heteroscedasticity (GARCH) being replaced by Long Short-Term Memory (LSTM) neural network. The network was used as iterative model to address this issue and achieve optimal parameters. Through a sensitivity analysis using mean absolute percentage error (MAPE), mean squared error (MSE), and mean absolute error (MAE), the performance of ARFIMA, ARFIMA-GARCH, and ARFIMA-LSTM models was assessed. The results showed that ARFIMA-LSTM excelled in simulating the inflation rate. This provided further evidence that inflation data showed characteristics of long memory, and the accuracy of the model was improved by integrating LSTM neural network.]]></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2023.1344345</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2023.1344345</link>
        <title><![CDATA[Corrigendum: Towards an understanding of global brain data governance: ethical positions that underpin global brain data governance discourse]]></title>
        <pubdate>2023-12-19T00:00:00Z</pubdate>
        <category>Correction</category>
        <author>Paschal Ochang</author><author>Damian Eke</author><author>Bernd Carsten Stahl</author>
        <description></description>
      </item><item>
        <guid isPermaLink="true">https://www.frontiersin.org/articles/10.3389/fdata.2023.1343108</guid>
        <link>https://www.frontiersin.org/articles/10.3389/fdata.2023.1343108</link>
        <title><![CDATA[Corrigendum: Do you hear the people sing? Comparison of synchronized URL and narrative themes in 2020 and 2023 French protests]]></title>
        <pubdate>2023-12-12T00:00:00Z</pubdate>
        <category>Correction</category>
        <author>Lynnette Hui Xian Ng</author><author>Kathleen M. Carley</author>
        <description></description>
      </item>
      </channel>
    </rss>