Your new experience awaits. Try the new design now and help us make it even better

BRIEF RESEARCH REPORT article

Front. Bioinform.

Sec. Data Visualization

Volume 5 - 2025 | doi: 10.3389/fbinf.2025.1528515

This article is part of the Research TopicGood Practice in Data Analysis and IntegrationView all 3 articles

Adaptive sampling methods facilitate determining reliable data set sizes for evidence-based modeling

Provisionally accepted
  • Julius Maximilian University of Würzburg, Würzburg, Germany

The final, formatted version of the article will be published soon.

How can we be sure that there is sufficient data for our model such that the predictions are reliable also on unseen data and the conclusions drawn from the model fitted to the data would not change too much using just a different sample of the same size? We answer these and related questions with a systematic approach looking at the data size and the accuracy gain achieved. Assuming the sample data is drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. The approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relation is calculated from sampling methods, which causes a variation in the convergence speed results between different runs. To stabilize results such that conclusions do not depend on the run and to ensure that we obtain the most reliable information encoded in the available data regarding convergence speed, the presented method automatically defines enough repetitions to lower deviations of samplings below a predefined threshold such that conclusions about the required amount of data are reliable.

Keywords: model reliability, Data size estimation, stochastic convergence to ground truth properties, stability of sampling properties, reliable alternative hypothesis formulation

Received: 14 Nov 2024; Accepted: 08 Jul 2025.

Copyright: © 2025 Breitenbach and Dandekar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Tim Breitenbach, Julius Maximilian University of Würzburg, Würzburg, Germany

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.