ORIGINAL RESEARCH article
Front. Vet. Sci.
Sec. Veterinary Epidemiology and Economics
Volume 12 - 2025 | doi: 10.3389/fvets.2025.1584864
This article is part of the Research TopicSentinels of Health: Advancements in Monitoring and Surveillance of Vector-Borne Diseases in Domestic and Wild Animals and VectorsView all 16 articles
Predicting vector distribution in Europe: At what sample size are species distribution models reliable?
Provisionally accepted- 1Deanery of Biomedical Sciences, College of Medicine and Veterinary Medicine, University of Edinburgh, Edinburgh, United Kingdom
- 2Research & Development Department, Avia-GIS bvba (Belgium), Zoersel, Belgium
- 3Spatial Epidemiology Lab (SpELL), Université libre de Bruxelles, Brussels, Belgium
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Introduction: Species distribution models can predict the spatial distribution of vector-borne diseases by forming associations between known vector distribution and environmental variables. In response to a changing climate and increasing rates of vector-borne diseases in Europe, model predictions for vector distribution can be used to improve surveillance. However, the field lacks standardisation with little consensus as to what sample size produces reliable models.Objective: Determine the optimum sample size for models developed with the machine learning algorithm, Random Forest, and different sample ratios.Materials and methods: To overcome limitations with real vector data, a simulated vector with a fully known distribution in 10 test sites across Europe was used to randomly generate different samples sizes. The test sites accounted for varying habitat suitability and the vector’s relative occurrence area. 9000 Random Forest models were developed with 24 different sample sizes (between 10 – 5000) and three sample ratios with varying proportions of presence and absence data (50:50, 20:80, and 40:60, respectively). Model performance was evaluated using five metrics: percentage correctly classified, sensitivity, specificity, Cohen’s Kappa, and Area Under the Curve. The metrics were grouped by sample size and ratio. The optimum sample size was determined when the 25th percentile met thresholds for excellent performance, defined as: 0.605–0.804 for Cohen’s Kappa and 0.795–0.894 for the remaining metrics (to three decimal places).Results: For balanced sample ratios, the optimum sample size for reliable models fell within the range of 750–1000. Estimates increased to 1100–1300 for unbalanced samples with a 40:60 ratio of presence and absence data, respectively. Comparatively, unbalanced samples with a 20:80 ratio of presence and absence data did not produce reliable models with any of the sample sizes considered.Conclusion: To our knowledge, this is the first study to use a simulated vector to identify the optimum sample size for Random Forest models at this resolution (≤1km2) and extent (≥10,000km2). These results may improve the reliability of model predictions, optimise field sampling, and enhance vector surveillance in response to changing climates. Further research may seek to refine these estimates and confirm transferability to real vectors.
Keywords: Vector Borne Diseases1, Sample Size2, Sample ratio3, Virtual Species4, Species Distribution Model (SDM)5, machine learning6, Random Forest7, Surveillance8
Received: 05 Mar 2025; Accepted: 30 Apr 2025.
Copyright: © 2025 Mitchel, Hendrickx, MacLeod and Marsboom. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Cedric Marsboom, Research & Development Department, Avia-GIS bvba (Belgium), Zoersel, Belgium
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.