AUTHOR=Mitchel Lianne , Hendrickx Guy , MacLeod Ewan T. , Marsboom Cedric TITLE=Predicting vector distribution in Europe: at what sample size are species distribution models reliable? JOURNAL=Frontiers in Veterinary Science VOLUME=Volume 12 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/veterinary-science/articles/10.3389/fvets.2025.1584864 DOI=10.3389/fvets.2025.1584864 ISSN=2297-1769 ABSTRACT=IntroductionSpecies distribution models can predict the spatial distribution of vector-borne diseases by forming associations between known vector distribution and environmental variables. In response to a changing climate and increasing rates of vector-borne diseases in Europe, model predictions for vector distribution can be used to improve surveillance. However, the field lacks standardisation with little consensus as to what sample size produces reliable models.ObjectiveDetermine the optimum sample size for models developed with the machine learning algorithm, Random Forest, and different sample ratios.Materials and methodsTo overcome limitations with real vector data, a simulated vector with a fully known distribution in 10 test sites across Europe was used to randomly generate different samples sizes. The test sites accounted for varying habitat suitability and the vector’s relative occurrence area. 9,000 Random Forest models were developed with 24 different sample sizes (between 10–5,000) and three sample ratios with varying proportions of presence and absence data (50:50, 20:80, and 40:60, respectively). Model performance was evaluated using five metrics: percentage correctly classified, sensitivity, specificity, Cohen’s Kappa, and Area Under the Curve. The metrics were grouped by sample size and ratio. The optimum sample size was determined when the 25th percentile met thresholds for excellent performance, defined as: 0.605–0.804 for Cohen’s Kappa and 0.795–0.894 for the remaining metrics (to three decimal places).ResultsFor balanced sample ratios, the optimum sample size for reliable models fell within the range of 750–1,000. Estimates increased to 1,100–1,300 for unbalanced samples with a 40:60 ratio of presence and absence data, respectively. Comparatively, unbalanced samples with a 20:80 ratio of presence and absence data did not produce reliable models with any of the sample sizes considered.ConclusionTo our knowledge, this is the first study to use a simulated vector to identify the optimum sample size for Random Forest models at this resolution (≤1 km2) and extent (≥10,000 km2). These results may improve the reliability of model predictions, optimise field sampling, and enhance vector surveillance in response to changing climates. Further research may seek to refine these estimates and confirm transferability to real vectors.