AUTHOR=Mitchel Lianne , Hendrickx Guy , MacLeod Ewan T. , Marsboom Cedric 

TITLE=Predicting vector distribution in Europe: at what sample size are species distribution models reliable?

JOURNAL=Frontiers in Veterinary Science

VOLUME=Volume 12 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/veterinary-science/articles/10.3389/fvets.2025.1584864

DOI=10.3389/fvets.2025.1584864

ISSN=2297-1769

ABSTRACT=IntroductionSpecies distribution models can predict the spatial distribution of vector-borne diseases by forming associations between known vector distribution and environmental variables. In response to a changing climate and increasing rates of vector-borne diseases in Europe, model predictions for vector distribution can be used to improve surveillance. However, the field lacks standardisation with little consensus as to what sample size produces reliable models.ObjectiveDetermine the optimum sample size for models developed with the machine learning algorithm, Random Forest, and different sample ratios.Materials and methodsTo overcome limitations with real vector data, a simulated vector with a fully known distribution in 10 test sites across Europe was used to randomly generate different samples sizes. The test sites accounted for varying habitat suitability and the vector’s relative occurrence area. 9,000 Random Forest models were developed with 24 different sample sizes (between 10–5,000) and three sample ratios with varying proportions of presence and absence data (50:50, 20:80, and 40:60, respectively). Model performance was evaluated using five metrics: percentage correctly classified, sensitivity, specificity, Cohen’s Kappa, and Area Under the Curve. The metrics were grouped by sample size and ratio. The optimum sample size was determined when the 25th percentile met thresholds for excellent performance, defined as: 0.605–0.804 for Cohen’s Kappa and 0.795–0.894 for the remaining metrics (to three decimal places).ResultsFor balanced sample ratios, the optimum sample size for reliable models fell within the range of 750–1,000. Estimates increased to 1,100–1,300 for unbalanced samples with a 40:60 ratio of presence and absence data, respectively. Comparatively, unbalanced samples with a 20:80 ratio of presence and absence data did not produce reliable models with any of the sample sizes considered.ConclusionTo our knowledge, this is the first study to use a simulated vector to identify the optimum sample size for Random Forest models at this resolution (≤1 km2) and extent (≥10,000 km2). These results may improve the reliability of model predictions, optimise field sampling, and enhance vector surveillance in response to changing climates. Further research may seek to refine these estimates and confirm transferability to real vectors.