C. imicola occurrence prediction in Italy using Machine-learning and satellite data
-
1
Experimental Zooprophylactic Institute of Abruzzo and Molise G. Caporale, Italy
In recent years, the availability of satellite-derived products is increased either in quantity or in quality and consequently their use in species distribution modelling. At the same time, machine-learning approaches showing great performance in terms of prediction accuracy have been developed and used in several fields. Environmental factors such as temperature and water accessibility due to precipitation are the most important drivers for vector’s life cycle, like mosquitos and Culicoides. Culicoides imicola is the most important Bluetongue vector in the Mediterranean basin and its distribution (in Italy) has been successfully modelled using satellite data and classical approaches like spatial logistic regression (Conte et al. 2007, Thibaut et al. 2012). However, accurate prediction in both space and time is still a big challenge because the interaction of environmental factors, along with their evolution in time, significantly affect the specie’s presence through complex relationships. The present work aims to answer the questions: will ML algorithms be able to improve C. imicola’s occurrence prediction accuracy in space and time, using freely available and commonly used satellite data? What is the achievable accuracy, and can we trust it as reliable?
Data about Culicoides imicola presence, along with time and geographical coordinates, where derived from the entomological surveillance plan in place since 2000 (Goffredo and Meiswinkel, 2004). The dataset includes about 4,500 catch sites repeatedly sampled in time (although not evenly) since 2000, resulting in around 150,000 catches (12,000 of which positive for C. imicola presence). Selected satellite data (MOD11A2 day and night land surface temperature - 1km spatial resolution and 8 days temporal resolution; MOD13Q1 enhanced and normalized vegetation index – 250 m spatial resolution and 16 days temporal resolution; TRMM precipitation data - 1° longlat spatial resolution and 1 day temporal resolution) were downloaded and processed to account for their availability and difference in space time resolution.
To account for satellite data availability, we considered only catches performed after May 2001. Satellite images falling into a 80 days back window were selected for each catch, resulting in 10 images for LSTD, 10 for LSTN, 10 for TRMM, 5 for EVI, 5 for NDVI, for a total of 40 variables. Two further variables were included along with satellite values: longitude and latitude catch sites’ coordinates. The number of catches included in the final dataset, however, was further decreased (for a final dataset of 91,467 records) by missing data in satellite images (mostly due to cloud cover). A graphical description of the final C. imicola dataset used in the study, in terms of seasonality and spatial distribution is shown in Fig.1.
The study has been structured into two steps.
In the first one, we evaluate the performance of four machine-learning algorithms (random forest, xgBoost tree, k-nearest neighbor and multi-layer perceptron (MLP)).
Data was randomly split into train and test data (about 80% and 20% respectively) accounting for prevalence and avoiding spatial overlapping between each other (as catch sites). Given data imbalance and to prevent from over-fitting, down-sampling negatives inside a ten-fold cross validation (ten repetition) was used for all algorithms except for MLP (in which case imbalance was managed during network training using a weighted loss function). Hyper-parameters tuning was implemented through grid search or random search and AUC metric was used for model selection (Andrew).
Model’s performance were measured in terms of AUC, accuracy, sensitivity and specificity, on the test dataset. The best resulting model capability in predicting the occurrence in space (almost one occurrence in time) was also evaluated.
Given the strong seasonality of C. imicola presence (Fig.1), we investigated the spatial distribution of the probability of presence in spring (critical transition period when the abundance start arising after winter) and in autumn (period when the abundance reaches its maximum). Using the best performing model we created two raster of the probability of C. imicola presence for the whole country (1 km spatial resolution) at the 2018/04/01 and 2018/10/15.
In the second step, we use the best model previously identified, to evaluate the capability of the model in generalizing results, decreasing both spatial and temporal link between training and test. Given the availability of data, we applied a more restrictive criteria to the train-test splitting procedure. Firstly, we ensured an not overlapping selection of sites creating wider regions (in the first step points do not overlap, but might be very close each other). For this purpose, the Italian territory was divided into 19 cells (2.5 longlat degree spatial resolution) to be sampled independently. Secondly, to avoid temporal overlapping, data until 2015 were used for training whilst data later than 2016 for test.
A brute-force search among all possible combinations of picking from six to nine cells into test dataset was performed using the two non-overlapping spatio and temporal criteria, resulting in about 140,000 possible splits. However, most of those were extremely unbalanced in terms of prevalence and test train dimension. Among all, we choose those having a difference in prevalence between test and train less than 2% and a test data dimension ranging from 22.5% and 27.5%. This process of data splitting and selection resulted in 277 feasible sample splits (Fig. 2 shows the results of this procedure).
Finally, we ran the best model on the selected sample splits and evaluated the predictive capability on the test dataset summarizing distribution of Sensibility, Specificity and AUC. Because of the extremely exacerbated sampling design, we trust the model must be able in catching factors really driving the presence of C. imicola. Variable importance was ranked for each model and summarized as overall importance through the median ranking.
All analyses were performed using R (R Core Team, 2019). MODIS and raster packages (Mattiuzzi 2018, Hijmans 2018) were used to download and process satellite data. dplyr package (Wickham 2018) was used to prepare data for ML algorithm. doSNOW package (Weston 2017) was used to parallelize tasks. All algorithms were performed using caret package (Kuhn 2018), whilst MLP was implemented using keras (Allaire 2018).
In the first step, classification was well performed by all methods, as the AUC values ranged from 0.982 (knn) to 0.984 (Xgboost) (Fig. 3 panel a). Xgboost tree algorithm showed a great performance, being at the same time the less time consuming, making it the best model to be used.
Spatial prediction (neglecting the timing of catches) using Xlgboost tree model is shown in Fig. 3, panel b for absence (on the left) and presence (on the right) sites in the test dataset. Red and green dots represent presence and absence prediction respectively. Class prediction in space well classified real positive sites except for a few points (sensitivity equals to 89%). Those false negative points, however, are located in high prevalence areas. As expected, there are several real negative sites predicted as positive (positive predictive value equals to 45%).
Fig.4 shows the predicted probability of occurrence for the 1st of April and the 15th of October 2018. The prediction seems to catch well the of C. imicola seasonality trend.
The characteristics of Xgboost tree algorithm made it really suitable for the second step study, where the training and testing process was repeated for all the selected sample splits (277 times). Distribution of AUC, Se and Sp are shown in Fig.5 panel a). Despite the exacerbated sampling design, the method reached a median value of AUC, Se and Sp of 0.9, 0.8 and 0.92 respectively. Variable importance ranking (Fig. 5 panel b) showed how latitude and longitude are the most important variables when predicting C. imicola, followed by the night land surface temperature with the highest lag considered (LSTN-10). Moreover looking at grouped variables (independently from the temporal lag), it is evident the importance of the temperature group in comparison with the TRMM.
The study highlighted the advantages of using ML methods when predicting C. imicola presence in space and time in Italy. The adopted experimental design ensured reliable results and reinforced the capability of the trained model in extending performance to never seen data neither in space or in time. This is particularly useful for targeting surveillance programs, discovering the presence of vectors in unchecked regions and for predicting future occurrence.
Future work will see the implementation of such trained model in a near real time web distributed GIS framework to allow policy maker targeted decisions related to BT surveillance in order to reduce the sampling rate and consequently the cost and resources.
Such a method might be successfully implemented for other vectors (mosquitos, ticks, etc.) and other countries wherever data are abundantly available, even in selected regions.
References
1. Conte A., Goffredo M., Ippoliti C. and Meiswinkel R., 2007. Influence of biotic and abiotic factors on the distribution and abundance of Culicoides imicola and the Obsoletus Complex in Italy. Vet. Par. Vol 150/4 pp 333-344.
2. Goffredo M., Meiswinkel R., 2004. Entomological surveillance of bluetongue in Italy: methods of capture, catch analysis and identification of Culicoides biting midges. Vet. Ital. 40, 260–265.
3. Andrew P.Bradley, 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition Volume 30, Issue 7, July 1997, Pages 1145-1159
4. R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
5. Matteo Mattiuzzi and Florian Detsch (2018). MODIS: Acquisition and Processing of MODIS Products. R package version 1.1.4.
6. Robert J. Hijmans (2018). raster: Geographic Data Analysis and Modeling. R package version 2.8-4.
7. Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2018). dplyr: A Grammar of Data Manipulation. R package version 0.7.8.
8. Microsoft Corporation and Stephen Weston (2017). doSNOW: Foreach Parallel Adaptor for the 'snow' Package. R package version 1.0.16.
9. Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2018). caret: Classification and Regression Training. R package version 6.0-81.
10. JJ Allaire and François Chollet (2018). keras: R Interface to 'Keras'. R package version 2.2.4.
Keywords:
machine learning,
satellite data,
space - time prediction,
accuracy,
C. imicola
Conference:
GeoVet 2019. Novel spatio-temporal approaches in the era of Big Data, Davis, United States, 8 Oct - 10 Oct, 2019.
Presentation Type:
Regular oral presentation
Topic:
Spatio-temporal surveillance and modeling approaches
Citation:
Candeloro
L,
Salini
R,
Goffredo
M,
Quaglia
M and
Conte
A
(2019). C. imicola occurrence prediction in Italy using Machine-learning and satellite data.
Front. Vet. Sci.
Conference Abstract:
GeoVet 2019. Novel spatio-temporal approaches in the era of Big Data.
doi: 10.3389/conf.fvets.2019.05.00112
Copyright:
The abstracts in this collection have not been subject to any Frontiers peer review or checks, and are not endorsed by Frontiers.
They are made available through the Frontiers publishing platform as a service to conference organizers and presenters.
The copyright in the individual abstracts is owned by the author of each abstract or his/her employer unless otherwise stated.
Each abstract, as well as the collection of abstracts, are published under a Creative Commons CC-BY 4.0 (attribution) licence (https://creativecommons.org/licenses/by/4.0/) and may thus be reproduced, translated, adapted and be the subject of derivative works provided the authors and Frontiers are attributed.
For Frontiers’ terms and conditions please see https://www.frontiersin.org/legal/terms-and-conditions.
Received:
10 Jun 2019;
Published Online:
27 Sep 2019.
*
Correspondence:
Dr. Luca Candeloro, Experimental Zooprophylactic Institute of Abruzzo and Molise G. Caporale, Teramo, Abruzzo, Italy, l.candeloro@izs.it