Robust Ensemble Classification Methodology for I123-Ioflupane SPECT Images and Multiple Heterogeneous Biomarkers in the Diagnosis of Parkinson's Disease

In last years, several approaches to develop an effective Computer-Aided-Diagnosis (CAD) system for Parkinson's Disease (PD) have been proposed. Most of these methods have focused almost exclusively on brain images through the use of Machine-Learning algorithms suitable to characterize structural or functional patterns. Those patterns provide enough information about the status and/or the progression at intermediate and advanced stages of Parkinson's Disease. Nevertheless this information could be insufficient at early stages of the pathology. The Parkinson's Progression Markers Initiative (PPMI) database includes neurological images along with multiple biomedical tests. This information opens up the possibility of comparing different biomarker classification results. As data come from heterogeneous sources, it is expected that we could include some of these biomarkers in order to obtain new information about the pathology. Based on that idea, this work presents an Ensemble Classification model with Performance Weighting. This proposal has been tested comparing Healthy Control subjects (HC) vs. patients with PD (considering both PD and SWEDD labeled subjects as the same class). This model combines several Support-Vector-Machine (SVM) with linear kernel classifiers for different biomedical group of tests—including CerebroSpinal Fluid (CSF), RNA, and Serum tests—and pre-processed neuroimages features (Voxels-As-Features and a list of defined Morphological Features) from PPMI database subjects. The proposed methodology makes use of all data sources and selects the most discriminant features (mainly from neuroimages). Using this performance-weighted ensemble classification model, classification results up to 96% were obtained.


INTRODUCTION
Parkinson's Disease (PD) is defined as a chronic, degenerative and neurological disorder that affects the motor system. The origins or triggers that makes appear the PD are still unknown. Several studies have demonstrated this is related to the destruction of pigmented neurons in the substantia nigra (Zetterström et al., 1997;Kordower et al., 2013). Its most frequent symptoms are: tremor, rigidity and bradykinesia, but also cognitive alterations, lack of emotion expressiveness (Pohl et al., 2017) and autonomy problems (Fauci et al., 2008).
One of the most extended tools for PD diagnosis is the use of I123-Ioflupane SPECT (Single Photon Emission Computerized Tomography) images (Neumeyer et al., 1991;Sixel-Döring et al., 2011). These images, also known as FP-CIT or DaTSCAN, make use of the Iodine-123-fluoropropyl-carbomethoxy-3-beta-(4iodophenyltropane) radio-ligand which presents a high binding affinity for presynaptic dopamine transporters (DAT) in the brain. As a marked reduction in dopaminergic neurons in the striatal region is the most significative feature of PD, DaTSCAN images give us a quantitative measure of the spatial distribution of the transporters in the striatum. This information is used in the differentiation of Healthy Control (HC) subjects vs. patients with Parkinson's Disease (PD) (Marek et al., 2001).
However, medical images are not the only effective biomarker that could be used in the diagnosis of PD. In recent years, several works have stated the relation between neurodegenerative disorders and different Biomedical Tests (BT) (Andersen et al., 2017;Dukart et al., 2017;Santiago and Potashkin, 2017). As Handels et al. (2017) points out in its study of Mild Cognitive Impairment (MCI), although some biomarkers could be used for classification purposes (increasing their accuracy in many cases), it is not easy to determine wheter significant improvements are clinically relevant. In fact, we can easily find works with opposing views on the use of biomarkers (Farotti et al., 2017;Mollenhauer et al., 2017) as predictive indicators of PD progression. However, the recent emergence of datasets with biomarkers data and neuroimages has opened up possibilities for the analysis in searching the origins and triggers of the PD progression.
Recently, there has been an increasing interest toward the application of multivariate analysis strategies, such as those based on Machine Learning (ML), to describe betweengroup differences, in terms of discrimination ability between populations and beyond classical statistical analysis. One of the major problems of ML algorithms is the overfitting problem in high dimensional settings (d) with a small sample size (l), where the designed classifiers are inevitably over-adjusted to the training set. Unfortunately, in neuroscience this situation is the rule rather than the exception, since the dimensionality of each observation (millions of variables) in relation to the number of available samples (hundreds of acquisitions) implies a high risk of overfitting. This risk can be also explained in terms of the high probability of the training set to be separable by a given surface in high dimensional spaces . The solution to this problem is multi-fold. This situation could be overcome by increasing l in resampling methods (i.e., boosting; Hastie et al., 2001) and bagging (Breiman et al., 1984), or by decreasing d using feature extraction and selection (FES) approaches (Ramírez et al., 2009;Segovia et al., 2010Segovia et al., , 2012Górriz et al., 2017b). In addition, to preserve complex models from overfitting, some solutions can be adopted that are well-established on crossvalidation methods. In this sense, several authors have studied numerous accuracy estimation methods using complex classifiers and cross-validation strategies (Efron, 1983;Kohavi, 1995), i.e., leave-one-out cross-validation.
In neuroimage, multiple Computer-Aided-Diagnosis (CAD) systems have been developed for automatic diagnosis of Parkinson's Disease Martinez-Murcia et al., 2014;Augimeri et al., 2016;Segovia et al., 2017b). Most of these systems consist in taking the information collected from medical images: VAF (Voxels-As-Features), textural patterns or morphological features extraction among others. Then, using ML techniques such as Support-Vector-Machines (SVM), Artificial Neural Networks (ANN), Classification trees, Bayesian classifiers, or Kernels; they classify whether a patient is probably suffering the disease, or not, even in its early stages.
Joining these two ideas, we have wondered how to implement an ensemble classification method (Segovia et al., 2014;Badoud et al., 2016) mixing information from clinical tests markers with patterns extracted from images. With this aim, we propose a robust system which combines multiple heterogeneous data sources and weights those that are more discriminative. Mathematically, this work also answers how combinations affects to the final classification and even if multiple sources give us a real significative hint like relationship between heterogeneous sources. We believe that combinations of new promising biomarkers will give us information about indicative factors of Parkinson's Disease progression and diagnosis even when the disease have not clearly manifested yet.
For all individual classifications carried out in this work per feature category (note that none of the classifiers mixes data from heterogeneous information sources), we have made use of linear SVM classifiers (Vapnik, 1998). Additional experiments were also performed using K-Nearest Neighbor (KNN) classifiers (Blanzieri and Melgani, 2008). As the linear SVM showed better results, they were selected as our reference classifiers.

Spatial Normalization
All DaTSCAN images have been spatially registered using the SPM (Statistical Parametric Mapping) tool. Specifically, for this work, we have used the SPM12 software package available from: www.fil.ion.ucl.ac.uk/spm/software/spm12/. Its documentation and manuals are also available from this website. Once registration was performed, it was checked that matching between voxels and anatomical structures was unaltered. After being co-registered and averaged, each cerebral image was reoriented into a standard image grid. Obtained images had a dimension of 79 × 95 × 78 voxels and a voxel size of 2.0 × 2.0 × 2.0 mm.

Intensity Normalization
Full dataset from the PPMI was used to normalize intensity of each image. An intensity normalization method based on the α-Stable distributions as described in Salas-Gonzalez et al. (2009), Castillo-Barnes et al. (2017 was used for that. This approach has shown itself to be more effective for homogenizing information from SPECT images than other approaches, like the currently widely used intensity normalization based on Binding Ratio or the equivalent Gaussian model, as was demonstrated in Salas- Gonzalez et al. (2013).
Mathematically, intensity normalization based on α-Stable distributions uses a linear transformation as presented in expression (1) with a and b as follows in (2): where γ * and µ * represent the mean of γ (dispersion) and µ (location) parameters, respectively, that are computed for the whole database. In short, steps to perform intensity normalization using the α-Stable distribution schema can be summarized as follows: • Step 1: A mask is applied to source images in order to consider only voxels in the brain outside the striatum (Brahim et al., 2015). This will reduce the computational load without losing too much accuracy. • Step 2: For each image, we compute the histogram of selected voxels in the previous step and fit an α-Stable distribution. We obtain α, β, γ , and δ parameters of each image. • Step 3: Once having all the α-Stable distributions, calculate the γ * and δ * parameters as mean of all γ and δ parameters. A comparison between original and intensity-normalized images is presented in Figure 1.

Region of Interest (ROI)
In this work, we considered striatum area and non-striatum area as significative regions for both intensity normalization and VAF classification purposes.
To get a realistic map from the striatum, a segmentation/extraction process was carried out for each image using the AAL (Automated Anatomical Labeling) template (Tzourio-Mazoyer et al., 2002). Thus, we selected regions that compose the striatum according to labels from this template.
2.4. CSF, Plasma, RNA, and Serum Biomarkers  csv file and those populated enough are summarized in Table 2.
More specific information about each BT like definitions, its units or extraction procedures are also described at the Biospecimen Analysis Methods section from the https://ida.loni.usc.edu/ website.

Morphological Features
Several morphological features were extracted from DaTSCAN images. Then, its performance was compared to a VAF model that uses the striatum region as reference. This set of features provides us another classifier for our ensemble model and makes it more robust against missclassifications. Besides, relevant information about structural or functional shapes may be indicative of PD progression (Garg et al., 2015) so it was considered important to include them in this work.
The morphological features obtained from normalized DaTSCAN images are: • Intensity means -Mean values of intensity in the striatum region. It is a 1-by-9 length vector corresponding with: the average intensity of full/left-hemisphere/right-hemisphere voxels in the striatum region, the average intensity of the 1% most intense full/left-hemisphere/right-hemisphere voxels in the striatum region, the average intensity of the 1% less intense full/left-hemisphere/right-hemisphere voxels in the striatum region. • Center of mass (CoM) -Given a particles system, the center of mass of that system is defined as the unique point where the weighted relative position of the distribuited mass sums to zero. In other words, the distribution of particles mass is balanced around the center of mass and the average of the weighted position coordinates of the distribuited mass defines its coordinates. In this work, the same idea has also been used to define a center of intensities instead of mass. To do this, given the relative position (x, y, z) of the distributed intensities I(x, y, z) of all N-voxels which forms the striatum, we have calculated the exact point where sum of all intensities sums to zero respect that point. N has been obtained as the number of voxels that conforms the striatum region according to the AAL template. Center of mass has been computed by expression Due to striatum shape, center of mass has been calculated for each left hemisphere (LH) and right hemisphere (RH) as shown in Figure 2. • Projections -As explained and performed in Segovia et al. (2017a), given a DaTSCAN image, we have projected the N most intense voxels in the three directions (x, y, and z). Thus, we obtained three two-dimensional images corresponding to axial projection (calculated as the maximum in the z-axis direction), coronal projection (calculated as the maximum in the y-axis direction) and sagital projection (calculated as the maximum in the x-axis direction). For each image as illustrated in Figure 2, we calculated the following features: − Area -Number of voxels in the left/right hemisphere projection. − Eccentricity -Ratio of the distance between the center of the ellipse [with general expression as presented in (4)] and each focus to the length of the semimajor axis a.
− Major axis length -Length (in voxels) of the major axis (2a) of the ellipse that has the same normalized second central moments as the region. − Minor axis length -Length (in voxels) of the minor axis (2b) of the ellipse that has the same normalized second central moments as the region. − Orientation -Angle between the major axis of the ellipse and the x-axis.
• Volumes -A HC subject is expected to present the striatum region highly illuminated and approximately homogeneous. For this reason, counting the number of voxels which exceed an intensity threshold may indicate whether a patient meets these specifications. We have calculated the number of voxels which exceeds a certain threshold. This threshold is defined as the 10, 20, 30%,... up to 100% of the averaged intensity value registered at the 1% most intense voxels in the striatum region. This measure is expected to be indicative of how quick DATs decrease in the striatum.

Ensemble Classification
Ensemble classification refers to the process of combining classifiers in order to provide a single and unified classification to an unseen instance (Rokach, 2010). There are two major ways for classifying new instances: fusion and selection. The first approach combines the output of several classifiers whereas selection only selects the output of a single member following a specified and previously defined criterion. In this paper, we have worked with the fusion approach for two reasons: several classifiers were available and none of them affects any individual response of each other.
Assuming that the output of each classifier i is a k-long vector p i,1 , · · · , p i,k , where the term p i,j represents the support that instance x belongs to class j according to the classifier i and it can be assumed (5).
In a weighting method, classification results of all members are combined using weights that indicate its effect on the final classification. These weights can be fixed or dynamically determined. A commonly accepted way for this is considering that the weight of each classifier (w i ) is proportional to its accuracy performance (α i ) on a validation set (Opitz et al., 1996) as follows in (6): Once the weights for each classifier are computed, classes with the highest score are selected by means of expression (7), where y k (x) represents the classification of the k'th classifier and g(y, c) is an indicator function defined as (8).
Since the weights are normalized and summed up to 1, it is possible to interpret the sum in Equation (7) as the probability that x i is classified into c j . When several classifications (but not all) present low accuracies, a sum of several missclassifications can be comparable to good ones. In that case, we need a method that will be able to weight more high scores classifications. To do that, we have used a Windowing technique consisting in increasing the contribution of classifiers with high accuracy rates. This technique is calculated by expression (9), where f (α i ) will be a linear, cuadratic or exponential function (among others) as reflected in (10). Linear Frontiers in Neuroinformatics | www.frontiersin.org The only two conditions these expressions should match are: f (α i ) = 1 when α i = 1 and f (α i ) = 0 when α i = 0.5, so (10) can be rewritten as (11) assuming that a = 1 in the cuadratic and the exponential cases.
All individual classifications have been performed using an SVM with linear kernel classifier. Different kernel functions or similarity matrices were not considered necessary as in a multimodal analisys (Tong et al., 2017;Li et al., 2018). In this case, a simple two-class (binary) classifier is considered as sufficient to separate HC subjects from patients labeled as PD or SWEDD.

Cross-Validation Strategy
In order to validate results, dataset has been splited into two groups: a training data group, which we use to train the prediction model, and a test data group, that is then used to measure the classifier's performance through the cross-validation strategy selected. Due to the reduced number of subjects available for each classification, a leave-one-out cross-validation strategy was selected instead of an N-fold cross-validation strategy (Kohavi, 1995).
Classification results were analyzed considering the following performance metrics: correct rate or accuracy (Acc), sensitivity or true positive rate (Sens), specificity or true negative rate (Spec) and precision (Prec) as defined in expression (12). T P is the number of PD patients correctly classified (true positives), T N is the number of healthy subjects correctly classified (true negatives), F P is the number of healthy subjects classified as PD (false positives) and F N is the number of PD patients classified as healthy (false negatives).

Permutation Tests
Non-parametric permutation tests, as referred to in Lehman and Romano (2005), Good (2006), were performed to assess the statistical significance of accuracy rates obtained for each group of patients.
To compute the permutation test, first we have performed a classification with the original labels (diagnoses) of the observations from the PPMI database. This step has resulted in   (2004), we have randomly rearranged the labels and computed this classification again. The process has been repeated several times until obtaining the distribution of classification results (R Acc,Perm i ) for a large number of possible rearrangements (n with 1 ≤ i ≤ n). Focusing on histogram of all possible results, it would be ideal that the accuracy rates were as far as possible from the center of the distribution. This case means that the original labels give us a better classification result than any other randomized combination of tags and, consequently, our classifier has been able to classify using only representative patterns from the input data. On the contrary, if original labels had given us a result near the central point of the histogram (in which is suppposed to have got most of the cases), it would be a sign that our classifier has not been able to find a significative pattern. In this last case, missclassification mistakes would be significant.

General Diagram
Diagram including all steps has been depicted in Figure 3. Detailed flowchart showing the ensemble classification model has also been included in Figure 4. This flowchart is similar to the presented in Dai et al. (2012) and consists in the use of two classification loops: • First of all, preprocessed input features are splitted into two parts: a training data set and a test data set. • As we are using a leave-one-out cross-validation schema for both external and internal loops, the first training data set consists of N − 1 samples whereas the test set only presents 1 sample. • The training set is used for two loops: • A nested loop which gets the accuracies of several linear SVM classifiers. It uses N −2 samples to obtain a data model and makes a cross-validation with the remaining sample. This will result into a w i weight obtained evaluating each individual (VAF, Morp, and biomedical tests -CSF, Plasma, RNA, and Serum-) classifier. • An external loop that fits a model for each data source. This schema uses the original training data with N − 1 samples for fitting the model as reflected in Figure 4.
• Once all the models are created and evaluated on the Test data, and when the nested loop returns the weights w i , the ensemble classification is performed. For that, the main loop, which also follows a leave-one-out validation schema, applies the windowing technique proposed and obtains the fusion parameters (accuracy, sensitivity, specificity, and precision) using the remaining test sample.
Note that different kind of classifiers and cross-validation schemas may be used instead of linear SVM classifiers and/or leave-one-out due to the flexibility of our proposal.

RESULTS
The proposed methodology has been tested using 388 different SPECT images (194 HC, 168 PD, and 26 SWEDD subjects) in baseline (BL) as cited in Table 1.
All images have been spatially normalizated and the intensity normalization approach explained in section 2.3.2 has also been applied. After intensity normalization, histograms of the intensity values present an α-Stable distribution centered on location δ = 28.42 and with dispersion γ = 5.41. Representation of final intensity distributions are shown in the Figure 5. Striatum volume for VAF was calculated using the AAL template including both caudate nucleus and putamen areas. Nevertheless, for reasons of anatomical relationship with the nigrostriatal pathway, the following structures were also included: globus pallidus, thalamus, olfactory cortex, amygdala, hippocampus, inferior temporal gyrus. Consequently, the final volume considered as ROI contained N = 21, 981 voxels in total.
A total of 68 BT were processed from the Biospecimen_Analysis_Results.csv file. Some of these tests are given in terms of average and standard desviation as reflected in Table 2. Despite of this, all values were included as input features in a matrix utilized for classification. Note that, due to the lack of some medical tests findings for some patients, we have decided to restrict the number of BT under study from 68 to 39.
To further reduce the number of experiments not providing relevant information to the ensemble methodology, a rank of features procedure based on the use of Welch's U-Test was performed for the biomedical tests. Thus, we have estimated the significance of each biomarker according to its most significative value (a minor p-value). As we can check in Table 3 Table 4 where it was also indicated the number of features considered for each data source. Even in this point, the list of final biomarker features could have been more reduced by ranking the features and selecting those ones with a better performance. However, as there were not much clinical information available for all the patients, we finally decided using as many tests as possible and the feature selection were performed only regarding to their number.
As represented in Figure 3, once data sources have been properly pre-processed, the next step is to classify/diagnose subjects through the ensemble classification model proposed. For that, the nested loop in Figure 4 consists of SVM with linear kernel classifiers for VAF, Morp, CSF, RNA, Serum. Then, in order to validate results of each classifier, a leaveone-out validation strategy has been carried out. Individual accuracy, sensitivity, specificity and precision are summarized in Table 5. Note that for VAF, only voxels from Striatum area were considered as input features.
For greater reliability, a non-parametric permutation test was performed for all sets of medical biomarkers (CSF, RNA, and Serum) to assess the statistical difference between accuracy rates obtained using the SVM with linear kernel classifiers. 1, 000 sets of random diagnostic labels (each of them with the same lenght as the original) were generated, then each classifier was trained with these random labels and the accuracy estimated. Histograms of p-value results were generated, and subsequently, compared to SVM original results as shown in Figure 6.
A one-sample t-test was also performed a posteriori. As shown in Table 6, results rejected the null hypotheses. This means, the data in each permutation test does not come from a normal distribution with mean equal to the accuracy obtained by its respective original classification.
Once nested loop is fully iterated, individual classications are performed and the ensemble classification methodology can be carried out.
Different ensemble classification approaches, most of them based on Performance Weighting (PW), have been  performed as shown in Table 7. Final results including individual classifications and the ensemble fusion method are presented in Figure 7. Although all classifications were performed using linear SVM classifiers, as mentioned in the 1, a second battery of simulations was also performed making use of K-Nearest Neighbor (KNN) classifiers. Results of these simulations have been included as Supplementary Material. Due to the worse classification rates obtained with this kind of classifiers, their use was discarded.
Finally, to highlight the difference between sets of medical tests (CSF, RNA, and Serum), image features and the ensemble model that combines all of them; a further comparison was performed by means of the Receiver Operating Characteristic (ROC) curves (Zweig and Campbell, 1993) for the seven experiments (see Figure 8).

DISCUSSION AND CONCLUSIONS
Despite the interest, many questions remain open surrounding the topic of Parkinson's Disease. As a general view (Meireles and Massano, 2012), it is expected that combination of different data sources will give us the necessary keys to determine precisely which are the origins and predictive factors of PD.
Although medical science has begun to consider neuroimaging analysis as the reference test in the diagnosis of Parkinson's Disease (Salvatore et al., 2014), results like VAF analysis with an accuracy up to 95% in many studies are hardly able to be improved even by employing advanced techniques of Machine Learning. In these terms, this work presents many significative strenghts: a robust classification methodology that combines an effective intensity normalization technique based on the use of α-Stable distributions; a classification schema which maximizes models obtained for each group of markers; a multimodal CAD system that combines multiple heterogeneous data sources and an ensemble classifier that selects the most reliable characteristics from input sources as indicated in Tables 5, 7.
If we compare our final proposal (Performance Weighting with Cuadratic Windowing) with the baseline method (Majority Voting) as shown in Table 7, we obtain an averaged improvement of 7.46%. This fact reinforces our main idea: if we use better (more discriminative) biomarkers, ensemble classification rates will increase. As it can be checked, biomedical tests with poor classification rates in the internal cross-validation loop are strongly penalized by the windowing technique so the final classification (external loop) makes a poor use of them. In fact, for this work, only image-based classifiers (VAF and Morp), with averaged accuracies of 94.38 and 90.64%, respectively, have proven to be good enough to the final ensemble classification. Such importance is explained through the cuadratic windowing method described in (11). For example, if we compare results from experiment 2, CSF and RNA tests resulted in a weight of w CSF = 0.10 and w RNA = 0.14, whereas VAF obtained a weight of w VAF = 0.90 and Morp was w Morp = 0.77. As markers based on image presented higher weights 1 , it results in a final classification result similar to them. For this study, results issued by the Welch's U-Test are consistent with the current state-of-the-art as reflected in Gallegos et al. (2015), Klettner et al. (2016), Xu et al. (2017), Hu et al. (2017), Vanle et al. (2017), and Abbasi et al. (2018), particulary for CSF and RNA tests (CSF Alpha-synuclein, pτ 181P, Total-τ , and GAPDH). We confirm this hypothesis as we obtain better ensemble classification results when those biomarkers are included in our multimodal experiments. However, as the weights obtained from these biomedical tests were rather small, the ensemble methodology has not been able to take advantage of them. Only features with individual classification rates equal to or above 50% are useful for our classification purposes. Though it could be seen as a disadvantage, discarding group of tests whose are not well-related to the disease prognosis also decreases computation costs and let us to center our focus on those biomedical tests that really matter.
Experiments involving Serum tests presented high accuracy rates. Nevertheless, they do not provide a reliable source of information as reflected in ROC curves (Figure 8) with AUC values for ensemble model substantially below single VAF or Morp. A direct consequence of this fact may be the need to discard this type of tests defined by the PPMI in a previous phase for future works.
In view of the obtained results, and as we can see in Figure 6 in relation to biomedical tests, no general conclusions can be drawn for experiments that have presented p-values above 5% significance level (none of the experiments presented a p-value under 0.05 and only experiment 2, and experiment 4 with pvalues between 0.05 and 0.1). In comparison with Welch's U-Test in Table 3, RNA and CSF features with p-values below 0.05 should be enough to discern between PD and HC subjects. However, this idea is not reflected in the permutation tests. The main reason could be the small sample size of groups: if distribution variance of accuracies increases, p-value is also increased.
This CAD system can be used to determine an early diagnosis or evolution of Parkinson's Disease. Subjects information for the last 5, 10, 15, or 20 years may be used to determine how disease has progressed. In this sense, if we could work using longitudinal information, we will face up to Parkinson's Disease from a different perspective: not only confirming if a subject shows signs of suffering the neurological disorder but also if that person may develop this pathology in the future.
Though there are not many works related to the use of ensemble classification methodologies for the study of neurodegenerative diseases, the use of Neural Networks or Tree-Based Models with different kind of classifiers as ensemble approaches are quite prominent. Works like presented in Khan et al. (2016) and Li and Wang (2017) which made use of datasets based on speech recordings were able to reach accuracies up to 90%. Other works like (Challa et al., 2016) also combine different imaging biomarkers with biomedical tests to make a model of the disease. In this sense, we could also cite the work presented in Latourelle et al. (2017) which performs a longitudinal study of Parkinsonism based on the use of different clinical, molecular and genetic data. The small size of the dataset used in some of these studies and the computation costs in several cases may be some of the strongest disadvantages with respect to our proposal. Only the proposal presented in Ramírez et al. (2018), for Alzheimer's Disease diagnosis, makes use of a multi-level robust ensemble classification model.
One last point to close this section 4 has a close relation to the most important problem we have had to face up: the lack of all medical tests results for all patients. Although our study was designed to work with the entire PPMI database, due to the lack of all medical tests our experiments have not been able to count on all subjects. In this sense, three main ideas have been suggested for future works: • The inclusion of Missing Data (MD) techniques which are already being implemented in fields like wireless networks or data mining (Magán-Carrión et al., 2015). • Add new promising biomarkers as referred on Saiki et al. (2017) and Delgado-Alvarado et al. (2017) or study relations between existing ones (Constantinides et al., 2017;Fereshtehnejad et al., 2017). • Include new image markers as stated in Saeed et al. (2017) or make use of different image sources combined as done by Segovia et al. (2017b). • The design of a dynamic feature selection procedure for the internal loop which may be also used by the external ensemble loop.
In regarding to its easy adaptation, the proposed methodology presented in this work can also be used for many other databases such as ADNI (http://adni.loni.usc.edu/) or DIAN (https://dian.wustl.edu/). Moreover, the extension of this proposal with the inclusion of procedures for semi-supervised learning or the use of data imputation techniques will face up with the lack of complete tests.