Computational Intelligence Technique for Prediction of Multiple Sclerosis Based on Serum Cytokines

Multiple sclerosis (MS) is a neurodegenerative disease characterized by lesions in the central nervous system (CNS). Inflammation and demyelination are the leading causes of neuronal death and brain lesions formation. The immune reactivity is believed to be essential in the neuronal damage in MS. Cytokines play important role in differentiation of Th cells and recruitment of auto-reactive B and T lymphocytes that leads to neuron demyelination and death. Several cytokines have been found to be linked with MS pathogenesis. In the present study, serum level of eight cytokines (IL-1β, IL-2, IL-4, IL-8, IL-10, IL-13, IFN-γ, and TNF-α) was analyzed in USA and Russian MS to identify predictors for the disease. Further, the model was extended to classify MS into remitting and non-remitting by including age, gender, disease duration, Expanded Disability Status Scale (EDSS) and Multiple Sclerosis Severity Score (MSSS) into the cytokines datasets in Russian cohorts. The individual serum cytokines data for the USA cohort was generated by Z score percentile method using R studio, while serum cytokines of the Russian cohort were analyzed using multiplex immunoassay. Datasets were divided into training (70%) and testing (30%). These datasets were used as an input into four machine learning models (support vector machine, decision tree, random forest, and neural networks) available in R programming language. Random forest model was identified as the best model for diagnosis of MS as it performed remarkable on all the considered criteria i.e., Gini, accuracy, specificity, AUC, and sensitivity. RF model also performed best in predicting remitting and non-remitting MS. The present study suggests that the concentration of serum cytokines could be used as prognostic markers for the prediction of MS.

Multiple sclerosis (MS) is a neurodegenerative disease characterized by lesions in the central nervous system (CNS). Inflammation and demyelination are the leading causes of neuronal death and brain lesions formation. The immune reactivity is believed to be essential in the neuronal damage in MS. Cytokines play important role in differentiation of Th cells and recruitment of auto-reactive B and T lymphocytes that leads to neuron demyelination and death. Several cytokines have been found to be linked with MS pathogenesis. In the present study, serum level of eight cytokines (IL-1β, IL-2, IL-4, IL-8, IL-10, IL-13, IFN-γ, and TNF-α) was analyzed in USA and Russian MS to identify predictors for the disease. Further, the model was extended to classify MS into remitting and non-remitting by including age, gender, disease duration, Expanded Disability Status Scale (EDSS) and Multiple Sclerosis Severity Score (MSSS) into the cytokines datasets in Russian cohorts. The individual serum cytokines data for the USA cohort was generated by Z score percentile method using R studio, while serum cytokines of the Russian cohort were analyzed using multiplex immunoassay. Datasets were divided into training (70%) and testing (30%). These datasets were used as an input into four machine learning models (support vector machine, decision tree, random forest, and neural networks) available in R programming language. Random forest model was identified as the best model for diagnosis of MS as it performed remarkable on all the considered criteria i.e., Gini, accuracy, specificity, AUC, and sensitivity. RF model also performed best in predicting remitting and non-remitting MS. The present study suggests that the concentration of serum cytokines could be used as prognostic markers for the prediction of MS.

INTRODUCTION
Multiple sclerosis (MS) is a chronic disease of the central nervous system (CNS) caused by chronic inflammation and autoimmune response. MS can be classified on the basis of onset of symptoms and their progression into relapsing remitting (symptoms appearing and disappearing), primary progressive (progressive symptom elevation), and secondary progressive (relapse-remitting MS development to progressive MS) multiple sclerosis. The disease is characterized by demyelinating areas in the brain and spinal cord which appear as plaques or lesions in the white and gray matter (1,2). Blood Brain Barrier (BBB) was shown to be affected, which explains the presence of circulating leukocytes into the brain matter (3). The auto-reactive T lymphocytes penetrating BBB could target neuroglia leading to more damage within the brain and thus exposing myelin antigens. These auto-reactive T cells can cause deterioration of the myelin sheath, which is essential for signal transmission within the brain (4). Depending on the varied locations of lesions in brain, clinical symptoms of MS may vary including vision loss, numbness, fatigue, movement difficulties, and many more (5).
Neuronal damage and neuroglial activation could cause the secretion of various cytokines which are involved in differentiation of Th1, Th2, Th9, and Th17 lymphocytes (6). Studies have shown changes in various cytokines level in serum and cerebrospinal fluid (CSF) of MS patients as compared to controls (7)(8)(9). These cytokines are associated with Th1 (IFNγ, TNF-α, IL-2) and Th2 (IL-4, IL-5, IL-13, IL-6) type immune responses. Also, activation of Th17 and Th9, secreting IL-17 and IL-9, respectively, was shown to play role in the progression of MS (10). Interestingly, loss of the natural regulatory T cells (T reg ) function was demonstrated as one of the factors leading to MS (11,12). It is believed that suppression of the T reg population can lead to proliferation of auto-reactive T cells in MS (11).
The analysis of body fluids such as blood, saliva, cerebrospinal fluid, and urine is often used to diagnose various diseases at the early stage. This analysis can be highly accurate and cost effective than the conventional diagnostic techniques such as computed tomography (CT), magnetic resonance imaging (MRI) scans, and tissue biopsies. The body fluids are commonly analyzed to determine changes in biomolecules which are either directly or indirectly associated with the disease progression. Since, blood cytokines is known to be affected in MS, hence we propose that changes in cytokine could be used as a prognostic markers for MS diagnosis.
Machine learning approaches were successfully employed for prediction of Alzheimer's disease, diabetes, inflammatory bowel disease, and diagnosis of glaucoma (13)(14)(15)(16). Recently, machine learning approach was applied into demographic dataset to predict MS disease course (17). Martins et al. analyzed thirteen inflammatory cytokines in 833 MS patients and 117 controls of USA population (18). Eight out of thirteen cytokines were found to differ significantly in MS as compared to controls (18). These eight cytokines were also analyzed in MS patients and controls of Russian cohort. In current study, four machine learning models were applied to predict MS using these eight cytokines (IL-1β, IL-2, IL-4, IL-8, IL-10, IL-13, IFN-γ, and TNF-α) data of USA and Russian cohorts. Further, machine learning models were also used to classify MS into remitting and non-remitting based on eight cytokine serum level, age, gender, disease duration, Expanded Disability Status Scale (EDSS) and Multiple Sclerosis Severity Score (MSSS).

Dataset Selection
Concentration data of eight cytokines (IL-1β, IL-2, IL-4, IL-8, IL-10, IL-13, IFN-γ, and TNF-α) in serum of MS patients and controls was selected from two different studies of USA and Russian population. Out of the two independent USA studies, one analyzed the concentration of serum cytokines in 833 MS patients and 117 healthy volunteers using multiplex immunoassay (18) while the other group analyzed the concentrations of serum cytokines in 26 MS patients and 11 controls (19). Data on eight serum cytokines (IL-1β, IL-2, IL-4, IL-8, IL-10, IL-13, IFN-γ, and TNF-α) in 97 MS patients and 71 controls in Russian cohort was also included into the analysis. There were 53 females and 18 males average age 28.6 ± 8.8 years, in Russian control cohort. The demographic and clinical features of 97 Russian MS patients are summarized in Table 1.

Dataset Generation
Dataset containing USA populations was generated using Z score percentile based method while Russian cytokine data was analyzed using multiplex magnetic bead-based antibody detection assays.

Z Score Percentile Method
Cytokine data from two previously published USA studies was reported in the mean ± standard deviation (SD)/standard error of mean (SEM) format. To convert SEM into SD, the SEM was multiplied by square root of total number (n). One of the major challenges was to generate the individual cytokines data from reported values as the data was mostly available as mean ± SD/SEM. Data was generated by two methods: solving the series of non-linear equations and Z score percentile based approach.
To choose best method for data analysis, random values of 50 cytokines were taken, and the actual values were compared with the generated values from Z score method and non-linear systems equations (data not shown). The data generated by Z score method was found to be more accurate. Hence, to generate the raw data from mean ± SD/SEM, Z score percentile method was used, where the population was presumed to follow the normal distribution (20). The Z score percentile method was implemented in R (an open source software licensed under GNU GPL) to calculate individual data. In this method, 99.7% of the total population was included and the remaining 0.3% was considered outliers and was excluded from the analysis (Supplementary Figure 1).  Serum cytokines (IL-1β, IL-2, IL-4, IL-8, IL-10, IL-13, IFNγ, and TNF-α) were analyzed using Pro Human Cytokine 27plex Bio-Plex (Bio-Rad, Hercules, CA, USA) multiplex magnetic bead-based antibody detection kits following the manufacturer's instructions. Serum aliquots (50 µl) were used for the analysis with a minimum of 50 beads per analyte acquired. Median fluorescence intensities were measured using a Luminex 200 analyzer. Data collected was analyzed with MasterPlex CT control software and MasterPlex QT analysis software (Hitachi Software, San Bruno, CA, USA). Standard curve for each analyte was generated using standards provided by the manufacturer.

Machine Learning Methods
Four machine learning models, Random Forest (RF) (21), Decision Tree (DT), Support Vector Machine (SVM) (22), and Neural Network (NN) (23) were used in the study. The required packages and tuning parameters to obtain the optimum results using these models are summarized in Table 2. The models were trained based on equation which includes factors required to predict the target 1 (MS vs. control) or classify target 2 (remitting vs. non-remitting MS).

Model Evaluation
The performance of models was evaluated using various parameters such as Gini, AUC, accuracy, specificity, and sensitivity (24). The following equations were used to calculate these parameters:

Repeated K-Fold Cross Validation
K-fold cross validation was done to test the robustness of proposed model by increasing the number of runs in model. In this method, K-folds are repeated n times to trace out the fluctuations in the model accuracy. If low variation in accuracy is identified, the model is identified as robust and the predictions to be reliable. In the present study, the dataset was divided into six equal portions and 6-fold cross validation was repeated three times to avoid discrepancies.

The Proposed Predictive Model
The proposed algorithm to predict and classify MS is summarized in Figure 2. The model is based on eight cytokines level in serum for MS and control. Datasets of cytokine levels, age and gender were used as input for machine learning model to predict if a person is having MS or not. Once MS is diagnosed, the model will be able to classify MS into remitting and non-remitting MS based on serum cytokines, age, gender, disease duration, EDSS, and MSSS.
Four machine learning models were employed to predict MS using dataset including 910 MS patients and 199 controls. The dataset was prepared by random shuffling of USA and Russian cohorts and then the data was divided into training (70%) and testing (30%) subsets. The data was divided as follows: 900 (training dataset) and 209 (testing dataset). The training dataset consisted of unbalanced data on MS patients (750) and controls (150) which was further distributed by Frontiers in Neurology | www.frontiersin.org  dividing patient data into five subsets to create a balance between the patient and control datasets. All four machine learning models were trained separately using each balanced dataset. All five trained models were then tested by using test dataset. Predictions generated via five trained models were combined using majority voting ensemble technique. Using SVM, DT, and RF, fare accuracy of MS prediction was demonstrated (83-91%). When additional parameters used for the analysis (Gini, AUC, specificity, and sensitivity) were looked, RF model demonstrated the best performance as compared to other models. Therefore, RF was selected as model for the prediction of MS and used for validation ( Table 3).  The prediction of MS was also done with inclusion of age and gender along with cytokine values in Russian cohort where datasets were divided into training (70%) and testing (30%). The accuracy of MS diagnosis for different models was within the range of 89-99% (Figure 3). RF model demonstrated 70% accuracy in classifying remitting and non-remitting MS while the percentage accuracy for DT, NN, and SVM models was 63, 54, and 47, respectively (Figure 4). In the Russian MS cohort, 97 patients, consisting of 22 patients taking medication, were included. Therefore, to compare the effect of MS treatment on MS prediction accuracy, 97 MS patients were compared with 75 MS patients without treatment. Data analysis did not reveal difference between these two groups (Figures 3, 4). Thus, it was concluded that, the MS prediction accuracy is not affected by inclusion of patients undergoing treatment.
IL-6 and IFN-α were shown to play role in MS pathogenesis (25,26). Therefore, we included these cytokines in dataset and calculated the MS prediction accuracy. We have found that inclusion of these cytokines did not improve the accuracy of MS prediction and classification (Figures 5, 6).

Validation of the Proposed Model
To demonstrate that the trained model is not overfitted, underfitted or biased, repeated 6-fold cross validation was performed. The accuracy of the proposed model was evaluated by repeated K-fold cross validation (Figure 7). The Receiver operating Characteristic (ROC) is the representation of the true positive rate (sensitivity) and false positive rate (1 specificity) of the models where for each data point, the sensitivity and specificity are calculated to plot the graph. The area under the curve (AUC) can be considered as the criterion for the measurement of the discriminative ability of the model to distinguish well-among the patients and controls. Receiver operating Characteristic (ROC) curve plots for each model were generated to demonstrate the performance of each model (Figure 8). It was observed that the RF model is performing well as compared with other models (Figure 8).

DISCUSSIONS
The pathogenesis of MS is complex and involves multiple factors which makes prediction and early diagnosis of the disease challenging. Recently, different computational methods were applied to develop interactive design and optimisation of the synthetic biological system to study pathogenesis of diabetes (27). This study was designed to develop novel approaches for diagnosis of the disease; because, early diagnosis of the disease could significantly increase the success rate of the current treatment. Artificial intelligence holds a great potential for early diagnosis and prediction of the treatment outcome. Several machine learning models have been developed to predict development of the heart diseases, Parkinson's disease and breast cancers (28)(29)(30). In this study, RF model was identified as the best to predict MS based on eight cytokine levels in serum. RF model has also shown good accuracy in classifying MS into remitting and non-remitting.
MS is a neurological disease highly prevalent in many European countries, USA, Canada and Australia (31). Clinically, MS is characterized by neurological dysfunction which often leads to a disability (32). Despite the advances made in our understanding of MS pathogenesis, prognostic markers for prediction of the disease remain largely unknown. Cytokines were shown to be consistently affected in serum of MS (18). Also, multiple studies have demonstrated that cytokines play a crucial role in the pathogenesis of MS (33,34). For example, Martins et al have shown that seven cytokines (IL-2, IL-4, IL-10, IL-13, IL1β, IFN-γ, and TNF-α) were significantly elevated in MS patients while IL-8 was significantly lower in MS as compared to controls (18). Interestingly, IL-2, IL-4, IL-10, IL-13, IL1β, IFNγ, and TNF-α serum level was found elevated in Russian MS as compared to controls, which was similar to that found in USA cohort. These data suggest that the pathogenesis of MS in Russian and USA could be similar. The only exception was changes in serum level of IL8, which was lower in USA and higher in Russian MS as compared to the respective controls. IL-8 is polypotent cytokine involved in regulation of inflammation, recruiting neutrophils, basophils, T lymphocytes, NK cells as well as enhancing the permeability of endothelial barrier (35)(36)(37)(38). Difference in IL-8 serum level in Russian and USA MS cohort could reflect the dissimilarities in the disease pathogenesis which could be related to the genetic predisposition, sun exposure, vitamin D production, smoking, etc.
We suggest that changes in serum cytokine levels could be used as predictors or diagnostic biomarkers for MS. Data on serum cytokine level in USA MS cohort was used in our study to develop the machine learning model. To increase the number of samples, data from another report on USA MS serum cytokine levels was included into the analysis (19). The raw data from these two studies was calculated via Z score percentile method. In the resulting synthetic data, the real experimental data obtained by multiplex immunoassay from Russian cohort was included to have high quality prediction. Four machine learning models were trained to predict MS where prediction was based on combined effect of level of eight cytokines in serum. Three models (SVM, DT, and RF) showed good accuracy for MS prediction. The model performance was further evaluated using additional factors (Gini, AUC, specificity and sensitivity). RF model has shown the best performance in each evaluation parameters. This data suggest that RF analysis of eight cytokine (IL-1β, IL-2, IL-4, IL-8, IL-10, IL-13, IFN-γ, and TNF-α) levels in serum could be used to predict MS. RF model has shown the accuracy of 70% to classify MS into remitting vs. non-remitting where age, gender, disease duration, EDSS, and MSSS in addition to cytokines levels were included as classification parameters. This data corroborates previous report where the accuracy of MS disease course was 60-70% when demographic (age, disease onset, gender, and smoking history) and clinical factors (expanded disability status scale, visual disability score, and mental disability score) were included into the prediction model (17).
IL-6 and IFN-α are the inflammatory cytokines which also affected in MS (25,26). Therefore, prediction and classification of MS algorithm was designed including these cytokines. Interestingly, adding IL-6 and INF-α did not improve the accuracy of MS diagnosis and classification. This suggests that although IL-6 and INF-α contribute into MS pathogenesis, data on level of eight cytokines (IL-1β, IL-2, IL-4, IL-8, IL-10, IL-13, IFN-γ, and TNF-α) in serum provides sufficient input data to diagnose and classify MS.
Analysis of Cerebrospinal fluid (CSF) demonstrated association between cytokines and MS pathogenesis; however, data remains inconsistent (39). In our previous report, ten (IL-2RA, CCL5, CCL11, CXCL1, CXCL10, CXCL12, MIF, IFN-γ, TRAIL, and SCF) out of forty eight cytokines were found elevated in MS as compared to non-MS controls (40). IFN-γ level was only found to be increased in CSF of MS in this study, while the remaining cytokines (IL-1β, IL-2, IL-4, IL-8, IL-10, IL-13, IFN-γ, and TNF-α), used in our prediction model, did not change significantly as compared to controls. Therefore, we did not include CSF cytokine data into our prediction model. Additionally, CSF collection painful and invasive procedure requiring highly trained personnel. Also, CSF analysis is not always required for MS diagnosis. In contrast, MS serum samples are often collected for routine clinical analysis, making them readily available for cytokine detection. Current approach could also be applied to differentiate MS from other neuro-inflammatory diseases.

CONCLUSION
Early diagnosis of MS remains a challenge since the disease develops slowly and clinical symptoms are often identified when brain tissue is already damaged. In the present study, RF model was found to have an accuracy of 91% which suggests that it could be applied to predict MS using serum level of eight cytokines (IL-1β, IL-2, IL-4, IL-8, IL-10, IL-13, IFN-γ, and TNF-α). Further, the accuracy of MS classification into remitting vs. non-remitting was observed to 70% by RF with inclusion of age, gender, diseases duration, EDSS and MSSS in addition to serum cytokines. This is the first study where eight cytokine levels in serum was used to predict MS in two distinct cohorts of patients.

ETHICS STATEMENT
Informed consent was obtained from each subject according to the clinical and experimental research protocol, approved by the Biomedicine Ethic Expert Committee of Republican Clinical Neurological Center, Republic of Tatarstan, Russian Federation (No.218; 11.15.2012).

AUTHOR CONTRIBUTIONS
MG: original idea generation, literature reviews for MS cytokines data, computational work that includes generation of USA MS cytokines data, compilation of results and figures, manuscript writing. DK: running of machine learning models and results generation of the models. PR: supervised the research work of machine learning models. TK and EM: involved in collection of MS and control samples and MS clinical data. SK: cytokines analysis of Russian population and manuscript editing. AR: arranging the work of Russian cytokines data analysis that includes the MS and control samples. MB: formulation of idea, overall responsible for coordinating the research project and managing multisite collaboration, writing the manuscript.

ACKNOWLEDGMENTS
AR and SK were supported by the Russian Government Program of Competitive Growth of Kazan Federal University. AR was personally supported by state assignments 20.5175.2017/6.7 and 17.9783.2017/8.9 of the Ministry of Science and Higher Education of Russian Federation.