AUTHOR=Wang Jianan , Gong Xiaoxian , Chen Hongfang , Zhong Wansi , Chen Yi , Zhou Ying , Zhang Wenhua , He Yaode , Lou Min TITLE=Causative Classification of Ischemic Stroke by the Machine Learning Algorithm Random Forests JOURNAL=Frontiers in Aging Neuroscience VOLUME=Volume 14 - 2022 YEAR=2022 URL=https://www.frontiersin.org/journals/aging-neuroscience/articles/10.3389/fnagi.2022.788637 DOI=10.3389/fnagi.2022.788637 ISSN=1663-4365 ABSTRACT=Background: Prognosis, recurrence rate, and secondary prevention strategies differ by different etiologies in acute ischemic stroke. However, identifying its cause is challenging. Objective: To develop a model to identify the cause of stroke using machine learning (ML) methods and test its accuracy. Methods: We retrospectively reviewed the data of patients who had determined etiology defined by the TOAST (Trial of ORG 10172 in Acute Stroke Treatment) from CASE-II (NCT04487340) to train and evaluate six ML models [Random Forests (RF), Logistic Regression (LR), Extreme Gradient Boosting (XGBoost), K-Nearest Neighbor (KNN), Ada Boosting, Gradient Boosting Machine (GBM)] for the detection of cardioembolism (CE), large-artery atherosclerosis (LAA) and small-artery occlusion (SAO). Between October 2016 and April 2020, consecutively patients were enrolled for algorithm development (phase one). Between June 2020 and December 2020, patients were enrolled consecutively in test set for algorithm test (phase two). Area under the curve (AUC), precision, recall, accuracy and F1 score were calculated for the prediction model. Results: Finally, a total of 18209 patients were enrolled in phase one, including 13590 patients (6089 CE, 4539 LAA, 2962 SAO) in the model, and a total of 3688 patients in phase two, including 3070 patients (1103 CE, 1269 LAA, 698 SAO) in the model. Among six models, the best models were RF, XGBoost and GBM, and we chose the RF model as our final model. Based on the test set, the AUC of the RF model to predict CE, LAA and SAO were 0.981 (95%CI, 0.978-0.986), 0.919 (95%CI, 0.911-0.928), 0.918 (95%CI, 0.908-0.927), respectively. The most important items to identify CE, LAA and SAO were atrial fibrillation and degree of stenosis of intracranial arteries. Conclusions: The proposed RF model could be a useful diagnostic tool to help neurologists categorize etiologies of stroke. Trial Registration: https://www.clinicaltrials.gov; unique identifier: NCT04487340