AUTHOR=Zeng Lixuan , Liu Lei , Chen Dongxin , Lu Henghui , Xue Yang , Bi Hongjie , Yang Weiwei TITLE=The innovative model based on artificial intelligence algorithms to predict recurrence risk of patients with postoperative breast cancer JOURNAL=Frontiers in Oncology VOLUME=Volume 13 - 2023 YEAR=2023 URL=https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2023.1117420 DOI=10.3389/fonc.2023.1117420 ISSN=2234-943X ABSTRACT=Abstract Purpose This study aimed to develop a machine learning model to retrospectively study and predict the recurrence risk of breast cancer patients after surgery by extracting the clinicopathological features of tumors from unstructured clinical electronic health record (EHR) data. Methods This retrospective cohort included 1,841 breast cancer patients who underwent surgical treatment. To extract the principal features associated with recurrence risk, the clinical notes and histopathology reports of patients were collected, and feature engineering was used. Predictive models were next conducted based on this important information. All algorithms were implemented using Python software. The accuracy of prediction models was further verified in the test cohort. The area under the curve (AUC), precision, recall, and F1 score were adopted to evaluate the performance of each model. Results A training cohort with 1,289 patients and the test cohort with 552 patients were recruited. From 2011 to 2019, totaling 1,841 textual reports were included. For the prediction of recurrence risk, both LSTM, XGBoost, and SVM had favorable accuracy of 0.89, 0.86, and 0.78. The AUC values of the micro-average ROC curve corresponding to LSTM, XGBoost and SVM were 0.98 ± 0.01, 0.97 ± 0.03, and 0.92 ± 0.06. Especially the LSTM model achieved superior execution than other models. Accuracy, F1 Score, macro-avg F1 Score (0.87), and weighted-avg F1 Score (0.89) of the LSTM model produced higher values. All P values were statistically significant. Patients in the high-risk group predicted by our model performed more resistant to DNA damage and microtubule targeting drugs than those in the intermediate-risk group. The predicted low-risk patients were not statistically significant compared to intermediate or high-risk patients due to the small sample size (188 low-risk patients were predicted via our model, and only two of them were administered chemotherapy alone after surgery). The prognosis of patients predicted by our model had consistent with the actual follow-up records. Conclusions The constructed model accurately predicted the recurrence risk of breast cancer patients from EHR data, and certainly evaluated the chemoresistance and prognosis of patients. Therefore, our model could help clinicians to formulate the individualized management of breast cancer patients.