Artificial Intelligence in Decision-Making for Colorectal Cancer Treatment Strategy: An Observational Study of Implementing Watson for Oncology in a 250-Case Cohort

Background Personalized and novel evidence-based clinical treatment strategy consulting for colorectal cancer has been available through various artificial intelligence (AI) supporting systems such as Watson for Oncology (WFO) from IBM. However, the potential effects of this supporting tool in cancer care have not been thoroughly explored in real-world studies. This research aims to investigate the concordance between treatment recommendations for colorectal cancer patients made by WFO and a multidisciplinary team (MDT) at a major comprehensive gastrointestinal cancer center. Methods In this prospective study, both WFO and the blinded MDT’s treatment recommendations were provided concurrently for enrolled colorectal cancers of stages II to IV between March 2017 and January 2018 at Shanghai Minimally Invasive Surgery Center. Concordance was achieved if the cancer team’s decisions were listed in the “recommended” or “for consideration” classification in WFO. A review was carried out after 100 cases for all non-concordant patients to explain the inconsistency, and corresponding feedback was given to WFO’s database. The concordance of the subsequent cases was analyzed to evaluate both the performance and learning ability of WFO. Results Overall, 250 patients met the inclusion criteria and were recruited in the study. Eighty-one were diagnosed with colon cancer and 189 with rectal cancer. The concordances for colon cancer, rectal cancer, or overall were all 91%. The overall rates were 83, 94, and 88% in subgroups of stages II, III, and IV. When categorized by treatment strategy, concordances were 97, 93, 89, 87, and 100% for neoadjuvant, surgery, adjuvant, first line, and second line treatment groups, respectively. After analyzing the main factors causing discordance, relative updates were made in the database accordingly, which led to the concordance curve rising in most groups compared with the initial rates. Conclusion Clinical recommendations made by WFO and the cancer team were highly matched for colorectal cancer. Patient age, cancer stage, and the consideration of previous therapy details had a significant influence on concordance. Addressing these perspectives will facilitate the use of the cancer decision-support systems to help oncologists achieve the promise of precision medicine.


INTRODUCTION
Colorectal cancer (CRC) is the third most commonly diagnosed cancer in both men and women worldwide (1). Its incidence and mortality rates have been increasing in China for several decades (2). The rapid expansion of clinical databases and massive genetic profiling programs has raised tremendous challenges for oncologists where there is insufficient time for tracking the treatment-related information (3).
Clinical decision-support systems that have emerged in the early days, called expert systems (4), are computer programs that help clinicians manage the comprehensive demands of relevant information developments. These systems collect and analyze knowledge in ways that allow algorithms to simulate human reasoning to assist decision-making. AI systems in cancer care have generally focused on obtaining information from unstructured data such as text (using natural language processing) or large structured datasets (using machine-learning methods) (5). However, a cognitive-support computer program for cancer treatment has, as far as we know, not emerged until the development of IBM's Watson for Oncology (WFO).
Despite substantial computer science and clinical expertise, mainly from Memorial-Sloan-Kettering Cancer Centre (MSKCC), guided the development of IBM WFO, which holds promise for improving the value of cancer care delivery, the prospects for its use in patients outside the US have not been examined clearly. According to the reports from oncologists in China and other countries, concordance of treatment decisions made by physicians and WFO varies depending on cancer type, where outcomes in terms of breast cancer (5), lung cancer (6), and gastric cancer (7) were likely to be highly concordant, the results in other studies (8,9) were not.
Hence, we carried out this prospective study to assess the level of agreement regarding colorectal cancer treatment between WFO and a multidisciplinary cancer team in a major comprehensive gastrointestinal cancer center in Shanghai, China. We report the results of decision concordance using the AI system and performed an in-depth analysis on patients where concordance was absent to update the AI model and discuss the potential value of the technology as a clinical adviser and a learning system in cancer treatment.

Study Design
This is a prospective, double-blind, and self-controlled trial to evaluate the clinical conformance between WFO and the multidisciplinary team of Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine (henceforward the RJ MDT) in patients undergoing colorectal cancer therapy in the gastrointestinal center. The clinic information of patients was entered into WFO with patients' consent, and the results were compared with those of actual clinical treatment plans made by the RJ MDT ( Figure 1). This study was approved by the ethics committee of Ruijin Hospital.

Patients
Patients admitted to Ruijin Hospital between March 2017 and January 2018 were eligible for the trial if they were aged between 18 years and 90 years and were diagnosed with colorectal cancer proven by colonoscopy biopsy. All patients provided informed written consent and were advised of their extensive rights to know about related information of the study. Comparison WFO and the doctors in charge of running the system were blinded to the treatment strategies that had been made by the RJ MDT. Concordance was assessed based on how the MDT's therapy strategy was categorized in WFO's recommendation list. If MDT's decision matched the "recommended" or "for consideration" categories, it was designated as concordant. If the decision was either in the "not recommended" table or not listed, the case was defined as non-concordant.

Statistics and Analysis
Descriptive statistics of patients' characteristics were presented using Microsoft Excel. Concordance was presented as percent agreement. Overall survival (OS) was calculated by the Kaplan-Meier method; the difference between survival curves was determined by the log-rank test. The difference was considered statistically significant when P value was less than 0.05. All analyses were conducted with IBM SPSS 22.0 for macOS (IBM, Chicago, USA).  Table 1). There was also a relatively small proportion of cases who underwent neoadjuvant and second-line therapy ( Table 1).

Concordance of WFO Treatment Recommendations With the RJ MDT's Opinions
Of the 250 patients treated by the RJ MDT experts and WFO in total, the overall concordance was 91% (Figure 2A). Subgroups based on the cancer phase showed concordance rates varied by the staging. Overall cases of Stage III exhibited higher concordance (94%) than stages II (83%) and IV cancers (88%; Figure 2A), while cases of stage II colon cancer exhibited higher concordance (94%) than stages III (92%) and IV colon cancers (88%; Figure 2B). In contrast, stage II rectal cancer cases showed a relatively lower concordance rate (72%) than stages III (94%) and IV rectal cancers (89%; Figure 2C). When exploring the concordance based on different treatment strategies, we noticed that the second-line group had the highest concordance rate of 100% for both cancer types ( Figure 3A). Furthermore, cases recommended undergoing neoadjuvant therapy and surgery had higher concordance (97, 93%, respectively; Figure 3A) than the other two, namely adjuvant and first-line groups. Similar results were seen for colon and rectal cancers, where concordance rates were 96 and 91% in surgery groups of colon cancer ( Figure 3B) and rectal cancer ( Figure 3C), respectively. Besides, adjuvant therapy for the two cancers showed a 91% concordance rate in colon cancer and 88% in rectal cancer ( Figures 3B, C). The decisions and recommendations of second-line treatment displayed largely consistent rates of 100% in both cancers ( Figures 3B, C).
Besides, we speculated if there was a difference in the situation of patients who had consistent or inconsistent results. Patients in the consistent group compared favorably to the inconsistent group (p = 0.0049), as shown in Figure 4. In the inconsistent group, we observed a median overall survival of 29 months, which was not yet available among the consistent group patients (Figure 4).

Factors Affect the Concordance and Corresponding Updates of WFO
Continuous training was thought to be fundamental to improve the capability of WFO. In applying WFO, we discussed the main reasons resulting in the discordance, and gave feedback to the platform accordingly. We suggested WFO to avoid adjuvant therapy in patients over 80 years in March 2017 and received positive responses from the supporter ( Table 2). When treating postoperative high-risk stage II colorectal cancers, we found WFO recommended observing strategy, which was against the CSCO (Chinese Society of Clinical Oncology) guidelines. The reason might be the absence of high-risk factors evaluation in dealing with such cases. In spite of a few unresolved proposals, most problems we reported received feedback of update soon after ( Table 2).
To evaluate the performance of WFO due to its continuous updating database, we analyzed the concordance rate in every 50 cases grouped by treatment strategy. Noticeable rising curves were found in most subgroups of various therapy strategies. Though the concordance met different levels of declines in the last 50 patients in neoadjuvant, surgery, and adjuvant groups, the overall rates were higher than the time applying earlier versions of WFO ( Figure 5).

DISCUSSION
The validity and timeliness of clinical guidelines and other therapeutic information an oncologist uses in practice are critical to cancer treatment. With the trends of delegating information-intensive tasks to technologies such as machine learning algorithms, physicians and computer companies are seeking a balance in utilizing evidence-based decision-making support systems in modern clinical practice. While some physicians applied them as a powerful resource, others, especially patients, believed the recommendations they made were already equal to those of the experts. This reflected not only the perspectives and expectations of patients regarding these tools but, more importantly, indicated the concerns of oncologists regarding the validity of the AI-made options. It has been a long time since we introduced such decision-making support systems in real life (4), and the exploration of the most proper model has never ceased.
For such purposes, by examining the concordance between the advice made by WFO, a decision support tool to provide personalized medical recommendations, and an experienced multidisciplinary cancer team, we observed broad agreement and realized the unfulfilled potential of the self-learning machine, as prior studies (11,12) have suggested. Nevertheless, as we expected, several aspects need to improve. In the early cases, we observed inconsistency in WFO's recommendations with respect to guidelines. As the classical chemotherapy regimen, FOLFIRI was no longer recommended for adjuvant Factors resulting in non-concordance could also come from variations in the aggressiveness of treatment approaches in patient subpopulations based on age. We found in our trial that patients over 80, who were not recommended for aggressive strategies such as chemotherapy in our clinical practice, were likely to have discordance where WFO still recommended standard systemic therapy for this subpopulation. However, the health status of the patients at this age should be rigorously evaluated to manage the benefits and risks of chemotherapy.
Our study also demonstrated that inconsistency between WFO and the RJ MDT occurred in 9% of cases, where the main difference was deriving due to the availability of treatments in China that were not included in the oncology advisor.
China has the largest cancer population with a particular cancer spectrum. The different local conditions and customs of national medicine form different therapeutic experiences and considerations. Since WFO was NCCN guidelines-based and MSKCC experience-trained AI, inevitable deviation from therapeutic guidelines arose. We suggest that, in the process of localizing WFO or developing similar prospective products in China or places outside the US, it is necessary to take more diverse patients treated in varying care settings into consideration (14). In terms of the poor survival rate of patients with inconsistent results, the worse and more complex status of disease and older age probably have played a crucial role in causing the difference. But it also indicates a potential possibility that the AI-powered supporting system could be used as a clinical assistant to help make decisions with better outcome.
Despite the endless arguments towards the responsibility in AI-assisted clinical decision-making systems (15), the great potentials of computerized decision support tools have been demonstrated in medical practice, and many modern technologies are expanding into this area. Google has developed a deep learning machine that can detect diabetic retinopathy and diabetic macular edema (16). Microsoft is exploiting new technology for automated analysis of radiological images (17). The current and potential AI   applications cover not only clinical practice, such as diagnosis, robotic surgery, and translational research, such as drug discovery and repurposing, but also several basic biomedical research fields, including gene function annotation and automated experiments (18). Multi-gene panel testing has been taken into consideration for prognostic cancer staging in conjunction with the American Joint Committee for Cancer (AJCC) staging (10). By combining genomic factors with conventional TNM staging, some anatomically classified groups (such as T 2 N 0 M 0 , stage 2A) were down-or upgraded and were determined to be more suitable therapy in clinical practice. Because of the trends towards relying more on molecular characteristics, supplementary decision support might be needed (19). KRAS, which was involved in NCCN guidelines for colorectal cancer in 2008 for the first time, has proven to be a key biomarker in applying EGFR-targeted therapies. Though KRAS and BRAF mutations were considered optional considerations of WFO, the decision it made did not always match standard treatment well. In our study, metastatic rectal cancer cases with RAS wt were treated with cetuximab according to NCCN guidelines (Version 17.3), and this was absent in WFO's options. This may be due to the different treatment strategy of Memorial Sloan Kettering Cancer Center, where WFO has been trained.
Additionally, the evolving feature of the clinical value of genetic assays may cause an unprecedented condition in which a given mutation may not lead to actionable events at the time of initial diagnosis but may later become considerable as research progresses become available (20). Therefore, tracking cancer's somatic mutations and reanalyzing them in an updated data pool would seem to be a potential ability of AI-based technology such as WFO to achieve precision medicine.
Patient perspectives are integral for the advanced use of WFO in the clinical workflow. Though modern societies, especially those in China, hold optimistic views of applying cutting edge technology in life, it raises a concern regarding health care, involving both data security and decision precision. Therefore, achieving higher levels of patient acceptance of WFO through systematically upgrade will not only improve oncology practice but contribute to enhance the relationship of cancer patients and physicians as well. Given that WFO is not yet commonly used in practice at the hospital, future studies should exploit their findings with physicians, as well as patients, in using WFO in clinical practice.
There are notable limitations to this study. First, the study design was observational and self-controlled with a relatively small sample size that may cause the results potentially to be susceptible to the bias of unmeasured factors. Patients participated in our study were treated at one comprehensive gastrointestinal cancer center on China's east coast. Adding cases treated in community-based clinics might widen the gap between WFO and clinician responses and lower the concordance but improve the value of computer-aided decision support in minimizing the medical disparities across different regions.
Many who were glad to accept WFO as a resource to provide oncologists with cutting-edge medical research and knowledge believed the ideal model of such tools in clinical practice is to be used as "a tool, not a crutch" (21). By addressing such perspectives, we wish to facilitate the use of WFO and other decision support tools, to help realize the promise of more effective clinical and precision healthcare.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding authors.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ruijin Hospital Ethics Committee. The patients/ participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
BA: Investigation, methodology, software, writing-original draft, and editing. PX: Resources, data curation, formal analysis, validation, investigation, and writing-original draft. HH: Resources, formal analysis, methodology, and editing. HJ: Resources, formal analysis, methodology, and editing. CW: Resources, software, formal analysis, and editing. SL: Resources and methodology. LH: Resources, formal analysis, and methodology. XD: Resources. HZ: Resources. GC: Resources. AL: Resources. LX: Resources, methodology, and formal analysis. MZ: Conceptualization, data curation, supervision, acquisition, validation, methodology, writing-original draft, writing-review and editing. HL: Conceptualization, data curation, supervision, acquisition, validation, methodology, writing-original draft, writing-review and editing. JS: Conceptualization, data curation, formal analysis, supervision, acquisition, validation, methodology, investigation, writingoriginal draft, writing-review and editing. All authors contributed to the article and approved the submitted version.