<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2023.1124182</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Using machine learning for healthcare treatment planning</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Dubey</surname> <given-names>Snigdha</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2277110/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Tiwari</surname> <given-names>Gaurav</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2268989/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Singh</surname> <given-names>Sneha</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Goldberg</surname> <given-names>Saveli</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1205063/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Pinsky</surname> <given-names>Eugene</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2128311/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Computer Science, Metropolitan College, Boston University</institution>, <addr-line>Boston, MA</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Radiation Oncology Mass General Hospital</institution>, <addr-line>Boston, MA</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Vladimir Brusic, The University of Nottingham Ningbo, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Bayram Akdemir, Konya Technical University, T&#x000FC;rkiye; Tianyi Qiu, Fudan University, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Eugene Pinsky <email>epinsky&#x00040;bu.edu</email></corresp></author-notes>
<pub-date pub-type="epub">
<day>25</day>
<month>04</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>6</volume>
<elocation-id>1124182</elocation-id>
<history>
<date date-type="received">
<day>14</day>
<month>12</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>03</day>
<month>04</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Dubey, Tiwari, Singh, Goldberg and Pinsky.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Dubey, Tiwari, Singh, Goldberg and Pinsky</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>We present a methodology for using machine learning for planning treatments. As a case study, we apply the proposed methodology to Breast Cancer. Most of the application of Machine Learning to breast cancer has been on diagnosis and early detection. By contrast, our paper focuses on applying Machine Learning to suggest treatment plans for patients with different disease severity. While the need for surgery and even its type is often obvious to a patient, the need for chemotherapy and radiation therapy is not as obvious to the patient. With this in mind, the following treatment plans were considered in this study: chemotherapy, radiation, chemotherapy with radiation, and none of these options (only surgery). We use real data from more than 10,000 patients over 6 years that includes detailed cancer information, treatment plans, and survival statistics. Using this data set, we construct Machine Learning classifiers to suggest treatment plans. Our emphasis in this effort is not only on suggesting the treatment plan but on explaining and defending a particular treatment choice to the patient.</p></abstract>
<kwd-group>
<kwd>machine learning</kwd>
<kwd>ML in healthcare treatment</kwd>
<kwd>nearest neighbor classification</kwd>
<kwd>explainable AI</kwd>
<kwd>ML in healthcare environments</kwd>
</kwd-group>
<counts>
<fig-count count="10"/>
<table-count count="9"/>
<equation-count count="3"/>
<ref-count count="18"/>
<page-count count="14"/>
<word-count count="6162"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Medicine and Public Health</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Breast cancer is a leading cause of cancer-related deaths among women worldwide. Early detection and accurate breast cancer diagnosis are crucial for improving patient outcomes and reducing mortality rates. It is the most commonly diagnosed cancer type, accounting for 1 in 8 cancer diagnoses worldwide (CDC, <xref ref-type="bibr" rid="B6">2022</xref>). According to the World Health Organization, in 2020, there were about 2.3 million new cases of breast cancer globally and about 685,000 deaths from this disease, with large geographical variations observed between countries and world regions (World health organization, <xref ref-type="bibr" rid="B18">2022</xref>). Doctors often use additional tests to find or diagnose breast cancer. Breast cancer is treated in several ways. It depends on the kind of breast cancer and how far it has spread. People with breast cancer often get more than one kind of treatment.</p>
<list list-type="order">
<list-item><p>Surgery</p></list-item>
<list-item><p>Chemotherapy</p></list-item>
<list-item><p>Hormonal therapy</p></list-item>
<list-item><p>Biological therapy</p></list-item>
<list-item><p>Radiation therapy</p></list-item>
</list>
<p>Machine learning (ML) algorithms have shown promise in aiding clinicians in the diagnosis, prognosis, and treatment of breast cancer. Among the different ML algorithms, logistic regression (LR), random forest (RF), and K-nearest neighbors (KNN) are widely used for breast cancer classification and prediction (e.g., Rajbharath and Sankari, <xref ref-type="bibr" rid="B11">2017</xref>). These models have been shown to improve the accuracy and efficiency of breast cancer diagnosis, prognosis, and treatment planning.</p>
<p>Extensive literature (e.g., Ak, <xref ref-type="bibr" rid="B1">2020</xref>) is present that compares the performance of multiple machine learning algorithms, including deep learning methods, in predicting breast cancer recurrence or classification. In our paper, the context is different. Our paper aims to provide a patient-centric approach to provide a dialogue between the physician and the patient. And keeping that as our focus, we concentrate on Machine Learning algorithms that are both explainable and accurate such as Logistic Regression.</p>
<p>The peculiarity of our study lies in the fact that the object of our approach is not a doctor who is offering the most optimal treatment method but a patient who is considering options for the treatment offered to him/her. This condition implies the following requirements and restrictions:</p>
<list list-type="order">
<list-item><p>The need for dialogue between patient, doctor, AI.</p></list-item>
<list-item><p>The need to explain the AI decision.</p></list-item>
<list-item><p>Explanation of the decision should be in a form and terms understandable to the patient.</p></list-item>
</list>
<p>In this regard, an open database based on common and understandable symptoms and types of treatment was used. KNN was considered a system of explanation, the most simple and understandable form for the patient.</p>
</sec>
<sec sec-type="methods" id="s2">
<title>2. Methodology</title>
<p>In this paper, we suggest a methodology of using machine learning to help patients and doctors identify the appropriate treatment plan. For the case study, we have used breast cancer (Reddy et al., <xref ref-type="bibr" rid="B12">2018</xref>; Song et al., <xref ref-type="bibr" rid="B15">2021</xref>; van de Sande et al., <xref ref-type="bibr" rid="B17">2021</xref>).</p>
<p>Our methodology for using Machine Learning consists of 2 stages:</p>
<p><bold>Stage 1:</bold> In the first stage, we use state-of-the-art ML algorithms (Logistic Regression and Random Forest) (Bishop, <xref ref-type="bibr" rid="B4">2016</xref>; Hastle, <xref ref-type="bibr" rid="B9">2018</xref>) to find out the status of the patient after 5 years based on the Doctor&#x00027;s suggestion. We want to use a classifier that is sufficiently accurate and explainable. A good choice for such a classifier would be Logistic regression. On the other hand, there are classifiers, such as Random Forest, that often provide higher accuracy but are not explainable. Unless the accuracy of logistic regression is insufficient, this would be the classifier of choice. For our particular case study, the Random Forest classifier gives marginally better results, and therefore, logistic regression would be used.</p>
<p><bold>Stage 2:</bold> In the second stage, we need to examine alternative treatment plans for a patient. To that end, we examine <italic>k</italic>-Nearest Neighbors (Cover and Hart, <xref ref-type="bibr" rid="B7">1967</xref>; Sarkar and Leong, <xref ref-type="bibr" rid="B13">2000</xref>; Bagui et al., <xref ref-type="bibr" rid="B3">2003</xref>). We use statistics for these neighbors to examine alternative treatment plans such as:</p>
<list list-type="order">
<list-item><p>Is chemotherapy required?</p></list-item>
<list-item><p>What is s the best radiation sequence with surgery?</p></list-item>
<list-item><p>What radiation recode should be proposed?</p></list-item>
</list>
<p>The idea is to help the doctor and the patient chooses a treatment to maximize the chances of survival after 5 years. We take <italic>k</italic> &#x0003D; 25 (neighbor) patients as we believe that this would be a number that is sufficient to compute statistics on alternative treatments and, at the same time, would allow the physician to examine these &#x0201C;neighbor&#x0201D; patients in detail and to explain the predicted results of alternative treatments.</p>
<p>We do not focus on <italic>k</italic>-nearest neighbors as an algorithm for breast cancer diagnosis, as considered in Medjahed et al. (<xref ref-type="bibr" rid="B10">2013</xref>). Our usage of nearest neighbors is to help the physician explain alternative treatments and outcomes once the prediction in Stage 1 is established. We should also note that the <italic>k</italic>-nearest neighbors require a distance metric, and one could get different results depending on the distance metrics and classification rules (Medjahed et al., <xref ref-type="bibr" rid="B10">2013</xref>). In stage 2, we considered a distance metric where all features have the same weight. It is up to a physician to assign different weights depending on her/his expertise. However, it is easier for the patients to understand the similarity if the weights are the same. In general, to use KNN to explain the solution, one need&#x00027;s the concept of proximity from the user&#x00027;s point of view (Goldberg and Pinsky, <xref ref-type="bibr" rid="B8">2022</xref>). In our case, the user is a patient, and the absence of symptoms signs in KNN may be necessary to start a dialogue with the doctor when explaining the proposed treatment.</p>
<p>This work also includes an interactive model where a patient can enter his details and use the model to predict the probability of his staying alive based on a combination of Radiation Sequence, Radiation Recode, and Chemotherapy Recode. For a survey of ML techniques in breast cancer prediction, see Boeri et al. (<xref ref-type="bibr" rid="B5">2020</xref>); Alaa et al. (<xref ref-type="bibr" rid="B2">2021</xref>), and Sugimoto1 et al. (<xref ref-type="bibr" rid="B16">2021</xref>).</p>
</sec>
<sec id="s3">
<title>3. The breast cancer dataset</title>
<p>For our research purpose, we requested access to Surveillance, Epidemiology, and End Results (SEER) custom breast cancer databases with the radiation and chemotherapy records&#x00027; fields (SEER, <xref ref-type="bibr" rid="B14">2022</xref>).</p>
<p>The data was accessed after signing the Data Use Agreement for SEER Radiation Therapy and Chemotherapy Information. Using the SEER&#x0002A;Stat tool, we selected the database named &#x0201C;November 2018 specialized databases,&#x0201D; which had additional treatment fields.</p>
<p>A Case listing session was created in order to fetch individual cancer records and patient histories. Fetching individual case listings allowed us to fetch the actual values stored in the database. We filtered the listings by specifying the cancer site to include only breast cancer cases.</p>
<p>We also segregated data based on the year of diagnosis and a selected number of intervals to be 5 years. We only included cases with a year of diagnosis &#x0003E; 2010 so that we have values for treatment-related fields.</p>
<p>The final step was to select the attributes/variables related to demographics, diagnosis, and treatment. The SEER&#x0002A;Stat query was executed, and the results were exported and saved as the Breast Cancer dataset.</p>
<p>The dataset can be accessed here: <ext-link ext-link-type="uri" xlink:href="https://github.com/snehasi2703/BreastCancerSurvivalDataset">https://github.com/snehasi2703/BreastCancerSurvivalDataset</ext-link>.</p>
<p>The resulting Breast Cancer dataset consists of 35,349 rows and 19 attributes (SEER, <xref ref-type="bibr" rid="B14">2022</xref>). This Dataset is imbalanced. Out of 35,349 rows, 23,404 are alive patients at the end of 5 years, and 11,945 are as dead after different intervals (<xref ref-type="fig" rid="F1">Figure 1</xref>). The dataset summary in <xref ref-type="fig" rid="F2">Figure 2</xref> shows the quartiles, medians, minimum, maximum, and means of all the required features considered at the time of Data cleaning.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Dataset balance.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1124182-g0001.tif"/>
</fig>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Dataset summary.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1124182-g0002.tif"/>
</fig>
<sec>
<title>3.1. Data cleaning</title>
<p>Out of the 35,349 records, there were 23,404 patients with Alive status while 11,945 with Dead status. The data set was then cleansed to remove all the NA values. The field CS Tumor Size had data with values 999, which meant that the tumor size for that patient&#x00027;s instance was unknown, or the size was not stated for that patient record. The data cleaning for such rows was also done to ensure only the valid instances where the tumor size is known are considered. Also, the field Regional Nodes Positive had a 99 in it, which meant that it is unknown whether the nodes are positive or not applicable and not stated in the patient record. Such data instances were also removed from the whole dataset to be used for model building. Post-data cleansing, the count of the entire dataset was 32,922 rows with 19 columns. Of these, 22,889 (69.5%) patient records had the status &#x0201C;Alive&#x0201D; while 10,033 (30.5%) had the status &#x0201C;Dead.&#x0201D; as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. A detailed description of the types of the Features variables and their description is provided in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Dataset details.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Attributes</bold></th>
<th valign="top" align="left"><bold>Attributes description</bold></th>
<th valign="top" align="left"><bold>Attributes type</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Age at diagnosis</td>
<td valign="top" align="left">Age at the beginning of treatment</td>
<td valign="top" align="left">Numeric</td>
</tr>
<tr>
<td valign="top" align="left">Regional nodes positive (1988&#x0002B;)</td>
<td valign="top" align="left">No. of regional lymph nodes positive, Above 90 unknown</td>
<td valign="top" align="left">Numeric</td>
</tr>
<tr>
<td valign="top" align="left">Total &#x00023; of in situ/malignant tumors for patient</td>
<td valign="top" align="left">No. of malignant tumors for patient</td>
<td valign="top" align="left">Numeric</td>
</tr>
<tr>
<td valign="top" align="left">Radiation recode</td>
<td valign="top" align="left">Radiation type</td>
<td valign="top" align="left">Categorical</td>
</tr>
<tr>
<td valign="top" align="left">Chemotherapy recode</td>
<td valign="top" align="left">Chemotherapy done?</td>
<td valign="top" align="left">Categorical</td>
</tr>
<tr>
<td valign="top" align="left">Radiation sequence with surgery</td>
<td valign="top" align="left">Radiation sequence</td>
<td valign="top" align="left">Categorical</td>
</tr>
<tr>
<td valign="top" align="left">ER Status Recode Breast Cancer (1990&#x0002B;)</td>
<td valign="top" align="left">Estrogen receptor info</td>
<td valign="top" align="left">Categorical</td>
</tr>
<tr>
<td valign="top" align="left">PR Status Recode Breast Cancer (1990&#x0002B;)</td>
<td valign="top" align="left">Progesterone receptors info</td>
<td valign="top" align="left">Categorical</td>
</tr>
<tr>
<td valign="top" align="left">CS tumor size (2004-2015)</td>
<td valign="top" align="left">Tumor size</td>
<td valign="top" align="left">Numeric</td>
</tr>
<tr>
<td valign="top" align="left">Derived HER2 Recode (2010&#x0002B;)</td>
<td valign="top" align="left">Joint hormone receptor</td>
<td valign="top" align="left">Categorical</td>
</tr>
<tr>
<td valign="top" align="left">Regional nodes examined (1988&#x0002B;)</td>
<td valign="top" align="left">Records the total &#x00023; of regional lymph nodes that were removed</td>
<td valign="top" align="left">Numeric</td>
</tr>
<tr>
<td valign="top" align="left">COD to site recode</td>
<td valign="top" align="left">Cause of Death</td>
<td valign="top" align="left">Categorical</td>
</tr>
<tr>
<td valign="top" align="left">Race Recode</td>
<td valign="top" align="left">Race categories</td>
<td valign="top" align="left">Categorical</td>
</tr>
<tr>
<td valign="top" align="left">Sex</td>
<td valign="top" align="left">Sex of patient</td>
<td valign="top" align="left">Categorical</td>
</tr>
<tr>
<td valign="top" align="left">Vital status recode (study cutoff used)</td>
<td valign="top" align="left">Patient Status</td>
<td valign="top" align="left">Categorical</td>
</tr>
<tr>
<td valign="top" align="left">Diagnosis_year</td>
<td valign="top" align="left">Year when diagnosis started</td>
<td valign="top" align="left">Numerical</td>
</tr>
<tr>
<td valign="top" align="left">Last_fu _year</td>
<td valign="top" align="left">Last Year of contact for treatment</td>
<td valign="top" align="left">Numerical</td>
</tr>
<tr>
<td valign="top" align="left">Interval_years</td>
<td valign="top" align="left">Number of Intervals the screening was done</td>
<td valign="top" align="left">Numerical</td>
</tr>
<tr>
<td valign="top" align="left">Status_5_years</td>
<td valign="top" align="left">Status of the patient during 5 years</td>
<td valign="top" align="left">Categorical</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>3.2. Feature selection</title>
<p>Feature importance was also taken into account to avoid missing important features. Out of 19 features, we removed all the features which were highly correlated with the class label as shown in <xref ref-type="fig" rid="F3">Figure 3</xref> since they were contributing 96 percent to the performance of the model, which was earlier 99 percent. The correlation was taken into consideration to improve the reliability of the model. Highly correlated attributes with the Target Variable contributed to the increased accuracy as they were directly associated with the Target Variable. Correlation details of the features with the target variable are shown in detail in <xref ref-type="table" rid="T2">Table 2</xref>. Feature importance and their scores are shown in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Correlation heatmap.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1124182-g0003.tif"/>
</fig>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Attribute importance and correlation with target variable.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Attribute</bold></th>
<th valign="top" align="center"><bold>Importance</bold></th>
<th valign="top" align="center"><bold>Correlation with target variable</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Regional nodes positive (1988&#x0002B;)</td>
<td valign="top" align="center">0.34</td>
<td valign="top" align="center">0.45</td>
</tr>
<tr>
<td valign="top" align="left">Radiation sequence with surgery</td>
<td valign="top" align="center">0.18</td>
<td valign="top" align="center">&#x02013;0.24</td>
</tr>
<tr>
<td valign="top" align="left">PR Status Recode Breast Cancer (1990&#x0002B;)</td>
<td valign="top" align="center">0.07</td>
<td valign="top" align="center">&#x02013;0.12</td>
</tr>
<tr>
<td valign="top" align="left">ER Status Recode Breast Cancer (1990&#x0002B;)</td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">&#x02013;0.10</td>
</tr>
<tr>
<td valign="top" align="left">Age at diagnosis</td>
<td valign="top" align="center">0.05</td>
<td valign="top" align="center">0.29</td>
</tr>
<tr>
<td valign="top" align="left">Total number of in situ/malignant tumors for patient</td>
<td valign="top" align="center">0.05</td>
<td valign="top" align="center">0.06</td>
</tr>
<tr>
<td valign="top" align="left">CS tumor size (2004-2015)</td>
<td valign="top" align="center">0.05</td>
<td valign="top" align="center">0.07</td>
</tr>
<tr>
<td valign="top" align="left">Race recode</td>
<td valign="top" align="center">0.05</td>
<td valign="top" align="center">&#x02013;0.09</td>
</tr>
<tr>
<td valign="top" align="left">Radiation recode</td>
<td valign="top" align="center">0.03</td>
<td valign="top" align="center">0.19</td>
</tr>
<tr>
<td valign="top" align="left">Chemotherapy recode</td>
<td valign="top" align="center">0.03</td>
<td valign="top" align="center">&#x02013;0.03</td>
</tr>
<tr>
<td valign="top" align="left">Derived HER2 Recode (2010&#x0002B;)</td>
<td valign="top" align="center">0.03</td>
<td valign="top" align="center">0.04</td>
</tr>
<tr>
<td valign="top" align="left">Regional nodes examined (1988&#x0002B;)</td>
<td valign="top" align="center">0.03</td>
<td valign="top" align="center">0.14</td>
</tr>
<tr>
<td valign="top" align="left">Sex</td>
<td valign="top" align="center">0.03</td>
<td valign="top" align="center">0.04</td>
</tr>
<tr>
<td valign="top" align="left">COD to site recode</td>
<td valign="top" align="center">&#x0002A;&#x0002A;</td>
<td valign="top" align="center">0.65</td>
</tr>
<tr>
<td valign="top" align="left">Diagnosis_year</td>
<td valign="top" align="center">&#x0002A;&#x0002A;</td>
<td valign="top" align="center">0.56</td>
</tr>
<tr>
<td valign="top" align="left">Last_fu _year</td>
<td valign="top" align="center">&#x0002A;&#x0002A;</td>
<td valign="top" align="center">&#x02013;0.72</td>
</tr>
<tr>
<td valign="top" align="left">Interva_years</td>
<td valign="top" align="center">&#x0002A;&#x0002A;</td>
<td valign="top" align="center">&#x02013;0.86</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p><sup>&#x0002A;&#x0002A;</sup>Feature not considered for study due to high correlation with Target Variable.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Feature importance.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1124182-g0004.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<title>4. ML models and performance evaluation</title>
<p>We chose Logistic Regression as the model for its easier explainability and the fact that it doesn&#x00027;t require any hyperparameter tuning and Random Forest to support our idea that Logistic Regression acts as a good classifier and results are on par with Random Forest. KNN was used for the explanation on the basis of past nearest patients having similar characteristics and to answer the questions of the patients and look at the previously evaluated patients and what worked in their cases.</p>
<sec>
<title>4.1. Logistic regression</title>
<p>Logistic regression is a statistical method used in machine learning to predict the probability of an outcome being one of two possible classes, given one or more independent variables. It is a binary classification algorithm that estimates the relationship between the independent variables and the dependent variable, using a logistic function to transform the output into a probability value.</p>
</sec>
<sec>
<title>4.2. Random forest</title>
<p>Random Forest is a machine learning algorithm used for both classification and regression problems. It is an ensemble learning method that combines multiple decision trees, each trained on a subset of the available features and data. Random Forest randomly selects features and data samples for each decision tree and aggregates the output of all the trees to make a final prediction.</p>
<p>The individual decision trees in a Random Forest model are trained using a technique called bagging (bootstrap aggregating), which involves resampling the data with replacement to create multiple training sets. The output of each decision tree is combined to produce a more accurate and stable prediction than a single decision tree.</p>
<p>The default parameters were used to initialize the model, and then we tuned the parameters to adjust the model to achieve the best performance, and we settled with max_depth as 10.</p>
</sec>
<sec>
<title>4.3. K nearest neighbors</title>
<p>K Nearest Neighbors (KNN) is a simple machine-learning algorithm for classification and regression problems. It predicts the value of an input data point based on the most frequent class or average value of the k nearest neighbors in the training set. KNN is often used for small data sets with complex relationships between input and output variables. However, it can be computationally expensive and sensitive to the choice of distance metric and the number of neighbors <italic>k</italic>.</p>
<p>We took the 25 nearest neighbors because this number is neither low nor high enough for a the physician to manually go ahead and verify by looking at 25 patients close to the patient.</p>
</sec>
<sec>
<title>4.4. Model training and performance evaluation</title>
<p>There were a total of 13 features (after removing COD to Site Recode, Diagnosis_year, Last_fu_year, interva_years, and Vital Status Recode) that we used to build our ML model for predicting the status of the patient after 5 years. A detailed description of these features is given in <xref ref-type="table" rid="T1">Table 1</xref>. We encoded these features using Label Encoder for the categorical values and then split the data into 50/50 with stratified data sampling across stutus_5_years and random_state as 42.</p>
<p>We built two prediction models: one using a Random Forest Classifier and another using Logistic regression. Both models utilized these 13 features to predict the patient&#x00027;s status after the number of intervals. The performance of the models was computed using the Confusion Matrix, Area Under the Curve (AUC), and <italic>F</italic><sub>1</sub> Score. We chose the AUC and <italic>F</italic><sub>1</sub> Score to understand how the model performs with each label broadly.</p>
<p>AUC is a measure of the performance of a classification model that quantifies how well the model can distinguish between positive and negative classes. The AUC represents the area under the receiver operating characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds (Bishop, <xref ref-type="bibr" rid="B4">2016</xref>). Its formula is given by</p>
<disp-formula id="E1"><mml:math id="M1"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mtext>AUC</mml:mtext><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:mrow><mml:msubsup><mml:mo>&#x0222B;</mml:mo><mml:mn>1</mml:mn><mml:mn>0</mml:mn></mml:msubsup><mml:mrow><mml:mtext>TPR</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mtext>FP</mml:mtext><mml:msup><mml:mtext>R</mml:mtext><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mi>d</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mrow></mml:mstyle></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The AUC ranges from 0 to 1, with a higher AUC indicating a better performance of the model. An AUC of 0.5 indicates that the model performs no better than random guessing, while an AUC of 1 indicates perfect classification.</p>
<p>By contrast, the <italic>f</italic><sub>1</sub>-score is a measure of the balance between the precision and recall of a classification model. It is the harmonic mean of precision and recall, with a value between 0 and 1, where a higher value indicates better performance. <italic>f</italic><sub>1</sub>-score is particularly useful when the data is imbalanced. Formula:</p>
<disp-formula id="E2"><mml:math id="M2"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>F</mml:mi><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo>&#x000B7;</mml:mo><mml:mfrac><mml:mrow><mml:mtext>precision&#x000A0;</mml:mtext><mml:mo>&#x000D7;</mml:mo><mml:mtext>&#x000A0;recall</mml:mtext></mml:mrow><mml:mrow><mml:mtext>precision&#x000A0;</mml:mtext><mml:mo>&#x0002B;</mml:mo><mml:mtext>&#x000A0;recall</mml:mtext></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where:</p>
<disp-formula id="E3"><mml:math id="M3"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mtext>precision</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac><mml:mtext>&#x02003;</mml:mtext><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mtext>&#x02003;recall</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>A confusion matrix is a table used to evaluate the performance of a classification model by comparing the actual and predicted classes of a set of data. The table has four entries: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).</p>
<p>The detailed results are shown in the results section. We also wanted to understand how the spread of data points happens across different age groups and how will the model performance change or gets impacted for different age groups of patient records. For that, we split the data into three ranges of age: 0 &#x02212; 45, 45 &#x02212; 65, and age &#x0003E;65 years.</p>
<p>We also looked at the cross-validation score of the models to verify the robustness of the predictive model and their performance across the different age groups.</p>
</sec>
</sec>
<sec id="s5">
<title>5. Results and discussion</title>
<p><xref ref-type="table" rid="T3">Tables 3</xref>, <bold>5</bold> present the performance metrics of the model for the full dataset, as well as for different age groups.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Logistic regression model results.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Specifications</bold></th>
<th valign="top" align="center"><bold>True positives</bold></th>
<th valign="top" align="center"><bold>True negatives</bold></th>
<th valign="top" align="center"><bold>False positives</bold></th>
<th valign="top" align="center"><bold>False negatives</bold></th>
<th valign="top" align="center"><bold>Recall</bold></th>
<th valign="top" align="center"><bold>Specificity</bold></th>
<th valign="top" align="center"><bold>FPR</bold></th>
<th valign="top" align="center"><bold>FNR</bold></th>
<th valign="top" align="center"><bold>Precision</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold></th>
<th valign="top" align="center"><bold>AUC</bold></th>
<th valign="top" align="center"><bold>F1 Score</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Full dataset</td>
<td valign="top" align="center">10,658</td>
<td valign="top" align="center">2,375</td>
<td valign="top" align="center">787</td>
<td valign="top" align="center">2,641</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">0.25</td>
<td valign="top" align="center">0.20</td>
<td valign="top" align="center">0.93</td>
<td valign="top" align="center">0.79</td>
<td valign="top" align="center">0.70</td>
<td valign="top" align="center">0.58</td>
</tr>
<tr>
<td valign="top" align="left">0-45 Years age</td>
<td valign="top" align="center">1,573</td>
<td valign="top" align="center">113</td>
<td valign="top" align="center">82</td>
<td valign="top" align="center">323</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.58</td>
<td valign="top" align="center">0.42</td>
<td valign="top" align="center">0.17</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.81</td>
<td valign="top" align="center">0.60</td>
<td valign="top" align="center">0.36</td>
</tr>
<tr>
<td valign="top" align="left">45-65 Years age</td>
<td valign="top" align="center">5,886</td>
<td valign="top" align="center">622</td>
<td valign="top" align="center">250</td>
<td valign="top" align="center">1,065</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">0.71</td>
<td valign="top" align="center">0.29</td>
<td valign="top" align="center">0.15</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.66</td>
<td valign="top" align="center">0.49</td>
</tr>
<tr>
<td valign="top" align="left">Age &#x0003E; 65 years</td>
<td valign="top" align="center">3,062</td>
<td valign="top" align="center">1,792</td>
<td valign="top" align="center">592</td>
<td valign="top" align="center">1,102</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">0.25</td>
<td valign="top" align="center">0.26</td>
<td valign="top" align="center">0.84</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.73</td>
<td valign="top" align="center">0.68</td>
</tr></tbody>
</table>
</table-wrap>
<p>The specifications column lists the different subsets of the data based on age, while the remaining columns present the following performance metrics:</p>
<list list-type="bullet">
<list-item><p><bold>True Positives:</bold> The number of individuals who were correctly identified as having the condition.</p></list-item>
<list-item><p><bold>True Negatives:</bold> The number of individuals who were correctly identified as not having the condition.</p></list-item>
<list-item><p><bold>False Positives:</bold> The number of individuals who were incorrectly identified as having the condition (also known as Type I error).</p></list-item>
<list-item><p><bold>False Negatives:</bold> The number of individuals who were incorrectly identified as not having the condition (also known as Type II error).</p></list-item>
<list-item><p><bold>Recall:</bold> The proportion of true positives out of all actual positives. This metric measures how well the model identifies individuals with the condition.</p></list-item>
<list-item><p><bold>Specificity:</bold> The proportion of true negatives out of all actual negatives. This metric measures how well the model identifies individuals without the condition.</p></list-item>
<list-item><p><bold>FPR (False Positive Rate):</bold> The proportion of false positives out of all actual negatives. This is the complement of specificity.</p></list-item>
<list-item><p><bold>FNR (False Negative Rate):</bold> The proportion of false negatives out of all actual positives. This is the complement of the recall.</p></list-item>
<list-item><p><bold>Precision:</bold> The proportion of true positives out of all predicted positives. This metric measures how many of the predicted positives are actually true positives.</p></list-item>
<list-item><p><bold>Accuracy:</bold> The proportion of correctly classified individuals out of all individuals.</p></list-item>
<list-item><p><bold>AUC (Area Under the Curve):</bold> The area under the ROC curve measures the trade-off between recall and specificity for different classification thresholds. A higher AUC indicates better classification performance.</p></list-item>
<list-item><p><bold>F1 Score:</bold> The harmonic mean of precision and recall. This metric provides a balanced measure of the model&#x00027;s ability to identify both true positives and true negatives.</p></list-item>
</list>
<sec>
<title>5.1. Logistic regression performance</title>
<p><xref ref-type="table" rid="T3">Table 3</xref> shows the results of a logistic regression model that was used to classify individuals as either alive or dead. The confusion matrix for the same model for the different age groups is shown in <xref ref-type="fig" rid="F5">Figure 5</xref>.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Confusion matrix of logistic regression. <bold>(A)</bold> Full dataset using LR. <bold>(B)</bold> Dataset of age &#x0003C; 45 years using LR. <bold>(C)</bold> Dataset of 45&#x0003C;age&#x0003C;65 years using LR. <bold>(D)</bold> Dataset of age &#x0003E; 65 years using LR.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1124182-g0005.tif"/>
</fig>
<p>From the results, we can see that the full dataset achieved an accuracy of 0.79, with a relatively high AUC of 0.70. The recall and specificity were 0.80 and 0.75, respectively, indicating a reasonable balance between identifying true positives and true negatives. The precision was high at 0.93, indicating that a large majority of the predicted positives were actually true positives.</p>
<p>The table also shows the results for different age groups. We can see that the model performed relatively well for all age groups, with the highest performance in the 45&#x02013;65 age group. This group had the highest recall and specificity, as well as the highest AUC and F1 score. The age group above 65 years had the lowest recall, while the 0&#x02013;45 age group had the lowest precision and F1 score.</p>
<p>10-fold cross-validation is a common training and validating method. It randomly divides the dataset into ten subsets, each turn of a total of ten in the validation process chooses one subset as the testing dataset, and the remaining nine are the training dataset. The average value of the accuracy (or error rate) of the ten times the results were used as the estimation of the accuracy of the algorithm.</p>
<p>We also cross-validated our models for the model robustness, and we obtained the below results for the different age groups as shown in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Logistic regression model cross validation results.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="center"><bold>Specifications</bold></th>
<th valign="top" align="center"><bold>Cross-val score</bold></th>
<th valign="top" align="center"><bold>Average accuracy</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">Full dataset</td>
<td valign="top" align="center">0.78, 0.78, 0.76, 0.80, 0.79, 0.80, 0.78, 0.78, 0.79, 0.79</td>
<td valign="top" align="center">0.78</td>
</tr>
<tr>
<td valign="top" align="center">0-45 Years age</td>
<td valign="top" align="center">0.82, 0.83, 0.83, 0.81, 0.81, 0.81, 0.81, 0.79, 0.81, 0.80</td>
<td valign="top" align="center">0.81</td>
</tr>
<tr>
<td valign="top" align="center">45-65 Years age</td>
<td valign="top" align="center">0.84, 0.83, 0.83, 0.83, 0.82, 0.83, 0.83, 0.82, 0.84, 0.83</td>
<td valign="top" align="center">0.83</td>
</tr>
<tr>
<td valign="top" align="center">Age &#x0003E; 65 years</td>
<td valign="top" align="center">0.77, 0.73, 0.74, 0.73, 0.75, 0.74, 0.75, 0.72, 0.74, 0.77</td>
<td valign="top" align="center">0.74</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>5.2. Random forest performance</title>
<p><xref ref-type="table" rid="T5">Table 5</xref> shows the results of a Random Forest model that was used to classify individuals as either alive or dead. The confusion matrix for the same model for the different age groups is shown in <xref ref-type="fig" rid="F6">Figure 6</xref>.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Random forest model results.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Specifications</bold></th>
<th valign="top" align="center"><bold>True positives</bold></th>
<th valign="top" align="center"><bold>True negatives</bold></th>
<th valign="top" align="center"><bold>False positives</bold></th>
<th valign="top" align="center"><bold>False negatives</bold></th>
<th valign="top" align="center"><bold>Recall</bold></th>
<th valign="top" align="center"><bold>Specificity</bold></th>
<th valign="top" align="center"><bold>FPR</bold></th>
<th valign="top" align="center"><bold>FNR</bold></th>
<th valign="top" align="center"><bold>Precision</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold></th>
<th valign="top" align="center"><bold>AUC</bold></th>
<th valign="top" align="center"><bold>F1 Score</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Full dataset</td>
<td valign="top" align="center">10,644</td>
<td valign="top" align="center">2,858</td>
<td valign="top" align="center">801</td>
<td valign="top" align="center">2,158</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.78</td>
<td valign="top" align="center">0.22</td>
<td valign="top" align="center">0.17</td>
<td valign="top" align="center">0.93</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">0.66</td>
</tr>
<tr>
<td valign="top" align="left">0-45 Years age</td>
<td valign="top" align="center">1,572</td>
<td valign="top" align="center">171</td>
<td valign="top" align="center">83</td>
<td valign="top" align="center">265</td>
<td valign="top" align="center">0.86</td>
<td valign="top" align="center">0.67</td>
<td valign="top" align="center">0.33</td>
<td valign="top" align="center">0.14</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.67</td>
<td valign="top" align="center">0.50</td>
</tr>
<tr>
<td valign="top" align="left">45-65 Years age</td>
<td valign="top" align="center">5,871</td>
<td valign="top" align="center">803</td>
<td valign="top" align="center">265</td>
<td valign="top" align="center">884</td>
<td valign="top" align="center">0.87</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">0.25</td>
<td valign="top" align="center">0.13</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.58</td>
</tr>
<tr>
<td valign="top" align="left">Age &#x0003E; 65 years</td>
<td valign="top" align="center">3,121</td>
<td valign="top" align="center">1,884</td>
<td valign="top" align="center">533</td>
<td valign="top" align="center">1,010</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">0.78</td>
<td valign="top" align="center">0.22</td>
<td valign="top" align="center">0.24</td>
<td valign="top" align="center">0.85</td>
<td valign="top" align="center">0.76</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">0.71</td>
</tr></tbody>
</table>
</table-wrap>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Confusion matrix of random forest. <bold>(A)</bold> Full dataset using RF. <bold>(B)</bold> Dataset of age &#x0003C; 45 years using RF. <bold>(C)</bold> Dataset of 45&#x0003C;age&#x0003C;65 years using RF. <bold>(D)</bold> Dataset of age &#x0003E; 65 years using RF.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1124182-g0006.tif"/>
</fig>
<p>The full dataset has a recall of 0.83, specificity of 0.78, FPR of 0.22, FNR of 0.17, precision of 0.93, the accuracy of 0.82, AUC of 0.75, and F1 score of 0.66. The 0&#x02013;45 years age group has a recall of 0.86, specificity of 0.67, FPR of 0.33, FNR of 0.14, the precision of 0.95, accuracy of 0.83, AUC of 0.67, and F1 score of 0.50. The 45&#x02013;65 years age group has a recall of 0.87, specificity of 0.75, FPR of 0.25, FNR of 0.13, the precision of 0.96, accuracy of 0.85, AUC of 0.72, and F1 score of 0.58. Finally, the age over 65 years group has a recall of 0.76, specificity of 0.78, FPR of 0.22, FNR of 0.24, the precision of 0.85, accuracy of 0.76, AUC of 0.75, and F1 score of 0.71.</p>
<p>The results suggest that the model performs well overall, with high recall and precision scores. However, it appears that the model performs slightly better for the age 45-65 years group, with higher scores for most performance metrics compared to the other age groups.</p>
<p>We also cross-validated our models for the model robustness, and we obtained the below results for the different age groups as shown in <xref ref-type="table" rid="T6">Table 6</xref>.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Random forest model cross validation results.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Specifications</bold></th>
<th valign="top" align="center"><bold>Cross-val score</bold></th>
<th valign="top" align="center"><bold>Average accuracy</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Full dataset</td>
<td valign="top" align="center">0.82, 0.80, 0.79, 0.82, 0.83, 0.82, 0.82, 0.81, 0.81, 0.82</td>
<td valign="top" align="center">0.81</td>
</tr>
<tr>
<td valign="top" align="left">0&#x02013;45 Years age</td>
<td valign="top" align="center">0.86, 0.87, 0.87, 0.81, 0.81, 0.87, 0.83, 0.82, 0.80, 0.84</td>
<td valign="top" align="center">0.85</td>
</tr>
<tr>
<td valign="top" align="left">45&#x02013;65 Years age</td>
<td valign="top" align="center">0.87, 0.85, 0.85, 0.85, 0.85, 0.86, 0.86, 0.86, 0.86, 0.85</td>
<td valign="top" align="center">0.86</td>
</tr>
<tr>
<td valign="top" align="left">Age &#x0003E; 65 years</td>
<td valign="top" align="center">0.77, 0.76, 0.75, 0.76, 0.77, 0.77, 0.77, 0.77, 0.77, 0.77</td>
<td valign="top" align="center">0.77</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>5.3. Comparing both models</title>
<p>Comparing the results of the logistic regression in <xref ref-type="table" rid="T3">Table 3</xref> with the results of the Random Forest classifier in <xref ref-type="table" rid="T5">Table 5</xref>, we notice that the Logistic regression gives competitive results. Models Accuracy for the different age groups for the Logistic and Random Forest model is shown in <xref ref-type="fig" rid="F7">Figure 7</xref>. Comparison of AUC Scores of the different age groups for the Logistic and Random Forest model is shown in <xref ref-type="fig" rid="F8">Figure 8</xref>. While comparison of F1 Scores of the different age groups for the Logistic and Random Forest model is shown in <xref ref-type="fig" rid="F9">Figure 9</xref>. These figures clearly illustrate that the performance of the Logistic Regression model is at par with Random Forest Model. The True Positives of Logistic regression on the Full dataset are higher than Random Forest, 10,658 and 10,644, respectively. Unlike the Random Forest classifier, the logistic regression is explainable and allows us to rank the importance of features. Even the results of the Logistic Regression models can be replicated again on multiple runs, unlike Random Forest, whose prediction could change on simultaneous runs, which makes us hesitant to use Random Forest for medical cases.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Models accuracy comparison.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1124182-g0007.tif"/>
</fig>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Models AUC score comparison.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1124182-g0008.tif"/>
</fig>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Models F1 score comparison.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1124182-g0009.tif"/>
</fig>
<p>Therefore, we decided to use the logistic regression for Stage 1 of our analysis.</p>
<p>Once the patient status is predicted, we consider the second stage: we implemented <italic>k</italic>-NN to find 25 nearest neighbors to the patient and advise the patient regarding Chemotherapy and Radiation sequence treatments. For this, we have included only the &#x0201C;Alive&#x0201D; patient Dataset to increase the chances of survival for the patient. We took 25 nearest neighbors because this number is neither low nor high enough for a physician to manually go ahead and verify by looking at 25 patients close to the patient in question.</p>
<p>To illustrate our approach, we consider two patients, Patient_1 and Patient_2. Their features are summarized in <xref ref-type="table" rid="T7">Table 7</xref>.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Sample data for two patients.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Feature</bold></th>
<th valign="top" align="center"><bold>Patient_1</bold></th>
<th valign="top" align="center"><bold>Patient_2</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Age at diagnosis</td>
<td valign="top" align="center">67</td>
<td valign="top" align="center">57</td>
</tr>
<tr>
<td valign="top" align="left">Regional nodes positive (1988&#x0002B;)</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">Total &#x00023; of in situ/malignant tumors</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">Radiation recode</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">Chemotherapy recode</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">Radiation Sequence with Surgery</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">3</td>
</tr>
<tr>
<td valign="top" align="left">ER Status Recode Breast Cancer (1990&#x0002B;)</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">PR Status Recode Breast Cancer (1990&#x0002B;)</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">CS tumor size (2004&#x02013;2015)</td>
<td valign="top" align="center">70</td>
<td valign="top" align="center">11</td>
</tr>
<tr>
<td valign="top" align="left">Derived HER2 Recode (2010&#x0002B;)</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">2</td>
</tr>
<tr>
<td valign="top" align="left">Regional nodes examined (1988&#x0002B;)</td>
<td valign="top" align="center">21</td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="left">Race recode</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">2</td>
</tr>
<tr>
<td valign="top" align="left">Sex</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0</td>
</tr>
<tr>
<td valign="top" align="left">Interval years</td>
<td valign="top" align="center">5</td>
<td valign="top" align="center">5</td>
</tr></tbody>
</table>
</table-wrap>
<p>We used Logistic Regression for the prediction of the survival of the patient after 5 years if the patient follows the advice of the doctor and also the recommendations of the doctor. We are also predicting the survival chances if the patient decides not to undergo radiation and chemotherapy. Then using K Nearest Neighbors, we check the probability of the survival of the patient for various combinations of Chemotherapy and Type of Radiation and Radiation Sequence.</p>
<p>We start with the first patient, Patient_1. According to our model, the doctor has advised the patient to Beam Radiation and chemotherapy. Based on the Logistic Regression patient would be alive after 5 years if the patient follows the doctor&#x00027;s advice or even if the patient refuses the doctor&#x00027;s advice.</p>
<p><italic>k</italic>-Nearest Neighbors of 25 resulted in the probability of the patient being alive after 5 years to 68%. But if the patient refuses to take the Beam radiation, the model predicts the probability of the patient being alive after 5 years as 0%. In case the the patient refuses any radiation and also wants no chemotherapy too, then the probability of the patient being alive after 5 years reaches 4%.</p>
<p>Now, consider the second patient, Patient_2. For this specific patient, according to our model, the doctor has advised the patient to Beam Radiation and also to take chemotherapy. Based on the Logistic Regression patient would be dead after 5 years if the patient follows the doctor&#x00027;s advice or even if the patient refuses the doctor&#x00027;s advice. <italic>K</italic> Nearest Neighbors of 25 resulted in the probability of the patient being alive after 5 years to 84%. But if the patient refuses to take the Chemotherapy, the model can predict the probability of the patient being alive after 5 years as 76%. In case the patient refuses any radiation and also wants no chemotherapy too, then the probability of the patient being alive after 5 years reaches 4%.</p>
<p>Using <italic>k</italic> &#x0003D; 25 Nearest Neighbors, patients and doctors can examine several combinations of &#x0201C;Chemotherapy Recode,&#x0201D; &#x0201C;Radiation Recode,&#x0201D; and &#x0201C;Radiation Sequence&#x0201D; and have the probabilities of survival after 5 years using the interactive model. This would allow them to decide on the more appropriate treatment as shown in <xref ref-type="table" rid="T8">Tables 8</xref>, <xref ref-type="table" rid="T9">9</xref>.</p>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption><p>Scenarios for Patient_1.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Scenario for patient 1</bold></th>
<th valign="top" align="left"><bold>Radiation recode</bold></th>
<th valign="top" align="left"><bold>Chemotherapy recode</bold></th>
<th valign="top" align="left"><bold>Radiation sequence with surgery</bold></th>
<th valign="top" align="center"><bold>Probability</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Scenario 1</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Radiation after Surgery</td>
<td valign="top" align="center">68</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 2</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Intraoperative rad with other rad before/after surgery</td>
<td valign="top" align="center">20</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 3</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Intraoperative radiation</td>
<td valign="top" align="center">56</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 4</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Radiation before and after surgery</td>
<td valign="top" align="center">8</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 5</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Radiation prior to surgery</td>
<td valign="top" align="center">40</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 6</td>
<td valign="top" align="left">Refused</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Radiation after surgery</td>
<td valign="top" align="center">4</td>
</tr></tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T9">
<label>Table 9</label>
<caption><p>Scenarios for Patient_2.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Scenario for patient 2</bold></th>
<th valign="top" align="left"><bold>Radiation recode</bold></th>
<th valign="top" align="left"><bold>Chemotherapy recode</bold></th>
<th valign="top" align="left"><bold>Radiation sequence with surgery</bold></th>
<th valign="top" align="center"><bold>Probability</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Scenario 1</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Radiation after Surgery</td>
<td valign="top" align="center">84</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 2</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Radiation after Surgery</td>
<td valign="top" align="center">76</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 3</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Intraoperative rad with other rad before/after surgery</td>
<td valign="top" align="center">20</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 4</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Intraoperative radiation</td>
<td valign="top" align="center">56</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 5</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Radiation before and after surgery</td>
<td valign="top" align="center">8</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 6</td>
<td valign="top" align="left">Beam Radiation</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Radiation prior to surgery</td>
<td valign="top" align="center">40</td>
</tr>
<tr>
<td valign="top" align="left">Scenario 7</td>
<td valign="top" align="left">Refused</td>
<td valign="top" align="left">No/Unknown</td>
<td valign="top" align="left">Radiation after surgery</td>
<td valign="top" align="center">4</td>
</tr></tbody>
</table>
</table-wrap>
<p>This model hence was able to predict how different treatments and combinations of various treatment plans can change the prediction of a patient&#x00027;s survival probability. Our model was able to predict what would be the best treatment plan based on the Nearest Neighbors. An interactive model we built is shown in <xref ref-type="fig" rid="F10">Figure 10</xref>, which shows how the doctors can enter patient details to help him/her look at the previous nearest 25 patients&#x00027; data.</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>Interactive model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-06-1124182-g0010.tif"/>
</fig>
<p>One of the limitations of our study has been the lack of biomarkers and genetics data, which might have provided better results. Still, this study can serve as the foundation for future works where features like genetics and biomarkers could be taken into account with patient understanding accounted for.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s6">
<title>6. Conclusion</title>
<p>This article proposed a methodology for developing treatment plans and explaining them to patients using machine learning. We illustrated the application of this methodology, focusing on the treatment of breast cancer. A distinguishing feature of our approach is that the user is the patient, and this imposes some restrictions on the type and form of the proposed solutions. For this, a combination of logistic regression and k-nearest neighbors is used. Logistic Regression is initially used to compute survival probabilities and explain the importance of features. We then find <italic>k</italic>-Nearest Neighbors and use them to explain the choice of treatment plans based on similar patients. We believe that using KNN allows the physician to justify his/her choice of treatment and makes it possible for the patient to understand potential risks and outcomes. Our future work will test this approach in real-world conditions to help expand and improve this methodology with further methods of explaining the results and treatment options.</p>
</sec>
<sec sec-type="data-availability" id="s7">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found at: <ext-link ext-link-type="uri" xlink:href="https://github.com/snehasi2703/BreastCancerSurvivalDataset">https://github.com/snehasi2703/BreastCancerSurvivalDataset</ext-link>.</p>
</sec>
<sec sec-type="author-contributions" id="s8">
<title>Author contributions</title>
<p>All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.</p>
</sec>
</body>
<back>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ak</surname> <given-names>M. F.</given-names></name></person-group> (<year>2020</year>). <article-title>A comparative analysis of breast cancer detection and diagnosis using data visualization and machine learning applications</article-title>. <source>Healthcare</source> <volume>8</volume>, <fpage>111</fpage>. <pub-id pub-id-type="doi">10.3390/healthcare8020111</pub-id><pub-id pub-id-type="pmid">32357391</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alaa</surname> <given-names>A. M.</given-names></name> <name><surname>Gurdasani</surname> <given-names>D.</given-names></name> <name><surname>Harris</surname> <given-names>A. L.</given-names></name> <name><surname>Rashbass</surname> <given-names>J.</given-names></name> <name><surname>van der Schaar</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>Machine learning to guide the use of adjuvant therapies for breast cancer</article-title>. <source>Nat. Mach. Intell</source>. <volume>3</volume>, <fpage>716</fpage>&#x02013;<lpage>726</lpage>. <pub-id pub-id-type="doi">10.1038/s42256-021-00353-8</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bagui</surname> <given-names>S.S B</given-names></name> <name><surname>Pal</surname> <given-names>K.</given-names></name> <name><surname>Pal</surname> <given-names>N.</given-names></name></person-group> (<year>2003</year>). <article-title>Breast cancer detection using rank nearest neighbor classification rules</article-title>. <source>Pattern Recognit</source>. <volume>36</volume>, <fpage>25</fpage>&#x02013;<lpage>34</lpage>. <pub-id pub-id-type="doi">10.1016/S0031-3203(02)00044-4</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bishop</surname> <given-names>C.</given-names></name></person-group> (<year>2016</year>). <source>Pattern Recognition and Machine Learning</source>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boeri</surname> <given-names>C.</given-names></name> <name><surname>Chiappa</surname> <given-names>C.</given-names></name> <name><surname>Galli</surname> <given-names>F.</given-names></name> <name><surname>Berardinis</surname> <given-names>V.</given-names></name> <name><surname>Bardelli</surname> <given-names>L.</given-names></name> <name><surname>Carcano</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Machine learning techniques in breast cancer prognosis prediction: a primary evaluation</article-title>. <source>Cancer Med</source>. <volume>9</volume>, <fpage>3234</fpage>&#x02013;<lpage>3243</lpage>. <pub-id pub-id-type="doi">10.1002/cam4.2811</pub-id><pub-id pub-id-type="pmid">32154669</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="web"><person-group person-group-type="author"><collab>CDC</collab></person-group> (<year>2022</year>). <source>Centers for disease control and prevention: Breast cancer</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.cdc.gov/cancer/breast/basic_info/index.htm/">https://www.cdc.gov/cancer/breast/basic_info/index.htm/</ext-link> (accessed December 7, 2022).</citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cover</surname> <given-names>T.</given-names></name> <name><surname>Hart</surname> <given-names>P</given-names></name></person-group>. (<year>1967</year>). <article-title>Nearest neighbor pattern classification</article-title>. <source>IEEE Trans. Inf. Theory</source> <volume>13</volume>, <fpage>21</fpage>&#x02013;<lpage>27</lpage>. <pub-id pub-id-type="doi">10.1109/TIT.1967.1053964</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goldberg</surname> <given-names>S.</given-names></name> <name><surname>Pinsky</surname> <given-names>E.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Building a meta-agent for human-machine dialogue in machine learning systems,&#x0201D;</article-title> in <source>Advances in Information and Communication, FICC 2022</source>, ed A. Kohei (New York, NY: Springer), <fpage>474</fpage>&#x02013;<lpage>487</lpage>.</citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hastle</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <source>Elements of Statistical Learning</source>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Pearson</publisher-name>.</citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Medjahed</surname> <given-names>S.</given-names></name> <name><surname>Saadi</surname> <given-names>T.</given-names></name> <name><surname>Benyettou</surname> <given-names>A.</given-names></name></person-group> (<year>2013</year>). <article-title>Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules</article-title>. <source>Int. J. Comput. Appl</source>. <volume>61</volume>, <fpage>1</fpage>&#x02013;<lpage>5</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rajbharath</surname> <given-names>R.</given-names></name> <name><surname>Sankari</surname> <given-names>L. K. S.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Predicting breast cancer using random forest and logistic regression,&#x0201D;</article-title> in <source>International Journal Of Engineering Science and Computing IJESC</source>, Vol. 7.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Reddy</surname> <given-names>J.</given-names></name> <name><surname>Lindsay</surname> <given-names>W.</given-names></name> <name><surname>Berlind</surname> <given-names>C.</given-names></name> <name><surname>Smith</surname> <given-names>B.</given-names></name></person-group> (<year>2018</year>). <article-title>Applying a machine learning approach to predict acute toxicities during radiation for breast cancer patients</article-title>. <source>Int. J. Radiat. Oncol. Biol. Phys</source>. 102, 559. <pub-id pub-id-type="doi">10.1016/j.ijrobp.2018.06.167</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sarkar</surname> <given-names>M.</given-names></name> <name><surname>Leong</surname> <given-names>T.</given-names></name></person-group> (<year>2000</year>). <article-title>&#x0201C;Application of k-nearest neighbors algorithm on breast cancer diagnosis problem,&#x0201D;</article-title> in <source>AMIA Annual Proceedings</source>, <fpage>759</fpage>&#x02013;<lpage>763</lpage>.<pub-id pub-id-type="pmid">11079986</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="web"><person-group person-group-type="author"><collab>SEER</collab></person-group> (<year>2022</year>). <source>National Cancer Institute: Surveillance, Epidemiology, and Results Program: Variable and recode definitions</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://seer.cancer.gov/analysis/">https://seer.cancer.gov/analysis/</ext-link> (accessed December 7, 2022).</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>D.</given-names></name> <name><surname>Man</surname> <given-names>X.</given-names></name> <name><surname>Li</surname> <given-names>Q.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Du</surname> <given-names>Y.</given-names></name></person-group> (<year>2021</year>). <article-title>A decision-making supporting prediction method for breast cancer neoadjuvant chemotherapy</article-title>. <source>Front. Oncol</source>. 10, 592556. <pub-id pub-id-type="doi">10.3389/fonc.2020.592556</pub-id><pub-id pub-id-type="pmid">33469514</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sugimoto</surname> <given-names>M.</given-names></name> <name><surname>Hikichi</surname> <given-names>S.</given-names></name> <name><surname>Takada</surname> <given-names>M.</given-names></name> <name><surname>Toi</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Machine learning techniques for breast cancer diagnosis and treatment: a narrative review,&#x0201D;</article-title> in <source>Annals of Breast Surgery Vol. 7</source>, <fpage>1</fpage>&#x02013;<lpage>13</lpage>.<pub-id pub-id-type="pmid">35771379</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>van de Sande</surname> <given-names>D.</given-names></name> <name><surname>Sharabiani</surname> <given-names>M.</given-names></name> <name><surname>Bluemink</surname> <given-names>H.</given-names></name> <name><surname>Kneepkens</surname> <given-names>E.</given-names></name> <name><surname>Bakx</surname> <given-names>N.</given-names></name> <name><surname>Hagelaar</surname> <given-names>E.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Artificial intelligence based treatment planning of radiotherapy for locally advanced breast cancer</article-title>. <source>Phys. Imaging Radiat. Oncol</source>. <volume>20</volume>, <fpage>111</fpage>&#x02013;<lpage>116</lpage>. <pub-id pub-id-type="doi">10.1016/j.phro.2021.11.007</pub-id><pub-id pub-id-type="pmid">34917779</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="web"><person-group person-group-type="author"><collab>World health organization</collab></person-group> (<year>2022</year>). <source>World Health Organization: International Agency for Research in Cancer</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.iarc.who.int/cancer-topics/">https://www.iarc.who.int/cancer-topics/</ext-link> (accessed December 7, 2022).</citation>
</ref>
</ref-list> 
</back>
</article>