<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2020.00006</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Robust Machine Learning for Colorectal Cancer Risk Prediction and Stratification</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Nartowt</surname> <given-names>Bradley J.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/522165/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Hart</surname> <given-names>Gregory R.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/532820/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Muhammad</surname> <given-names>Wazir</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/74238/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Liang</surname> <given-names>Ying</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Stark</surname> <given-names>Gigi F.</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Deng</surname> <given-names>Jun</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/431624/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Therapeutic Radiology, Yale University</institution>, <addr-line>New Haven, CT</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Radiation Oncology, Medial College of Wisconsin</institution>, <addr-line>Milwaukee, WI</addr-line>, <country>United States</country></aff>
<aff id="aff3"><sup>3</sup><institution>Department of Statistics &#x00026; Data Science, Yale University</institution>, <addr-line>New Haven, CT</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Katrina M. Waters, Pacific Northwest National Laboratory (DOE), United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Maria Chan, Memorial Sloan Kettering Cancer Center, United States; Avinash Parnandi, Langone Medical Center, New York University, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Jun Deng <email>jun.deng&#x00040;yale.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Medicine and Public Health, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>03</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>3</volume>
<elocation-id>6</elocation-id>
<history>
<date date-type="received">
<day>10</day>
<month>10</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>31</day>
<month>01</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Nartowt, Hart, Muhammad, Liang, Stark and Deng.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Nartowt, Hart, Muhammad, Liang, Stark and Deng</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>While colorectal cancer (CRC) is third in prevalence and mortality among cancers in the United States, there is no effective method to screen the general public for CRC risk. In this study, to identify an effective mass screening method for CRC risk, we evaluated seven supervised machine learning algorithms: linear discriminant analysis, support vector machine, naive Bayes, decision tree, random forest, logistic regression, and artificial neural network. Models were trained and cross-tested with the National Health Interview Survey (NHIS) and the Prostate, Lung, Colorectal, Ovarian Cancer Screening (PLCO) datasets. Six imputation methods were used to handle missing data: mean, Gaussian, Lorentzian, one-hot encoding, Gaussian expectation-maximization, and listwise deletion. Among all of the model configurations and imputation method combinations, the artificial neural network with expectation-maximization imputation emerged as the best, having a concordance of 0.70 &#x000B1; 0.02, sensitivity of 0.63 &#x000B1; 0.06, and specificity of 0.82 &#x000B1; 0.04. In stratifying CRC risk in the NHIS and PLCO datasets, only 2% of negative cases were misclassified as high risk and 6% of positive cases were misclassified as low risk. In modeling the CRC-free probability with Kaplan-Meier estimators, low-, medium-, and high CRC-risk groups have statistically-significant separation. Our results indicated that the trained artificial neural network can be used as an effective screening tool for early intervention and prevention of CRC in large populations.</p></abstract>
<kwd-group>
<kwd>colorectal cancer</kwd>
<kwd>risk stratification</kwd>
<kwd>neural network</kwd>
<kwd>concordance</kwd>
<kwd>self-reportable health data</kwd>
<kwd>external validation</kwd>
</kwd-group>
<counts>
<fig-count count="6"/>
<table-count count="2"/>
<equation-count count="5"/>
<ref-count count="31"/>
<page-count count="12"/>
<word-count count="7728"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>Of all new cancer incidences in the United States, 8.1% are colorectal cancer (CRC) (Falco et al., <xref ref-type="bibr" rid="B10">2018</xref>; National Cancer Institute, <xref ref-type="bibr" rid="B20">2018</xref>). The 5-year survival rate for CRC ranges from 14% for a distant stage to 90% for a localized stage. CRC is responsible for 8.3% of all cancer deaths, and is especially deadly and recurrent when coincident with diabetes and hypertension (Yang et al., <xref ref-type="bibr" rid="B31">2012</xref>). However, there exists little knowledge of the primary causes of CRC. Thus, current screening recommendations are only based on family history of CRC and age. Specifically, the United States Preventative Services Task Force (USPSTF) recommends screening for individuals between ages 50 and 75 while the American Cancer Society recommends screening for individuals between ages 45 and 75 (Collins et al., <xref ref-type="bibr" rid="B8">2015</xref>; Bibbins-Domingo et al., <xref ref-type="bibr" rid="B5">2016</xref>). Both guidelines recommend screening for anyone with one or more primary relatives who have ever had CRC. While screening according to these guidelines indisputably saves lives, high-risk individuals with no CRC family history and/or aged 18&#x02013;49 would clearly benefit from a model that better detects their risk. Low-risk individuals that are flagged for screening under a new model (Collins et al., <xref ref-type="bibr" rid="B8">2015</xref>; Bibbins-Domingo et al., <xref ref-type="bibr" rid="B5">2016</xref>), would also be given information to help them choose whether they want to be subject to invasive, expensive, and injurious (Benard et al., <xref ref-type="bibr" rid="B2">2018</xref>; National Cancer Institute, <xref ref-type="bibr" rid="B21">2019</xref>) screening. Hence, it is important to develop an effective method to estimate CRC risk non-invasively and cost-effectively.</p>
<p>There have been a lot of previously-developed CRC-risk models that do not involve biomarkers (Usher-Smith et al., <xref ref-type="bibr" rid="B30">2016</xref>). Using only professionally-collected routine data (biological sex, use of non-steroidal anti-inflammatory drugs (NSAIDs), form of recruitment, non-specific abdominal pain, bowel-habit, age, BMI, cholesterol, and triglycerides), Betes et al. achieved a concordance &#x0007E;0.7 using a multiple logistic regression model (Betes et al., <xref ref-type="bibr" rid="B4">2003</xref>). Using data from a self-completed questionnaire (asking about CRC in first-degree relatives, BMI, screening, NSAID use, diet, inflammatory bowel disease, alcohol/tobacco use, and physical activity), Colditz et al. also built a multiple logistic regression model of similar concordance &#x0007E;0.7 from data on family history, obesity, screening, diet (multivitamin, alcohol, vegetables, and red meat consumption), height, physical activity, pharmaceuticals (prophylactic, post-menopausal hormone, and aspirin use), and inflammatory bowel disease (Colditz et al., <xref ref-type="bibr" rid="B7">2000</xref>). Both models are externally tested, i.e., the model is built from one dataset and its performance is reported on a dataset from a separate study (Collins et al., <xref ref-type="bibr" rid="B8">2015</xref>). However, compared to the simple logistic regression models, there has been no systematic study on the development of more advanced machine learning models for CRC risk prediction and stratification for a large population, in consideration of various imputation methods.</p>
<p>Hence in this work, we aim to identify an effective mass screening method for CRC risk based solely on personal health data. We trained and cross-tested various machine learning models with two large national databases, reporting performance in terms of the concordance, a performance metric that is biased but standard (Hanley and McNeil, <xref ref-type="bibr" rid="B14">1982</xref>; Hosmer and Lemeshow, <xref ref-type="bibr" rid="B16">2000</xref>; Fawcett, <xref ref-type="bibr" rid="B11">2005</xref>; Hajian-Tilaki, <xref ref-type="bibr" rid="B13">2013</xref>). A variety of imputation methods were explored in handling the missing data. Additionally, a component of cross-uncertainty is incorporated to the total uncertainty reported, adding stringency to our testing that to our knowledge has not been used before. Finally, we furnish some ideas on how our model can be deployed for real world applications.</p>
</sec>
<sec sec-type="materials and methods" id="s2">
<title>Materials and Methods</title>
<sec>
<title>Two Datasets From Separate Studies</title>
<p>The National Health Interview Survey (NHIS) dataset<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> is a cross-sectional study of the overall health status of the United States. Each year, roughly 30,000 adults are interviewed on a range of current and past personal health conditions. The first survey of the NHIS after a significant revision was administered in 1997 and the next such redesign of the NHIS is scheduled to appear in 2019, so data from years 1997 to 2017 was used. Our other study is the longitudinal Pancreatic, Lung, Colorectal, Ovarian (PLCO) Cancer Screening dataset from the National Cancer Institute<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>. The PLCO dataset is a randomized, controlled longitudinal study on the efficacy of screening for prostate, lung, colorectal, and ovarian cancer. Between November 1993 and July 2001, participants were randomized, entered into the trial, answered a baseline questionnaire (BQ), and were followed for up to 14 years, exiting the trial early if they were diagnosed with any cancer or if they died. To match this PLCO data with the NHIS dataset, we assumed that answering the PLCO BQ was equivalent to participating in the NHIS&#x00027;s interview.</p>
<p>Data was marked by 7 for responses of &#x0201C;Refused,&#x0201D; 8 for &#x0201C;Not ascertained,&#x0201D; and 9 &#x0201C;Don&#x00027;t know&#x0201D; in the NHIS<sup>1</sup>; all these responses were assumed to indicate data missing completely at random (Little and Rubin, <xref ref-type="bibr" rid="B18">2014</xref>) (MCAR). This is distinguished from data not missing at random, which is marked by the table entry being actually blank (e.g., all pregnancy data has a blank entry for male respondents). PLCO uses the same scheme of marking the missingness of data with digit-entries, while data missing not at random is actually blank.</p>
<p>The United States Preventative Services Task Force guidelines currently recommend anyone with family history of CRC and/or aged 50&#x02013;75 years for screening (Bibbins-Domingo et al., <xref ref-type="bibr" rid="B5">2016</xref>), while screening at ages 76&#x0002B; is up to the individual. Thus, ages 18&#x02013;49 and ages 50&#x02013;75 form sub-demographics of data that are of interest. To assess performance in these sub-demographics, we trained and tested models on these age splits of the data as well as on all ages.</p>
<p>There are factors appearing in the NHIS dataset but missing in the PLCO dataset, and vice versa. Specifically, factors appearing in the NHIS but not in the PLCO are alcohol-use, vigorous exercise frequency, functional limitations, kidney comorbidity, and incidence of angina. Factors appearing in the PLCO but not in the NHIS are non-steroidal anti-inflammatory drug (NSAID) use, gallbladder inflammation, and incidence of diverticulitis. To ensure a TRIPOD level 3 cross-testing between separate datasets, and all its rigor and robustness, all these factors are not used in our study.</p>
</sec>
<sec>
<title>CRC vs. Never Cancer</title>
<p>The NHIS records each respondent&#x00027;s age at the time of the survey, and the age(s) at which the respondent was diagnosed with cancer of the colon and/or rectum, if at all. Respondents were counted as positive cases of CRC if their diagnosis happened &#x0003C;4 years prior to the survey. In each study, a small fraction of the respondents were recently diagnosed with CRC. We considered CRC in survey respondents ages 18&#x02013;85.</p>
<p>In PLCO and NHIS, the following four types of respondents were discarded: (1) those diagnosed with CRC more than 4 years prior to taking the survey (NHIS) or answering the questionnaire (PLCO), (2) those non-CRC respondents diagnosed with any other cancer at any time, (3) those CRC respondents diagnosed with a cancer other than CRC at a time before their CRC diagnosis, and (4) those CRC respondents having CRC at a time before randomization (PLCO only). Members of the first group were discarded because their reported personal health data was considered irrelevant to their CRC diagnosis. Those in the second and third groups were discarded because those diagnosed with any cancer already receive heightened screening attention, defeating the purpose of assessing their risk. The fourth group is discarded because before randomization the time (in days) of CRC diagnosis is not known. Thus, the negative examples were those who were never diagnosed with any cancer and are referred to as &#x0201C;never-cancer&#x0201D; (NC) while the positive examples were those recently diagnosed with CRC and are referred to as &#x0201C;CRC.&#x0201D;</p>
<p>To be considered a positive case of CRC in the PLCO data remaining after the deletion described above, the respondent needed to meet both of the following conditions: (1) they were diagnosed with CRC within 4 years of the BQ and (2) CRC was the first cancer they had. If both of these conditions were not met, the respondent was considered part of the non-cancer population remaining after the above discarding was carried out. Hence, the outcome variable used in both datasets was the respondent&#x00027;s cancer status coded to a 0 or a 1. A value of 0 indicated that the respondent was never diagnosed with cancer (CRC or any other cancer). It is assumed that a given respondent would have already been flagged for screening if previously diagnosed with any kind of cancer, defeating the purpose of risk-scoring. A value of 1 indicated that the respondent was diagnosed with CRC within four (4) years of answering either the PLCO BQ or the NHIS interview questions. All respondents who fit neither of these criteria were assumed to be data not missing at random (Little and Rubin, <xref ref-type="bibr" rid="B18">2014</xref>), and thus discarded (never subject to any imputation methods).</p>
<p>Performance after training with such an outcome variable is not relative to the sensitivity and specificity of any gold standard. In our work, the gold standard of CRC diagnosis is colonoscopy. Unfortunately, colonoscopy data is missing not at random for a significant portion of data. Specifically, only NHIS questionnaires from years 2000, 2005, 2010, and 2015 asked the respondent if they had ever been screened by the gold standard (sigmoidoscopy, colonoscopy, or proctoscopy). We therefore assumed that neither dataset contained any false positive or false negative cases.</p>
</sec>
<sec>
<title>Data Preparation</title>
<p>For reproducibility, we describe how the raw data was mapped to the datasets used to train and test the machine learning algorithms (MLAs). The factors of ever having hypertension, ulcers, a stroke, any liver comorbidity, arthritis, bronchitis, coronary heart disease, myocardial infarction, and/or emphysema are binary variables and mapped to 0 for &#x0201C;no&#x0201D; and 1 for &#x0201C;yes.&#x0201D; Diabetic status has one of three discrete values: not diabetic, pre-diabetic/borderline, and diabetic, respectively. These conditions were mapped to 0, 0.5, and 1, respectively. The age factor is continuous and equals the age at response to the NHIS or PLCO BQ for negative cases and the age at CRC diagnosis for positive cases. Body mass index (BMI) is likewise continuous. All such continuous factors were unitized to the interval [0, 1]. The sex factor is 0 for women and 1 for men. The variable of Hispanic ethnicity was given a value of 0 for a response of &#x0201C;Not Hispanic/Spanish origin&#x0201D; and 1 otherwise. The variable of race was set to 1 for responses of &#x0201C;Black/African American only,&#x0201D; &#x0201C;American Indian only,&#x0201D; &#x0201C;Other race,&#x0201D; or &#x0201C;Multiple race,&#x0201D; and 0 otherwise. The smoking status had a value of 1 for an everyday smoker, 0.66 for a some-day smoker, 0.33 for a former smoker, and 0 for a never smoker. The NHIS defines a &#x0201C;never smoker&#x0201D; as someone who has smoked 100 cigarettes or less over their entire lifetime, and a &#x0201C;former smoker&#x0201D; as a smoker who quit at least 6 months prior to the survey; this same definition was used to score PLCO respondents&#x00027; smoking status using equivalent fields. The variable of family history represents the number of first-degree relatives who have had CRC, and was capped at 3. The family history variable values of 0, 1, 2, and 3 were mapped to 0, 0.33, 0.66, and 1, respectively.</p>
</sec>
<sec>
<title>The Levels of TRIPOD and the Cross-Testing Uncertainty</title>
<p>Below, we use the terms &#x0201C;training,&#x0201D; &#x0201C;validation,&#x0201D; and &#x0201C;testing&#x0201D; to describe increasingly-general model performances. Any portion of data designated as &#x0201C;training&#x0201D; is used to directly adjust the parameters of the model (e.g., by iterations of gradient-descent in the space of model parameters for an artificial neural network). Any portion of data designated as &#x0201C;validation&#x0201D; is not involved in direct adjustment of model parameters, but is used to stop further iterations of an algorithm based on whether overfitting happens (e.g., stopping iterations of gradient descent if the training fitting error is decreasing but the validation fitting error is increasing). Finally, any portion of data designated as &#x0201C;testing&#x0201D; is data used for neither training nor validation. In the literature, the term &#x0201C;validation&#x0201D; is sometimes used to describe what is actually testing, often by way of the term &#x0201C;cross-validation&#x0201D; (Picard and Cook, <xref ref-type="bibr" rid="B22">1984</xref>). In this work, we use the term &#x0201C;cross-testing&#x0201D; to avoid any possible confusion.</p>
<p>We reported concordance, a performance metric, at level 3 of the hierarchy proposed by the Transparent Reporting of Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines (Collins et al., <xref ref-type="bibr" rid="B8">2015</xref>). TRIPOD level 1a corresponds to testing upon the same dataset used for training (leaving any overfitting undetected). TRIPOD level 1b corresponds to <italic>n</italic>-fold cross-validation (Picard and Cook, <xref ref-type="bibr" rid="B22">1984</xref>). TRIPOD levels 2a and 2b each correspond to a trained model tested upon or cross-tested between splits of the data involved neither in training nor overfitting-detection (&#x0201C;validation&#x0201D;). Level 2a corresponds to random splits of the data, accordingly yielding normally-distributed random error (Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B3">2008</xref>) in Equation (2). Level 2b corresponds to non-random splits of the data, yielding systematically-distributed error in Equation (2). TRIPOD level 3 is where a model trained by data from one study is tested upon or cross-tested between data from a separate study. TRIPOD level 4 corresponds to testing a published model on a separate dataset (<xref ref-type="sec" rid="A1">Appendix</xref>).</p>
<p>Our model has a TRIPOD level of 3, as it was trained upon a dataset from a longitudinal study and tested on a dataset from a cross-sectional study and vice versa. Throughout this paper, cross-testing shall refer to training on NHIS/PLCO and testing upon PLCO/NHIS, respectively. In this case, the systematic error from Equation (2) arises from the distributional disparity [in the Bayesian perspective (Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B3">2008</xref>), where each data entry in the NHIS and PLCO is assumed to be drawn from separate probability distributions with unknown parameters] between the PLCO and NHIS datasets due to (among other things) the fact that the NHIS is cross-sectional while PLCO is longitudinal. Reporting performance at TRIPOD level 3 demonstrates generalizability of the model&#x00027;s predictive capacity.</p>
</sec>
<sec>
<title>Seven Machine Learning Algorithms</title>
<p>The MLAs used in this work are an artificial neural network (ANN), logistic regression (LR), naive Bayes (NB), decision tree (DT), random forest (RF), linear-kernel support-vector machine (SVM), and linear discriminant analysis (LDA) with automatic optimization of hyper-parameters (Fisher, <xref ref-type="bibr" rid="B12">1936</xref>; Morgan and Sonquist, <xref ref-type="bibr" rid="B19">1963</xref>; Rumelhart et al., <xref ref-type="bibr" rid="B27">1986</xref>; Cortes and Vapnik, <xref ref-type="bibr" rid="B9">1995</xref>; Hosmer and Lemeshow, <xref ref-type="bibr" rid="B16">2000</xref>; Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B3">2008</xref>). The LR, NB, DT, SVM, and LDA MLAs were invoked, respectively, by the &#x0201C;fitglm,&#x0201D; &#x0201C;fitcnb,&#x0201D; &#x0201C;fitctree,&#x0201D; &#x0201C;fitcsvm,&#x0201D; and &#x0201C;fitcdiscr&#x0201D; MATLAB functions. The SVM, LDA, and DT MLAs yielded a CRC risk score using Platt scaling (Platt, <xref ref-type="bibr" rid="B23">1999</xref>). The ANN used a previously developed in-house MATLAB code.</p>
<p>ANNs are a method of regression (Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B3">2008</xref>) as they determine function parameters that minimize fitting error using iterations of stochastic gradient descent in parameter space (Bishop, <xref ref-type="bibr" rid="B6">2006</xref>; Andoni et al., <xref ref-type="bibr" rid="B1">2014</xref>; Kingma and Ba, <xref ref-type="bibr" rid="B17">2015</xref>) through a process called backpropagation (Rumelhart et al., <xref ref-type="bibr" rid="B27">1986</xref>), and are similar to logistic regressions (LR) (Hosmer and Lemeshow, <xref ref-type="bibr" rid="B16">2000</xref>). Specifically, an ANN with a logistic activation function and zero hidden layers is a logistic regression. With their hidden layers, ANNs model inter-factor coupling as logistic or hyperbolic-tangential probabilities; these probabilities are called the activation function of the ANN. ANNs that use logistic activation functions are multilinear generalizations of LRs.</p>
<p>The in-house MATLAB coded ANN has two hidden layers with logistic activation and deployed adaptive gradient descent via the &#x0201C;Adam&#x0201D; learning rate. It also uses both early stopping and automatic hyperparameter optimization to minimize overfitting. There is one input neuron for each factor used, and each hidden layer has one neuron for each input neuron. Each neuron is associated with a single weight <italic>W</italic> and a single bias <italic>B</italic>, which, respectively, are the slope and intercept for the linear function <italic>z</italic> &#x0003D; <italic>z</italic>(<italic>X</italic>) &#x0003D; <italic>WX</italic> &#x0002B; <italic>B</italic> with argument <italic>X</italic>. The linear function itself is then fed into the neuron&#x00027;s sigmoidal activation function (<italic>e</italic><sup>&#x02212;<italic>z</italic></sup> &#x0002B; 1)<sup>&#x02212;1</sup>. The weights and biases are, respectively, determined by iterations of the equations <inline-formula><mml:math id="M1"><mml:msup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>W</mml:mi><mml:mo>-</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula> and <inline-formula><mml:math id="M2"><mml:msup><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>B</mml:mi><mml:mo>-</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>B</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula> for fitting error <inline-formula><mml:math id="M3"><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">S</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo class="qopname">ln</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msubsup><mml:mrow><mml:mo class="qopname">&#x0220F;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x00232;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x00232;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> between the subject&#x00027;s risk-score and their actual cancer status in a total of <italic>N</italic> subjects, a process called backpropagation. In our backpropagation, we chose to iterate until |<italic>W</italic>&#x02032; &#x02212; <italic>W</italic>|, |<italic>B</italic>&#x02032; &#x02212; <italic>B</italic>| &#x02264; &#x003B5; for a small &#x003B5; we picked.</p>
<p>The NB method modeled the conditional probability of having CRC by constructing a Gaussian distribution with a conditional sample mean and conditional sample variance (Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B3">2008</xref>). This conditioning was based on whether or not each respondent was drawn from the CRC or the never-cancer population. That is, the conditional probability <italic>P</italic> = P(C|&#x003A6;) of the event &#x003A6; of having a set of features (e.g., hypertension, diabetes, body-mass index) resulting in the event C of having CRC was given by Bayes theorem as P(C|&#x003A6;) = P(&#x003A6;|C)P(C)/P(&#x003A6;). The NB method thus incorporated inter-factor coupling, though as a multiplicative model that assumed the factors to be distributed independently. Despite this assumption of independence being almost always incorrect, the NB method&#x00027;s performance was competitive with those of more advanced MLAs (Rish, <xref ref-type="bibr" rid="B24">2001</xref>).</p>
<p>The LDA and SVM calculate a decision boundary between the positive and negative populations that maximized a likelihood function (Fisher, <xref ref-type="bibr" rid="B12">1936</xref>; Cortes and Vapnik, <xref ref-type="bibr" rid="B9">1995</xref>). The method assumed homoscedasticity, multicollinearity, and that the responses were random variables drawn from completely independent Gaussian distributions. The SVM method similarly calculated a decision boundary, except without assuming the feature-values were drawn from a Gaussian distribution. In general, decision boundary methods are effective because they resist the effects of outliers.</p>
<p>The DT method constructed a flowchart of factors leading to CRC. The DT used the variable of lowest entropy (Shannon, <xref ref-type="bibr" rid="B28">1948</xref>; Morgan and Sonquist, <xref ref-type="bibr" rid="B19">1963</xref>) to construct the base of the tree, and used increasingly less informative variables at higher branches. Such a flowchart can be easily understood by a human, and is thus highly desirable in a clinical setting. Finally, we tested a bootstrap-aggregated (&#x0201C;bagged&#x0201D;) collection of random trees, better known as a random forest (RF). Such RFs resist the overfitting that DT are prone to, but lack the transparency and information that DTs have in making their classifications.</p>
</sec>
<sec>
<title>Six Imputation Methods to Handle Missing Data</title>
<p>The datasets were subject to mean, Gaussian, Lorentzian, one-hot encoding, expectation-maximization (EM), and listwise deletion to handle data that are missing completely at random (Little and Rubin, <xref ref-type="bibr" rid="B18">2014</xref>), some of which over-represented distributional moments. The six examined imputation methods have different strengths and weaknesses. Imputation by mean over-represents the mean. Imputation by drawing from a Gaussian random variable over-represents the variance about the mean. Imputation by drawing from a Cauchy random variable does not over-represent the mean or variance. Imputation by one-hot encoding (Bishop, <xref ref-type="bibr" rid="B6">2006</xref>) uses the actual missingness of a data-entry as a feature. Finally, imputation by the (multivariate Gaussian) expectation-maximization (EM) iterative method over-represents the covariance between features (e.g., the covariance of diabetes with hypertension). The methods that draw from the Gaussian and Cauchy distributions used MATLAB&#x00027;s random number generator in invoking the function &#x0201C;rand,&#x0201D; and thus are stochastic. Imputation by mean, one-hot encoding, EM algorithm, and listwise deletion, on the other hand, are deterministic.</p>
<p>The version of the EM algorithm that we chose assumed that all variables in each dataset were random variables drawn from a multivariate Gaussian distribution (Bishop, <xref ref-type="bibr" rid="B6">2006</xref>). Iterations of the algorithm imputated the MCAR with values that overrepresented the covariance of each data column with each other. Since the multivariate Gaussian distribution is completely specified by its mean and variance, imputation by this method is incorrect only if the data is not drawn from a multivariate Gaussian or if the data is not MCAR. Because both the NHIS and PLCO datasets distinguish between data that is MCAR and data not missing at random, the effect of non-normal/Gaussian distribution of missingness remained minimized.</p>
<p>The multivariate-Gaussian EM algorithm is just one of many types of EM algorithms, as other data distributions (e.g., a multivariate-multinomial) may be assumed. Because our data contains a mixture of continuous and binary data-fields, and because the closed-form properties of the multivariate Gaussian are well-known (Bishop, <xref ref-type="bibr" rid="B6">2006</xref>), we used Gaussian expectation-maximization for convenience. Categorical survey fields are multinomial, and a sufficiently-large number of such multinomial random variables are Gaussian by the central limit theorem (Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B3">2008</xref>). Ordinal survey fields have a distribution that in general have non-zero skewness and kurtosis, and thus are not exactly Gaussian. To avoid the calculation of the covariance of one multivariate Gaussian distribution with a non-Gaussian distribution, we just used a multivariate Gaussian for all fields. &#x0201C;Multivariate-Gaussian EM imputation&#x0201D; shall be referred to as just &#x0201C;EM imputation&#x0201D; throughout this paper.</p>
<p>About 1.2% of all data (795,215 respondents) was missing completely at random. However, about 16% of the 795,215 respondents had one or more of these missing entries. Listwise-deletion discards any respondent with even one missing entry, so about 16% of all data is then lost.</p>
</sec>
<sec>
<title>Model Evaluation</title>
<p>A popular metric of the performance in discriminating CRC incidence from non-CRC incidence is concordance, which is sometimes known as the area under the curve (AUC) of the receiver-operator characteristic (ROC) plot (Hanley and McNeil, <xref ref-type="bibr" rid="B14">1982</xref>; Hosmer and Lemeshow, <xref ref-type="bibr" rid="B16">2000</xref>; Fawcett, <xref ref-type="bibr" rid="B11">2005</xref>). We reported concordances from training on NHIS/PLCO and testing upon PLCO/NHIS (&#x0201C;cross-testing&#x0201D;), which gives a TRIPOD level (Collins et al., <xref ref-type="bibr" rid="B8">2015</xref>) of 3. Total uncertainty in concordance across cross-testing (Picard and Cook, <xref ref-type="bibr" rid="B22">1984</xref>) is calculated using Equation (3).</p>
<p>For individuals ages 18&#x02013;49 the PLCO dataset<sup>2</sup> has a sharply different prevalence of CRC (379 positives, 12 negatives) compared to the NHIS dataset<sup>1</sup> (114 positives and 76,676 negatives for family history data used; 562 positives and 398,222 negatives for family history data not used). Thus, for this age range, models were cross-tested between the 2-folds formed by the following non-random split: (1) the combination of all PLCO data with NHIS years 1997&#x02013;2006 and (2) the remaining NHIS years 2007&#x02013;2017. This makes the testing level for individuals ages 18&#x02013;49 drop from TRIPOD 3 to TRIPOD 2b.</p>
</sec>
<sec>
<title>Stratifying CRC Risk</title>
<p>The ANN with EM imputation was used to stratify subjects into low-, medium-, and high CRC-risk groups. The ANN trained on NHIS data, and used this model to stratify the PLCO subjects into these risk categories. The PLCO dataset records the time in days at which the participant was diagnosed with CRC, and that time was used to build a forecast in the form of a Kaplan-Meier (KM) survival plot. Performance in risk-stratification was reported to give both an illustration of immediate clinical application and a performance metric that is not as biased as concordance is (Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B3">2008</xref>; Hajian-Tilaki, <xref ref-type="bibr" rid="B13">2013</xref>).</p>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>Results</title>
<sec>
<title>Concordance Statistics of Seven Machine Learning Algorithms</title>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> is a ROC plot of the seven MLAs used with datasets subject to EM imputation. The standard deviation was formed from the variance from cross-testing between the NHIS and PLCO datasets, and the variance from screened/unscreened sub-populations (Hanley and McNeil, <xref ref-type="bibr" rid="B14">1982</xref>) using Equation (3). Considering the mean concordance minus the total uncertainty (Equation 3) to be the metric of performance, the top performer was the ANN, with the SVM and NB as equally-performing runner-ups. LR (Hosmer and Lemeshow, <xref ref-type="bibr" rid="B16">2000</xref>) offered fourth-place performance. Our ANN used the same logistic activation function (Bishop, <xref ref-type="bibr" rid="B6">2006</xref>) as the LR. Our LR was our ANN with no hidden layers, suggesting the importance of inter-factor coupling possibly corresponding to complications. The good performance of the SVM came from not assuming a particular underlying distribution to the data, while LDA assumed that the NHIS and PLCO data were drawn from Gaussian distributions. The good performance of the NB came from its multiplicative incorporations of inter-factor coupling. The ANN&#x00027;s good performance was also roughly insensitive to which imputation method is used. The SVM and LDA perform well with one-hot encoding imputated data due to their resisting overfitting and outliers. RFs offered slightly improved performance over the DT, but worse than the ANN.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Comparison of ROC curves of all seven MLAs, with the mean concordance and its uncertainty reported. Expectation maximization was used to impute missing data.</p></caption>
<graphic xlink:href="fdata-03-00006-g0001.tif"/>
</fig>
<p>The concordance statistics for cross-testing for all combinations of MLAs and imputation methods are summarized in <xref ref-type="table" rid="T1">Table 1</xref>, showing relevant divisions of the datasets by age, as well as the effect of making family history data part of the model vs. leaving it out. The ANN offered performance (mean concordance minus the uncertainty) that was not only better than other MLAs but also insensitive to which imputation method was used. It can also be seen that in the group ages 18&#x02013;49, among whom recent diagnosis of CRC is rarer (due in part to a greater prevalence of those bypassed for by-age screening of the USPSTF&#x00027;s recommendations), concordance was driven up by the increased true negative rate (or specificity). The opposite effect was observed in the group ages 50&#x02013;75. Including family history data improved performance, but in a manner that is offset by the fact that it could only be included for a smaller data. Finally, it can be seen that the EM Gaussian algorithm tended to give the best concordance. One-hot encoding similarly performed well.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Mean concordance (standard deviation), multiplied by 100, for various machine learning algorithms, imputation methods, age groups, and with or without family history of CRC data.</p></caption>
<graphic xlink:href="fdata-03-00006-i0001.tif"/>
<table-wrap-foot>
<p><italic>Models for all ages and ages 50&#x02013;75 were conducted at TRIPOD level 3 and models for ages 18&#x02013;49 at TRIPOD level 2b. The standard deviation reported has a component of population-uncertainty from Equation (1) and a cross-uncertainty from Equation (2). The cell shading scheme was determined by subtracting the standard deviation from the mean concordance statistic, so that darker shading indicates a concordance statistic that not only had a higher mean value, but also a lower uncertainty</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>Testing the ANN at TRIPOD Level 3</title>
<p>In <xref ref-type="fig" rid="F2">Figure 2</xref> the ANN with EM imputation performed consistently well as the incorporation of family history data and age range of subjects varied. The AUCs were greatest for individuals ages 18&#x02013;49. The prevalence of CRC was lower in this group and <xref ref-type="fig" rid="F2">Figure 2</xref> shows that the concordance was driven up by the low-cutoff portion of the ROC curve where the sensitivity of the ANN with EM imputation can be seen to rise sharply. This sharp rise is due to the high probability of any negative call being correct in a dataset with such low prevalence of CRC. As the cutoff increases in <xref ref-type="fig" rid="F2">Figure 2</xref>, the sensitivity exhibits several sharp drop-offs. In the high-cutoff portion of <xref ref-type="fig" rid="F2">Figure 2</xref>, the performance of the ANN with EM imputation becomes insensitive to the age-demographic, or even the incorporation of family history data in the model. This trend is in sharp contrast to the low-cutoff portion of the ROC, where performance in the group of individuals ages 18&#x02013;49 was significantly better than in the high-prevalence group of those ages 50&#x02013;75. These results make the case that the concordance is a good measure of the performance of the ANN with EM imputation relative to other MLAs.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>The ROC curves of the ANNs averaged across cross-testing for different sub-demographics and with/without family history data.</p></caption>
<graphic xlink:href="fdata-03-00006-g0002.tif"/>
</fig>
<p>The improvement in concordance in <xref ref-type="fig" rid="F2">Figure 2</xref> and <xref ref-type="table" rid="T1">Table 1</xref> is due to the ANN becoming increasingly insensitive when trained on data from individuals ages 18&#x02013;49 and thus making more negative calls. The high number of negative calls in ages 18&#x02013;49 gives a high specificity, and thus a high concordance.</p>
</sec>
<sec>
<title>Risk Stratification by ANN at TRIPOD Level 3</title>
<p>We stratified survey-respondents by the risk score calculated by the ANN with EM imputation. Such stratification has been completed at TRIPOD level 2b in previous work (Hart et al., <xref ref-type="bibr" rid="B15">2018</xref>; Rofman et al., <xref ref-type="bibr" rid="B26">2018</xref>), and was done here at TRIPOD level 3. <xref ref-type="fig" rid="F3">Figure 3</xref> illustrates the stratification of individuals into three risk score categories. <xref ref-type="table" rid="T2">Table 2</xref> shows how many survey-respondents (CRC and never-cancer) ended up in each category.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Stratification of individuals into low-, medium-, and high CRC-risk groups by the ANN with EM imputation. Risk categories are defined by the requirement that no more than 1% of positive cases be classified as low risk, and no more than 1% of negative cases be classified as high risk.</p></caption>
<graphic xlink:href="fdata-03-00006-g0003.tif"/>
</fig>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Comparison of our ANN with EM imputation with USPSTF screening guidelines in stratifying PLCO and NHIS respondents into low-, medium-, and high CRC-risk groups.</p></caption>
<graphic xlink:href="fdata-03-00006-i0002.tif"/>
</table-wrap>
<p>In <xref ref-type="fig" rid="F3">Figure 3</xref>, only relative (rather than absolute) values of risk are relevant, and thus the numbering of the horizontal axis is not comparable between plots. Unitizing the axis of risk to the interval [0, 1] is not done. Doing so would be misleading because differing models have differing levels of risk because the minimum and maximum values of risk needed for unitization differ between models. In stratifying NHIS/PLCO using a PLCO/NHIS-trained ANN, the risk-boundaries constructed as such were in general not equal, and the interval formed by this disparity is demarcated by vertical black dotted lines. Cumulative functions and complement-cumulative functions of negative and positive populations are plotted. Dotted lines and solid lines of the same color are complementary cumulative distributions summing to 100%.</p>
</sec>
<sec>
<title>Predicting CRC Incidence in the Never-Cancer Population</title>
<p>The Kaplan-Meier plots of <xref ref-type="fig" rid="F4">Figure 4</xref> show the estimated probability of the never-cancer PLCO population getting CRC as a function of time in years, taking the CRC population as a Bayesian given. A cone of uncertainty is indicated. This cone, which widens at later times, suggests that the never-cancer population flagged as high risk (see <xref ref-type="fig" rid="F3">Figure 3</xref> and <xref ref-type="table" rid="T2">Table 2</xref>) has an appreciable probability of developing CRC at a later time. Accordingly, this group regarded as &#x0201C;false positives&#x0201D; actually would benefit from screening. Because these false positives drive down the sensitivity and positive predictive value, this builds the case that concordance is better suited as a relative rather than absolute metric of performance.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Plot of the Kaplan-Meier estimator of the CRC-free probability of the PLCO respondents vs. time for the low-, medium-, and high CRC-risk groups stratified by ANN with EM imputation model. The shaded regions are 95% confidence intervals at TRIPOD level 1a.</p></caption>
<graphic xlink:href="fdata-03-00006-g0004.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="F4">Figure 4</xref>, while the risk stratification into three categories is done at TRIPOD level 3, the confidence intervals are at TRIPOD level 1a. This is because they contain only a population-uncertainty calculated from an expression analogous to Equation (1).</p>
</sec>
</sec>
<sec sec-type="discussion" id="s4">
<title>Discussion</title>
<sec>
<title>Machine Learning Algorithms for CRC Prediction</title>
<p>Obtaining a concordance of 0.70 &#x000B1; 0.02 on training an ANN with EM imputation gives a test that is competitive with the tests using routine data itemized in the review by Usher-Smith et al. (<xref ref-type="bibr" rid="B30">2016</xref>), including Betes (Betes et al., <xref ref-type="bibr" rid="B4">2003</xref>) (TRIPOD level 3) and even the self-completed questionnaire used by Colditz (Colditz et al., <xref ref-type="bibr" rid="B7">2000</xref>) (TRIPOD level 3). Our model of CRC risk combines routine data (involving no biomarkers) to form a score of CRC risk, and thus gives a discriminating and generalizable score of CRC risk.</p>
<p>Like other clinical tests, the negative calls made by the ANN with EM imputation have a greater probability of being correct compared to the corresponding probability of correctness of its positive calls. This trend can be seen by considering the model&#x00027;s strong performance among individuals ages 18&#x02013;49 as well as its sensitivity (0.63 &#x000B1; 0.06) and misclassification rate of CRC as low risk being significantly worse than its specificity (0.82 &#x000B1; 0.04) and misclassification rate of non-CRC as high risk. Among individuals ages 18&#x02013;49, the concordance was driven up by the increase in specificity which was probably triggered by the greater number of respondents for whom the ANN with EM imputation could make correct negative calls. Likewise, better performance was observed when testing the PLCO-trained model upon NHIS data (compared to testing the NHIS-trained model upon PLCO data). In NHIS, there are &#x0007E;10<sup>6</sup> respondents of which &#x0007E;10<sup>3</sup> have CRC, whereas in PLCO there are &#x0007E;10<sup>5</sup> respondents of which &#x0007E;10<sup>3</sup> have CRC. Thus, for the PLCO dataset, there were (an order of magnitude) fewer specificity-boosting negative calls. In typical clinical practice such a test that makes a negative call gives a recommendation for no further testing, while a positive call gives a recommendation for further testing by a more accurate (and costly) test (Simundic, <xref ref-type="bibr" rid="B29">2017</xref>).</p>
<p>This paper reported uncertainty at TRIPOD level 3 wherever possible. Reporting uncertainty is crucial to determine optimal performance because concordance can be misleadingly high even when averaged across cross-testing. For instance, imputation of missing data with the average of a data-field gave a concordance that was almost 0.80, but with an accompanying uncertainty of 0.20. The performance for which mean minus uncertainty was greatest was 0.70 &#x000B1; 0.02 (see <xref ref-type="table" rid="T1">Table 1</xref>) when ANN with EM imputation were used.</p>
</sec>
<sec>
<title>Improving Performance With Additional Relevant Factors</title>
<p>The input predictors to the MLAs were selected based on availability in both the NHIS and PLCO datasets, what Rubin calls the file-matching problem (Little and Rubin, <xref ref-type="bibr" rid="B18">2014</xref>). Because of this selection criteria, some of the stronger factor-correlations with CRC (e.g., NSAIDs, such as aspirin and ibuprofen; Rodriguez and Huerta-Alvarez, <xref ref-type="bibr" rid="B25">2001</xref>; Betes et al., <xref ref-type="bibr" rid="B4">2003</xref>) needed to be omitted from the model, as data on use of NSAIDs was only available in the NHIS dataset for years 2000, 2005, 2010, and 2015. The risk-stratification demonstrated in <xref ref-type="table" rid="T2">Table 2</xref> would likely be even more effective if these stronger predictors were used. Indeed, a data-driven approach to detecting CRC risk in the general public would put priority on recording these strong predictors more regularly.</p>
</sec>
<sec>
<title>ANN With EM Imputation</title>
<p>The concordance of 0.70 &#x000B1; 0.02 of our ANN with EM imputation is competitive with previous externally-tested (TRIPOD 3) risk models using routine data (Betes et al., <xref ref-type="bibr" rid="B4">2003</xref>; Usher-Smith et al., <xref ref-type="bibr" rid="B30">2016</xref>) as input. To our knowledge, calculating an uncertainty by the law of total variance (Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B3">2008</xref>) so as to incorporate both the population-uncertainty (Hanley and McNeil, <xref ref-type="bibr" rid="B14">1982</xref>; Fawcett, <xref ref-type="bibr" rid="B11">2005</xref>) of Equation (1) and the cross-uncertainty due to variance in performance across cross-testing (Picard and Cook, <xref ref-type="bibr" rid="B22">1984</xref>) of Equation (2) has never been done before. Incorporating this additional component of cross-uncertainty demonstrates the advantage of using the ANN. The advantage of the ANN over LR is not in having a high mean concordance, but rather in having a much lower uncertainty, which demonstrates the generalizability of the model. Because of better generalizability, the ANN with EM imputation is considered the best among all the model/imputation configurations.</p>
</sec>
<sec>
<title>Clinical Deployment</title>
<p>In this work, the developed ANN with EM imputation is used to predict the colorectal cancer risk for individuals based on their personal health data. The output of the model, the colorectal cancer risk score, can be used to help the clinicians make screening decisions. Generally speaking, true positives require further screening and true negatives require no screening. False positives still stand to benefit from our model, which offers this population their individual cancer risk as a function of personal health habits they have at different times. Drops in an individual&#x00027;s risk score in response to better personal health habits, such as quitting smoking and treatment of diabetes will provide positive feedback for that individual in the form of a reduced risk-score. Furthermore, high-risk never-cancer false positives warrant heightened screening attention, as demonstrated by the sharply decreasing Kaplan-Meier probability of high-risk never-cancer individuals remaining free of CRC over time. In general, the temporal trend of cancer risk will determine the next step for the individuals.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s5">
<title>Conclusion</title>
<p>In this comparative study, we have evaluated seven machine learning algorithms in combination with six imputation methods for missing data, all trained and cross-tested with the NHIS and PLCO datasets. Among various machine learning algorithms using different imputation methods, the artificial neural network with Gaussian expectation-maximization imputation was found to be optimal, with a concordance of 0.70 &#x000B1; 0.02, a sensitivity of 0.63 &#x000B1; 0.06, and a specificity of 0.82 &#x000B1; 0.04. In CRC risk stratification this optimal model had a never-cancer misclassification rate of only 2%, and a CRC misclassification rate of only 6%. Being a TRIPOD level 3 study, our model with low uncertainty suggests that it can be used as a non-invasive and cost-effective tool to screen the CRC risk in large populations effectively using only personal health data.</p>
</sec>
<sec sec-type="data-availability-statement" id="s6">
<title>Data Availability Statement</title>
<p>The code used in this study is not publicly available due to a concern of intellectual property proprietary to Yale University. Requests to access the NHIS datasets should be directed to the Centers for Disease Control and Prevention (CDC) at <ext-link ext-link-type="uri" xlink:href="https://www.cdc.gov/nchs/nhis/">https://www.cdc.gov/nchs/nhis/</ext-link>. Requests to access the PLCO datasets should be directed to the National Cancer Institute (NCI) at <ext-link ext-link-type="uri" xlink:href="https://biometry.nci.nih.gov/cdas/plco/">https://biometry.nci.nih.gov/cdas/plco/</ext-link>.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>BN analyzed the data, produced the results, and wrote the technical details. GH, WM, YL, and GS produced the technical details, and reviewed the manuscript. JD generated the research ideas and reviewed the manuscript.</p>
<sec>
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<ack><p>The authors gratefully acknowledge the facilities provided by the Yale Department of Therapeutic Radiology at which this work was carried out.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Andoni</surname> <given-names>A.</given-names></name> <name><surname>Panigrahy</surname> <given-names>R.</given-names></name> <name><surname>Valiant</surname> <given-names>G.</given-names></name> <name><surname>Zhang</surname> <given-names>L.</given-names></name></person-group> (<year>2014</year>). <article-title>Learning polynomials with neural networks</article-title>. <source>JMLR</source> <fpage>32</fpage>.</citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Benard</surname> <given-names>F.</given-names></name> <name><surname>Barkun</surname> <given-names>A. N.</given-names></name> <name><surname>Martel</surname> <given-names>M.</given-names></name> <name><surname>von Renteln</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>Systematic review of colorectal cancer screening guidelines for average-risk adults: Summarizing the current global recommendations</article-title>. <source>World J. Gastroenterol</source>. <volume>24</volume>, <fpage>124</fpage>&#x02013;<lpage>138</lpage>. <pub-id pub-id-type="doi">10.3748/wjg.v24.i1.124</pub-id><pub-id pub-id-type="pmid">29358889</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bertsekas</surname> <given-names>D. P.</given-names></name> <name><surname>Tsitsiklis</surname> <given-names>J. N.</given-names></name></person-group> (<year>2008</year>). <source>Introduction to Probability.</source> <publisher-loc>Belmont, MA</publisher-loc>: <publisher-name>Athena Scientific</publisher-name>.</citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Betes</surname> <given-names>M.</given-names></name> <name><surname>Munoz-Navas</surname> <given-names>M. A.</given-names></name> <name><surname>Duque</surname> <given-names>J.</given-names></name> <name><surname>Angos</surname> <given-names>R.</given-names></name> <name><surname>Macias</surname> <given-names>E.</given-names></name> <name><surname>Subtil</surname> <given-names>J. C.</given-names></name> <etal/></person-group>. (<year>2003</year>). <article-title>Use of colonoscopy as a primary screening test for colorectal cancer in average risk people</article-title>. <source>Am. J. Gastroenterol.</source> <volume>98</volume>, <fpage>2648</fpage>&#x02013;<lpage>2654</lpage>. <pub-id pub-id-type="doi">10.1111/j.1572-0241.2003.08771.x</pub-id><pub-id pub-id-type="pmid">14687811</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bibbins-Domingo</surname> <given-names>K.</given-names></name> <name><surname>Grossman</surname> <given-names>D. C.</given-names></name> <name><surname>Curry</surname> <given-names>S. J.</given-names></name> <name><surname>Davidson</surname> <given-names>K. W.</given-names></name> <name><surname>Epling</surname> <given-names>J. W.</given-names> <suffix>Jr.</suffix></name> <name><surname>Garc&#x000ED;a</surname> <given-names>F. A. R.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Screening for colorectal cancer us preventive services task force recommendation statement</article-title>. <source>J. Am. Med. Assoc</source>. <volume>315</volume>, <fpage>2564</fpage>&#x02013;<lpage>2575</lpage>. <pub-id pub-id-type="doi">10.1001/jama.2016.5989</pub-id><pub-id pub-id-type="pmid">27304597</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bishop</surname> <given-names>C. M.</given-names></name></person-group> (<year>2006</year>). <source>Pattern Recognition and Machine Learning.</source> <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Colditz</surname> <given-names>G. A.</given-names></name> <name><surname>Atwood</surname> <given-names>K. A.</given-names></name> <name><surname>Emmons</surname> <given-names>K.</given-names></name> <name><surname>Monson</surname> <given-names>R. R.</given-names></name> <name><surname>Willett</surname> <given-names>W. C.</given-names></name> <name><surname>Trichopoulos</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2000</year>). <article-title>Harvard report on cancer prevention volume 4: Harvard cancer risk index</article-title>. <source>Cancer Causes Control</source> <volume>11</volume>, <fpage>477</fpage>&#x02013;<lpage>488</lpage>. <pub-id pub-id-type="doi">10.1023/A:1008984432272</pub-id><pub-id pub-id-type="pmid">10880030</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Collins</surname> <given-names>G. S.</given-names></name> <name><surname>Reitsma</surname> <given-names>J. B.</given-names></name> <name><surname>Altman</surname> <given-names>D. G.</given-names></name> <name><surname>Moons</surname> <given-names>K. G.</given-names></name></person-group> (<year>2015</year>). <article-title>Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): the tripod statement</article-title>. <source>Br. J. Cancer</source> <volume>162</volume>, <fpage>55</fpage>&#x02013;<lpage>63</lpage>. <pub-id pub-id-type="doi">10.1161/CIRCULATIONAHA.114.014508</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cortes</surname> <given-names>C.</given-names></name> <name><surname>Vapnik</surname> <given-names>V.</given-names></name></person-group> (<year>1995</year>). <article-title>Support-vector networks</article-title>. <source>Mach. Learn.</source> <volume>20</volume>, <fpage>273</fpage>&#x02013;<lpage>297</lpage>. <pub-id pub-id-type="doi">10.1007/BF00994018</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Falco</surname> <given-names>M.</given-names></name> <name><surname>Wyant</surname> <given-names>T.</given-names></name> <name><surname>Simmons</surname> <given-names>K.</given-names></name></person-group> (<year>2018</year>). <source>What is Colorectal Cancer?</source> American Cancer Society.</citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fawcett</surname> <given-names>T.</given-names></name></person-group> (<year>2005</year>). <article-title>An introduction to receiver-operator characteristic analysis</article-title>. <source>Pattern Recogn. Lett.</source> <volume>27</volume>, <fpage>861</fpage>&#x02013;<lpage>874</lpage>.</citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fisher</surname> <given-names>R. A.</given-names></name></person-group> (<year>1936</year>). <article-title>The use of multiple measurements in taxonomic problems</article-title>. <source>Mach. Learn</source>. <volume>20</volume>, <fpage>273</fpage>&#x02013;<lpage>297</lpage>.</citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hajian-Tilaki</surname> <given-names>K.</given-names></name></person-group> (<year>2013</year>). <article-title>Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation</article-title>. <source>Caspian J. Intern. Med</source>. <volume>4</volume>, <fpage>627</fpage>&#x02013;<lpage>635</lpage>.<pub-id pub-id-type="pmid">24009950</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hanley</surname> <given-names>J. A.</given-names></name> <name><surname>McNeil</surname> <given-names>B. J.</given-names></name></person-group> (<year>1982</year>). <article-title>The meaning and use of the area under a receiver operating characteristic (roc) curve</article-title>. <source>Radiology</source> <volume>143</volume>, <fpage>29</fpage>&#x02013;<lpage>36</lpage>. <pub-id pub-id-type="doi">10.1148/radiology.143.1.7063747</pub-id><pub-id pub-id-type="pmid">7063747</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hart</surname> <given-names>G. R.</given-names></name> <name><surname>Roffman</surname> <given-names>D. A.</given-names></name> <name><surname>Decker</surname> <given-names>R.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>A multi-parameterized artificial neural network for lung cancer risk prediction</article-title>. <source>PLoS ONE</source>. <volume>13</volume>:<fpage>e0205264</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0205264</pub-id><pub-id pub-id-type="pmid">30356283</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hosmer</surname> <given-names>D. W.</given-names></name> <name><surname>Lemeshow</surname> <given-names>S.</given-names></name></person-group> (<year>2000</year>). <source>Applied Logistic Regression</source>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Wiley</publisher-name>. <pub-id pub-id-type="doi">10.1002/0471722146</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>D. P.</given-names></name> <name><surname>Ba</surname> <given-names>J. L.</given-names></name></person-group> (<year>2015</year>). <article-title>Adam: a method for stochastic optimization</article-title>. <source>ICLR 2015</source> arXiv:1412.6980v9.</citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Little</surname> <given-names>R. J. A.</given-names></name> <name><surname>Rubin</surname> <given-names>D. B.</given-names></name></person-group> (<year>2014</year>). <source>Statistical Analysis with Missing Data.</source> <publisher-name>Wiley</publisher-name>.<pub-id pub-id-type="pmid">31244326</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Morgan</surname> <given-names>J. N.</given-names></name> <name><surname>Sonquist</surname> <given-names>J. A.</given-names></name></person-group> (<year>1963</year>). <article-title>Problems in the analysis of survey data, and a proposal</article-title>. <source>Am. Statist. Assoc</source>. <volume>58</volume>, <fpage>415</fpage>&#x02013;<lpage>434</lpage>. <pub-id pub-id-type="doi">10.1080/01621459.1963.10500855</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><collab>National Cancer Institute</collab></person-group> (<year>2018</year>). <source>Cancer Stat Facts: Colorectal Cancer</source>. National Cancer Institute.</citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><collab>National Cancer Institute</collab></person-group> (<year>2019</year>). <source>Tests to Detect Colorectal Cancer and Polyps</source>. National Cancer Institute.</citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Picard</surname> <given-names>R. R.</given-names></name> <name><surname>Cook</surname> <given-names>R. D.</given-names></name></person-group> (<year>1984</year>). <article-title>Cross-validation of regression models</article-title>. <source>J. Am. Stat. Assoc.</source> <volume>79</volume>, <fpage>575</fpage>&#x02013;<lpage>583</lpage>. <pub-id pub-id-type="doi">10.1080/01621459.1984.10478083</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Platt</surname> <given-names>J. C.</given-names></name></person-group> (<year>1999</year>). <article-title>Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods</article-title>, in <source>Advances in Large Margin Classifiers</source> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>), <fpage>61</fpage>&#x02013;<lpage>74</lpage>.</citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rish</surname> <given-names>I.</given-names></name></person-group> (<year>2001</year>). <article-title>An empirical study of the naive bayes classifier</article-title>. <source>IJCAI 2001 Work Empir. Methods Artif. Intell</source>. <volume>3</volume>, <fpage>41</fpage>&#x02013;<lpage>46</lpage>.</citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rodriguez</surname> <given-names>L. G.</given-names></name> <name><surname>Huerta-Alvarez</surname> <given-names>C.</given-names></name></person-group> (<year>2001</year>). <article-title>Reduced risk of colorectal cancer among long-term users of aspirin and nonaspirin nonsteroidal antiinflammatory drugs</article-title>. <source>Epidemiology</source> <volume>12</volume>, <fpage>88</fpage>&#x02013;<lpage>93</lpage>. <pub-id pub-id-type="doi">10.1097/00001648-200101000-00015</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rofman</surname> <given-names>D.</given-names></name> <name><surname>Hart</surname> <given-names>G.</given-names></name> <name><surname>Girardi</surname> <given-names>M.</given-names></name> <name><surname>Ko</surname> <given-names>C. J.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Predicting non-melanoma skin cancer via a multi-parameterized artifcial neural network</article-title>. <source>Sci. Rep.</source> <volume>8</volume>, <fpage>1</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1038/s41598-018-19907-9</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rumelhart</surname> <given-names>D. E.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name> <name><surname>Williams</surname> <given-names>R. J.</given-names></name></person-group> (<year>1986</year>). <article-title>Learning representations by back-propagating errors</article-title>. <source>Nature</source> <volume>323</volume>, <fpage>533</fpage>&#x02013;<lpage>536</lpage>. <pub-id pub-id-type="doi">10.1038/323533a0</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shannon</surname> <given-names>C. E.</given-names></name></person-group> (<year>1948</year>). <article-title>A mathematical theory of communication</article-title>. <source>Bell Syst. Techn. J</source>. <volume>27</volume>, <fpage>623</fpage>&#x02013;<lpage>656</lpage>. <pub-id pub-id-type="doi">10.1002/j.1538-7305.1948.tb00917.x</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simundic</surname> <given-names>A. M.</given-names></name></person-group> (<year>2017</year>). <article-title>Extent of diagnostic agreement among medical referrals</article-title>. <source>EJIFCC</source> <volume>19</volume>, <fpage>203</fpage>&#x02013;<lpage>211</lpage>. <pub-id pub-id-type="doi">10.1111/jep.12747</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Usher-Smith</surname> <given-names>J. A.</given-names></name> <name><surname>Walter</surname> <given-names>F. M.</given-names></name> <name><surname>Emery</surname> <given-names>J. D.</given-names></name> <name><surname>Win</surname> <given-names>A. K.</given-names></name> <name><surname>Griffin</surname> <given-names>S. J.</given-names></name></person-group> (<year>2016</year>). <article-title>Risk prediction models for colorectal cancer: a systematic review</article-title>. <source>Cancer Prev. Res</source>. <volume>9</volume>, <fpage>13</fpage>&#x02013;<lpage>26</lpage>. <pub-id pub-id-type="doi">10.1158/1940-6207.CAPR-15-0274</pub-id><pub-id pub-id-type="pmid">26464100</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Mauldin</surname> <given-names>P. D.</given-names></name> <name><surname>Ebeling</surname> <given-names>M.</given-names></name> <name><surname>Hulsey</surname> <given-names>T. C.</given-names></name> <name><surname>Liu</surname> <given-names>B.</given-names></name> <name><surname>Thomas</surname> <given-names>M. B.</given-names></name> <etal/></person-group>. (<year>2012</year>). <article-title>Effect of metabolic syndrome and its components on recurrence and survival in colon cancer patients</article-title>. <source>Cancer</source> <volume>119</volume>, <fpage>1512</fpage>&#x02013;<lpage>1520</lpage>. <pub-id pub-id-type="doi">10.1002/cncr.27923</pub-id><pub-id pub-id-type="pmid">23280333</pub-id></citation></ref>
</ref-list>
<app-group>
<app id="A1">
<title>Appendix</title>
<sec>
<title>The Levels of TRIPOD and the Cross-Testing Uncertainty</title>
<p>Differing levels of TRIPOD have cross-testing uncertainty from different sources of distributional disparity. There is always a component of uncertainty due to the finitude of the dataset used, and that of the concordance is well-known (Hanley and McNeil, <xref ref-type="bibr" rid="B14">1982</xref>; Fawcett, <xref ref-type="bibr" rid="B11">2005</xref>). For a population of N respondents C of whom have cancer and a MLA giving a concordance of AUC, what shall be called the &#x0201C;population-uncertainty&#x0201D; &#x003A0;<sup>2</sup> is a function of AUC, C, and N alone and given as,</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mo>&#x003A0;</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>C</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>N</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>C</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:mi>C</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:mi>C</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x02212;</mml:mo><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:msup><mml:mi>C</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;</mml:mtext><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>C</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:msup><mml:mi>C</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x02212;</mml:mo><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:msup><mml:mi>C</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>For TRIPOD level 1b or higher, an additional uncertainty component we call the &#x0201C;cross-uncertainty&#x0201D; (&#x003C4;<sup>2</sup>) arises from cross-validation or cross-testing. This uncertainty, unlike the population-uncertainty, depends explicitly upon the disparity of the distribution of the two datasets. If the split of the data is random (TRIPOD levels 1b and 2a), the cross-uncertainty is normal or Gaussian (Bishop, <xref ref-type="bibr" rid="B6">2006</xref>; Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B3">2008</xref>). If the split of the data is non-random (TRIPOD levels 2b and 3), the cross-uncertainty indicates the difference between the distributions of the data of each group. In the case of cross-testing between NHIS and PLCO, the cross-uncertainty indicates the difference between the underlying distributions of each dataset. For splitting of the data into <italic>n</italic><sub>f</sub>-folds the cross-uncertainty &#x003C4;<sup>2</sup> is estimated as the sample variance resulting from the concordance AUC<sub>i</sub> from testing or validating upon the ith-fold of data summed over all <italic>n</italic><sub>f</sub>-folds, which is done as:</p>
<disp-formula id="E3"><label>(2)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:mi>C</mml:mi></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:mi>C</mml:mi></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover></mml:mstyle><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>A true positive rate or sensitivity (TPR) and a true negative rate or specificity (SPC) determine a concordance or area under the (receiver-operator characteristic) curve AUC. Taking TPR and SPC to be random variables that take on differing values over differing folds of cross-testing, we used the law of total variance (Hosmer and Lemeshow, <xref ref-type="bibr" rid="B16">2000</xref>) to form the uncertainty &#x003C3;<sup>2</sup>. Both the population-uncertainty &#x003A0;<sup>2</sup> and the cross-uncertainty &#x003C4;<sup>2</sup> between folds of data were incorporated as the following sum of variances conditioned upon a specific TPR and SPC:</p>
<disp-formula id="E4"><label>(3)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>E</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtext>var&#x000A0;</mml:mtext><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:mi>C</mml:mi><mml:mo>|</mml:mo><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mi>R</mml:mi><mml:mo>,</mml:mo><mml:mi>S</mml:mi><mml:mi>P</mml:mi><mml:mi>C</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mtext>var&#x000A0;</mml:mtext><mml:mi>E</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mi>U</mml:mi><mml:mi>C</mml:mi><mml:mo>|</mml:mo><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mi>R</mml:mi><mml:mo>,</mml:mo><mml:mi>S</mml:mi><mml:mi>P</mml:mi><mml:mi>C</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>&#x003A0;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003A0;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Throughout this paper, we reported the square root of the total uncertainty <inline-formula><mml:math id="M9"><mml:msqrt><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:msqrt><mml:mo>=</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula> from Equation (3), formed from summing the mean population-uncertainty <inline-formula><mml:math id="M10"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003A0;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula> from Equation (1) and the cross-uncertainty &#x003C4;<sup>2</sup> from Equation (2). Through the cross-uncertainty &#x003C4;<sup>2</sup> the disparity between the distributions of the folds of cross-testing or cross-validation appear explicitly in the total uncertainty &#x003C3;<sup>2</sup>.</p>
</sec>
</app>
</app-group>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>National Health Interview Survey (NHIS) (1997&#x02013;2017).</p></fn>
<fn id="fn0002"><p><sup>2</sup>Prostate/Lung/Colorectal/Ovarian (PLCO) Cancer Screening Trial (1993&#x02013;2001).</p></fn>
</fn-group>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> Research reported in this publication was supported by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under Award Number R01EB022589.</p>
</fn>
</fn-group>
</back>
</article> 