ORIGINAL RESEARCH article

Front. Artif. Intell., 20 January 2023

Sec. Medicine and Public Health

Volume 5 - 2022 | https://doi.org/10.3389/frai.2022.1059093

Statistical biopsy: An emerging screening approach for early detection of cancers

  • 1. Institute for Disease Modeling, Global Health Division, Bill and Melinda Gates Foundation, Seattle, WA, United States

  • 2. Department of Therapeutic Radiology, Yale University, New Haven, CT, United States

  • 3. SMFE, Wright-Patterson Air Force Base, Dayton, OH, United States

  • 4. Research Partners, Sun Nuclear Corporation (Mirion Technologies Inc.), Melbourne, FL, United States

  • 5. Department of Physics, Florida Atlantic University, Boca Raton, FL, United States

Abstract

Despite large investment cancer continues to be a major source of mortality and morbidity throughout the world. Traditional methods of detection and diagnosis such as biopsy and imaging, tend to be expensive and have risks of complications. As data becomes more abundant and machine learning continues advancing, it is natural to ask how they can help solve some of these problems. In this paper we show that using a person's personal health data it is possible to predict their risk for a wide variety of cancers. We dub this process a “statistical biopsy.” Specifically, we train two neural networks, one predicting risk for 16 different cancer types in females and the other predicting risk for 15 different cancer types in males. The networks were trained as binary classifiers identifying individuals that were diagnosed with the different cancer types within 5 years of joining the PLOC trial. However, rather than use the binary output of the classifiers we show that the continuous output can instead be used as a cancer risk allowing a holistic look at an individual's cancer risks. We tested our multi-cancer model on the UK Biobank dataset showing that for most cancers the predictions generalized well and that looking at multiple cancer risks at once from personal health data is a possibility. While the statistical biopsy will not be able to replace traditional biopsies for diagnosing cancers, we hope there can be a shift of paradigm in how statistical models are used in cancer detection moving to something more powerful and more personalized than general population screening guidelines.

Introduction

Cancer is a global public health burden with an estimated 21.7 million new cases and 13 million cancer deaths annually by 2030 (Ferlay et al., 2019). Despite a huge amount of money and resources spent on cancer screening, diagnosis, and treatment, it is estimated that 609,360 people in the United States will die from cancer in 2022

alone (Siegel et al., 2022). One important factor contributing to the high mortality is the lack of an efficient tool for cancer screening, missing the most effective window of opportunity for detecting cancers at their earliest stages. Another factor is the lack of individualized risk management for tailored cancer prevention. Hence, it is critical to develop safe and cost-effective approaches for cancer screening prior to disease onset with high sensitivity, specificity, and accessibility.

Tissue biopsy has long been used to diagnose cancer and often considered the gold standard, but it is limited by constraints on sampling frequency and incomplete representation of the organ being biopsied (Bravo et al., 2001). In addition, the surgical procedure is invasive, time-intensive, and costly with pain and risk of complications. Liquid biopsy offers a non-invasive alternative to cancer screening, but detection and analysis of circulating tumor DNA in a body fluid specimen present a considerable challenge (Alix-Panabières and Pantel, 2013; Crowley et al., 2013). Another challenge for liquid biopsy is how to identify the tumor site in the body, even after an individual has tested positive (Su, 2019).

Numerous schemas have been developed to improve clinical decision-making in cancer screening, detection, and prevention (Kramer, 2004; Holle, 2017).13 While cancer screening usually involves a procedure or body fluid test to detect cancer at an early stage, cancer prevention aims to reduce cancer risk and mortality by avoiding carcinogens, modifying lifestyles, and using chemoprevention (Kramer, 2004; Holle, 2017). As of now, routine cancer screening is only recommended for breast, cervical, colorectal, lung, and prostate cancers (see Footnotes 1–3). Cancer prevention strategies are only available for breast cancer, colorectal cancer, human papillomavirus-related cancers (anal, cervical, penile, vaginal, and vulvar cancers), ovarian cancer, and prostate cancer, as recommended by the American Cancer Society (ACS), National Comprehensive Cancer Network (NCCN), and US. Preventive Services Task Force (USPSTF) (see Footnotes 1–3). While the benefits of those schemas may include reduced cancer incidence and cancer mortality, their common limitations include the requirement of clinical testing, suboptimal positive/negative predictive values, frequent involvement of invasive procedures, and over diagnosis and overtreatment (Kramer, 2004; Holle, 2017). Ideally, it would be in the best interest of people to improve estimates of cancer risk prior to any clinical testing so that the cost and potential harms associated with invasive procedures would be limited (Cruz and Wishart, 2006; Ayer et al., 2010; Kourou et al., 2014; Boursi et al., 2017; Rajkomar et al., 2019).

Recently, we have demonstrated that deep neural networks, trained and validated with the National Health Interview Survey (NHIS) and/or the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial datasets, can be used to predict and stratify cancer risks with high discriminatory power based solely on personal health data (Hart et al., 2018, 2019, 2020; Roffman et al., 2018a,b; Muhammad et al., 2019; Nartowt et al., 2019a,b; Stark et al., 2019). Compared to the clinician's judgment, the strong performance of our models presents a novel opportunity to perform a “statistical biopsy” on individuals prior to disease onset (Hart et al., 2020). As shown in Figure 1, statistical biopsy mines personal health data from individuals for early cancer detection, analogous to tissue biopsy evaluating cells from a tissue specimen and liquid biopsy evaluating circulating tumor DNA from a fluid sample. What is different is that statistical biopsy seeks to decipher the invisible correlations and inter-connectivity between multiple medical conditions and health parameters via sophisticated statistical modeling. With statistical biopsy, it is possible to generate a holistic analysis of an individual's risk for a variety of cancers simultaneously. Furthermore, if integrated into a modern electronic medical record (EMR) system, it offers a cost-effective and safe approach to cancer screening in real time, informing preventive interventions and screening decisions.

Figure 1

In order to personalize early cancer detection and prevention, an accurate risk assessment of a variety of cancers for each individual is needed. Hence, we begin the development of a novel cancer risk profiler based on deep learning of personal health data for better risk stratification and more precise screening. We hypothesize that the trove of personal health data, including clinical and demographic data, family history, socio-behavioral, dietary and lifestyle data, can be used to train and validate a deep learning model capable of screening cancer prior to disease onset, with high sensitivity and specificity and with minimal toxicity and maximal accessibility.

Materials and methods

Data sets

In this work we use two large medical datasets, one for training a neural network to predict the appearance of cancer within 5 years and the other for testing the neural network. The first is the Prostate, Lung, Colorectal, and Ovarian (PLCO) trial (Tammemagi et al., 2011) which is used for training. The testing set came from the UK Biobank database (UK Biobank, 2022).

PLCO was a randomized controlled trial investigating the effectiveness of screening methods for prostate, lung, colorectal, and ovarian cancers. PLCO enrolled 154,897 participants 55–75 years of age between November 1993 and July 2001 in the United States. Participants were followed for 13 years, until they developed cancer, or passed away. We removed those that did not complete the baseline health survey leaving 149,623 participants. PLCO recorded the appearance of 13 general cancers (biliary, bladder, colorectal, glioma, head and neck, hematopoietic, liver, lung, melanoma, pancreas, renal, thyroid, and upper GI cancers), 3 female specific cancers (breast, endometrial, and ovarian), and 2 male specific cancers (male breast and prostate). In addition to these cancers, we use 116 general features, 20 female specific features, and 12 male specific features. We split the data into a set for females to predict 16 cancer types and set for males to predict 15 cancer types. See Table 1 for a list of features and their statistics and Table 2 for the number of cancer cases.

Table 1

FeatureFemaleMale
TrainTestTrainTest
Binary% Yes (% missing)% Yes (% missing)
Ever had arthritis45.82 (0.00)2.62 (0.00)29.97 (0.67)3.80 (0.00)
Ever had chronic bronchitis5.94 (0.00)1.10 (76.49)3.64 (0.68)1.17 (75.35)
Ever had colon co-morbidity1.70 (0.00)1.32 (0.00)1.17 (1.02)1.14 (0.00)
Ever had diabetes6.42 (0.00)1.03 (0.59)9.07 (0.63)3.83 (0.43)
Ever had diverticulitis or diverticulosis8.32 (0.00)7.51 (0.00)5.38 (0.78)7.01 (0.00)
Ever had emphysema2.05 (0.00)0.40 (76.49)3.05 (0.65)0.22 (75.35)
Ever had gall bladder stones or inflammation15.90 (0.00)2.94 (0.00)6.99 (0.73)5.54 (0.00)
Ever had coronary heart disease or a heart attack4.84 (0.00)11.62 (0.00)13.46 (0.64)4.10 (0.00)
Ever had high blood pressure33.97 (0.00)26.16 (0.00)34.38 (0.59)19.39 (0.00)
Ever had liver co-morbidity3.37 (0.00)0.65 (0.00)4.09 (0.77)0.37 (0.00)
Ever had osteoporosis9.64 (0.00)0.62 (0.00)0.82 (0.75)1.93 (0.00)
Ever had colorectal polyps5.54 (0.00)6.08 (0.00)8.12 (0.75)4.04 (0.00)
Ever had a stroke2.14 (0.00)0.72 (0.00)2.75 (0.63)0.37 (0.00)
Ever smoked regularly44.34 (0.00)65.30 (0.58)63.52 (0.03)55.20 (0.52)
Current smoker9.71 (0.00)12.56 (0.60)11.71 (0.03)8.96 (0.53)
Family history of biliary cancer0.34 (0.00)– (100.00)0.20 (4.50)– (100.00)
Family history of bladder cancer2.18 (0.00)– (100.00)1.51 (4.48)– (100.00)
Family history of breast cancer14.56 (0.00)12.55 (23.42)– (100.00)12.97 (16.78)
Family history of colorectal cancer11.33 (0.00)14.14 (23.10)9.29 (4.31)12.57 (16.87)
Family history of endometrial cancer2.89 (0.00)– (100.00)– (100.00)– (100.00)
Family history of glioma cancer2.01 (0.00)– (100.00)1.74 (4.46)– (100.00)
Family history of head and neck cancer1.42 (0.00)– (100.00)1.09 (4.48)– (100.00)
Family history of hematopoietic cancer6.67 (0.00)– (100.00)5.35 (4.40)– (100.00)
Family history of liver cancer2.04 (0.00)– (100.00)2.19 (4.44)– (100.00)
Family history of lung cancer11.71 (0.00)15.14 (22.51)9.85 (4.28)14.69 (16.34)
Family history of male breast cancer– (100.00)– (100.00)21.01 (2.47)– (100.00)
Family history of melanoma cancer1.40 (0.00)– (100.00)0.80 (4.49)– (100.00)
Family history of ovarian cancer3.93 (0.00)– (100.00)– (100.00)– (100.00)
Family history of pancreas cancer3.06 (0.00)– (100.00)2.18 (4.47)– (100.00)
Family history of prostate cancer– (100.00)– (100.00)7.40 (2.53)9.65 (17.15)
Family history of renal cancer1.79 (0.00)– (100.00)1.25 (4.48)– (100.00)
Family history of thyroid cancer0.70 (0.00)– (100.00)0.35 (4.50)– (100.00)
Family history of upper GI cancer4.51 (0.00)– (100.00)4.63 (4.41)– (100.00)
Ever had enlarged prostate21.80 (0.18)0.00 (0.00)
Ever had inflamed prostate8.45 (16.54)0.00 (0.00)
Ever had a prostate biopsy4.98 (2.90)0.00 (0.00)
Ever had a prostatectomy0.31 (3.21)– (100.0)
Ever had a prostate resection2.98 (3.16)0.00 (0.00)
Ever had a vasectomy27.28 (0.35)0.00 (0.00)
Had ovaries removed16.57 (0.00)
Had tubes tied21.49 (0.00)0.00 (0.00)
Ever take birth control pills54.22 (0.00)– (100.00)
Currently using female hormones49.33 (0.00)0.00 (0.00)
Ever take female hormones66.37 (0.00)– (100.00)
Ever been pregnant92.49 (0.00)– (100.00)
Ever dealt with infertility14.51 (0.00)0.00 (0.00)
Ever had benign or fibrocystic breast disease28.45 (0.00)0.01 (0.00)
Ever had benign ovarian tumor/cyst12.80 (0.00)0.00 (0.00)
Ever had endometriosis8.39 (0.00)0.00 (0.00)
Ever had Uterine fibroid tumors22.48 (0.00)0.00 (0.00)
Categorical% in Category% in Category
Race
White88.5593.9788.3794.18
Black5.681.654.561.96
Hispanic1.600.002.170.00
Asian3.372.724.072.22
Pacific Islander0.490.000.620.00
American Indian0.270.000.250.00
Missing0.041.660.061.64
Education level
< 8 years0.720.001.250.00
8–11 years5.820.007.000.00
12 years27.4723.8518.2528.55
Non-college training12.854.4812.255.76
Some college23.1519.2820.4116.24
College graduate15.0233.5618.8331.02
Postgraduate14.710.021.730.00
Missing0.2618.80.2918.44
Marriage status
Married or cohabitating68.7176.7382.5169.47
Widowed13.850.003.600.00
Divorced12.910.009.050.00
Separated0.920.001.110.00
Never married3.390.003.430.00
Missing0.2323.270.2930.53
Occupation
Homemaker22.230.540.084.62
Working35.1860.2344.0054.59
Unemployed0.962.351.161.06
Retired36.5831.2249.5935.00
Extended sick leave0.200.000.170.00
Disabled2.084.092.422.73
Other2.240.892.090.94
Missing0.521.080.491.06
ContinuousMean (SD); % missingMean (SD); % missing
Age at enrollment62.5 (5.4); 0.056.7 (8.2); 0.062.7 (5.3); 0.056.3 (8.0); 0.0
BMI at enrollment27.1 (5.5); 0.027.8 (4.2); 0.027.5 (4.2); 1.627.1 (5.2); 0.5
Weight at age 20124 (18.1); 0.0– (–); 100.0160 (24.3); 1.3– (–); 100.0
Years since quitting smoking25.0 (13.3); 0.024.6 (14.2); 1.016.9 (13.5); 1.025.3 (17.8); 46.0
Pack years smoked13.3 (22.4); 0.026.1 (20.9); 1.025.2 (31.5); 2.320.2 (15.5); 51.4
Monthly aspirin use9.8 (16.5); 0.00.0 (0.0); 0.012.2 (16.7); 0.30.0 (0.0); 0.0
Monthly ibuprofen use7.5 (17.4); 0.00.0 (0.0); 0.04.9 (14.3); 0.50.0 (0.0); 0.0
Youngest relative with biliary cancer68.1 (5.4); 0.0– (–); 100.068.4 (12.2); 0.0– (–); 100.0
Youngest relative with bladder cancer67.7 (6.3); 0.0– (–); 100.067.9 (11.9); 1.9– (–); 100.0
Youngest relative with breast cancer58.4 (8.0); 0.0– (–); 100.0– (–); 0.0– (–); 100.0
Youngest relative with colorectal cancer66.2 (6.8); 0.0– (–); 100.065.7 (12.7); 2.1– (–); 100.0
Youngest relative with endometrial cancer56.0 (7.2); 0.0– (–); 100.0– (–); 0.0– (–); 100.0
Youngest relative with glioma cancer54.9 (8.3); 0.0– (–); 100.055.0 (17.8); 1.3– (–); 100.0
Youngest relative with head and neck cancer60.7 (5.6); 0.0– (–); 100.061.4 (13.0); 2.6– (–); 100.0
Youngest relative with hematopoietic cancer57.0 (10.1); 0.0– (–); 100.056.1 (20.0); 1.9– (–); 100.0
Youngest relative with liver cancer64.2 (6.6); 0.0– (–); 100.065.3 (12.5); 1.6– (–); 100.0
Youngest relative with lung cancer65.0 (6.1); 0.0– (–); 100.063.9 (11.5); 1.7– (–); 100.0
Youngest relative with male breast cancer– (–); 0.0– (–); 100.058.8 (15.6); 2.4– (–); 100.0
Youngest relative with melanoma cancer55.9 (8.9); 0.0– (–); 100.056.8 (17.3); 1.2– (–); 100.0
Youngest relative with ovarian cancer57.9 (8.1); 0.0– (–); 100.0– (–); 0.0– (–); 100.0
Youngest relative with pancreas cancer68.9 (5.7); 0.0– (–); 100.067.9 (11.9); 1.0– (–); 100.0
Youngest relative with prostate cancer– (–); 0.0– (–); 100.070.3 (9.8); 2.5– (–); 100.0
Youngest relative with renal cancer63.1 (6.7); 0.0– (–); 100.062.5 (14.9); 2.4– (–); 100.0
Youngest relative with thyroid cancer43.4 (8.3); 0.0– (–); 100.049.2 (18.6); 2.9– (–); 100.0
Youngest relative with upper GI cancer64.3 (6.7); 0.0– (–); 100.063.6 (13.8); 2.0– (–); 100.0
Number of relatives with biliary cancer1.0 (0.2); 0.0– (–); 100.01.0 (0.1); 0.0– (–); 100.0
Number of relatives with bladder cancer1.0 (0.2); 0.0– (–); 100.01.0 (0.2); 0.0– (–); 100.0
Number of relatives with breast cancer1.1 (0.3); 0.01.0 (0.0); 0.0– (–); 0.01.0 (0.0); 0.0
Number of relatives with colorectal cancer1.1 (0.3); 0.01.0 (0.0); 0.01.1 (0.3); 0.01.0 (0.0); 0.0
Number of relatives with endometrial cancer1.0 (0.2); 0.0– (–); 100.0– (–); 0.0– (–); 100.0
Number of relatives with glioma cancer1.0 (0.2); 0.0– (–); 100.01.0 (0.2); 0.0– (–); 100.0
Number of relatives with head and neck cancer1.0 (0.2); 0.0– (–); 100.01.0 (0.2); 0.0– (–); 100.0
Number of relatives with hematopoietic cancer1.1 (0.3); 0.0– (–); 100.01.1 (0.2); 0.0– (–); 100.0
Number of relatives with liver cancer1.0 (0.2); 0.0– (–); 100.01.0 (0.2); 0.0– (–); 100.0
Number of relatives with lung cancer1.1 (0.4); 0.01.0 (0.0); 0.01.1 (0.3); 0.01.0 (0.0); 0.0
Number of relatives with male breast cancer– (–); 0.0– (–); 100.01.0 (0.1); 0.0– (–); 100.0
Number of relatives with melanoma cancer1.0 (0.2); 0.0– (–); 100.01.0 (0.2); 0.0– (–); 100.0
Number of relatives with ovarian cancer1.0 (0.2); 0.0– (–); 100.0– (–); 0.0– (–); 100.0
Number of relatives with pancreas cancer1.0 (0.2); 0.0– (–); 100.01.0 (0.2); 0.0– (–); 100.0
Number of relatives with prostate cancer– (–); 0.0– (–); 100.01.1 (0.3); 0.01.0 (0.0); 0.0
Number of relatives with renal cancer1.0 (0.2); 0.0– (–); 100.01.0 (0.1); 0.0– (–); 100.0
Number of relatives with thyroid cancer1.0 (0.2); 0.0– (–); 100.01.0 (0.2); 0.0– (–); 100.0
Number of relatives with upper GI cancer1.0 (0.2); 0.0– (–); 100.01.1 (0.3); 0.0– (–); 100.0
Age when prostate became enlarged52.6 (9.3); 0.556.8 (10.2); 0.0
Age when prostate became inflamed45.0 (13.2); 0.6– (–); 0.0
How many times you get up at night to urinate1.3 (0.9); 0.2– (–); 100.0
Age at which you started urinating at night50.5 (10.5); 58.5– (–); 0.0
Age at first prostate surgery54.9 (7.9); 7.454.5 (6.9); 0.0
Age at vasectomy29.0 (3.5); 0.5– (–); 0.0
Age at hysterectomy41.5 (4.6); 0.0– (–); 100.0
Age started birth control24.8 (6.4); 0.0– (–); 100.0
Number of years taking female hormones6.8 (3.0); 0.0– (–); 100.0
Age at birth of first child21.0 (4.5); 0.0– (–); 100.0
Number of live births3.1 (1.3); 0.0– (–); 100.0
Number of miscarriages0.5 (0.7); 0.0– (–); 100.0
Number of still births0.1 (0.3); 0.0– (–); 100.0
Number of tubal/ectopic pregnancies0.0 (0.2); 0.0– (–); 100.0
Age at first menstrual period12.2 (1.6); 0.0– (–); 100.0

Feature distributions and missingness.

Table 2

CancerFemaleMale
TrainTestTrainTest
Biliary20771053
Bladder89276387781
Breast1,9124,5251331
Colorectal4291,0346811,352
Endometrial352614
Glioma4245260459
Head and Neck63264171681
Hematopoietic351651482849
Liver81,082601,050
Lung526949806838
Melanoma195599289575
Ovarian225514
Pancreas89208134202
Prostate3,7493,365
Renal98229155407
Thyroid471183146
Upper GI30160164338

Count of cancer cases in the data sets.

UK Biobank is a large-scale biomedical database trying to accelerate medical and public health research by gathering and maintaining a staggering amount of information. They enrolled half a million participants from 2006 to 2010. Many types of follow-up and additions are frequently made. Everything from repeating the baseline health evaluation to imaging and sequencing. Information is pulled from death and cancer registries and hospital admissions and primary care data. From this data base we have 229,263 male participants and 273,375 female participants. The UK Biobank data is more detailed than the PLCO data, so we map it onto the PLCO features we used in training.

For both datasets we normalized all the inputs, situating them within the range 0–1. Categorical inputs were handled using one-hot encoding. For the cancer diagnoses we considered diagnoses <5 years after baseline evaluation to be positive and all others to be negative. We handled missing data through k-nearest neighbor imputation with k = 5. Imputation was done separately on PLCO and UK Biobank so that there was no information passed between them, except in the case of a feature completely missing from UK Biobank, in which case we set it to the mean value from the PLCO dataset (Figure 2).

Figure 2

The data was read in and processed in Python with the Pandas library, version 1.5.1. The Pandas data frames were converted to 2d Numpy arrays (version 1.23.4) before being passed to the training software.

Neural network

Using the PLCO dataset we train two different neural networks, one to take in the female data and predict the risk for 16 different cancers and another to take in the male data and predict the risk for 15 different cancers. The networks were trained as binary classifiers, with the positive class being those that developed cancer within 5 years of enrolling in the study. Each network has 2 hidden layers with 120 nodes in the first layer and 80 in the second. This network architecture was chosen because it was previously used with good results in a master's thesis that used the PLCO dataset to predict cancer risk (Yan, 2020). For both the female and male models the biases are initialized to 0 and weights are initialized with a glorot normal initializer. We used the ReLu activation function and the Adam optimizer with a learning rate of 0.01. To avoid the exploding gradient problem, we use gradient clipping. For the loss function we use binary cross-entropy. We train with batch sizes of 1,024 for 10 epochs. The prediction for each cancer coming from the output layer was put through a logistic function to scale it to the interval 0–1. We think of these values as the probability of developing cancer and later will multiply them by 100 and use them as the percent risk of developing cancer. The training and predictions were done with TensorFlow 2 via Keras, version 2.11.0.

For each cancer the neural network returns a number in the range of 0–1. Traditionally a threshold value of 0.5 is selected so that values ≥0.5 are considered positive and values below 0.5 are considered negative. However, in the data we are using there are more people without cancer than with cancer. This data imbalance can lead to bias in the predictions, but this can be addressed by avoiding the default threshold value. We empirically set the threshold (for each cancer) to maximize the Youden index. The Youden index is the difference between the true positive rate and the false positive rate. Maximizing this index picks the threshold value where the ROC curve begins to bend. We maximize the Youden index using the training data and then apply the results thresholds to the testing data (Duda et al., 2001; Bishop, 2006; Mitchell, 2006).

Results

Fitting the neural network to predict cancer incidence within 5 years for all 17 cancer types is quite successful. Looking at the ROC for the PLCO data (dotted lines in Figure 3) the classifier is near perfect for every cancer. This is further confirmed by looking at various metrics of effectiveness. On this training data no cancer has an AUC below 0.98, informedness below 0.85, or diagnostic odds ratio below 270 (see Table 3).

Figure 3

Table 3

CutoffPositive predictive valueNegative predictive valueAUC of ROCMatthews correlation coefficientInformed-nessDiagnostic odds ratio
Biliary
Female
Train0.2630.61291.00000.99330.76300.9498120,276
Test0.00040.99980.63410.00610.18047
Male
Train0.0280.34481.00000.99990.58710.9997Inf
TestNan1.00000.1339Nan0.00000.0000
Bladder
Female
Train0.0020.21451.00000.99950.46210.9957Inf
Test0.10470.99990.96580.31130.92641,727
Male
Train0.0020.14140.99970.99110.36090.9229691
Test0.01390.99840.77270.07500.456912
Breast
Female
Train0.0250.43440.99970.98830.64430.95632,942
Test0.03910.99960.98150.14980.5788319
Male
Train0.0010.00441.0000.99500.06530.9605Inf
Test0.00000.99990.3992−0.0018−0.02420
Colorectal
Female
Train0.0030.23960.99970.98600.47340.93621,211
Test0.00621.00000.99790.04960.3948Inf
Male
Train0.0060.22730.99890.96400.44150.8616295
Test0.4410.99990.98870.19380.8546463
Female
Train0.0030.28920.99990.99540.52920.96894,445
Test0.07240.99990.99110.25890.9266775
Glioma
Female
Train0.2840.40821.00000.99590.62320.951626,203
Test0.06860.99840.97320.04160.025940
Male
Train0.0090.33331.00000.99720.56720.965118,427
Test0.54010.99870.87730.42750.3393893
Head and Neck
Female
Train0.0010.04321.00000.98580.20260.95051,745
Test0.02870.99990.96600.16150.9112536
Male
Train0.0030.08390.99990.99480.28160.94611,380
Test0.33620.99960.89630.53950.86681,291
Hematopoietic
Female
Train0.0050.14240.99990.99450.36830.95271,845
Test0.00240.99810.93390.00290.0162187
Male
Train0.0110.26160.99960.98640.48970.9183851
Test0.46140.99970.95580.64610.90542,555
Female
Train0.0430.32001.00000.99990.56560.9998Inf
Test0.55370.99810.92080.53210.5131642
Male
Train0.2910.48361.00000.99890.68930.982568,978
Test0.00000.99540.4788−0.0001−0.0001218
Lung
Female
Train0.0040.26030.99980.99020.49720.95041,692
Test0.06251.00000.99810.24340.9471Inf
Male
Train0.0070.29780.99950.98780.52550.9292856
Test0.06440.99910.83140.21250.711478
Melanoma
Female
Train0.0020.23450.99990.98870.46950.94072,340
Test0.00230.99910.96480.00610.02711,231
Male
Train0.0130.33050.99970.98180.55220.92341,824
Test0.75000.99750.95430.12490.02094,874
Ovarian
Female
Train0.0010.07330.99970.96810.25110.8641270
Test0.00221.00000.99890.01740.1383Inf
Pancreas
Female
Train0.0030.27331.00000.99920.51900.985728,626
Test0.06580.99980.93720.22620.7797429
Male
Train0.0020.06270.99990.99530.24300.24301,262
Test0.30000.99910.59800.06650.0148314
Prostate
Male
Train0.0400.45590.99920.98120.64780.92231,137
Test0.32261.00000.99230.55890.9685Inf
Renal
Female
Train0.0110.38621.00000.99570.61120.967415,944
Test0.21940.99980.91700.40270.74011,302
Male
Train0.0050.10140.99990.99210.30590.9243938
Test0.48780.99950.84600.59780.73332,001
Thyroid
Female
Train0.3140.44231.00000.99770.65770.978060,262
Test0.56740.99990.98770.69670.855620,820
Male
Train0.0010.04651.00000.99820.21120.95943,621
Test0.00020.99990.87140.00480.150833
Upper GI
Female
Train0.1200.30111.00000.98450.52980.932516,371
Test0.60000.99960.90780.38710.24993,503
Male
Train0.0010.16290.99980.93290.38750.92221,314
Test0.62380.99880.66550.34050.18621,345

Metrics of performance.

We tested the model's generalizability on the UK Biobank data. Figure 3 (solid lines) shows that for most cancers the generalization is very good. The cancers that did not generalize well, biliary, male breast, liver, and pancreas, are those with the fewest cases in the training set and tend to have few cases in the test set as well (see Table 2). Also, the difference in the ROC curves tend to be larger for the model predicting cancer in males than for the one predicting cancer in females, indicating that the model for females generalized better than the model for males. Also, the male model did not generalize as well as the female model. However, the model this performs very well in terms of AUC and diagnostic odds ratio, with all but 3 cancers have diagnostic odds ratio above 10 with most of them still in the hundreds or thousands.

In addition to simply training the neural network to predict future cancer incidence. We take the raw output of the model (always in the range of 0–1) as a risk indicator. Multiplying this risk by 100, we can treat it as a risk score and look at individual's risks across all cancers. In Figure 4A we see an example of such an analysis for a male from the UK Biobank dataset. It shows that he has high risk for colorectal and prostate cancer, but essentially no risks for the other cancers. While in Figure 4B we ran the same analysis for a female from the UK Biobank dataset and find that she has moderate risk for most cancers.

Figure 4

Discussions

In this work we introduce the idea of a statistical biopsy, which mines personal health data from individuals for early cancer detection, analogous to tissue biopsy evaluating cells from a tissue specimen and liquid biopsy evaluating circulating tumor DNA from a fluid sample. Taking advantage of two rich datasets, PLCO and UK Biobank, we were able to train two neural networks (one for men and one for women) to predict cancer risk for 17 different cancers. This model was trained on the cancer focused PLCO dataset and then tested on the much larger UK Biobank dataset.

Testing with the UK Biobank dataset helps to show the model's generalizability and give us confidence that we are not overfitting the PLCO data, especially given the large number of features that we are using. Also given that the UK Biobank data comes from a different population, does not record all data in the same way, and is missing some of the features we used in our model, high performance on this dataset shows that the model has a high degree of robustness. Furthermore, the UK Biobank dataset is representative of the noisy and messy data that a physician would have access to via electronic medical records as opposed to much cleaner data gathered in a clinical trial, giving confidence that this idea can work in practice. While testing on this second dataset that comes from a different population adds a lot of confidence in the generalization of the model, it is important to note that both the training set and test set come from primarily Caucasian populations living in wealthy countries. Validating on additional datasets coming from other countries is important, especially depending on where this model is used.

Despite all this there were places where the model did not perform well. On cancers such as biliary, liver, and male breast cancer the model did not generalize at all and for two of these would do better if its predictions were reversed. Furthermore, on almost every cancer the male model generalized worst then the female model. This is particularly surprising since there are more missing female only features in the test set then in there are missing male only features. We need to further test the importance of this female/male only features and where there are other features that should be included. In addition to exploring feature importance, we are also working on quantifying the uncertainty in our prediction from these missing features and a way for the model to not only give a prediction but indicate which feature to learn to most improve the prediction. Also, while the diagnostic odds ratio is high for almost all the cancers, they need to be compared against tested screening guidelines (whether recommended or not) to see if our statistical biopsy is actually an improvement over traditional methods.

Lastly, while the stochastic nature of the development of cancer means a statistical biopsy could never completely replace a liquid or tissue biopsy, like the screening guidelines (see Footnote 1–3) it could point those traditional biopsies to individuals who would get the most benefit from them. Furthermore, it is possible to generate a holistic analysis of an individual's risk for a variety of cancers simultaneously, having the benefit of a liquid biopsy's general screening but retaining the specificity of a tissue biopsy (i.e., identifying which cancers one is at high risk for). Furthermore, if integrated into a modern electronic medical record (EMR) system, it offers a cost-effective and safe approach to cancer screening in real time, informing preventive interventions and screening decisions.

This model will form the backbone of a user-facing mobile health platform that will not only let individuals evaluate their cancer risk in real time, but also see the effect of certain preventative measures or lifestyle changes on those risks.

In the short term we hope that this mobile health platform will not only help individuals in early cancer detection, but also continue improving itself as it builds up a large and diverse longitudinal data set shared by the consented individuals.

Ultimately, we envision a model like this will be integrated into EMR systems, where every time an individual visits their doctor, has a test done, etc. it can update its predictions. It would assist physicians and patients, prompting conversations about cancer prevention and screenings as needed. In addition, as the model matures with more data, it could also provide information on what tests or diagnostics would provide the most information on cancer risk as well as the timing and spacing of such diagnostics.

While there are still many hurdles to overcome, at the scientific, social, and legal levels, there is already a good start toward this vision of statistical biopsies. Keeping active discussions on all three levels in the community is necessary for stakeholders to make steady progress toward the vision of statistical biopsy.

Conclusion

We trained two neural networks to predict the risk of 16 types of cancers in females and 15 types in males and validated it against a second dataset that came from a different population. We showed this model could be used to look holistically at an individual's cancer risks. We introduced the term “statistical biopsy” to help change the paradigm around these types of models. With the large amounts of data available and powerful computers and algorithms it is time we move beyond guidelines for general population screening to more powerful and personalized methods akin to the liquid and tissues biopsies currently used in the medical field.

Statements

Data availability statement

The existing datasets analyzed in this study can be accessed by application via the following links: https://cdas.cancer.gov/datasets/plco/ and https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access.

Author contributions

GH and VY developed models and code. GH, VY, and JD developed the core ideas and did most of the writing. BN, GH, VY, DR, GS, and WM did preliminary work predicting individual cancers with different models and datasets. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under Award Number R01EB022589, the National Science Foundation under Award Number DMS 1918925, the National Cancer Institute under Award Number 21X130F, and the Department of Energy under Award Number DE-SC0021655 to JD.

Conflict of interest

DR was employed by Sun Nuclear Corporation (Mirion Technologies Inc.), Melbourne, FL, United States. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

1.^American Cancer Society Prevention and Early Detection Guidelines. https://www.cancer.org/health-care-professionals/american-cancer-society-prevention-early-detection-guidelines.html.

2.^National Comprehensive Cancer Network Guidelines. https://www.nccn.org/professionals/physician_gls/default.aspx.

3.^United States Preventive Services Task Force Published Recommendations. https://www.uspreventiveservicestaskforce.org/BrowseRec/Index.

References

  • 1

    Alix-PanabièresC.PantelK. (2013). Circulating tumor cells: liquid biopsy of cancer. Clin. Chem. 59, 110118. 10.1373/clinchem.2012.194258

  • 2

    AyerT.AlagozO.ChhatwalJ.ShavlikJ. W.KahnC. E.BurnsideE. S. (2010). Breast cancer risk estimation with artificial neural networks revisited: discrimination and calibration. Cancer116, 33103321. 10.1002/cncr.25081

  • 3

    BishopC. M. (2006). Pattern Recognition and Machine Learning. New York, NY: Springer.

  • 4

    BoursiB.FinkelmanB.GiantonioB. J.HaynesK.RustgiA. K.RhimA. D.et al. (2017). A clinical prediction model to assess risk for pancreatic cancer among patients with new-onset diabetes. Gastroenterology152, 840850.e3. 10.1053/j.gastro.2016.11.046

  • 5

    BravoA. A.ShethS. G.ChopraS. (2001). Liver biopsy. N. Engl. J. Med. 344, 495500. 10.1056/NEJM200102153440706

  • 6

    CrowleyE.Di NicolantonioF.LoupakisF.BardelliA. (2013). Liquid biopsy: monitoring cancer-genetics in the blood. Nat. Rev. Clin. Oncol. 10, 472484. 10.1038/nrclinonc.2013.110

  • 7

    CruzJ. A.WishartD. S. (2006). Applications of machine learning in cancer prediction and prognosis. Cancer Inform. 2, 5977. 10.1177/117693510600200030

  • 8

    DudaR. O.HartP. E.StorkD. G. (2001). Pattern Classification, 2nd Edn. New York, NY: Wiley. Available online at: https://books.google.com/books/about/Pattern_classification.html?id=YoxQAAAAMAAJ

  • 9

    FerlayJ.ColombetM.SoerjomataramI.MathersC.ParkinD. M.PinerosM.et al. (2019). Estimating the global cancer incidence and mortality in 2018: GLOBOCAN sources and methods. Int. J. Cancer144, 19411953. 10.1002/ijc.31937

  • 10

    HartG.RoffmanD.DeckerR.DengJ. (2018). A multi-parameterized artificial neural network for lung cancer risk prediction. PLoS ONE13, e0205264. 10.1371/journal.pone.0205264

  • 11

    HartG. R.NartowtB. J.MuhammadW.LiangY.HuangG. S.DengJ. (2019). Stratifying ovarian cancer risk using personal health data. Front. Big Data2, 2. 10.3389/fdata.2019.00024

  • 12

    HartG. R.YanV.HuangG. S.LiangY.NartowtB. J.MuhammadW.et al. (2020). Population-based screening for endometrial cancer: human vs. machine intelligence. Front. Artif. Intell. 3, 539879. 10.3389/frai.2020.539879

  • 13

    HolleL. M. (2017). “Cancer screening and prevention,” in ACSAP 2017 BOOK 1 (Cincinnati: Oncologic/Hematologic Care), 729.

  • 14

    KourouK.ExarchosT. P.ExarchosK. P.KaramouzisM. V.FotiadisD. I. (2014). Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 817. 10.1016/j.csbj.2014.11.005

  • 15

    KramerB. S. (2004). The science of early detection. Urol. Oncol. 22, 344347. 10.1016/j.urolonc.2003.04.001

  • 16

    MitchellT. M. (2006). The Discipline of Machine Learning. Pittsburgh, PA: Carnegie Mellon University. Available online at: http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

  • 17

    MuhammadW.HartG.NartowtB.FarrellJ.JohungK.LiangY.et al. (2019). Pancreatic cancer prediction through an artificial neural network. Front. Artif. Intell. 2, 2. 10.3389/frai.2019.00002

  • 18

    NartowtB. J.HartG. R.RoffmanD. A.LlorX.AliI.MuhammadW.et al. (2019a). Scoring colorectal cancer risk with an artificial neural network based on self-reportable personal health data. PLoS ONE14, e0221421. 10.1371/journal.pone.0221421

  • 19

    NartowtB. J.HartG. R.StarkG. F.MuhammadW.LiangY.DengJ. (2019b). Building robust machine learning models for colorectal cancer risk prediction. Front. Big Data3, 6. 10.3389/fdata.2020.00006

  • 20

    RajkomarA.DeanJ.KohaneI. (2019). Machine learning in medicine. N. Engl. J. Med. 380, 13471358. 10.1056/NEJMra1814259

  • 21

    RoffmanD.HartG.GirardiM.KoC. J.DengJ. (2018a). Predicting non-melanoma skin cancer via a multi-parameterized artificial neural network. Sci. Rep. 8, 1701. 10.1038/s41598-018-19907-9

  • 22

    RoffmanD.HartG.LeapmanM.YuJ.GuoF.AliI.et al. (2018b). Development and validation of a multi-parameterized artificial neural network for prostate cancer risk prediction. JCO CCI2, 110. 10.1200/CCI.17.00119

  • 23

    SiegelR. L.MillerK. D.FuchsH. E.JemalA. (2022). Cancer statistics, 20122. CA Cancer J. Clin. 72, 733. 10.3322/caac.21708

  • 24

    StarkG. F.HartG. R.NartowtB. J.DengJ. (2019). Predicting breast cancer risk using personal health data and machine learning models. PLoS ONE14, e0226765. 10.1371/journal.pone.0226765

  • 25

    SuY. -H. (2019). Liquid biopsy: An old concept with a new twist. Genet. Test Mole. Biomark. 23, 230232. 10.1089/gtmb.2018.0326

  • 26

    TammemagiC. M.PinskyP. F.CaporasoN. E.KvaleP. A.HockingW. G.ChurchT. R.et al. (2011). Lung cancer risk prediction: prostate, lung, colorectal, and ovarian cancer screening trial models and validation. J. Natl. Cancer Inst. 103, 10581068. 10.1093/jnci/djr173

  • 27

    UK Biobank (2022). UK Biobank. Cheshire: UK Biobank Limited. Available online at: https://www.ukbiobank.ac.uk/ (accessed September 29, 2022).

  • 28

    YanV. (2020). Noninvasive Personal Cancer Risk Profiling (PCRP) via Machine Learning. Yale Master's Thesis.

Summary

Keywords

cancer screening, machine learning and AI, neural network, biopsy, data mining, cancer detection, individualized medicine

Citation

Hart GR, Yan V, Nartowt BJ, Roffman DA, Stark G, Muhammad W and Deng J (2023) Statistical biopsy: An emerging screening approach for early detection of cancers. Front. Artif. Intell. 5:1059093. doi: 10.3389/frai.2022.1059093

Received

30 September 2022

Accepted

14 December 2022

Published

20 January 2023

Volume

5 - 2022

Edited by

Shi-Cong Tao, Shanghai Jiao Tong University, China

Reviewed by

Nurulisa Zulkifle, Universiti Sains Malaysia (USM), Malaysia; Mehul Jani, University of North Texas, United States

Updates

Copyright

*Correspondence: Jun Deng ✉

This article was submitted to Medicine and Public Health, a section of the journal Frontiers in Artificial Intelligence

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics