Construction and Validation of a Lung Cancer Risk Prediction Model for Non-Smokers in China

Background About 15% of lung cancers in men and 53% in women are not attributable to smoking worldwide. The aim was to develop and validate a simple and non-invasive model which could assess and stratify lung cancer risk in non-smokers in China. Methods A large-sample size, population-based study was conducted under the framework of the Cancer Screening Program in Urban China (CanSPUC). Data on the lung cancer screening in Henan province, China, from October 2013 to October 2019 were used and randomly divided into the training and validation sets. Related risk factors were identified through multivariable Cox regression analysis, followed by establishment of risk prediction nomogram. Discrimination [area under the curve (AUC)] and calibration were further performed to assess the validation of risk prediction nomogram in the training set, and then validated by the validation set. Results A total of 214,764 eligible subjects were included, with a mean age of 55.19 years. Subjects were randomly divided into the training (107,382) and validation (107,382) sets. Elder age, being male, a low education level, family history of lung cancer, history of tuberculosis, and without a history of hyperlipidemia were the independent risk factors for lung cancer. Using these six variables, we plotted 1-year, 3-year, and 5-year lung cancer risk prediction nomogram. The AUC was 0.753, 0.752, and 0.755 for the 1-, 3- and 5-year lung cancer risk in the training set, respectively. In the validation set, the model showed a moderate predictive discrimination, with the AUC was 0.668, 0.678, and 0.685 for the 1-, 3- and 5-year lung cancer risk. Conclusions We developed and validated a simple and non-invasive lung cancer risk model in non-smokers. This model can be applied to identify and triage patients at high risk for developing lung cancers in non-smokers.


INTRODUCTION
Lung cancer is the leading cause of cancer related deaths in both the world and China. The latest data from the International Agency for Research on Cancer (IARC) shows that in 2020, there were about 1.80 million lung cancer deaths worldwide, which China accounts for 39.8% (1). The majority of lung cancer cases in China were found to be clinically advanced, with 64.6% of stage III-IV lung cancers in 2012-2014 (2). The age standardized 5-year survival rate of lung cancer in China increased slightly between 2003 and 2015, but still did not exceed 20.0% (3). The prognosis of lung cancer is closely related to the diagnostic stage, and the 5-year survival rate after surgery is almost 0 for stage IV patients, but >80% for stage I lung cancer patients (4).
The results of the National Lung Screening Trial (NLST), initiated in 2002, suggested that low-dose computed tomography (LDCT) screening could reduce lung cancer mortality by 20% (5). However, this project only screened people at high risk for lung cancer based on age and smoking history (55-74 years, smoked no less than 30 pack-years, and had no more than 15 years of smoking quit time). It is well known that smoking significantly increases the risk of lung cancer. Meta-analysis showed that the risk of lung cancer was 13.1 times higher among smokers than non-smokers in Europe and the United States [Hazard Ratio (HR)=13.1, 95% CI= 9.9-17.3] (6), much higher than the 2.77 times risk in the Chinese population [Odds Ratio (OR)=2.77, 95% CI=2. 26-3.40] (7). This suggests that the current international standards for lung cancer screening based on smoking as the main indicator for high-risk populations may not be suitable for the Chinese population, especially for Chinese non-smokers. Therefore, how to effectively predict the risk of lung cancer in non-smokers and then guide the more costeffective LDCT screening is an effective way to achieve efficient early diagnosis and treatment of lung cancer.
Previous studies have constructed several lung cancer risk prediction models based on different characteristics of populations , but there is few lung cancer risk prediction models based on non-smokers in mainland of China. To this end, developing lung-cancer risk prediction tools for Chinese non-smokers based on risk factors consistently identified in previous studies becomes a priority (39). However, this is difficult and challenging. Unlike the situation of tobacco-driven lung cancer, there is no established risk factors dominating the development of lung cancer among non-smokers. Numerous risk factors have been suggested and their effects vary greatly by geographical region (40)(41)(42)(43). For example, we note that the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) models do not seem to be useful for Asian non-smokers because PLCO only included about 2000 never-smokers of Asian ethnicity, of which 7 cases of lung cancer occurred (44). Indeed, none of the non-smokers in the PLCO (n=65,711) had a six-year risk >0.0151, using the PLCO M2014 that is analogous to PLCO M2012 and included non-smokers.
The model was developed based on the Cancer Screening Program in Urban China (CanSPUC) (45). With the focus on established risk factors for lung cancer routinely available in general cancer screening settings, we aimed to develop and internally validated a risk prediction model for lung cancer in Chinese non-smokers.

Data Source and Subjects
This study was conducted within the framework of CanSPUC, an ongoing, nationwide, population-based cancer screening program in urban China. The purpose of CanSPUC is to screen five most prevalent cancers, including lung cancer, female breast cancer, liver cancer, upper gastrointestinal cancer, and colorectal cancer. The methodology of the CanSPUC has been previously described (45,46). In brief, after signing a written informed consent, all eligible participants (40-74 years old) were interviewed by trained staffs to collect data on their exposure to risk factors and to evaluate their cancer risk using a defined clinical cancer risk score system. CanSPUC was launched in Henan province of China in October 2013, covering eight cities (Zhengzhou, Zhumadian, Anyang, Luoyang, Nanyang, Jiaozuo, Puyang, and Xinxiang). In this study, we used data from the first six years (from October 2013 to October 2019) in Henan province. Only those non-smokers (except former smokers) were included in this study. Subjects would be excluded if they have been already diagnosed with lung cancer.

Outcome, Variables and Measurements
All new cases of lung cancer in the study were ascertained through local cancer registry databases with a histologically confirmed diagnosis from October 1, 2013 to March 10, 2020 in mainland of China. Newly diagnosed lung cancers were classified by sites according to International Classification of Diseases, 10th version (ICD-10). Lung cancers were identified by ICD-10 of C33-C34. To identify potential risk factors for lung cancer, the following data were collected by self-report: (1) Demographic characteristics: including age, gender, race, height, weight and level of education. A low education was defined as primary school or below, medium education was defined junior or senior high school, and high-level education was defined as undergraduate or over. Body mass index (BMI) was calculated according to height and weight, and classified as "<18.5 kg/m 2 ", "18.5-23.9 kg/m 2 ", "24.0-27.9 kg/ m 2 "and "≥28.0 kg/m 2 ". (2) Dietary habit: a) Dietary intake of the following food in the past two years: vegetables intake (<2.5kg/week, ≥2.5kg/week), fruit intake (<1.25kg/week, ≥1.25kg/week), roughage intake (<0.5kg/week, ≥0.5kg/week). Vegetables referred to green leaf plants and fungi, except for potato, sweet potato, and other starch. Roughage referred to the grains except white flour and rice. Food weight was determined before cooking. b) Taste preferences: heavy-salt diet (yes, no) and heavy-grease diet (yes, no). (3) Living environment, behavior and habits: a) Cooking oil fume (COF) exposure: exposure is considered as "none or a little", if chimneys, fume extractors, or smoke-less pots was used during cooking; otherwise, it is considered as "a lot". b) Physical activity: activities were categorized as Taijiquan/ Qigong/Walking, long distance running/aerobics, ball games (basketball, table tennis, badminton, etc.), fast walking/ yangko dance, swimming and other physical exercises (such as mountain climbing, rope skipping, shuttlecock kicking). Subjects who did exercise for at least three days with a total time ≥90 mins per week were categorized as "heavy physical activity"; otherwise, were categorized as "moderate or no physical activity". (4) Comorbidities: including history of chronic respiratory disease, tuberculosis, chronic bronchitis, emphysema, asthma bronchiectasis and hyperlipidemia. All self-reported comorbidities required a diagnosis from professional medical institutions. (5) Family history of lung cancer: whether first-degree relatives, second-degree relatives or third-degree relatives had lung cancer or not.

Statistical Analysis
All statistical analyses were performed with the statistical software SAS version 9.4 (SAS Institute, Cary, NC) and R version 4.0.3 (The Free Software Foundation, Boston, MA, USA). The "rms" package was used to draw the nomogram. The "survivalROC" package was used to draw the ROC curves.
The "ggplot2" package was used to draw the calibration curves. All tests were two-sided and p-values of 0.05 or less were considered to be statistically significant.
With the help of randomization codes produced by means of the PROC PLAN of the SAS system, the dataset was randomly divided into training set and validation set with a 1:1 assignment ratio. The training set was used to create the risk prediction model, while the validation set was used to validate the performance of the model.
Descriptive statistics, expressed as proportions for categorical variables, were used to compare the characteristics of those with and without the outcome of developing lung cancer. Chi-squared tests for categorical variables were used to determine the univariate association between the baseline factors and lung cancer development. Continuous variables were described by means (standard deviation) or median (interquartile range, IQR).
In this study, the combined model based on all independent prognostic factors selected by the stepwise multivariable Cox regression (P entry =0.15, P stay =0.10) was used to construct a nomogram to assess the 1-, 3-, and 5-year estimates of the lung cancer risk in the training set. The calibration curves were used to evaluate the validity of the nomogram. The Kaplan-Meier curves were plotted for low-, medium-, and high-risk groups using the 33% and 66% quantiles for lung cancer according to the risk prediction model, and differences among the three curves were tested according to the log-rank test. The prediction performance of the 1-, 3-and 5-year estimates of the lung cancer risk was quantified by receiver operating characteristic (ROC) curves and the area under the curve (AUC) in the training set and validation set. The bootstrap sampling approach was used to evaluate the calibration of the present model by comparing the observed and predicted probabilities. Correction for deviation of estimates from observations (overfitting correction) estimates were based on predictions for a subset of the interval. The median absolute error is also used to evaluate the calibration performance.

Characteristics of the Study Population
A total of 214,764 eligible subjects with a mean age of 55.19 years were included into this study, and 70.70% were females. Subjects were randomly divided into the training set (107,382 subjects) and validation set (107,382 subjects) ( Figure 1). By March 2020, among 214,764 eligible participants, 344 lung cancer cases occurred in the follow-up yielding an incident density of 50.53/ 100,000 person-years. Compared with participants without lung cancer, lung cancer cases were more likely to have a low education, without passive smoking exposure, have a heavy physical activity and have a family history of lung cancer (all P vales <0.05). Additional characteristics are presented in Supplementary Table 1 and Table 1.  (Figure 2A).

Predictive Performance of the Model
The risk predictions were stratified into low-, medium-, and high-risk groups and visualized by Kaplan-Meier curves, showing statistically significant differences between the groups by a log-rank test ( Figure 2B, P<0.001).
Using  (Figure 3). Calibration was satisfactory, with observed risks awfully close to the predicted risks ( Figure 4).

Validation of the Lung Cancer Risk Model
The model showed a moderate predictive discrimination in the validation set, with the AUC was 0.668, 0.678, and 0.685 for 1year, 3-year, and 5-year lung cancer risk (Supplementary Figure 1) and the satisfactory calibration of relative risk (Supplementary Figure 2).

DISCUSSION
In this study, using data from a large perspective lung cancer screening cohort studies, we developed and internally validated a simple risk prediction model for lung cancer in non-smokers, based on six widely available variables, including demographics (age, gender, education), comorbidities (tuberculosis, hyperlipidemia) and family history of lung cancer. Our results showed that the model has good discriminatory accuracy and goodness-of-fit for both men and women, non-passive smokers and passive smokers.
For non-smokers, several risk factors for lung cancer have been identified, including passive smoking (47,48), previous lung diseases [tuberculosis, chronic bronchitis, emphysema, previous lung diseases (COPD)] (49), indoor radon (50), cooking oil fumes (51) and family history of lung cancer (52). The risk factors for lung cancer identified in our study, such as age, gender, family history of lung cancer, history of tuberculosis, are consistent with the findings. The most dominant risk factors for lung cancer in non-smokers is age, and our study showed that elder age was the main risk factor for lung cancer and the risk was more than 9 times higher in age group of 70-74 years than in the age group of 40-44 years. Besides, being male remains a risk factor for lung cancer in non-smokers in our study, even though more than 50% lung cancers were non-smokers in women in Southeast Asia compared to approximately 2-6% in men in Western series (41,42,53). Just like other prediction models, such as Bach model (8), LLP (Liverpool Lung Project) model (10) and PLCO M2012 model (54), education levels was included in our model as a protection factor.
Another important finding was that history of hyperlipidemia [increased total cholesterol (TC), or triglycerides (TG), or lowdensity lipoprotein cholesterol (LDL-C), or decreased highdensity lipoprotein cholesterol (HDL-C)] exposure might  decrease the risk of lung cancer, despite a small effect. Since the 1980s, several epidemiological studies have investigated the associations of TC, TG, and HDL-C with lung cancer risk in non-smokers but have shown markedly contrasting results due to differences in the classification of smoking status, lack of prospective cohort study designs, relatively modest sample sizes and other potential bias (55-58). Lyu etc. (58) conducted a prospective cohort study among over 100 thousand Chinese males and found that both low and high TC levels, both low and high TG levels, and low LDL-C levels increased lung cancer risk in non-smokers. Besides, many studies reported an inverse relationship between TC (56, 59), LDL-C (60) and lung cancer incidence, to some extent, consistent with our findings. More epidemiologic, molecular and biochemical studies are needed to test this hypothesis. In addition to credible predictors, a risk prediction model should also meet performance standards related to discrimination defined as the ability to distinguish lung cancer cases from controls, and calibration defined as the consistency between observed and predicted risk for lung cancer. The rapid increase in the number of lung cancer risk prediction model studies since 2010 reflects the current need for the use of predictive models to guide population splitting. Initially, models focused on the use of traditional epidemiological risk factors such as age, smoking history, personal history of disease and family history of cancer, such as the Bach model (8), Spitz model (9), LLP model (10) and PLCO M2012 model  easily collected and updated without any imaging, sophisticated testing or calculation. Moreover, the model will not only be used as a practical tool to triage high risk patients in non-smokers, but also have implications for public health measures, such as guidelines for the prevention of lung cancer in non-smokers.
However, limitations include that the self-report data might subject to social desirability and recall bias. However, given the good data acquisition and quality control, most information is believed to be reliable. Secondly, the performance of our risk prediction model was not validated on an external dataset.  However, the results of the internal validation suggest promisingly that this model will obtain well performance when applied to other populations.

CONCLUSIONS
In summary, we developed and internally validated a simple risk prediction model for lung cancer in non-smokers based on a large-scale lung cancer screening program in China. The model has good discrimination and could be used as a tool for triaging high-risk patients to prevent lung cancer in non-smokers. Further prospective studies are required to validate the model in external populations.

DATA AVAILABILITY STATEMENT
The datasets for this manuscript are not publicly available because all our data are under regulation of both the National Cancer Center of China and Henan Cancer Hospital. Requests to access the datasets should be directed to Shaokai Zhang, shaokaizhang@126.com.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Ethics Committee of Henan Cancer Hospital (no. 2021-KY-0028-001). The patients/participants provided their written informed consent to participate in this study.