Integrating Histologic and Genomic Characteristics to Predict Tumor Mutation Burden of Early-Stage Non-Small-Cell Lung Cancer

Tumor mutation burden (TMB) serves as an effective biomarker predicting efficacy of mono-immunotherapy for non-small cell lung cancer (NSCLC). Establishing a precise TMB predicting model is essential to select which populations are likely to respond to immunotherapy or prognosis and to maximize the benefits of treatment. In this study, available Formalin-fixed paraffin embedded tumor tissues were collected from 499 patients with NSCLC. Targeted sequencing of 636 cancer related genes was performed, and TMB was calculated. Distribution of TMB was significantly (p < 0.001) correlated with sex, clinical features (pathological/histological subtype, pathological stage, lymph node metastasis, and lympho-vascular invasion). It was also significantly (p < 0.001) associated with mutations in genes like TP53, EGFR, PIK3CA, KRAS, EPHA3, TSHZ3, FAT3, NAV3, KEAP1, NFE2L2, PTPRD, LRRK2, STK11, NF1, KMT2D, and GRIN2A. No significant correlations were found between TMB and age, neuro-invasion (p = 0.125), and tumor location (p = 0.696). Patients with KRAS p.G12 mutations and FAT3 missense mutations were associated (p < 0.001) with TMB. TP53 mutations also influence TMB distribution (P < 0.001). TMB was reversely related to EGFR mutations (P < 0.001) but did not differ by mutation types. According to multivariate logistic regression model, genomic parameters could effectively construct model predicting TMB, which may be improved by introducing clinical information. Our study demonstrates that genomic together with clinical features yielded a better reliable model predicting TMB-high status. A simplified model consisting of less than 20 genes and couples of clinical parameters were sought to be useful to provide TMB status with less cost and waiting time.


INTRODUCTION
Immune checkpoint inhibitors (ICIs) targeting programmed death 1 (PD-1) and programmed death ligand 1 (PD-L1) achieved great success improving clinical outcomes of patients with advanced NSCLC. The efficacy of ICIs varies widely among individuals (1). Therefore, biomarkers stratifying patients who may benefit from ICI treatment are of great importance. Immunostaining of PD-L1 is considered as the first considered option. NSCLC patients with tumor proportion score (TPS) ≥1% showed survival advantage from ICIs, especially, mono-immunotherapy. Tumor mutation burden (TMB) has been confirmed as a biomarker associating with efficacy of immunotherapy (2,3). Meanwhile, in patients with resected NSCLC, TMB can help to evaluate long-term prognosis (4). Recent studies have shown that there are many factors affecting TMB distribution and ICIs (5,6). TMB was negatively associated with clinical outcomes in metastatic EGFR mutant lung cancer patients treated with EGFR-TKI (7). PIK3CA amplification was significantly associated with TMB-H (8). Thus MSI-H/MMRdeficient tumors have much more somatic-mutations than MSS/ MMR-proficient tumors (9), which have been demonstrated to have direct effects on TMB. Moreover, the molecular profile was associated with clinicopathological features and genetic ancestry markers of CRC patients (10). NSCLC tumors with elevated TMB and PD-L1 expression are associated with lympho-vascular invasion (11). It was also reported in patients with advanced gastric cancer that clinicopathological (lymph node metastasis) and molecular characteristics (PIK3CA mutations) are associated with responders to nivolumab (12).
TMB was precisely evaluated by whole-exon sequencing and could be predicted by a comprehensive genomic profiling (CGP) panel with a minimal size of 1 M. However, more turn-around time (TAT) would be taken when CGP is performed. Therefore, establishing a precise TMB predicting model is essential to monitor which populations are likely to respond to immunotherapy or prognosis and to maximize the benefits of treatment. In this article, we firstly aimed to select potential parameters by associating genetic and pathological characters with TMB distribution. An optimal TMB prediction model was constructed based on selected various clinical and genetic factors. Receiver operating curve analysis was applied to assess the performance of this prediction model.

Patients
A total of 499 Formalin-Fixed, Paraffin Embedded tumor specimens of resected lung cancer were collected between March 2019 and September 2019. All patients signed the informed consent. Five hundred and eight cancer-related genes were sequenced.

Targeted Exome Capture Sequencing and Tumor Mutation Burden Assessment
Targeted exome capture sequencing data from 499 NSCLC samples were generated by MGI-500 platform. In detail, genomic DNA (gDNA) was extracted from FFPE and peripheral blood samples using the Qiagen DNeasy Blood & Tissue Kit (Qiagen, Hilden, Germany) per protocol. DNA concentration and quality were assessed by Qubit (Life Technologies, Gaithersburg, MD, USA) and agarose gel electrophoresis. gDNA (250 ng) was used for sequencing library construction as previously described. The hybridization product was subsequently purified, amplified, and qualified. Finally, sequencing of 508 key cancer related genes was performed with a paired-end 100 bp and 8 bp barcode on a MGISEQ-2000 sequencer following the manufacturer's protocols.
Raw data was first filtered by SOAPnuke to exclude reads with low quality. The clean reads were then aligned to the reference human genome (UCSC hg19) using the BWA MEM algorithm. Single-nucleotide variants (SNVs) were detected by Genome Analysis Toolkit (GATK) Unified Genotyper. Small insertions and deletions (indels) were called using GATK Haplotype. Copy number variants (CNVs) were called using read-depth analysis. All the above variants were further filtered by quality depth, strand bias, mapping quality, and read position. Each variant was finally annotated with respect to gene location.
Targeted exome capture sequencing data of 499 NSCLC patients was analyzed in depth, and TMB was evaluated, which was defined as the total number of non-synonymous and indel somatic mutations present in a baseline tumor sample excluding known driver genes. TMB-high group was defined based on top 20% of TMB value.

Statistical Analysis
Chi-square test and Kruskal-Wallis test were used for comparing categorical and continuous variables, respectively. A P-value threshold of p ≤0.001 (Chi-square test) and p <0.05 (Kruskal-Wallis test) were used to define statistical significance. To determine the driver genes' differential between TMB-H and TMB-L, the Wilcoxon test was performed to figure out the significant driver genes. To determine the multivariable association of clinical and mutation characteristics with TMB-H, LASSO regression was used. In order to evaluate whether the histologic and genomic data could provide effective prediction of TMB, the receiver operating characteristic (ROC) curve analysis and area under the curve (AUC) were applied to evaluate the accuracy of TMB prediction model.

Constructing Tumor Mutation Burden Prediction Model
Based on the above clinical and genetic results, we hypothesized whether combination of clinical and genetic features could predict TMB status. Therefore, we trained a multivariable logistic regression model that included clinical parameters, age, histology, clinical stage (TNM) as well as genetic factors (TP53, FAT3, APC, EPHA3, TERT, LRRK2, RB1, PTPRD, STK11, and NF1).
Three factors (histology, Stage and TP53) were extremely powerful predictors for TMB through multivariate analysis (p < 0.001) (Supplementary Table 1). Other factors like FAT3, APC, PTPRD (P = 0.01), lymph-node metastasis, EPHA3, TERT, and STK11 (P = 0.05) that have been found to be related to TMB distribution (Supplementary Table 2). Using TMB =6.15 muts/ Mb, the prediction model achieved a sensitivity of 73.8% and a specificity of 90.3%; the AUC (area under the ROC curve) was 0.899 (95% confidence interval, 0.861-0.938) indicating its potential for reliably identifying patients with greater TMB. After removing histological parameters, the AUC (area under the ROC curve) of these factors was 0.863 with a sensitivity of 76.3% and a specificity of 87.1% (Figure 4).

DISCUSSION
TMB, PD-1/L-1 expression are used to select patients who may benefit from immunotherapy (13). TMB is an emerging predictive marker of immune checkpoint blockade response (4) as well as prognosis for patients with NSCLC (14). In particular, it is important to accurately predict the benefit of immunotherapy based on TMB status. Also, it is reported that high TMB was associated with a better prognosis in patients with resected NSCLC (15). Our integrated histologic and genomic model is an important step toward addressing this unmet need.
In this analysis, we evaluated the association between TMB and clinical characteristics in patients with early-stage NSCLC. In the univariate analysis, despite being significantly associated with sex, our results further found that TMB was correlated with several clinicopathological features, like histological subtype, LVI, pathological subtype, para-bronchial lymph nodes, lymph node metastasis as well as tumor size. A recent study also explored the correlation of TMB and clinical characteristics in early-stage squamous cell lung carcinoma; however, no significant association was observed between TMB and age, gender, smoking history, stage (16). Another study assessed associations between clinical and TMB in resected NSCLC and identified that histological type, gender, and smoking status were associated with higher TMB (17). These inconsistent findings may be due to differences in ethnicity and pathologic types of the cohorts. In addition, the differences in panels used for TMB evaluation also have a significant impact on the results. TMB was initially detected using whole exon sequencing, but a growing number of clinical trials are now using commercial panel sequencing to detect TMB. There is no uniform standard for TMB calculation method and threshold determination. Moreover, TMB varies greatly among different cancers and even different pathological subtypes. These are challenges that need to be overcome before further application of TMB (18)(19)(20).
In LUAD, carcinoma in situ, invasive and microinvasive cancers have different cell growth patterns and stages, which in turn affect the patient's treatment and prognosis (21). In stage I LUADs, the micropapillary component was significantly associated with nodal micro-metastasis of tumor cells and may be a manifestation of aggressive behavior (22). TMB discrepancy  was observed among LUAD with various components. Remarkably, it is critical to determine the heterogeneity of LUAD components (histological subtype) by genetic profile. Solid predominant LUADs were more likely to harbor KRAS mutations than are other predominant subtypes (23). The solid predominant subtype of tumor has been found to correlate remarkably with an inflamed phenotype characterized by a high proportion of PD-L1/CD8+TILs and active cytotoxic immune profiling and that increased tumor immunogenicity from a high TMB (24). The alterations of EGFR, KRAS, and BRAF genes proved to be more frequent in micropapillary LUAD (24,25). The studies have suggested that the molecular pathogenesis of micropapillary component may differ from other types of LUAD (26). Pre-invasive LUAD displayed distinct mutation profiles. In situ and micro-invasive LUAD showed higher prevalence of driver mutations, for example, EGFR mutations and ALK fusion. Thus, compared with invasive LUAD, in situ and micro-invasive LUAD had lower TMB, which are concordant with variant distribution of driver gene mutations in these two histologic subtypes. LUSC (27) and LVI (28) were previously found to have higher TMB, which was similar to the results of our study. Beside, LVI has been linked to an increase in immune cell infiltration (28). Our result, together with other reported data may provide a TMB related immune activation hypothesis. Patients with different tumor stages exhibited distinct clinical behaviors (29). Para-bronchial lymph nodes is associated with a poorer prognosis (21,30). Thus, we inferred that these factors may affect immunogenicity through immune microenvironment and molecular profile. Neuro-invasive and stage-M did not affect TMB distribution, contrary to our expectation. The specific mechanism needs to be elucidated.
On the genomic level, the results showed that fifteen genes (TP53, PIK3CA, KRAS, EPHA3, TSHZ3, FAT3, NAV3, KEAP1, NFE2L2, PTPRD, LRRK2, STK11, NF1, KMT2D, GRIN2A) were significantly associated with TMB-H; EGFR was associated with TMB-L. Among these high-TMB-related genes, recent studies have shown that TP53-mutated tumors showed prominently increased somatic mutation burden compared with other mutant groups (KRAS, EGFR, STK11); and patients with TP53 or KRAS mutations showed remarkable clinical benefit to PD-1 inhibitors (31,32). PIK3CA and KRAS are mainly involved in the PI3K signaling pathway, which is one of the most important signal transduction pathways in the development of LUADs (33). In particular, four activated mutations of PIK3CA (p.E542X, p.E545X and p.Q546K) were found to have significant effect on TMB-H. PIK3CA gene mutations in the helical domain were correlated with TMB-H and poor prognosis in metastatic breast carcinomas with late-line therapies (34). Studies in lung cancer suggested that PIK3CA amplification was associated with higher TMB (8). Moreover, KRAS G12 mutations also correlated with high TMB group. As reported, NSCLC patients with KRAS G12 mutations showed an increased proportion of PD-L1+/CD8+ TILs (35). These genes stimulated PTEN/PIK3CA/AKT pathway, which in turn would lead to increased proliferation of tumor cells (32). Accelerated cell cycle may accumulate somatic mutations putatively resulting in elevated TMB. KEAP1-NFE2L2 plays a significant role in the dysregulation of oxidative stress pathway in lung cancer (36). Oxidative stress can lead to mutagenic DNA damage in the form of oxidative base modifications and the induction of DSB (DNA Double-Strand Break) which promotes mutations (8,37). KEAP1 mutation was significantly associated with lower CD8+TIL density which may be associated with shorter survival in LUAD patients receiving immnotherapy (38). It means that oxidative stress is a parallel mechanism of high-TMB. EGFR, KRAS, TP53, and STK11, also reported in a recent study, showed a correlation with tumor antigenicity and PD-L1 expression (8,36,39,40). Of note, GRIN2A regulates excitatory neurotransmission in the brain (36) and has scarcely been reported in NSCLC. It is necessary to further obtain a deeper understanding of its mechanism and further applications. Our data also confirmed the association between EGFR mutations and TMB-L, which have been reported previously (41). Considering these findings, we speculate that tumor with high TMB-related-gene mutations may lead to the destruction of immune cells including CD8+TILs and DSB/DDR level, resulting in the increase of somatic mutations of tumor cells.  >Thus, we trained a multivariable logistic regression model predicting TMB category with 6.15 mutations/Mb as the cut-off value utilizing five clinical features (age, histology, T, N, M) and 10 genes (TP53, FAT3, APC, EPHA3, TERT, LRRK2, RB1, PTPRD, STK11 and NF1). By comparing sensitivity and specificity, the results of two predictive model for TMB (histologic + genomics, genomics) confirmed that histologic features made a strong contribution to the integrated model for TMB prediction. There is also a small sample study which found that integrating multiple factors helps accurate prediction of TMB (37) although the ROC curves of the two studied models are close (0.89). Our study selected fewer histologic parameters without involving radiologic parameters. Another study showed that 56-gene panel could be used as a screening method for patients with low TMB. Compared with the panel, our prediction model has fewer genetic parameters, but achieved comparative efficiency (41). Therefore, this model may be better used to screen TMB-L or TMB-H status in early-stage NSCLC patients.

CONCLUSION
Overall, comprehensive clinical and genomic information can effectively evaluate TMB-high or low status. Our results showed that an integrated prediction model combining histology and genomic parameters significantly improved the accuracy of TMB prediction. However, whether this integrated model plays a key role in predicting the clinical outcome to immunotherapy and prognosis, still needs further investigation.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number can be found below: China National GeneBank DataBase (CNGBdb) https://db.cngb.org/. Accession number CNP0001479.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Institutional Review Board of the First Affiliated Hospital of Guangzhou Medical University. The patients/ participants provided their written informed consent to participate in this study. The animal study was reviewed and approved by Institutional Review Board of the First Affiliated Hospital of Guangzhou Medical University.

AUTHOR CONTRIBUTIONS
JH, YQ, CZ, and LL conceptualized and designed the study. KW provided administrative support. YY and HC provided the study materials or patients. QD, YL, and WJ collected and assembled the data. WL and DS analyzed and interpreted the data. All authors contributed to the article and approved the submitted version.