Germline Predisposition and Copy Number Alteration in Pre-stage Lung Adenocarcinomas Presenting as Ground-Glass Nodules

Objective: Synchronous multiple ground-glass nodules (SM-GGNs) are a distinct entity of lung cancer which has been emerging increasingly in recent years in China. The oncogenesis molecular mechanisms of SM-GGNs remain elusive. Methods: We investigated single nucleotide variations (SNV), insertions and deletions (INDEL), somatic copy number variations (CNV), and germline mutations of 69 SM-GGN samples collected from 31 patients, using target sequencing (TRS) and whole exome sequencing (WES). Results: In the entire cohort, many known driver mutations were found, including EGFR (21.7%), BRAF (14.5%), and KRAS (6%). However, only one out of the 31 patients had the same somatic missense or truncated events within SM-GGNs, indicating the independent origins for almost all of these SM-GGNs. Many germline mutations with a low frequency in the Chinese population, and genes harboring both germline and somatic variations, were discovered in these pre-stage GGNs. These GGNs also bore large segments of copy number gains and/or losses. The CNV segment number tended to be positively correlated with the germline mutations (r = 0.57). The CNV sizes were correlated with the somatic mutations (r = 0.55). A moderate correlation (r = 0.54) was also shown between the somatic and germline mutations. Conclusion: Our data suggests that the precancerous unstable CNVs with potentially predisposing genetic backgrounds may foster the onset of driver mutations and the development of independent SM-GGNs during the local stimulation of mutagens.

Objective: Synchronous multiple ground-glass nodules (SM-GGNs) are a distinct entity of lung cancer which has been emerging increasingly in recent years in China. The oncogenesis molecular mechanisms of SM-GGNs remain elusive. Results: In the entire cohort, many known driver mutations were found, including EGFR (21.7%), BRAF (14.5%), and KRAS (6%). However, only one out of the 31 patients had the same somatic missense or truncated events within SM-GGNs, indicating the independent origins for almost all of these SM-GGNs. Many germline mutations with a low frequency in the Chinese population, and genes harboring both germline and somatic variations, were discovered in these pre-stage GGNs. These GGNs also bore large segments of copy number gains and/or losses. The CNV segment number tended to be positively correlated with the germline mutations (r = 0.57). The CNV sizes were

INTRODUCTION
The widespread use of advanced chest computed tomography (CT) for lung cancer screening has facilitated the detection of ground-glass nodules (GGNs) (1)(2)(3). Recent data indicates that up to 20% of GGN patients (3% of the screening population) are diagnosed with synchronous multiple ground-glass nodules (SM-GGNs) (4). GGNs are like atypical adenomatous hyperplasia (AAH), adenocarcinoma in situ (AIS), minimally invasive adenocarcinoma (MIA), or invasive adenocarcinoma (AD) (5). To date, neither auxiliary tests that can assist in the differential diagnosis (6) nor recommended strategies for the identification and treatment of GGNs exist in clinical practice guidelines for lung cancer. In fact, a major clinical challenge is to distinguish between independent synchronous multiple primary lung cancer (SMPLC) and intrapulmonary metastasis, which makes a treatment decision difficult. Thus, molecular characterization of GGNs may provide insight into the genetic drivers of synchronous multiple tumors and identify inter-tumor heterogeneity (7)(8)(9).
Although SM-GGNs appear within the same environmental and genetic background as GGNs, SM-GGNs may comprise of a complex combination of different gene alterations and distinct morphologic characteristics (10). Currently available genetic evidence for lung cancer metastasis suggests that the time between the development of two tumors is important in distinguishing SMPLC from metastasis (6). Synchronous metastases have largely preserved genetic patterns identical to those of the primary lung cancer (11,12). Next-generation sequencing (NGS) has revealed that solid tumors, including lung cancer, harbor thousands of single-nucleotide variations (SNVs) and ten to hundreds of somatic chromosomal rearrangements (SVs) (13,14). Both alterations have been used to analyze the lineage relationships between tumors from the same individual. However, the results of matched analyses of the concordance of cancer molecular characteristics and genetic patterns in SMPLC, have been discrepant and inconclusive (10,15,16). A cluster analysis has been used to identify copy number variation (CNV) patterns (17,18). According to the results of the TRACERx group (19), whole-genome duplication and CNV are early events in non-small-cell lung cancer (NSCLC) evolution; higher copy-number variation heterogeneity has been found to be a risk factor for recurrence or death (hazard ratio, 4.9; P = 4.4 × 10 −4 ).
In this study, we conducted deep genomic sequencing to explore the genomics of SM-GGN. Our data showed that the similar CNV instability and a predisposition genetic background may foster the onset of driver mutations for the pre-stage lung adenocarcinomas, presenting as SM-GGN.

Patient Clinical Information and Sequencing Statistics
Detailed clinical features of the 51 SM-GGN samples collected from 25 patients and 18 triple SM-GGNs from six patients are summarized in Table 1. Most patients were females (52/69). Only three patients were smokers. All were disease-free for more than 2 years after surgery resection. 50-78% of tumor cells were ensured in all samples (Tables S1, S2). Two hundred twenty gene panel (Table S3) was used for TRS. The sequencing depth and coverage for both TRS and WES are summarized in Table S4.
The common silent somatic mutations were very low, with 1-16 common somatic SNV/INDELs among 22 patients and none common in 3 patients in TRS data. There are 7, 9, 10, 7, 13, and 10 common silent SNV/INDELs detected among the triple GGNs of the same M1 to M6 patients, respectively, in WES data.

Pre-stage Nodules Have Fewer Driver Mutations Than Advanced Lesions
In TRS, the average somatic SNV/INDELs were 69.4, 53.3, 64, and 84.5 for AAH, AIS, MIA and AD, respectively. However, there were no significant differences between all groups. The average non-synonymous SNV/INDELs were 1.6, 1.92, 1.2, 4.25 for AAH, AIS, MIA, and AD, respectively. However, there was a significant difference between AAH vs. AD (P = 0.023) and combined, AAH, AIS, and MIA vs. AD (P = 0.025) in missense somatic variations.
In WES, the average somatic SNV/INDELs were 417.3, 412.5, 500 for AIS, MIA, and AD, respectively, and there were no significant differences among these types. The average of the nonsynonymous somatic SNV/INDELs of 41 AD was 1.45-fold of AIS (28; P = 0.9) and MIA (28.5; P = 0.5), although it was not statistically significant.
We compiled the profiles of the 5 ′ and 3 ′ conjugated bases of all somatic SNVs of each GGN sample and expected that some mutation signatures may represent the mutagenesis (20) specific to all GGNs or some types of GGNs. TRS data showed predominant somatic signature 1 and 16 in all four types of GGNs. In addition, AIS bore signature 29, and AD had signature 26 (Figure 3). The WES data showed a different signature profile. Signature 3 was predominant, following signature 1. All three types of GGNs from 18 GGN samples had a similar signature profile: 1, 3, 6.

There Were High Burden Germline Mutations in GGN Patients
In TRS, 994,089 germline SNP/INDELs were discovered, with an average of 39,763 from each patient. Of the 1,438 variants, 280 had ≤ 0.01 MAF in a Chinese population derived from 2,220 whole-exome sequencing and 568 wholegenome sequencing data. Interestingly, some important cancer gene variants were found in patients (3 TP53, 3 BRCA2, 3 FGFR3) ( Table S8).
In WES, a total of 1,636,397 SNP/INDELs were discovered in five patients, consisting of approximately 345,000 each. The 24,085 genes bore germline variants. Further filtering with 0.01 MAF of Chinese populations revealed 2,323 mutations in 1,990 genes. In these 6 patients, 100% (6/6) had a TRIP10 missense mutation, and 83% (5/6) had a non-frameshift insertion in MICALCL and PPP2R2B (Table S9). We also found the cancer gene BRCA1 variant in one patient.

Predisposition Gene Mutations in GGN Patients
From 25 germline TRS data, we discovered a total of 1,744 variants (448 unique variants) in 39 known predisposition genes of genetic diseases from Human Gene Mutation Database (HGMD) and Online Mendelian Inheritance in Man (OMIM) databases (Table S10, Figure S2).
We further examined only the 280 missense/truncated and infrequently observed Chinese MAF mutations using TRS, and we found several predisposition gene mutations, including the previously reported lung cancer susceptible gene CHRNA3 and several cancer gene germline mutations including BRAC1, BRAC2, EGFR, and TP53 (Table S8). From the 2,323 missense/truncated with low-frequency Chinese MAF from WES, TRIP10, MICALCL, and PPP2R2B had the highest frequency among the 6 WES patients.

Genes Burden With Both Germline and Somatic Mutations
We next examined the top genes that had both nonsynonymous somatic mutations and germline mutations in our cohorts (Figure 4). Among the five somatic missense/truncated variants and 6 germline missense/truncated low MAF variants in MUC4, one variant was predicted to be deleterious by PROVEAN (21), and three were predicted to be damaging by SIFT (22) (Table S11). Among the 12 variants of gene FLG, two missense SNPs were predicted to be damaging by SIFT. Using muPIT Interactive (23) and RCSB PCB (24), we confirmed the harmful effects on protein structure predicted by SIFT and PROVEAN. Each SNP resulted in a dramatic alteration of protein structure, and we analyzed the impact of this functional perturbation on other genes and pathways.

The Pre-stage GGN Showed Chromosome Instability by CNV
We performed a clustering analysis of CNV using all paired SM-GGNs ( Figure 5A). The paired GGNs were not clustered except M18, M5, and M1 patients. Interestingly, the similar types of GGNs tend to be clustered. For example, the cluster C2 had 5 AD out of seven members and cluster C5 had 8 AAH out of 11 members. However, this observation appears differently in WES data. As shown in Figure 5B, all triple SM-GGNs were clustered within the same patients by CNV. This result suggests that CNVs derived from a region beyond the cancer gene panel may represent the overall of the same patient. Using WES, 30 regions of gain and 5.5 regions of loss, 15 Mb and 6 Mb, respectively, were found in each GGN. Interestingly, the CNV gain size or total CNV size showed a moderate positive correlation (r > 0.5) with the somatic mutations, while the CNV gain or loss segment number or total number correlated with the germline mutations (r > 0.5) (Figures 5C,D and Table S12). Our data uncovered a correlation between germline mutation and somatic mutations in GGN patients, though such CNV correlations were not found in the CNV data from the TR samples ( Table S13).

Evaluation of Truncal and Branched Driver Mutations
Since there were very few non-silent variants for each GGN, we included all somatic variants (SNV/INDELs) to construct phylogenetic trees using the parsimony method (25) with branch lengths reflecting the number of mutations. We labeled the key driver mutation during the acquisition process. All the potential predisposition gene mutations shown in Figure 6 have been previously reported to associate with lung or other cancer risks, and all had ≤ 0.01 MAF in Chinese populations. As shown in Figures 6A-F, the SM-GGNs of the six patients did not originate from the same ancestor clones.

DISCUSSION
In this study, we found that the six triple SM-GGNs and 25 double SM-GGNs were from independent origins, because the SM-GGNs of the same patient shared none or very few common driver mutation events or somatic mutations. This confirmed a previous study that used data from two patients (26). If the metastasis model can help perceive how some types of SM-GGNs occurred by metastasis spread (27), understanding how multiple GGNs occurred independently is elusive, which motivated us to deeply explore our data. We proposed several possible models for SM-GGN development (Figure 7). We found our data represented an inherent sporadic SM-GGNs (ISG) model. ISG typically does not exhibit identical driver mutations; however, germline predisposing mutations and similar chromosome alterations bring about many GGNs simultaneously (Figure 6). Izumchenko et al. (15) applied targeted sequencing on 25 distinct AAHs that were incidentally discovered in lung resection specimens from six patients with invasive adenocarcinoma. They found that one of six primary tumors harbored BRAF mutation and three of four patients with mutated BRAF in AAHs carried wild-type BRAF in paired invasive adenocarcinoma. EGFR, TP53, and KRAS mutations were found in several AAHs and matched primary tumors of two patients. This is probably the earliest report that some SM-GGNs originated from metastasis (27) (Figure 7). SM-GGNs may also be derived from multiple local primary tumors. In agreement with a recent study demonstrating independent clonal origins of SM-GGNs (26), we did not find significant overlapping variant sets between any tumor pair in each of these 31 patients. This observation suggested that multiple pulmonary nodules within one subject probably arose autonomously from different progenitor cells. These SM-GGNs shared an identical germline genetic background and environmental exposure in individual patients. We hypothesize that there are three phenotypes with sporadic ancestor origins (Figure 7). Kobayashi et al. (28) found EGFR mutations in 64% of 104 GGNs, KRAS in 4%, ALK in 3%, and HER2 in 4%. Our data also supports the ISG model and our somatic mutation findings are similar to those in previous reports, where BRAF, EGFR, and KRAS were the top driver mutation GGNs. In our study, we did not find somatic TP53 mutations in both TRS and WES cohorts. This result is consistent with previous reports where the TP53 mutation was not found (27) or the TP53 mutation was not among the top driver mutations in GGNs (15,26). This indicates that GGN is distinct from typical invasive lung cancer in which TP53 mutations were the most frequent among all other driver genes (29).
The inherent sporadic SM-GGNs typically do not exhibit identical driver mutations; however, germline predisposing mutations and similar chromosome alterations bring about many GGNs simultaneously. Therefore, although occurring independently, they are linked by the genetic background and similar CNVs. The two cohorts in our study were this type of SM-GGNs. No identical drive mutations were found, but these patients had a large number of germline mutations, potential predisposition mutations, and large chromosome gain and loss segments. Different known driver mutations were acquired in the evolution branches.
In addition, our study showed that independent clonal SM-GGNs (ICG) do not have identical driver mutations nor typical predisposition mutations. These SM-GGNs do not have any genetic link; therefore, they occur randomly and occasionally. We hypothesize that ICG may prone to the  deficient local immune microenvironment or presumably to the biochemical substances released from the primary tumor lesion (30).
It is believed that many types of mutagens can increases the frequency of mutations above the natural background level, resulting in driver gene mutations at the pre-stage. Additional molecular alterations later progress the cancer development (31,32). As the germline mutations and copy number alterations occurred in the pre-stage of GGNs, we suggest these genomic factors may foster the mutation acquisition during the local stimulation of mutagens such as air pollutants. Interestingly, our data showed a moderate correlation between global germline and somatic mutations (Figure 5E). High burden germline mutations may predispose individuals to driver mutation acquisition, and the genes bearing the highest mutation frequency could play potential roles as cancer drivers.
Somatic mutations are an accumulation process. Signature 1 has been found in all cancer types and in most cancer samples (20). This finding suggests that a similar mutational process operated to promote tumorigenesis in SM-GGNs. Signature 3 is strongly associated with germline and somatic BRCA1 and BRCA2 mutations in breast, pancreatic, and ovarian cancers, which was previously reported in GGNs (20).
CNV reflects the instability of chromosomes. Our study showed large segments of amplification gain/loss in GGNs. The gene gains or losses caused by CNV can result in tumorigenesis. The abnormal amplification of chromosome segments may increase the error rate during DNA replication process. Interestingly, our data showed the CNV segment number positively was correlated with the germline mutations and the CNV size was correlated with the somatic mutations. Though it is not clear why the CNV showed the different correlation favors, our data demonstrated a relation between mutations and CNV in SM-GGNs. We also found that many patients had similar overall profiles of somatic CNV among SM-GGNs. The similar overall CNVs provide an atmosphere for developing synchronous multiple GGNs in the same patient. The TRS analysis showed more CNV segments and sizes, most likely due to overestimation by the techniques and the control samples. This is challenging as the control-FREEC software inferring  whole genome-wide CNVs from the limited targeted regions can affect CNV calls by off-target reads.
We acknowledge that, despite using a large size of 69 SM-GGN lesions, the 31 patients in the study was not enough to detect lesscommon genomic alterations, though sample sizes of two (27), 4 (26), six (15), and 9 (33) patients have been previously reported in similar GGN studies. Expanding our knowledge in this area will require a major international collaboration. Likewise, we acknowledge that although our predictive findings demonstrate the power of molecular data in guiding management decisions, a prospective clinical trial using predictors derived from our data will be required before clinical use. Also, various of analysis including RNA-seq, epigenetic seq will be required for deeper research on SM-GGNs.
In summary, we have comprehensively explored the genomic variations of 69 SM-GGN samples of 31 patients. Our data demonstrated that SM-GGNs shared similar overall patterns of large segments of copy number gains and/or losses and potential predisposition gene mutations in the independent origin of SM-GGNs. The CNVs were correlated with germline and somatic mutations, and the global germline mutations were correlated with somatic mutations. These results might suggest that the unstable CNV with potential predisposition genetic background probably can foster the onset of driver mutations for the development of independent SM-GGNs (Figure 6) during local stimulation of mutagens such as air pollutants. The candidate genes, BRIP1 with high germline mutations, MUC4, and FLG harboring both germline and somatic mutations, may provide an avenue to further study how driver mutations were predisposed in the development of SM-GGNs. Since the molecular abnormities found in previous studies were still not conclusive in relation to the tumorigenesis among the SM-GGN or GGN tumors, our findings here provide insight into the biology of SM-GGNs.

Patients and Tissue Samples
Tumor samples were obtained from patients who underwent surgical resection at the Shanghai Pulmonary Hospital between 2014 and 2015. Two cohort samples were collected, one for TRS (51 GGNs) and the other for WES (18 triple GGNs). All case included in the study met the proper indication, including (a) all treatment decisions were made by a multidisciplinary team. (b) all cases with a dominant nodule, which was suspected as an invasive lung cancer. (c) ground-glass nodules were suspected as invasive lung cancer in the following circumstances: (1) nonsolid nodules measuring > 10 mm; (2) Nonsolid nodules that grow or develop a solid component in the follow-up. Two pathologists reviewed all samples to confirm the histology and assess the tumor content according to the 2011 International Association for the Study of Lung Cancer (IASLC). Tumor cell proportion of all samples, GGN site, and sizes were summarized (Tables S1, S2). The cells were counted under 100X microscopic vision using the average of the sum counts in five randomly selected fields of each lesion. Per previously reviewed studies, we used either the matched normal mediastinum lymph node tissue (15) or normal lung tissues (26,(29)(30)(31)(32)(33)(34) for germline variant calling and as the control for somatic variant calling. The institutional Ethics Committee of the Shanghai Pulmonary Hospital approved the study. Written consent was obtained from the patients in this study.

Targeted Sequencing (TRS)
DNA was extracted using a TIANamp FFPE DNA Kit (TIANGEN) and quantified with the Nanodrop system (Thermo Scientific). Matched normal lymph node tissue was used in each case as a control. Targeted region sequencing was performed at Novogene, Inc. A paired-end DNA library was prepared according to the manufacturer's instructions (Agilent); 1 µg of tumor DNA was sheared into 180-280-bp fragments using a Covaris S220 sonicator. The ends of the gDNA fragments were repaired, and the 3 ′ ends were adenylated. Both ends of the gDNA fragments were ligated at the 3 ′ ends with pairedend adaptors (Illumina USA) with single "T" base overhang and purified using AMPure SPRI beads from Agencourt. The adaptor-modified gDNA fragments were enriched via 6 cycles of PCR using SureSelect Primer and SureSelect ILM Indexing Pre-Capture PCR Reverse Primer. The concentration and size distribution of the libraries were determined on an Agilent Bioanalyzer DNA 1000 chip. Whole-target regain capture was performed using Agilent's Sure Select XT Custom. The captured DNA library was sequenced on an Illumina Hiseq 4000 according to the manufacturer's instructions for paired-end 150bp reads. The libraries were loaded onto paired-end flow cells at concentrations of 14-15 pM to generate cluster densities of 800,000-900,000/mm 2 using an Illumina cBot and HiSeq pairedend cluster kit.

Whole-Exome Sequencing (WES)
DNA was extracted using a QIAamp DNA Mini Kit (QIAGEN, Shanghai, China) and quantified with the Nanodrop system (Thermo Scientific, Shanghai, China). Matched normal lung tissue was used in each case as a control. Wholeexome sequencing (WES) of the samples was performed at Novogene, Inc. (Beijing, China). Sequencing libraries were generated using the Agilent SureSelect Human All Exon kit (Agilent Technologies, CA, USA). Fragmentation was carried out using the hydrodynamic shearing system (Covaris, Massachusetts, USA) to generate 180-280-bp fragments. The remaining overhangs were converted into blunt ends using exonuclease/polymerase activities. After adenylation of the 3 ′ ends of DNA fragments, adapter oligonucleotides were ligated. The captured libraries were enriched in a PCR reaction to add index tags to prepare for sequencing. Products were purified using the AMPure XP system (Beckman Coulter, Beverly, USA) and quantified using the Agilent high sensitivity DNA assay on the Agilent Bioanalyzer 2100 system. Clustering of the indexcoded samples was performed on a cBot Cluster Generation System using the Hiseq PE Cluster Kit (Illumina) according to the manufacturer's instructions. After cluster generation, the DNA libraries were sequenced on the Illumina Hiseq platform, and 150-bp paired-end reads were generated.

Sequence Quality Check and Mapping
We first checked the quality of the base call of the Hiseq sequence reads, and we ensured an average score of Q30 with more than 80% and an error rate < 0.1%. In addition to removing the adaptor sequence bases, we also removed those read pairs that had more than 10% uncertain bases (N) among a whole read, and those that had a length of more 50% of a read with a low-quality score (Q5). The Burrows-Wheeler Aligner (BWA) software (35) (Version 1) was used to map the paired-end clean reads to the reference genome (UCSC hg19). BWA chooses the best math mapping location if a read can be mapped on multiple locations. We then checked the read coverage and depth of each sample. We examined the properly paired mapped rate, coverage of the targeted regions, and fraction of targeted regions with different depths.

SNV/INDEL Call
Samtools mpileup (36) was used to perform variant calling and to identify SNVs and indels in tumor and normal samples. We used a minimum number of two gapped reads for indel candidates and 0.002 as the minimum fraction of gapped reads. We skipped the alignments with mapQ smaller than 1. The final calls were filtered with a base call Q score >20, read depth >4, and mapping quality > 30. These SNV/INDELs were used for further germline variation analysis. The variant allele frequencies (VAF) of the germline SNPs were examined ( Table S14). The somatic SNVs were detected by muTect (37), and the somatic INDEL were detected by Strelka (38). We applied the Refseq and Gencode of HG19 to annotate these variations, including chromosome loci, gene structure (UTR, intronic, exonic, intergenic), function affects (missense, splicing, synonymous), and types (mRNA, non-coding RNA, small RNA).

CNV Analysis
The targeted sequencing data can be used to infer CNV (39), although it only covers a small portion of the whole genome. The off-target sequences have also been used to infer CNV. We used the Control-FREEC software for CNV detection (40). First, Control-FREEC obtains input aligned reads and counts reads (RC) in non-overlapping windows. The second step is profile normalization by fitting to the control RC. The observed RC in ploidy P-copy regions (i.e., regions with a copy number equal to P) can be modeled as a polynomial of the control RC, and the observed RC in a region with an altered copy number is linearly proportional to the RC in P-copy regions. The third step is segmentation of the normalized CNV profile using a LASSO-based algorithm by Levy-Leduc and Harchaoui (41). The last step involves analysis of the segmented profiles, which includes the identification of regions of genomic gains and losses and the prediction of copy number changes in these regions. The heat-maps of somatic CNVs were analyzed using Complex Heatmaps [https://github. com/jokergoo/ComplexHeatmap], according to the study by Ni et al. (18).

Mutation Signature Analysis
We deciphered the somatic mutation signatures in GGN types using the R package "deconstructSigs" (42). Briefly, for each tumor sample, we extracted the 5 ′ and 3 ′ sequence context of each mutation from the hg19 reference genome, and the SNVs were categorized into C>A, C>G, C>T, T>A, T>C, and T>G bins according to the type of substitution and then subcategorized into 96 sub-bins according to the nucleotides preceding (5 ′ ) and succeeding (3 ′ ) the mutated base. We then groped the samples into four classes, AAH, AIS, MIA, and AD, and calculated the mean frequency of each of the 96 mutations. We downloaded the 30 known mutation signatures from the COSMIC website (43). To enable comparisons with the known signatures based on the Wellcome Trust Sanger Institute Mutational Signature Framework, the "deconstructSigs" (42) allowed us to compare our mutation profiles against the COSMIC signatures and statistically quantify the contribution of each signature to each group of GGN tumors.

Mutation Integrative Analysis
SNP data were divided based on whether an SNP was from a germline or somatic samples. To filter and maintain significant SNPs for the germline, VAFs for each of the SNPs were determined from a private Chinese collaborator and the 1000 Genomes Project (44). Only SNPs with a VAF with no entry or ≤ 0.01 were retained. Somatic SNPs did not need to be filtered by VAF. The SNPs were pooled into AAH and the other GGN types separately. Reference SNP IDs were determined using the UCSC Genome Browser (version 138) (45). In addition, before filtering by VAF, SNP data were plotted for comparison. The SNPs of each gene were also separated into AAH and others.
All remaining SNPs were analyzed using SIFT and PROVEAN to predict whether they were damaging or harmless. SIFT first obtains sequences related to the query sequence and chooses closely related sequences that have a similar structure and function. It calculates probabilities of amino acids using Dirichlet mixtures. The cutoff for being deleterious is a score of 0.05 (determined from comparison to experimental data) (22). PROVEAN clusters BLAST hits and the clusters most related to the query are used to generate a PROVEAN score prediction. If the score is less than −2.5 (default score threshold), then the mutations are classified as deleterious (21). Their effects on other genes and pathways were analyzed using Ingenuity Pathway Analysis (IPA) (46). We ranked the germline mutations, and selected the genes that had the highest number of germline mutations with at least one somatic mutation.

Construction of Trunks and Branches of Driver Mutations
All somatic mutations, including silent mutations, were considered to evaluate the phylogenetic trees. Trees were constructed using binary presence/absence matrices built from the regional distribution of variants within the tumor. The parsimony distance method was applied to generate rooted trees (25). Branch lengths were determined using the number of mutations.

ETHICS STATEMENT
This study was carried out in accordance with the principles of the Helsinki Declaration of the World Medical Association. The protocol was approved by The institutional Ethics Committee of the Shanghai Pulmonary Hospital.

AUTHOR CONTRIBUTIONS
CC and YR designed the study. WX conducted the analytic design. YR and CD performed the targeted and whole-exome sequencing samples and prepared the clinical data. DX, HZ, YS, FZ, KF, and GJ performed the surgery and sample collection.