Biased Influences of Low Tumor Purity on Mutation Detection in Cancer

The non-cancerous components in tumor tissues, e.g., infiltrating stromal cells and immune cells, dilute tumor purity and might confound genomic mutation profile analyses and the identification of pathological biomarkers. It is necessary to systematically evaluate the influence of tumor purity. Here, using public gastric cancer samples from The Cancer Genome Atlas (TCGA), we firstly showed that numbers of mutation, separately called by four algorithms, were significant positively correlated with tumor purities (all p < 0.05, Spearman rank correlation). Similar results were also observed in other nine cancers from TCGA. Notably, the result was further confirmed by six in-house samples from two gastric cancer patients and five in-house samples from two colorectal cancer patients with different tumor purities. Furthermore, the metastasis mechanism of gastric cancer may be incorrectly characterized as numbers of mutation and tumor purities of 248 lymph node metastatic (N + M0) samples were both significantly lower than those of 121 non-metastatic (N0M0) samples (p < 0.05, Wilcoxon rank-sum test). Similar phenomena were also observed that tumor purities could confound the analysis of histological subtypes of cancer and the identification of microsatellite instability status (MSI) in both gastric and colon cancer. Finally, we suggested that the higher tumor purity, such as above 70%, rather than 60%, could be better to meet the requirement of mutation calling. In conclusion, the influence of tumor purity on the genomic mutation profile and pathological analyses should be fully considered in the further study.

However, it has been reported that the identification of somatic mutation may be influenced by tumor purity (Koboldt et al., 2012;Cibulskis et al., 2013). As is known to all, tumor tissues collected patients contain not only tumor cells, but also non-tumor cells, e.g., infiltrating stromal cells, immune cells, fibroblasts and normal cells (Joyce and Pollard, 2009), which could dilute the purity of tumor cells. Specifically, DNA from tumor samples are inevitably contaminated with non-tumor DNA. Various tumor purities might affect mutation detections through disturbed the numbers of mutated read , and consequently affect the biological interpretations of genomic analyses (Aran et al., 2015).
Several approaches have been proposed to reduce the influence of tumor purity on mutation detection. For example, most studies generally require samples with at least 60% of tumor nuclei. However, the threshold of tumor purity might remain to be further evaluated (Aran et al., 2015). Practically, it is often difficult to obtain some cancer samples with sufficient tumor purity, such as diffuse gastric cancer and pancreatic adenocarcinomas. The laser capture microdissection (LCM) is commonly used to isolated pure tumor cells from tumor tissues (Espina et al., 2006), but it is cost and time consuming, which makes it difficult to be widely used in clinical scenes. Meanwhile, other collection technologies have been reported to isolate pure or putative tumor cells from tumor tissues. For example, DEPArray technology could isolate putative tumor cells from cancer samples , but it is difficult to handle large number of cells from large volume of cancers because of sorting time and the expenses . Furthermore, several algorithms have been proposed to evaluate tumor purities based on the copy number ploidy variations (Carter et al., 2012), methylation (Zheng et al., 2014), or expression levels of signature genes (Yoshihara et al., 2013). However, these tumor purities commonly reflect the average proportion of various cell types or are biased to a certain cell type. And the measurements of genes are sensitive to experimental batch effects (Leek et al., 2010;Oesper et al., 2014). The evaluation and correction of tumor purity is very hard and the golden standard is still dependent on the pathologists. Therefore, it is necessary to fully evaluate the influence of tumor purity on the analysis of genome mutation profile.
Gastric cancer is one of the common malignant tumors (Siegel et al., 2017). Tumor progression of gastric cancer, e.g., metastasis or post-surgery relapse, is the main death cause, and the tumor-node-metastasis (TNM) staging is an important indicator for tumor progression, which T represents primary tumor, N represents metastasis of regional lymph nodes and M represents distant metastasis of cancer. Based on the TNM system, the absence or presence of lymph node metastasis is identified as N0M0 or N + M0. Meanwhile, according to the Lauren's pathological classification, gastric cancer could be distinguished as intestinal, diffuse, or mixed subtypes (Shah et al., 2011). Compared with intestinal subtype, diffuse subtype has a different pattern of spread and behavior with a worse prognosis (Shah et al., 2011). The TNM staging system and the pathological classification are always used to determine the treatment strategies for gastric cancer patients. Besides, the microsatellite instability (MSI) status is another indicator for determining the treatment regimen in gastric cancer and colon cancer, which patients with high level of MSI (MSI-H) are less likely to benefit from the 5-Fu-based chemotherapy (Ilson, 2018). The MSI status were commonly identified by using immunohistochemistry and polymerase chain reaction (PCR), which measured the expressions of putative genes or the mutations of putative sites. However, molecular analyses between N0M0 and N + M0, or between diffuse and intestinal subtypes, or the identification of MSI status, may be affected by various tumor purities.
In this study, mainly using public gastric cancer samples from The Cancer Genome Atlas (TCGA) for example, the influence of tumor purity on mutation detection, pathological subtypes and the identification of MSI status were evaluated. Moreover, the biased influences were further evaluated in other nine cancers from TCGA and the in-house samples with different tumor purities from the same cancer patients. To obtain the robustly biological interpretations of genomic and pathological analyses, we suggested that the biased influences of various tumor purities should be fully considered.

Data and Pre-processing
Public Data and Pre-processing The mutation profiles called by four algorithms (MuSE, MuTect2, SomaticSniper, and VarScan2) and the clinical information of stomach adenocarcinoma (STAD) samples were downloaded from TCGA ( Table 1). 1 Generally, multiple slides which were sampled from the top to bottom of the same tumor tissue were collected. Each slide was consisted of tumor cells and nontumor cells. The percent of tumor nuclei in each slide was evaluated by pathologists. According to the report by Yoshihara et al. (2013), the tumor purity of a sample was the arithmetic mean percent of tumor nuclei in all slides. If the information of percent of tumor nuclei in one of the multiple slides was 1 http://cancergenome.nih.gov/ unavailable or the percent of tumor nuclei of all slides are zeros, the sample is excluded. Moreover, the mutation profiles and corresponding clinical information of other nine cancer types, included breast invasive carcinoma (BRAC), colorectal carcinoma (CRC), glioblastoma multiforme (GBM), brain lower grade glioma (LGG), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), pancreatic adenocarcinoma (PAAD), and prostate adenocarcinoma (PRAD) were also downloaded, respectively. And 723 cancer genes were downloaded from the COSMIC database (Tate et al., 2019), 2 which were used to analyze the influences of tumor purity on mutation callings of cancer genes.

In-house Data and Measurement
Six surgical resection specimens from two gastric cancer patients were measured by whole-exome sequencing with mean depth of 80-100×. For each patient, three specimens were sampled in three different locations, whose diameters of tumor tissues were at least 50 mm, respectively. The tumor purities of six samples, measured by pathologists, ranged from 26.5 to 92.5%, as shown in Table 2. Meanwhile, five surgical resection specimens collected from two colorectal cancer patients in our previous study were used to validate the influence of tumor purity on mutation detection (Yan et al., 2019). The tumor purities of five colorectal cancer samples ranged from 40 to 100% (Table 2). This study was approved by the institutional review boards of all participating institutions, and written consent forms were obtained from all participants. Afterward, according to the manufacture's protocol, total DNA was isolated from the fresh frozen gastric tumor tissues and the generated raw whole-exome sequencing files (.fastq) were preprocessed using Trimmomatic (Bolger et al., 2014), and the reference genome (GRCh37) was used to align reads using Burrows-Wheeler aligner (BWA; Li and Durbin, 2009). Finally, the mutations were called using default parameters. Mutations included single nucleotide variation (SNV), indel (insertion and deletion, less than 50 bp) in this study. And they were filtered to exclude the mutation sites of germline risk based on gnomAD variant dataset file. 3 Only those SNVs which were identified as mutations were further analyzed.

Statistical Analysis
The spearman rank correlation analysis was used to assess the correlation between numbers of mutation and corresponding tumor purities in tumor samples. The wilcoxon rank-sum test was used to assess the difference of tumor purities (or numbers of mutation) between two groups of samples. And the fisher exact test was used to evaluate the significance of mutation frequencies of genes between high-purity and low-purity samples or between N0M0 and N + M0 samples. N0M0 and N + M0 represent non-metastatic samples and lymph node metastatic samples of gastric cancer, respectively. The hypergeometric test and cumulative binomial test were used to assess the impact of sample size on the correlation between numbers of mutation and tumor purities, respectively.

Tumor Purity Confounds Mutation Detection
Taken gastric cancer as an example, we firstly analyzed the associations between numbers of mutation called by four mutation calling algorithms (MuSE, MuTect2, SomaticSniper, and VarScan2) and corresponding tumor purities, respectively. Tumor purities of gastric cancer samples distributed dispersedly, ranging from 5 to 100%. The tumor purity of about 72% gastric cancers were higher than 70%. The results showed that numbers of mutation called by MuSE and SomaticSniper algorithms were significant positively correlated with tumor purities (p = 2.22e-05 for MuSE and p = 1.84e-05 for SomaticSniper). Similar results were also observed in numbers of mutation called by MuTect2 (p = 1.00e-04) and VarScan2 (p = 7.73e-06) algorithms which are implanted the correction parameters of tumor purity. Notably, the significantly positive correlation between numbers of mutation and tumor purities in other nine cancer types could also be observed (Table 3). These results suggested that mutation detections might be significantly influenced by various tumor purities. Then we verified the influence of tumor purity on mutation detection using MuTect2 algorithm in six in-house gastric tumor samples, which were sampled from three different locations with different tumor purities from each gastric cancer patient. The results showed that, for the samples from the same patient, the numbers of mutation decreased as the tumor purities decreased,  as shown in Figures 1A,B. Similar results were also observed in five in-house colorectal tumor samples collected from two patients, as shown in Figures 1C,D. The results further confirmed that various tumor purities might affect numbers of mutation. Moreover, similar results were observed in numbers of mutation detected by the Varscan2, SomaticSniper and MuSE algorithms, respectively, which decreased with the tumor purities, as shown in Supplementary Table 1.
Additionally, we further analyzed the numbers of mutated reads aligned to each mutation site in measured gastric cancer samples. For GC-1 patient, among 19 SNVs that were identified in samples with tumor purities of 92.50 and 72.50%, 15 SNVs were not detected in sample with the lowest tumor purity of 26.50%. Nevertheless, they were aligned to several mutated fragments (14 SNVs: 1-4 reads and 1 SNV: 6 reads). Similarly, 14 SNVs were not identified as mutations in the position C with 33% of tumor purity for GC-2 patient, but they were also aligned to several mutated fragments (1-5 reads). Those unidentified mutation sites in the position C of two patients included the genes FBXO11 and XPO1, which were identified as cancer genes in the COSMIC database, 4 shown in Table 4. These results indicated that the artificially low mutation burden might result from low tumor purities.

Tumor Purity Confounds the Mutation Differences Between Metastasis and Non-metastasis of Gastric Cancer
Based on the non-synonymous mutation data of primary gastric cancer samples from TCGA database, which were called by MuTect2 algorithm, we found that the numbers of mutation in 248 N + M0 samples tended to be significantly less than those in 121 N0M0 samples (p = 5.14e-02, Wilcoxon ranksum test, Figure 2A). Then we compared the differences of multiple clinical factors between two subgroups, including age, gender, tumor purity and grade, and found that only tumor purity was significantly different between two subgroups. The tumor purities in N + M0 samples were significantly lower than those in N0M0 samples (p = 1.77e-02, Wilcoxon rank-sum test, Figure 2B). In order to remove the biased influence of sample sizes, we randomly selected 121 samples from 248 N + M0 samples and compared tumor purities and numbers of mutation between 121 N0M0 and 121 N + M0 samples. The random experiment was repeated 1,000 times. The result showed that there were 546 times of significantly different tumor purities between N0M0 and N + M0 samples, 388 times of significantly different numbers of mutation, and 246 times that tumor purity and number of mutation were both significantly different (all p < 0.05, Wilcoxon rank-sum test). The results were not happened randomly (p < 1.00e-16, hypergeometric test), which indicated that the biased sample sizes could not be the main cause of mutation differences between N0M0 and N + M0 samples. Removing diffuse gastric tumor samples with high heterogeneity, similar phenomena were also observed in intestinal gastric cancer that numbers of mutation in 115 N + M0 samples were significantly less than those in 46 N0M0 samples (p < 8.40e-03, Wilcoxon rank-sum test, Figure 2C), and tumor purities in 115 N + M0 samples were also significantly less than those in 46 N0M0 samples (p < 4.24e-02, Wilcoxon rank-sum test, Figure 2D). The results indicated that the difference of numbers of mutation between N0M0 and N + M0 may be mainly caused by the variations of tumor purity. The lower tumor purities of N + M0 samples could lead to the artificially lower mutation burden than that of N0M0 samples. Meanwhile, we also found the mutation frequency of 1,184 genes were significantly different between N0M0 and N + M0 samples (p < 0.05, Fisher's exact one-side test). Subsequently, we divided the primary gastric tumor tissues into two groups according to tumor purities. Totally, 129 samples whose tumor purities were at least 80% were divided into the high-purity group, while 127 samples whose tumor purities were less than 70% were divided into the low-purity group. The information of low-and high-purity samples in different categories was shown  in Table 5. The numbers of mutation in low-purity samples were significantly lower than those in high-purity samples (p < 0.05, Wilcoxon rank-sum test, Figures 2E,F). Similarly, the mutation frequencies of 1,247 genes were significantly different between high-purity and low-purity groups (p < 0.05, Fisher's exact oneside test). There were 184 genes overlapped with the 1,184 genes of differentially mutated frequency between N0M0 and N + M0 samples, of which 182 genes had significantly higher mutation frequency in both N0M0 samples and high-purity samples. Gene SLC3A2 and APC, which were associated with metastasis and neoplasia (Ghatak et al., 2017;Wang et al., 2017), were included. These results indicated that various tumor purities had an impact on mutation differences between N0M0 and N + M0 samples, which might confound the interpretation of metastasis mechanism for gastric cancer.

Tumor Purity Confounds the Molecular Analysis of Gastric Cancer Subtypes
We then evaluated the influence of tumor purity on the mutation analysis between the diffuse and intestinal histological subtypes of gastric cancer. No significant difference of tumor purity was observed between 70 diffuse samples and 190 intestinal samples (p = 1.45e-01, Wilcoxon rank-sum test). However, after excluding five intestinal and four diffuse unrepresentative samples that only had one slide with more than 90% of tumor purity, the tumor purities of 66 diffuse samples tend to be significantly lower than those of 185 intestinal samples (p = 5.04e-02, Wilcoxon ranksum test), while numbers of mutation in diffuse subtype were significantly less than those in intestinal subtype (p = 9.49e-05, Wilcoxon-rank test), as showed in Figure 3A. Furthermore, similar phenomena that the significant differences of tumor purities and numbers of mutation between the histological subtypes of lung cancer (including LUAD and LUSA) or glioma (including GBM and LGG) were also observed, respectively, as shown in Figure 3B. The results suggested the various tumor purities might confound the mutation differences between different histological subtypes of cancer.

Tumor Purity Confounds the Identification of MSI Status
We further evaluated the influence of various tumor purities on the identification of a known pathological biomarker, the MSI status, which is commonly used to determine the followup treatment regimen for gastric and colon cancer patients. According to the MSI status of gastric cancer, the tumor purities of 241 samples with stable level of MSI were significantly lower than both 72 MSI-H samples and 56 low level of MSI (MSI-L) samples, respectively (all p < 0.05, Wilcoxon rank-sum test, Figure 3C). Compared with the distribution of tumor purities of gastric cancer samples, the tumor purities of colon cancer samples distributed narrowly, and 86% of the colon cancer samples were with ≥70% of tumor purities. No significant correlation was observed between number of mutation and tumor purity in colon cancer. However, the tumor purities of 83 MSI-H samples were significantly higher than those of 82 MSI-L samples (p = 4.46e-02, Wilcoxon rank-sum test) and tentative significantly higher than those of 291 samples with stable level of MSI (p = 7.20e-02, Wilcoxon rank-sum test), respectively, as shown in Figure 3C. The above results suggested that various tumor purities might confound the identification of MSI status.

An Appropriate Threshold of Tumor Purity for Mutation Calling
Finally, we took gastric cancer as an example to identify an appropriate tumor purity for mutation calling. According to the at least 60% of tumor purity required in most researches, we firstly removed the gastric cancer samples with tumor purity less than 60%, and observed that numbers of mutation called by four algorithms were still significant positively correlated with tumor purities (p < 0.05, Table 6). These results indicated that higher tumor purity may be needed for mutation calling. Then we analyzed samples with higher than 70% of tumor purity. No significant correlation was observed between tumor purity and number of mutation, except for SomaticSniper algorithm. Moreover, similar results that nonsignificant correlation between tumor purities and numbers of mutation were observed in other nine cancer types, except for LGG ( Table 6).
In order to remove the influence of sample size, the same size of gastric samples with above 70% of tumor purity were randomly selected from samples with ≥60% of tumor purity and the correlations between tumor purities and numbers of mutation were calculated. The random experiment was repeated 1,000 times. Finally, a cumulative binomial test was used to assess the significance of positive correlation in the 1,000 random experiments. The results showed that 65.50% of 1,000 random experiments were significant correlations in Mutect2 algorithms and more than 80% of 1,000 random experiments were significant correlations in other three algorithms, respectively (all p < 0.05, binomial test, Supplementary Table 2). Similar results of random experiments were also observed in other multiple cancer types (Supplementary Table 1). These results indicated that the sample sizes could not be the major factor of correlation between number of mutation and tumor purity. In a word, above 70% of tumor purity, rather than 60%, might be better to meet the requirement of mutation calling.

DISCUSSION
As showed in this study, numbers of mutation and tumor purities were significantly positive correlation in gastric cancer and other nine cancer types, regardless of calling algorithms. The lower tumor purities may lead to the artificially lower mutation burden, which may consequently cause the misleading biological interpretation of metastasis mechanism, pathological subtypes, as well as pathological biomarker analyses. Finally, we suggested that above 70% of tumor purity could be better to meet the requirement of mutation callings.
Moreover, gene FBXO11, XPO1, SLC3A2, and APC, whose mutation detections may be affected by various tumor purities in gastric cancer, were closely related with cancer occurrence and development. For examples, protein FBXO11 has both the E3 ubiquitin ligase and methyltrasferase activity, which could facilitate epithelial-mesenchymal transition (EMT), promote PI3K/AKT pathway activation, and regulate metastasis and apoptosis in human cancer (Kim et al., 2018(Kim et al., , 2020Sun et al., 2018). Protein XPO1 is positively correlated with cell proliferation and growth transformation, and negatively correlated with poor survival outcomes, which could be a promising molecular target in gastric cancer (Subhash et al., 2018;Gruffaz et al., 2019;Sexton et al., 2019). Protein SLC3A2 is associated with the migration and invasion of tumor cells (Wang et al., 2017), which is a potential biomarker for molecular imaging-based detection of gastric cancer (Yang et al., 2012). Gene APC, which is involved in Wnt/β-catenin signaling pathway, has been reported to be associated with tumorigenesis, tumor metastasis and resistance (Yang et al., 2018b).
Currently, many studies have been proposed that tumor mutation burden (TMB) could predict the response to immunotherapy (Goodman et al., 2017;Qin et al., 2018), which patients with high TMB commonly responds better to immunotherapy than patients with low TMB. However, due to the differences in surgical sampling or biopsy sites of tumor tissue, the TMB or the pathologic biomarkers, such as PDL-1 (Anagnostou et al., 2017;Qin et al., 2018), could be affected by various tumor purities. For this problem, some researches proposed to increase the sequencing depth to reduce the false negatives from low tumor purity, but it might also sharply increase the false positives of mutation detection, work burden and cost. Additionally, for the threshold of tumor purity, TCGA originally required at least 80% of tumor nuclei (Aran et al., 2015), but it is generally difficult to collect enough amount of samples. Then, this threshold was later reduced to 60% as the RNA-seq technology developed. And most current studies set the threshold as 60%. However, the research by Dvir Aran et.al (Aran et al., 2015) indicated that the impact of 60% of tumor purity on the interpretation of genomic analyses remained to be evaluated. Our results in ten cancer types showed that, above 70% of tumor purity, rather than 60%, might be better to meet the requirement of mutation calling and obtain relatively sufficient and reliable mutation profiles. Certainly, a novel mutation detection algorithm for tumor sample with low purity should be developed as soon as possible.
A major limitation is that the tumor heterogeneity, pathological subtypes, and the colonal selection of mutations do affect mutation callings during the process of tumor occurrence and development (Gerlinger et al., 2012), which could not be excluded in this study. However, our study revealed that there were universal significantly correlations between numbers of mutation and tumor purities in ten cancer types. Although the sample size of in-house data is small in this study, the low tumor purities resulted in less mutations that were further demonstrated in six gastric cancer samples from two patients and five colorectal cancer samples from two patients with different tumor purities. That suggested that numbers of mutation were influenced by tumor purities regardless of tumor types and the influence of tumor purity on number of mutation should be noticed.
In conclusion, the influences of various tumor purities on mutation detection and pathological analyses should be fully considered in further analysis. And we suggested that more than 70% of tumor purity could be better to meet the requirement of mutation calling.

DATA AVAILABILITY STATEMENT
The in-house data used and analyzed during the current study is available from the corresponding authors upon reasonable request.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by The Affiliated Union Hospital of Fujian Medical University. The patients/participants provided their written informed consent to participate in this study. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.