Quantitative Estimation of Oxidative Stress in Cancer Tissue Cells Through Gene Expression Data Analyses

Quantitative assessment of the intracellular oxidative stress level is a very important problem since it is the basis for elucidation of the fundamental causes of metabolic changes in diseased human cells, particularly cancer. However, the problem proves to be very challenging to solve in vivo because of the complex nature of the problem. Here a computational method is presented for predicting the quantitative level of the intracellular oxidative stress in cancer tissue cells. The basic premise of the predictor is that the genomic mutation level is strongly associated with the intracellular oxidative stress level. Based on this, a statistical analysis is conducted to identify a set of enzyme-encoding genes, whose combined expression levels can well explain the mutation rates in individual cancer tissues in the TCGA database. We have assessed the validity of the predictor by assessing it against genes that are known to have anti-oxidative functions for specific types of oxidative stressors. Then the applications of the predictor are conducted to illustrate its utility.

Unlike hypoxia or acid-base imbalance whose related stress can be measured/estimated in terms of the concentration of one or a few molecular species such as O 2 or H+, there are many types of oxidizing molecules such as reactive oxygen species (Barash et al., 2010;Reczek and Chandel, 2017), reactive nitrogen species (Fionda et al., 2016;Kruk and Aboul-Enein, 2017), reactive lipid species (Higdon et al., 2012;Graham, 2016) among others. Furthermore, to estimate the stress level, it also requires to know the total cellular reducing capacity. This is also challenging since human cells not only have a basic set of anti-oxidation (or reducing) capacities consisting of (1) glutathione and associated enzymes (e.g., glutathione transferase and peroxidase), (2) vitamin A, C, E and derivatives like beta-carotene, and (3) anti-oxidation enzymes like SOD (superoxide dismutase), PRX (periaxin), and TRX (thioredoxin) (Chang et al., 2008;Mantovani et al., 2010), but also rely on antioxidation capacities of various other molecules including some enzymes and fatty acids. Among them, the main functions of the enzymes may not be for anti-oxidation (Osmundsen et al., 1982;Wieczorek et al., 2017), while the anti-oxidative properties of fatty acids have long been well-established (Richard et al., 2008;Freitas et al., 2017). All these make it very difficult to pin down on what molecular species should be used when assessing the level of oxidative stress.
A few proposals have been made regarding possible biomarkers for intracellular level of oxidative stress such as the carbonylation level (aldehydes and ketones) of proteins (Dalle-Donne et al., 2003;Hacişevki et al., 2012;Fernando et al., 2016), the level of oxidized low-density lipoprotein (Itabe, 2012;Osman et al., 2016), oxidized products of lipids such as 4-HNE (4-hydroxynonenal) and MDA (malondialdehyde) (Niki, 2008;Halder and Bhattacharyya, 2014;Teppner et al., 2015), and protein thiols (Giustarini et al., 2012(Giustarini et al., , 2017. There are two general issues with these biomarkers: (1) they tend to reflect the level of oxidation by specific oxidizing molecules; and (2) more importantly, they are not high-throughput, hence it is impractical for large-scale analyses, such as analyses of TCGA tissue samples (https://portal.gdc.cancer.gov/) to elucidate possible causes of specific metabolic changes in such tissues.
The goal here is to identify a set of genes whose mRNA expression levels can collectively reflect the overall level of intracellular oxidative stress present in a cancer tissue. The strategy is: (1) the somatic point-mutation rate in each cancer tissue sample is used as an indicator for the level of intracellular oxidative stress, an idea that has been well-established and applied (Doudican et al., 2005;Slane et al., 2006;Xu et al., 2014;Fitzgerald et al., 2017;Markkanen, 2017;Rao et al., 2017;Tubbs and Nussenzweig, 2017); (2) a few enzyme classes are selected, which are known to have anti-oxidation activities such as EC 3.1.-, EC 3.6.-, EC 2.4.-, and EC 2.7.- (Kato et al., 1995;Kong et al., 2007); and (3) a subset of genes are selected from these enzyme classes, whose combined expressions correlate strongly with the mutation rates in the matching genomes, determined through regression analyses. One assumption used in the analysis is that the oxidative stress levels in cytosol and nucleus are the same, which is reasonable knowing that the nuclear pores are large enough to allow most of the oxidizing molecules to go through freely between the two compartments; and increased mutation rates in cancer are known to be related to the nucleus oxidative stress level (Chung et al., 2014;Markkanen, 2017).
By applying this strategy to gene-expression data and matching genomic mutation data of cancer tissues of 14 cancer types in TCGA, representing all those with sufficiently large sample sizes, we have trained a predictor for the intracellular level of oxidative stress for each of the 14 cancer types. They are BLCA (bladder urothelial carcinoma), BRCA (breast invasive carcinoma), COAD (colon adenocarcinoma), ESCA (esophageal carcinoma), HNSC (head and neck squamous cell carcinoma), KICH (kidney renal papillary cell carcinoma), KIRC (kidney chromophobe), KIRP (kidney renal clear cell carcinoma), LICH (liver hepatocellular carcinoma), LUAD (lung squamous cell carcinoma), LUSC (lung adenocarcinoma), PRAD (prostate adenocarcinoma), STAD (stomach adenocarcinoma), and THCA (thyroid carcinoma). We have then validated the predictor against data with known oxidative stress related information. A key advantage in having a gene-expression data-based predictor is that RNA-seq data tend to be collected in general for cancer research; and such a predictor does not require a user to know the mutation rate distribution of the cancer type under study since such information is already encoded in the predictor for each cancer type.

Genomic Mutation Profiles
We have calculated and plotted the distribution of the number of point-mutations in coding regions per genome across all cancer genomes for each of the 14 cancer types, as shown in Figure 1. From the figure, LUAD has the highest average mutation rate at 347.20 per genome, and KICH has the lowest one at 19.75 per genome. The following lists the average numbers of the remaining 12 cancer types in the descending order: 288.95 mutations in LUSC,272.02 in COAD,257.50 in STAD,234.71 in BLCA,213.26 in HNSC,124.59 in ESCA,147.65 in LIHC,303.75 in BRCA,111.14 in KIPR,113.94 in KIRC,138.75 in PRAD,and 134.18 in THCA.

Selection of Antioxidant Enzyme-Encoding Genes
Some classes of enzymes have long been known to have anti-oxidative functions as their main function such as the antioxidant enzymes mentioned earlier while some others have anti-oxidation as their secondary function such as cytochromes P450 and mitogen activated protein kinases (Limón-Pacheco and Gonsebatt, 2009). And yet, increasingly more proteins have been found to have antioxidant roles in addition to their main functions such as translocases (Tang et al., 2017), hydrolase (Liu et al., 2018), and hexokinase (Heneberg, 2019).
Based on such information, we have conducted a preliminary regression analysis of the mutation rates against the expression data of all the enzyme-encoding genes in the same samples for each of the 14 cancer types (with a Lasso penalty). Interestingly, genes that give rise to good regression results across all 14 cancer types generally fall into a small set of enzyme subclasses, particularly EC 3.-(hydrolases) and EC 2.-(transferases). Hence, we have conducted a second round of regression against expression data of genes only in these two EC classes.
From genes selected for each of the 14 regression models, we note the following: (1) in all 14 cancer types, genes in each regression model fall into exactly four sub-subclasses of EC 2.  Table S1 gives the gene names selected in the regression model for each cancer type.

Linear Regression Analyses
For each cancer type, a linear regression model is trained to predict the mutation rate using expression data in the same cancer sample, of selected genes from some EC subclasses as discussed above. The detailed objective function is described in the Methods section. To ensure the quality of each trained model, we have randomly selected 2/3 of the samples as the training data, and used the remaining 1/3 as the test data. Figure 2 shows the predicted values by the best trained model vs. the actual mutation rates across all samples for each cancer type. Similar plots for the test samples are given in Figure S1. Table 1 summarizes the level of contribution by genes of each EC subclass to the regression result for each cancer type, with the detailed information of the 14 models being given in Table S2.

Prediction Validation
We have conducted the following analyses to provide supporting data to our prediction model.

Validation Against Fatty Acid Synthesis Genes
It is well-known that fatty acids serve as antioxidants in cancer and other disease cells (Richard et al., 2008;Freitas et al., 2017). Hence, we anticipate that our predictor should have some level of correlation with the fatty acid synthesis process. Table 2 shows the best correlation coefficient between our predictor and one of the four fatty acid synthesis genes (see Methods) across the 14 cancer types, with Table 3 giving the statistical significance of the observed correlations.

Validation Against Mucin Genes
Mucins have been found to have elevated levels across numerous cancer types and are known to have anti-oxidation roles (Takeyama et al., 2000;Wu et al., 2009;Dewald et al., 2016). We have examined their expression levels and our predictor. Table 4 shows the best correlation coefficient between our predictor and one of the mucin genes across the 14 cancer types, with Table 5 giving the statistical significance of the observed correlations. And the statistical significance of the observed correlations for each Mucin genes is given in Table S3 in detail.

Validation Against Glutathione Synthesis Genes
Glutathione is a molecule human cells use as a key antioxidant. We have examined their expression levels and our predictor. Table 6 shows the R2 values between our predictor and Frontiers in Genetics | www.frontiersin.org  the glutathione synthesis genes across the 14 cancer types, with Table 7 giving the statistical significance of the observed correlations. It is noteworthy that our predictor is designed to predict the level of oxidative stress; and each of the above three groups of genes is known to be associated with the level of oxidative stress and not involved in the training dataset. The observed strong correlations between our predicted oxidative stress levels and the expression levels of each such group, along with significant pvalues, provide strong support for that our predictor captures the anti-oxidative level from different independent aspects, hence indicating the validity of our trained predictor as an indicator for the level of oxidative stress.

Application
To demonstrate the utility of our predictor, we have calculated the average oxidative stress levels for each of the four stages vs. the matching controls in each of the 14 cancer types, as detailed in Figure 3. From the figure, we can see: (1) cancer tissue cells have elevated oxidative-stress levels than matching controls in all of 14 cancer types; and (2) the oxidative stress level tends to progressively increase as a cancer advances from stage 1 through stage 4, in 10 out of 14 cancer types with at most one predicted level of oxidative stress being out of order. This is clearly consistent with our general understanding about cancer progression. We have also conducted co-expression analyses to find genes that strongly correlate with our predictor in each cancer type and analyzed the pathways enriched by such genes. Table S4 gives the up-to top 100 enriched pathways with p < 0.05 in each cancer. We note that the enriched pathways are highly consistent across different cancer types. Furthermore, we note based on extensive literature search that a majority of the enriched pathways, marked in bold letters, is: (1) involved in anti-oxidation activities, (2) induced by oxidative stress, (3) induces oxidative stress, (4) involved in reactive oxidative stress signaling, and/or (5) is altered by oxidative stress. While these data provide supporting data to our predictor, pathways not known to be oxidative stress related may provide useful candidates for elucidation of the overall oxidative-stress responding mechanisms in cancer cells.

DISCUSSION
Quantitative assessment of the intracellular oxidative stress will prove to be an invaluable tool for elucidation of the possible causes of various changes in cancer cells, including extensive metabolic reprogramming. However, the problem proves to be very challenging because there are large numbers of contributors to both the total oxidizing power and the anti-oxidizing capacity in human cells. Previous studies tend to focus on oxidative stress induced by specific molecular species such as H2O 2 or lipid radicals rather than the overall oxidative stress level. To the best our knowledge, our work is the first computational tool for estimating the oxidative stress level. The basic premise of our tool is that the genomic mutation level is strongly associated with the overall intracellular oxidative stress level. The second premise is that many enzymes have anti-oxidation capacity as reported by numerous authors (Sies, 1997;Rajput et al., 2013). One surprising result in our regression analyses is that genes in two EC classes, EC 2.-(transferases) and EC 3.-(hydrolases), specifically four subclasses of these two, can be used to well interpret the mutation rate in each cancer sample by a linear combination of their expressions with high statistical significance. This strongly suggests that all the enzymes encoded by these genes play roles in cellular anti-oxidation, which is clearly a novel discovery and warrants further investigation. Furthermore, different cancer types tend to use a distinct combination of genes from four subclasses of enzymes strongly suggest that these cancer types may encounter different types of oxidative stress, hence using different combinations of antioxidation enzymes. This also warrants further study regarding why genes in different enzyme classes show strong correlations with mutation rates in different cancer types, therefore to understand the detailed mechanisms of their anti-oxidative functions possibly for different types of oxidants.
As mentioned in the Introduction, oxidative stress may arise from different molecular species such reactive oxygen species, reactive nitrogen species, reactive lipid species and various free radicals. Different molecular species might be utilized to consume specific types of oxidizing molecules. By showing that three independent classes of molecules all have statistically significant correlation coefficients with our general-purpose predictor for oxidative stress, we clearly have strong support for the validity of our predictor.
The availability of this new tool make open new doors for studying impact of oxidative stress on various chronic inflamed diseases, including cancer, including (i) elucidation of all metabolic processes, particularly reprogrammed metabolisms observed in cancer and other diseases, that are statistically associated with the level of oxidative stress, hence providing a new capability for detection of the possible causes of various altered metabolisms, and (ii) systematic analyses of different classes of enzymes in terms of their anti-oxidative roles, which could provide potentially powerful and new targets for treating various chronic diseases, including cancer.
We anticipate that our predictor will prove a powerful tool useful for elucidation of causes of variety of systematic changes, including metabolic reprogramming to gain indepth understanding of why specific metabolic pathways are reprogrammed and certain cellular functions tend to be repressed or hyper-activated in cancer.
Genomic mutation data were derived from the wholeexome sequencing data. Specifically, somatic changes are identified through comparing allele frequencies in the aligned DNA sequences of each cancer and the matching normal samples, using the GDC DNA-Seq analysis pipeline (GDC DNA-Seq analysis pipeline: https://docs.gdc.cancer.gov/ 1 TCGA data portal: https://portal.gdc.cancer.gov/

Synthesis of Fatty Acids
We have used the four fatty-acid synthesis genes: FASN (Fatty Acid Synthase), ACAT1 (Acetyl-CoA Acetyltransferase 1), ACAT2 (Acetyl-CoA Acetyltransferase 2), and MCAT (Malonyl-CoA-Acyl Carrier Protein Transacylase) to reflect the level of fatty acid synthesis, and have calculated the Pearson correlation coefficient between the expressions of one of them and our predictor.

Glutathione Synthesis
We have used the three biosynthesis genes: GCLC (glutamatecysteine ligase catalytic subunit), GCLM (glutamate-cysteine ligase modifier subunit) and GSS (glutathione synthase) to reflect the level of glutathione synthesis. As above, we have calculated the Pearson correlation coefficient between the expressions of the glutathione synthesis genes and our oxidative stress predictor.

Differential Expression Analyses
We have used edgeR in the R package (edgeR package: https:// www.r-project.org/) to determine if a gene is differentially expressed in cancer vs. control samples of the same cancer type. T-test is applied to estimate the statistical significance of each gene considered as differentially expressed, using 0.05 as the cut-off.

Linear Regression Analysis
We have conducted a linear regression analysis of the observed mutation rates, Y n , in n samples of a specified cancer type against the expressions of m selected genes across n samples, X n,m , so that residual ||ε|| is as small as possible as defined below: where B m is a coefficient vector with its m values to be determined through solving this optimization problem. To avoid using too many genes in the regression analysis, we have included a penalty for penalizing using more variables than necessary.
where λ is an (adjustable) constant. This problem can be solved using a least squared regression analyses in the following form: For each Y, we have retrieved the number of point mutations per cancer genome. Then for the n cancer genomes of each of the 14 cancer types, we have n numbers, namely the mutation number divided by the number of genes (20,000), which gives rise to the Y. For the ith row of X, we have retrieved the expression data of m selected genes in the ith sample of the same cancer type. For each regression analysis, we have used the R package to solve the minimization problem. The regression result has a pvalue associated with each of the m values of B. We then remove those genes with insignificant p-values, i.e., > 0.05, for the second round of regression analyses mentioned in the Results section.

Correlation Analyses
We have used Pearson correlation coefficient to calculate the linear correlation between two lists of numbers, with one being the combined gene-expression data and the other being mutation rates of the same samples.

Pathway Enrichment Analysis
We have conducted a pathway enrichment analysis over a given set of genes found to be strongly correlated with our predictor for oxidative stress using DAVID tool (David links: https://david.ncifcrf.gov/) against the combined database of GO/Biological Process, KEGG and Reactome pathways. A pathway is considered enriched if tits adjusted p < 0.05.

Assessment of the Level of Contribution by Each Enzyme Subclass in Regression Model
For each derived regression model against selected enzymes, the following is used to estimate the level of contribution by each subclass of enzymes. The package used for linear repression construction provides a p-value for each selected gene, indicating the level of the gene's contribution to the regression result with smaller p-value representing higher level of contribution. We then use the product of the p-values of the selected genes from the same enzyme subclass as the p-value of the subclass.

SOFTWARE AVAILABILITY
Our predictor is written in R language and is available upon request.

AUTHOR CONTRIBUTIONS
LL designed the method, carried out computational analyses, analyzed computational results, and wrote the article. HC supported in data analysis. YX conceived the project, reviewed computational analyses, and revised the article.

ACKNOWLEDGMENTS
LL wants to extend her thanks to members of the Computational Systems Biology Lab (CSBL) in the University of Georgia, for their helpful discussions related to the study here.