Prediction of Gene Expression Patterns With Generalized Linear Regression Model

Cell reprogramming has played important roles in medical science, such as tissue repair, organ reconstruction, disease treatment, new drug development, and new species breeding. Oct4, a core pluripotency factor, has especially played a key role in somatic cell reprogramming through transcriptional control and affects the expression level of genes by its combination intensity. However, the quantitative relationship between Oct4 combination intensity and target gene expression is still not clear. Therefore, firstly, a generalized linear regression method was constructed to predict gene expression values in promoter regions affected by Oct4 combination intensity. Training data, including Oct4 combination intensity and target gene expression, were from promoter regions of genes with different cell development stages. Additionally, the quantitative relationship between gene expression and Oct4 combination intensity was analyzed with the proposed model. Then, the quantitative relationship between gene expression and Oct4 combination intensity at each stage of cell development was classified into high and low levels. Experimental analysis showed that the combination height of Oct4-inhibited gene expression decremented by a temporal exponential value, whereas the combination width of Oct4-promoted gene expression incremented by a temporal logarithmic value. Experimental results showed that the proposed method can achieve goodness of fit with high confidence.


INTRODUCTION
Somatic cells can be reverted to a pluripotent stem cell by cell reprogramming. Cell reprogramming has been significant in many domains of biological and medical science, including tissue repair, organ reconstruction, disease pathogenesis, and new drug development (Wernig et al., 2007;Park et al., 2008). Earlier, the nuclear transfer method was the main method to cultivate new individuals. However, this method was very controversial in terms of ethics (Gurdon, 1958;Campbell et al., 1996;McCreath et al., 2000;Polejaeva et al., 2000). Recently, study of cells induced to reprogram through specific transcription factors became a hotspot. This method solved the problem of immune rejection of allogeneic cells. In this way, the patient-specific stem cells were obtained without ethical controversy (Lv et al., 2018;Poli et al., 2018;Stadhouders et al., 2018).
As an important regulatory element, transcription factor (TF) was involved in the regulation of transcription initiation, and binding sites of TFs in promoter regions affected gene expression (Duren et al., 2017). Oct4, a core transcription factor, played an important regulatory role in stem cell self-renewal and pluripotency maintenance. It controlled the development and differentiation of early embryos and was highly expressed in a variety of stem cells, including germ cells, embryonic stem cells (ESCs), embryonic germ cells (EGCs), and embryonic tumor cells. In an experiment of mice, Oct4 was observed to play a central role in the cellular pluripotency regulatory network, which reprogramed somatic cells into induced pluripotent stem cells (iPSCs) by expressing transcription factors Oct4, Sox2, Klf4, and c-Myc ectopically . Another study showed that pluripotent stem cells can be obtained by adding Oct3/4, Sox2, c-Myc, and Klf4 to the fiber cells of mice (Boyer et al., 2005). Regulation of these transcription factors on target genes was achieved mainly through the interaction of feedforward systems, self-regulatory networks and other signaling pathways (Boyer et al., 2005).
Oct4-binding sites in promoter regions were closely related to gene expression . However, the relationship between Oct4 combination intensity in promoter regions and gene expression remained unclear. Therefore, in this paper, a generalized linear regression model was proposed to analyze the relationship between gene expression and Oct4 combination intensity in promoter regions.
The rest of paper was organized as follows. section Related Work introduces related work on cell reprogramming and gene expression; section Materials and Methods provides materials and methods, including source of data, the proposed generalized linear regression model and evaluation criteria of model performance; section Results and Analysis contains detailed experimental results and analysis, including the solution result and performance analysis of our proposed model, analysis of factors affecting gene expression on every stage of cell development, and applications of our proposed model in gene classification; and section Conclusion summarizes the contents of this paper.

RELATED WORK
Previous studies reported mechanisms and methods of cell reprogramming. Earlier, Gurdon et al. applied the nuclear transfer method to cell reprogramming of Xenopus laevis (Gurdon, 1958). Campbell, McCreath, and Polejaeva cultivated cloning animals using nuclear transfer technology (Campbell et al., 1996;McCreath et al., 2000;Polejaeva et al., 2000). Håkelien and Hochedlinger analyzed a cell recombination mechanism based on nuclear fusion and nuclear transfer technology (Håkelien et al., 2002;Hochedlinger and Jaenisch, 2002). Later, Stadtfeld and Zardo analyzed the effects of specific transcription factors and epigenetic plasticity of chromatin on cell reprogramming (Stadtfeld et al., 2008;Zardo et al., 2008). Studies by Hanna and Li showed that overexpression of transcription factor Oct4 had an effect on cell reprogramming (Hanna et al., 2009;Li et al., 2009). Doege et al. elaborated the effects of the interaction of Oct4, Sox2, Klf4, and c-Myc on cell reprogramming in the early stages of cell reprogramming (Doege et al., 2012). Apostolou and Chen found that the dynamic mechanisms of chromatin change and DNA methylation had important effects on cell reprogramming (Apostolou and Hochedlinger, 2013;Chen et al., 2013). Koqa et al. analyzed the role of transcription factor Foxd1 in cell reprogramming (Koga et al., 2014). Recently, Poli and Stadhouders elaborated the roles of specific transcription factors used as inducing factors in cell reprogramming (Poli et al., 2018;Stadhouders et al., 2018).
The process of cell reprogramming was closely related to the regulation of gene expression. Moreover, regulation of gene expression is the molecular basis of many life activities, including cell differentiation, morphogenesis, and ontogeny . Earlier, Chen and Rimsky analyzed regulation effects of cis-and trans-regulatory elements on gene expression (Rimsky et al., 1989;Chen et al., 1990). Later, Ueda et al. analyzed effects of diurnal variation of transcription factors on gene expression (Ueda et al., 2002). Patricia et al. analyzed effects of the interaction of cis-and trans-regulatory elements on gene expression (Wittkopp et al., 2004). Sullivan CS et al. studied the regulation effect of microRNAs encoded by SV40 on gene expression (Sullivan et al., 2005). Jeffery et al. found factors related to gene expression using gene expression data and binding sites of transcription factor (Jeffery et al., 2007). Han et al. found that certain types of genomic organization by SATB1 had an effect on gene expression (Han et al., 2008). Afterward, Costa et al. predicted gene expression in T cell differentiation by using histone modification and binding affinity of transcription factor via a linear mixed model . Maienscheincline et al. searched for target genes regulated by transcription factors based on some information, including binding sites of transcription factors and target genes (Maienschein-Cline et al., 2012). MT and Holoch analyzed the effects of specific transcription factors and the regulation effect of RNA on gene expression, respectively (Lee et al., 2013;Holoch and Moazed, 2015). Recently, Engreitz and Singh clarified effects of lncRNA promoter, transcription factor, variable splicing, and histone modification on gene expression, respectively (Engreitz et al., 2016;Singh et al., 2016). Thomou and Wu analyzed effects of miRNAs and histone modifications on gene expression (Thomou et al., 2017;Wu et al., 2017). Additionally, Duren et al. predicted gene expression based on chromatin accessibility data, cis-acting and trans-acting element data by logistic regression models (Duren et al., 2017). Neumann and Stadhouders analyzed effects of LncRNA and the dynamic interaction of transcription factors with expression of target genes (Neumann et al., 2018;Stadhouders et al., 2018).
Many methods were proposed for deciphering regulation mechanisms of cis-regulatory and trans-regulatory elements based on gene expression. Studies showed that gene expression was closely related to Oct4 combination intensity in promoter regions (Machado et al., 2011;Machado, 2017;Yan et al., 2017;Antão et al., 2018). However, the quantitative relationship between gene expression and Oct4 combination intensity was not considered. Therefore, firstly, a generalized linear regression model was proposed for quantifying the relationship of gene expression and Oct4 combination intensity based on eight gene datapoints. Then, testing data were applied to test the generalization ability of the model. On the one hand, experiments of 27 genes, as well as all genes, from GEO were applied to analyze the quantitative relationship between Oct4 combination intensity and target gene expression at each stage of cell development by our proposed model. On the other hand, 27 genes were divided into positive and negative samples by our proposed method.

Datasets
Experimental data came from mouse transcriptome data and ChIP-seq data, which were downloaded from GEO database with accession numbers GSE67462 and GSE67520, respectively. In this paper, gene promoter regions were defined as −1.5 kb to +0.5 kb of gene transcription start sites (TSSs). For quantifying the relationship between gene expression and Oct4 combination intensity, while testing the generalization ability of the proposed model, experimental data were divided into training data and test data.
Step 1. All dynamic Oct4 combination intensity and gene expression data related to genes Btbd8, Cnbp, Cyb5r3, Dars2, Eef1a1, Hist1h2bf, Ptrh2, Zfp143 were extracted from transcriptome and ChIP-seq data . Oct4 combination intensities were expressed as a series of peaks that contained three characteristics, including height, distance and width, which were defined as the value of the highest point corresponding to the midpoint of the peak (height); distance between the midpoint of the peak and transcription start site (distance); and difference between the right and left boundaries of the peak (width).
Step 2. Transcriptome and ChIP-seq data of the above genes from Day 0, Day 1, Day 3, Day 5, Day 7, Day 11, Day 15, and Day 18 were selected for studying the relation between time and gene expression .
Step 3. Promoter regions with the strongest signal were extracted to avoid the influence of redundant data.
Testing data were composed of two parts, including data of 27 genes and all genes. Firstly, 27 genes and all genes were applied to analyze quantitative relationship between Oct4 combination intensity and target gene expression at each stage of cell development by our proposed model. Then, 27 genes were divided into high and low expression groups to classify.

Generalized Linear Regression Model
In Figure 1, relations between height, distance, width, gene expression of Oct4 combination intensity, and time were provided, respectively. Figure 1 shows different change trends with time of Oct4 combination intensity in promoter regions and gene expression in the eight proposed genes. Figure 1A illustrates in detail that change trends of height with time were nearly identical in these genes. Similarly, Figure 1C demonstrates that change trends of width with time in these genes were also nearly identical. For quantifying the relationship between gene expressions and Oct4 combination intensity, correlations between height, distance, width, time, and gene expressions were analyzed by using their correlation coefficients, which is defined as Equation (1) with two random variables, X and Y.
In Equation (1), r (X, Y) represents the correlation coefficient between X and Y, cov (X, Y) represents covariance between X and Y, var (X), and var (Y) represent variance of X and Y, respectively. The correlation coefficients between gene expression and Oct4 combination intensity are shown in Table 2. In addition, correlation coefficients for Oct4 combination intensity and the gene expression, height, distance, width, and time of each gene are provided in Figure 2. Table 2 and Figure 2 indicate that the correlation coefficients for gene expression and time were the largest. Correlation coefficients for time and other variables were also strong. However, goodness of fit was low when the predicted model was constructed using height, distance, and width as explanatory variables, and gene expression as explained variable. Due to the strong relationship between time and Oct4 combination   intensity, several time-dependent derived combination variables were used as explanatory variables of the proposed model. Firstly, new derived combination variables were obtained by multiplication operations between height, distance, width and a function of time t, including e t , log 10 (t + 1) and t k (k = 1,2,3). In this way, a set V = {H × t, H × t 2 , H × t 3 , H × e t , H × 0.5 t , H × log 10 (t + 1), D × t, D × t 2 , D × t 3 , D × e t , D × 0.5 t , D × log 10 (t + 1), W × t, W × t 2 , W × t 3 , W × e t , W × 0.5 t , W × log 10 (t + 1)} was constructed as the set of explanation variables, where H denotes height, D denotes distance and W denotes width. Then, stepwise regression method was used to determine explanatory parameters of the proposed regression model. Finally, six explanatory variables were selected from V, including H × e t , D × t, D × t 2 , D × t 3 , D × 0.5 t and W × log 10 (t + 1).
Therefore, a generalized linear regression model was constructed by using selected explanatory variables, in which gene expression was the explained variable. In this paper, four generalized linear regression models, Models 1-4, were constructed by Equations (2-5).
Model 4 : Exp = β 1 × H × e t + β 2 × D × 0.5 t + β 3 × W × log 10 (t + 1) + ε Frontiers in Genetics | www.frontiersin.org In Equations (2-5), Exp represents the value of gene expression; β 1 , β 2 , β 3 are regression coefficients, which are calculated by the Least Squares Method (LSM), and LSM is defined as the sum of squares of differences between predicted value and true value; a random disturbance ε is a normal distribution that was applied to represent other factors affecting gene expression except height, distance and width.
H × e t and W × log 10 (t + 1) were selected in the final model because they were common items in Equations (2-5). Therefore, a general model of gene expression patterns was obtained by Equation (6), and the correctness of the model will be verified in section Analysis of Factors Affecting Gene Expression at Every Stage of Cell Development.

Evaluation Criteria of Model Performance
F-test, t-test, and goodness of fitR 2 were used to evaluate the performance of linear regression model (Huang and Pan, 2003;Zhou et al., 2003;Xu et al., 2008;Wang and Lee, 2010;Wang et al., 2012). More precisely, F-test was used to test significance of the entire regression model and t-test was used to test significance of regression coefficients in the model. Goodness of fitR 2 was used to measure the approximation degree between fitted curve and original data. Meanwhile,R 2 , a generation from original coefficient of determination R 2 , was an adjusted coefficient of determination. It was eliminated the influence of coefficient of determination generated by number of explanatory variables. In this paper, F-test statistic, t-test statistic, adjusted coefficient of determinationR 2 , original coefficient of determination R 2 , total sum of squares (TSS), explained sum of squares (ESS), and residuals sum of squares (RSS) are defined as Equations (7-13) (Huang and Pan, 2003;Zhou et al., 2003;Xu et al., 2008;Wang and Lee, 2010;Wang et al., 2012).
In Equations (7-13), k is the number of variables; n is the number of samples;β i and se β i are estimated value and standard deviation of estimated value of regression coefficient; and Y i ,Ŷ i ,Ȳ represent true, estimated and mean values of explained variable. Accuracy (Acc), Sensitivity (S n ), specificity (S p ), and Mathew correlation coefficient (Mcc) were used to measure the performance of the classification model (Xu et al., 2013;Guo et al., 2014;Awazu, 2016). Which were defined as Equations (14-17).
In Equations (14-17), TP represents the number of positive samples that are correctly predicted as positive samples; TN represents the number of negative samples that are correctly predicted as negative samples; FP represents the number of negative samples that are incorrectly predicted as positive samples; and FN represents the number of positive samples that are incorrectly predicted as negative samples (Zhang et al., 2014(Zhang et al., , 2018Wang et al., 2017Wang et al., , 2018a.

RESULTS AND ANALYSIS Solution Result of Our Proposed Model
Gene expression patterns of the eight selected genes were analyzed by using Models 1-4. More specifically, Model 1 was applied to describe the expression pattern of gene Zfp143, Model 2 was applied to describe the expression pattern of gene Hist1h2bf; Model 3 was applied to describe the expression patterns of genes Dars2 and Eef1a1, and Model4 was applied to describe the expression patterns of genes Btbd8, Cnbp, Cyb5r3, and Ptrh2. Both Model 2 and Model 3 were used to express the expression pattern of gene Eef1a1. Parameter values of the models are shown in Table 3. Parameter values of Model 2 and Model 3 for gene Eef1a1 were shown in Table 4. Table 3 showed that regression coefficients β 2 and β 3 were large, which indicated that both distance and width had important influences on gene expression. Furthermore, distance had an effect on gene expression in the form of exponential function of time, and width had an effect on gene expression in the form of logarithmic function of time without other factors. Additionally, Table 4 shows that the difference of the regression coefficients between Model 2 and Model 3 were small. In both Model 2 and Model 3, β 3 has the largest absolute values in regression coefficients for gene Eef1a1, which indicated that width was a key factor affecting gene expression.

Performance Analysis of Our Proposed Model
Goodness of fit for proposed model was calculated to evaluate the performance of these models. In addition, performance of the models was tested by F-test and t-test. Results of goodness of fit, F-test and t-test are shown in Table 5. Results of goodness of fit, F-test and t-test of gene Eef1a1 are shown in Table 6. Table 5 demonstrates that goodness of fit reached at least 80% for all genes except Dars2 by using our proposed method. In addition, the p-value of F-test and t-test were <0.1, which meant that our proposed model was effective with 90% confidence.
As shown in Table 6,R 2 from Model 3 was larger than Model 2, which means that distance had a greater influence on gene expression than time for gene Eef1a1.
As shown in Tables 3-6, absolute values of regression coefficients β 2 and β 3 were large in all regression coefficients. Additionally, the absolute value of regression coefficients for β 3 was the largest in all regression coefficients with Model 2 and Model 3 for gene Eef1a1. Therefore, width was considered to be the most important factor affecting gene expression, and width had an effect on gene expression in the form of a logarithmic function.

Analysis of Factors Affecting Gene Expression in Whole-Cell Developmental Stage
In this paper, the relationship between gene expression and Oct4 combination intensity in promoter regions at the wholecell developmental stage was analyzed based on the generalized linear regression model. Experimental results showed that the  proposed model was effective for gene expression pattern of all eight selected genes except for Eef1a1. For exploring the effects of each model on the different genes, expression data of selected eight genes and Oct4 combination intensity in promoter regions were substituted into the models. Experimental results are shown in Figure 3. Figure 3 demonstrated that differences in goodness of fit between different models for the same gene were large, which indicated that distance had strong effects on the gene expression of different genes with different levels. Strong correlation between gene expression and D × 0.5 t , W × log 10 (t + 1) was found in Table 3 1, 3, 5, 7, 11, 15, and 18 day(s). Y-axis represents parameter value. Curves in red, green, blue, and black represent values of parameters β 1 , β 2 , β 3 and ε from the proposed generalized linear regression model, respectively. log 10 (t + 1) was lower than for the selected six derived combination variables, which indicated that gene expression was promoted by the interaction of height, distance, width, and time.

Analysis of Factors Affecting Gene Expression at Every Stage of Cell Development
Oct4 combination intensity and time had different effects on gene expression in different cell development stages. The goodness of fit obtained by Model 4 was higher than that obtained by Models 1-3 in the prediction of gene expression. Therefore, differences were analyzed based on Model 4 with testing data including 27 genes and all genes. Experimental results are shown in Figures 4, 5.  Figures 4, 5 show that the absolute value of β 3 was larger than that of β 1 , β 2, and e. Absolute values of β 1 and β 2 were close to zero except for a few points, which indicated that width influenced gene expression in the form of a logarithmic function of time. However, change trends of β 1 and β 2 were different for Figures 4, 5. More specifically, the absolute value of β 1 obtained by 27 genes decreased with time, and the value of was negative when time was equal to 0; the absolute value of obtained by all genes decreased with time and the value of was positive when time was equal to 0; the value of obtained by 27 genes was positive while value of obtained by all genes was negative due to partially missing data, which was contradictory and indicated that time had an important impact on gene expression. Incorrect conclusions were obtained when data of some certain time were missing. Therefore, Figures 4, 5 showed that width and time had important effects on gene expression. Furthermore, width influenced gene expression in the form of a logarithmic function of time.

Application of Our Proposed Model in Gene Classification
Gene classification experiments were provided to test the generalization ability of the proposed model. Firstly, in order to avoid the influence of random disturbance on experimental results, the data of 27 genes, including height, distance, width and gene expression, were normalized. Then, Models 1-4 were applied to predict gene expression for 27 genes. Finally, the 27 genes were divided into two categories by comparing gene expression with a threshold; meanwhile, 10-fold cross-validation was used to test the model's performance. Comparison results of Models 1-4 showed Model 4 had a high goodness of fit. Therefore, 27 genes were classified by Model 4.
Gene groups of high and low expression were defined in an artificial way; meanwhile, threshold setting was random in the classification process. A BP neural network was used to classify positive and negative samples in order to prove that the randomness had little effect on experimental results. In this paper, the hidden layer of the BP neural network was set to one layer, and the number of hidden layer neurons was set to 2. In 10-fold cross-validation, regression coefficients and random disturbance of Model 4 were shown in Table 7. The prediction performance obtained by Model 4 and the BP neural network are shown in Table 8. Table 8 showed that the Acc, Sn, Sp, and Mcc obtained by Model 4 were the largest of the two different methods. Therefore, randomness of the threshold setting had little effect on experimental results, and our proposed method was effective in predicting gene expression.

CONCLUSION
Cell reprogramming has been a hot issue in the field of life sciences and has played a significant role in medicine, such as in tissue repair, organ reconstruction, disease pathogenesis, and  new drug development. Oct4 has especially played an important regulatory role in the process of cell reprogramming. However, there was no scientific method to quantify the relationship between Oct4 combination intensity and gene expression. Therefore, data from the eight selected typical genes were extracted from mouse transcriptome data and ChIP-seq data for quantifying the relationship between gene expression values and Oct4 combination intensity in promoter regions. Firstly, a generalized linear regression model was constructed based on gene expression with eight different time periods during cell development and Oct4 combination intensity in promoter regions. Then, the relationship between Oct4 combination intensity and gene expression at whole and each stage of cell development was analyzed. Finally, the 27 genes were divided into positive and negative samples based on Model 4 and the BP neural network. Experimental results showed that width of combination influenced gene expression by a logarithmic function of time (day). Additionally, accuracy obtained by the models was 4.05% higher than that obtained by the BP neural network, which indicated that our proposed model was effective in predicting gene expression.
Several additional factors, including extent of histone modification, degree of chromatin opening, strength of promoter and binding sites of transcription factors and promoter regions, also affected gene expression. Non-linear relations between gene expression and Oct4 combination intensity were also ignored due to large non-linear relations. Therefore, in the future, multiple factors and non-linear relations should be considered to analyze key factors affecting gene expression.

AUTHOR CONTRIBUTIONS
SL: design experiment and analyze experiment result; ML: data processing and accomplish experiment; HL: extract and clean data from biological experiment and public database; YZ: provide idea from biological significance.