Leveraging methylation alterations to discover potential 1 causal genes associated with the survival risk of cervical 2 cancer in TCGA through a two-stage inference approach

21 Background ： Multiple genes were previously identified to be associated with cervical 22 cancer; however, the genetic architecture of cervical cancer remains unknown and 23 many causal genes have yet been discovered . 24 Methods: To explore causal genes related to cervical cancer, a two-stage causal 25 inference approach was proposed within the framework of Mendelian randomization, 26 where the gene expression was treated as exposure, with methylations located within 27 that gene serving as instrumental variables. Five prediction models were first utilized 28 to characterize the relationship between the expression and methylations for each 29 gene; then the methylation-regulated gene expression (MReX) was obtained and the 30 association was evaluated via Cox mixed-effects model based on MReX. We further 31 implemented the harmonic mean p -value (HMP) combination to take advantage of 32 respective strengths of these prediction models while accounting for dependency 33 among the p- values. 34 Results: A total of 14 causal genes were discovered to be associated with the survival 35 risk of cervical cancer in TCGA when the five prediction models were separately 36 employed. The total number of causal genes was brought to 23 when conducting 37 HMP. Some of the newly discovered genes may be nove l (e.g. YJEFN3 , SPATA5L1 , 38 IMMP1L , C5orf55 , PPIP5K2, ZNF330 , CRYZL1 , PPM1A , ESCO2 , ZNF605 , ZNF225 , 39 ZNF266 , FICD and OSTC ). Functional analyses showed these genes were enriched in 40 tumor-associated pathways. Additionally, four genes (i.e. COL6A1 , SYDE1 , ESCO2 41 and GIPC1 ) were differentially expressed. 42 Conclusion: Overall, our study discovered promising candidate genes that are 43 causally associated with the survival risk of cervical cancer and thus provided new 44 insights into the genetic etiology of cervical cancer.

selected genes with the phenotypic variance explained by methylations larger than 1% 229 (corresponding to a correlation coefficient of 10%). The remaining 12,623 genes are 230 referred to as methylation-regulated genes and included in our subsequent analyses 231 ( Figure 2). The number of methylation GpG sites across genes ranges from 10 to 232 1,062, with the majority of analyzed genes (92.0% = 11,607/12,623) having 233 methylations less than 50. 234

Identification of DEGs, GO and KEGG pathway annotation 317
In terms of the differential expression analysis, four DEGs are detected among the 23 318 new genes identified above ( Figure 5A). In particular, COL6A1 and SYDE1 are 319 up-regulated genes, while ESCO2 and GIPC1 are down-regulated genes ( Figure 5B). 320 To explore the potential functions of these genes that may be associated with the 321 tumorigenesis and development of cervical cancer, we performed functional 322 enrichment analysis with GO and KEGG using the R package clusterProfiler (version  (Table S2). The functional 341 enrichment results suggest that these newly discovered causal genes may participate 342 in oncogenicity and tumor progression in cervical cancer through regulating relevant 343 biological processes and critical pathways.

14/29
Discussion 345 Given the severe health threat among women and little knowledge of genetic basis for 346 cervical cancer, persistent work should be done to discover genes that are causally 347 related to cervical cancer [5]. The present study is one of such efforts with the aim to 348 detect newly causal genes for cervical cancer through integrative genomic methods. 349 The two-stage inference analysis pipeline applied in this work can be considered as a ENET-coxlmm) would be superior. Due to unknown true association patterns, there is 381 no uniformly most powerful test. As a result, the two-stage association test may 382 perform well for one gene, but not necessarily for another. 383 To leverage the advantage of distinct prediction models to improve power, instead 384 selecting an optimal prediction model, in the present study we considered a wide 385 range of prediction models in our two-stage inference procedure. It can be imaged that 386 the resulting p values would be highly correlated because they are generated with the 387 same data set following the similar logic ( Figure 2). The correlation structure of these 388 p-values also depends on the true architecture of gene expression, which however is 389 rarely known in advance and is likely to vary from one gene to another across the 390 genome. Therefore, it is desirable to construct an omnibus test that integrates the 391 advantage of multiple prediction approaches and is robust against distinct 392 transcriptomic architectures. In summary, using the proposed two-stage causal inference approach within the 425 framework of MR analysis, we discovered a total of 14 causal genes which were 426 associated with the survival risk of cervical cancer patients when separately applying 427 five commonly used prediction models. The number of causal genes was brought to 428 23 when employing the combination method of HMP. Some may be newly novel

Consent for publication 456
Not applicable. 457

Availability of data and materials 458
The datasets used and/or analyzed during the current study are available from the 459 corresponding author on reasonable request or in https://xenabrowser.net/hub/. 460

Competing interests 461
The authors declare that they have no competing interests 462 de Tayrac   linear mixed-effects model was applied to identify methylation-driven genes based on 815 predicted expression levels; we aggregated the p values of genes from different 816 prediction models through a p-values combination manner to find significant genes 817 that were related to the survival of cervical cancer. Finally, we further implemented 818 functional and differential expression analyses for newly identified associated genes. 819