# SYSTEM BIOLOGY METHODS AND TOOLS FOR INTEGRATING OMICS DATA

EDITED BY : Liang Cheng, Lei Deng and Mingxiang Teng PUBLISHED IN : Frontiers in Genetics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88966-333-0 DOI 10.3389/978-2-88966-333-0

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# SYSTEM BIOLOGY METHODS AND TOOLS FOR INTEGRATING OMICS DATA

Topic Editors: Liang Cheng, Harbin Medical University, China Lei Deng, Central South University, China Mingxiang Teng, Moffitt Cancer Center, United States

Citation: Cheng, L., Deng, L., Teng, M., eds. (2021). System Biology Methods and Tools for Integrating Omics Data. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88966-333-0

# Table of Contents

*05 Editorial: Molecular Biology of* Bamboo Mosaic Virus*—A Type Member of the Potexvirus Genus*

Yau-Heiu Hsu, Ching-Hsiu Tsai and Na-Sheng Lin


Yiqun Li, Ying Wu, Xiaohan Zhang, Yunfan Bai, Luqman Muhammad Akthar, Xin Lu, Ming Shi, Jianxiang Zhao, Qinghua Jiang and Yu Li

*29 iDNA6mA-Rice: A Computational Tool for Detecting N6-Methyladenine Sites in Rice*

Hao Lv, Fu-Ying Dao, Zheng-Xing Guan, Dan Zhang, Jiu-Xin Tan, Yong Zhang, Wei Chen and Hao Lin

*40 Predicting circRNA-Disease Associations Based on circRNA Expression Similarity and Functional Similarity*

Yongtian Wang, Chenxi Nie, Tianyi Zang and Yadong Wang

*51 iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition*

Bin Liu, Shengyu Chen, Ke Yan and Fan Weng

*59 Variance-Preserving Estimation of Intensity Values Obtained From Omics Experiments*

Adèle H. Ribeiro, Julia Maria Pavan Soler and Roberto Hirata Jr.


Tianyi Zhao, Yang Hu, Tianyi Zang and Yadong Wang

*100 A New Algorithm for Identifying Genome Rearrangements in the Mammalian Evolution*

Juan Wang, Bo Cui, Yulan Zhao and Maozu Guo

*106 Integrating the Ribonucleic Acid Sequencing Data From Various Studies for Exploring the Multiple Sclerosis-Related Long Noncoding Ribonucleic Acids and Their Functions*

Zhijie Han, Jiao Hua, Weiwei Xue and Feng Zhu

*119 CircSLNN: Identifying RBP-Binding Sites on circRNAs* via *Sequence Labeling Neural Networks*

Yuqi Ju, Liangliang Yuan, Yang Yang and Hai Zhao


Zhiqiang Chang, Xiuxiu Miao and Wenyuan Zhao

*144 eQTLMAPT: Fast and Accurate eQTL Mediation Analysis With Efficient Permutation Testing Approaches*

Tao Wang, Qidi Peng, Bo Liu, Xiaoli Liu, Yongzhuang Liu, Jiajie Peng and Yadong Wang


Yanglan Gan, Ning Li, Yongchang Xin and Guobing Zou


Sheng Zhao, Huijie Jiang, Zong-Hui Liang and Hong Ju


# Editorial: System Biology Methods and Tools for Integrating Omics Data

#### Liang Cheng1,2 \*, Lei Deng<sup>3</sup> \* and Mingxiang Teng<sup>4</sup> \*

*<sup>1</sup> NHC Key Laboratory of Molecular Probe and Targeted Theranostics, Harbin Medical University, Harbin, China, <sup>2</sup> College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China, <sup>3</sup> School of Computer Science and Technology, Central South University, Changsha, China, <sup>4</sup> Moffitt Cancer Center, Tampa, FL, United States*

Keywords: OMICS data, data mining, machine learning, complex disease, system biology

#### Edited and reviewed by:

*Simon Charles Heath, Center for Genomic Regulation (CRG), Spain*

#### \*Correspondence:

*Liang Cheng chl198478@126.com Lei Deng leideng@csu.edu.cn Mingxiang Teng tengmx@gmail.com*

#### Specialty section:

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> Received: *17 May 2020* Accepted: *06 October 2020* Published: *12 November 2020*

#### Citation:

*Cheng L, Deng L and Teng M (2020) Editorial: System Biology Methods and Tools for Integrating Omics Data. Front. Genet. 11:563108. doi: 10.3389/fgene.2020.563108* **System Biology Methods and Tools for Integrating Omics Data**

**Editorial on the Research Topic**

With the rapid evolution of sequencing technologies, it becomes more and more easy for researchers to analyze the expression level of molecules or variations in the genome, transcriptome, and proteome in wet labs. These technological innovations have advanced the life science community in terms of revealing disease risk factors such as gene variations or expressions, clinical phenotypes, etc. Accompanied by technological advances, significant amounts of sequencing data have been generated in the field to then be interpreted using novel data integration methods.

To this end, it is urgent to develop methods and tools to better utilize omics datasets in disease studies. One way would be to evaluate the associations between different diseases or sub-types by analyzing omics datasets across individual laboratories. e.g., LncRNAs biomarkers, associated with clinical sub-types and the prognosis of diffuse large B-cell lymphoma, were discovered and validated by re-annotating the probes and analyzing the data of multiple microarray platforms. Another way would be to reveal potential characteristics of diseases by integrating multi-level omics data. Gene targets of complex diseases could, for example, be predicted by integrating summary data from GWAS and eQTL studies. Integration of omics data by exploring computational tools is likely to be challenging for most biologists, as most tools require a certain level of computing knowledge one the part of the users to be operated optimally. It is consequently of great import to establish automated pipelines that combine these tools. In summary, the current challenge for understanding complex disease is to mine novel and precise characterization through the fusing of multi-level omics data using system biology approaches. Here, we organized a Research Topic on "System Biology Methods and Tools for Integrating Omics Data." In total, 22 outstanding works were presented in this thematic issue, six of which have been highlighted as follows.

**5**


analysis. TriPCE can identify coherent patterns of various epigenetic modifications across different cancer types. To validate its capability, they applied TriPCE to analyze six important epigenetic marks among seven cancer types and identified significant cross-cancer epigenetic similarities. The results highlighted specific epigenetic patterns among the investigated cancers. The functional gene analysis further demonstrated strong relevance of studied gene sets with cancer development and revealed a consistent risk tendency among these investigated cancer types.

• Zeng et al. developed a hybrid deep neural network framework 4mcDeep-CBI, aiming to identify 4mC sites. Preliminary extracted features were fed to the Convolutional Neural Network (CNN) and Bidirectional Long Short Term Memory network (BLSTM) to generate advanced features. Taking the advanced features as input, they designed an integrated algorithm to improve feature representation. Experimental results on a large new dataset showed that 4mcDeep-CBI could achieve generally better performances when identifying 4mC sites compared to other state-ofart predictors.

Each study in the special issue was peer reviewed by two or three external reviewers. We would like to thank all the authors for contributing their work to our hot thematic issue and all the reviewers for their time and efforts. Finally, we would like to thank the Chief Editor and Editorial Office of Frontiers in Genetics for their support during the whole processes.

### AUTHOR CONTRIBUTIONS

LC, LD, and MT conducted this topic issue and wrote the manuscript. All authors contributed to the article and approved the submitted version.

### FUNDING

The Tou-Yan Innovation Team Program of the Heilongjiang Province (2019-15); National Natural Science Foundation of China (61871160); Heilongjiang Province Postdoctoral Fund (LBH-TZ20); and Young Innovative Talents in Colleges and Universities of Heilongjiang Province (2018-69).

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Cheng, Deng and Teng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Systems Chemical Genetics-Based Drug Discovery: Prioritizing Agents Targeting Multiple/Reliable Disease-Associated Genes as Drug Candidates

Yuan Quan1†, Zhi-Hui Luo2†, Qing-Yong Yang1†, Jiang Li <sup>1</sup> , Qiang Zhu<sup>1</sup> , Ye-Mao Liu<sup>1</sup> , Bo-Min Lv <sup>1</sup> , Ze-Jia Cui <sup>1</sup> , Xuan Qin<sup>1</sup> , Yan-Hua Xu<sup>3</sup> , Li-Da Zhu<sup>1</sup> \* and Hong-Yu Zhang<sup>1</sup> \*

*<sup>1</sup> Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China, <sup>2</sup> College of Life Sciences and Technology, Huazhong Agricultural University, Wuhan, China, <sup>3</sup> Sci-meds Biopharmaceutical Co., Ltd., Wuhan, China*

#### Edited by:

*Lei Deng, Central South University, China*

#### Reviewed by:

*Leyi Wei, Tianjin University, China Shikui Tu, Shanghai Jiao Tong University, China Zhenjia Wang, University of Virginia, United States*

#### \*Correspondence:

*Li-Da Zhu zhulinda@hotmail.com Hong-Yu Zhang zhy630@mail.hzau.edu.cn*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> Received: *07 March 2019* Accepted: *01 May 2019* Published: *29 May 2019*

#### Citation:

*Quan Y, Luo Z-H, Yang Q-Y, Li J, Zhu Q, Liu Y-M, Lv B-M, Cui Z-J, Qin X, Xu Y-H, Zhu L-D and Zhang H-Y (2019) Systems Chemical Genetics-Based Drug Discovery: Prioritizing Agents Targeting Multiple/Reliable Disease-Associated Genes as Drug Candidates. Front. Genet. 10:474. doi: 10.3389/fgene.2019.00474* Genetic disease genes are considered a promising source of drug targets. Most diseases are caused by more than one pathogenic factor; thus, it is reasonable to consider that chemical agents targeting multiple disease genes are more likely to have desired activities. This is supported by a comprehensive analysis on the relationships between agent activity/druggability and target genetic characteristics. The therapeutic potential of agents increases steadily with increasing number of targeted disease genes, and can be further enhanced by strengthened genetic links between targets and diseases. By using the multi-label classification models for genetics-based drug activity prediction, we provide universal tools for prioritizing drug candidates. All of the documented data and the machine-learning prediction service are available at SCG-Drug (http://zhanglab.hzau.edu.cn/scgdrug).

Keywords: drug discovery, disease associated genes, drug targets, systems chemical genetics, machine learning

### INTRODUCTION

Finding novel drugs or new uses for old drugs is one of the most important motivations of life sciences. Drug development is a costly process. The rich knowledge accumulated by modern life sciences is, thus, highly expected to reduce the attrition rate during drug development. From a chemical viewpoint, drugs exert therapeutic effects by inhibiting or activating one or more of the target genes/proteins associated with certain diseases. Therefore, gene-disease association information is crucial for drug discovery (Brinkman et al., 2006; Sanseau et al., 2012; Wang Z. Y. et al., 2012; Plenge et al., 2013; Okada et al., 2014; Nelson et al., 2015).

In life sciences, genetics is best dedicated to revealing gene-disease links. Thus, genetics makes great contributions to the pharmaceutical industry. For example, disease-associated genes identified by medical genetics constitute a promising source of drug targets (Brinkman et al., 2006; Sanseau et al., 2012; Wang Z. Y. et al., 2012; Plenge et al., 2013; Okada et al., 2014; Nelson et al., 2015). Moreover, the pathogenesis revealed by genetics is also of high value for drug discovery. If a disease arises from gain of function (GOF) mutation of a target gene, the corresponding drugs must be antagonists or inhibitors; while for a disease induced by loss of function (LOF) mutation of a gene, the targeted drugs must be agonists (Wang and Zhang, 2013).

Thousands of disease-associated genes have been identified by traditional Mendelian genetics and recently developed genomeand phenome-wide association studies (GWAS and PheWAS, respectively). However, nearly all studies attributed diseases to variations at a single genetic locus. Most diseases are caused by multiple pathogenic factors (Yildirim et al., 2007; Hopkins, 2008; Guney et al., 2016); thus, a majority of the identified links between diseases and single genetic variations are not strong enough to have therapeutic value. For example, only ∼5% of the drug-disease associations derived from PheWAS are supported by clinical evidence (Rastegar-Mojarad et al., 2015). Thus, to utilize the medical genetic information more efficiently in drug development, we should aim at multiple genes associated with certain diseases rather than a single pathogenic factor to identify potential drugs. To test this hypothesis, we retrieved the genes responsible for various disorders and collected the chemical agents targeting these genes. A comprehensive analysis on the relationships between agent activity/druggability and target genetic characteristics revealed that the agents targeting multiple pathogenic factors were more likely to show desired medicinal activities and to be clinically approved. The therapeutic potential of agents can be enhanced with the consolidation of genetic links between targets and diseases. These observations allowed us to predict agent activities using machine learning methods, which are definitely helpful to prioritize drug candidates.

### RESULTS

#### Data Preparation and Validation

The information for agent-target interaction was obtained through retrieving Drug-Gene Interaction database (DGIdb) (Wagner et al., 2015), Therapeutic Target Database (TTD) (Qin et al., 2014), and DrugBank (Law et al., 2014). Only the clinically supported or approved activities of the agents were used in the present study, which were derived from DrugBank, TTD, and ClinicalTrials (Zarin et al., 2011; Law et al., 2014; Qin et al., 2014). The disease-associated gene information was derived from the following eight databases: Genetic Association Database (GAD) (Becker et al., 2004), Online Mendelian Inheritance in Man (OMIM) (Hamosh et al., 2005), Clinvar (Landrum et al., 2014), Orphanet (http://www.orpha.net/consor/cgi-bin/ index.php), DisGeNET (Piñero et al., 2015), INtegrated TaRget gEne PredItion (INTREPID) (Chen and Tian, 2016), GWASdb (Nelson et al., 2015), and The Human Gene Mutation Database (HGMD) (Wang X. et al., 2012) (**Figure 1**).

To facilitate the present analysis, a natural language processing tool MetaMap was used to convert disease terms of genes and indication annotations of agents to Unified Medical Language System (UMLS) concepts (Aronson, 2001), where the Medical Subject Headings (MeSH) thesaurus was selected as the vocabulary source of UMLS (Liu et al., 2014). Using the disease classes provided by pharmaprojects (Similarity threshold: 0.75) (Mcinnes et al., 2009), the chemical agents were indicated for treating 667 disease classes and the disorder-related genes were associated with 703 disease classes (**Figure 1**). All of the data are freely available at SCG-Drug (http://zhanglab.hzau.edu.cn/scgdrug).

Data validation was performed by the following analyses. First, we assessed the reliability of the gene-disease pairs by examining whether similar diseases cover similar gene sets. The disease similarity was measured using UMLS::similarity (Mcinnes et al., 2009); the disease gene set distance was calculated using the Tanimoto coefficient (see Methods). As shown in **Figure 2A**, a definite correlation exists between disease similarity and gene set distance. That is, if two diseases exhibit similar symptoms, then these diseases tend to involve similar genes, validating the identified gene-disease pairs. Then, we used a similar method to evaluate the quality of agentdisease pairs. A good correlation was observed between disease similarity and agent set distance (**Figure 2A**), supporting the reliability of agent-disease pairs. Therefore, one can infer the activities of agents through their target-associated genetic diseases, provided the agents and the targets are truly linked. As illustrated in **Figure 2B**, for the agents in TTD, DGIdb, and DrugBank, 4.1, 4.7, and 5.3% of their genetics-implicated activities are supported by clinical trials, respectively, comparable with the PheWAS-based activity prediction efficiency (Rastegar-Mojarad et al., 2015). However, if the agents were randomly assigned with targets (for 10,000 times), the clinically supported activities derived from genetic predictions are significantly rarer than those from real agent-target pairs (**Figure 2B**, P < 10−<sup>4</sup> ). This 10,000-permutation test validates the agenttarget associations.

### Dependence of Agent Activity/Druggability on Target Quantity

Based on the validated data, we can investigate how the agent activity/druggability depends on the target characteristics. As illustrated in **Figure 3**, for the agents targeting a single disease gene, 3.0% of genetics-derived activities are supported by clinical test and only 0.6% are clinically approved (**Table S1**). For agents targeting two disease-associated genes, 4.1% of geneticsimplicated activities are clinically supported, and 1.5% have been introduced to the market (**Table S1**). The clinically active ratio of agents culminates to 26.7%, and the approval ratio is up to 11.4%, when the agents targeting tens of disorder genes. Together, the therapeutic potential of agents increases steadily with increasing number of targeted disease genes (**Figure 3**).

Drug action is usually considered a specific process. It is thus of apparent interest to investigate the molecular mechanisms underlying the promiscuity of the multi-target agents. Considering the fact that human genes generate a large number of paralogs during evolution, a primary explanation is that the multiple targets covered by the agents have similar sequences and functions. Indeed, the sequences for target pairs hit by the agents are more similar than those randomly selected from the target set (P = 2.20 × 10−16, Wilcoxon ranksum test) (**Figure 4A**), where the needle program of EMBOSS package (Rice et al., 2000) was used to do pairwise alignments. Furthermore, it was found that the target pairs covered by the agents are significantly enriched with paralogs (4.72% (2,602 of 55,110), derived from Ensemble database), compared with the randomly combined target pairs (0.10% (4,029 of 3,955,078), P

searching DGIdb, TTD, and DrugBank, 3,346 genes were targeted by 14,558 agents. 3,346 targets were associated with 703 diseases, resulting in 359,101 gene-disease pairs; 5,759 agents were indicated for treating 667 diseases, resulting in 74,902 agent-disease pairs.

∼ 0, hypergeometric test). Besides, the GO-based Czekanowski– Dice distances (Ovaska et al., 2008) of the gene pairs targeted by the agents are evidently smaller than those of randomly selected target pairs (P = 2.20 × 10−16, Wilcoxon rank-sum test) (**Figure 4B**). These observations not only support the evolutionary explanation to the molecular basis of multi-target drug action, but also provide useful clues to addressing the concerns about the side effects of promiscuous agents.

Despite the achievements of multi-target strategy for drug discovery, questions concerning security remain, as the tendency to act on multiple genes may increase the probability of inducing adverse effects. The present analyses indicate that these agents prefer to target genes with similar sequences and functions, namely paralogs, which means that the agent-targeting process is not so random that it will constrain the agent activities into a relatively narrow range. This is definitely beneficial to alleviate the side effects of multi-target agents and thus helpful to enhance their druggability.

Furthermore, we analyzed the chemical genetic data recorded in connectivity map (cMap) (Lamb et al., 2006). The cMap comprises 7,056 gene expression profiles for five human cell lines treated with 1,309 agents. Using the biclustering approach FABIA (factor analysis for bicluster acquisition), we have generated 49 gene modules for cMap data, establishing links between gene modules and chemical agents (Xiong et al., 2014). Therefore, each agent has a gene module profile, and the promiscuity of the agent increases with the increasing number of modules the agent covers. As shown in **Figure 5A**, with the increase of targets, the agents indeed cover more gene modules, supporting the opinion that multi-targeted agents have a higher risk of yielding unwanted effects. However, the druggability analysis indicated that with the increasing number of targets, the drug approval ratio does not decrease but rather increases slightly (**Figure 5B**). Moreover, if only disease-associated genes are considered, the drug approval ratio increases evidently with the increase of targeted gene number (**Figure 5C**). This observation strongly suggests that despite the enhanced risk in side effects, multi-targeted agents are still very promising in drug development.

### Dependence of Agent Activity/Druggability on Target Quality

Besides the quantity of agent targets, their quality also influences the medicinal potential of agents in principle. Our prior study has revealed that the agents targeting "top genes" have higher therapeutic potential (Quan et al., 2018), where "top genes" were defined as those tightly associated with certain diseases. Four disease-gene databases, i.e., AlzGene (Bertram et al., 2007), SzGene (Allen et al., 2008), PDGene (Lill et al., 2012), and MSGene (Lill et al., 1994), provide "top genes" annotations for

pairs.

Alzheimer's disease, schizophrenia, Parkinson's disease, multiple sclerosis, respectively. From DGIdb, TTD and DrugBank, we retrieved 3,692 agents targeting the genes including "top genes" contained in these four databases (**Table S2**). As illustrated in **Figure 6**, multi-target agents exhibit higher medicinal potential than single-target counterparts, consistent with the above observations. Next, for the agents covering "top genes," their genetics-derived activities are more likely to be supported by clinical evidence and be clinically approved (**Figure 6** and **Table S2**), indicating the importance of target quality in geneticsbased drug discovery.

However, only a few genetic databases contain quality information for disease genes. Considering the above finding that multi-target agents usually hit paralogs, we speculated that ohnolog genes, i.e., paralogs generated by whole genome duplication, may be used as "top genes" instead. Ohnolog genes have been recognized to significantly enrich disease genes, compared with other paralog genes, because of their strong dosage balance (Makino and Mclysaght, 2010; McLysaght et al., 2014; Xie et al., 2016; Sekine and Makino, 2017).

As illustrated in **Figure 7**, the agents covering diseaseassociated ohnolog genes indeed exhibit higher approved potential (P < 1.09 × 10−61, hypergeometric test), suggesting that disease-associated ohnolog genes can be regarded as "top genes" to some extent. This finding is very useful in establishing the machine-learning models for drug activity prediction (see below for details).

### Target Quality Evaluation and Druggability Score of Disease Genes

Eight disease gene databases (including Clinvar, OMIM, HGMD, Orphanet, GWASdb, INTREPID, GAD, and DisGeNET) are used in the present study. The target quality of each database must be different, which stimulated our interest to do an evaluation by comparing the clinically supported ratio of genetics-implicated agent activities derived from eight databases. The results showed that target genes of Clinvar have the highest quality, in which 16.52% of genetics-based activity predictions are supported by clinical test. The target quality (measured by clinically active ratio) of other databases declines in the order: OMIM (15.01%), HGMD (14.09%), Orphanet (13.62%), GWASdb (10.53%), INTREPID (7.08%), GAD (5.75%), and DisGeNET (4.14%) (**Figure 8** and **Table S3**). This observation inspired us to propose a parameter for quantitatively measuring the druggability of disease genes. First, the genes derived from different databases were given different quality scores, with the highest-quality database (i.e., Clinvar) being assigned with the highest score (eight points), while the lowest (i.e., DisGeNET) with the lowest score (one point). Then, the scores were summed up for each disease gene to define its druggability (**see Methods**). The higher the score is, the more druggable the disease gene. Apparently, a gene may have different scores for different diseases.

This scoring system is validated by the following observations. First, for the disease genes with higher druggability scores, the genetics-implicated activities of agents are more possible to be clinically supported and approved (**Figure 9** and **Table S4**). Considering the correlation between gene druggability and pathogenicity (Plenge et al., 2013; Quan and Zhang, 2016),

it is inferred that druggability score is also appropriate for characterizing gene-disease links. Indeed, the "top genes" derived from AlzGene, SzGene, PDGene, and MSGene, which are tightly connected with diseases, exhibit much higher druggability scores than other genes with the same pathogenic annotations (P = 2.51 × 10−52, Wilcoxon rank-sum test) (**Figure 10**). Therefore, each disease can be characterized by the corresponding scored genes, constituting a gene profile pertinent to the disease. Different diseases can be compared through calculating Spearman's rank correlation between their gene profiles. It is interesting to notice that the diseases exhibiting similar gene profiles display similar symptoms measured by UMLS::similarity (**Figure 11**), validating the scoring system in characterizing gene-disease links. Together, it is concluded that druggability score can be used to measure target quality and genetic links between genes and diseases, which is of great value in drug activity prediction by machinelearning models.

### Agent Activity Prediction With Multi-Label Classification Model

The above analysis implied that it is possible to establish drugactivity prediction models based on the genetic information of drug targets. Since a drug is usually associated with multiple activities for diseases and a disease could be treated by multiple drugs, drug-activity prediction problem can be considered as a multi-label classification task. In this paper, we adopted a method of multi-label k-nearest neighbor (MLKNN) which can construct high-accuracy multi-label prediction models for drug-activity prediction (Zhang and Zhou, 2007; Wen et al., 2015).

First, we investigate a variety of features to represent the characters of druggability. Considering that various features may bring diverse information as well as noise, we adopt ensemble learning method to select suitable features to build the models (Lee and Soo, 2013; Yang et al., 2014; Zhang et al., 2015). Considering that agents targeting multiple disease genes, in particular "top disease genes" and genes with high druggability scores, tend to show high therapeutic potential (**Figures 3**, **6**, **7**, **9**), we rationally selected four parameters to build the models. The first parameter characterizes the overall score of genes responsible for certain diseases within drug targets, and the second parameter is the normalized average value of the overall score. The third and fourth parameters describe the absolute number and relative ratio of ohnologous disease genes (serving as "top genes") within drug targets, respectively (**see Methods**).

Representation of drug labels is a crucial step in multi-label learning. An agent-disease pair was regarded as a positive, if the drug hits one or more disease genes and is indicated for treating this disease. An agent-disease pair was regarded as a negative, if the drug targets one or more disease genes but is not annotated for controlling this disease. As a result, a total of 74,902 positives covering 5,759 agents and 667 diseases, and 3,778,517 negatives were selected.

Given a dataset of n drugs denoted as x<sup>i</sup> , yi n i=1 , x<sup>i</sup> and y<sup>i</sup> are the p-dimensional feature vector and q-dimensional disease vector for the ith drug, respectively. Our goal is to build the functional relationship Y = F (X): 2 <sup>p</sup> → 2 <sup>q</sup> between exploratory variables (feature vector) and target values (agentactivity vector) for multi-label learning.

First, four MLKNN models were constructed based on four features. Then, each model was evaluated by the internal 5-fold cross validation on the training data. As a result, five MLKNN models were built based on five internal folds and selected features. The final prediction result is the average and standard deviation scores of outputs by five MLKNN models. At last, we used the ensemble learning method to combine four features and generate high-accuracy prediction models **(see Methods**).

The performance of assembled classifier for agent-activity prediction is shown in **Figure 12**. For a 5-fold stratified crossvalidation with a 1,000 repeat, MLKNN displays the best performance (**Table S5**). By inputting the 5,759 original agents and associated targets into the models (where the threshold of predictive value was set to 0.5), 11,649 activities were predicted. 67.01% of the predicted activities are supported by clinical trials, and 14.52% have been approved, which are much higher than the overall ratio of genetics-implicated clinical activity and approved indication (3.96 and 1.16%, respectively).

To examine for which kind of diseases the predictions are most relevant, we compared the clinically active/approval ratio

FIGURE 6 | Effects of top genes on the clinically active/approval ratio of agents. The top genes were derived from AlzGene, SZGene, PDGene, and MSGene. From DGIdb, TTD and DrugBank, we retrieved 3,692 agents targeting the genes contained in the four databases, of which 726 targeted at least one top gene. The results show that for the agents covering top genes, their genetics-implicated activities are more likely to be supported by clinical trials and to be clinically approved (*P*-values were calculated using the hypergeometric test).

active/approval ratio of agents. A total of 7,294 ohnolog genes were obtained from Makino and Mclysaght's work31, in which 5,265 genes were disease-associated. Searching DGIdb, TTD and DrugBank revealed that 4,058 agents targeted 1,164 of the 5,265 ohnolog genes. The results show that for the agents covering disease-associated ohnolog genes, their genetics-derived activities are more likely to be supported by clinical evidence and be clinically approved (*P*-values were calculated using the hypergeometric test).

of the predicted results for various diseases. It was found that, leukemia and lymphoma have the most predictions (**Table S6**). To demonstrate the usefulness of the present method, we tested the predicted anti-leukemia agents by cytotoxicity experiment. Using our models, 809 agents were predicted to have antileukemia potential, of which 550 (67.99%) have been validated by prior clinical tests. Thus, it is intriguing to examine the anti-leukemia potential of the rest 259 agents. 14 of 259 agents are commercially available, which were evaluated by K562 (chronic myeloid leukemia-derived cancer cell line) cytotoxicity assays. The results show that 10 agents (71.43%) can inhibit the growth of K562 efficiently (**Figure 13**) (**Table S7**), with

IC50 values ranging from 0.106 (saracatinib) to 111.2 µM (veliparib) (**Table S7**).

To facilitate the use of the machine-learning prediction models, we developed a web server SCG-Drug (Systems Chemical Genetics-Drug, http://zhanglab.hzau.edu.cn/scgdrug) that allows a quick and intuitive access to the background information and predicted results. Currently, SCG-Drug contains 5,759 agents, 703 diseases and 19,233 genes derived from various databases. By inputting the target information of any agents into SCG-Drug, one can use the established machine-learning models to predict the potential activities of the agents. The SCG-Drug web interfaces allow users to explore medicinal information related to a given drug, disease or gene through four interfaces in "Analysis" page: "Drug", "Batch prediction," "Disease," and "Gene." The "Drug" interface allows users to submit a single drug to retrieve target genes and potential activities of the query drug. For example, when a user submits a single drug that was shown in the dropdowns, the drug will be searched in the database directly. If it is unable to find any matches for the search term, the user will be asked to input the corresponding target genes of the drug. Then, the system will call the prediction module. Alternatively, the system allows the user to upload a file on the "Batch prediction" interface, in which an agent and corresponding targets are in

FIGURE 10 | Comparison of druggability scores for top genes derived from AlzGene, SzGene, PDGene, MSGene, and ordinary genes with the same pathogenic annotations. The top genes exhibit evidently higher scores than other genes (*P* = 2.51 × 10–52, Wilcoxon rank-sum test).

a single row and the terms in each row are separated by tabs, along with an email address to which the predicted activities of the agents will be sent. Offline prediction automatically starts, and the predicted results will be sent to the user via e-mail. The "Disease" interface allows users to obtain relevant disease genes with druggability score, and database source by querying standardized disease descriptions of MeSH. The "Gene" interface allows users to explore gene-related diseases (with druggability score) and drugs only by submitting a gene name or an Entrez ID, which have been documented in the server. In addition, users can obtain the information for documented drugs (with normalized indications) and targets/genes (with normalized disease descriptions) from "Download" page. The data and the machine-learning models will be updated regularly.

### DISCUSSION

Selecting agents with desired activities and high druggability from an infinite chemical space is a fundamental task for drug development. Previous studies have revealed that genetic disease genes can provide valuable clues for drug activity prediction and druggability assessment (Brinkman et al., 2006; Sanseau et al., 2012; Wang Z. Y. et al., 2012; Plenge et al., 2013; Wang and Zhang, 2013; Okada et al., 2014; Nelson et al., 2015). However, these studies are limited to single-drug-single-target paradigm. Because most complex diseases are caused by multiple pathogenic factors, it is reasonable to speculate that targeting multiple disorder factors will better navigate the drug space. In this study, by a comprehensive analysis, we clearly indicate that aiming at multiple disease genes is helpful to prioritize drug candidates with promising activities and high druggability. Additionally, the strengthened genetic links between target genes and diseases are helpful to improve the medicinal potential of drug candidates. The drug-gene interaction information is expected to be rapidly accumulated through emerging techniques in chemical biology. However, the identification of reliable

genetic links between genes and diseases depends on progress in medical genetics.

A number of systems genetics methods have been developed for enriching and screening the driver genes underlying complex traits in the post-GWAS era. For example, Zhu et al. identified 126 genes related to human complex traits through the integration of summary-level GWAS results and eQTL data (Zhu et al., 2016). Based on the exome sequencing, array copy number and RNA sequencing (RNA-seq) data from 3,281 samples across 12 cancer types, Leiserson

et al. performed a pan-cancer analysis of mutated networks utilizing a HotNet2 (HotNet diffusion-oriented sub-networks) algorithm, by which they identified 16 significantly mutated subnetworks containing 147 genes. Many of these genes have been validated to play a critical role in cancer pathogenesis (Leiserson et al., 2015). Gamazon et al. proposed a genebased association method called PrediXcan that directly tests the molecular mechanisms through which genetic variation affects phenotype (Gamazon et al., 2015). Greene et al. introduced a Network-guided GWAS Analysis method called NetWAS, which integrated tissue-specific networks and nominally significant P-values in GWAS to identify biologically important disease-gene associations (Greene et al., 2015). Although these methods are helpful to identify reliable genes associated with a complex disease trait, the complex application procedures hinder their convenient use. In this study, we endorsed the possibility of using ohnolog genes as a source of "top disease genes." The high accessibility of ohnologs will facilitate the identification of disease driver genes and the genetics-based drug discovery.

The above discoveries inspired us to establish systems chemical genetic models for predicting drug activities. Because drug repurposing is a hot spot in the pharmaceutical industry, a number of theoretical methods, including cheminformaticsbased, bioinformatics-based and systems biology-based methods, have been proposed to predict drug activities (Jin and Wong, 2014). However, most of these methods were derived from parameters trained using large datasets, suggesting that these methods may be sensitive to datasets and poor in generalization capabilities. The identification of the genetic determinants of drug activities facilitates the rational selection of parameters to establish machine-learning models for drug activity prediction. Because this model was built on the fundamental principle of drug activity determination, it is expected to be robust when generalized to different datasets and explainable to certain extent. Moreover, to maximize the convenience for researchers, a user-friendly online service (SCG-Drug) was provided for drug-activity prediction and data retrieval as well. These systems chemical genetics methods are of high value in prioritizing drug candidates, also highlighting the

the growth of K562.

importance of modern genetics in facilitating the paradigm shift of pharmaceutical industry.

### MATERIALS AND METHODS

#### Data Sources and Pre-processing Agent Information

We collected agents and agent-target associations from three databases: DrugBank, TTD, and DGIdb (Law et al., 2014; Qin et al., 2014; Wagner et al., 2015). By integrating the 6,841 agents covering 3,692 targets from DrugBank, the 5,208 agents covering 569 targets from TTD, and the 10,941 agents covering 3,090 targets from DGIdb, we obtained 35,860 agent-target associations, comprising 16,021 agents and 4,613 target genes. The indication information for the agents were collected from DrugBank, TTD, and ClinicalTrials (Zarin et al., 2011; Law et al., 2014; Qin et al., 2014). Totally, we obtained 80, 90 agents with corresponding target genes and pharmacological activity records. Using the disease classes provided by Pharmaprojects (similarity threshold: 0.75, for more details see the **Disease standardization** section), we finally acquired 5,759 agents covering 667 types of diseases and 2,813 target genes.

#### Disease-Associated Genes

Eight databases were used to collect disease-related genes, including the Genetic Association Database (GAD, https:// geneticassociationdb.nih.gov/) (Becker et al., 2004), Online Mendelian Inheritance in Man (OMIM, http://omim.org/) (Hamosh et al., 2005), Clinvar (http://www.ncbi.nlm.nih.gov/ clinvar/) (Landrum et al., 2014), Orphanet (http://www. orpha.net/consor/cgi-bin/index.php), DisGeNET (http://www. disgenet.org/web/DisGeNET/menu/rdf) (Piñero et al., 2015), INtegrated TaRget gEne PredItion (INTREPID) (Chen and Tian, 2016), GWASdb (http://jjwanglab.org/gwasdb) (Nelson et al., 2015) and The Human Gene Mutation Database (HGMD, http:// www.hgmd.cf.ac.uk/ac/index.php) (Wang X. et al., 2012). A total of 19,233 disease-associated genes were collected for use in the present analysis. Genes that could not be mapped to an Entrez ID were excluded. The available URLs, version information, access dates, and number of records from the above eight databases are provided in **Table S8**.

#### Disease Standardization

We used the Unified Medical Language System (UMLS), which provides a comprehensive set of medical concepts, to standardize disease annotations of genes, and agents. UMLS is a medical terminology system that has been developed by the National Library of Medicine for more than 20 years and contains a large number of standardized medical concepts. The natural language processing program MetaMap was used to convert disease annotations to the corresponding disease concepts (Aronson, 2001). We selected Medical Subject Headings (MeSH) as the vocabulary, and limited the semantic type to "Pathologic Function," "Injury or Poisoning," and "Anatomical Abnormality" to obtain the disease-related concepts (Liu et al., 2014). We processed all gene-related phenotypes and agents' indications using the UMLS concept. As MeSH defines disease concepts using a hierarchical system, it classifies each disease to a narrow disease type; for example, "Alzheimer disease 15" is a subtype of "Alzheimer disease." The latter is simply a broader term for the former. In our work, all subtype disease concepts were converted to the appropriate broader term using a Perl module UMLS::Interface. Disease annotations that could not be mapped to any disease concept were excluded from subsequent analyses. Using the disease classes provided by Pharmaprojects (similarity threshold: 0.75) (Mcinnes et al., 2009), we obtained 914,190 gene-disease pairs (involving 703 types of diseases) and 74,902 agent-disease pairs (involving 667 types of diseases).

#### Sequence Similarity Analysis

The needle program of EMBOSS package (Version: 6.6.0.0) (Rice et al., 2000) was employed to perform sequence similarity analysis of agent-targeted proteins, because of its accurate production of Needleman-Wunsch global pairwise alignments.

#### Gene Ontology (GO) Terms Similarity Measurement

We used the GO-based Czekanowski–Dice distance to evaluate the GO terms similarity of the target pairs. The Czekanowski– Dice functional distance was calculated using a previously described method (Ovaska et al., 2008). The GO term information of the gene pairs was obtained from the Ensembl database (version 72).

#### "Top Genes" and Ohnolog Genes

The AlzGene database contains 650 genes for Alzheimer's disease (Bertram et al., 2007); the SzGene database contains 937 genes for schizophrenia (Allen et al., 2008); the PDGene database contains 571 genes for Parkinson's disease (Lill et al., 2012); and the MSGene database contains 675 genes for multiple sclerosis (Lill et al., 1994). From these databases, 44, 43, 31, and 43 genes strongly associated with Alzheimer's disease, schizophrenia, Parkinson's disease and multiple sclerosis, respectively, were identified. These genes were termed "top genes," meaning that relatively reliable associations have been established between these genes and certain diseases. In addition, the ohnologs served as an alternative source of "top disease genes," because ohnologs are significantly enriched with disease genes due to their strong dosage balance (Makino and Mclysaght, 2010; McLysaght et al., 2014). From Makino et al.'s work (Makino and Mclysaght, 2010), we extracted 9,057 ohnolog pairs covering 7,295 genes from the human genome.

#### Druggability Score of Disease Genes

Based on clinically active ratio of genes from eight disease databases (Clinvar, OMIM, HGMD, Orphanet, GWASdb, INTREPID, GAD, and DisGeNET), we proposed a parameter named druggability score for quantitatively measuring the druggability of disease genes. First, the genes derived from different databases were given different scores, with the highestclinically active ratio database (i.e., Clinvar) being assigned with the highest score (eight points), the disease genes obtained from the second-ranked database of the clinically active ratio (i.e., OMIM) was given seven points, and so on, from HGMD was given six points, from Orphanet was given five points, from GWASdb was given four points, from INTREPID was given three points, from GAD was given two points, while the lowest clinically active ratio (i.e., DisGeNET) with the lowest score (one point) (**Table S3**). Then, if a disease gene is recorded in multiple databases, the scores of the corresponding multiple databases were summed up for this disease gene to define its druggability:

$$Drugability\ score = \sum\_{j=1}^{k} score\_{ij} \tag{1}$$

where scoreij denotes the assigned score of a pathogenic gene i in the jth database (**Table S3**); i = 1, 2, ..., m; j = 1, 2, ..., k, where m is the number of disease genes, k is the number of databases (k = 8 in this study).

#### Statistical Analysis

#### Disease Similarity Measurement

First, the disease terms of genes and indication annotations of agents were converted to the standardized medical concepts of UMLS by a natural language processing tool MetaMap. Then, through using the disease classes provided by pharmaprojects (Similarity threshold: 0.75), the disease similarity was measured using UMLS::similarity. Lin, which is calculated using the information content and path of concepts, shows good performance for disease similarity measurement (Nelson et al., 2015). In this study, we used the Lin to evaluate the disease term similarity of all disease concepts. The Lin is calculated using the following equation:

$$Lin = \frac{IC(lcs)}{IC\left(connect1\right) + IC(conceppt2)}\tag{2}$$

where IC is the negative log of the probability of the concept, the probability is pre-calculated by the Perl module by summing the probability of the concept occurring in some text plus the probability of its descendants occurring in some text, and lcs is the least common subsuming concept of concept1 and concept2.

#### Tanimoto Coefficient Calculation

To assess the correlations between disease concepts and their corresponding causal genes or drugs, we characterized the distance between disease gene sets or drug sets using the Tanimoto coefficient. The Tanimoto coefficient (TC) is calculated using the following equation:

$$TC = \frac{N\_{AB}}{N\_A + N\_B - N\_{AB}} \tag{3}$$

where N<sup>A</sup> is the number of disease A-related genes or drugs, N<sup>B</sup> is the number of disease B-related genes or drugs, and NAB is the number of common genes or drugs for disease A and disease B.

#### Permutation Test

To evaluate the quality of agent-target pairs, we did a 10000 permutation test on the three sets of agent-target pairs derived from DGIdb, TTD and DrugBank (Law et al., 2014; Qin et al., 2014; Wagner et al., 2015), respectively. The agents were randomly assigned with targets and the clinically active ratio of agents was calculated. This random shuffling procedure was repeated for 10,000 times.

#### Machine-Learning Modeling

#### Feature Generation

We rationally selected four parameters to build the model. The first parameter characterizes the overall druggability score of the pathogenic genes within drug targets. The second parameter is the average value of the first parameter and is normalized by 36 (namely 8∼). For example, if an agent targets two related disease genes derived from Clinvar and DisGeNET, respectively, the first parameter will be 9 (8 + 1), and the second parameter will be 0.125 (9/2 × 36). The third and fourth parameters are the absolute number and relative ratio of ohnologous disease genes within drug targets, respectively.

#### Positive Sample Generation

An agent-disease pair was regarded as a positive, if the drug hits one or more disease genes and is indicated for treating this disease. The positive samples were generated as 74,902 agentdisease pairs.

#### Negative Sample Generation

An agent-disease pair was regarded as a negative, if the drug targets one or more disease genes but is not annotated for controlling this disease. The negative samples were generated as 3,778,517 pairs. In the web server SCG-Drug (http://zhanglab. hzau.edu.cn/scgdrug), the model with all samples is provided.

#### MLKNN

Given the training set x<sup>i</sup> , yi n i=1 , x<sup>i</sup> is the ith instance (drug), and y<sup>i</sup> is the corresponding disease vector. y<sup>i</sup> l = 1. If the ith instance can treat the lth disease, otherwise y<sup>i</sup> l = 0, l = 1, 2, ... , q. The k nearest neighbors (in training set) of instance x<sup>i</sup> are denoted by N (xi), i = 1, 2, ... , n. Thus, based on lth disease of these neighbors, a membership counting vector can be denoted as:

$$C\_{\mathbf{x}\_l} \left( l \right) = \sum\_{a \in N(\mathbf{x}\_l)} \wp\_a \left( l \right), \quad l = 1, 2, \dots, q \tag{4}$$

where Cx<sup>i</sup> l counts the number of neighbors of x<sup>i</sup> treating the lth disease, and 0 ≤ Cx<sup>i</sup> l ≤ k.

For a test drug t, MLKNN identifies its k nearest neighbors in the training set and calculate C<sup>t</sup> l . Let H l 1 be the event that a drug has lth disease and H l 0 be the event that a drug does not treat lth disease. Let E l j be the event that a drug just hasj neighbors with lth disease in its k nearest neighbors. For the instance t, its label for lth disease y<sup>t</sup> l is determined by the following principle:

$$\chi\_l\left(l\right) = \arg\max\_{b \in \{0, 1\}} P\left(H\_b^l | E\_{C\_l(l)}^l\right), \quad l = 1, 2, \dots, q \tag{5}$$

Using the Bayesian rule, above Equation (5). can be rewritten as:

$$\nu\_t\left(l\right) = \arg\max\_{b \in \{0, 1\}} \frac{P\left(H\_b^l\right)P\left(E\_{C\_t(l)}^l | H\_b^l\right)}{P\left(E\_{C\_t(l)}^l\right)}$$

$$= \arg\max\_{b \in \{0, 1\}} P\left(H\_b^l\right)P\left(E\_{C\_t(l)}^l | H\_b^l\right) \tag{6}$$

In the prediction model, P H l b and P E l Ct(l) |H l b are calculated based on the training set. The prior probabilities are calculated.

$$P\left(H\_1^l\right) = \frac{\left(s + \sum\_{i=1}^n \wp\_i\left(l\right)\right)}{\left(s \times 2 + n\right)} \\ \text{and } P\left(H\_0^l\right) = 1 - P\left(H\_1^l\right) \quad \text{(7)}$$

Then, the posterior probabilities P E l Cxi (l) |H l 0 , P E l Cxi (l) |H l 1 are calculated by following equations,

$$P\left(E\_j^l|H\_1^l\right) = \frac{\left(s + c\left[j\right]\right)}{\left(s \times \left(k+1\right) + \sum\_{i=0}^k c\_l\left[i\right]\right)}\tag{8}$$

$$P\left(E\_j^l|H\_0^l\right) = \frac{\left(s + c'\left[j\right]\right)}{\left(s \times \left(k+1\right) + \sum\_{i=0}^k c\_l'\left[i\right]\right)}$$

$$l = 1, 2, \dots, q, j = 1, 2, \dots, k \tag{9}$$
 where  $s$  is the smooth factor.  $c\_l$   $[i]$  is the number of instances which just has  $i$  neighbors with  $l$ th disease in their  $k$  nearest neighbors;  $c\_l'$   $[i]$  is the number of instances which just has  $i$  neighbors.

Cross-Validation

(Zhang and Zhou, 2007).

We used 5-fold stratified cross-validation with 1,000 repeats to avoid arbitrariness.

i neighbors without lth disease in their k nearest neighbors

#### Ensemble Learning Method

In this paper, an ensemble learning method was designed to combine various features and develop high-accuracy prediction models (Lee and Soo, 2013; Yang et al., 2014; Wen et al., 2015). Previous studies have shown that combining predictions from different methods could achieve better and more robust results than using one algorithm alone. In this study, an ensemble classifier was generated using the linear weighted sum of outputs from classifiers based on four features.

Given m features, we build m individual feature-based MLKNN models, and use them as base predictors. Since features may make different contributes, it is natural to adopt weighted scoring ensemble strategy, which assigns m base predictors with m weights {w1,w2, ... ,wm}. For a testing instance, the ith predictor will give scores for q diseases, denoted as S<sup>i</sup> = s 1 i ,s 2 i , ... ,s q i , i = 1, 2, ... , m. The final prediction produced by the ensemble model is the linear weighted sum of outputs from base predictors.

$$\begin{aligned} \text{Ensemblle Score} &= \begin{bmatrix} \, \, \boldsymbol{w}\_1, \, \boldsymbol{w}\_2, \, \ldots, \, \boldsymbol{w}\_m \end{bmatrix} \times \begin{bmatrix} \, \, \boldsymbol{S}\_1 \\ \, \, \boldsymbol{S}\_2 \\ \, \, \ldots \\ \, \boldsymbol{S}\_m \end{bmatrix} \\ &= \begin{bmatrix} \, \boldsymbol{w}\_1, \, \boldsymbol{w}\_2, \, \ldots, \, \boldsymbol{w}\_m \end{bmatrix} \times \begin{bmatrix} \, \, \boldsymbol{S}\_1^1 \cdot \cdots \, \, \boldsymbol{S}\_1^2 \, \boldsymbol{S}\_1^q \\ \, \boldsymbol{\vdots} \cdot \, \vdots \\ \, \boldsymbol{S}\_m^1 \, \boldsymbol{S}\_m^2 \cdot \cdots \, \boldsymbol{S}\_m^q \end{bmatrix} \end{aligned} \tag{11}$$

Tuning weights for base predictors are critical for the ensemble models. The weights are non-negative real values between 0 and 1, and the sum of weights equals 1. We adopt the internal 5-CV AUPR on training data is used as the fitness score (Lee and Soo, 2013; Yang et al., 2014; Wen et al., 2015).

#### Performance Evaluation

In the agent-activities prediction, the predicted scores for activities were usually merged for evaluation, and the metrics for ordinary binary classification were often adopted. The area under ROC curve (AUC) and the area under the precision-recall curve (AUPR) can be used to evaluate models regardless of any threshold. However, there are much more negative labels than positive labels in the agent-activities prediction, and machinelearning methods are likely to produce overestimated AUC scores. Since AUPR takes into account recall as well as precision, it is used as the most important metric.

We used the following evaluation metrics to evaluate the performance of machine-learning models: Precision, Accuracy (ACC), Recall, Specificity, Mathew's correlation coefficient (MCC) (12–16). These metrics can be calculated by the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

$$Precision = \frac{TP}{TP + FP} \tag{12}$$

$$ACC = \frac{TP + TN}{TP + FN + TN + FP} \tag{13}$$

$$Recall = \frac{TP}{TP + FN} \tag{14}$$

$$\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} \tag{15}$$

$$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)}} \tag{16}$$

Several metrics were designed for multi-label classification, i.e., Hamming loss, one-error, coverage, ranking loss and average precision. Hamming loss is the fraction of the wrong labels to the total number of labels. The one-error evaluates the fraction of examples whose top-ranked label is not in the relevant label set. The coverage evaluates how many steps are needed, on average, to move down the ranked label list so as to cover all the relevant labels of the example. The average precision evaluates whether the average fraction of relevant labels ranked higher than a particular label. Therefore, we adopt AUPR, average precision, one-error, coverage, ranking loss and hamming loss for the agentactivities prediction.

#### Cytotoxicity Assays Cell Culture and Reagents

K562 cells were purchased from Shanghai Cell Bank, Chinese Academy of Sciences. Cells were cultured in RPMI-1640 (Procell, China) with 10% FBS (Biowest, France) and 1% penicillin/streptomycin (Procell, China) at 37◦C, in 5% CO<sup>2</sup> humidified atmospheric air. All agents were purchased from TargetMol and dissolved in dimethyl sulfoxide (DMSO).

#### Cytotoxicity Assays

The effects of agents on K562 were determined using CellTiter-Glo <sup>R</sup> Luminescent Cell Viability Assay (Promega). Cells were seeded in 96-well plate at a density of 2 × 10<sup>3</sup> cells/well and treated with different agents for 72 h together. An equal volume of CellTiter-Glo reagents was added to the cells in 96-well plates and mixed for 2 min on an orbital shaker and incubated for a further 10 min at room temperature. The luminescence of each well was measured by FlexStation3(Molecular Devices). The IC50 values were calculated using Graphpad Prism software. All experiments were performed in triplicate.

#### Web Server Implementation

Systems Chemical Genetics-Drug (SCG-Drug, http://zhanglab. hzau.edu.cn/scgdrug) was built in Java, JavaScript, and Bootstrap with MySQL as the primary data store. The site is served with nginx on a server running CentOS 7.2. Two modules are used: the search module and the prediction module. The search module was implemented by an entry-name matching algorithm. By using this module, the server will return a list of partially matched terms and shows them in the dropdowns when users type only the starting characters of a gene, disease or drug in the search field. In the prediction module, there are two steps: data preprocessing and drug indication prediction. In the data preprocessing step, a Python script was used to produce the parameters matrix. In the drug indication prediction step, an R script was used to generate the result by calling the prediction model.

### Code and Data Availability

The R and Python scripts used to process the data and conduct the analyses described herein are available upon request. All of the intermediate data are available from the authors by request.

### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: http://zhanglab.hzau.edu.cn/scgdrug.

### AUTHOR'S NOTE

Finding novel drugs or new uses for old drugs is a costly process. Previous studies have shown that genetics, which is

### REFERENCES


best dedicated to revealing gene-disease links, makes great contributions to the pharmaceutical industry. On the other hand, most diseases are caused by multiple pathogenic factors. In this paper, we proposed that aiming at multiple genes associated with certain diseases rather than a single pathogenic factor is more efficient in identifying potential drugs. In addition, our results demonstrated the therapeutic potential of agents can be enhanced with the consolidation of genetic links between targets and diseases. In other words, simultaneously increasing the quantity and quality of target-disease associations can significantly increase the activity/druggability of agents. According to the above theories, we have established a drug-activity predictor with multi-label classification model based on the genetic information of drug targets (online service is freely available at SCG-Drug, http://zhanglab. hzau.edu.cn/scgdrug), which is of high value in prioritizing drug candidates.

### AUTHOR CONTRIBUTIONS

H-YZ: conceptualization. YQ, Z-HL, and L-DZ: data curation. YQ, Z-HL, and Q-YY: formal analysis. H-YZ: funding acquisition. QZ, Z-JC, and XQ: investigation. H-YZ and L-DZ: methodology. JL and Y-ML: software. B-ML: conceived and designed the experiments. Y-HX: performed the experiments. H-YZ and L-DZ: supervision. H-YZ and L-DZ: validation. YQ, Z-HL, Q-YY, L-DZ, and JL: visualization. YQ, Z-HL, Q-YY, L-DZ, and H-YZ: writing–original draft. YQ, Z-HL, Q-YY, L-DZ, and H-YZ: writing–review and editing.

### ACKNOWLEDGMENTS

This research was partially supported by the Fundamental Research Funds for the Central Universities (Grant 2662017PY115). We are grateful to Lin Li for his helpful assistance in web server implementation.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00474/full#supplementary-material


association studies. Nat. Biotechnol. 33, 342–345. doi: 10.1038/ nbt.3183


**Conflict of Interest Statement:** H-YZ has received research/grant support from Wuhan Bio-Links Technology Co., Ltd. Huazhong Agricultural University and the developers of the methods for drug discovery and drug repositioning may benefit financially pursuant to the University's Policy on Inventions, Patents and Technology Transfer, even if these methods are not used in the commercialized therapy.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Quan, Luo, Yang, Li, Zhu, Liu, Lv, Cui, Qin, Xu, Zhu and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# SCIA: A Novel Gene Set Analysis Applicable to Data With Different Characteristics

Yiqun Li 1†, Ying Wu2†, Xiaohan Zhang<sup>1</sup> , Yunfan Bai <sup>1</sup> , Luqman Muhammad Akthar <sup>1</sup> , Xin Lu<sup>1</sup> , Ming Shi <sup>1</sup> , Jianxiang Zhao<sup>1</sup> , Qinghua Jiang<sup>1</sup> \* and Yu Li <sup>1</sup> \*

*<sup>1</sup> Department of Laboratory of Cancer Biology, School of Life Science and Technology, Harbin Institute of Technology, Harbin, China, <sup>2</sup> Department of Biostatistics, School of Public Health, Southern Medical University, Guangzhou, China*

#### Edited by:

*Lei Deng, Central South University, China*

#### Reviewed by:

*Qinghua Cui, Peking University, China Hao Lin, University of Electronic Science and Technology of China, China*

#### \*Correspondence:

*Qinghua Jiang qhjiang@hit.edu.cn Yu Li liyugene@hit.edu.cn*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> Received: *05 May 2019* Accepted: *05 June 2019* Published: *25 June 2019*

#### Citation:

*Li Y, Wu Y, Zhang X, Bai Y, Akthar LM, Lu X, Shi M, Zhao J, Jiang Q and Li Y (2019) SCIA: A Novel Gene Set Analysis Applicable to Data With Different Characteristics. Front. Genet. 10:598. doi: 10.3389/fgene.2019.00598* Gene set analysis is commonly used in functional enrichment and molecular pathway analyses. Most of the present methods are based on the competitive testing methods which assume each gene is independent of the others. However, the false discovery rates of competitive methods are amplified when they are applied to datasets with high inter-gene correlations. The self-contained testing methods could solve this problem, but there are other restrictions on data characteristics. Therefore, a statistically rigorous testing method applicable to different datasets with various complex characteristics is needed to obtain unbiased and comparable results. We propose a self-contained and competitive incorporated analysis (SCIA) to alleviate the bias caused by the limited application scope of existing gene set analysis methods. This is accomplished through a novel permutation strategy using *a priori* biological networks to selectively permute gene labels with different probabilities. In simulation studies, SCIA was compared with four representative analysis methods (GSEA, CAMERA, ROAST, and NES), and produced the best performance in both false discovery rate and sensitivity under most conditions with different parameter settings. Further, the KEGG pathway analysis on two real datasets of lung cancer showed that the results found by SCIA in both of the two datasets are much more than that of GSEA and most of them could be supported by literature. Overall, SCIA promisingly offers researchers more reliable and comparable results with different datasets.

Keywords: GSA, competitive method, self-contained method, topology-based method, functional enrichment analysis

## INTRODUCTION

In recent years, gene set analysis (GSA) has become the most common method in functional genomics studies, because evaluating a single p-value for a gene set is statistically more powerful than genewise tests. Typically, by choosing gene sets that represent biological pathways, GSA can help to bring insights into biological mechanisms, cellular functions, and disease states (Kanehisa et al., 2012). Various statistical procedures for gene set testing have been proposed and can be divided into three generations roughly in chronological order (Khatri et al., 2012; Zyla et al., 2017). The first generation of GSA used over-representation analysis (ORA), where the first step is to define differentially expressed genes (DEGs) and non-DEGs in the input gene list by a certain threshold (Beissbarth and Speed, 2004). Then, the proportion of DEGs between a given functional gene set and the background gene set are tested by hypergeometric, binomial, or chi-square distribution. This comparison of the DEG proportions is the original theory of competitive testing. ORA has been reported with minor variations by many different authors (Khatri and Draghici, 2005). Even though the ORA method seems simple and effective, there are two serious drawbacks. First, the information about the strength of gene expression is lost by gene binarization. Second, the assumption of inter-gene independence needed by the testing methods is not satisfied in most cases.

The second generation of GSA, known as functional class sorting (FCS), was proposed to avoid these deficiencies. Instead of defining genes as DEGs and non-DEGs, different univariate gene-level statistics such as t-statistic (Al-Shahrour et al., 2005; Tian et al., 2005), Q-statistic (Goeman et al., 2004), signalto-noise ratio (Subramanian et al., 2005), fold change score and Z-score (Kim and Volsky, 2005), or their trans-formations (Tian et al., 2005; Ackermann and Strimmer, 2009) are used to measure DEGs and overcome the first problem of ORA. Then, a gene-set-level statistic is aggregated by these gene-level statistics. Aggregation approaches can be sum, mean, median of the gene-level statistics (Jiang and Gentleman, 2007), or calculating statistics such as the Kolmogorov-Smirnov statistic (Mootha et al., 2003; Subramanian et al., 2005), Wilcoxon rank sum (Barry et al., 2005), or the max-mean statistic (Efron and Tibshirani, 2006). Because the distributions of gene-set-level statistics are usually unknown, permutation procedures are used to complete FCS tests. According to different null hypotheses and corresponding permutation objects, FCSs can be classified as competitive or self-contained methods.

Assuming that all the input genes are independent of each other, competitive methods usually permute gene labels but lose the inter-gene information, which causes the false discovery rate (FDR) to be uncontrolled when the inter-gene correlations are high. Self-contained methods test each gene set independently by permuting sample labels but lose all the information outside the given gene set, which causes the FDR to be uncontrolled when the percentage of DEGs in the background genes is high. Irrespective of the prerequisites for the permutation procedure, the ORA methods can be considered as generalized competitive methods, whereas the classical methods based on multiple linear regression (Mansmann and Meister, 2005; Kong et al., 2006), by definition, are special cases of self-contained methods.

To address the second problem of ORA, some competitive FCS methods that take account of the correlations among genes have been proposed. The method of Nam (2010) removed the bias caused by the inter-gene correlations, while the method of Wu and Smyth (2012) alleviated the problem by estimating the variance inflation factor. However, the information of intergene correlations is partially neglected in these procedures, which causes reduced sensitivity or uncontrolled FDR. Self-contained FCS methods seem to be more powerful than competitive ones and do not assume that all the genes are independent, but their null hypothesis is usually over restrictive (Goeman et al., 2004; Tian et al., 2005; Khatri et al., 2012). They assume that the gene set does not contain any genes with expression levels that are associated with different experimental conditions. Under this hypothesis, a few DEGs may cause a given pathway to be defined as a significant differential pathway (Khatri et al., 2012). Although the method of Wu et al. (2010) moderated this hypothesis using a Monte Carlo based testing method, the parameter describing the least proportion of DEGs in a pathway is given arbitrarily instead of calculated by the expression of genes outside the gene set. Even though competitive methods are overwhelmingly more commonly used than self-contained methods in the genomic literature (Gatti et al., 2010), information is still lost during the permutation procedures. Thus, the collision of applicable scopes between self-contained and competitive methods remains unsolved.

The third generation of GSA, known as the pathway topology (PT)-based approach, is based on the large amount of publicly available pathway knowledge. Mitrea et al. (2013) introduced dozens of PT-based methods with different principles and applicable conditions. Most of these methods consider topological information as a weight that measures the centrality of nodes but ignores the spatiotemporal specificity of topological information and changes in the topological structure between different experimental conditions (Fang et al., 2012; Gu et al., 2012; Dona et al., 2017). On this basis, the method of Yuan et al. (2016) proposed a novel statistic that combines node (gene expression) changes with edge (inter-gene correlation) changes. The utilization of biological information greatly improved the performance of PT-based methods, however, the testing methods of them are essentially the same as FCS methods in that they perform the same pipeline (Mitrea et al., 2013). Therefore, the above defects of FCS methods are not solved by PTbased methods.

Here, we propose a new GSA method with less information loss that can alleviate the bias of self-contained and competitive methods caused by their limited applicability. First, to capture all the information within a given gene set like other self-contained methods, a powerful multivariate statistic C is developed to test node changes and edge changes simultaneously. We chose Hotelling's T 2 , a self-contained statistic with the ability to penalize gene collinearity (Ackermann and Strimmer, 2009), for node testing because of its suitability for overcoming the limitation of competitive methods, and linear regression to test the edge changes among genes. Because of the additivity of chisquare distributed variables, these two statistics are transformed to the chi-square scale and summed up to get the C statistic. Second, we developed a novel permutation procedure based on a condition-specific shortest-path network (CSSPN, proposed by Dezso et al., 2009). The genes in the CSSPN are selectively permuted instead of permuting the whole gene labels as usual. This procedure does not disrupt inter-gene correlations but uses

**Abbreviations:** GSA, gene set analysis; ORA, over-representation analysis; DEGs, differentially expressed genes; FCS, functional class sorting; FDR, false discovery rate; CSSPN, condition-specific shortest-path network; SCIA, self-contained and competitive incorporated analysis.

inter-pathway information from a priori biological networks, which creates a platform for the incorporation of self-contained, competitive, and PT-based methods. The whole pipeline is called self-contained and competitive incorporated analysis (SCIA), which has been implemented in an R package "SCIA" available on GitHub https://github.com/YiqunLiHIT/SCIA. Results from this study showed that the sensitivity and FDR of SCIA outperform four other commonly used GSA methods in most conditions in simulated datasets and the results are more stable with different real datasets of lung cancer.

## STATISTICAL MODELS AND METHODS Notations and Background Network

The main objective of SCIA is to detect gene sets that are differentially expressed under different experimental conditions. Here, we consider the gene set as pathway P for one experimental condition and P ′ for another. N<sup>1</sup> and N<sup>2</sup> are the sample size for P and P ′ , respectively. For convenience, we assumed that P and P ′ are under linear models:

$$\begin{aligned} X\_1 \stackrel{\beta\_1}{\rightarrow} X\_2 \stackrel{\beta\_2}{\rightarrow} \dots \dots \dots \ X\_{n-1} \stackrel{\beta\_{n-1}}{\longrightarrow} X\_n\\ X\_1' \stackrel{\beta\_1'}{\rightarrow} X\_2' \stackrel{\beta\_2'}{\rightarrow} \dots \dots \dots X\_{n-1}' \stackrel{\beta'\_{n-1}}{\rightarrow} X\_n' \end{aligned}$$

with n nodes and n − 1 edges, where β<sup>i</sup> (1 ≤ i < n) represent the regression coefficient of X<sup>i</sup> and Xi+1. Let U = X<sup>1</sup> − X<sup>1</sup> ′ , X<sup>2</sup> − X<sup>2</sup> ′ , . . . . . . , X<sup>n</sup> − X<sup>n</sup> ′ denote the vector of difference in the means of two groups. S and S ′ are the covariance matrices of P and P ′ , respectively. These notations are also used in the simulation studies.

We chose the background network of CSSPN as the Human Protein Reference Database (HPRD) network (Library et al., 2009), a centralized platform to visually depict and integrate information pertaining to do-main architecture, posttranslational modifications, interaction networks, and disease associations for each protein in the human proteome. Other comprehensive networks, such as the integrated network of seven common used networks in Edge Set Enrichment Analysis (Han et al., 2015) can also be used as the background network of SCIA.

#### C Statistic

The C statistic is proposed to measure the difference of a given gene set in different experimental conditions. It consists of two parts, the node difference model and the edge difference model. The node difference model is based on Hotelling's T <sup>2</sup> method:

$$T^2 = \frac{N\_1 N\_2}{N\_1 + N\_2} U^T \mathcal{S}\_c^{-1} U$$

where,

$$S\_{\mathcal{C}} = \frac{(N\_1 - 1)\,\mathrm{S} + (N\_2 - 1)\,\mathrm{S}'}{N\_1 + N\_2 - 2}$$

Under the self-contained null hypothesis H0: U = 0, T 2 follows a chi-square distribution with degrees of freedom equal to n representing genes in the given pathway with a sufficient sample size. This allows Hotelling's T 2 statistic to be combined with other statistics that also follow a chi-square distribution, because chi-square distributions are additive on the freedoms. There are many transformations of Hotelling's T 2 statistic which show its different characteristics. It can be transformed as:

$$F = \frac{N\_1 + N\_2 - n - 1}{(N\_1 + N\_2 - 2)n} T^2$$

following an F distribution with the degree of freedom of n and N<sup>1</sup> + N<sup>2</sup> − n − 1 under a relatively small sample size. This allows Hotelling's T 2 statistic to be used alone when the sample size is insufficient. Typically, Hotelling's T 2 test is not only a node testing method but is related to the Pearson correlation coefficient. For convenience, assuming n = 2 and N<sup>2</sup> is big enough, the estimated value X<sup>i</sup> ′ (1 ≤ i ≤ 2) can be considered as constants µ<sup>i</sup> (1 ≤ i ≤ 2), then Hotelling's T 2 statistic can be transformed as:

$$T^2 = \frac{t\_1^2 + t\_2^2 - 2\rho t\_1 t\_2}{1 - \rho^2}$$

where t<sup>1</sup> and t<sup>2</sup> denote the t-statistics for the two component genes, and ρ represents the Pearson correlation coefficient between X<sup>1</sup> and X2. If t<sup>1</sup> = t2, Hotelling′ s T 2 statistic can be simplified to:

$$T^2 = \frac{2t\_1^2}{1+\rho}$$

This transformation of T 2 indicates that when X<sup>1</sup> and X<sup>2</sup> are positively correlated and have similar changes in different experimental conditions, there would be a penalty on the Pearson correlation coefficient, which can avoid the disadvantages of the competitive methods. When X<sup>1</sup> and X<sup>2</sup> are negatively correlated but both have positive changes in different experimental conditions, which indicates that the correlation of X<sup>1</sup> and X<sup>2</sup> has changed in different experimental conditions, the T 2 statistic is would be more sensitive.

Although Hotelling's T 2 statistic only slightly considers the correlations between genes, a statistically rigorous edge testing statistic is still needed. Based on the linear regression method, a Z-score-like statistic is combined with Hotelling's T 2 statistic in the C statistic. βˆ <sup>i</sup> and βˆ i ′ can be estimated by the least square method. Then the Z-score-like B statistic can be written as:

$$B\_i = \frac{\hat{\beta}\_i - \hat{\beta}\_i^{'}}{\sqrt{var\left(\hat{\beta}\_i\right) + \nu ar\left(\hat{\beta}\_i^{'}\right)}}$$

under the null hypothesis H0: βˆ <sup>i</sup> = βˆ i ′ , B<sup>i</sup> follows a standard normal distribution the same as the Z-score, and B 2 i follows a chi-square distribution and can be combined with Hotelling's T 2 statistic. Thus, we obtained the C statistic as:

$$\mathbf{C} = T^2 + \sum\_{i=1}^{n-1} \mathbf{B}\_i^2$$

which follows a chi-square distribution with the degrees of freedom equal to n+(n−1), and can be used to test node changes and edge changes simultaneously. Notably, when the sample size is very small, T 2 and B **2** <sup>i</sup> will not obey the chi-square distribution, the parameter of SCIA about the correlation test should be set as "FALSE."

#### CSSPN-Based Permutation Procedure

To avoid the shortcoming of self-contained methods and utilize additional inter-pathway information from a priori biological networks, a CSSPN is built by SCIA. First, a set of DEGs should be selected as the terminal genes of CSSPN, and a set of initial genes can usually be selected in the same way. For each pair of genes (X<sup>i</sup> , Xt), where X<sup>i</sup> is in the initial gene set and X<sup>t</sup> is in the terminal gene set, all the shortest pathways are searched under a background network, such as HPRD (see section Notations and Background Network). When the results are not unique, the pathway with the highest C score will be chosen for a subpathway permutation procedure. In this procedure, 1,000 nodes are selected randomly as the initial gene set for each X<sup>t</sup> , which is the only terminal gene in this procedure. Assuming there are x shortest pathways, built by the randomly selected genes and Xt , that have higher C scores than the given gene pair (X<sup>i</sup> , Xt), the permutation p-value of the sub-pathway (X<sup>i</sup> , Xt) is x/1,000. The permutation p-value and C statistic p-value are both adjusted using the method of Benjamini and Hochberg (1995), and only if the two p-values are <0.05, the sub-pathway is defined as a statistically significant pathway. Then, all the significant subpathways among the initial gene set and the terminal gene set are used to build the CSSPN. All the genes in the CSSPN can be considered as DEGs with edges and can be used in classical functional enrichment analysis.

In SCIA, background genes are used selectively in the CSSPN-based permutation procedure. Essentially, the selection of background genes means the information from the a priori biological network is utilized, because all the genes neighboring DEGs in the background network are used at a higher probability to establish the CSSPN. Additionally, because the permutation procedure does not destroy any inter-gene or inter-pathway structures, almost no information is lost in SCIA.

#### RESULTS

#### Simulated Data and Scenarios Simulated Data

The simulated data were generated under a linear model (Formula 1). Firstly, we generated the initial node X<sup>1</sup> of a given pathway P from the normal distribution N µ1, σ<sup>1</sup> 2 . And then, the neighbor node X<sup>2</sup> = β1X<sup>1</sup> + ε1, X<sup>3</sup> = β2X<sup>2</sup> + ε<sup>2</sup> . . . . . . X<sup>n</sup> = βn−1Xn−<sup>1</sup> + ε<sup>n</sup> were generated in the same way. Where ε<sup>i</sup> ∼ N 0, τ<sup>i</sup> 2 (1 < i ≤ n) was the residual error term. Similarly, we generated X<sup>1</sup> ′ ∼ N µ1 ′ , σ<sup>1</sup> ′2 , X<sup>i</sup> ′ = βi−<sup>1</sup> ′Xi−<sup>1</sup> ′ +ε<sup>i</sup> ′ with ε<sup>i</sup> ′ ∼ N 0, τ<sup>i</sup> ′2 (1 < i ≤ n)representing the pathway P ′under another experimental condition. Under the H<sup>0</sup> hypothesis that there is no change in nodes and edges between different experimental conditions, we set the default simulating parameters as:µ<sup>1</sup> = µ<sup>1</sup> ′ = 1, σ<sup>1</sup> <sup>2</sup> = σ<sup>1</sup> ′<sup>2</sup> = 1,τ<sup>i</sup> <sup>2</sup> = τi ′<sup>2</sup> = 1, and β<sup>i</sup> = β<sup>i</sup> ′ = 0.5. In most of the following simulations without mentioned specially, the gene number n in a pathway was set as 5, the sample sizes N<sup>1</sup> and N<sup>2</sup> of different experimental conditions were both set as 100, and the simulations were repeated 1,000 times.

#### Scenarios

Four scenarios and 16 conditions were used to simulate different data structures and prove the extensive applicability of SCIA. The H<sup>0</sup> hypothesis condition was designed to evaluate the FDR and the H<sup>1</sup> hypothesis condition was designed to evaluate the sensitivity. The basic setting for the H<sup>1</sup> hypothesis is node or edge changes, with three additional conditions: sample size, intergene correlation, and percentages of DEGs in background genes that are outside the given pathway. In each scenario, only one additional condition is set as different values to highlight the robustness of SCIA. Thus, the four scenarios are:


Scenarios 1 and 2 were designed to simulate datasets with different inter-gene correlations, scenario 3 was designed to simulate datasets with different percentages of DEGs in background genes, and scenario 4 was designed to simulate datasets with edge changes under different sample sizes. Details of the parameter settings under these scenarios are listed in **Supplementary Data Section 1**.

#### Evaluation of SCIA Performance With Simulated Data

To evaluate its performance, SCIA was compared with two powerful self-contained approaches, ROAST and NES, and two commonly used competitive approaches, CAMERA and GSEA (More details about these methods are stated in **Supplementary Data Section 2**). The application scope of these methods is quite different, so we compared SCIA with them under corresponding application conditions. As shown in **Table 1**, only competitive methods are suitable for scenario 3, and only self-contained methods are suitable for scenario 4.

#### SCIA Successfully Controls the FDR Under Different Inter-gene Correlations in Simulated Datasets

First, we compared SCIA with self-contained methods in scenario 1 under different inter-gene correlations in simulated datasets. The FDRs were well-controlled by all the three methods (**Table 2**), and **Figure 1** clearly shows the sensitivities of the three methods were quite similar, indicating the C statistic allowed SCIA to match the advantages of the self-contained methods. Noticeably, ROAST had high sensitivity under the



*"* √ *" indicates the method was designed for the condition; "*×*" indicates the method was not designed for the condition and may have problems in sensitivity or FDR.*

TABLE 2 | FDR is well-controlled by SCIA similar to other self-contained methods under different inter-gene correlations in simulated datasets.


high inter-gene correlation. However, high sensitivity with intergene correlations close to 1 is not useful for combination with competitive approaches because a small percentage of highly correlated DEGs may produce unreasonable significant results.

Second, we compared SCIA with competitive methods under scenario 2. **Table 3** clearly shows that the FDR of GSEA lost control, which is common for competitive methods due to the correlation between genes, whereas CAMERA adjusted the high

FDR only under a moderate inter-gene correlation of all genes but failed to control the FDRs under high inter-gene correlations. SCIA was the most robust method with well-controlled FDRs and similar sensitivities as CAMERA with comparable FDRs. Because there were no randomly selected DEGs in the given pathway, the SCIA results in scenarios 1 and 2 are comparable, which indicated that the information of background genes outside the given gene set was well-utilized by SCIA. A notable question is that the intersection ratio of the results obtained from SCIA and GSEA is decreasing with the increasing of inter-gene correlation, because GSEA is more sensitive in finding significant pathways with less but consistent expression changes. This result indicated that SCIA and GSEA could find different types of differentially expressed gene sets.

#### SCIA Has Higher Sensitivity and Lower FDR Than Two Competitive Methods Under High Percentages of DEGs in Background Genes

When the percentages of DEGs in background genes are high, there are likely to be relatively high overlaps between a given gene set and background DEGs. Therefore, self-contained methods are invalid in scenario 3 and SCIA was compared with competitive methods. **Table 4** shows that SCIA had higher sensitivity than the other two methods and, interestingly, the FDR was negatively correlated with the percentage of DEGs in background genes. These results are reasonable and reflect the incorporation of different GSA methods in SCIA. Like other competitive methods, when the percentage of DEGs in background genes was high, SCIA assigned a competitive penalty of the significance to the given pathway, and when the percentage of DEGs in background genes was low, SCIA assumed only a few percentages of the DEGs would produce a significant result for the given pathway because there was no other explanation for these DEGs. Notably, in complex diseases such as cancer, DEGs usually account for more than 40% of the genes in a dataset, under which condition SCIA was the best method both in sensitivity and FDR.

#### SCIA Has Higher Sensitivity Than the Two Self-Contained Methods in Testing Changes of Inter-gene Correlations

Most competitive methods cannot simultaneously test node and edge changes; hence, we compared SCIA with self-contained methods under scenario 4 with the same H<sup>0</sup> hypothesis and FDRs (**Table 2**) as scenario 1. The influence of different sample sizes was measured at the same time. **Figure 2** shows that SCIA had the highest sensitivity and the slowest drop in sensitivity with decreasing sample sizes. However, when the sample size was 10 pairs, the sensitivity of SCIA dropped sharply because of the approximation of chi-square distribution (see method), which needs sample sizes of 15–30 pairs. Unsurprisingly, ROAST had the lowest sensitivity because it was not designed for this purpose. Besides, although the edge testing modules of SCIA and NES are quite similar, SCIA was more sensitive because edge changes are also considered by Hotelling's T 2 (see method), indicating SCIA does not simply superpose node testing and edge testing methods like NES.

TABLE 3 | SCIA has lower FDRs than the competitive methods under different inter-gene correlations in simulated datasets.


TABLE 4 | SCIA has higher sensitivity than the competitive methods under different percentages of DEGs in background genes.


### Evaluation of SCIA Performance With Real Datasets

We applied SCIA to recover differentially expressed genes and pathways involved in lung squamous cell carcinoma (LUSC), a common type of non-small-cell lung cancer using two datasets, one from the NCBI's GEO (Gene Expression Omnibus) and one from TCGA (The Cancer Genome Atlas) database. The GEO dataset (Series Accession: GSE103512, Brouwer-Visser et al., 2017) contains 23 LUSC sub-type cancer samples and 9 normal samples. The LUSC dataset from TCGA contains 502 LUSC samples and 51 normal samples.

The two LUSC datasets were used as input to compare the sensitivity and robustness of SCIA and GSEA. In the CSSPN-base permutation procedure of SCIA, all the genes were mapped to the HPRD network, then the top 2% of DEGs (about 200 in each dataset) were defined as the initial and terminal genes of CSSPN (see method). All the nodes in the CSSPN were used for classical functional enrichment analysis based on a hypergeometric test. Unlike the simulation studies, the adjustment of permutation pvalues (see method) should be moderate here. This is because, under the H<sup>0</sup> hypothesis of simulation studies, there is no relation between the background network and the given gene set, whereas, in real organisms, hundreds of genes in the background network will differentially expressed in response to the DEGs in the given gene set. Due to the C statistic p-values of all the single pathways were already Benjamini and Hochberg (1995) adjusted, we did not adjust the permutation p-value in the following analysis, indicating there are approximate 500 genes in the HPRD background network that, on average, are affected by the terminal DEGs. This p-value threshold is a parameter of SCIA and can be set as different scores according to different data and requirements.

The results of the KEGG functional enrichment analysis are shown in **Supplementary Tables S1–S4**. SCIA found 131 and 64

pathways and GSEA found 46 and 40 pathways in the GSE103512 and TCGA LUSC datasets, respectively. Among them, 55 (42%) SCIA pathways were common between the two datasets, whereas only 5 (11%) of the GSEA pathways were common between the two datasets. These results illustrated that there was little comparability between the two results of GSEA, while, SCIA could demonstrate common results in different lung cancer datasets and the individual differences in the two researches, implying the two results of SCIA with different datasets were comparable. More than 33 of the 55 SCIA pathways found in both of the two datasets have been reported previously to have relationships with lung cancer (**Table 5**), including the non-small cell lung cancer. While, most of these pathways were not detected by GSEA. This result showed that SCIA could find many positive pathways that GSEA could not, and the high proportion of results with literature supporting indicated that the intersection of results of SCIA with different datasets could increase the reliability. Further, SCIA produces a CSSPN, which can be considered simply as a set of DEGs. SCIA detected 41 DEGs in the two datasets, and more than 27 (**Supplementary Table S5**) of these genes have been reported previously to be related with lung cancer.

TABLE 5 | SCIA found more literature supported KEGG pathways than GSEA in two non-small-cell lung cancer datasets.


*"Yes" means the pathway is found by both SCIA and GSEA with adjusted p-value* < *0.05. "No" means the pathway is found by SCIA but not by GSEA.*

### DISCUSSION

SCIA is the first GSA method that combines the advantages of self-contained, competitive, and PT-based methods. SCIA has three main advantages over the other methods as was shown by the simulation studies. First, SCIA is powerful and statistically rigorous under high inter-gene correlations, which are conditions under which most competitive methods lose control of FDR. Second, SCIA has higher sensitivity and minimum FDR compared to two competitive methods (GSEA, CAMERA) under a high proportion of DEGs in background genes, which are conditions that make most self-contained methods invalid. Moreover, SCIA uses an a priori biological network and performs better than ROAST and NES in testing edge (inter-gene correlation) changes. Overall, the FDR of SCIA was well-controlled and its sensitivity was higher than that of the other four methods tested (GSEA, CAMERA, ROAST, and NES) under most simulated conditions, highlighting the extensive applicability and unbiased results of SCIA.

The robustness of SCIA can be attributed to two aspects. First, its extensive applicability with reliable and unbiased results, as mentioned above, are the most important reasons. Second, through the CSSPN-based permutation strategy in SCIA, a reasonable hypothesis is innovatively combined with a priori biological information. Briefly, if DEGs can be mapped only in one gene set, a positive weight is added to them because there is no other explanation for the differential expressions of these genes. Therefore, for SCIA, comprehensiveness of the background networks is more important than its accuracy. However, when the a priori biological networks are more comprehensive, the hypothesis of SCIA becomes more reasonable and the results are more precise. This robustness gives SCIA the ability to calculate with different datasets and to integrate the results of SCIA with different datasets.

There are many potential applications for SCIA, including differential expression analysis (Dona et al., 2017), sub-pathway analysis (Martini et al., 2013), and micorRNA target gene prediction (Wang, 2008). First, all of the genes in the CSSPN can be considered as DEGs and used independently. In addition, CSSPN itself can be considered as a cascading effect pathway when the input data are from a knockout/over-expression experiment of a single gene. Second, if the function of differential pathways can be biologically confirmed, the sub-pathway of the given functional pathway can be built without the permutation procedure. Third, the choice of initial gene set is very flexible and can be tailored for different purposes. For instance, if the input data are derived from a microRNA knockout/overexpression experiment, the initial gene set can be select as the predicted target genes of the microRNAs, and the significant predicted targets will have more potential to be the targets of these microRNA in a specific experimental condition.

### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc=GSE103512; https://www.cancer.gov/about-nci/ organization/ccg/research/structural-genomics/tcga.

### AUTHOR CONTRIBUTIONS

YuL and QJ designed the experiments. YiL, YW, YB, XZ, and JZ performed the experiments and data analysis. LA, XL, and MS have contributed to the writing of this article

### FUNDING

This work was supported by the National Nature Science Foundation of China [31571323, 61571152, 81703322] and the National Science and Technology Major Project [2016YFC1202302].

### ACKNOWLEDGMENTS

We thank Margaret Biswas, Ph.D., from Liwen Bianji, Edanz Group China (www.liwenbianji.cn/ac), for editing the English text of a draft of this manuscript.

### REFERENCES


## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00598/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Li, Wu, Zhang, Bai, Akthar, Lu, Shi, Zhao, Jiang and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# iDNA6mA-Rice: A Computational Tool for Detecting N6-Methyladenine Sites in Rice

*Hao Lv1 , Fu-Ying Dao1 , Zheng-Xing Guan1, Dan Zhang1, Jiu-Xin Tan1, Yong Zhang1\*, Wei Chen2\* and Hao Lin1\**

*1 Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China, 2 Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China*

DNA N6-methyladenine (6mA) is a dominant DNA modification form and involved in many biological functions. The accurate genome-wide identification of 6mA sites may increase understanding of its biological functions. Experimental methods for 6mA detection in eukaryotes genome are laborious and expensive. Therefore, it is necessary to develop computational methods to identify 6mA sites on a genomic scale, especially for plant genomes. Based on this consideration, the study aims to develop a machine learningbased method of predicting 6mA sites in the rice genome. We initially used mononucleotide binary encoding to formulate positive and negative samples. Subsequently, the machine learning algorithm named Random Forest was utilized to perform the classification for identifying 6mA sites. Our proposed method could produce an area under the receiver operating characteristic curve of 0.964 with an overall accuracy of 0.917, as indicated by the fivefold cross-validation test. Furthermore, an independent dataset was established to assess the generalization ability of our method. Finally, an area under the receiver operating characteristic curve of 0.981 was obtained, suggesting that the proposed method had good performance of predicting 6mA sites in the rice genome. For the convenience of retrieving 6mA sites, on the basis of the computational method, we built a freely accessible web server named iDNA6mA-Rice at http://lin-group. cn/server/iDNA6mA-Rice.

#### *Edited by:*

*Liang Cheng, Harbin Medical University, China*

#### *Reviewed by:*

*Jianzhao Gao, Nankai University, China Xiangxiang Zeng, Xiamen University, China*

#### *\*Correspondence:*

*Yong Zhang zhangyong916@uestc.edu.cn Wei Chen chenweiimu@gmail.com Hao Lin hlin@uestc.edu.cn*

#### *Specialty section:*

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> *Received: 13 June 2019 Accepted: 26 July 2019 Published: 10 September 2019*

#### *Citation:*

*Lv H, Dao F-Y, Guan Z-X, Zhang D, Tan J-X, Zhang Y, Chen W and Lin H (2019) iDNA6mA-Rice: A Computational Tool for Detecting N6-Methyladenine Sites in Rice. Front. Genet. 10:793. doi: 10.3389/fgene.2019.00793*

Keywords: N6-methyladenine, mono-nucleotide binary encoding, random forest, cross-validation, web-server

### INTRODUCTION

Methylated bases, such as N4-methylcytosine (4mC), N6-methyladenine (6mA), and 5-methylcytosine (5mC), exist in genomic DNA of diverse species (Cheng, 1995; Ratel et al., 2006). All these DNA methylation modifications play important roles in controlling many biological functions (Tang et al., 2018b). As an epigenetic mechanism, DNA methylation refers to a process that methyl groups are transferred to DNA molecules and is essential in the normal development of organisms (Bergman and Cedar, 2013; Smith and Meissner, 2013; von Meyenn et al., 2016). Through DNA methylation, the activity of a DNA segment can be changed without changing its sequence. For example, gene transcription can be repressed when DNA methylation occurs at its promoter (Bird, 1992).

As shown in **Figure 1**, after a methyl group is transferred to the sixth position of adenine ring, under the catalysis action of methyltransferases, 6mA is formed. 6mA is a noncanonical DNA

**29**

modification form in different eukaryotes at low levels (Fu et al., 2015; Greer et al., 2015; Zhang et al., 2015; Koziol et al., 2016; Liu et al., 2016; Mondo et al., 2017; Wang et al., 2017). 6mA in prokaryotes and eukaryotes shows similar characteristics (Heyn and Esteller, 2015). It has diverse functions, including guiding the discrimination of an original DNA strand from a newly synthesized DNA strand (Wion and Casadesus, 2006), regulating gene transcription (Cheng et al., 2016), repressing transposable elements, and reducing the stability of base pairings (Fang et al., 2012). Surprisingly, the methylation protection is an inheritable state, although it may be changed by environmental factors (Wion and Casadesus, 2006). Therefore, it is worth underscoring the importance of 6mA throughout generations.

Recent studies revealed the genome-wide distributions of 6mA in *Tetrahymena* (Wang et al., 2017), *Chlamydomonas reinhardtii* (Fu et al., 2015), *Drosophila melanogaster* (Zhang et al., 2015), *Caenorhabditis elegans* (Greer et al., 2015), vertebrates (e.g. frog and fish) (Koziol et al., 2016; Liu et al., 2016), mammals (e.g., human and *Mus. musculus*) (Wu et al., 2016; Yao et al., 2017; Xiao et al., 2018; Zou et al., 2018a), fungi (Mondo et al., 2017), and vascular plants (e.g. rice) (Zhou et al., 2018). Although these studies testified the presence of 6mA in eukaryotic genomes based on experimental means and indeed achieved encouraging results, the implication of 6mA in epigenetics is still obscure (Ratel et al., 2006). In addition, in eukaryotes, the level of 6mA was so low that it could only be detected by advanced techniques. In rice, with two antibodies, based on SMRT and IP-seq, Zhou et al. (2018) found that AGG-rich sequences were the most significantly enriched for 6mA. Thus, the computational prediction of 6mA sites may be a good choice to reduce experimental costs and guide the experimental study on plant 6mA.

In fact, several computational methods have been applied in the identification of DNA methylation sites. Based on the data of experimentally confirmed 4mC sites, Chen et al. (2017) firstly developed a predictor called iDNA4mC to identify 4mC sites, in which DNA samples were formulated with nucleotide frequency and nucleotide chemical property. Then, based on the dataset (Chen et al., 2017), He et al. (2018a) established another tool named 4mCPred, and Wei et al. (2018b) built a new predictor (4mcPred-SVM) to predict 4mC sites. Recently, a free tool called iDNA6mA-PseKNC was constructed for the computational prediction of 6mA sites (Feng et al., 2019). The tool could be used to identify 6mA sites in *Mus. musculus* genome. However, the tool could not provide valuable data contained in plant genomes due to the difference between mammal and plant genomes. Thus, it is necessary to develop a 6mA site predictor for plant genomes. Recently, a tool named i6mA-Pred was constructed to identify 6mA site in rice (Chen et al., 2019). The tool could realize the area under the receiver operating characteristic curve (auROC) of 0.886 in jackknife cross-validation. However, the database used was not large enough, and the accuracy should be further improved.

In view of the aforementioned descriptions, this study aims to develop a new method and establish an efficient tool to identify 6mA sites in the rice genome. A flowchart is shown in **Figure 2**. We firstly collected the existing data in the rice genome, including experimentally confirmed non-6mA sequences and 6mA sequences and built a benchmark dataset based on the report by Zhou et al. (2018). Subsequently, three kinds of sequence encoding features were proposed to formulate samples as the input of the Random Forest algorithm (RF) to discriminate 6mA sequences from non-6mA sequences. Then, several experiments were performed to investigate the prediction capability of the proposed method. Finally, on the basis of the method, we established a predictor called iDNA6mA-Rice.

### MATERIALS AND METHODS

### Benchmark Dataset

A benchmark dataset is important in building a reliable prediction model. By combining immunoprecipitation with single-molecular real-time sequencing approach, 6mA sites

in the rice genome had been detected (Zhou et al., 2018) and deposited in Gene Expression Omnibus (GEO) database, which was created and is maintained by the National Center for Biotechnology Information (NCBI) (Long et al., 2019). Therefore, a total of 265,290 6mA sites containing sequences were obtained from GEO. All of these sequences in GEO are 41 nt long with the 6mA site at the center. To reduce homologous bias and avoid redundancy (Dao et al., 2018; Su et al., 2018; Tang et al., 2018a; Zou et al., 2018b; Feng et al., 2019), sequences with the similarity above 80% were excluded by using the CD-HIT program (Li and Godzik, 2006). Finally, we obtained 154,000 6mA sites-contained sequences as positive samples.

Negative samples were collected from NCBI (https://www. ncbi.nlm.nih.gov/genome/10) and according to the following three rules. Firstly, the 41-nt long sequences with adenine at the center were selected. Secondly, experimental results proved that the centered adenine was not methylated. Thirdly, Zhou et al. (2018) believed that 6mA most frequently occurred at GAGG, AGG, and AG motifs, so we statistically analyzed the ratios of GAGG, AGG, and AG motifs in positive samples and reported the result in **Table 1**. Based on the result in **Table 1**, we selected the negative samples with the same ratio of motifs so that the TABLE 1 | Details of the three motifs in positive samples.


negative data were more objective. In this way, a large number of negative samples were obtained. In machine learning processes, imbalanced datasets lead to unreliable results. To balance positive and negative samples, 154,000 non-modified segments were randomly picked out as negative samples in model training. Finally, the benchmark dataset contained 154,000 positive samples and 154,000 negative samples. The benchmark dataset **S** is formulated as:

$$\mathbf{S} = \mathbf{S}^+ \cup \mathbf{S}^- \tag{1}$$

where the **S**+ contains 154,000 positive samples; the **S−** contains 154,000 negative samples; is the symbol of "union" in the set theory. The benchmark dataset is available at http://lin-group.cn/ server/iDNA6mA-Rice.

#### Feature Descriptions

Feature extraction is a key step in establishing an excellent predictor (Song et al., 2012; Zuo et al., 2017; Stephenson et al., 2018; Manavalan et al., 2018a; Wei et al., 2018a; Manavalan et al., 2018b; Song et al., 2018b; Song et al., 2018c). The following three feature extraction techniques were adopted to formulate 6mA samples.

#### K-tuple Nucleotide Frequency Component

As a special form of PseKNC (Guo et al., 2014; Lin et al., 2014), the K-tuple nucleotide frequency component has been widely used in a variety of bioinformatics problems (Lin and Li, 2011; Yang et al., 2018b).

A DNA sequence **D** can be expressed as:

$$\mathbf{D} = R\_1 R\_2 R\_3 R\_4 \cdots R\_i \cdots R\_{L-1} R\_L \, , \tag{2}$$

where *Ri* represents the nucleotide [Adenine (A), Thymine (T), Cytosine (C), and Guanine (G)] at the *i*th position; L is the length of sequence **D** and equals to 41 in this study. The strategy of k-tuple composition is to convert each sample into a 4*k* dimension vector expressed as:

$$\mathbf{D} - \left[ \int f\_1^{k-\text{top}} f\_2^{k-\text{top}} \cdots f\_i^{k-\text{top}} \cdots f\_{4^k}^{k-\text{top}} \right]^T \tag{3}$$

where *T* represents the transposition of the vector and *fi k t* − *uple* represents the frequency of the *i*th *k*-tuple composition in the DNA sequence sample. The feature has been applied in DNA element identification (Wei et al., 2018b). Here, we set *k* = 2, 3, 4.

#### Mono-Nucleotide Binary Encoding

The second feature technique is to transfer nucleotide into a binary code formulated as:

$$m = \begin{cases} (1,0,0,0), & \text{when } n = A\\ (0,1,0,0), & \text{when } n = C\\ (0,0,1,0), & \text{when } n = G\\ (0,0,0,1), & \text{when } n = T \end{cases} \tag{4}$$

Thus, an arbitrary DNA sequence with *L* nucleotides can be described as a vector of 4 × *L* features (Song et al., 2018a; Wei et al., 2018b).

#### Natural Vector

In the natural vector method proposed by Deng et al. (2011), sequences are represented as points in high-dimensional space based on statistical characteristics (Liu et al., 2018). With the sequence data, such as occurrence frequencies, the central moments, and average positions of nucleotides, the natural vector method is used to describe the distributions and numbers of nucleotides, cluster sequences, and predict their various attributes.

Based on Eq. (3), each nucleotide *R* can be defined as follows:

$$W\_k(\cdot) \colon \{\mathcal{A}, \mathcal{C}, \mathcal{G}, \mathcal{T}\}, \to \{0, 1\}, \tag{5}$$

where *WR* (*Ri* ) = 1 if *Di* = *R* and *WR* (*Di* ) = 0, otherwise

$$m\_{\mathbb{R}} = \sum \prescript{\ast}{\imath}\_{i=1} W\_{\mathbb{R}}(D\_i),\tag{6}$$

where *nR* represents the number of nucleotide *R* in the DNA sequence *D*:

$$\mathcal{S}\_{[\:R]\{i\}} = i \cdot \mathcal{W}\_{\mathbb{R}}(D\_i),\tag{7}$$

where *S*[*R*][*i*] represents the distance from the first nucleotide to the *i*th nucleotide *R*.

$$T\_{\mathbb{R}} = \sum\_{i=1}^{n\_{\mathbb{R}}} \mathbb{S}\_{[\mathbb{R}][i]},\tag{8}$$

where *TR* represents the total distance of each set of the four nucleotides.

$$
\mu\_{\mathbb{R}} = T\_{\mathbb{R}} / n\_{\mathbb{R}}, \tag{9}
$$

where μ*R* represents the mean position of the nucleotide *R*.

Finally, the second-order normalized central moments can be defined as:

$$D\_2^{\mathbb{R}} = \sum\_{i=1}^{n\_{\mathbb{R}}} \frac{(\mathbb{S}\_{\left\| \mathbb{R} \right\|\_{\mathbb{H}}} - \mu\_{\mathbb{R}})^2}{nm\_{\mathbb{R}}} \tag{10}$$

Then, the natural vector of sequence *D* is expressed as (Tian et al., 2018):

$$\left(n\_A, \mu\_A, D\_2^A, n\_\epsilon, \mu\epsilon, D\_2^C, n\_G, \mu\_G, D\_2^G, n\_T, \mu\_T, D\_2^T\right). \tag{11}$$

#### Random Forest Algorithm

The RF algorithm has been extensively applied in computational biology (Zhao et al., 2014; Zhang et al., 2016; Lv et al., 2019), since it is a flexible and practical machine learning method and can deal with many input variables without variable deletion and provide an internal unbiased estimate of the generalization error. According to the principle of RF, many trees are randomly generated with the recursive partitioning approach, and then, the results are aggregated according to voting rules. In this study, the number of trees is set to 100 with the seed of 1. The details of RF had been described by Breiman (2001).

#### Performance Evaluation

Cross-validation test is a statistical analysis method for assessing a classifier. For the purpose of saving computation time, the fivefold cross-validation test was performed to assess the method proposed in this study. We used four metrics [Matthew's correlation coefficient (*MCC*), sensitivity (*Sn*), overall accuracy (*Acc*), and specificity (*Sp*)] to measure the predictive capability of our model (Zuo et al., 2014; Zou et al., 2016; Manavalan and Lee, 2017; Manavalan et al., 2017; Cao et al., 2017a; Cao et al., 2017b; Cheng et al., 2018a; Yang et al., 2018a; Zhu et al., 2019).

 

$$\begin{aligned} Sn &= 1 - \frac{N\_{-}^{+}}{N^{+}} & 0 \le Sn \le 1\\ Sp &= 1 - \frac{N\_{+}^{-}}{N^{-}} & 0 \le Sp \le 1\\ Acc &= 1 - \frac{N\_{-}^{+} + N\_{+}^{-}}{N^{+} + N^{-}} & 0 \le Acc \le 1 \end{aligned}$$

$$\begin{aligned} N^+ + N \\ \text{MCC} = \frac{1 - (\frac{N\_-^+}{N^+} + \frac{N\_+^-}{N^-})}{\sqrt{(1 + \frac{N\_+^- - N\_-^+}{N^+})(1 + \frac{N\_-^+ - N\_+^-}{N^-})}} & \mathbf{0} \le \text{MCC} \le 1 \end{aligned}$$

(12)

,

where *N*+ and *N*− are, respectively, the numbers of 6mA sites and non-6mA sites in benchmark dataset; *N*<sup>−</sup> + indicates the number of the 6mA sites recognized as non-6mA sites; and *N*<sup>+</sup> − indicates the number of the wrongly predicted non-6mA sites. *Sn* and *Sp* represent the ability of a model to correctly identify 6mA sites and non-6mA sites, respectively. The value of *Acc* indicates the overall accuracy of our model distinguishing 6mA sites from non-6mA sites. *MCC* indicates the performance of our model based on real and predicted values. When *N N* <sup>−</sup> + + <sup>−</sup> = = 0, meaning that none of the 6mA sites in the dataset *S*+ and none of the non-6mA sites in the dataset *S*− was mispredicted, we have *MCC* = 1; when *N N* <sup>−</sup> + + = / 2 and *N N* <sup>+</sup> − − = / 2, we have *MCC* = 0, meaning no better than random prediction; when *N N* <sup>−</sup> + + = and *N N* <sup>+</sup> − − = we have *MCC* = -1, meaning total disagreement between prediction and observation.

In addition to the analysis based on the previously discussed indicators, the ROC curves (Metz, 1989; Chen et al., 2016; Dao et al., 2018; Feng et al., 2018; Lai et al., 2019; Tan et al., 2019) were plotted, and then, the area under the receiver operating characteristic curve (AUC) was calculated to objectively evaluate our proposed model.

### RESULTS AND DISCUSSION

#### Sequence Analysis

To investigate the nucleotide distribution around the 21st site (6mA or non 6mA) in positive and negative samples, the pLogo (O'Shea et al., 2013) was plotted to analyze the statistical difference of nucleotide occurrence between two kinds of samples. The 6mA samples were dramatically different from non-6mA samples in terms of nucleotide compositions (**Figure 3**). The nucleotide composition bias regions existed in the ranges from -8 to +10 sites and from +15 to +18 downstream of the 6mA site. Unlike the distribution in the non-6mA samples, a consensus motif of AAAA was observed in the upstream of the 6mA site. These results suggested that it was feasible to construct a machine learning model for identifying 6mA sites with extracted sequence features.

#### Performance Evaluation on Different Features

The prediction performances of three features [K-tuple nucleotide frequency component (KNFC), mono-nucleotide binary encoding (MNBE), and natural vector (NV)] and their combinations were firstly explored with RF. Accordingly, we built four computational models and evaluated them through the fivefold cross-validation test. The prediction results are provided in **Figure 4** and **Table 2**. It was found that MNBE could produce the best prediction performance among all features, indicating that it was the best descriptor for 6mA samples.

KNFC is a commonly used feature extractor technique and has been successfully applied in DNA regulatory element prediction. However, the results in **Table 2** showed that the accuracy of KNFC was only 68.3%, which was far from satisfactory. For the 41-nt long 6mA samples, KNFC is a high-dimension vector (16 + 64 + 256), which is so large that many elements in feature vector are zero. Although

containing sequence, whereas the lower half of the x-axis indicates the nucleotide distribution in non-6mA site containing sequences.

TABLE 2 | Predictive performances of KNFC, MNBE, and NV.


high-dimension features contain more information, more noise and redundant information are also included, thus decreasing the discrimination capability. Therefore, KNFC is not suitable for 6mA identification. In fact, the NV is the worst descriptor among all features in this study, since it can only obtain the overall accuracy of 54.3%, which almost equals the accuracy of random guess. The reason for the poor performance of NV in 6mA prediction is that NV contains too few features to capture enough sequence information of 6mA and non-6mA samples.

For the combinations of different features, if MNBE was included, the prediction performances are always good. However, they are still not higher than those obtained with MNBE alone. Thus, subsequent studies were based on MNBE.

#### Performance Evaluation of Different Algorithms

It is natural to ask whether other classification is better than RF in 6mA identification. Thus, we investigated the discriminant capabilities of three algorithms, namely, Naïve Bayes, Bayes Net, and Logistic Regression, with the benchmark dataset through fivefold cross-validation. All algorithms were implemented in WEKA (Frank et al., 2004). The ROC curves were plotted (**Figure 5**). It is obvious that RF is the best one for 6mA prediction among four algorithms. Thus, the final model was built with RF.

#### Performance Evaluation Based on Different Data Ratios

In order to further assess the proposed method, the benchmark dataset was randomly divided into two parts according to five ratios (5:5, 6:4, 7:3, 8:2, and 9:1): training dataset and testing dataset. The former part was used to train the model, whereas the other part was used to test corresponding model. In this way, the training dataset and testing dataset are independent of each other. The predictive results are listed in **Table 3**. For each ratio between training and testing datasets, the model could always

TABLE 3 | Predictive performances of five ratios on the testing and training datasets.


produce the AUC of >0.90, suggesting that our method was robust and reliable.

#### Performance Evaluation With an Independent Dataset

We designed the third experiment to investigate the performance of our proposed predictor. In the experiment, an independent test set was collected from NCBI Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) with the accession number GSE103145 (Zhou et al., 2018). All the sequences were 41 nt long with the 6mA site at the center. After removing redundant information with CD-HIT program according to the cutoff of 60%, a total of 880 positive samples were obtained (Chen et al., 2019). The negative samples were also obtained from the rice genome. In the report by Zhou et al., 6mA most frequently occurs at GAGG motifs and seldom occurs in coding sequences (CDSs). Thus, negative samples were extracted from CDSs with GAGG motifs in the rice genome. In total, 880 negative samples with the sequence identity less than 60% were obtained. All negative samples were also 41 nt long with non-methylated adenosine at the center. The data were utilized as the benchmark dataset in i6mA-Pred (Chen et al., 2019). The details for the benchmark dataset are available at http://lin-group.cn/server/iDNA6mA-Rice.

We utilized these data to examine our proposed model (**Table 4**). In total, 95.8% 6mA sites and 93.3% non-6mA sites were correctly identified, suggesting that the method was a powerful tool for identifying 6mA sites in rice genome.

### Comparison With Published Methods

Till now, i6mA-Pred (Chen et al., 2019) is the only computational-based predictor for 6mA site prediction in the



TABLE 5 | Comparison of different methods for predicting 6mA sites in the rice genome with jackknife test.


rice genome. To provide an objective and strict comparison, we investigated the performance of our method with the same data through jackknife cross-validation. The method could produce the auROC of 0.910 (**Table 5**), which was higher than that of i6mA-Pred. This comparison demonstrated that our method was powerful.

Subsequently, iDNA6mA-PseKNC (Feng et al., 2019) is a tool to identify 6mA sites in *Mus. musculus* genome, and it can identify 6mA sites in many other species with high success rates. Thus, it is necessary to compare our proposed method with it. We investigated the performance of our predictor and iDNA6mA-PseKNC based on the independent dataset used in this work. All compared results were recorded in **Table 4**. It is obvious that the model proposed in this paper is superior to iDNA6mA-PseKNC for identifying 6mA sites.

#### Web Server

Databases and web servers (Wang et al., 2014; Liang et al., 2017; Yi et al., 2017; Zhang et al., 2017; Cui et al., 2018; Dao et al., 2018; Cheng et al., 2018b; He et al., 2018b; Hu et al., 2019; Cheng et al., 2019a; Cheng et al., 2019b) can provide scholars with more convenient services. Thus, the basis of the novel method, we built a web server named iRNA6mA-Rice to identify 6mA sites in the rice genome. The web server can be freely accessible at http://lin-group.cn/server/ iDNA6mA-Rice.

Users can open the homepage shown in **Figure 6** to see a short introduction about iDNA6mA-Rice. One may firstly click the "Webserver" button, then type or copy/paste DNA sequences in the input box, or upload the FASTA format file. Note that the length of each sequence should be greater than 41 nt. Subsequently, after clicking the "submit" button, the predicted results will appear on a new page. As described previously, the tool is simple and can provide a convenient way for users to identify putative 6mA sites in DNA of their interest. Moreover, in order to facilitate the processing of largescale data, the stand-alone package can be downloaded at http://lingroup.cn/server/iDNA6mA-Rice/download.html.

### CONCLUSIONS

This paper developed a computational method for the identification of 6mA sites in the rice genome. We designed several kinds of experiments to examine the performance of the proposed method, for example, the performance evaluation on different features, performance evaluation on different algorithms, performance evaluation based on different data ratios, performance evaluation with an independent dataset, and

comparison with published methods. All results demonstrated that our proposed method could accurately recognize 6mA sites in the rice genome. For the convenience of most wet-experimental scholars, we established a free web server to predict 6mA sites. We anticipate that the web server can promote the efficient discovery of novel potential 6mA sites in the rice genome and facilitate the exploration of their functional mechanisms in gene regulation.

### DATA AVAILABILITY

All datasets generated for this study are included in the manuscript/supplementary files.

#### REFERENCES


### AUTHOR CONTRIBUTIONS

WC, YZ, and HLin conceived the study. HLv and F-YD implemented the study and drafted the manuscript. HLv, Z-XG, and DZ wrote the custom scripts and performed analysis. HLv, WC, and YZ interpreted the data. All authors read and approved the manuscript.

### FUNDING

This work has been supported by the National Nature Scientific Foundation of China (grant nos. 61772119 and 31771471) and the Science Strength Promotion Programme of UESTC.


octamer composition into general PseKNC. *Bioinformatics* 34, 4196–4204. doi: 10.1093/bioinformatics/bty508


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Lv, Dao, Guan, Zhang, Tan, Zhang, Chen and Lin. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Predicting circRNA-Disease Associations Based on circRNA Expression Similarity and Functional Similarity

#### *Yongtian Wang, Chenxi Nie, Tianyi Zang\* and Yadong Wang\**

*School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China*

Circular RNAs (circRNAs) are a novel class of endogenous noncoding RNAs that have wellconserved sequences. Emerging evidence has shown that circRNAs can be novel biomarkers or therapeutic targets for many diseases and play an important role in the development of various pathological conditions. Therefore, identifying potential disease-related circRNAs is helpful in improving the efficiency of finding therapeutic targets for diseases. Here, we propose a computational model (PreCDA) to predict potential circRNA–disease associations. First, we calculated the circRNA expression similarity based on circRNA expression profiles. The circRNA functional similarity is calculated based on cosine similarity, and the disease similarity is used as the dimension of each circRNA vector. The associations between circRNAs and diseases are defined based on the circRNA functional similarity and expression similarity. We constructed a disease-related circRNA association network and used a graph-based recommendation algorithm (PersonalRank) to sort candidate disease-related circRNAs. As a result, PreCDA has an average area under the receiver operating characteristic curve value of 78.15% in predicting candidate disease-related circRNAs. In addition, we discuss the factors that affect the performance of this method and find some unknown circRNAs related to diseases, with several common diseases used as case studies. These results show that PreCDA has good performance in predicting potential circRNA–disease associations and is helpful for the diagnosis and treatment of human diseases.

#### *Edited by:*

*Lei Deng, Central South University, China*

#### *Reviewed by:*

*Leyi Wei, The University of Tokyo, Japan Hui Ding, University of Electronic Science and Technology of China, China*

*\*Correspondence:*

*Tianyi Zang tianyi.zang@hit.edu.cn Yadong Wang ydwang@hit.edu.cn*

#### *Specialty section:*

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> *Received: 24 June 2019 Accepted: 13 August 2019 Published: 12 September 2019*

#### *Citation:*

*Wang Y, Nie C, Zang T and Wang Y (2019) Predicting circRNA-Disease Associations Based on circRNA Expression Similarity and Functional Similarity. Front. Genet. 10:832. doi: 10.3389/fgene.2019.00832*

Keywords: circRNA, disease, circRNA expression similarity, circRNA functional similarity, PersonalRank

## INTRODUCTION

Circular RNAs (circRNAs) are a type of RNA molecule that forms a covalently closed continuous loop from exon circularization (Motieghader et al., 2017; Xu, 2017). In recent years, advances in high-throughput sequencing technology have greatly facilitated the study of circRNAs (Jeck and Sharpless, 2014). When compared to other ncRNAs (Danan et al., 2012), circRNAs are highly stable. Circular RNAs have evolutionarily conserved sequence features across species, tissues, and developmental stages (Jens, 2013; Conn et al., 2015; Rybak-Wolf et al., 2015). Therefore, circRNAs have become hotspots in transcriptomics research.

Recent studies have shown that alterations in the expression of circRNAs play important roles in human disease and other biological processes (Xu, 2017; Zhao and Shen, 2017; Xia et al., 2018). For example, the best-known circRNA, CDR1as, as the inhibitor of miR-7, is a critical ncRNA known to be involved in cancer, neurodegenerative diseases, diabetes, and atherosclerosis (Li et al., 2015; Xu et al., 2018).

**40**

Researchers found that the circRNA ciRS-7 may be a promising target for neurodegenerative disorder (Lukiw, 2013) and myocardial infarction (Lin et al., 2018). The circRNA CircCCDC66 has been demonstrated to regulate colon cancer growth and metastasis as a miRNA sponge (Hsiao et al., 2017). The circRNA hsa\_circ\_0001895 is involved in the expression of cancer-related proteins in gastric cancer (Shao et al., 2017). The circRNA CircHIPK3 plays an important role in cell growth by sponging multiple miRNAs (Zheng et al., 2016). Moreover, circRNAs can be found in exosomes, cell-free saliva, and plasma (Li Y et al., 2015). Circular RNAs are emerging as novel biomarkers or therapeutic targets for many diseases due to their conservation, cell type–specific expression, and tissue-specific expression, and they play roles in the development of various pathological conditions (Meng et al., 2017; Vo et al., 2018).

Although a large number of circRNAs have been discovered, the mechanisms of circRNAs in many diseases remain unclear (Xu et al., 2018). To enable research on circRNAs and diseases, several databases have been constructed, such as circRNADisease (Zhao et al., 2018), CircR2Disease (Fan et al., 2018), and Circ2Disease (Yao et al., 2018). They provide important data support for circRNA– disease association analyses. Some methods have been proposed to provide the most promising disease-related biomarkers, including those involving lncRNAs (Chen et al., 2015; Gu et al., 2017; Cheng et al., 2018a; Cheng et al., 2019), miRNAs (Peng et al., 2019b; Shao et al., 2018), genes (Cheng et al., 2016; Hu et al., 2019; Peng et al., 2019a), and drugs (Jiang et al., 2017; Zhang et al., 2017), for further experimental validation. These methods can decrease the time and cost of biological experiments. However, very few methods have been developed to predict potential circRNA– disease associations (Lei et al., 2018), and both disease functional similarity and semantic similarity were not considered in these methods. Improved knowledge has suggested that exploring both the semantic and functional associations of diseases, which are two types of significant associations, is beneficial in measuring disease similarity (Cheng et al., 2014; Peng et al., 2018).

In this study, we proposed a computational model (PreCDA) for potential disease-related circRNA identification. In view of the limited number of circRNA–disease associations, we introduced disease similarity to solve possible sparse problems and built a disease-related circRNA similarity network. However, relying entirely on circRNA-related diseases greatly limits the utility of the method because many circRNAs still have very few or no associated diseases. To overcome this limitation, we calculated the circRNA expression similarity based on the existing data resources. Subsequently, we built a new disease-associated circRNA network by fusing circRNA functional associations and expression similarities. To assess the practicability and accuracy of this method, we designed a validation process with different datasets of circRNA–disease associations, as good computational models must perform well on different data sources. Finally, PreCDA proved successful in predicting potential disease-related circRNAs.

#### MATERIALS AND METHODS

#### Workflow

A flowchart of the PreCDA workflow is shown in **Figure 1**. We preprocessed circRNA and disease data because of the lack of

uniform identification of circRNAs and diseases. We extracted the synonym vocabulary from the two circRNA databases, including circRNADisease (Zhao et al., 2018) and circBase (Glažar et al., 2014). Then, we unified different representations of the same circRNA in different databases. Additionally, the identification of the Human Disease Ontology (DO) (Kibbe et al., 2015) was used as the unified marker of diseases in the computational model. We measured the similarity between circRNAs in two ways, including the circRNA expression similarity and functional similarity. We extracted circRNA expression profiles from circBase (Glažar et al., 2014) and CIRCpedia (Dong et al., 2018). The circRNA expression similarity was calculated based on the Spearman correlation coefficient. The disease similarity was used as the dimension of each circRNA vector, and the circRNA functional similarity was calculated based on cosine similarity. A disease-related circRNA association network was built based on the circRNA expression similarity and functional similarity. Finally, we identified potential candidate disease-related circRNAs based on the PersonalRank algorithm (PR) (Haveliwala, 2002).

## Data Preprocessing

#### circRNA Data

In this study, we used three circRNA databases for experiments and validations. The circRNADisease database is a manually curated database of experimentally supported circRNA and disease associations, which collected 330 circRNAs and 48 diseases in 354 associations. Each entry in the circRNADisease database includes detailed information on a circRNA–disease association, including the circRNA and disease name, the circRNA expression pattern, literature references, and other annotation information. CircR2Disease is a database for experimentally supported circRNA–disease associations and provides a platform for investigating the mechanism of diseaserelated circRNAs. The present version of CircR2Disease collected 661 circRNAs and 100 diseases. Circ2Disease is a database that curates experimentally supported human circRNAs and provides comprehensive associations between circRNAs and human diseases. It contains 273 manually curated associations between 237 circRNAs and 54 human diseases from 120 studies. However, currently, the naming of circRNAs has not yet been unified (Xu et al., 2018), which leads to the underutilization of information from different public circRNA databases. Therefore, we designed and collected mappings among different circRNA names provided by different circRNA databases, including circRNADisease and circBase. circRNADisease contains circRNA synonyms, and circBase is a database that merged and unified datasets of circRNAs. We mapped circRNAs from the three circRNA databases to circBase referring to circRNA synonyms. Then, we used circRNA IDs from circBase as the unified IDs of circRNAs in this work.

#### Disease Data

Human Disease Ontology represents common and rare human disease concepts captured across biomedical resources. Each node in DO represents one disease term and is organized in a directed acyclic graph with the relationship of "is\_a". MEDIC (Davis et al., 2012) integrates OMIM (Online Mendelian Inheritance in Man) terms (Amberger et al., 2015), synonyms and identifiers with MeSH terms (Lipscomb, 2000), synonyms, definitions, identifiers, and hierarchical relationships.

We extracted disease terms and synonyms from MEDIC to annotate DO by the same external references in DO and MEDIC, as shown in **Figure 2**. If a disease term was recorded in both DO and MEDIC, the term and its synonyms in MEDIC were used to annotate DO. With this approach, a given disease name can be matched to DO to a great extent by string matching, considering that the naming rules for diseases in different disease-related circRNA databases are different. The diseases described by different names are considered to be the same disease that has a unique id in DO if these disease names can match the disease term or its extended synonyms in DO.

#### circRNA Expression Similarity

Considering that comprehensive circRNA expression data are still unavailable, we extracted circRNA expression profiles from circBase and CIRCpedia, including the expression profiles of 92488 circRNAs in 78 human cell types or tissues. We used the Spearman correlation coefficient between the expression profiles of each circRNA as the circRNA expression similarity, as shown in Formula 1.

$$\rho = 1 - \frac{\sigma \sum d\_i^2}{n\left(n^2 - 1\right)}\tag{1}$$

where *di* is the difference between the two ranks of the expression scores in the *i*th human cell type or tissue, and *n* is the number of the human cell types or tissues from circBase or CIRCpedia. Matrix *CB* and Matrix *CP* are, respectively, denoted as the circRNA expression similarity matrix of circBase and CIRCpedia, where *CB(i,j)* and *CP(i,j)* are the expression similarities between circRNA *c(i)* and *c(j)*. Then, to obtain reliable performance for circRNA expression data, we defined the expression similarity between circRNA *c(i)* and *c(j)* as shown in Formula 2 if circRNA *c(i)* and *c(j)* are included in both circBase and CIRCpedia.

$$\text{ExcSim}\{i, j\} = \begin{cases} \text{Max}\left(\text{CB}\{i, j\}, \text{CP}\{i, j\}\right) & \text{Max}\left(\text{CB}\{i, j\}, \text{CP}\{i, j\}\right) \succeq \pi \\ 0 & \text{otherwise} \end{cases} \tag{2}$$

To reduce the impact of data noise, we set a threshold τ to filter out those weak similarities between circRNAs. The threshold τ is set to 0.7 based on our experiments.

#### circRNA Functional Similarity

We extracted circRNA–disease associations from these above circRNA databases and defined a relational matrix of circRNAs and diseases. For each circRNA, all diseases in the matrix can be used to make a vector in a multidimensional space. Because of the limited number of available disease–circRNA pairs, there is a data sparsity problem in the matrix. Therefore, we calculated the circRNArelated disease similarity and filled this matrix with predicted association scores based on disease–circRNA associations and the disease similarity. Here, we use FNSemSim (Wang et al., 2017) to calculate disease similarity. This method, which combines disease functional similarity and semantic similarity, has good performance for calculating similarities between diseases. The workflow of calculating circRNA functional similarity is shown in **Figure 3**.

To calculate the association between one circRNA and any disease, the similarities between this disease and all diseases that are directly related to this circRNA are calculated by FNSemSim. *C* is defined as the set of disease-related circRNAs, and *D* represents the set of circRNA-related diseases. DisSet(c) is defined as the set of diseases directly related to circRNA *c*. The association score between disease *dis* and circRNA *c* is defined as follows:

$$\text{Discore}\{\text{dis},\mathcal{c}\} = \begin{cases} \text{Max}\left(\text{FNSemSim}\{\text{dis},\text{dis}\_{i}\}\right) & \text{dis}\_{i} \in \text{DisSet}\{\mathcal{c}\}, \quad \text{dis} \notin \text{DisSet}\{\mathcal{c}\} \\\\ 1 & \text{dis} \in \text{DisSet}\{\mathcal{c}\} \end{cases} \tag{3}$$

where DisSet(c) ⊆D, 1*≤*i *≤* |DisSet(c)|; |DisSet(c)| is denoted as the number of diseases in DisSet(c). If this disease belongs to DisSet(c), the score is 1; otherwise, the score is defined as the maximum of similarities between this disease and all the diseases related to

circRNA *c*. Therefore, circRNA *c* can be depicted by a vector that is composed of circRNA-related diseases in a multidimensional space. We can calculate the functional similarity between any two circRNAs based on cosine similarity. The functional similarity between circRNA *c(m)* and *c(n)* is defined as follows:

$$FnSim\left(m,n\right) = \frac{\sum\_{i=1}^{\|\mu\|} \text{Score}\left(\text{dis}\_i, \mathcal{L}\left(m\right)\right) \times \text{Score}\left(\text{dis}\_i, \mathcal{L}\left(n\right)\right)}{\sqrt{\sum\_{i=1}^{\|\mu\|} \text{Score}\left(\text{dis}\_i, \mathcal{L}\left(m\right)\right)^2} \sqrt{\sum\_{i=1}^{\|\mu\|} \text{Score}\left(\text{dis}\_i, \mathcal{L}\left(n\right)\right)^2}}\tag{4}$$

where |D| represents the size of the circRNA-related disease set *D*, and dis*<sup>i</sup>* is the *i*th disease in the circRNA-related disease set *D*.

#### Prediction of Candidate Disease-Related circRNAs

We take circRNA functional similarity and expression similarity as weights to construct a circRNA association network. In this network, the weight between circRNA *c(i)* and *c(j)* is defined as shown in Formula 5. If *ExSim(i,j)* is greater than 0, the weight between circRNA *c(i)* and *c(j)* is the average value of their

functional similarity and expression similarity; otherwise, the weight is defined as the functional similarity between them.

$$\text{Circ}\,\text{Weight}\left(i,j\right) = \begin{cases} \left(\text{FnSim}\left(i,j\right) + \text{ExSim}\left(i,j\right)\right) / 2 & \text{if } \text{ExSim}\left(i,j\right) > 0\\ \quad \text{FnSim}\left(i,j\right) & \text{otherwise} \end{cases} \tag{5}$$

To predict candidate disease-related circRNAs, the associations between diseases and circRNAs are also considered in this network. The weight between circRNA *c* and disease *dis* is defined as shown in Formula 6. If the disease is directly related to circRNA *c*, the weight between them is 1; otherwise, the weight is 0.

$$\text{CircDis Weight} \left(i, j\right) = \begin{cases} \quad 1 & \text{if } \text{dis} \in \text{DisSet} \left(\mathcal{c}\right) \\ \quad 0 & \text{otherwise} \end{cases} \tag{6}$$

In this network composed of circRNAs and diseases, we identify novel candidate disease-related circRNAs based on the PR. PersonalRank algorithm, as a recommendation algorithm based on random walking, can reveal more information between a target node and all the others in a specific network. PersonalRank algorithm is defined as follows:

$$PR\left(i\right) = \left(1 - d\right)r\_i + d\sum\_{j \in \text{in}\left(i\right)} \frac{PR\left(j\right)}{\left|\text{out}\left(j\right)\right|}\tag{7}$$

where PR(*i*) represents the possibility value that node *i* is accessed; *d* is the transfer probability; out(*j*) represents the outdegree of node *j*; in(*i*) is the in-degree of node *i*; and *ri* is defined as follows:

$$r\_i = \begin{cases} \quad 1 & \text{if } i = t \\ \quad 0 & \text{if } i \neq t \end{cases} \tag{8}$$

where *t* represents the target node. According to previous studies (Kang et al., 2014; Cheng et al., 2018b), *d* is set to 0.85. The target node *t* in the network randomly moves to adjacent nodes with the probabilities of the edges between these nodes. After enough iterations, the probabilities from the target node to all the other nodes will become stable. Eventually, the algorithm outputs the relevance degrees between all the nodes and this target node.

#### RESULTS

#### circRNAs and Diseases

We calculated similarities between 323 circRNAs from circBase and CIRCpedia based on circRNA expression profiles. Then, we obtained 11,281 circRNA pairs based on the preset threshold. Additionally, we found 507 relationships between 58 diseases and 445 circRNAs by mapping DO terms to the diseases in CircR2Disease. We matched 26 diseases based on DO terms and extracted 293 relationships between 277 circRNAs and these diseases from circRNADisease. In Circ2Disease, 218 relationships between 37 diseases and 199 circRNAs were found. Based on DO terms and the unification of circRNA naming, we analyzed the three circRNA databases and found the same circRNAs and diseases among these databases, as shown in **Figure 4**. This provided the test data for the performance evaluation of PreCDA.

We separately calculated the similarities between 445 circRNAs from CircR2Disease, 277 circRNAs from circRNADisease and 199 circRNAs from Circ2Disease. Three circRNA association networks were built that in turn contained 96,580 associations

TABLE 1 | Information on the three circRNA association networks.


between 440 circRNAs associated with 56 diseases; 38,226 associations between 277 circRNAs associated with 26 diseases; and 18,915 associations between 195 circRNAs associated with 36 diseases. The detailed statistics of the circRNAs and diseases are shown in **Table 1**.

#### Performance

We designed a test scheme to assess the performance of PreCDA. First, we selected two circRNA–disease databases, one to build the circRNA association network and the other to provide test data. Then, we extracted the same diseases from the circRNA association network and the reference database. For a given disease, if any circRNA related to this disease in the reference database exists in the network, but the association between the circRNA and the disease does not, the circRNA can be used as a test case for the disease to assess the performance of this circRNA association network. The test scheme is shown in **Figure 5**.

In this article, we used three circRNA–disease databases, including CircR2Disease, circRNADisease, and Circ2Disease. For example, both circRNA hsa\_circ\_0000284 and liver cancer (DOID: 3571) were recorded in Circ2Disease and CircR2Disease. The circRNA hsa\_circ\_0000284 was related to liver cancer (DOID: 3571) in Circ2Disease but not in CircR2Disease. Therefore, we built a circRNA association network based on CircR2Disease and calculated the relevance degrees between liver cancer and all circRNAs unrelated to the disease. We calculated the area under the receiver operating characteristic curve (AUC) according to the ranking of the circRNA hsa\_circ\_0000284 among these circRNAs to measure the prediction results. To validate the reliability of the computational model, we conducted nine validation experiments based on this scheme involving 18 diseases. We built three circRNA association networks based on the three different circRNA–disease databases. The three data sources were also used as the reference data. Additionally, we merged the known circRNA–disease associations in the three databases as an additional control data source.

PreCDA had an average AUC value of 78.15% in predicting candidate disease-related circRNAs. Furthermore, it had an outstanding performance on some diseases. For example, diabetes mellitus (DOID: 9351) in the network from Circ2Disease had an AUC of 98.48% based on the control data from circRNADisease and an AUC of 93.04% based on the control data from CircR2Disease. Based on the control data from Circ2Disease, the AUC of osteoarthritis (DOID: 8398) was 97.44% in the network from CircR2Disease and 98% in the network from circRNADisease. In the network from Circ2Disease, the AUC of stomach cancer (DOID: 10534) was 56.41% based on the control data from circRNADisease; it had an AUC of 73.88% in CircR2Disease. This shows that the networks from the different data sources have different results for a disease based on the same control database. However, the AUCs in the other validation experiments achieved more than 65%. Even so, the performance of PreCDA is excellent in predicting candidate disease-related circRNAs. The performance of PreCDA based on the different databases and the different control data sources is shown in **Figure 6**.

#### Case Study

To further evaluate the performance of PreCDA in predicting potential disease-related circRNAs, we conducted some case studies, including prostate cancer (DOID: 10283), liver cancer (DOID: 3571), breast carcinoma (DOID: 3459), Alzheimer disease, and pancreatic cancer (DOID: 1793). We integrated the known associations between circRNAs and diseases in the three databases and prioritized candidate disease-related circRNAs based on PreCDA.

In the ranking of candidate circRNAs related to liver cancer (DOID: 3571), hsa\_circ\_0001727 (Qiu et al., 2018) ranked 4th, hsa\_circ\_0001946 (Yu et al., 2016) ranked 7th, and hsa\_ circ\_0001141 (Guo et al., 2017) ranked 19th. They ranked in the

top 3% and were associated with liver cancer. For prostate cancer (DOID: 10283), hsa\_circ\_0001946 (Zhang et al., 2018) and hsa\_ circ\_0001649 (Yi et al., 2016) ranked 3rd and 5th in the ranking, respectively. They were documented to be related to prostate cancer. For pancreatic cancer (DOID: 1793), CircRNA\_100782 (Chen et al., 2017), which ranked 1st in the ranking, was

validated to regulate pancreatic carcinoma proliferation through the IL6-STAT3 pathway. We found that some candidate circRNAs related to these diseases were included by Circ2Traits (Ghosal et al., 2013), which is a comprehensive database for circRNAs potentially associated with disease and traits. For example, hsa\_circ\_0000118, which ranked 1st in the ranking of candidate


circRNAs associated with prostate cancer, was documented to be potentially related to this disease in Circ2Traits. The prediction results of the case studies are presented in **Table 2**.

#### DISCUSSION

Although functional associations between circRNAs are measured based on circRNA expression profiles, there are many weak connections among them. To reduce the impact of data noise, we set a threshold to filter out those weak connections between circRNAs. Based on the above validation strategy and different thresholds, we conducted nine groups of experiments in which these three databases were used as a reference to each other and to test the performance of PreCDA. As shown in

**Figure 7**, the average AUC of PreCDA varied with the change in the threshold, and the computational model worked best when the threshold was set to 0.7.

We calculated circRNA similarities by only cosine similarity and built a circRNA association network. Additionally, we merged the known circRNA–disease associations in these three databases

TABLE 3 | Performance differences of predicting circRNA–disease pairs based on different data sources.


as an additional control data source. Based on the validation strategy mentioned above, we used these three databases to test the performance of the network. As shown in **Figure 8**, the average AUC was 77.22%, the minimum AUC was 69.26%, and the maximum AUC was 88.85%. In comparison, PreCDA has a more stable performance, with an average AUC of 78.15%. Its minimum and maximum AUCs are 71.83% and 95.72%, respectively.

We found that the performance of predicting potential disease–circRNA pairs in the disease-related circRNA association network was impacted by different data sources. The result of predicting the associations between the same diseases and circRNAs was different based on the different data sources that were used to build networks. For example, referring to CircR2Disease, some of the data to be tested in the networks built based on circRNADisease and Circ2Disease were the same. However, the AUC values of predicting the associations between them were different. As shown in **Table 3**, we predicted the associations between colorectal cancer (DOID: 9256) and four circRNAs, including hsa\_circ\_0001649, hsa\_circ\_0000284, hsa\_circ\_0014717, and hsa\_circ\_0001141. The AUC value for the network of circRNADisease was 71.86%. The performance of identifying the associations between colorectal cancer and these four circRNAs based on Circ2Disease was improved, and its AUC achieved 82.17%.

### CONCLUSIONS

Circular RNA plays an important role in the development of various pathological conditions. Research on circRNA is invaluable in explaining the underlying pathogenesis. Therefore, we proposed a computational model to identify candidate disease-related circRNAs. First, we calculated the circRNA expression similarity with the circRNA expression profiles. Then, the disease similarity was used as dimensions of circRNA vectors, and the circRNA functional similarity was calculated based on cosine similarity. We defined the associations between circRNAs and diseases based on the circRNA expression similarity and functional similarity. A disease-related circRNA association network was built, and potential candidate diseaserelated circRNAs were ranked by the PR.

### REFERENCES


We evaluated the performance of PreCDA with the help of data differences among these three databases, including CircR2 Disease, circRNADisease, and Circ2Disease. The results showed that the average AUC of PreCDA was 78.15%, and it had good performance in predicting potential disease-related circRNA signatures. We discussed the selection of the threshold and the impact of different data sources on the performance of PreCDA. Then, we used several common diseases as case studies and found some unknown circRNAs that could be related to these diseases based on PreCDA. The findings of this study could be further applied in analyzing diseases in a system biology perspective (Cheng and Hu, 2018) and helpful for researchers to improve disease diagnostics and treatments.

### DATA AVAILABILITY

PreCDA is implemented using a combination of Java and scala, and it is freely available from the website at https://github.com/ wythit/PreCDA.

### AUTHOR CONTRIBUTIONS

YoW and CN did data collection and preprocessing. And with the guidance of TZ and YaW, YoW finished the algorithm design and validation. YoW was the major contributor in writing the manuscript. All authors have read and approved the final version of the manuscript.

### FUNDING

Publication costs were funded by the National Key Research and Development Program of China (grant no. 2016YFC0901605, 2016YFC1201702-01) and the National High-tech R&D Program of China (grant no. 2012AA02A604, 2015AA020108).

### ACKNOWLEDGMENTS

TZ and YaW are he corresponding authors. We thank them for their guidance. We also thank Ling Wang and Zhenxing Wang for their valuable suggestions on our work. We are grateful to Rongjie Wang and Yanshuo Chu for their helpful assistance in writing.


Cheng, L., Li, J., Ju, P., Peng, J. J., and Wang, Y. D. (2014). SemFunSim: A New Method for Measuring Disease Similarity by Integrating Semantic and Gene Functional Association. *PLoS One* 9(6), 11. doi: 10.1371/journal.pone.0099415


human and mouse. *Nucleic Acids Res.* 47(D1), D140-D144. doi: 10.1093/nar/ gky1051


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Wang, Nie, Zang and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition

*Bin Liu1,2\*†, Shengyu Chen3†, Ke Yan4 and Fan Weng4*

*1 School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China, 2 Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China, 3 School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, IN, United States, 4 School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China*

Summary: Identification of replication origins is playing a key role in understanding the mechanism of DNA replication. This task is of great significance in DNA sequence analysis. Because of its importance, some computational approaches have been introduced. Among these predictors, the iRO-3wPseKNC predictor is the first discriminative method that is able to correctly identify the entire replication origins. For further improving its predictive performance, we proposed the Pseudo k-tuple GC Composition (PsekGCC) approach to capture the "GC asymmetry bias" of yeast species by considering both the GC skew and the sequence order effects of *k*-tuple GC Composition (*k*-GCC) in this study. Based on PseKGCC, we proposed a new predictor called iRO-PsekGCC to identify the DNA replication origins. Rigorous jackknife test on two yeast species benchmark datasets (*Saccharomyces cerevisiae*, *Pichia pastoris*) indicated that iRO-PsekGCC outperformed iRO-3wPseKNC. It can be anticipated that iRO-PsekGCC will be a useful tool for DNA replication origin identification.

#### *Edited by:*

*Liang Cheng, Harbin Medical University, China*

#### *Reviewed by:*

*Xun Lan, Tsinghua University, China Qiwen Dong, East China Normal University, China*

#### *\*Correspondence:*

*Bin Liu bliu@bliulab.net*

*†These authors have contributed equally to this work and share first authorship*

#### *Specialty section:*

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> *Received: 05 July 2019 Accepted: 13 August 2019 Published: 18 September 2019*

#### *Citation:*

*Liu B, Chen S, Yan K and Weng F (2019) iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition. Front. Genet. 10:842. doi: 10.3389/fgene.2019.00842*

Availability and implementation: The web-server for the iRO-PsekGCC predictor was established, and it can be accessed at http://bliulab.net/iRO-PsekGCC/.

Keywords: replication origin identification, pseudo *k*-tuple GC composition, random forest, web-server, DNA sequence analysis

### INTRODUCTION

In the process of the cell cycle, DNA replication is one of the most important steps (Shirahige et al., 1998). Since the DNA replication is initiated from a specific region, which is called replication origin, identifying the DNA replication origin is especially important for studying drug developments, cell life activities, genetic engineering, etc. (Méchali, 2010). Experimental methods detect the replication origins by using Chromatin immunoprecipitation (Chip) with high cost (Lubelsky et al., 2012). Therefore, researchers are seeking computational methods to efficiently predict the replication origins only based on the sequence information. Compared with non-replication origins, replication origins show uneven distribution of G (guanine) and C (cytosine) (Lobry, 1996), and the concept of "GC Skew" (Grigoriev, 1998) was proposed. Later, some computational methods incorporated these characteristics into the predictors based on the replication origins (Zhang and Zhang, 1991; Zhang and Zhang, 1994; Grigoriev, 1998; Roten et al., 2002; Thomas et al., 2007; Gao and Zhang, 2008; Luo et al., 2014; Bu et al., 2018). In order to further improve the predictive performance, the discriminative

**51**

methods were proposed by using both the information of the positive and negative samples (Chen et al., 2012; Li et al., 2015; Zhang et al., 2016), and all of these methods mentioned above achieved the-state-of-the-art performance. A recent method iRO-3wPseKNC incorporated the "GC asymmetry bias" (Lobry, 1996; Grigoriev, 1998; Lubelsky et al., 2012; Li et al., 2014) into the prediction by representing the entire replication origins based on three-window-based PseKNC (3wPseKNC) (Liu et al., 2018b). Feature extraction methods are the keys for the performance improvement. In this regard, many features have been proposed, which can be easily generated by some software tools.

These existing computational methods have significantly enhanced the development of this hot area, but they all suffer from certain disadvantages or limitations, for example, as discussed above the GC Skew is an important feature of replication origins, but all the existing discriminative methods failed to directly use GC Skew to construct the predictors. Furthermore, the existing feature extraction methods cannot reflect the uneven distribution of G and C. To solve these problems, we followed the framework of iRO-3wPseKNC (Liu et al., 2018b), and proposed an improved predictor called iRO-PsekGCC for replication origin identification. iRO-PsekGCC cannot only capture the CG asymmetry bias by using *k*-tuple GC composition (or *k*-GCC), but can also incorporate the GC Skew into the concept of PseKNC (Chen et al., 2014a; Chen et al., 2014b).

#### MANUSCRIPT FORMATTING

#### Benchmark Datasets

In order to evaluate the performance of the proposed method, two recently established benchmark datasets of the *Saccharomyces cerevisiae* and *Pichia pastoris* (Liu et al., 2018b) were employed in this study, because they showed clear CG asymmetry distributions, which can be represented as:

$$\mathbb{E}\mathbb{S}^{\tau} = \mathbb{S}^{+}\_{\tau} \bigcup \mathbb{S}^{-}\_{\tau}, \quad \mathbb{T} = \begin{cases} & \text{1 for } \text{Saccharowycz } c \text{2} \\ & \text{2 for } \text{Pichia } \text{pastoris} \end{cases} \tag{1}$$

where the symbol ∪ represents the union, and <sup>−</sup> <sup>+</sup> represents the positive dataset containing 340 replication origins, and <sup>1</sup> − represents the negative dataset containing 342 non-replication origins; 305 replication origins are in positive dataset <sup>2</sup> <sup>+</sup> , and 302 non-replication origins are in the negative dataset <sup>2</sup> <sup>−</sup> . For both of the two benchmark datasets, the redundant samples have been removed by using CD-HIT software tool (Li and Godzik, 2006) with the most stringent cut-off threshold (80%).

#### Pseudo *k*-Tuple GC Composition (PsekGCC)

One of the key steps for constructing machine-learning predictors for analyzing biological sequences is feature extraction. Following the framework of three-window-based PseKNC (3wPseKNC) (Liu et al., 2018b), we proposed a feature extraction method called "Pseudo k-tuple GC composition (PseKGCC)" to directly incorporate the CG asymmetry bias (Lobry, 1996; Grigoriev, 1998; Lubelsky et al., 2012; Li et al., 2014) and GC skew (Grigoriev, 1998) into the predictor. In the following sections, we will introduce how to represent DNA samples by using PseKGCC.

A DNA sequence D can be formulated as follow:

$$\mathbf{D} = \mathbf{N}\_1 \mathbf{N}\_2 \mathbf{N}\_3 \cdots \mathbf{N}\_i \cdots \mathbf{N}\_L \quad \text{( $i = 1, 2, 3 \cdots, L$ )}\tag{2}$$

where *L* denotes the length of **D**, and

$$N\_i \in \{\text{A(adending),}, \text{G(cyto sine),}, \text{G(guanine)},$$

$$\text{T(thymine)}, \quad (i = 1, 2, 3, \dots, L) \tag{3}$$

which represents the *i*-th nucleobase in the sequence, and fi ∈ denotes the "member of '" in the set theory. Following the study (Liu et al., 2018b), **D** is divided into three windows by two parameters ε and δ, including front window, middle window, and rear window respectively. ε and 1 − δ denote the percentage of total nucleobases of **D** in the front window and rear window, respectively. The front window, middle window and rear window can be represented as **D**[1,η], D[η + 1,ξ], and **D**[ξ + 1, *L*], respectively, where η and ξ are defined as (Liu et al., 2018b),

$$\begin{cases} \begin{array}{c} \mathfrak{n} = \operatorname{Int}^{\mathbb{C}}[L \times \mathfrak{c}] \\ \mathfrak{k} = \operatorname{Int}^{\mathbb{C}}[L \times \mathfrak{d}] \end{array} , \end{cases}, \qquad \text{(\$0 < \mathfrak{c} < \mathfrak{d} < 1.0)}$$

where the symbol IntC represents the ceiling operator, which means to return the smallest integer value greater than or equal to the float number .

According to (Liu et al., 2018b), if D is formulated by the *k*-tuple nucleotide (or *k*-mer) (Liu et al., 2019b; Liu, 2017) based on the three windows strategy, it can be represented as follow:

$$\mathbf{D} = \left[ f\_1^{(1)} \cdots f\_\nu^{(1)} \cdots f\_{\mathbf{4}^k}^{(1)} f\_{\mathbf{4}^k+1}^{(2)} \cdots f\_{\mathbf{4}^k+\nu}^{(2)} \cdots f\_{\mathbf{4}^k+\nu}^{(2)} \cdots \right]$$

$$f\_{2 \times 4^k}^{(2)} f\_{2 \times 4^k+1}^{(3)} \cdots f\_{2 \times 4^k+\nu}^{(3)} \cdots f\_{3 \times 4^k}^{(3)} \right]^\mathbf{T} \tag{5}$$

where in vector operations, symbol 'T' denotes the transformation symbol, and in the sample D, the normalized frequency values of the corresponding *k*-tuple nucleotides appearing in the front window, middle window and rear window are represented as *f* (1), *f* (2), *f* (3), respectively. The feature vector's dimension is 3 × 4*<sup>k</sup>* .

This strategy was proposed to capture the patterns of "GC asymmetry bias" in yeast species genomes, and it is able to improve the predictive performance for identifying replication origins among multiple yeast species genomes. However, this approach has the following disadvantages: 1) the three windows strategy can only capture the local GC asymmetry bias of replication origins, but it cannot incorporate the GC asymmetry bias in a global fashion; 2) for large *k* values of *k*-tuple nucleotide, the dimension of the resulting feature vectors is high, which will cause high dimension disaster.

In order to overcome these disadvantages, we proposed a new composition of DNA sequence called "*k*-tuple GC composition (or *k*-GCC)" to capture the GC preference in the replication origins and their global interactions. *k*-GCC treats A (adenine) and T (thymine) as one nucleotide type represented as \*. Therefore, the alphabet of *k*-GCC is

$$\mathbf{N}\_{i} \in \{ \text{G(guanine)}, \text{C(cytosine)}, \text{\*} \}, \quad \text{(\$i = 1, 2, 3, \dots\$, \$L)} \tag{6}$$

Therefore, by replacing the *k*-tuple by k-GCC, a DNA sequence D can be represented as:

$$\mathbf{D} = \left[ f\_1^{(1)} \cdots f\_{\nu}^{(1)} \cdots f\_{\mathbf{j}^k}^{(1)} f\_{\mathbf{j}^k+1}^{(2)} \cdots f\_{\mathbf{j}^k+\nu}^{(2)} \cdots f\_{\mathbf{2}\times\mathbf{3}^k}^{(2)} f\_{\mathbf{2}\times\mathbf{3}^k+1}^{(3)} \cdots \right] \tag{7}$$

$$f\_{\mathbf{2}\times\mathbf{3}^k+1}^{(3)} \cdots f\_{\mathbf{2}\times\mathbf{3}^k+\nu}^{(3)} \cdots f\_{\mathbf{3}\times\mathbf{3}^k}^{(3)} \right]^\mathbf{T} \tag{7}$$

Compared with Equation 5, the *k*-GCC can efficiently reduce the dimension of the feature vector from 3 × 4*<sup>k</sup>* to 3 × 3*<sup>k</sup>* by focusing on the GC composition.

The proposed Pse-KGCC incorporates both the *k*-GCC and GC skew into the framework of PseKNC (Chen et al., 2014a), which can be represented as:

$$\mathbf{D} = \begin{bmatrix} \phi\_1 \cdots \phi\_{\mathbf{3}^k} \cdots \phi\_{\mathbf{3}^k + \lambda} \ \phi\_{\mathbf{3}^k + \lambda + 1} \cdots \phi\_{(\mathbf{3}^k + \lambda) + \mathbf{3}^k} \cdots \phi\_{2 \times (\mathbf{3}^k + \lambda)} \ \phi\_{2 \mathbf{c} (\mathbf{3}^k + \lambda) + 1} \\\\ \cdots \phi\_{2 \times (\mathbf{3}^k + \lambda) + \mathbf{3}^k} \cdots \phi\_{\mathbf{3} \times (\mathbf{3}^k + \lambda)} \end{bmatrix}^\mathrm{T} \tag{8}$$

where

$$\Phi\_{u} = \begin{cases} \sum\_{j=1}^{j^{(1)}} f\_{1}^{(1)} + w \sum\_{j=1}^{\lambda} \theta\_{j}^{(1)} & 1 \le u \le 3^{\lambda} \\ \sum\_{j=1}^{\lambda^{\lambda}} f\_{1}^{(1)} + w \sum\_{j=1}^{\lambda^{\lambda}} \theta\_{j}^{(1)} & 3^{\lambda} + 1 \le u \le 3^{\lambda} + \lambda \\ \sum\_{j=1}^{\lambda^{\lambda}} f\_{1}^{(1)} + w \sum\_{j=1}^{\lambda^{\lambda}} \theta\_{j}^{(1)} & 3^{\lambda} + \lambda + 1 \le u \le 2 \times 3^{\lambda} + \lambda \\ \sum\_{j=3^{\lambda^{\lambda}}} \frac{w \theta\_{j}^{(2)}}{f\_{1}^{(2)} + w} + w \sum\_{j=1}^{\lambda} \theta\_{j}^{(2)} & 2 \times 3^{\lambda} + \lambda + 1 \le u \le 2 \times 3^{\lambda} + 2\lambda \\ \sum\_{j=3^{\lambda^{\lambda}}} \frac{w \theta\_{j}^{(3)}}{f\_{1}^{(2)} + w} + w \sum\_{j=1}^{\lambda} \theta\_{j}^{(3)} & 2 \times 3^{\lambda} + 2\lambda + 1 \le u \le 3 \times 3^{\lambda} + 2\lambda \\ \sum\_{j=3^{\lambda^{\lambda^{\lambda}}}} \frac{w \theta\_{j}^{(3)}}{f\_{1}^{(3)} + w} + w \sum\_{j=1}^{\lambda} \theta\_{j}^{(3)} & 3 \times 3^{\lambda} + 2\lambda + 1 \le u \le 3 \times 3^{\lambda} + 3\lambda \end{cases} \tag{9}$$

where λ denotes the highest tier correlation of the *k-*GCC nucleotides in each local window of **D**, whose the value is an integer. *w* is a float number that represents the weight factor, and the value of *w* is between 0 and 1. In the front window, the middle window and the rear window, the correlation factor of the *j*-th tier is represented as θ*<sup>j</sup>* ( ) <sup>1</sup> , θ*<sup>j</sup>* ( ) <sup>2</sup> , and θ*<sup>j</sup>* ( ) 3 , respectively. The GC skew value of the *k*-GCC nucleotides separated by *j* nucleotides is used to represent the correlation factor of the *j*-th tier in each local window. (**Figure 1**). θ*<sup>j</sup>* ( ) <sup>1</sup> , θ*<sup>j</sup>* ( ) <sup>2</sup> , and θ*<sup>j</sup>* ( ) 3 can be calculated by

$$\text{128}\begin{cases} \theta^{(i)}\_{j} = \frac{1}{\text{Int}^{c}[\frac{\mathbf{N}-k}{\beta}]+1} \sum\_{i=0}^{\text{Int}^{c}[\frac{\mathbf{N}-k}{\beta}]} \Theta(\mathbf{N}\_{\alpha j + 1}, \mathbf{N}\_{\alpha j + 2}, \dots, \mathbf{N}\_{\alpha j + k})\\ \qquad\qquad\qquad\qquad\qquad \qquad \qquad \qquad \text{128}\begin{cases} \Theta(\mathbf{N}\_{\alpha j + 1}, \mathbf{N}\_{\alpha j + 2}, \dots, \mathbf{N}\_{\alpha j + k})\\ \qquad\qquad\qquad \qquad \qquad \qquad \qquad \text{129}\begin{cases} \mathbf{N}\_{\alpha j} = \mathbf{0}, \mathbf{N}\_{\alpha j} \\ \end{cases} \end{cases} \\\ \Theta^{(i)}\_{j} = \frac{1}{\text{Int}^{c}[\frac{\mathbf{N}-k}{\beta}]+1} \sum\_{i=0}^{\text{Int}^{c}[\frac{\mathbf{N}-k}{\beta}]} \Theta(\mathbf{N}\_{\alpha j + k}, \mathbf{N}\_{\alpha j + 2}, \dots, \mathbf{N}\_{\alpha j + k})\\ \qquad\qquad \qquad \quad \quad \text{120}\begin{cases} \frac{\mathbf{N}-k}{\beta} \end{cases} \Theta(\mathbf{N}\_{\alpha j + k}, \mathbf{N}\_{\alpha j + 2}, \dots, \mathbf{N}\_{\alpha j + k}) \end{cases} \quad \text{121}\tag{10}$$

where Int C[ ] η − <sup>+</sup> *k j* 1 denotes the number of the *k*-GCC in the corresponding local window, and Θ(N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + 1N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + 2 ⋯ N*<sup>i</sup>* × *j* + *k*) is the GC Skew (Lobry, 1996; Grigoriev, 1998; Li et al., 2014) of the *i*-th *k*-GCC in the local window, which can be calculated by

$$\Theta\left(\mathbf{N}\_{\iota\times j+1}\mathbf{N}\_{\iota\times j+2}\cdots\mathbf{N}\_{\iota\times j+k}\right) = \frac{f\_G\left(\mathbf{N}\_{\iota\times j+1}\mathbf{N}\_{\iota\times j+2}\cdots\mathbf{N}\_{\iota\times j+k}\right) - f\_G\left(\mathbf{N}\_{\iota\times j+1}\mathbf{N}\_{\iota\times j+2}\cdots\mathbf{N}\_{\iota\times j+k}\right)}{f\_G\left(\mathbf{N}\_{\iota\times j+1}\mathbf{N}\_{\iota\times j+2}\cdots\mathbf{N}\_{\iota\times j+k}\right) + f\_G\left(\mathbf{N}\_{\iota\times j+1}\mathbf{N}\_{\iota\times j+2}\cdots\mathbf{N}\_{\iota\times j+k}\right)}\tag{11}$$

where *f*G(N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + 1 N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + 2 ⋯ N*<sup>i</sup>* × *j* + *k*) denotes the frequency of G in the subsequence N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + 1 N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + 2 ⋯ N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + *<sup>k</sup> f*C(N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + 1 N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + *<sup>k</sup>* ⋯ N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + *<sup>k</sup>*) denotes the frequency of C in the subsequence N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + 1 N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + 2 ⋯ N*<sup>i</sup>* <sup>×</sup> *<sup>j</sup>* + *<sup>k</sup>*, reflecting the CG asymmetry bias directly. Please note that for the terminal subsequence, if its length is less than *k*, then the GC skew will be calculated by all the available nucleotide residues.

#### Random Forest

Being widely used in bioinformatics (Zhao et al., 2014; Su et al., 2019), Random Forest (RF) (Ho, 1995; Barandiaran, 1998) is a machine learning classifier. Its training process can prevent overfitting (Hastie et al., 2008). The Random Forest model was implemented by calling the command line RandomForestClassifier ("max\_features='sqrt', min\_ samples\_leaf=1, min\_samples\_split=2, criterion = 'gini', = optimize-d value ") with the help of the Scikit-learn package (Pedregosa et al., 2011), where the values of represents the number of the trees in the forest, and it was set as 600 for both the two benchmark datasets (cf. Equation 1).

#### Ensemble Learning

Previous studies (Zou et al., 2015; Liu et al., 2016a; Chen et al., 2016b; Chen et al., 2017a; Chen et al., 2017b; Liu et al., 2018a) have demonstrated that fusing a series of individual predictors

by a voting strategy can improve the predictive performance. In this regard, in this study an ensemble predictors was constructed by fusing 10 top performing individual predictors constructed by different parameter combinations of PseKGCC (see **Supplementary Information S1**), which can be represented as (Liu et al., 2016a):

*k-*GCC (*k* = 4); (D) The coupling between the second most contiguous *k-*GCC (*k* = 4).

$$\mathbb{RF}^{\mathrm{E}} = \mathrm{RF}(1) \,\forall \, \mathrm{RF}(2) \,\forall \, \cdots \,\forall \, \mathrm{RF}(i) \,\forall \, \cdots \,\forall \, \mathrm{RF}(10) = \forall\_{i=1}^{10} \mathrm{RF}(i) \,\tag{12}$$

where <sup>E</sup> represents the ensemble classifier, ∀ represents the fusing operator, and RF(*i*) represents the basic Random Forest predictor.

The ensemble predictor is constructed based on the fusion score ß of the probabilities predicted by the 10 basic predictors, which can be calculated by

$$\mathfrak{A} = \sum\_{i=1}^{10} q\_i P\_i \tag{13}$$

where *qi* is the weight of the *i*-th basic RF predictor, which was optimized by the genetic algorithm (Mitchell, 1998), and their values were listed in **Supplementary Information S1**. If the value of ß is higher than 0.5, it is a replication origin, otherwise, it is a non-replication origin. The flowchart of the iRO-PseKGCC is illuminated in **Figure 2**.

FIGURE 2 | A flowchart illustration to show how the iRO-PseKGCC predictor works.

#### Cross Validation

Three widely used cross-validation strategies include: i) independent test, ii) K-fold cross validation, and iii) jackknife test. Among these methods, only the jackknife test can achieve the unique results for the same benchmark dataset. Therefore, in this study, the jackknife test was employed to give the final predictive results. However, considering its high computational cost, during the parameter optimization process, the 5-fold cross-validation was used to reduce the computational cost (see *Optimize Parameters* section).

#### Evaluation Method of Performance

To evaluate the quality of the classifier for prediction of the replication origins, the four metrics are used (Feng et al., 2013; Chen et al., 2016c; Chen et al., 2019): i) the sensitivity, Sn, ii) the specificity, Sp, iii) the overall accuracy of the predictive results, Acc, iv) the Mathew's correlation coefficient, MCC, and v) Arear under ROC Curve, AUC (Chen et al., 2016a), defined as:

where *N*+ denotes the number of all the positive samples (replication origins), *N*– denotes the number of all the negative samples (non-replication origins), *N*<sup>−</sup> <sup>+</sup> denotes the number of the positive samples (replication origins) incorrectly predicted as the negative samples (non-replication origins), *N*<sup>+</sup> <sup>−</sup> denotes the number of the negative samples (non-replication origins) incorrectly predicted as the positive samples (replication origins). More information of these performance measures can refer to Liu et al. (2016b).

(14)

### RESULTS AND DISCUSSION

#### Optimize Parameters

There are five parameters in PseKGCC according to Equations 4–9. These parameters were optimized by the following equations:


The fivefold cross-validation was employed to search the optimal parameters by gridding method so as to reduce the time consumption, and the predictive results of the top 10 performing predictors, and their optimized parameters were listed in **Supplementary Information S1**.

#### Comparison With Other Methods

To the best knowledge of ours, iRO-3wPseKNC (Liu et al., 2018b) is the only existing predictor that is able to predict the entire replication origins. All the other predictors can only predict the fragments of replication origins. Therefore, the performance of the proposed iRO-PseKGCC was compared with iRO-3wPseKNC on the two benchmark datasets, and the results were listed in **Table 1**, from which we can see that iRO-PseKGCC obviously outperformed iRO-3wPseKNC in terms of the five performance measures (cf. Equation 14), indicating that the proposed PseKGCC feature is able to capture the GC asymmetry bias, and incorporate the GC skew into the predictor. Therefore, iRO-PseKGCC is an efficient approach for improving the predictive performance.

#### Feature Analysis

Random forest is a combination classifier model composed of decision tree classifiers. During the process of constructing each tree by the "Bootstrap" method (Efron, 1992), samples not extracted for training the corresponding tree can be used to make "Out Of Bag" (OOB) error estimate (Breiman, 1996) to evaluate the generalization performance of a predictor. Based on the OOB error, the Mean Decrease Accuracy (MDA) (Jiang et al., 2007) can

TABLE 1 | The results of the iRO-PseKGCC Predictor and comparison with iRO-PseKGCC on the two benchmark datasets (cf. Equation 1) obtained by using jackknife test.


*aThe parameters are listed in Supplementary Information S1.*

*bThe predictor reported in (Liu et al., 2018b) with parameter* ε *= 0.25,* <sup>δ</sup> *= 0.85, k = 5,* λ*= 6, w = 0.3, and = 700.* 

*cThe predictor reported in (Liu et al., 2018b) with parameter* <sup>ε</sup> *= 0.15,* δ *= 0.55, k = 4,* λ *= 9, w = 0.3, and = 800.*

be used to estimate the importance of the features. The details of the process are (Jiang et al., 2007): 1) When training a Random Forest model, using the OOB samples to test the accuracy of each tree in the model; 2) Randomly disturb the value of the feature variable *v* in the OOB samples, and retest the accuracy of each tree; 3) Calculate the mean value of the decreasing accuracy between the two tests in all decision trees in the Random Forest model. The MDA value can reflect the importance of the corresponding feature.

As shown in previous studies (Liu and Zhu, 2019; Liu et al. 2019a), feature analysis is critical for exploring the characteristics of the predictors. To explore the reason why the proposed predictor iRO-PseKGCC works so well, we analyzed the features of the two top performing iRO-PseKGCC predictors (see **Supplementary Information S1**) on the two benchmark datasets (cf. Equation 1) by MDA approach, and the results are listed in the **Table 2**, from which we can see that: 1) for both the two RF-based predictors, their most important features are the "\*\*\*" and "\*\*\*\*\*," indicating the importance of the *k*-GCC; 2) The global sequence order effects measured by different λ values and GC skew values contribute to the performance improvement; 3) Features in certain local window show more discriminative powers than those in other windows, for examples, for *Pichia pastoris*, all the top 10 most important features are in the middle window, which is consistent with the previous observations that the nucleobase composition distribution is uneven along the replication origins (Lobry, 1996; Grigoriev, 1998; Frank and Lobry, 1999; Tillier and Collins, 2000; Liu et al., 2018b).

### Web Server and User Guide

Web-servers are important for the researchers to implement the corresponding computational predictors. In this regard, for the user's convenience, we established a web-server named

TABLE 2 | The top 10 most important features of the top two performing RF-based predictors on the two benchmark datasets (cf. Equation 1).


"iRO-PseKGCC." For users' convenience, a detailed user guide explaining how to use the web-server is given.


#### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. These data can be found here: https://academic.oup.com/bioinformatics/ article-abstract/34/18/3086/4978052?redirectedFrom=fulltext

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

BL provided the main idea of the manuscript and wrote the manuscript. SC did the experiments and revised the manuscript. KY revised the manuscript and did the typesetting. FW did the experiments.

#### FUNDING

This work was supported by the National Natural Science Foundation of China (No. 61672184, 61732012, 61822306), Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063), Shenzhen Overseas High Level Talents Innovation Foundation (Grant No. KQJSCX20170327161949608), Guangdong Natural Science Funds for Distinguished Young Scholars (2016A030306008), and Scientific Research Foundation in Shenzhen (JCYJ20180306172207178).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00842/ full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Liu, Chen, Yan and Weng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Variance-Preserving Estimation of Intensity Values Obtained From Omics Experiments

*Adèle H. Ribeiro1\*, Julia Maria Pavan Soler2 and Roberto Hirata Jr.1*

*1 Department of Computer Science, Institute of Mathematics and Statistics, University of São Paulo, São Paulo, Brazil, 2 Department of Statistics, Institute of Mathematics and Statistics, University of São Paulo, São Paulo, Brazil* 

Faced with the lack of reliability and reproducibility in omics studies, more careful and robust methods are needed to overcome the existing challenges in the multi-omics analysis. In conventional omics data analysis, signal intensity values (denoted by *M* and values) are estimated neglecting pixel-level uncertainties, which may reflect noise and systematic artifacts. For example, intensity values from two-color microarray data are estimated by taking the mean or median of the pixel intensities within the spot and then subjected to a within-slide normalization by LOWESS. Thus, focusing on estimation and normalization of gene expression profiles, we propose a spot quantification method that takes into account pixel-level variability. Also, to preserve relevant variation that may be removed in LOWESS normalization with poorly chosen parameters, we propose a parameter selection method that is parsimonious and considers intrinsic characteristics of microarray data, such as heteroskedasticity. The usefulness of the proposed methods is illustrated by an application to real intestinal metaplasia data. Compared with the conventional approaches, the analysis is more robust and conservative, identifying fewer but more reliable differentially expressed genes. Also, the variability preservation allowed the identification of new differentially expressed genes. Using the proposed approach, we have identified differentially expressed genes involved in pathways in cancer and confirmed some molecular markers already reported in the literature.

Keywords: delta method, pixel-level uncertainty, spot quantification, optimal LOWESS normalization, two-color microarray, variability preservation, parameter selection

## INTRODUCTION

The growing number of omics datasets (e.g., genomics, transcriptomics, proteomics, metabolomics) and the recent advances in multi-omics integration approaches have contributed to the better understanding of biological mechanisms and also the emergence of the personalized medicine. However, the lack of reliability and reproducibility in omics studies stands as one of the biggest obstacles in bridging the gap between research and practice of personalized medicine (Alyass et al., 2015; Karczewski and Snyder, 2018). Considering that inflated variability and non-robust estimation may lead to inaccurate and misleading results, this paper proposes improvements to the conventional estimation and normalization of the intensity values obtained from omics experiments. Specifically, the proposal is to estimate the intensity values by a method that accounts for the variability due to pixel-level uncertainties and to normalize these values by using LOWESS with suitably selected

#### *Edited by:*

*Liang Cheng, Harbin Medical University, China*

#### *Reviewed by:*

*Tian Qing Zheng, Chinese Academy of Agricultural Sciences, China Tianyi Zhao, Harbin Institute of Technology, China*

> *\*Correspondence: Adèle H. Ribeiro adele.ribeiro@usp.br*

#### *Specialty section:*

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> *Received: 05 April 2019 Accepted: 16 August 2019 Published: 20 September 2019*

#### *Citation:*

*Ribeiro AH, Soler JMP and Hirata R Jr (2019) Variance-Preserving Estimation of Intensity Values Obtained From Omics Experiments. Front. Genet. 10:855. doi: 10.3389/fgene.2019.00855*

parameter values, preserving variation that may be relevant to subsequent analyses.

Image processing and fluorescence analysis are the preferred approaches for data quantification in microarray technologies. Although microarrays have been predominantly used since the end of the nineties to measure gene expression levels, they remain widely used to detect other omics data types, including microRNA expression, DNA methylation, single-nucleotide polymorphisms (SNPs), and copy number variants (CNVs) (Goodwin et al., 2016). After hybridization and cleaning of the target molecules, the array is scanned by activation with lasers at different wavelengths (one for each of the fluorophores used), and each laser channel generates an image. The pixel intensities within each spot in these microarray images are summarized to represent the hybridization signal. Depending on the platform (e.g., gene expression array, DNA methylation array, SNP array, and comparative genomic hybridization [CGH] array), the interpretation of this signal is different (e.g., gene expression levels, methylation levels, allele frequencies, and copy number alterations).

The continuance of the microarray technology can be mainly explained by the availability of many datasets in public repositories, such as the Gene Expression Omnibus (GEO) (Edgar et al., 2002; Barrett et al., 2012) and ArrayExpress (Kolesnikov et al., 2015), by the existence of well-established strategies for data analysis and experimental design, and by the low cost compared with the next-generation sequencing technologies. However, given that microarray analysis is still facing reliability and reproducibility problems, more robust and rigorous methods are needed to account for the high variability and biases introduced in all steps of a microarray experiment.

Several preprocessing and normalization procedures have been proposed to remove biases due to the inhomogeneity of the background and the different fluorescence properties of the dyes. However, biases introduced in the image analysis step, which includes spot segmentation and signal extraction, have not received the same attention, and those may partially explain the existing reliability and reproducibility problems in omics studies. Particularly, several factors, including image resolution, scanner settings, effectiveness of the segmentation algorithm, and unexpected behaviors during hybridization, may lead to errors in spot localization and classification of the pixels (as foreground or background, depending on whether it is situated within or around the spot). Thus, spot intensities are usually noisy and that high pixel–level variability leads to uncertainty in microarray quantification and correlates with variability between replicate spots on duplicate slides (Brown et al., 2001).

Given that even state-of-art image processing tools are susceptible to errors that significantly influence the variability of the data derived from microarray images (Ahmed et al., 2004), new segmentation and intensity extraction algorithms are still being developed in order to improve precision in spot quantification (Li et al., 2017; Karthik and Manjunath, 2018; Shao et al., 2019). Usually, these tools combine sophisticated algorithms and pixellevel analyses in order to obtain an accurate estimate of the signal intensity in each spot. However, to allow subsequent analyses to take into account possible errors and uncertainties arising from the image processing, the method output usually includes not only statistical measures of location (e.g., mean and median) of the foreground and background intensities of each channel of each spot but also measures of dispersion, including standard deviation and covariance between both channels.

Despite the common use of pixel-level variability measures as data quality criteria for filtering purpose, the conventional microarray analysis is solely based on statistical measures of location of the spot intensities (Yang et al., 2002; Sun et al., 2011; Brady and Vermeesch, 2012). To improve robustness and reliability in microarray analysis, pixel-level uncertainties should be accounted for in the intensity log-ratio estimation and propagated to the next steps of the analysis.

Pixel-level uncertainties have been taken into account by many spot quantification algorithms in the literature, but requiring all pixel values to be available. Some of them are interested in improving the log-ratio estimator. Particularly, the method proposed by (Dodd et al., 2004) is a log-ratio estimator that corrects for signal saturation by regressing all pixel intensities at both test and control channels using a censored regression model. The META algorithm (Chan and Chang, 2009) estimates the intensity log-ratio by grouping the pixels according to their distance to the center of the spot and then weighting the log-ratio of each group in inverse proportion to its sample variance. A method that only uses pixel-level mean and variance summary statistics is the hierarchical maximumlikelihood estimator (Bakewell and Wit, 2005). However, it is not exactly based on the standard log-ratio representation of the spot intensity. It models the gene expression signal at control and treatment channels separately, incorporating the sample within-spot deviation and then performs the estimation using maximum likelihood. To the best of our knowledge, there is no intensity log-ratio estimator to be used after the image analysis phase (i.e., based solely on the pixel-level summary statistics) that takes into account pixel-level uncertainties.

The first contribution of this paper is a more robust estimator for the intensity log-ratio (*M*) and average log intensity (*A*) of a microarray spot that accounts for pixel-level variance and covariance between channels. For a spot *t*, these values are denoted by *Mt* and *At* , respectively (Dudoit et al., 2002). We derive these estimators by using the multivariate delta method (Casella and Berger, 1990). Specifically, we approximate the expected values of *Mt* and *At* by using their second-order Taylor's expansions, and the variance of *Mt* and *At* by using their firstorder Taylor's expansions. These expansions depend on the pixel-level variance and covariance between channels of the spot, whose sample estimates are readily accessible through standard output files of microarray image analysis tools.

After spot intensity estimation, it is necessary to perform a within-slide normalization to remove array-specific effects, intensity-dependent dye biases, and other systematic trends of the microarray data. The within-slide normalization based on the robust locally weighted regression (LOWESS) (Cleveland, 1979) is one of the most used techniques. The choice of the LOWESS parameters, particularly the smoothing parameter (also known as neighborhood size or bandwidth), dramatically affects the intensity and quality of the microarray data calibration. Although the smoothing parameter is still commonly set arbitrarily (around 0.2 and 0.4) (Dudoit et al., 2002; Smyth and Speed, 2003; Drăghici, 2012), some data-driven methods have been proposed to select its optimal value (Berger et al., 2004; Futschik and Crompton, 2004a; Lee et al., 2008). All these methods are similar in that they choose the smoothing parameter by minimizing a measure of error of the LOWESS fit. Berger et al. (2004) use the mean-squared difference between the LOWESS estimates and the corresponding normalization reference levels as cost function. These normalization levels are the true spot-specific calibration errors, which are usually unknown. Thus, Berger et al. suggest to estimate them from control transcripts and replicate slides. However, they are not always available for all genes in a typical microarray experiment, making it hard to reliably use the method. Futschik and Crompton's selection method, named OLIN (Futschik and Crompton, 2004a; Futschik and Crompton, 2004b), has the advantage of not relying on a reference level. Its optimization procedures use the generalized cross-validation (GCV) criterion, an estimator of the prediction mean square error (PMSE), as cost function. Lee et al. (2008) proposes to select the smoothing parameter by minimizing the bootstrap estimate of the mean integrated square error (MISE) and show that their results are comparable to OLIN.

Although all these methods have shown superiority over LOWESS normalization with a fixed arbitrarily chosen smoothing parameter, they lack in taking into account any heteroskedasticity in the data. In addition, they usually suffer from a poor bias–variance trade-off, tending to choose small smoothing values, which yield unnecessarily complicated (with high variance) LOWESS fits.

The second contribution of this paper is a data-driven method for selecting the smoothing parameter of the LOWESS normalization process. Inspired by the previous proposed methods, we choose the optimal smoothing value by minimizing a mean squared error criterion. However, our selection method also takes into account heteroskedasticity of the microarray data and offers a better bias–variance trade-off by selecting from among the low-MSE fits the one that is the most parsimonious. The parameter selection is obtained by solving a discrete optimization problem and is based on conventionally accepted ideas for analysis of M-plots—a graphical tool showing the curve of the MSE against the effective degrees of freedom of the estimate (Cleveland et al., 1988).

Given that the primary application of DNA microarrays has been to measure gene expression levels, we focus in this paper on variation-preserving estimation and normalization methods for gene expression levels from two-channel (or two-color) microarrays. However, it is straightforward to adapt the same ideas to improve analysis of other types of microarray data, even from single-channel technologies.

The proposed methods were evaluated by a differential gene expression analysis from real intestinal metaplasia and normal microarray samples. The proposed estimators for the *Mt* and *At* values were compared with the conventional estimators that neglect the pixel-level variability. In addition, we compared the proposed method for selecting the LOWESS smoothing parameter with OLIN, as it is conceptually similar to the other existing methods and can be applied even to microarray experiments with few or no replicates. Results show that a more robust and conservative analysis is performed when the LOWESS smoothing parameter is selected by our method, potentially reducing the number of false-positive differential expressions. Besides, both the pixel-level variabilities incorporated by the proposed estimators for the *Mt* and *At* values and the variability preserved by our more parsimonious normalization method contributed to the identification of new differentially expressed genes. Thus, the proposed methods may also reduce the falsenegative rate.

#### MATERIALS AND METHODS

Two procedures that critically affect the adequacy of microarray data analysis are the spot quantification, which extracts summarized quantitative measures of the pixel intensities within each spot of the microarray slide, and the within-slide normalization, which removes dye-specific biases and other systematic noises simultaneously from all logged spot intensities (*Mt* and *At* values).

In the section Intestinal Metaplasia Database, we describe a gene expression dataset used to illustrate the application of our proposed methods. In the section Improved Estimators for the *Mt* and *At* values, we show our improved estimation method for the *Mt* and *At* values that incorporates pixel-level variability. In the section Estimators for the Variances of the *Mt* and *At* Values, we discuss some criteria that can be used for proper setting of the parameters of the LOWESS within-slide normalization and we propose an algorithm for selecting the optimal value for the smoothing parameter (denoted by *f*).

### Intestinal Metaplasia Database

Due to a chronic inflammatory process, the normal squamous mucosa of the stomach may be replaced by columnar intestinaltype epithelium, characterizing a disease called intestinal metaplasia of the stomach. Since adenocarcinoma of the stomach and inflamed intestinal mucosa are strongly associated (Coussens and Werb, 2002), intestinal metaplasia may be a significant risk factor for gastric cancer.

We analyzed data from a two-color microarray experiment with tissues samples from 90 different subjects, being 35 from tissues representing type II intestinal metaplasia and 55 from tissues representing the normal condition, obtained from the Tumor Bank at A.C. Camargo Cancer Center/Antonio Prudente Foundation.

It was used the standard reference design (Churchill, 2002), in which each sample is hybridized against a pool of normal tissues using the same orientation of dye labeling. Gene expression levels were measured on Agilent Whole Human Genome Microarrays 4x44K G4112F (design ID 014850), each slide containing 41,093 unique probes. The scanned images of the microarray slides were processed by *Agilent Feature Extraction* software, version 9.5, where statistics (mean, standard deviation, and covariance) of the foreground and local background pixels were computed for each spot, in both test and reference channels. Each microarray spot contains about 60 foreground pixels.

This study was carried out in accordance with the recommendations of the international guidelines for investigations involving human beings with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Ethics Institutional Committee of the A.C. Camargo Cancer Center (process number 1023/07).

#### Improved Estimators for the *Mt* and *At* Values

Usually, in microarray analysis, the test channel is denoted by (red), and the reference channel is denoted by *G* (green), following this usual notation, denoted by *Rtj* and by *Gtj*, the intensity value of the *j*th pixel within the th spot, respectively, in the test and reference channel. The relative expression of pixel *j* within spot is denoted by *Mtj* and defined as follows:

$$M\_{\circ j} \doteq \log\_2 \left( \frac{R\_{\circ j}}{G\_{\circ}} \right) = \log\_2(R\_{\circ j}) - \log\_2(G\_{\circ}).\tag{1}$$

 The average expression of pixel within spot is denoted by *Atj* and defined as follows:

$$A\_{tj} \doteq \frac{1}{2} \left( R\_{tj} G\_{tj} \right) = \frac{\log\_2(R\_{tj}) + \log\_2(G\_{tj})}{2}. \tag{2}$$

Usually, image analysis software does not provide all pixel intensity values within each spot. Nonetheless, it provides several descriptive statistics of the foreground and background pixel intensities, including sample estimates for the mean, median, variance, and covariance between the two channels.

To incorporate the pixel-level variability in the analysis, we derived an approximation of the expected values of *Mtj* and *Atj* by using the *multivariate delta method* (Casella and Berger, 1990). Assuming that the functions (1) and (2) are twice differentiable on an open interval which contains the point ( ( *R G* ), ) ( ) *tj tj* , we computed their second-order Taylor's expansions, around the point ( ( *R G* ), ) ( ) *tj tj* , and then derived their expected values. The derivation is presented in Appendix 4.

It is reasonable to assume that the variables *Rtj*, *Gtj*, *Mtj* and *Atj* have a distribution with well-defined mean and variance. Particularly, Hoyle et al. (Hoyle et al., 2002) empirically showed that the distribution of the pixels within a spot is heavy-tailed (a non-Gaussian distribution) and well-approximated by a log-normal distribution. Consequently, *Mtj* and *Atj* follow a distribution which is well-approximated by a Gaussian distribution and all the variables have at least the first and second moments finite.

Let *Rtc* and *Gtc* be non-zero estimates of, respectively, ( ) *Rtj* and ( ) *Gtj* , which represent average foreground signals after correction for removing the background influence. The subscript indicates dependence on the background correction. Also, let σˆ ( ) <sup>2</sup> *Rt* and σˆ ( ) <sup>2</sup> *Gt* be estimates of, respectively, Var (*Rtj*) and Var (*Gtj*), which are assumed to be independent of the background correction. Note that mean and variance estimates are calculated across observed foreground pixel intensities within the spot at the respective channel.

We can derive improved estimators for ( ) *Mtj* and ( ) *Atj* as follows:

$$\begin{split} \bar{\boldsymbol{M}}\_{t} \doteq \mathbb{E}(\boldsymbol{M}\_{t}) &= \log\_{2} (\overline{\mathcal{R}}\_{\scriptscriptstyle{\boldsymbol{\alpha}}}) - \log\_{2} (\overline{\mathcal{G}}\_{\scriptscriptstyle{\boldsymbol{\alpha}}}) \\ &+ \frac{1}{2\ln \left( 2 \right)} \bigg( -\frac{\hat{\sigma}^{2} \left( \boldsymbol{R}\_{t} \right)}{\overline{\mathcal{R}}\_{\scriptscriptstyle{\boldsymbol{\alpha}}}^{2}} + \frac{\hat{\sigma}^{2} \left( \boldsymbol{G}\_{t} \right)}{\overline{\mathcal{G}}\_{\scriptscriptstyle{\boldsymbol{\alpha}}}^{2}} \bigg), \end{split} \tag{3}$$

$$\begin{split} \tilde{A}\_{t} \doteq \mathbb{E}(A\_{t\circ}) &= \frac{1}{2} \Big( \log\_{2}(\overline{R}\_{\iota\iota}) + \log\_{2}(\overline{G}\_{\iota\iota}) \Big) \\ &- \frac{1}{4\ln\left(2\right)} \Big( \frac{\hat{\sigma}^{2}(\mathcal{R}\_{\iota})}{\overline{R}\_{\iota\iota}^{2}} + \frac{\hat{\sigma}^{2}(G\_{t})}{\overline{G}\_{\iota\iota}^{2}} \Big). \end{split} \tag{4}$$

Note that the conventional estimators for the *Mtj* and *Atj* values, given by

$$
\hat{M}\_t \doteq \log\_2(\overline{R}\_{\text{tc}}) - \log\_2(\overline{G}\_{\text{tc}}),
\tag{5}
$$

$$\hat{A}\_t \doteq \frac{\log\_2(\overline{R}\_{\mathfrak{t}\mathfrak{t}}) + \log\_2(\overline{G}\_{\mathfrak{t}\mathfrak{t}})}{2},\tag{6}$$

are approximations of, respectively, ( ) *Mtj* and ( ) *Atj* derived from only the zeroth-order Taylor's expansion of the functions that define *Mtj* and *Atj*. Thus, the conventional estimators ignore the known measures of pixel-variability, which represent uncertainties in the gene expression measurements.

**Figure 1** illustrates the differences between the estimators for the ( ) *Mtj* and ( ) *Atj* for a randomly chosen microarray slide of the database described in the section *Intestinal Metaplasia Database*. Since these estimators may suffer from numerical instability if the corrected foreground signals, *Rtc* and *Gtc*, are very close to zero, we removed the background influence by applying the *normexp* method (Ritchie et al., 2007) with offset equals to 50. The top 20 spots with the highest pixel-level variability are highlighted in red plus symbols. Several of these spots have low average intensity (small estimates for ( ) *Atj* ) and a small difference between the intensities of the two channels (estimates for ( ) *Mtj* close to zero), but they are not the majority. The differences between the proposed estimators, defined in Eq. (3) and (4), and the conventional estimators, defined in Eq. (5) and (6), are shown in **Figures 1C**, **D**. These differences are due to the distinct parts between their respective formulas. When computing the *M <sup>j</sup>* estimates, the ratio of the pixel-level variability to the squared expected value in the test channel appears in Eq. (3) with an opposite sign to the same term in the reference channel. Thus, positive and negative differences between the estimates for ( ) *Mtj* may occur if such terms do not cancel each other out. **Figure 1C** shows the *ilde M <sup>t</sup>* estimates were smaller than the *M*<sup>ˆ</sup> *<sup>t</sup>* estimates for the genes with highest pixel-level variance, indicating a larger variance in their test channels. **Figure 1D** shows some *A <sup>t</sup>* estimates were smaller than

influence was removed from the foreground signals by the *normexp* method with offset.

the ˆ *At* estimates. The reduction is explained by the fact that the additional terms in Eq. (4) are negative for any positive pixellevel variability in any channel.

#### Estimators for the Variances of the *Mt* and *At* Values

Since we have also available the sample covariance between *Rtj* and *Gtj*, denoted by σˆ( , *R G* ) *t t* , we applied the multivariate delta method for deriving estimators for the variances of the *Mtj* and *Atj*. We calculated the variance of the first order Taylor's expansion of the functions (1) and (2) that define, respectively, *Mtj* and *Atj*, as shown in Appendix 5. The variance estimators for *Mtj* and *Atj*, for pixels *j* within spot *t* are:

$$\hat{\boldsymbol{\sigma}}^{2}(\boldsymbol{M}\_{t}) \dot{=} \frac{1}{\ln^{2}(2)} \left( \frac{\hat{\boldsymbol{\sigma}}^{2}(\boldsymbol{R}\_{t})}{\overline{\boldsymbol{R}}\_{\boldsymbol{\kappa}}^{2}} + \frac{\hat{\boldsymbol{\sigma}}^{2}(\boldsymbol{G}\_{t})}{\overline{\boldsymbol{G}}\_{\boldsymbol{\kappa}}^{2}} - 2 \frac{\hat{\boldsymbol{\sigma}}(\boldsymbol{R}\_{t}, \boldsymbol{G}\_{t})}{\overline{\boldsymbol{R}}\_{\boldsymbol{\kappa}} \overline{\boldsymbol{G}}\_{\boldsymbol{\kappa}}} \right), \qquad (7)$$

$$\hat{\boldsymbol{\sigma}}^{2}(\boldsymbol{A}\_{\mathrm{t}}) \dot{\coloneqq} \frac{1}{4\ln^{2}(2)} \left( \frac{\hat{\boldsymbol{\sigma}}^{2}(\boldsymbol{R}\_{\mathrm{t}})}{\overline{\mathcal{R}}\_{\mathrm{u}}^{2}} + \frac{\hat{\boldsymbol{\sigma}}^{2}(\boldsymbol{G}\_{\mathrm{t}})}{\overline{\mathcal{G}}\_{\mathrm{u}}^{2}} - 2\frac{\hat{\boldsymbol{\sigma}}(\boldsymbol{R}\_{\mathrm{t}},\mathbf{G}\_{\mathrm{t}})}{\overline{\mathcal{R}}\_{\mathrm{u}}\overline{\mathcal{G}}\_{\mathrm{u}}} \right). \tag{8}$$

The variances of *Mtj* and represent pixel-level uncertainties of the th spot. They can be used, for instance, for assessing the quality of the th spot or for constructing confidence intervals for the parameters ( ) *Mtj* and ( ) *Atj* .

#### Optimal Selection of the LOWESS Parameters

To simplify the notation, we will denote the estimates for ( ) *Mtj* and ( ) *Atj* , independently of the estimation method used, by, respectively, *Mt* and *At* values.

It is necessary to remove from these *Mtj* intensity values the dependent dye-specific biases and other systematic errors by using some within-slide normalization method.

In the LOWESS within-slide normalization method, one estimates for each microarray slide a smoothing function µˆ that maps each *At* observed value to a smoothed *Mt* value, µˆ( ) *At* . Since µˆ( ) *At* is considered an estimate of a dye-dependent bias, it must be subtracted from the corresponding observed *Mt* value to obtain a residual value representing, presumably, the biologically relevant gene expression level.

An appropriate LOWESS estimation depends on the choice of its parameters. According to loader (Loader, 1999), the weight function and the number of iterations of the robustness algorithm are not critical parameters. Cleveland (Cleveland, 1979) comments that good choices for these parameters are, respectively, the tricube function and three iterations. However, the degree of the local polynomials and the smoothing parameter *f*, which, in the nearest neighbor method, is a number between and indicating the proportion of data used in each local fit, affects the bias and the variance of the fit.

Specifically, the higher the degree of the local polynomial (related to the complexity of the model), the lower the bias of the fit (probably, fitting the data very well). However, the additional parameters of this more complex model increase the variance of the fitted values, yielding a poor generalization ability (i.e., the model will have a large error). Thus, to avoid unstable LOWESS estimates, several references as (Loader, 1999; Yang et al., 2001; Dudoit et al., 2002; Smyth and Speed, 2003) recommend using local polynomials of degree one, mainly in the presence of sparsity, as is the case of microarray data.

The effects of the smoothing parameter *f* on the bias and variance of the fit are opposite to those of the degree of the local polynomials. Since the *f* parameter indicates the number of observations that will be used in the local polynomial estimation, when *f* value is large, a simple polynomial may not fit well to all observations in the neighborhood, distorting or ignoring essential features. In other words, the estimation of the smoothing function can be significantly biased. On the other hand, when a low *f* value is chosen, the number of observations may be insufficient to capture the general behavior of the data, resulting in a very noisy (large variance) fitness function.

In the next section, we propose a method for selecting a value for the *f* parameter, focusing on microarray data normalization. Our method takes into account the intrinsic characteristics of the bias and variance of the fit as well as of gene expression data.

#### Lowess Smoothing Parameter Selection

For microarray data normalization, the ideal LOWESS fitted curve captures only trends and effects from systematic errors, retaining all biological variation. However, it critically depends on the choice of the *f* parameter value.

**Figure 2** illustrates the MA plot of the microarray slide shown in **Figure 1B**, with different LOWESS fits yielded by *f* values varying from 0.05 to 0.9. The improved estimation method was used to obtain the *Mt* and *At* values, that is, the *M t* and *A t* estimates.

The quality of a LOWESS estimator can be assessed by the MSE, which measures how close the estimator µˆ is of the true mean function μ :

$$MSE(\hat{\mu}) = \mathbb{E}[(\mu - \hat{\mu})^2].$$

Since the real curve μ is unknown, we need a criterion to evaluate the MSE. Under the assumption of heteroskedasticity, Cleveland and Devlin (Cleveland and Devlin, 1988) propose the Mallows' Cp criterion for local fitting that can be used as as MSE estimator. In the presence of heteroskedasticity, as usual for microarray data, the heteroskedasticity-robust Cp (HRCp) criterion, proposed by Liu and Okui (Liu and Okui, 2013), may be a more appropriate MSE estimator. We detail this MSE estimator next.

Considering {(*A M*, )} *t t <sup>t</sup> T* <sup>=</sup>1 within-slide data points, the evaluation of the LOWESS smoothing function on any point is given by a linear combination of the observed points, whose weights {(*l A*( )} *t t T* <sup>=</sup>1 are assigned according to the distance of *A* to the *At* observed points:

$$\hat{\mu}(A) = \sum\_{t=1}^{T} l\_t(A)M\_t... $$

Consider the *T* × *T* matrix *L* which maps the observed to the fitted values:

$$\left( \begin{array}{c} \hat{\mu}(A\_1) \\ \vdots \\ \hat{\mu}(A\_T) \end{array} \right) = \mathbf{L}M = \left( \begin{array}{c} l\_1(A\_1)\dots l\_T(A\_1) \\ \vdots \\ l\_1(A\_T)\dots l\_T(A\_T) \end{array} \right) \left( \begin{array}{c} \mathcal{M}\_1 \\ \vdots \\ \mathcal{M}\_T \end{array} \right).$$

Two commons definitions of the effective degrees of freedom of µˆ are: (1) *v*<sup>1</sup> tr (*L*) and (2) *v*<sup>2</sup> tr (*L L*′ ), where tr stands for the trace operator.

Supposing that the variance of *Mt* , across *T* spots of a microarray slide, is constant and equals to σ2 , the Mallows' Cp for local fitting is defined as:

$$Cp(\hat{\mu}) = \frac{1}{\sigma^2} \sum\_{t=1}^{T} (M\_t - \hat{\mu}(A\_t))^2 - T + 2\nu\_1.$$

Cleveland et al. (1988) shows that σ2 can be estimated as follows:

$$
\hat{\sigma}^2 \doteq \frac{\Sigma\_{t=1}^T [M\_t - \hat{\mu}(A\_t)]^2}{n + \nu\_2 - 2\nu\_1}.
$$

When heteroskedasticity is present, Mallows' Cp criterion is not an appropriate MSE estimator. Considering the *T* × *T* diagonal matrix Σ, whose th diagonal element is given by a nonhomogeneous variance σ*t* 2 of *Mt* , a robust MSE estimation can be achieved by using the HRCp criterion, defined as:

$$HRCp(\hat{\mu}) = \sum\_{t=1}^{T} (M\_t - \hat{\mu}(A\_t))^2 + 2\operatorname{tr}(\Sigma L).$$

According to Loader (1999), σ*t* 2 can be estimated locally by calculating the error variance (the residual sum of squares divided by the corresponding degrees of freedom) of a nearly unbiased LOWESS fit, which can be yielded using a very small value for the smoothing parameter (e.g., *f* = 0.1. Since the local variance estimates can be very noisy, it may be appropriate to smooth them using a gamma kernel.

Several authors suggest to choose the *f* value which minimizes a measure of error of the LOWESS fit, such as the MSE criterion (Berger et al., 2004; Futschik and Crompton, 2004a; Lee et al., 2008). However, other authors (Mallows, 1973; Cleveland and Devlin, 1988; Loader, 1999) argue that a selection based only on minimizing the MSE criterion is a poor procedure since it ignores the intrinsic information of the bias and variance of the fit. Therefore, following their suggestion, we propose a method based on a graphical tool called M-plot. It is a graph of the MSE estimate as a function of the effective degrees of freedom of the fit.

M-plots illustrating the *f* parameter selection method for a typical microarray slide (ID 251485069395\_1.4) are shown in **Figure 3**. Dots show MSE estimates (by HRCp criterion) and respective degrees of freedom (by *v*2 definition) of LOWESS fits (on the *M*<sup>ˆ</sup> *t* and <sup>ˆ</sup> *At* estimates, in the first M-plot, and on the *M t* and *A <sup>t</sup>* , in the second M-plot) obtained with *f* parameter varying from to 0.2 We fixed the other LOWESS parameters (local polynomials of degree one, tricube weight function, and three iterations) so that the M-plot curve shows only the effect of the *f* parameter on the bias–variance compromise. Large *f* values tend to yield simple fits (with fewer degrees of freedom), which have a small variance, but a large bias. On the other hand, minimal *f* values tend to yield complex fits (with many degrees of freedom), which have a small bias, but a large variance.

For the microarray slide in **Figure 3**, a selection method based only on the minimization of the MSE curve would choose the smallest evaluated *f* value (0.2). However, any *f* value within the flattening region near to the minimum (the region with light-colored dots) is a good choice, in the sense that it yields a low-MSE fit (Cleveland and Devlin, 1988; Loader, 1999). Depending on the type of application, we can choose between one value which yields a low-bias fit (with more degrees of freedom) or a low-variance fit (with fewer degrees of freedom). Since we want to estimate a natural phenomenon behavior, we propose to select from the flattening region the *f* value which yields the simplest LOWESS fit (the one with fewest effective degrees of freedom). The biggest dot in each M-plot indicates the selected *f* value. The detection of the flattening region is made by searching points for which the derivative of the MSE curve is small. We check for each sequence of three points near the minimum whether the difference between the MSE values

251485069395\_1.4). The flattening region is represented by the light-colored dots and the selected *f* value by the biggest dot. The LOWESS fits were yielded using values ranging from 1 to 0.2 (from lowest to highest degree of freedom).

is small. If so, these points are considered as belonging to the flattening region.

The *f* parameter selection method can be summarized in the following discrete and constrained optimization problem. Consider a sequence of *l* different values for *f*, {*f*1, *f*2, ... , *fl* }, and denoted by µˆ *fk* , the LOWESS fit yielded by using the value *fk* for the *f* parameter. Also, let:

$$\begin{aligned} \mathcal{F} &= \{\hat{\mu}\_{f\_k}; f\_k \in \{f\_1, f\_2, \dots, f\_l\}, f\_{k+1} < f\_k, \text{ for } k = 1, \dots, l-1\}; \\ f\_{\text{min}} &= \operatorname\*{arg\,min}\_{f\_k} \text{HRCp } (\hat{\mu}\_{f\_k}), \text{ such that } \hat{\mu}\_{f\_k} \in \mathcal{F}; \\ f\_{\text{max}} &= \operatorname\*{arg\,max}\_{f\_k} \text{HRCp } (\hat{\mu}\_{f\_k}), \text{ such that } \hat{\mu}\_{f\_k} \in \mathcal{F}; \text{ and} \\ \Delta\_{\text{MSE}} &= 0.05(\text{HRCp } (\hat{\mu}\_{f\_{\text{max}}}) - \text{HRCp } (\hat{\mu}\_{f\_{\text{min}}})). \end{aligned}$$

Since *v*2 function provides the effective degrees of freedom of a given fit, the selected *f* value is the solution *f \** , if it exists, of the following problem:

$$\begin{aligned} f^{\star} &= \underset{f\_k}{\text{arg min }} \,\nu\_z(\hat{\mu}\_{f\_k}) \\ \text{subject to:} \\ &\hat{\mu}\_{f\_k} \in \mathcal{F}; \\ &\text{HRCp}\left(\hat{\mu}\_{f\_k}\right) \leq \text{ HRCp}\left(\hat{\mu}\_{f\_{\text{obs}}}\right) + \,\Delta\_{\text{MSE}}, \text{for } k = 1, 2; \\ &\text{HRCp}\left(\hat{\mu}\_{f\_{k-1}}\right) \leq \text{ HRCp}\left(\hat{\mu}\_{f\_{\text{obs}}}\right) + \,\Delta\_{\text{MSE}}, \text{for } k = 3, \dots, l; \\ &\left| \text{HRCp}\left(\hat{\mu}\_{f\_k}\right) \right| - \,\text{HRCp}\left(\hat{\mu}\_{f\_{k-1}}\right) \right| < \Delta\_{\text{MSE}}, \text{ for } k = 2, \dots, l; \text{and} \\ &\left| \text{HRCp}\left(\hat{\mu}\_{f\_k}\right) \right| - \,\text{HRCp}\left(\hat{\mu}\_{f\_{k-2}}\right) \right| < \Delta\_{\text{MSE}}, \text{ for } k = 3, \dots, l. \end{aligned}$$

If the minimum of the M-plot curve is far away of the point corresponding to the second lowest MSE estimate, the previous problem has no solution. In that case, the *f* value that yields the fit with lowest MSE estimate is selected. Specifically, the *f* parameter value is selected by solving the following problem:

$$f^\star \doteq \operatorname\*{arg\,min}\_{f\_k} \text{HRCp } (\hat{\mu}\_{f\_k}) \text{, such that } \hat{\mu}\_{f\_k} \in \mathcal{F}.$$

#### APPLICATION ON INTESTINAL METAPLASIA DATA

To investigate the effects of the proposed methods, we preprocessed the data described in the section *Intestinal Metaplasia Database* by using all discussed methods and compared the identified differentially expressed genes.

First, we applied the *normexp* method with offset value of for removing the background influence. Then, we compute the *Mt* and *At* values both by the conventional estimation methods, defined in Eq. (5) and (6), and by the proposed estimation methods, defined in Eq. (3) and (4). The LOWESS withinslide normalization was carried out as discussed in the section *Optimal Selection of the LOWESS Parameters*. For comparison purpose, the *f* smoothing parameter was selected both by the OLIN method (considered by us as a conventional approach) and by the proposed method, discussed in the section *LOWESS Smoothing Parameter Selection*. Since data from all microarray slides present overdispersion, we used the HRCp criterion as cost function of our selection method.

Therefore, the following four preprocessing procedures were applied separately to the original data:


**Figure 4** shows the distribution of the optimal values for the LOWESS *f* parameter, according to the proposed selection method with HRCp criterion, for the entire database, separated by normal and intestinal metaplasia conditions (both, hybridized against a pool of normal tissues). In the first plot, the LOWESS curve was fitted on the *M*<sup>ˆ</sup> *t* and <sup>ˆ</sup> *At* estimates and, in the second plot, on the *M t* and *A <sup>t</sup>* estimates. The average of the selected *f* values was close to 0.5.

As expected from a method that neither takes into account heteroskedasticity of the data nor attempts to make a good balance between bias and variance, the OLIN method selected the smallest evaluated value (0.2) for most of the slides. Same results were obtained when the *Mt* and *At* values were estimated by the conventional and by the proposed estimator. Such behavior has been reported in the literature, implying that the optimal *f* values according OLIN are usually close to the default one (Chiogna et al., 2009).

After preprocessing the data, a two-sample t-test assuming unequal variance was performed for each spotted gene to determine whether its expression is statistically different between gastric tissues in normal and intestinal metaplasia groups. However, since we are interested in directly assessing the impact of each proposed method on the t-statistics and p-values rather than making inference about differential expression, the comparative study was performed before applying a multiple testing correction.

#### Comparison of the Results

Results of a pairwise comparison among the p-values and t-statistics obtained by the four preprocessing methods are shown in **Figure 5**. In the left-column plots, we compare the p-values and, in the right-column plots, we show the changes in the difference between the group means (the absolute value of the t-statistic numerator) and in the within-group variability (the denominator of the t-statistic). Only genes with p-value less than 5% were considered.

The left-column plots show that most of the points are distributed around the 45-degree line. Thus, the p-values and, consequently, the total number of differentially expressed genes, even at a lower significance level, were similar among the four methods.

The first- and second-row plots show how p-values and t-statistics were affected by estimating the *Mt* and *At* values with the proposed method, which takes into account the pixellevel uncertainties. The genes represented by blue plus signs were identified as differentially expressed only when using the proposed estimator for the *Mt* and *At* values.

The genes represented by green crosses were identified as differentially expressed only when using the conventional estimator for the *Mt* and *At* values.

When the LOWESS *f* parameter is selected by OLIN (first-row plots), it is clear that the within-group variability decreases when using the proposed estimators for the *Mt* and *At* values. When the LOWESS parameter is selected by our method (second-row plots), there is still a reduction in the within-group variability. However, this impact is less clear because of the variability introduced when the LOWESS *f* parameter is selected by our method.

The third- and fourth-row plots compare p-values and t-statistics obtained by OLIN and the proposed approach for selecting the LOWESS *f* parameter. The genes represented by blue plus signs were identified as differentially expressed only when *f* was selected by the proposed method. The genes represented by green crosses were identified as differentially expressed only when selecting *f* by OLIN. It is clear that, for most genes, both within-group variabilities increased, implying that the normalization procedure was more conservative, and thus, more potentially relevant information is retained. In addition, for many genes, the increase in the within-group variability was counterbalanced by an increase in the distance between the groups. Such effect is even most pronounced when the proposed estimator for the *Mt* and *At* values are used. Thus, their respective p-values reduced enough to consider them as differentially expressed genes.

The diagrams in **Figure 6** show a comparison of the methods with respect to the total number of genes with p-value less than 5%. On the left, the p-values were not corrected for multiple tests, while on the right, the p-values were adjusted by the false discovery rate (FDR) correction (Benjamini and Hochberg, 1995).

Note that the four methodologies are quite different in terms of which genes were identified as differentially expressed. As a consequence of the more conservative (milder) noise reduction performed in the LOWESS withinslide normalization procedure with *f* parameter selected by our method, fewer genes are identified as differentially expressed. However, regardless of the normalization method, more genes could be identified as differentially expressed when the *Mt* and *At* values were estimated by the proposed estimation method that incorporates pixel-level variability. Given that both proposed methods make the analysis more robust by incorporating and preserving information neglected by the conventional methods, we can argue that they are contributing to the reduction of both false-positive and falsenegative rates.

#### Validation Analysis

To check the consistency of our analysis, we compared our results with those reported in the literature. Out of the genes which are associated with intestinal metaplasia according to the Gene Expression Omnibus platform (Edgar et al., 2002) of the NCBI (National Center for Biotechnology Information) website, 80 spotted genes (corresponding to 63 unique genes) have p-value (before FDR correction) less than 5%, and 35 spotted genes (corresponding to 29 unique genes) have p-value (after FDR correction) less than 5%. These findings are summarized respectively in **Tables 1**, **2**. In addition, **Figure 7** compares the total number of validated genes identified by each method with p-value less than 5% (before FDR correction).

Greater differences in inference were observed among the genes whose p-value is close to the significance level. These

column plots compare the difference between the absolute values of the numerators with the difference between the denominators of the t-test statistic.

genes have a more subtle differential expression, which can be easily damaged by measurement errors and poor estimation and normalization methods. Thus, the more accurate and careful analysis provided by the proposed methods is especially important for making decisions on the differential expression of these more sensitive genes.

Two replicates of the HSPB1 gene could not be identified as differentially expressed when using both the conventional estimators for the *Mt* and *At* values and our selection method for the LOWESS *f* parameter. Thus, the estimation of the *Mt* and *At*

values by the proposed estimators was crucial in determining the differential expression of the HSPB1 gene.

The genes PTEN, CTNNB1, MLH1, CXCR4, and CXCR1 could only be identified as differentially expressed when the LOWESS parameter was selected by our proposed method. Particularly, the gene CXCR4 only was determined as differentially expressed when the improved estimators for the *Mt* and *At* values were also used. In contrast, the gene KRT14 was no longer identified as differentially expressed when the LOWESS *f* parameter was selected by our proposed method.

In the following, we briefly describe the association of those genes with intestinal metaplasia of the stomach according to the literature data:


in esophageal adenocarcinoma when compared to normal esophagus (Lv et al., 2019).

### Genes Involved in Cancer

By performing a gene enrichment analysis, we identified, at a significance level of 5% (after FDR correction), 31 differentially expressed genes that are involved in cancer. Their respective p-values and fold changes are shown in **Table 3**. We remark that their association with intestinal metaplasia has not been clearly demonstrated yet. Thus, further investigation has to be done to confirm such conclusions.

Particularly, two replicates of the CCND1 gene and the LAMB2 gene were identified as differentially expressed only by the conventional approaches, suggesting that they may be false positives. Next, we briefly describe their association with cancer:


### DISCUSSIONS

Faced with the growing trend of multi-omics data integration in the midst of a replication crisis, improved microarray data analyses are crucial to identifying more reliable results (Ritchie et al., 2015a).

TABLE 1 | Genes reported in the literature as associated with intestinal metaplasia of the stomach that were identified as differentially expressed in our analysis at a significance level of 5% (after FDR correction).


*Each column shows the p-value (p), the FDR-corrected p-value (adj. p), and the fold change (FC) obtained in a variant of the database. P-values greater than 5% are shown in bold type.*

Ribeiro et al.

TABLE 2 | Other genes reported in the literature as associated with intestinal metaplasia of the stomach that were identified as differentially expressed in our analysis at a significance level of 5% (without FDR correction).


*Each column shows the p-value (p), the FDR-corrected p-value (adj. p), and the fold change (FC) obtained in a variant of the database. P-values greater than 5% are shown in bold type.*

Ribeiro et al.

Given that several pixel-level summary statistics are readily available in microarray databases, but are usually discarded in conventional approaches, we propose an improved estimation method for the *Mt* and *At* values, which takes into account the pixel-level variability. Specifically, we applied the multivariate delta method to derive estimators for the expected values of *Mt* and *At* , considering their Taylor's expansion up to the second-order terms. The conventional estimators, nonetheless, approximate the expected values considering only the zeroth-order term. Since the functions that define *Mt* and *At* are analytic (they are combinations of logarithmic function through addition or subtraction), the higher the number of terms of the Taylor expansion, the better the approximation of the function. Thus, we expect that the proposed estimators provide a better quantification of the hybridization signal. Also, by using these improved estimators, pixel-level dispersion can play an essential role in the analysis, increasing reliability.

To minimize the propagation of errors, the *Mt* and *At* values have to be properly normalized. Thus, we also propose a method for selecting the LOWESS smoothing parameter *f* that provides an optimal bias–variance compromise, considering some specific characteristics of microarray experiments, such as heteroskedasticity. This optimal normalization method leads to a more parsimonious correction of the systematic biases and, consequently, to greater preservation of the biological variation of interest.

By using the proposed methods, more variability information is considered and retained, improving inferences and preventing false conclusions. Thus, we expect to perform a more conservative analysis, where possibly fewer but more reliable differentially expressed genes are identified. In other words, we expect a reduction in both the false-positive and false-negative error rates.

Besides the theoretical support, relevant empirical observations could be drawn by a comparative study between the methods using real intestinal metaplasia microarray data. The results shows that inferences on differential gene expression were moderately affected by the incorporation of the pixel-level variability in the estimation of the *Mt* and *At* values and significantly affected by the LOWESS within-slide normalization using a smoothing parameter selected by the method. Both proposed methods tend to increase the within-group variability (the denominator of the t-statistic). However, for many genes, such increase occurred along with an increase in the difference between the group means (the absolute value of the t-statistic numerator), significantly reducing their respective p-values. Thus, many genes were identified as differentially expressed only when the proposed methods were used and some of them have been validated by other studies.

It is important to remark that most of the genes reported in the literature as differentially expressed in intestinal metaplasia were validated with a very strong association with the disease. Thus, these genes are probably more robust to difference approaches for estimating and normalizing the gene expression levels. On the other hand, genes sensitive to methods that address essential uncertainties in measurements are precisely those plagued with major reproducibility issues. Measurement error is one of the most damaging sources of error and has been neglected in many published analyses, thereby increasing uncertainty in parameter estimates and even inflating the estimates of effect sizes (Loken and Gelman, 2017). Thus, particularly for those sensitive genes, a more robust analysis is needed so that false conclusions are not made.

In this paper, we focused on gene expression from twocolor microarray data, but it is possible to use the same ideas to improve estimation and normalization of any fluorescent signal quantified by microarray image analysis. Also, the proposed methods could be adapted for oligonucleotide (one-color) microarray data. Particularly, the cyclic LOWESS normalization method (Bolstad et al., 2003) could be extended by just considering that the *Mt* and *At* values are defined by comparing pairs of arrays instead of pairs of channels and that the LOWESS normalization is applied to all distinct combination of two arrays. Although not so straightforward, it is also possible to adapt our methods to handle next-generation sequencing (NGS) data. Recently, Law et al. (Law et al., 2014) showed that RNA-Seq counts after log transformation and normalization by sequencing depth (log-counts per million, or log-cpm) can be properly analyzed by methods based on the normal distribution if a precision weight for each observation is taken into account. It was used to adapt all methods in the limma package (initially developed for microarrays) to also handle RNA-Seq and other sequence count data (Ritchie et al., 2015b). Therefore, considering the current need for accounting and propagating measurement uncertainties through analyses of NGS data (O'Rawe et al., 2015), a possible future work is to adapt our ideas to improve transcriptome profiling from RNA-Seq data. Specifically, one could investigate whether it is possible to use the delta method for incorporating a measure of uncertainty

15

 | Volume 10 | Article 855


PIAS3 4.53 × 10−5 2.29 × 10−3 −0.55 2.93 × 10−5 1.68 × 10−3 −0.57 4.81 × 10−5 2.38 × 10−3 −0.55 2.85 × 10−5 1.65 × 10−3 −0.57 ITGA2 5.24 × 10−5 2.53 × 10−3 0.48 7.52 × 10−5 3.33 × 10−3 0.47 5.88 × 10−5 2.76 × 10−3 0.48 7.43 × 10−5 3.30 × 10−3 0.47 FZD8 6.00 × 10−5 2.83 × 10−3 −0.60 5.09 × 10−5 2.51 × 10−3 −0.60 6.05 × 10−5 2.81 × 10−3 −0.60 4.83 × 10−5 2.42 × 10−3 −0.61 FOXO1 1.54 × 10−4 5.65 × 10−3 −0.53 1.03 × 10−4 4.25 × 10−3 −0.53 1.39 × 10−4 5.24 × 10−3 −0.53 1.00 × 10−4 4.16 × 10−3 −0.54 FOXO1 2.70 × 10−3 4.46 × 10−2 −0.20 2.66 × 10−3 4.33 × 10−2 −0.20 2.80 × 10−3 4.51 × 10−2 −0.20 2.42 × 10−3 4.06 × 10−2 −0.21 EGLN1 1.85 × 10−4 6.42 × 10−3 0.50 4.00 × 10−4 1.16 × 10−2 0.46 1.73 × 10−4 6.10 × 10−3 0.50 3.96 × 10−4 1.16 × 10−2 0.46 TGFBR2 2.88 × 10−4 9.06 × 10−3 −0.36 8.86 × 10−5 3.78 × 10−3 −0.37 2.68 × 10−4 8.46 × 10−3 −0.36 8.71 × 10−5 3.73 × 10−3 −0.37 WNT3 4.16 × 10−4 1.19 × 10−2 0.51 4.13 × 10−4 1.19 × 10−2 0.51 4.00 × 10−4 1.15 × 10−2 0.51 4.22 × 10−4 1.21 × 10−2 0.50 CKS1B 7.02 × 10−4 1.76 × 10−2 −0.29 1.91 × 10−3 3.46 × 10−2 −0.27 1.04 × 10−3 2.29 × 10−2 −0.27 2.01 × 10−3 3.56 × 10−2 −0.27 AXIN2 7.63 × 10−4 1.88 × 10−2 −0.53 8.64 × 10−4 2.02 × 10−2 −0.53 7.62 × 10−4 1.86 × 10−2 −0.53 8.52 × 10−4 2.01 × 10−2 −0.53 CCND1 9.74 × 10−4 2.22 × 10−2 −0.55 7.00 × 10−4 1.75 × 10−2 −0.55 9.79 × 10−4 2.21 × 10−2 −0.55 6.73 × 10−4 1.70 × 10−2 −0.56 CCND1 3.34 × 10−3 5.12 × 10−2 −0.76 2.81 × 10−3 4.51 × 10−2 −0.77 3.45 × 10−3 5.19 × 10−2 −0.76 2.88 × 10−3 4.58 × 10−2 −0.77 CCND1 3.49 × 10−3 5.23 × 10−2 −0.26 4.11 × 10−3 5.80 × 10−2 −0.26 3.19 × 10−3 4.95 × 10−2 −0.27 3.75 × 10−3 5.45 × 10−2 −0.26 ITGAV 1.03 × 10−3 2.30 × 10−2 −0.36 1.06 × 10−3 2.34 × 10−2 −0.35 9.39 × 10−4 2.15 × 10−2 −0.36 1.04 × 10−3 2.29 × 10−2 −0.35 CEBPA 1.50 × 10−3 2.96 × 10−2 0.63 1.79 × 10−3 3.32 × 10−2 0.60 1.36 × 10−3 2.77 × 10−2 0.63 1.76 × 10−3 3.27 × 10−2 0.60 JUN 1.60 × 10−3 3.09 × 10−2 −0.58 1.57 × 10−3 3.04 × 10−2 −0.54 1.94 × 10−3 3.48 × 10−2 −0.56 1.56 × 10−3 3.03 × 10−2 −0.54 WNT11 2.98 × 10−3 4.76 × 10−2 0.28 2.96 × 10−3 4.65 × 10−2 0.28 3.06 × 10−3 4.81 × 10−2 0.28 2.97 × 10−3 4.67 × 10−2 0.28

Variance-Preserving Estimation of Intensity Values

Ribeiro et al.

 *Each column shows the p-value (p), the FDR-corrected p-value (adj. p), and the fold change (FC) obtained in a variant of the database. P-values greater than 5% are shown in bold type.*

LAMB2 5.18 × 10−3 6.76 × 10−2 −0.52 2.58 × 10−3 4.25 × 10−2 −0.49 4.42 × 10−3 6.10 × 10−2 −0.49 2.61 × 10−3 4.28 × 10−2 −0.49

to each base call, usually provided by base-calling algorithms, into the log-cpm estimator, leading to a more accurate gene expression quantification from RNA-Seq data.

### DATA AVAILABILITY

The omicsMA R package contains the source code of the proposed methods and part of the metaplasia dataset analyzed in this study. It was implemented using R, version 3.5.1, and depends on the locfit (Loader, 2013), maigesPack (Esteves et al., 2016), and limma (Ritchie et al., 2015b) R packages. The omicsMA R package is available at https://github.com/adele/ omicsMA, and the latest release is available at https://github. com/adele/omicsMA/releases/latest.

### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the international guidelines for investigations involving human beings with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Ethics Institutional Committee of the A.C. Camargo Cancer Center (process number 1023/07).

### REFERENCES


### AUTHOR CONTRIBUTIONS

AR and RH conceived of the presented ideas. AR derived the models, implemented the methods, and analyzed the data. AR wrote the manuscript with support from RH and JS. All authors discussed the results and contributed to the final manuscript. RH and JS supervised the project.

### FUNDING

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001; National Council of Technological and Scientific Development (CNPq); and NAP eScience - PRP - USP. It was also supported by the Foundation for Research Support of the State of São Paulo (FAPESP) [grants 06/03227-2, 2011/50761-2 and 2015/01587-0].

### ACKNOWLEDGMENTS

We are greatly thankful to Luiz F. L. Reis, Director of the Research and Education, Hospital Sírio-Libanês, São Paulo, for making the intestinal metaplasia database available to us, and to Professors Eduardo Jordão Neves and Luís Gustavo Estevesfor helping us with data preprocessing and analysis.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Ribeiro, Soler and Hirata. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

### APPENDIX

### Estimation of *E*(*Mtj*) and *E*(*Atj*) by the Delta Method

Let *f* (*Rtj*, *Gtj*) be a twice differentiable function of two random variables, *Rtj* and *Gtj*.The second-order Taylor's expansion of at ( ( *R G* ), ( )) *tj tj* is:

$$\begin{split} &f(R\_{\eta},G\_{\eta}) \sim f(\mathbb{E}(R\_{\eta}),\mathbb{E}(G\_{\eta})) + \frac{\partial f}{\partial R\_{\eta}}(\mathbb{E}(R\_{\eta}),\mathbb{E}(G\_{\eta}))(R\_{\eta}-\mathbb{E}(R\_{\eta})) + \\ &\frac{\partial f}{\partial G\_{\eta}}(\mathbb{E}(R\_{\eta}),\mathbb{E}(G\_{\eta}))(G\_{\eta}-\mathbb{E}(G\_{\eta})) + \\ &\frac{1}{2}\Big{(}\frac{\partial^{2}f}{\partial R\_{\eta}^{2}}(\mathbb{E}(R\_{\eta}),\mathbb{E}(G\_{\eta}))(R\_{\eta}-\mathbb{E}(R\_{\eta}))^{2} + 2\frac{\partial^{2}f}{\partial R\_{\eta}\partial G\_{\eta}}(\mathbb{E}(R\_{\eta}),\mathbb{E}(G\_{\eta})) \\ &\Big{[}(R\_{\eta}-\mathbb{E}(R\_{\eta}))(G\_{\eta}-\mathbb{E}(G\_{\eta}))\Big{]} + \frac{\partial^{2}f}{\partial G\_{\eta}^{2}}(\mathbb{E}(R\_{\eta}),\mathbb{E}(G\_{\eta}))(G\_{\eta}-\mathbb{E}(G\_{\eta}))^{2}\Big{]}. \end{split}$$

An approximation of ( ( *f R*( ) *G* ), *tj tj* can be determined by the expected value of the second-order Taylor's expansion of *f*:

$$\mathbb{E}(f(R\_{\eta}, G\_{\eta})) \sim \mathbb{E}[f(\mathbb{E}(R\_{\eta}), \mathbb{E}(G\_{\eta}))] + \frac{\partial f}{\partial R\_{\eta}}(\mathbb{E}(R\_{\eta}), \mathbb{E}(G\_{\eta})) \mathbb{E}(R\_{\eta} - \mathbb{E}(R\_{\eta})) + \varepsilon$$

$$\frac{\partial f}{\partial G\_{\eta}}(\mathbb{E}(R\_{\eta}), \mathbb{E}(G\_{\eta})) \mathbb{E}(\mathbf{G}\_{\eta} - \mathbb{E}(G\_{\eta})) +$$

$$\frac{1}{2} \left( \frac{\partial^{2} f}{\partial R\_{\eta}^{2}}(\mathbb{E}(R\_{\eta}), \mathbb{E}(G\_{\eta})) \mathbb{E}[(R\_{\eta} - \mathbb{E}(R\_{\eta}))^{2}] +$$

$$2 \frac{\partial^{2} f}{\partial R\_{\eta} \partial G\_{\eta}}(\mathbb{E}(R\_{\eta}), \mathbb{E}(G\_{\eta})) \mathbb{E}((R\_{\eta} - \mathbb{E}(R\_{\eta}))(G\_{\eta} - \mathbb{E}(G\_{\eta}))) +$$

$$\frac{\partial^{2} f}{\partial G\_{\eta}^{2}}(\mathbb{E}(R\_{\eta}), \mathbb{E}(G\_{\eta})) \mathbb{E}((\mathbf{G}\_{\eta} - \mathbb{E}(G\_{\eta}))^{2}) \right).$$

Considering that

$$\begin{aligned} \text{Var}\left(R\_{\circ}\right) &= \mathbb{E}[\left(R\_{\circ} - \mathbb{E}(R\_{\circ})\right)^{2}], \\ \text{Var}\left(G\_{\circ}\right) &= \mathbb{E}[\left(G\_{\circ} - \mathbb{E}(G\_{\circ})\right)^{2}], \text{and} \\ \text{Cov}\left(R\_{\circ}, G\_{\circ}\right) &= \mathbb{E}[\left(R\_{\circ} - \mathbb{E}(R\_{\circ})\right)(G\_{\circ} - \mathbb{E}(G\_{\circ}))], \end{aligned}$$

the following simplified expression for the expected value of *f* (*Rtj*, *Gtj*) is obtained:

$$\mathbb{E}(f(R\_{\eta}, G\_{\eta})) = f(\mathbb{E}(R\_{\eta}), \mathbb{E}(G\_{\eta})) + \frac{1}{2} \left( \frac{\partial^2 f}{\partial R\_{\eta}^2} (\mathbb{E}(R\_{\eta}), \mathbb{E}(G\_{\eta})) \text{Var} \,(R\_{\eta}) + 1 \right)$$

$$2 \frac{\partial^2 f}{\partial R\_{\eta} \partial G\_{\eta}} (\mathbb{E}(R\_{\eta}), \mathbb{E}(G\_{\eta})) \text{Cov } (R\_{\eta}, G\_{\eta}) +$$

$$\frac{\partial^2 f}{\partial G\_{\eta}^2} (\mathbb{E}(R\_{\eta}), \mathbb{E}(G\_{\eta})) \text{Var} \,(G\_{\eta}) \text{.}$$

Since

$$M\_{\mathfrak{q}} = f(R\_{\mathfrak{q}}, G\_{\mathfrak{q}}) \doteq \log\_2(R\_{\mathfrak{q}}) - \log\_2(G\_{\mathfrak{q}}),$$

the first and second derivatives of the function that defines *Mtj* are:

$$\begin{aligned} \frac{\partial f}{\partial R\_{lj}} &= \frac{1}{R\_{lj} \ln \text{ (2)}}; \qquad \frac{\partial f}{\partial G\_{lj}} = -\frac{1}{G\_{lj} \ln \text{ (2)}};\\ \frac{\partial^2 f}{\partial R\_{lj}^2} &= -\frac{1}{R\_{lj}^2 \ln \text{ (2)}}; \qquad \frac{\partial^2 f}{\partial G\_{lj}^2} = \frac{1}{G\_{lj}^2 \ln \text{ (2)}}; \qquad \frac{\partial^2 f}{\partial R\_{lj} \partial G\_{lj}} = 0. \end{aligned}$$

Assuming that ( ) *Rtj* and ( ) *Gtj* are non-zero, an approximation of ( ) *M R* (log ( ) log (*G* )) *tj* = − 2 2 *tj tj* can be obtained by using its second-order Taylor's expansion:

$$\begin{split} \mathbb{E}(M\_{\eta}) &= \mathbb{E}(\log\_{2}(R\_{\eta}) - \log\_{2}(\mathbb{G}\_{\eta})) \\ &\approx \log\_{2}(\mathbb{E}(R\_{\eta})) - \log\_{2}(\mathbb{E}(\mathcal{G}\_{\eta})) + \frac{1}{2} \left( -\frac{Var\,\,(\mathcal{R}\_{\eta})}{\ln\left(2\right)\mathbb{E}^{2}\left(R\_{\eta}\right)} + \frac{Var\,\,(\mathcal{G}\_{\eta})}{\ln\left(2\right)\mathbb{E}^{2}\left(\mathcal{G}\_{\eta}\right)} \right) \\ &= \log\_{2}(\mathbb{E}(R\_{\eta})) - \log\_{2}(\mathbb{E}(\mathcal{G}\_{\eta})) + \frac{1}{2\ln\left(2\right)} \left( -\frac{Var\,\,(R\_{\eta})}{\mathbb{E}^{2}\left(R\_{\eta}\right)} + \frac{Var\,\,(\mathcal{G}\_{\eta})}{\mathbb{E}^{2}\left(G\_{\eta}\right)} \right). \end{split}$$

Let the non-zero background-corrected signals be estimators for the expected values of the foreground signals, i.e.,

$$
\hat{\mathbb{E}}(R\_{\sharp j}) = \overline{R}\_{\omega} \text{, with } \overline{R}\_{\iota \varepsilon} \neq 0,
$$

$$
\hat{\mathbb{E}}(G\_{\iota j}) = \overline{G}\_{\iota \varepsilon} \text{, with } \overline{G}\_{\iota \varepsilon} \neq 0.
$$

Denote the sample variance estimators, obtained across the pixel intensities within each spot, as σˆ ( ) <sup>2</sup> *Rt* (for the test channel) and σˆ ( ) <sup>2</sup> *Gt* (for the control channel). Also, assume that these estimators do not depend on thebackground correction. We can derive an estimator for ( ) *Mtj* as follows:

$$
\bar{\mathcal{M}}\_t \doteq \log\_2(\overline{R}\_{\underline{n}}) - \log\_2(\overline{G}\_{\underline{n}}) + \frac{1}{2\ln \text{ (2)}} \left( -\frac{\hat{\sigma}^2(R\_t)}{\overline{R}\_{\underline{n}}^2} + \frac{\hat{\sigma}^2 \left( G\_t \right)}{\overline{G}\_{\underline{n}}^2} \right).
$$

Since

$$A\_{\mathfrak{t}\mathfrak{f}} = f(R\_{\mathfrak{t}\mathfrak{f}}, G\_{\mathfrak{t}\mathfrak{f}}) \doteq \frac{\log\_2(R\_{\mathfrak{t}\mathfrak{f}}) + \log\_2(G\_{\mathfrak{t}\mathfrak{f}})}{2},$$

we can estimate ( ) *Atj* in a similar way to ( ) *Mtj* . The first and second derivatives of *Atj* are:

$$\begin{aligned} \frac{\partial f}{\partial R\_{\circ j}} &= \frac{1}{2\ln\left(2\right)R\_{\circ j}}; \qquad \qquad \frac{\partial f}{\partial G\_{\circ j}} = \frac{1}{2\ln\left(2\right)G\_{\circ j}};\\ \frac{\partial^2 f}{\partial R\_{\circ j}^2} &= -\frac{1}{2\ln\left(2\right)R\_{\circ j}^2}; \qquad \qquad \frac{\partial^2 f}{\partial G\_{\circ j}^2} = -\frac{1}{2\ln\left(2\right)G\_{\circ j}^2}; \qquad \qquad \frac{\partial^2 f}{\partial R\_{\circ j}\partial G\_{\circ j}} = 0, \end{aligned}$$

An approximation of ( ) *Atj* is obtained by using its secondorder Taylor's expansion:

$$\begin{split} \mathbb{E}(A\_{\eta}) &\approx \frac{1}{2} (\log\_{2} \left( \mathbb{E}(R\_{\eta}) \right) + \log\_{2} \left( \mathbb{E}(G\_{\eta}) \right)) + \frac{1}{2} \bigg( -\frac{Var \left( R\_{\eta} \right)}{2 \ln \left( 2 \right) \mathbb{E}^{2} \left( R\_{\eta} \right)} - \frac{Var \left( G\_{\eta} \right)}{2 \ln \left( 2 \right) \mathbb{E}^{2} \left( G\_{\eta} \right)} \bigg) \\ &= \frac{1}{2} \Big( \log\_{2} \left( \mathbb{E}(R\_{\eta}) \right) + \log\_{2} \left( \mathbb{E}(G\_{\eta}) \right) \bigg) - \frac{1}{4 \ln \left( 2 \right)} \left( \frac{Var \left( R\_{\eta} \right)}{\mathbb{E}^{2} \left( R\_{\eta} \right)} + \frac{Var \left( G\_{\eta} \right)}{\mathbb{E}^{2} \left( G\_{\eta} \right)} \right). \end{split}$$

Considering the sample estimators of the expected values and variances of *Rtj* and *Gtj*, we can derive the following estimator for ( ) *Atj* :

$$\tilde{A}\_t \doteq \frac{1}{2} \Big( \log\_2(\overline{R}\_{\scriptscriptstyle\rm tr}) + \log\_2(\overline{G}\_{\scriptscriptstyle\rm tr}) \Big) - \frac{1}{4\ln \text{ (2)}} \Big( \frac{\hat{\mathfrak{G}}^2(R\_t)}{\overline{R}\_{\scriptscriptstyle\rm tr}^2} + \frac{\hat{\mathfrak{G}}^2(G\_t)}{\overline{G}\_{\scriptscriptstyle\rm tr}^2} \Big).$$

#### A.2.Estimation of *Var* (*Mtj*) and *Var* (*Atj*) by the Delta Method

We can derive an estimator for Var (*f* (R*tj*, G*tj*)) by computing the variance of the first-order Taylor's expansion of *f* (R*tj*, G*tj*) at ( ( *R G* ), ( )) *tj tj* :

$$\begin{split} \text{Var}\left(f(R\_{\eta},G\_{\eta})\right) &\approx \left(\frac{\partial f}{\partial R\_{\eta}}(\mathbb{E}(R\_{\eta}),\mathbb{E}(G\_{\eta}))\right)^{2} \text{Var}\left(R\_{\eta}\right) + \\ & \quad \left(\frac{\partial f}{\partial G\_{\eta}}(\mathbb{E}(R\_{\eta}),\mathbb{E}(G\_{\eta}))\right)^{2} \text{Var}\left(G\_{\eta}\right) + \\ & \quad 2\left(\frac{\partial f}{\partial R\_{\eta}}(\mathbb{E}(R\_{\eta}),\mathbb{E}(G\_{\eta}))\right) \left(\frac{\partial f}{\partial G\_{\eta}}(\mathbb{E}(R\_{\eta}),\mathbb{E}(G\_{\eta}))\right) \text{Cov}\left(R\_{\eta},G\_{\eta}\right). \end{split}$$

The second-order term was not considered because Var ( ) *Rtj* 2 and Var ( ) *Gtj* 2 cannot be usually estimated.

Since *M f tj tj tj* = − *R G R G tj tj* ( , ) l og ( ) log ( ) 2 2 , with the first and second derivative showed in Appendix 5, we can obtain an approximation of Var (*Mtj*) as follows:

$$\begin{split} \mathrm{Var}\left(M\_{\eta}\right) &\approx \left(\frac{1}{\ln\left(2\right)\mathrm{E}\left(R\_{\eta}\right)}\right)^{2} \mathrm{Var}\left(R\_{\eta}\right) + \left(-\frac{1}{\ln\left(2\right)\mathrm{E}\left(G\_{\eta}\right)}\right)^{2} \mathrm{Var}\left(G\_{\eta}\right) + 1 \\ &\quad 2\left(\frac{1}{\ln\left(2\right)\mathrm{E}\left(R\_{\eta}\right)}\right) \left(-\frac{1}{\ln\left(2\right)\mathrm{E}\left(G\_{\eta}\right)}\right) \mathrm{Cov}\left(R\_{\eta}, G\_{\eta}\right) \\ &= \frac{1}{\ln^{2}\left(2\right)} \left(\frac{\mathrm{Var}\left(R\_{\eta}\right)}{\mathrm{E}^{2}\left(R\_{\eta}\right)} + \frac{\mathrm{Var}\left(G\_{\eta}\right)}{\mathrm{E}^{2}\left(G\_{\eta}\right)} - 2\frac{\mathrm{Cov}\left(R\_{\eta}, G\_{\eta}\right)}{\mathrm{E}\left(R\_{\eta}\right)\mathrm{E}\left(G\_{\eta}\right)}\right). \end{split}$$

Consider the sample estimators of the expected values of *Rtj* and *Gtj*, denoted by, respectively, *Rtc* and *Gtc* , and assume that they are non-zero. Also, consider their variance and covariance sample estimators, denoted by, respectively, σˆ ( ) <sup>2</sup> *Rt* , σˆ ( ) <sup>2</sup> *Gt* , and ˆ( , ) σ *Rt G*<sup>t</sup> , and assume that they are independent of the background correction. We can derive the following estimator for Var (*Mtj*) :

$$
\hat{\boldsymbol{\sigma}}^{2}(M\_{\boldsymbol{t}}) \dot{=} \frac{1}{\ln^{2}(2)} \left( \frac{\hat{\boldsymbol{\sigma}}^{2}(R\_{\boldsymbol{t}})}{\overline{R}\_{\boldsymbol{t}\boldsymbol{c}}^{2}} + \frac{\hat{\boldsymbol{\sigma}}^{2}(G\_{\boldsymbol{t}})}{\overline{G}\_{\boldsymbol{t}\boldsymbol{c}}^{2}} - 2 \frac{\hat{\boldsymbol{\sigma}}(R\_{\boldsymbol{t}}, G\_{\boldsymbol{t}})}{\overline{R}\_{\boldsymbol{t}\boldsymbol{c}} \overline{G}\_{\boldsymbol{t}\boldsymbol{c}}} \right).
$$

Considering that *Atj* is defined by the function

$$f(R\_{\mathfrak{g}}, G\_{\mathfrak{g}}) \doteq \frac{\log\_2(R\_{\mathfrak{g}}) + \log\_2(G\_{\mathfrak{g}})}{2},$$

we can estimate Var (*Atj*) in a similar way to Var (*Mtj*).

By using the first and second derivatives of A*tj*, which are showed in Appendix (Barrett et al., 2012), we obtain the following approximation of Var (*Atj*):

$$\begin{split} \mathrm{Var}\,(A\_{\eta}) &\approx \left(\frac{1}{2\ln\left(2\right)\mathbb{E}(R\_{\eta})}\right)^{2} \mathrm{Var}\,(\mathcal{R}\_{\eta}) + \left(-\frac{1}{2\ln\left(2\right)\mathbb{E}(\mathcal{G}\_{\eta})}\right)^{2} \mathrm{Var}\,(\mathcal{G}\_{\eta}) + 1 \\ &\quad 2\left(\frac{1}{2\ln\left(2\right)\mathbb{E}(R\_{\eta})}\right) \left(-\frac{1}{2\ln\left(2\right)\mathbb{E}(\mathcal{G}\_{\eta})}\right) \mathrm{Cov}\,(\mathcal{R}\_{\eta}, \mathcal{G}t\_{\eta}) \\ &= \frac{1}{4\ln^{2}\left(2\right)} \left(\frac{\mathrm{Var}\,(\mathcal{R}\_{\eta})}{\mathbb{E}^{2}(\mathcal{R}\_{\eta})} + \frac{\mathrm{Var}\,(\mathcal{G}\_{\eta})}{\mathbb{E}^{2}(\mathcal{G}\_{\eta})} + 2\frac{\mathrm{Cov}\,(\mathcal{R}\_{\eta}, \mathcal{G}\_{\eta})}{\mathbb{E}(\mathcal{R}\_{\eta})\mathbb{E}(\mathcal{G}\_{\eta})}\right). \end{split}$$

Rewriting the above expression using the sample estimators for the expected value, variance and covariance of *Rtj* and *Gtj*, we derive the following estimator for Var (*Atj*) :

$$
\hat{\boldsymbol{\sigma}}^{2}(\boldsymbol{A}\_{\boldsymbol{t}}) \dot{\boldsymbol{\epsilon}} \doteq \frac{1}{4\ln^{2}(2)} \left( \frac{\hat{\boldsymbol{\sigma}}^{2}(\boldsymbol{R}\_{\boldsymbol{t}})}{\overline{\boldsymbol{R}}\_{\boldsymbol{u}}^{2}} + \frac{\hat{\boldsymbol{\sigma}}^{2}(\boldsymbol{G}\_{\boldsymbol{t}})}{\overline{\boldsymbol{G}}\_{\boldsymbol{t}\boldsymbol{c}}^{2}} + 2\frac{\hat{\boldsymbol{\sigma}}\left(\boldsymbol{R}\_{\boldsymbol{t}},\boldsymbol{G}\_{\boldsymbol{t}}\right)}{\overline{\boldsymbol{R}}\_{\boldsymbol{t}\boldsymbol{c}}\overline{\boldsymbol{G}}\_{\boldsymbol{u}\boldsymbol{c}}} \right).
$$

# The Construction and Comprehensive Analysis of ceRNA Networks and Tumor-Infiltrating Immune Cells in Bone Metastatic Melanoma

#### *Edited by:*

*Liang Cheng, Harbin Medical University, China*

#### *Reviewed by:*

*Chuan-xing Li, Karolinska Institute (KI), Sweden Dapeng Hao, Baylor College of Medicine, United States*

#### *\*Correspondence:*

*Jie Zhang jiezhang@tongji.edu.cn Tong Meng mengtong@medmail.com.cn Zongqiang Huang gzhuangzq@163.com*

*†These authors have contributed equally to this work*

#### *Specialty section:*

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> *Received: 26 June 2019 Accepted: 12 August 2019 Published: 25 September 2019*

#### *Citation:*

*Huang R, Zeng Z, Li G, Song D, Yan P, Yin H, Hu P, Zhu X, Chang R, Zhang X, Zhang J, Meng T and Huang Z (2019) The Construction and Comprehensive Analysis of ceRNA Networks and Tumor-Infiltrating Immune Cells in Bone Metastatic Melanoma. Front. Genet. 10:828. doi: 10.3389/fgene.2019.00828*

*Tong Meng2,3,4\* and Zongqiang Huang1\* 1 Department of Orthopaedics, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China, 2 Division of Spine, Department of Orthopedics, Tongji Hospital affiliated to Tongji University School of Medicine, Shanghai, China, 3 Tongji University School of Medicine, Tongji University, Shanghai, China, 4 Department of Orthopedics, Shanghai General Hospital, School of Medicine, Shanghai Jiaotong University, Shanghai, China, 5 Shanghai East Hospital, Key Laboratory of* 

*Arrhythmias, Ministry of Education, Tongji University School of Medicine, Shanghai, China*

*Runzhi Huang1,2,3†, Zhiwei Zeng1†, Guangyu Li1†, Dianwen Song4, Penghui Yan1, Huabin Yin4, Peng Hu1, Xiaolong Zhu1, Ruizhi Chang1, Xu Zhang1, Jie Zhang5\*,* 

Background/Aims: As a malignant and melanocytic tumor, cutaneous melanoma is the devastating skin tumor with high rates of recurrence and metastasis. Bone is the common metastatic location, and bone metastasis may result in pathologic fracture, neurologic damage, and severe bone pain. Although metastatic melanoma was reported to get benefits from immunotherapy, molecular mechanisms and immune microenviroment underlying the melanoma bone metastasis and prognostic factors are still unknown.

Methods: Gene expression profiling of 112 samples, including 104 primary melanomas and 8 bone metastatic melanomas from The Cancer Genome Atlas database, was assayed to construct a ceRNA network associated with bone metastases. Besides, we detected the fraction of 22 immune cell types in melanoma *via* the algorithm of "cell type identification by estimating relative subsets of RNA transcripts (CIBERSORT)." Based on the significant ceRNAs or immune cells, we constructed nomograms to predict the prognosis of patients with melanoma. Ultimately, correlation analysis was implemented to discover the relationship between the significant ceRNA and immune cells to reveal the potential signaling pathways.

Results: We constructed a ceRNA network based on the interaction among 8 pairs of long noncoding RNA–microRNA and 15 pairs of microRNA–mRNA. CIBERSORT and ceRNA integration analysis discovered that AL118506.1 has both significant prognostic value (*P* = 0.002) and high correlation with T follicular helper cells (*P* = 0.033). Meanwhile, T cells CD8 and macrophages M2 were negatively correlated (*P* < 0.001). Moreover, we constructed two satisfactory nomograms (area under curve of 3-year survival: 0.899; 5-year survival: 0.885; and concordance index: 0.780) with significant ceRNAs or immune cells, to predict the prognosis of patients.

**78**

Conclusions: In this study, we suggest that bone metastasis in melanoma might be related to AL118506.1 and its role in regulating thrombospondin 2 and T follicular helper cells. Two nomograms were constructed to predict the prognosis of patients with melanoma and demonstrated their value in improving the personalized management.

Keywords: melanoma, bone metastasis, competing endogenous RNA network, immune cell, nomogram

### INTRODUCTION

Cutaneous melanoma is a malignant, melanocytic tumor and considered as the most harmful skin cancer (Cymerman et al., 2016; Lombard et al., 2019). All over the world, it accounts for about 232,100 (1.7%) cases of all newly diagnosed primary malignant cancers (excluding nonmelanoma), and meanwhile approximately 55,500 (0.7%) deaths are derived from cutaneous melanoma each year (Schadendorf et al., 2018). Nowadays, its incidence rate is still escalating dramatically (Schadendorf et al., 2019).

Extensive local resection with clean margins, depending on Breslow thickness of the tumor tissue, is recommended as the primary treatment for localized disease [The Cancer Genome Atlas (TCGA), 2015]. However, distant metastases often occur even after complete tumor resection due to the aggressive nature. Bone is the common metastatic location, and bone metastasis often results in pathologic fracture, neurologic damage, and severe bone pain, which decreases the quality of life (Braeuer et al., 2014; Bier et al., 2016). Regarding some patients with metastasis, systemic therapies such as targeted therapy and immunotherapy have achieved promising survival outcome; however, prognosis remains poor in most patients with metastasis (Bostel et al., 2016). Hence, it is in a desperate need to explore the molecular mechanism and probe for the prognostic factors for cutaneous melanoma patients with bone metastasis. The relationship among microRNA (miRNA), long noncoding RNA (lncRNA), and mRNA, known as ceRNA networks, had been explored in many diseases. However, ceRNA network mechanism underlying melanoma and bone metastasis still remains unknown.

In this study, we constructed a ceRNA network based on the gene expression profiling retrieved from the TCGA database to identify the ceRNAs associated with melanoma and bone metastasis. Besides, we perform "The Cell Type Identification by Estimating Relative Subsets of RNA Transcripts algorithm (CIBERSORT)" algorithm to detect the immune cells and their proportions in tumor tissues of melanoma. Additionally, nomograms were developed to predict the prognosis of melanoma with bone metastasis based on significant immune cells and ceRNA. The relationship between bone metastasis– related immune cells and ceRNA networks was evaluated to identify the underlying signaling pathways.

### MATERIALS AND METHODS

#### Data Collection and Differential Gene Expression Analysis

The Ethics Committee of the First Affiliated Hospital of Zhengzhou University approved this study (no. 2019-KY-107). We downloaded the RNA profiles of the primary melanomas and bone metastasis samples from the TCGA (https://tcgadata.nci.nih.gov/tcga/) database. HTseq-count and fragments per kilobase of exon per million reads mapped profiles of 112 samples, including 104 primary melanomas and 8 melanomas with bone metastasis, were assembled. Meanwhile, demographic and survival information of each patient was collected. The edgeR method was used to find differentially expressed mRNAs, lncRNAs, and miRNAs after removing nonmelanoma-specific expression genes (no expression in both the experimental group and control group). Only when the false discovery rate (FDR) *P* < 0.05 and the log (fold change) > 1.0 or <−1.0 could be regarded as differentially expressed gene of downregulation and upregulation, respectively.

#### The Construction of the ceRNA Network

Prior to the initial statistical analysis, the miRNA–mRNA and lncRNA–miRNA interaction data were retrieved from miRTarBase (http://mirtarbase.mbc.nctu.edu.tw/) (Chou et al., 2018) and Incbase v.2 Experimental Module (http:// carolina.imis.athena-innovation.gr/diana\_tools/web/index. php?r=lncbasev2%2Findex-experimental) (Paraskevopoulou et al., 2016), respectively. Afterward, miRNAs, which illustrate significant outcomes in the aspect of regulating both IncRNAs and mRNAs in hypergeometric testing and correlation analysis, were collected for establishing the ceRNA network by Cytoscape v.3.5.1 (Shannon et al., 2003).

#### Survival Analysis and Nomograms of Key Members in the ceRNA Network

Kaplan–Meier (K-M) survival analysis was performed to show the relationship between the expression level of biomarkers with the prognostic value illustrated in the ceRNA network and survival outcomes in patients with melanoma. Afterward, the significant biomarkers were incorporated into the reduced Cox proportional hazards model by screening the significant variables in the initial Cox models to illustrate the variables with prognostic values. Besides, Lasso regression (least absolute shrinkage and selection operator regression), which is a kind of linear regression using shrinkage where data values are shrunk to a specific point, was implemented to confirm

**Abbreviations:** AUC, Area under curve; ceRNA, competitive endogenous RNA; lncRNA, long noncoding RNA; miRNA, microRNA; CIBERSORT, Cell type identification by estimating relative subsets of RNA transcripts; TCGA, The Cancer Genome Atlas; FDR, false discovery rate; SD, standard deviation; ROC, Receiver operating characteristic curves; THBS, Thrombospondin, Tfh, T follicular helper cells; IL-21, interleukin 21.

the fitness of the established multifactor models. Ultimately, a nomogram based on the multivariable models was developed to predict the prognosis of patients with melanoma. In accordance with the expression level of biomarkers with prognostic values, we can acquire the points of each biomarker and add up to obtain the total points, which can display the 3 and 5-year overall survival probability. Meanwhile, calibration curves and receiver operating characteristic (ROC) curves were performed to evaluate the discrimination and precision of the nomogram.

#### CIBERSORT Estimation

CIBERSORT is an analytical tool constructed by Newman et al. (2015) to identify the richness and proportion of the diversified cell types in a mixed cell population using gene expression data. Every cell type and their quantity in each sample can be conveniently acquired *via* CIBERSORT estimation. In this study, we use CIBERSORT algorithm to further probe for the cytological causes of molecular mechanisms of the pivotal biomarkers in the ceRNA network. The proportions of 22 immune cell types in the primary melanoma and melanoma with bone metastasis were estimated by CIBERSORT. Only when the CIBERSORT output of *P* < 0.05 could put the samples into further analysis. The Wilcoxon rank-sum test was performed to look for the significant immune cells in the aspect of the fraction between the primary melanoma and melanoma with bone metastasis. Then, K-M survival analysis was used to demonstrate the relationship between the overall survival of melanoma patients and proportion of specific immune cells. After being well filtered by Lasso regression, specific immune cells were incorporated into the Cox proportional hazards model. Then, nomogram was constructed to predict the prognosis for melanoma. Concordance index of Cox model was applied to access the discrimination and accuracy of the nomogram. Ultimately, Pearson correlation analysis was carried out to show the relationship between immune cells and biomarkers.

#### Online Database Validation

To minimize bias caused by the imbalanced sample size and get more complete annotation of key biomarkers, multiple online databases including the CellMarker (Zhang et al., 2019), LncRNA2Target (Cheng et al., 2019), Ontogene (Cheng et al., 2016), String (Szklarczyk et al., 2019), DincRNA (Cheng et al., 2018), SurvExpress (Aguirre-Gamboa et al., 2013), Cancer Cell Line Encyclopedia (CCLE) (Ghandi et al., 2019), Genotype– Tissue Expression (GTEx) (Consortium, 2015), Oncomine (Elfilali et al., 2006), and Gene Expression Omnibus (GEO) (ID: GSE19234 (Bogunovic et al., 2009), GSE22153 (Jonsson et al., 2010) were used to detect gene expression levels of key biomarkers at the tissue and cellular levels.

#### Statistical Analysis

Only two-sided *P* < 0.05 was defined as statistical significance. All the statistical analyses were performed with R version 3.5.1 software (Institute for Statistics and Mathematics, Vienna, Austria; www.rproject.org) (package: GDCRNATools (Li et al., 2018), edgeR, ggplot2, rms, glmnet, preprocessCore, survminer, timeROC).

### RESULTS

#### Identification of Significantly Differentially Expressed Genes

**Figure 1** illustrates the analysis process of this study. The baseline features of all the patients retrieved from the TCGA database were described in **Table S1**. We defined the log (fold change) >1.0 or < −1.0 and FDR <0.05 as the critical point and found out that there were 701 differentially (550 down- and 151 up-) expressed

protein-coding genes, along with 14 differentially (5 downand 9 up-) expressed lncRNAs and 72 differentially (45 downand 27 up-) expressed miRNAs between the bone metastatic melanoma and the primary melanoma from the TCGA database (**Figures 2A**–**F**).

#### ceRNA Network Establishment and Survival Analysis

A ceRNA network was established based on the interaction among 8 pairs of lncRNA–miRNA and 15 pairs of miRNA– mRNA (**Figure 3A**) (**Table 1**). Kaplan–Meier survival analysis was implemented to explore the relationship between the prognosis and biomarkers involved in ceRNA network related to the bone metastasis in melanoma. The results revealed that

thrombospondin 2 (THBS2) (*P* = 0.040) and AL118506.1 (*P* = 0.002) displayed significance (**Figures 3B**, **C**). According to enrichment analysis, the significant genes associated with bone metastasis in melanoma were mostly functioned in extracellular matrix organization (**Figure S1**).

#### Construction of the Prediction Model Based on the ceRNA Network

The outcomes of Lasso regression illustrated that four genes, hsa-miR-137, hsa-miR-425-5p, VCAN, and AL118506.1, were critical to modeling and were then incorporated into the Cox regression, after which the nomogram, aimed to predict the prognosis, was constructed according to the Lasso regression. The areas under curve (AUC) of the 3- and 5-year survival were

and down-regulated RNAs, respectively. It shows that 550 of 701 differentially expressed protein-coding genes are down-regulated and 151 are up-regulated. Besides, among 14 differentially expressed lncRNAs, 5 lncRNAs are down-regulated, and 9 are up-regulated. Volcano plots of differentially expressed mRNAs (D) and lncRNAs (F). We defined the log (fold change) >1.0 or <−1.0 and FDR <0.05 as the critical point. Thus, the red and blue dots in the plots represent high and low expression RNAs with statistical significance, respectively. Meanwhile, black dots represent mRNAs and lncRNAs without statistical significance between the primary and the bone-metastatic melanoma.

0.899 and 0.855, respectively, which reflects the satisfactory accuracy. Additionally, the discrimination of the nomogram was suggested by the calibration curves (**Figures 4A**–**F**).

### Immune Cells Related to the Melanoma

The composition of the immune cells in the melanoma evaluated by CIBERSORT algorithm was illustrated in the histogram and the heat map (**Figures 5A**, **B**). The results of the Wilcoxon rank-sum test revealed that the proportion of the T follicular helper (Tfh) cells in the melanoma with bone metastasis was relatively less than that in the primary melanoma (*P* = 0.021), and macrophages M2 was relatively greater in the melanoma with bone metastasis (*P* = 0.036) (**Figure 5C**).

#### Construction of the Prediction Model Based on the Immune Cells

Similarly, 16 of 22 immune cells, which showed significant prognostic values in the initial Cox regression model, were integrated into the final multivariable model with satisfactory predictive power (concordance index 0.780) and were utilized to construct the nomogram (**Figures 6A**, **B**). The concordance curve and concordance index showed a good concordance of the model (**Figure 6C**). Based on the result of the Kolmogorov–Smirnov test, the fraction of regulatory T cells (Tregs) in stages T1, T2, T3, and T4 showed significant difference between patients with or without bone metastasis (**Figure S2**).

### Comprehensive Analysis of Genes and Immune Cells

Correlation analysis (Pearson analysis) was applied to demonstrate the coexpression patterns among diversified immune cells (**Figure 7A**). Likewise, correlation relationship (Pearson analysis) between immune cells and biomarkers was further analyzed and illustrated (**Figure 7B**). As shown, hsa-miR-425-5p and Tfh cells (*P* = 0.019, *R* = 0.260) (**Figure 7C**), AL118506.1 and Tfh cells (*P* = 0.033, *R* = −0.240) (**Figure 7D**), and Tfh cells and hsa-miR-425-5p (**Figure S3**) represented good correlation. Eventually, bone metastasis–specific immune cells and ceRNAs significantly associated with prognosis were integrated into one multivariable model and one nomogram (**Figure S4**), which could decently predict the prognosis of SKCM (AUC of 3-year survival: 1.000; AUC of 5-year survival: 1.000). However, the model diagnostic information suggested that the prediction model had bias due to the small sample size.

#### Metastasis-Specific ceRNAs and Immune Cells' Surface Markers Coding Genes Showing Significant Results in Multidimensional Validation

In order to explore the expressions of metastasis-specific ceRNAs and immune cells' surface markers coding genes in different datasets, a dimensional validation applying multiple online databases was performed.

At the cellular level, BCL6 transcription repressor (BCL6), membrane metalloendopeptidase (MME), C-X-C motif

chemokine ligand 13 (CXCL13), inducible T-cell costimulator (ICOS), and programmed cell death 1 (PDCD1) had been reported as the surface markers of Tfh cell in the CellMarker (**Figure S5**). AL118506.1 is a type of lncRNA (Ensemble ID: ENSG00000268858). According to DincRNA, Ontogene, and LncRNA2Target database, AL118506.1 is the antisense to Abhydrolase domain containing 16B (ABHD16B, also known as C20orf135), and it can down-regulate the expression level of hsa-miR-27b-3p. However, the function of AL118506.1 remains largely unknown. Thus, AL118506.1, ABHD16B, THBS2, BCL6, MME, CXCL13, ICOS, and PDCD1 were incorporated into further multidimensional validation.

First, **Figure S6** illustrates the protein–protein interaction network of these genes, indicating that there are many interactions between THBS2 protein and T infertile helper cell's surface markers. Besides, in the CCLE and GTEx, we found that THBS2 was expressed in various SKCM cell lines, and Tfh cell's surface marker coding gene expressions were low, while in normal skin tissue THBS2 and AL118506.1 were expressed, and surface marker coding gene expressions were also low (**Figures S7A**, **S7C**). Meanwhile, significant coexpression relationships between THBS2 and Tfh cell's surface marker coding genes had been observed in tissue levels, but not in cancer cell lines (**Figures S7B**, **S7D**). Besides, in meta-analysis of Oncomine,


TABLE 1 | Hypergeometric testing and correlation analysis results of ceRNAs network.

*ceRNAs, competing endogenous RNAs; LncRNA, long noncoding RNA; MiRNA, microRNA.*

THBS2 (Median rank 1,088, *P* < 0.001) (**Figures S8A, B**), ICOS (Median rank 1,008, COPA = 1.854) (**Figures S8C, D**), CXCL13 (Median rank 536.5, COPA = 30.145) (**Figures S8E, F**), BCL6 (Median rank 434.5, COPA = 2.016) (**Figures S8G, H**), MME (median rank 221.0, COPA = 8.940) (**Figures S8I, J**), and PDCD1 (median rank 7,680, *P* = 0.350) (**Figures S8C, D**) all showed significant results in multiple melanoma–related studies except PDCD1. Additionally, the reanalysis results of GSE19234 (**Figure S9**) and GSE22153 (**Figure S10**) in SurvExpress suggested that these genes have significant predictive value for metastasis (censoring event: metastasis, hazard ratio = 5.19 [95% confidence interval {CI}, 1.92–14.05], *P* = 0.001, **Figure S9C**) (censoring event: subcutaneous metastasis, hazard ratio = 4.01 [95% CI, 1.93–8.34], *P* < 0.001, **Figures S10C, D**) and prognosis (censoring event: overall death, hazard ratio = 3.15 [95% CI, 1.71–5.80], *P* < 0.001, **Figure S10B**).

### DISCUSSION

Malignant melanoma is regarded as one of the most devastating and metastatic diseases with a drastic increasing incidence rate around the world (Bostel et al., 2016). Tumor metastasis is the advanced stage of disease and its complications often decrease the quality of life, especially for the bone metastasis. Although the mechanisms of tumorigenesis and metastasis are still unclear for melanoma, molecular and cellular features often changed during the process and are often viewed as important predictors (Braeuer et al., 2014; Rodina et al., 2016). Thus, the differentially expressed genes and tumor-infiltrating immune cells in the primary melanoma and bone metastasis attract our interest, which is seldom focused by previous studies.

In the current study, we first figured out the differently expressed and statistically significant ceRNA and tumorinfiltrating immune cells between the primary and metastatic melanoma. Afterward, two nomograms are constructed based on them to predict the outcomes of patients with melanoma. The high AUC value and concordance index in two nomograms might contribute to make an evaluation for bone metastasis and survival outcomes. At last, according to the results of K-M survival analysis and correlation analysis, we inferred that the ceRNA regulatory mechanism of AL18506.1 (lncRNA), THBS2 (mRNA), hsa-miR-27b-3p (miRNA), and Tfh cell might play a crucial role in bone metastasis of melanoma.

Recently, a myriad of studies had uncovered that no more than 2% of the whole genome encode protein-coding genes, which suggests that most of the human transcriptomes are represented by noncoding RNAs (Volders et al., 2013). mRNAs, miRNAs, and lncRNAs are connected through the competitive endogenous RNA networks in an intricate crosstalk (Tay et al., 2014). The interaction among miRNA, lncRNA, and mRNA, operating as ceRNA networks, had been drastically explored in many diseases, including lung cancer, gastric cancer, and gallbladder cancer, among others (Kumar et al., 2014; Chen et al., 2018; Chen et al., 2019). However, ceRNA network mechanism underlying melanoma and bone metastasis remains largely unknown. In our study, we identified that AL118506.1 (lncRNA) could down-regulate and up-regulate the level of hsa-miR-27b-3p and THBS2, respectively, to promote bone metastasis in patients with melanoma *via* ceRNA network. The role of hsa-miR-27b-3p was shown to be essential in malignant transformation, which is in conformity with our present study (Liu et al., 2015).

Thrombospondins (THBSs) had been verified to play important roles in various processes, including angiogenesis, cellular adhesion, extracellular matrix interaction, tumor formation, and metastasis (Roberts, 2008; Liu et al., 2018). Thrombospondin 2, one of members in THBSs, is revealed to regulate the antiangiogenic activity and prevent the development of focal adhesion in endothelial cells (Agostini et al., 2012). Moreover, the overexpression of THBS2 had been demonstrated to be positively correlated with node metastasis and over survival in many types of cancer, including colorectal adenocarcinoma, myxoid liposarcoma, prostate cancer, and gastric cancer (Kim et al., 2010; Slavin et al., 2014; Chang et al., 2016; Lin et al., 2016; Nezu et al., 2016; Zhuo et al., 2016; Qian et al., 2017; Wei et al., 2017). The role of THBS2 was also investigated in melanoma in a previous study, and metastatic uveal melanoma had a higher expression level of THBS2, which is consistent with our analysis (Liu and Ma, 2018).

FIGURE 4 | (A) The Cox proportional hazards model based on RNAs selected by (B) (C) Lasso regression. hsa-miR-137, hsa-miR-425-5p, VCAN, and AL118506.1 are incorporated into the Cox proportional hazards model. (E) Nomogram for predicting patients' outcome based on RNAs (hsa-miR-137, hsa-miR-425-5p, VCAN, and AL118506.1) in Panel (A). (D) ROC curves and (F) calibration curves for assessing the discrimination and accuracy of the nomogram. Besides, AUCs of the 3 and 5-year survival were 0.899 and 0.855, respectively. AUC, area under curve; ROC, receiver operating characteristic.

We also found out the different proportions of numerous immune cells in the primary melanoma and bone metastatic melanoma tissues. T follicular helper cells and macrophages M2 were demonstrated to be related to bone metastasis. The nomogram, composed of 16 kinds of immune cells, was constructed to predict the overall survival, which showed the great clinical utility with the concordance index of 0.78.

Generally, the CD8+ cytotoxic T cell is considered to be the main element of active antitumor immunity, whose full function greatly relied on adequate help from CD4+ T cells (Gillgrass et al., 2014). Naive CD4+ T cells could differentiate into different T helper

(TH) cells, including TH1, TH2, TH17, Tregs, and Tfh cells (Zhu et al., 2010). The Tfh cell is one subtype of CD4+ T cells, which is defined by its surface phenotypes with the highest expression level of CXCR5(Vinuesa et al., 2016). It had been demonstrated that Tfh plays an important part in the construction of humoral immunity through regulating the formation and cellular reactions that happen in the germinal center (Qi, 2016). The dysregulated Tfh cells were found to be associated with several autoimmune or (and) immunedeficient diseases, including systemic lupus erythematosus, HIV, and lymphoma (Tangye et al., 2013). A few previous studies had revealed that there are ordered lymph node–like structures mainly

FIGURE 6 | (A) Cox proportional hazards model integrated by 16 different types of immune cells. (B) Nomogram for predicting patients' outcome based on 16 cells in Panel (A). (C) Calibration curves for evaluating the accuracy of the nomogram. \**P* < 0.05; \*\**P* < 0.001.

FIGURE 7 | (A) Correlation analysis (Pearson analysis) of different tumor-infiltrating cells and (B) the relationships between different tumor-infiltrating cells and differentially expressed genes in tumor tissues of melanoma. Scatterplots further illustrate the exact relationship between T cells CD8 and macrophages M2 (*<sup>P</sup>* < 0.001, *R* = −0.480) (C), AL118506.1, and T follicular helper cells (*P* = 0.033, *R* = −0.240) (D). Besides, gray-shaded areas in two graphs represent the standard errors of the blue regression lines. R, correlation coefficient.

formed by Tfh cells in extensively infiltrated tumors, including breast cancer, lung cancer, and colorectal cancer, with obviously detectable Tfh cells, which function in antitumor immunity with positive clinical outcome (Dieu-Nosjean et al., 2008; deLeeuw et al., 2012). Other human-related studies also identified that Tfh cells had great capacity in directly assisting B cells through releasing interleukin 21 (IL-21), and IL-21 could further help human antigen-specific cytotoxic T cells to generate and proliferate, which also suggests that Tfh cells had a direct antitumorigenic function (Chen et al., 2016). Thus, patients with fewer Tfh cells had a decreased immune response in fighting against tumor, while immunosuppression was positively correlated with tumor metastasis (Bidwell et al., 2012). In our study, our data indicate that Tfh cells had a lower expression level in patients with bone metastatic disease.

Similarly, the importance of CD4+ cells of high concentration in hindering melanoma metastasis and recurrence has also been reported (He et al., 2017). Antibody of anti–programmed death 1, situated on the surface of CD4+ cells, had been verified to prove the clinical outcomes of patients with melanoma (Yamaguchi et al., 2018). Additionally, the expression levels of tumor-infiltrating cells of CD8 and macrophages M2 are, to some extent, related to clinical outcomes. The extensively studied immune infiltrate in different cancer had established that macrophages M2 could suppress antitumor immunity and promote tumor progression (Gillgrass et al., 2014; Guerriero et al., 2017). The data presented in this study also showed that macrophages M2 expression is higher in samples of patients with bone metastasis. Furthermore, the correlation analysis led us to know that the level of macrophages M2 was inversely correlated with that of CD8 T cells, and patients with more CD8 cells in tumor tissues had worse outcome, which was highly consistent with a previous study (Gillgrass et al., 2014).

The correlation analysis revealed that Tfh cells were associated with AL118506.1 (*R* = −0.240, *P* = 0.033). Based on the results of correlation analysis and hypergeometric testing of ceRNA network, AL118506.1 (lncRNA), THBS2 (protein-coding RNA), and hsa-miR-27b-3p (miRNA) were considerably correlated (*P* = 0.007). Therefore, we inferred that the interaction among hsa-miR-27b-3p, AL118506.1, THBS2, and Tfh cells was highly relevant with bone metastasis in patients with melanoma.

Nevertheless, there are several unavoidable limitations to our study that should be taken into consideration. First, the quantity of related data available from the public datasets is still limited. The idea of acquiring the same number of cases in the aspects of different genders, age groups, and races, among others, to decrease the potential error and bias is far too difficult to be achieved under the current circumstances, which leads to the lack of comprehensiveness of this study. Second, we have not taken into account the heterogeneity of the immune microenvironment associated with the location of immune infiltration. Third, all data series retrieved for the construction of nomograms aimed to predict outcomes were from the west. Therefore, if patients are from other countries, samples are tested by other platforms, but GPL96 or GPL570. Last but not least, the small sample size of bone metastasis melanoma may reduce the confidence and transformation of the predictive models into other cohorts. And to minimize bias, additional validation based on multiple databases was applied to detect gene expression levels of key biomarkers at the tissue and cellular levels, showing the key biomarkers were significantly associated with metastasis and prognosis of SKCM (**Figure S5**–**S10**).

### CONCLUSIONS

According to ceRNA networks and tumor-infiltrating immune cells, two nomograms were built, respectively, in our study to predict survival and metastasis of melanoma patients and had great utility, which was verified by high concordance index and AUC values. Based on the comprehensive clinical information from the prediction nomograms, individual management of melanoma patients could be greatly improved. Furthermore, with sufficient evidence shown in this study, we speculate that melanoma bone metastasis may depend on the interaction among hsa-miR-27b-3p, AL118506.1, THBS2, and Tfh cells.

### DATA AVAILABILITY

All datasets for this study are included in the TCGA-SKCM program.

### ETHICS STATEMENT

The Ethics Committee of the First Affiliated Hospital of Zhengzhou University approved this study (no. 2019-KY-107).

### AUTHOR CONTRIBUTIONS

Conception/design: RH, ZZ, GL, DS, PY, HY, PH, XiZ, RC, XuZ, TM, JZ, and ZH. Provision of study material: RH, ZZ, GL, TMeng, JZ, and ZH. Collection and/or assembly of data: RH, ZZ, GL, DS, PY, HY, PH, XiZ, RC, and XuZ. Data analysis and interpretation: RH, ZZ, GL, DS, PY, HY, PH, XiZ, RC, and XuZ. Manuscript writing: RH, ZZ, GL, TM, JZ, and ZH. Final approval of manuscript: RH, ZZ, GL, DS, PY, HY, PH, XiZ, RC, XuZ, TM, JZ, and ZH.

### FUNDING

This study was supported in part by the National Natural Science Foundation of China (grant no. 81702659; 81772856; 81501203). Youth Fund of Shanghai Municipal Health Planning Commission (No.2017YQ054); Henan Medical Science and Technology Research Project (grant no. 201602031).

### ACKNOWLEDGMENTS

We thank the TCGA team of the National Cancer Institute for using their data.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00828/ full#supplementary-material

TABLE S1| Baseline information of 112 patients diagnosed with Primary melanoma.

FIGURE S1 | The result of enrichment analysis showed that genes in melanoma tissues were significantly associated with extracellular matrix organization.

FIGURE S2 | The results of the Wilcoxon rank-sum test in T regulatory cells (Tregs) of different T stages.

FIGURE S3 | The correlation analysis revealed that T cells follicular helper was positively correlated with hsa-miR-425-5p (*P* = 0.019; *R* = 0.260).

FIGURE S4 | The results of Cox proportional hazards model and the nomogram integrating both biomarkers and immune cell portions significantly associated with prognosis. Bone metastasis–specific immune cells and ceRNAs significantly associated with prognosis were integrated into one multi-variable model and one nomogram (A, E), which could decently predict the prognosis of SKCM (AUC of 3-year survival: 1.000; AUC of 5-year survival: 1.000) (D). However, the model diagnostic information suggested that the prediction model had bias due to the small sample size (A, B, C, F).

FIGURE S5 | Use CellMarker to explore the surface markers of T follicular helper cells. At the cellular level, BCL6 transcription repressor (BCL6), membrane metalloendopeptidase (MME), C-X-C motif chemokine ligand 13 (CXCL13), inducible T-cell costimulator (ICOS) and Programmed cell death 1 (PDCD1) had been reported as the surface markers of T follicular helper cell in the CellMarker.

FIGURE S6 | Protein–protein interaction network of ABHD16B, THBS2, BCL6, MME, CXCL13, ICOS, PDCD1, indicating that there are many interactions between THBS2 protein and T infertile helper cell's surface markers.

FIGURE S7 | The expression levels and co-expression analysis of AL118506.1, ABHD16B, THBS2, BCL6, MME, CXCL13, ICOS, PDCD1 in various SKCM cell

lines and normal skin tissue in Cancer Cell Line Encyclopedia (CCLE) (A, B) and The Genotype–Tissue Expression (GTEx) database (C, D).

FIGURE S8 | Validation of THBS2 (A, B), ICOS (C, D), CXCL13 (E, F), BCL6 (G, H), MME (I, J), and PDCD1 (K, L) on a transcriptional level in multiple cancer types and multiple studies using the Oncomine database.

FIGURE S9 | The results of reanalysis of GSE19234 in SurvExpress. The reanalysis results of GSE19234 in SurvExpress suggested that these genes have

#### REFERENCES


significant predictive value for metastasis (Censoring event: metastasis, Hazard Ratio = 5.19 (95% CI, 1.92–14.05), *P* = 0.001)

FIGURE S10 | The results of reanalysis of GSE22153 in SurvExpress. The reanalysis results of GSE22153 in SurvExpress suggested that these genes have significant predictive value for metastasis (Censoring event: subcutaneous metastasis, Hazard Ratio = 4.01 (95% CI, 1.93–8.34), *P* < 0.001) and prognosis (Censoring event: overall death, Hazard Ratio = 3.15 (95% CI, 1.71–5.80), *P* < 0.001).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Huang, Zeng, Li, Song, Yan, Yin, Hu, Zhu, Chang, Zhang, Zhang, Meng and Huang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Integrate GWAS, eQTL, and mQTL Data to Identify Alzheimer's Disease-Related Genes

#### *Tianyi Zhao1, Yang Hu2, Tianyi Zang1\*, and Yadong Wang1\**

*1 Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China,2 School of Life Science and Technology, Harbin Institute of Technology, Harbin, China*

It is estimated that the impact of related genes on the risk of Alzheimer's disease (AD) is nearly 70%. Identifying candidate causal genes can help treatment and diagnosis. The maturity of sequencing technology and the reduction of cost make genome-wide association study (GWAS) become an important means to find disease-related mutation sites. Because of linkage disequilibrium (LD), neither the gene regulated by SNP nor the specific SNP can be determined. Because GWAS is affected by sample size and interaction, we introduced empirical Bayes (EB) to make a meta-analysis of GWAS to greatly eliminate the bias caused by sample and the interaction of SNP. In addition, most SNPs are in the noncoding region, so it is not clear how they relate to phenotype. In this paper, expression quantitative trait locus (eQTL) studies and methylation quantitative trait locus (mQTL) studies are combined with GWAS to find the genes associated with Alzheimer disease in expression levels by pleiotropy. Summary data-based Mendelian randomization (SMR) is introduced to integrate GWAS and eQTL/mQTL data. Finally, we prioritized 274 significant SNPs, which belong to 20 genes by eQTL analysis and 379 significant SNPs, which belong to seven known genes by mQTL. Among them, 93 SNPs and 2 genes are overlapped. Finally, we did 10 case studies to prove the effectiveness of our method.

#### Keywords: Alzheimer's disease, Mendelian randomization, GWAS, eQTL, mQTL

## INTRODUCTION

It is estimated that the impact of related genes on the risk of AD is nearly 70%. Importantly, neuronal cell death precedes the appearance of cognitive symptoms for 10 years or more, suggesting that targeted treatment needs to be performed before symptoms appear. Therefore, the identification of AD biomarkers such as genes, RNAs (Jiang et al., 2015; Cheng et al., 2018; Cheng et al., 2019), proteins, and metabolites (Cheng et al., 2019) is critical for early detection and early intervention in AD. In addition, identifying candidate genes and loci can also help us understand the pathogenesis of AD and develop drugs.

Recently, Jansen et al. (Jansen et al., 2019) published his AD GWAS study on natural genetics. The sample size is more than eight times that of Lambert et al. (Lambert et al., 2013) in 2013. Due to the increase in the number of samples, they found nine AD risk loci more than in previous studies. Jansen et al. found that most of the AD-related DNA mutations were located in the noncoding part of the genome in regions that affected gene transcription. It means that combining GWAS data with transcriptional expression data will greatly advance AD research (Cheng et al., 2016).

#### *Edited by:*

*Lei Deng, Central South University, China*

#### *Reviewed by:*

*Rui Guo, Harvard Medical School, United States Eunhee Choi, Harvard Medical School, United States*

#### *\*Correspondence:*

*Tianyi Zang tianyi.zang@hit.edu.cn Yadong Wang ydwang@hit.edu.cn*

#### *Specialty section:*

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> *Received: 22 April 2019 Accepted: 24 September 2019 Published: 25 October 2019*

#### *Citation:*

*Zhao T, Hu Y, Zang T and Wang Y (2019) Integrate GWAS, eQTL, and mQTL Data to Identify Alzheimer's Disease-Related Genes. Front. Genet. 10:1021. doi: 10.3389/fgene.2019.01021*

**92**

However, GWAS still has certain limitations. The SNP is not necessarily the true pathogenic locus, but only related to the SNP that actually causes the disease due to the LD. GWAS usually analyzes the edge effects of individual loci while ignoring the interaction of multiple genes in complex diseases (Battle et al., 2014). Therefore, GWAS still cannot fully reveal the genetic susceptibility factors of complex diseases (Cheng et al., 2018). It is only an important part of exploring the genetic etiology of complex diseases (Cheng and Hu, 2018). Therefore, using GWAS data for research, we must first start with the expression of SNP, that is, combined with data affecting gene expression, which can weaken the impact of LD on significance. Then, the interaction of multiple genes is considered, that is, the statistical values of each SNP are revised within the whole genome.

It was found that about 80% of the genetic susceptibility loci detected by GWAS were located in the noncoding region of the genome, suggesting that the pathogenic loci may have regulatory functions on gene expression. An important role of largescale eQTL research is to be able to prioritize SNP loci (Barral et al., 2012) in GWAS susceptible regions and to infer possible biological mechanisms through the influence of DNA polymers on biological characteristics. At present, many studies have used eQTL analysis as a very effective tool to explain the results of GWAS. Hormozdiari et al. (Hormozdiari et al., 2016) present a probabilistic method named eCAVIAR, which can detect target genes by colocalization of GWAS and eQTL signals. Xu et al. purposed a more powerful method based on PrediXcan and TWAS. It can integrate single set or multiple sets of eQTL data with GWAS.

mQTL is mainly based on the analysis of cis-mQTL, that is, using Beta value of methylation level of CpG locus near a gene as dependent variable, screening all SNP variations in the chromosomal region upstream and downstream of the gene as independent variable and regressing each SNP locus S and methylation level M in this region one by one, so as to obtain SNP loci significantly related to the methylation level of a gene. There is no doubt that methylation affects gene expression. This is very similar to eQTL, both of which can cause changes in expression through mutations in a single locus. Therefore, in recent years, more and more studies have been carried out to screen genes related to traits by combining mQTL with GWAS. Hägg et al. (Hägg et al., 2015) integrated GWAS, eQTL, and mQTL to find out genes which are related to obesity. Pharoah et al. (Pharoah et al., 2013) identified three new susceptibility loci for ovarian cancer by GWAS meta-analysis and verified the result by mQTL.

In our previous paper (Hu et al., 2018), we have identified some AD-related genes by GWAS and eQTL using SMR. There are three points to be improved. Firstly, mQTL should be included to verify and improve our result. Secondly, we used several eQTL datasets in that paper, whereas a meta-analysis method should be used to integrate the datasets, which can improve the accuracy of eQTL's statistical results. Finally, GWAS datasets should also be integrated into one dataset so that can overcome the difference of statistical power caused by sample size.

#### METHODS

### SMR

Since Zhu et al. proposed "SMR" in 2016, it has become a common way to identify the genes whose expression levels are associated with a complex trait because of pleiotropy. Using GWAS and eQTL data, SMR could screen trait-related genes. After two years, they applied SMR to mQTL data. They found 7,858 DNAm sites which are related to 14 complex traits.

The basic idea of this method is as follows. First, let y be the phenotype, which is the outcome variable. x is the gene expression, which is the exposure factor. z is the gene mutation, which is the instrumental variable. Then, bxy is the effect of x on y, bzx is the effect of z on x, and bzy is the effect of z on y. The definition of bxy is bxy = bzy/bzx, which means the effect of gene expression on phenotype without confounding factors. This idea is based on the Mendelian randomization (Cheng et al., 2018; Cheng et al., 2019).

**Figure 1** is a hypothetical model of a mediation mechanism tested in SMR. The blue line represents causal relationship. Methylation will cause SNP. Both SNP and methylation can affect the change of transcription. The change of transcription will cause the difference of trait. The red line denotes the relationship data represents. mQTL denotes the relationship between methylation and SNP. eQTL denotes the relationship between transcription and SNP. GWAS denotes the relationship between SNP and trait.

Based on this hypothesis, many researchers have found the genes which are related to certain traits. Diseases like bone mineral density (BMD) (Meng et al., 2018), amyotrophic lateral sclerosis (ALS) (Du et al., 2017), and neuroticism (Fan et al., 2017) have been found some potential related genes by SMR. Other traits like height, BMI (Yengo et al., 2018), and obesity (Liu et al., 2018) have also researched by SMR.

#### Eb-GWAS

Due to the complex linkage effects and statistical errors of the samples, the contribution of GWAS to biological research is reduced. GWAS may associate common diseases with thousands of DNA mutations, that is, every DNA region that happens to be active in diseased tissues may be associated with disease (Jiang et al., 2013). Many GWAS matches are not specifically biologically related to disease and, therefore, cannot be used as effective drug targets. In fact, these "peripheral" mutations are likely to affect the activity of "core" genes, which are more directly related to disease, through complex biochemical regulatory networks (Jiang et al., 2010).

As we discussed before in the introduction, the interaction of multiple genes is considered, that is, the statistical values of each SNP are revised within the whole genome. In this section, we will

process GWAS data in two steps: 1. meta-analysis, 2. using EB, revise the statistical value of each SNP within the whole genome.

#### Meta-Analysis

Since SE denotes the standard error of each SNP, it represents the reliability of Beta values. Then, weight of each Beta should be:

$$\mathcal{W}\_i = 1 / \mathcal{S} \mathcal{E}\_i^2 \tag{1}$$

*SEi* denotes the standard error for study i, wi denotes the weight of Beta.

Then, the Beta after meta-analysis would be:

$$\beta = \sum\_{i} \beta\_i w\_i / \sum\_{i} w\_i \tag{2}$$

βi denotes effect size estimate for study i.

Then, we could use the weight of each Beta to calculate the result of meta-analysis.

$$SE = \sqrt{1 / \sum\_{i} w\_i} \tag{3}$$

Finally, the overall Z-score could be obtained by the original equation.

$$Z = \mathcal{B} / \text{SE} \tag{4}$$

#### Eb-GWAS

After meta-analysis, we could summary several GWAS datasets into one dataset. Then, we used EB to integrate all the Z scores in the whole genomic level. As we know that the SNP could interact with each other, the Z score of all SNP should have some relationship and obey normal distribution.

The overall Z-score we obtained before obeying normal distribution with standard deviation is 1. Then,

$$
\hat{Z}\_i \mid Z\_i \sim \mathcal{N}(\hat{Z}\_i, 1) \tag{5}
$$

 *Zi* denotes the Z score we obtained. It is a value with bias. *Zi* denotes the real Z score.

Real Z score obeys normal distribution:

$$Z \stackrel{ind}{\sim} N(\theta, \sigma^2) \tag{6}$$

Then, the marginal distribution of *Zi* is

$$
\hat{Z} \stackrel{ind}{\sim} N(\theta, \sigma^2 + 1) \tag{7}
$$

Moreover, the posterior distribution should be:

$$Z\_i \mid \hat{Z}\_i \stackrel{ind}{\sim} N(\Theta + B(\hat{Z}\_i - \Theta), B) \tag{8}$$

$$B = \frac{\sigma^2}{1 + \sigma^2} \tag{9}$$

Then, we could know that *E Zi* ( ) =θ, so the mean of *Zi* can be used to estimate θ.

$$
\hat{\Theta} = mean(\hat{Z}\_i) = \overline{\hat{Z}}\_i \tag{10}
$$

$$\frac{\sum\_{i}^{N} (\hat{Z}\_{i} - \overline{\hat{Z}}\_{i})^{2}}{\sigma^{2} + 1} = \frac{\mathcal{S}}{\sigma^{2} + 1} \sim \mathcal{X}^{2} (N - 1) \tag{11}$$

Then,

$$\frac{\sigma^2 + 1}{S} \sim \text{inverse} - \chi^2 (N - 1) \tag{12}$$

From the properties of inverse chi-square distribution,

$$E(\frac{\sigma^2 + 1}{S}) \sim \frac{1}{N - 3} \tag{13}$$

Then,

$$E(\frac{N-3}{S}) = \frac{1}{\sigma^2 + 1} = 1 - B \tag{14}$$

Therefore, the EB estimation of B is

$$B = 1 - \frac{\text{(N-3)}}{\text{S}} \tag{15}$$

Finally, we can put the (Hu et al., 2018) into (Battle et al., 2014)

$$Z\_i = \overline{\hat{Z}} + (1 - \frac{(N-3)}{S})(\hat{Z}\_i - Z) \tag{16}$$

Then, we have done the meta-analysis and revised the statistical value of each SNP within the whole genome.

#### Dataset

As shown in **Table 1** we obtained five GWAS datasets, three eQTL dataset, and three mQTL datasets. All the eQTL and mQTL are from brain tissue. Yang Jian et al. have already meta-analysis the eQTL and mQTL datasets. Therefore, we used the data they processed.

For GWAS dataset, Scelsi M A et al. obtained the data from 1,517 Caucasian ADNI subjects. Lambert JC et al.'s dataset is


consisted of 17,008 Alzheimer's disease cases and 37,154 controls. Marioni R E et al. obtained data from 314,278 participants.

For eQTL dataset, SNPs within 1Mb distance from each probe are available in these three datasets. After meta-analysis, the estimated effective sample size n = 1194.

For mQTL dataset, 5kb, 500kb, and 20kb are the available distance for the three datasets, respectively. After meta-analysis, the estimated effective sample size n = 1160.

#### RESULTS

#### Results of GWAS Meta-Analysis

We did a meta-analysis of five groups of GWAS data and integrated them into a GWAS file.

The blue block in **Figure 2** is P value density of GWAS after meta-analysis. The red block in **Figure 2** is P value density of GWAS after EB. As we can see in **Figure 2**, the distribution approximates uniform distribution. After using EB in all SNPs in whole dataset, the P value of the final GWAS data approximates the normal distribution.

#### Results of SMR

GWAS included 1,474,846 SNPs, mQTL included 6,966,746, and eQTL included 1,067,443 SNPs. There are 149,326 SNPs occur in both GWAS and eQTL and 408,896 SNPs occur in both GWAS and mQTL. Therefore, we use SMR to test these repeated SNPs in data sets.

Note that some SNPs are marked by multiple probes, so one SNP may significant in more than one gene. One SNP may affect expression of multiple genes.

In **Figures 3** and **4**, we can see that SNPs' P value in GWAS are not related to eQTL and mQTL. It means that only few significant SNPs in GWAS have significance in eQTL and mQTL. Anyway, the points near the upper right corner in the images mean that the difference in expression level caused by these SNPs is related to AD and SMR can help us detect these SNPs.

We set a threshold as 0.05/(number of probers). For eQTL data, the threshold is 0.05/8362 = 5.98e-06. For mQTL data, the threshold is 0.05/97263 = 5.14e-07. The numbers of SNPs and genes identified by the two experiments are shown in **Table 2**.

FIGURE 3 | Duplicated SNPs' P value in genome-wide association study (GWAS) and eQTL.

**Figure 5** shows all the SNPs' P value. The red points are the P value of GWAS SNPs. The blue points are the P value of eQTL SNPs and the green points are the P value of mQTL SNPs. There is a black line in the first picture. The line is the significant threshold of P value. It is -log10(5\*10-8). The SNPs of eQTL and mQTL are already screened so each SNP's P value is less than 5\*10-8. FIGURE 2 | Pvalue density of genome-wide association study (GWAS).

TABLE 2 | The results of summary data-based Mendelian randomization (SMR).


**Figure 6** shows the result of SMR by two different datasets. The first graph is the result of GWAS and eQTL and the second one is the result of GWAS and mQTL. The black line in the two graphs is significant threshold, respectively. As we can see, only few of SNPs can pass the SMR test. Some of them are not very significant in GWAS, but combined with eQTL or mQTL, they would be significant.

As we can see in **Table 3**, HLA-DQA1 and HLA-DRB5 are selected in both eQTL and mQTL datasets. The HLA complex is located in the 21.31 region (6p21.31) on the short arm of chromosome 6 and is composed of 3.6 million base pairs. It is the region with the highest gene density and the most polymorphic region in human chromosomes. Known as "chemical fingerprints in humans". Due to the complexity of HLA, the methylation level and expression level differ greatly.

#### Case Study

In this section, we want to confirm whether the 25 AD-related genes we found have been reported by others. In order to be precise, we only use the literature that got AD-related genes by biological experiments, rather than the bioinformatics method or GWAS method.

Zhu et al. (2017) found four CR1 SNPs showed significant associations with the Aβ deposition at the baseline level.

James et al. (2018) gathered 71 cognitively healthy women's the volumes of total gray matter, cerebrocor-tical gray matter, and subcortical gray matter by structural magnetic resonance imaging



(sMRI) scan and found that the protective effect of DRB1\*13:02 is related to successful elimination of specific pathogens that would ultimately cause gradual brain atrophy.

Yu et al. (2015) found that BIN1 was associated with Aβ load and brain DNA methylation in HLA-DRB5 was associated with pathological AD by 447 participants

Lee et al. (2018) used non-Hispanic Caucasians with neuroimaging and found that HLA-DQB1 is significantly associated with entorhinal cortical thickness by controlling for multiple testing.

Yoshino et al. (2016) found that SNCA mRNA expression in 50 AD subjects was significantly higher than that in control subjects. Therefore, they inferred mRNA expression and methylation of SNCA intron 1 are altered in AD, whereas ZSCAN21 at upstream of these CpG site were reported to bind at intron 1.

Rathore et al. (2018) noted that both TREM2 and PILRB function as activating receptors and signal through DAP12. A reduction of PILRA inhibitory signals in R78 carriers could allow more microglial activation via PILRB/DAP12 signaling and reinforce the cellular mechanisms by which TREM2 is believed to protect from AD incidence.

Ruggiero et al. (2017) did biological experiments on mice and found that MTCH2 is a critical player in neuronal cell biology, controlling mitochondria metabolism, motility, and calcium buffering to regulate hippocampal-dependent cognitive functions.

De Jager et al. (2014) used a collection of 708 prospectively collected autopsied brains to assess the methylation state of the brain's DNA in relation to AD and found two SNPs associated with POLR2E are related to AD in methylation levels.

Roses et al. (2010) identified polymorphic poly-T variant rs10524523 in transposase of TOMM40 gene, which can be used to estimate the starting age of LOAD with APOE ɛ3 carriers.

Prendecki et al. (2018) recruited 230 individuals and found that APOC1 and TOMM40 rs2075650 polymorphisms may be independent risk factors of developing AD, whose major variants are accompanied by disruption of biothiols metabolism and inefficient removal of DNA oxidation.

We found 10 of 25 genes are reported to be related to AD by biological experiments. Some literary works may found that the other 15 genes are related to AD via other methods, but we would not discuss in this paper. This case study verified the effectiveness of our method and we hope the other 15 genes could be verified by biological experiments in future.

#### CONCLUSION

AD brings great burden to patients and society and identifying AD-related genes can help us known the machanism of AD then diagnose and treatment. In this paper, we used SMR to find AD-related genes by GWAS, eQTL, and mQTL. There are some overlaps between GWAS and the other two datasets, which means that some SNPs are related to AD due to the change of expression level. SMR is a method which can identify the genes whose expression levels are associated with a complex trait because of pleiotropy.

Due to the LD and interaction between genes, GWAS data has bias. In order to overcome these, we did meta-analysis on five GWAS datasets and then used EB to revise the Z-score of each SNPs in whole-SNP level.

Finally, we found 653 SNPs reached the threshold of significance and they are associated with 25 genes. Ninety-three of SNPs are significant in both GWAS&eQTL and GWAS&mQTL tests. We did 10 case studies at last, which means that the 10 of 25 genes we identified have been verified to correlated to AD by biological experiments in existing literary works.

### DATA DEPOSITION

#### eQTL and mQTL Data

The direct link for accessing eQTL and mQTL data is as follows (origin from PMID: 29891976).


#### GWAS Dataset 1,2,3

GWAS dataset 1,2,3 are from paper PMID:29860282. The direct link is for accessing them is as following.


#### GWAS Data 4

GWAS data 4 is from PMID: 24162737. The direct link is for accessing it is as following:

http://web.pasteur-lille.fr/en/recherche/u744/igap/igap\_ download.php

#### GWAS Data 5

GWAS data 5 is from PMID: 29777097. The direct link is for accessing it is as following:

http://datashare.is.ed.ac.uk/download/DS\_10283\_3364.zip

### REFERENCES


#### All code could be downloaded by

https://github.com/zty2009/Integrate-GWAS-eQTL-andmQTL-data-to-identify-Alzheimer-s-Disease-related-genes

#### AUTHOR CONTRIBUTIONS

TZang and YW are the corresponding authors. They help to revise and support data for this data. TZhao and YH are the co-first authors. They wrote the code and write the paper.

#### FUNDING

This work was supported by the National Natural Science Foundation of China (No: 61571152 and 61502125), the National High-tech R&D Program of China (863 Program) [Nos: 2014AA021505, 2015AA020101, 2015AA020108], the National Science and Technology Major Project [Nos: 2013ZX03005012 and 2016YFC1202302], the Heilongjiang Postdoctoral Fund (Grant No. LBH-Z15179), and the China Postdoctoral Science Foundation (Grant No. 2016M590291).


pathways influencing Alzheimer's disease risk. *Nat. Genet.* 51, 404–413. doi: 10.1038/s41588-018-0311-9


ligand binding and confers protection to Alzheimer's disease. *PLoS Genet.* 14 (11), e1007427. doi: 10.1371/journal.pgen.1007427


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Zhao, Hu, Zang and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# A New Algorithm for Identifying Genome Rearrangements in the Mammalian Evolution

*Juan Wang1, Bo Cui1, Yulan Zhao1 and Maozu Guo2,3\**

*1 School of Computer Science, Inner Mongolia University, Hohhot, China, 2 School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China, 3 Beijing University of Civil Engineering and Architecture, Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, China*

Genome rearrangements are the evolutionary events on level of genomes. It is a global view on evolution research of species to analyze the genome rearrangements. We introduce a new method called RGRPT (recovering the genome rearrangements based on phylogenetic tree) used to identify the genome rearrangements. We test the RGRPT using simulated data. The results of experiments show that RGRPT have high sensitivity and specificity compared with other tools when to predict rearrangement events. We use RGRPT to predict the rearrangement events of six mammalian genomes (human, chimpanzee, rhesus macaque, mouse, rat, and dog). RGRPT has recognized a total of 1,157 rearrangement events for them at 10 kb resolution, including 858 reversals, 16 translocations, 249 transpositions, and 34 fusions/fissions. And RGRPT has recognized 475 rearrangement events for them at 50 kb resolution, including 332 reversals, 13 translocations, 94 transpositions, and 36 fusions/fissions. The code source of RGRPT is available from https://github.com/ wangjuanimu/data-of-genome-rearrangement.

#### *Edited by:*

*Lei Deng, Central South University, China*

#### *Reviewed by:*

*Yungang Xu, University of Texas Health Science Center at Houston, United States Zhen Tian, Zhengzhou University, China Wei Lan, Guangxi University, China*

> *\*Correspondence: Maozu Guo guomaozu@bucea.edu.cn*

#### *Specialty section:*

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> *Received: 02 July 2019 Accepted: 24 September 2019 Published: 29 October 2019*

#### *Citation:*

*Wang J, Cui B, Zhao Y and Guo M (2019) A New Algorithm for Identifying Genome Rearrangements in the Mammalian Evolution. Front. Genet. 10:1020. doi: 10.3389/fgene.2019.01020*

Keywords: genome rearrangements, mammal, phylogenetic tree, evolution, algorithm

## INTRODUCTION

The rapid development of sequencing technologies makes the phylogenetic analysis from the level of whole genome possible. A studied genome is represented as a line of conserved segments (called syntenic blocks). The genome rearrangements of species are changes of syntenic block orderings and losing of sequence blocks. These events include reversal, translocation, transposition, fusion, fission, and so on (Xu et al., 2017; Cheng et al., 2019; Dong et al., 2018). The research on genome rearrangements is mainly three aspects.

One is the computation of evolutionary distance between two species by considering genome rearrangements. Researchers have proposed a lot of metric for measuring the dissimilarity of evolution between species and a large amount of algorithms for computing the metrics. The breakpoint distance is the minimum rearrangement operations transforming one genome to the other genome, which is computed by means of breakpoint graph (Blanchette et al., 1997; Sankoff and Blanchette, 1998). There are lots of algorithms for computing breakpoint distance. In 1995, Hannenhalli and Pevzner put forward an algorithm with O(*n5*) time complexity to compute the breakpoint distance just considering reversal events (Hannenhalli and Pevzner, 1999). Later, Kaplan improved the algorithm to time complexity O(*n5*) (Kaplan et al., 2000). In 1996, Hannenhalli designed an algorithm with O(*n3*) time complexity to compute it by considering translocation events (Hannenhalli, 1995). In 2001, Zhu et al. improved the algorithm to time complexity O(*n2logn*) (Zhu and Ma, 2002). And then Zhu et al. devised an algorithm with O(*n2*) time complexity (Liu et al., 2004). The DCJ distance is introduced by Yancopoulos et al. (Sophia et al., 2005), which uses the double cut and join (DCJ for short) operation to model rearrangement events, such as reversal, translocation, transposition, fusion, and fission in an unified way. Yancopoulos et al. first propose a method to compute the DCJ distance by considering only translocations and reversals on linear chromosomes (Sophia et al., 2005). Paper (Lu et al., 2006) has proposed an *O*(*n*2) time algorithm to compute the distance by considering the fusions and fissions between circular unsigned chromosomes. Unimog (Hilker et al., 2012) is software for computing DCJ distance which implements lots of algorithms (Erdös et al., 2011; Jakub et al., 2011). SoRT is a tool to compute breakpoint distance and the DCJ distance for linear/circular multi-chromosomal gene orders (Yen-Lin et al., 2010). SCJ distance (Feijão and Meidanis, 2011) is defined using the single cut and join (SCJ for short) operations, which is in analogy to DCJ measure. The distance can be computed by a speedily computable.

Two is the reconstruction of the ancestral gene orders by using the genomes of extant species. Ma et al. (Ma et al., 2006) use maximum parsimony principle to recover reliably ancestral genomes starting from phylogenetic tree and adjacent genes in genome and make the probabilistic reconstruction accuracy analysis for the six mammalian genome (human, mouse, rat, dog, opossum, and chicken) based on the improved Jukes– Cantor model. PMAG utilized the Bayesian theorem in the probabilistic framework to infer ancestral genomes (Yang et al., 2014). Multiple Genome Rearrangements (MGR) recovers the ancestral genome by minimizing the rearrangement distance (Bourque and Pevzner, 2002). Multiple Genome Rearrangements and Ancestors (MGRA) is developed to reconstruct ancestral genomes based on multiple breakpoint graphs and is used to analyze rearrangement evolutionary events of seven mammalian genomes (human, chimpanzee, macaque, mouse, rat, dog, and opossum) (Alekseyev and Pevzner, 2009). Decostar (Duchemin et al., 2017) is a software which reconstructs neighborhood relations of ancestral genes aiming at reconstructing the organization of ancestral genomes.

Three is the recognition of the rearrangement events of existing species. Efficient Method to Recover Ancestral Events (EMRAE) is an algorithm which can recognize rearrangement events in evolution described by phylogenetic tree by means of adjacent genes in genomes (Zhao and Bourque, 2009).

#### MATERIALS AND METHODS

#### Preliminaries

A genome is composed of several chromosomes, and each chromosome is an ordering of syntenic blocks. For convenience, each syntenic block is recorded by an integer, so a chromosome is represented by a signed permutation *X*=*c*1*c*2⋯*gn*, where *ci* (1≤*i*≤*n*) is an integer representing a syntenic block, its sign is assigned with the orientation that is either positive (recorded by *ci* ) or negative (recorded by –*ci* ). The chromosome *X*=*c*1*c*2⋯*cn* is the same as –*X* = – *cn* – *cn* – 1**…** – *c*1.

A reversal *r* (*i*, *j*) (*i* ≤ *j*) converts chromosome *X*=*c*1*c*2⋯*cn* into a new chromosome *X*ʹ=*c*1*c*2⋯−*cj* −*cj*–1⋯−*ci*+1−*ci cj*+1⋯*cn*, where the reversal is from *ci* to *cj* .

A translocation event breaks two chromosomes into four segments and then reconnects them into two new chromosomes. Given two chromosomes *X* = *X*1*X*2 and *Y* = *Y*1*Y*2, where *X*1=*x*1*x*2⋯*xi*–1,*X*2=*xi xi*+1⋯*xm*,*Y*1=*y*1*y*2⋯*yj*–1, and *Y2*=*yj yj*+1⋯*yn*, a translocation is represented by *tl*(*i*,*j*). *X*1 and *Y*1 are exchanged to form two new chromosomes *X*ʹ=*Y*1*X*2 and *Y*ʹ=*X*1*Y*2, or *X*1 and *Y*<sup>2</sup> are exchanged to form two new chromosomes *X*" = – *Y*2*X*2 and *Y*" = *X*1 – *Y*1.

A transposition event is to exchange two adjacent fragments on one chromosome into a new chromosome. A transposition is represented by *tp*(*i*, *j*, *k*), i.e., the fragment *ci* ⋯*cj* of one chromosome inserted into after *ck*. If *ck* is on the same chromosome (*k* > *j* or *k* < *i*), then the transposition *tp*(*i*, *j*, *k*) is called intra-chromosomal; otherwise, it is inter-chromosomal. Given a chromosome *X*=*c*1*c*2⋯*ci ci*+1⋯*cj*–1*cj* ⋯*ck*⋯*cn* and an intra-chromosomal transposition, *X* is converted into *X*ʹ=*c*1*c*2⋯*ckci ci*+1⋯*cj ck*+1⋯*cn*.

A fusion event is to connect two chromosomes into a new chromosome. The fusion acting on chromosomes *X*1 and *X*<sup>2</sup> is represented by *f u*(*X*1, *X*2) and forming a new chromosome *X*1*X*2 or *X*1−*X*2. A fission is to split a chromosome into two new chromosomes. A fission acting on the chromosome *X* = *X*1*X*2 is represented by *f i*(*X*) and forming two new chromosomes *X*1 and *X*2 (where *X*1 and *X*2 are non-empty segments).

An adjacency *a*(*ci* ,*ci*+1) of genome *X* is two adjacent integers in one chromosome of *X*. *a*(*ci* ,*ci*+1) is the same as *a*(−*ci*+1,−*ci* ). For example, all adjacencies on chromosome *X =* 1,234 are *a*(1, 2), *a*(2, 3), and *a*(3, 4). For a set of genomes *S*, an adjacency *a* is effective w.r.t. *S* if it belongs to at least one genome and not all genomes. For example, two uni-chromosomal genomes *G*1 and *G*2, the chromosome *X =* 1,234 of *G*1 and the chromosome *Y =*  1 – 3 − 24 of *G*2, then all effective adjacencies w.r.t. *G*1 and *G*2 are *a*(1, 2), *a*(2, 3), *a*(3, 4), *a*(1, −3), and *a*(−2, 4).

#### EMRAE

Given a phylogenetic tree *T* describing the evolution of the genomes *G*, EMRAE first computes all effective adjacencies w.r.t. *G*. Then, it predicts the rearrangement events for each edge of *T* by means of inference rules (will be introduced in the following).

**Figure 1** shows a reversal *r*(2, 3) during the evolution from *A* to *B*, where *A* and *B* are two uni-chromosomal genomes, and the chromosomes are *X =* 1,234 and *Y =* 1 – 3 – 24, respectively. The set of genomes will be divided into two subsets recorded by *SA* and *SB* after removing the edge *e* from *T*. Suppose there is not any rearrangement events inside *SA* and *SB*. Then, adjacencies *a*(1, 2) and *a*(3, 4) can be found in each genome of *SA* and not in any one genome of *SB*; *a*(1,−3) and *a*(−2,4) can be

found in each genome of *S*B and not in any one genome of *SA*. In turn, we can utilize the four adjacencies *a*(1, 2), *a*(3, 4), *a*(1, −3), and *a*(−2,4) to identify a reversal *r*(2, 3) occurring on the edge *e*. The EMRAE method infers the rearrangement events by means of the similar rules.

Let *e =* (*A*, *B*) be an edge of *T*, *G*={*G*1,*G*2,⋯,*Gm*}the genomes of leaves, and *a*1,*a*2,⋯*ai* the children of *A* and *b*1,*b*2,⋯*b*<sup>j</sup> the children of *B*. EMRAE first selects a number of adjacencies as candidate adjacencies *Ca*(*e*,*A*) for edge *e* and node *A* according the following steps.

	- a. Find the adjacencies that are in one genome of each Sui (1 ≤ i ≤ k) and not in any one genome of *SB*, then put them to *Ca*(*e*,*A*);
	- b. Compute *Ca*(*ei* , *ui* ) and *Ca*(*ei* ,*u*)(1≤*i*≤*k*). For each one *Ca*(*ei* , *ui* ), find the adjacency *a*1 from *Ca*(*ei* , *ui* ), such that *a*1 is not overlap gene with any one adjacency in *Ca*(*ei* , *u*), *a*1 has overlap gene with one adjacency *a*2 in each *Ca*(*ej* ,*uj* )(1≤*j*≠*i*≤*k*), and *a*2 has overlap gene with at least one adjacency in *Ca*(*ej* , *u*), then put *a*\s\do5(1) to *Ca*(*e*, *u*).

EMRAE then infers rearrangement from *Ca*(*e*, *A*) and *Ca*(*e*, *B*) for edge *e =* (*A*, *B*) with the help of inference rules in the following section. From the definitions of genome rearrangements, we find that each genome rearrangement can change several adjacencies. For example, each reversal *r*(*i*, *j*)(*i* ≤ *j*) can change two adjacencies *a*1=*a*(*ci*–1,*ci* ) and *a*2=*a*(*cj* ,*cj*+1) into *b*1 = *a*(*ci* –1, – *cj* ) and *b*2=*a*(−*ci* ,*cj*+1). Based on those facts, we obtain the inference rules introduced in the following section.

#### Inference Rule

Let *e =* (*A*,*B*) be an edge of the phylogenetic tree *T*. Given adjacencies *a*1 = *a* (*c*1–1, *ci* ), *a*2 = *a* (*cj, cj*+1) in *Ca*(*e*,*A*) and *b*1=*a*(*ci*–1,− *cj* ), *b*2=*a*(−*ci* ,*cj*+1) in *Ca*(*e*,*B*), EMRAE infers a reversal *r*(*i*,*j*) from *A* to *B* if all genomes are uni-chromosomal or *a*1, *a*2 are in the same chromosome in *SA* and *b*1, and *b*2 are in the same chromosome in *SB*. Otherwise, we infer a translocation *tl*(*i*, *j*). Similarly, given adjacencies *a*1=*a*(*ci*–1,*ci* ), *a*2=*a*(*cj cj*+1) in *Ca*(*e*,*A*) and *b*1=*a*(*ci*+1,*cj*+1), *b*2=*a*(*cj* ,*ci* ) in *Ca*(*e*,*B*), EMRAE infers a translocation *tl*(*i*,*j*), or a reversal for *a*1, *a*2 in *Ca*(*e*,*A*) and adjacencies *b*1, *b*2 in *Ca*(*e*,*B*).

Assume that there are adjacencies *a*1=*a*(*ci*–1,*ci* ), *a*2=*a*(*cj* ,*cj*+1), and *a*3=*a*(*ck*,*ck*+1) in *Ca*(*e*,*A*) and *b*1=*a*(*ci*–1,*cj*+1), *b*2=*a*(*ck*,*ci* ), and *b*3=*a*(*cj* ,*ck*+1) in *Ca*(*e*,*B*). EMRAE can predict a transposition *tp*(*i*,*j*,*k*) during the evolution from *A* to *B* if all genomes are uni-chromosomal. Otherwise, suppose *m* genomes in *SA* have *a*1 and *a*2, then EMRAE can predict a transposition *tp*(*i*,*j*,*k*) if there are at least *m*/2 genomes such that the four integers of *a*1 and *a*2 on the same chromosome, or there are at least *m*/2 genomes such that the four integers of *a*2 and *a*3 on the same chromosome.

Assume that there is *a*=*a*(*ci* ,*cj* ) in *Ca*(*e*,*A*). EMRAE can predict a fission that splits the adjacency *a*=*a*(*ci* ,*cj* ) if *a* is sign-compatible for each genome *Gk* in *SB*. The fusion from *A* to *B* can be seen as a fission from *B* to *A*.

#### Recovering the Genome Rearrangements Based on Phylogenetic Tree

EMRAE can not identify the rearrangement occurring in the frontier of genomes. We take **Figure 2**, for example, where species *A*, *B*, and *C* are uni-chromosomal genomes *A =* 1,234, *B =* −2 – 134, and *C =* 1,234. A reversal r(1,2) has occurred in the evolution from *A* to *B*. EMRAE can compute the candidate adjacencies *a*(−1,3) for *Ca*(*e*1,*B*) and *a*(2,3) for *Ca*(*e*1,*A*). So, EMRAE can not infer the reversal r(1,2) on the edge *e*1 according to the candidate adjacencies.

We improve EMRAE so that the improved method (called RGRPT) is able to infer the rearrangement events occurring in the frontier region. The inference rule of RGRPT is the same as that of EMRAE. The difference between RGRPT and EMRAE is that they have different candidate adjacencies. RGRPT puts 0 to the head and tail for each chromosome, so there will be added a lot of adjacencies for each genome. For example, considering the uni-chromosomal genomes *X =* 1,234 and *Y =* −2 −134, the two additional candidate adjacencies *a*(0,1) and *a*(0,−2) are added.

RGRPT adds candidate adjacencies in the step b of EMRAE. For each one *Ca*(*ei* ,*ui* ) and an adjacency *a*1 from *Ca*(*e*<sup>i</sup> ,*u*i ), if there is an adjacency *a*2 in each *Ca*(*ej* ,*uj* )(1≤*j*≠*i*≤*k*) such that *a*1 with *a*<sup>2</sup> has overlap gene, then put *a*1 to *Ca*(*e*,*u*).

### RESULTS

All of the experiments were performed on a computer with Intel Vostro 14 2.0 GHz CPU, 4 GB RAM, and 500 GB Hard Disk Drives (HDD). The operating system was Win10 64 bit with Java 1.6 installed. RGRPT was written in Java.

We tested RGRPT with both simulated data and the practical data (i.e., real biological data) introduced by the following section.

#### Simulated Data

Here, we start with an uni-chromosomal genome as the ancestor, and it evolves along the phylogenetic tree with *n* taxa whose topology sees the **Figure 3**.

We generate two simulated data sets in order to test the affectivity of RGRPT. One of them is created from the phylogeny only with reversals events. The other data set is generated from the phylogeny with kinds of events, including reversals, translocation, transposition, fusion, and fission, and the quantity of those events is in a certain ratio. The two data sets can test the ability of methods to recover the simple and the complex evolution histories. First data set is created just using reversal events. Since the reversal on only one gene is rare (Korbel et al., 2007), we set the ratio of reversal on one gene and on more than one gene as 1:3. The number of leaves is from 3 to 10 with step 1. For each number of leaves,

the ancestor genome with *m* gene, where *m* from 50 to 150 with step 10. Each edge will happen *k* reverse, where *k* is random integer number from 3 to 10. So, there are 11 groups data for each leaf number. Sensibility is the percentage of correctly predicted events in all practical events. Specificity is the percentage of correctly predicted events in all predicted events. We compute the sensibility and specificity for RGRPT and EMRAE for each group data. **Table 1** shows the average sensitivity and specificity for each leaf number. The second column of the table records the number of all events, and its last row records the average values.

**Table 1** shows that RGRPT achieves higher sensibility than EMRAE, and RGRPT achieves comparable specificity with EMRAE. Obviously, RGRPT can distinguish more actually occurred events than EMRAE. So, the experimental results show that the RGRPT is more efficient than EMRAE for predicting reversal events.

Second data set is generated by using all events, i.e., reversal, translocation, transposition, fusion, and fission. The reversals are generally more than the other rearrangement events. The fusions and the fissions are very rare; so, we record the number of the two events together. Here, we set the ratio of those events as 10:2:2:0.1. The ancestor genome has 5 chromosomes and each chromosome with 100 genes. The ancestor genome evolves along the topology with four leaves (see **Figure 3**). Each edge happen *k* events, where *k* is random number from 1 to μ and μ is 6, 12, 18, and 24. For each μ, it runs 10 times; so, we can obtain 10 groups data for each μ. **Table 2** shows the average of 10 groups data for each μ. This table indicates that the RGRPT is more efficient than EMRAE for predicting all events.

#### Practical Data

The practical data is from the paper (Zhao and Bourque, 2009). It contains six mammalian genomes, i.e., human, chimpanzee, rhesus monkey, mouse, voles, and dog. The data are created from two different levels of resolution 10 kb and 50 kb. **Figure 4** is the tree describing the phylogeny of species. The results are shown in **Tables 3** and **4**. EM and RG represent EMRAE and RGRPT respectively, and Rev, Tloc, Tran, Fus, and Fis represent reversal,

TABLE 1 | Results of EMRAE and recovering the genome rearrangements based on phylogenetic tree algorithms in predicting reversal events.



TABLE 2 | Results of EMRAE and recovering the genome rearrangements based on phylogenetic tree algorithms in predicting all events.

translocation, transposition, fusion, and fission, respectively. Each row in the table records the ancestor rearrangement events of the edge. For example, the values in the human row are the rearrangement events from D to human; the values in MR row are the rearrangement events from A and B.

At 10 kb resolution, the RGRPT algorithm predicts 1,157 ancestor rearrangement events, including 858 reversals, 16 translocations, 249 transpositions, and 34 fusions and fissions. It identifies 48 rearrangement events more than the EMRAE. The reversal events are in the majority in all predicted events. At 50 kb resolution, the RGRPT algorithm predicts 475 ancestor rearrangement events, including 332 reversals, 13 translocations, 94 transpositions, and 36 fusion and fissions. RGRPT identifies 21 rearrangement events more than EMRAE algorithm. The rearrangement events identified in the rat

TABLE 3 | Genome rearrangement predictions of EMRAE and recovering the genome rearrangements based on phylogenetic tree at 10 kb resolution.


TABLE 4 | Genome rearrangement predictions of EMRAE and recovering the genome rearrangements based on phylogenetic tree at 50 kb resolution.


edge are mostly in all edges either at 10 kb resolution or at 50 kb resolution. The syntenic blocks of genomes at 10 kb resolution are more than the syntenic blocks of genomes at 50 kb resolution. The fact reduces the recognized rearrangement events at 10 kb resolution that are more than the recognized rearrangement events at 50 kb resolution. Experiments show that RGRPT can recover more ancestor events than EMRAE.

#### DISCUSSION

This paper proposes a new method, RGRPT, to infer ancestor rearrangement events. RGRPT takes a phylogenetic tree describing the evolution of species and the genomes of species as input. Experiments on the simulated data and practical data show that RGRPT is more efficient than EMRAE and can recover more ancestor rearrangement events than EMRAE. RGRPT provides a method for us to research the genome rearrangement of species. We can use RGRPT to recognize the ancestral genome rearrangement for the evolution of other species in future (Tian et al., 2018).

#### REFERENCES


#### DATA AVAILABILITY STATEMENT

Publicly available datasets were analyzed in this study. These data can be found here: https://github.com/wangjuanimu/ data-of-genome-rearrangement.

#### AUTHOR CONTRIBUTIONS

JW proposed and implemented the RGRPT method. JW and BC designed all experiments. All authors participated in the designing the algorithm and writing the paper.

#### FUNDING

The work was supported by the National Natural Science Foundation of China (61661040, 61661039, 61571163, 61532014, 61671189, 91735306, 61751104); the National Key Research and Development Plan Task of China (Grant No. 2016YFC0901902).


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Wang, Cui, Zhao and Guo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Integrating the Ribonucleic Acid Sequencing Data From Various Studies for Exploring the Multiple Sclerosis-Related Long Noncoding Ribonucleic Acids and Their Functions

#### Edited by:

Liang Cheng, Harbin Medical University, China

#### Reviewed by:

Guiyou Liu, Chinese Academy of Sciences, China Zhi-Liang Ji, Xiamen University, China

#### \*Correspondence:

Feng Zhu zhufeng@zju.edu.cn; prof.zhufeng@gmail.com

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 22 July 2019 Accepted: 18 October 2019 Published: 12 November 2019

#### Citation:

Han Z, Hua J, Xue W and Zhu F (2019) Integrating the Ribonucleic Acid Sequencing Data From Various Studies for Exploring the Multiple Sclerosis-Related Long Noncoding Ribonucleic Acids and Their Functions. Front. Genet. 10:1136. doi: 10.3389/fgene.2019.01136

*Zhijie Han1,2, Jiao Hua3, Weiwei Xue2 and Feng Zhu1,2\**

1 College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China, 2 School of Pharmaceutical Sciences, Chongqing University, Chongqing, China, 3 School of Mathematics, Harbin Institute of Technology, Harbin, China

Multiple sclerosis (MS) is a chronic fatal central nervous system (CNS) disease involving in complex immunity dysfunction. Recently, long noncoding RNAs (lncRNAs) were discovered as the important regulatory factors for the pathogenesis of MS. However, these findings often cannot be repeated and confirmed by the subsequent studies. We considered that the small-scale samples or the heterogeneity among various tissues may result in the divergence of the results. Currently, RNA-seq has become a powerful approach to quantify the abundances of lncRNA transcripts. Therefore, we comprehensively collected the MS-related RNA-seq data from a variety of previous studies, and integrated these data using an expression-based meta-analysis to identify the differentially expressed lncRNA between MS patients and controls in whole samples and subgroups. Then, we performed the Jensen-Shannon (JS) divergence and cluster analysis to explore the heterogeneity and expression specificity among various tissues. Finally, we investigated the potential function of identified lncRNAs for MS using weighted gene co-expression network analysis (WGCNA) and gene set enrichment analysis (GSEA), and 5,420 MS-related lncRNAs specifically expressed in the brain tissue were identified. The subgroup analysis found a small heterogeneity of the lncRNA expression profiles between brain and blood tissues. The results of WGCNA and GSEA showed that a potential important function of lncRNAs in MS may be involved in the regulation of ribonucleoproteins and tumor necrosis factor cytokines receptors. In summary, this study provided a strategy to explore diseaserelated lncRNAs on genome-wide scale, and our findings will be benefit to improve the understanding of MS pathogenesis.

Keywords: ribonucleic acid sequencing, multiple sclerosis, long non-coding ribonucleic acids, meta-analysis, function analysis

1 **106**

## INTRODUCTION

Multiple sclerosis (MS) is a chronic fatal neurodegenerative disease involving in complex immunity [central nervous system (CNS)] (Sospedra and Martin, 2005; Frohman et al., 2006; Li et al., 2018). Based on the 2014 statistics of the Atlas of MS investigation, the estimated number of the people afflicted with the MS worldwide has reached approximately 2.3 million (Browne et al., 2014). Although much remains unknown about the molecular etiology of MS, more and more studies showed that the dysregulation of transcriptional processes could potentially contribute to the pathogenesis of MS (Li et al., 2017; Selmaj et al., 2017; Angerer et al., 2018; Cheng et al., 2018; Han et al., 2018b; Zhang et al., 2019).

Recently, long noncoding RNA (lncRNA), one of the nonprotein-coding genes whose transcripts are longer than 200 nucleotides, has been discovered as the important regulatory factor of immune system and pathogenesis of CNS disorders including MS (Gomez et al., 2013; Ng et al., 2013; Dong et al., 2015; Cheng et al., 2016; Santoro et al., 2016; Zhang et al., 2016; Chen et al., 2017; Eftekharian et al., 2017; He et al., 2017; Cheng et al., 2018; Yin et al., 2019). However, for MS, these results often cannot be repeated and confirmed by subsequent study. For example, multiple variants of the lncRNA antisense non-coding RNA in the INK4 locus (*ANRIL*) are found significantly associated with the risk of MS through the haplotype analysis of blood samples (Rezazadeh et al., 2018). But following study reveals that the function of *ANRIL* does not contribute the pathogenesis of MS in blood, cortex, and cerebellum tissues (Pahlevan Kakhki et al., 2018). Study showed a significant upregulation of lncRNA *MALAT1* in MS blood tissues (Cardamone et al., 2019), while the expression of *MALAT1* was found markedly decreased in MS brain by the subsequent study (Masoumi et al., 2019). Moreover, another study found that *MALAT1* is not significantly differentially expressed between MS patients and controls (Gharesouran et al., 2019). We considered that the small-scale samples or the heterogeneity among various tissues may result in the divergence of the results.

Currently, specifically for lncRNAs, using RNA-seq data to quantify abundance of the transcripts has become very powerful approach compared with the traditional ones (e.g., gene microarray) (Wang et al., 2009). Particularly, almost all of the expression of the known lncRNA transcripts can be measured using RNA-seq data, but this proportion is just approximately 0.1 to 10.6% by the method of probe re-annotation using various types of microarrays (Du et al., 2013; Fang et al., 2018; Yang et al., 2019). Moreover, lncRNA abundance quantification using RNA-seq data also shows higher accuracy based on its deep read coverage, while the re-annotation approach only requires the sequence match of 1 to 4 probes when quantifies lncRNA abundance (Du et al., 2013; Gellert et al., 2013; Li et al., 2019). A previous study reported that by paying attention to some aspect of library and sequencing process [i.e., poly-A tail selection, paired-end sequencing, and sequencing of double-stranded complementary DNA (cDNA)], the lncRNAs are more easily and more accurately identified through RNA-seq (Ilott and Ponting, 2013).

In this study, we thus selected all MS-related RNA-seq data in a variety of studies by searching three authoritative public databases: GEO DataSets (Barrett et al., 2013), EBI-EMBL ArrayExpress (Athar et al., 2019), and DDBJ Sequence Read Archive (Ogasawara et al., 2013) using the keyword "multiple sclerosis." Then, we used these RNA-seq data to perform expression quantification of the lncRNA in each of the selected studies. Next, we integrated the lncRNA expression results of all selected studies by an expression-based meta-analysis to identify the significantly differentially expressed lncRNAs between MS patients and controls. Further, we explored their heterogeneity and expression specificity among various tissues. After that, the weighted gene co-expression network analysis (WGCNA) was performed using the expression data of lncRNAs and proteincoding genes to identify the significant modules for MS. The expression of the protein-coding genes was calculated using the same approach on lncRNA. Finally, we conducted gene set enrichment analysis (GSEA) on the co-expressed protein-coding genes in each significant module to infer the function of the differentially expressed lncRNAs potentially contributing to the pathogenesis of MS.

## MATERIALS AND METHODS

#### Selection of the Multiple Sclerosis-Related Ribonucleic Acid Sequencing Datasets and Studies

We used the keyword "multiple sclerosis" to search all the possible MS-related RNA-seq datasets in three authoritative databases: GEO DataSets (Barrett et al., 2013), EBI-EMBL ArrayExpress (Athar et al., 2019), and DDBJ Sequence Read Archive (Ogasawara et al., 2013). The search was performed before the last update of the databases on May 16 2019. Then, we selected the suitable datasets using four criteria: 1) the organism in the dataset is the human being; 2) the study in the dataset is designed using the case-control method; 3) the dataset has provided the FASTQ data; (4) the FASTQ data in the dataset is not generated by metagenome, whole genome, or whole exome sequencing. Finally, the studies from these datasets based on various tissues were selected. **Figure 1** showed the workflow.

#### Quantification of Long Noncoding Ribonucleic Acid Sequencing Abundance Using Ribonucleic Acid Sequencing Sequencing Data

We first downloaded the sequence data of these studies by *Prefetch* and converted them into FASTQ files using *fastq-dump* tool of the SRA Toolkit software (Leinonen et al., 2011). Next, we downloaded the reference sequences of lncRNA and proteincoding transcripts in FASTA format from NONCODE (version 5) (Fang et al., 2018) and Ensembl (release 91) (Aken et al., 2017), respectively, and further merged the two FASTA format files. Particularly, NONCODE is one of the most complete and well-annotated databases of the noncoding RNAs, and we obtained a total of 172,216 transcript sequences of 96,308 human

lncRNA genes from it. Ensembl aggregated the cDNA data from National Center for Biotechnology Information (Sayers et al., 2019), UniProt (UniProt, 2015), Genome Reference Consortium (Church et al., 2011), and UCSC Genome Browser (Kent et al., 2002) databases. After removing the pseudogenes, we obtained a total of 160,040 transcript sequences of 22,810 human proteincoding genes from it. Then, we performed the quantification of the lncRNA and protein-coding transcripts simultaneously by mapping the RNA-seq reads of each study to the merged reference sequence (pseudoalignment) and calculating the count values using *Kallisto* software (Bray et al., 2016). *Kallisto* is a fast and highly accurate quantification tool for transcript abundance through k-mer lookup technique. Here, the merged reference sequences have been processed into a transcriptome index to conduct the pseudoalignment which has the same effect as the reads alignment to a given reference genome in the traditional transcript-level RNA-seq processing but can substantially reduce calculation time. For the paired-end sequencing samples, the

arguments were set to defaults, i.e., the number of bootstrap samples (-b) equals 0 and the number of threads (-t) equals 1. For the single-end sequencing samples, besides these default parameter settings, we set the estimated average fragment length (-l) and the standard deviation of fragment length (-s) to 200 and 20, respectively, according to Kallisto's recommended parameters. Finally, based on the annotation file "Transcript2Gene," we integrated transcript-level count values of lncRNAs to calculate their corresponding gene-level count values using the R package "tximport" (Soneson et al., 2015).

### Heterogeneity Test and Meta-Analysis

To identify the significantly differentially expressed lncRNAs between MS patients and controls, we calculated and integrated the results of each study by a meta-analysis. These analyses were conducted using R package "MetaOmics," which is a comprehensive analytical pipeline to meta-analyze multiple transcriptomic studies (Ma et al., 2019). This meta-analysis includes a normalization process same as the edgeR's strategy and a "AW-Fisher" method to integrate data (Bullard et al., 2010; Robinson et al., 2010; Ma et al., 2019). First, we calculated the two parameters, I2 and P value, to measure the lcnRNA expression heterogeneity by the Cochran's Q Statistics, which is based on a chi-square test with *k* − 1 degrees of freedom (*k* equals to the number of studies used for the meta-analysis). According to the previous studies, the heterogeneity was considered as statistically significant when I2 > 50% and P < 0.01 (Han et al., 2015; Li et al., 2016; Liu et al., 2017; Han et al., 2018a; Xue et al., 2018). Then, the meta-analysis was performed for each of these lncRNAs based on their count values. Particularly, the random effect model (REM) and fixed effect model (FEM) were used, respectively, for the lncRNAs with a significant heterogeneity or not. Using the REM in meta-analysis can reduce bias of the results (Kim et al., 2015; Szajewska and Kolodziej, 2015). We calculated standardized mean difference (SMD) with its 95% confidence interval (CI) to identify the differentially expressed lncRNA between the MS patients and controls (95% CI of SMD does not include zero, FDR adjusted P < 0.05). The SMD is given by the mean difference between case and control divided by the standard deviation and applies to meta-analysis when the outcome is continuous variable (e.g., expression level). Moreover, since all these samples can be split into brain and blood, we performed the meta-analysis for the two subgroups, and explored the differential expression pattern of the MS-related lncRNAs between brain and blood.

In addition, we further explored the specific target genes of the lncRNAs using LncRNA2Target v2.0 database which is authoritative source including 152,137 lncRNA-target relationships confirmed by the knockdown or overexpression analysis and binding experimental technologies, and provides web interface for searching the targets by a particular lncRNA (Cheng et al., 2019).

#### Tissue Specificity Analysis of the Multiple Sclerosis-Related Long Noncoding Ribonucleic Acids

We explored the tissue expression specificity of the significantly differentially expressed lncRNAs in MS, which was important aspects of neurological disease research (usually, specifically expressed in CNS system) (Fatica and Bozzoni, 2014; He et al., 2017; Tang et al., 2019b). For this purpose, lncRNA expression data were first downloaded from the NONCODE, which were involved in primary human tissue/cell line (e.g., brain, heart, breast, lung, liver, foreskin, lung, lymph node, colon, skeletal muscle, leukocyte, HeLa cells, and fibroblasts, etc.). Then, we extracted the expression data of various tissues by the corresponding differentially expressed lncRNAs in brain, blood, and whole sample, respectively, and stored them in three independent sets. Further, based on these data, we used the Jensen-Shannon (JS) divergence, an entropy-based approach, to calculate a tissue specificity score of the differentially expressed lncRNAs according to previous study (Cabili et al., 2011). Briefly, the lncRNA expression vectors were converted to abundance density, and the distance between two tissue expression patterns was defined as the square root of JS divergence. The tissue

specificity of a lncRNA expression pattern was measured through the distance between expression patterns across various tissues and predefined extreme pattern in which the lncRNA is uniquely expressed in one tissue (1 minus the distance). Thus, the metric of tissue specificity ranged from 0 to 1. The nearer the score to one, the stronger the tissue specificity becomes. Finally, using the same data, we performed the cluster analysis with Manhattan distance for differentially expressed lncRNAs in brain, blood and whole sample by R package "gplots."

#### Inferring the Functions of Multiple Sclerosis-Related Long Noncoding Ribonucleic Acids by Weighted Gene Co-Expression Network Analysis

To infer the potential biological functions of these significantly differentially expressed lncRNAs in MS, we used WGCNA approach to determine the co-expression profile of these MS-related lncRNAs and protein-coding genes, and further performed the GSEA by the co-expressed protein-coding genes. First, in the same way used for identifying MS-related lncRNAs, we quantified the abundance of the protein-coding genes and identified the significantly differentially expressed genes by a meta-analysis. Second, we constructed the co-expression network by integrating the count values of the differentially expressed lncRNAs and protein-coding genes using the R package "WGCNA" (Langfelder and Horvath, 2008). Particularly: 1) we conducted the sample clustering to check if there were any outlier samples using "hclust" function of R package "WGCNA"; 2) after quality control, we used "pickSoftThreshold" function of R package "WGCNA" to calculate the satisfactory soft threshold power β for ensuring the scale-free topology characteristics of the co-expression network; 3) based on the β value, we applied the Pearson's method to calculate an adjacency matrix which includes the weighted correlation of all gene pairs; 4) by adjacency matrix, we used the dynamic cut-tree algorithm to construct a hierarchical clustering dendrogram and identified the co-expression modules where genes have high topological overlap with each other. Finally, we assessed the significance of the modules for MS by measuring two indices. Particularly, one of the indices is correlation between module membership (i.e., intramodular connectivity) and gene significance for MS. High correlation means that the hub genes (i.e., the genes with high connectivity in a co-expression module) of the corresponding module also tend to be highly correlated with disease states (MS or healthy) (Langfelder and Horvath, 2008). The other is the average correlation of the genes in each module with disease states. This was also applied to assess association of each module with the platforms and the tissue types, respectively.

#### Pathway Analysis of the Multiple Sclerosis-Related Long Noncoding Ribonucleic Acids by Gene Set Enrichment Analysis

Based on the two indices of module significance, we selected the most significant modules of disease states to investigate the lncRNA functions in MS by GSEA. We first extracted the ID numbers of the protein-coding genes co-expressed with lncRNAs in the modules. Then, we downloaded the signaling pathway data from two common databases, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). GO is a public resource of data on the gene functions in the biological process, molecular function, and cellular component (The Gene Ontology, 2017), and KEGG is comprehensive database which integrates the information of genes involved in signaling pathways, cellular processes, human diseases, etc. (Kanehisa et al., 2017). Finally, we used the co-expressed protein-coding genes and the signaling pathway data to conduct the GSEA of the most significant modules using R package "clusterProfiler" (Yu et al., 2012). The adjusted P value calculated by the multiple testing (Benjamini-Hochberg method) was set at less than 0.05 as the threshold of significance.

### RESULTS AND DISCUSSION

#### Results of Study Selection and Long Noncoding Ribonucleic Acid Abundance Quantification

Using keyword search and quality filtering, we identified ten MS-related RNA-seq datasets including: GSE60424, GSE66573, GSE66763, GSE89843, GSE100297, GSE120411, GSE111972, GSE123496, GSE77598, and SRP132699 from three authoritative databases. We found that the library preparation and sequencing methods in most of these datasets meet one/multiple requirements for improving the lncRNAs identification (i.e., poly-A tail selection, paired-end sequencing, and sequencing of double-stranded cDNA). Then, after the investigating the source of samples, we found that these datasets are involved in eight brain tissues (optic chiasm, corpus callosum, occipital cortex, astrocytes, frontal cortex, hippocampus, internal capsule, parietal cortex) and seven blood tissues (B cell, T cell, monocyte, platelets, neutrophils, natural killer cell, and whole blood). According to the various tissues, we selected a total of 20 studies (207 MS cases and 348 controls) for the following analysis. The detailed information of each study was shown in **Table 1**. Finally, we downloaded RNA-seq data of the samples in each study, and used them to measure lncRNA expression (count values) using *Kallisto* (Bray et al., 2016) and R package "tximport" (Soneson et al., 2015). In total, lncRNA abundance in 555 samples was quantified.

### Heterogeneity Test and Meta-Analysis

Based on the count values of the 96,308 lncRNAs in 20 studies, the meta-analysis was performed to calculate SMD value with its 95% CI for each lncRNA using REM/FEM. Heterogeneity test showed that only about 2.90% lncRNAs have the significant heterogeneity (I2 > 50% and P < 0.01). Therefore, the homogeneous unbiased results could be identified in >97% lncRNAs by FEM. For the remaining lncRNAs of significant heterogeneity, REM could reduce resulting bias. In total, 5,420 lncRNAs were identified significantly differentially expressed between MS cases and controls, which included 368 downregulated and 5,052 upregulated lncRNAs (shown in **Figure 2A** and **Supplementary Table S1**). For example, the **Figure 2B** exhibited the meta-analysis results of the lncRNA NONHSAG108980.1 which shows the


FIGURE 2 | The results of heterogeneity test and meta-analysis for all samples and subgroups. (A) The expression level of the significantly differentially expressed long noncoding RNAs (lncRNAs) in each study after meta-analysis. The random effect model was used for 157 lncRNAs with a significant heterogeneity, while the fixed effect model was used for 5,263 non-heterogeneous lncRNAs. The details can be clearly viewed by enlarging the electronic version. (B) The forest plot for the meta-analysis of the lncRNA NONHSAG108980.1 which is the most significant result associated with an increased risk of MS (SMD = 0.59, 95% CI = 0.40−0.78, P = 1.89×10−9). (C) The bar plot showing the results of heterogeneity test in each group. For all samples, the proportion of lncRNAs with a significant heterogeneity is not high (about 2.90%), and this percentage is further decreased to about 1.99 and 1.20% in blood and brain, respectively. (D) The Venn diagram exhibiting the overlap among the significantly differentially expressed lncRNAs that are identified using brain tissues, blood tissues, and all samples.

most significant association with an increased risk of MS (SMD = 0.59, 95% CI = 0.40−0.78, P = 1.89×10−9). Then, to investigate the heterogeneity of the lncRNA expression profile in various tissues, we split the samples into brain and blood tissue, and performed the heterogeneity test and meta-analysis for subgroups. We found that not only the proportion of lncRNAs with a significant heterogeneity was not high for the whole samples, but also this percentage is further reduced to about 1.99 and 1.20% in blood and brain, respectively (**Figure 2C**). Finally, we explored the difference of the differentially expressed lncRNAs identified in various tissues. We found that there was the higher specificity for these lncRNAs identified in brain compared with them identified in blood. Particularly, about 60.06% of the 5,420 differentially expressed lncRNAs can also be identified in the blood, while percentage is only 26.82% in brain (**Figure 2D**). Moreover, the total number of upregulated lncRNAs is far more than that of the downregulated ones in the blood (**Supplementary Table S2**) and the brain (**Supplementary Table S3**), which indicated that MS risk was related to lncRNA overexpression.

In addition, previous studies found that lncRNAs were modestly evolutionarily conserved in sequence (Guttman et al., 2009; Iyer et al., 2015). Therefore, we explored the conservation in sequence of these differentially expressed lncRNAs using conservation constrain search in NONCODE which contains the conservation information of lncRNAs in 13 common model organisms (i.e., human, chimp, gorilla, orangutan, rhesus, mouse, rat, cow, pig, opossum, platypus, chicken, and zebrafish). The results showed that only 0.11% of the differential lncRNAs were conserved in sequence among all these 13 organisms, while this percentage is increased to 28.5% in primates (human, chimp, gorilla, orangutan, and rhesus).

#### Tissue Specificity Analysis of the Multiple Sclerosis-Related Long Noncoding Ribonucleic Acids

Using expression data of NONCODE database, we performed the JS divergence metric and the cluster analysis to explore the tissue specificity of MS-related lncRNAs. The results of JS divergence metric showed that the MS-related lncRNA had high tissue specificity when used the brain, blood and whole samples (**Figure 3A**). For cluster analysis, relied on the same data, we further compared the expression patterns of these differentially expressed lncRNAs in various human tissues and cell lines. We found that the differentially expressed lncRNAs identified based on whole sample were highly specifically expressed in brain tissue (**Figure 3B**). Similarly, we observed a significant brain-specific expression for the differentially expressed lncRNAs identified based on brain sample (**Figure 3C**). Interestingly, although the differentially expressed lncRNAs were identified from blood sample, their expressions were still highly specific in brain tissue (**Figure 3D**). These results are consistent with the findings of the previous step and our recently published study (Han et al., 2019), which suggest that MS possesses the characteristics of the CNS disorder in lncRNA dysregulation.

#### Inferring the Functions of Multiple Sclerosis-Related Long Noncoding Ribonucleic Acids by Weighted Gene Co-Expression Network Analysis

After abundance quantification together with meta-analysis, we identified 2,051 protein-coding genes significantly differentially expressed between MS patients and controls (**Supplementary Table S4**). Then, we combined the count values of 2,051 differentially expressed protein-coding genes and 5,420 MS-related lncRNAs to perform the WGCNA. By quality control, we removed three outlier samples whose minimum cluster size less than 5 and cutting height less than 4.0×106 (**Supplementary Figure S1**). The satisfactory soft threshold power β was set as 9 when the model fitting index R2 equals 0.8 and the mean connectivity is close to 0 simultaneously (**Supplementary Figure S2**). Finally, we constructed a co-expression network which includes 1,938 protein-coding genes and 5,022 lncRNAs, and according to the interconnectedness of gene pairs, they were clustered into 15 modules in network (MEyellow, MEturquoise, MEblue, MEsalmon, MEred, MEpurple, MEpink, MEmagenta, MEgreen, MEmidnightblue, MEcyan, MEtan, MEgreenyellow, MEbrown, and MEblack) (**Figure 4A**). Moreover, to assess the significance of these modules for MS, we calculated two types of correlations as the index. The results of the average correlation of the genes in each module with the disease states showed that MEyellow is the most associated module with MS (r = 0.33, P = 5×10−15), and the following three are MEred (r = 0.32, P = 2×10−14), MEpink (r = −0.28, P = 2×10−11), and MEbrown (r = 0.24, P = 9×10−9). This was also applied to assess the association of each module with the platforms and the tissue types, respectively. Consistently, we found that the MEred (r = 0.71, P = 2×10−85), MEbrown (r = 0.52, P = 1×10−39), and MEyellow (r = 0.38, P = 2×10−20) were most significantly associated with the tissue types. While there is no module strongly associated with platforms (**Figure 4B**). These findings are generally consistent with the result of the correlation between the module membership and the gene significance for MS. For example, MEyellow and MEred are the top two module with the high average correlation of genes with disease states, and they also show a high correlation between module membership and gene significance (cor = 0.43, P = 4.6×10−15 and cor = 0.50, P = 2.6×10−19, respectively) (**Figures 4C**, **D**). On the contrary, MEcyan shows a very low level both for the two types of correlations (r = −0.058, P = 0.2 and cor = 0.038, P = 0.8) (**Figure 4E**).

In addition, we also perform a WGCNA with the satisfactory soft threshold power β = 9 using all the quantified genes. We found that these genes are clustered into 119 modules in the network, and about 82.2% differentially expressed genes are clustered into 16 of the 119 modules (including a gray one). We also found that these modules show low/modest association with MS (the correlation coefficients are < 0.19). These results reflect the similar distribution of the differentially expressed genes between using all and filtering genes in this WGCNA, and imply that the extra genes may mask the association of the differentially expressed genes with MS.

#### Pathway Analysis of the Multiple Sclerosis-Related Long Noncoding Ribonucleic Acids by Gene Set Enrichment Analysis

To explore the function of lncRNAs in MS, we performed GSEA in the four most significant modules for MS based on the two types of correlations, i.e., MEyellow (r = 0.33, P = 5×10−15 and cor = 0.43, P = 4.6×10−15), MEred (r = 0.32, P = 2×10−14 and cor = 0.50, P = 2.6×10−19), MEpink (r = −0.28, P = 2×10−11 and cor = 0.63, P = 3.5×10−14), and MEbrown (r

FIGURE 3 | The tissue specificity of the multiple sclerosis-related long noncoding RNAs (lncRNAs) based on expression data from NONCODE database. (A) Tissue specific expression measured by Jensen-Shannon divergence. The distributions of the maximal tissue specificity scores showed the high tissue specificity of the differentially expressed lncRNAs identified using whole (blue), brain (green), and blood sample (red), respectively. The (B) to (D) showed the hierarchical clustering heatmap for expression of these lncRNAs in primary human tissues and cell lines. These differentially expressed lncRNAs identified using whole (B), brain (C), and blood sample (D) are all highly specifically expressed in brain tissue. The Manhattan distance was used to perform all of the three cluster analyses.

= 0.24, P = 9×10−9 and cor = 0.32, P = 4.7×10−9). We found no significantly enriched pathway related to the MEred. Based on the result of LncRNA2Target, we identified that two differentially expressed lncRNAs in MEred could target the MS-related genes. Particularly, two target genes (CDH1 and CDH2) of the lncRNA NONHSAG081583.2 encoded cadherin protein which is the most abundant adhesion molecules participating in nerve conduction in synaptic junctions and the proinflammatory cytokines in MS can downregulate its expression (Minagar et al., 2003; Tian et al., 2009). The lncRNA NONHSAG000840.2 targets a MS-related gene NOTCH2, and reducing NOTCH2 in the proinflammatory monocytes can increase the frequency of the nonclassical monocytes and neutralizing antidrug antibody induction in IFN-β treated MS patients (Adriani et al., 2018). For MEbrown, the co-expressed protein-coding genes were mainly involved in leukocytes and interleukin-related immune response (**Figure 5A** and **Supplementary Table S5**), which was similar to the finding of our recent study (Han et al., 2019). Many genomic variants in the human leukocyte antigen complexes and interleukin receptor were identified significantly associated with susceptibility of MS (Rubio et al., 2002; Teutsch et al., 2003; Lundmark et al., 2007; Hollenbach and Oksenberg, 2015; Tang et al., 2019a). The protein-coding genes in MEpink are mainly associated with intercellular junction

FIGURE 4 | The co-expression network analysis of the differentially expressed long noncoding RNAs (lncRNAs) and protein-coding genes. (A) The clustering dendrogram of these co-expressed lncRNAs and protein-coding genes. There are 15 clustered modules in the hierarchical clustering dendrogram which is constructed by a dynamic cut-tree algorithm. These clustered modules are marked as 15 different colors, respectively, i.e., yellow, turquoise, tan, salmon, red, purple, pink, midnight blue, magenta, green yellow, green, cyan, brown, blue, and black. (B) The heatmap for the association of each module with the disease states, platforms, and tissue types. Each cell represents a module, and contains the correlation r and corresponding P value (in brackets). Panels (C) to (E) show the results of correlation between the module membership and the gene significance in MEyellow, MEred, and MEcyan, respectively. The results of other modules were described in Supplementary Figure S3.

and signaling transmission (**Figure 5B**). Previous studies found that the defect of axon-glial signaling transmission caused by the oligodendrocyte gap junction loss and disconnection contributes to MS pathogenesis (Brand-Schieber et al., 2005; Markoullis et al., 2012; Markoullis et al., 2014). The results of LncRNA2Target showed that lncRNA NONHSAG049754.2 in MEyellow targets the MS-related gene TNFRSF10A. This gene encodes the receptor of tumor necrosis factor (TNF) cytokines which plays a important role in inflammation regulations and is related to susceptibility of developing MS (De-la-Torre et al., 2019). The protein-coding genes in the MEyellow are related to ribonucleoprotein (**Figure 5C**). Ribonucleoprotein is a kind of ribonucleic acid-binding protein which participates in the mRNA splicing (Guthrie, 1991). Previous study showed that as an important autoantigen in the neuroimmune disease, the ribonucleoprotein significantly more often interact with the autoantibodies in MS cerebrospinal fluids compared with controls (Sueoka et al., 2004; Yukitake et al., 2008). The following studies further identified a ribonucleoprotein-related lncRNA, TNF-α, and heterogeneous nuclear ribonucleoprotein L, which was significantly upregulated and produced transcriptional activating complexes to promote TNF-α expression by cooperating with ribonucleoprotein in the circulating blood cells of MS (Li et al., 2014; Eftekharian et al., 2017). Given that MEyellow is the most significant module for MS, we inferred that one of the key mechanisms of lncRNAs in MS is associated with the regulation of ribonucleoprotein and TNF cytokines receptor.

### CONCLUSIONS

In this study, we comprehensively collected MS-related RNA-seq data from a variety of studies, and integrated these data by an expression-based meta-analysis to assess the affection of lncRNAs on the MS pathogenesis on genome scale. We identified a total of 5,420 lncRNAs significantly differentially expressed between MS patients and controls. Then, the subgroup analysis found a small heterogeneity of the lncRNA expression profile between the brain and blood tissues. Further, the specificity analysis of multiple tissues showed that the differentially expressed lncRNAs (including identified using brain, blood, and whole sample) are highly specifically expressed in brain tissue. Finally, the result of GSEA and WGCNA demonstrated that the potential important function of lncRNAs in MS may be involved in the regulation of ribonucleoprotein and TNF cytokines receptor. All in all, we performed a strategy to resolve the inconsistent MS-related lncRNA findings in previous studies, and explore the functions of these lncRNAs in MS. The findings of this study will be benefit to improve the understanding of the pathogenesis of MS.

### DATA AVAILABILITY STATEMENT

Publicly available datasets were analyzed in this study. This data can be found here: https://www.ncbi.nlm.nih.gov/gds, https:// www.ebi.ac.uk/arrayexpress, https://ddbj.nig.ac.jp/DRASearch.

### AUTHOR CONTRIBUTIONS

ZH and FZ designed the research. ZH, FZ, JH, and WX collected the data. ZH performed the research, analyzed data, and wrote the paper. FZ reviewed and modified the manuscript. All authors discussed the results, and contributed to the final manuscript. All authors read and approved the final manuscript.

### FUNDING

This work was funded by National Key R&D Program of China (2018YFC0910500), National Natural Science Foundation of China (81872798), Fundamental Research Fund for Central Universities (10611CDJXZ238826, 2018QNA7023, 2018CDQYSG0007 & CDJZR14468801), and Innovation Project on Industrial Generic Key Technologies of Chongqing (cstc2015zdcy-ztzx120003).

### REFERENCES


### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.01136/ full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Han, Hua, Xue and Zhu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# CircSLNN: Identifying RBP-Binding Sites on circRNAs via Sequence Labeling Neural Networks

*Yuqi Ju1†, Liangliang Yuan1†, Yang Yang1,2,3\* and Hai Zhao1,2,3*

1 Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China, 2 Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China, 3 Brain Science and Technology Research Center, Shanghai Jiao Tong University, Shanghai, China

The interactions between RNAs and RNA binding proteins (RBPs) are crucial for understanding post-transcriptional regulation mechanisms. A lot of computational tools have been developed to automatically predict the binding relationship between RNAs and RBPs. However, most of the methods can only predict the presence or absence of binding sites for a sequence fragment, without providing specific information on the position or length of the binding sites. Besides, the existing tools focus on the interaction between RBPs and linear RNAs, while the binding sites on circular RNAs (circRNAs) have been rarely studied. In this study, we model the prediction of binding sites on RNAs as a sequence labeling problem, and propose a new model called circSLNN to identify the specific location of RBP-binding sites on circRNAs. CircSLNN is driven by pretrained RNA embedding vectors and a composite labeling model. On our constructed circRNA datasets, our model has an average F1 score of 0.790. We assess the performance on full-length RNA sequences, the proposed model outperforms previous classificationbased models by a large margin.

Keywords: RNA–protein binding sites, sequence labeling, convolutional neural network, bidirectional LSTM neural network, deep learning

## INTRODUCTION

Benefitting from the rapid development of high-throughput experimental technologies, transcriptome, proteome, epigenome and other omics data have accumulated in an unprecedented speed. The multi-omics data have enabled large-scale studies on gene regulation at different levels. Especially, the interactions between RNAs and RNA binding proteins (RBPs) are crucial for understanding post-transcriptional regulation mechanisms (Filipowicz et al., 2008). The RNA–RBPinteractions play important roles in protein synthesis, gene fusion, alternative mRNA processing, etc. (Bolognani and Perrone-Bizzozero, 2008). The aberrant expression of RBPs and disruption of RNA–RBP-interactions are closely related to various diseases of human beings (Khalil and Rinn, 2011). In the early stage of RNA–RBP-interaction studies, the recognition of binding sites mainly relies on the analysis of RNA–protein complexes *via* biophysical methods. As the experimental process is costly and laborious, it is increasingly important to develop automatic tools to predict binding sites.

As for protein–protein-interactions, both structures and amino acid sequences are commonly used for identifying binding sites, including POCKET (Liu and Hu, 2011), Fpocket (Le Guilloux

#### Edited by:

Lei Deng, Central South University, China

#### Reviewed by:

Xiaofeng Song, Nanjing University of Aeronautics and Astronautics, China Hao Lin, University of Electronic Science and Technology of China, China

#### \*Correspondence:

Yang Yang yangyang@cs.sjtu.edu.cn

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 24 August 2019 Accepted: 25 October 2019 Published: 22 November 2019

#### Citation:

Ju Y, Yuan L, Yang Y and Zhao H (2019) CircSLNN: Identifying RBP-Binding Sites on circRNAs via Sequence Labeling Neural Networks. Front. Genet. 10:1184. doi: 10.3389/fgene.2019.01184

et al., 2009) LIGSITE (Hendlich et al., 1997), etc. The structural feature-based prediction methods exploit protein 3D structures and appropriate geometries to locate potential binding regions. Most structure-based methods assume that proteins bound to the same ligand have similar overall structure and biochemistry characteristics, while some researchers found that proteins having the same binding site may have diverse sequences or structures (Muppirala et al., 2011). Sequence-based methods usually utilize amino acid composition, function domain, secondary structure and solvent accessibility information (Shen et al., 2007).

Due to the lack of solved structures for RNA-protein complexes, most of the existing studies have turned to sequence information and machine learning methods for predicting RBP-binding sites on RNAs, like support vector machines (SVMs) (Kumar et al., 2008) and random forest (RF) (Liu et al., 2010). Moreover, deep learning models have emerged in this field (Alipanahi et al., 2015; Pan and Shen, 2017). Deep learning is a data-driven approach that allows automatic learning of the advanced features from data without the need for domain knowledge, by stacking multiple layers of neural networks (LeCun et al., 2015). Compared to traditional machine learning models, it does not require feature engineering and can achieve better performance. A few deep learning methods, including convolutional neural network (CNN) and recurrent neural network (RNN), have been developed to predict RBP-binding sites (Pan and Shen, 2017; Pan et al., 2018).

Although researchers have made some progress in predicting RNA–protein binding sites, current mainstream prediction methods have some limitations.

First, most prediction methods simplify the prediction task as a binary classification problem, i.e. they assign a positive/negative label to a segment of RNA, where the positive label denotes the presence of a binding site. Actually, binding sites on RNAs are sequence fragments that range from tens to hundreds of nucleotides in length. Thus, the prediction based on fixed-length fragments may be inaccurate, as it only yields approximate locations of binding sites and could not specify the length that the sites span.

Second, most of the existing methods predict the interaction between linear RNAs and RBPs, while circular RNAs (circRNAs) have been rarely studied. CircRNAs play an important role in gene regulation, and they also play crucial roles in the development of many complex diseases (Fan et al., 2018). Thanks to the advances of new sequencing technology, circRNAs have been identified on the whole genome scale (Song et al., 2016). Moreover, the interplay between circRNAs and proteins or microRNAs has attracted more and more research interests from biomedical field, resulting in large-scale data of circRNA–RBP interactions using high-throughput experiments, like CLIP-Seq (Dudekula et al., 2016). Thus, the models for predicting binding sites on circRNAs are in great demand.

In this study, we propose a sequence labeling neural network model to predict circRNA–protein binding sites, called circSLNN, which is composed of a long-short-term memory (LSTM) network, a convolutional neural network (CNN) and a conditional random field (CRF) model. Instead of performing a binary classification on the whole fragment, it assigns a label (bound or unbound) to each position on the fragment. Compared with traditional classifiers, it can not only predict whether the input segment is bound to a given RBP, but also predict the specific location of binding sites on the segment. Besides, in order to fully utilize the sequence information of circRNAs, we propose to use RNA embeddings learned *via* a similar word embedding algorithm for processing natural languages, where the corpus is extracted from the whole human genome. To the best of our knowledge, this is the first predictor for RNA–protein binding sites using a sequence labeling scheme. The contributions of this study are listed in the following.


## RELATED WORK

### Prediction Based on Traditional Machine Learning Methods

The prediction of molecular interactions has been a hot topic in bioinformatics over the past decades. Especially, the protein– protein-interactions (PPIs) have been well-studied due to the abundant information that can be utilized in the prediction, e.g. amino acid sequences, function domains, gene ontology annotation (Ashburner et al., 2000). The machine learningbased predictors usually consist of two parts, i.e. the feature extraction and classification. Similar to PPI, the prediction of RNA–RBP-interaction is a typical machine learning problem. However, due to the lack of functional annotation of RNAs, the feature extraction mainly relies on RNA sequences or secondary structures. For some types of RNAs, like circRNAs which have constrained structures, i.e. covalently closed continuous loops, the effective feature extraction from sequences are more important.

Traditional feature representation of RNA sequences include *k*-tuple composition, pseudo *k*-tuple composition (PseKNC) (Chen et al., 2013), etc. The features are discrete vectors, working with shallow learning models. For instance, Muppirala et al. (2011) used the SVMs and random forest methods to predict the RNA–RBP-interactions. As the rise of deep learning, sequence encoding schemes and deep neural networks have been emerging and achieved better prediction performance.

### Prediction Based on Deep Neural Networks

DeepBind (Alipanahi et al., 2015) is a pioneer work in developing deep learning models for RNA–RBP-interactions. The model is based on a convolutional neural network, which not only improves prediction accuracy but also reveals new

Data Encoding

sequence patterns at the binding area. Later, Pan et al. released a series of computational tools, including iDeep (Pan and Shen, 2017), iDeepS (Pan et al., 2018) and iDeepE (Pan and Shen, 2018), which have different feature representation and model architecture. iDeep utilizes five different information sources, i.e. secondary structure information, motif information for describing the conserved region of sequences, CLIP co-binding, region type, and sequence information, to extract high-level abstraction features *via* deep learning models. Especially, the sequence information is processed by a CNN (Krizhevsky et al., 2012), while other four data sources are processed by deep belief networks (Zou and Conzen, 2004). Compared with iDeep, iDeepS reduces the types of data sources and only retains sequence information and secondary structure information. The authors added bi-directional long short-term memory (BiLSTM) (Schuster and Paliwal, 1997) to integrate the data, which better reserves contextual information based on relative position relationship of nucleotides.

Generally, the performance of deep learning-based methods depends on informative feature representation and powerful model architecture. In this study, we explore both the two parts to improve prediction accuracy.

### MATERIALS AND METHODS

### Data Source

To construct a predictor for circRNA–RBP-interactions, we collect a standard dataset of RBP-binding sites on circular RNAs from the circRNA Interactome database (Dudekula et al., 2016), which contains sequence information for more than 100,000 human circRNAs, as well as specific locations of binding sites for different RBPs. Each binding site is represented as an interval from the start index to the end index on the circRNAs. We extend 50-nt upstream and downstream respectively by taking the midpoint of each interval as the center. In this way, 101-nt fragments can be obtained as positive samples. Then we randomly extract 101-nt segments from the remaining fragments as negative samples. In order to avoid the issue caused by repeated sequences, we remove redundant sequences using CD-HIT (Li and Godzik, 2006). The positive-to-negative ratio is 1:1, and the training-to-test ratio is 5:1.

Then we generate standard labels for all samples. For positive samples, we label all the symbols within the binding sites as "I" and all the other locations as "O", meanwhile we mark all symbols as "O" for negative samples. Here we use the IO tag scheme, where "I" is short for inside (a binding site) and "O" is short for outside, i.e. not a binding site. As it is known that, the BIO format (short for inside, outside, beginning) is a common tagging format in natural language. As there are a lot of adjacent labeling objects in text, it is hard to distinguish between different labeling objects using only the IO scheme. By contrast, in the sequence labeling problem of binding sites, the distribution of binding sites is extremely sparse, and usually binding segments are far from each other. Thus, we use the IO labeling scheme to reduce the types of labels and make the training model easier to converge.

As mentioned in the *Related Work* section, feature representation can have a substantial impact on the performance for both shallow learning and deep learning models. To work with deep models, RNA sequences need to be encoded into numerical vectors, like one-hot vectors. In recent years, more and more studies on biological sequence analysis have adopted word embedding-based encoding schemes to replace one-hot encoding (Harris and Harris, 2010), as embedding vectors are continuous and high-dimensional, which may capture more context and semantic information in sequences. In our previous studies, we propose the RNA2Vec method to get RNA embeddings (Xiao et al., 2018). We regard 10-mer segments as words and train the word embeddings using Glove (Pennington et al., 2014).

### Model Architecture

In this study, we design a sequence labeling model based on deep neural networks to predict RBP-binding sites on RNAs. We first feed the embedding vectors to a convolutional neural network (Krizhevsky et al., 2012) to extract local features, and then learn the long-distance dependency information among bases through a BiLSTM layer. Finally, the label identification of the entire RNA sequence is completed by the CRF layer (Lafferty et al., 2001). The network structure is shown in **Figure 1**.

### CNN Layer

Convolutional neural network (CNN) (Krizhevsky et al., 2012) is a widely used deep learning architecture. CNN generates feature maps at different abstract levels by stacking convolutional layers. In circSLNN, the CNN serves as a feature extractor from the initial input vectors. As sequence labeling models predict a label for each symbol in the sequence, whereas the embedding vectors are trained for 10-mers, we adopt CNN to extract highlevel features for each nucleotide in RNA sequences based on the embedding vectors of its surrounding 10-mers, i.e. a window centered by the nucleotide.

Specifically, for each individual nucleotide (except for the first 9 nucleotides), there are 10 fragments of length 10 containing it. Based on the vectors of the 10 fragments, we perform feature extraction *via* a one-dimensional CNN. Suppose the dimensionality of embedding vectors is *m*, then each nucleotide can be represented as a matrix of size 10×*m*, which is fed to the CNN. Before using CNN, we need to expand the 101-nt fragments to 110-nt (101 + 10 − 1), which is passed through a sliding window of size 10. Here we pad the matrix by zero vectors.

Let *hj* be the size of the *j*th convolutional kernel, *Xi* be the matrix of the sliding window at the *i*th time step, which consists of the *i*th to the (*i* + *hj* − 1)th columns of the original input. Thus, the features learned by the convolutional layer can be expressed in Eq. 1,

$$\mathcal{c}\_{\vec{y}} = f(\mathcal{w}\_{j} \* X\_{\iota:\iota + h^{j} - 1} + b\_{j})$$

$$\forall i \in \{1, 2, ..., N - h\_{j} + 1\}, \mathbf{j} \in \{1, 2, ..., \mathfrak{n}\} \tag{1}$$

where *n* is the number of filters, *f*(.) is the activation function, and *wj* and *bj* are the weight matrix and the offset, respectively.

#### BiLSTM Layer

Till now, the mechanism of RNA–RBP-interaction has not been fully understood yet, and various factors impact the binding between RNAs and RBPs, include not only the local structural motifs and binding domains but also long-term dependencies of nucleotides. In our model, the CNN component serves as a feature extractor from raw input and learn the context information in local regions. To further exploit sequence information, we adopt bi-directional long short-term memory (BiLSTM) (Schuster and Paliwal, 1997) network. BiLSTM is a combination of forward LSTM and backward LSTM, which is a special type of recurrent neural network (RNN). It is often used to model context information in natural language processing tasks. BiLSTM was designed to learn the relationship between base before and after the current position, and to capture longer distance dependencies.

Let *xt* be the input vector of the *t*th time step, and *st* and *s*ʹ *t* be the hidden states of the forward and backward calculations of the *t*th time step. Then the calculations of *st* and *s*ʹ *t* depend on *st*-1 and *s*ʹ *<sup>t</sup>*+1, respectively, as shown in Eqs. 2 and 3.

$$\mathbf{s}\_t = \mathbf{g} \left( \mathbf{U} \mathbf{x}\_t + \mathbf{W} \mathbf{s}\_{t-1} \right) \tag{2}$$

$$\mathbf{s}'\_{t} = \mathbf{g}(\mathbf{U}'\mathbf{x}\_{t} + \mathbf{W}'\mathbf{s}'\_{t+1}) \tag{3}$$

where *U* and *W* are the weight matrices of the input and hidden states in the forward pass. *U*′ and *W*′ are the weight matrices of the input and hidden states in the backward pass.

The final output *ot* of step *t* is a combination of a forward hidden layer and a backward hidden layer, defined as follows.

$$o\_t = h(Vs\_t + V's\_t')\tag{4}$$

where *V* and *V*′ are the weight matrices of the hidden layers to the output layer in forward pass and backward pass, respectively.

### CRF Layer

As mentioned in the *CNN Layer* and *BiLSTM Layer* sections, CNN and RNN have their respective advantages. The hybrid CNN-RNN architecture has been proposed in previous studies and achieved much better performance than using CNN or RNN alone. For instance, both CRIP (Zhang et al., 2018) and iDeepS (Pan et al., 2018) are hybrid CNN-RNN models, and both use LSTM for classification. CRIP feeds the outputs for all timesteps of the LSTM to a fully-connected layer and get the decision result, while iDeepS uses the output of the last time-step for classification. Actually, based on the output on each time-step of LSTM, it is straightforward to get the sequence labeling results. However, the raw outputs without any constraint are often meaningless, e.g. OIOI … OOI, as it is known that binding sites are continuous regions on RNA sequences. In order to avoid such cases, we add a conditional random field (CRF) layer to process the output of BiLSTM. The purpose of the CRF layer is to predict the probability of the entire sequence rather than the probability of each individual tag. The CRF layer can add some constraints to the predicted labels to ensure that the output labels are legal. During the data training process, these constraints can be automatically learned through the CRF layer, so the probability of occurrence of illegal sequences in the prediction phase will be greatly reduced. Specifically, the CRF layer calculates the conditional probability shown in Eq. 5

$$P(\mathcal{Y}\_1, \dots, \mathcal{Y}\_n \mid \mathbf{x}\_1, \dots, \mathbf{x}\_n) = \mathbf{P}(\mathcal{Y}\_1, \dots, \mathcal{Y}\_n \mid \mathbf{x}), \mathbf{x} = (\mathbf{x}\_1, \dots, \mathbf{x}\_n) \tag{5}$$

where *P*(*y*|*x*) is the probability that the prediction label is *y* if the input is *x*, where *xi* is the output of *i*th time-step by the LSTM layer.

In order to estimate the probability, CRF makes two assumptions. First, the distribution is an exponential family distribution. Second, the association between the outputs occurs only at adjacent locations, and the association is exponentially additive. This allows the probability to be calculated by the probability density function as shown in Eq. 6.

$$f(\boldsymbol{\chi}\_1, \dots, \boldsymbol{\chi}\_n; \boldsymbol{\chi}) = h(\boldsymbol{\chi}\_1; \boldsymbol{\chi}) + g(\boldsymbol{\chi}\_1, \boldsymbol{\chi}\_2; \boldsymbol{\chi}) + h(\boldsymbol{\chi}\_2; \boldsymbol{\chi}) + \boldsymbol{\chi}$$

$$g(\boldsymbol{\chi}2, \boldsymbol{\chi}3; \boldsymbol{\chi}) + h(\boldsymbol{\chi}3; \boldsymbol{\chi}) + \dots + g(\boldsymbol{\chi}\_{n-1}, \boldsymbol{\chi}\_n; \boldsymbol{\chi}) + h(\boldsymbol{\chi}\_n; \boldsymbol{\chi})\tag{6}$$

where *f*, *g*, *h* are probability density functions and can be considered as scoring functions. The overall score *f* of all tags can be broken down into the sum of the score *h* of each individual tag and the score *g* of each pair of adjacent tags. Since LSTM is capable to learn the mapping from input *x* and its output *y*, we assume that the function *g* is independent of *x* and the final probability distribution can be formulated in Eq. 7,

$$P(\boldsymbol{\jmath}\_1, \dots, \boldsymbol{\jmath}\_n \mid \mathbf{x}) = \frac{1}{Z(\boldsymbol{\chi})} \exp(h(\boldsymbol{\jmath}\_1; \boldsymbol{\chi}) + \sum\_{k=1}^{n-1} \left[ g(\boldsymbol{\jmath}\_k, \boldsymbol{\jmath}\_{k+1}) + h(\boldsymbol{\jmath}\_{k+1}; \boldsymbol{\chi}) \right]) \tag{7}$$

where the single-label scoring function *h*(*yi* ; *x*) is fitted by the BiLSTM layer, thus completing the construction of the CRF layer.

### EXPERIMENTAL RESULTS

### Experimental Settings

In circSLNN, the number of convolution kernels in the CNN layer is 128, the convolution window size is 10, the hidden layer size of the BiLSTM layer is 256, and the activation function used by the middle layer is ReLU. The optimization algorithm is RMSProp, with batch size 512 and epoch number 20, using the early stopping mode. The performance metrics include precision, recall and *F*1, which are computed based on the labels of individual nucleotides.

### Prediction Performance of circSLNN

We perform experiments on all 37 datasets described in the **Data Source** section. For each dataset, we perform a 6-fold cross-validation. The original datasets are divided into 6 folds with approximately equal size (5 folds for training and validation, and one fold for test). The accuracies shown in **Table 1** are averaged over 6 times of independant test.

As can be seen, circSLNN achieves high prediction accuracy for most RBPs. The *F*1 scores are higher than 0.8 on 24 out of the 37 datasets, showing the effectiveness of the sequence labeling model.

### Data Encoding Analysis

In circSLNN, the inputs are pretrained embedding vectors for *k*-mers, while most of the existing methods for predicting RBPbinding sites use one-hot encoding, e.g. iDeep and DeepBind. In order to investigate the impact of encoding scheme on model performance, we compare one-hot and our embedding vectors



on the same datasets. We randomly choose 5 RBPs. **Figure 2** depicts the comparison results.

Apparently, the pretrained embedding vectors perform much better than the one-hot vectors. The average *F1* score is increased by 0.087. This result suggests that the word embedding encoding method can effectively extract the feature information of RNA sequences from the human genome database, and can effectively improve the performance of the binding site predictor.

### The Role of CNN Layer

Compared to ordinary text sequence labeling tasks, we introduce the CNN layer to extract local features from RNA sequences. The purpose of the CNN layer is to characterize the local sequence pattern surrounding the base to be labeled, and encode each individual base with richer information. Here we assess the contribution of CNN by removing it from the model. The inputs of the LSTM-CRF model are the pretrained *k*-mer embedding vectors. Specifically, for each base, we choose the embedding vector of the fragment that centered by the base as its feature

vector. The following training on LSTM and CRF is the same as circSLNN. We compare the performance of the two methods on five randomly selected data sets, as shown in **Figure 3**.

As can be seen, the average *F*1 is increased by 0.021 by introducing CNN layer. Although the overall improvement seems not significant, we find that CNN has larger contribution for the difficult datasets, e.g. HUR and LIN288, compared with easy datasets, indicating the importance of further feature learning from raw inputs.

### Comparison of Different Sequence Labeling Schemes

The sequence labeling scheme used in this study is IO tag, not the BIO or BME (BME is short for begin, middle and end) that commonly used in text labeling tasks (Carpenter, 2009), as binding sites generally span tens of bases in length, whereas common text labeling objects only consist of several words, such as a typical place name in the named entity recognition mission (NER), 'Shanghai Jiao Tong University'. In order to assess the

performance of these three tag systems, we conduct experiments on five randomly selected protein datasets, as shown in **Figure 4**.

As can be seen, the IO tag system outperforms BIO and BME by a large margin. BIO and BME have close performance. We find that the B-coded labeling systems can hardly find tag B in the test set, i.e. their results contain only tag I and tag O. The reason is that the B tag is extremely sparse due to the long binding sites, which leads to an imbalanced distribution of tags, and it is very hard to recognize tag B.

### Investigation on Positive-to-Negative Data Ratio

In our experiments, the positive-to-negative ratio for all datasets is 1:1, which is the same as previous studies (Pan and Shen, 2017), (Zhang et al., 2018). However, the length of human circRNAs could be tens of thousands bases, including 1–5 exons (Memczak et al., 2013), while the binding sites are small regions and very sparse on the sequences. That is to say, the true ratio between positive and negative data is very small, leading to an extremely imbalanced problem, thus most studies adopt a sampling strategy to control the ratio. Here, to get closer to the actual situation, we compare the performance of circSLNN under different positive-to-negative ratios, i.e. 1:1, 1:2, and 1:4. The results are shown in **Figure 5**.

Note that although adding negative samples results into data imbalance, the increase in data volume is beneficial for training the model. As shown in **Figure 5**, the accuracies on some datasets, e.g. LIN28B, LIN28B, and TDP43, have even been increased by using expanded negative set. Generally, the performance of circSLNN has little variance when expanding negative set several times, showing the model robustness.

#### Comparison With the Existing Methods on Sequence Labeling for Full-Length circRNAS

In order to assess the performance of circSLNN in real cases, we conduct experiments on full-length circRNAs instead of sampled

segments in the datasets, and compare it with the state-of-the-art predictors for RNA–RBP binding sites.

To the best of our knowledge, circSLNN is the first sequence labeling model for identifying RBP-binding sites on circRNAs. Therefore, for the convenience of comparison, we need to process the output of the existing classification models, i.e. converting the labels for segments into labels for individual nucleotides. Specifically, for a full-length RNA, we divide it from beginning to end into 101-nt fragments. For each fragment, the circSLNN model is used to predict whether each base belongs to the binding site. If it belongs, it is marked as 1; otherwise, it is marked as 0. For the classification model, whether the fragment belongs to the binding site is predicted. If the fragment is predicted as positive, then all the bases in the sequence are labeled by 1, otherwise all bases are labeled by 0. In this way, we obtain the label sequences of full-length RNAs predicted by two different models. By comparing the predicted sequence labels with the actual labels, we can calculate the *F*1 score.

We collect a dataset of 100 full-length circRNAs that are bound to different RBPs. They are first segmented into 101-nt segments, and then fed to the classification models and sequence labeling model, respectively, to predict the binding sites. *F*1 scores are computed based on individual bases. The results are shown in **Figure 6**.

As can be seen from the results, circSLNN achieves the highest *F*1 on almost all circRNAs in the dataset. The average *F*1 score of circSLNN reaches 0.568, while the average *F*1 scores of iDeepE (Pan and Shen, 2018) and CRIP (Zhang et al., 2018) are 0.504 and 0.494, respectively. This suggests that the sequence labeling model can more accurately identify the position of the binding site, which is important for further verification of the interaction regions using biological experiments.

Despite the advantages over other methods, we can find that the overall accuracy is much lower than that computed on the short segments (the average *F*1 of 37 test sets is 0.790 as shown in **Table 1**). It is mainly due to the extremely imbalanced class distribution in this new test set. In training sets, the positiveto-negative ratio is 1:1, while when the full-length circRNAs are segmented, most of them contain no binding site at all. Although the model can handle imbalanced distribution to some extent as described in the *Investigation on Positive-to-Negative Data Ratio* section, the performance decreases greatly when the data set is severely imbalanced.

#### DISCUSSION

This study aims to develop a machine learning model for identifying RBP-binding sites on RNAs. The existing prediction methods consider this problem as a classification problem, which divide RNA sequences into fragments and predict whether or not binding sites exist in the fragments. To further predict the location and length of binding sites, we propose a sequence labeling model, circSLNN, which assigns a label to each base in fragments instead of the whole fragments, so as to provide more information of the binding regions. Besides, considering the lack of tools designed for circRNAs, circSLNN is specially trained by circRNA datasets. Although trained on circRNAs, circSLNN provides a general sequence labeling framework that can be applied to all types of RNAs.

Despite the enhancement of performance, this study is still a preliminary exploration on characterizing binding sites on circRNAs. The first limitation lies in the input features. As it is known that the interaction between RNAs and other molecules has complex mechanisms, especially the circRNAs that have not been well studies, the prediction of circSLNN is based only on circRNA sequences, which is a very limited information source. One future research direction is to incorporate more biological properties or domain knowledge related to circRNAs.

Second, although we have used a hybrid neural network, the proposed model structure is relatively simple. In recent years, not only new embedding training methods but also deep architecture have emerged in the field of natural language processing (Devlin et al., 2018), (Peters et al., 2018), which have achieved substantial improvement on a variety of tasks. Many of them could be adapted to biological sequence analysis, thus our network structure still has a lot of room for improvement.

Third, because the lengths of circular RNA sequences vary greatly, ranging from a few hundred to several millions, which seriously affects the training of the model. Most of the predictors including circSLNN are trained on short segments of RNAs, which may lose some information of whole RNAs and lead to high false-positive-rate. Better predictions based on full-length RNAs or longer segments are the focus of our future work.

### CONCLUSION

This study proposes a sequence labeling neural network for predicting RBP-binding sites on circRNAs, called circSLNN. To fully exploit sequence information, we train continuous embedding vectors for 10-mers of RNAs using the whole human genome sequences, and we construct a hybrid CNN– LSTM–CRF network to perform the sequence labeling task. The purpose of using a hybrid model is to combine the advantages of two deep architectures and to obtain better highlevel abstract feature representations for classification. We train circSLNN on 37 datasets of circRNA fragments, and the average *F*1 score is 0.790. The experimental results show that it is feasible to use the sequence labeling method for identifying binding sites on circRNAs. Both the RNA fragment embedding

#### REFERENCES


vectors and the hybrid architecture contribute to improved performance. Compared with the classification model, it can more accurately label the position of the binding site on the full-length RNAs. The proposed model will help researchers study the circRNA–RBP-interactions and reveal regulatory functions of circRNAs.

### DATA AVAILABILITY STATEMENT

All datasets generated/analyzed for this study are available at https://github.com/JuYuqi/circSLNN.

### AUTHOR CONTRIBUTIONS

YJ, LY, YY and HZ designed the model. YJ and LY implemented the model and performed the experiments. YJ, LY, YY and HZ analyzed the results and drafted the article. YY and HZ supervised this work.

### FUNDING

This paper was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100), Key Projects of National Natural Science Foundation of China (U1836222 and 61733011) and the National Natural Science Foundation of China (No. 61972251).


with combined features. *Bioinformatics* 26, 1616–1622. doi: 10.1093/ bioinformatics/btq253


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Ju, Yuan, Yang and Zhao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Frin: An Efficient Method for Representing Genome Evolutionary History

*Yan Hong and Juan Wang\**

School of Computer Science, Inner Mongolia University, Hohhot, China

Phylogenetic analysis is important in understanding the process of biological evolution, and phylogenetic trees are used to represent the evolutionary history. Each taxon in a phylogenetic tree has not more than one parent, so phylogenetic trees cannot express the complex evolutionary information implicit in phylogeny. Phylogenetic networks can be used to express genome evolutionary histories. Therefore, it is great significance to research the construction of phylogenetic networks. Cass algorithm is an efficient method for constructing phylogenetic networks because it can construct a much simpler network. However, Cass relies heavily on the order of input data, i.e. different networks can be constructed for the same dataset with different input orders. Based on the frequency and incompatibility degree of taxa, we propose an efficiently improved algorithm of Cass, called as Frin. The experimental results show that the networks constructed by Frin are not only simpler than those constructed by other methods, but Frin can also construct more consistent phylogenetic networks when the treated data have different input orders. Furthermore, the phylogenetic network constructed by Frin is closer to the original information described by phylogenetic trees. Frin has been built as a Java software package and is freely available at https://github.com/wangjuanimu/Frin.

### Edited by:

Lei Deng, Central South University, China

#### Reviewed by:

Junwei Luo, Henan Polytechnic University, China Junwei Han, Harbin Medical University, China Dong Chen, Heilongjiang Institute of Technology, China

> \*Correspondence: Juan Wang wangjuan@imu.edu.cn

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 18 October 2019 Accepted: 14 November 2019 Published: 06 December 2019

#### Citation:

Hong Y and Wang J (2019) Frin: An Efficient Method for Representing Genome Evolutionary History. Front. Genet. 10:1261. doi: 10.3389/fgene.2019.01261

Keywords: evolution, phylogenetic network, incompatibility degree, frequency, genome

## INTRODUCTION

Studying the evolution of species is helpful for humans to reveal biological secrets, prevent, and treat diseases. The purpose of phylogenetic analysis is to reveal the evolutionary relationships between different species or taxa and study the evolution of life on Earth (Huson and Scornavacca, 2011). The evolutionary history is like the growth of trees, and all species can be traced back to a common ancestor. It makes sense to use trees to represent the evolutionary history, in which each node except the root has only one parent. There are a number of reticulate evolutionary events, such as reversal, translocation, and fusion, which have resulted in more than one parent of some taxa in the evolution (Gusfield et al., 2007a; Gusfield et al., 2007b; Kelk and Scornavacca, 2014; Wu, 2010; Van Iersel et al., 2017). Such a complex evolutionary history can be represented by the phylogenetic networks (Doolittle, 1999; Nakhle, 2010; Yu and Nakhleh, 2015; Huber et al., 2018). A network is a generalization of a tree in that it contains nodes with in-degree greater than one (Iersel et al., 2009). Phylogenetic networks are functionally classified into implicit networks and explicit networks (Huson et al., 2007; Huson and Rupp, 2008; Van Iersel et al., 2010). Implicit networks can be used to represent conflicting patterns due to the model misspecification. However, explicit networks can capture reticulate evolutionary events.

In recent years, a lot of work has been developed on the methods for constructing phylogenetic networks (Albrecht 2015; Albrecht et al., 2012; Bordewich et al., 2007; Francis et al., 2018; Gambette et al., 2017; Linz and Semple, 2009; Makarenkov et al., 2006; Mirzaei and Wu, 2016; Jansson and Sung, 2006). Cluster network method uses the network-popping algorithm to construct an implicit network, which can be drawn as a cladogram (Huson and Rupp, 2008). Galled network method uses the seedgrowing algorithm to find the solution of RMCS (Restricted Maximum Compatible Subset) problem for input dataset, and then construct phylogenetic network (Huson et al., 2007). The relationships between phylogenetic trees and networks are the basis for the reconstruction and verification of phylogenetic networks. TCP algorithm solved the problem whether or not certain existing phylogenetic trees are displayed in a phylogenetic network (Gunawan et al., 2016; Gunawan et al., 2018). Cass is an efficient method to construct a phylogenetic network for any input trees, and is able to construct much simpler networks than other available methods (Van Iersel et al., 2010). But Cass usually constructs some different networks for the same dataset when it is input as different orders. The phylogenetic network constructed by Cass represents lots of redundant information except for the original information. Both factors considered it is obvious that Cass has poor practical application. Lnetwork improves the Cass by fixing the order of removed taxa in the construction process of phylogenetic networks. It saves the running time for us and reduces the dependence on the input data order (Wang et al., 2013a). BIMLR is also an improved algorithm of Cass by considering incompatibility of taxa in the construction process of phylogenetic network (Wang et al., 2013b). Such methods, including Cass, Lnetwork, and BIMLR, have the significant flexibility that they are not restricted to binary input trees and are not restricted to trees on the same taxa set. In addition, they can construct simpler networks for the same input than other methods, although they are relatively slow. Therefore, The above three methods are efficient and widely used in the construction of phylogenetic networks.

In this paper, we will introduce another improved Cass algorithm, Frin. It constructs phylogenetic networks with phylogenetic trees as input, just like Cass algorithm. Experiments show that Frin is less dependent on the input data order and runs faster than Cass. Moreover, Frin constructs a simpler network than other available methods.

#### PRELIMINARIES

#### Related Knowledge

Given a set of taxa *X*, a subset of *X*, excluding the empty set and the complete set, is called a cluster. A cluster *C* is non-trivial if it contains more than one element. If two clusters *C*′ <sup>1</sup> and *C*′ 2 are compatible if either *C C* 1 2 ′ ′ = φ or *C C* ′ ⊂ ′ 1 2 or *C C* ′ ⊂ ′ 2 1 . Otherwise, they are incompatible. For a set of cluster *Y* on *X*, *Y* is said to be compatible if any one pair of clusters are compatible. An incompatible cluster set is represented by an incompatible graph *IG*(*Y*) = (*E*, *V*), which consists of a node set and an edge set. The node set consists of all the non-trivial clusters in the *Y*

and the edge set consists of edges connecting the incompatible clusters. The set of clusters represented by a rooted phylogenetic tree is compatible; on the contrary, any one compatible cluster set can be constructed into a rooted phylogenetic tree.

Supposed that *N* = (*V*, *E*) is a network on taxa set *X*. *δ*- (*v*) represents the in-degree of the node *v*. We introduce a concept used to describe the complexity of a network, which is called reticulation number. Reticulation number of a network is not necessarily equal to the number of reticulate nodes. It is defined as:

$$\sum\_{\nu \colon V, \delta' > 0} (\delta^-(\nu) - 1) = |E| - |V| + 1$$

ν

If each connected component of a network contains reticulation number at most *k*, then we call that it is a *level-k* network. A level-*k* network is called a simple level- < *k* network, which does not contain cut nodes. A node is a cut node if its removal disconnects the graph.

Each phylogenetic tree *T* is uniquely defined by the set of clusters. For a phylogenetic tree, an edge *e*=(*u*, *v*) represents the cluster containing those taxa that are descendants of *v*. Similarly, a phylogenetic network represents clusters in the soft-wired sense or in the hard-wired sense. For each reticulate node of the network *N*, we switch on its one incoming edge and switch off the others, and we called the network *N* represents the cluster *C* in the soft-wired sense if cluster *C* equals the set of all taxa that can be reached from *v*. On the other hand, if cluster *C* equals the set of taxa that are descendants of *v*, we said the edge (*u*, *v*) of a network represents the cluster *C* in the hard-wired sense. In this article, we research the representing in the soft-wired sense, whose pseudocode is shown by Algorithm 1.

ALGORITHM 1 | The clusters represented by a network in the soft-wired sense.


```
22. continue: soft (N, i-1, j)
23. end if
24. end for
25. return the cluster set Y
End
```
Cass, Lnetwork, BIMLR, and Frin all take the set of trees as the input when to construct a phylogenetic network. They first compute all clusters represented by input trees, and then construct a phylogenetic network representing those clusters. Assume that *Y* is the cluster set represented by the input file, *N* is a constructed network. *Y′* is the cluster set represented by the network, which are greater than or equal to the clusters in the *Y*. The clusters in *Y-Y′*are called the redundant clusters. Both the reticulation number and the number of redundant clusters describe the complexity of a network. The best phylogenetic network should contain fewer reticulation numbers and have fewer redundant clusters.

Suppose that *N* is a network on taxa set *X*, *e =* (*u*, *v*) is an edge of *N* with parent node *u* and child node *v*. If each way from the root node to *v* passes through *u*, we called that *u* is the stable ancestor on *v*; otherwise, it is the unstable ancestor. For an edge *e =* (*u*, *v*), let P(*e*) = {*x*∈*X*| *x* is the stable ancestor on *v*}, Q(*e*) = {*x*∈*X* | *x* is the unstable ancestor on *v*}, S(*e*) = {*x*∈*X* | *x* is not a descendant of *v*}. We call {P(*e*), Q(*e*), S(*e*)} the tripartition of *e*. Θ( ) *N* represents all tripartition sets of network *N*. Given two networks *N*1 and *N*2, tripartition distance between them is computed by | ( Θ Θ *N N* ) ( )|/ 1 2 2 , of which Δ is symmetry variation. The tripartition distance measures the topology different between two phylogenetic networks. In this paper, we use the tripartition distance to measure the dissimilarity of the phylogenetic networks.

### Cass Algorithm

We will have a brief description for Cass algorithm in the following. Given a set of clusters *Y* on taxa *X*, Cass algorithm is divided into four steps:

Step 1: Cass works out non-trivial connected component *Y*1,…,*Yp* of incompatibility graph *IG*(Y). Then, Cass collapses the maximal ST-sets for each non-trivial connected component *Yi* and gets *Yi* ′ . Given a taxa set *X* and a subset *S*⊂*X*, each cluster *C*⊂*Y* removes the elements of subset *S*, and the remaining cluster set *Y*′ is called the restriction of *Y* to *S*, denoted by *Y*|*s*. The largest set of ST-set is called the maximal ST-set. Given |S|>1, if *S* is compatible with each cluster of *Y*, and *Y*|*s* are compatible, we called *S* is a strict tree set (ST-set) of *Y*.

Step 2: Cass (*k*) constructs simple level*- < k* networks for each *Yi* ′ , which is crucial step of Cass algorithm. For each nontrivial connected component, Cass(*k*) loops all taxa and removes them from each cluster, and collapses all of the maximal ST-sets for the remaining cluster set. Cass(*k*) repeats above operations *k* times, until the remaining cluster set is compatible to construct phylogenetic trees. The removed taxon is added to the phylogenetic tree as children of reticulate nodes, which becomes a simple level*- < k* network.

Step 3: For each *i*∈{1,…,*p*}, Cass removes all clusters that are in *Ci* , adds a cluster *Xi* and each maximal subset *X*⊂*Xi* that is not separated by *Ci* . All above set become cluster set *C*′′ . Then Cass constructs a rooted phylogenetic tree *T* for *C*′′ , which is the whole frame of the resulting network.

Step 4: Cass adds all the simple level*- < k* networks constructed in step 2 to the rooted phylogenetic tree *T* by the method of ancestor nodes displacement.

When Cass starts constructing a simple level*- < k* network, it does not know the number of network level. Thus, it first sets *k* = 0 and runs Cass(0),which constructs a simple level*- <* 0 network. If such a network exists, it outputs the result and halts. Otherwise, Cass continues to sets *k = k* + 1, and runs Cass(1), Cass(2),…, Cass(*k*), until the constructed network represents the given clusters sets the soft-wired sense. The process is very time-consuming, because Cass(*k*) loops over all taxa and repeatedly attempts to remove each taxon. The selection of removed taxa is highly uncertain, which makes the algorithm depend heavily on the order of input data, and it also reduces the speed of the construction.

### METHOD

Given a set of clusters *Y* on taxa set *X*, the frequency of a taxon *x*∈*X* is the number of clusters containing taxon *x*, denoted by *f*(*x*). The number of edges of the graph *IG*(*Y*) is called incompatibility degree of *Y*, denoted by *d*(*Y*). The incompatibility degree of a taxon *x*∈*X*, denoted by *d*(*x*), is the result of subtracting the incompatibility degree of *Y*|X|{x} from that of *Y*, i.e. *d*(*x*) = *d*(*Y*) –*d* (*Y*|*X*|{*x*} ). For example, given incompatible cluster set *Y* = {1, 2}, {2, 3}, we can get taxa frequency *f*(1) = 1, *f*(2) = 2, *f*(3) = 1 and taxa incompatibility degree *d*(1) = 0, *d*(2) = 1, *d*(3) = 0. Moreover, we know that only by removing taxa 2, the remaining clusters are compatible. Frequency and incompatibility degree of taxa contribute a lot to the compatibility of a cluster set, which will affect the construction of phylogenetic networks. The premise of constructing a network is to construct a phylogenetic tree for the compatible cluster set, which is the result by removing some taxa from the originally incompatible set of clusters. The key of Frin method lies in the addition of taxa removal rules, which makes the algorithm select removed taxa more efficiently. Frin chooses the removed taxa based on its frequency and incompatibility degree. Such choices make the remaining cluster set compatible as quickly as possible.

Frin constructs phylogenetic networks in four steps; steps 1, 3, and 4 are the same as Cass algorithm. Frin improves the step 2 of the Cass for the construction of simple level*- < k* networks. Frin first find the non-trivial connected components of the incompatibility graph *IG*(*Y*); next it constructs simple level*- < k* network based on taxa frequency and incompatibility degree; then it constructs a unique phylogenetic trees for compatible clusters; finally it integrates simple level*- < k* networks into the resulting phylogenetic networks. Frin (*k*) constructs a simple level*- < k* network as follows.

For each taxon *x*∈*X*′, Frin(*k*) obtains the frequency and incompatibility degree, and then calculates the weighted value |equ\_0013.eps| on the frequency and incompatibility degree, i.e. *s*(*x*) = *p* × *f*(*x*) + *q* × *d*(*x*), where *p* and are *q* weight values of its frequency and incompatibility degree. All taxon *x*∈*X*′ are ordered according to the value of *s*. Frin(*k*) selects the taxon with the maximum *s* as the removed taxa each time, until the remaining cluster set is compatible to construct a phylogenetic tree. Then Frin(*k*) adds all the removed taxa to the tree as the child of reticulate nodes, and gets a resulting network representing all clusters. Here, we set the value of *p* and *q*, 0 < *p* ≤ 1, 0 ≤ *q <*1, *p* + *q* = 1, and step size is 0.1. Then we can get ten groups of *p* and *q* values, for each group of values, Frin(*k*) constructs only one network. To avoid the same network that can be constructed over and over again when it runs, we ignore constructing the same network as before by comparing the taxa removal process. Finally, Frin constructs one or more different networks, and records the network with less reticulation number and redundant clusters as the final phylogenetic network.

In addition, Frin sometimes adds dummy taxa to construct a network. The dummy taxa are removed before outputting the resulting network.

Example 3.1, given taxa set *X* = {1, 2, 3, 4, 5} and the cluster set *Y* = {{1, 2}, {1, 4}, {3, 4}, {1, 3, 4}, {4, 5}, {1, 2, 3, 4}, {2, 3}, {2, 3, 4}, {2, 3, 4, 5}}, Frin constructs two different networks *N*1and *N*2 for *Y*, as shown in **Figure 1**. *N*1 is a level-3 network with *r* = 3, *c* = 3 and *N*2 is a level-3 network with *r* = 3, *c* = 6, where *r* is the reticulation number and *c* is the number of redundant clusters. The two networks have the same reticulation number, and *N*1 has fewer redundant clusters than *N*2. Therefore, Frin outputs *N*1 as the final network. The example shows that Frin can construct several different networks for each input trees due to the coefficients' uncertainty of the taxa frequency and incompatibility degree. By comparing the number of reticulation nodes and redundant clusters, we select the optimal network from different networks as the output.

Example 3.2, we consider the taxa set *X* = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} and the cluster set *Y* = {{7, 8, 9}, {2, 3, 4, 7, 8, 10}, {5, 6, 7, 8, 9}, {2, 3, 4, 5, 6, 7, 8, 9}, {2, 3, 4, 5, 6}, {2, 3, 4, 10}, {2, 3, 4, 5, 6, 7, 8, 10}}. We take the cluster set *Y* for example to illustrate that the input data order has different influence degree on Frin, Cass, BIMLR and Lnetwork. Then we need to give all

FIGURE 2 | N3 is the network constructed by Frin for all permutations of input data in Example 3.2.

permutations of input data, and construct networks for each permutation. We represent the difference between the resulting networks by tripartition distance. For all permutations of the input data, Frin can construct the same network *N*3, as shown in **Figure 2**. Cass constructs three different networks *N*4, *N*5, and *N*6, and the minimum, maximum, and mean tripartition distance between them are 1.5, 2, and 1.67 respectively, as shown in **Figure 3** | *N*4, *N*5 and *N*6are the networks constructed by Cass for all permutations of input data in Example 3.2. BIMLR constructs three different networks *N*7, *N*8, and *N*9, and the minimum, maximum and mean tripartition distance between them is 1, 3, and 2, as shown in **Figure 4**. Lnetwork also constructs three different networks *N*10, *N*11, and *N*12, and the minimum, maximum, and mean tripartition distance between them is 1, 1.5, and 1.33, as shown in **Figure 5**. The example shows that Frin can construct more consistent networks than other methods for the same data with different input order, i.e. Frin reduces the influence of input data order. The conclusion will be demonstrated by the following section.

#### RESULTS

The experiments are performed on a personal computer with an Intel Core i5-4200U, 1.6GHz CPU, and 4GB RAM. All programs are written in Java.

We test the efficiencies of Frin, Cass, Lnetwork, and BIMLR on artificial and the practical dataset, which can be accessed from the website (https://sites.google.com/site/cassalgorithm/ data-sets). The results are shown in **Tables 1**–**3**. On the one hand, we use practical data to test the influence of input data order on constructing network (see **Table 1**). On the other hand, we compared the network complexity, i.e. the level; the reticulation

TABLE 1 | The results of Frin, Cass, Lnetwork and BIMLR on practical datasets with clusters |C| and taxa |X| when input order is different.


TABLE 2 | The results of Frin, Cass, Lnetwork and BIMLR on artificial datasets with clusters |C| and taxa |X|.


TABLE 3 | The results of Frin, Cass, Lnetwork and BIMLR on practical datasets with clusters |C| and taxa |X|.


number and the redundant cluster number, of four methods on artificial and practical data (see **Table 2** and **3**).

We get all permutations of input order for each data, and then construct networks for each permutation. Since the running time of the experiment is factorial, we choose small-scale data as the input. In order to measure the influence of input data order, we record the number of different resulting networks and compute the tripartition distance between them. We use the tripartition distances to measure the dissimilarity between the networks. The experimental result is shown in **Table 1**. Each dataset consists of cluster number |*C*| and taxa number |*X*|. The table records the number of different networks (n) and mean (mean), minimum (min), maximum (max) values of the tripartition distance, and the last row is the average of the corresponding columns. **Table 1** shows that the number of different networks constructed by Frin is less than other three methods for most data, and the tripartition distance between them is also smaller, especially compared with Cass algorithm. Hence, Frin constructs more consistent networks when the input data orders are different.

We test the complexity of the networks constructed by Frin, Cass, Lnetwork, and BIMLR, including the network level (k), the reticulation number (r) and the redundant cluster number (c), and as well as the running time (t) of those methods in h/m/s. The following tables show the results of experiment on artificial and practical data with the cluster number |*C*| and the taxa number |*X*|. The last row of the tables is the average of the corresponding columns. **Table 2** compares Frin with other three methods in several artificial datasets. It shows that Frin consumes less time for the same input data compared with Cass, and Frin has significantly fewer redundant clusters than Cass and BIMLR. **Table 3** compares the four methods in several practical datasets. It shows that the average reticulation number of Frin is slightly larger than the other methods, but it has fewer redundant clusters than Cass and Lnetwork in most cases. Thus, the network constructed by Frin is simpler than that constructed by other methods in the aspect of redundant clusters,

and the execution time of Frin has also been greatly reduced compare with Cass, although it takes longer than the other two methods.

We describe the application of Frin to the *Poaceae* dataset and also compare it with other programs. The dataset consists of three phylogenetic trees of the *Poaceae* family, which are based on sequences data for three difference gene loci, petD, ndhB, and rpl2. The gene sequences are downloaded from NCBI database. We do sequence alignment on the obtained sequence using Clustalx, and construct a phylogenetic tree using Phylip. Frin constructs a level-5 network with 10 taxa, 5 reticulations and 31 redundant clusters for the three gene trees of *poaceae* datasets. The resulting network is shown in **Figure 6** using Dendroscope3 (Huson et al., 2007; Vaughan, 2017). For the same input, BIMLR constructs a level-5 network with r = 5, c = 33 and Lnetwork constructs a level-5 network with r = 5, c = 37; while Cass algorithm cannot construct the network in a day. The result shows that the network constructed by Frin is the simplest. It illustrates that the network constructed by Frin which can describe real evolutionary history better than the other methods.

#### CONCLUSION

In this paper, we propose an efficient method called Frin to construct phylogenetic networks. In the process of construction, Frin considers the two factors that affect the compatibility of a cluster set, which are the frequency and incompatibility degree of taxa, respectively. Frin can construct several different networks, and select the simplest network from them as the resulting network. The experimental results show that Frin is an improved method. First, Frin can construct less different networks when the input data order is different than the other methods. Second, the networks constructed by Frin have less the number of redundant clusters than the other methods in the case of the level and the reticulation number of the networks not are increasing. Both facts indicate that Frin can better describe the biological evolutionary history.

### DATA AVAILABILITY STATEMENT

The datasets generated for this study can be found in Github (https://github.com/wangjuanimu/Frin). The artificial and the practical datasets can be accessed from the Cass website (https:// sites.google.com/site/cassalgorithm/data231 sets).

### AUTHOR CONTRIBUTIONS

YH proposed the method and designed the experiments. YH and JW wrote the paper.

#### FUNDING

The work was supported by National Natural Science Foundation of China under Grant No. 61661040.

## REFERENCES


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Hong and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Identification of Prognostic Dosage-Sensitive Genes in Colorectal Cancer Based on Multi-Omics

Zhiqiang Chang, Xiuxiu Miao and Wenyuan Zhao\*

College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China

#### Edited by:

Lei Deng, Central South University, China

#### Reviewed by:

Li Hongdong, Gannan Medical University, China Xiaowei Chen, Institute of Biophysics (CAS), China

\*Correspondence: Wenyuan Zhao zhaowenyuan@ems.hrbmu.edu.cn

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 22 September 2019 Accepted: 27 November 2019 Published: 09 January 2020

#### Citation:

Chang Z, Miao X and Zhao W (2020) Identification of Prognostic Dosage-Sensitive Genes in Colorectal Cancer Based on Multi-Omics. Front. Genet. 10:1310. doi: 10.3389/fgene.2019.01310 Several studies have already identified the prognostic markers in colorectal cancer (CRC) based on somatic copy number alteration (SCNA). However, very little information is available regarding their value as a prognostic marker. Gene dosage effect is one important mechanism of copy number and dosage-sensitive genes are more likely to behave like driver genes. In this work, we propose a new pipeline to identify the dosagesensitive prognostic genes in CRC. The RNAseq data, the somatic copy number of CRC from TCGA were assayed to screen out the SCNAs. Wilcoxon rank-sum test was used to identify the differentially expressed genes in alteration samples with |SCNA| > 0.3. Coxregression was used to find the candidate prognostic genes. An iterative algorithm was built to identify the stable prognostic genes. Finally, the Pearson correlation coefficient was calculated between gene expression and SCNA as the dosage effect score. The cell line data from CCLE was used to test the consistency of the dosage effect. The differential coexpression network was built to discover their function in CRC. A total of six amplified genes (NDUFB4, WDR5B, IQCB1, KPNA1, GTF2E1, and SEC22A) were found to be associated with poor prognosis. They demonstrate a stable prognostic classification in more than 50% threshold of SCNA. The average dosage effect score was 0.5918 ± 0.066, 0.5978 ± 0.082 in TCGA and CCLE, respectively. They also show great stability in different data sets. In the differential co-expression network, these six genes have the top degree and are connected to the driver and tumor suppressor genes. Function enrichment analysis revealed that gene NDUFB4 and GTF2E1 affect cancer-related functions such as transmembrane transport and transformation factors. In conclusion, the pipeline for identifying the prognostic dosagesensitive genes in CRC was proved to be stable and reliable.

Keywords: colorectal cancer, somatic copy number alteration, survival analysis, gene dosage effect, differential co-expression

## INTRODUCTION

Colorectal cancer (CRC), is the 3rd leading cause of cancer-associated deaths in the world (Siegel et al., 2019). Studies have shown that somatic copy number alteration (SCNA) is one of the most common and important structural mutations in CRC (Li et al., 2017; Oliveira et al., 2018). SCNA genes are usually considered as the driver gene for cancer development and an important factor for the progression of CRC (Wang et al., 2009; Rosenberg et al., 2018; Lee et al., 2019).

In addition to this few SCNA genes are also being considered as prognostic markers for CRC patients (Roy et al., 2016; Sefrioui et al., 2017). Previous research has shown that a high copy number of mitochondrial DNA can help in identifying the poor prognosis associated with advanced-stage CRC patients (Wang et al., 2016). However, the reason for this specific attribute is still unknown. SCNAs are generated by chromosomal rearrangement. Another important mechanism of SCNA influencing cancer progression is through the gene dosage effect (Harel and Lupski, 2018; Salpietro et al., 2018). For a gene in the region of SCNA, if its expression increases with amplification of the copy number and vice versa, this gene would be defined as dosage-sensitive gene. With respect to the unstable and complex nature of expression regulation, the DNA copy number is relatively more stable. Therefore, the copy number of dose-sensitive genes is more likely to be used as a driver gene in cancer. Some of the dosage-sensitive genes (DSGs) such as CD274/PD-L1 gene amplification (Lee et al., 2018b), fibroblast growth factor 1 amplification (Bae et al., 2019), RING-Finger Protein 6 amplification (Steinman et al., 1979), have been shown to be associated with poor prognosis, suggesting DSGs can also be considered as prognostic markers.

The amount of SCNA can be considered as one important indicator of cancer progression. Cancerous tissue may contain both tumor and non-tumor cells, and the copy number of DNA in all cells can be measured during detection. The copy number value obtained from the whole tissue sample with respect to the control sequence reflects the frequency of copy number alteration in the whole sample. This value is often in parts. However, identifying a threshold value of SCNA to be considered as pathogenic or mutant needs a thorough investigation. Jianxin Shi et al. identified significant CNVs using the FASST2 algorithm and selected the number of probes per fragment >5 and log2ratio greater than 0.3 as amplification gene (Shi et al., 2016). Villela et al. also used 0.3 as the SCNA threshold (Kostolansky et al., 1986; Villela et al., 2018). In addition, the copy number amplification or deletion of 0.5 (i.e. half amplification or deletion) is pathogenic (Birchler et al., 2001; Birchler and Veitia, 2012). These results suggest that different threshold values should be used as a measure of SCNA.

Due to the importance of DSGs and the fact that SCNA could be a prognostic marker of CRC, we hypothesize that the dosagesensitive prognostic genes should also affect CRC progression. TCGA is a milestone project of cancer genome covering CNV, RNA-seq data, and patient-specific data of CRC. It can provide a possibility for relatively large-scale excavation of prognostic genes of CRC. In this paper, we have established a pipeline for screening prognosis sensitive genes in CRC, organically identified stable prognostic markers with dosage sensitivity of copy number in CRC, and verified their dosage sensitivity by cell line data. This analysis can help to further enhance our understanding of the value of the prognostic gene of SCNA and can lay a foundation for further analysis.

### MATERIALS AND METHODS

#### Datasets and Processing

The data of CNA, RNA-seq data, and clinical data of CRC were downloaded from the TCGA database. By mapping the copy number probe across the reference genome of hg38, the SCNA at gene level was calculated using Gistic2 software (Mermel et al., 2011). The value of SCNA represents the portability of copy number alteration and the q-value for the genes in aberrant regions. The q-value > 0.1 and q-value < −0.1 were considered as copy number amplified and deleted, respectively. For each gene, the samples with SCNA value > = x (x represents the threshold of SCNA with a value >0) were identified as copy number amplification samples (CNAS), the samples with SCNA < = −x were identified as copy number deleted samples (CNDS), and the samples with | SCNA | < x were identified as copy number nonaltered samples (CNNS). The location information of chromosomes was obtained from the HGNC database (Braschi et al., 2019). RNAseq FPKM data was downloaded from University of California Santa Cruz (UCSC, http://genome. ucsc.edu/), and more than 80% of genes with 0 value were filtered out. The test data-set was collected from the Cancer Cell Line Encyclopedia (CCLE; http://www.broadinstitute.org/ ccle/home).

#### Filtering of Prognosis-Sensitive SCNA Genes

PSGs of SCNA were screened in five steps as described below:


cancer samples were divided into CNAS, CNDS, and CNNS. For each threshold of SCNA, the log-Rank test was used to assess the significance of overall survival times in CNAS vs. CNNS and CNDS vs. CNNS groups. The abnormal driver genes with the number more than 50% number of the thresholds were selected as a stable PSG.

Step 5: In order to further screen dosage-sensitive genes from stable PSGs in different SCNA threshold, the prognostic sensitive abnormal genes of DSGs were selected. Linear regression was applied to assess the dosage-sensitivity. The Rvalue represents the dosage-effect score. The genes with the pvalue < 0.05 and R > = 0.3 were considered as prognostic dosage-sensitive genes (PDSGs).

### Verification of DSGs in Cell Lines

In order to verify the stability of the dosage-sensitivity of PDSGs, the correlation coefficients between gene expression and copy number alteration were calculated with the RNA-seq of CRC and CNA at gene level downloaded from the CCLE database. These values were compared with the findings obtained from TCGA.

#### Building the Differential Co-Expression Network

In order to further identify the genes affected by PDSGs, Pearson correlation coefficients of these six PDSGs and other genes was calculated as co-expression values in CNAS or CNDS, CNNS. Gene pairs with correlation coefficients higher than 0.5 in one group and less than 0.1 in another group were screened as differentially co-expressing gene pairs. Network visualization tools were executed using Cytoscape (Shannon et al., 2003).

### Analysis

All the analysis was performed in the R computing environment. Survival curves were estimated using the Kaplan-Meier method. Gene function enrichment was performed using the Cluster Profiler package (Yu et al., 2012).

### RESULTS

### PDSGs in CRC

A total of 448 CRC samples with SCNA and RNA-seq data were downloaded from The Cancer Genome Atlas (TCGA). The samples were screened for survival information. There were 22,752 genes, of these 17,442 were protein-coding and 14,688 were differentially expressed.

After applying FDR < 0.1 and FC > 1.2, 6,814 genes had upregulated expression in CNAS. Twenty-five genes had a downregulated expression in CNDS. Cox regression analysis was applied to calculate the correlation between SCNA and survival time. A total of 215 prognosis-sensitive genes (PSGs) significantly related to SCNA were obtained, of these 214 were amplified and one was deleted. Next, the 21 SCNA threshold value was raised from 0.1 to 0.5 at a step of 0.02. For each threshold, the samples were classified into CNNS, CNAS, CNDS group and logRank test between CNNS and CNAS, CNDS and CNNS was performed. As shown in Figure 1, 73.02% of genes didn't show any significant classification with any threshold. A total of 15 genes showed stable prognosis classification of patients in more than 10 threshold values, suggesting these 15 genes can be considered as stable markers for prognosis classification in CRC.

After further screening stable PSGs which are highly affected by copy number dosage effect, the Pearson correlation coefficient between copy number and corresponding expression value (FPKM) of these 15 genes was calculated. Finally, six genes (NDUFB4, WDR5B, IQCB1, KPNA1, and SEC22A) which are stable PSGs (Figure 2) were identified. The average dosage effect score was 0.5918 and the variance was 0.066.

Kaplan-Meier survival curve analysis revealed six (6) PDSGs with similar results in a different threshold of SCNA. In the 0.1 SCNA threshold value, genes GTF2E1, NDUFB4, IQCB1, KPNA 1 and WDR5B had a significant classification effect (Figures 3A– C). At the 0.3 threshold value of SCNA, all six genes had a similar and significant classification effect (Figure 3D). At the 0.5 threshold value, five genes (GTF2E1, NDUFB4, IQCB1, KPNA1, WDR5B) had similar classification effect (Figures 3E, F). Although the statistical significance of the two classifications (p-value = 0.087199 and p-value = 0.12643) in 0.5 SCNA threshold was not significant, their classification curves were distinctly separated. The non-significance can be primarily attributed to the very small number of samples with SCNA threshold >0.5.

### Testing Dosage Effect of PDSGs in CCLE

In order to verify if the copy number of six PDSGs is dosagesensitive in the data from cell lines with 53 cell line samples, the

FIGURE 1 | Classification stability of gene prognosis. For each threshold of somatic copy number alteration (SCNA) (from 0.1 to 0.5, at 0.02 step), the p-value was calculated by the log-rank test in corresponding alteration and CNNS samples. The Number of Threshold will increase if the p-value < 0.05.

FIGURE 3 | The Kaplan-Meier curves of six PSDGs for samples in CNAS and CNNS. (A–C) with the somatic copy number alteration (SCNA) threshold 0.1, gene GTF2E1 and NDUFB4 had similar prognostic classification efficacy. (D) with the SCNA threshold 0.3, all six PSDGs have similar efficacy. (E, F) with the SCNA threshold 0.5, although the p-value was > 0.05, the two survival curves still separated from each other.

dosage effect score of these six PDSGs in CRC from CCLE was calculated. An average score of 0.5978 and variance was 0.082 consistent with the result from TCGA was obtained (Figure 4A). The Pearson correlation coefficient was 1, suggesting that the gene dosage effect is stable in CRC different data.

#### Six PDSGs Are Co-Alteration in CRC

Further to test similarity between survival curves of these six PDSGs, we mapped them to chromosomes and found that they all are located on 3q13.33–3q21.1. By computing the correlation coefficients between the copy number of two pairs of genes an average value of 0.9967 (Figure 4B) was observed. This indicates that these six PDSGs are highly consistent with each other during alteration.

Research have shown that heterogeneity of copy number alterations exists in ongoing unstable chromosome in COAD (Bolhaqueiro et al., 2019). There are some chromosomes fragile sites in genome, the genes in fragile sites may break when they fell external pressure. In order to determine the presence of breakpoints in the region near to 6PDSGs, they were mapped on the database of human chromosomes fragile sites (HumCFS, http://webs.iiitd.edu.in/raghava/humcfs/). As a result, FRA3D (3q25.32) and FRA3C were found to be near to six PDSGs. Correlation analysis of SCNA in six PDSGs and the genes in FRA3D and FRA3C was performed. Gene RSRC1 (R = 0.82), MLF1(R = 0.82) in FRA4D, and LPP (R = 0.80) in FRA3C had lowest relationship with PDSGs. Thus we infer that the breakpoints in fragile site may explain the reason for the nearby region and a similar SCNA value.

#### Building and Analysis of Differential Co-Expression Network With PDSGs

In order to further explore if these six PDSGs can also affect the expression of other genes in CRC, we screened genes with (R) > 0.5 and (R) < 0.1 in a different class of samples by calculating the differences of gene co-expression between CNAS and CNNS. A total of 234 co-expressed gene pairs were observed and 215 genes (Figure 5A) involved in differential co-expression networks were identified. The whole network constitutes a component suggesting that CRC is a disease involving multiple genes. Among these 194 gene pairs were co-expressed in alteration samples (R > 0.5), but not co-expressed in non-alteration samples (R < 0.1), while the other 40 pairs behaved in a reverse manner. In the network, gene NDUFB4, SEC22A had the highest degree (109 and 45 respectively) consisting of 15 colinked genes. The genes CAPN14 and CMPK2 were affected by three PDSGs (NDUFB4, SEC22A, and IQCB1). This suggests that PDSGs are closely linked and interact with each other.

Each PDSG in the network was related to at least 13 genes and 22 genes were associated with more than one PDSG. We also found that several PDSGs-associated genes were also COADrelated. The co-expression of GTF2E1-WNT8B was activated in CNAS(R = 0.59). WNT8B one member of the WNT signal was differentially expressed in COAD (Neumann et al., 2014). In addition to this, after mapping the PDSG-related genes to the driver gene list from DriverDB (Liu et al., 2019), three genes (C8orf33, LAPTM4B, PTP4A3) were found (Figure 5B), and they all were co-expressed with gene NDUFB4 in CNAS but not in CNNS. Mapping of PDSG-related genes on the tumor suppressor database (TSGene, http://bioinfo.mc.vanderbilt.edu/ TSGene/) revealed 16 TSGs (Figure 5A, Triangle). Among these, gene DCDC2, ISG15, RARRES3 can affect more than one PDSG. Gene RARRES3 has been shown to be mutated, differentially expressed and also inhibits metastasis in COAD (Lee et al., 2018a). ISG15 is shown to have significant differential expression in COAD (Yu et al., 2019; Zamanian-Azodi and Rezaei-Tavirani, 2019).

Further to explore the possible functions of these six PDSGs, linked genes were extracted and gene ontology function enrichment analysis was performed. Genes linked to gene NDUFB4 (Figure 5C) were mainly enriched in functions such as "transmembrane receptor," "transmembrane transport," "peptide receptor," "G protein-coupled receptor," "transforming growth factor." Genes linked to gene GTF2E1

represent tumor suppressor genes, lower triangular represent driver gene. Six PDSGs (NDUFB4, WDR5B, IQCB1, KPNA1, GTF2E1, and SEC22A) have the top degree. The edge represents co-expression of the adjacent genes above 0.5 in one group and below 0.1 in another group. (B). The co-expression curve of gene (C) Normal and abstained function of gene NDUFB4 using Cluster Profiler R package (D) Normal and abstained function of gene GTF2E1 using Cluster Profiler R package.

were enriched (Figure 5D) in functions such as "cyclindependent protease," "ATP synthase transport proton-related functions." Previous studies have shown that transforming growth factor can also promote tumorigenesis (De Miranda et al., 2015; Yu et al., 2018; Kim et al., 2019). G-proteincoupled receptors (GPCRs) are a member of the largest cell surface molecule family involved in signal transduction and are considered as the key molecule in the growth and metastasis of tumors (Wielenga et al., 2015; Insel et al., 2018). Malignant cells often hijack the normal physiological functions of GPCRs to survive, proliferate independently, escape the epidemic system, increase blood supply, invading the surrounding tissues and spread to other organs.

### DISCUSSION

In this manuscript, a series of screening methods were established to identify PDSGs in CRC. A total of six PDSGs identified in the present study not only have the robustness to different SCNA threshold in prognostic classification but also have the same dosage effect in CRC cell lines. This indicates that our screening pipeline is suitable, reasonable, and effective. The amplification of the copy number of these six PDSGs can lead to poor prognosis, indicating that the SCNA of genes could serve as an important prognostic marker in CRC.

In addition to the stable results, these PDSGs have been shown to be associated with CRC. Gene NDUFB4 encodes a non-catalytic subunit of the NADH. The NADH dehydrogenase complex I is overexpressed in incipient metastatic murine CRC cells (Marquez et al., 2019). Mutations in mitochondrial NADH dehydrogenase subunit 1 (mtND1) gene were found in CRC (Yusnita et al., 2010). WDR5B encodes a protein containing several WD40 repeats, and it is reported as an important target of miR-31. The knockout of microRNA-31 promotes the development of colitis-associated cancer (Liu et al., 2017). The protein encoded by gene SEC22A belongs to the member of the SEC22 family of vesicle trafficking proteins. It has a similarity to rat SEC22 and may act in the early stages of the secretory pathway, which is related to CRC (Jilling and Kirk, 1996; Baron et al., 2010).

Compared with the gene expression the DNA copy number often occurs in arm-level, i.e. the same segment tends to have the same copy number alteration (Roy et al., 2016; Xu et al., 2018). The results of this study not only support this opinion but also suggest that even in the same fragment the correlation between different samples is not always 1. There are some differences indicating that somatic alterations have some heterogeneity, and demonstrates the diversity of alteration in CRC. In addition, although chromosomes play a role through the dosage effect to some extent they may be affected by the regulation of gene expression. Six of the 15 genes obtained in this paper have a strong dosage effect suggesting that not all gene copy number amplification will lead to up-regulation of expression. The contribution is a combination of copy number and dosage effect. In future, if targeted drugs or therapies can be developed to reduce the copy number of these six PDSGs, patients with amplified copies of these six genes may receive a precise treatment. This is also an important starting point and foothold of this topic.

The ratio of amplified and non-amplified samples of CPCDGs gene is 1:11, which indicates that these prognostic markers are

#### REFERENCES


valuable only for patients with high SCNA. Therefore, SCNA can be an important part of precise medical treatment. Due to computational limitations, the minimum alteration sample selected in this paper is 10, which may reduce the excavation of alteration genes to a certain extent. However, it is believed that in the future, with the increase of the sample size, the increase of different DNA copy number alteration types in CRC will lead to the identification of much clinically relevant SCNA genes.

In summary, the findings of the present study suggest that PDSGs obtained from the analysis of CRC have good application value and can provide an important reference for the precise treatment of CRC.

#### DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the article/ supplementary material.

#### AUTHOR CONTRIBUTIONS

WZ designed and supervised the study and was a major contributor in editing the manuscript. ZC analyzed and interpreted the data and was a major contributor in writing the manuscript. XM performed analysis and contributed to the manuscript. All authors read and approved the final manuscript.

#### FUNDING

This research was funded by the Fundamental Research Funds for the Provincial Universities (31041180039).


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Chang, Miao and Zhao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author (s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# eQTLMAPT: Fast and Accurate eQTL Mediation Analysis With Efficient Permutation Testing Approaches

Tao Wang<sup>1</sup> , Qidi Peng<sup>1</sup> , Bo Liu<sup>1</sup> , Xiaoli Liu<sup>2</sup> , Yongzhuang Liu<sup>1</sup> , Jiajie Peng3\* and Yadong Wang1\*

<sup>1</sup> School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, <sup>2</sup> Department of Neurology, Zhejiang Hospital, Hangzhou, China, <sup>3</sup> School of Computer Science, Northwestern Polytechnical University, Xi'an, China

#### Edited by:

Lei Deng, Central South University, China

#### Reviewed by:

Juan Wang, Inner Mongolia University, China Fan Yang, Harvard Medical School, United States

#### \*Correspondence:

Jiajie Peng jiajiepeng@nwpu.edu.cn Yadong Wang ydwang@hit.edu.cn

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 08 September 2019 Accepted: 27 November 2019 Published: 09 January 2020

#### Citation:

Wang T, Peng Q, Liu B, Liu X, Liu Y, Peng J and Wang Y (2020) eQTLMAPT: Fast and Accurate eQTL Mediation Analysis With Efficient Permutation Testing Approaches. Front. Genet. 10:1309. doi: 10.3389/fgene.2019.01309 Expression quantitative trait locus (eQTL) analyses are critical in understanding the complex functional regulatory natures of genetic variation and have been widely used in the interpretation of disease-associated variants identified by genome-wide association studies (GWAS). Emerging evidence has shown that trans-eQTL effects on remote gene expression could be mediated by local transcripts, which is known as the mediation effects. To discover the genome-wide eQTL mediation effects combing genomic and transcriptomic profiles, it is necessary to develop novel computational methods to rapidly scan large number of candidate associations while controlling for multiple testing appropriately. Here, we present eQTLMAPT, an R package aiming to perform eQTL mediation analysis with implementation of efficient permutation procedures in multiple testing correction. eQTLMAPT is advantageous in threefold. First, it accelerates mediation analysis by effectively pruning the permutation process through adaptive permutation scheme. Second, it can efficiently and accurately estimate the significance level of mediation effects by modeling the null distribution with generalized Pareto distribution (GPD) trained from a few permutation statistics. Third, eQTLMAPT provides flexible interfaces for users to combine various permutation schemes with different confounding adjustment methods. Experiments on real eQTL dataset demonstrate that eQTLMAPT provides higher resolution of estimated significance of mediation effects and is an order of magnitude faster than compared methods with similar accuracy.

Keywords: trans-eQTL, cis-eQTL, mediation analysis, multiple testing control, permutation test, gene regulation

### INTRODUCTION

Understanding the complex functional natures of genome variants has been the focus of many studies in recent years, which provides us with advanced insights into phenotype variability and disease susceptibility (Cheng et al., 2017; Watanabe et al., 2017; Gallagher and Chen-Plotkin, 2018). Vast genome variants relevant to disease risks and other traits have been unequivocally identified by genome-wide association studies (GWAS) (Visscher et al., 2017). However, most of those traitsassociated variants localize in non-coding regions, intergenic, or intronic regions, indicating that genomic variants are likely to be involved in gene regulation instead of exerting their effects through altering the protein sequence directly (Gallagher and Chen-Plotkin, 2018). To understand the complex regulatory natures of genomic variants, one of the fundamental tasks is to discover target genes which can be regulated by variants in the cell. The expression quantitative trait loci (eQTL) analysis has been proven a powerful tool in achieving this goal.

An eQTL is essentially a variant at a specific genome location with its genetic variance associates with gene expression variation in a population. Most eQTL mapping studies access the eQTL effects through association tests between the genotypes of a variant and expression profiles of a gene using regression models (Shabalin, 2012; Ongen et al., 2015). And eQTL summary statistics have been widely used in the interpretation of GWAS results and Mendelian randomization studies (Cheng et al., 2018b; Peng et al., 2019a). eQTLs can exert their regulatory effects on local gene transcriptions (cis-acting) and distant gene transcriptions (trans-acting), defined by the physical distance between an eQTL and a gene, usually using 1 Mb as a threshold or on different chromosomes for trans-acting associations (Ongen et al., 2015; GTEx Consortium, 2017). cis-acting or trans-acting may reflect different underlying regulation natures. For example, cis-eQTLs usually reside close to transcription starting sites (TSS) and might affect the gene expression directly through affecting transcription factor (TF) binding process (Nica and Dermitzakis, 2013). However, very little knowledge is known for trans-eQTLs due to multifaceted reasons. First, trans-acting effects are usually weaker than cisacting, which requires a large sample size to detect the weak signals (Yao et al., 2017). Second, the number of trans-eQTL associations is an order of magnitude more than cis-eQTL associations, which brings heavy computational burdens. Third, the multiple testing problem in identifying trans-eQTLs results in stringent significance thresholds. And trans-eQTLs have been proven less replicable across studies (Innocenti et al., 2011). Therefore, most eQTL studies only focus on cis-eQTLs, and the mechanisms underling the regulatory effects of genetic variation on the expression of distant genes and genes in other chromosomes are largely unknown (Bryois et al., 2014).

Recent studies have shown that trans-eQTLs are likely involved in indirect regulations, where the trans-eGene can be mediated by the cis-eGene, which is known as the mediation effect (Pierce et al., 2014; Brynedal et al., 2017; Yang et al., 2017; Yao et al., 2017). These studies provide evidence of a cismediated mechanism that explains distal regulation of transeGenes by trans-eQTLs (Yao et al., 2017). Characterizing these regulatory relationships will allow us to better understand regulatory networks and the biological mechanisms underlying trans-eQTLs (Westra et al., 2013). To discover the mediation effect among cis-/trans-eQTL (L), cis-eGene (C) and trans-eGene (T), represented by a trio (L!C!T), a recently proposed work which aims to test the significance of the effect of cis-eGene on trans-eGene controlled by the genotype of L and confounders (Yang et al., 2017). Mathematically, by using a linear regression model, with the formula T = a + b1C + b2G + GCov + ϵ, where G represents the genotype of L (see details in Material and Methods), the objective is to test the significance of b1. In practice, this requires performing a large amount of association tests in order to scan all possible candidate trios due to related variants in linkage disequilibrium (LD). Thus, it will result in a large number of nominal statistics, i.e., P values, and multiple testing has to be considered in order to control the false discovery rate. A traditional solution is to use Bonferroni correction method, which multiplies the nominal P value with the total number of tests to get an adjusted P value. However, the Bonferroni method has been proven overly stringent in genomic area due to the fact that a large number of tests are not independent because of variants in LD, and this method will result in a lot of false negatives (Ongen et al., 2015).

To solve this problem, a commonly adopted strategy is to use the non-parametric permutation testing approach. The permutation test can be performed by the following steps: first, perform thousands of permutations on gene expression profiles by randomly exchanging sample IDs. Notably, to break the potential mediation effects from C to T while keeping the ciseQTL and trans-eQTL associations, the sample ID rearrangement need to be performed within each genotype group (i.e., AA, AB, or BB) (Yang et al., 2017). Second, calculate a list of permutation statistics, under the null hypothesis of no association, by performing associations using genotypes and permuted expressions. Third, compare the nominal statistics with the distribution of permutation statistics to assess how likely the observed nominal association statistics originates from the null distribution. The permutation tests have been applied to multiple bioinformatics applications to control for multiple testing, for example, eQTL mapping (Ongen et al., 2015), allelic association analysis (Zhao et al., 2000), and biological network analyses (Wang et al., 2019). In the context of detecting mediation effect of cis-eGenes on trans-eGenes, a recently proposed algorithm named GMAC adopts permutation strategy to control for multiple testing (Yang et al., 2017). However, it suffers from a main drawback: it relies on performing a fixed number, usually thousands of permutations per trio, to balance the running time and P value resolution empirically estimated. For example, 10,000 permutations can derive P value at a resolution of 10−<sup>4</sup> at the best circumstance. There is no efficient built-in permutation scheme, which makes its practical application very timeconsuming and not accurate in estimating significance of mediation effects.

In this work, we present eQTLMAPT, an R package which improves upon GMAC (Yang et al., 2017) by implementing faster and more efficient permutation-based multiple testing correction approaches. Besides the traditional fixed permutation scheme, eQTLMAPT also provides 1) the adaptive permutation scheme which prunes the permutation process opportunely; 2) the approximation of the tail of null distribution using generalized Pareto distribution (GPD) model, which allows the user to accurately estimate adjusted P values at any significance level in a short running time; and 3) flexible choices of different confounding factors adjustment methods. In addition, eQTLMAPT provides flexible interfaces for users to combine different features and perform the proper permutation scheme based on their practical needs. Experiments on a real eQTL dataset demonstrate that eQTLMAPT is an order of magnitude faster than GMAC, and its estimated significance has a much higher resolution than the compared method.

#### MATERIAL AND METHODS

#### Overview

To efficiently identify cis-eGene mediators of trans-eQTLs in whole genome, we developed eQTLMAPT, an R package to perform mediation analysis with multiple permutation schemes and flexible covariate adjustment strategies. The core regression models we used in mediation analysis is similar to the model used in the recently proposed method, GMAC (Yang et al., 2017). The models can be formalized as Equations 1, 2, and 3, where G represents the genotype of single nucleotide polymorphism (SNP)L; C, and T represent gene expression levels of cis-eGene and trans-eGene, respectively; Cov represents covariates; and ϵ represents the error term following normal distribution. For the trio (L,C,T), we assume L is significantly associated with C and T by testing b<sup>1</sup> ≠ 0 and b<sup>2</sup> ≠ 0 in the linear models, with b estimated by leastsquares fitting. The statistic of mediation analysis here is to test the mediation effect of cis-eGene C on trans-eGene T while controlling for the effects of eQTL L, covariants Cov. The null hypothesis is H0:b<sup>3</sup> = 0.

$$C = a\_1 + \beta\_1 G + \Gamma\_1 Co\nu + \mathfrak{e}\_1 \tag{1}$$

$$T = a\_2 + \beta\_2 G + \Gamma\_2 Co\nu + \mathfrak{e}\_2 \tag{2}$$

$$T = a\_3 + \beta\_3 C + \beta\_4 G + \Gamma\_3 Co\nu + \mathfrak{e}\_3 \tag{3}$$

Our method can be separated into two main steps: first, we calculate the nominal association statistic, z = b3/se, in Equation 3, where se represents the standard error of b3. Second, to account for multiple testing in assessing the significance of the mediation effect, we perform within-genotype group permutations of cis-eGene transcripts C to empirically characterize the null distribution of mediation effects (i.e., the distribution of z scores expected under the null hypothesis of no mediation effect, denoted by vector Z0). The purpose of withingenotype group permutation is to break the potential mediation effects from C to T within each genotype group (i.e., AA, AB, or BB) while keeping the cis-eQTL and trans-eQTL associations. The adjusted empirical P value of mediation test would finally be calculated by comparing the observed mediation statistic z with the permutation statistics Z<sup>0</sup> under the null.

To obtain the null distribution of mediation effects, i.e., Z0, and provide users with flexible choices, we implemented three permutation schemes in our package: 1) fixed permutation scheme, which generates N permutation datasets (Estimation of P Values Under Fixed Permutation Scheme); 2) adaptive permutation scheme, which prunes the permutation process when there are too many null statistics better than the observed z statistic (Calculate Empirical P Value Using Adaptive Permutation Scheme); and 3) GPD approximation, which models the tail of the null distribution via a drastically reduced number of null statistics and estimates P value with higher resolution (Model the Tail of the Null Distribution Using GPD). To deal with complex hidden confounding effects, we also adopt an adaptive confounder adjustment method (Yang et al., 2017) and a fixed confounder adjustment method incorporating the three permutation schemes (Confounding Factors Adjustment).

#### Estimation of P Values Under Fixed Permutation Scheme

The associations of trios (L,C,T) we aim to test are not independent due to the fact that multiple SNPs are correlated because of LD. Traditional multiple testing correction methods like Bonferroni and Benjamini–Hochberg correction, which give a global significance threshold based on all nominal P values, prove to be overly stringent and may result in false negatives in such correlated genomic analyses. Thus, we adapt permutationbased testing approaches to assess the significance in association test for each trio (L, C, T) (Equation 3). Permutation test is a widely used non-parametric method in many bioinformatics applications. It generates a null statistic distribution by random permutations and then assesses how likely the observed statistic obtained in the nominal association originates from the null distribution.

Assume the nominal mediation statistic z = b3/se is assessed for a trio (L, C, T) by Equation 3, where se is the standard error of b3. Given a fixed number of N, we perform N times permutations within-genotype groups for cis-eGene C by randomly permuting sample labels in each genotype group, i.e., AA, AB, and BB. It will generate N null mediation statistics, denoted by Z<sup>0</sup> = f z1 <sup>0</sup> , z<sup>2</sup> 0, …, z<sup>N</sup> <sup>0</sup> <sup>g</sup>, where <sup>z</sup><sup>i</sup> <sup>0</sup> is in absolute value, i∈[1,N]. If M null statistics in Z<sup>0</sup> are stronger than the observed statistic |z|, the empirical P value is assessed by Equation 4, where pseudo-count 1 is added to avoid meaningless denominator.

$$P\_{\text{fixed}} = \frac{M+1}{N+1} \tag{4}$$

The strategy of fixed permutation scheme is direct, easy to implement, and adopted by most permutation testing approaches. However, the adjusted P value has lower bound limitation that Pfixed ≥ <sup>1</sup> <sup>N</sup>+1. That means we have to increase the fixed number of N to get precise P value estimates for strong mediation effects with smaller P values, which will tremendously increase the computational costs. For example, if the true P value is 10−<sup>6</sup> for a trio, at least 1 million permutations should be performed to achieve the precise P value. But for most trios, with true P values larger than 10−<sup>3</sup> , 1 million permutations would be a waste of resources because thousands of permutations could lead to precise P values. To solve this problem, we implemented an adaptive permutation strategy in eQTLMAPT to prune permutations once we observe too many null statistics stronger than the nominal statistic z of mediation analysis.

#### Calculate Empirical P Value Using Adaptive Permutation Scheme

The basic idea of adaptive permutation strategy is to perform more permutations for significant trios while decreasing the number of permutations for insignificant trios. This is because insignificant trios could be assessed with fewer permutations than significant ones. By setting a significance level, a = 0.05 for example, and a maximum permutation times N, in case of indefinitely running the process, we define the pruning threshold K = a\*N, and usually K << N. For each trio (L,C,T), if we observe more than or equal to <sup>K</sup> null statistics that jz<sup>i</sup> <sup>0</sup>j <sup>&</sup>gt; jzj or we reach the maximum permutation upper bound N, the permutations process will be stopped. Suppose G times of permutations are executed in total and M null statistics are found to be stronger than the observed statistic |z|, the adjusted P value is given by Equation 5.

$$P\_{adaptive} = \frac{\min(K+1, M+1)}{\min(\varGamma + 1, N+1)} \tag{5}$$

For example, given N = 10,000 and a = 0.05, then K = 500, and assume we have performed 800 times of permutation for a trio and find K null statistics stronger than nominal statistic z. Then, we stop performing further permutations and the final adjusted P value = 501/801. In this case, only 800 times permutations are needed instead of 10,000 times in the fixed permutation scheme. This strategy tremendously reduces the number of permutations required for insignificant trios; however, the lower bound of adjusted P value still exists, which is 1/(N + 1). To solve the lower bound problem, we approximate the tail of null statistics distribution by generalized Pareto distribution and estimate the small P values at any significance level without the limitation of lower bound.

#### Model the Tail of the Null Distribution Using GPD

It is critical to accurately estimate small P values especially in large-scale genomic analyses, where huge numbers of associations are simultaneously tested. To determine precise small P values at any significance level without performing all possible permutations, we implemented a P value approximation method based on GPD, which has been widely used in modeling extreme values (Knijnenburg et al., 2009). The basic methodology is to estimate the small permutation P values using extreme value theory by fitting extreme permutation values originating from the tail of null distribution with generalized Pareto distribution (Gumbel, 2012). And it has been proven that the GPD approximation method can lead to precise estimation of small P values using much fewer permutations compared with fixed number of permutation approach (Knijnenburg et al., 2009).

In our case, given permutation statistics set <sup>Z</sup><sup>0</sup> <sup>=</sup> <sup>f</sup> <sup>z</sup><sup>1</sup> 0, z<sup>2</sup> 0, …, zN <sup>0</sup> <sup>g</sup> and nominal mediation statistic <sup>z</sup> of a trio (L,C,T), we suppose both z and z<sup>i</sup> <sup>0</sup> ∈ Z<sup>0</sup> are in absolute value, and elements in Z<sup>0</sup> are sorted in decreasing order, i.e., z<sup>i</sup> <sup>0</sup> ≥ z j 0, i<j. Define Nexc as the number of exceedances (extreme values), and <sup>Y</sup><sup>0</sup> <sup>=</sup> <sup>f</sup> <sup>z</sup><sup>1</sup> 0, z<sup>2</sup> 0, …, z Nexc <sup>0</sup> <sup>g</sup>,Y0⊂Z0, and exceedance threshold <sup>t</sup> = (<sup>z</sup> Nexc <sup>0</sup> + z Nexc+1 <sup>0</sup> )=2, such that z<sup>0</sup> > t, if z0∈Y0. Then, we calculate z0−t for each element <sup>z</sup>0∈Y<sup>0</sup> to get a vector of exceedances <sup>X</sup><sup>0</sup> <sup>=</sup> <sup>f</sup> <sup>x</sup><sup>1</sup> 0, x<sup>2</sup> 0, :::x Nexc <sup>0</sup> <sup>g</sup>, where xi <sup>0</sup> = z<sup>i</sup> <sup>0</sup> − t, x<sup>i</sup> <sup>0</sup> ∈ X0, z<sup>i</sup> <sup>0</sup> ∈ Y0. Next, exceedances in X<sup>0</sup> are used to fit the tail of the null distribution modeling by GPD. The GPD has cumulative distribution function (CDF) shown in Equation 6.

$$F(\boldsymbol{x}) = \begin{cases} 1 - \left(1 - \frac{k\boldsymbol{x}}{a}\right)^{\frac{1}{k}}, & k \neq 0 \\\ 1 - e^{\frac{-\boldsymbol{x}}{a}}, & k = 0 \end{cases} \tag{6}$$

The a and k are scale parameter and shape parameter, respectively, and the range of x requires 0 ≤ x ≤ <sup>a</sup> <sup>k</sup> for k > 0, and x ≥ 0 for k ≤ 0. If x falls out of these ranges, the GPD estimated P values will be zeros, i.e., k > 0, x > <sup>a</sup> <sup>k</sup>. Maximum likelihood (ML) is used to estimate the two parameters a and k in F(x) given X0. The goodness-of-fit test of the Anderson–Darling statistic is used to evaluate whether the exceedances follow the GPD (Choulakian and Stephens, 2001). Finally, the permutation test P value of the GPD approximation is computed as shown in Equation 7, where z represents the absolute value of the nominal mediation statistic.

$$P\_{\rm gpd} = \frac{N\_{c\infty}}{N} \left( 1 - F(z - t) \right) \tag{7}$$

Nexc is initialized as minimum value between 250 and number of permutation tests by default. If it fails to fit GPD (goodness-offit test P ≤ 0.05), then iteratively reduce Nexc by 10 until a good fit is achieved. Besides, the GPD approximation can only be used when the nominal mediation statistic z is in the range of extreme permutation null statistics (tail of null distribution). For example, if z is in the middle of the null distribution, this method cannot be applied. To specify, let M be the number of permutation values that exceed the test statistic z, if M < N\*a, a = 0.01 in default, GPD approximation will be performed; otherwise, fixed permutation scheme will be performed. The detailed methods have been described in Knijnenburg et al. (2009), and we implemented this method with R language in our package to accurately estimate the mediation significance with much fewer permutations.

#### Confounding Factors Adjustment

The presence of heterogeneous known or latent unmeasured covariates that affect genotype and phenotype (gene expression in our context) is a major source of bias in the mediation analysis, which needs to be adjusted. The common sources of covariates, such as batch effects, age, sex, postmortem interval (PMI), RNA integrity number (RIN), and population stratification, are associated with either samples or individuals. The latent unwanted covariates can be identified by methods like principal component analysis (PCA) (Abdi and Williams, 2010), surrogate variables analysis (SVA) (Leek et al., 2012), and probabilistic estimation of expression residuals (PEER) (Stegle et al., 2012).

In our package, we adopt two covariates adjustment strategies: fixed confounder adjustment strategy and adaptive confounder adjustment strategy. The first one is to directly pass the user-given PCs/SVs or PEER factors together with known covariates into the Cov variable in Equation 3 when performing mediation analysis. The second way is proposed in GMAC (Yang et al., 2017), which adaptively selects hidden covariates for each trio. In brief, this method first identifies a pool of hidden covariates, represented by H, which can be supplied by users or identified with PCA on expression profiles automatically [first 30 principal components (PCs) in default]. Then, for each trio (L, C,T), only a small number of PCs will be selected from H for adjustment based on the correlations between PCs and C,T. And experiments demonstrated that this adaptive covariates selection method improved power and precision in mediation analysis (Yang et al., 2017). Notably, both covariates adjustment strategies can be flexibly selected by users for each of the three permutation schemes introduced above.

#### ROSMAP Dataset and Preprocessing ROSMAP Study and Dataset

The Religious Orders Study (ROS) (A Bennett et al., 2012a) and Memory and Aging Project (MAP) (A Bennett et al., 2012b) are two longitudinal cohort studies of aging and Alzheimer's disease (AD). We downloaded the gene expression, genotype, and clinical dataset of ROSMAP Study from Synapse platform (ID: syn3219045) with approval. RNA samples were obtained from the homogenate of the dorsolateral prefrontal cortex of 724 subjects and RNA sequencing (RNA-seq) data have been processed into read count table using standard pipeline (syn9702085) (Mostafavi et al., 2018). DNA samples were from whole blood and genotype profiles of 1,179 subjects were calculated from whole-genome sequencing (De Jager et al., 2018). Only neuropathologically healthy individuals (cogdx score ≤3, no Alzheimer's disease and no dementia) with both genotype data and RNA-seq data passing quality controls were used in eQTL analysis, which downsized the sample size to N = 334.

#### Genotype Processing

We applied PLINK2 (v1.9beta) (Chang et al., 2015) and in-house scripts to perform rigorous subject and SNP quality control (QC) for genotype dataset derived from WGS. To QC in SNP level, we removed SNPs with genotype call rate <95%, with Hardy– Weinberg equilibrium testing P < 10−<sup>6</sup> , informative missingness test P < 10−<sup>9</sup> , and with minor allele frequency (MAF) < 0.05 seperately. To QC in subject level, we removed subjects with call rate <95%, with outlying heterozygosity rate based on heterozygosity F score (beyond 4\*sd from the mean F score), and with gender mismatch. We also performed IBS/IBD filtering: pairwise identity-by-state probabilities were computed for removing both individuals in each pair with IBD > 0.98 and one subject of each pair with IBD > 0.1875. To test for population substructure, we performed PCA using smartPCA in ENGINSOFT (Patterson et al., 2006).

#### Gene Expression Profiles Processing

Stringent quality controls and normalization steps were also performed for gene expression profiles. Gene read count derived from RNA-seq was normalized to TPM (transcripts per kilobase million) by scaling gene length (union of exon length) and sequencing depth. We removed samples with gender mismatch by checking gender-specific expression genes XIST and RPS4Y1. Sample outliers with problematic gene expression profiles were detected and removed based on hierarchical clustering (AC't Hoen et al., 2013). Genes with low expression were also removed by keeping genes with >0.1 TPM in at least 20% of samples and ≥6 reads in at least 20% samples. For normalization, gene expression values were quantile normalized after log10-transformed. SVA package was applied for removing batch effect and adjusting age, sex, RIN, PMI, and latent covariates. Residuals were outputted for downstream eQTL analysis.

#### eQTL Mapping and Mediation Analysis

MatrixEQTL (Shabalin, 2012) was used for cis/trans-eQTL mapping using additive linear model. In cis-eQTL analysis, variants (SNPs and indels) within 1 M upstream and downstream from the TSS were tested for association with gene expression traits. And variants beyond the ±1M window were associated with the gene expression traits in trans-acting manner. Forcis-eQTL results, a significance level offalse discovery rate (FDR)≤0.05 was used.And for trans-eQTL results, we adopt a global significance level P <1× 10−<sup>8</sup> because of the tremendous amount of trans-associations and weak trans-eQTL effects.

For biological discovery, mediation analyses with adaptive permutation scheme and GPD approximation (N = 10,000, a = 0.05) were applied for all candidate trios (L,C,T), where eQTL L was significantly associated with cis-eGene C (FDR ≤ 0.05; Equation 1) and trans-eGene T (P < 1 × 10−<sup>8</sup> ; Equation 2). For performance comparison, mediation analyses were performed in multiple scenarios described in the Results section.

### RESULTS

#### Candidate (L, C, T) Trios Detected in ROSMAP Dataset

After stringent quality controls for both RNA-seq and genotyping data (ROSMAP Dataset and Preprocessing), 26,662 gene transcripts and 6,736,714 variants (including SNPs and indels) of 334 subjects were left for eQTL analysis. We detected 3,195,073 significant cis-eQTL associations, representing 5,711 unique cis-eGenes and 60,758 unique cis-eQTLs, and 145,153 trans-eQTL associations, representing 1,382 trans-eGenes and 66,847 unique trans-eQTLs, under significance thresholds of FDR ≤ 0.05 (corresponding P < 1 × 10−<sup>3</sup> ) and P < 1 × 10−<sup>8</sup> for cis- and trans-eQTL associations, respectively. Seventy-five percent of trans-eQTLs were also identified as cis-eQTLs, which is similar to previous findings (Pierce et al., 2014; Yao et al., 2017). To detect the mediation effects, 999,725 candidate trios (L,C,T) representing 6,217 unique gene pairs (C,T) were derived from significant cis- and trans-eQTL associations. For multiple correlated variants linked to each gene pair, we used permutation schemes introduced in Material and Methods to control for multiple testing, and for genome-wide unique gene pairs, we used a FDR procedure to control for multiple testing.

#### Performance With Adaptive Permutation Scheme

We first compared adaptive permutation scheme implemented in our package with fixed permutation strategy which was commonly adopted by traditional methods, including GMAC (Yang et al., 2017). For each unique gene pair (C,T) from candidate trios, we selected the most significant cis-eQTL for cis-eGene C, resulting in 6,217 trios. Mediation analyses with fixed permutation scheme (with N = 10,000) and adaptive permutation scheme (with N = 10,000, a = 0.05) were both performed on those 6,217 trios. Empirical P values Pfixed and <sup>P</sup>adaptive were shown in Figure 1A, with Pearson's correlation <sup>r</sup> <sup>=</sup> 0.999, indicating the two schemes have similar precision. While fixed scheme always executed 10,000 times of permutations for each tested trio, adaptive scheme significantly reduced the permutation times, as shown in the histogram in Figure 1B. For example, 68% trios executed less than 2,000 times of permutations. The total time used with adaptive scheme is less than one-third of that with fixed permutation strategy (floating bar plot in Figure 1B).

#### More Accurate P Values and Fewer Permutations with GPD Approximation

Using generalized Pareto distribution to model the tail of null distribution of permutation statistics could derive more precise empirical P values with fewer number of permutations compared with traditional fixed permutation strategy (Knijnenburg et al., 2009). To test the performance of the GPD approximation method implemented in eQTLMAPT, we first randomly selected 1,000 (L,C,T) trios with fixed permutation P values were less than or equal to 0.01 (N = 10,000). And then we rerun mediation analyses for those trios with GPD approximation under fixed permutation schemes with N = 1,000, 5,000, and 10,000. The reason that we only select trios with P ≤ 0.01 is because only permutation P values at the tail of null distribution can be estimated by the GPD approximation method (see Model the Tail of the Null Distribution Using GPD). Figures 2A–C show the GPD estimated <sup>P</sup> values versus <sup>P</sup> values derived from the fixed permutation scheme (N = 10,000, 5,000, and 1,000, respectively), and we can see that GPD-estimated P values have higher resolution than fixed permutation scheme. For instance, GPD-estimated P values range from 10−<sup>2</sup> to 10−<sup>8</sup> , while fixed permutation-derived mediation P values range from 10−<sup>2</sup> to 10−<sup>3</sup> , when N is set to 1,000. And GPD-estimated P values are much smaller than fixed permutation-derived P values, which demonstrates that the GPD approximation method has the ability to detect mediation effect more accurately with higher significance resolution.

To prove the accuracy of the GPD approximation strategy, we first sampled 1,000 trios with P value equal to 0.01 under the fixed permutation scheme with N = 100. It is reasonable to suppose that the significance is likely to be underestimated because of the small N (Pfixed ≤ 0.01). Then we rerun the mediation analyses for those 1,000 trios with N set to 10,000, where <sup>P</sup>fixed <sup>≤</sup> <sup>10</sup>−<sup>4</sup> . The density plot of P values of those 1,000 trios derived under the fixed permutation scheme (N = 10,000) was shown in Figure 3A, where two peaks around 10−<sup>2</sup> and 10−<sup>3</sup> were shown. The peak around 10−<sup>2</sup> indicates some trios have true significance level around 10−<sup>2</sup> . However, the larger peak centers around 10−<sup>3</sup> indicate that the significance of a large number of tests is underestimated when N = 100. Then we asked whether using GPD approximation strategy can derive P values proxy for true P values even when N was still set to 100. We extracted trios with significance levels between (a,b) interval (shown in Figure 3A) and rerun mediation analyses with GPD approximation and N was still set to 100. The distribution of the GPD approximation-derived P values was shown as the boxplot in Figure 3A, which were centered around 10−<sup>3</sup> , as expected.

The other advantage of using GPD approximation in mediation effect analysis is that with fewer permutations large amount of time cost can be avoided. To achieve a resolution of P ≤ 10−<sup>8</sup> , at least 10<sup>8</sup> permutations should be performed under

FIGURE 1 | Performance of mediation analysis with adaptive permutation scheme versus fixed permutation scheme. (A) Empirical P values of 6,217 (L,C,T) trios derived from adaptive scheme (y-axis) and fixed scheme (x-axis) were shown in Panel A, and the portion of Pfixed < 0.05 was enlarged in −log<sup>10</sup> scale. (B) Trios were grouped by permutation times (in adaptive scheme) and were shown in histogram (left-side y-axis). Running time of each group (right-side y-axis) using two permutation schemes was overlaid on the histogram with two colored dash lines, and the total running time was also shown in the floating colored bar plot. To be noted, all trios were executed 10,000 times of permutations in the fixed permutation scheme.

FIGURE 3 | Performance of eQTL mediation analysis with GPD approximation. (A) Density plot reflecting the distribution of empirical P values under fixed permutation scheme (N = 10,000) of 1,000 selected trios with Pfixed = 0.01 when N = 100. The cyan area was selected based on the density >0.6, and fixed permutation P values were around 10−<sup>3</sup> , when N = 10,000. For trios covered by the cyan area, GPD-estimated P values (N = 100) were shown in the floating boxplot. (B) Time cost for analyzing the same set of trios under various permutation schemes. The color legend represents whether GPD estimation process is used. P values were −log10-transformed.

fixed permutation scheme, while the same resolution could be achieved with only 10<sup>3</sup> permutations with GPD estimation (see Figure 2). Figure 3B intuitively shows the time cost for analyzing the mediation effect of a trio under different permutation schemes. One hundred, 1,000, 5,000, and 10,000 permutations were performed in the mediation analysis of the same collection of trios. We can see that the run time is significantly correlated with permutation times. We also tested the time cost caused by the GPD estimation under 10,000 permutations (the two right-most boxplots in Figure 3B). We can see that the GPD estimation process only adds a few time cost burden than without GPD estimation, which shows the number of permutations are the most time-consuming. However, P value estimates have larger variance for small N and converge to the real Pperm when N is getting larger (Knijnenburg et al., 2009). Experimentally, we recommend users to use N ≤ 1,000, and the larger N will result in more accurate estimated P values. In conclusion, by applying GPD approximation strategy, eQTLMAPT can accurately estimate the significance level with fewer permutation operations, which makes the mediation analysis much more efficient.

#### Discover cis-Mediators of trans-eQTLS Using ROSMAP Dataset

To test the speed and discovery performance, we compared eQTLMAPT, combining adaptive permutation scheme and GPD approximation strategy, with GMAC in the discovery of eQTL mediation effects using ROSMAP dataset. For each unique gene pair, we first selected the best trio showing the strongest mediation effect based on the nominal P value, resulting in 6,217 candidate trios. Then, we performed mediation analyses using eQTLMAPT and GMAC separately on those 6,217 candidate trios. Both methods adopt permutation tests to adjust P values for each trio, and FDR procedure described by Storey and Tibshirani (ST) (Storey and Tibshirani, 2003) to control for multiple testing of gene pairs. To make the comparison comparable, both methods applied the adaptive confounders selection strategy, taking all of the PCs derived from expression profiles as the selection pool of hidden confounders. And both methods adjusted the same fixed covariates (age, sex, RIN, PMI, and batch). We performed N = 10,000 permutations for GMAC and performed N = 10,000, 5,000, 1,000, and 500 permutations for eQTLMAPT, respectively. In our program, we set a = 0.05 in adaptive permutation scheme.

Table 1 summarizes the performance between eQTLMAPT and GMAC. Both methods detected similar number of trios with suggestive mediation effects (permutation P ≤ 0.05) and similar number of significant trios with FDR ≤ 0.25 (Storey and Tibshirani multiple-test controlling method). The Venn diagram in Figure 4 demonstrated that most significant trios (with suggestive permutation P ≤ 0.05 or FDR ≤ 0.25) detected by GMAC can be discovered by eQTLMAPT with N = 10,000, 5,000, 1,000, and even 500. For example, among the 113 significant trios with FDR ≤ 0.25 detected by GMAC, 110 (97%) can be discovered by eQTLMAPT with N = 10,000, and 104 (92%) can be discovered by eQTLMAPT with N = 500. With the similar ability in discovering significant trios, eQTLMAPT is about 90, 40, 8, and 4 times faster than GMAC when N = 500, 1,000, 5,000, and 10,000,



respectively (Table 1). We also noticed that some significant trios detected by eQTLMAPT were missed by GMAC, which might be due to improved P value resolution. However, since there is no "true" set of trios with mediation effects, we are not able to compare the true positive rate and false positive rate. In summary, with similar discovery ability, eQTLMAPT is order of magnitudes faster than GMAC. The 519 trios intersected from the five compared strategies with suggestive permutation P ≤ 0.05 were available in Supplementary Table 1.

#### Enrichment Analysis for eQTLs Among GWAS SNPs

We first performed GWAS enrichment analyses for genome-wide significantcis-eQTLs (FDR ≤ 0.05) and trans-eQTLs (P ≤ 1 × 10−<sup>8</sup> ). From the NHGRI GWAS catalog (July 2019), 70,971 unique SNPs, reportedly associated with traits and genotyped in ROSMAP dataset, were downloaded (Welter et al., 2013). After pruning correlated SNPs in LD (r <sup>2</sup> > 0.3) using PLINK and ROSMAP genotype data, 30,894 independent trait-associated SNPs were left, ofwhich, 16,398 SNPs had GWASP≤5 × 10−<sup>8</sup> and 14,496 SNPs had GWAS P ≤ 5 × 10−<sup>8</sup> , respectively. Among SNPs with GWAS P ≤ 5 × 10−<sup>8</sup> , 28% were cis-eQTLs compared with 18% in SNPs with GWAS P ≤ 5 × 10−<sup>8</sup> (Fisher's exact test OR = 1.75, with 95% CI = 1.66–1.85 and <sup>P</sup> = 1.83 × 10<sup>−</sup>93; Figure 5A). To be noted, the GWAS enrichment method was the same as described in previous work (Westra et al., 2013). In addition, we also observed GWAS enrichment for trans-eQTLs (Fisher's exact test OR = 2.58, with 95% CI = 1.8–3.76, and P < 2.51 × 10−<sup>8</sup> ; Figure 5B). This demonstrated that SNPs known to be associated with traits were more likely to be cis/trans-eQTLs, which was consistent with previous findings (Fehrmann et al., 2011; Pierce et al., 2014).

Next, we performed GWAS enrichment analysis for eQTLs with significant mediation effects. Among the 999,725 candidate trios, 67,906 trios, representing 27,100 unique SNPs, showed suggestive mediation effects with permutation P ≤ 0.05 under fixed permutation scheme (N = 10,000). Using the same GWAS enrichment method, we found GWAS SNPs were more likely to have mediation effects (Fisher's exact test OR = 4.19, with 95% CI = 2.16–8.9, and P = 1.47 × 10−<sup>6</sup> ; Figure 5C), indicating that mediation analysis can help to explain GWAS findings.

#### Transcription Factors May Act as cis-Mediators

The 519 trios with suggestive permutation P ≤ 0.05 (Supplementary Table 1) represent 351 unique cis-mediators (cis-eGenes). Among those cis-mediators, we found 14 are TFs, including ZNF488, ZSCAN26, ZNF254, TBX1, FOXS1, ZFP57,

FIGURE 4 | Venn diagram of significant trios at suggestive permutation P ≤ 0.05 (A) and FDR ≤ 0.25 (B) derived by GMAC and eQTLMAPT with different numbers of permutations.

ZNF568, ZNF260, ZNF14, GTF2I, ZFX, CSDC2, GTF2IRD2B, and GTF2IRD2. For example, we observed the trio (rs77969091, TBX1, MSC), where TBX1 is the cis-eGene and MSC is the trans-eGene, and MSC has been predicted to be the target of the transcription factor TBX1 in brain tissue and central nervous system (Marbach et al., 2016). This indicates that trans-eQTLs can exert their effects on distant target genes through affecting TFs which act as mediators. However, we did not observe overrepresentation of TFs among cis-mediators (Fisher's exact test P = 0.15, compared with 1,665 TFs downloaded from HumanTFDB) (Hu et al., 2018).

### DISCUSSION

There has been intense efforts to identify causal genes and other biomarkers such as RNA, protein, and microbiota underlying complex diseases (Cheng and Hu, 2018; Cheng et al., 2019). One of these efforts is to discover genes regulated by GWAS variants through eQTL analysis. However, less is known regarding how trans-eQTLs work on distant genes. The eQTL mediation analysis is a promising tool to uncover the mechanisms underlying transeQTLs. In order to discover the eQTL mediation effects in whole genome, millions of candidate associations of (eQTL, cis-eGene, trans-eGene) trios need to be tested, which requires the computational methods to control for multiple testing appropriately. In practice, there are hundreds of variants on average associated with eGenes in both cis- and trans-manner, which result in huge numbers of candidate trios. For example, in the ROSMAP dataset, nearly 1 million candidate trios need to be tested, which only represent 6,217 unique (cis-eGene, transeGene) pairs. To determine the genome-wide significance of a nominal testing statistics, we need to account for two multipletesting levels: multiple genetic variants are tested per (cis-eGene, trans-eGene) pair, and multiple (cis-eGene, trans-eGene) pairs are tested genome-wide. We used permutation test to correct for the former and FDR estimation to control for the latter.

The traditional permutation scheme, which runs a fixed number of permutations, has to balance the time cost and the P value resolution, which is limited by a lower bound. And there is no efficient built-in permutation scheme in current tools aiming at analyzing eQTL mediation effect. To fill this gap, we present eQTLMAPT, which implements a fast and accurate eQTL analysis method with efficient permutation procedures to control for multiple testing. eQTLMAPT can correct for the multiple correlated variants tested via three different permutation schemes: the fixed permutation scheme, the adaptive permutation scheme, and the generalized Pareto distribution (GPD) approximation, which models the null distribution of no mediation effects using GPD trained from a few permutation statistics and could accurately estimate the adjusted P values without the limitation of lower bound. These strategies implemented in eQTLMAPT greatly accelerated the efficiency of multiple test controling in mediation analyses and provided users higher resolution of estimated significance which would help them distinguish the best signals.

In the analyses of the ROSMAP dataset, we detected 519 trios with suggestive mediation effects (permutation P ≤ 0.05), representing 351 unique cis-eGenes. Among those cis-mediators, we found 14 are TFs, including ZNF488, ZSCAN26, ZNF254, TBX1, FOXS1, ZFP57, ZNF568, ZNF260, ZNF14, GTF2I, ZFX, CSDC2, GTF2IRD2B, and GTF2IRD2. This proves that TFs might play a role in the mediation effects. We also tried to replicate these significant trios with mediation effects in the GTEx dataset analyzed by Yang et al. (2017), and 70 trios, identified by gene pairs, can be replicated with mediation P ≤ 0.05 in multiple tissues. For example, the gene pair (MZT2A, AC018804.6) was observed with mediation effects in multiple tissues including brain putamen, fibroblast, colon, esophagus, lung, muscle, pancreas, pituitary, skin, thyroid, and vagina. And the significance of the mediation effect can reach 2 × 10−<sup>7</sup> in GTEx muscle tissue. This might suggest a common transeQTL regulatory mechanism across tissues.

There are some limitations of our method and discoveries in the ROSMAP dataset. The discovery of trans-eQTLs requires a large sample size because of smaller effect size of trans-eQTL associations. A small sample size might cause less replicable trans-eQTL signals across studies. The effective sample size of the ROSMAP dataset used in the discovery study is relatively small, which might be the reason that some trios were not able to be replicated in the GTEx dataset, whose sample size is also limited. Besides the transcription factors found in the cis-mediators, noncoding genes such as long non-coding RNA (lncRNA), microRNA, snRNA, antisense RNA, and pseudogene, were also detected. The top 3 gene classes are protein coding, pseudogene, and lncRNA genes. Although many studies have shown that non-coding RNAs play key roles in the complex regulatory networks in cell system, most of their functions are still missing (Cheng et al., 2018a; Cheng et al., 2018d; Peng et al., 2019b). Further computational methods and biological experiments are still needed to understand these unknown markers, such as using phynotypes, ontologies, deep learning methods, etc. (Cheng et al., 2016; Cheng et al., 2018c; Peng et al., 2019c; Peng et al., 2019d). In addition, since the gene expression is tissue-specific and cell type-specific, the mediation effects found in brain tissue might not show up in other tissues and cell types. Thus, with the development of single-cell RNA sequencing technologies, further studies should put more attention on cell type-specific mediation effects.

In conclusion, we present eQTLMAPT, an R package which aims to perform eQTL mediation analysis with efficient permutation procedures in multiple testing correction (Supplementary Figure 1). Experiments demonstrate that our method provides higher resolution in estimated significance and is an order of magnitude faster than the compared methods. Our method will be helpful in identifying mediation effects, which could allow us to better understand the biological mechanisms underlying trans-eQTLs and the regulatory network in the cell.

## DATA AVAILABILITY STATEMENT

Genotype and RNA-seq data of ROSMAP study (in control use): Synapse platform (https://www.synapse.org/#!Synapse: syn3219045). Source code and comprehensive documentation of eQTLMAPT are freely available to download at https://github. com/QidiPeng/eQTLMAPT.

## AUTHOR CONTRIBUTIONS

TW designed the study and co-implemented the R package, analyzed data, and wrote the paper. QP co-implemented the R package, performed dry experiments, and revised the paper. BL XL, and YL revised the paper and provided suggestions. JP and YW supervised the research, provided funding support, and revised the paper.

## FUNDING

This work has been supported by the National Key Research and Development Program of China (Nos. 2017YFC1201201 and 2017YFC0907503).

## ACKNOWLEDGMENTS

We thank all the contributors in ROSMAP study. The results published here are in whole or in part based on data obtained from the AMP-AD Knowledge Portal (doi:10.7303/syn2580853). Study data were provided by the Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago. Data collection was supported through funding by NIA grants P30AG10161, R01AG15819, R01AG17917, R01AG30146, R01AG36836, U01AG32984, U01AG46152, the Illinois Department of Public Health, and the Translational Genomics Research Institute.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 01309/full#supplementary-material

SUPPLEMENTARY FIGURE 1 | Overview of functions implemented in eQTLMAPT.

## REFERENCES


animal transcription factors. Nucleic Acids Res. 47, D33–D38. doi: 10.1093/nar/ gky822


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Wang, Peng, Liu, Liu, Liu, Peng and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# GANsDTA: Predicting Drug-Target Binding Affinity Using GANs

Lingling Zhao1\*, Junjie Wang1 , Long Pang<sup>2</sup> , Yang Liu<sup>1</sup> and Jun Zhang<sup>3</sup>

<sup>1</sup> School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, <sup>2</sup> Institute of Space Environment and Material Science, Harbin Institute of Technology, Harbin, China, <sup>3</sup> Department of Rehabilitation, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China

The computational prediction of interactions between drugs and targets is a standing challenge in drug discovery. State-of-the-art methods for drug-target interaction prediction are primarily based on supervised machine learning with known label information. However, in biomedicine, obtaining labeled training data is an expensive and a laborious process. This paper proposes a semi-supervised generative adversarial networks (GANs)-based method to predict binding affinity. Our method comprises two parts, two GANs for feature extraction and a regression network for prediction. The semisupervised mechanism allows our model to learn proteins drugs features of both labeled and unlabeled data. We evaluate the performance of our method using multiple public datasets. Experimental results demonstrate that our method achieves competitive performance while utilizing freely available unlabeled data. Our results suggest that utilizing such unlabeled data can considerably help improve performance in various biomedical relation extraction processes, for example, Drug-Target interaction and protein-protein interaction, particularly when only limited labeled data are available in such tasks. To our best knowledge, this is the first semi-supervised GANs-based method to predict binding affinity.

Keywords: drug-target affinity prediction, deep learning, semi-supervised, generative adversarial networks, convolutional neural networks

#### Specialty section:

\*Correspondence: Lingling Zhao zhaoll@hit.edu.cn

Edited by: Lei Deng,

Reviewed by: Hao Lin,

Rubo Zhang,

China

Central South University, China

Dalian Nationalities University,

University of Electronic Science and Technology of China, China

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

> Received: 23 August 2019 Accepted: 11 November 2019 Published: 09 January 2020

#### Citation:

Zhao L, Wang J, Pang L, Liu Y and Zhang J (2020) GANsDTA: Predicting Drug-Target Binding Affinity Using GANs. Front. Genet. 10:1243. doi: 10.3389/fgene.2019.01243 INTRODUCTION

A basic task in the field of new drug design and development is to model the interaction between known drugs and target proteins and to identify drugs with a high affinity for specific disease proteins (Cheng et al., 2018a; Cheng et al., 2019b). However, this is a rather challenging and expensive process even when only approximately 97M compounds reported by the PubChem database (Bolton et al., 2008) and 12K drug entries reported by the DrugBank (Wishart et al., 2006 are considered. Computational methods, especially machine learning models, can considerably accelerate the drug development process and save costs by guiding biological experiments.

Drug-target interaction (DTI) prediction (Yamanishi et al., 2010; Liu et al., 2016; Nascimento et al., 2016; Keum and Nam, 2017) was modeled as a binary classification problem and solved by a few traditional machine learning methods in recent decades. These methods have achieved remarkable performancehowever, they still exhibit limitations because of their strong dependence on handcrafted features.

Apart from predicting DTI, the drug-target binding afï- nity (DTA)(Pahikkala et al., 2014; He et al., 2017) attracts more interest as it can indicate the strength of the interaction between a DT pair. Therefore, predicting DTA can considerably benefit drug discovery, because the searching space would be narrowed down by pruning those DT pairs with low binding affinity scores. Kronecker regularized least squares (KronRLS) Pahikkala et al. (2014) and boosting machines (SimBoost) He et al. (2017) are two state-of-theart methods for both DTI and DTA prediction. KronRLS is a similarity-based method and can predict the interaction by evaluating the structure similarity among compounds and targets. On the contrary, SimBoost utilizes a gradient boosting machine and belongs to feature-based methods; its feature involves similarity matrices of the drugs and those of targets He et al. (2017). The similarity-based methods (Cheng et al., 2018b) generally rely on similarities to predict the interaction of DT, which inevitably leads to bias. For the feature-based methods, more information regarding the DT are involved; but expert knowledge and feature engineering are also required to construct appropriate features.

Deep learning can represent and recognize the hidden patterns in the data well, therefore, deep-learning based methods have been proposed to predict DTI or DTA utilizing deep neural networks (DNN) (Peng-Wei et al., 2016; Tian et al., 2016; Hamanaka et al., 2017), convolutional neural networks(CNN), (Jastrzebski et al., 2016; Gomez-Bombarelli et al., 2018) recurrent neural networks (RNNs) and stacked-autoencoders based architectures. These methods facilitate the learning of the 3D structures provided and the bimolecular interaction mechanism. However, on one hand, this indeed improves the prediction as more important structural information is exploited, on the other hand, when the 3D structure is the input, these methods depend considerably on the availability of the known 3D structure of the protein-ligand complex.

Another deep-learning based method, called DeepDTA, was implemented to predict the binding affinities with CNN using only 1D representation, that is, the sequences of the proteins and simplified molecular input line entry system(SMILES)of the compounds. In DeepDTA, two CNN blocks are employed as feature extractors, and a fully connected layer receives the output of the CNN blocks and outputs the final prediction results. DeepDTA utilizes the strong representation of CNN, while avoiding the dependence on the 3D structure information, which results in remarkable performance over the other traditional machine learning methods. However, similar to all the state-of-the-art methods for DTA prediction, DeepDTA is also primarily based on supervised machine learning with known labels information. It is known that creating large sets of training data is prohibitively expensive and laborious, particularly in biomedicine, as domain knowledge is required.

An unsupervised learning method, generative adversarial networks(GANs), devised by Goodfellow et al. in 2014 (Goodfellow et al., 2014) may address the challenge. The GANs architecture is characterized by two differentiable functions that play different roles in refining the system. One differentiable function is known as a generator and the other as a discriminator. The generator learns to produce data from a learned probability distribution. The discriminator determines if the produced data is valid by determining if the input comes from the generator or from the actual data set. GANs and its variants have achieved great success in many applications such as computer vision and natural language processing. Additionally, GANs are more attractive as they can learn representations by reusing parts of the generator and discriminator networks as feature extractors, which can be widely applied in many supervised classification or prediction tasks. On the other hand, there also exist some problems in GANs, for example, the better the discriminator is, the more serious the gradient of the generator disappears; the adversarial network may cause the collapse of the model during training, this also brings inconvenience in the practical application. In order to solve these problems, researchers continue to push forward new improvement methods, including least squares GAN(LSGAN) Mao et al. (2017), Wasserstein GAN(WGAN) Arjovsky et al. (2017) conditional GAN(CGAN) Mirza and Osindero (2014), information maximizing GAN(infoGAN) Chen et al. (2016), energy-based GAN(EBGAN) Zhao and Mathieu. (2016), boundary-seeking GAN(BEGAN) Hjelm R D (2017) and so on.

Owing to the unsupervised characteristics of GANs, in this paper, we propose a GANs-based method to predict binding affinity, called GANsDTA for short. Our method comprises two types of networks, two partial GANs for the feature extraction from the raw protein sequences and SMILES strings separately and a regression network using convolutional neural networks for prediction. The contributions of this paper mainly include: We proposed a semi-supervised framework for DTA prediction; we adopted GAN to extract features of protein sequence and compound SMILES in an unsupervised way. Therefore, the proposed model can accommodate unlabeled data for the training as feature extractor using GANs does not require labeled data. This semi-supervised mechanism enables more datasets even without labels available for our model to learn proteins drugs features, leading to better feature representation and prediction performance accordingly. To our best knowledge, this is the first semi-supervised GAN-based method to predict binding affinity. Our results suggest that utilizing such unlabeled data can considerably help improve performance in various biomedical relation extraction processes, particularly when only limited labeled data (e.g. 2000 samples or less) is available in such tasks.

#### MATERIALS AND METHODS

#### Data Sets

We evaluated our proposed method using two benchmark data sets, the Davis et al. (2011) and KIBA data set (Tang et al., 2014). Table 1 and Figure 1 provides the statistics of these two datasets.



### Proposed Method

#### Overview of our Approach

Figure 2 provides an overview of the entire pipeline for our method for drug-target binding affinity prediction. Our approach comprises three elements: two feature extractors for protein sequence and compound, respectively, and a regressor for affinity value prediction. Each feature extractor is composed of a feature representation modular from GANs while the regressor is made up of a CNN. A two-round training pattern is employed. In the first training round, the feature extractors are trained in the context of GANs. First, fake samples are generated according to a given noise distribution by the generator of GANs, and then all the fake samples from the generator and the real samples from the available data sets are inputted to the discriminator network. In order to learn to distinguish real and fake sequences of proteins and SIMILES of compounds, the discriminator maps the input into a feature space by a local feature extractor, which promotes the sample classification. Thus, after the training of the whole GANs, a local feature extractor is obtained from the discriminator that can represent the characteristic of the input protein sequence or SMILE sequence. This trained local feature extractor is utilized as the feature representation of the proposed framework, followed by a regressor or classifier for prediction or classification task respectively. Finally, during the second round of training, with the labeled data (SIMILES and protein sequence) and fixed GANs-based feature extractor, the regressor is trained to minimize the loss function, leading to the optimal model parameters.

In the proposed method, the input proteins and drugs are treated as sequence representations. In particular, drugs are represented as SMILES strings – describing the chemical structure in short ASCII strings, and similarly, protein sequences are represented as a string of ASCII letters, which are the amino acids. Having the inputs as strings of text, the discriminator can learn the latent features of those sequences.

#### Feature Extracting Model

Goodfellow et al. (Goodfellow et al. (2014)) proposed a framework using a minimax game to train deep generative models, so called GANs. The GANs comprise two parts, a generator G and a discriminator D. The generator network G generates fake samples from the generator distribution PG by transforming a noise variable z∼Pnoise(z) into a sample G(z). The discriminators are to differentiate these generated samples following distribution PG from the true sample distribution Pdata. G and D are trained by playing against each other which can be formulated by a minimax game as follows:

min <sup>G</sup> max <sup>D</sup> <sup>V</sup>(D,G) <sup>=</sup> <sup>E</sup><sup>x</sup>ePdata ½log (x)- <sup>+</sup> <sup>E</sup><sup>z</sup> <sup>e</sup>Pnoise½log (1 <sup>−</sup> <sup>D</sup>(G(z)))-(1)

Meanwhile, for a given generator G, the optimal discriminator is D(x) = Pdata(x)/ (Pdata(x)+PG(x)).

The GANs employed in our framework is depicted in Figure 3 — in which the generator network is a four-layer fully connected network and considers a noise vector as input and produce a sequence of proteins or SMILES. The

discriminator network is a three-layer fully connected network and the output is a probability value between 0 and 1, where 1 means that the input is real and 0 means that the input is fake.

Typically, the discriminator network can be decomposed into a feature extractor F (·;jf) and a sigmoid classification layer with weight vector y<sup>l</sup> . Mathematically, given an input sequence s, we have

$$D(\mathbf{s}) = \text{sigmoid}\left(\phi\_l^T F(\mathbf{s}; \phi\_f)\right) = \text{sigmoid}\left(\phi\_l^T f\right) \tag{2}$$

where f= (f<sup>f</sup> fl ) and sigmoid(z)=1/ (1+e −z ). f=F (s;ff) is the feature extractor of s in the last layer of D, which is to be leaked to the regression model.

#### Regression Model

To predict the binding affinity, we combine the intermediate features learned by the two GANs and then apply a few 1D convolution layers to learn the final regression output. The convolution regression model conducts convolution operations with the kernel size of 4 to acquire feature maps of the input information. The dimension of the first convolution layer is 16×4. All the convolution layers are connected to activation functions (ReLU function). The dimensions of the second and third, convolution layers are 32×4, and 48×4. The activation function of the output layer is a linear function (identity function, i.e., y=x) that obtains a continuous value. This network is trained by minimizing the loss function defined by the mean square error (MSE) between the outputs p of this network and depth values y included in the dataset:

$$MSE = \frac{1}{n} \sum\_{k=1}^{n} (\wp\_k - \wp\_k)^2 \tag{3}$$

#### EXPERIMENTS AND RESULTS

We compared our proposed method with the state-of-the-art DTA prediction models using the Davis and Kiba datasets. For these two datasets, we used the same setting as DeepDTA, that is, 80% of data were split as training samples and 20% as testing samples. In addition, our model is trained by both the labeled and unlabeled instances. We apply the Adam optimizer with the initial learning rate of 0.0001 to optimize the parameters of the model. We manually tuned the hyperparameters based on the testing results on the validation set. The performance of the proposed model was measured by calculating the concordance index (CI) and mean squared error (MSE) metrics. CI evaluates the ranking performance of the models that output continuous values.

$$CI = \frac{1}{Z} \sum\_{\delta\_{\mathbf{x}} > \delta\_{\mathbf{y}}} h \text{ ( $b\_{\mathbf{x}} - b\_{\mathbf{y}}$ )}\tag{4}$$

where bx is the prediction value for the larger affinity dx, by is the prediction value for the smaller affinity dy, Z is a normalization constant, and h(m) is the step function.

$$h(m) = \begin{cases} 1; & \text{if } m > 0 \\ \\ 0.5; \text{if } m = 0 \\ 0; & \text{if } m < 0 \end{cases} \tag{5}$$

MSE is a common measure to quantify the difference between the predicted values p and the actual values, which is defined as follows:

We compared the predicted performance of our method with DeepDTA and two machine-learning-based KronRLS and SimBoost method. Both of our work and DeepDTA only utilize the information of protein sequence and SMILES of the compounds. The difference is that our method can extract features of proteins and compounds in an unsupervised manner. Tables 2 and 3 present the MSE and CI values for different methods for Davis and KIBA datasets.

For the Davis dataset (Table 2), even the DeepDTA, with Simith–Waterman as the protein's representation form and drugs in the 1D strings, achieves the best CI score (0.886), slightly higher than our method - its MSE metric is much higher than our methods. Whereas another DeepDTA, CNN for protein and compound representation, achieves the best MSE with 0.261 as well as the lower CI than our method.

A similar performance is observed for the Kiba dataset (Table 3). In particular, DeepDTA is the best baseline in both measures, CI, at 0.863, and MSE, at 0.194, when both drugs and proteins are represented as 'words'. Regarding CI, the proposed GANsDTA exhibits a slight improvement. The best CI GANsDTA gained is 0.866.

To provide a better assessment of our model, we determined the performances of GANsDTA, DeepDTA with two CNN modules and two baseline methods with two different metrics: r2 <sup>m</sup> index and area under precision recall (AUPR) score as well. r 2 m

TABLE 2 | CI and MSE scores for the Davis dataset on the independent test for our method and other methods.


Bolded texts mean the best results.

TABLE 3 | CI and MSE scores for the Kiba dataset on the independent test.


Bolded texts mean the best results.

TABLE 4 | r 2 <sup>m</sup> index and AUPR score for the Davis dataset."4 r 2 <sup>m</sup> index and AUPR score for the Davis dataset."


TABLE 5 | The r 2 <sup>m</sup> index and AUPR score for the KIBA dataset.


index is a metric which defines the possibility of an acceptable model. Generally, if the value of r<sup>2</sup> <sup>m</sup> the index is greater than 0.5 on a test set, we consider this model to be acceptable. The metric is described in equation (6) where r<sup>2</sup> and r <sup>0</sup> are the squared correlation coefficients with and without intercept, respectively. The details of the formulation are explained in Pratim Roy et al. (2009); Roy et al. (2013).

$$r\_m^2 = r^2 \star \left(1 - \sqrt{r^2 - r\_0^2}\right) \tag{6}$$

The AUPR score is generally adopted for binary prediction. To measure AUPR based performances, the Davis and KIBA datasets should be converted into their binary forms via thresholding. For the Davis dataset we selected a pKd value of 7 as the threshold, while for KIBA dataset the threshold is 12.1, which is same as in the literature Öztürk et al. (2018).

Tables 4 and 5 list the <sup>r</sup> 2 <sup>m</sup> index and AUPR score of GANsDTA and three baseline methods on the Davis and KIBA datasets, respectively. The results suggest that SimBoost, DeepDTA and GANsDTA are acceptable models for to predict affinity with result to r 2 <sup>m</sup> value.

Figure 4 illustrates the predicted binding affinity values against the actual values for our GANsDTA on the Davis and KIBA datasets. Evidently, an ideal model is expected to enable predictions (p) equal to the measured (y) values. For GANsDTA, it can be observed that the density is high around the p=y line, particularly for the KIBA dataset.

It can be observed that the proposed GANsDTA exhibits a similar performance to DeepDTA from Tables 2-4. For the Davis dataset, GANsDTA provides a slightly lower CI score (0.881) than the state-of-the-art DeepDTA with CNN the feature extraction (0.886), and a slightly higher MSE with 0.015. The reason is that the training for GANs is insufficient due to the small size of the Davis dataset which only includes 442 proteins, 68 compounds, and 30056 interactions. However, GANsDTA is still the second-best predictor. The other benchmark KIBA dataset includes 229 proteins, 2111 compounds, and 118254 interactions, enabling the GANs to be trained better, leading to better prediction accuracy. This indicates that GANsDTA is more suitable for the prediction task with a large dataset. In the future, more possible datasets (Cheng et al., 2018c; Cheng et al., 2019a) Cheng et al., 2016; Cheng et al., 2019a can be utilized to improve the training of GANsDTA.

#### CONCLUSION

Predicting drug-target binding affinity is challenging in drug discovery. The supervised-based methods heavily depend on labeled data, which are expensive and difficult to obtain on a large scale. In this paper, we propose a semi-supervised GANbased method to estimate drug-target binding affinity, while effectively learning useful features from both labeled and unlabeled data. We use GANs to learn representations from

the raw sequence data of proteins and drugs and convolutional regression when predicting the affinity. We compare the performance of the proposed model with the state-of-art deeplearning-based method as our baseline. By utilizing the unlabeled data, our model can achieve competitive performance while using freely available unlabeled data. However, because it is difficult to train GANs, this approach is not comparative in the scenarios of a small dataset, and the improved techniques for training GANs should be employed to enhance the adaptability of GANs.

#### DATA AVAILABILITY STATEMENT

The datasets KIBA and Davis for this study can be found in http://www.ebi.ac.uk/biostudies/studies/S-EPMC6129291? xr=true.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

LZ, JW, and LP substantially contributed to the conception and design of the study, and acquisition of data. YL analyzed and interpreted the data. LZ, JW, and JZ drafted the article.

### FUNDING

This work is supported by the National Natural Science Foundation of China (NSFC, Grant no.61305013 and 61872114).

#### ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers for their valuable comments.


Tian, K., Shao, M., Wang, Y., Guan, J., and Zhou, S. (2016). Boosting compoundprotein interaction prediction by deep learning. Methods 110, 64–72. doi: 10.1016/j.ymeth.2016.06.024

Wishart, D. S., Knox, C., Guo, A. C., Shrivastava, S., Hassanali, M., Stothard, P., et al. (2006). Drugbank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34, D668–D672. doi: 10.1093/nar/gkj067

Yamanishi, Y., Kotera, M., Kanehisa, M., and Goto, S. (2010). Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics 26, i246–i254. doi: 10.1093/bioinformatics/btq176

Zhao J, L. Y., and Mathieu, M. (2016). Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126.

Conflict of Interest: The authors declare that the research was conducted without any commercial or financial relationships that could be construed as potential conflict of interest.

Copyright © 2020 Zhao, Wang, Pang, Liu and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# TriPCE: A Novel Tri-Clustering Algorithm for Identifying Pan-Cancer Epigenetic Patterns

Yanglan Gan<sup>1</sup> , Ning Li <sup>1</sup> , Yongchang Xin<sup>1</sup> and Guobing Zou2\*

<sup>1</sup> School of Computer Science and Technology, Donghua University, Shanghai, China, <sup>2</sup> School of Computer Engineering and Science, Shanghai University, Shanghai, China

Epigenetic alteration is a fundamental characteristic of nearly all human cancers. Tumor cells not only harbor genetic alterations, but also are regulated by diverse epigenetic modifications. Identification of epigenetic similarities across different cancer types is beneficial for the discovery of treatments that can be extended to different cancers. Nowadays, abundant epigenetic modification profiles have provided a great opportunity to achieve this goal. Here, we proposed a new approach TriPCE, introducing tri-clustering strategy to integrative pan-cancer epigenomic analysis. The method is able to identify coherent patterns of various epigenetic modifications across different cancer types. To validate its capability, we applied the proposed TriPCE to analyze six important epigenetic marks among seven cancer types, and identified significant cross-cancer epigenetic similarities. These results suggest that specific epigenetic patterns indeed exist among these investigated cancers. Furthermore, the gene functional analysis performed on the associated gene sets demonstrates strong relevance with cancer development and reveals consistent risk tendency among these investigated cancer types.

Keywords: epigenetic analysis, pattern discovery, tri-clustering, FP-growth algorithm, pan-cancer

### INTRODUCTION

Cancer genetics and epigenetics are closely linked in driving the cancer phenotype (Bailey et al., 2018). The vast majority of human cancers emerge from a gradual accumulation of somatic alterations and epigenetic abnormalities, which together lead to the malignant growth (Jones et al., 2016). Epigenetic changes can further enable tumor cells to escape from host immune surveillance and various treatments (You and Jones, 2012). Epigenetic abnormalities are usually observed as disrupted DNA methylation patterns (Chiappinelli et al., 2015), abnormal histone post translational modifications (Sawan and Herceg, 2010), and aberrant changes in chromatin organization (Allis and Jenuwein, 2016). How to identify epigenetic modification patterns that lead to the corresponding dysregulation in diverse cancers has become a critical research issue of cancer studies (Dawson, 2017; Kelly and Issa, 2017).

Great advancements have been made in delineating the underlying mechanisms of human cancers (Lawrence et al., 2014; Martincorena and Campbell, 2015). Extensive research has centered on the genetic aspect of cancers, such as how mutational activation and inactivation of cancer genes influence the cellular pathways (Vogelstein et al., 2013; Waddell et al., 2015). Recently, an increasing

#### Edited by:

Liang Cheng, Harbin Medical University, China

#### Reviewed by:

Hui Liu, Changzhou University, China Tianfan Fu, Georgia Institute of Technology, United States Qingting Wei, Nanchang University, China

#### \*Correspondence:

Guobing Zou gbzou@shu.edu.cn

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

> Received: 24 August 2019 Accepted: 25 November 2019 Published: 15 January 2020

#### Citation:

Gan Y, Li N, Xin Y and Zou G (2020) TriPCE: A Novel Tri-Clustering Algorithm for Identifying Pan-Cancer Epigenetic Patterns. Front. Genet. 10:1298. doi: 10.3389/fgene.2019.01298

**164**

emphasis of drug discovery efforts has been targeting on the cancer epigenome (Flavahan et al., 2017). Many epigenome mapping projects have been gradually founded. The Cancer Genome Atlas Network (TCGA), BLUEPRINT, and the International Cancer Genome Consortium (ICGC) define the genome-wide distribution of epigenetic marks in many normal and cancerous tissues (Beck et al., 2012; Kundaje et al., 2015; Weinstein et al., 2015). Given the genome-wide distribution of epigenetic modifications of different cancers, it is urgent to decipher common epigenetic patterns across cancers and to understand the underlying mechanisms of tumorigenesis. Key epigenomic similarities shared by different cancer types would present an important opportunity to design effective cancer treatment strategies among cancers regardless of tissue or organ and enable the extension of effective treatments from one cancer type to another (Karlic et al., 2010; Gan et al., 2018).

To detect significant epigenetic patterns, existing computational methods mainly focus on identifying combinatorial states of different epigenetic marks. Specifically, CoSBI captures diverse histone modification patterns based on the correlations of different histone signals (Ucar et al., 2011). ChromHMM and HiHMM both apply a HMM model to annotate genomic sequences by the co-occurrence of multiple epigenetic marks (Ernst et al., 2011; Sohn et al., 2015). RFECS is developed mainly based on random forests (Rajagopal et al., 2013). IDEAS is able to jointly characterize epigenetic landscapes in many cell types and detect differential regulatory regions (Zhang et al., 2016). These methods have successfully identified the combinatorial epigenetic pattern in specific cell type. However, the relations among different cancer types still need to be investigated. Because DNA methylation in cancers has been addressed elsewhere (Kretzmer et al., 2015; Yang et al., 2016), here we only focus on the critical covalent histone modifications that are altered in various cancers, particularly the well-studied acetylation and methylation modifications.

In this paper, we proposed a tri-clustering approach, named TriPCE, for integrative pan-cancer epigenomic analysis. The method TriPCE adopts a tri-clustering strategy to identify the coherent patterns of various epigenetic modifications across different cancer types. We applied TriPCE to investigate six critical epigenetic marks among seven cancer types, and identified significant pan-cancer epigenetic modification patterns. The results reveal that there exists consistent epigenetic modification tendency among these cancer types. Meanwhile, the gene function analysis demonstrates that these associated genes are strongly relevant with the cancer cellular pathway.

#### MATERIALS AND METHODS

#### Datasets

To detect epigenetic similarities among different cancers, we analyzed the epigenome maps of seven cancer types, including A549, K562, HepG2, HCT116, Hela-S3, multiple myeloma-Cell Line, and sporadic Burkitt lymphoma-Cell Line. For the epigenetic marks, we first filtered out those marks that are not included in these seven cancer types, and then focused on six widely studied ones, including H3K4me1, H3K4me3, H3K9me3, H3K27ac, H3K27me3, and H3K36me3. Meanwhile, the RNA expression profiles of these cancers were also collected. Totally, we obtained 42 epigenome maps and 7 RNA expression profiles for these cancers. The datasets were downloaded from the website of NIH Roadmap Epigenome Project.

#### General Scheme of the TriPCE Approach

We developed a tri-clustering approach TriPCE to dissect the pan-cancer epigenetic pattern. The method not only explicitly detects combinatorial states of various epigenetic marks in different genomic segments, but also mines similar epigenetic patterns across different cancer types. The proposed TriPCE model has three key components, as shown in Figure 1. Firstly, preprocess the modification data of various epigenetic marks in different cancer types. Secondly, identify bi-Clusters based on FP-growth algorithm for each epigenetic mark. Thirdly, mine tri-Clusters with coherent epigenetic modification patterns across different cancer types.

Step 1. Preprocess the epigenetic modification data of different cancer types. Firstly, the genome was divided into consecutive genomic segments, with a typical segment size of 200 bps (Gan et al., 2017). For each epigenetic modification map, we computed the summary tag count of every segment. Then, each segment is associated with the intensities of a set of epigenetic modifications in each cancer type. To deduce the impact of the noise resulting from spurious tag counts in the ChIP-seq experiments, raw sequence read counts of each epigenetic modification were further normalized by the total number of reads followed by arcsine transformation (Pinello et al., 2014). Finally, according to the genome annotation data, the epigenetic distribution in the promoter regions was extracted.

After the preprocessing step, we gained six epigenetic profiles of seven cancer types along the promoter regions. Let G = {ɡ1, ɡ2, …, ɡn} be a set of n genes, let T = {t1, t2,…, t7} be the investigated seven cancer types and let E = {e1, e2,…, e6} be the six epigenetic marks. For each epigenetic mark, the epigenetic profiles of different cancer types in the promoter regions of these genes are organized as a matrix Dk = T G = t k <sup>i</sup>,<sup>j</sup> (with i ∈[1,2…,7], j ∈[1,2…, n], k ∈[1,2…,6]), where rows correspond to the cancer types, and columns correspond to those genes, respectively. Each entry t k <sup>i</sup>,<sup>j</sup> is a vector representing the epigenetic profile of ek in the ith cancer along the promoter region of gene j.

Step 2. Identify bi-clusters based on FP-growth algorithm for each epigenetic mark. Given the preprocessed and reorganized epigenetic modification data matrix of each epigenetic mark, we first computed the Pearson correlation coefficients between the epigenetic profiles of any two cancer types at every promoter region, and then obtained a correlation coefficient matrix.

Specifically, for the promoter region ɡi, we computed the Pearson correlation coefficients among the epigenetic modification distribution vectors of any different cancer types. If the calculated correlation coefficient is higher than a given threshold, the epigenetic modification trend in these two cancer

FIGURE 1 | The flowchart of the proposed TriPCE approach. (A) Preprocessing the epigenetic modification data of different cancer types. (B) For each epigenetic mark, identifying bi-Clusters based on the FP-growth algorithm. (C) Mining tri-Clusters with coherent epigenetic modification patterns across different cancer types.

types is regarded as coherent in this promoter region. Then, we added this cancer type to the corresponding itemset, which contains all the cancer types exhibiting similar epigenetic patterns in this region. Based on extensive experimental comparison, when the correlation coefficient threshold is set as 0.7, the identified epigenetic patterns are obviously coherent. For each epigenetic mark, we respectively constructed the corresponding similar itemsets for all promoter regions.

Based on the resulted itemset, we further identified the significant coherent epigenetic patterns using FP-growth algorithm (Han et al., 2004). FP-growth algorithm is a data mining method that was originally developed for frequent itemset mining in market basket analysis. Here, we adopted the FP-tree model to represent in a compact way all the cancer types with similar epigenetic patterns in different promoter regions. Then, it can be used to mine potential frequent itemsets and filter out most of the unrelated data. In this context, a typical frequent itemset represents a group of cancer types that share similar epigenetic patterns in abundant promoter regions. To gain the significant epigenetic states, we set the minimum support of genes as 10% of the investigated genes. For each frequent itemset, we then inversely identified the corresponding gene set and gained the bi-Cluster. The resulted bi-Cluster is in the form ("genomic regions," "cancer types"), representing the cancer types exhibit similar epigenetic patterns in these genes. Similarly, we obtained the corresponding bi-Cluster sets for all investigated epigenetic marks.

Step 3. Mine tri-Clusters with coherent epigenetic modification patterns across different cancer types. After obtaining the bi-Cluster sets for each epigenetic mark, we further mined the tri-Clusters. By enumerating the maximum subsets of different epigenetic marks, we obtained the tri-Clusters. In detail, we respectively computed the intersection of the bi-Cluster sets from two epigenetic marks ek and el , which are kept with the epigenetic marks to get possible tri-Clusters. Further, by filtering out the candidates with the support lower than the predefined minimum support, we obtained the significant tri-Clusters. Iteratively, we continued the process with another epigenetic mark until all the epigenetic marks were analyzed. We tried all such paths and kept the maximal tri-Clusters only. Each tri-Cluster is represented as ("genomic regions," "cancer types," "epigenetic marks"), listing a gene set with similar trend of epigenetic modifications in different cancer types. The resulted tri-Clusters indicate that the conserved epigenetic signatures in these genomic regions are shared by multiple cancer types.

#### Functional Analysis of the Genes

From the identified tri-Clusters, we can obtain the gene sets associated with specific coherent epigenetic patterns. To investigate the potential functions of these genes, we performed the gene ontology (GO) enrichment analysis and pathway enrichment analysis via DAVID bioinformatics resources (Huang et al., 2007). The significant enrichment lists were obtained with P-value < 0.005.

#### RESULTS

#### Identifying Similar Epigenetic Patterns Across Different Cancer Types

We developed a tri-clustering approach, TriPCE, to capture similar epigenetic patterns among different cancer types. TriPCE was applied to the genome-wide epigenetic modification maps of seven cancer types, including A549, K562, HepG2, HCT116, Hela-S3, multiple myeloma-Cell Line, and sporadic Burkitt lymphoma-Cell Line. For each epigenetic mark, TriPCE first groups the promoter regions based on the epigenetic modification profiles among different cancer types. Figure 2 shows a typical bi-Cluster of epigenetic mark H3K4me1, which contains abundant genes with similar modification pattern in four cancer types, including Hela-S3, HepG2, K562, and A549. From this figure, we observe that the epigenetic profiles of these genes are similar in these cancer types.

Then, the epigenetic profile shared by a cluster of promoter regions in multiple cancer types is considered to be an epigenetic pattern. Meanwhile, different cancer types share similar epigenetic patterns. This result is consistent with previous finding that H3K9me3/me2 and H3K36me3/me2 frequently observed in breast cancer (Liu et al., 2009), esophageal cancer (Yang et al., 2000), MALT lymphoma (Vinatzer et al., 2008), and lung sarcomatoid carcinoma (Italiano et al., 2006). Based on the identified bi-Clusters of these investigated epigenetic marks, we noted that cancers (HepG2 and HCT116) are clustered together and share a larger number of epigenetic marks, implying that they share more similar epigenetic regulation mechanisms.

To identify the significant modification patterns, we set the minimal support of genes as 10% of the investigated genes. With diverse correlation coefficient thresholds, we respectively gained different numbers of bi-Clusters for epigenetic marks H3K4me1, H3K4me3, H3K9me3, H3K27me3, H3K36me3, and H3K27ac, among these cancer types, as shown in Figure 3. The comparison indicates that the similarities of these epigenetic marks are quite different. Under different threshold settings, the epigenetic mark H3K4me3 has a relatively small number of bi-Clusters, indicating that its profiles are less conserved and exhibit more variable patterns among these cancer types than other epigenetic marks. On the contrary, there are more similar epigenetic patterns of H3K4me1 and H3K27me3 among different cancer types (Baylin and Jones, 2016). The plasticity of epigenome depends on diverse environmental factors. Thus, it is not surprising that epigenotypes contribute to developmental human disorders and adult diseases (Brien et al., 2016). As the minimal support threshold slightly affects the trend among different epigenetic marks, we chose the bi-Clusters with threshold 0.7 for further analysis.

#### Identifying Coherent Patterns Among Different Epigenetic Marks

From the above results, we notice that there are obvious differences among the investigated epigenetic modifications. To

identify the conserved epigenetic states and explore the similar patterns of these epigenetic modifications, we further clustered these epigenetic marks based on the detected bi-Clusters. By systematically computing the intersection of the bi-Cluster sets from different epigenetic marks, we kept the tri-Clusters with the support higher than the predefined minimum support. The identified tri-Clusters are represented as triples ("genomic regions," "cancer types," "epigenetic marks"). Each tri-Cluster represents that the promoter region of these genes exhibits similar epigenetic modification patterns in the related cancer types.

Applying TriPCE to the data set, we initially obtained 175 significant tri-Clusters. Figure 4 shows the information of 15 typical clusters, including the epigenetic marks, the cancer types, and the supports of these tri-Clusters. The results indicate that specific genomic regions indeed share combinatorial epigenetic patterns across different cancer types. For example, the changing pattern of epigenetic modifications (H3K4me3, H3K9me3, H3K27me3, and H3K36me3) are shared by a large number of genes in cancer types A549, HepG2, and K562. On the contrary, some epigenetic modification patterns are only coherent in certain cancer types. Among these resulted clusters, we observe that the similar patterns of H3K36me3, H3K27ac, and H3kK27me3 exist in fewer cancer types, such as HepG2 and sporadic Burkitt lymphoma-Cell Line. Notably, these identified tri-Clusters reveal more information about the epigenetic patterns among these cancer types.

#### Analyzing the Potential Roles of Associated Genes

Based on the detected tri-Clusters, we further obtained those gene sets that exhibit coherent epigenetic patterns in different cancer types. Previous studies have shown that the modification intensities are significantly distinct between high-expression gene promoters and low-expression gene promoters, which suggests that these chromatin components have significant effect on gene regulation (Su et al., 2012). To investigate the potential functions of those genes in the cellular control pathways, we performed a systematic GO enrichment analysis using DAVID tools (https://david.ncifcrf.gov/). Then, for the associated gene sets in the identified tri-Clusters, we respectively summarized the key biological processes and pathways that they are involved in.

Overall, we found that those genes enriched in tri-Clusters exhibit an enrichment for cancer-related functions. Table 1 lists the significant GO terms of a typical tri-Cluster (P-value < 0.005). In this tri-Cluster, the genes exhibit coherent modification patterns on epigenetic marks (H3K4me1, H3K4me3, H3K9me3, H3K27ac, and H3K27me3) in cancer types (HeLa-S3, HepG2, multiple myeloma-Cell Line, and sporadic Burkitt lymphoma-Cell Line). In the table, terms "positive regulation of cell proliferation" and "negative regulation of apoptotic process" are enriched in these gene sets. This result implies that the identified genes in this tri-Cluster are essential for cell proliferation and apoptotic process, which has been reported to be related to cancer development by

FIGURE 4 | Typical epigenetic tri-Clusters. (A) The epigenetic marks (column) in each cluster (row). (B) The cancer types (column) in each cluster (row). Fold enrichment was calculated as the ratio between the number of genes in the tri-Cluster to that of all genes.



previous researches (Deng et al., 2016). Meanwhile, the term "positive regulation of gene expression" is also enriched in the gene set, further indicating that these genes might perform important regulation roles in these cancers.

### DISCUSSION

Identifying epigenetic patterns is important to understand epigenetic mechanisms in various cancers. The detected patterns among different cancers could demonstrate critical cross-cancer similarities, which reveals some consistent clinical risk among different cancer types and further suggests strong clinical relevance. Our knowledge about the patterns of epigenetic modifications and the cause and consequence of them is still limited. Computational approach that exploits the complex epigenomic landscapes and discovers significant signatures out of them is required. Previous computational methods for analyzing epigenomes primarily focus on the combinatorial states of different epigenetic marks in a specific cell type. Differently, we developed a tri-clustering approach TriPCE for integrative pan-cancer epigenomic analysis. Based on the FP-tree structure, TriPCE can compactly represent all similar cancer types in the promoter regions for a specific epigenetic mark. Using the constructed FP-tree, the frequent patterns are then detected to yield the set of bi-Clusters of this epigenetic mark, indicating the similar epigenetic pattern in these cancer types along these genomic regions. TriPCE further mines the final tri-Clusters based on the bi-Clusters of all investigated epigenetic marks, explicitly detecting combinatorial epigenetic states in different genomic segments and similar epigenetic changes across different cancer types. In the proposed approach TriPCE, the tri-Cluster enumeration is an expensive operation. In the future we plan to develop heuristic techniques to efficiently prune the search space, and then improve the efficiency of mining the tri-Clusters. We applied TriPCE to uncover the similar patterns of six epigenetic marks among seven cancer types and successfully identified significant crosscancer epigenetic modification similarities, which suggests that there exhibits consistent epigenetic modification tendency among these investigated cancer types. Furthermore, the gene functional analysis demonstrates that these associated genes are strongly relevant with the cancer cellular pathway.

#### DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the article/ supplementary material.

#### AUTHOR CONTRIBUTIONS

YG is responsible for the main idea, as well as the completion of the manuscript. NL and YX have developed the algorithm and

#### REFERENCES


performed data analysis. GZ has coordinated data preprocessing and supervised the effort. All authors have read and approved the final manuscript.

#### FUNDING

This work and the publication costs were supported in part by the National Natural Science Foundation of China (61772128, 61772367), National Key Research and Development Program of China (2016YFC0901704), Shanghai Natural Science Foundation (17ZR1400200,18ZR1414400), and the Fundamental Research Funds for the Central Universities (2232016A3-05),

#### ACKNOWLEDGMENTS

Authors are grateful to NIH Roadmap Epigenome Project and iHMS website for providing the epigenomic data to carry out this work. An earlier version of this paper was presented at the 2018 International Conference on Intelligent Computing (ICIC 2018).


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Gan, Li, Xin and Zou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author (s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Comprehensive Analysis of Copy Number Variations in Kidney Cancer by Single-Cell Exome Sequencing

Wenyang Zhou1† , Fan Yang2† , Zhaochun Xu1† , Meng Luo<sup>1</sup> , Pingping Wang<sup>1</sup> , Yu Guo<sup>1</sup> , Huan Nie1\*, Lifen Yao2\* and Qinghua Jiang1\*

<sup>1</sup> School of Life Science and Technology, Harbin Institute of Technology, Harbin, China, <sup>2</sup> Department of Neurology, The First Affiliated Hospital of Harbin Medical University, Harbin, China

#### Edited by:

Lei Deng, Central South University, China

#### Reviewed by:

Hao Lin, University of Electronic Science and Technology of China, China Juan Wang, Inner Mongolia University, China Jingpu Zhang, Henan University of Urban Construction, China

#### \*Correspondence:

Huan Nie nh1212@hit.edu.cn Lifen Yao yaolifen\_2015@sina.com Qinghua Jiang qhjiang@hit.edu.cn

† These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 23 September 2019 Accepted: 17 December 2019 Published: 23 January 2020

#### Citation:

Zhou W, Yang F, Xu Z, Luo M, Wang P, Guo Y, Nie H, Yao L and Jiang Q (2020) Comprehensive Analysis of Copy Number Variations in Kidney Cancer by Single-Cell Exome Sequencing. Front. Genet. 10:1379. doi: 10.3389/fgene.2019.01379 Clear-cell renal cell carcinoma (ccRCC) is the most common and lethal subtype of kidney cancer. VHL and PBRM1 are the top two significantly mutated genes in ccRCC specimens, while the genetic mechanism of the VHL/PBRM1-negative ccRCC remains to be elucidated. Here we carried out a comprehensive analysis of single-cell genomic copy number variations (CNVs) in VHL/PBRM1-negative ccRCC. Genomic CNVs were identified at the single-cell level, and the tumor cells showed widespread amplification and deletion across the whole genome. Functional enrichment analysis indicated that the amplified genes are significantly enriched in cancer-related signaling transduction pathways. Besides, receptor protein tyrosine kinase (RTK) genes also showed widespread copy number variations in cancer cells. Our studies indicated that the genomic CNVs in RTK genes and downstream signaling transduction pathways may be involved in VHL/PBRM1-negative ccRCC pathogenesis and progression, and highlighted the role of the comprehensive investigation of genomic CNVs at the singlecell level in both clarifying pathogenic mechanism and identifying potential therapeutic targets in cancers.

Keywords: copy number variations, single-cell exome sequencing, clear-cell renal cell carcinoma, receptor protein tyrosine kinase, signaling transduction pathway

### INTRODUCTION

Renal cell carcinoma (RCC) is one kind of kidney cancer, accounting for nearly 300,000 new cancer cases per year worldwide (Hakimi et al., 2013). RCC includes several histological subtypes, among which clear cell renal cell carcinoma (ccRCC) is the most common and lethal one (Hakimi et al., 2016). Increasing studies have shown that the development of ccRCC seems to be shaped by chromosomal lesions and a number of somatic mutations (Sato et al., 2013). VHL and PBRM1, located within the chromosome 3p25 and 3p21 segments, are the top two significantly mutated genes in ccRCC (Sato et al., 2013). Nearly 90% of ccRCCs undertake the deletion on chromosome 3p, leading to a very high frequency of VHL inactivation (Gnarra et al., 1994). Moreover, VHL and PBRM1 are mutated in about 50 and 41% of sporadic ccRCC, respectively (Kaelin, 2004; Varela et al., 2011). However, little is known about the genetic mechanisms in VHL/PBRM1 negative ccRCC.

Based on the next-generation sequencing technology, previous studies identified many driver mutations in ccRCC (Gnarra et al., 1994; Kaelin, 2004; Sato et al., 2013; Cheng et al., 2019). However, limited insights are available on the genomic diversity within tumor tissues (Wang et al., 2014). Generally, tumor tissues may contain cancer cells from multiple clones and noncancerous cells, which make it difficult to identify the mutations in each clone and detect the driver genes during the cancer progression (Xu et al., 2012; Casasent et al., 2018). Fortunately, single-cell DNA sequencing has been developed to meet this challenge, because it can provide unique insights into intratumor heterogeneity, development, and diversity of cancers at the single-cell level (Casasent et al., 2018). For example, Xu et al. (2012) carried out the single-cell exome sequencing on a ccRCC tumor and its adjacent normal tissue. They identified four genes (i.e., AHNAK, SRGAP3, LRRK2, and USP6) as potential driving factors for VHL/PBRM1-negative ccRCC development, which provided new insights into the pathogenesis of the ccRCC.

Genomic copy number variations (CNVs) play an important role in cancer progression, and emerging studies indicate that genomic CNVs are associated with the ccRCC (Gerlinger et al., 2014; Nouhaud et al., 2018) and other cancers (Waddell et al., 2015; Secrier et al., 2016; Hong et al., 2019). Xu et al. (Xu et al., 2012) performed a single-cell exome sequencing to elucidate the genetic mechanisms of the ccRCC by identifying the single nucleotide variants (SNVs). However, the authors did not examine whether the genomic copy number variations play a crucial role in ccRCC.

To further investigate the potential roles of CNVs in VHL/ PBRM1-negative ccRCC, we performed a comprehensive singlecell CNV analysis based on a dataset provided by Xu et al., (2012). We delineated the genomic copy number variation landscape at the single-cell level and reclassified all single cells based on the single-cell genomic CNVs. We also identified several significantly amplified/deleted loci and genes in cancer cells. Finally, we further investigated the biological pathways which may be involved in the ccRCC pathogenesis.

#### METHODS

#### Datasets

The sample data and information used in our article came from a previous study, and the original sequencing data were downloaded from NCBI (http://www.ncbi.nlm.nih.gov/sra) under the accession number SRA050201.

### Quality Control

Quality control of the sequencing data was performed using FastQC. The adapter and low-quality ends were trimmed from reads using Trim-Galore version 0.5.0 (http://www. bioinformatics.babraham.ac.uk/projects/trim\_galore/). Trimmed reads shorter than 20 bp were discarded.

### Reads Mapping

The human reference genome sequence (Hg19) was used for mapping (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/ bigZips/). Short read pairs were mapped to the reference genome using Burrows-Wheeler Aligner (BWA) version 0.7.12-r1039 (Li and Durbin, 2009). In this process, we adopted the BWA-MEM algorithm and adjusted the main parameters, setting the minimum seed length to 19, the penalty for a mismatch to 4, and shorter split hits were marked as secondary. Then, Samtools was used to convert SAM files to compressed BAM files, sort the BAM files by chromosomal coordinates, and remove the PCR duplicates from BAM files.

### Copy Number Variations Calling

In each cell, germline and somatic copy number variations were called by Control-FREEC version 11.5 (Boeva et al., 2012). Considering the exome enrichment during library construction, read counts were calculated by exome region. The target region file of exome capture was downloaded from the Agilent website (https://earray.chem.agilent.com/suredesign/ index.htm). The germline CNVs were detected in each cell and bulk normal tissue, respectively. Somatic CNVs were detected only in single cells. Gene annotations were performed with Annovar software (Wang et al., 2010) and OAHG database (Cheng et al., 2016).

### Dimensionality Reduction of Cells

T-distributed stochastic neighbor embedding (t-SNE) was performed based on the germline CNVs of target regions. Both 25 single cells and bulk normal tissue were projected to 2D space using the R package named "Rtsne."

### Significantly Somatic Copy Number Variation Loci Analysis

Significantly amplified/deleted loci in tumor cells were identified using GISTIC2.0 (Mermel et al., 2011). GISTIC2.0 was run on an input defined as the log2()-1 of somatic copy number values, with confidence (-conf) threshold of 0.9. Considering for downstream analysis, thresholds suggested by GISTIC2.0 for copy number variation were as follows: if GISTIC score ≥0.9, it means amplification; 0.1 < GISTIC score <0.9, corresponding to gain; −1.3 < GISTIC score < −0.1, loss; GISTIC score ≤ −1.3, deletion.

### Receptor Protein Tyrosine Kinase Gene Copy Number Profiling

To examine the landscape of copy number variations in RTK genes, we derived GISTIC-equivalent scores by dividing the germline copy numbers and classifying genes as amplified if score ≥ 0.9, deleted if score ≤ −1.3, gained if score > 0.1, and loss if the score < −0.1.

### Function Analysis

The significantly amplified and deleted genes were identified according to significantly somatic CNV loci (q-value < 0.0001) in GISTIC2.0. The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway function enrichment analysis was performed using the Carcinogenic Potency Database (CPDB) (Kamburov et al., 2013). In this study, the p-value threshold for KEGG enrichment analysis is 0.05.

### RESULTS

### Identification of Single-Cell Genomic Copy Number Variations in Kidney Cancer and Normal Cells

To identify genomic CNVs in ccRCC, we analyzed the sequencing data from a ccRCC patient, which includes 20 single-cell exome sequencing data from the tumor tissue, 5 single-cell exome sequencing data from the adjacent normal tissue, and a bulk exome sequencing data from the adjacent normal tissue. Trim-Galore was used to remove the low-quality and adapter segments and analyze the quality of sequencing reads. The cleaned reads were mapped to the reference genome with BWA software (Li and Durbin, 2009). The sequencing depth was more than 20X (29.68 ± 5.68) in all single cells. The genomic CNVs were called by using Control-FREEC (Boeva et al., 2012).

Germline CNVs were identified in all the samples. The comparison between cancer and normal cells revealed widespread amplification and deletion across the whole genome in tumor cells (Figure 1A). At the same time, some deleted loci were found both in normal and cancer cells, which may be caused by multiple displacement amplification (MDA) amplification (Yilmaz and Singh, 2012) or exome capture during DNA library preparation.

To remove the background mutations caused by germline or technology flaws, somatic CNVs were identified in all cells using bulk normal tissue as control. The somatic CNVs showed much more amplification than germline CNVs in the cancer cells remarkably (Figure 1B). Large-scale of somatic CNVs were found in the ccRCC single cells, which was consistent with the previous studies based on the bulk sequencing (Cancer Genome Atlas Research, N, 2013; Gerlinger et al., 2014; Nouhaud et al., 2018). What's more, single-cell sequencing data revealed the amplification of copy number showed a high degree of consistency, which suggests the amplification may play an important role in the progression of ccRCC. On the contrary, the deletion showed higher intratumor heterogeneity in the cancer cells.

#### Re-Classification of Kidney Cancer and Normal Cells Based on Single-Cell Copy Number Variations

Generally, surgically removed cancer tumors may contain both cancer and normal cells (Xu et al., 2012). To reclassify all the single cells accurately, the t-distributed stochastic neighbor embedding (t-SNE) was performed based on the cell copy number in exome target regions. The results of dimensionality reduction (Figure 2, Supplementary Table S1) showed that three cancer cells (CC-15, CC-17, and CC-20) clustered tightly

with the normal cells and tissue, suggesting that they probably were normal cells in the tumor tissue. These results were consistent with the previous findings which based on the single-cell SNVs (Xu et al., 2012). These three cells (CC-15, CC-17, and CC-20) were excluded from the cancer cell group in the downstream analysis. Focusing on the remaining cancer cells, we found no subpopulation of cancer cells within the cancer tissue.

FIGURE 2 | Population analysis based on the germline copy number variations (CNVs). T-distributed stochastic neighbor embedding (T-SNE) analysis of cancer cell (red), normal cell (blue), and normal cell in cancer tissue (green) based on the germline CNVs.

According to the single-cell genomic CNVs, all the single cells can be reclassified into three groups, namely cancer cell (CC), normal cell (NC), and normal cell in cancer tissue (NCinCT). To address whether the genomic CNVs were significantly different between the three groups, we calculated the proportion of whole genome that covered with amplification (copy number ≥ 4) and loss (copy number = 0), respectively. The results (Figure 3) showed that there were more amplified loci in CC group than NC group (P = 3×10−<sup>4</sup> ) and NCinCT group (P = 1.8×10−<sup>3</sup> ). Besides, there was no significant difference between NC and NCinCT groups (P = 0.79). The lost loci also showed a similar result. Single-cell genomic CNVs indicated that the genome of cancer cells was in an extremely unstable state.

#### Loci Distribution of Significant Genomic Copy Number Variations in Kidney Cancer

To investigate the loci distribution of the significant genomic CNVs across all tumor single cells, GISTIC2.0 (Mermel et al., 2011) was used to identify the significant genomic CNVs loci based on the somatic CNVs in 17 cancer cells, but not including germline CNVs which are not involved in cancer development generally. The results indicated that copy numbers in the significant CNV loci have a high degree of consistency across all the cancer cells. Although lots of lost loci (more slight than deletion, −1.3 < GISTIC score < −0.1) were identified, there was no significantly deleted locus (GISTIC score ≤ −1.3) found in cancer cells, which was consistent with high heterogeneity of deletion region in our cancer cells.

Significantly amplified loci (Figure 4, Supplementary Table S2A) according to GISTIC2.0 (12q13.3, 12p13.31, 5q35.3, etc.; qvalue < 0.05) comprised genes such as IGFBP4, ERBB2, ERBB3, FGFR4, CDK2, FLT4, and so on. The IGFBP4 gene had been reported to be associated with several types of cancer (Hallberg

FIGURE 3 | The coverage of genomic copy number variations (CNV) regions in three cell types. (A) The percentage of amplification region (copy number ≥ 4) across the whole genome in different cell types. (B) The percentage of loss region (copy number = 0) across the whole genome in different cell types. In the two sub-graphs (A) and (B), p-values between two groups (Wilcoxon signed-rank test) and all groups (Kruskal-Wallis test) were calculated.

Zhou et al. Single-Cell CNVs Analysis in ccRCC

et al., 2000; Romero et al., 2011; Yang et al., 2017), it can promote the RCC cell metastasis and activate Wnt/beta-catenin signaling pathway in humans (Ueno et al., 2011). ERBB2 and ERBB3 genes belong to the epidermal growth factor receptor (EGFR) family, and they had been identified as common driver genes of multiple cancer types by promoting solid tumor growth (Yarden, 2001; Henson et al., 2017; Oldrini et al., 2017). The amplification of EGFR also was found in other cancers, which contributed to the EGFR excessive activation (Sigismund et al., 2018). FGFR4 gene belongs to the fibroblast growth factor receptor family, and the activation of FGFR4 can promote cell growth and angiogenesis in cancer (Bai et al., 2015). CDK2 gene is commonly excessive activation in human cancers, and dysfunction of CDK2 can lead to uncontrolled cell growth (Mihara et al., 2001). FLT4 gene, belonging to the vascular endothelial growth factor family, had been reported to regulate cancer cell survival and proliferation (Varney and Singh, 2015).

While the top significantly deleted loci (Figure 4, Supplementary Table S2B) (1q21.3, 1p35.2, 16q24.3, 3p14.1, etc.; q-value < 0.05) showed loss of Chmp1A, CADM2, PRAP1, and ULK1 genes. Chmp1A and CADM2, belonging to cell adhesion molecules family, had been found to be a tumor suppressor gene in RCC. The overexpression of Chmp1A and CADM2 significantly suppressed cancer growth and invasion (You et al., 2012; He et al., 2013). PARP1 gene played an important role in DNA repair and cell apoptosis (Tulin, 2011), the cell with PARP1 deficiency show resistance to DNA damageinduced programmed cell death and increased cancer risk (Schiewer and Knudsen, 2014). ULK1 was an initiate autophagy gene, and the down-regulation of ULK1 had been found in cancer (Zhang et al., 2017). ULK1 may play a pivotal role in cancer by promoting cell death (Chen et al., 2014).

The genes in significantly amplified loci include a number of known driver genes, which may promote the cancer progression by the up-regulation of cell growth and cell cycle. Significantly deleted loci include some tumor suppressor genes and autophagy genes. The inactivation of these genes leads to uncontrolled tumor growth, which may contribute to the VHL/PBRM1 negative ccRCC pathogenesis and progression

#### Functional Analysis of Significant Genomic Copy Number Variations in Kidney Cancer

To better understand the potential biological and functional characteristics of the significantly amplified and deleted genes in cancer cells, biological function pathways in ccRCC had been further investigated. The KEGG functional enrichment analysis was performed using the CPDB Database based on the amplified and deleted genes, respectively. The amplified genes showed significant enrichment (p-value < 0.05) for signal transduction, metabolism, cell cycle, immunity, and other cancer-related pathways (Figure 5, Supplementary Table S3). In contrast to amplified genes, deleted genes only showed significant enrichment for the fatty acid elongation pathway (p-value = 7.6×10−<sup>3</sup> ).

The most notable result is that a large portion of enrichment pathways belong to the signaling transduction pathway. Both of the HIF-1 (Posadas et al., 2013), ErbB (Liu et al., 2015), PI3K-Akt (Linehan et al., 2010; Sato et al., 2013; Guo et al., 2015), Ras (de Araujo Junior et al., 2015; Chen et al., 2018), Rap1 (Chen et al., 2018), and MAPK signaling pathway (Liu et al., 2015) had been

frequency histogram, and q-value for each significant genomic CNV loci was shown on the right. Only the loci with q-value < 0.0001 were shown.

FIGURE 5 | Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment analysis for significantly amplified genes. The size of the point means the gene number both in our amplified gene set and KEGG pathway terms. The color of point means enrichment significance (−log10P). The pathways were sorted by rich factor (the ratio of significantly amplified gene number in this pathway term to gene number in this pathway term).

found involved in the pathogenesis of RCC. What's more, the results also showed that Th17 cells (Li et al., 2015) and microRNAs (Gowrishankar et al., 2014) seem to have a connection with the ccRCC pathogenesis. Interestingly, the fatty acid elongation pathway was significantly deleted in ccRCC, which may account for the fact that ccRCC tumors are lipid-laden (Hakimi et al., 2016).

#### Receptor Protein Tyrosine Kinase Genes Show Widespread Copy Number Variations in Cancer Cells

Since lots of cancer-related signaling transduction pathways showed significantly amplified in cancer cells, we then examined the copy number variations in their upstream RTK genes (Robinson et al., 2000; Secrier et al., 2016) to investigate possible reasons for the negative results that tumor did not appear known driver mutations in VHL and PBRM1.

The single cancer cells show widespread amplification and deletion on multiple RTKs compared with the normal cells, the NC and NCinCT groups show similar RTK gene profile. There were some RTK genes (EPHB6, EPHA1, EPHB3, FGFR4, PDGFRB, and FLT4) showing amplification in cancer cells. On the contrary, EPHB2, ERBB4, FGFR1, PDGFRA, KDR, and FLT1 genes showed deletion in cancer cells (Figure 6). Genomic copy number is varied across these RTKs and downstream pathways, indicating that the genomic CNVs in RTKs and downstream

Frontiers in Genetics | www.frontiersin.org

signaling transduction pathways may have important roles in the pathogenesis and progression of the VHL/PBRM1 negative ccRCC.

#### DISCUSSION

Previous studies have shown that VHL and PBRM1 are the top two significantly mutated genes in ccRCC. However, the pathogenesis in VHL/PBRM1-negative ccRCC is still unclear. Our comprehensive analysis of CNVs in 25 single cells from a ccRCC patient provided new insights into the pathogenesis of the ccRCC. We reclassified all the single cells and identified pathological mutations in VHL/PBRM1-negative ccRCC cells. Similar to the genomic CNVs in other cancers, the pathogenesis in VHL/PBRM1-negative ccRCC seems to be shaped by the accumulation of amplification in driver genes (IGFBP4, ERBB2, ERBB3, FGFR4, CDK2, and FLT4), the loss of function in tumor suppressor genes (Chmp1A, CADM2) and autophagy genes (PRAP1, ULK1).

Pathway analysis of these significantly amplified and deleted genes identified several signaling transduction pathways, including HIF-1, ErbB, PI3K-Akt, Ras, Rap1, and MAPK signaling pathways, were affected by genomic amplification. At the same time, RTK genes showed widespread copy number variations in cancer cells specifically. Mutations on RTKs may take part in the overactivity of downstream signaling transduction pathways, leading to the uncontrolled growth of ccRCC cells.

Overall, our single-cell analysis of the copy number in VHL/ PBRM1-negative ccRCC revealed that the genomic CNVs in RTKs may cooperate with downstream signaling transduction pathways to take part in VHL/PBRM1-negative ccRCC pathogenesis and progression. Clinically, our findings may provide more effective targeted therapeutic approaches for patients with VHL/PBRM1-negative ccRCC. However, because of the small number of cells and the high intratumor heterogeneity, our findings need to be verified in larger cohorts.

### REFERENCES


#### DATA AVAILABILITY STATEMENT

Publicly available datasets were analyzed in this study. The original sequencing data can be downloaded from NCBI (http://www.ncbi.nlm.nih.gov/sra) under the accession number SRA050201.

### AUTHOR CONTRIBUTIONS

HN, LY, and QJ designed the experiments. PW obtained data from NCBI. WZ and ML analyzed the data. FY, ZX, and YG wrote the manuscript. All authors read and approved the manuscript.

### FUNDING

This work is supported by the National Natural Science Foundation of China [61571152, 61822108], the National Science and Technology Major Project of China [2016YFC1202302] and the Natural Science Foundation of Heilongjiang Province [F2015006].

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 01379/full#supplementary-material

TABLE S1 | The results of dimensionality reduction based on the germline CNVs. The name and coordinate in 2D space of all single cells were shown in this table.

TABLE S2 | The results of significantly amplified (Table S2A) and deleted (Table S2B) loci according to GISTIC2.0. The cytoband name, q-value and gene names of each amplification/deletion loci were shown in this table.

TABLE S3 | The results of KEGG enrichment analysis based on significantly amplified (Table S3A) and deleted (Table S3B) genes according to the CPDB database. The pathway name, p-value and gene sets of each pathway were shown in this table.


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Zhou, Yang, Xu, Luo, Wang, Guo, Nie, Yao and Jiang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Integrating Multi-Omics Data to Identify Novel Disease Genes and Single-Neucleotide Polymorphisms

Sheng Zhao<sup>1</sup> , Huijie Jiang1\*, Zong-Hui Liang2\* and Hong Ju3\*

<sup>1</sup> Department of Radiology, The Second Affiliated Hospital of Harbin Medical University, Harbin, China, <sup>2</sup> Department of Radiology, Jian'an District Centre Hospital of Fudan University, Shanghai, China, <sup>3</sup> Department of Information Engineering, Heilongjiang Biological Science and Technology Career Academy, Harbin, China

#### Edited by:

Lei Deng, Central South University, China

#### Reviewed by:

Tianyi Zhao, Harvard University, United States Hao Lin, University of Electronic Science and Technology of China, China

#### \*Correspondence:

Huijie Jiang jianghuijie@hrbmu.edu.cn Zong-Hui Liang liangzh@vip.163.com Hong Ju hongju.hit@hotmail.com

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 15 October 2019 Accepted: 06 December 2019 Published: 24 January 2020

#### Citation:

Zhao S, Jiang H, Liang Z-H and Ju H (2020) Integrating Multi-Omics Data to Identify Novel Disease Genes and Single-Neucleotide Polymorphisms. Front. Genet. 10:1336. doi: 10.3389/fgene.2019.01336 Stroke ranks the second leading cause of death among people over the age of 60 in the world. Stroke is widely regarded as a complex disease that is affected by genetic and environmental factors. Evidence from twin and family studies suggests that genetic factors may play an important role in its pathogenesis. Therefore, research on the genetic association of susceptibility genes can help understand the mechanism of stroke. Genome-wide association study (GWAS) has found a large number of stroke-related loci, but their mechanism is unknown. In order to explore the function of single-nucleotide polymorphisms (SNPs) at the molecular level, in this paper, we integrated 8 GWAS datasets with brain expression quantitative trait loci (eQTL) dataset to identify SNPs and genes which are related to four types of stroke (ischemic stroke, large artery stroke, cardioembolic stroke, small vessel stroke). Thirty-eight SNPs which can affect 14 genes expression are found to be associated with stroke. Among these 14 genes, 10 genes expression are associated with ischemic stroke, one gene for large artery stroke, six genes for cardioembolic stroke and eight genes for small vessel stroke. To explore the effects of environmental factors on stroke, we identified methylation susceptibility loci associated with stroke using methylation quantitative trait loci (MQTL). Thirty-one of these 38 SNPs are at greater risk of methylation and can significantly change gene expression level. Overall, the genetic pathogenesis of stroke is explored from locus to gene, gene to gene expression and gene expression to phenotype.

Keywords: stroke, genome-wide association study, expression quantitative trait loci, mQTL, SMR, singlenucleotide polymorphisms

### INTRODUCTION

Stroke is a major cerebrovascular disease caused by a transient or permanent decrease of local cerebral blood flow. It is characterized by arterial obstruction (Krishnamurthi et al., 2018), so it is also called cerebral infarction (Dargazanli et al., 2018). According to the World Health Organization, stroke affects more than 15 million people worldwide and directly kills about 5.7 million people. It also causes approximately 5 million people to have a lifelong disability, while about 4.3 million people died due to disability. At present, thrombolytic therapy (Castellanos et al., 2018) (recombinant tissue plasminogen activator) is the only acute treatment for ischemic stroke with a narrow time window (3–4.5 hours). Therefore, only 3.4%–5.2% of patients were treated within the short time window. Researchers have been focusing on how to improve the clinical diagnosis and treatment of cerebral infarction beyond the time window of thrombolysis (Feil et al., 2019).

The occurrence and development of ischemic stroke is affected by a variety of risk factors, such as family history of stroke (Zheng et al., 2019), history of heart disease (Beck et al., 2018), history of diabetes (Zou et al., 2018), history of hypertension, etc. According to the investigation and analysis of Li et al. (2019), the prevalence rate of the family with a family history of stroke is 10.52%. In recent years, a number of genetic association studies have suggested that there are multiple genetic risk factors for ischemic stroke, and multiple risk loci were found to affect the susceptibility to ischemic stroke.

Cacabelos et al. (2018) and Yee et al. (2019) showed that the C7673T polymorphism of APOB gene was significantly associated with the risk of ischemic stroke. Chen et al. (2019), Nordestgaard et al. (2018) confirmed that the polymorphism of ϵ 2,ϵ3,ϵ4 of APOE gene was associated with ischemic stroke. APOB gene and APOE gene are both known ischemic stroke susceptibility genes because of blood lipid level. In addition, many studies have shown that the SG13S114 (rs10507391) polymorphism of ALOX5AP gene and SG13S32 (rs9551963) polymorphism are associated with susceptibility to ischemic stroke. Zheng et al., (2018) found that carriers of SG13S114 polymorphism TT/TA genotype of ALOX5AP gene had a higher risk of acute cerebral infarction. Naderi et al. (2019) showed that SG13S114 polymorphism of ALOX5AP gene was associated with acute cerebral infarction. Previous genetic studies have found that some ischemic stroke susceptibility genes on chromosome 14, such as GCH1 gene (Wei et al., 2018), MEG3 gene (Han et al., 2018), MMP-14 gene (Elgebaly et al., 2019), PRKCH gene (Krupinski et al., 2018), are associated with the risk of ischemic stroke.

Genome-wide association study (GWAS) reveals candidate loci, susceptible genes and their loci related to the occurrence, development and treatment of diseases by genome-wide highdensity genetic markers (Pei Li and Wang, 2015; Cheng et al., 2019a; Cheng et al., 2019b). Since 2009, GWAS has been widely used to explore and excavate candidate gene loci related to new types of stroke. GWAS is generally believed to be able to identify some previously undetected or identified biological markers related to stroke (Ye et al., 2018; Cheng et al., 2019c), and because of its large sample size, it can minimize false positive results. The National Institute of Neurological Diseases (NIND) has conducted the largest and most comprehensive GWAS to explore the genetic loci of stroke and its subtypes. The results supported the previously established genetic association of ischemic stroke. New loci on chromosome 1p13 (such as rs12122341 of TSPAN2 gene) have been found to be associated with ischemic stroke. Although GWAS has many advantages and is widely used, it is still very hard to understand the role of nucleotide polymorphism (SNP) loci in diseases from the huge results of GWAS.

Therefore, recently many researchers have tried to integrate GWAS with expression quantitative trait loci (eQTL) to mine the disease-related genes (Cheng et al., 2018a; Cheng et al., 2018b). Since eQTL conveys gene expression information and GWAS conveys disease-related SNPs information, combining the two datasets, we could know the loci which are associated with diseases because of affecting other genes expression. Zhao et al. (2019) found many Alzheimer's disease-related genes and SNPs by GWAS and eQTL. Asthma-related genes were identified by Li et al. (2015). by integrating GWAS and eQTL. Systematic integration of Brain eQTL and GWAS were done by Luo et al. (2015) and they identified ZNF323 as a novel Schizophrenia risk gene.

Zhu et al. (2016) generalized Mendelian randomization to SMR. SMR is used to test the association between a trait and the expression level of each gene across the whole genome using summary data from GWAS and eQTL studies. SMR is a common tool to identify the genes whose expression levels are associated with a complex trait because of pleiotropy. Twenty-eight GWAS datasets are used by Pavlides et al. (2016) to find genes whose expression levels were associated with complex phenotype. Bone mineral density (BMD)-related genes are studied by Meng et al. (2018) using SMR. SMR is also used to identify genes and pathways for Amyotrophic Lateral Sclerosis by Du et al. (2017). Fan et al. (2017) found 6 genes are associated with neuroticism by SMR. Liu et al. (2018) used SMR on doing research on Obesity and found 20 BMI associated genes. Veturi and Ritchie (2018) compared two popular methods: MP and SMR by different datasets. Though these scholars' researches, we could judge that SMR is an effective tool. In this paper, summary-level data mendelian randomization (SMR) is used to integrate GWAS and eQTL datasets. In this way, the most functionally relevant genes at the loci identified in GWAS for stroke are found.

## METHODS

### Work Frame

As shown in Figure 1, since GWAS has identified SNPs which are related to stroke, and eQTL has identified SNPs which can affect genes expression, SMR is used to identify SNPs that can change gene expression and this should be the reason that they are associated with stroke. Therefore, firstly, we should obtain GWAS and eQTL data. Then, we checked the overlap between these two datasets. Finally, SMR is used to screen SNPs.

#### SMR

z in summary data level Mendelian Randomization (SMR) is a genetic variant (SNP), x is the expression level of a gene and y denotes the trait, then the two-step least-squares estimate of the effect of x on y from an MR analysis is:

$$
\hat{b}\_{xy} = \hat{b}\_{zy} / \hat{b}\_{zx} \tag{1}
$$

^bzy and ^bzx are the least-squares estimates of y and x on z, respectively. Then, ^bxy denotes the effect size of x on y without confounding from non-genetic factors. The variance of ^bxy is:

$$T\_{MR} = \hat{b}\_{xy}^2 / \text{var}(\hat{b}\_{xy}) \tag{2}$$

Here, TMR obeys a chi-square distribution with a degree of freedom of 1. As we can see in equation (Dargazanli et al., 2018), MR requires genotype, gene expression and phenotype to be measured on the same sample. However, Zhu et al. have proved that the power of detecting ^bxy can be greatly increased using a two-sample MR analysis. Therefore, the TMR can be replaced by TSMR.

$$T\_{\text{SMR}} = \hat{b}\_{\text{xy}}^2 / \text{var}(\hat{b}\_{\text{xy}}) = \frac{z\_{zy}^2 z\_{z\text{x}}^2}{z\_{zy}^2 + z\_{z\text{x}}^2} \tag{3}$$

zzy is the z statistics from GWAS and zzx is the z statistics from eQTL.

### RESULTS

#### Data Description GWAS

We used the data from Malik et al.'s research. Eight GWAS datasets are used. Table 1 shows the detailed information about these data.

We collected GWAS data for four different types of stroke (ischemic stroke, large artery stroke, cardioembolic stroke, small vessel stroke).

Figure 2 shows P value of SNPs in GWAS1 and GWAS2. The SNPs are almost same in these GWAS dataset, but difference races cause the difference of P value. We could know different races have different stroke susceptibility genes.

#### eQTL

eQTL data is from a meta-analysis of GTEx brain (Consortium G, 2017), CMC (Fromer et al., 2016), and ROSMAP (Ng et al., 2017). All the data are from brain. Only SNPs within 1Mb distance from each probe are available. The estimated effective n is 1,194.



#### mQTL

mQTL used in this paper is a set of brain data from a metaanalysis of ROSMAP (Ng et al., 2017), Hannon et al. (2016) and Jaffe et al. (2016). In the ROSMAP data, only SNPs within 5Kb of each DNA methylation probe are available. In the Hannon et al. data, only SNPs within 500Kb distance from each probe and with PmQTL < 1.0e-10 are available. In the Jaffe et al. data, only SNPs within 20Kb distance from each probe and with FDR < 0.1 are available. The estimated effective n is 1,160.

### Four Kinds of Stroke

Ischemic stroke is a kind of stroke which caused by arterial obstruction. It accounts for approximately 85% of the total. large artery stroke and cardioembolic stroke are the subgroup of this kind of this stroke.

Large artery stroke is caused by blood clots (thrombus) which are formed in the neck or cerebral arteries. There may be accumulation of fatty deposits (often referred to as plaques) in these arteries.

Cardioembolic stroke is caused by blood clots that reach the brain and blocks the blood vessels. A common cause is the formation of blood clots in the two upper atrial rhythm abnormalities of the heart (atrial fibrillation).

Small vessel stroke is actually a transient stroke symptom that usually lasts only a few minutes. small vessel stroke is caused by transient blood supply to specific parts of the brain and does not cause significant persistent effects on patients. However, it is generally believed that the risk of stroke after small vessel stroke is higher.

#### SNPs and Genes for Ischemic Stroke

10 SNPs which change six genes expression are screened by Europeans dataset and 11 SNPs which change five genes expression are screened by trans-ethnic dataset.

As we can see in Table 2, HSD17B12 is overlapped in the two tests. Moreno et al. (2018) found upregulation of HSD17B12 is TABLE 2 | SMR results of ischemic stroke.


associated ischemic stroke using 82 cases and 67 controls. ALDH2 is generally considered as a gene (Guo et al., 2013) which can protect against ischemic stroke, because overexpression of ALDH2 rescued neuronal survival against 4- HNE treatment in PC12 cells (Lee et al., 2012). These two genes show the accuracy of our results.

#### SNPs and Genes for Large Artery Stroke

None SNP is screened by Europeans dataset for large artery stroke. Three SNPs which correspond one gene 'C3orf18' are screened by trans-ethnic dataset.

Phenotypes for C3orf18 Gene include Decreased homologous recombination repair frequency, Decreased ionizing radiation sensitivity, Upregulation of Wnt pathway, Increased vaccinia virus (VACV) infection, Mildly decreased CFP-tsO45G cell surface transport. It is considered to be associated with cognitive function measurement.

#### SNPs and Genes for Cardioembolic Stroke

11 SNPs are significant in Europeans dataset and trans-ethnic dataset. rs3807989 is screened more than one time in Europeans dataset because it can affect more than one gene expression. Both CAV1 and CAV2's expression can be changed by this SNP.

As we can see in Table 3, 6 genes and 3 genes are screened by SMR in Europeans dataset and Trans-ethnic dataset, respectively. Three of them are overlapped.

#### SNPs and Genes for Small Vessel Stroke

13 SNPs and 4 SNPs are significant in Europeans dataset and trans-ethnic dataset, respectively. None of these SNPs or their corresponding genes are overlapped in these two tests. As we can see in Table 4, although no overlap is found between these two

#### TABLE 3 | SMR results of cardioembolic stroke.


TABLE 4 | SMR results of small vessel stroke.


tests, some genes are overlapped between cardioembolic stroke and small vessel stroke.

### SNPs Changes Gene Expression Level by Methylation

Since both genetic and environmental factors are key to cause stroke, while methylation plays an important role in the interaction between environmental factors and genetic expression, we assumed that some of the SNPs identified above are at greater risk of methylation and can change gene expression levels.

Therefore, we integrated the SNPs found above with mQTL data for research. Thirty-eight unique SNPs are found in four different types of stroke. Thirty-one of these 38 SNPs are significant in mQTL dataset. We draw the P value of these 31 SNPs as Figure 2. As shown in Figure 3, most of these SNPs are associated with several genes expression. In addition, most of SNPs have a quite low P value, which means that they can significant change the expression of genes.

#### Case Study

#### ULK4

Guo et al. (2016) have found that genetic variants in LRP1 and ULK4 are associated with acute aortic dissections. In their paper, they also mentioned that ULK4 may contribute stroke.

#### CAV1

Shyu et al. (2017) discussed association of eNOS and CAV1 gene polymorphisms with susceptibility risk of large artery atherosclerotic stroke. A tendency toward an increased LAA stroke risk was significant in carriers with the eNOS Glu298Asp variant in conjunction with the G14713 A and T29107A polymorphisms of the CAV1 (aOR = 2.03, P-trend = 0.002).

#### CAV2

Jolobe (2012) found that recurrent stroke is because of a novel voltage sensor mutation in CAV2. They compared stroke mouse and normal mouse to obtain this conclusion.

### CONCLUSIONS

Stroke is the primary cause of disability in adults, which constitutes a serious public health burden. Stroke is generally believed to be caused by genetic and environmental factors. Therefore, in this paper, we identified stroke-related genes and loci from both genetic and environmental aspects.

GWAS identified a large number of stroke-related SNPs, which were difficult to explain. We tried to identify the pathogenesis of significant SNPs by combining SMR with eQTL data. Since eQTL shows the SNPs that can significantly change genes expression and GWAS shows the SNPs that are significant related to stroke, we combined these two data to identify the genes whose expression levels are associated with stroke because of pleiotropy.

38 SNPs which cause changes in 14 genes expression were found by 8 GWAS data and brain eQTL. Those 8 GWAS data are from two different races sample and include four types of stroke (ischemic stroke, large artery stroke, cardioembolic stroke, small vessel stroke). CAV1, SURF1, PLEKHH2, ECD, BNIP1, CAV2 are found to be associated with cardioembolic stroke and Small vessel stroke in Europeans. ULK4 is a susceptibility gene for ischemic stroke and small vessel stroke.

Since methylation (Lv et al., 2019) plays an important role in the interaction between environmental factors and genetic expression, we tried to find out whether 38 SNPs are affected by methylation and lead to the changes in other genes expression levels. Thirty-one of these 38 SNPs are significant in mQTL data and most of them can affect more than one gene expression.

Overall, integrating GWAS with eQTL, we found 38 SNPs and 14 genes are related to stroke by SMR. Thirty-one of 38 SNPs are at high risk of methylation which can also cause changes in gene expression. These findings serve as a guide to understanding the pathogenesis of stroke at the molecular level.

## DATA AVAILABILITY STATEMENT

All the datasets used in this paper could be downloaded from GWAS: ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary\_ statistics/MalikR\_29531354\_GCST006908/MEGASTROKE.2. AIS.EU

R.out ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary\_statistics/ MalikR\_29531354\_GCST005843/MEGASTROKE.2.AIS.TR

ANS.out ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary\_ statistics/MalikR\_29531354\_GCST006907/MEGASTROKE.3. LAS.EU

R.out ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary\_statistics/ MalikR\_29531354\_GCST005840/MEGASTROKE.3.LAS.TR

ANS.out ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary\_ statistics/MalikR\_29531354\_GCST005842/MEGASTROKE.4. CES.TR

ANS.out ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary\_ statistics/MalikR\_29531354\_GCST006910/MEGASTROKE.4. CES.EU

R.out ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary\_statistics/ MalikR\_29531354\_GCST005841/MEGASTROKE.5.SVS.TR

ANS.out ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary\_ statistics/MalikR\_29531354\_GCST006909/MEGASTROKE.5. SVS.EU

R.out eQTL: https://cnsgenomics.com/software/ smr/#eQTLsummarydata

mQTL: https://cnsgeno mics.com/software/ smr/#mQTLsummarydata.

## AUTHOR CONTRIBUTIONS

HuJ, Z-HL, and HoJ conceived and designed the experiments. SZ analyzed data. SZ, HuJ, Z-HL, and HoJ wrote this manuscript. All authors read and approved the final manuscript.

## FUNDING

This study was supported by grants from the National Natural Science Foundation of China (81671760 and 81873910), Scientific Research Transformation Special Fund of Heilongjiang Academy of Medical Sciences (2018415);Scientific Research Project of Health and Family Planning Commission of Heilongjiang Province (201812 and 201622), National Natural Science Foundation of China (81871423), and Shanghai Municipal Commission of Health and Family Planning (20160064).

### REFERENCES


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Zhao, Jiang, Liang and Ju. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author (s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# CHG: A Systematically Integrated Database of Cancer Hallmark Genes

Denan Zhang1† , Diwei Huo2† , Hongbo Xie1† , Lingxiang Wu1† , Juan Zhang<sup>1</sup> , Lei Liu<sup>1</sup> , Qing Jin<sup>1</sup> and Xiujie Chen1\*

<sup>1</sup> College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China, <sup>2</sup> The 2nd Affiliated Hospital of Harbin Medical University, Harbin, China

Background: The analysis of cancer diversity based on a logical framework of hallmarks has greatly improved our understanding of the occurrence, development and metastasis of various cancers.

#### Edited by:

Lei Deng, Central South University, China

#### Reviewed by:

Jiajie Peng, Northwestern Polytechnical University, China Xuekun Ren, Harbin Institute of Technology, China Qiang Lei, Harbin Institute of Technology, China

> \*Correspondence: Xiujie Chen

chenxiujie@ems.hrbmu.edu.cn † These authors have contributed

equally to this work and share first authorship

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 01 July 2019 Accepted: 09 January 2020 Published: 05 February 2020

#### Citation:

Zhang D, Huo D, Xie H, Wu L, Zhang J, Liu L, Jin Q and Chen X (2020) CHG: A Systematically Integrated Database of Cancer Hallmark Genes. Front. Genet. 11:29. doi: 10.3389/fgene.2020.00029 Methods: We designed Cancer Hallmark Genes (CHG) database which focuses on integrating hallmark genes in a systematic, standard way and annotates the potential roles of the hallmark genes in cancer processes. Following the conceptual criteria description of hallmark function the keywords for each hallmark were manually selected from the literature. Candidate hallmark genes collected were derived from 301 pathways of KEGG database by Lucene and manually corrected.

Results: Based on the variation data, we finally identified the hallmark genes of various types of cancer and constructed CHG. And we also analyzed the relationships among hallmarks and potential characteristics and relationships of hallmark genes based on the topological structures of their networks. We manually confirm the hallmark gene identified by CHG based on literature and database. We also predicted the prognosis of breast cancer, glioblastoma multiforme and kidney papillary cell carcinoma patients based on CHG data.

Conclusions: In summary, CHG, which was constructed based on a hallmark feature set, provides a new perspective for analyzing the diversity and development of cancers.

Keywords: Hallmark genes, mutation, methylation, copy number variation, annotating Hallmark features, database

### INTRODUCTION

In 2000, Weinberg et al. (2000) first proposed six hallmarks of cancer, including Sustaining Proliferative Signaling (SPS), Evading Growth Suppressors (EGS), Resisting Cell Death (RCD), Enabling Replicative Immortality (ERI), Inducing Angiogenesis (IA), and Activating Invasion and Metastasis (AIM), which provided a logical framework for conceptualizing a variety of neoplastic diseases. In 2011, they added another four hallmarks to more fully capture the features of cancers, including Genome Instability and Mutation (GIM), Tumor-Promoting Inflammation (TPI), Reprogramming Energy Metabolism (REM), and Evading Immune Destruction (EID) (Hanahan and Weinberg, 2011). The hallmarks of cancer capture the most essential phenotypic characteristics of malignant transformation and progression, but numerous factors involved in this multistep process are still unknown to date. It is undoubtedly that the framework constructed by hallmarks has greatly improved the analysis on diversity of cancers. Balázs Győrffy et al. reviewed the available techniques that are capable of and appropriate for determining the characteristic features of each hallmark (Menyhart et al., 2016). Hallmark capabilities are regulated by partially redundant signaling pathways, and the significance of these pathways depends on the tumor's underlying molecular features. Recently, many studies have focused on the integration of various cancer-related pathways or genes for analysis, and they have found some significant results. In 2011, Jie Li et al. identified high-quality breast cancer prognostic markers and metastasis network modules by integrating hallmark-related genes from GO terms (Li et al., 2010). In 2013, Naif Zaman et al. predicted breast cancer subtype-specific drug targets by exploring the modules (including apoptosis, cell proliferation and cell cycle) in a signaling network assessment of mutations and copy number variations (CNVs) (Zaman et al., 2013). These researches strongly emphasized the importance of constructing gene sets for hallmarks. Moreover, the advantages of the analysis based on a hallmark framework are notable: 1) It reduces feature dimension of cancer (more attention will be focused on the significant genes in each hallmark rather than on all genes, which will reduce the large number of passenger genes analyzed). 2) It is explicable (the results of analysis are depicted more easily). 3) It provides a potential avenue for exploring the mechanism of carcinogenesis. However, the overlap rate of the hallmark genes in current studies is low because the studies use different extraction methods. Furthermore, no gene sets have been systematically collected for the different hallmarks thus far, which makes it difficult to clarify the gene alteration features (including mutations, DNA methylations and CNVs) in each hallmark (Wang et al., 2015).

To address this problem, we established a database called Cancer Hallmark Genes in (CHG), which provides gene sets for the ten hallmarks and the corresponding statistical analysis results, including the frequency of different mutation types (e.g., missense, deletion, insertion), methylation and CNV (e.g., loss or gain) for each gene. To maximize the usage of our database, we collected a total of 22697 samples from TCGA and analyzed the variations of mutation, CNV, and methylation of hallmark genes across 34 cancer types.

Furthermore, we analyzed the relationship among ten hallmarks by Fisher's exact test and unsupervised hierarchical clustering (method 2). Eventually, the hallmarks were clustered into four classes: 1) Reprogramming Energy Metabolism (REM). 2) Activating Invasion and Metastasis (AIM), Evading Growth Suppressors (EGS), Enabling Replicative Immortality (ERI), and Sustaining Proliferative Signaling (SPS). 3) Genome Instability and Mutation (GIM). 4) Tumor-Promoting Inflammation (TPI), Evading Immune Destruction (EID), Resisting Cell Death (RCD), and Inducing Angiogenesis (IA).

Even though the hallmark genes identified in the database came from the confirmed literature and databases, we manually confirmed the top 10 altered (mutation, methylation, CNV) genes of each hallmark to further ensure the accuracy of the data. In addition, we also used several of cancers as examples for further analysis with the CHG data to demonstrate the value of this database at a practical level.

The CHG database is freely available at our website: http:// www.bio-bigdata.com/CHG/index.html.

#### MATERIALS AND METHODS

#### Data for Hallmarks

In this work, 301 pathways were downloaded from KEGG (version 78.0) (Kanehisa et al., 2017). This data was used for Lucene search and extraction of pathway genes. Gene variant data (7,075 samples of mutation in 34 cancers, 6,177 samples of methylation in 20 cancers, 9445 samples of CNV in 33 cancers) from TCGA (Stratton et al., 2013) were downloaded, where the methylated data was selected as JHU\_USC (HumanMethylation 450) and BI (Genome\_Wide\_SNP\_6) was selected for CNV data. These data were used to calculate the frequency of gene variation, and the proportion of different types of variation. The data in this article across DNA methylation, mutation and CNV were from the same samples of TCGA database. In the TCGA database, there are strict rules for the sequencing, processing and analysis, etc. of the samples data and provide standardized data downloading. Human protein-protein interaction data was downloaded from HPRD (Keshava Prasad et al., 2009), STRING (Szklarczyk et al., 2011), BioGRID (Chatraryamontri et al., 2013) and HTRIdb (Bovolenta et al., 2012). Human gene regulation data was downloaded from HTRIdb. These data were used to integrate an integrated gene interaction network. The cDNA data (GRCh38 version and GRCh37 version) was downloaded from Ensembl (Flicek et al., 2014). This data was used for the processing of CNV data (Supplementary Table 3).

#### The Construction Process of the CHG Database

Following the conceptual criteria description of hallmark function in the article "Hallmarks of Cancer: The Next Generation," published in Cell in 2011, we searched the relevant literature in PubMed, and screened the high-frequency descriptive vocabulary appearing in the abstract of the literature as the key words of the corresponding Hallmark. The core idea of our CHG database is to transform the conceptual description of Hallmark features into real biological processes and their corresponding entities. So, we built a process that consists of three main steps (Figure 1).

First, we identify the Hallmark description keyword. This step is to materialize the conceptual description of the Hallmark feature. The relevant literature is determined by searching the Hallmark feature description in the literature, and the specific descriptors associated with each Hallmark feature are determined by identifying the high frequency vocabulary in the relevant document abstract. In this step, we manually confirmed the results from the literature scan. In addition to determining that the identified keywords are related to the Hallmark feature, some of the words without more information such as "cancer"

and "tumor" are not directly provided to vocabulary. At the same time, we also further enrich the identified Hallmark description keywords through synonym expansion, for example, "apoptosis" and "cell death" (Supplementary Table 1).

Second, we use a text mining software package Lucene to identify the Hallmark-specific pathways in the literature and KEGG database based on the Hallmark description keywords identified in the previous step. The result of the identification is manually confirmed again. The manual confirmation step does not add any subjective results, and only in the case of certainty, significant unrelated results due to software recognition errors are removed (Supplementary Tables 1 , 2).

Finally, genes with potential specificity in the potential Hallmark-specific pathway were screened from gene mutation level, epigenetic level, and CNV level to construct CHG.

## Cancer Type-Specific Variant Gene

Based on the variation data in TCGA (Montenegro et al., 2015), we calculated the variations of mutation, methylation and CNV for these hallmark genes in different types of cancers. Mutation, CNV, and methylation signatures were used as part of the filtration function in the Hallmarkspecific gene screening process in our construction of the CHG database. This is because the relationship between these features and cancer has been confirmed in extensive and in-depth discussions in many previous studies (Kan et al., 2010; Kandoth et al., 2013; Laddha et al., 2014; Wu et al., 2017; Bouras et al., 2019; Sina et al., 2019; Tate et al., 2019). The variations in the characteristics of these different types of cancer not only provide more detailed information for analysis based on the hallmarks but also can be used as a "fingerprint" of cancer type or progression, and this cancer classification can be used as further guidance in prognosis and clinical treatment (Supplementary Table 3).

#### Gene Mutation

Based on the somatic mutation (level 2) data for the 34 types of cancers in TCGA, the frequency of each mutated gene was calculated in specific cancers(Chung et al., 2016). To account for the specific action of different somatic mutations in different types or periods of cancers, we mainly studied the following six types of somatic mutations: insertion (INS), deletion (DEL), missense mutations (SNP\_mis), nonsense mutations (SNP\_non), splice site mutations (SNP\_spl), and gene silencing (SNP\_sil) (Hu et al., 2018). The proportion of mutation types in each type of cancer was also statistically analyzed (Kan et al., 2010; Kandoth et al., 2013).

#### DNA Methylation

We carried out the following calculations for the level 3 data from 20 human tumors derived from TCGA that simultaneously contained both cancer and control samples (Bouras et al., 2019; Sina et al., 2019):

a. Calculate the methylation beta value of each sample (including cancer and normal samples). For genes with multiple methylation sites, the average beta value represents the gene methylation values. The average beta value of the gene in all normal samples was calculated as the methylation level of the control group (Tate et al., 2019);


#### Copy Number Variation

We analyzed gene segments for the CNV based on level 3 data derived from TCGA and cDNA data from Ensembl in 33 human tumors that simultaneously contained both cancer and control samples. For each pair of samples, if the CNV occurred in only one sample, the default value of the segment in any other sample was 0. Based on experience, we chose 0.2 and -0.2 as the thresholds for altered CNV genes; we marked the gene as a "gain" when the segment value was greater than 0.2 in the cancer samples and as a "loss" when the segment value was less than -0.2 (Laddha et al., 2014). We counted the frequency of CNV in the genes and the proportion of genes belonging to the "gain" and "loss" categories.

#### Analysis of Relationships of Hallmarks

We analyzed the relationships among the ten hallmarks by Fisher's exact test and unsupervised hierarchical clustering (Tan et al., 2011; Hashemi et al., 2013). We compared the relationship between the specific gene sets of two hallmarks to the final recognition of the overall relationships among the 10 hallmarks. We separately calculated the number of genes belonging to two hallmarks, only one hallmark and all hallmarks. Based on the null hypothesis of independence between any two hallmarks, we calculated the similarity through Fisher's exact test. Finally, we carried out hierarchical clustering with the 1-P value as the similarity score.

#### RESULTS

#### The Features of Hallmark Genes Across Cancers

Genome variation is a common phenomenon in cancer, and it is essential to understanding the internal mechanism and prognosis of the tumor in terms of whether the hallmarkrelated genes have a generally or specifically altered pattern. To this end, we processed the somatic mutation data, methylation data and copy number variant data for 34 cancers in TCGA and analyzed the frequency of somatic mutations, methylation and CNVs in different cancer types (Table 1).

To promote the analysis of carcinogenesis, we mapped the driven mutation, methylation and CNV gene data from TCGA into hallmarks to analyze the altered percentages of all hallmark genes. We found that, among all hallmark genes, 97.39% of the

TABLE 1 | Numbers of pathways and genes of 10 hallmarks.


genes were altered by mutation, 33.44% were regulated by methylation, and 84.88% were influenced by CNV (Figure 2). In each hallmark, the ratio of genes altered by mutation, methylation and CNV was more than 95% (Table 2). These results indicate that the genomic changes in cancer are widespread.

We counted the number of hallmark genes that are mutated, differentially methylated and copied in 34 different cancer types (Figure 3). The results showed that the difference among the number of mutated genes in different cancer types is large, and there is a 9-fold difference between the maximum and the minimum number of mutated genes, with 2644 in LIHC (liver hepatocellular carcinoma) and 281 in LAML (acute myeloid leukemia). The largest number of differentially methylated genes is 490 in BRCA (breast invasive carcinoma), and the smallest number is 34 in LUAD (lung adenocarcinoma). The largest number of differentially CNV genes is 1972 in OV (ovarian serous cystadenocarcinoma), and the smallest number is 267 in THYM (thymoma).

We also found that different types of cancer have different alteration characteristics. As shown in Figure 3, some cancers, such as SKCM (skin cutaneous melanoma), ESCA (esophageal

#### TABLE 2 | Ratio of altered Genes in hallmarks.


For each hallmark, the ratio of genes altered by mutation, methylation, and CNV were more than 95%.

carcinoma), LIHC (liver hepatocellular carcinoma), mainly reflect the mutation pattern of the genome, and this is a common pattern in most cancers. Some cancers, such as PCPG (pheochromocytoma and paraganglioma), LAML (acute myeloid leukemia), and OV (ovarian serous cystadenocarcinoma), mainly reflect a pattern of CNV variation, which suggests that we should analyze the specific alteration patterns in specific cancers when uncovering the functional importance of the genomic alterations and the underlying mechanisms that drive cancer development, progression and metastasis in different cancer types.

#### Network of Hallmark Genes

The potential characteristics and relationships of hallmark genes can be effectively revealed based on the topological structures of their networks. Since the hallmark genes were identified from qualitative analysis without any relevant interaction information, we mapped these hallmark genes onto the integrated protein regulatory network to collect data on the interaction and regulation relationships between the hallmark genes and the extract interactions between the hallmark genes, which resulted in the construction of 10 hallmark subnetworks. The average degree of the integrated protein interactions is 36 and 54 in the regulation network and the entire hallmark network (constructed by all the hallmark interaction genes), respectively. This indicates that the interaction between hallmarks is higher than the average level of integrated protein interactions and shows that hallmark networks are more closely linked. On average, for the 10 hallmark subnetworks, 94% of the hallmark genes were involved in the network (Supplementary Figure 1). We performed an analysis of the 10 subnetworks and calculated the degree, betweenness and clustering coefficient of all nodes. We found that, in addition to the GIM network in Figure 4, the gene interactions inside each hallmark subnetwork were more closely related than the interactions between the 10 hallmark subnetworks. This result may be due to GIM as the basis of other hallmarks; genetic diversity of GIM will lead to in other hallmark features (Hanahan and Weinberg, 2011). At the same time, we also analyzed the correlation between the degree and number of genes in each subnetwork. The results showed that genes with large degrees often also have larger betweenness, as there was a positive correlation between these variables (Supplementary Figure 1).

#### Relationship of Hallmarks

Ten types of hallmarks described different aspects of the tumor characteristics, but there were few relationships mentioned between these characteristics on a pan-cancer scale. To this end, we analyzed the relationship among the hallmarks and divided the ten hallmarks into four classes (Figure 5). Interestingly, we found two classes with only one hallmark, namely, Reprogramming Energy Metabolism (REM) and Genome Instability and Mutation (GIM). This result is reasonable, as both of these hallmarks are clearly different from the other hallmarks in terms of their mechanisms. As we know, almost all types of cancers are caused by DNA mutation or genome structure alterations and are followed by the appearance of other hallmarks.

In addition, the similarity among the hallmarks Activating Invasion and Metastasis (AIM), Evading Growth Suppressors (EGS), Enabling Replicative Immortality (ERI) and Sustaining Proliferative Signaling (SPS) is prominent. Many of the

FIGURE 3 | Number of variant genes of Hallmarks in different cancer types. The number of hallmark genes with mutated, differentially methylated and copied in 34 different cancer types. It is showed that different types of cancer have different alteration characteristics.

hallmarks in this set are related to the preliminary stage of cancers (Hanahan and Weinberg, 2000; Hanahan and Weinberg, 2011). One confusing inclusion in the set is AIM, which is a hallmark that is considered to be related to the end stage of cancers. However, recent research has also found that AIM occurs in early cancers as well (Hanahan and Weinberg, 2011).

The last class includes Tumor-Promoting Inflammation (TPI), Evading Immune Destruction (EID), Resisting Cell Death (RCD), and Inducing Angiogenesis (IA). Noticeably, tumor-promoting inflammation may activate the response of immune system, and many recent studies have focused on the relationship between inflammation and the immune system in cancers (Grivennikov et al., 2010; Tan et al., 2011; Elinav et al., 2013; Hashemi et al., 2013).

In addition, we further analyzed the patterns of characteristic variation of the hallmark genes (Figure 6) in 34 different cancers (Supplementary Table 3). We looked at the top 10 altered features (e.g., mutation, CNV or methylation) of each hallmark gene as the Typical Characteristics of the Hallmark Gene (TCHG, Supplementary Table 4). In heat map analysis, we can clearly find major differences between the TCHGs as altered patterns in different types of cancer. In fact, these features can be used as simple markers for distinguishing cancer types.

#### Validation of CHG Data

Although the hallmark-related genes identified in the database came from the confirmed literature and databases, we manually

Replicative Immortality (ERI), and Sustaining Proliferative Signaling (SPS) is prominent. Many of the hallmarks in this set are related to the preliminary stage of cancers. The last class includes Tumor-Promoting Inflammation (TPI), Evading Immune Destruction (EID), Resisting Cell Death (RCD), and Inducing Angiogenesis (IA). Noticeably, tumor-promoting inflammation may activate the response of immune system, and many recent studies have focused on the relationship between inflammation and the immune system in cancers.

confirmed the TCHG to further ensure the accuracy of the data. Considering the very large dataset that we had to confirm, we have currently verified only the top 10 altered (mutation, methylation, CNV) genes of each hallmark. Over 92% of the typical characteristic genes have explanations of their specific hallmark functions in the literature, which demonstrates the accuracy and precision of the CHG data on a theoretical level (Supplementary Table 4).

In addition, we compared the results of this study with existing Sanger Cancer Gene Census databases (Futreal et al., 2004). The Sanger Cancer Gene Census database not only describes the genomic features of cancer-related genes themselves, but also includes information on tissue distribution, mutation information and protein structure. We also compared 699 cancer-related genes identified in the Sanger Cancer Gene Census database with the Typical Characteristics of the Hallmark Gene (TCHG) we identified. Of the 139 Hallmarkrelated TCHG genes we identified, 69 were also included in the Sanger database, accounting for 49.7%. These results also confirm the accuracy of our results. For other genes that are not included in the Sanger database, we also confirm their important role in cancer-related biological processes through literature verification, such as ETS1 (Watabe et al., 1998; Fujimoto et al., 2004; Zhang et al., 2014; Li et al., 2015) and RHOA (Lee et al., 2015; Zeng et al., 2015; Sun et al., 2016) in hallmark "Activating Invasion and Metastasis".

#### CHG Case Study

In addition, we used breast cancer data that was labeled as recurrent or not recurrent as samples for further analysis based on the CHG data. These analyses can be used as an example of the applications of the CHG database and can also prove the value of this database at a practical level. We performed a significant enrichment analysis of the differentially expressed genes based on data from 159 breast cancer patients from GEO with a significance level of p < 0.01. The sample group and the control group were patient data with and without recurrence, respectively. In particular, these differentially expressed genes were filtered by hallmark genes from the CHG database before performing the enrichment analysis. We found that these genes were enriched in 2 out of the 10 hallmarks, corresponding to the hallmarks whose main functions include Genome Instability and Mutation (GIM) and Tumor-Promoting Inflammation (TPI) (Table 3). It is well known that tumor development is jointly promoted by cell-intrinsic and cell-extrinsic factors. The hallmarks in Table 3 include risk factors for tumor recurrence that are both extracellular (Tumor-Promoting Inflammation) and intracellular (Genome Instability and Mutation). These results not only expressed the theoretical interpretation of the enrichment analysis but also reflected the significance of the hallmark genes in the CHG database.

TABLE 3 | Hallmark function of differentially expressed genes based on 137 breast cancer data.


time in the prognosis. In a survival analysis of 1,183 breast cancer patients (up) and 156 glioblastoma multiforme patients (down), only the expression level of hallmark genes could clearly distinguish the length of the survival time in the prognosis.

The accuracy and specificity of the hallmark genes identified in CHG can also be confirmed by our analysis of the survival data for cancer patients. The survival analysis based on TCGA data was carried out with only hallmark genes as a single block, and it showed that patient groups with differentially expressed (compared to the average expression level) hallmark markers could clearly distinguish the prognosis of patients with high statistical significance. Similar results have been found in many types of cancer. For instance, in a survival analysis of 1183 breast cancer patients and 156 glioblastoma multiforme patients, only the expression level of hallmark genes could clearly distinguish the length of the survival time in the prognosis (Figure 7). In addition, the hallmark gene identified by CHG can also be used as a marker to determine the recurrence of cancer to some extent. An analysis of the survival data of 284 KIRP (kidney papillary cell carcinoma) patients with 27 recurrence cases in Figure 8 shows that the hallmark genes identified in CHG have good sensitivity for distinguishing cancer recurrence. These results fully showed that the variation characteristics of the hallmarkrelated genes in CHG were representative, and they could be directly applied to rapid qualitative analysis.

#### DISCUSSION

Since Weinberg et al. firstly established the hallmarks for cancer in 2000, many studies have focused on the analysis of cancer based on a framework constructed by these hallmarks. In addition, in 2011, the number of hallmarks increased to ten, which indicates that the features of cancer may be exceedingly complex. Perhaps unsurprisingly, in 2013, another hallmark, Aberrant Alternative Splicing, was proposed by Michael Ladomery (Ladomery, 2013). It has been reported that the vast majority of human genes, possibly over 94%, are alternatively spliced (Pan et al., 2008). In 2015, MF Montenegro et al. targeted the epigenetic machinery of cancer cells and noted that there was increasing evidence linking the aberrant regulation of methylation to carcinogenesis (Montenegro et al., 2015), which implied that it may be a potential hallmark for cancer. In 2015, Mamatha Bhat et al. published a review about the translation machinery in cancer. They mentioned that translation played a major role in the regulation of gene expression, and the dysregulation of this process is considered a hallmark of cancer.

The CHG database that we constructed is based on the ten hallmarks that Weinberg proposed in 2011. As a specifically designed framework constructed from a hallmark database, CHG can provide a new perspective for an analysis of the diversity and development of cancers as well as a convenient method for indepth data mining. The CHG database focused on integrating

hallmark genes, annotating the potential roles of hallmark features in human cancer processes, and evaluating the relationships of the ten hallmarks by constructing hallmark networks and calculating the degree and distance between genes belonging to each network. Even though the hallmarkrelated genes identified in the database have been confirmed by consensus from the literature and databases, we manually confirmed the top 10 altered (mutation, methylation, CNV) genes in each hallmark to further ensure the accuracy of our data.

According to our plan, CHG database will be updated regularly every year to supplement the new findings in hallmark field or revise the existing results. We will also follow up the study of cancer hallmarks, the update of important data source (such as revision of TCGA or KEGG) and improve the practicality of CHG database in mechanism interpretation and clinical aspects. All of old version database would also be maintained and access to downloaded. The difference of each version of database would be listed.

Furthermore, over the past decade, analysis based on the integration of multiple datasets has become quite prevalent. In 2013, Du et al. (Du et al., 2013) analyzed clinically relevant long noncoding RNAs in human cancer by integrating SCNA (somatic copy number alteration), lncRNA and clinical data. In 2014, Wu et al., (2014) predicted disease-causing nonsynonymous single nucleotide variants by integrating multiple genomic datasets. Sanchez et al., (2014) integrated an analysis of Chip-Seq and RNA-Seq data to unveil an lncRNA tumor suppressor signature. Many studies, such as the work of Peng et al., have determined that miRNAs are a widely regulated regulatory mechanism in cancer (Peng et al., 2019b). Hence, it is worthwhile to integrate non-coding RNA (including miRNA, lncRNA, etc.) (Cheng et al., 2016; Cheng et al., 2019), fusion genes and drug information into a database. We have set out to construct a network that is comprised of these non-coding RNAs, genes and drugs. We hope that the next step will be to provide an online analysis tool (such as Peng et al., 2019a; Peng et al., 2019c) to provide further personalized analysis. We will gather these resources into the database in the next version, and we anticipate that the database will help promote the analysis of cancer and the identification of valuable drug targets.

#### REFERENCES


#### DATA AVAILABILITY STATEMENT

The CHG database is freely available at our website: http://www. bio-bigdata.com/CHG/index.html.

#### AUTHOR CONTRIBUTIONS

DZ, DH, HX, and LW contributed equally to this work and should be considered co-first authors. JZ, LL, and HX collected data and conducts calculation and analysis. DH, QJ, and XC analyzed the results. DZ and LW wrote the paper. All authors reviewed the manuscript.

#### FUNDING

This work was supported by the National Natural Science Foundation of China [Grant No. 61671191, 61971166, 61701142, 61802090].

#### ACKNOWLEDGMENTS

We are very grateful to Professor Xiujie Chen and Professor Hongbo Zhou for their suggestions and comments to this research.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020.00029/ full#supplementary-material

SUPPLEMENTARY FIGURE 1 | Topological characteristics of hallmark gene networks.

SUPPLEMENTARY TABLE 1 | Characteristic pathways of different hallmarker with mapping keys.

SUPPLEMENTARY TABLE 2 | Genesets of 10 hallmarker.

SUPPLEMENTARY TABLE 3 | Specific genes in 34 cancer type of mutation.

SUPPLEMENTARY TABLE 4 | Literature validation of TOP10 altered (mutation, methylation, CNV) genes (TCHGs) of each hallmark.


Grivennikov, S. I., Greten, F. R., and Karin, M. (2010). Immunity, inflammation, and cancer. Cell 140, 883–899. doi: 10.1016/j.cell.2010.01.025


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Zhang, Huo, Xie, Wu, Zhang, Liu, Jin and Chen. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Pipeline for Reconstructing Somatic Copy Number Alternation's Subclonal Population-Based Next-Generation Sequencing Data

Yanshuo Chu, Chenxi Nie and Yadong Wang\*

Center of Bioinfomatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China

State-of-the-art next-generation sequencing (NGS)-based subclonal reconstruction methods perform poorly on somatic copy number alternations (SCNAs), due to not only it needsto simultaneously estimatethe subclonal population frequency andthe absolute copy number for each SCNA, but also there exist complex bias and noise in the tumor and its paired normal sequencing data. Both existing NGS-based SCNA detection methods and SCNA's subclonal population frequency inferring tools use the read count on radio (RCR) of tumor to its paired normal as the key feature of tumor sequencing data; however, the sequencing error and bias have great impact on RCR, which leads to a large number of redundant SCNA segments that make the subsequent process of SCNA's subclonal population frequency inferring and subclonal reconstruction time-consuming and inaccurate. We perform a mathematical analysis of the solution number of SCNA's subclonal frequency, and we propose a computational algorithm to reduce the impact of false breakpoints based on it. We construct a new probability model that incorporates the RCR bias correction algorithm, and by stringing it withthe false breakpoint filtering algorithm, we construct a whole SCNA's subclonal population reconstruction pipeline. The experimental result shows that our pipeline outperforms the existing subclonal reconstruction programs both on simulated data and TCGA data. Source code is publicly available as a Python package at https://github.com/dustincys/msphy-SCNAClonal.

Keywords: somatic copy number alternation, subclonal reconstruction, subclonal frequency, absolute copy number, bias correction

## INTRODUCTION

Tumor heterogeneity introduces challenges in cancer tissue diagnosis and subsequent treatment (Nowell, 1976). Tumor heterogeneity cannot be inferred by the properties of biomolecular through the ontology or pathway analysis (Cheng et al., 2017; Cheng et al., 2018c), but could be inferred by measuring the quantity of biomoleculars (Cheng et al., 2018b;Cheng et al., 2018d;Cheng et al., 2019). To decipher cell composition in bulk cells, somatic copy number alternations (SCNAs), most commonly found in tumor cells (Beroukhim et al., 2010), are utilized as the representative to determine tumor subclonal populations in a tumor–normal tissue paired manner (Oesper et al., 2013; Li and Xie, 2015).

#### Edited by:

Lei Deng, Central South University, China

#### Reviewed by:

Junwei Han, Harbin Medical University, China Juan Wang, Inner Mongolia University, China

> \*Correspondence: Yadong Wang ydwang@hit.edu.cn

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

> Received: 16 August 2019 Accepted: 16 December 2019 Published: 27 February 2020

#### Citation:

Chu Y, Nie C and Wang Y (2020) A Pipeline for Reconstructing Somatic Copy Number Alternation's Subclonal Population-Based Next-Generation Sequencing Data. Front. Genet. 10:1374. doi: 10.3389/fgene.2019.01374 The benefit of using SCNA to conduct subclonal reconstruction is that the WGS data doesn't have to be deeply sequenced (Li and Xie, 2015), because SCNA affects large, multi-kilobase-sized or megabase-sized regions of the genome, which allows the average copy number of these regions to be accurately estimated with whole genome sequencing (WGS) (Deshwar et al., 2015).

SCNA's subclonal reconstruction algorithms attempt to infer the population structure of heterozygous tumors based on the subclonal population frequency of SCNA (Deshwar et al., 2015). However, the cellular prevalence and the absolute copy number are intertwined and next-generation sequencing (NGS)-based subclonal reconstruction needs to simultaneously estimate population frequency and the absolute copy number for each SCNA. The solution space of subclonal frequency of SCNA remains poorly understood, and there might exist multiple solutions for subclonal frequency for some SCNAs (Oesper et al., 2013), which makes the infinite site assumptions (ISAs) (Kimura, 1969; Hudson, 1983; Jiao et al., 2014) invalid. ISA is the commonly accepted and powerful assumption,which posits that eachmutation occurs only once in the evolutionary history of the tumor.

To infer the SCNA's subclonal population frequency based on NGS data, the location of SCNAs in the genome needs to be obtained first. The SCNA breakpoints are detected through multiple bin-merging processes, during which rcr of tumor to its paired normal is used as a key feature (Xi et al., 2010). However, the sequencing error and bias have great impact on RCR, which leads to false positive breakpoints and incorrect subclonal reconstruction (Please refer to Figures S2 and S3, Tables S2 and S3 in the Supplementary). The higher sensitivity the SCNA detection tools show, the more prone to the sequencing error the tools would be. For example, BIC-seq (Xi et al., 2010) first splits whole genome into small bins, then uses the Bayesian Information Criterion as the bin merging and stopping criterion to detect SCNA breakpoints. When sensitivity parameter l of BIC-seq is very high, the true positive rate and the false discovery rate will decrease simultaneously (Xi et al., 2010), which means the SCNA regions will be separated into small fragments by the false positive breakpoints (Xi et al., 2010). The choice of parameter l is equivalent to setting type I error; in other words, when performing the loop of combining windows, two neighboring windows that should be combined are left separated apart. Since the reconstruction algorithm of subclone depends on the proportion of subclone populations of somatic mutation to define mutation set and its subpopulation (Deshwar et al., 2015) (Please refer to Figure S4 for the definition of subpopulation and subclonal population), in order to more precisely estimate the subclonal population ratio of every SCNA fragment, we need to choose a smaller l to ensure the high true positive rate of breakpoints, so as to more accurately estimate the subclonal population frequency. However, the false positive breakpoints split the SCNA regions into many small SCNA fragments, which violates ISA and results in many redundant input data and causes the subclone reconstruction process to be extremely slow and time consuming.

Existing (NGS) based subclonal reconstructionmethods, such as ThetA (Oesper et al., 2013) and Mixclone (Li and Xie, 2015), use expectation maximation (EM) or maximum likelihood method (MLM) to infer the subclonal frequency and the absolute copy number of every input data. To reduce the searching space, MixClone assumes that the number of subclonal population is less than 3, and this number (1 or 2) needs to be predefined. During the maximization step of the EM process, MixClone assumes the subclonal frequencies of all the subclonal population only equal to several combinations of discrete values to further reduce the searching space. Thus, MixClone's accuracy is compromised for speed of computation. On the other side, Theta (Oesper et al., 2013) does not make any compromise on searching space. Thus, Theta is extremely time consuming while search optimal subclonal frequency in (0,1) for every input data, which makes it unable to perform subclonal reconstruction for more than three subclonal populations.

With the everincreasing data of biotechnology comes the chance of developing computational toolkit (Cheng et al., 2016; Cheng et al., 2018a;Cheng et al., 2019) tofind out the pathogeny of diseases; in this article, we provide a pipeline for reconstructing SCNA's subclonal population-based NGS data. We first perform a mathematical analysis of the solution number of SCNA's subclonal frequency, propose and prove the theorem of solution number of SCNA's subclonal frequency, and present a method to filter out false SCNA breakpoints based on it. Then we propose a probability model that incorporates rcr bias correction algorithm we previously developed, and we construct an SCNA's subclonal population reconstruction pipeline by stringing it with the false breakpoint filtering algorithm. We model the read depth of tumor sample as a Poisson distribution with the expected tumor read count proportional to the absolute copy number and subclonal frequency. We use the tree-structured stick breaking Dirichlet process (Prescott Adams et al., 2010) to generate the tree structure of tumor's evolutionary history, and use the Markov Chain Monte Carlo (MCMC) to obtain the result of subclonal reconstruction. The experimental result shows that our pipeline outperforms the existing subclonal reconstruction programs both on simulated data and TCGA data.

### MATERIALS AND METHODS

#### Solution Space of SCNA's Subclonal Population Frequency

The RCR and the b-allele frequency (BAF) of the heterozygous single nucleotide polymorphism (SNP) locus in the SCNA segment are commonly used as input for the sequencing databased SCNA's copy number and subclonal frequency inferring tools (Wang et al., 2007; Oesper et al., 2013; Li and Xie, 2015). Since the number of reads mapped in certain genome region is proportional to the copy number of this region, the RCR is set to be proportional to <sup>C</sup><sup>j</sup> <sup>2</sup> by existing tools (Oesper et al., 2013; Li and Xie, 2015), where <sup>C</sup><sup>j</sup> <sup>2</sup> denotes its average copy number of the jth SCNA segment. Let f<sup>j</sup> denote the subclonal population cellular prevalence of the jth SCNA segment; C<sup>T</sup> <sup>j</sup> denote its absolute copy number; m<sup>T</sup> jk represent the BAF of the kth heterozygous SNP locus in the <sup>j</sup>th SCNA segment; <sup>m</sup><sup>j</sup> represent the average BAF of the kth heterozygous SNP locus in the jth SCNA segment. Then we have the following equation set

$$\begin{cases} \bar{\mathbf{C}}\_{j} = \phi\_{j} \ast \mathbf{C}\_{j}^{T} + (1 - \phi\_{j}) \ast \mathbf{2}, \\\\ \bar{\mathbf{C}}\_{j} = \frac{1}{\mu\_{\mathbb{H}}} \left[ \phi\_{j} \ast \mathbf{C}\_{j}^{T} \ast \mu\_{jk}^{T} + (1 - \phi\_{j}) \ast \mathbf{2} \ast \frac{1}{2} \right], \quad k = 1, \ldots, K\_{j}. \end{cases} \tag{1}$$

where Kj is the total number of heterozygous SNP loci in the jth SCNA segment. Since the B allele locates either in paternal or maternal haploid, both m<sup>T</sup> jk and (1 <sup>−</sup> <sup>m</sup><sup>T</sup> jk) could possibly be the BAF value in the same SCNA fragment and both <sup>m</sup>jk and (1 <sup>−</sup> <sup>m</sup>jk) could possibly be the average BAF value in the same SCNA fragment. To reduce the complexity, we use m^<sup>T</sup> jk to denote the smaller one of m<sup>T</sup> jk and (1 <sup>−</sup> <sup>m</sup><sup>T</sup> jk); <sup>m</sup>bjk to denote the smaller one of m<sup>T</sup> jk and (1 <sup>−</sup> <sup>m</sup>jk). Here we give a theorem to help answer the solution space of equation set 1 and we prove it in the

Supporting Information. THEOREM 1. Given <sup>C</sup><sup>j</sup> and <sup>f</sup>mbjk<sup>g</sup> Kj <sup>k</sup>=1 and let <sup>x</sup> <sup>=</sup> <sup>C</sup><sup>T</sup> j m^T jk−1 CT <sup>j</sup> <sup>−</sup><sup>2</sup> , we have the following conclusions:


As shown in Figure 1, given the observation value <sup>C</sup><sup>j</sup> and <sup>m</sup>bjk and maximum copy number Cmax = 15, only 7/43 of the curves of the family of function <sup>m</sup>bjk <sup>=</sup> <sup>x</sup>(1 <sup>−</sup> <sup>2</sup> Cj ) + <sup>1</sup> Cj present multiple f<sup>j</sup> solutions (Please refer to Table S1 for the detail information of multi-solution range).

#### The Algorithm of Filtering Out False Positive SCNA Breakpoints

We assume that there are no two adjacent SCNAs that present the same <sup>C</sup><sup>j</sup> and <sup>m</sup>bjk and meanwhile the different <sup>f</sup><sup>j</sup> and <sup>C</sup><sup>T</sup> <sup>j</sup> according to Theorem 1.We use the same method described in Li and Xie (2015) to model the read count ratio of tumor and its paired normal. Based on the Lander–Waterman model (Lander and Waterman, 1988), the probability of sampling a read from a given segment depends on three main factors: 1) its copy number, 2) its total genomic length, and 3) its mappability, which depends on factors such as repetitive sequence and GC content (Li and Xie, 2015). For each segment j, we associate a coefficient j) to account for the effect of its mappability and genomic length. Thus, the expected tumor read counts mapped to segment <sup>j</sup>, which is denoted as <sup>l</sup><sup>j</sup> , are proportional to <sup>C</sup><sup>j</sup>q<sup>j</sup>. For example, for segment x and segment y, we have

$$\frac{\bar{\lambda}\_{\chi}}{\bar{\lambda}\_{\chi}} = \frac{\bar{C}\_{\chi}\theta\_{\chi}}{\bar{C}\_{\chi}\theta\_{\chi}}\tag{2}$$

Because the mappability coefficients matter only in a relative sense, we take <sup>q</sup><sup>x</sup>=q<sup>y</sup> <sup>=</sup> <sup>D</sup><sup>N</sup> <sup>x</sup> =D<sup>N</sup> <sup>y</sup> , as these segments should have the same sequence properties between the normal and tumor samples. Thus, Equation 2 is transformed into

$$\log\left(\mathcal{\lambda}\_{\mathbf{x}}/D\_{\mathbf{x}}^{\mathbf{N}}\right) - \log\left(\mathcal{\lambda}\_{\mathbf{y}}/D\_{\mathbf{y}}^{\mathbf{N}}\right) = \frac{\bar{C}\_{\mathbf{x}}}{\bar{C}\_{\mathbf{y}}}\,. \tag{3}$$

However, our previous study (Chu et al., 2017a) has shown the RCR of tumor to its paired normal presents a log-linear GC content bias, and has described a bias correction software "Pre-SCNAClonal" (Chu et al., 2017a) to correct this bias specifically. Let D d<sup>S</sup> <sup>i</sup> =D<sup>N</sup> <sup>i</sup> denote the corrected read count ratio of tumor sample and its paired normal, and let F() denote the bias correction process. Then we haveD d<sup>S</sup> <sup>i</sup> =D<sup>N</sup> <sup>i</sup> <sup>=</sup> <sup>F</sup>(D<sup>S</sup> <sup>i</sup> =D<sup>N</sup> <sup>i</sup> ) and

$$\log\left(\overline{D\_i^S/D\_i^N}\right) - \log\left(\overline{D\_j^S/D\_j^N}\right) = \log\frac{\bar{C}\_i}{\bar{C}\_j} \,. \tag{4}$$

Then we use the following steps to filter out false positive SCNA breakpoints.


The space complexity of the algorithm of filtering out false positive SCNA breakpoints is o(J 2 ). The computational complexity of "MeanShift" and "hierarchical clustering" are <sup>o</sup>(o<sup>N</sup> <sup>n</sup>=1(In\*osj∈S<sup>n</sup> Kj) 2 ) and o(J 3 ), where In is the number of iterations for Sn. Thus. the time complexity of the algorithm of filtering out false positive SCNA breakpoints is o(J <sup>3</sup> + oN <sup>n</sup>=1(In\*osj∈S<sup>n</sup> Kj) 2 ). The detail validation of this algorithm are described in Section 4 in the Supplementary (Please refer to Figures S5–S8 for the results).

#### Normal Segments Detection Method

The task of normal segments detection is to find out all the segments that C<sup>j</sup> = 2, since the copy number C<sup>N</sup> <sup>j</sup> in sj in normal sample equals 2, normally. A cancer genome differs from the reference genome by gains and losses of segments, or intervals, of the reference genome (Oesper et al., 2013).

However, due to two different sequencing processes and the coverage may not exactly be the same for tumor and its paired normal,D d<sup>S</sup> <sup>j</sup> =D<sup>N</sup> <sup>j</sup> does not always equal to 1 for the normal segments (Li and Xie, 2015). In this paper, we use the same normal segments detection method described in our previous work (Chu et al., 2017a), which utilizes BAF information to detect normal segments.

Equation set 1 implies following conclusion

$$\begin{aligned} \phi\_j = 0 \quad \text{or} \quad \mathbf{C}\_j^\mathrm{T} = 2 \iff \bar{\mathbf{C}}\_j = 2, \\ \phi\_j = 0 \quad \text{or} \quad \mathbf{C}\_j^\mathrm{T} = 0 \quad \text{or} \quad \mu\_{jk}^\mathrm{T} = \frac{1}{2} \Leftrightarrow \bar{\mu}\_{jk}^\mathrm{T} = \frac{1}{2} \;. \end{aligned} \tag{5}$$

We detect the normal segments Ntm from Stm according to Equation 5 by the following two steps. First, we filter out all the segments s<sup>j</sup> ∈ Stm with m<sup>T</sup> jk ≠ <sup>1</sup> <sup>2</sup> for k = 1, …, Ksj . In the remaining segments, the possible C<sup>T</sup> <sup>j</sup> could be any one in {0, 2, 4,…}, since all the possible genotypes G<sup>T</sup> jk of allele at the <sup>k</sup>th site for <sup>m</sup><sup>T</sup> jk = <sup>1</sup> 2 could be any one in {∅, PM, PPMM,…}. Next, we obtain all the normal segments Ntm from these segments by selecting the segments with the read depth d<sup>S</sup> jk at the kth heterozygous SNP site equal to the coverage of the aligned WGS data of the tumor sample.

#### The Probability Model of Subclonal Population Frequency

Figure 2 shows the probabilistic graphical model of SCNA'<sup>s</sup> subclonal population frequency. In this figure, S denotes the set of all the SCNA segments; N denotes the set of segments that contain no SCNA. We use the same method described in Li's study (Li and Xie, 2015) to set the probability of BAF to obey binomial distribution

$$d\_{jk}^{\mathbb{S}}|d\_{jk}^{\mathbb{S}}, \mu\_{jk}^{\mathbb{T}}, \phi\_j \quad \sim \text{Binomial}\left(d\_{jk}^{\mathbb{S}}, \widehat{\tilde{\mu}}\_{jk}\right),\tag{6}$$

where b<sup>S</sup> jk denotes the number of tumor reads that contain B allele at the kth heterogeneous SNP locus and d<sup>S</sup> jk denotes the total number of tumor reads mapped at this locus. In this figure, GT jk denote the allele's genotype at the kth heterogeneous snp locus in segment sj.

According to Equation 4, we have the expected tumor read counts mapped to segment j

FIGURE 2 | Bayesian network model for subclonal population frequency. In this figure, G denotes the tree-structured Dirichlet process; H denotes the base distribution; <sup>a</sup> and <sup>g</sup> are the scaling parameters of G; <sup>f</sup><sup>j</sup> denotes the subclonal frequency of SCNA in segment s<sup>j</sup> ; DS <sup>j</sup> denotes the number of tumor reads mapped in segment s<sup>j</sup> , while DN <sup>j</sup> denotes the number of normal reads mapped in segment s<sup>j</sup> ; CT <sup>j</sup> denotes the absolute copy number of SCNA in segment s<sup>j</sup> ; ϑ denotes the geometric mean of the read count ratio of all the baseline segments N; Cmax is the maximum absolute copy number predefined; G<sup>T</sup> jk denotes the tumor genotype of the kth heterozygous SNP loci in the jth segments s<sup>j</sup> ; uT jk denotes the tumor BAF of the kth heterozygous SNP loci in the jth segments s<sup>j</sup> ;b<sup>S</sup> jk and d<sup>S</sup> jk denote the number of B-allele and the total allele at the kth heterozygous SNP loci in the jth segments s<sup>j</sup> .

$$\mathcal{A}\_{\circ} = \Phi^{-\mathsf{I}} \left( \frac{\bar{\mathsf{C}\_{j}}}{\widehat{\mathsf{C}\_{i}}} \times \widehat{D\_{i}^{\mathsf{S}}/D\_{i}^{\mathsf{N}}} \right) \times D\_{j}^{\mathsf{N}} \tag{7}$$

where F−<sup>1</sup> () denotes the reverse process of bias correction. Let |N| denote the number of baseline segments (Li and Xie, 2015) (in which the absolute copy number C<sup>T</sup> <sup>j</sup> = 2). We use the average of read count's log ratio of all the baseline segments ϑ = ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Y si∈N D d<sup>S</sup> <sup>i</sup> =D<sup>N</sup> <sup>i</sup> <sup>−</sup>jN<sup>j</sup> s to calculate the expectation of

tumor read count, and model the tumor read count as a Poisson distribution

$$D\_j^S | D\_j^N, C\_j^T, \phi\_j \quad \sim Poisson\left(\varPhi^{-1}\left(\frac{\bar{C}\_j}{2} \times \mathfrak{G}\right) \times D\_j^N\right) \tag{8}$$

It could be deduced from the first equation in Equation set 1 that C<sup>j</sup> > 2 ⇔ C<sup>T</sup> <sup>j</sup> > 2. Therefore, we may conclude that D d<sup>S</sup> <sup>j</sup> =D<sup>N</sup> <sup>j</sup> > ϑ ⇔ C<sup>T</sup> <sup>j</sup> > 2, since C<sup>i</sup> must equal 2 if si contains no SCNA. We set CT <sup>j</sup> obeys the categorical distribution

$$C\_j^T \sim \text{Categorical}(\varsigma(\mathfrak{g})),\tag{9}$$

where function ς (ϑ) denotes C<sup>T</sup> <sup>j</sup> 's range; ς (ϑ) = {0, 1, 2} if D d<sup>S</sup> <sup>j</sup> =D<sup>N</sup> <sup>j</sup> < ϑ; ς (ϑ) = {2, 3,…, Cmax} if D d<sup>S</sup> <sup>j</sup> =D<sup>N</sup> <sup>j</sup> > ϑ.

The subclonal population frequency of certain mutation equals the sum of all its subpopulation frequencies (for details, refer to Figure S1 in the Supplementary), and all the subpopulation frequencies in the tumor sample sums to 1. Therefore, all the subpopulation frequencies in the tumor sample obey the Dirichlet distribution, and this Dirichlet distribution obeys the treestructured Dirichlet process (DP) (Prescott Adams et al., 2010). Suppose there are P subpopulations in a tumor sample; let x1,…, xp denote all the subpopulation frequencies

$$\langle \boldsymbol{x}\_1, \dots, \boldsymbol{x}\_P \rangle \quad \sim \text{Dirichlet}(\alpha\_1, \dots, \alpha\_P), \tag{10}$$

where a1,…, a<sup>p</sup> are the concentration parameters. In this paper, we set a<sup>1</sup> = … = a<sup>p</sup> = 1, then Equation 10 is transformed into a uniform distribution of (p −1)-dimension simplex. Therefore, the prior probability of subclonal frequency f<sup>j</sup> equals the probability of the tree structure. In Figure 2, <sup>G</sup> denotes the tree-structured DP; H denotes the base distribution;a and g are the scaling parameters ofG.

We use MCMC to obtain the prior distribution of f<sup>j</sup> since the probability of tree-structured DP cannot be explicitly expressed. We use the slice sampling method described in Prescott's study (Prescott Adams et al., 2010) to generate tree structure. The complete posterior probability of the subclonal population frequencies of all the SCNA segments

Pr f<sup>j</sup> sj∈SnN<sup>j</sup> DS J sj∈SnN, bS jk n oKj <sup>k</sup>=1 sj∈SnN , T ! ∝ Pr D<sup>S</sup> J Sj∈SnN, <sup>b</sup><sup>S</sup> jk n oKj <sup>k</sup>=1 Sj∈SnN j f<sup>j</sup> Sj∈<sup>S</sup> <sup>g</sup><sup>N</sup> ! - Pr f<sup>j</sup> Sj∈Sn<sup>N</sup> - = Y N∈TC<sup>T</sup> <sup>j</sup> ∈ o f g 0,1::: Cmax o GT jk∈<sup>z</sup> <sup>C</sup><sup>T</sup> j <sup>m</sup><sup>T</sup> jk o ∈h GT jk - Y Sj∈N 1 DS j ! - <sup>F</sup><sup>−</sup><sup>1</sup> <sup>C</sup><sup>j</sup> 2 jNj ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Y si∈N DS i <sup>d</sup>=D<sup>N</sup> i <sup>s</sup> <sup>0</sup> @ 1 <sup>A</sup> - D<sup>N</sup> j 0 @ 1 A DS j - 2 6 4 e <sup>−</sup>f−<sup>1</sup> <sup>C</sup><sup>j</sup> 2 - ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Y si∈N DS i <sup>d</sup>=D<sup>N</sup> <sup>i</sup> <sup>j</sup><sup>N</sup> <sup>s</sup> <sup>0</sup> @ 1 <sup>A</sup> - D<sup>N</sup> j - <sup>Y</sup>Kj k=1 dS jk bS jk 0 @ 1 <sup>A</sup> <sup>m</sup>b bS jk jk <sup>1</sup> <sup>−</sup> <sup>m</sup>bjk - <sup>d</sup><sup>S</sup> jk−bS jk <sup>3</sup> <sup>5</sup> : (11)

where T denotes the tree structure, and N denotes a node in T. We select the tree structure with maximum posterior probability

$$\mathfrak{X}\_{\max} = \operatorname\*{arg\,max}\_{\mathfrak{X}^{\{i\}}} \Pr\left( \left\{ D\_{j}^{\operatorname\*{S}} \right\}\_{\mathbf{S}\_{j} \in \mathbb{E} \backslash \mathcal{N}}, \left\{ \left\{ b\_{jk}^{\operatorname\*{S}} \right\}\_{k=1}^{K\_{j}} \right\}\_{\mathbf{S}\_{j} \in \mathbb{E} \backslash \mathcal{N}} \left| \left\{ \phi\_{j} \right\}\_{S\_{j} \in \mathbb{E} \backslash \mathcal{N}}^{\{i\}}, \mathfrak{X}^{\{i\}} \right\} \right. \tag{12}$$

where T(i) and ffjg(i) <sup>s</sup>j∈Sn<sup>N</sup> denote tree structure and subclonal population frequencies of the ith sampling process. The absolute copy number of the ith sampling process is

CT j n o(i) Sj∈SnN<sup>=</sup>∪ N∈T(i) arg max CT j n oSj∈N Y Sj∈N 1 DS <sup>J</sup> ! <sup>F</sup><sup>−</sup><sup>1</sup> <sup>C</sup><sup>j</sup> 2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Y si∈N DS i <sup>d</sup>=D<sup>N</sup> <sup>i</sup> <sup>j</sup>N<sup>j</sup> <sup>s</sup> <sup>0</sup> @ 1 <sup>A</sup> - D<sup>N</sup> j 0 @ 1 A DS j - 2 6 4 e <sup>−</sup>F−<sup>1</sup> <sup>C</sup><sup>j</sup> <sup>2</sup> - ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Y si∈N DS i <sup>d</sup>=D<sup>N</sup> <sup>i</sup> <sup>j</sup>N<sup>j</sup> <sup>s</sup> <sup>0</sup> @ 1 <sup>A</sup> - D<sup>N</sup> <sup>j</sup> - <sup>Y</sup>Kj k=1 dS jk bS jk 0 @ 1 <sup>A</sup> <sup>m</sup>b bS jk jk <sup>1</sup> <sup>−</sup> <sup>m</sup>bjk - <sup>d</sup><sup>S</sup> jk−b<sup>S</sup> jk <sup>3</sup> 5 , (13)

where fC<sup>T</sup> <sup>j</sup> g(i) <sup>s</sup>j∈Sn<sup>N</sup> are absolute copy numbers with the maximum posterior probability if the i'-th sampling process is the solution of Equation 12.

#### The Pipeline for Reconstructing SCNA's Subclonal Population-Based NGS Data

As shown in Figure 3, the pipeline consists of five models. The tumor and its paired normal sequence alignment sequencing data in BAM format are used as input of the pipeline. The SCNA segments are detected by BIC-seq (Xi et al., 2010), then the bias of read count ratio is corrected by the correction model (Chu et al., 2017a) we previously proposed. We filter out the false positive breakpoints by the algorithm we proposed in this paper, then we use the probability model of subclonal population frequency proposed in this paper to infer the subclonal frequency of each SCNA segment. Finally, we use the tree structure learning algorithm (Prescott Adams et al., 2010) to reconstruct the SCNA's subclonal population.

#### RESULTS

In this section, we evaluate the performance of probabilistic model on both simulated and real datasets and compare its performance with existing tools. Existing tools such as Mixclone (Li and Xie, 2015) and TheatA (Oesper et al., 2013) could not calculate the subclonal frequencies of more than three subclonal populations. Therefore, we use the simulated data, which contain more than three subclonal populations and TCGA benchmark data together to evaluate our model.

#### Results From Simulated Data

We use Pysubsim-tree (Chu et al., 2017b) to simulate a tumor's NGS read alignment data from Chromosome 21 with the evolution history configuration shown in Figure 4 and the acquired SCNA's configuration listed in Table 1. In Figure 4, each circle represents a subpopulation; the squares with character a, b, c, d, e, and f represent five SCNAs; the number on the right side of the circle is the frequency of the subpopulation.

We set the first 50 cycles of the MCMC sampling process as burn-in and use the result of the following 300 cycles to calculate the probability of the evolutionary relationship between subpopulations. We set a = 1.0, g = 1.0, H to be the uniform distribution. Figures 5A, B are the dot-plots of the distribution of the output of subclonal population frequency model. Figure 5C shows the partial order plot (Jiao et al., 2014) of the evolutionary relationship obtained by the model proposed in this paper. The arrows in this figure denote the direct evolutionary relationship of the two subpopulations. The width of the arrow denotes the probability of this evolutionary relationship present in the 300 cycles of the MCMC process. Suppose fTig<sup>I</sup> <sup>i</sup>=1 denotes all the trees obtained in all the cycles of the MCMC process, ab! denotes the evolutionary relationship from subpopulation a to b. Then the probability of this evolutionary relationship is

$$\Pr\left(\overrightarrow{\text{ab}}\right) = \frac{1}{I} \left| \left\{ \mathfrak{T}\_i | \overrightarrow{\text{ab}} \subseteq \mathfrak{T}\_i, i = 1, \ldots, I \right\} \right| \;. \tag{14}$$

According to Theorem 1, a and e have only one solution of f<sup>j</sup> while the others are not. The distribution of absolute copy numbers shown in Figure 5Ais consistent with Theorem 1. The distribution of e's subclonalfrequency is quite scatteredinFigure 5Bbecause the small subclonal frequency and the absolute copy number of e (closed to normal) cause the coverage to decrease by 5%, which is almost the same as the noise. The subclonal frequencies of other SCNAs are highly distributed at the positions of subclonal frequencies listed in Table 1. Each SCNA's absolute copy number and subclonal frequency with the maximum posterior probability are listed in Table 2. The subclonal frequencies of b and c are not correct because they have multiple solutions of subclonal frequencies according to Theorem 1, while the others are correct. The distribution of absolute copy number and subclonal frequency in Figure 5 and the result listed in Table 2 show that our SCNA probability model could correctly calculate the subclonal frequency of SCNA.

#### Results From Breast Cancer Sequencing Data

We use the ngs data "HCC1954-spiked1-n25t35s40" and "HCC1954-spiked1-n25t55s20" (denoted as "n25t35s40" and "n25t55s20" for convenience) of Cancer Genome Atlas (TCGA) Benchmark 4 dataset, which is publicly available at the National

data. In this figure, each circle denotes a subpopulation; the number on the left is its frequency; each square inside the circle denotes an SCNA; each arrow points an offspring subpopulation.

Cancer Institute GDC Data Portal (https://gdc.cancer.gov/ resources-tcga-users/tcga-mutation-calling-benchmark-4-files) to further validate the subclonal frequency model proposed in this paper. HCC1954 is an immortal cell line derived from an invasive ductal carcinoma of the breast diagnosed in a 61-year-old woman (Bignell et al., 2007). "G15512.HCC1954.1" is the NGS data of this cell line, which contains one subclonal population with purity 0.99; however, this data has no ground truth of absolute copy number of the SCNA regions."HCC1954-spiked1-n25t35s40"is generated by merging 35% of "G15512.HCC1954.1" with 25% of its paired normal NGS data and 40% of "G15512.HCC1954.1" with some SCNAs randomly spiked in it. Therefore, there are two subclonal populations in the tumor sample "HCC1954-spiked1-n25t35s40," and their subclonal frequencies are 75% and 40%, respectively. The ISA is invalid since each subclonal population contains multiple SCNAs; thus, we set the prior probability of tree structure to obey uniform distribution, and thus Equation 11 could be rewritten as follows:

$$\Pr\left(\boldsymbol{\phi}\_{\boldsymbol{\upbeta}}\left\{\boldsymbol{D}\_{j}^{\boldsymbol{S}}\right\}\_{\boldsymbol{S}\_{j}\in\mathbb{N}\cup\boldsymbol{\upbeta}^{\boldsymbol{s}}}\left\{\boldsymbol{b}\_{jk}^{\boldsymbol{S}}\right\}\_{k=1}^{k\_{j}},\mathfrak{T}\right)\propto\Pr\left(\left\{\boldsymbol{D}\_{j}^{\boldsymbol{S}}\right\}\_{\boldsymbol{S}\_{j}\in\mathbb{N}\cup\boldsymbol{\upbeta}}\left\{\boldsymbol{b}\_{jk}^{\boldsymbol{S}}\right\}\_{k=1}^{k\_{j}},\mathfrak{T}\left|\boldsymbol{\phi}\_{j}\right\rangle\right)$$

$$=\prod\_{\boldsymbol{s}\_{j}\in\mathbb{N}\cup\boldsymbol{\upbeta}^{\boldsymbol{s}}\_{j}^{\boldsymbol{s}}}\sum\_{\{\boldsymbol{0}:\boldsymbol{1}\ldots\boldsymbol{C}\_{\boldsymbol{m}}\}}\left\{\frac{1}{D\_{j}^{\boldsymbol{S}}}\times\left(\boldsymbol{\Phi}^{-1}\left(\frac{\bar{\boldsymbol{C}}\_{j}}{2}\times\sqrt[N]{\prod\_{s\in\boldsymbol{K}}D\_{j}^{\boldsymbol{S}}/D\_{j}^{\boldsymbol{K}}}\right)\times D\_{j}^{\boldsymbol{N}}\right)^{D\_{j}^{\boldsymbol{S}}}\times\boldsymbol{1}\right)^{\boldsymbol{S}\_{j}^{\boldsymbol{S}}}\times\boldsymbol{1}\,\,\boldsymbol{\upbeta}\tag{15}$$

$$\boldsymbol{e}^{-\boldsymbol{\upbeta}^{\boldsymbol{s}}}\left(\sqrt[N]{\prod\_{s\in\boldsymbol{\upbeta}^{\boldsymbol{s}}}D\_{j}^{\boldsymbol{S}}/D\_{i}^{\boldsymbol{S}}}\right)\times D\_{j}^{\boldsymbol{N}}\times$$

$$\prod\_{k=1}^{K\_{j}}\sum\_{\boldsymbol{G}\_{j}^{\boldsymbol{S}}\in\mathcal{L}\left(\boldsymbol{C}\_{j}^{\boldsymbol{G}}\right)\mu\_{jk}^{\boldsymbol{r}}}\s\left(\boldsymbol{G}\_{k}^{\boldsymbol{G}}\right$$

Figure 6 shows the subclonal frequencies obtained by the model proposed in this paper. In this figure, "P" denotes the parent subclonal population (subclonal frequency 75%) and "C" denotes the child subclonal population (subclonal frequency 40%). As shown in Figure 6, the subclonal frequencies of these two population obtained by the model proposed in this paper are 72% and 42% for sample "n25t35s40" and 77% and 25% for sample "n25t55s20," which are the most closed to the fact in comparison with MixClone and ThetA.

#### DISCUSSION

Generally, SCNAs with larger subclonal population frequency could relatively be more precisely located. However, due to the


TABLE 1 | The SCNA's configuration for each subpopulation of the simulation data.

TABLE 2 | The results of subclonal population frequency inferring based on simulation data.

inferred by the 300 cycles of MCMC process. (C) The partial plot of the subclonal frequency.


twice sequencing procedures of tumor and its paired normal, the read information of the genomic regions with the same copy number in tumor sample is not exactly the same as its paired normal's. Moreover, the lower read coverage of NGS makes the noise/error more likely to be mistaken for an SCNA. As shown in Figure 7, the number of SCNA breakpoints obtained by SCNA detection tool is proportional to the subclonal population frequency. If there exists a large proportion of false negative breakpoints, it will cause the read count in the segments incapable to reveal the copy number property, then it will affect all the read count-based SCNA analysis tools. On the other hand, if there exists a large proportion of false positive breakpoints, the segment clustering step of filtering out the false positive breakpoints could reduce the data size and make the read count information more robust to noise by merging the SCNA segments with the same absolute copy number and subclonal population frequency. As shown in Theorem 1, the SCNA segments with the same RCR and average B-allele frequency are indistinguishable to the NGS-based SCNA analysis tools. Merging two non-adjacent SCNA segments with the same NGS properties could not affect the result of the NGSbased SCNA analysis tools.

Tree-Structured Stick Breaking (TSSB) process (Prescott Adams et al., 2010) could learn the tree structure of the hierarchical data. A tree structure space could be generated

by intertwining two DP; then as described in Prescott's paper (Prescott Adams et al., 2010), one can imagine throwing a dart (data) on the tree space and considering which node the dart hits. If we know subclonal number L in advance, then we could generate the tree structure in two steps. Step 1: generate a tree using all the data; Step 2: sort nodes by the sum of the size of the genome region hit, then find out the top L nodes and throw the rest of the darts (data not in the L nodes) into these L nodes randomly. Figure 7 shows that subclonal frequency affects the number of breakpoints; thus, there might present false positive or false negative breakpoints in the result of the SCNA detection tool. The false positive breakpoints could be filtered out by the algorithm in this paper. Even if there exist false breakpoints, the redundant data that contains the same SCNA might hit the same node in the tree space generated by the TSSB process. Thus, the redundant data affects the time and space consumption, but could not affect the result of subclonal reconstruction theoretically.

#### CONCLUSION

In this paper, we first perform a mathematical analysis of the solution space of SCNA's subclonal frequency. Then based on the mathematical analysis, we propose an algorithm to filter out the false breakpoints and we construct a new probability model to reconstruct SCNA's subclonal population, which incorporates the algorithms of RCR bias correction we previously proposed. We use the tree-structured stick breaking DP (Prescott Adams et al., 2010) to generate the tree structure space of tumor's evolutionary history. In the probability model, the BAF of the heterozygous SNP locus in the SCNA segment is modeled as a binomial distribution and the read depth of tumor sampling data is modeled as a Poisson distribution with respect to the potential bias in RCR. We generate the distribution of subclonal frequency from the distribution of subpopulation frequency, which is drawn from the tree structure space. By stringing the model with the false breakpoint filtering algorithm, we construct a whole SCNA's subclonal population reconstruction pipeline, which is capable of inferring SCNA's absolute copy number and its subclonal population frequency and its evolutionary process while there are a lot of false positive SCNA breakpoints and the RCR presents bias. The results show that the model proposed in this paper could more accurately estimate the absolute copy number of SCNA segments and their subclonal population frequencies in comparison with existing methods both on simulated data and TCGA data.

#### REFERENCES


#### DATA AVAILABILITY STATEMENT

Publicly available datasets were analyzed in this study. This data can be found here: https://gdc.cancer.gov/resources-tcga-users/ tcga-mutation-calling-benchmark-4-files.

#### AUTHOR CONTRIBUTIONS

YC: Coming up with the theories and all the mathematical equations in this paper and implemented the initial version of P-SCNAClonal, the initial version of this paper. CN: Debugging of the initial version of P-SCNAClonal, experiments and result collecting, completed this paper with the result section. YW: Providing the basic idea and funding support.

### FUNDING

This work was supported by funding from the National Key R&D Program of China (No: 2016YFC1202302 and 2017YFSF090117) and the National Nature Science Foundation of China (Grant No. 61822108 and 61571152).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.01374/full#supplementary-material


throughput sequencing data. Genome Biol. 11, 1. doi: 10.1186/1465-6906-11- S1-O10

Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Chu, Nie and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author (s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Deep Neural Network for Identifying DNA N4-Methylcytosine Sites

#### Feng Zeng<sup>1</sup> \*, Guanyun Fang<sup>1</sup> and Lan Yao<sup>2</sup> \*

*<sup>1</sup> School of Computer Science and Engineering, Central South University, Changsha, China, <sup>2</sup> College of Mathematics and Econometrics, Hunan University, Changsha, China*

Motivation: N4-methylcytosine (4mC) plays an important role in host defense and transcriptional regulation. Accurate identification of 4mc sites provides a more comprehensive understanding of its biological effects. At present, the traditional machine learning algorithms are used in the research on 4mC sites prediction, but the complexity of the algorithms is relatively high, which is not suitable for the processing of large data sets, and the accuracy of prediction needs to be improved. Therefore, it is necessary to develop a new and effective method to accurately identify 4mC sites.

#### Edited by:

*Liang Cheng, Harbin Medical University, China*

#### Reviewed by:

*Himel Mallick, Merck, United States Lin Meng, Ritsumeikan University, Japan*

#### \*Correspondence:

*Feng Zeng fengzeng@csu.edu.cn Lan Yao yao@hnu.edu.cn*

#### Specialty section:

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> Received: *22 November 2019* Accepted: *21 February 2020* Published: *06 March 2020*

#### Citation:

*Zeng F, Fang G and Yao L (2020) A Deep Neural Network for Identifying DNA N4-Methylcytosine Sites. Front. Genet. 11:209. doi: 10.3389/fgene.2020.00209* Results: In this work, we found a large number of 4mC sites and non 4mC sites of *Caenorhabditis elegans* (*C. elegans*) from the latest MethSMRT website, which greatly expanded the dataset of *C. elegans*, and developed a hybrid deep neural network framework named 4mcDeep-CBI, aiming to identify 4mC sites. In order to obtain the high latitude information of the feature, we input the preliminary extracted features into the Convolutional Neural Network (CNN) and Bidirectional Long Short Term Memory network (BLSTM) to generate advanced features. Taking the advanced features as algorithm input, we have proposed an integrated algorithm to improve feature representation. Experimental results on large new dataset show that the proposed predictor is able to achieve generally better performance in identifying 4mC sites as compared to the state-of-art predictor. Notably, this is the first study of identifying 4mC sites using deep neural network. Moreover, our model runs much faster than the state-of-art predictor.

Keywords: N4-methylcytosine, machine learning, deep neural network, CNN, BLSTM, integrated algorithm

## 1. INTRODUCTION

DNA methylation is a form of chemical modification of DNA, which alters genetic performance without altering the DNA sequence. Numerous studies have shown that DNA methylation can cause changes in chromatin structure, DNA conformation, DNA stability, and DNA-protein interactions, thereby controlling gene expression (Wang and Qiu, 2012). In many species, the Nmethylation would inhibit Watson-Crick hydrogen bond formation with guanosine (Fazakerley et al., 1987). The differential susceptibility of foreign DNA and self-DNA suggests that some process, such as cytosine methylation, may be affording protection to nuclear DNA (Carpenter et al., 2012). DNA methylation guided by specific methyltransferase enzymes occurs in both prokaryotes and eukaryotes. These modifications can label genomic regions to control various processes including base pairing, duplex stability, replication, repair, transcription, nucleosome localization, X chromosome inactivation, imprinting and epigenetic memory (Iyer et al., 2011; Allis and Jenuwein, 2016; O'Brown and Greer, 2016). The most widespread DNA methylation modifications are N6-methyladenine (6mA), 5 methylcytosine (5mC) and N4-methylcytosine (4mC) that have been detected in both prokaryotic and eukaryotic genomes (Fu et al., 2015; Blow et al., 2016; Chen et al., 2017). These modifications are catalyzed by specific DNA methyltransferases (DNMTs) that transfer a methyl group to specific exocyclic amino groups (He et al., 2018). In eukaryotes, 5mC is the most common DNA modification, which is essential for gene regulation, transposon suppression and gene imprinting (Suzuki and Bird, 2008). While 6mA and 4mC are very small, they can only be detected in eukaryotes by high sensitivity techniques. In prokaryotes, 6mA and 4mC are the majority, mainly used to distinguish host DNA from exogenous pathogenic DNA (Heyn and Esteller, 2015), and 4mc controls DNA replication and corrects DNA replication errors (Cheng et al., 1995; Wei et al., 2018). Moreover, 4mC as part of a restriction-modification (R-M) system prevents restriction enzymes from degrading host DNA (Schweizer et al., 2008; Wei et al., 2018).

Although extensive studies have been conducted on modifications of 5mC and 6ma, studies on 4mC are relatively limited due to the lack of effective experimental methods and large amounts of data. Single-molecule real-time sequencing (SMRT) technology can detect 4mC, 5mc, and 6mA base modifications (Ecker, 2010; Flusberg et al., 2010; Clark et al., 2013; Davis et al., 2013). However, SMRT sequencing is costly and is not conducive to the analysis of various species. Recently, Yu et al. (2015) proposed a method for the determination of methylcytosine in genomic DNA by 4 mC-Tet-assisted bisulfite sequencing, which can accurately generate a genome-wide, single-base resolution map of 4mC, and finally identify the 4mC motif associated with the bacterial R-M system. Biological experiments are laborious and expensive when performing genome-wide testing. Therefore, it is necessary to develop a calculation method for identifying 4mC sites.

So far, there are only four methods for identifying the 4mC sites, all of which adopt the SVM model, including iDNA4mC, 4mCPred, 4mcPred-SVM and 4mcPred-IFL. The four predictors are designed to predict 4mC sites directly from sequences. The first 4mC site predictor, called iDNA4mC (Chen et al., 2017), encodes DNA sequences using nucleotide chemistry properties and frequency and is tested across different species. The experimental results show that iDNA4mC has achieved initial results in identifying 4mC sites. However, the low predictive power is the main drawback of iDNA4mC. The second 4mC site predictor, called 4mCPred (He et al., 2018), proposes a new feature coding algorithm by combining positionspecific trinucleotide propensity and electron-ion interaction pseudopotentials, which improves the accuracy of prediction. The third 4mC site predictor, called 4mcPred-SVM (Wei et al., 2018), proposes more useful sequence features in the predictor and improves the feature representation capability through a two-step feature selection method. However, the performance of the experiment did not improve much. Recently, Wei et al. (2019) proposed the fourth 4mC site predictor, called 4mcPred-IFL, which uses an iterative feature representation algorithm to learn probabilistic features from different sequential models and enhance feature representation in a supervised iterative manner. However, the complexity of 4mcPred-IFL is very high. When the data set is large, it takes a long time to obtain the results. Meanwhile, the prediction accuracy in 4mcPred-IFL can be improved further.

In this work, we developed a deep learning framework called 4mcDeep-CBI to identify the 4mC sites. Deep learning related methods are widely used in hot spots prediction of proteinprotein interfaces (Pan et al., 2018; Wang et al., 2018; Deng et al., 2019; Liu et al., 2019), but we have not found any work with deep learning in 4mC sites prediction, and all previous studies have used SVM machine learning methods. This work is the first study of 4mC sites using deep learning. Especially, we have greatly expanded the dataset which is used to evaluated the prediction models of the 4mC sites. Experimental results demonstrate that 4mcDeep-CBI has better performance than other models. The contributions of our work can be summarized as follows.


### 2. MATERIALS AND METHODS

### 2.1. Datasets

We obtained samples genomes of Caenorhabditis elegans (C. elegans) from the latest MethSMRT website, found a lot of 4mC sites and non 4mC sites with the sequence lengths all of 41 bp. Each 4mC sequence sample has several indicators: position, coverage, IPDRatio (inter-pulse duration ratio), frac, fracLow, fracUp, identificationQv. In order to construct a reliable quality dataset, we did the following two steps. Firstly, as stated in the Methylome Analysis Technical Note, the Modification QV (modQV) score indicates that the IPD ratio is significantly different from the expected background. Since the modQV score of 30 is the default threshold for calling a position as modified, we removed the sample with the modQV score more than 30. Secondly, as elaborated in previous study (Chou et al., 2015), if training and testing are conducted through this biased dataset, the experimental results may have overestimated accuracy. To eliminate redundancy and minimize the bias, the CD-HIT software (Fu et al., 2012) with the cut off threshold set at 80% was used to remove those sequences with high sequence similarity. After the above two steps, we obtained 15, 639 samples in C. elegans.

We combine the new samples with the C. elegans benchmark dataset (Ye et al., 2017) that was used in the previous works to form a new data set with 18, 747 samples. Some of the new samples we extracted may be similar to the previous benchmark dataset. Therefore, we use the CD-HIT software to remove those samples with high sequence similarity. Finally, we get the new C. elegans dataset with 17, 808 samples which contains 111, 73 positive samples and 663, 5 negative samples. The positive samples are the sequences centroided with functional 4mC sites detected by the SMRT sequencing technology, while the negative samples are the sequences with the cytosines in the center but not detected as 4mC (Wei et al., 2019). The new dataset can be downloaded from our github, and the download link is given in section 3.

## 2.2. Model of 4mcDeep-CBI

#### 2.2.1. Preliminary Feature Extraction

We use the eight features mentioned in Chen et al. (2017), He et al. (2018), Wei et al. (2018), and Wei et al. (2019) as preliminary features. These features are obtained by encoding the different sequence information by the feature representation algorithm of the sequence. These features are BKF (Binary and k-mer frequency), DBPF (Dinucleotide binary profile and frequency), KNN (K-Nearest Neighbor), PCP (Physical-Chemical Properties), MMI (Multivariate Mutual Information), PseDNC (Pseudo dinucleotide composition), PseEIIP (Electronion interaction pseudopotentials of trinucleotide) and RFHCP (Ring-function-hydrogen-chemical properties). The related feature extraction methods can be found in Wei et al. (2019).

#### 2.2.2. 4mcDeep-CBI Network

As shown in **Figure 1**, 4mcDeep-CBI consists of 3-CNN layer, BLSTM layer, fully connected layer, and a sigmoid classifier. The input of 4mcDeep-CBI is one of eight preliminary features. First of all, the preliminary feature is used as the input to 3-CNN layer, which contains convolution layer, ReLU activation function and max pooling operation. Next, the output of 3-CNN layer will be imported to BLSTM layer to obtain an advanced feature. With the eight features as the inputs, we can get eight advanced features, respectively. Then, each advanced feature (matrix) will be further converted to one-dimensional feature (vector) using the flatten function, which will be finally connected to the fully connected layer. The last layer is the sigmoid layer, which is used to obtain advanced probability features and the prediction result of the first step. At last, we get an eight-dimensional feature, which will be the input of the integrated algorithm.

#### **2.2.2.1. Convolutional neural network (CNN)**

CNN has a powerful ability to extract abstract features, which is not only suitable for image processing, but also for natural language processing tasks. It consists of convolution, activation, and max-pool layers.

In the model design, since we have verified in experiment that the model with 3 CNN layers has the best performance, we employ 3-CNN as an advanced feature extractor, and the input is the preliminary feature extracted from DNA sequences. We first put the preliminary features into the 3-CNN layer, respectively, and set the weighting parameters of the convolution filter. Then, the convolution layer outputs the matrix inner product between the input preliminary feature and filters. After convolution, a rectified linear unit (ReLU) is applied to sparsify the output of the convolution layer. The Rectified Linear Unit (ReLU) (Nair et al., 2010) takes the output of a convolution layer and clamps all the negative values to zero to introduce non-linearity that can not only reduce the computational cost, but also avoid the phenomenon of vanishing gradient and over-fitting. Finally, a max pooling operation is used to reduce the dimensionality and over-fitting by taking the maximum value in a fixed-size sliding window. The output of the convolution module is represented by the following expression:

$$O\_{\mathfrak{C}} = Pool\left(ReLU\left(Conv(\mathcal{S})\right)\right),$$

where O<sup>c</sup> is the output tensor, S is the input preliminary feature of the sequence. For BKF as an example, the dimension of S is 1 × 500 × 1 (input\_shape). The nb\_filter of 3-CNN are 16, 32, 64, respectively, and the filter\_length of 3-CNN are all 8. The parameters of max pool is 2. Therefore, the dimension of O<sup>c</sup> is 1×223×64.

#### **2.2.2.2. Long short term memory networks (LSTM)**

LSTM is a recurrent neural network (RNN) architecture (an artificial neural network) published in 1997 (Hochreiter and Schmidhuber, 1997). Compered with traditional RNNs, LSTM network is well-suited to learn from experience to classify, process and predict time series, and it has advantages in dealing with long term dependency. Especially, Bidirectional LSTM can capture the bidirectional dependence of features and the outputs of individual directions are concatenated, which can well mine the deeper information in the features:

$$O\_r = BiLSTM(O\_c),$$

where O<sup>r</sup> is the output of BLSTM layer and is also advanced feature of the sequence, O<sup>c</sup> is the feature matrix of a sequence obtained by the 3-CNN layer. A LSTM contains a forget gate layer, an input gate layer and an output gate layer. When the LSTM traverses each element of the input, it first determines what information the forget gate layer is about to discard based on the previous input. The input gate layer then determines what information should be stored for the next layer and updates the current state value. Finally, the output gate layer will only output the part of our output that we determined (Pan and Shen, 2018).

### 2.3. Integrated Algorithm Model

In the integrated algorithm model, there are six machine learning algorithms involved, which are K-nearest neighbor algorithm, Logistic regression algorithm, Support vector machine algorithm, Naive Bayesian algorithm, Decision tree algorithm, and Random forest algorithm, respectively. With the 8-D advanced feature of the sequence as the input, we run these six different machine learning algorithms to predict the labels, and get the best

result. Then, the obtained probability value is combined with the previous 8-D advanced feature vector to form a new 9- D feature vector. Next, the 9-D feature are imported into the integrated algorithm model for the new iteration. This process will be repeated until performance reaches convergence. In each iteration, the multi-dimensional input features are trained, and the optimal algorithm is selected each time to obtain an onedimensional probability feature, and then the input and output features are merged into a new feature vector which has one more dimension than the input and will be the new input for next iteration. For example, it is supposed that the vectors f1, f2, . . . , f<sup>8</sup> are the advanced features obtained by previous processing, and with (f1, f2, . . . , f8) as the algorithm input, we can get the result vector f9. Then, (f1, f2, . . . , f8, f9) will be the algorithm input of the next iteration. If there are 5 iterations, we will get the result (f1, f2, . . . , f8, f9, f10, f11, f12, f13) which will be the feature matrix for the following processing. In the experiment, after less than 10 iterations, the algorithm can reach the state of convergence, which can be shown in section 3.

#### 2.4. Deep Learning Model

For the last part of 4mcDeep-CBI, a general neural network model is used to get the optimal solution. The neural network has 2–4 intermediate layers, each with a different activation function. In our experiment, we used two layers of intermediate layers, each using the ReLU function as the activation function, and finally used the sigmoid function as the output layer. We found that inputting the advanced feature matrix obtained by the integrated algorithm into the neural network model can further improve the accuracy.

### 2.5. Performance Evaluation

For performance evaluation, we used the following five generallyused metrics: Sensitivity (SN), Specificity (SP), Accuracy (ACC), Mathew's Correlation Coefficient (MCC) (Wei et al., 2019) and Area Under the ROC Curve (AUC). The definition of each evaluation metric is as follows:

$$\begin{aligned} \text{SN} &= \frac{TP}{TP + FN}, \\ \text{SP} &= \frac{TN}{TN + FP}, \\ \text{ACC} &= \frac{TP + TN}{TP + TN + FN + FP}, \\ \text{MCC} &= \frac{TP \ast TN - FP \ast FN}{\sqrt{(TP + FN)(TP + FP)(TN + FP)(TN + FN)}}. \end{aligned}$$

where TP indicates that the actual result is a positive sample, and the predicted result is also a positive sample; TN indicates that the actual result is a negative sample, and the predicted result is also a negative sample; FP indicates that the actual result is a negative sample, and the predicted result is a positive sample (indicating that the negative sample is predicted incorrectly); FN indicates that the actual result is a positive sample, and the prediction result is a negative sample (indicating that the positive sample is predicted incorrectly).

The area under the ROC curve (AUC) is a comprehensive used metric. The abscissa of the ROC curve is the false positive rate and the ordinate is the positive rate. The AUC value is the enclosed area value of the ROC curve and the coordinate axis, and the value is between 0 and 1. The maximum value of AUC is 1, which means that the performance of the model is perfect, and all prediction results are correct. AUC value of 0 means that the model performance is very poor, and all prediction results are wrong.

### 3. RESULT AND DISCUSSION

We have done extensive experiments on the new dataset using the proposed predictor (4mcDeep-CBI) and the stateof-art predictor (4mcPred-IFL), respectively, then we make a performance comparison between two models. The dataset and code used in the experiment have been uploaded to our GitHub (https://github.com/mat310/4mcDeep), which is shared with other researchers. Due to limited space, part of experimental results are listed in **Supplementary Material**.

### 3.1. Performance of Different Features Used in Prediction

We put 8 preliminary features into the 3-CNN and BLSTM models to obtain advanced features. Then the advanced feature are sequentially passed through sigmoid classifier to obtain the prediction result of the first step. We performed different types

of features for predictive performance analysis and compared the experimental results of 4mcPred-IFL with 4mcDeep-CBI. From **Figure 2**, we find that the predicted performance of the four features BKF, DBPF, KNN, and RFHCP ranks in the top four in the experimental results of both modes. In addition, the performance metrics of the eight characteristic experimental results have been improved in our model (The experimental results can be found in **Tables S1**, **S2**). **Figure 2** shows that our proposed model performs better than 4mcPred-IFL in the preliminary experimental results.

The experiment used a three-fold cross-validation. As shown in **Figure 3**, this is the acc-loss curve of AD\_BKF during the preliminary experiment (acc-loss curves of other advanced feature can be found in **Figure S1**). Epoch refers to the number of times when all data were sent into the network to complete

one forward calculation and back propagation. As can be seen from the figure, with the increase of epoch value, the accuracy of the training set and verification set increased continuously, and finally converged at epoch = 5. The loss function values of the training set and verification set decreased continuously, and finally converged when epoch = 5. Therefore, we can set epoch = 5 to get the best experimental results. **Figure 3** illustrates that the prediction performance is continuously improved and there is no over-fitting during the experiment.

### 3.2. Performance of the Integrated Algorithm

In the previous section, we compared the experimental results of different advanced features. Here, we combine the advanced probability features obtained from the sigmoid classifier to

form a matrix with 8-D probabilistic feature. This matrix is input into the integrated algorithm model and we get the experimental results. To visually analyze the results, we plot the ACC change with the increment of the feature size, which is shown in **Figure 4**. In the figure, the X-axis represents the number of iterations and the Y-axis represents the performance in terms of accuracy. Before performing the iterative operation, we have a matrix with 8-D probabilistic feature. As the number of iterations increases, performance increases rapidly from the beginning, reaching a maximum after 5 iterations when the feature size of the matrix is 13 and ACC is 0.9274, then gradually converge to a steady state. This suggests that the integrated algorithm model can improve feature representation and surely improve performance. 4mcPred-IFL adopted an iterative feature representation algorithm, which reached the maximum when the number of iterations was 30 and ACC was 0.9001, and then gradually converges to a stable state. The details can be found in **Figure S2**.

### 3.3. 4mcDeep-CBI vs. State-of-Art Predictor on Performance

Our 4mcDeep-CBI model shows the best predictive performance, and we achieve ACC = 0.9294, MCC = 0.8498, SN = 0.9486, SP = 0.8938, AUC = 0.9242. To further evaluate the performance of our predictor 4mcDeep-CBI, we compared our predictor with the state-of-art predictor: 4mcPred-IFL. The performances of 4mcDeep-CBI and 4mcPred-IFL are depicted in **Figures 5**, **6**, respectively. **Figure 5** illustrates the performances in terms of ACC, MCC, SN, SP, and AUC, while **Figure 6** shows the ROC curves of 4mcDeep-CBI and 4mcPred-IFL. The details of their performances can be found in **Table S3**. It can be clearly seen that 4mcDeep-CBI achieved better performance than 4mcPred-IFL in all five metrics. Our predictor improves ACC by 3.26%. It is worth noting that our predictor increased the MCC by 7.88%. MCC is essentially a correlation coefficient between the actual classification and the prediction classification, and is a


relatively comprehensive metric. This shows that 4mcDeep-CBI is better than 4mcPred-IFL in terms of comprehensiveness and integrity.

The ROC curve between the different methods is shown in **Figure 6**. As can be seen from the figure, the ROC curve of 4mcDeep-CBI is closer to the upper left corner, and the area under the ROC curve is the largest, which is 4.35% larger than that of 4mcPred-IFL. In summary, the above results illustrate that the performance of 4mcDeep-CBI is better than 4mcPred-IFL, and 4mcDeep-CBI can effectively improve the accuracy of identifying 4mC sites.

### 3.4. 4mcDeep-CBI vs. State-of-Art Predictor on Running Time

The running time of the main modules of 4mcPred-IFL and 4mcDeep-CBI accounts for a large proportion in their respective models. Among them, the main module of 4mcPred-IFL refers to the preliminary experimental results obtained by putting the extracted preliminary features into the SVM model. The main module of the 4mcDeep-CBI model refers to the preliminary experimental results obtained by putting the extracted preliminary features into the deep learning model. In order to explore the operational efficiency of the model, we run the main modules of 4mcPred-IFL and 4mcDeep-CBI separately on the same server. The preliminary feature is BKF as an example. Experiments are carried out with different sample sizes. The results obtained are shown in **Table 1**. 4mcPred-IFL employed Sequential Forward Search (SFS) to determine the optimal feature subset. In **Table 1**, "SVM\_10" refers to the distance of the SFS is 10, and "SVM\_50" refers to the distance of the SFS is 50. The smaller the distance setting, the greater the possibility of better experimental results, and the longer the experiment runs. In addition, when the distance range from 10 to 50, the optimal subset of features can be obtained. As we can see in **Table 1**, our model runs much faster than the state-ofart predictor. After running 16, 000 samples, 4mcDeep-CBI need 48.2 min only, but even if the distance is set to 50, 4mcPred-IFL takes 3261.3 min to run. The running time is more than 50 times slower than us. Moreover, as the number of samples increased, 4mcDeep-CBI grew more slowly than 4mcPred-IFL. There are at least two reasons: (1) The efficiency of 4mcpred-IFL using SFS method to obtain the optimal feature set is very slow. (2) There are two important parameters (the penalty parameter C and the kernel parameter γ ) in the SVM model TABLE 2 | ACC of 4mcDeep-CBI with 4 CNN layers under different parameters.


used by 4mcPred-IFL. Meanwhile, 4mcPred-IFL takes a lot of time to call SVM algorithm over and over again to optimize the penalty parameter C and the kernel parameter γ by using the grid search method. Consequently, the complexity of the 4mcpred-IFL model is much higher than our proposed model.

### 3.5. Impact of Different CNN Layers on 4mcDeep-CBI

In the proposed model 4mcDeep-CBI, we have three CNN layers which can efficiently extract the features from input data. In the experiment, with the CNN layers given, we obtain the accuracy of the 4mcDeep-CBI, and we make a performance comparison according to different CNN layers. For feature RFHCP, **Table 2** shows the experimental results of the 4mcDeep-CBI with 4 CNN layers. Parameters are set as batch\_size = 32, 64, 128, 256; maxpool1D = 1, 2, 3; learning rate = 0.001, 0.005, 0.0001; dropout ratio = 0.1, 0.2, 0.5. It can be found from **Table 2** that the maximum ACC value is 90.17% when the 4mcDeep-CBI has 4 CNN layers. Similarly, we do experiments based on different (2, 3, 5, and 7) CNN layers. The experimental results are shown in **Figure 7**. As can be seen from **Figure 7**, maximum ACC value is 90.57% when the 4mcDeep-CBI has 3 CNN layers. For other features, the experiment has the same result. Therefore, the experiment verifies that 3-CNN layer model has the best performance, that is why we choose 3 CNN layers in the model design of the 4mcDeep-CBI.

### 4. CONCLUSION

In this paper, we propose a deep neural network named 4mcDeep-CBI, which can further boost the performance of identifying 4mC sites. Moreover, we found a large number of

4mC sites and non 4mC sites of C. elegans from the latest MethSMRT website, which greatly expanded the data set of C. elegans. The proposed model 4mcDeep-CBI uses 3-CNN and BLSTM modules to mine deep information of features to obtain advanced features. By experimental comparison with the state-ofart predictor, we found that our proposed framework performed better than the state-of-art predictor, and our model did not appear to have an over-fitting phenomenon. In addition, we have proposed an integrated algorithm to generate informative features. By analyzing the accuracy of the model during the iterative process, we find that the integrated algorithm is constantly improving the performance of the model. Finally,

#### REFERENCES


we evaluated our proposed 4mcDeep-CBI with the state-ofart predictor, and the results demonstrate that our model can achieve better performance in identifying 4mC sites and runs more efficiently. We hope that 4mcDeep-CBI can be an useful bioinformatics tool for identifying 4mC sites and promoting the DNA methylation analysis.

Deep learning is an important way of sequence analysis. For feature selection, we can use the most popular word embedding training method: Word2Vec algorithm, which can be combined with the secondary structure of DNA to predict 4mC sites. Moreover, the sequence length provided by the MethSMRT website is 41 bp, and we need longer DNA sequence fragments, such as 80, 100, and 150 bp to do further research.

### DATA AVAILABILITY STATEMENT

The dataset and code used in the experiment have been uploaded to our GitHub (https://github.com/mat310/4mcDeep).

### AUTHOR CONTRIBUTIONS

FZ and GF design the model, experiments, and wrote the paper. GF performed the experiments. LY analyzed the data, provided the suggestions to improve the performance, and contributed the materials and analysis tools.

#### ACKNOWLEDGMENTS

The authors would like to thank the reviewers for their constructive comments. This work is supported in part by the National Science Foundation of China (Grant No. 61502159).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2020.00209/full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Zeng, Fang and Yao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# GBDTL2E: Predicting lncRNA-EF Associations Using Diffusion and HeteSim Features Based on a Heterogeneous Network

#### Jiaqi Wang, Zhufang Kuang\*, Zhihao Ma and Genwei Han

*School of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha, China*

Interactions between genetic factors and environmental factors (EFs) play an important role in many diseases. Many diseases result from the interaction between genetics and EFs. The long non-coding RNA (lncRNA) is an important non-coding RNA that regulates life processes. The ability to predict the associations between lncRNAs and EFs is of important practical significance. However, the recent methods for predicting lncRNA-EF associations rarely use the topological information of heterogenous biological networks or simply treat all objects as the same type without considering the different and subtle semantic meanings of various paths in the heterogeneous network. In order to address this issue, a method based on the Gradient Boosting Decision Tree (GBDT) to predict the association between lncRNAs and EFs (GBDTL2E) is proposed in this paper. The innovation of the GBDTL2E integrates the structural information and heterogenous networks, combines the Hetesim features and the diffusion features based on multi-feature fusion, and uses the machine learning algorithm GBDT to predict the association between lncRNAs and EFs based on heterogeneous networks. The experimental results demonstrate that the proposed algorithm achieves a high performance.

#### Edited by:

*Liang Cheng, Harbin Medical University, China*

#### Reviewed by:

*Jingpu Zhang, Henan University of Urban Construction, China Hui Liu, Changzhou University, China*

#### \*Correspondence:

*Zhufang Kuang zfkuangcn@163.com*

#### Specialty section:

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> Received: *23 November 2019* Accepted: *06 March 2020* Published: *15 April 2020*

#### Citation:

*Wang J, Kuang Z, Ma Z and Han G (2020) GBDTL2E: Predicting lncRNA-EF Associations Using Diffusion and HeteSim Features Based on a Heterogeneous Network. Front. Genet. 11:272. doi: 10.3389/fgene.2020.00272* Keywords: long non-coding RNA, environmental factor, heterogenous network, HeteSim score, gradient boosting decision tree, random walk with restart

## 1. INTRODUCTION

The environment factor (EF) is a biological or non-biological factor that affects a living organism. Non-biological factors include physical factors, chemical factors, and social factors. Biological factors include parasites and viruses. Many studies have demonstrated that Gene-Environment (G– E) interactions play an important role in the etiology and progression of many complex diseases (Xu et al., 2019). Alzheimer's disease (AD), for example, is a disease that manifests as many intertwined factors, including environmental factors and the like (Eid et al., 2019). Moreover, fetal death and coronary-heart-disease (CHD) could also be caused by G–E interactions (Moreau et al., 2019).

According to the central law of molecular biology, genetic information is mainly saved in DNA sequences. Genetic information is transcribed from DNA into RNA, which is then translated into proteins. Genome sequence analysis shows that the protein-coding sequences account for about 2% of the human genome, and 98% are non-encoding protein sequences (Bertone et al., 2004). In biology, RNAs that do not code are called non-coding RNAs (ncRNAs). In ncRNAs, ncRNAs with a length between 200 and 100,000 nt are called Long non-coding RNAs (lncRNAs), and these play an important role in the understanding of life sciences (Deng et al., 2018). LncRNAs are significant in many aspects, such as in cellular biological processes, gene expression regulation at transcriptional and posttranscriptional levels, and others (Zhang Z. et al., 2019).

There are many studies on the biological mechanism and interaction between genes, microRNAs (miRNAs), lncRNAs, EFs, and diseases, such as the relationship between genes and diseases, miRNAs and diseases, lncRNAs and diseases, miRNAs and EFs, etc. Among them, microRNA (miRNA) is a kind of non-coding RNA that has only about 21–25 nucleotides (Deng et al., 2019b).

For the association between genes and diseases, a data synthesis platform based on gene variation and gene expression was established by Luo et al.. This method applies the method of network analysis to predict the interaction between genes and diseases (Luo Z. et al., 2018). The recent advances in predicting gene–disease associations have been reviewed by Opap and Mulder (2017). An understanding of the association between genetics and disease is an important step in understanding the etiology of diseases. There are many other studies about the association between genes and diseases. Due to the limitation of space, only a few studies have been introduced here.

For the association between miRNAs and diseases, KBMF-MDI was proposed by Lan et al. KBMF-MDI predicts the association between miRNAs and diseases based on their similarities to diseases (Lan et al., 2018), and this is a method that is based on the dynamic neighborhood regularized logical matrix factorization (DNRLMF-MDA) proposed by Yan et al. (2017). The IMCMDA (Chen et al., 2018) was subsequently proposed by Chen et al.. The IMCMDA is an inductive matrix filling model. A new computational model, called heterogeneous graph convolutional network (HGCNMDA) (Li et al., 2019), was presented by Li et al., and another method, the double Laplace regularization (DLRMC) matrix completion model, is proposed by Tang et al. (2019). Those studies have proven that the computational model could effectively predict the potential miRNA-disease associations and provide convenience for the verification experiment of biological researchers.

For the association between lncRNAs and diseases, a method to predict the association between human lncRNAs and diseases based on the random walk of the global network was proposed by Gu et al. (2017). The BRWLDA proposed by Yu et al. is a method to predict the lncRNA-disease associations based on the double random walk of heterogeneous networks (Yu et al., 2017). A global network-based framework named LncRDNetFlow (Zhang J. et al., 2019) was proposed by Zhang et al. LncRDNetFlow utilizes a flow propagation algorithm to predict lncRNA-disease associations. The calculation method LDASR was proposed by Guo et al. (2019). The LDASR analyzes the relationships between known lncRNAs and diseases to identify the relationships between lncRNAs and diseases. A bipartite graph network based on the known lncRNA-disease associations was constructed by Ping et al. (2018), and a bilateral sparse self-representation (TSSR) algorithm was proposed by Ou-Yang et al. (2019) to predict lncRNA-disease associations. A new method of lncRNAdisease-gene tripartite mapping (TPGLDA) was proposed by Ding et al. to predict the associations of lncRNA-disease, which combined the associations of gene-disease and lncRNA-disease (Ding et al., 2018). A new potential factor mixture model (LFMMs) estimation method was constructed by Caye et al. (2019), and the model is implemented in the updated version of the corresponding computer program. The ILDMSF is a novel framework that was proposed by Chen et al. (2020). Furthermore, a method named LDAH2V (Deng et al., 2019a) was proposed by Deng et al., and the HIN2Vec is used to calculate the meta-path and feature for each lncRNA-disease in the heterogeneous networks.

For the association between miRNAs and EFs, the MiREFRWR was proposed by Chen et al., and it uses the Random Walk with Restart algorithm in a complex network to predict interactions (Chen, 2016). The MEI-BRWMLL (Luo H. et al., 2018) method to reveal the relationships of miRNAs and EFs was proposed by FLuo et al.. In this approach, multi-label learning and double random walk are used to predict the associations between miRNAs and EFs. These studies provide directional guidance for the analysis of complex diseases and the association between miRNAs and EFs in clinical trials (Chen et al., 2012; Qiu et al., 2012).

With the application of computing technology in the field of biology, more and more public biological databases have also been established, such as HMDD (Huang et al., 2018), miR2Disease (Jiang et al., 2008), DrugCombDB (Liu et al., 2020), and gutMDisorder (Cheng et al., 2020).

The development of genomics and bioinformatics facilitated the identification of lncRNA. LncRNA has also been found to interact with various EFs, such as chemicals, smoking, and air pollution (Flynn and Chang, 2014). It has been found that these lncRNAs and EFs may be the cause of some diseases (Chen and Yan, 2013). However, compared with protein-coding genes and miRNAs, there are fewer methods using bioinformatics and computational methods to study the association between lncRNAs and EFs, and these are also less effective. Based on the restart random walk model, the RWREFD method and a lncRNA-EF associations database, LncEnvironmentDB, were designed by Zhou et al. (2014). A method based on a binary network and resource transfer algorithm to predict the associations of lncRNA-EF was designed by Zhou and Shi (2018). The KATZ measure and Gaussian interaction profile kernel similarity are used to predict new potential associations between lncRNAs and EFs, as proposed by Vural and Kaya (2018). Three computational models for predicting the relationship between lncRNAs and EFs using the similarity of gaussian interaction properties of lncRNAs and EFs were proposed by Xu (2018). They are the prediction methods of lncRNAs and EFs association based on the Laplacian regularized least square method, the KATZ method, and the double random walk algorithm. The above studies show that the computational approach can improve the speed and reduce the cost.

However, the aforementioned studies for predicting the association between disease-related lncRNAs and EFs usually use traditional similarity search methods, which focus on measuring the similarity between objects of the same type. Those existing methods to study the association between disease-related lncRNAs and EFs simply treat all objects as the same type without considering different subtle semantic meanings of different paths in the heterogeneous network. This will reduce the accuracy and persuasiveness of the results. In this paper, we have proposed a high-performance method to predict the correlation between lncRNAs and EFs based on heterogeneous networks. The proposed method integrates the structural information and heterogenous networks and combines the Hetesim features and the diffusion features as data features and uses the GBDT algorithm as a prediction model. The HeteSim features are a path-based measurement method in heterogeneous networks and can measure the relationship between objects of the same or different types. The Hetesim has not been used to predict the association between lncRNAs and EFs. It is the first time that the Hetesim is integrated as a fusion feature in the step of feature extraction for predicting the association between lncRNAs and EFs. The method GBDT is used in the proposed algorithm, which is an integrated learning method in machine learning, and has superior accuracy compared with other algorithms. It is also the first time that the integrated learning method GBDT is used to investigate the association between lncRNAs and EFs. From our perspective, on the one hand, our proposed method provides an efficient calculation method for mining the association between lncRNAs and EFs, which greatly saves manpower and material resources. On the other hand, it also helps biologists to explore the influence of environmental factors on diseases.

For the rest of the paper, the materials and methods have been presented in section 2, the experimental results and evaluates have been discussed in section 3, and, finally, we have concluded this paper in section 4.

#### 2. MATERIALS AND METHODS

The data used in this experiment are downloaded from the DLREFD database (Sun et al., 2017). The data include 475 lncRNAs and 152 environmental factors. After the duplicate data are removed, the number of correlations between lncRNAs and EFs was 735. The set of lncRNAs and the set of EFs are shown in **Supplementary Material**.

A method based on the Gradient Boosting Decision Tree (GBDT) to predict the association between LncRNA and EFs (GBDTL2E) has been proposed in this section. The GDDTL2E integrates the structural information and heterogenous networks, combines the Hetesim features and the diffusion features based on multi-feature fusion, and uses the machine learning algorithm GBDT to predict the association. This mainly includes several steps: (1) according to the lncRNA-EF correlations dataset downloaded from the public database DLREFD, after the duplicate data are removed, the set of lncRNAs and EFs and the association matrix A of the lncRNA-EF correlations are obtained, respectively. Then, the gaussian interaction profile kernel similarity of lncRNA (KL) and the gaussian interaction profile kernel similarity of EFs (KE) are calculated, respectively. (2) The chemical structure similarity matrix E between EFs is calculated by using the published tool SimComp. (3) The lncRNA similar information (KL) is transformed by the logistic function to obtain lncRNA similarity information SL, and the chemical structure similarity matrix E and the gaussian interaction profile kernel similarity matrix (KE) are then used to construct a similarity matrix SE of EFs. (4) A global heterogeneous network is constructed by integrating the three subnets of association matrix A, similarity matrix SL of lncRNA, and similarity matrix SE of EFs to construct adjacency matrix G of the global heterogeneous network. On the heterogeneous network, the Random Walk with Restart (RWR) algorithm is used to calculate the diffusion score and obtain the diffusion features, and singular value decomposition (SVD) is used to reduce the dimension of the diffusion features. (5) The Hetesim feature (score) for the lncRNAs-EFs pair is calculated. (6) The feature data set is obtained by combining the diffusion feature and the HeteSim score. The obtained combined feature is used to train the Gradient Boosting Decision Tree (GBDT) for predicting the relationship between lncRNAs and EFs. **Figure 1** shows that the overview of the proposed method. Each step of GBDTL2E are described in the following section.

#### 2.1. Calculate Gaussian Interaction Profile Kernel Similarity

In this section, the calculation of the gaussian interaction profile kernel similarity was presented first. The association matrix A of lncRNAs and EFs was obtained by the known lncRNA-EF correlations. The gaussian interaction profile kernel similarity matrix of lncRNA and the gaussian interaction profile kernel similarity matrix of EF were calculated. Let A(l<sup>i</sup> , ej) indicate whether the lncRNA l<sup>i</sup> is associated with e<sup>j</sup> . Specifically, A(l<sup>i</sup> , ej) = 1 if there is an association between l<sup>i</sup> and e<sup>j</sup> ; otherwise, A(l<sup>i</sup> , ej) = 0, which is given by

$$\mathcal{A}\left(l\_i, e\_j\right) = \begin{cases} 1 \ l\_i \text{ is associated with } e\_j\\ 0 & \text{otherwise} \end{cases} \tag{1}$$

The gaussian interaction profile kernel similarity matrix KL of lncRNA was constructed. For a given lncRNA l<sup>i</sup> , IP(li) is defined as the ith row of the adjacency matrix A. Then the gaussian interaction profile kernel similarity between lncRNA l<sup>i</sup> and lncRNA l<sup>j</sup> for each lncRNA pair is calculated, which can be written as

$$\text{KL}\left(l\_i, l\_j\right) = \exp\left(-\gamma\_l ||\text{IP}\left(l\_i\right) - \text{IP}\left(l\_j\right)\|^2\right) \tag{2}$$

$$\gamma \eta = \left\vert \gamma' / \left( \frac{1}{n!} \sum\_{i=1}^{n!} \parallel \text{IP} \left( l\_i \right) \right) \right\vert^2 \tag{3}$$

where γ<sup>l</sup> is used to control the frequency band of Gaussian interaction profile kernel similarity. It represents the normalized frequency band of Gaussian interaction profile kernel similarity based on the new frequency band parameter γ ′ l . Denote nl as the number of lncRNA. Denote KL as the gaussian interaction profile kernel similarity matrix of lncRNA, and denote KL li , lj as the gaussian interaction profile kernel similarity score of lncRNA l<sup>i</sup> and lncRNA l<sup>j</sup> .

FIGURE 1 | Flowchart of our method: (A) Obtained the association matrix A; Calculated the gaussian interaction profile kernel similarity of lncRNA and EF respectively. (B) Calculated the chemical structure similarity matrix E. (C) Obtained lncRNA similarity information SL and construct a similarity matrix SE of EF. (D) Integrated three subnets A, SL, and SE to construct a global heterogeneous network. (E) Constructed the adjacency matrix G and obtain the diffusion feature. (F) Calculated the Hetesim score. (G) Combined the diffusion feature and the HeteSim score. (H) Trained the Gradient Boosting Decision Tree classifier (GBDT).

Similarly, the known lncRNA-EF correlations were used to construct the gaussian interaction profile kernel similarity matrix of EFs. For a given EF e<sup>i</sup> , IP′ (ei) is defined as the ith column of the adjacency matrix A. KE represents the gaussian interaction profile kernel similarity matrix of environmental factors. Denote KE ei , ej as the gaussian interaction profile kernel similarity score of EFs e<sup>i</sup> and e<sup>j</sup> , which is given by

$$\text{KE}\left(e\_i, e\_j\right) = \exp\left(-\left.\gamma\_e \right| \left| \text{IP}'\left(e\_i\right) - \text{IP}'\left(e\_j\right) \right| \right) \tag{4}$$

$$\gamma\_{\varepsilon} = \gamma\_{\varepsilon}^{'} / \left( \frac{1}{n\varepsilon} \sum\_{i=1}^{n\varepsilon} \left\| \text{IP}^{'} \left( \varrho\_{i} \right) \right\|^{2} \right) \tag{5}$$

where γ<sup>e</sup> represents normalized gaussian interaction kernel similarity bandwidth based on the frequency width parameter γ ′ e . Denote ne as the number of EFs.

#### 2.2. Calculate Chemical Structure Similarity

In this section, the computation of the chemical structure similarity has been given. The chemical structural similarity matrix between EFs is calculated using the SimComp tool (Hattori et al., 2010). With the Kyoto Encyclopedia of Genes and Genomes (KEGG) database entry number corresponding to EFs in the DLREFD database as the parameter, the SimComp tool is used to calculate the chemical structure similarity score. By calling SimComp's API, the chemical structure similarity score E ei , ej of each pair of environmental factors e<sup>i</sup> and e<sup>j</sup> was calculated. SimComp (Similar Compound) is a kind of method based on a graph that is used to compare the chemical structure. It has been implemented in a KEGG system to search for similar chemical structures in a chemical structure database.

#### 2.3. Obtain the Similarity Matrix

The structural information and heterogenous networks were integrated in the proposed GBDTL2E. The transformed similarity matrix SL and integrated similarity matrix calculation SE have been described in this section. The lncRNA similarity matrix KL was transformed by logistic function to obtain lncRNA similar matrix SL. The similarity matrix SE of EFs was constructed by using the chemical structure similarity matrix E of EFs and the gaussian interaction profile kernel similarity matrix KE of EFs, given by

$$\text{SL}\left(l\_i, l\_j\right) = \frac{1}{1 + e^{c \cdot KL\left(l\_i, l\_j\right)} + \nu} \tag{6}$$

where c = −15, v = log(9999);

$$\text{SE}\begin{pmatrix} e\_i, e\_j \end{pmatrix} = \begin{cases} \epsilon \boldsymbol{w} \cdot \boldsymbol{E}\begin{pmatrix} e\_i, e\_j \end{pmatrix} + \begin{pmatrix} 1 - \epsilon \boldsymbol{w} \end{pmatrix} \cdot \text{KE}\begin{pmatrix} e\_i, e\_j \end{pmatrix} \in \begin{pmatrix} e\_i, e\_j \end{pmatrix} \neq \mathbf{0} \\\ \text{KE}\begin{pmatrix} e\_i, e\_j \end{pmatrix} \qquad \text{otherwise} \end{cases} \tag{7}$$

where ew is the weight of correlation information of two EFs in SE.

### 2.4. Obtain Low-Dimensional Network Diffusion Features

In this section, the association matrix A of lncRNA-EF, the similarity matrix SL of lncRNA, and the similarity matrix SE of EFs were integrated to construct a global heterogeneous network. In heterogeneous networks, the Random Walk with Restart (RWR) is used to calculate the diffusion score and obtain the diffusion features. Due to the fact that the higherdimensional features in model training are more susceptible to noise interference, the singular value decomposition (SVD) is used to reduce the dimension of the diffusion features. The details of each sub-steps were as follows.

#### 2.4.1. Construct of Roaming Network

In this section, the roaming network was constructed firstly. The adjacency matrix G of the global heterogeneous network was obtained. The matrix G has nl + ne dimensions, where nl is the number of lncRNA and ne is the number of EFs, respectively. G is given by

$$\mathbf{G} = \begin{bmatrix} \mathbf{S} \mathbf{L} \ A \\ \mathbf{A}^T \ \mathbf{S} \mathbf{E} \end{bmatrix} \tag{8}$$

where A<sup>T</sup> represents the transpose of A, and SL and SE are given by (6) and (7), respectively. T is the transition probability matrix of G, which is given by

$$\text{TT}(i,j) = \frac{G(i,j)}{\sum\_{k=1}^{nl+ne} G(k,j)} \tag{9}$$

where T(i, j) represents the probability of node i transferring to node j in the global network. For any two given nodes i and j in the wandering network, if T(i, j) is not 0, there is an edge between them. If T(i, j) is 0, and node i has no relationship with node j.

#### 2.4.2. Obtain the Diffusion Features Using RWR

The RWR algorithm (Liu et al., 2016) is used to obtain the diffusion features of each node on the global network in this section. Based on the transition probability matrix T, the diffusion features of all nodes P = - P i were obtained by RWR, where i ∈ {1, 2, . . . n}. P i represents the diffusion features of node i, n = nl + ne, and nl + ne is the total number of nodes in the global heterogeneous network. Starting from a node i in the global heterogeneous networks, each step prompted two choices: randomly select the neighboring node or return the starting node. The process of restarting the random walk is given by

$$P\_{t+1}^{i} = (1 - r) \* T \* P\_t^i + r \* P\_0^i \tag{10}$$

where r is the restart probability; P i t is an n-dimensional probability distribution vector of node i, and its jth element represents the probability of accessing node j at step t, and j ∈ {1, 2, . . . , n}. P i 0 represents the initial transition probability, which is given by

$$P\_0^i = \left(\frac{1}{n}, \frac{1}{n}, \frac{1}{n}, \dots, \frac{1}{n}\right) \tag{11}$$

The initial assumption is that the transition probability value of each node is 1/n, and n is the total number of nodes. After several iterations, when (Pt+<sup>1</sup> − Pt) is less than 10−10, the final diffusion features were obtained.

#### 2.4.3. Calculate Low-Dimensional Diffusion Features

The calculation of low-dimensional diffusion features has been given in this section following the diffusion features obtained by RWR. As the number of nodes increases, the diffusion state increases in dimension as well. Singular value decomposition (SVD) (Golub and Reinsch, 1971; Cho et al., 2015) is used to reduce the dimension of diffusion features. The highdimensional diffusion feature matrix is decomposed:

$$\mathbf{P} = U\boldsymbol{\Sigma}\boldsymbol{V}^T\tag{12}$$

$$\mathbf{P} = U\Sigma^{1/2}\Sigma^{1/2}V^T \tag{13}$$

where U and V represent the left singular matrix and the right singular matrix, respectively. The U and V are units on an orthogonal matrix, 6 only has value on the diagonal, and the other elements are 0. We refer to these non-zero values as singular values and order these values in 6 from largest to smallest. Singular values can be thought of as representing values of a matrix, or as representing information about the matrix. The larger the singular value, the more information it represents. Therefore, in order to reduce the computation, we only need to take the first 50 maximum singular values, and we can basically restore the data itself. Therefore, we take the first 50 singular values and eigenvectors, which are given by

$$X = U\_{n \ast d} \left( \Sigma\_{d \ast d} \right)^{1/2} \tag{14}$$

TABLE 1 | The paths from a lncRNA to an environmental factor in our heterogeneous network with a length of less than 5.


$$\mathcal{W} = (\Sigma\_{d \ast d})^{1/2} \left( V\_{d \ast n} \right)^{T} \tag{15}$$

where X is the low-dimensional node feature matrix derived from the high-dimensional diffusion feature. Each row of matrix X is the low-dimensional feature vector of each node in the network. W is the low-dimensional context eigenmatrix derived from the high-dimensional diffusion feature. Thus, we obtain the diffusion feature X after dimensionality reduction.

#### 2.5. Calculate the Hetesim Score

In order to obtain high performance, apart from the diffusion feature obtained in the above section, the proposed method combines the Hetesim features and the diffusion features based on multi-feature fusion. Another important feature is that HeteSim (Shi et al., 2014) is used to calculate the relevance between objects in the heterogeneous network in this section. HeteSim is a path-based measure. For each pair object (of the same or different types) in the heterogeneous network, it could obtain one single score, which means their relatedness based on an arbitrary path. **Figure 2** illustrates a HeteSim score.

As we can see from **Figure 2**, the number of paths from A to C is three and the number of paths from B to C is two. The number of paths from A to C is larger than B to C, which might mean that A is closer to C than B. But, based on HeteSim, B is closer to C than A to C because there are two edges for B to C, which account for two-thirds of the edges starting from B to other objects. However, A only has a small part of the edges connected with C. In our proposed method, the HeteSim is used to measure the similarities between lncRNAs and EFs. Under the constraint of length less than five, there are 14 different paths from lncRNA to the EFs, as shown in **Table 1**.

The HeteSim score between lncRNA and EF is calculated:

**Step (1):** The transition probability matrix MLP from lncRNA to EF, lncRNA to lncRNA, EF to lncRNA, and EF to EF in global heterogeneous networks are calculated. The calculation formula of transfer probability matrix MLP(i, j) is given by

$$M\_{LP}(i,j) = \frac{I\_{LP}(i,j)}{\sum\_{k=1} I\_{LP}(i,k)}\tag{16}$$

where L and P represent two types of objects in the global heterogeneous network, and i and j represent two nodes in the global heterogeneous network. Matrix I is the incidence matrix of L and P. If both L and P are environmental factors, matrix I is matrix SE. If both L and P are lncRNAs, matrix I is matrix SL. If L and P are lncRNA and EFs respectively, then matrix I is matrix A. The four transfer probability matrices can be obtained as MLE, MLL, MEL, and MEE respectively.


$$R\_{\text{path}\_{\text{L}}} = M\_{h\_1, h\_2}, M\_{h\_2, h\_3} \cdots \cdot M\_{h\_{\text{mid}-1}, h\_{\text{mid}}} \tag{17}$$

$$R\_{path\_R} = M\_{h\_{mid},h\_{mid+1},}, M\_{h\_{mid+2},h\_{mid+3}} \cdots M\_{h\_{!m-1},h\_m} \tag{18}$$

**Step (4):** The HeteSim score of path path is calculated, which is given by:

$$\text{Hetesim}\_{\begin{array}{c} \text{Hetesim} \end{array}} = \frac{\boldsymbol{R}\_{path\_L} \left(\boldsymbol{R}\_{path\_R^{-1}}\right)^T}{\left\|\boldsymbol{R}\_{path\_L}\right\|\_2 \* \left\|\boldsymbol{R}\_{path\_R^{-1}}\right\|\_2} \tag{19}$$

where path−<sup>1</sup> R is the reverse path of path<sup>R</sup> . There are in total 14 different paths from a lncRNA to an EF under the constraint of length <5. So, we obtain 14 dimensional HeteSim features for each node in the heterogeneous networks.

#### 2.6. Train the Gradient-Boosting Decision Tree Classifier

After the multi-features were combined, the Hetesim features and the diffusion features were obtained. The method for training the GBDT classifier model to predict the association between lncRNAs and EFs based on heterogeneous networks has been presented in this section. The 50-dimensional diffusion features and 14-dimensional HeteSim scores were combined to get the 64-dimensional features data set. The features of the data were used for training the Gradient Boosting Decision Tree (GBDT) (Friedman, 2001) classifier. The classifier was used to predict the correlation between lncRNAs and EFs.

GBDT is an effective machine learning method for classification and regression problems. GBDT is composed of multiple decision trees, and the final answer is obtained via the sum of the conclusion of all trees. GBDT generates a weak classifier in each iteration through multiple rounds of iteration. Each classifier is trained on the basis of the gradient (residual value) of the previous round of classifiers. The final total classifier is obtained by weighted summation of the weak classifier obtained in each round of training, which is the addition model. The model training steps have been presented:

**Step (1):** The initialization model is given by:

$$\Theta\_0(\mathbf{x}) = \frac{1}{2} \ast \log \left( \frac{\sum\_{i=1}^{N} \mathcal{Y}\_i}{\sum\_{i=1}^{N} 1 - \mathcal{Y}\_i} \right) \tag{20}$$

where N is the number of training samples, and y<sup>i</sup> is the real label. The loss function is given by:

$$\mathcal{L}\left(\mathbf{y}, \Theta\_{m-1}\left(\mathbf{x}\_{i}\right)\right) = \log\left(1 + \exp\left(-\mathbf{y}\Theta\_{m-1}\left(\mathbf{x}\_{i}\right)\right)\right) \tag{21}$$

where y is the real class label, and 2<sup>m</sup> (x) is the weak model in the mth round.

**Step (2):** Cycle m in turn, where m = 1,2,...M

**A:** The calculation for the negative gradient of the loss function of the ith sample in the mth round is given by:

$$r\_{m,i} = -\frac{\partial \mathcal{L}\left(\mathbf{y}\_i, \Theta\_{m-1}\left(\mathbf{x}\_i\right)\right)}{\partial \Theta\_{m-1}\left(\mathbf{x}\_i\right)} = \frac{\mathbf{y}\_i}{\left(1 + \exp\left(\mathbf{y}\_i\right)\Theta\left(\mathbf{x}\_i\right)\right)}\tag{22}$$

where i = 1, 2, . . . N.

**B:** Construct the mth decision tree, and then get the corresponding leaf node area Rm,<sup>j</sup> ,wherej = 1, 2, ..., J, and the J is the number of leaf nodes in the tree.

**C:** For the samples in each leaf node, we calculated the cm,<sup>j</sup> , which minimizes the loss function, namely, the best output value of fitting the leaf node, given by:

$$c\_{mj} = \arg\min\_{\boldsymbol{\varsigma}} \sum\_{\boldsymbol{\chi} \in R\_{mj}} \log \left( 1 + \exp \left( -\mathbf{y}\_i \Theta \left( \boldsymbol{\chi}\_i \right) + \boldsymbol{\varsigma} \right) \right) \tag{23}$$

**D:** Update mth weak model:

$$\Theta\_m(\mathbf{x}) = \Theta\_{m-1}(\mathbf{x}) + lr \ast \sum\_{j=1}^{J} c\_{m,j} I\left(\mathbf{x} \in R\_{m,j}\right) \quad \text{(24)}$$

where I x ∈ Rm,<sup>j</sup> means that if x falls on a leaf node corresponding to Rm,<sup>j</sup> , then the corresponding term is 1, and lr means learning rate.

TABLE 2 | The experimental parameters of GBDTL2E.


**E:** Judge whether m is greater than M. If m is less than M, then m=m+1 and jump to Step(1) for the next iterations. Otherwise, it means that m weak learners have been constructed, and we then jump to Step(3) to end the training.

**Step (3):** Obtain the final Strong Model:

$$\Theta(\mathbf{x}) = \Theta\_0(\mathbf{x}) + lr \ast \sum\_{m=1}^{M} \sum\_{j=1}^{J} c\_{m,j} I\left(\mathbf{x} \in R\_{m,j}\right) \tag{25}$$

#### 2.7. GBDTL2E Algorithm

In this section, the proposed GBDTL2E algorithm to predict the association between lncRNAs and EFs based on heterogeneous networks has been described in Algorithm 1. From lines four to nine of Algorithm 1, the low-dimensional diffusion feature matrix X was obtained by using the random walk with restart algorithm and singular value decomposition. In lines 10–41 of Algorithm 1, the Hetesim score was obtained. In lines 42–58 of Algorithm 1, the training data is obtained and used to train the GBDT classifier. Furthermore, the final classification model is obtained.

#### 3. RESULT AND DISCUSSION

#### 3.1. Data Sets

We randomly selected 300 positive samples and 300 negative samples for training the model. Positive samples were that samples with a correlation between lncRNA and EF, while negative samples were samples without a correlation between lncRNA and EF. For objective performance evaluation, an independent test set was built by randomly selecting 300 positive samples and 300 negative samples. Note that all the positive and negative samples in these test sets were independently chosen and excluded from the training set.

**Algorithm 1** GBDTL2E algorithm **Input:** lncRNAs set, EFs set, The association matrix of the lncRNA-EFs A; **Output:** The gaussian interaction profile kernel similarity matrices KL and KE. The chemical structural similarity matrix, E. The similarity matrices SL and SE. 1: Construct the adjacency matrix G; 2: Initialize the global transition probability matrix T; 3: Initialize the transition probability vector for each node P i <sup>0</sup> <sup>=</sup> 1 n , 1 n , 1 n . . . 1 n 4: **while** P i <sup>t</sup>+<sup>1</sup> − P i <sup>t</sup> > 10−<sup>10</sup> **do**: 5: Obtain the updated probability vector: 6: P i <sup>t</sup>+<sup>1</sup> = (1 − r) ∗ T ∗ P i <sup>t</sup> + r ∗ P i 0 ; 7: **end while** 8: P = Un∗d6d∗dV T d∗n 9: X = Un∗d6 1/2 d∗d 10: Input L,P to caculate MLP(i, j) 11: **if** L ∈ EFs and P ∈ EFs **then** 12: MLP(i, j)= MEE(i, j) = <sup>P</sup> SEEE(i,j) k=1 SEEE(i,k) 13: **end if** 14: **if** L ∈ lncRNAs and P ∈ EFs **then** 15: MLP(i, j)= MLE(i, j) = <sup>P</sup> ALE(i,j) <sup>k</sup>=<sup>1</sup> ALE(i,k) 16: **end if** 17: **if** L ∈ EFs and P ∈ lncRNAs **then** 18: MLP(i, j)= MEL(i, j) = A T EL(i,j) P <sup>k</sup>=<sup>1</sup> A T EL(i,k) 19: **end if** 20: **if** L ∈ lncRNAs and P ∈ lncRNAs **then** 21: MLP(i, j)= MLL(i, j) = <sup>P</sup> SLLL(i,j) k=1 SLLL(i,k) 22: **end if** 23: **for** n = 1 → 5 **do** 24: Divide the path into two parts. 25: **if** n%2 == 0 **then** 26: mid = (m/2) + 1 27: path<sup>L</sup> = h1, h2, · · · , hmid 28: path<sup>R</sup> = hmid, h2, · · · hm+<sup>1</sup> 29: **end if** 30: **if** n%2! = 0 **then** 31: mid1 = ((m + 1)/2) 32: mid2 = ((m + 3)/2) 33: pathL<sup>1</sup> = h1, h2, · · · , hmid1 34: pathR<sup>1</sup> = hmid1+<sup>1</sup> , h2, · · · hm+<sup>1</sup> 35: pathL<sup>2</sup> = h1, · · · , hmid2 36: pathR<sup>2</sup> = hmid2+<sup>1</sup> , · · · hm+<sup>1</sup> 37: **end if** 38: Rpath<sup>L</sup> = Mh1,h<sup>2</sup> , Mh2,h<sup>3</sup> · · · Mhmid−<sup>1</sup> ,hmid 39: Rpath<sup>L</sup> = Mh1,h<sup>2</sup> , Mh2,h<sup>3</sup> · · · Mhmid−<sup>1</sup> ,hmid 40: Hetesim = RpathL R path−<sup>1</sup> R T <sup>R</sup>pathL 2 ∗ R path−<sup>1</sup> R 2 41: **end for** 42: Combined with the diffusion feature and HeteSim score to get the data set

43: Dtrain = x1, y<sup>1</sup> , x2, y<sup>2</sup> , . . . , xN, y<sup>N</sup> , Dtest = x1, y<sup>1</sup> , x2, y<sup>2</sup> , . . . , xN, y<sup>N</sup> 


#### 3.2. Performance Measures

The 10-fold cross-validation was used to measure the performance of the GBDTL2E. The GBDTL2E parameters used are listed in **Table 2**. The detailed process of 10-fold crossvalidation has been described as: the training set was randomly divided into 10 groups of roughly the same size subsets. Each subset was used for validation data in turn, and the remaining nine subsets were used for training data. This process was repeated 10 times, and performance assessments were performed using average performance measures of more than 10 times. The experiment used a variety of methods to evaluate performance, including recall (REC), F1-score, accuracy (ACC), Matthews correlation coefficient (MCC), and the area under the receiver operating characteristic curves (AUC). They were defined:

$$Recall = \frac{TP}{TP + FN},\tag{26}$$

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN},\tag{27}$$

$$F1 - Score = \frac{2 \times TP}{2TP + FP + FN},\tag{28}$$

$$\text{MCC} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}} \quad \text{(29)}$$

where TP and FP represent the numbers of correctly predicted positive and negative samples, and FP and FN represent the numbers of wrong predicted positive and negative samples, respectively. The AUC score is computed by varying the cutoff of the predicted scores from the smallest to the greatest value.

#### 3.3. Performance Comparison With Existing Machine Learning Methods

In this section, the proposed GBDTL2E method was compared with the following schemes, which include the k-nearest neighbor algorithm (KNN) (Cover and Hart, 1967), random forest

TABLE 3 | The performance comparison with other machine learning methods.


(RF) (Liaw et al., 2002), and support vector machine (SVM) (Burges, 1998). The 10-fold cross-validation was used by the four algorithms. For the KNN classifier, five nearest neighbors were used. The RF algorithm constructed multiple decision tree classifiers for training on a set of randomly selected benchmark samples to improve performance. For the SVM, we used the radial basis function (RBF) as the kernel function to optimize the penalty c and γ parameters. In addition, we set c and γ as 64 and 0.0001, respectively. **Table 3** and **Figure 3** show the predictive performance comparison of the machine learning approach used with other machine learning approaches. It can be seen that the method used in the present invention had the best performance. In order to further prove the performance of this model, we also compared the performances of these different machine learning methods on the independent test set. The ROC curve compared on the independent test set is shown in **Figure 4**. The AUC of GBDTL2E, KNN, RF, and SVM were 0.91, 0.82, 0.88, and 0.88, respectively. The results show that the performance using GBDT was better than that of other machine learning methods.

#### 3.4. Performance Comparison With Different Topological Features

In order to verify the performance of combined diffusion and Hetesim features in GBDTL2E, we compared the performance by using two separate features and combined features in this section. **Figures 5**, **6** show the Performance comparison with different topological features, In the **Figure 5**, we denote the "Hete+Diff," "Hete," and "Diff " as the Hetesim and diffusion combined feature, HeteSim feature, and diffusion feature, respectively. As we can see from **Figure 5**, the Hetesim and diffusion combined features achieved higher performance than the two separate features. The results show that the combination of the two features can improve the prediction performance. **Figure 6** shows the ROC curve comparison with different feature groups, which is the method using GBDTL2E only with diffusion feature, using GBDTL2E only with HeteSim feature, and GBDTL2E with combined feature. We also used 10-fold cross validation to verify the influence of different feature groups on the experimental results. We can see, from **Figure 6**, that GBDTL2E with combined features can obtain higher performances than other two algorithms. The GBDTL2E with the Hetesim feature only could obtain a better performance than the GBDTL2E with the diffusion feature only.

#### 3.5. Performance Comparison With Existing Methods

In this section, the GBDTL2E algorithm was compared with the existing methods for predicting associations between lncRNAs and EFs. However, there were a few studies that predicted new

FIGURE 3 | The ROC curve comparison with other machine learning methods. (A) The ROC curve with using KNN. (B) The ROC curve with using RF. (C) The ROC curve with using SVM. (D) The ROC curve with using GBDT.

potential associations between lncRNAs and EFs. Three methods were chosen to compare with the proposed GBDTL2E method. These were KATZ (Vural and Kaya, 2018), MPALERLS (Xu, 2018), and BIRWAPALE (Xu, 2018).

• KATZ: The KATZ method, based on the KATZ, was used to find potential new associations between lncRNAs and

EFs; it uses the DLREFD database as well and contains proven associations between lncRNAs and EFs. The KATZ and Gaussian interaction profile kernel similarity was used to predict new potential associations between lncRNAs and EFs. In this method, the parameters β and k are to 0.01 and 3, respectively.


**Figure 7** shows the comparison results. The experimental results show that the GBDTL2E algorithm can obtain a better performance than the other three algorithms. This was for several reasons: (1) Computing the HeteSim score of different paths from lncRNA to EFs in the heterogenous network to obtain the HeteSim features, and combining the HeteSim features and diffusion features as the data feature, could make better use of the topological characteristics of heterogeneous networks and thus obtain better performance. (2) The GBDT algorithm is an effective prediction model. As far as we know, we have been the first to apply both diffusion and HeteSim features to predict lncRNA-EFs interactions. As result show that, combine the diffusion and HeteSim features can further improve the performance.

### 3.6. Case Study

To further measure the performance of our proposed algorithm, we investigated an environmental factor "Cisplatin," which is an effective chemotherapy drug for many cancers (Florea and Büsselberg, 2011). The proven associations between "Cisplatin" and many lncRNAs have been discovered. In this study, we attempted to use our model to predict the association between "Cisplatin" and lncRNA. First, all associations between "Cisplatin" and lncRNA were deleted from the training set.

After processed by our algorithm, we sorted the correlation values between "Cisplatin" and ordered LncRNA from largest to smallest. We found that all the top 10 lncRNAs were related to "Cisplatin," and these lncRNAs are confirmed to be

FIGURE 7 | The Roc curve comparison with existing method. (A) The ROC curve only of KATZ. (B) The ROC curve only of MPALERLS. (C) The ROC curve of BIRWAPALE. (D) The ROC curve of GBDTL2E.

TABLE 4 | The TOP 10 predicted lncRNAs related to cisplatin.


related to "Cisplatin" in the DLREFD database. The 10 lncRNAs and their corresponding PUBMED reference ID are shown in **Table 4**.

#### 4. CONCLUSIONS

Recent studies have shown that the interaction between lncRNA and EF is closely related to the production of diseases. As more and more computational methods are used to deal with biological problems, which can greatly save manpower, it is possible to use computational methods to predict the interaction between lncRNAs and EFs. In this paper, we proposed a method to predict the association between lncRNAs and EFs. The proposed method combined the Hetesim features and the diffusion features based on multi-feature fusion, and used the machine learning algorithm GBDT to predict the association between lncRNAs and EFs based on heterogeneous networks. The 10-fold cross validation was used to evaluate our method. We also compared our method with others. An environmental factor in the case study was also be used to compare our performance. The results show that the GDBTL2E can obtain high performance. In future, adding the expression profile of lncRNAs to further improve the performance will be investigated.

#### DATA AVAILABILITY STATEMENT

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/zhufangkuang/DLREFD.

#### AUTHOR CONTRIBUTIONS

JW, ZK, ZM, and GH conceived this work and designed the experiments. JW and ZK carried out the experiments. ZM and GH collected the data and analyzed the results. JW and ZK wrote, revised, and approved the manuscript.

#### FUNDING

This work was supported in part by the National Natural Science Foundation of China under Grants Nos. 61309027, 61702562, and 61702561, the Hunan Provincial Natural Science Foundation of China under Grants No. 2018JJ3888, the Scientific Research Fund of Hunan Provincial Education Department under Grant No. 18B197, the National Key R&D Program of China under Grant No. 2018YFB1700200, the Open Research Project of Key Laboratory of Intelligent Information Perception and Processing Technology (Hunan Province) under Grant No. 2017KF01, the Foundation Project of Hunan Internet of Things Society No. 2018-2,

### REFERENCES


and the Hunan Key Laboratory of Intelligent Logistics Technology (2019TP1015).

#### ACKNOWLEDGMENTS

We would like to thank the Experimental Center of School of Computer and Information Engineering, Central South University of Forestry and Technology, for providing computing resources.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2020.00272/full#supplementary-material


toward the discovery of combinatorial therapy. Nucleic Acids Res. 48, D871–D881. doi: 10.1093/nar/gkz1007


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Wang, Kuang, Ma and Han. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership