# MACHINE LEARNING TECHNIQUES ON GENE FUNCTION PREDICTION

EDITED BY : Quan Zou, Arun Kumar Sangaiah and Dariusz Mrozek PUBLISHED IN : Frontiers in Genetics and Frontiers in Plant Science

#### Frontiers Copyright Statement

© Copyright 2007-2019 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.

The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.

Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.

Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.

As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.

All copyright, and all rights therein, are protected by national and international copyright laws.

The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use. ISSN 1664-8714 ISBN 978-2-88963-214-5 DOI 10.3389/978-2-88963-214-5

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# MACHINE LEARNING TECHNIQUES ON GENE FUNCTION PREDICTION

Topic Editors: Quan Zou, University of Electronic Science and Technology of China, China Arun Kumar Sangaiah, VIT University, India Dariusz Mrozek, Silesian University of Technology, Poland

Citation: Zou, Q., Sangaiah, A. K., Mrozek, D., eds. (2019). Machine Learning Techniques on Gene Function Prediction. Lausanne: Frontiers Media. doi: 10.3389/978-2-88963-214-5

# Table of Contents


Wei Chen, Pengmian Feng, Hui Ding and Hao Lin


Shiheng Lu, Ke Zhao, Xuefei Wang, Hui Liu, Xiamuxiya Ainiwaer, Yan Xu and Min Ye

*116 M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species*

Xiaoli Qiang, Huangrong Chen, Xiucai Ye, Ran Su and Leyi Wei

*125 Identification and Analysis of Rice Yield-Related Candidate Genes by Walking on the Functional Network*

Jing Jiang, Fei Xing, Chunyu Wang and Xiangxiang Zeng

*134 LLCMDA: A Novel Method for Predicting miRNA Gene and Disease Relationship Based on Locality-Constrained Linear Coding* Yu Qu, Huaxiang Zhang, Chen Lyu and Cheng Liang


*173 MADS-Box Gene Classification in Angiosperms by Clustering and Machine Learning Approaches*

Yu-Ting Chen, Chi-Chang Chang, Chi-Wei Chen, Kuan-Chun Chen and Yen-Wei Chu

*185 Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods*

Kaiyang Qu, Leyi Wei, Jiantao Yu and Chunyu Wang

*195 Conserved Disease Modules Extracted From Multilayer Heterogeneous Disease and Gene Networks for Understanding Disease Mechanisms and Predicting Disease Treatments*

Liang Yu, Shunyu Yao, Lin Gao and Yunhong Zha


Lei Deng, Jiacheng Wang and Jingpu Zhang

*247 Construction of Complex Features for Computational Predicting ncRNA-Protein Interaction*

Qiguo Dai, Maozu Guo, Xiaodong Duan, Zhixia Teng and Yueyue Fu

*257 Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data*

Limin Jiang, Yongkang Xiao, Yijie Ding, Jijun Tang and Fei Guo


Shuai Liu, Mengye Lu, Hanshuang Li and Yongchun Zuo

*296 DMfold: A Novel Method to Predict RNA Secondary Structure With Pseudoknots Based on Deep Learning and Improved Base Pair Maximization Principle* Linyu Wang, Yuanning Liu, Xiaodan Zhong, Haiming Liu, Chao Lu, Cong Li and Hao Zhang *308 Identification of Triple-Negative Breast Cancer Genes and a Novel High-Risk Breast Cancer Prediction Model Development Based on PPI Data and Support Vector Machines* Ming Li, Yu Guo, Yuan-Ming Feng and Ning Zhang *320 Inferring Bacterial Infiltration in Primary Colorectal Tumors From Host Whole Genome Sequencing Data* Man Guo, Er Xu and Dongmei Ai *328 Multi-Level Comparative Framework Based on Gene Pair-Wise Expression Across Three Insulin Target Tissues for Type 2 Diabetes* Shaoyan Sun, Fengnan Sun and Yong Wang *344 A Machine Learning Approach for Identifying Gene Biomarkers Guiding the Treatment of Breast Cancer* Ashraf Abou Tabl, Abedalrhman Alkhateeb, Waguih ElMaraghy, Luis Rueda and Alioune Ngom *357 Mining* Magnaporthe oryzae *sRNAs With Potential Transboundary Regulation of Rice Genes Associated With Growth and Defense Through Expression Profile Analysis of the Pathogen-Infected Rice* Hao Zhang, Sifei Liu, Haowu Chang, Mengping Zhan, Qing-Ming Qin, Borui Zhang, Zhi Li and Yuanning Liu *370 A Positive Causal Influence of IL-18 Levels on the Risk of T2DM: A Mendelian Randomization Study* He Zhuang, Junwei Han, Liang Cheng and Shu-Lin Liu *379 DeePromoter: Robust Promoter Predictor Using Deep Learning* Mhaned Oubounyt, Zakaria Louadi, Hilal Tayara and Kil To Chong *388 LPI-IBNRA: Long Non-coding RNA-Protein Interaction Prediction Based on Improved Bipartite Network Recommender Algorithm* Guobo Xie, Cuiming Wu, Yuping Sun, Zhiliang Fan and Jianghui Liu *398 An Ensemble Strategy to Predict Prognosis in Ovarian Cancer Based on Gene Modules* Yi-Cheng Gao, Xiong-Hui Zhou and Wen Zhang *408 Predicting Ion Channels Genes and Their Types With Machine Learning Techniques* Ke Han, Miao Wang, Lei Zhang, Ying Wang, Mian Guo, Ming Zhao, Qian Zhao, Yu Zhang, Nianyin Zeng and Chunyu Wang *417 Dual Convolutional Neural Networks With Attention Mechanisms Based Method for Predicting Disease-Related lncRNA Genes* Ping Xuan, Yangkun Cao, Tiangang Zhang, Rui Kong and Zhaogong Zhang *428 rSeqTU—A Machine-Learning Based R Package for Prediction of Bacterial Transcription Units* Sheng-Yong Niu, Binqiang Liu, Qin Ma and Wen-Chi Chou *434 An Effective Method to Measure Disease Similarity Using Gene and Phenotype Associations* Shuhui Su, Lei Zhang and Jian Liu


Yingying Wang, Xingxian Huang, Jianfeng Liu, Xuefei Zhao, Haibo Yu and Yunpeng Cai

*463 NCNet: Deep Learning Network Models for Predicting Function of Non-coding DNA*

Hanyu Zhang, Che-Lun Hung, Meiyuan Liu, Xiaoye Hu and Yi-Yang Lin

*472 Corrigendum: NCNet: Deep Learning Network Models for Predicting Function of Non-Coding DNA*

Hanyu Zhang, Che-Lun Hung, Meiyuan Liu, Xiaoye Hu and Yi-Yang Lin

*473 Gradient Boosting Decision Tree-Based Method for Predicting Interactions Between Target Genes and Drugs* Ping Xuan, Chang Sun, Tiangang Zhang, Yilin Ye, Tonghui Shen and Yihua Dong

# Editorial: Machine Learning Techniques on Gene Function Prediction

#### *Quan Zou1\*, Arun Kumar Sangaiah2 and Dariusz Mrozek3*

*1 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China, 2 School of Computing Science and Engineering, VIT University, Vellore, India, 3 Institute of Informatics, Silesian University of Technology, Gliwice, Poland*

Keywords: machine leaming, gene function prediction, deep learning, ensemble learning, bioinformatics

#### **Editorial on the Research Topic**

#### **Machine Learning Techniques on Gene Function Prediction**

Gene function, including that of coding and noncoding genes, can be difficult to identify in molecular wet laboratories. Therefore, computational methods, often including machine learning, can be a useful tool to guide and predict function. Although machine learning has been considered as a "black box" in the past, it can be more accurate than simple statistical testing methods. In recent years, deep learning and big data machine learning techniques have developed rapidly and achieved an amazing level of performance in many areas, including image classification and speech recognition. This Research Topic explores the potential for machine learning applied to gene function prediction.

#### *Edited and reviewed by:*

*Joao Carlos Setubal, University of São Paulo, Brazil*

> *\*Correspondence: Quan Zou zouquan@nclab.net*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 14 August 2019 Accepted: 05 September 2019 Published: 04 October 2019*

#### *Citation:*

*Zou Q, Sangaiah AK and Mrozek D (2019) Editorial: Machine Learning Techniques on Gene Function Prediction. Front. Genet. 10:938. doi: 10.3389/fgene.2019.00938*

Frontiers in Genetics | www.frontiersin.org October 2019 | Volume 10 | Article 938

We are pleased to see that authors brought the latest machine learning techniques on gene function prediction. Submissions came from an open call for paper, and they were accepted for publication with the assistance of professional referees. Forty-six papers are finally selected from a total of 72 submissions after rigorous reviews. They were presented from different countries and regions, including China, USA, Poland, Taiwan, Korea, Saudi Arabia, India, and so on. According to the topics, we categorize three subtopics for our special issue.

The first part of this special issue discusses the gene and disease relationship. Six papers included in this part are focused on general diseases. These papers propose novel methods to predict disease and gene/miRNA/long noncoding RNA (lncRNA) associations. Su et al. proposed a novel method called GPSim to effectively deduce the semantic similarity of diseases. Yu et al. constructed a weighted four-layer disease–disease similarity network to characterize the associations at different levels between diseases. Three papers paid attention to miRNA and disease relationship. Qu et al. proposed a novel method to predict miRNA–disease associations based on Locality-constrained Linear Coding. Zhao et al. proposed a novel computational model of SNMFMDA (Symmetric Nonnegative Matrix Factorization for MiRNA-Disease Association prediction) to reveal the relation of miRNA–disease pairs. He et al. proposed an NRLMFMDA (neighborhood regularized logistic matrix factorization method for miRNA–disease association prediction) by integrating miRNA functional similarity, disease semantic similarity, Gaussian interaction profile kernel similarity, and experimental validation of disease–miRNA association. Besides miRNA, there is still a paper on lncRNA–disease relationship prediction. A dualconvolutional neural networks with attention mechanism–based method are presented for predicting the candidate disease lncRNAs (Xuan et al.).

There are seven papers on cancer and oncogenes. Two papers paid attention to cancer subtypes. Liu et al. classified muscle-invasive bladder cancer into two conservative subtypes using miRNA, mRNA, and lncRNA expression data; investigated subtype-related biological pathways; and evaluated the subtype classification performance using machine learning methods. Jiang et al. employed spectral clustering and a novel kernel to predict cancer subtypes. Two papers are focused on breast cancer. Abou Tabl et al. present a hierarchical machine learning system that predicts the 5-year survivability of the patients who went through specific therapy. Li et al. employed machine learning methods to select 54 novel breast cancer oncogenes and proved their findings with GO and KEGG. Three papers researched on other kinds of cancer. Liu et al. found lncRNA LINC00941 as a potential biomarker of gastric cancer. Gao et al. proposed an ensemble strategy to predict prognosis in ovarian cancer. Guo et al. developed rigorous bioinformatics and statistical procedures to identify tumor-infiltrating bacteria associated with colorectal cancer.

Two papers focused on type 2 diabetes and four papers paid attention to other diseases. Zhuang et al. employed a two-sample Mendelian randomization method to analyze the causal relationships between interleukin 18 (IL-18) plasma levels and type 2 diabetes using IL-18–related SNPs (Single Nucleotide Polymorphism) as genetic instrumental variables. Sun et al. establish a multilevel comparative framework across three insulin target tissues (white adipose, skeletal muscle, and liver) to provide a better understanding of type 2 diabetes. Zhong et al. identified potential prognostic genes for neuroblastoma. Wang et al. predicted chronic kidney disease susceptibility gene PRKAG2 by comprehensive bioinformatics analysis. Lu et al. employed the Laplacian heat diffusion algorithm to infer novel genes with functions related to uveitis. Li et al. analyzed the blood gene expression signature for osteoarthritis with advanced feature selection methods.

The second part focused on gene structure and function prediction. Four papers were involved in gene elements, and two papers researched RNA structure. Oubounyt et al. employed deep learning techniques to predict gene promoter regions. Dao et al. gave a review for detecting DNA replication origins in eukaryotic genomics with machine learning methods. Exons skipping is an important issue in gene structure research. Chen, Feng et al. and Chen, Song et al. analyzed the relationship between histone modifications and exons skipping. Two papers performed researches on RNA secondary structure prediction, which is a classical problem in computational biology. Wang et al. and Zhang et al. employed deep learning to predict RNA secondary structure, especially on pseudoknots.

Besides gene structure prediction, four papers focused on the gene function prediction, and five papers paid attention to gene identification. Due to the GO- and KEGG-rich knowledge for gene function, researchers would like to pay attention to noncoding RNA function prediction. Zhang et al. predicted noncoding RNA function with deep learning network. Zhao and Ma employed Multiple Partial Regularized Nonnegative Matrix Factorization for Predicting Ontological Functions of lncRNAs. Deng et al. proposed an integrated model to infer the gene ontology functions of miRNAs The work was supported by the National Key R&D Program of China (2018YFC0910405), the Natural Science Foundation of China (No. 61771331). by integrating multiple data sources. Zou et al. predicted enzyme function with hierarchical multilabel deep learning.

There are also five papers on gene identification, expression pattern prediction, and sites modification. They are all involved with machine learning techniques. Han et al. predicted ion channels genes and their types. Chen et al. paid attention to MADS-box gene classification and clustering. Liu et al. predicted gene expression patterns with a generalized linear regression model. Fu et al. identified microRNA genes with sequence and structure information. Qiang et al. predicted RNA N6-methyladenosine sites with machine learning and sequence features.

Other researches were categorized as the third part of our special issue. There are 12 papers in total in this part. Two papers are focused on drugs. Zhu et al. predicted drug–gene interactions with Metapath2vec. Xuan et al. resolve this problem with the latest machine learning technique gradient boosting decision tree. Four papers researched lncRNA– protein interaction prediction. Xie et al. predicted this problem with improved bipartite network recommender algorithm. Zhan et al. combined sequence and evolutionary information on this problem. Zhao et al. employed random walk and neighborhood regularized logistic matrix factorization approach. Dai et al. paid attention to complex features for ncRNA–protein interaction prediction. Three papers are focused on plant researches. Qu et al. found effective sequence features for classifying plant pentatricopeptide repeat proteins. Jiang et al. identified rice yield-related candidate genes by walking on the functional network. Zhang et al. mined *Magnaporthe oryzae* sRNAs with potential transboundary regulation of rice genes associated with growth and defense through expression profile analysis of the pathogen-infected rice. Three papers paid attention to RNA-seq data analysis. McDermaid et al. proposed a new machine learning–based framework for mapping uncertainty analysis in RNA-seq read alignment and gene expression estimation. Wang et al. gave a systems analysis of the relationships between anemia and ischemic stroke rehabilitation based on RNA-seq data. Niu et al. developed rSeqTU, which is a machine learning–based R package for predicting bacterial transcription units from RNA-seq data.

To conclude, papers in this special issue cover several emerging topics of advanced learning techniques and applications for bioinformatics. We highly hope this special issue can attract concentrated attention in the related fields. We thank the reviewers for their efforts to guarantee the high quality of this special issue. Finally, we thank all the authors who have contributed to this special issue.

# AUTHOR CONTRIBUTIONS

ZQ wrote the manuscript draft. DM helped to revise the text. AKS gave some helpful suggestions.

# FUNDING

The work was supported by the National Key R&D Program of China (2018YFC0910405), the Natural Science Foundation of China (No. 61771331, No. 61922020), Statutory Research funds of Institute of Informatics, Silesian University of Technology, Gliwice, Poland (BK/204/RAU2/2019), and the professorship grant of the Rector of the Silesian University of Technology (02/020/RGPL9/0184).

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Zou, Sangaiah and Mrozek. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# IRWNRLPI: Integrating Random Walk and Neighborhood Regularized Logistic Matrix Factorization for lncRNA-Protein Interaction Prediction

Qi Zhao1,2, Yue Zhang<sup>1</sup> , Huan Hu<sup>3</sup> , Guofei Ren<sup>4</sup> , Wen Zhang<sup>5</sup> and Hongsheng Liu2,3,6 \*

*<sup>1</sup> School of Mathematics, Liaoning University, Shenyang, China, <sup>2</sup> Research Center for Computer Simulating and Information Processing of Bio-Macromolecules of Liaoning Province, Shenyang, China, <sup>3</sup> School of Life Science, Liaoning University, Shenyang, China, <sup>4</sup> School of Information, Liaoning University, Shenyang, China, <sup>5</sup> School of Computer, Wuhan University, Wuhan, China, <sup>6</sup> Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, China*

#### Edited by:

*Quan Zou, Tianjin University, China*

#### Reviewed by:

*Yi Xiong, Shanghai Jiao Tong University, China Yongqiang Xing, Inner Mongolia University of Science and Technology, China*

> \*Correspondence: *Hongsheng Liu liuhongsheng@lnu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *14 May 2018* Accepted: *15 June 2018* Published: *04 July 2018*

#### Citation:

*Zhao Q, Zhang Y, Hu H, Ren G, Zhang W and Liu H (2018) IRWNRLPI: Integrating Random Walk and Neighborhood Regularized Logistic Matrix Factorization for lncRNA-Protein Interaction Prediction. Front. Genet. 9:239. doi: 10.3389/fgene.2018.00239* Long non-coding RNA (lncRNA) plays an important role in many important biological processes and has attracted widespread attention. Although the precise functions and mechanisms for most lncRNAs are still unknown, we are certain that lncRNAs usually perform their functions by interacting with the corresponding RNA- binding proteins. For example, lncRNA-protein interactions play an important role in post transcriptional gene regulation, such as splicing, translation, signaling, and advances in complex diseases. However, experimental verification of lncRNA-protein interactions prediction is time-consuming and laborious. In this work, we propose a computational method, named IRWNRLPI, to find the potential associations between lncRNAs and proteins. IRWNRLPI integrates two algorithms, random walk and neighborhood regularized logistic matrix factorization, which can optimize a lot more than using an algorithm alone. Moreover, the method is semi-supervised and does not require negative samples. Based on the leave-one-out cross validation, we obtain the AUC of 0.9150 and the AUPR of 0.7138, demonstrating its reliable performance. In addition, by means of case study in the "Mus musculus," many lncRNA-protein interactions which are predicted by our method can be successfully confirmed by experiments. This suggests that IRWNRLPI will be a useful bioinformatics resource in biomedical research.

Keywords: lncRNA, protein, interaction prediction, random walk, neighborhood regularized logistic matrix factorization, integration method

# INTRODUCTION

A great quantity of studies has indicated that more than 90% of DNA is transcribed into RNA in human organism, the vast majority of which are non-coding RNA. Non-coding RNA (ncRNA) is a RNA that does not encode a protein, and plays a very broad regulatory role in many organisms' life activities. Abundant and functionally important types of non-coding RNAs include transfer RNA (tRNA) and ribosomal RNA (rRNA), and small RNAs such as microRNAs, siRNAs, piRNAs, snoRNAs, snRNAs, exRNAs, scaRNAs, and the long non-coding RNAs. Long non-coding RNA (lncRNA) refers to ncRNA longer than 200 nucleotides. LncRNA was originally considered a "noise" of genomic transcription, a byproduct of RNA polymerase II transcription, without biological function. But recent studies indicate lncRNA involves in a variety of important regulatory procedures, such as chromatin modification (Guttman et al., 2009), cell differentiation and proliferation (Wapinski and Chang, 2011), RNA progressing (Wilusz et al., 2009), and cellular apoptosis (Yu et al., 2015) and so on. These lncRNA regulation effects begin to attract widespread attention from the abnormal convey of biological cell genes. In addition, more and more experiments demonstrate that lncRNAs involve in the regulation of a variety of physiological and pathological processes, as well as the development processes of a variety of diseases including tumors (Wilusz et al., 2009; Harries, 2012; Chen and Yan, 2013; Morlando et al., 2014; Chen et al., 2015, 2016c, 2017d, 2018a,b; Yu et al., 2015; Chen and Huang, 2017b; Li et al., 2017; You et al., 2017). For instance, Gupta et al. issued an increase in the expression of lncRNA HOTAIR in primary breast tumors (Gupta et al., 2010). Along with the growth of bioinformatics, many lncRNAs have been discovered, some of which have been studied or are being studied. However, the functionality of most lncRNAs remains unknown. Usually, most lncRNAs exert their function through the interaction with the corresponding RNAbinding proteins. Although we have succeeded in identifying some RNA-binding proteins in the human genome and this number is growing steadily (Cook et al., 2011; Ray et al., 2013), we are not fully aware of the association between lncRNA and protein and its function in the post-transcriptional regulating network (Mittal et al., 2009; Kishore et al., 2010). Moreover, the experimental identification of lncRNA-protein associations is time-consuming, laborious and costly, so it is necessary to develop effective computational prediction methods.

At present, computational models have been broadly utilized in bioinformatics such as lncRNA-disease interactions prediction (Zeng et al., 2015; Chen et al., 2016b,d,e, 2017c,d; Huang et al., 2016; Li et al., 2016; Liu et al., 2016; Zhao et al., 2016a; Zou et al., 2016; Zhang et al., 2017a,b; Hu et al., 2018; Tang et al., 2018). However, only a few models can be used to forecast lncRNAprotein associations. For example, Bellucci et al. (2011) proposed catRAPID, which encoded the lncRNA-protein as a characteristic vector, and combined two value structures between lncRNA and protein forces, hydrogen bonding and Fan Dehua force. Later, Muppirala et al. (Muppirala et al., 2011) developed RPISeq, which utilized merely lncRNA and protein sequences, and used support vector machine (SVM) classifier (Hearst, 1998) and random forest (RF) (Liaw and Wiener, 2002) to predict the interactions between lncRNAs and proteins. Wang et al. presented a model, it utilized the same dataset of a paper by Muppirala et al. and similar data characteristics. Its theoretical basis was Naive Bias (NB) and Extended NB (ENB) classifier. In 2015, Suresh et al. proposed RPI-Pred (Suresh et al., 2015), a method on account of SVM, the sequences and structures of lncRNAs and proteins, and the high-order 3D structure characteristics of proteins are used in this method. In the same year, a method based on heterogeneous networks, called LPIHN, was proposed by Li et al. (2015). They predicted new lncRNA-protein associations by implementing a random walk with restart (RWR) on a constructed heterogeneous network. In a recent study, Ge et al. (2016) introduced a network bisection approach, named LPBNI. They carried out the resource allocation procedure in the lncRNA-protein dichotomous network to evaluate candidate proteins for each lncRNA to achieve the goal of predicting the absence of the interaction. Lately, Hu et al. (2017) advanced a semi-supervised method called LPI-ETSLP that revealed the lncRNA-protein associations. In particular, LPI-ETSLP did not require negative samples.

There are several problems with these methods, as follows: (1) Most of the models mentioned above don't use lncRNAprotein interactions data, but are trained using RNA-protein interactions data. This leads to a limitation on the ability to forecast the lncRNA-protein associations. (2) Some of the models utilize the NPInter (Yuan et al., 2014; Hao et al., 2016) database to predict the interactions between lncRNAs and proteins. Although NPInter is by far the best lncRNA-protein database, it only provides lncRNA's gene-protein interactions entries, and dose not directly provide the entries of lncRNA-protein interactions. If these models are directly investigated using lncRNA's geneprotein interactions, it will certainly affect the prediction results. (3) Finally, although the current researches and understanding of lncRNA-protein interactions are increasing, there isn't enough negative samples data yet, and it is hard to choose lncRNA and protein features. In order to solve these problems, we integrate the two methods of random walk and neighborhood regularized logistic matrix factorization to develop a new model called IRWNRLPI. The model utilizes known lncRNA-protein associations, protein similarity network and lncRNA similarity network to forecast possible lncRNA-protein associations. And unlike the traditional machine learning methods, IRWNRLPI uses semi-supervised learning to derive unknown information primarily through known associations and their similarities, so it does not need negative samples. In addition, our model provides a high level of importance for the nearest neighbors, thus avoiding noise information. We implement leave-oneout cross validation (LOOCV) on IRWNRLPI to evaluate its performance, resulting in the AUC of 0.9150, which indicates that the model has reliable performance. And the AUPR value of 0.7138 demonstrates the reliability of our model. Moreover, in the case study, we predict the lncRNA-protein associations of "Mus musculus" in view of the predicted score level, demonstrating that our method is generally effective.

### MATERIALS AND METHODS

#### Dataset

Along with the development of bioinformatics, there are a number of public databases available for scientists to study lncRNA-protein interactions. The database NPInter includes experimental verification interactions between non-coding RNAs and other biomolecules (proteins, RNA and genomic DNA). NONCODE (Xie et al., 2014; Zhao et al., 2016b), a comprehensive annotation database, covers all types of non-coding RNA (not including tRNA and rRNA). And the database Uniprot

(Consortium, 2015; Pundir et al., 2016) can provide us with protein sequences. With these databases, we can acquire the datasets we need for lncRNAs and proteins, which will help us to carry out our research better.

According to NPInter V2.0, we chiefly extract species for human lncRNA relevant items. We obtain 4870 items which are experimentally identified lncRNA-protein associations, covering 1114 lncRNAs and 96 proteins. From NONCODE 4.0, we can obtain lncRNA sequence information. From Uniprot, we can get the protein sequence information. Further, we remove proteins and lncRNAs that can't obtain sequences information. Besides, we delete those lncRNAs associated with only one protein, and those proteins that are associated with only one lncRNA. These data are low-similarity pairs and potential noise. Removing these data helps improve the performance of the model. Finally, we construct a dataset containing 4158 lncRNA-protein correlations, including 990 lncRNAs and 27 proteins.

#### LncRNA-Protein Interaction Matrix

To facilitate the description of lncRNA-protein interactions and the algorithmic model, matrix Y is denoted as the adjacency matrix of lncRNA-protein interactions, if lncRNA l(i) is connected with the protein p(j), Y(l(i), p(j)) is 1, otherwise 0. According to sequence similarity matrix, the interactions between lncRNAs and proteins are measured. We screen the lncRNAs and proteins sequences which are inferior quality or cannot find their corresponding proteins and lncRNAs. The inferior quality refers to incomplete sequence information and repeated lncRNA and protein sequences. Finally, 4158 high quality lncRNA-protein associations are obtained.

#### LncRNA Sequence Similarity Matrix

In our work, we calculate the similarity of the lncRNA sequence according to the lncRNA sequence information. These lncRNAs sequences information is acquired from the NONCODE 4.0 database. As a result of filtering, we gain 990 credible lncRNAs sequences. The regularized Smith-Waterman algorithm (Pearson, 1991) is used to compute lncRNAs sequence similarity. Thus, the lncRNA sequence similarity matrix LS is built, where the empty LS(l(i), l(j)) indicates the sequence similarity between lncRNA l(i) and l(j). LS is normalized as below:

$$LS(l\,\left(i\right), l\,\left(j\right)) = \frac{sw(l\,\left(i\right), l\,\left(j\right))}{max(sw(l\,\left(i\right), l\,\left(i\right)), sw(l\,\left(j\right), l\,\left(j\right)))}$$

Where sw(l(i), l j ) is the sequence similarity between lncRNA l(i) and l(j) calculated according to the Smith-Waterman algorithm.

#### Protein Sequence Similarity Matrix

We screen 27 dependable protein sequences on the basis of the lncRNA-protein network, they come from Uniprot (Consortium, 2015; Pundir et al., 2016) entirely. Similarly, protein sequence similarity can also be calculated by utilizing a regularized Smith-Waterman algorithm. Then, we can construct a protein sequence similarity matrix PS, in which the entity PS(p(i), p(j)) expresses the sequence similarity between protein p(i) and p(j). The PS is normalized as below:

$$PS(p\,\left(i\right),p\,\left(j\right)) = \frac{sw(p\,\left(i\right),p\,\left(j\right))}{\max\left(sw\left(p\,\left(i\right),p\left(i\right)\right),sw\left(p\,\left(j\right),p\left(j\right)\right)\right)}$$

Where sw(p (i), p j ) is the sequence similarity between protein p(i) and p(j) calculated according to the Smith-Waterman algorithm.

#### Work Flow

The workflow of our IRWNRLPI model is given in **Figure 1**. The procedure for predicting the lncRNA-protein interactions consists of four steps. (1) Firstly, abstract gene-protein pairs information in NPInter v2.0, and we can obtain the interaction matrix between lncRNAs and proteins. (2) The second step is to extract lncRNA sequences and protein sequences from NONCODE and UniProt on account of gene-protein pairs, separately. (3) Next, we screen and remove the lncRNAs in NONCODE that fail to discovery the relevant information, as well as the protein in Uniprot that cannot seek out the corresponding information. Then, we employ the regularized Smith-Waterman algorithm to compute the similarity of lncRNA sequences and protein sequences, respectively, and generate corresponding lncRNA and protein similarity matrix. (4) Last, we will apply the three matrixes obtained above to random walk algorithm and neighborhood regularized logistic matrix factorization algorithm, respectively, to gain a potential lncRNAprotein interactions score matrix, and then enter these two score matrixes to IRWNRLPI integration model. Eventually, we gain final lncRNA-protein associations score matrix. The above is the whole prediction process to obtain new lncRNA-protein associations.

### IRWNRLPI

The flowchart of this section is given in the **Figure 2**. The upper two parts in **Figure 2** are the main flow of the random walk method and the neighborhood regularized logistic matrix factorization method, respectively. The left box is the four steps of random walk, and the lncRNA-protein score matrix S<sup>R</sup> is finally obtained. The right box is the process of adjacency regularization, and finally the lncRNA-protein score matrix S<sup>N</sup> is obtained. The bottom of **Figure 2** is the process of obtaining the final lncRNA-protein score matrix S by integrating the above two methods.

#### Random Walk

In random walk model, given a protein p, the process of predicting the lncRNAs associated with p is modeled as a random walk on the weighted graph G. The process can be roughly divided into four steps.

In the first step, the lncRNA network is established based on the sequence similarity between lncRNAs. For a given protein p, the known lncRNAs associated with p and the candidate lncRNAs associated with p and their relations form a network, expressed as a weighted graph G (V, E, W). Each vertex (v∈V) represents the lncRNA or candidate lncRNA associated with p. Each edge (e∈E)

denotes the relationship between the two vertices connected by edge e. We denote sequence similarity between v<sup>x</sup> and v<sup>y</sup> as Sim (vx, vy), and the weight w of edge e is Sim (vx, vy). The greater the w, the more likely that the two vertices are correlated with a set of similar proteins. In this network, the known lncRNA associated with p is called a labeled node. The remaining lncRNAs have so far, no evidence that they are related to p, which are unlabeled nodes.

In the second step, constructing the correlation matrix R to establish two one-step transition matrices L<sup>Q</sup> and LU. First of all, we construct the correlation matrix R. For v<sup>i</sup> , we evaluate the extent of relevance between neighbors v<sup>j</sup> and p, which is denoted by rij. Firstly, suppose that the set of all the labeled nodes is denoted as Q, vi∈Q. If v<sup>i</sup> is relevant to protein p, its neighbors may also be relevant to p. In addition, when v<sup>i</sup> is a labled node, the association probability is greater than the association probability when v<sup>i</sup> is an unlabeled node. Thus, the former and the latter are multiplied by wQ∈(0,1) and wU∈(0,1) separately. Evidently, w<sup>Q</sup> is higher than wU. Secondly, suppose U is the set of all unmarked nodes, which may be associated with lncRNAs, and vi∈U. If v<sup>i</sup> is related to p, its neighbors may also be associated with p. The weight of the associated information from the unmarked node is wU. Thirdly, if the two lncRNAs are not connected, such as v<sup>i</sup> and vj , rij is set to 0. Finally, an lncRNA to a value of itself is set to 0.

R (rij) <sup>M</sup>×<sup>M</sup> is constructed on the basis of the above rules, rij is formally defined as follows:

$$r\_{ij} = \begin{cases} \text{Sim}\left(\nu\_i, \nu\_j\right) \cdot \nu\_Q, \ \nu\_i \in Q, \left(\nu\_i, \nu\_j\right) \in E\\ \text{Sim}\left(\nu\_i, \nu\_j\right) \cdot \nu\_U, \ \nu\_i \in U, \left(\nu\_i, \nu\_j\right) \in E\\ 0, \qquad \left(\nu\_i, \nu\_j\right) \notin E \text{ or } \nu\_i = \nu\_j \end{cases} \tag{1}$$

In which v<sup>i</sup> is the vertex and v<sup>j</sup> is one of its neighbors.

Then, we construct the transfer matrix L (lij) <sup>M</sup>×<sup>M</sup> . We proportionate the transfer probability lij to rij. The matrix R is normalized by the next type, and the one step transfer probability array L (lij) <sup>M</sup>×<sup>M</sup> is obtained:

$$l\_{\vec{i}\vec{j}} = r\_{\vec{i}\vec{j}} / \sum\_{j=1}^{N} r\_{\vec{i}\vec{j}} \tag{2}$$

lij indicates the transition possibility from v<sup>i</sup> to v<sup>j</sup> . Nevertheless, after the row of R is normalized, the weights (w<sup>Q</sup> and wU) for distinguishing between the labeled node and the unlabeled node associated information are lost, thus ignoring the effect of the previous information about whether the vertex is relevant to p. In order to settle the difficulty, we divide the matrix L into two arrays of L<sup>Q</sup> and LU. L<sup>Q</sup> expresses the transformation array of the marked node, and L<sup>U</sup> indicates the transfer matrix of the unmarked node. All lines of the marked (unmarked) node in L<sup>Q</sup> (LU) are in accordance with the relevant rows in L, the rest of rows of L<sup>Q</sup> (LU) are set to 0.

In the third step, a new forecast method on account of random walk is established to evaluate the correlation scores between each unmarked node and p, that is, estimate the correlation score of the candidate lncRNAs. In view of the transfer matrix L<sup>Q</sup> and LU, the prediction method is further established as below:

$$S(t+1) = r\_Q L\_Q^T S(t) + p\_Q \left(1 - r\_Q\right) X + r\_U L\_U^T S(t) + p\_U (1 - r\_U) X \tag{3}$$

First, S(t + 1) represents a probability vector, indicating the probability that the walker reaches the ith vertex at time t + 1 is Si(t + 1). The walker begins with the marked node, the components in S(0) represent the original probability, which means the walker begins at the same probability at time 0 from a marked node. And Si(0) calculates according to the following formula:

$$S\_i(0) = \begin{cases} \frac{1}{|Q|} & \text{if } \nu\_i \in Q \\ 0 & \text{otherwise} \end{cases} \tag{4}$$

Second, to use priori information, we assign weights r<sup>Q</sup> and r<sup>U</sup> (0 < rQ, r<sup>U</sup> <1, r<sup>Q</sup> > rU) to the labeled node and the unlabeled node, respectively. In fact, r<sup>Q</sup> and r<sup>U</sup> replace the ignored function of w<sup>Q</sup> and wU. Finally, when the walker finds a marked node, at time t+1 it will go back the initial vertex (marked node) at probability pQ(1-rQ) and start walking again. The probability total of the walkers arriving at each marked node at time t is expressed as pQ. The formula is as follows:

$$p\_Q = \sum\_{\nu\_i \in Q} S\_i(t) \tag{5}$$

Likewise, when the walker finds an unmarked node, at the next time it will return to the beginning vertex with possibility pU(1-rU). The probability total of the walkers arriving at each unmarked node at time t is expressed as pU, it is equal to 1-pQ. X defines the nodes at which the walker returns and restarts. Since walker begins with a marked node, X is equal to S(0).

The fourth step is to sort all unmarked nodes and choose potential candidates. The walker begins with the marked node and starts iterating. When the iteration satisfies the condition of convergence, the iteration procedure suspends. The convergence condition is L1-norm between S(t) and S(t + 1) less than 10−<sup>10</sup> . The definition of the correlation fraction of unmarked nodes is the steady state probability of the pedestrians staying at that vertex. In this way, all unmarked nodes get a correlation score, and we sort them according to their fractions. The greater the fraction, the more likely that the unlabeled node is associated with the given protein p. The score matrix obtain by this part is denoted by SR, in which SR(l(i), p(j)) is the possibility of association between lncRNA l(i) and protein p(j).

#### Neighborhood Regularized Logistic Matrix Factorization

Here we explain the neighborhood regularized logistic matrix factorization method. First, lncRNAs and proteins are mapped to shared potential spaces with dimension r, and r << min (m, n). u<sup>i</sup> ∈ R 1×r and v<sup>j</sup> ∈ R 1×r represents the characters of lncRNA l<sup>i</sup> and protein p<sup>j</sup> , separately. The following formula is used to calculate the probability of association pij of the lncRNA-protein pair (l<sup>i</sup> , pj):

$$p\_{ij} = \frac{\exp(\boldsymbol{u}\_i \boldsymbol{\nu}\_j^T)}{1 + \exp(\boldsymbol{u}\_i \boldsymbol{\nu}\_j^T)}\tag{6}$$

In order to simplify, we utilize U∈R m×r and V∈R n×r to represent the set of potential vectors for all lncRNAs and all proteins.

In order to make our modeling more efficient and more accurate for lncRNA-protein interactions prediction, we recommend giving positive samples a higher level of importance than negative samples (Johnson, 2014; Liu et al., 2014), the weight of the positive sample given above is c, the weight of the negative sample is 1.

Suppose all samples are trained independently, and the probability as follows:

$$p\left(Y|U,V\right) = \left(\prod\_{1
$$\times \left(\prod\_{1$$
$$

Note that when yij = 1, c 1 − yij = 1 − yij, when yij = 0, cyij = yij. So, we rewrite the formula (7) as follows:

$$\begin{aligned} p\left(Y|U,V\right) &= \left(\prod\_{1$$

$$=\prod\_{i=1}^{m} \prod\_{j=1}^{n} p\_{ij}^{c\gamma\_{ij}} \begin{pmatrix} 1 & p\_{ij} \end{pmatrix} \begin{pmatrix} 1 - \wp\_{ij} \end{pmatrix} \tag{8}$$

In addition, we will carry out zero mean spherical Gaussian priori on the potential vector of lncRNA and protein:

$$\rho\left(U|\sigma\_l^2\right) = \prod\_{i=1}^m N(\mu\_i|0, \sigma\_l^2 I), \ p\left(V\middle|\sigma\_p^2\right) = \prod\_{j=1}^n N(\nu\_j|0, \sigma\_p^2 I) \tag{9}$$

Among them, σ 2 l and σ 2 p are to regulate the variance of the Gaussian distribution, I is the unitary array. So, through Bayesian inference, we have:

$$\operatorname{p}\left(U, V\middle|Y, \sigma\_l^2, \sigma\_p^2\right) \propto \operatorname{p}\left(Y|U, V\right)\operatorname{p}\left(U|\sigma\_l^2\right)\operatorname{p}\left(V\middle|\sigma\_p^2\right) \tag{10}$$

Thus, the posterior distribution logarithm is as below:

$$\begin{split} \log p\left(U, V\middle|Y, \sigma\_{l}^{2}, \sigma\_{p}^{2}\right) &= \sum\_{i=1}^{m} \sum\_{j=1}^{n} c \gamma\_{ij} u\_{l} \nu\_{j}^{T} \\ &- (1 + c\gamma\_{l} - \gamma\_{l}) \log\left[1 + \exp\left(u\_{l} \nu\_{j}^{T}\right)\right] \\ &- \frac{1}{2\sigma\_{l}^{2}} \sum\_{i=1}^{m} ||u\_{l}||\_{2}^{2} \\ &- \frac{1}{2\sigma\_{p}^{2}} \sum\_{j=1}^{n} ||\nu\_{j}||\_{2}^{2} + C \end{split} \tag{11}$$

Where C is an absolute term. Maximizing the posterior distribution is same as minimizing the below object functions:

$$\min\_{U,V} \sum\_{i=1}^{m} \sum\_{j=1}^{n} \left( 1 + c\boldsymbol{\gamma}\_{ij} - \boldsymbol{\gamma}\_{ij} \right) \log \left[ 1 + \exp \left( \boldsymbol{u}\_{i} \boldsymbol{\nu}\_{j}^{T} \right) \right]$$

$$1 - c\boldsymbol{\gamma}\_{ij}\boldsymbol{u}\_{i}\boldsymbol{\nu}\_{j}^{T} + \frac{\lambda\_{l}}{2} ||U||\_{F}^{2} + \lambda\_{p} 2 \tag{12}$$

Where, λ<sup>l</sup> = 1 σ 2 l and λ<sup>p</sup> = 1 σ 2 p and || • ||<sup>F</sup> show the Frobenius norm of the array. Alternating gradient descent method (Johnson, 2014) can resolve the difficulty in Equation (12).

By mapping lncRNAs and proteins to shared potential space, the logistic matrix factorization method can effectually evaluate the monolithic structure of lncRNA-protein interactions information. In addition, we use lncRNAs and proteins neighbors to further advance the forecast veracity. For lncRNA l<sup>i</sup> , we denote the nearest neighbor set with N(li)∈L\l<sup>i</sup> , where N(li) makes up selecting the K<sup>1</sup> most similar lncRNAs of l<sup>i</sup> . After that, we structure the set N(pj)∈P\p<sup>j</sup> , which is made up of the K<sup>1</sup> most similar proteins of p<sup>j</sup> . In the experiment, we set K<sup>1</sup> to 5 according to experience.

Here, the lncRNA neighborhood information can be represented by the adjacency array A, and ai<sup>µ</sup> is defined as below:

$$a\_{i\mu} = \begin{cases} s\_{i\mu}^l \text{ if } l\_{\mu} \in N(l\_i) \\ 0 \text{ otherwise} \end{cases} \tag{13}$$

The protein neighborhood information is described by the adjacency matrix B, and bjv is defined as below:

$$b\_{j\nu} = \begin{cases} s\_{j\nu}^{\rho} \text{ if } p\_{\nu} \in N(p\_j) \\ 0 \text{ otherwise} \end{cases} \tag{14}$$

It should be noted that matrix A and B are asymmetric.

The main idea of predicting lncRNA-protein interactions with lncRNAs neighborhoods information is to minimize the distance between l<sup>i</sup> and its nearest neighbor N(li) in the potential space, which can be gained by minimizing the below object functions:

$$\begin{aligned} \frac{\alpha}{2} \sum\_{i=1}^{m} \sum\_{\mu=1}^{m} a\_{i\mu} ||u\_i - u\_{\mu}||\_F^2 &= \frac{\alpha}{2} \left[ \sum\_{i=1}^{m} \left( \sum\_{\mu=1}^{m} a\_{i\mu} \right) u\_i u\_i^T \right. \\ &+ \sum\_{\mu=1}^{m} \left( \sum\_{i=1}^{m} a\_{i\mu} \right) u\_{\mu} u\_{\mu}^T \Big] \\ &\frac{\alpha}{2} tr \left( U^T A U \right) - \frac{\alpha}{2} tr \left( U^T A^T U \right) \\ &= \frac{\alpha}{2} tr \left( U^T L^l U \right) \end{aligned} \tag{15}$$

Among them, tr(•) is matrix trace, L <sup>l</sup> = D <sup>l</sup> + D˜ l − (A + A T ). D l and D˜ l are two diagonal arrays, where diagonal elements are D l ii = P<sup>m</sup> µ=1 ai<sup>µ</sup> and D˜ l µµ = P<sup>m</sup> i=1 ai<sup>µ</sup> separately. We also minimize the following objective functions to use the neighborhood information of the protein for lncRNA-protein interactions prediction:

$$\frac{\beta}{2} \sum\_{j=1}^{n} \sum\_{\nu=1}^{n} \left. b\_{j\nu} ||\nu\_j - \nu\_\nu||\_F^2 = \frac{\beta}{2} \text{tr}\left(V^T L^\rho V\right) \tag{16}$$

Wherein, L <sup>p</sup> = D <sup>p</sup> + D˜ <sup>p</sup> − (B + B T ), D p and D˜ <sup>p</sup> are two diagonal arrays, where diagonal elements are D p jj = P<sup>n</sup> v=1 bjv and D˜ p vv = P<sup>n</sup> j=1 bjv respectively.

By taking into account lncRNA-protein associations and lncRNAs and proteins K<sup>1</sup> the nearest neighborhoods, the final prediction model can be derived. By substituting Equations (15, 16) into Equation (12), the resulting model is as follows:

$$\begin{split} &\min\_{U,V} \sum\_{i=1}^{m} \sum\_{j=1}^{n} \left( 1 + c\boldsymbol{\gamma}\_{\boldsymbol{j}} - \boldsymbol{\gamma}\_{\boldsymbol{j}} \right) \ln \left[ 1 + \exp \left( \boldsymbol{u}\_{i} \boldsymbol{\nu}\_{\boldsymbol{j}}^{T} \right) \right] - c\boldsymbol{\gamma}\_{\boldsymbol{j}} \boldsymbol{u}\_{i} \boldsymbol{\nu}\_{\boldsymbol{j}}^{T} \\ &+ \frac{1}{2} tr \left[ \boldsymbol{U}^{T} \left( \boldsymbol{\lambda}\_{\boldsymbol{I}} \boldsymbol{I} + \boldsymbol{\alpha} \boldsymbol{L}^{\boldsymbol{I}} \right) \boldsymbol{U} \right] + \frac{1}{2} tr \left[ \boldsymbol{V}^{T} \left( \boldsymbol{\lambda}\_{\boldsymbol{P}} \boldsymbol{I} + \boldsymbol{\beta} \boldsymbol{L}^{\boldsymbol{P}} \right) \boldsymbol{V} \right]. \end{split} \tag{17}$$

An alternating gradient rise process can resolve the optimization problem in Equation (17), which is represented as L, the gradient relative to U and V as below:

$$\frac{\partial l}{\partial U} = PV + (c - 1) \left( Y \odot P \right) V - cYV + \left( \lambda\_l I + \alpha L^l \right) U \tag{18}$$

$$\frac{\partial l}{\partial V} = P^T U + (c - 1) \left( Y^T \odot P^T \right) U - cY^T U + \left( \lambda\_P I + \beta L^p \right) V \tag{19}$$

P∈ R m×n , and pij (see Equation 1) represents the Hadamard product of the two arrays. In order to quicken the constriction of the gradient decline optimization method, we utilize the AdaGrad algorithm to adaptively select the grad step length.

If potential carriers U and V are known, the association probability of any unknown lncRNA-protein pair (l<sup>i</sup> , pj) can be forecasted by formula (6). The negative dataset L <sup>−</sup> and P<sup>−</sup> of lncRNAs and proteins might influence on lncRNA− protein interactions. The set of K<sup>2</sup> nearest neighbors in L <sup>+</sup> and P <sup>+</sup> are denoted as N <sup>+</sup>(li) and N <sup>+</sup>(pj) for lncRNA l<sup>i</sup> ∈ L <sup>−</sup> and protein p<sup>j</sup> ∈ P <sup>−</sup>. N <sup>+</sup>(li) and N <sup>+</sup>(pj) are structured utilizing the same standard as utilized to structure neighborhoods during the training procedure. Then, the interaction probability between lncRNA u<sup>i</sup> and protein v<sup>j</sup> is modified to:

$$\hat{p}\_{ij} = \frac{\exp(\tilde{u}\_i \tilde{\nu}\_j^T)}{1 + \exp\left(\tilde{u}\_i \tilde{\nu}\_j^T\right)},\tag{20}$$

where

$$\tilde{u}\_{i} = \begin{cases} \frac{u\_{i}}{\sum\_{\mu \in N^{+}(l\_{i})} s\_{i\mu}^{d}} & \text{if } l\_{i} \in L^{+}\\ \frac{1}{\sum\_{\mu \in N^{+}(l\_{i})} s\_{i\mu}^{d}} \sum\_{\mu \in N^{+}(l\_{i})} u\_{i\mu} & \text{if } l\_{i} \in L^{-} \end{cases} \tag{21}$$

Note that Equation (21) shows a general case of smooth learning lncRNA specificity and target-specific potential carriers. In our experiment, K<sup>2</sup> is set to 5 based on experience. The score matrix obtained by this part is denoted by SN, and SN(l(i), p(j)) is the possibility of association between lncRNA l(i) and protein p(j).

#### Integrating Model

At last, to avoid the unsatisfactory result of using one of the two methods alone, we adopt an integration strategy and propose the integration model IRWNRLPI. Here we combine the two algorithms of random walk and neighborhood regularized logistic matrix factorization, and obtain a desired result. The specific approach is that we use these two algorithms obtain two score matrix S<sup>R</sup> and SN, and then take the average. The final fraction array is denoted as S, and S(l(i), p(j)) is the possibility of association between lncRNA l(i) and protein p(j). The formula is as follows:

$$\mathcal{S} = \frac{\mathcal{S}\_R + \mathcal{S}\_N}{2} \tag{22}$$

#### RESULTS

#### Performance Evaluation

In this work, to measure the capability of our IRWNRLP model, we perform LOOCV on lncRNA-protein interactions that have been experimentally verified. In the LOOCV experiment, it is assumed that a total of N samples, one of them is selected as a test sample, and the rest of the samples are selected as training samples. So, we result in N classifiers, N test results, and we will utilize the average of the N results to evaluate the capability of our method. Use the LOOCV to obtain the receiver operator characteristics (ROC) curve and calculate the area under ROC curve (AUC). AUC is an important popular metric for evaluating the classification model. If AUC = 1, IRWNRLPI has perfect performance; if AUC = 0.5, it represents random performance. There is also a popular indicator the area under prediction recall curve (AUPR), it is more adaptive for category unbalanced datasets because it penalizes false positives more in the assessment. Because of the presence of massive unknown labeled data in the dataset, AUPR is used to lessen the impact of misinformation for false positives on the function of the prediction model. The larger the value of AUPR, the better the capability of the method.

For adequately examining the capability of the method, we introduce the following indicators to evaluate our method: ACC (overall accuracy), SEN (sensibility), PRE (precision), and F1 (F1 scores), these indices are extensively utilized in bioinformatics, remarked as (Chen et al., 2016a, 2017a):

$$\begin{aligned} \text{ACC} &= \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}} \\ \text{SEN} &= \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \text{PRE} &= \frac{\text{TP}}{\text{TP} + \text{FP}} \\ \text{F1} &= \frac{2 \times \text{TP}}{2 \times \text{TP} + \text{FP} + \text{FN}} \\ &= 2 \cdot \frac{\text{PRE} \cdot \text{REC}}{\text{PRE} + \text{REC}} \end{aligned}$$

Where TP represents true positive, TN is true negative, FP is false positive, FN is false negative. ACC is the index of systematic error, up to 100% of ACC indicates that the prediction is perfect, and in the random prediction ACC can only get 50%. Other metrics in the binary classification can also measure the capability of the method. PRE indicates the quantity of true positive predictions in the positive prediction, and SEN is also called recall, indicating the positive predictions amount of the positive samples that are properly forecasted. Considering the accuracy and sensibility of the test, the fractional value obtained by calculating the F1-score (F-score or F degree measure) can reflect if the classification model is robust. F1 is 1 for perfect method, while the worst model of F1 is 0.

#### Comparison With Other Methods on NPInter V2.0

In this part, we compare IRWNRLPI with other four models on NPInter v2.0, which are LPI-ETSLP, RWR, LPBNI, and RPISeq. Among them, RPISeq is compared with IRWNRLPI as an example of the machine learning model, in view of RF and SVM classifiers. The other three methods, LPI-ETSLP, RWR, and LPBNI, forecast potential correlations with IRWNRLPI using identical type of lncRNA and protein sequences information. The results of IRWNRLPI and the other four models are displayed in **Figure 3** and **Table 1**, and indicate that IRWNRLPI is more ideal than the others by comparison.

We perform all of these models on the same dataset, and implement LOOCV experiments to compare their performance. As shown in **Figure 3**, our IRWNRLPI method has a AUC value of 0.9150, well above 0.5 (random), indicating that this model is feasible to predict lncRNA-protein associations. And we can see that the AUC of IRWNRLPI is higher than those of LPI-ETSLP (0.8876), RWR (0.8332), LPBNI (0.8586), RPISeq-RF (0.3949), and RPISeq-SVM (0.3987). Obviously, RPISeq is much worse than other models, even less than 0.5 (random). There are two reasons for this result: First, RPISeq is a machine

TABLE 1 | Comparison of IRWNRLPI with LPI-ETSLP, RWR, LPBNI, and RPISeq models.

interaction between IRWNRLPI and the randomized lncRNA-protein pairs.


learning method and depends on data, and our model does not have negative sample set; Second, RPISeq utilizes RNA-protein associations to train rather than lncRNA-protein associations, whereas the biological function of lncRNA differs from the biological function of common RNA, thus affecting the final outcome. In contrast, IRWNRLPI can avoid the problem of feature selection, thereby avoiding reliance on negative sample datasets.

From the indicators in **Table 1**, we can see that the prediction ability of IRWNRLPI is obviously superior to the other four methods. First, we compare the values of AUPR, which are 0.6438 (LPI-ETSLP), 0.2893 (RWR), 0.3306 (LPBNI), 0.0631 (RPISeq-RF), and 0.0698 (RPISeq-SVM) respectively. The above values are lower than 0.7138 (IRWNRLPI), indicating that the prediction result of IRWNRLPI is more dependable. Next, we further analyze the ACC, PRE, SEN, and F1-score of these models. As we can see the ACC of IRWNRLPI is less than RWR and LPBNI, owing that IRWNRLPI predicts potential lncRNA-protein associations based on known lncRNA-protein correlations, but for now, experimentally verified lncRNAprotein interactions are still less. Consequently, it is not difficult to forecast, with the lncRNA-protein associations data continuing increasing, IRWNRLPI prediction accuracy will greatly improve. In addition, it is more reasonable for this unbalanced dataset to evaluate the F1-score than using the ACC evaluation. From **Table 1**, it is easy to find, the F1-score of IRWNRLPI is higher than those of other methods, especially RWR and LPBNI. Our IRWNRLPI results show prediction accuracy (PRE) of 0.7187, which is approximately 21, 95, and 94% higher than LPI-ETSLP, RWR and LPBNI, separately, much higher than RPISeq-RF and RPISeq-SVM results. The sensibility (SEN) is 0.5960, it is 68, 44, 98, and 104% higher than RWR, LPBNI, RPISeq-RF, and RPISeq-SVM, separately. This results further demonstrate that IRWNRLPI performs better in forecasting lncRNA-protein associations.

#### Case Study

To evaluate the capability of the prediction method more comprehensively, we use IRWNRLPI to forecast potential lncRNA-protein interactions in view of the known associations of "Mus musculus" in the NPInter v3.0 dataset. The top 10 lncRNA-protein interactions are displayed in **Table 2**, and finally the data is centrally checked and fully verified in the "Mus musculus". Moreover, we describe their ranking of in other methods, and it is not difficult to see from **Table 2** that some of them do not get a high rank in the prediction of other models, which can lead that some new discoveries may be neglected by corresponding models. On the contrary, our model can find and confirm the interactions of these lncRNAs with proteins, and the corresponding genes are displayed in **Table 2**. The loss function of massive lncRNAs expressed in mouse embryonic cells is studied to show the influence on gene expression. Studies have indicated lncRNA regulates the impact of tumor cells on blood vessels, which can affect the mechanism of tumorous growth. In our forecast outcomes, NONMMUG002214-Q13185, NONMMUG013483-A2AC19 and NONMMUG015351-Q88974 are forecasted to have associations in the top 10 results of these methods, which are studied by Guttman et al. (2009). In terms of outcomes, IRWNRLPI is obviously superior in forecasting potential lncRNA-protein associations to other methods.

### DISCUSSION

LncRNA involves a variety of important cellular regulatory processes and many disease progression processes, particularly in the development of various cancers. In general, most lncRNAs play their function by interacting with the corresponding RNA-binding proteins. Therefore, predicting the new lncRNAprotein associations is conducive to the research of lncRNA. Nevertheless, lncRNA-protein interactions experiments will cost a lot of materials, human and financial resources. Therefore, the utilization of computational methods to forecast lncRNA-protein associations arouses widespread concern. In our work, to obtain


TABLE 2 | Top 10 novel interactions predicted by IRWNRLPI and their ranks in the prediction of other methods.

better prediction results, we introduce the idea of integrating algorithm and present the IRWNRLPI method, which integrates two prediction methods, random walk and neighborhood regularized logistic matrix factorization, to forecast lncRNAprotein interactions. IRWNRLPI bases only on experimentally validated lncRNA-protein associations, which avoids dependence on negative sample datasets. We conduct a more comprehensive evaluation of IRWNRLPI, test our model in the NPInter v2.0 dataset, and compare it with other four methods. In the LOOCV experiment, the AUC value of IRWNRLPI is 0.9150, indicating that IRWNRLPI performs well in the forecast of lncRNA-protein correlations. And IRWNRLPI obtains the AUPR value of 0.7138, which states clearly the responsibility of this method. In addition, we use the "Mus musculus" dataset as a case study to test IRWNRLPI and investigate the practical capability of this method in forecasting unknown lncRNA-protein associations. Case study shows that IRWNRLPI is able to forecast other new lncRNAprotein interactions. With the continuous progress of science and technology, more and more lncRNA-protein interactions will be found, and then the accuracy of IRWNRLPI prediction will also increase. In conclusion, IRWNRLPI is an efficient model of predicting potential lncRNA-protein associations, and we also hope that IRWNRLPI can be used in a wider range of studies.

The excellent and reliable predictive performance of IRWNRLPI is mainly attributable to the following factors. Firstly, unlike the traditional machine learning methods, IRWNRLPI uses semi-supervised learning to derive unknown information primarily through known associations and their similarities, so it does not need negative samples. Secondly, our model provides a high level of importance for the nearest neighbors, thus avoiding noise information. Thirdly, IRWNRLPI is a model based on an integrated idea,

#### REFERENCES


and the integration model gets better results than a single model.

Of course, IRWNRLPI also needs to be improved for the following reasons. First of all, the proposed model relies heavily on the known correlation data, but the number of current known lncRNA-protein associations is still very limited. As the number of experimentally validated associations increasing in future, the prediction accuracy of our method will improve. Furthermore, when the training sample changes, the prediction effect will be unstable. In addition, further consideration should be given on how to choose the value of the model parameters more properly.

#### AUTHOR CONTRIBUTIONS

QZ and HL conceived the project, developed the prediction method, designed and implemented the experiments, analyzed the result, and wrote the paper. YZ implemented the experiments, analyzed the result, and wrote the paper. HH, GR, and WZ analyzed the result. All authors read and approved the final manuscript.

#### FUNDING

This work was supported by the National Natural Science Foundation of China (No: 31570160, 61772381 and 61772531), Innovation Team Project of Education Department of Liaoning Province under Grant No. LT2015011, the Doctor Startup Foundation from Liaoning Province under Grant No. 20170520217, Important Scientific and Technical Achievements Transformation Project under Grant No. Z17-5-078, Large-scale Equipment Shared Services Project under Grant No. F15165400 and Applied Basic Research Project under Grant No. F16 205151.


using biological interaction networks. Brief. Bioinformatics 17:193. doi: 10.1093/bib/bbv033


Zou, Q., Li, J., Song, L., Zeng, X., and Wang, G. (2016). Similarity computation strategies in the microRNA-disease network: a survey. Brief. Funct. Genomics 15, 55–64. doi: 10.1093/bfgp/ elv024

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhao, Zhang, Hu, Ren, Zhang and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Prediction of Drug–Gene Interaction by Using Metapath2vec

Siyi Zhu1†, Jiaxin Bing1†, Xiaoping Min<sup>1</sup> \*, Chen Lin<sup>1</sup> and Xiangxiang Zeng1,2

<sup>1</sup> Department of Computer Science, Xiamen University, Xiamen, China, <sup>2</sup> Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Madrid, Spain

Heterogeneous information networks (HINs) currently play an important role in daily life. HINs are applied in many fields, such as science research, e-commerce, recommendation systems, and bioinformatics. Particularly, HINs have been used in biomedical research. Algorithms have been proposed to calculate the correlations between drugs and targets and between diseases and genes. Recently, the interaction between drugs and human genes has become an important subject in the research on drug efficacy and human genomics. In previous studies, numerous prediction methods using machine learning and statistical prediction models were proposed to explore this interaction on the biological network. In the current work, we introduce a representation learning method into the biological heterogeneous network and use the representation learning models metapath2vec and metapath2vec++ on our dataset. We combine the adverse drug reaction (ADR) data in the drug–gene network with causal relationship between drugs and ADRs. This article first presents an analysis of the importance of predicting drug–gene relationships and discusses the existing prediction methods. Second, the skip-gram model commonly used in representation learning for natural language processing tasks is explained. Third, the metapath2vec and metapath2vec++ models for the example of drug–gene-ADR network are described. Next, the kernelized Bayesian matrix factorization algorithm is used to complete the prediction. Finally, the experimental results of both models are compared with Katz, CATAPULT, and matrix factorization, the prediction visualized using the receiver operating characteristic curves are presented, and the area under the receiver operating characteristic values for three varying algorithm parameters are calculated.

#### Edited by:

Quan Zou, Tianjin University, China

#### Reviewed by:

Pan Zheng, Swinburne University of Technology Sarawak Campus, Malaysia Bosheng Song, Huazhong University of Science and Technology, China

> \*Correspondence: Xiaoping Min mxp@xmu.edu.cn

†These authors have contributed equally to this work.

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 21 May 2018 Accepted: 22 June 2018 Published: 31 July 2018

#### Citation:

Zhu S, Bing J, Min X, Lin C and Zeng X (2018) Prediction of Drug–Gene Interaction by Using Metapath2vec. Front. Genet. 9:248. doi: 10.3389/fgene.2018.00248 Keywords: drug–gene, ADR, heterogeneous network, link prediction, representation learning, network embedding

# INTRODUCTION

Over the past few years, predicting the relationship between drugs and genes have gradually become a subject of concern among researchers in the fields of new drug discovery and personalized medicine. Conventionally, the route for improving drug efficacy is to analyze the interaction between the drugs and their targets. Most targets are proteins encoded from genes; thus, raising this research work to the gene level is a critical development. Studies on drug–gene interactions have proved that determining this relationship can not only improve the positive effects of drugs but also help prevent adverse drug reactions (ADRs) by enabling genotype-guided prescription. As early as 1909, Garrod proposed that people would have different responses after using a given drug (Garrod, 1909). Currently, an increasing number of people believe that gene is a vital factor in the variability

**22**

of drug response (Swen et al., 2007). According to a number of studies, gene expression may affect the efficacy of the drugs; however, some drugs can also upregulate or downregulate the expression of corresponding human genes (Liu and Pan, 2015). The assumption underlying individualized medication according to human genotype is that the human genotype will determine the reaction to a given medication (Weiss et al., 2008). On the one hand, patients may positively respond to the medication or the risk of complications may be low. On the other hand, drugs may provoke a series of side effects. For example, genetic factors may have an effect on the response to antihypertensive medication. Schelleman et al. (2004) found that compared with other antihypertensive treatments, diuretics as therapy can reduce the risk of myocardial infarction and stroke among patients with the 460 W allele of the α-adducin gene because of the interactions between the genetic polymorphisms for endothelial nitric oxide synthase and diuretics and between the α-adducin gene and diuretics.

Traditional prediction methods belong to two categories: machine learning and statistics (Pan et al., 2017). Conventional machine learning methods directly treat the known drug–gene pairs as the positive training set and the unknown genes as the negative training set. In other words, these methods ignore the possibility of unknown positive samples in the data.

Typical statistical prediction methods are classified into two types: structure-based approaches and text mining methods. Structure-based approaches focus on the physiochemical properties of drug binding sites to predict drug availability (Cheng et al., 2007). These methods require the binding site information of the drugs as the structural features of the target proteins or the expressed sequences of genes to which a drug compound molecule binds have effects. Thus, these methods cannot be used for genes with unknown sequences. Text mining methods are based on the assumption that two biological entities may be very likely related if they appear in one body of literature (Zhu et al., 2005). However, this type of methods is not feasible for entities without any known interactions.

Recently, Dong et al. (2017) proposed two models called metapath2vec and metapath2vec++, which can effectively represent the semantic information and structure of a heterogeneous information network (HIN) simultaneously. In the current work, we extend these algorithms into the drug– gene field and use both models on a biological heterogeneous network consisting of three types of nodes to predict the interactions between drugs and genes.

This paper is organized as follows:


performances of both models are compared with those of three conventional prediction algorithms, namely, Katz, CATAPULT, and matrix factorization (MF). The comparison results presented in section IV demonstrate that the two representation learning methods have achieved the highest accuracy of prediction.

### PRELIMINARIES

#### Experimental Data Drug–Gene–ADR Network

In our experiment, we applied the metapath2vec and metapath2vec++ models to a biological HIN with three types of nodes, namely, drug, gene, and ADR, and four types of relationship, namely similarity between drugs, similarity between genes, drug–gene interaction, and drug–ADR interaction.

The heterogeneous network is partially represented by **Figure 1**. In the network, the blue nodes indicate drugs, the red nodes indicate genes, and the green nodes indicate ADRs. The solid lines represent the interactions of the node pairs, with the blue lines showing a similarity between drugs and the red lines showing a similarity between genes. Moreover, the gray lines mean that an interrelationship exists between a drug and a gene, and the green line symbolizes the causality between a drug and an ADR.

The drug–gene interaction dataset was extracted from the online database the Library of Integrated Network-Based Cellular Signatures (LINCS). It is a rich database that aims to explain biology by cataloging changes in gene expression and other cellular processes occurring under a variety of drug therapies or other perturbing factors. We selected 10,830 genes from the database and 38,456 interacting drug–gene pairs to constitute the drug–gene part of the network.

The data on the interactions between drugs and ADRs were collected from the online database Adverse Drug Reaction Classification System (ADReCS). The ADReCS is a database specifically created for ADR research. It provides comprehensive ADR ontology data, including data on ADR standardization and hierarchical classification of ADR terms (Cai et al., 2014). ADReCS has collected a large number of drug–ADR correlations from more than three sources like DailyMed, MedDRA, and SIDER2. DailyMed, which is a continuously updated website providing massive information on medicines sold in the market and containing 102,405 drug listings (as of May 21st, 2018) submitted to the Food and Drug Administration, is the main source of the data in ADReCS. Thus, ADReCS provides not only data on ADRs but also information on 1,355 single active ingredient drugs and 134,022 drug–ADR interactions. We extracted from the database 1,370 ADRs caused by the drugs in the drug–gene data. Consequently, we were able to incorporate the concept of side effect in the drug–gene network.

#### Drug Similarity

In particular, we used two kinds of similarity data between drugs to distinguish each drug effectively. First, the structural similarities of drugs are based on the drug compounds' chemical

structures, which were commonly used in previous drug–target prediction studies.

The other kind of similarity data we used in the experiment are the pharmacological correlation data based on the information in the Anatomical Therapeutic Chemical (ATC) classification system. This classification system is formulated and updated on a regular basis by the WHO Collaborating Centre for Drug Statistics Methodology. The ATC classification system can classify drugs into diverse categories based on the treatment effects and compound molecular features (Cai et al., 2014). In the experiment, we presented a transformation strategy to change the ATC data of drugs into pharmacological similarity scores of drugs by comparing ATC categories belonging to two different drugs.

#### Major Motivation

We aimed to explore the interaction between drugs and genes by constructing a heterogeneous network and contribute to the literature on the prediction of the negative effects of new drugs on human gene expression. Furthermore, in view of the growing importance of identifying ADRs in developing new drugs, we introduced ADR data to obtain a sizable amount of information about drugs, regarded ADRs as a set of labels, and considered the causal relationships between drugs and ADRs to be a group feature of drugs.

# PROPOSED METHODS

#### Related Work

#### Skip-Gram

Skip-gram is a language model widely used for training word representation vectors to determine the relationships between words in a network. To help predict the context words of the target word in a sentence or in an entire document, a skipgram model finds the representations of these words (Cai et al., 2014). Simply, a skip-gram model can provide the information surrounding a word. A skip-gram model generally has three or more layers; a center word is inputted in the input layer, and consequently, a certain amount of words related to the input word are generated with a high probability. Given an example of a drug set, if a series of drugs (i.e., d1, d2, . . . , dN) constitutes the training set, some of these drugs are related, regardless of whether the relationships between others are unknown. The average log probability that the skip-gram model should maximize can be defined as follows:

$$\frac{1}{N} \sum\_{n=1}^{N} \sum\_{-\substack{\mathbf{c} \le j \le c, j \ne 0}} \log p(d\_{\mathbf{n}+j}|d\_{\mathbf{n}}),\tag{1}$$

where N is the number of drugs, d<sup>n</sup> and dn+<sup>j</sup> indicate two related drug nodes in the training set, and c is the number of drugs in the training set. A higher prediction accuracy can be achieved with more training samples. In the original skip-gram model, v<sup>d</sup> is the input representation vector, v ′ d is the output representation vector of drug d, and D is the total number of drugs. Accordingly, the probability of dn+<sup>j</sup> related to d<sup>n</sup> can be computed by the following softmax function:

$$\Pr\left(d\_{n+j}|d\_n\right) = \frac{e^{\left(\nu'\_{d\_{n+j}}\top\_{rd\_n}\right)}}{\sum\_{d=1}^{D} e^{\left(\nu'\_d \top\_{\nu\_{d\_n}}\right)}}.\tag{2}$$

#### Hierarchical Softmax

In a typical skip-gram model, the output layer commonly uses a softmax function to yield the probability distribution. In general, the softmax function can squash a vector of real values into another vector whose values are controlled within

the range (0, 1). To reduce the computational cost and time, a replacement function called hierarchical softmax was proposed in (Morin and Bengio, 2005). The hierarchical softmax function requires less computational space and time by obtaining a vector with a length of no more than log<sup>2</sup> |D|, whereas the standard softmax must compute a D-dimension vector (Mikolov et al., 2013). Hierarchical softmax constructs a binary tree with all the nodes as leaves (**Figure 2**) to achieve exponential speed-up of computation. The output of learning a drug relationship dataset is formalized as a Huffman tree with a train of binary decisions. The more related to the root, the closer the distance to the current node is. The algorithm then assigns 1 to the left branch and 0 to the right branch of each node on the tree to formalize these nodes into vectors, which denote the paths from the root node to the current nodes. In **Figure 2**, the red line indicates the metapath between drug "D013999," the root, and drug "C014374" and corresponds to the information learned from the input dataset.

#### Noise-Contrastive Estimation

To present an alternative to hierarchical softmax and further improve computational performance, Gutmann and Hyvarinen (Morin and Bengio, 2005) proposed noise-contrastive estimation (NCE), which is a method based on sampling. The core idea of NCE is that for each instance of sampling n labels from the entire dataset including its own label, only the possibility of the instance belonging to the n+1 labels should be computed instead of calculating the probabilities of the objects related to every label. In **Figure 3**, the genes are temporarily regarded as labels of drugs. When the NCE strategy is used to identify labels for drug "D020849," noise labels such as gene "3108" and gene "9143" can be randomly sampled. Furthermore, gene "148022" can be

sampled on the basis of the similarity between drug "D020849" and drug "D013999" (an interaction occurs between "D013999" and gene "9053"), or gene "1027" can be sampled because of the similarity between gene "1027" and gene "3108," which is related to drug "D020849." The NCE method divides the labels of the central node into two categories: true label and noise label. Subsequently, the multilabel classification problem can be translated to a binary classification task, thereby significantly reducing the time cost.

The probability of the true label can be defined as

$$p\left(g\_i = 1 \mid \mathcal{G}, d\right) = \frac{p\_\theta\left(d \mid \mathcal{G}\right)}{p\_\theta\left(d \mid \mathcal{G}\right) + k \ast q(d)},\tag{3}$$

where g<sup>i</sup> is a gene label of the central drug d and G is the label set of d. Meanwhile, k noise labels are selected from a noise distribution q(d). In (3), θ is a parameter used to maximize the conditional likelihood of the label set (Gutmann and Hyvarinen, 2012).

Next, the noise sample probability can be computed as follows:

$$p\left(g\_{i} = 0\vert \mathcal{G}, d\right) = \frac{k \ast q(d)}{p\_{\boldsymbol{\theta}}\left(d\middle| \mathcal{G}\right) + k \ast q(d)}.\tag{4}$$

Accordingly, the cost function for N total number of central drugs is computed as follows:

$$\frac{1}{N} \sum\_{i}^{N} \left\{ \log p \left( \mathbf{g}\_{i} = 1 | \mathbf{G}, d \right) + \sum \log p \left( \mathbf{g}\_{i} = 0 | \mathbf{G}, d \right) \right\}. \tag{5}$$

#### Negative Sampling

Mikolov (Cai et al., 2014) proposed negative sampling to replace hierarchical softmax. Negative sampling can simplify NCE and maintain the quality of the representation vectors. It is similar to NCE as it also uses a noise label set to change the task into a binary classification problem. Thus, negative sampling can be regarded as a specific version of NCE with the constant q and k = |V|. Accordingly, the probability computation in (3) can be changed into

$$\operatorname{p}\left(\mathcal{g}\_{i}=1|G,d\right)=\frac{\operatorname{p}\_{\theta}\left(d|G\right)}{\operatorname{p}\_{\theta}\left(d|G\right)+1},\tag{6}$$

and Equation (4) is simplified to

$$\mathfrak{p}\left(\mathcal{g}\_{i}=0|G,d\right)=\frac{1}{\mathfrak{p}\_{\theta}\left(d|G\right)+1}.\tag{7}$$

#### Metapath2vec

In view of the application of the skip-gram model on a language network, it may be designed for homogeneous networks with only one type of nodes. Thus, it cannot be directly used on a network consisting of multiple types of nodes and links. In developing the metapath2vec model, Dong et al. (2017) designed a skip-gram model that can be applied to a heterogeneous network by incorporating heterogeneous network features and implemented two improvements on the standard framework.

Nodes in a network are generally related to each other on two aspects, their semantic relevance and structural similarity. Nodes with similar semantemes are obviously associated and should be close in geographic proximity. For example, for the two nodes drug "D020849" and drug "D013999" in **Figure 3**, an edge indicates drug similarity between them. In other words, they have similar semanteme. If a clustering algorithm were to be performed, the two drugs may have a high probability of belonging to a common cluster or community. With regard to structural correlation, two nodes also exhibit an affinity if they have extremely similar structures in the entire network, such as the two nodes gene "9053" and gene "56924" in **Figure 3**. Both of them are related to only one drug node with no more edges. Thus, when we calculate the representations of these two genes, they should be embedded close to each other.

In the node2vec model, which is used on homogeneous networks to analyze the relationship between words, Grover and Leskovec (Dyer, 2014) combined the advantages of breadth-first sampling (BFS) and depth-first sampling (DFS) to establish a supervised random walk algorithm. Typically, BFS can effectively sample a group of nodes on the basis of structural similarity. By contrast, DFS prefers to search a train of nodes forming a path according to their content similarity. The random walk algorithm used in node2vec presents two benefits; it functions as a two-sample algorithm and performs well in terms of time and complexity and space.

The first improvement of the metapath2vec model from the skip-gram model is the incorporation of the random walk method, which allows for the compression of the structural feature of a heterogeneous network based on the homogeneous version used in node2vec. The drug–gene–ADR network in **Figure 4** visualizes the capability of the metapath to restrict the random walker according to a given metapath and consequently sample diverse nodes in a heterogeneous network.

Controlled by the given metapath "Drug–ADR–Drug–Gene– Drug," the metapath-based random walker standing on the node drug "D013999" selects ADR "06.04.05" to be its next step instead of jumping to other neighboring nodes, such as gene "9053" or gene "55038." Thus the specific heterogeneous semanteme can be identified from the entire network.

between drugs and genes. The blue and pink lines indicate the similarity

between two drugs and two genes respectively.

In addition to the metapath-based random walk, the second improvement exhibited by the metapath2vec model is the heterogeneous skip-gram model. This model is established by calculating the probability that a node has a heterogeneous neighbor set and then maximizing the computation result. Dong et al. (2017) defined this model as follows:

$$\arg\max \theta \sum\_{d \in V} \sum\_{t \in T\_V} \sum\_{\mathcal{g}\_t \in N\_I(d)} \log p(\mathcal{g}\_t | d, \theta), \tag{8}$$

where d is the central drug node in the heterogeneous network. Nt(d) is the neighbor set in which g<sup>t</sup> is one of the neighbor nodes of d; V and T<sup>V</sup> represent the node set and the type set of the nodes, respectively.

#### Metapath2vec++

Metapath2vec++ is the upgraded version of the metapath2vec model with the improvement of endowing the negative sampling strategy mentioned in section III-A4 with a heterogeneous character. The softmax function used in the metapath2vec model is more applicable to a homogeneous network because this function ignores the heterogeneity of the network. In models with the traditional softmax function, such as node2vec or metapath2vec, the output layer exports a matrix consisting of the representation vector of each node. By contrast, the metapath2vec++ model can more clearly analyze the heterogeneous semantic relationship between these nodes, which are assigned to the neighbor set of the central node. **Figure 5** demonstrates the main difference between metapath2vec and metapath2vec++.

#### Prediction

After the representation vector of each node in the entire network is constructed by the two representation learning models, the prediction algorithm can be executed to obtain the related score between a drug node and a gene node.

The kernelized Bayesian matrix factorization (KBMF) method combines the ideas of multiple kernel learning and MF to employ more kinds of features, which can contribute to the prediction results. Every specific metapath can offer a group of latent features embedded in the representation vectors, and different features can make different contributions to the prediction. The conventional MF method cannot simultaneously take advantage of features from multiple domains; therefore, we used KBMF (Gönen et al., 2013), a variation of the MF with kernels, to calculate the probability of a drug–gene pair. In the KBMF model, a set of kernels corresponds to a set of features from multiple domains. In our experiment, we used the representation vectors of each metapath instead of the set of kernels. Afterward, the model assigns a group of weights to these kernels to integrate every component linearly with the assumption that kernel weights are normally distributed without enforcing any constraints on them. The main process of KBMF is summarized in **Figure 6**, which demonstrates how to use more than one group of features and how to combine all kernels. In the diagram, there are m groups of drug features and n groups of gene features indicated by the kernel matrices K m d ∈ R <sup>N</sup>d×N<sup>d</sup> and K n <sup>g</sup> ∈ R Ng×N<sup>g</sup> , respectively, where N<sup>d</sup> is the number of drugs in the training set and N<sup>g</sup> is the number of genes in the training set. In the same diagram, A<sup>d</sup> ∈ R <sup>N</sup>d×<sup>X</sup> and A<sup>g</sup> ∈ R <sup>N</sup>d×<sup>X</sup> represent the projection matrix of drugs and genes to the subspace with dimension X, respectively; G m <sup>d</sup> = A T d K m d refers to the component of a specific kernel for drugs, and G n g has a similar meaning for genes. After the components for all the kernels are obtained, they can be combined with the kernel weights e<sup>d</sup> and e<sup>g</sup> to derive the composite components H<sup>d</sup> and H<sup>g</sup> . Finally, the relative score for each drug–gene pair is calculated by H<sup>d</sup> and H<sup>g</sup> .

#### EXPERIMENTAL RESULTS

#### Performance Metric

Receiver operating characteristic (ROC) curves (Hanley and McNeil, 1982) are widely used to assess the discrimination capability of data mining algorithms, especially for measuring the link prediction results (Grover and Leskovec, 2016). Both ROC curve and area under the receiver operating characteristic

FIGURE 5 | An intuitive example of the heterogeneous network with drug, gene, and ADR nodes to demonstrate the main difference between metapath2vec and metapath2vec++. When the node drug "D013999" is selected as the central node and its neighbor set is produced, metapath2vec usually presents a probability matrix of all the neighbors together, whereas metapath2vec++ presents the probability of each neighbor type separately.

(AUROC) curve can estimate the performance of a binary classifier. Compared with other performance metrics widely used for classification algorithms, such as precision and recall ratio, the ROC curve can remain stable and constant when the distribution of positive samples and negative samples in the dataset is changed. The ratio between positive and negative instances cannot always be balanced in real-world datasets. In fact, the number of negative samples is commonly much larger than the positive samples, thereby resulting in a data imbalance problem, which clearly affects the estimation. Thus, the precision–recall curve may significantly change along with the size of the dataset. Therefore, we used the ROC curve as the performance metric in our experiment.

#### Parameter Setting

In this section, we introduce the parameters used in the experiment and compare the prediction results for each parameter adjustment. The metapath2vec and metapath2vec++

FIGURE 7 | Prediction results obtained by (A–C) varying three parameters and (D) comparing metapath2vec and metapath2vec++. (A) ROC curves for changing w that indicates the numbers of walkers in the random walk algorithm. (B) ROC curves for changing l that indicates the lengths of the metapath in the random walk process. (C) ROC curves for changing d that indicates the dimensions of the representation vector. (D) ROC curves for comparisons of metapath2vec and metapath2vec++.


The bold values indicate the best results in the experiments with different values of parameters.

TABLE 1 | AUROC values for three varying parameters.

algorithms each include five parameters. We modified three of these parameters as follows to test the sensitivity of the models.


# Comparison Methods

To verify the excellent performance of both metapath2vec and metapath2vec++ further, we set the three parameters mentioned above as w = 1,000, l = 100, and d = 100, respectively and selected three existing algorithms commonly used for the link prediction problem, namely, Katz, CATAPULT, and KBMF, and compared them with the two models proposed in this study. In this section, we briefly discuss these conventional methods and present the comparison of the experimental results represented by ROC curves.

#### Katz

Katz (1953) is a famous algorithm first proposed to improve the balloting results of a dataset for sociometric problems (Forsyth and Katz, 1946). It has gradually been successfully applied to heterogeneous networks for link prediction. This graph-based algorithm can estimate the effects of a given node by calculating the numbers of its direct neighbors and indirect neighbors. The Katz method can be used to find the nodes related to the central node by measuring their similarities in both the directed and undirected graphs. Thus, it can predict the relationships between nodes accurately for a social network, and its performance has been proved (Singh-Blom et al., 2013). The main idea of Katz is that if numerous paths exist between node j and the given node i in a network, then the two nodes may be very similar because these indirect links connect them. A similarity score for each node pair can be obtained by counting the numbers of edges with different steps from one to one, similar to the random walk (Wang and Landau, 2001; Semage, 2017) process with fixed end nodes. In our experiment, we used this method on our drug– gene–ADR network to predict the correlation between a drug and a gene. However, its prediction results are not as good as those of the other compared models as it is applicable to a homogeneous network because the input of Katz is an unweighted network. As a result, the interaction scores of the drug–gene pairs are poor. Furthermore, some ADRs are inconsequential because the drug–ADR matrix is not extremely symmetric. The ROC curve for the prediction result of Katz is shown in **Figure 8**.

#### CATAPULT

Drawing from the idea of Katz, Singh-Blom (Singh-Blom et al., 2013) proposed a new guilt-by-association (GBA) method called CATAPULT. GBA (Oliver, 2000) is a powerful heuristic method that infers whether a novel biological entity is associated with another known entity through similarities in function or structure. GBA methods can be used not only to illustrate the associated expression of a group of genes but also to predict the product of an unknown gene by searching for other genes that are correlated with the given one (Wolfe et al., 2005). As a GBA method, CATAPULT also presents good performance in predicting related genes and drugs in our experiment. By combining the Katz measure and machine learning, CATAPULT can assign appropriate weights to a set of links with different lengths by learning suitable features; thus, it improves the accuracy of the original Katz method. Furthermore, CATAPULT is also based on positive-unlabeled (PU) learning methods (Yang et al., 2012), which are suitable for datasets with only positive samples and unknown samples (Hsieh et al., 2015). Many datasets in the real world seldom have determinate negative samples; hence, traditional methods would select unlabeled nodes to constitute a negative dataset. Thus, the noise of the negative dataset can disturb the performance of the classifier because of the potential related nodes whose links are not inexistent but unknown. Introducing PU learning, CATAPULT uses a strategy to pick nodes without labels randomly to be negative samples to solve this problem. The ROC curve for the prediction result of CATAPULT is also presented in **Figure 8**.

#### Matrix Factorization

MF is another typical algorithm employed for network data mining, and it is mainly used on structural link prediction (Menon and Elkan, 2011). MF divides a matrix into more than two different low-rank matrices and can effectively reduce high dimensionality to obtain potential structures of the original data. MF possesses several advantages (Lu and Yang, 2015). First, it can solve the data sparseness problem (Koren, 2008) and can easily be adopted in many fields with specific data. Second, it can easily find the optimal solution. Third, by combining the features of row data and column data of the given matrix, MF can yield satisfactory results in link prediction. In the comparative experiments, we used KBMF.

The main idea of KBMF is to use a group of kernels following a normal distribution to use more than one set of node features. We used this algorithm in our method to complete the prediction part, after the representation vectors were obtained by the representation learning models. Thus, we compared simple KBMF and KBMF plus metapath2vec (denoted by metapath2vec in **Figure 8**). The ROC curves for the prediction results of both models are also shown in **Figure 8**.

#### CONCLUSION

The experimental results show that the extended representation learning methods present excellent performance on a

### REFERENCES


heterogeneous network and possess good prospects in link prediction.

On the basis of the prediction results, we determined the best parameters for our biological heterogeneous network dataset, namely, 200 for the number of walkers, 150 for the length of a path in the random walk process, and 200 for the dimension of the representation vector of each node. Furthermore, we were able to understand the sensitivity of the two models to parameter variation. In the future, we expect to improve the prediction results by incorporating more comprehensive data and extend the prediction task to drug–ADR relationships.

#### AUTHOR CONTRIBUTIONS

SZ, XZ, and JB wrote the paper. JB performed the experiments. CL, SZ, and XM revised the paper.

#### ACKNOWLEDGMENTS

The work was supported by the National Natural Science Foundation of China (Grant Nos. 61472333, 61772441, 61472335, 61272152, and 41476118), Project of marine economic innovation and development in Xiamen (No. 16PFW034SF02), Natural Science Foundation of the Higher Education Institutions of Fujian Province (No. JZ160400), Natural Science Foundation of Fujian Province (No. 2017J01099), President Fund of Xiamen University (No. 20720170054). XZ is supported by Juan de la Cierva position (code: IJCI-2015-26991).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhu, Bing, Min, Lin and Zeng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identifying and Exploiting Potential miRNA-Disease Associations With Neighborhood Regularized Logistic Matrix Factorization

Bin-Sheng He<sup>1</sup> , Jia Qu<sup>2</sup> \* and Qi Zhao3,4 \*

*<sup>1</sup> The First Affiliated Hospital, Changsha Medical University, Changsha, China, <sup>2</sup> School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China, <sup>3</sup> School of Mathematics, Liaoning University, Shenyang, China, <sup>4</sup> Research Center for Computer Simulating and Information Processing of Bio-Macromolecules of Liaoning Province, Shenyang, China*

#### Edited by:

*Quan Zou, Tianjin University, China*

#### Reviewed by:

*Pengwei Hu, Hong Kong Polytechnic University, Hong Kong Yuangen Yao, Huazhong Agricultural University, China*

> \*Correspondence: *Jia Qu tb17060015b4@cumt.edu.cn Qi Zhao zhaoqi@lnu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *25 June 2018* Accepted: *18 July 2018* Published: *07 August 2018*

#### Citation:

*He B-S, Qu J and Zhao Q (2018) Identifying and Exploiting Potential miRNA-Disease Associations With Neighborhood Regularized Logistic Matrix Factorization. Front. Genet. 9:303. doi: 10.3389/fgene.2018.00303* With the rapid development of biological research, microRNAs (miRNA) have become an attractive topic because lots of experimental studies have revealed the significant associations between miRNAs and diseases. However, considering that experiments are expensive and time-consuming, computational methods for predicting associations between miRNAs and diseases have become increasingly crucial. In this study, we proposed a neighborhood regularized logistic matrix factorization method for miRNA-disease association prediction (NRLMFMDA) by integrating miRNA functional similarity, disease semantic similarity, Gaussian interaction profile kernel similarity, and experimentally validation of disease-miRNA association. We used Gaussian interaction profile kernel similarity to cover the shortage of the traditional similarity to make it more reasonable and complete. Furthermore, NRLMFMDA also considered the important influences of the neighborhood information and took full advantage of them to improve the accuracy of the miRNA-disease association prediction. We also improved the accuracy by giving higher weights to the known association data in the process of calculating the potential association probabilities. In the global and the local leave-oneout cross validation, NRLMFMDA got the AUCs of 0.9068 and 0.8239, respectively. Moreover, the average AUC of NRLMFMDA in 5-fold cross validation was 0.8976 ± 0.0034. All the three kinds of cross validations have shown significant advantages to a number of previous models. In the case studies of breast neoplasms, esophageal neoplasms and lymphoma according to known miRNA-disease associations in the recent version of HMDD database, there were 78, 80, and 74% of top 50 predicted related miRNAs verified to have associations with these three diseases, respectively. In the further case studies for new disease without any known related miRNAs and the previous version of HMDD database, there were also high proportions of the predicted miRNAs verified by experimental reports. All the validation experiment results have demonstrated the effectiveness and practicability of NRLFMDA to predict the potential miRNA-disease associations.

Keywords: microRNA, disease, association prediction, neighborhood regularized, matrix factorization

# INTRODUCTION

MicroRNAs (miRNAs) are a category of endogenous and short non-coding single-stranded RNAs (21∼24 nucleotides) which could regulate the gene expression by targeting mRNAs for cleavage or translational repression at the posttranscriptional level (Ambros, 2001, 2004; Bartel, 2004; Meister and Tuschl, 2004). The first miRNA was found 20 years ago. And since then, people have discovered thousands of miRNAs in a wide variety of species (Jopling et al., 2005; Kozomara and Griffiths-Jones, 2011). Furthermore, more and more studies have found that the miRNAs play crucial roles at multiple stages of the biological processes (Lee et al., 1993; Chen et al., 2017b; Li et al., 2017), such as early cell growth, proliferation (Cheng et al., 2005), differentiation (Miska, 2005), development (Karp and Ambros, 2005), aging (Bartel, 2009), apoptosis (Xu et al., 2004), and so on. Additionally, the key regulatory roles of miRNAs have increasingly been paid attention to in the abnormal gene expression of biological cells. For example, the dysregulation of the miRNAs has been confirmed as a main reason of aberrant cell behavior by many studies (Griffiths-Jones et al., 2006). In the recent years, more and more experiments have been implemented to show that miRNAs have great connections with the various development processes of many human complex diseases (Lynam-Lennon et al., 2009; Meola et al., 2009; Huang et al., 2016b). For example, researches have implicated that miRNA-7a has clinical significance of high mobility group A2 in human gastric cancer. And Schulte et al. reported the capacity of miRNA-197 and miRNA-223 in predicting cardiovascular death and burden of future cardiovascular events in a large cohort of Coronary artery disease patients (Schulte et al., 2015). Besides, Thomas Thum et al. (Thum et al., 2008) showed that miR-21 affects the global cardiac structure and function through regulating the ERK–MAP kinase signaling pathway in cardiac fibroblasts. Therefore, identifying disease-related miRNAs is important and beneficial to the treatment, diagnosis, and prevention of a variety of clinically important disease. Nevertheless, identifying the associations between miRNAs and diseases with experimental methods is expensive and timeconsuming. With the development of biological technology, lots of experiments have been implemented to produce vast numbers of miRNA-associated datasets. There is an urgent need for us to make further efforts to develop novel computational models for potential miRNAs-disease association prediction. In fact, many computational methods are well behaved in predicting miRNA-disease associations (Chen and Yan, 2013; Chen, 2015b; Chen et al., 2016a,g; Chen et al., 2018c). Therefore, further experimental studies can be more efficiently implemented by selecting the most promising associated miRNAs predicted by computational models.

Based on the assumption that functionally similar miRNAs are more likely to have associations with phenotypically similar diseases, many computational approaches have been introduced for the identification of miRNA-disease associations (Bandyopadhyay et al., 2010; Jiang et al., 2010; Liu et al., 2016b; Pasquier and Gardès, 2016; Zeng et al., 2016b; Zou et al., 2016; Chen and Huang, 2017; Chen et al., 2017a,c,d; You et al., 2017; Chen et al., 2018a,b,d,e,f; Tang et al., 2018). A hypergeometric distribution-based model was proposed by Jiang et al. (2010). Through using the human known disease-miRNA association network, disease phenotype similarity network and miRNA functional similarity network, this model gave the prediction of miRNA-disease associations. But there was a high proportion of false positive and false negative samples in the miRNA-target associations set on which this method extremely depended. Shi et al. (2013) proposed a random walk algorithmbased model in protein-protein interaction (PPI) network under the assumption that miRNAs have closer associations with the diseases that are more correlated to the miRNA targets. They obtained potential miRNA-disease associations by the comprehensive consideration of miRNA–target interactions, disease–gene associations and PPIs. Mørk et al. (2014) presented a miRPD method by integration of miRNA-protein association scores, protein-disease association scores and the shared proteins between miRNAs and diseases to obtain the best scoring protein connections between miRNA-disease pairs. Xu et al. (2014) introduced a miRNA prioritization model by the integrationof known disease–gene associations and miRNA-target interactions. It is worthy mentioning that the model is independent of the experimentally verified miRNA-disease associations. Instead, they need to calculate the similarity between miRNA targets and disease genes. Nonetheless, the aforementioned methods could not provide sufficiently accurate prediction results due to the incomplete disease-gene association network or/and the miRNAtarget interactions with high false positive and false negative samples.

Xuan et al. (2013a) constructed a computational method called HDMP for the identification of miRNA-disease associations based on the experimentally verified miRNAdisease associations, miRNA functional similarity, disease semantic similarity and disease phenotype similarity. According to miRNAs with similar functions are normally related to similar diseases and vice versa, they used the k nearest neighbors of miRNAs for estimating more reliable relevance scores of the unlabeled miRNAs. To overcome the shortages of the previous methods, it assigned higher weights to members in the same miRNA cluster when they calculated the miRNA functional similarity. However, the HDMP cannot prioritize miRNAs(diseases) for diseases(miRNAs) that have no known related miRNAs(diseases). Additionally, the performance of HDMP could not better than most of previous models which were calculated based on the global network similarity measure. A global network similarity-based computational model was proposed by Chen et al. (2012b) called RWRMDA, which used the random walk method based on the dataset of human known miRNA–disease associations and miRNA functional similarity. We can see that RWRMDA has excellent prediction performance through cross-validation and case studies of several important human complex cancers. However, there is a non-negligible limitation that it could not work for diseases without any known associated miRNAs. Chen et al. (2016f) developed another computational approach of WBSMDA by integrating the Gaussian interaction profile kernel similarity, miRNA functional similarity, disease semantic similarity, and miRNA-disease associations for the prediction of potential miRNAs-diseases associations. WBSMDA could effectively predict disease(miRNA)-related miRNAs(diseases) that without known related miRNAs(diseases). Recently, Chen et al. (2016d) developed a novel computational model named HGIMDA, which had superior performance compared with four classical methods (WBSMDA, RLSMDA, RWRMDA, and HDMP).

Nowadays, machine learning has been applied in extensive scientific fields, and it is highly effective for most of the research problems (Chen et al., 2012a, 2015c, 2016c; Wong et al., 2015; Huang et al., 2016b). Therefore, more and more studies have focused on it. For instance, Xu et al. (2011) proposed a computational model, named miRNA-target dysregulated network (MTDN), which combined miRNA-target interactions and expression pattern of miRNAs and mRNAs. In the model, the support vector machine (SVM) classifier was constructed to distinguish positive miRNA-disease associations from negative ones by extracting the feature of network topologic information. It is known that negative miRNA-disease associations are difficult to obtain, and the ambiguity caused by negative samples usually affects the accuracy of the supervised. Chen et al. (Chen and Yan, 2014) provided RLSMDA, a computational model in which they used semi-supervised learning to predict potential disease-related miRNAs by the consideration of disease semantic similarity, miRNA functional similarity, and known miRNA-disease associations. Furthermore, RLSMDA could also predict disease(miRNA)-related miRNAs(diseases) without any known miRNAs(diseases) and avoid the problem of using negative miRNA-disease associations. However, the ways of combining the classifiers in different spaces together and the selection of parameters for RLSMDA would greatly influence the prediction result. Based on known miRNA-disease associations, Chen et al. (2015b) further developed a computational model of RBMMMDA by presenting restricted Boltzmann machine (RBM). RBMMMDA is a two-layer (visible and hidden) undirected graphical model, which can not only obtain new miRNA-disease associations, but also corresponding association types. Nevertheless, it is difficult to make decision on the parameter values.

In our proposed method, we introduced a novel matrix factorization computational approach, namely neighborhood regularized logistic matrix factorization for miRNA-disease association prediction (NRLMFMDA). In consideration of the effectiveness of the classical method with integrated similarities, we combined the Gaussian interaction profile kernel similarity and the modified matrix factorization to get a more accuracy prediction result. Based on the known miRNA-disease associations, disease semantic similarity, miRNA functional similarity, and Gaussian interaction profile kernel similarity, the proposed method focuses on predicting the probability that a miRNA would be associated with a disease by mapping a miRNA and a disease to a shared low dimensional latent space as two latent vectors. Additionally, we also studied the local structure of the association data to further improve the prediction accuracy by exploiting the influences of the neighbors which were from the most similar miRNAs and most similar diseases. Moreover, the proposed approach assigned higher importance level to the nearest neighbors for avoiding noisy information. Furthermore, we used global LOOCV, local LOOCV, and 5-fold cross validation to evaluate the effectiveness of NRLMFMDA. As a result, the AUCs of global and local LOOCV are 0.9068 and 0.8239, respectively. By adopting 5-fold cross validation, NRLMFMDA model obtained the average AUC of 0.8976 ± 0.0034. In three types of case studies, we tested the prediction effect of NRLMFMDA for known diseases in the recent version of HMDD database, new diseases without any known related miRNAs and known disease based on previous version of HMDD database, respectively. As a result, most of the predicted miRNAs have been confirmed by recent experimental reports. Thus, we can conclude that NRLMFMDA is a useful tool in predicting potential miRNA-disease associations.

#### MATERIALS AND METHODS

#### Human miRNA-Disease Association

For convenience, we have built an adjacency matrix Y ∈ R m×n to formalize the known miRNA-disease associations that acquired from the HMDD v2.0 database (Li et al., 2014). The known miRNA-disease associations dataset used in this paper includes 5430 distinct experimentally confirmed miRNA-disease between 383 diseases and 495 miRNAs, m and n were expressed as the miRNAs and diseases numbers in the dataset. Then we stored the known miRNA-disease association information into the matrix Y. If a miRNA r<sup>i</sup> has been experimentally verified to be associated with a diseased<sup>j</sup> , then yij equals to 1, otherwise 0.

#### miRNA Functional Similarity

The miRNA functional similarity was calculated according to the method proposed by Wang et al. (2010) by the consideration of miRNAs with functional similar tend to be interacted with semantic similar diseases, and vice versa (Goh et al., 2007; Lu et al., 2008). Owing to their excellent work, we can download the miRNA functional similarity data from http://www.cuilab.cn/ files/images/cuilab/misim.zip. The matrix MS was constructed to represent the miRNA functional similarity. The element MS(r<sup>i</sup> ,rj) represented the value of similarity between the miRNA r<sup>i</sup> and the miRNAr<sup>j</sup> .

#### Disease Semantic Similarity Model 1

We constructed a Directed Acyclic Graph (DAG) to describe the diseases according to the MeSH descriptors downloaded from the National Library of Medicine (http://www.nlm.nih.gov/) (Chen, 2015a; Chen et al., 2015a, 2016a,e; Huang et al., 2016a). Then we defined the contribution of disease d in DAG(D) to the semantic value of disease D as follows:

$$\begin{cases} D1\_D\left(d\right) = 1 \text{ if } d = D\\ D1\_D\left(d\right) = \max\left\{\Delta^\* D1\_D\left(d'\right) \mid d' \in \mathit{child}\,\text{ren}\,\text{of } d\right\} \text{ if } d \neq D \end{cases} \tag{1}$$

where 1 is the semantic contribution decay factor and we set the value of 1 to 0.5 (Xuan et al., 2013b). The self-semantic value of disease D is defined as follows:

$$DV1\,(D) = \sum\_{d \in T(D)} D1\_D(d) \tag{2}$$

where T(D) represents D itself and all its ancestral nodes. According to the observation that two diseases with larger shared part of their DAGs have larger similarity score, the semantic similarity score between disease d<sup>i</sup> and d<sup>j</sup> are defined as follows:

$$\text{SS1}(d\_i, d\_j) = \frac{\sum\_{t \in T(d\_i) \cap T(d\_j)} (D1\_{d\_i}(t) + D1\_{d\_j}(t))}{DV1(d\_i) + DV1(d\_j)} \tag{3}$$

#### Disease Semantic Similarity Model 2

Different from disease semantic similarity model 1, we considered that assigning the same contribution value to the diseases in the same layer of DAG(D) was not reasonable. Actually, a more specific disease which appears in less DAGs contributes to the semantic similarity of disease D at a higher contribution level. So we made definition for the contribution of disease d in DAG(D) to the semantic value of disease D as follows:

$$D2\_D(d) = -\log[\text{the number of DAG including t}/2]$$

$$\text{the number of disease}] \tag{4}$$

We gave definition of the semantic similarity between disease d<sup>i</sup> and d<sup>j</sup> are the proportion of the summing contributions of their shared ancestor nodes and themselves to them in all the contributions of their ancestor nodes and themselves defined as the disease semantic similarity model 1.

$$\text{SS2}(d\_i, d\_j) = \frac{\sum\_{t \in T(d\_i) \cap T(d\_j)} (D2\_{d\_i}(t) + D2\_{d\_j}(t))}{DV2(d\_i) + DV2(d\_j)} \tag{5}$$

#### Gaussian Interaction Profile Kernel Similarity

Considering that Gaussian kernel function is one of the Radial Basis function whose value depends only on the distance from the origin, we constructed Gaussian interaction profile kernel similarity as another similarity algorithm that different from disease semantic similarity and miRNA functional similarity (Van et al., 2011; Chen et al., 2016b). Our definition of vector IV(di) and IV(rj) are the i th row and j th column of adjacent matrix Y which represents whether the disease or the miRNA associated with each of the miRNAs or the diseases. Accordingly, the Gaussian interaction profile kernel similarity of diseases and miRNAs can be computed as follows:

$$\text{GD}(d\_l, d\_{\dot{l}}) = \exp(-\beta\_d \left\| IV(d\_l) - IV(d\_{\dot{l}}) \right\|^2) \tag{6}$$

$$\left|GR(r\_i, r\_j)\right| = \exp(-\beta\_r \left\| IV(r\_i) - IV(r\_j) \right\|^2) \tag{7}$$

where adjustment coefficient β<sup>d</sup> and β<sup>r</sup> for the kernel bandwidth can be denoted as follows:

$$\beta\_d = \beta'\_d / \left(\frac{1}{n} \sum\_{i=1}^n \left\| IV(d\_i) \right\|^2 \right) \tag{8}$$

$$\beta\_r = \beta'\_{r} / \left(\frac{1}{m} \sum\_{i=1}^{m} \left\| IV(r\_i) \right\|^2 \right) \tag{9}$$

where β ′ <sup>d</sup> and β ′ <sup>r</sup> are the original bandwidths and both of them were set 1 according to the previous literature (Chen and Yan, 2013).

#### Integrated Similarity for MiRNAs and Diseases

As mentioned above, a Directed Acyclic Graph (DAG) was introduced to describe a disease based on the MeSH descriptors. Disease semantic similarity was calculated according to the assumption that the two diseases with larger shared area of their DAGs may have greater similarity score. In fact, for the specific disease that without DAG, we cannot calculate the semantic similarity between the specific disease and other diseases. Thus, for disease pairs that have no semantic similarity, we used Gaussian interaction profile kernel similarity score to define their similarity. We gave a definition of integrated disease similarity by the combination of disease semantic similarity and Gaussian interaction profile kernel similarity for disease. Specifically, if disease d<sup>i</sup> and d<sup>j</sup> have semantic similarity, the integrated disease similarity can be defined as the average of SS1 and SS2, otherwise we would attach the value of Gaussian interaction profile kernel similarity for disease to the integrated disease similarity. The formulations show as follows:

$$SD(d\_i, d\_j) = \begin{cases} \frac{\text{SS1}(d\_i, d\_j) + \text{SS2}(d\_i, d\_j)}{2} & d\_i \text{ and } d\_j \text{ has semantic similarity} \\\\ GD(d\_i, d\_j) & \text{otherwise} \end{cases} \tag{10}$$

In the same way, we made a definition for integrated miRNA similarity through combining miRNA functional similarity and Gaussian interaction profile kernel similarity for miRNA. we obtained the integrated miRNA similarity as follows:

$$\text{SR}(r\_i, r\_j) = \begin{cases} \text{MS}(r\_i, r\_j) & r\_i \text{ and } r\_j \text{ has functional similarity} \\ \text{GR}(r\_i, r\_j) & \text{otherwise} \end{cases} \tag{11}$$

#### NRLMFMDA

In this study, we proposed a neighborhood regularized logistic matrix factorization method for miRNA-disease association prediction (NRLMFMDA) by integrating known miRNA-disease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity (see **Figure 1**). As far as we have known, the matrix factorization has been applied to recommender systems and obtained successful association prediction results currently. For example, logistic matrix factorization (LMF) (Johnson, 2014) has been demonstrated to be effective for personalized recommendations. Therefore, the probability of the association between a miRNA and a disease can be computed based on it. In details, we mapped the diseases and the miRNAs into a shared latent space with a dimensionality r which is far lower than the minimum of m and n. The latent space vectors u<sup>i</sup> ∈ R 1×r and v<sup>j</sup> ∈ R 1×r are used to represent the properties of the miRNA r<sup>i</sup> and the diseased<sup>j</sup> , respectively. For simplicity, we further denote the latent vectors of all miRNAs and all diseases by U ∈ R m×r and V ∈ R n×r respectively, where u<sup>i</sup> is the i th row in U and v<sup>j</sup> is the j th row

in V. Simultaneously, the probability distributions of U and V are assumed as Gaussian distributions with zero-means and their variances are set as σ 2 r and σ 2 d , respectively. Their formulations are shown as follows:

$$p(U|\sigma\_r^2) = \prod\_{i=1}^m N(\mu\_i|0, \sigma\_r^2 I), p(V|\sigma\_d^2) = \prod\_{j=1}^n N(\nu\_j|0, \sigma\_d^2 I) \tag{12}$$

where I denotes the identity matrix. Afterwards, based on the Bayesian theorem, we know that

$$p(U, V | Y, \sigma\_r^2, \sigma\_d^2) \propto p(Y | U, V) p(U | \sigma\_r^2) p(V | \sigma\_d^2). \tag{13}$$

Based on the assumption that all the training examples are independent, we denoted the probability of associations under the condition of U and V as follows:

$$p(Y|U,V) = \prod\_{i=1}^{m} \prod\_{j=1}^{n} p\_{ij}^{\varepsilon \circ \eta} (1 - p\_{ij})^{(1 - \wp\_{ij})} \tag{14}$$

where we denote the probability pij of the association between miRNA r<sup>i</sup> and disease djas follows:

$$p\_{ij} = \frac{\exp(\boldsymbol{u}\_i \boldsymbol{\nu}\_j^T)}{1 + \exp(\boldsymbol{u}\_i \boldsymbol{\nu}\_j^T)}\tag{15}$$

And the known associations between diseases and miRNAs are assigned with higher importance levels of c (c > 1) which is empirically set to 5 in experiment so that we could get more accurate predictions with the help of the trustworthy data. Then, we made the log form on the both side of the formula (13) as follows:

$$\begin{aligned} \log p(U, V | Y, \sigma\_r^2, \sigma\_d^2) &= \sum\_{i=1}^m \sum\_{j=1}^n c y\_{ij} u\_i \nu\_j^T \\ &- (1 + c y\_{ij} - y\_{ij}) \log[1 + \exp(u\_i \nu\_j^T)] \\ &- \frac{1}{2 \sigma\_r^2} \sum\_{i=1}^m \|u\_i\|\_2^2 - \frac{1}{2 \sigma\_d^2} \sum\_{j=1}^n \left\|\nu\_j\right\|\_2^2 + C \end{aligned} \tag{16}$$

where C is a constant. We maximized the posterior distribution to obtain the most possible U and V. And it is equivalent to the problem as follows:

$$\min\_{U,V} \sum\_{i=1}^{m} \sum\_{j=1}^{n} (1 + \varepsilon \boldsymbol{\gamma}\_{ij} - \boldsymbol{\gamma}\_{ij}) \log[1 + \exp(\boldsymbol{\mu}\_{i} \boldsymbol{\nu}\_{j}^{T})]$$

$$-\varepsilon \boldsymbol{\gamma}\_{ij} \boldsymbol{\mu}\_{i} \boldsymbol{\nu}\_{j}^{T} + \frac{\lambda\_{r}}{2} \left\| U \right\|\_{F}^{2} + \frac{\lambda\_{d}}{2} \left\| V \right\|\_{F}^{2} \tag{17}$$

whereλ<sup>r</sup> = 1 σ 2 r , λ<sup>d</sup> = 1 σ 2 d , and k•k<sup>F</sup> is the Frobenius norm of a matrix. We solved this searching minimum problem with an alternating gradient descent method (Johnson, 2014). Because the neighborhoods of a miRNA or a disease have strong associations, the nearest miRNAs and diseases can provide the most useful information about how to find the reasonable way to factorize the logical matrix. Therefore, our object is to minimize the distances between d<sup>i</sup> and its nearest neighbors in set N(di) which is formed by K<sup>1</sup> nearest neighbors of the diseased<sup>i</sup> . The same to miRNAr<sup>j</sup> , N(rj)is the set formed by K<sup>1</sup> nearest neighbors of the miRNAr<sup>j</sup> . K1is empirically set to 5 in experiment. We used the adjacency matrix A and B to represent the neighborhood information, and their elements aiu and bjv are defined as follows:

$$a\_{i\mu} = \begin{cases} SR(r\_i, r\_\mu) & \text{if } r\_\mu \in N(r\_i) \\ 0 & \text{otherwise} \end{cases} \tag{18}$$

$$b\_{\dot{\mathcal{V}}} = \begin{cases} SD(d\_{\dot{\mathcal{I}}}, d\_{\dot{\mathcal{V}}}) & \text{if } d\_{\dot{\mathcal{V}}} \in N(d\_{\dot{\mathcal{I}}})\\ 0 & \text{otherwise} \end{cases} \tag{19}$$

Based on them, we aimed to minimize the following functions:

$$\begin{aligned} &\frac{\alpha}{2} \sum\_{i=1}^{m} \sum\_{\mu=1}^{m} a\_{i\mu} \left\| u\_i - u\_{\mu} \right\|\_{F}^{2} \\ &= \frac{\alpha}{2} \left[ \sum\_{i=1}^{m} (\sum\_{\mu=1}^{m} a\_{i\mu}) u\_i u\_i^T + \sum\_{\mu=1}^{m} (\sum\_{i=1}^{m} a\_{i\mu}) u\_{\mu} u\_{\mu}^T \right] \\ &- \frac{\alpha}{2} tr(U^T A U) - \frac{\alpha}{2} tr(U^T A^T U) \\ &= \frac{\alpha}{2} tr(U^T L^r U) \end{aligned} \tag{20}$$

$$\frac{\beta}{2} \sum\_{j=1}^{n} \sum\_{\nu=1}^{n} b\_{j\nu} \left\| \nu\_j - \nu\_\nu \right\|\_F^2 = \frac{\beta}{2} tr \langle V^T L^d V \rangle \tag{21}$$

where L <sup>r</sup> = (D <sup>r</sup>+D˜ r )−(A+A T ) and L <sup>d</sup> = (D <sup>d</sup>+D˜ <sup>d</sup> )−(B+B T ). In the two formulations, D r , D˜ r , D d ,and D˜ <sup>d</sup> are diagonal matrices and their diagonal elements are <sup>r</sup> ii = P<sup>m</sup> µ=1 aiµ, D˜ r µµ = P<sup>m</sup> i=1 aiµ, D d jj = P<sup>n</sup> v=1 bjv, and D˜ d jj = P<sup>n</sup> j=1 bjv, respectively. According to the analysis above, the integrated formulation to minimize the objective function F is as follows:

$$\min\_{U,V} F = \min\_{U,V} \sum\_{i=1}^{m} \sum\_{j=1}^{n} (1 + \gamma\_{ij} - \gamma\_{ij}) \ln \left[ 1 + \exp(u\_i v\_j^T) \right]$$

$$- \upsilon\_{ij} u\_i v\_j^T + \frac{1}{2} tr \left[ U^T (\lambda\_r I + \alpha L^r) U \right]$$

$$+ \frac{1}{2} tr \left[ V^T (\lambda\_d I + \beta L^d) V \right] \tag{22}$$

However, the alternating gradient descent method needs the partial differential of F with respect to U and V, so they are computed and simplified as follows:

$$\begin{aligned} \frac{\partial F}{\partial U} &= PV + (c-1)(Y \ast P)V - cYV + (\lambda\_r I + \alpha L^r)U\\ \frac{\partial F}{\partial V} &= P^T U + (c-1)(Y^T \ast P^T)U - cY^T U + (\lambda\_d I + \beta L^d)V \end{aligned} \tag{23}$$

where P ∈ R m×n is the matrix with elements pij in equation (10) and ∗ represents the Hadamard product. The gradient step size is chosen based on the AdaGrad algorithm (Duchi et al., 2011). In the experiments, we selected the dimensionality of the latent space r from {50, 100}. Simultaneously, we set λ<sup>r</sup> = λ<sup>d</sup> and chose the values from 2 −5 , 2−<sup>4</sup> , · · · , 2<sup>1</sup> . Neighborhood regularization parameters α and β were selected from 2 −5 , 2−<sup>4</sup> , · · · , 2<sup>2</sup> and 2 −5 , 2−<sup>4</sup> , · · · , 2<sup>0</sup> . The optimal learning rate γ was selected from 2 −3 , 2−<sup>2</sup> , · · · , 2<sup>0</sup> .

In the training procedure, the new diseases and new miRNAs are learned based on the mixed negative samples (including potential positive miRNA-disease associations) which will lead to a bias on the prediction results. Therefore, before obtaining the final probabilities with the learned U and V above, we further improved the prediction accuracy for new diseases or new miRNAs by replacing the latent vectors of negative samples with the linear combination of its nearest positive neighbors. For a miRNA r<sup>i</sup> in negative set M<sup>−</sup> which is the set of new miRNAs without any known related diseases, we denoted its K<sup>2</sup> nearest neighbors in positive set M<sup>+</sup> by N <sup>+</sup>(ri). And for a disease d<sup>j</sup> in negative set D <sup>−</sup> which is the set of new diseases without any known related miRNAs, we denoted its K<sup>2</sup> nearest neighbors in positive set D <sup>+</sup> by N <sup>+</sup>(dj), where K<sup>2</sup> is empirically set to 5 in experiment. Hence, the modified association probability is represented as follows:

$$\hat{p}\_{ij} = \frac{\exp(\tilde{u}\_i \tilde{\nu}\_j^T)}{1 + \exp(\tilde{u}\_i \tilde{\nu}\_j^T)}\tag{24}$$

Where,

$$
\tilde{u}\_{i} = \begin{cases}
1 & \text{if } \tau\_{i} \in M^{+}(\tau\_{i}) \\
\frac{1}{\sum\_{w \in N^{+}(\{r\_{i}\})} \text{SR}(r\_{i}, r\_{w})} \sum\_{w \in N^{+}(\{r\_{i}\})} \text{SR}(r\_{i}, r\_{w}) u\_{w} & \text{if } r\_{i} \in M^{-} \\
& \text{if } r\_{i} \in M^{+}, \\
\tilde{\nu}\_{j} = \begin{cases}
1 & \text{if } r\_{i} \in M^{+}, \\
\frac{1}{\sum\_{z \in N^{+}(\{d\_{j}\})} \text{SD}(d\_{j}, d\_{z})} \sum\_{z \in N^{+}(d\_{j})} \text{SD}(d\_{j}, d\_{z}) \nu\_{z} & \text{if } d\_{j} \in D^{-} \\
\nu\_{j} & \text{if } d\_{j} \in D^{+}.
\end{cases}
\end{cases} \tag{25}
$$

The modified latent vectors are helpful to overcome the bias due to using the uncertain negative samples to train the latent vectors of miRNAs and diseases in negative sets.

# RESULTS

#### Performance Evaluation

Leave-one-out cross validation (LOOCV) and 5-fold cross validation were applied to evaluate the performance of NRLMFMDA. And the LOOCV was implemented in two ways. (1) Based on the experimentally confirmed miRNAdisease associations in HMDD v2.0 database, Global LOOCV was used to evaluate the performance of NRLMFMDA. The "global" means that each one of the known miRNA-disease associations will be left out in turn to be considered as candidate association which are the unconfirmed miRNA-disease associations. Then after calculating prediction association scores of all the miRNA-disease pairs by NRLMFMDA, we compared the score of each test sample with all the candidate ones to observe whether its rank was above the threshold which was given in advance. (2) Unlike the Global LOOCV, Local LOOCV only compared the score of each test sample with the candidate samples composed of all the miRNA-disease pairs whose miRNAs did not have any known associations with the investigated disease. And if the rank of the test association exceeded the threshold which was given ahead of time, the model was considered to successfully predict this miRNA-disease association. Further, we drew Receiver operating characteristics (ROC) curve by plotting the true positive rate (TPR, sensitivity) vs. the false positive rate (FPR, 1-specificity) at different thresholds. Sensitivity refers to the percentage of the positive samples correctly identified among all the positives. Meanwhile, specificity denotes the percentage of negative samples correctly identified among all the negatives. After that, the prediction ability of NRLMFMDA would be evaluated by Area under the ROC curve (AUC). AUC=1 indicates the prediction performance of NRLMFMDA is perfect; AUC=0.5 indicates the prediction performance of NRLMFMDA is random. The results showed that NRLMFMDA obtained the AUC of 0.9068 and 0.8239 in global and local LOOCV, respectively (see **Figure 2**). The AUC results implied that the NRLMFMDA had shown reliable and effective prediction performance for potential miRNA–disease association prediction. However, HGIMD, RLSMDA, HDMP, and WBSMDA obtained the AUC of 0.8781, 0.8426, 0.8366 and 0.8030 in global LOOCV, respectively. In local LOOCV, their AUCs are 0.8077, 0.6953, 0.7702, and 0.8031, respectively. Differently, RWRMDA only has AUC of local LOOCV (0.7891) which is one of its defects because it cannot uncover the missing associations for all the diseases simultaneously. Therefore, in comparison with the previous methods, we can intuitively observe the improvement of predicting the miRNA-disease associations with NRLMFMDA.

Additionally, we also implemented 5-fold cross validation to evaluate the prediction effectiveness of NRLMFMDA. We firstly divided the known miRNA-disease associations into five parts randomly. Then, one of the five parts was treated as test samples and the remaining four parts were regarded as training samples in turn. In the same way as LOOCV, the miRNA-disease pairs without known evidence of association were regarded as candidate samples. Afterwards, the scores of test samples were taken out to compare with the scores of candidate samples, and we finally acquired their rankings. This procedure was repeated 100 times randomly to make validation more accuracy. In comparison with RLSMDA, HDMP, and WBSMDA whose average AUCs were 0.8569 ± 0.0020, 0.8342 ± 0.0010 and 0.8185 ± 0.0009 respectively, the average AUC of NRLMFMDA in 5-fold cross validation was 0.8976 ± 0.0034 which further confirmed the effectiveness and superiority for predicting potential miRNAdisease associations. At last, in order to obtain a clear knowledge of the predictability performance of NRLMFMDA. We listed evaluation result of NRLMFMDA and other several typical

classical models.

TABLE 1 | Performance evaluation comparison between NRLMFMDA and other several typical models in global LOOCV, local LOOCV and 5-fold cross validation based on known miRNA-disease associations.


TABLE 2 | Prediction of the top 50 predicted miRNAs associated with breast neoplasms based on known associations in HMDD database.


*The first column records top 1–25 related miRNAs. The second column records the top 26-50 related miRNAs.*

models in global LOOCV, local LOOCV as well as 5-fold cross validation by using tabular format (see **Table 1**).

#### Case Studies

Based on another two miRNA-disease association databases, namely dbDEMC (Yang et al., 2010) and miR2Disease (Jiang et al., 2009), we studied three common major diseases of human beings to verify the prediction results of NRLMFMDA. The dataset of 5430 known miRNA-disease associations from HMDD v2.0 was treated as training set. For each disease, all candidate miRNAs would be ranked in the light of their predicted scores and the top 50 predicted miRNAs would be confirmed using another two miRNA-disesase association databases (i.e., dbDEMC and miR2Disease). It is worth noting that only candidate miRNAs that without known associations with investigated disease were ranked and confirmed. Therefore, there is no overlap between the training samples and the prediction lists and none of the top 50 predicted miRNAs existed in HMDD v2.0. We ulteriorly observed the number of the verified miRNAs in the top 10, top 20 and top 50 ones which are related with the three diseases respectively in the two databases.

Breast cancer is the worldwide women's health threatening, and it has caused large quantity of death in female all over the world. More than 80% of breast cancers are hormone-receptor positive in the western world (Van et al., 2014). About 232,340 new cases of invasive breast cancer including 39,620 breast cancer deaths occurred among women of America in 2013. At present, more and more researchers have paid attention to the original etiology of miRNAs in breast cancers and increasing number of evidences show that several miRNAs are closely related to breast cancer and play important roles in the tumorigenesis of breast cancer. For example, among the differentially expressed miRNAs, miR-10b, miR-125b, miR145, miR-21, and miR-155 showed as the most consistently deregulated in breast cancer. It is worthy noting that miR-10b, miR-125b, and miR-145, were down-regulated and the other two, miR-21 and miR-155, were up-regulated, which means that they can be treated as tumor suppressor genes or oncogenes, respectively (Iorio et al., 2005). After implementing NRLMFMDA, we can obtained all the rankings for potential miRNA-disease associations from the HMDD v2.0. The final results showed that 8, 16 and 39 of the top 10, 20 and 50 potential miRNAs associated with breast cancer were confirmed, respectively (see **Table 2**).

Esophageal Neoplasms is a cancer generated from the esophagus which runs between the throat and the stomach. It is still a common cancer happened among the public. The estimated number of new esophageal cancer cases and deaths were 291238 and 218957, respectively. The crude incidence and mortality rates for esophageal cancer were 21.62/100000 and 16.25/100000, respectively(Zeng et al., 2016a). Researches have showed that low expression of let-7b and let-7c associated with poor response to chemotherapy both clinically and histopathologically, which was observed from 74 patients as the training set in before-treatment biopsies (Sugimura et al., 2012). NRLMFMDA was implemented to identify esophageal neoplasms-associated miRNAs. As a result, 9 out of the top 10 and 40 out of the top 50 predicted esophageal neoplasms related miRNAs were experimentally confirmed by reports (see **Table 3**).

Lymphoma is a group of blood cell tumors developed from lymphocytes that is a type of white blood cell. It's also worth mentioning that Hodgkin lymphoma and non-Hodgkin lymphoma are the two main types, among which the proportion of patients with non-Hodgkin lymphoma (NHL) is about 90%. (Alizadeh et al., 2000). Experimental studies showed that the miR155 is significantly up-regulated in some Burkitt's lymphoma and several other types of lymphomas (Metzler, 2004). In canine B-cell lymphomas, compared with normal canine peripheral blood mononuclear cells (PBMC) and normal lymph nodes (LN), the expression of miRNA hsa-mir-19a was increased. After the implementation of NRLMFMDA, we took lymphomas as a case study for the identification of potential miRNA-disease association. The results showed that 8 out of top 10 and 37 out of 50 potential lymphoma-associated miRNAs in the prediction

TABLE 3 | Prediction of the top 50 predicted miRNAs associated with esophageal neoplasms based on known associations in HMDD database.


*The first column records top 1-25 related miRNAs. The second column records the top 26–50 related miRNAs.*

TABLE 4 | Prediction of the top 50 predicted miRNAs associated with lymphoma based on known associations in HMDD database.


*The first column records top 1–25 related miRNAs. The second column records the top 26–50 related miRNAs.*

result list have been verified based on recent experimental reports (see **Table 4**).

To demonstrate the result of ranking completely, we have provided the prediction list of the whole potential miRNA-disease associations in HMDD v2.0 database and their association scores predicted by NRLMFMDA (see **Supplementary Table 1**).

In addition, we want to test the prediction ability of NRLMFMDA for the new diseases, namely the ones that have no known association with any miRNA. Therefore, we hid the association information between the miRNAs and the test disease by setting any of the known associations between them as unknown ones. After implementing the NRLMFMDA, we obtained the ranking of the miRNA-disease association prediction scores. We showed the result of hepatocellular carcinoma ranking in **Table 5**, in which we can see that 9, 18 and 42 related miRNAs out of the top 10, 20, and 50 had been confirmed by at least one of the three databases HMDD, dbDEMC and miR2Disease. Moreover, hsa-mir-146a was ranked first in the top 50 and the recent research has confirmed that a functional polymorphism (rs2910164) in the miR-146a gene is associated with the risk for hepatocellular carcinoma (Xu et al., 2008).

Finally, we implemented NRLMFMDA on the old version of the database HMDD to observe whether the model still performs well on it. After implementing the experiment with the proposed method, it had shown the effectiveness on predicting potential miRNA-disease associations based on the previous dataset. For instance, there are 5, 11, and 31 respectively out of top 10, 20, and 50 miRNAs related with the lung neoplasms have been confirmed (see **Table 6**). As we can see, hsa-mir-96 was ranked first in the top 50 and research has confirmed that the expression of miR-96 in tumors was positively related to its expression in sera. Besides, high expression of tumor and serum miRNAs of the miR-183 family were associated with overall poor survival in patients with lung cancer, which was demonstrated by Log-rank and Cox regression analyses (Zhu et al., 2011).

According to the result of case studies on the five major human diseases, excellent prediction performance of NRLMFMDA has been presented. With the development of experimental tools and the improvement of experimental measures, we look forward that more and more miRNA-disease association data verified by experiment will spring up. At that time, increasing portion of the predictions with NRLMFMDA can be verified by researches in the future.

#### DISCUSSION

Nowadays, researchers have made progress not only in discovering miRNAs, but also in discovering the important TABLE 5 | Prediction of the top 50 predicted miRNAs associated with carcinoma, hepatocellular based on known associations in HMDD database.


*The first column records top 1–25 related miRNAs. The second column records the top 26–50 related miRNAs.*

roles that miRNAs play in physiological and pathophysiological processes (Liu and Olson, 2010). For example, aberrant expression of miRNAs has been related with various neurological disorders (NDs) in the central nervous system such as Huntington disease, amyotrophic lateral sclerosis, schizophrenia and autism, Alzheimer disease, Parkinson's disease. If dysregulated miRNAs are discoveried in patients with NDs, this may be used as a biomarker for the earlier diagnosis and monitoring of disease progression (Kamal et al., 2015). MiRNA can also be transcriptional regulators participated in pulmonary sarcoidosis and packaged in extracellular vesicles (EV) during cellular communication (Kishore et al., 2018). In biomedical research, identification of disease-associated miRNAs has become an important filed, which will accelerate people's understanding of disease pathogenesis at the molecular level and disease diagnosis, treatment and prevention in medical(Chen et al., 2017d).

This paper introduced the computational method called NRLMFMDA in which we combined the novel method of logistic matrix factorization with the similarity computational method of Gaussian interaction profile kernel similarity and further assigned higher importance level to the known associations in the process of calculating the potential miRNA-disease association probabilities to assure the larger positive influence of the known data. Additionally, we also took full advantage of the information of nearest neighbor diseases and miRNAs to improve the accuracy of the miRNA-disease association prediction (Liu et al., 2016a). As is known, the logistic matrix factorization technique has been applied in many early work of predicting associations. And it has shown remarkable effectiveness. Taking the neighborhood principle into consideration, we modified it in a more reasonable way to improve the accuracy of prediction. Due to the introduction of the Gaussian interaction profile kernel similarity, the information of the disease similarity and the miRNA similarity was fully excavated to improve the accuracy of the prediction. To verify the accuracy of the NRLMFMDA, three types of cross validation which contains Global LOOCV, Local LOOCV, and 5-fold cross validation have been implemented. As a result, the excellent performance of NRLMFMDA has been showed both from the cross validation and the case studies with several crucial diseases.

Several important factors contribute to the excellent performance of NRLMFMDA. First of all, more and more association pairs between miRNAs and diseases have been discovered and confirmed till now. Due to the datadependent property of NRLMFMDA, the increasing of known TABLE 6 | Prediction of the top 50 predicted miRNAs associated with lung neoplasms based on known associations in old version HMDD database.


*The first column records top 1–25 related miRNAs. The second column records the top 26–50 related miRNAs.*

associations assuredly improved the predicting accuracy. Secondly, NRLMFMDA can take full advantage of the similarity information by introducing the Gaussian interaction profile kernel similarity. Thirdly, NRLMFMDA pays attention to the neighborhood information which provides more reliable associations by using the neighborhood regularization method in the training procedure and the neighborhood smoothing method in the final prediction. What's more, some machine learning-based model randomly selected negative samples as training data, this inaccurate chosen process would affect the model's prediction accuracy. The modified latent vectors used in NRLMFMDA can overcome the bias because of using the uncertain negative samples to train the latent vectors of miRNAs and diseases in negative sets, which would helpful to the improvement of prediction accuracy for NRLMFMDA. Last but not least, searching the optimal solution with an alternating gradient ascent procedure made sure the reliability of the disease eigenvectors and the miRNA eigenvectors. In view of above-mentioned, NRLMFMDA has greatly improved the accuracy in prediction association between miRNA and disesase.

Some limitations have been noted in this study. Firstly, though current studies benefit from the increased known data, it is never a finished work to expand data. Numerous excellent methods were proposed just to cover the shortage of the data (Liu et al., 2014; You et al., 2014). Secondly, in the iterative process, we have five parameters that are difficult to choose as the optimal combination. Actually, we have some ranges for the five parameters. However, even using grid search strategy, it wastes a lot of time and resources due to the limitation of current situation. Therefore, we expect to use some optimized search strategy to improve the accuracy of prediction method in the future.

#### AUTHOR CONTRIBUTIONS

JQ implemented the experiments, analyzed the result, and wrote the paper. B-SH conceived the project, designed the experiments, analyzed the result, and revised the paper. QZ conceived the project, implemented the experiments, and analyzed the result, and revised the paper. All authors read and approved the final manuscript.

### FUNDING

B-SH was supported by Key Program of Hunan Provincial Education Department (Grant No. 15A026), General Program of Hunan Provincial Philosophy and Social Science Planning Fund office (Grant No. 15YBA035). QZ was supported by Innovation Team Project from the Education Department of Liaoning Province under Grant No. LT2015011 and the Doctor Startup Foundation from Liaoning Province under Grant No. 20170520217.

#### REFERENCES


### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00303/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 He, Qu and Zhao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation

Adam McDermaid1,2, Xin Chen<sup>3</sup> , Yiran Zhang1,4, Cankun Wang<sup>1</sup> , Shaopeng Gu<sup>4</sup> , Juan Xie1,2 and Qin Ma1,2 \*

*<sup>1</sup> Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture, and Plant Science, South Dakota State University, Brookings, SD, United States, <sup>2</sup> Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, United States, <sup>3</sup> Center for Applied Mathematics, Tianjin University, Tianjin, China, <sup>4</sup> Department of Electrical Engineering and Computer Science, South Dakota State University, Brookings, SD, United States*

#### Edited by:

*Dariusz Mrozek, Silesian University of Technology, Poland*

#### Reviewed by:

*Xiangxiang Zeng, Xiamen University, China Shihao Shen, University of California, Los Angeles, United States*

> \*Correspondence: *Qin Ma Qin.Ma@sdstate.edu*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *25 May 2018* Accepted: *23 July 2018* Published: *14 August 2018*

#### Citation:

*McDermaid A, Chen X, Zhang Y, Wang C, Gu S, Xie J and Ma Q (2018) A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation. Front. Genet. 9:313. doi: 10.3389/fgene.2018.00313* One of the main benefits of using modern RNA-Sequencing (RNA-Seq) technology is the more accurate gene expression estimations compared with previous generations of expression data, such as the microarray. However, numerous issues can result in the possibility that an RNA-Seq read can be mapped to multiple locations on the reference genome with the same alignment scores, which occurs in plant, animal, and metagenome samples. Such a read is so-called a multiple-mapping read (MMR). The impact of these MMRs is reflected in gene expression estimation and all downstream analyses, including differential gene expression, functional enrichment, etc. Current analysis pipelines lack the tools to effectively test the reliability of gene expression estimations, thus are incapable of ensuring the validity of all downstream analyses. Our investigation into 95 RNA-Seq datasets from seven plant and animal species (totaling 1,951 GB) indicates an average of roughly 22% of all reads are MMRs. Here we present a machine learning-based tool called GeneQC (Gene expression Quality Control), which can accurately estimate the reliability of each gene's expression level derived from an RNA-Seq dataset. The underlying algorithm is designed based on extracted genomic and transcriptomic features, which are then combined using elastic-net regularization and mixture model fitting to provide a clearer picture of mapping uncertainty for each gene. GeneQC allows researchers to determine reliable expression estimations and conduct further analysis on the gene expression that is of sufficient quality. This tool also enables researchers to investigate continued re-alignment methods to determine more accurate gene expression estimates for those with low reliability. Application of GeneQC reveals high level of mapping uncertainty in plant samples and limited, severe mapping uncertainty in animal samples. GeneQC is freely available at http://bmbl.sdstate.edu/ GeneQC/home.html.

Keywords: gene expression, RNA-Seq read alignment, mapping uncertainty, machine learning, elastic-net, mixture model fitting, k-means clustering, EM-algorithm

# INTRODUCTION

RNA-Seq is a revolutionary high-throughput process that allows researchers to observe the genetic makeup of a particular sample (Wang et al., 2009; Garber et al., 2011; Ozsolak and Milos, 2011) and can assist in determination of regulatory mechanisms and transcription unit prediction (Chou et al., 2015; Chen et al., 2017). Research involving RNA-Seq data produces gene expression profiles, in which a discrete expression value for each annotated gene for that species is identified. These gene expression profiles are extracted through computational analysis pipelines (Trapnell et al., 2009; Andrews, 2010; Wang et al., 2010; Grabherr et al., 2011; Kong, 2011; Li and Dewey, 2011; Dobin et al., 2013; Philippe et al., 2013; Wu et al., 2013, 2016; Anders et al., 2015; Bonfert et al., 2015; Chang et al., 2015; Kim et al., 2015; Pertea et al., 2015, 2016; Yuan et al., 2017), which can be analyzed further to identify differentially expressed genes between treatment groups (Robinson et al., 2010; Anders and Huber, 2012; Trapnell et al., 2012; Ritchie et al., 2015; Pimentel et al., 2017; Monier et al., 2018), enriched functional gene modules (Subramanian et al., 2005; Zhou and Su, 2007; Chen et al., 2009; Pathan et al., 2015), co-expression networks (Zhang et al., 2016; Cao et al., 2017), and to generate visualizations to assist in broad interpretations between treatment groups (Goff et al., 2013; Powell, 2015; Younesy et al., 2015; Ge, 2017; Harshbarger et al., 2017; Nelson et al., 2017; Nueda et al., 2017; McDermaid et al., 2018a; Perkel, 2018), among other applications.

One application of RNA-Seq analysis pipelines is to use the sequenced RNA-Seq reads (or reads for short) with a reference genome, if available, to estimate the expression level of each gene (Nagalakshmi et al., 2008; Miller et al., 2014). The basic process is to map these reads to the location with the best alignment score on the reference genome (Wu et al., 2014). Even though numerous methods have been developed to facilitate this analysis, some critical issues persist. The nature of DNA—long strands of millions of base-pairs created by a reordering of the four nucleotides—makes it inevitable that some similarities and duplications will occur throughout the genome. This can lead to ambiguity during read mapping, with specific reads being aligned to multiple locations across the reference genome with the same alignment scores (Li et al., 2009; Oshlack et al., 2010; Swan, 2013; Trapnell et al., 2013; Baruzzo et al., 2017).

This MMR problem can be observed in any genomic region, including, exons and transcripts. For conciseness, we refer to these genomic regions simply as "genes." This issue has been observed in many diploid species, including human and other mammals and Arabidopsis (Albrecht et al., 2009; Cho et al., 2009; Yoder-Himes et al., 2009; Zhu et al., 2011; Network CGAR., 2018;), as well as many multiploid species (Consortium IWGS., 2014). In some species, such as Glycine max, up to 75% of the genes have the duplicated partners in its genome (Schmutz et al., 2010). For species with high levels of uncertainty, especially angiosperms, the MMR problem can have serious implications on gene expression levels and can be extremely hard to remediate due to the genes' and chromosomes' duplicative nature. To more fully investigate the prevalence of MMRs in current RNA-Seq analyses, we analyzed almost two terabytes of RNA-Seq data from seven plant and animal species. Upon analysis of this data, it was clear that a large amount of MMRs was present in a variety data. Thus, mapping uncertainty is inevitably affecting the gene expression estimates and eventually causing bias in downstream analyses.

During our initial investigation into the MMR problem, 95 datasets totaling 1,951 GB were analyzed. Both paired- and single-end reads were collected from NCBI (Coordinators, 2016), URGI (https://urgi.versailles.inra.fr/), and JGI (Nordberg et al., 2013) for seven plant and animal species. These species include Arabidopsis thaliana, Vitis vinifera, Solanum Lycopersicum, Panicum Virgatum, Triticum Aestivum, Homo sapiens, and Mus musculus. The 95 datasets average 20.6 GB, with an average overall alignment rate of 81.87%. Each dataset was aligned using HISAT2 (Kim et al., 2015) against the appropriate reference genome. Alignment statistics were collected or calculated from the HISAT2 output file, as shown in **Table 1**. It was determined that an average of 22% of all reads were ambiguously aligned in each of the seven distinct plant and animal species. In four datasets, over 35% of the reads were ambiguously aligned, and over two-thirds of the analyzed datasets having at least 18% of the reads multi-mapped. Panicum virgatum exhibited the highest overall proportions—ranging from 17 to 33%—of MMRs over all analyzed datasets, while Arabidopsis thaliana displayed the lowest proportion, ranging from 8 to 17%. The other analyzed species had similar percentages of MMRs. More details of the MMR analyses over these 95 datasets can be found in **Supplementary File 1**.

The general solution of the MMR problem in previous studies is to discard or evenly distribute to all potential locations, leading to severe, biased underestimation or overestimation of the gene expression levels, respectively (Kim et al., 2013). More commonly, a proportional assignment of ambiguous reads, in which the read is segmented in smaller portions based on the number of possible mapping locations and uniquely mapped reads to each of them (Li et al., 2009). Recently, additional methods have been employed to attempt remediation of mapping uncertainty after initial alignment (Li and Dewey, 2011; Kahles et al., 2015; Bray et al., 2016). However, even these realignment strategies do not provide a thorough method for evaluation of the alignment quality. While RNA-Seq pipelines traditionally begin with read-level quality control using FastQC (Andrews, 2010), no such method currently exists for controlling the quality of gene expression estimation after read alignment.

If researchers continue processing RNA-Seq data with such high levels of mapping uncertainty, all downstream analyses will have skewed and biased results. Just as raw reads require quality control (Andrews, 2010) so do gene expression estimates based on mapping results. Even with tools that are specifically designed to address mapping uncertainty, such as MMR (Kahles et al., 2015), the quality of the derived gene expression estimates based on mapping results still requires investigation, especially in real datasets not simulated datasets. Without some quality control for gene expression estimation, researchers could potentially be using unreliable data, and blindly doing so.


*The alignment statistics for the 95 analyzed datasets across seven species, indicating the ranges of percentages for the uniquely aligned, multi-mapped, and un-mapped reads, as well as the proportion of multi-mapped out of the total mapped reads.*

One promising method for addressing the issue of gene expression-level quality control is the implementation of machine learning. It uses or relates to following concepts or algorithms including statistics, artificial intelligence, philosophy, information theory, biology, cognitive science, computational complexity and control theory to give computers and algorithms the ability to learn and improve performance on a specific task without being explicitly programmed (Mitchell, 1997). Machine learning has two main categories: supervised and unsupervised learning. The majority of practical studies use supervised learning methods to train the relationship from the input to the output, using provided category labels or resultant values to develop a mapping function for the prediction of unlabeled data. Specifically, Elastic-net regularization, a supervised method, was used in this research. Meanwhile, machine learning can also be used to train a model from unlabeled data through the unsupervised learning, aiming to model the underlying structure or distribution in the training data for clustering and association problems. Two unsupervised learning algorithms were used in this study, i.e., K-means clustering and the Expectation-Maximization algorithm (EMalgorithm).

To address issue of mapping uncertainty, we present the machine learning-based tool GeneQC (**Figure 1**), which uses extracted multi-level features combined with novel applications of regularized regression and mixture model fitting approaches to quantify the mapping uncertainty issue (McDermaid et al., 2018b). This tool can determine the genes having reliable expression estimates and those require further analysis, along with a statistical significant evaluation of the mapping uncertainty level. GeneQC develops a novel score, referred to as D-score, to represent the level of mapping uncertainty for each annotated gene and groups genes into several categorizations with different reliability levels, through integration and modeling of three genomic and transcriptomic features. Specifically, (i) sequence similarity between a particular gene and other genes is collected to give an insight into the genomic characteristics contributing to the MMR problem; (ii) the proportion of shared MMR between gene pairs provides information regarding the transcriptomic influences of mapping uncertainty within each dataset; and (iii) the degree of each gene, representing the number of significant gene pair interactions resulting from calculating (i) and/or (ii). More details of the procedure can be found in the Methods section.

# METHODS

# GeneQC Implementation

GeneQC is designed to fit into computational pipelines for RNA-Seq data immediately following read alignment, acting as a supplement to most current pipelines. GeneQC is composed of two distinct processes: feature extraction and statistical modeling. The feature extraction process is implemented using a Perl program and the statistical modeling is performed on the feature extraction output using an R package, which provides the final output for GeneQC (http://bmbl.sdstate.edu/GeneQC/ download.html). More details on the implementation of GeneQC can be found at http://bmbl.sdstate.edu/GeneQC/tutorial.html.

### Required Inputs

GeneQC takes as inputs three pieces of information that are easily found in most RNA-Seq analysis pipelines: (1) the read mapping result SAM file; (2) the fasta reference genome corresponding to the to-be-analyzed species; and (3) the species-specific annotation general feature format (gff/gff3) file (**Figure 1B**). Example datasets can be found on the GeneQC webserver at http://bmbl.sdstate.edu/GeneQC/result.html.

### Feature Extraction

From input information, GeneQC first performs feature extraction, in which the three characteristics are calculated for each annotated gene (**Figure 1C**). The first extracted feature (D1) is derived from genomic level information and involves the similarity between two genes (**Figure 2A**). For each gene, this is calculated as the maximum of the sequence similarity multiplied by the match length, where the match length is the longest continuous string of matching base pairs. More specifically, D<sup>1</sup> = max y {ssi,y∗li,y}, where ssi,<sup>y</sup> is the base pair sequence similarity of gene i and gene y and li,<sup>y</sup> is the match length of these two genes.

The second feature (D2) comes from transcriptomic level information and represents the proportion of shared MMRs (**Figure 2B**). This value is calculated as the maximum

FIGURE 1 | Mapping Uncertainty and GeneQC. (A) The MMR percentages for the 95 datasets across seven species. More detailed information is showcased in Table 1; (B) GeneQC takes a read alignment, reference genome, and annotation file as inputs; (C) The first step of GeneQC is to extract features related to mapping uncertainty for each annotated gene; (D) Using the extracted features, elastic-net regularization is used to calculate the D-score, which represents the mapping uncertainty for each gene; (E) A series of Mixture Normal and Mixture Gamma distributions are fit to the D-scores; and (F) The mixture models are used to categorize the D-scores into different levels of mapping uncertainty along with a statistical alternative likelihood value for each gene.

proportion of shared MMRs between the gene of interest and another gene. In other words, D<sup>2</sup> = |Gi∩X| |G<sup>i</sup> | , where G<sup>i</sup> = {all reads aligned to gene i} and X = |G<sup>i</sup> ∩ Y|.

The third feature (D3) is a network factor that represents the number of alternate gene locations with significant interactions with the gene of interest based on the previous two parameters (**Figure 2C**) and is calculated as D<sup>3</sup> = log<sup>10</sup> (|S ∪ M| + 1), where S = {genomic locations with D<sup>1</sup> > 0} and M = {genomic locations with D<sup>2</sup> > 0}.

In addition to understanding the severity of the MMR problem in each sample, GeneQC provides species- or sample-specific insight into each feature's impact on mapping uncertainty. This is done by developing a linear model to determine the significance and degree of impact for each feature.

#### GeneQC Modeling

#### Dependent Variable Construction

To perform the modeling, a dependent variable is constructed. The dependent variable D<sup>4</sup> is an approximation of the proportion of ambiguous reads based on the two most extreme approaches to dealing with multi-mapped reads, the unique alignment approach and the all-matches approach. If we consider G<sup>i</sup> = {reads mapped to gene i} and U<sup>i</sup> = reads uniquely mapped to gene i , the true alignment R<sup>i</sup> must fall somewhere between these two values, with |U<sup>i</sup> | ≤ |R<sup>i</sup> | ≤ |G<sup>i</sup> |. Thus, we approximate the true alignment as Rˆ i  = |G<sup>i</sup> |+|U<sup>i</sup> | 2 . Using this approximation, we calculate

$$D\_4 = 1 - \frac{\left|\hat{R}\_i\right|}{|G\_i|} = 1 - \frac{|G\_i| + |U\_i|}{2|G\_i|}$$

#### Elastic-Net Regularization

To develop a model evaluating the severity of mapping uncertainty and thus expression estimation quality, a regression approach is utilized. Ordinary least squares has been demonstrated to have particular issues when dealing with real world data, especially data that does not fit linearity, homoscedasticity, lack of serious multi-collinearity, or other requirements (Dempster et al., 1977). Because of this, alternative approaches were explored. Ridge regression, which develops a model based on an L2-norm penalization, has better predictive results than ordinary least squares regression (Hoerl and Kennard, 1970; Dempster et al., 1977). However, this approach tends to retain all included variables to achieve such high predictive power, in turn reducing the interpretability of the model (Zou and Hastie, 2005). Another approach with potential application in GeneQC is the least absolute shrinkage and selection operator, also known as lasso. This method uses an L1-norm penalization, while simultaneously performing continuous shrinkage and variable selection (Tibshirani, 1996). While this is an appealing feature in generating a model, lasso has shortcomings when it comes to dealing with variables exhibiting high pairwise correlation (Zou and Hastie, 2005). Elastic-net regularization—sometimes referred to simply as elastic net—has the potential to overcome the shortcomings of both ridge and lasso regression methods by implementing a combination of the two approaches.

Take the set of n response variables **y** = y1, y2, . . . , y<sup>n</sup> T , a set of p predictor variables **x<sup>i</sup>** = xi,1, xi,2, . . . , xi,<sup>p</sup> , i ∈ {1, . . . , n}, a set of p coefficients β = (β1, β2, . . . , βp), and matrix of predictor variables

$$\mathbf{X} = (\mathbf{x\_1}, \mathbf{x\_2}, \dots, \mathbf{x\_n})^T = \begin{pmatrix} \mathbf{x\_{1,1}} & \cdots & \mathbf{x\_{1,p}} \\ \vdots & \ddots & \vdots \\ \mathbf{x\_{n,1}} & \cdots & \mathbf{x\_{n,p}} \end{pmatrix}.$$

For a given λ1, λ<sup>2</sup> ≥ 0, elastic-net regularization uses a criterion based on

$$\begin{aligned} L(\lambda\_1, \lambda\_2, \beta) &= \left\| \mathbf{y} - \mathbf{X}\beta \right\|\_2^2 + \lambda\_2 \left\| \beta \right\|\_2^2 + \lambda\_1 \left\| \beta \right\|\_1 \\ \|\beta\|\_2 &= \sqrt{\sum\_{j=1}^p \beta\_j} \\ \|\beta\|\_1 &= \sum\_{j=1}^p |\beta\_j| \end{aligned}$$

Thus, the set of coefficient estimates βˆ are calculated as

$$\begin{aligned} \hat{\boldsymbol{\beta}} &= \underset{\boldsymbol{\beta}}{\text{argmin}} \left\{ \boldsymbol{L}(\boldsymbol{\lambda}\_1, \boldsymbol{\lambda}\_2, \boldsymbol{\beta}) \right\} = \underset{\boldsymbol{\beta}}{\text{argmin}} \left\{ \left\| \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right\|\_2^2 + \boldsymbol{\lambda}\_2 \left\| \boldsymbol{\beta} \right\|\_2^2 \right\} \\ &+ \boldsymbol{\lambda}\_1 \left\| \boldsymbol{\beta} \right\|\_1 \text{\&} \end{aligned}$$

Given α = λ1 λ1+λ<sup>2</sup> , solving for βˆ is equivalent to optimizing βˆ = **<sup>y</sup>** <sup>−</sup> **<sup>X</sup>**<sup>β</sup> 2 2 , for α kβk 2 <sup>2</sup> + (1 − α) kβk<sup>1</sup> ≤ k, for some k. In the construction of this elastic net, α kβk 2 <sup>2</sup> + (1 − α) kβk<sup>1</sup> is considered as the elastic net penalty, representing a combination of the penalties used in ridge and lasso regression methods. In the situation where α = 1, the elastic net is equivalent to basic ridge regression. For α = 0, the approach becomes lasso regression (Zou and Hastie, 2005).

GeneQC utilizes the elastic-net regularization method (Zou and Hastie, 2005) with default α = 0.5 to develop a regression model for the calculation of D-scores. Here, elasticnet regularization is used to properly perform the variable selection, while simultaneously fitting a sufficient model to the provided data (**Figure 1D**). This approach also accounts for potential serious multicollinearity issues which were detected in some of the test data and prevents overfitting of the regression model (Zou and Hastie, 2005). The set of calculated D-scores represents the mapping uncertainty for each annotated gene and is provided to give researchers an idea of how reliable their initial read mappings are. A higher D-score represents more mapping uncertainty, and thus a less reliable expression estimate.

#### Mixture Model Fitting

Based on the calculated sets of D-scores through above investigations during GeneQC development, there are apparent underlying distributions for these scores, intuitively representing levels of mapping uncertainty. For this purpose, extensive mixture model fitting is included within GeneQC to best fit a mixture model distribution with three sub-distributions to each set of D-scores (**Figure 1E**).

Our mixture model fitting process involves k-means initialization with randomized initial grouping. Cluster means, µi , are then calculated for each of the k clusters, followed by two iterative steps: (1) reassignment of data points to the cluster with the lowest distance between a data point and cluster mean, and (2) recalculation of cluster centers. This process is continued until achieving the minimum within-cluster sum of squares:

$$\underset{k}{\operatorname{argmin}} \sum\_{i=1}^{k} \sum\_{\boldsymbol{x} \in K\_{i}} \left\| \boldsymbol{x} - \boldsymbol{\mu}\_{i} \right\|^{2}$$

After initialization using the k-means process defined above, the EM-algorithm is implemented to find the best fitting distributions. Based on our preliminary investigations into the D-score development, we have selected two underlying distributions for this purpose: Gamma and Gaussian. Specifically, it is assumed that each set of D-scores can be expressed as a mixture model distribution given by

$$P\left(X|\theta\right) = \sum\_{k} \beta\_k Y\_k(X|\theta\_k)$$

with β<sup>k</sup> representing the weighting parameter of the k th component, Y<sup>k</sup> representing the probability density function of the k th component of the mixture model, and θ<sup>k</sup> representing the parameters of the k th component. Considering the Gaussian distribution scenario, Y<sup>k</sup> (X|θk) is N(X|µ<sup>k</sup> , σ 2 k ). In this case,

$$\begin{aligned} MLE(\mu\_k) &= \hat{\mu}\_k = \frac{\sum\_{j}^{N\_k} \mathbb{x}\_{j,k}}{N\_k} \\ MAE\left(\sigma\_k^2\right) &= \hat{\sigma}\_k^2 = \frac{\sum\_{j}^{N\_k} \left(\mathbb{x}\_{j,k} - \mu\_k\right)^2}{N\_k} \\ \beta\_k &= \frac{N\_k}{N} \end{aligned}$$

previous step,

$$\begin{aligned} \hat{\mu}\_k &= \frac{\sum\_{j=1}^N P\left(\mathbf{x}\_j \in k\_i \middle| \mathbf{x}\_j\right) \mathbf{x}\_j}{\sum\_{j=1}^N P\left(\mathbf{x}\_j \in k\_i \middle| \mathbf{x}\_j\right)}\\ \hat{\sigma}\_k^2 &= \frac{\sum\_{j=1}^N P\left(\mathbf{x}\_j \in k\_i \middle| \mathbf{x}\_j\right) \left(\mathbf{x}\_j - \hat{\mu}\_k\right)^2}{\sum\_{j=1}^N P\left(\mathbf{x}\_j \in k\_i \middle| \mathbf{x}\_j\right)}\\ \beta\_k &= \frac{\sum\_{j=1}^N P\left(\mathbf{x}\_j \in k\_i \middle| \mathbf{x}\_j\right)}{N} \end{aligned}$$

These parameter estimates are then used as the parameters for the next Expectation step, through which this process iteratively continues until convergence, i.e., no significant improvement in the log-likelihood is achieved from the previous iteration. This process is implemented iteratively to quickly generate a series of mixture model distributions for both Gamma and Gaussian distributions.

The optimally fitted mixture model is determined using a Bayesian Information Criterion (BIC) with a penalization based on the number of distributions is used to determine the bestfitting distribution. The BIC for a mixture distribution K is based on the number of sub-distributions k, the number of data points n, and the log likelihood Lˆ.

$$BIC\left(K\right) = -2k\log\left(n\right) - 2\hat{L}\_z$$

#### Mapping Uncertainty Categorization

The best fitting mixture model is then used to separate each D-score into a category representing the severity of mapping uncertainty, thus indicating the mapping uncertainty categorization for each gene (**Figure 1F**). The categorizations are based on the intersections of the density functions representing the mixture model fitting. If the Gaussian distributions provide the minimal BIC, the categorization cutoffs are calculated as

$$\mathbf{x} = -\left(\frac{\mu\_{i+1}\sigma\_i^2 - \mu\_i\sigma\_{i+1}^2}{\sigma\_{i+1}^2 - \sigma\_i^2}\right) \pm \sqrt{\left(\frac{2\sigma\_i^2\sigma\_{i+1}^2 \cdot \ln\left(\frac{\sigma\_{i+1}^2}{\sigma\_i^2}\right) - \mu\_i^2\sigma\_{i+1}^2 + \mu\_{i+1}^2\sigma\_i^2}{\sigma\_{i+1}^2 - \sigma\_i^2}\right) + \left(\frac{\mu\_{i+1}\sigma\_i^2 - \mu\_i\sigma\_{i+1}^2}{\sigma\_{i+1}^2 - \sigma\_i^2}\right)^2 \cdot \mathbf{x}^2 + \mu\_i^2 \left(\frac{\mu\_{i+1}\sigma\_i^2 - \mu\_i\sigma\_{i+1}^2}{\sigma\_{i+1}^2 - \sigma\_i^2}\right) \cdot \mathbf{x}^2 + \mu\_{i+1}^2 \left(\frac{\mu\_{i+1}\sigma\_i^2 - \mu\_i\sigma\_{i+1}^2}{\sigma\_{i+1}^2 - \sigma\_i^2}\right) \cdot \mathbf{x}^2 + \mu\_{i+1}^2 \left(\frac{\mu\_{i+1}\sigma\_i^2 - \mu\_i\sigma\_{i+1}^2}{\sigma\_{i+1}^2 - \sigma\_i^2}\right) \cdot \mathbf{x}^2 = \frac{\mu\_{i+1}\sigma\_i^2 + \mu\_{i+1}^2 \cdot \left(\frac{\mu\_{i+1}\sigma\_i^2 - \mu\_i\sigma\_{i+1}^2}{\sigma\_{i+1}^2 - \sigma\_i^2}\right)}{\sigma\_{i+1}^2 - \sigma\_i^2}$$

where xj,<sup>k</sup> is the jth data point in component k, N<sup>k</sup> is the number of data points in cluster k and N is the total number of data points (i.e., P <sup>k</sup> N<sup>k</sup> = N). After this initialization step, the algorithm proceeds to the Expectation (E) step. In this step, for each data point (i.e., each D-score from this dataset) the posterior probability of containment within each cluster k<sup>i</sup> is generated by

$$\begin{split}P\left(\mathbf{x}\_{j}\in k\_{i}|\mathbf{x}\_{j}\right)&=\frac{P\left(\mathbf{x}\_{j}|\mathbf{x}\_{j}\in k\_{i}\right)P\left(k\_{i}\right)}{P\left(\mathbf{x}\_{j}\right)}=\frac{N\left(\mathbf{x}\_{j}|\hat{\mu}\_{k},\hat{\sigma}\_{k}^{2}\right)\left(\frac{N\_{k}}{N}\right)}{\sum\_{k}\beta\_{k}N\left(\mathbf{x}\_{j}|\hat{\mu}\_{k},\hat{\sigma}\_{k}^{2}\right)}\\&=\frac{\beta\_{k}N\left(\mathbf{x}\_{j}|\hat{\mu}\_{k},\hat{\sigma}\_{k}^{2}\right)}{\sum\_{k}\beta\_{k}N\left(\mathbf{x}\_{j}|\hat{\mu}\_{k},\hat{\sigma}\_{k}^{2}\right)}\end{split}$$

After this Expectation step, the Maximization step again calculates parameters µˆ <sup>k</sup> , σˆ 2 k for each component k. Based on the

For Gamma distributions providing the minimal BIC, a closed form solution of the density function intersections does not exist. To accommodate this, an estimation approach is utilized. The cutoffs are calculated as the mean value of the maximum sequence element for which sub-distribution i has a higher probability density value than it does for sub-distribution i + 1 and the minimum sequence element for which sub-distribution i + 1 has a higher probability density value than it does for sub-distribution i, i.e.,

$$\begin{aligned} \operatorname{mean} \left( \operatorname\*{argmax}\_{\mathbf{x}} \left\{ f\_i(\mathbf{x}) > f\_{i+1}(\mathbf{x}) \right\}, \operatorname\*{argmin}\_{\mathbf{x}} \left\{ f\_i(\mathbf{x}) < f\_{i+1}(\mathbf{x}) \right\} \right) \\ \mathbf{x} \in \{a\_n\} \operatorname\*{argmax}\_{\mathbf{x}} f\_i(\mathbf{x}) \le a\_n \le a\_{n+1} \le \operatorname\*{argmax}\_{\mathbf{x}} f\_{i+1}(\mathbf{x}) \end{aligned}$$

resulting in two cutoff values.

#### TABLE 2 | GeneQC example output.


Due to the nature of mapping uncertainty and the lack of current approaches to evaluate this concept, we have included an alternative likelihood value, for the first time, as a proposed method of evaluating the mapping uncertainty categorizations computationally. This value based on the posterior probabilities of the other distributions is provided to represent the certainty of the gene ID belonging to that category. This value (sd) is computed as the maximum posterior probability of the D-score belonging to any other categorization distribution.

$$s\_d = \max\{1 - F\_{i-1}\left(d\right), \ F\_{i+1}\left(d\right)\}$$

where i is the distribution for which d is categorized, and F<sup>j</sup> represents the cumulative distribution function of distribution j.

#### RESULTS

#### GeneQC Output

The final output of GeneQC includes the three extracted features (named D1, D2, and D3), D-score, mapping uncertainty categorization, and alternative likelihood for each annotated gene. This information is combined into a concise table to provide users with all relevant information related to the mapping uncertainty of their read alignment data, allowing them to make informed decisions about further and continued analysis. An example of the output file from Vitis vinifera can be found in **Table 2**. For each annotated gene, the D-score indicates the severity of mapping uncertainty for that particular gene in this particular RNA-Seq data. A higher D-score indicates a higher level of mapping uncertainty, with maximum levels of mapping uncertainty occurring around 0.5 for most samples. Genes with relatively high D-scores have mapping uncertainty issues resulting in potentially unreliable expression estimates (i.e., the High category). Whereas, genes with D-scores close to 0 have little to no mapping uncertainty, and therefore have reliable expression estimates (i.e., the Low and Medium categories).

Source code and implementation instructions can be found on the GeneQC web server at http://bmbl.sdstate.edu/GeneQC/ home.html. Additionally, example data for seven analyzed species can be downloaded on this server, including all reference genomes, annotations, original raw data, and outputs from both TABLE 3 | GeneQC analysis of seven species.


*This table shows the sample ID and relevant metrics for each of the seven datasets analyzed. Mean values for D*1*, D*2*, D*3*, and D-score are calculated based on the genes that exhibit some level of mapping uncertainty, and D*1*, D*2*, and D*<sup>3</sup> *were normalized for comparison.*

the feature extraction and modeling portions of GeneQC. An indepth tutorial for application instructions can also be found on this site.

## Implementation and Application of GeneQC Results

GeneQC has four main applications in RNA-Seq analyses. (1) Users can take the D-score and categorization results from an entire dataset to evaluate the alignment quality of their data or to determine how severe the overall mapping uncertainty is within their RNA-Seq datasets. This process would involve displaying the set of D-scores in some visualization technique, such as a boxplot, violin plot, or histogram. Displaying the D-scores in this format would allow for users to determine if the overall alignment quality is sufficient to continue analysis or if it requires further evaluation using a re-alignment method. It is expected that there will be high D-scores for some genes; however, a large portion of data having high D-scores would indicate severe problems with alignment requiring further analysis. (2) Users can use D-scores and mapping uncertainty categorizations to evaluate the reliability of their downstream analyses, such as differential gene expression results. If users have identified a particular set of genes that are differentially expressed, it would be of interest to evaluate the reliability of the expression estimates from which those comparisons were made. Genes identified

as differentially expressed having high mapping uncertainty levels—either through D-scores or categorization—would be less reliable than the differentially expressed genes that have low mapping uncertainty. (3) GeneQC can be used to directly compare the severity of mapping uncertainty between samples or even between species. This application method is used in section GeneQC Application: Analysis of Seven Plant and Animal Species to demonstrate which species have relatively high levels of mapping uncertainty and to determine which characteristics or features could be affecting this issue. In particular, identification of characteristics impacting mapping uncertainty for a single species could provide information that would assist in realignment processes. (4) GeneQC can be used to perform largescale comparisons of alignment tools using real data. Currently, comparisons of alignment tools require either simulated data which cannot accurately replicate the complexities within real RNA-Seq data, or they rely on small-scale real data, which has implicit biases that may favor one tool. GeneQC allows for the large-scale comparisons of alignment methods with complex data of any species.

# GeneQC Application: Analysis of Seven Plant and Animal Species

In order to display the use of GeneQC, one dataset from each of the seven species were investigated for multi-mapping issues (**Table 3**). Based on this analysis, it is evident that plant samples tend to have higher proportions of genes with mapping uncertainty than animal samples (**Figure 3A**). These results correlate with the fact that plant genomes tend to have higher levels of duplication, which is a strong contributing factor to mapping uncertainty. While H. sapiens and M. musculus have lower proportions of genes with mapping uncertainty than the plant samples, the proportion of genes with high mapping uncertainty of all the genes with mapping uncertainty is much higher. Plant species exhibited mapping uncertainty in an average of 12.6% of genes across the five species, whereas animal species exhibited this issue in an average of 5% of genes (**Supplementary Files S2, S3**). However, over half of the genes with mapping uncertainty in the animal samples fall into the "High" categorization, while only around one-fifth of genes with mapping uncertainty from plant samples fall into this category. The contributing factors to the higher proportion of "High" categorized genes for animal samples can be seen when looking at the three extracted features for each species.

The analysis results for the three features and calculated Dscores for genes with some level of mapping uncertainty are displayed in **Figures 3B,C**, respectively. Both H. sapiens and M. musculus display higher levels of sequence similarity (D1), shared MMR proportion (D2), and degree (D3) than what is generally exhibited in the analyzed plant species. These relatively high values for each feature led the higher D-scores, translating to a higher measure of mapping uncertainty in the animal samples compared with the plant samples. Mean D-score for H. sapiens and M. musculus are 0.43 and 0.42, respectively. These average values are much higher than those for the analyzed plant samples, which are 0.29, 0.24, 0.33, 0.16, and 0.31 for A. thaliana, V. vinifera, S. lycopersicum, P. virgatum, and T. aestivum, respectively.

FIGURE 3 | GeneQC application. The results related to the analysis of seven datasets representing five plant and two animal species. (A) Categorizations for the level of mapping uncertainty per gene are shown relative to all categorizations. (B) Boxplots for the three extracted features of each gene are shown for each analyzed sample. D1, D2, and D3 represent the sequence similarity, proportion of shared MMR, and degree weight, respectively. Each value is shown normalized between 0 and 1. Only genes with mapping uncertainty are displayed. (C) Derived D-scores for each gene are shown by species, as calculated from the three features in (B). Higher D-scores represent higher levels of mapping uncertainty.

# CONCLUSION

GeneQC is a tool used to investigate the prominent issue of mapping uncertainty in modern RNA-Seq analysis through the combination of feature extraction and machine learning methods. Oversight in the quality of derived gene expression estimates based on mapping results can have drastic consequences for all downstream analyses and read mapping uncertainty is a significant cause of problems in further analysis. While read mapping has been accepted as sufficient, entirely ignoring the possibility of poorly mapped reads used for further analysis can have detrimental effects on all manner of RNA-Seq studies. As demonstrated in our analysis of 95 RNA-Seq datasets, the problem of mapping uncertainty is prominent and is displayed directly in the gene expression estimates. GeneQC can provide insight into the severity of this issue for each annotated gene along with a statistical evaluation framework. It utilizes feature extraction, elastic-net regularization, and mixture model fitting to provide researchers with a sense of the quality of gene expression estimates resulting from the read alignment step. GeneQC provides sufficient information for researchers to make more well-informed decisions based on the results of their RNA-Seq data analysis and to plan further analyses to address mapping uncertainty.

The application of GeneQC on the seven analyzed datasets display some interesting differences between plant and animal samples. Fewer genes displayed mapping uncertainty in the animal samples, while a higher proportion of these genes were categorized as "High." Alternatively, a much higher proportion of plant genes displayed mapping uncertainty, but more of these genes had moderate to low mapping uncertainty, relative to genes from animal samples. Both of these scenarios display the severity of mapping uncertainty in modern RNA-Seq analyses. High mapping uncertainty displayed in animal samples can lead to very biased expression estimates over fewer genes, while moderate levels of mapping uncertainty on a wider scale as displayed in plant species can cause widespread expression estimate biases on a lesser scale.

# DISCUSSION

In addition to the direct provisions of GeneQC, interpretations of the coefficients allow for a further examination of the specific features contributing the mapping uncertainty. This will allow for further analysis and re-alignment strategies to be developed to the specific characteristics of the dataset. We are currently using this information to develop a computational tool capable of performing re-alignment of reads currently aligned to genes with high D-scores with the purpose of assisting researchers in the correction of mapping uncertainty. In the future, GeneQC will be integrated into a web server that applies this tool and associated re-alignment tools to perform large-scale RNA-Seq analyses on human, plant, and metagenome datasets. This application will allow for ease-of-use and collection of more data to support research with significant MMR issues.

Additionally, further exploration of machine learning approaches, both supervised and unsupervised, will be explored with respect to their applicability in detecting mapping uncertainty. Large-scale use of simulated data for multiple species will provide a direct indication of the actual expression level, which can be compared with the expression estimate from various high-performing and widely-used alignment tools. The various machine learning methods can then be used to detect mapping uncertainty for each tool, with performance comparisons being derived from the correlation between the predicted mapping uncertainty level from the machine learning algorithm and the difference between actual and estimated expression for each gene. A determination for the best-performing method will be based on the highest correlation and may be alignment tool-specific.

# AUTHOR CONTRIBUTIONS

QM conceived the basic idea and designed the analysis. AM, XC, YZ, CW, and QM contributed to development of feature extraction conceptualization, methods, and implementation. AM, SG, JX, and QM contributed to machine learning modeling conceptualization, methods, and implementation. AM, YZ, CW, and QM contributed to the development and maintenance of GeneQC website. AM, SG, and QM contributed to the manuscript development and writing. All authors contributed to the manuscript revisions, read and approved the final version of the manuscript.

# FUNDING

This work was supported by National Science Foundation/EPSCoR Award No. IIA-1355423, the State of South Dakota Research Innovation Center and the Agriculture Experiment Station of South Dakota State University (SDSU). This work is also supported by Hatch Project: SD00H558- 15/project accession No. 1008151 from the USDA National Institute of Food and Agriculture. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation (grant number ACI-1548562).

# ACKNOWLEDGMENTS

This work has been released as a pre-print (McDermaid et al., 2018b).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00313/full#supplementary-material

# REFERENCES


Mitchell, T. M. (1997). Machine Learning. Boston, MA: McGraw-Hill.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 McDermaid, Chen, Zhang, Wang, Gu, Xie and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Novel Computational Method for the Identification of Potential miRNA-Disease Association Based on Symmetric Non-negative Matrix Factorization and Kronecker Regularized Least Square

*School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China*

#### Yan Zhao, Xing Chen\* and Jun Yin

Edited by: *Quan Zou, Tianjin University, China*

#### Reviewed by:

*Jiawei Luo, Hunan University, China Jialiang Yang, Icahn School of Medicine at Mount Sinai, United States*

> \*Correspondence: *Xing Chen xingchen@amss.ac.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *03 July 2018* Accepted: *30 July 2018* Published: *21 August 2018*

#### Citation:

*Zhao Y, Chen X and Yin J (2018) A Novel Computational Method for the Identification of Potential miRNA-Disease Association Based on Symmetric Non-negative Matrix Factorization and Kronecker Regularized Least Square. Front. Genet. 9:324. doi: 10.3389/fgene.2018.00324* Increasing evidence has indicated that microRNAs (miRNAs) are associated with numerous human diseases. Studying the associations between miRNAs and diseases contributes to the exploration of effective diagnostic and treatment approaches for diseases. Unfortunately, the use of biological experiments to reveal the potential associations between miRNAs and diseases is time consuming and costly. Therefore, it is very necessary to use simple and efficient calculation models to predict potential disease-related miRNAs. Considering the limitations of other previous methods, we proposed a novel computational model of Symmetric Nonnegative Matrix Factorization for MiRNA-Disease Association prediction (SNMFMDA) to reveal the relation of miRNA-disease pairs. SNMFMDA could be applied to predict miRNAs associated with new diseases. Compared to the direct use of the integrated similarity in previous computational models, the integrated similarity need to be interpolated by symmetric non-negative matrix factorization (SymNMF) before application in SNMFMDA, and the relevant probability of disease-miRNA was obtained mainly through Kronecker regularized least square (KronRLS) method in our model. What's more, the AUC of global leave-one-out cross validation (LOOCV) reached 0.9007, and the AUC based on local LOOCV was 0.8426. Besides, the mean and the standard deviation of AUCs achieved 0.8830 and 0.0017 respectively in 5-fold cross validation. All of the above results demonstrated the superior prediction performance of SNMFMDA. We also conducted three different case studies on Esophageal Neoplasms, Breast Neoplasms and Lung Neoplasms, and 49, 49, and 48 of the top 50 of their predicted miRNAs respectively were confirmed by databases or related literatures. It could be expected that SNMFMDA would be a model with the ability to predict disease-related miRNAs efficiently and accurately.

Keywords: microRNA, disease, association prediction, matrix factorization, Kronecker regularized least square

# INTRODUCTION

MicroRNAs (miRNAs) are a class of endogenous non-coding RNAs with regulatory functions found in eukaryotes, which are approximately 20–25 nucleotides in length (Ambros, 2001). There are evidences manifesting that miRNAs are one of the most abundant gene regulatory molecules in multicellular organisms, which might affect the expression of many protein-coding genes and play an important regulatory role in animals and plants (Bartel, 2004).With more and more researchers being interested in miRNAs, the researches on miRNAs have been further deepened, and the number of discovered miRNAs is gradually increasing in recent years. The latest database records 24,521 microRNA loci in 206 species and 30,424 mature miRNAs after processing (Kozomara and Griffiths-Jones, 2014). Recently, it has been verified that miRNAs are crucial constituent in cells and may make an important impact in many important biological processes, including proliferation (Cheng et al., 2005), development (Karp and Ambros, 2005), differentiation (Miska, 2005), viral infection (Miska, 2005) and so on. So it is taken for granted that there are associations between miRNAs and the generation as well as development of a number of human diseases (Alvarez-Garcia and Miska, 2005). For instance, by targeting BCL6 corepressor like BCORL1, the migration and invasion of hepatocellular carcinoma (HCC) cells are restrained by mir-876- 5p, which provides a new idea for the treatment of HCC (Xu et al., 2018). And it has been confirmed that miR-485-5p inhibits the development and improves the chemosensitivity of breast cancer by regulating survivin, which provides a potential method for addressing the chemoresistance of breast cancer (Wang et al., 2018). Therefore, it is of great significance to research the relations between diseases and miRNAs, which contributes to study the pathogenesis of the disease at the molecular level and makes a big difference in the early diagnosis of human diseases (Jiang et al., 2010). Using biological experiments to identify potential disease-related miRNAs is time-consuming and costly (Jiang et al., 2013), so it is imperative to use low-cost, highefficiency methods to predict miRNAs associated with diseases. In recent years, since a large amount of biological data has been collected and organized into different databases, it is feasible and necessary to develop computational model to reveal potential disease-miRNA associations based on these databases (Chen, 2015).

In the last couple of years, more and more computational models for miRNA-disease associations prediction have been developed (Chen et al., 2017b). In the view of that miRNAs with similar functions tend to be involved with phenotypically similar diseases and vice versa, numerous computational methods have been proposed recently (Bandyopadhyay et al., 2010). Chen et al. (2016b) presented the model of Within and Between Score for MiRNA-Disease Association prediction (WBSMDA) to predict potential miRNAs associated with diseases. In this method, Within-Score and Between-Score about miRNAs and diseases were calculated by integrating miRNA functional similarity, disease semantic similarity, known miRNA-disease associations and Gaussian interaction profile kernel similarity, and then these two scores were combined to acquire the relation probability of miRNA-disease pair. Unfortunately, how to more reasonably integrate Within-scores and Between-score to calculate relevant probabilities remained unresolved. What's more, Chen et al. (Chen and Yan, 2014) also proposed Regularized Least Squares for MiRNA-Disease Association (RLSMDA) to reveal the unknown relations between miRNAs and diseases. RLSMDA was a semi-supervised method, so it didn't need negative samples. However, the optimal values of parameters in RLSMDA had not yet been obtained, which might affect the prediction performance. By implementing random walk on the miRNA– miRNA functional similarity network, Chen et al. (2012) developed another model named Random Walk with Restart for MiRNA–Disease Association (RWRMDA) to identify miRNAs related with diseases. The outstanding prediction performance of RWRMDA had been confirmed by a number of experiments, but there was still a main limitation in this model. RWRMDA couldn't be applied to predict the relations between miRNAs and new diseases without any known associated miRNAs. Later, Chen et al. (2017a) introduced a novel model called Ranking-based KNN for miRNA-Disease Association prediction (RKNNMDA) to uncover the potential associations between miRNAs and diseases by applying K-Nearest Neighbors (KNN) algorithm to obtain k-nearest-neighbors both for miRNAs and diseases. After resorting the k-nearest-neighbors based on the Support Vector Machine (SVM) ranking model, they got the ranking of association probability of disease-miRNA pairs by implementing weighted voting. RKNNMDA was capable of being implemented on new diseases, which overcame the biggest limitation of RWRMDA. It was a pity that there were also some limitations in this method. RKNNMDA could not be used to score all miRNAs based on the same criteria, especially for miRNAs with more known related diseases. Mørk et al. (2014) proposed a reliable calculation model of miRNA-Protein-Disease (miRPD) which didn't directly predict the miRNAs related with diseases but through proteins. The relations between miRNAs and diseases were uncovered by integrating the predicted and known miRNA–protein associations with the associations of protein– disease text mined from the literature. And the prediction performance of miRPD would be further improved if more involved datasets were taken into account. Xuan et al. (2015) developed a novel method of MIRNAs associated with Diseases Prediction (MIDP) to reveal the associations between miRNAs and diseases by implementing random walk on a miRNA functional similarity network where the similarity scores of miRNAs pairs were obtained through their related diseases. What's more, Xuan et al. also presented its extension method MIDPE to predict the miRNAs related with new diseases. In addition, by constructing a high-dimensional vector space to store the distribution information on miRNAs and diseases, Pasquier et al. (Pasquier and Gardes, 2016) proposed MiRAI to identify the miRNAs associated with diseases based on the similarity of the high-dimensional vectors composed of the distribution information on miRNAs and diseases. Xuan et al. (2013) presented an effective model named Human Diseaserelated MiRNA Prediction (HDMP) where the calculation method of miRNA functional similarity was improved by taking more related information into account. miRNAs in the same

family or cluster were assigned higher weights since they were more likely to be associated with diseases with phenotype similarity. In this method, the sub-scores of the miRNA's k neighbors were equal to the product of the neighbor's weight and the miRNA functional similarity, and then by adding the subscores of k neighbors, the relevance score of a miRNA-disease pair was obtained. In addition, Chen et al. (2018a) proposed Network Distance Analysis for MiRNA-Disease Association prediction (NDAMDA) to detect the miRNAs associated with diseases. Compared to other methods, the improvement of NDAMDA lied in that in addition to the direct network distance between two studied diseases (miRNAs), the respective mean distances for each of them and all the rest of diseases (miRNAs) were taken into consideration. By implementing the matrix completion algorithm to update the adjacency matrix which recorded the known associations of disease-miRNA pairs in HMDD and then uncovering the unknown relations, Li et al. (2017) developed a method of Matrix Completion for MiRNA-Disease Association prediction model (MCMDA) without the need of negative samples. Compared with other computational models, the biggest advantage of MCMDA was that it only required known miRNA-disease associations, which also led to that MCMDA couldn't be introduced to predict potential associated miRNAs for new diseases or potential associated diseases for new miRNAs. Furthermore, the optimal parameters of this method were still unknown. In addition, by integrating miRNA functional similarity, disease semantic similarity, Gaussian interaction profile kernel similarity, and miRNA-disease associations confirmed by experiments into a heterogeneous graph, Chen et al. (2016c) presented a model of Heterogeneous Graph Inference for MiRNA-Disease Association prediction (HGIMDA) to reveal the unknown relations of miRNA-disease pairs by incorporating related data into a heterogeneous graph and summarizing all paths with the length equal to three to calculate the association probability of diseasemiRNA pair. Unfortunately, limitations also existed in this method that for those miRNAs with more known related diseases, scores made by HGIMDA were generally higher than those miRNAs with less. Later, Chen et al. (2018b) proposed another method called Graph Regression for MiRNA-Disease Association prediction (GRMDA). In this method, by using two matrix decomposition methods to extract important correlation properties and filter noise, graph regression was performed synchronously in three potential spaces including the associated space, miRNA similarity space, and disease similarity space to reveal the potential disease-miRNAs associations. However, there were still some problems of GRMDA to be settled. For example, according to the size of the matrix, how to choose the optimal parameters in SVD and PLS remained unsolved. By optimizing the existing method for maximizing the flow of information, which was mainly used to prioritize diseaseassociated protein-coding genes, Yu et al. (2017) developed a combinatorial prioritization algorithm to predict the miRNAdisease associations. This method didn't require negative samples, which solved the problem that negative microRNAdisease associations were difficult to obtain.

In this paper, we proposed Symmetric Nonnegative Matrix Factorization for MiRNA-Disease Association prediction (SNMFMDA) to predict potential miRNA-disease associations. The process was mainly divided into two steps. Firstly, we used symmetric non-negative matrix factorization (SymNMF) to interpolate the integrated similarity matrix. Secondly, based on interpolated integrated similarity matrix, we utilized Kronecker regularized least square (KronRLS) method to obtained diseasemiRNA association score matrix. We implemented global and local Leave-One-Out Cross Validation (LOOCV) and 5-fold cross validation to assess the prediction performance of SNMFMDA. As shown in the results, the AUC values of global LOOCV, local LOOCV, and 5-fold cross validation of SNMFMDA reached 0.9007, 0.8426, and 0.8830 ± 0.0017 respectively, which verified the excellent prediction performance of SNMFMDA.

# MATERIALS AND METHODS

#### Human miRNA-Disease Association

In this paper, we obtained known human disease-miRNA associations from HMDD v2.0, which recorded 5430 experimentally verified associations between 383 diseases and 495 miRNAs. To better represent whether there were known associations between diseases and miRNAs, we defined a nd × nm adjacency matrix A, where nd and nm corresponded to the number of diseases and miRNAs respectively. If the relation between disease d(i) and miRNA m(j) had been verified, the value of the element A(d (i), d(j)) of the matrix was 1, otherwise 0.

### miRNA Functional Similarity

On the basis of the assumption that functionally similar miRNAs tend to be associated with similar diseases and vice versa, miRNA functional similarities were calculated in this paper (Wang et al., 2010), and we could download them from http://www.cuilab.cn/fles/images/cuilab/misim.zip. For the sake of better describing the functional similarity between miRNAs, we defined the miRNA functional similarity matrix FS, where the element FS(m (i), m(j)) represented the functional similarity score between miRNA m(i) and m(j) (For the specific calculation process of miRNAs functional similarity, please see **Supplementary Material**).

# Disease Semantic Similarity Model 1

To describe the association between diseases, the Directed Acyclic Graphs (DAGs) were built. DAG(D) = (D, T (D), E(D)) was applied to indicate disease D, where T(D) was a set of nodes composed of node D itself and its ancestor nodes and E(D) was a set consisting of edges directly from parent nodes to the child nodes (Wang et al., 2010). The contribution of disease d to the semantic value of disease D in DAG(D) and the semantic value of disease D were defined as follows:

$$\begin{cases} D1\_D\left(d\right) = 1\\ D1\_D\left(d\right) = \max\left\{\triangle\\ \triangle \ast D1\_D\left(d'\right)\left|d'\right| \in \mathit{child}\,\mathrm{ren}\,\mathrm{of}\,d\right\} \text{ if } d \neq D \end{cases} \tag{1}$$

$$DV1\,(D) = \sum\_{d \in T(D)} D1\_D\,(d)\tag{2}$$

where △ is the semantic contribution factor. The contribution of disease D to its own semantic value was 1, and the contribution of other diseases to the semantic value of disease D was negatively related to the distance between the disease and disease D, so the diseases in the same layer might have the same contribution to the semantic value of disease D.

Here, based on the model in the paper (Xuan et al., 2013), we constructed disease semantic similarity matrix SS1, whose element SS1 d (i), d j indicated the semantic similarity score between disease d(i) and d j . Based on the assumption that the more DAGs the two diseases overlapped, the greater their semantic similarity would be. The disease semantic similarity between disease d(i) and d j was calculated as follows:

$$\text{SS1}\left(d\left(i\right), d\left(j\right)\right) = \frac{\sum\_{t \in T(d(i)) \cap T(d(j))} \left(D\_{d(i)}\left(t\right) + D\_{d(j)}\left(t\right)\right)}{DV1\left(d\left(i\right)\right) + DV1\left(d(j)\right)} \tag{3}$$

#### Disease Semantic Similarity Model 2

For these diseases that appeared in the same layer of the DAG(A), according to the above definition of the disease semantic similarity model 1, they had the same contribution to the semantic value of the disease A. However, they might appear in different number of disease DAGs. For example, for two diseases d (i) and d j that appeared in the same layer of DAG(A), disease d(i) occurred in more disease DAGs, while d(j) appeared in less. It was clear that the contributions of the two diseases to the semantic value of disease A were different and the disease d (i) should have a less contribution to the semantic value of the disease A than d(j). Therefore, it was unreasonable to simply calculate the contribution to the semantic value of disease according to the definition of the disease semantic similarity model 1. Here, according to the model in the paper (Xuan et al., 2013), we defined the disease semantic similarity model 2 to supplement model 1. In the second model, diseases appeared in the same layer of DAG(A) might not necessarily ensure that they had the same contribution to the semantic value of disease A. The contribution of disease D to the semantic value of disease A was calculated as follows:

$$D2\_A\ (D) = -\log \frac{\text{the number of DAGs including D}}{\text{the number of disease}} \qquad \text{(4)}$$

The semantic value of disease A was calculated as follows:

$$DV2\,(A) = \sum\_{t \in T(A)} D2\_A(t) \tag{5}$$

Similar to the disease semantic similarity matrix SS1, the element SS2(d (i), d(j)) of the disease semantic similarity matrix SS2 was calculated as follows:

$$\text{SS2}\left(d\left(i\right), d\left(j\right)\right) = \frac{\sum\_{t \in T(d(i)) \cap T(d(j))} \left(D2\_{d(i)}\left(t\right) + D2\_{d(j)}\left(t\right)\right)}{DV2\left(d\left(i\right)\right) + DV2\left(d(j)\right)}\tag{6}$$

Here, SS2(d (i), d(j)) was the disease semantic similarity between disease d(i) and d j . Combining the two models of disease semantic similarity, we could calculate the final disease semantic similarity matrix SS as follow:

$$\text{SS} = \frac{\text{SS1} + \text{SS2}}{2} \tag{7}$$

# Gaussian Interaction Profile Kernel Similarity

On the basis of the assumption that functionally similar miRNAs were more likely to be associated with similar diseases and vice versa, by taking the topological information of known miRNA-disease association network into account, we defined Gaussian interaction profile kernel similarity for diseases to describe the similarities between diseases based on the model in the paper (van Laarhoven et al., 2011). Here, we applied binary vector IP(d(i)) to represent the ith row of the adjacency matrix A, which recorded the association information of disease d(i) with all miRNAs. The Gaussian interaction profile kernel similarity matrix for diseases was defined as KD. The element KD d (i), d j indicated the Gaussian interaction profile kernel similarity between disease d(i) and d j and could be calculated as follows:

$$KD\left(d\left(i\right), d\left(j\right)\right) = \exp\left(-\gamma\_d \left\|\left.IP(d(i)) - \left.IP(d(j))\right\|\right\|^2\right) \tag{8}$$

Here, the role of parameter γ<sup>d</sup> is to control the kernel bandwidth and it could be obtained by normalizing another new bandwidth parameter γ ′ d by the average number of associated miRNAs for all the diseases.

$$\gamma\_d = \frac{\nu\_d^{'}}{\frac{1}{nd} \sum\_{u=1}^{nd} \left\| IP \quad \left( d\left(u\right) \right) \right\|^2} \tag{9}$$

Here, we set the value of γ ′ d to 1. Similarly, the Gaussian interaction profile kernel similarity matrix KM was defined as follows:

$$KM\left(m\left(i\right),m\left(j\right)\right) = \exp\left(-\gamma\_m \left\|IP(m(i)) - IP(m(j))\right\|^2\right) \tag{10}$$

The binary vector IP(m(i)) represents the ith column of the adjacency matrix A. Similar to γd, parameters γ<sup>m</sup> was calculated as follows:

$$\gamma\_m = \frac{\gamma\_m^{\prime}}{\frac{1}{nm} \sum\_{\mu=1}^{nm} \left\| IP \quad (m \,(\mu)) \right\|^2} \tag{11}$$

Here, the value of γ<sup>m</sup> ′ was set to 1.

#### Integrated Similarity for miRNAs and Diseases

As we know, many diseases can be described by a DAG. Based on the assumption that two diseases with large overlapping parts of their DAGs are considered to have large semantic similarity, we could calculate the semantic similarity between diseases, but we could not get DAG for all diseases, so for those diseases without DAG, we could not calculate the semantic similarity between them and other diseases. Therefore, we constructed integrated similarity matrix SD for diseases by integrating disease semantic similarity matrix and Gaussian interaction profile kernel similarity matrix in the following way according to the model in this paper (Chen et al., 2016b):

$$SD\left(d\left(i\right), d\left(j\right)\right) = \frac{\text{SS}\left(d\left(i\right), d\left(j\right)\right) + KD\left(d\left(i\right), d\left(j\right)\right)}{2} \tag{12}$$

Similar to disease, the integrated similarity matrix SM was calculated as follow:

$$\text{SM}\left(m\left(i\right), m\left(j\right)\right) = \frac{\text{FS}\left(m\left(i\right), m\left(j\right)\right) + \text{KM}\left(m\left(i\right), m\left(j\right)\right)}{2} \quad \text{(13)}$$

#### SNMFMDA

Motivated by the paper (Chen and Li, 2017), in this paper, we proposed SNMFMDA to predict potential miRNA-disease associations and the flow chart of the algorithm is shown in **Figure 1**. First step, we used SymNMF to interpolate the integrated similarity matrix SM and SD. Second step, based on interpolated integrated similarity matrix SM, SD, we utilized KronRLS method to obtained score matrix S with the same dimension as the adjacency matrix A, and each element of S was the associated probabilities of the corresponding disease-miRNA pairs. The two-step process was as follows:

#### SymNMF

As an unsupervised learning method, non-negative matrix factorization (NMF) was extremely versatile and it had gradually become one of the most popular multidimensional data processing tools in signal processing, semantic analysis of documents and image engineering (He et al., 2011). In our model, we improved the integrated similarity by introducing SymNMF, which was a special kind of nonnegative matrix factorization. For the matrix SD, our purpose was to find a matrix P with the same size as the integrated similarity matrix SD, which also satisfied the following requirement:

$$\text{SD} \approx \text{PP}^T \tag{14}$$

The specific process was as follows: The first step was initialization and we constructed a random matrix P <sup>0</sup> whose elements were all positive as the initialization of the matrix P. P i indicated the matrix corresponding to P in the beginning of the ith update, and the norm E i in ith update could be computed as follow:

$$E^i = \left\| SD - P^i P^{i^T} \right\|\_F^2 \tag{15}$$

The second step was update. Here, we temporarily marked P as P new after each update. The specific process was as follows:

$$\mathcal{R}^i = \begin{pmatrix} (\mathbb{S}D)P^i \end{pmatrix} \cdot \begin{pmatrix} (P^i P^i ^T P^i) & \end{pmatrix} \tag{16}$$

$$P^{new} = P^i \ast \left(1 - \alpha + \alpha \cdot \mathbf{R}^i\right) \tag{17}$$

$$E^{new} = \left\| \left\| SD - P^{new} \langle P^{new} \rangle^T \right\|\_F^2 \tag{18}$$

where A. <sup>∗</sup>B and A./B were the entrywise product (i.e., Hadamard product) and entrywise division respectively. The value of α should be less than 1 but close to 1, here, we set α = 0.999. Then compared E new to E i , if EN was smaller, update process ended, otherwise, updated P as follows:

$$P^{\text{new}} = \,^jP^{\, \, \ast} \, \left( \boldsymbol{R}^i \right)^{\frac{1}{3}} \tag{19}$$

$$E^{new} = \left\| P^{new} (P^{new})^T \right\|\_F^2 \tag{20}$$

$$P^{new} \to P^{i+1} \tag{21}$$

$$E^{new} \to E^{i+1} \tag{22}$$

$$i = \ i + 1\tag{23}$$

After that, the procedure went back to formula (15) and started the next update. Using SymNMF to perform interpolation on the similarity matrix SM was similar to SD.

#### KronRLS

First of all, we needed to build a list D = [d (1), d (2), . . . d(nd)] for diseases. Similarly, list M = [m (1), m (2), . . . m(nm)] was constructed to denoted nm miRNAs. All the columns of adjacency matrix A were stitched together to form a ndimensional column vector Y where n = nd × nm indicated the total number of disease-miRNA pairs. And then we built another n-dimensional column vectors X whose element X<sup>i</sup> represented the disease-miRNA pair corresponding to the ith element in Y. The purpose of the RLS algorithm was to find the mapping function f from vector X to score matrix S by minimizing the following function.

$$J\left(f\right) = \frac{1}{2n} \sum\_{i=1}^{n} \left(Y\_i - f(\mathbf{X}\_i)\right)^2 + \frac{\lambda}{2} \left\|f\right\|\_{K}^2\tag{24}$$

where f K is a norm of function f on the Hilbert space related with the kernel K. λ is a regularization parameter that determine the trade-off between prediction error and model complexity, here, we set it to 1. The representative theorem ensured that the Equation (23) had a closed form solution:

$$f\left(X\right) = \sum\_{i=1}^{n} a\_i K(X, X\_i) = Ka \tag{25}$$

Here, a is also an n-dimensional vector and it could be get by solving the following equation:

$$\left(\left(K+\lambda I\right)a=Y\right)\tag{26}$$

where I is the identity matrix and K is named as the pairwise instance kernel to represent the similarity of two data points in the Hilbert space. To be specific, for two disease-miRNA pairs (d<sup>i</sup> , mj) and (dw, mz), K d<sup>i</sup> , m<sup>j</sup> , dw, m<sup>z</sup> indicates the similarity between the two disease-miRNA pairs. And the kernel could be calculated as follow:

$$K\left(\left(d\_i, m\_j\right), \left(d\_w, m\_z\right)\right) = SD(i, w)SM(j, z) \tag{27}$$

K = SD ⊗ SM (28)

where SD ⊗ SM is the Kronecker product of SD and SM, and the relation probabilities of all disease-miRNA pairs could be obtained according to the kernel as follow:

$$\text{vec}(S) = f(X) = K \left( K + \lambda I \right)^{-1} \text{ Y} \tag{29}$$

Here, vec(·) is a vectorization operator that combine all the columns of a matrix into a column vector.

In order to solve the problem more effectively, we introduced spectral decomposition of the matrix to speed up the calculation. The decompositions of integrated similarity SD, SM and K were defined as follow:

$$SD = \bigvee\_d \bigwedge\_d \bigvee\_d^T \tag{30}$$

$$\text{SM} = \stackrel{\stackrel{\frown}{\smile} \cdots \stackrel{\smile}{\smile} \stackrel{\cdots}{\smile} \stackrel{\cdots}{\smile}^{T}}{\text{M}}\_{m} \bigvee\_{m}^{T} \tag{31}$$

$$K = \bigvee \bigwedge^{\dots} \bigvee^{T} \tag{32}$$

Here, the dimension of the matrix W d ( W m, W ) is the same as SD(SM, K), and each of its columns is the eigenvector of the matrix SD(SM, K). Matrix V d ( V m, V ) is a diagonal matrix whose diagonal element V dii ( V mii , V ii) is the eigenvalue of SD(SM, K) corresponding to the ith column [i.e., the ith eigenvector of SD(SM, K)]. Then, the Kronecker product of SD and SM could be calculated as follows:

$$K = \mathbb{S} \mathcal{D} \otimes \mathbb{S}M = \bigvee \bigwedge \bigvee^T \tag{33}$$

$$\left(K(K+\lambda I)^{-1} = \bigvee \bigwedge \left(\bigwedge + \lambda I\right)^{-1}\bigvee^T\right.\tag{34}$$

where:

$$\bigvee\_{\mathsf{A}} = \bigvee\_{\mathsf{A}} \otimes \bigvee\_{\mathsf{A}} \tag{35}$$

$$
\dot{\bigwedge} = \dot{\bigwedge}\_d \otimes \dot{\bigwedge}\_m \tag{36}
$$

Here, we introduced a property of the Kronecker product:

$$(A \otimes B) \,\,\nu \text{ec}\,\,(Y) = \nu \text{ec}\,\left(BYA^T\right) \tag{37}$$

By integrating the above formulas, the score matrix S could be calculated as follows:

$$\mathbf{S} = V\_d Z^T V\_m^T \tag{38}$$

where:

$$\text{vec}\left(Z\right) = \left(\Lambda\_d \otimes \Lambda\_m\right) \left(\Lambda\_d \otimes \Lambda\_m + \lambda I\right)^{-1} \text{vec}\left(V\_m^T Y^T V\_d\right) \tag{39}$$

# RESULTS

#### Performance Evaluation

In this study, based on the 5,430 confirmed associations between 383 diseases and 495 miRNAs recorded in HMDD v2.0, local and global LOOCV were applied to test the prediction performance of SNMFMDA. In LOOCV, each of the 5,430 known associations (positive samples) was left out in turn as the test sample and the remaining 5,429 known associations were considered as training samples, while the miRNA-disease pairs without verified association were viewed as candidate samples (unknown samples), and then we applied SNMFMDA to calculated the association probability of candidate samples and test sample. In global LOOCV, the score of test sample was ranked with all of the candidate samples, while we sort it with all the unknown samples that contained the studied diseases in local LOOCV. When the ranking of test sample was higher than the given threshold, we affirmed that SNMFMDA had correctly predicted the sample. We set the true positive rate (TPR, sensitivity) as the vertical axis and the false positive rate (FPR, 1-specificity) as the horizontal axis. When the thresholds took different values, they correspond to different points in the coordinate system. The bight composed of all these points was the Receiver operating characteristics (ROC) curve. Here, Sensitivity was the ratio of the number of correctly predicted test samples to the total number of positive samples, and specificity was the percentage of candidate samples whose ranking were lower than the given threshold to all of the unknown samples. The area under the ROC curve (AUC) was calculated to evaluate the reliability of SNMFMDA. The AUC value of 0.5 meant that the computational model was equivalent to random prediction, and AUC=1 indicated that the prediction performance of the calculation model was excellent. In other words, when the value of AUC was greater than 0.5 and less than 1, the larger the value, the better the prediction performance.

The comparison of the prediction performance between a couple of computational methods based on the AUC value of global and local LOOCV respectively was shown in **Figure 2**. As a consequence, the AUC of SNMFMDA was 0.9007, while the AUC values of HGIMDA (Chen et al., 2016c), MCMDA (Li et al., 2017), MaxFlow (Yu et al., 2017), RLSMDA (Chen and Yan, 2014), HDMP (Xuan et al., 2013), WBSMDA (Chen et al., 2016b) were respectively 0.8781, 0.8749, 0.8624, 0.8426, 0.8366, 0.8030 in global LOOCV. In local LOOCV, SNMFMDA obtained AUC of 0.8426, which were clearly better than HGIMDA (0.8077), MCMDA (0.7718), MaxFlow (0.7774), RLSMDA (0.6953), HDMP (0.7702), WBSMDA (0.8031), MiRAI (0.6299), MIDP (0.8196), and RWRMDA (0.7891). Both RWRMDA and MIDP weren't capable to predict potential related miRNAs for all diseases at the same time, so we could only evaluate their prediction performance with local LOOCV instead of global LOOCV. Besides, the association probabilities of candidate samples calculated by MiRAI had a high-positive correlation with the number of known associations of corresponding diseases. The more known miRNAs associated with a disease, the greater the disease-related candidate samples' association probabilities would be. Thus, it wasn't reasonable to compare the association probabilities of candidate samples corresponding to different diseases. Therefore, we couldn't apply global LOOCV to evaluate the prediction performance of RWRMDA, MIDP, and MiRAI. What's more, as could be seen from **Figure 2**, the value of the AUC for local LOOCV of MiRAI was relatively small. This was because that the core of MiRAI was collaborative filtering which caused its prediction accuracy to heavily depend on the number of known miRNA-disease associations. The database used in our method had 383 diseases but there were few known miRNAs associated with each disease. Therefore, the predictive performance of MiRAI based on this database was far worse than that in the original literature where the training database contained more verified associations for each disease.

In addition, we also applied 5-fold cross-validation to evaluate the predictive performance of SNMFMDA. All known miRNAdisease associations were randomly divided into five equal parts, and each part was in turn treated as the test sample while the other four parts were treated as the training samples. Besides, in order to reduce the influence of the division of known associations on prediction accuracy, we performed 100 random divisions. It could be seen from the results that the mean and standard deviation of the AUC in 5-fold cross validation respectively reached 0.8830 and 0.0017, which was obviously better than MCMDA (0.8767 ± 0.0011), MaxFlow (0.8579 ± 0.001), RLSMDA (0.8569 ± 0.0020), HDMP (0.8342 ± 0.0010), and WBSMDA (0.8185 ± 0.0009). In summary, all cross-validation results further proved the superior prediction performance of SNMFMDA.

## Case Studies

In addition to cross-validation, we also used case studies to evaluate the prediction accuracy of our computational model. Here, we applied three different case studies on Esophageal Neoplasms (EN), Breast Neoplasms (BN), and Lung Neoplasms (LN) to test the prediction performance from different aspects. In the first kind, we could obtain 5,430 known miRNA-disease associations (positive samples) from HMDD v2.0 (Li et al., 2014) for 383 diseases and 495 miRNAs, and the remaining 184,155 samples were unknown samples. In order to predict the potential associations of miRNA-disease pairs from unknown samples, we scored them by SNMFMDA and ranked them. Finally, we verified the top 50 potential associated miRNAs for the investigated disease by two other databases dbDEMC (Yang et al., 2010) and miR2Disease (Jiang et al., 2009) that recorded a number of verified miRNA-disease associations. In order to assess SNMFMDA's prediction power for new diseases without any known associated miRNAs, the second kind of case study was implemented. Here, when we studied a disease, we first removed all its known associations based on the HMDD v2.0 that was, all the 1 s in the row corresponding to this disease in the adjacency matrix were turned to 0. After that we used SNMFMDA to score all miRNAs for the investigated disease and then sorted these miRNAs. Finally, we verified whether the associations between the disease and the top 50 miRNAs were verified by the three databases dbDEMC, miR2Disease and HMDD v2.0. Because there are a number of databases in which known miRNA-diseases associations were recorded, when the prediction accuracy of computation method based on a certain database was good, it couldn't explain that the prediction performance of the method was superior. To prove the applicability of our model to different databases, we carried out the third type of case studies. The third case study was similar to the first, and the difference between them was that the dataset on which SNMFMDA was based was not the same. There were 1895 verified associations between 137 diseases and 271 miRNAs recorded in the database HMDD v1.0 used in the third case study. Finally, we applied three databases (HMDD v2.0, miR2Disease and dbDEMC) to verify the top 50 predicted miRNAs for the investigated disease after utilizing our calculation model.

As one of the most common tumors in the world, Esophageal Neoplasms (EN) has the top 10 morbidity and mortality in all cancers (He et al., 2012). According to the latest estimates of related departments in US, in 2018, there would be 12,850 patients dying from EN, accounting for 4% of all patients dying of cancer (Siegel et al., 2018). Although the treatment method have been improved, the damage of EN to human have not

been significantly reduced (He et al., 2012). The survival rate of patients with EN was less than 25% in the last five years (Kim et al., 2011). Research showed that if EN could be diagnosed early, its mortality rate was expected to drop to 10% (Daly et al., 2000), so finding better and more efficient diagnosis and treatment was imperative (Xie et al., 2013). More and more researches indicated that there was a close relationship between miRNAs and the development of human diseases (Alvarez-Garcia and Miska, 2005). A number of associations between miRNAs and EN had been verified. For instance, by sponging miR-200a, which was functionally similar to a competitive endogenous RNA, lncRNA MALAT1 adjusted the expression of ZEB1 and ZEB2 to facilitate the invasion and migration of EN cells by mean of inducing epithelial-mesenchymal transition (Zhang et al., 2017). We used SNMFMDA to perform the first case study on EN, and the results showed that 47 of the top 50 predicted EN-related miRNAs were verified by other two databases dbDEMC and miR2Disease. For the remaining three miRNAs that had not been verified by the above two databases, there were studies showing that serum expression levels of mir-218 (15th in the prediction list) in EN patients were significantly lower than those in healthy people, and the levels were related to tumor differentiation, staging, and lymph node metastasis. For this reason, mir-218 was highly likely to be the target of the early diagnosis of EN, which provided a new idea for the detection of this cancer (Jiang et al., 2015). In EN cells, mir-122 (43th in the prediction list) targeted pyruvate kinase M2 (PKM2), and Tanshinone IIA could limit the expression of PKM2 by promoting the expression of mir-122, which in turn restricted the growth of EN cells (Zhang et al., 2016). As could be seen from the verification results, only one of the top 50 predicted miRNAs had not been validated by the database or literature (see **Table 1**).

In order to facilitate further validation and research, we have provided the complete prediction list of potential miRNAs associated with all the 383 human diseases in HMDD v2.0 (see **Supplementary Table 1**).

Breast Neoplasms (BN) is a type of cancer with high morbidity and mortality among women in the United States (Kelsey and Horn-Ross, 1993). According to the prediction of relevant departments, 40,920 women would die from BN in the United States in 2018, accounting for 14% of the total cancer deaths (Siegel et al., 2018). According to the current medical level, the only way to improve the cure rate and reduce the mortality rate of BN lies in the early detection and timely treatment (Tao et al., 2015). In order to improve the diagnostic efficiency, the researchers have put forward many methods including the prediction of potential relevant miRNAs of BN. It has been proved that the expression levels of mir-21 and mir-146a in plasma samples of BN patients were obviously higher than that of healthy volunteers and we could identify whether a patient had BN based on the expression level of mir-21 and mir-146a in plasma (Kumar et al., 2013). We implemented the second case study on BN and 45 of the top 50 miRNAs potentially associated with BN were verified by databases (HMDD, dbDEMC and miR2Disease). Among the remaining 5 miRNAs that were not validated by the databases, the expression of mir-142 (12th in the prediction list) was dysregulated in BN cells and mir-142 could restricted the invasion of BN cells by simultaneously targeting WASL, Integrin Alpha V, and Additional Cytoskeletal Elements (Schwickert et al., 2015). The expression of mir-378a-3p (20th in the prediction list) was lower in BN tissues, and it acted on the endocrine resistance mechanism of BN by regulating the expression of GOLT1A (Ikeda et al., 2015). It was also verified that there was excessive expression of mir-302f (48th in the

TABLE 1 | Prediction of the top 50 predicted miRNAs associated with


*The first 25 miRNAs and the last 25 miRNAs were recorded in the first and third columns, respectively. The second and forth columns recorded the database or literatures in PubMed that verified the corresponding miRNAs associated with Esophageal Neoplasms.*

prediction list) in HER2-postive BN (Kang et al., 2014). What's more, researches also confirmed that there was overexpression of mir-744 (49th in the prediction list) in BN cell and mir-744 played a part in the drug resistance of BN (Chen et al., 2016a). The above results (see **Table 2**) showed that 49 of the top 50 potential BN-associated miRNAs predicted by SNMFMDA were validated, which indicated that the prediction performance of SNMFMDA was still reliable when it was applied to new diseases.

Lung Neoplasms (LN) is one of the most deadly cancers (Liu and Wei, 2018), and its poor prognosis have also caused great harm to human health (Zhao et al., 2018). According to the estimation of the American Cancer Society, in 2018, new cases of bronchus and LN would reach 121,680, accounting for 14% of all newly diagnosed cancer patients, and it was estimated that there would be 83550 patients dying from lung cancer, more than a quarter of the total deaths (Siegel et al., 2018). The early diagnosis and treatment of LN is very difficult, fortunately, more and more studies have shown that miRNAs are closely related to the development, progression, and progression of LN (Zhao et al., 2018). For example, researches showed that there was



*Here, all known associations with Breast Neoplasms had been removed. The first 25 miRNAs and the last 25 miRNAs were recorded in the first and third columns, respectively. The second and forth columns recorded the database or literatures in PubMed that verified the corresponding miRNAs associated with Breast Neoplasms.*

significantly higher expression of mir-221 in LN patients than that in healthy people, and biological analysis indicated that the target of mir-221 was most likely related to the formation and development of LN (Zhu et al., 2017). Besides, mir-221 was very likely to become a non-aggressive biomarker for the diagnosis of LN (Zhu et al., 2017). As an embryo-expressing lung miRNA, mir-127 had been shown to be closely linked to the poor prognosis of LN (Shi et al., 2017). Therefore, the prediction of miRNAs associated with LN could enable us to understand the pathogenesis of cancer and might provide novel diagnostic methods and treatment approaches. We applied SNMFMDA to perform the third case study on LN to test the prediction power of the model when it is applied to another database HMDD v1.0. The prediction result showed that 44 of the top 50 potential LNassociated miRNAs were verified by other databases (dbDEMC, miR2Diseaes, HMDD v2.0). For the remaining 6 miRNAs that weren't verified by the three databases, studies showed that the mir-92 (6th in the prediction list) family was less expressed in cisplatin-resistant cells, which indicated that the mir-92 family played a part in the regulation of cisplatin resistance in nonsmall cell lung cancer (Zhao et al., 2015). Some researches confirmed that the overexpression of mir-194 (20th in the prediction list) produced an effect on the expression of Mpl/ERK pathway proteins and restrained the mitosis and proliferation of non-small cell lung cancer cells by targeting Human nuclear distribution C (hNUDC), which provided a novel strategy for the treatment of LN (Zhou et al., 2016). Studies confirmed that mir-372-3p (38th in the prediction list) was obviously overexpressed in lung squamous cell carcinoma cells and limited the expression of FGF9 by binding to it, which contributed to the proliferation of lung squamous cell carcinoma cells (LSCC). In contrast, the low expression of mir-372-3p or high expression of FGF9 were conducive to inhibit the growth and invasion of LSCC cells (Wang et al., 2017). The expression level of mir-320 (46th in the prediction list) in non-small cell lung cancer (NSCLC) cells was lower than the level in normal cells, and mir-320 limited cell growth in NSCLC cells through targeting fatty acid synthase (Lei et al., 2016). Based on the above results (see **Table 3**), 48 of the top 50 potential LN-associated miRNAs predicted by SNMFMDA were validated, which indicated that the prediction performance of the model based on other datasets was also very reliable.

# DISCUSSION

As accumulating studies have demonstrated that miRNAs play an extremely important role in human physiological processes, researches on the association between miRNAs and diseases have attracted more and more attention. Since it is time-consuming and costly to use biological experiments to reveal potential miRNA-disease associations, many computational models have been proposed to predict disease-related miRNAs in recent years. In the paper, we developed a novel model of SNMFMDA to reveal the relation of miRNA-disease pairs by integrating the known miRNA-disease associations recorded in HMDD v2.0, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for diseases and miRNAs. SNMFMDA overcame the limitation of many previous models that they were incapable of predicting miRNAs associated with new diseases. As shown in the prediction results, the AUC values of global LOOCV, local LOOCV, and 5-fold cross validation reached 0.9007, 0.8426, and 0.8830 ± 0.0017 respectively. As we all know, models with global AUC value above 0.9 were rare, so SNMFMDA is a model with higher credibility. In the future, our model would be an effective tool to reveal potential disease-related miRNAs, which is conducive to the diagnosis and treatment of diseases.

TABLE 3 | Prediction of the top 50 predicted miRNAs associated with Lung Neoplasms based on known associations in HMDD v1.0.


*The first 25 miRNAs and the last 25 miRNAs were recorded in the first and third columns, respectively. The second and forth columns recorded the database or literatures in PubMed that verified the corresponding miRNAs associated with Lung Neoplasms.*

The superior prediction performance of the model was mainly due to the following aspects. Firstly, the database on which SNMFMDA was based was reliable and the model could be used to predict miRNAs potentially associated with new diseases by introducing the information of disease similarity. Secondly, we used SymNMF to interpolate the integrated similarity, while many of the previous methods used the integrated similarity directly. Finally, we introduced spectral decomposition to speed up the calculation of the Kronecker product in our model. Certainly, there are still some limitations to be resolved in the future. For example, SNMFMDA might not score miRNAs using the same criteria, especially for those with more known related diseases. Although the prediction accuracy of SNMFMDA is obviously higher than many previous calculation methods, if the biological database on which our model is based can be further improved, SNMFMDA's prediction performance would be better. The calculation of disease similarity and miRNA similarity used in our model may not be the most perfect method, and we expect to add more biological data sets in future calculations to improve the accuracy of similarity calculations. Besides, SNMFMDA involved calculating the Kronecker product of two matrices. The solution of the Kronecker product of two matrices was equivalent to that each element of the previous matrix multiplied by the next matrix, so the sizes of the Kronecker product was much larger than the first two matrices. Therefore, calculating the Kronecker product often led to memory problems in computer. In addition, our model SNMFMDA did not consider miRNAprotein association and miRNA-cellular pathway association, which significantly affected the prediction performance of the model.

#### REFERENCES


# AUTHOR CONTRIBUTIONS

YZ implemented the experiments, analyzed the result, and wrote the paper. XC conceived the project, developed the prediction method, designed the experiments, and revised the paper. JY analyzed the result and revised the paper. All authors read and approved the final manuscript.

# FUNDING

XC was supported by National Natural Science Foundation of China under Grant No. 61772531.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00324/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhao, Chen and Yin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification and Analysis of Blood Gene Expression Signature for Osteoarthritis With Advanced Feature Selection Methods

Jing Li <sup>1</sup> \*, Chun-Na Lan<sup>1</sup> , Ying Kong<sup>1</sup> , Song-Shan Feng<sup>2</sup> and Tao Huang<sup>3</sup> \*

*<sup>1</sup> Department of Rehabilitation, The Second Xiangya Hospital, Central South University, Changsha, China, <sup>2</sup> Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, China, <sup>3</sup> Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China*

#### Edited by:

*Quan Zou, Tianjin University, China*

#### Reviewed by:

*Jiangning Song, Monash University, Australia Jianbo Pan, Johns Hopkins Medicine, United States*

#### \*Correspondence:

*Jing Li lijing2017@csu.edu.cn Tao Huang tohuangtao@126.com*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *03 May 2018* Accepted: *22 June 2018* Published: *30 August 2018*

#### Citation:

*Li J, Lan C-N, Kong Y, Feng S-S and Huang T (2018) Identification and Analysis of Blood Gene Expression Signature for Osteoarthritis With Advanced Feature Selection Methods. Front. Genet. 9:246. doi: 10.3389/fgene.2018.00246* Osteoarthritis (OA) is a complex disease that affects articular joints and may cause disability. The incidence of OA is extremely high. Most elderly people have the symptoms of osteoarthritis. The physiotherapy of OA is time consuming, and the chances of full recovery from OA are very minimal. The most effective way of fighting OA is early diagnosis and early intervention. Liquid biopsy has become a popular noninvasive test. To find the blood gene expression signature for OA, we reanalyzed the publicly available blood gene expression profiles of 106 patients with OA and 33 control samples using an automatic computational pipeline based on advanced feature selection methods. Finally, a compact 23-gene set was identified. On the basis of these 23 genes, we constructed a Support Vector Machine (SVM) classifier and evaluated it with leave-one-out crossvalidation. Its sensitivity (Sn), specificity (Sp), accuracy (ACC), and Mathew's correlation coefficient (MCC) were 0.991, 0.909, 0.971, and 0.920, respectively. Obviously, the performance needed to be validated in an independent large dataset, but the in-depth biological analysis of the 23 biomarkers showed great promise and suggested that mRNA surveillance pathway and multicellular organism growth played important roles in OA. Our results shed light on OA diagnosis through liquid biopsy.

Keywords: osteoarthritis, blood, gene expression, signature, support vector machine, minimal redundancy maximal relevance, incremental feature selection

# INTRODUCTION

Osteoarthritis (OA) is a complex disease that affects articular joints and may cause disability (Appleton, 2017). In the USA, 14 million people have symptomatic knee osteoarthritis (KOA) (Vina and Kwoh, 2017). Approximately 10–20% adult have OA (Bay-Jensen et al., 2018). Although OA is considered a disease primarily for the elderly, nowadays, more than half of patients with OA are under 65 years old. More and more young people show the symptoms of OA. The physiotherapy of OA is time consuming, and the chances of full recovery from OA are very minimal (Nelson, 2017). The most effective way of fighting OA is early diagnosis and early intervention. However, usually at early stage when OA is treatable, the patients often ignore the symptoms and are reluctant to go

**70**

to the doctor for consultation (Nelson, 2017). When OA becomes serious, it is too difficult to treat this illness.

Blood is a vehicle for mRNAs from different tissues (Budd et al., 2017). It has been widely used for the early detection of various cancers (Zhang et al., 2017) and predictions of drug responses (Huang et al., 2008; Zhang et al., 2012). As a complex disease, the occurrence and development of OA involves changes to the mRNA (Steinberg et al., 2017). The blood flow under the subchondral bone (Aaron et al., 2018) may carry the signal of OA (Fotouhi et al., 2018). It can be detected when the mRNA level changes in blood (Budd et al., 2017). If so, then the detection of OA will be much easier and more accurate. In fact, there have been several studies of blood biomarkers for OA (Ramos et al., 2014; Feng et al., 2015; Ahmed et al., 2016; Bay-Jensen et al., 2018; Costa-Cavalcanti et al., 2018). For example, Ramos et al. demonstrated that the mRNA expression of apoptotic pathways was significantly different in the blood of patients with OA (Ramos et al., 2014). Bay-Jensen et al. reported the use of biochemical markers for OA, which measured the turnover of joint tissue or the inflammatory status (Bay-Jensen et al., 2018).

To quantify the cartilage turnover, several discovered biomarkers were used, such as PIIANP, CTX-II, ARGS, COMP, and C2C. In serum, PIIANP and CTX-II were found to be associated with OA progression by Osteoarthritis Initiative (OAI) Study of FNIH (Foundation for the National Institutes of Health; Kraus et al., 2017). ARGS was found to be associated with pain in anterior cruciate ligament injury patients (Wasilko et al., 2016). COMP was highly expressed in synovial fluid of patients with OA (Lorenzo et al., 2017). C2C was significantly different among patients with OA with no sign of cartilage damage, early signs of OA, and radiographic OA, and it was highly expressed in the patients with radiographic OA (Schaefer et al., 2017). In addition, there were biomarkers for synovial inflammation and fibrosis, such as C1M, C3M, and CRPM. They were positively correlated with elderly symptomatic OA (Martel-Pelletier et al., 2016).

Unfortunately, many of these biomarkers were for synovial fluid and most of them were only differentially expressed. Such qualitative biomarkers cannot be used in clinical settings directly, and for this reason, a blood biomarker-based quantitative classifier was the ideal model.

To build such a useful model, we reanalyzed a publicly available dataset from Ramos et al. (2014), which included the blood gene expression profiles of 106 patients with OA and 33 control samples with advanced feature selection methods, such as minimal redundancy maximal relevance (mRMR) and incremental feature selection (IFS), instead of a conventional statistical test. We identified 23 blood gene expression biomarkers. On the basis of these 23 genes, we constructed a Support Vector Machine (SVM) classifier and evaluated its performance with Leave-One-Out Cross Validation (LOOCV). The sensitivity (Sn), specificity (Sp), accuracy (ACC), and Mathew's correlation coefficient (MCC) were 0.991, 0.909, 0.971, and 0.920, respectively. In addition, we performed in-depth biological analysis of the 23 biomarkers. They were involved in the mRNA surveillance pathway and multicellular organism growth. Not only was a quantitative classifier constructed, but also the underlying mechanisms of OA occurrence and progression were revealed.

# MATERIALS AND METHODS

## The Blood Gene Expression Profiles of Osteoarthritis and Control Samples

We downloaded the blood gene expression profiles of 106 OA and 33 control samples from the Gene Expression Omnibus (GEO) database under the accession number of GSE48556 (Ramos et al., 2014). The gene expression levels were measured using Illumina HumanHT-12 V3.0 expression beadchip. There were 48,802 probes corresponding to 25,159 genes. The probes representing the same gene were averaged, and the gene expression profiles of OA and control samples were quantilenormalized.

Unlike Ramos's study (Ramos et al., 2014), which identified 694 genes with adjusted p-value smaller than 0.05 using linear regression analysis and then narrowed down the genes to a short list using functional annotation, we aimed to develop an automatic analysis pipeline that minimized human intervention and avoided the hand-picking during biomarker selection. Despite the great performance achieved by Ramos et al. (2014), we believe that there are other actionable biomarkers which may function in a different way and we are trying to find them with advanced feature selection methods.

#### Mutual Information-Based Feature Ranking

Identifying the phenotype-associated features is one of the basic problems in bioinformatics, and for different problems, there are different solutions (Huang et al., 2008; Cai et al., 2010; Zhang et al., 2012, 2015, 2016, 2017; Li et al., 2014; Chen et al., 2018a; Wang et al., 2018). For identifying differentially expressed genes (DEG), the most widely used methods are the t-test, significance analysis of microarrays (SAM; Tusher et al., 2001), and linear regression as performed by Ramos et al. (2014). However, usually such statistics-based methods will identify too many DEG than we require. The redundancy between DEG is extremely high. Many genes have very similar expression patterns.

Unlike DEG, we needed a smaller number of signature genes that can be applied in clinical settings. Therefore, we adopted a mutual information-based method, i.e., mRMR (Peng et al., 2005), which has been widely used in feature ranking (Niu et al., 2013; Zhao et al., 2013; Zhou et al., 2015; Zhang et al., 2016; Li and Huang, 2017; Liu et al., 2017). It considers both the relevance between features and sample labels and the redundancy among features and has been proven to be an effective feature selection method, especially for gene expression analysis (Qin et al., 2012; Zhang et al., 2014b, 2017, 2018; Zhang Y. et al., 2014; Li et al., 2015; Zhou et al., 2015; Wang et al., 2016; Song et al., 2017; Chen et al., 2018b). The method works like this: let us use to denote all the 25,159 genes, <sup>s</sup> to denote


the selected gene set that includes m genes, and <sup>g</sup> to denote the n genes that will be evaluated, and one of them will be selected.

First, the relevance of gene g from <sup>g</sup> with sample labels l was measured using mutual information (I) (Sun et al., 2012; Huang and Cai, 2013):

**I**(**g**, **l**) (1)

As the mutual information can only be calculated between categorical variables, the expression levels of each gene were discretized with the thresholds of mean minus standard deviation and mean plus standard deviation.

Then, the redundancy of gene g with selected gene set <sup>s</sup> was quantified:

$$\frac{1}{m} \left( \sum\_{\mathbf{g}\_i \in \Omega\_s} I(\mathbf{g}, \mathbf{g}\_i) \right) \tag{2}$$

As we wanted to maximize the relevance and minimize the redundancy, the optimization goal can be characterized as follows and the best gene form <sup>g</sup> will be selected:

$$\max\_{\mathbf{g}\_j \in \Omega\_{\mathbf{g}}} \left[ I\left(\mathbf{g}\_j, l\right) - \frac{1}{m} \left( \sum\_{\mathbf{g}\_i \in \Omega\_{\mathbf{s}}} I(\mathbf{g}\_j, \mathbf{g}\_i) \right) \right] \left( \mathbf{j} = 1, 2, \dots, n \right) \tag{3}$$

After n rounds of optimization, a ranked gene list S = g1 ′ , g<sup>2</sup> ′ , ... , g<sup>r</sup> ′ , ... , gN′ was obtained. The top ranked genes had strong relevance to OA but little redundancy among each other. In the next step, we further optimized the top 300 mRMR genes and got the final OA biomarker.

# Osteoarthritis Biomarker Optimization

Although the mRMR method can rank genes effectively, it is still unknown how many genes should be finally selected as the OA biomarker. Therefore, we applied a greedy method called incremental feature selection (IFS) (Jiang et al., 2013; Li et al., 2014; Shu et al., 2014; Zhang N. et al., 2014a; Huang et al., 2015; Zhang et al., 2015; Chen et al., 2018a) to optimize the number of signature genes. In this method, too few genes may miss the important information and too many genes may introduce noise.

During the IFS procedure, different numbers of genes were tried and their performances were evaluated. As there were too many combinations and the mRMR have already ranked the genes meaningfully, the mRMR genes were tested sequentially, i.e., in the r rounds, g1 ′ , g<sup>2</sup> ′ , ... , g<sup>r</sup> ′ were tested. For each round, an SVM classifier was constructed based on the selected

TABLE 2 | The confusion matrix of the predicted and actual sample classes.


genes and its performance was evaluated through LOOCV. We used the R function SVM from package e1017 with default parameters and kernel of radial to build the SVM classifier.

To have a complete measurement of the prediction performance, four statistics, which were the sensitivity (Sn), specificity (Sp), accuracy (ACC), and Matthew's correlation coefficient (MCC), were calculated:

$$\mathcal{S}\_n = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{4}$$

$$\mathcal{S}\_{\text{p}} = \frac{T\mathcal{N}}{T\mathcal{N} + FP} \tag{5}$$

$$\mathcal{T}\mathcal{P} + T\mathcal{N} \tag{6}$$

$$\begin{aligned} \text{ACC} &= \frac{\text{TP} + \text{TN} + \text{FP} + \text{FN}}{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}} \\ \text{MCC} &= \frac{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}} \end{aligned} \tag{7}$$

In Equations (4–7), TP, TN, FP, and FN were the number of true OA, true control, false OA, and false control samples, respectively.

On the basis of IFS results, we can determine how many genes should be chosen finally as the OA biomarker to achieve the best performance. As the numbers of OA samples and control samples were not balanced, the MCC was used as the main measurement for classification performance.

#### RESULTS

### The Osteoarthritis-Associated Genes Selected and Ranked Based on the mRMR Method

To identify the OA-associated genes, we used the mRMR method that can select and rank genes based on their relevance with

FIGURE 3 | The Venn diagram of our 23 genes and the 27 genes from Ramos et al. (2014). There were four overlapped genes, ADRB2, H3F3B, PELO, and ZNF20, between the 23 osteoarthritis biomarker genes we identified and the 27 genes from Ramos et al. (2014). To evaluate the significance of overlap, we calculated the hypergeometric test *p-*value and odds ratio, which were 9.18e-09 and 229.87, respectively. The overlap was very significant.

OA and their redundancy with other genes. The top 300 most discriminative genes for OA were selected and ranked using the mRMR method. These 300 mRMR genes will be further optimized using the IFS method.

# The Osteoarthritis Biomarker Optimization Based on the IFS Method

As a ranked gene list, the top 300 mRMR genes included the candidate OA biomarker genes. However, we still did not know how many genes should be finally selected. To optimize OA biomarker selection, we tried different number of top genes and calculated their prediction performance. On the basis of these performances, we plotted an IFS curve, as shown in **Figure 1**, in which the x-axis was the number of genes and the y-axis was the LOOCV MCC of the SVM classifier. It can be seen that when the top 23 mRMR genes were used, the MCC was the highest, i.e., 0.920. Meanwhile, the sensitivity, specificity, and accuracy of the 23-gene classifier were 0.991, 0.909, and 0.971, respectively. The 23 genes are listed in **Table 1**. The confusion matrix of the predicted and actual sample classes is given in **Table 2**.

To investigate the associations of the 23 genes with OA, we plotted the heatmap of the 23 genes in OA and control samples, as shown in **Figure 2**. It can be seen that the OA and control samples had very different expression patterns. Generally speaking, APP, SERINC3, GNL3L, MLLT6, C17orf91, NUFIP2, TAOK1, H3F3B, and SNORD38A were highly expressed in control samples, whereas COG5, UBXD8, ZNF20, PELO, MTSS1, CEP250, CDC2L5, MFAP1, RNF34, UPF1, LRRC33, TNFSF14, ADRB2, and PVRIG were highly expressed in OA samples.

We compared our 23 genes with the 27 genes from Ramos et al. (2014) and plotted the Venn diagram, as shown in **Figure 3**. There were four overlapped genes: ADRB2, H3F3B, PELO, and ZNF20. We evaluated the significance of overlapping using the hypergeometric test. The p-value was 9.18e-09 and the odds ratio was 229.87. The overlap between our 23 genes and the 27 genes from Ramos et al. (2014) was very significant.

## The Functional Analysis of the Optimal Osteoarthritis Biomarker

We did functional enrichment analysis of 23 OA biomarker genes using Metascape (Tripathi et al., 2015). The Gene Ontology (GO) results are shown in **Figure 4**. The enriched GO terms were GO:0032200: telomere organization, GO:1903829: positive regulation of cellular protein localization, GO:0010389: regulation of G2/M transition of mitotic cell cycle, and GO:0010951: negative regulation of endopeptidase activity.

There have been many studies about the relationship between telomere length and OA (Kuszel et al., 2015; Wiwanitkit, 2017). OA is a typical geriatric disease and the telomere length becomes shorter and shorter during aging. In patients with OA, the shortening of telomeres was accelerated (Kuszel et al., 2015). H3F3B, UPF1, and GNL3L were involved in GO:0032200: telomere organization.

The dysfunctional regulation of cellular protein localization in OA was reasonable. Osteoarthritis is a joint disease and the gap junctional communication is regulated by the extracellular

included APP, RNF34, TNFSF14, CEP250, and MLLT6, and the GNL3L cluster

that included GNL3L, UPF1, TAOK1, ADRB2, and H3F3B.

FIGURE 4 | The enriched GO terms of the 23 osteoarthritis biomarker genes. The 23 osteoarthritis biomarker genes were enriched onto GO terms, such as GO:0032200: telomere organization, GO:1903829: positive regulation of cellular protein localization, GO:0010389: regulation of G2/M transition of mitotic cell cycle, and GO:0010951: negative regulation of endopeptidase activity.

signal pathway (Niger et al., 2009). APP, TNFSF14, CEP250, and GNL3L were involved in GO:1903829: positive regulation of cellular protein localization.

There have been many theories about cell cycle and OA. Franke et al. found that during the pathogenesis of OA, advanced glycation end products (AGEs) influence osteoarthritic fibroblast-like synovial cells through inducing cell cycle arrest (Niger et al., 2009). de Andrés et al. discovered that the demethylation of an NF-κB enhancer can induce OA by regulating the cell cycle (de Andrés et al., 2016). APP, CEP250, and TAOK1 were involved in GO:0010389: regulation of the G2/M transition of the mitotic cell cycle.

It is known that several endogenous peptides have strong inflammatory effects in the joint and they are regulated by endopeptidase (Solan et al., 1998). Therefore, the genes from GO:0010951: negative regulation of endopeptidase activity, such as APP, TNFSF14, and RNF34, may play regulatory roles in OA.

# The Protein Interactions Between the Optimal Osteoarthritis Biomarkers

The protein–protein interaction (PPI) between the optimal OA biomarker was derived from the STRING database (https://string-db.org/) and is shown in **Figure 5**. STRING is a comprehensive database that integrates protein functional associations from multiple sources, such as experiment and literature (Szklarczyk et al., 2015). From **Figure 5**, we can see that APP, RNF34, TNFSF14, CEP250, and MLLT6 formed a cluster and GNL3L, UPF1, TAOK1, ADRB2, and H3F3B formed another cluster.

Basically, the functions of the APP cluster that included APP, RNF34, TNFSF14, CEP250, and MLLT6 were regulation of endopeptidase activity, cell cycle, and cellular protein localization, whereas the functions the GNL3L cluster that included GNL3L, UPF1, TAOK1, ADRB2, and H3F3B were involved in telomere organization and cellular protein localization. Common function that linked the two clusters was cellular protein localization, which indicated that the secretion of protein into extracellular synovia was the key processes of OA.

# DISCUSSION

As a common geriatric disease, OA has extremely high incidence, especially in elder people. As the chances of full recovery from late-stage OA are minimal, the most effective way of fighting OA is early diagnosis and early intervention. As a popular noninvasive test, liquid biopsy showed great potential in cancer detection. To identify the blood gene expression signature for

### REFERENCES


OA, we studied the blood gene expression profiles of 106 patients with OA and 33 control samples. With mRMR and IFS methods, we identified 23 genes whose sensitivity, specificity, accuracy, and Mathew's correlation coefficient were 0.991, 0.909, 0.971, and 0.920, respectively. The prediction performance was excellent. The biological function analysis of these 23 genes suggested that there were two pathways or PPI modules associated with OA through aging, cellular protein localization, and inflammation. These findings may be helpful for understanding OA.

There were still some disadvantages of this work. Here, we investigated only the gene expression levels. However, recent studies have suggested that the genome-wide association study (GWAS) and epigenetics approaches were also effective in OA mechanisms (Kerkhof et al., 2010; Panoutsopoulou et al., 2011; Rushton et al., 2014; Ramos and Meulenbelt, 2017; Simon and Jeffries, 2017). Integrating the genetic and epigenetic data with gene expression may provide a more comprehensive view of OA. We surveyed the identified genes based on one expression and found that the variant rs3815148 of COG5 was found to be associated with OA by GWAS reports (Kerkhof et al., 2010; Panoutsopoulou et al., 2011). Rushton et al. reported that the methylation status of MLLT6, TNFSF14, TAOK1, and MTSS1 was different between OA hip subtypes and LRRC33 was hypermethylated in OA hip than OA knee (Rushton et al., 2014). These results encourage us and others to do integrative studies of multiomics data in OA in future.

# DATA AVAILABILITY STATEMENT

The datasets for this study can be found in the Gene Expression Omnibus [https://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc=GSE48556].

# AUTHOR CONTRIBUTIONS

JL and TH conceived and designed the experiments; JL performed the experiments; JL, C-NL, YK, and S-SF analyzed the data; JL and TH wrote the paper.

## FUNDING

National Natural Science Foundation of China (31701151), Shanghai Sailing Program, and The Youth Innovation Promotion Association of Chinese Academy of Sciences (2016245).

# ACKNOWLEDGMENTS

We would like to thank Ramos et al. for sharing their data.

of osteoarthritis of the knee and typing and progression of arthritic disease. Arthritis Res. Ther. 18, 250. doi: 10.1186/s13075-016-1154-3


synovial fluid of patients with different joint disorders by novel automated assays. Osteoarthritis Cartilage 25, 1436–1442. doi: 10.1016/j.joca.2017. 04.004


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Li, Lan, Kong, Feng and Huang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification of miR-200c and miR141-Mediated lncRNA-mRNA Crosstalks in Muscle-Invasive Bladder Cancer Subtypes

Guojun Liu<sup>1</sup>† , Zihao Chen<sup>2</sup>† , Irina G. Danilova<sup>1</sup> , Mikhail A. Bolkov<sup>3</sup> , Irina A. Tuzankina<sup>3</sup> and Guoqing Liu<sup>4</sup> \*

1 Institute of Natural Sciences and Mathematics, Ural Federal University, Yekaterinburg, Russia, <sup>2</sup> Department of Urology, Nanfang Hospital, Southern Medical University, Guangzhou, China, <sup>3</sup> Institute of Immunology and Physiology, Ural Branch of the Russian Academy of Sciences, Yekaterinburg, Russia, <sup>4</sup> School of Life Sciences and Technology, Inner Mongolia University of Science and Technology, Baotou, China

#### Edited by:

Quan Zou, Tianjin University, China

#### Reviewed by:

Chi Zhang, Indiana University Bloomington, United States Qing Li, University of Utah, United States Leyi Wei, Tianjin University, China

> \*Correspondence: Guoqing Liu gqliu1010@163.com

†These authors have contributed equally to this work and share the senior authorship

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 15 June 2018 Accepted: 10 September 2018 Published: 28 September 2018

#### Citation:

Liu G, Chen Z, Danilova IG, Bolkov MA, Tuzankina IA and Liu G (2018) Identification of miR-200c and miR141-Mediated lncRNA-mRNA Crosstalks in Muscle-Invasive Bladder Cancer Subtypes. Front. Genet. 9:422. doi: 10.3389/fgene.2018.00422 Basal and luminal subtypes of muscle-invasive bladder cancer (MIBC) have distinct molecular profiles and heterogeneous clinical behaviors. The interactions between mRNAs and lncRNAs, which might be regulated by miRNAs, have crucial roles in many cancers. However, the miRNA-dependent crosstalk between lncRNA and mRNA in specific MIBC subtypes still remains unclear. In this study, we first classified MIBC into two conservative subtypes using miRNA, mRNA and lncRNA expression data derived from The Cancer Genome Atlas. Then we investigated subtype-related biological pathways and evaluated the subtype classification performance using Decision Trees, Random Forest and eXtreme Gradient Boosting (XGBoost). At last, we explored potential miRNA-mediated lncRNA-mRNA crosstalks based on co-expression analysis. Our results show that: (1) the luminal subtype is primarily characterized by upregulation of metabolism-related pathways while the basal subtype is predominantly characterized by upregulation of epithelial-mesenchymal transition, metastasis, and immune system process-related pathways; (2) the XGBoost prediction model is consistently robust for classification of the molecular subtypes of MIBC across four datasets (The area under the ROC curve > 0.9); (3) the expression levels of the molecules in the miR-200c and miR141-mediated lncRNA-mRNA crosstalks differ considerably between the two subtypes and have close relationships with the prognosis of MIBC. The miR-200c and miR-141-dependent mRNA-lncRNA crosstalks might be of great significance in tumorigenesis and tumor progression and may serve as the novel prognostic predictors and classification markers of MIBC subtypes.

Keywords: muscle-invasive bladder cancer, subtypes, miR200c, miR-141, random forest, XGBoost

**Abbreviations:** ASW, Average of Silhouette Width; CC, cellular component; CC, consensus clustering; CDF, cumulative density function; CoC, cluster of cluster; CPCC, cophenetic correlation coefficient; DEFGs, differentially expressed feature genes; DEGs, differentially expressed genes; DTs, decision trees; EMT, epithelial-mesenchymal transition; FDR, False Discovery Rate; GO-BP, Gene Ontology Biological Process; GSEA, gene set enrichment analysis; KEGG, Kyoto Encyclopedia of Genes and Genomes; K–M curve, Kaplan–Meier curve; MAD, median absolute deviation; MF, molecular function; MIBC, muscle-invasive bladder cancer; NES, normalized enrichment scores; PAC, proportion of ambiguous clustering; RF, random forest; ROC, receiver operating characteristics curve; XGBoost, eXtreme Gradient Boosting.

Urothelial bladder cancer (UBC) is one of the most common malignant tumors of urinary system. UBC can generally be classified into non-muscle-invasive bladder cancer (NMIBC) and muscle-invasive bladder cancer (MIBC), according to whether the cancer cells are restricted locally in the lamina propria or invade the muscularis propria (Kamat et al., 2016). A great number of studies have reported that according to shared RNA expression patterns or specific genomic alterations MIBC can be further classified into two major subtypes, namely basal and luminal (Sjodahl et al., 2012; Iyer et al., 2013; Choi et al., 2014a,b; Damrauer et al., 2014; Network, 2014; Robertson et al., 2017), which are strikingly similar to the molecular subtypes first described in breast cancer (Perou et al., 2000; Prat et al., 2010). The basal subtype has drawn much attention because it is associated with a more aggressive phenotype and has a higher risk of distant metastasis than luminal subtype (Choi et al., 2014a; Robertson et al., 2017). One reason for the difference is that the two subtypes develop from etiologically different pathways. Pathways that are involved in EMT and immune-associated pathways are upregulated in the basal subtype (Choi et al., 2014a). The molecular biomarkers and pathways involved in MIBC subtypes are the key to understanding its subtype heterogeneity and identifying subtype-specific biomarkers that can be used to better manage MIBC patients.

MicroRNAs (miRNAs) represent one of the most exciting areas of modern medical and biological sciences as they can modulate an immense and complex regulatory network of gene expression in a broad spectrum of developmental and cellular processes, such as cell proliferation, metabolism, apoptosis, and viral infection (Johnston and Hobert, 2003; Hatfield et al., 2005; Zhao et al., 2005; Chen et al., 2006; Oliveira-Carvalho et al., 2012; Huang M. et al., 2016). miRNAs not only have a well-established inhibitory effect on gene expression but also promote gene expression in some cases (Sayed and Abdellatif, 2011; Song et al., 2014), and long non-coding RNAs (lncRNAs) exhibit facilitative or suppressive effects on the gene regulatory network during tumor development (Gontan et al., 2012; Sun et al., 2013). Furthermore, aberration or perturbation in miRNA-mediated mRNA and lncRNA expression levels has a significant correlation with serious clinical consequences, including diseases of diverse origins and malignancy (Salmena et al., 2011; Valinezhad Orang et al., 2014; Tay et al., 2014; Yuan et al., 2014; Zeng et al., 2016; Hu et al., 2017).

Regarding molecular drivers of cancer development, oncogenic mutations and downstream signaling pathways in the pre-cancerous or cancerous cell have been thought to play a crucial role in the cancer formation and progression. In addition, recent studies have shown that metabolic reprogramming plays much more important roles than previously thought in cancer development (Cairns et al., 2011). It is possible that a great number of genomic mutations detected in cancer provide a selective advantage for the cancer cell in the stressful tumor microenvironment by reprogramming cell metabolic processes (Zhang et al., 2015). No matter what is the primary cause of cancer development, it is clear that both the oncogenic signaling and reprogrammed metabolisms involve numerous genes, working in a concerted manner in a complex network. Gene regulatory network-based view can, therefore, provide a deeper insight into the cancer development.

The aim of this study is to identify subtype-specific dysregulated miRNA-mediated mRNA-lncRNA interactions and discover new critical subtype-related genes in MIBC.

# MATERIALS AND METHODS

#### Data Acquisition and Pre-processing

The MIBC RNA-Seq (FPKM) and clinical data were obtained from The Cancer Genome Atlas (TCGA) public data portal<sup>1</sup> , and miRNA-Seq (RPM) data was downloaded from the Broad GDAC Firehose<sup>2</sup> . The gene expression datasets of 403 tumor samples and 19 adjacent normal tissue samples contain 19181 mRNAs, 14376 lncRNAs, and 2588 mature miRNAs. The microarray datasets (GSE32894, GSE13507, and GSE31684) derived from Gene Expression Omnibus (GEO) were used to evaluate the performance of classifiers and verify the prognostic use of marker genes<sup>3</sup> .

# Clustering Analysis and Gene Set Enrichment Analysis

Consensus clustering (Monti et al., 2003) is a method that provides quantitative evidence for determining the number and membership of possible clusters within a dataset, such as RNA-Seq and microarray. For CC analysis, the RPKM gene expression data was pre-processed to detect the most highly expressed and variable genes across samples. We removed 25% genes that have the low arithmetic mean of the given gene across samples. Then the MAD was used to select the most highly expressed and variable 3,000 mRNAs, 300 miRNAs, and 3,000 lncRNAs. CC available in the R package "ConsensusClusterPlus" was performed on 3,000 mRNAs, 300 miRNAs, and 3,000 lncRNAs with 403 tumor samples, using the following key parameters: reps = 50, innerLinkage = complete, clusterAlg = hc, k = 6, and distance = pearson (Wilkerson and Hayes, 2010).

Cluster of cluster analysis is a method of integrating the primary clustering results into final cluster assignments. Each sample is represented as a binary vector, whose length is P<sup>t</sup> <sup>i</sup>=<sup>1</sup> K<sup>i</sup> (where t is the number of datasets and K<sup>i</sup> is the number of clusters for dataset (i), to implement subsequent clustering analysis. We first conducted the CoC analysis on the clustering results of mRNA, miRNA, and lncRNA dataset to obtain a binary dataset. The CC was once more performed on the binary dataset for generating final clusters. Number of final clusters (K) was estimated by commonly used methods including ASW, CPCC, Relative Change in Area under Cumulative density function [ <sup>1</sup>(K)], and PAC (¸Senbabaoglu et al., 2014).

In order to explore subtype-associated biological processes, GSEA (Subramanian et al., 2005) was conducted using three

<sup>1</sup>https://cancergenome.nih.gov

<sup>2</sup>https://gdac.broadinstitute.org

<sup>3</sup>https://www.ncbi.nlm.nih.gov

gene set datasets (GO-BP, KEGG, and Hallmark gene sets]. The following parameters were taken for GSEA: Number of permutations = 1000, Permutation type = gene\_set, Enrichment statistic = weighted, Metric for ranking genes = Signal2Noise.

# Differentially Expressed Genes and Machine Learning

"Ballgown" (R package) was used to identify DEGs between tumor and normal samples (Frazee et al., 2015). F-test was used in "Ballgown", and DEGs here were defined as those with FDR adjusted p-value < 0.05 (Benjamini–Hochberg method) and |log2fold change| > 0.57).

Three tree-based machine learning methods, namely DTs, RF, and eXtreme Gradient Boosting (XGBoost or XG), were performed on 3000 mRNAs, 300 miRNAs, and 3000 lncRNAs for MIBC subtype classification. The area under the ROC curve (AUC) was used to estimate the performance of the classification methods. For each classification method, MIBC samples were randomly divided into training (60%) and testing (40%). We performed RF with different parameter values of ntree and mtry, and used 10-fold cross-validation to acquire the mean accuracy. XGBoost was implemented with the following parameters: gamma = 1, min\_child\_weight = 1, max\_depth = 14, nrounds = 2000. In order to optimize the parameter iter (number of iterations) of XGBoost, we obtained 10-fold crossvalidation performance for each iter and selected the iter value that generated the best performance. For DTs, the following parameters were taken: minCases = 20 and CF = 0.25. Moreover, the well-performed classifiers in this study were trained on the TCGA-derived RNA expression data and were tested on the GSE32894 to further evaluate their performance. All machine learning methods were implemented using R packages including "C5.0", "randomForest", and "XGBoost" packages (Liaw and Wiener, 2002; Chen and Guestrin, 2016; Kuhn et al., 2018).

The overlap between the feature genes obtained by the wellperformed classifiers and DEGs was referred to as DEFGs. GO enrichment analysis available in the R package "clusterProfiler" was performed on DEFGs to identify their enriched GO terms (Yu et al., 2012). A multiple-test correction was done using the method proposed by Benjamini and Hochberg, in which an adjusted p-value < 0.05 was considered to represent statistical significance.

# Construction of a Subtype-Related mRNA-miRNA-lncRNA Network

Pairwise Pearson's correlation analysis was carried out on the DEFGs. The lncRNA-miRNA pairs, miRNA-mRNA pairs, and lncRNA-mRNA pairs with |r| > = 0.4 and p-value < 0.05 were considered to be co-expressed gene pairs. If both elements in a co-expressed lncRNA-mRNA pair are simultaneously coexpressed with a miRNA, it is defined as a miRNA-dependent lncRNA-mRNA co-expressed interaction. A miRNA-dependent lncRNA-mRNA network was established using Cytoscape software (version 3.5.1). miRWalk2.0 (Dweep et al., 2011) is an integration of six widely used databases (miRWalk, miRanda, miRDB, miRNAMap, RNA22, and Targetscan) and supplies the biggest available collection of predicted and experimentally verified miRNA-target interactions. Our inferred co-expressed interactions including mRNA-miRNA and lncRNA-miRNA interactions were compared to those derived from miRWalk2.0. An mRNA is considered to be a true target of miRNA if their interaction occurs in at least four databases, and an lncRNA is considered to be a true target of miRNA if their interaction is supported in at least one database among miRWalk, miRanda, and Targetscan.

# Survival Analysis

We further assessed whether the genes in the inferred interactions are correlated with the overall survival of MIBC patients. Based on the mean expression level of the genes, patient samples were divided into high and low expression groups. We performed survival analysis available in R package "survival" (Therneau, 2015) using the Kaplan–Meier curve (K–M curve) method. A logrank test was used to compare survival times between two groups, and p < 0.05 was considered to represent the statistical significance.

# RESULTS

# Clustering Analysis and GSEA

We first performed the CC on mRNA, miRNA, and lncRNA expression datasets to obtain the clustering results. By applying the CoC analysis to the clustering outcomes of CC, a binary dataset was obtained, which was referred to as CoC dataset. The CC was once again performed on the CoC dataset to generate the different Ks, and the ASW, CPCC, 1K, and PAC were used to evaluate the optimal K (**Supplementary Figure S1**). As a result, for the CoC dataset, ASW evaluation suggests the optimal K of 6 and CPCC, 1K, and PAC indicate the optimal K of 2. Given that K = 2 is the consistent optimal value, we chose K = 2 as a solution, dividing MIBC samples into two subtypes, namely subtype-1 and subtype-2. The hierarchically clustered heatmap of K = 2 for CoC dataset was shown in **Figure 1A**. Survival curves regarding two subtypes were plotted using the K-M method. Our results have shown that 5-year overall survival rate with regard to subtype-1 is 55% and 30% for subtype-2, indicating that they differ considerably in clinical prognosis (**Figure 1B**, p < 0.01). The heatmap depicting basal biomarkers, luminal biomarkers, and clinical indicators for the two subtypes was shown in the **Figure 1C**. The subtype-1 is characterized by the high expression of luminal markers such as CYP2J2, ERBB2, and KRT18, while the subtype 2 is characterized by high expression of basal markers such as CD44, CDH3, and KRT1. The Pearson's chi-squared test is utilized to compare clinical indicators between the two subtypes. The histology, stage, grade, and status are significantly different between the two subtypes, and gender almost differs between the two subtypes (**Supplementary Table S1**). The subtype-1 and subtype-2 resemble the luminal and basal subtype, respectively, in terms of K–M curves, biomarkers, and clinical indicators, therefore, which were redefined as luminal and basal subtypes (Choi et al., 2014a).

Gene set enrichment analysis was done for the basal and luminal subtypes, and the results were shown in **Tables 1**, **2**.

Upregulated pathways in luminal subtype are mainly involved in metabolism (e.g., oxidative phosphorylation, cytochrome P450, and fatty acid metabolism) (**Table 1**). Whereas, upregulated pathways in the basal subtype are principally related to immune system process (e.g., extracellular structure organization, allograft rejection, mTORC1 signaling, and TNF-a signaling via NF-kB), metastasis, and EMT (**Table 2**).

# Differentially Expressed Genes and Machine Learning

The DEGs that could distinguish tumor from normal samples were analyzed and visualized as volcano plots (**Supplementary Figures S2A–C**). In total, 208 miRNAs (148 upregulated and 60 downregulated), 2488 lncRNAs (1402 upregulated and 1086 down-regulated), and 4167 mRNAs (2314 upregulated and 1853 downregulated) are differentially expressed.

We applied DTs, RF, and XGBoost for the basal and luminal subtype classification based on mRNA, miRNA, and lncRNA expression dataset, and AUC was used to evaluate their performance. As shown in **Figure 2A**, XGBoost outperforms RF and DTs, having AUC values of 98.6, 94.5, and 98.7%, respectively, in mRNA, miRNA and lncRNA-based classification. Details regarding 10-fold cross-validation procedure can be found in **Supplementary Figure S3**. DTs was excluded in the following comparison, as it is significantly inferior to RF and XG on average. By using the CC method, the GSE32894 dataset

TABLE 1 | Top-ranked terms of GO-BP, KEGG and Hallmark gene sets for the luminal subtype.

TABLE 2 | Top-ranked categories of GO-BP, KEGG and Hallmark gene sets for the basal subtype.


NES, normalized enrichment score; GO-BP, Gene Ontology Biological Process; KEGG, Kyoto Encyclopedia of Genes and Genomes. Size is the number of genes in the gene set. A positive NES means that genes over-represented in the gene set are upregulated in luminal subtypes.

containing 28 biomarkers and 190 samples was grouped into two subtypes prepared for the classification task. The heatmap plots and the K–M curves for the two subtypes were shown in **Supplementary Figure S4**. We trained the well-performed classifiers (RF and XG) on mRNA dataset that was derived from TCGA and tested them on GSE32894 dataset. The results demonstrated that XGBoost has a better performance than RF (**Figure 2A4**).

The intersection between DEGs and feature genes obtained by RF and XG was defined as DEFGs, which includes 57 lncRNAs, 120 miRNAs, and 278 mRNAs. The Upset plot and heatmap


All abbreviations are the same as in Table 1. A negative NES value indicates that genes over-represented in the gene set are upregulated in the basal subtype.

plots for DEFGs were shown in **Figure 2B** and **Supplementary Figures S2D–F**. The genetic and clinical information of DEFGs was visualized in **Figure 2C**. GO enrichment analysis indicated that differentially expressed feature mRNAs are enriched with adherens junction, cell-substrate junction, cell-cell junction, cellsubstrate adherens junction, and focal adhesion (**Figure 2D**). These GO terms have been found to play roles in tumorigenesis and tumor progression by regulating T-cell signaling, innate immunity, TGF-β signaling, and Wnt signaling through posttranslational modification (Kikuchi et al., 2006; Lönn, 2010; Liu et al., 2016; Cho et al., 2018; Kuwabara et al., 2018).

# Construction of Subtype-Related mRNA-miRNA-lncRNA Network

A miRNA-dependent mRNA-lncRNA co-expression network was constructed, which consists of 90 mRNAs, 22 miRNAs, and 14 lncRNAs (**Figure 3A**). The miRNA-dependent mRNAlncRNA crosstalks verified in miRWalk database contain four miRNA-mediated mRNA-lncRNA interactions (**Figure 3B**). To be specific, two co-expressed lncRNA-mRNA pairs, AC010326.3- GATA3 and AC073335.2-GATA3, are positively regulated by miR-141-3p; The lncRNA-mRNA pairs, such as MIR100HG– CLIC4 and MIR100HG–PALLD, are negatively regulated by miR-200c-3p and miR-141-5p, respectively. All the nine genes in the network differ in their expression between the two subtypes (**Figure 3C**). For instance, as compared to the luminal subtype, the basal subtype is characterized by a lower expression level of six genes (miR-200c-3p, miR-141-3p, miR-141-5p, GATA3, AC010326.3, and AC073335.2) and a higher expression level of the other three genes (MIR100HG, PALLD, and CLIC4), suggesting that all the nine genes can be used as potential markers for the two MIBC subtypes. In addition, GO analysis showed that

the mRNAs in the network (CLIC4, PALLD, and GATA3) are related to cytoskeleton.

# Survival Analysis of Crosstalk-Involved Genes

The association between expression levels of crosstalk-involved genes and MIBC prognosis was analyzed by K–M method. Strikingly, the results revealed that all of them are closely related to prognosis of MIBC. Specifically, the higher expression level of miR-141-5p, miR-141-3p, AC010326.3, AC073335.2, miR-200c-3p, and GATA3 predicts better prognosis, indicating that they may function as tumor suppressors (**Figures 4B–F,H**); In contrast, the higher expression level of MIR100HG, PALLD, and CLIC4 is associated with worse prognosis, suggesting that they may play an oncogenic role (**Figures 4A,G,I**). In addition, the association between MIBC prognosis and expression levels of crosstalk-related mRNAs (CLIC4, PALLD, and GATA3) was validated in two independent microarray datasets (GSE13507 and GSE31684), suggesting again their prognosis value in MIBC (**Supplementary Figure S5**).

# DISCUSSION

In this study, we have investigated miRNA-dependent mRNAlncRNA interactions in MIBC basal and luminal subtypes using bioinformatics approaches. On the basis of MIBC mRNA, miRNA, and lncRNA expression datasets obtained from TCGA, 403 MIBC samples were reliably classified into two intrinsic molecular types, which resemble basal and luminal subtypes identified previously (Choi et al., 2014a). A number of subtype-related pathways were identified through GSEA. Moreover, we conducted and compared subtype classification performance among tree-based machine learning algorithms, and found XGBoost outperforms other classifiers. Additionally,

we implemented a gene co-expression analysis on DEFGs and successfully identified subtype-specific mRNA-lncRNA crosstalks, which differ considerably between basal and luminal subtypes and have close relationships with the prognosis of MIBC.

Subtype-related pathways presented in this study (**Tables 1**, **2**) are largely consistent with the previously identified (Choi et al., 2014a; Hurst and Knowles, 2014; McConkey et al., 2016; Ochoa et al., 2016; Hau et al., 2017; Seiler et al., 2017; Baker et al., 2018). In general, pathways that are involved in the EMT, metastasis, and immune system process, are upregulated in the basal subtype, whereas, metabolic-related pathways are upregulated in the luminal subtype. Th pathways enriched in basal and luminal subtypes provide a biological explanation for their distinctively different clinical and pathological behaviors. However, the mechanisms by which some other pathways shown in our results, like valine leucine, isoleucine degradation, autoimmune thyroid disease, hematopoietic cell lineage and viral myocarditis, play a role in MIBC subtypes deserve further investigation.

Many machine learning methods have been broadly applied in many areas of biology such as gene family classification, hepatotoxicity prediction, RNA methylation prediction, cancer prediction and classification (Zou et al., 2014; Kourou et al., 2015; Liao et al., 2017, 2018; Su et al., 2018; Wei et al., 2018a,b). As suggested in previous studies, RF is a powerful classifier for classifying gene expression data (Wu et al., 2003; Lee et al., 2005; Ishwaran et al., 2010). And XGBoost keeps winning in "every" Kaggle competition and has become a really popular tool among data scientists (Ren et al., 2017; Torlay et al., 2017; Zhang and Zhan, 2017). Recently, XGBoost has been successfully applied to many classification problems, such as pan-cancer classification

(Li et al., 2017) and prediction of RNA-protein interactions (Jain et al., 2018). However, no comparison between RF and XGBoost in terms of cancer classification has been made in the past. In this study, we compared the performance of DTs, RF, XGBoost in classifying basal and luminal subtypes. Our results clearly demonstrated the advantage of XGBoost in gene expression databased cancer classification (**Figure 2A**).

Previous studies investigated MIBC-associated miRNAs and their target genes without considering the genetic heterogeneity of MIBC subtypes (Martens-Uzunova et al., 2014; Huang T. et al., 2016; Xue et al., 2016; Zhong et al., 2016). It is therefore important to elucidate the subtype-related molecular pathways and identify novel biomarkers for MIBC subtypes. In this study, we systematically explored MIBC subtype-related gene coexpression networks. A total of three mRNAs (GATA3, CLIC4, and PALLD), three miRNAs (miR-200c-3p, miR-141-3p, and miR-141-5p), and three lncRNAs (AC010326.3, AC073335.2, and MIR100HG) were found in miRNA-mediated mRNA-lncRNA crosstalks, which differ considerably in their expression between basal and luminal subtypes (**Figure 3**), and their expression level is significantly associated with the prognosis of MIBC (**Figure 4**). It was previously observed that miR-141-5p, miR-141-3p, miR-200c-3p, and GATA3 are the most important markers of luminal subtype, which is consistent with our results (Robertson et al., 2017). Besides, previous studies found that the down-regulation of miR-200c and miR-141 is associated with elevated ZEB1 (Wiklund et al., 2011; Shan et al., 2013; Mahdavinezhad et al., 2015), and the down-regulation of miR-200c is also coupled with the down-regulation of BMI-1 and E2F3 (Liu et al., 2014), which play an important role in the invasion, migration, and EMT of bladder cancer.

It has been shown that some other genes in the crosstalk are also closely related to cancer. For example, AC073335.2, a highly expressed lncRNA in human glioblastoma, is involved in tumorigenesis via acting as a competing endogenous RNA of miR-940 (Shi et al., 2017). MIR100HG was previously reported to act as a regulator of hematopoiesis and oncogenes in many cancers (Emmrich et al., 2014; Nair, 2016; Shang et al., 2016; Wieczorek and Reszka, 2018; Zhang et al., 2018). In agreement with our findings, MIR100HG was reported to be down-regulated in MIBC and may serve as a significant biomarker for MIBC (Wang et al., 2016). As reported previously, GATA3 is a prognostic marker and inhibits cell migration and invasion in MIBC (Miyamoto et al., 2012; Choi et al., 2014a,b). And, GATA3 is differentially expressed between basal and luminal subtypes and can be used as a luminal-infiltrated marker (Robertson et al., 2017). CLIC4 has a complicated role in cancer. For instance, it functions as a tumor suppressor in lung adenocarcinomas (Okudela et al., 2014). And it promotes the metastasis and development of colorectal cancer (Deng et al., 2014; Peretti et al., 2015). Previous studies have established that the expression of CLIC4 in MIBC has a subtype-dependent pattern (Robertson et al., 2017). And the overexpression of CLIC4 in stroma increases cell migration and invasion and promotes epithelial to mesenchymal transition in multiple human cancers (Shukla et al., 2014). PALLD SNPs were reported to be a significant predictor of prostate cancer-specific mortality (Bao et al., 2011). Our findings are largely consistent with previously reported results, suggesting crosstalk-implicated genes might be of great significance in MIBC pathogenesis and post-transcriptional gene regulation.

The combination of bioinformatics and several machine learning approaches in this study have achieved reliable results regarding the MIBC subtype classification, subtype-associated pathways, and the network-associated markers for MIBC subtypes. The subtype-related genes can not only be used for subtype classification but also serve as a good predictor of cancer prognosis. It is worth noting that we can enhance our study in the following aspects in the future: (1) the crosstalks discovered through computational analyses need to be verified by biological experiments. (2) DEFGs were defined as the overlap between DEGs and feature genes that were determined by XGBoost based on the ranking approximates of Information Gain. This procedure may result in the missing of some highly correlated genes that are also biologically important.

# CONCLUSION

By conducting bioinformatics analyses, we identified two subtypes of MIBC and lncRNA-mRNA crosstalks mediated by miR-200c and miR-141, which are found to be significantly associated with prognosis, formation, and metastasis of bladder cancer. Our results should be informative for molecular subtype classification, prognosis and molecule-targeted therapy of bladder cancer.

# AUTHOR CONTRIBUTIONS

GjL and ZC performed the computations. MB and IT contributed to data preparation and analysis. GjL and GqL wrote the manuscript. GqL and ID conceived and designed the study.

# FUNDING

This work was supported by grants from the National Natural Science Foundation of China (31660322), Inner Mongolia Natural Science Foundation of China (2018LH03023), IIP UB RAS project (No. AAAA-A18-118020590108-7), Science Foundation for Excellent Youth Scholars of Inner Mongolia University of Science and Technology (2016YQL06), and Act 211 Government of the Russian Federation (No. 02.A03.21.0006).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2018. 00422/full#supplementary-material

FIGURE S1 | The graphs show the evaluation output of ACW, CPCC, 1K, and PAC. CoC datasets represented by green line were used as the criteria to infer optimal K. (A) ASW allows us to inference the optimal K by high ASW. (B) The optimal K according to CPCC is that the magnitude of CPCC should be very close to one. (C) The optimal K according to 1K is the K value before the 'elbow' or the K where D(K) reaches its maximum. (D) PAC allows us to inference the optimal K by the lowest PAC.

FIGURE S2 | Volcano plots for DEGs and heatmap plots for DEFGs. (A–C) Volcano plots for differentially expressed 4167 mRNAs, 208 miRNAs, and 2488 lncRNAs between tumor and normal samples (adjusted p-value < 0.05 and |log2fold change| > 0.57). (D–F) Heatmap plots for 278 DEFmRNAs, 120 DEFmiRNAs, and 57 DEFlncRNAs. Basal, luminal, and normal samples are represented by the red, blue, and yellow bar, respectively.

FIGURE S3 | Parameter selection and Performance of RF and XG in mRNA, miRNA and lncRNA dataset. (A) The x-axis represents the number of mtry set for RF classifier (1, 5, 10, 15, 20, 25). The y-axis represents the corresponding AUC. (B) The x-axis represents the number of ntree set for RF (20, 400, 600, 800). The y-axis represents corresponding obb error rates. The colors correspond to mtry numbers. (C) The x-axis represents the number of fold set for RF. The y-axis represents corresponding accuracy. The red color shows mean accuracy. (D) The

#### REFERENCES


x-axis represents the number of iter set for XG (1, 400, 800, 1200, 1600, 2000, 2400) and the y-axis represents the corresponding accuracy.

FIGURE S4 | Heatmap and K–M plots for basal and luminal subtypes of GSE32894. (A) Heatmap depicts the expression profiles of basal (up) and luminal (down) biomarkers in GSE32894. The yellow and turquoise color corresponds to high and low relative expression, respectively. B. A K-M plot for the overall 5-year survival of basal and luminal subtypes (basal = 52, luminal = 62, p < 0.01).

FIGURE S5 | Kaplan-Meier plots for CLIC4, PALLD, and GATA3 in GSE13507 and GSE31684. (A–C) K–M survival curves showing overall survival according to high expression and low expression of CLIC4, PALLD, and GATA3 in GSE13507. (D–F) K–M survival curves showing overall survival according to high expression and low expression of CLIC4, PALLD, GATA3, and MIR100HG in GSE31684.


Atlas gene expression data. BMC Genomics 18:508. doi: 10.1186/s12864-017- 3906-0


Workshop on Digital Watermarking, Magdeburg, 378–390. doi: 10.1007/978- 3-319-64185-0\_28


fgene-09-00422 September 27, 2018 Time: 16:15 # 11

prediction of anti-cancer peptides. Bioinformatics doi: 10.1093/bioinformatics/ bty451 [Epub ahead of print].


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Liu, Chen, Danilova, Bolkov, Tuzankina and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Classifying Included and Excluded Exons in Exon Skipping Event Using Histone Modifications

Wei Chen1,2 \*, Pengmian Feng<sup>3</sup> , Hui Ding<sup>4</sup> and Hao Lin<sup>4</sup> \*

*<sup>1</sup> Center for Genomics and Computational Biology, School of Life Science, North China University of Science and Technology, Tangshan, China, <sup>2</sup> Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China, <sup>3</sup> School of Public Health, North China University of Science and Technology, Tangshan, China, <sup>4</sup> Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics and Center for Information in Biomedicine, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China*

#### Edited by:

*Arun Kumar Sangaiah, VIT University, India*

#### Reviewed by:

*Hongbo Liu, University of Pennsylvania, United States Juexin Wang, University of Missouri, United States*

#### \*Correspondence:

*Wei Chen chenweiimu@gmail.com Hao Lin hlin@uestc.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *30 July 2018* Accepted: *12 September 2018* Published: *01 October 2018*

#### Citation:

*Chen W, Feng P, Ding H and Lin H (2018) Classifying Included and Excluded Exons in Exon Skipping Event Using Histone Modifications. Front. Genet. 9:433. doi: 10.3389/fgene.2018.00433* Alternative splicing (AS) not only ensures the diversity of gene expression products, but also closely correlated with genetic diseases. Therefore, knowledge about regulatory mechanisms of AS will provide useful clues for understanding its biological functions. In the current study, a random forest based method was developed to classify included and excluded exons in exon skipping event. In this method, the samples in the dataset were encoded by using optimal histone modification features which were optimized by using the Maximum Relevance Maximum Distance (MRMD) feature selection technique. The proposed method obtained an accuracy of 72.91% in 10-fold cross validation test and outperformed existing methods. Meanwhile, we also systematically analyzed the distribution of histone modifications between included and excluded exons and discovered their preference in both kinds of exons, which might provide insights into researches on the regulatory mechanisms of alternative splicing.

Keywords: alternative splicing, exon skipping, histone methylation, histone acetylation, random forest

# INTRODUCTION

RNA splicing is a process that eliminates introns from the precursor messenger RNA (pre-mRNA) so that exons can be linked together, which is an essential step of gene expression (Tilgner et al., 2012). In some cases, RNA splicing can create a range of unique proteins by orchestrating exons of the same pre-mRNA in different modes (Black, 2003). This phenomenon is known as alternative splicing. Among the numerous modes of alternative splicing, exon skipping is the most common one, in which a particular exon may be included in mRNAs under some conditions and omitted from the mRNA in others (Black, 2003).

It has been demonstrated that ∼95% of human genes undergo alternative splicing (Wang et al., 2008a). The multiple transcript variants of alternative splicing from a single gene often have different biological functions. However, our knowledge about the regulatory mechanism of alternative splicing is far from satisfactory.

In the past decades, a series of researches have been carried out in order to reveal the mechanisms of alternative splicing, and demonstrated that alternative splicing is regulated

**90**

not only on the genome level but also on the epigenome level (Fox-Walsh and Fu, 2010). On the genome level, there are exonic and intronic splicing enhancers (ESEs and ISEs) and silencers (ESSs and ISSs), which are sequence motifs that can be recognized and bound by proteins (Wang and Burge, 2008; Barash et al., 2010). Although the information on genome level can explain some of the splicing events, it is not sufficient for cell type specific and stage type specific RNA splicing (Wang et al., 2008a).

Recent researches have demonstrated that histone modifications from the epigenome level also participate in medicating RNA splicing. For example, Luco et al. have demonstrated that the alternative splicing of the FGFR2 (Fibroblast growth factor receptor 2) gene is regulated by H3K36me3 (Luco et al., 2010). Zhou et al. found that the exon inclusion event of human Fibronectin (FN1) gene is medicated by H3K9me2 and H3K27me3 (Zhou et al., 2014). Shindo et al. found that combinatorial effect of histone modifications also contribute to alternative splicing patterns among different cell lines (Shindo et al., 2013). These results hint us that finding the splicing code from histone modifications will provide new insights into RNA splicing regulatory mechanisms.

Accordingly, several computational methods have been proposed to classify included and excluded exons in exon skipping event based on histone modifications. In 2012, Enroth et al. developed a rule-based model and obtained an accuracy of 72% (Enroth et al., 2012). Later on, Chen et al. proposed a quadratic discriminant (QD) function based method and obtained an accuracy of 68.5% (Chen et al., 2014). More recently, by integrating features of genomic sequences and histone modifications, Xu et al. proposed a deep learning approach to predict splicing patterns (Xu et al., 2017). These works promote the research progress on revealing RNA splicing regulatory mechanisms. However, the performance of these methods remains unsatisfactory.

In the current study, we proposed a new method to classify included and excluded exons in exon skipping event. The Maximum Relevance Maximum Distance (MRMD) feature selection technique was used to winnow out the optimal histone modification features. By using the histone modification information, the Random Forest (RF) was performed to establish the prediction model. Results of 10-fold cross validation test demonstrate that the proposed method is reliable.

# MATERIALS AND METHODS

#### Dataset

The dataset used to train and test the predictive model was constructed by Enroth et al. (Enroth et al., 2012). According to the gene expression data of CD4<sup>+</sup> T cell, Enroth et al. obtained 13,374 "included" and 11,587 "excluded" exons from the exon skipping event of the human genome (Enroth et al., 2012). These exons are all 50 bp long with flanking introns longer than 360 bp, and none of them overlap to each other. Enroth et al. further mapped the 20 kinds of histone acetylation (Barski et al., 2007) and 18 kinds of histone methylation (Wang et al., 2008b) to those exons and their closest 180 bp of flanking intronic regions. By doing so, they obtained the histone modification signals and represented them by binary attributes, namely present (noted by "1") and absent (noted by "0") over the three regions (preceding, on and succeeding the exons). After removing exons with no histone acetylation or methylation modification present, a benchmark dataset containing 12,692 "included" exons and 11,165 "excluded" exons with histone acetylation and methylation information was obtained.

## Sample Formulation

By using the binary attributes of 20 kinds of histone acetylation and 18 kinds of histone methylation (**Supplementary Table S1**), the samples in the dataset can be represented by a 114 dimensional vector given by

$$\mathbf{R} = [\Phi\_1, \Phi\_2, \Phi\_3, \dots, \Phi\_{\dot{\imath}}, \dots, \Phi\_{114}]^T \tag{1}$$

where **T** is the transpose operator. The values for the vector component 8<sup>i</sup> can be "1" (indicating the presence of histone modification) or "0" (indicating the absence of histone modification). 81, 82, and 8<sup>3</sup> indicate the presence or absence of H3K27me3 on, preceding and succeeding exons, respectively; 84, 85, and 8<sup>6</sup> indicate the presence or absence information for H3K4me2, and so forth. More details can be found in **Supplementary Table S1**. The encoded samples by using histone modification information are available at https://github.com/ chenweiimu/splicing.

#### Feature Selection

If the exons are represented by a vector of 114 dimensions, it may bring out the following three unfavorable problems (Feng et al., 2013): (1) including redundant or irrelevant information; (2) leading to over-fitting problems and reducing the generalization capacity of the model; (3) increasing the computational time. In order to alleviate irrelevant features, a series of effective feature selection techniques have been proposed, such as analysis

TABLE 1 | Performance metrics of different classifiers for classifying included and excluded exons.


TABLE 2 | A comparison of the current method with existing method for classifying included and excluded exons.


*a (Chen et al., 2014).*

of variance (Lin and Ding, 2011; Lin et al., 2015), Minimal Redundancy Maximal Relevance (Peng et al., 2005; Chen et al., 2014), and Diffusion Maps (Coifman et al., 2005).

In this study, the Maximum Relevance Maximum Distance (MRMD) approach was employed to select the optimal features, which has been widely used in the realm of bioinformatics since proposed in 2016 (Zou et al., 2016). As indicated by Zou et al. (2016), the major concern of MRMD is searching a kind of features ranking metric which contains two aspects: one is the relevance between sub feature set and target class, and the other is redundancy of sub feature set. The more details about MRMD can be found in Zou et al.'s work 2016.

## Random Forest

Random forest (RF) is an ensemble of a large number of decision trees (Breiman, 2001). Each tree in the ensemble is trained on a subset of training instances that are randomly selected from the given training set. Instead of using all the features, a random subset of features is selected, further randomizing the tree. The prediction results of RF are based on the ensemble of those decision trees and each tree gives a classification result. Finally, the RF classifier selects the prediction result that has the largest number of votes from the classification results. Owing to its advantages in dealing with high-dimensional data, RF has been used in various areas of bioinformatics (Ferrat et al., 2018; Manavalan et al., 2018; Wang et al., 2018).

### Cross Validation

In statistical prediction, three cross-validation methods, namely independent dataset test, sub-sampling (or n-fold crossvalidation) test and jackknife test, are often used to evaluate the anticipated success rate of a predictor. Among the three cross-validation methods, the jackknife test is deemed the least arbitrary and most objective one (Chen et al., 2015, 2018; Feng et al., 2018). However, to reduce the computational time, the 10 fold cross validation test was used to evaluate the performance of the proposed method. For 10-fold cross-validation, the training dataset is randomly partitioned into ten training subsets, and nine subsets were used for training and the remaining one was used for testing. This process was repeated ten times in such a way to ensure that each set is utilized once for testing the model that was trained on the other nine.

# Performance Evaluation

The performance of the proposed method was evaluated by using the following four metrics, namely sensitivity (Sn), specificity (Sp), Accuracy (Acc), and the Mathew's correlation coefficient (MCC), which are expressed as (Chen et al., 2017; Lin et al., 2017; Jia et al., 2018; Zeng et al., 2018)

$$\begin{cases} \text{Sn} = \frac{TP}{TP + FN} \times 100\% \\\\ \text{Sp} = \frac{TN}{TN + FP} \times 100\% \\\\ \text{Acc} = \frac{TP + TN}{TP + FN + TN + FP} \times 100\% \\\\ \text{MCC} = \frac{(TP \times TN) - (FP \times FN)}{\sqrt{(TP + FN) \times (TP + FP) \times (TN + FN) \times (TN + FP)}} \end{cases} \tag{2}$$

where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

# RESULTS AND DISCUSSION

#### Performance Evaluation

By encoding the included and excluded exons in the dataset using the histone modification, each of the sample was represented by a 114-dimensional vector (Equation 1) used as the input vector of RF to build a computational model. By examining the performance of the model via the 10-fold cross-validation test, we obtained an accuracy of 63.49%, which is still far from our satisfaction. In order to improve the performance of the proposed model, it is necessary to choose the optimal number of features to build a robust and efficient predictive model.

We therefore used the MRMD together with the Incremental Feature Selection (IFS) strategy to build the optimal feature subsets. We ranked the 114 features using the MRMD algorithm. The 114 ranked features were then added one by one from lower to higher rank. This procedure was repeated 114 times, and for each time a RF model was built. Their performances were investigated by using the 5-fold cross-validation test. The most optimal features can be obtained when the accuracy reaches its maximum. The IFS was used to determine the optimal number of features. The corresponding IFS curve was plotted in **Figure 1**. Accuracy reaches its maximum of 79.79% when the top ranked 96 features were used to encode the samples. Therefore, a computational model was built based on these 96 optimal features. In this case, the proposed model obtained an accuracy of 72.91% with the sensitivity of 67.03% and specificity of 79.65% in 10-fold cross-validation test.

# Comparative Analysis Among Different Classifiers

To further demonstrate the power of the proposed method for classifying the 'included' and "excluded" exons, we compared



*<sup>a</sup>The bias of the 96 optimal features to exon inclusion or exclusion case were analyzed using hypothesis test of sample frequency. "I" indicates that he features that significantly (p* < *0.01) bias to exon inclusion case, while "E" indicates bias significantly (p* < *0.01) bias to exon exclusion case.*

its performance with that of other classifiers, such as BayseNet, Naïve Bayes, J48 Tree and Support Vector Machine (SVM). All these classifiers were tested on the benchmark dataset and implemented in WEKA (Frank et al., 2004) with the default settings. Their 10-fold cross-validation test results based on the 96 optimal features were reported in **Table 1**. As indicated in **Table 1**, the four metrics as defined in Equation. 2 for the current method are all higher than those of BayseNet and SVM. Although Naïve Bayes and SVM yielded higher sensitivity, their specificity, accuracy, and MCC are significantly lower than that of the current method.

In addition, a comparison was also made between the current method and the method in our previous work (Chen et al., 2014), where a QD function based method was proposed to classify the "included" and "excluded" exons. Since both methods are trained and tested based on the same dataset, we directly compared the 10 fold cross-validation test results of the current method with that listed in previous work (Chen et al., 2014). As indicated in **Table 2**, the accuracy achieved by the current method is over 4% higher than existing method, indicating that the current method is superior to our previous method for classifying the "included" and "excluded" exons.

#### Features Analysis

To provide an overall view of the optimal features for classifying the "included" and "excluded" exons, we compared their frequency distributions in both kinds of exons using the z-test (**Table 3**). As we can see from **Table 3**, among the 96 optimal features, 29 features significantly prefer to the included exons, while 52 features significantly prefer to the exclude exons. More

interestingly, 61 of the 81 features that differently distributed in "included" and "excluded" exons are from the proceeding or succeeding regions of the exons. This result indicates that the major regulatory epigenetic factors of exon skipping event located in the surrounding regions of the exons.

Rather than medicated by a single type of histone modification, recent researches have demonstrated that RNA splicing can be regulated by a combination of different types of histone modifications (Shindo et al., 2013). To detect whether the cooperation or competition of histone modifications exists in the exon skipping event process, we calculated the Pearson correlation coefficient of the 81 optimal features. The correlation matrix for "included" and "excluded" exons were plotted in **Figures 2**, **3**, respectively. As indicated in these figures, significant positive and negative correlations could be observed among different kinds of histone modifications. For example, in the "included" exon case, H3K18ac is positively correlated with H3K23ac, H4K8ac and H4K12ac, while H4K91ac is negatively correlated with H3K91me2. In the "excluded" exon case, H2AK5ac is positively correlated with H2BK5me1, H2BK12ac, H2BK20ac, H4K5ac, and H3K4ac; the negative correlations are observed between H3K79me1 with H3K27me2, H3K27me3, and H3K6me1. These results prove that the histone modification cooperation and competition indeed exist in the process of RNA splicing.

# CONCLUSION

As one of the key processes of gene expression, besides regulated by ESEs, ISEs, ESSs, ISSs, and other trans-elements, RNA splicing is also regulated by epigenetic factors. In this paper, we presented a new computational method to classify the "included" and "excluded" exons in exon skipping events based on histone modifications. The samples in the dataset were encoded using optimal histone modification information obtained by feature selection technique and then used as the input of RF. The predictive results derived by the 10-fold cross validation test demonstrated that the proposed approach can achieve better performance than existing approaches.

To provide an intuitive view of the histone modifications that contribute to the predictions, we systematically analyzed their distributions in "included" and "excluded" exons. The nonrandom distribution of histone modifications (**Table 3**) and their positive or negative correlation profiles (**Figures 2**, **3**) suggest that exon skipping is regulated by the combination of different types of histone modifications. Further experimental investigations are required to reveal how these histone modifications are associated with splicing.

In the future work, we will do our best to develop a much more smart method to classify "included" and "excluded" exons by integrating information from both the genome and epigenome levels.

# AUTHOR CONTRIBUTIONS

WC and HL conceived and designed the experiments. PF and HD performed the experiments. HL and WC wrote the paper. All authors read and approved the final manuscript.

### FUNDING

This work was supported by the National Nature Scientific Foundation of China (31771471, 61772119), Natural Science

### REFERENCES


Foundation for Distinguished Young Scholar of Hebei Province (No. C2017209244), the Program for the Top Young Innovative Talents of Higher Learning Institutions of Hebei Province (No. BJ2014028).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00433/full#supplementary-material

Black, D. L. (2003). Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem. 72, 291–336. doi: 10.1146/annurev.biochem.72.121801.161720


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Chen, Feng, Ding and Lin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Accurate Prediction of ncRNA-Protein Interactions From the Integration of Sequence and Evolutionary Information

#### Zhao-Hui Zhan<sup>1</sup> , Zhu-Hong You<sup>2</sup> \*, Li-Ping Li <sup>2</sup> , Yong Zhou<sup>1</sup> and Hai-Cheng Yi <sup>2</sup>

<sup>1</sup> School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China, <sup>2</sup> Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China

Non-coding RNA (ncRNA) plays a crucial role in numerous biological processes including gene expression and post-transcriptional gene regulation. The biological function of ncRNA is mostly realized by binding with related proteins. Therefore, an accurate understanding of interactions between ncRNA and protein has a significant impact on current biological research. The major challenge at this stage is the waste of a great deal of redundant time and resource consumed on classification in traditional interaction pattern prediction methods. Fortunately, an efficient classifier named LightGBM can solve this difficulty of long time consumption. In this study, we employed LightGBM as the integrated classifier and proposed a novel computational model for predicting ncRNA and protein interactions. More specifically, the pseudo-Zernike Moments and singular value decomposition algorithm are employed to extract the discriminative features from protein and ncRNA sequences. On four widely used datasets RPI369, RPI488, RPI1807, and RPI2241, we evaluated the performance of LGBM and obtained an superior performance with AUC of 0.799, 0.914, 0.989, and 0.762, respectively. The experimental results of 10-fold cross-validation shown that the proposed method performs much better than existing methods in predicting ncRNA-protein interaction patterns, which could be used as a useful tool in proteomics research.

Edited by:

Quan Zou, Tianjin University, China

#### Reviewed by:

Kang Wei, The Chinese University of Hong Kong, Hong Kong Dongya Jia, Rice University, United States

> \*Correspondence: Zhu-Hong You zhuhongyou@ms.xjb.ac.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 07 August 2018 Accepted: 19 September 2018 Published: 08 October 2018

#### Citation:

Zhan Z-H, You Z-H, Li L-P, Zhou Y and Yi H-C (2018) Accurate Prediction of ncRNA-Protein Interactions From the Integration of Sequence and Evolutionary Information. Front. Genet. 9:458. doi: 10.3389/fgene.2018.00458 Keywords: ncRNA-protein interactions, PSSM, LightGBM, Pseudo-Zernike moments, k-mers

# INTRODUCTION

Non-coding RNAs (ncRNAs) are regarded as the "dark matter" in the genome because of their inability in coding proteins. These years, a variety of ncRNA has been discovered by researchers which plays an indispensable role in most processes of vital movements in the field of biology including amino acids transporting, RNA modification and so on (Pan et al., 2016). According to recent research on ncRNA, ncRNA has been proved to be inextricably associated with human diseases and even cancer. For instance, Tian Y et al. have demonstrated that the role of ncRNA in diabetes is emerging significantly since ncRNA is involved in the modulation of 0205 cell mass, insulin synthesis, secretion and signaling (Tian et al., 2018). However, compared to those ncRNAs with known functions in vital processes occurring in living organisms, there is still a large part of ncRNAs whose functions are not yet clear. In order to gain insight into the function of ncRNA, it is essential to determine whether these ncRNAs interact with other proteins which subserve the comprehension of the mechanism behind biological processes involving RNA-Binding proteins(RBPs) (Li and Nagy, 2011). Although reliable models in predicting ncRNA and protein were composed by a large number of experimental analyses such as RBPs (Pan et al., 2017), RPI-Bind (Luo et al., 2017), RNA Compete-S (Cook et al., 2017), there is still a limited number of structural features available in the protein data bank(PDB) about RNA-protein complexes causing these experiments were time-consuming and resourceconsuming (Berman et al., 2000) Therefore, researchers focused their attention on predicting interactions between ncRNA and protein only based on sequences which was regarded as a reliable computational approach since the sequences carried enough information required for prediction (Suresh et al., 2015). This sequence-based method can be used to identify potential ncRNA and protein partners in the absence of their structural information during the experiment (Muppirala et al., 2011).

Machine learning provides researchers one of the most cost-effective ways to construct predictive models in an experimental environment where validated training data is available (Muppirala et al., 2011). In Mohammad et al.'s article, they collected motif information and repetitive patterns extracted from validated interactions between RNA and protein with the combination of sequence composition as descriptors to build a RPI prediction model called rpiCOOL by using a random forest classifier (Akbaripour-Elahabad et al., 2016). The random forest classifier is an ensemble of decision trees of which each tree is constructed through training a subset of features that are sampled from the input feature sets randomly (Akbaripour-Elahabad et al., 2016). And in Wang Ying et al.'s article, they proposed a new ncRNA-protein interaction model extended Bayesian classifier which selected valid features by reducing likelihood scores and allowed transparent feature integration during prediction (Wang et al., 2013). After feature extraction, the extracted features were sent to Bayesian classifier for training. Bayesian classifier is one of the most basic statistical classification methods which principle is to calculate the posterior probability of an object by using Bayesian formula, and select the class with the maximum posterior probability as the class to which the object belongs (Cheng et al., 2017). Hai-cheng Yi et al. proposed a computational RPI-SAN model by using the deep-learning stacked auto-encoder network to mine the hidden high-level features from RNA and protein sequences and fed them into a Random forest classifier to predict ncRNA binding proteins (Yi et al., 2018). They further employed Stacked assembling to improve the accuracy of the proposed method (Long et al., 2017; Patel et al., 2017). Including random forests and Bayesian classifiers, these classifiers are traditional classical machine learning classifiers which effectiveness have verified by a large-scale number of experiments (Liu et al., 2016; Wang et al., 2016; Luo and Liu, 2017). However, these traditional classifiers still have much room for improvement in classification performance and time consumption.

In recent years, an improved gradient boosting decision tree classifier named LightGBM has been proposed. LightGBM is a histogram-based decision tree algorithm, which divides continuous feature values into discontinuous feature blocks, and transforms these feature blocks into feature histograms during training (Shi et al., 2018). This LightGBM classifier algorithm had been used to speed up the decision tree building process on GPUs (Graphics Processing Units) and improved its scalability in the article of Huan Zhang et al. (Zhang et al., 2017). In their paper, a large number of experimental data shown that the training speed in constructing decision trees of LightGBM classification algorithm was much faster than general decision tree algorithms with the same classification accuracy (Mitchell and Frank, 2017).

In the field of biology, the discovery of ncRNAs has far exceeded the speed of research on their functions in ncRNA and protein interactions. Therefore, it is urgent to study an efficient prediction tool in the field of ncRNA-protein interactions which is less-time consuming and resource saving. Hence, we applied this efficient LightGBM classifier to large-scale ncRNA and protein interaction prediction and proposed a new machine learning model using sequence-based information named LGBM in this context. More specifically, each sequence of ncRNA is converted into a k-mers sparse matrix and the feature vectors of ncRNA are extracted from the resulting k-mers sparse matrices using the singular value decomposition (SVD). For proteins, based on the evolutionary point mutation model of protein sequences, we converted each protein sequence into a positionspecific scoring matrix (PSSM) where the position information and frequency information were contained. Afterwards, each protein sequence was characterized by the feature vector obtained from a transform processing by using the pseudo-Zernike moment (PZM) algorithm. After extracting features of ncRNA and protein, we fed these reprehensive features into LightGBM classifier for classifying learning and predicting interactions between ncRNA and protein. In order to evaluate the predictive performance of the machine learning model, we used a 10-fold cross-validation to reduce overfitting. During the experiment, we employed four benchmark datasets to evaluate the performance of our model which was RPI369, RPI488, RPI1807, and RPI2241, respectively, and compared the prediction results of our model with other advanced models at the present stage. Experimental results indicated that our model LGBM performed well on four datasets above.

# METHODS

# Protein Feature Extraction

In this section, we selected the PZM feature extraction algorithm to extract sequence-based protein feature vectors using PSSM (Maali et al., 2016; Kheirkhah et al., 2017). The PSSM algorithm first integrates the biological evolution information to predict distantly related proteins, and has achieved good performances in protein binding sites and disordered region prediction (Yi et al., 2018). Let P be a PSSM matrix as the representative of an arbitrary protein. A matrix P consists of r rows and 20 columns with the explanation that r means the length of the primary sequence of an arbitrary protein while 20 means the quantity of amino acids (Sharma et al., 2013). Based on this, a PSSM matrix is represented as follows:

$$P = \begin{bmatrix} \mathcal{P}\_{1,1} & \cdots & \mathcal{P}\_{1,20} \\ \vdots & \ddots & \vdots \\ \mathcal{P}\_{r,1} & \cdots & \mathcal{P}\_{r,20} \end{bmatrix} \tag{1}$$

Where pi,<sup>j</sup> in ith row jth column denotes the relative probability of jth amino acid at the ith position of the same protein sequence with which PSSM matrix comes from (Hayat and Khan, 2011). In experiments, the position-specific iterated BLAST (PSI-BLAST) tool was used to transform original protein sequences into PSSM matrices with the parameter err-value set to be 0.001.

Then we extracted PZM feature vectors from the resulting PSSM matrices above. The PZM is a statistical feature extraction algorithm that is computationally efficient for using global information to extract features (Haddadnia, 2001). Pseudo-Zernike polynomials are orthogonal sets of complex-valued polynomials defined as follows (Haddadnia et al., 2003):

$$V\_{\alpha\beta} \left( \mathbf{x}, \boldsymbol{\uprho} \right) = R\_{\alpha\beta} \left( \boldsymbol{\uprho} \right) \exp \left( j\boldsymbol{\upbeta} \tan^{-1} \left( \frac{\boldsymbol{\uprho}}{\boldsymbol{\uprho}} \right) \right) \tag{2}$$

Where x <sup>2</sup> + y <sup>2</sup> ≤ 1, α ≥ 0, |β| ≤ α and ρ = p x <sup>2</sup> + y 2 is the length of the vector from the origin to the pixel (x, y). And the radial polynomials Rαβ are defined as:

$$R\_{\alpha\beta}\left(\mathbf{x},\boldsymbol{\uprho}\right) = \sum\_{t=0}^{\alpha - |\beta|} Z\_{\alpha,|\beta|,t} \left(\mathbf{x}^2 + \boldsymbol{\uprho}^2\right)^{\frac{\alpha - t}{2}} \tag{3}$$

Where

$$Z\_{\alpha, |\beta|, t} = (-1)^t \frac{2\alpha + 1 - t}{t! \left(\alpha - |\beta| - t\right)! \left(\alpha - |\beta| - t + 1\right)!} \tag{4}$$

And Rα,−<sup>β</sup> (ρ) = Rα,<sup>β</sup> (ρ) Therefore, the Zernike moments of order α with repetition β for a continuous image function f x, y that vanishes outside the unit circle are as follows (Kim and Lee, 2003):

$$M\_{\alpha\beta} = \frac{\alpha+1}{\pi} \iint\_{\mathbf{x}^2 + \mathbf{y}^2 \le 1} f\left(\mathbf{x}, \mathbf{y}\right) V\_{\alpha\beta}^\*\left(\boldsymbol{\rho}, \boldsymbol{\theta}\right) d\mathbf{x} d\mathbf{y} \tag{5}$$

Pseudo-Zernike polynomials are orthogonal and satisfy the following equation:

$$\iint\_{\mathbf{x}^{2}+\mathbf{y}^{2}\leq 1} \left[ V\_{\alpha\beta}^{\ast} \left( \mathbf{x}, \mathbf{y} \right) \right] \times V\_{mn} \left( \mathbf{x}, \mathbf{y} \right) d\mathbf{x} d\mathbf{y} = \frac{\pi}{\alpha+1} \delta\_{\alpha m} \delta\_{\beta n} \tag{6}$$

With

$$\delta\_{ab} = f\left(\mathbf{x}\right) = \begin{cases} 1, & a = b \\ 0, & otherwise \end{cases} \tag{7}$$

Hence, based on the derivation of the above formulas, the feature vectors of protein sequences can be represented as follows (Wang Y. et al., 2017):

$$\overrightarrow{F} = \begin{bmatrix} |M\_{11}| \, |M\_{22}| \, \text{,} \cdot \cdot \text{,} \, |M\_{a\beta}| \Big| \end{bmatrix}^{T} \tag{8}$$

#### ncRNA Feature Extraction

As for ncRNA, we used the SVD algorithm to extract feature vectors from the k-mers sparse matrix represented ncRNA sequences. In the k-mers sparse matrix construction algorithm, we traversed each complete ncRNA sequence (A, C, G, U) stepping one nucleotide at a time, which is considered characteristic of each nucleotide (Yi et al., 2018). After that, the frequency of the combined triplet feature based on 4 nucleotide letters was extracted for each RNA sequence and obtained 4 <sup>k</sup> dimensional features (You et al., 2016). Each characteristic value is the normalized frequency of 4-mers nucleotides in the ncRNA sequences, which is AAAA, AAAC . . . TTTT (Pan et al., 2016). Therefore, we obtained matrices including frequency information, location information and more hidden information represented the ncRNA sequences (Yi et al., 2018).

Furthermore, we used SVD algorithm to decompose k-mers sparse matrix. The Q represent the original k-mers sparse matrix from above and there is singular value decomposition as follows:

$$Q = U\Sigma V \tag{9}$$

Where the elements of diagonal in Σ represent the singular value of Q. It obtained the most information from original matrix Q. Consequently, We reconstruct a 1 × 4 <sup>k</sup> dimensional vector from Q shows as follows:

$$
\overrightarrow{F} = U\tag{10}
$$

#### LightGBM Algorithm

After obtaining potential features of ncRNA and protein calculated from above feature representation approaches, we fed these high-level features into LightGBM classifier to train the prediction scheme for predicting RPIs.

The traditional gradient boosting decision tree (GBDT) algorithm is a widely used machine learning algorithm which ensemble decision trees in an integrated learning model (Ke et al., 2017). This GBDT algorithm learns the decision trees by fitting the negative gradients (Friedman, 2001). In the process of learning decision trees, the most time-consuming and laborconsuming step is to find the best split points (Appel et al., 2013). The traditional GBDT algorithm uses the histogrambased algorithm to store continuous eigenvalues into discrete regions which are used to construct feature histograms during training instead of selecting the best split points (Li et al., 2007). However, with the increase of data volume, the workload of scanning all the data instances to estimate the information gain of all possible split points is increasing which costs timeconsuming a lot (Chen and Guestrin, 2016). In order to address the limitation of this problem, an improved algorithm based on GBDT named LightGBM was proposed which improving the accuracy of classification in proposing two new novel techniques called Gradient Based One-side Sampling (GOSS) and Exclusive Feature Bundling (EFB) (Ke et al., 2017).

Through the GOSS algorithm, the problem that no native sample weights in GBDT avoiding hurting the accuracy of the learned model was solved by discarding those data instances with small gradients. Firstly, training instances were sort by their gradients from high to low in order. Second, select top p × 100% instances with high gradients and sample q percent data instances in the remaining subsets randomly. Let A ∪ B represents their collection. Hence, the estimated variance gain V˜ s b of splitting feature s at point b over the subset A ∪ B can be define as follows (Ke et al., 2017):

$$\begin{split} \left| \hat{V}\_{s} \left( b \right) \right| &= \frac{1}{n} \left( \frac{\left( \sum\_{\boldsymbol{x}\_{i} \in A\_{l}} g\_{i} + \frac{1-p}{q} \sum\_{\boldsymbol{x}\_{i} \in B\_{l}} g\_{i} \right)^{2}}{n\_{l}^{s} \left( b \right)} \\ &+ \frac{\left( \sum\_{\boldsymbol{x}\_{i} \in A\_{r}} g\_{i} + \frac{1-p}{q} \sum\_{\boldsymbol{x}\_{i} \in B\_{r}} g\_{i} \right)^{2}}{n\_{r}^{s} \left( b \right)} \end{split} \tag{11}$$

Where A<sup>l</sup> = x<sup>i</sup> ∈ A; xij ≤ b , A<sup>r</sup> = x<sup>i</sup> ∈ A; xij > b , B<sup>l</sup> = x<sup>i</sup> ∈ B; xij ≤ b and B<sup>r</sup> = x<sup>i</sup> ∈ B; xij > b .

On the second step, the EFB algorithm was used to effectively reduce the number of features by bundling exclusive features into a single feature avoiding hurting the accuracy. By adopting the EFB algorithm, building the same feature histograms from the resulting feature bundles above were available as those from individual features (Meng et al., 2016). Therefore, the complexity of histogram building was reduced from O #data × #feature to O #data × #bundle since #bundle≪#feature. First, we used NPhard to partition features into a smallest number of exclusive bundles just as the graph coloring problem (Zuev, 2015). Second, offsets were added to the original values of feature vectors to merging the features in the same bundle and ensured that the values of the original values can be identified from the resulting feature bundles.

#### Evaluation Criteria

In this study, we used a 10 - fold cross-validation method to avoid overfitting and guarantee the accuracy of our algorithm of our model which divided the datasets into 10 equal parts randomly. During each training test, one part was taken as the testing dataset, while the remaining nine parts were the training datasets<sup>1</sup> . Therefore, a total of 10 experiments were conducted. To evaluate the performance of our model LGBM, we followed several widely used evaluation criteria including accuracy, sensitivity, specificity, precision, and Matthews Correlation Coefficient(MCC) as follows (Liu and Chen, 2012):

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{12}$$

$$\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{13}$$

$$\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} \tag{14}$$

$$Precision = \frac{TP}{TP + FP} \tag{15}$$

$$TP \times \text{TN} - FP \times \text{FN}$$

MCC = √ (TP + FP) (TP + FN)(TN + FP) (TN + FN) (16)


where TP, FP, TN, and FN are respectively interpreted as the number of true positive, false positive, true negative and false negative. The Receiver Operating Characteristic(ROC) curve can be represented as the threshold between SP and SN, which x-ray depicts false positive rate (FPR) while y-ray depicts true positive rate (TPR) (Huang et al., 2015). Meanwhile, the AUC is regarded as the area of the graphical under the ROC curve.

#### Datasets

To verify the robust and effectiveness of our model LGBM, we selected four ncRNA and protein interactions datasets including RPI369, RPI488, RPI1807, and RPI2241. Among them, the dataset RPI369 and RPI2241 were selected from the databases PRIDB which is a database of ncRNA-protein interfaces calculated from their complexes in the protein data bank (Berman et al., 2000; Wang et al., 2013). RPI2241 is a positive sample set consisting of 2,241 pairs of experimentally verified ncRNA-protein pairs including 2,043 protein chains and 842 ncRNA chains. RPI369 is a subpart of RPI2241 with 369 pairs including 338 protein chains and 332 ncRNA chains which excludes all ncRNA-protein interaction pairs that interact with ribosomal proteins or ribosomal ncRNA in various organisms (Muppirala et al., 2011). For dataset RPI369 and RPI2241, an approximately negative sample dataset was constructed with twice number of pairs by pairing ncRNA and protein sequences after removing the pairs in the positive sample dataset randomly (Wang et al., 2013). RPI488 is a non-redundant lncRPI dataset based on structural complexes which consists of 488 lncRNAprotein pairs, including 245 non-interacting pairs and 243 interacting pairs from shen et.al. (Pan et al., 2016). And RPI488 is smaller than other datasets since there are fewer lncRNAprotein complexes in PDB where ncRNA-protein complexes are destroyed from downstream (Ying et al., 2010). The dataset RPI1807 consists of 1807 positive ncRNA-protein pairs including 1078 ncRNA chains and 1807 protein chains and 1436 negative pairs with 493 ncRNA chains and 1436 Protein chains. It is established by parsing a nucleic acid database (NAD) that provides RNA protein complex data and protein RNA interface data (Yi et al., 2018). The specific composition of these four datasets are described in **Table 1**.

#### Experimental Results

In this study, we proposed a machine learning classification model based on improved gradient boosting decision tree to predict interactions between ncRNA and protein named LGBM which used PSSM and PZM algorithms to extract protein feature

<sup>1</sup>K-Fold Cross Validation. Classification.

TABLE 2 | Ten-fold cross-validation results on dataset RPI369.


TABLE 3 | Ten-fold cross-validation results on dataset RPI488.


vectors and combined k-mers matrices and SVD algorithms to extract RNA feature vectors. The specific steps of the machine learning model are shown in the **Figure 1**. To verify the performance of the proposed model LGBM, we evaluated the prediction ability of LGBM on datasets RPI369 and RPI488 and had a comparison with the prediction performance of other classifiers under the same feature extraction condition firstly. In addition, we also evaluated the predictive performance of datasets RPI1807 and RPI2241 and compared the prediction results of these two datasets with those of other proposed models in earlier papers.

# Prediction Ability of LGBM

In this section, we validated our machine learning model LGBM with 10-fold cross-validation on datasets RPI369 and RPI488 to predicting ncRNA-protein interactions. The 10-fold crossvalidation contributed LGBM to avoid over-fitting and had a better performance. As a result, the summary of experimental prediction results under 10-fold cross-validation are shown in **Tables 2**, **3**.

As shown in **Tables 2**, **3**, when LGBM machine learning model was used to predict interactions between ncRNA and protein on dataset RPI369, the mean performance of accuracy, precision, sensitivity, specificity and MCC were 73.81, 72.18, 68.75, 78.81, and 48.03%, respectively. While for dataset RPI488, the mean performance of accuracy, precision, sensitivity, specificity and MCC highly achieved 89.52, 93.28, 94.30, 84.17, and 79.02%, respectively. At the meantime, in 10-fold cross-validation, the accuracy of one validation was even as high as 95.92% while there were other five validations achieved the accuracy of 90%.


TABLE 4 | Performance evaluation on different classifiers.

The bold value indicates this measure performance is the best among the compared methods.

The prediction accuracy of LGBM on datasets RPI369 and RPI488 illustrated the feasibility of predicting ncRNA and protein interactions only based on their sequence information. In fact, the protein and ncRNA feature extraction methods can extract more in-depth information hidden in sequences including location, frequency and interaction information into PSSM matrices and k-mers matrices (You et al., 2016). In addition, selecting PZM algorithm to extract feature vectors makes better use of the properties of PZM (Khotanzad and Hong, 1990).

### Comparison Between Different Classifiers

In this comparison module, we compared the prediction performance of LightGBM classifier, SVM classifier and traditional gradient boosting decision tree classifier in datasets RPI369 and RPI488 sharing the same feature extraction condition. As a result, the summary of experimental prediction results under 10-fold cross-validation is shown in **Table 4** and the corresponding trade-off between false positive rate and true positive rate shown in the receiver operating characteristic (ROC) curve in **Figures 2**, **3**.

As can be seen from **Table 4**, the LightGBM classifier achieved an accuracy of 73.81% in predicting interactions between ncRNA and protein in dataset RPI369, which was higher than 71.60% of SVM classifier and 71.74% of traditional GBDT classifier. And as for precision, sensitivity and MCC except specificity, the LightGBM classifier also had a better performance with exact percent of 72.18, 78.81, and 48.03% respectively, while 71.70, 72.51, and 43.62% under SVM classifier and 71.79, 72.79, and 43.90% under traditional GBDT classifier. For dataset RPI488, whether accuracy, precision and sensitivity or specificity and MCC, LightGBM classifier performed better than the other two classifiers with the exact results of 89.52, 93.28, 84.17, 94.30, and 79.02%, respectively. That is to say, under the evaluation of each evaluation criterion, our LightGBM classifier had a better classification performance than SVM and traditional GBDT classifiers which proved the feasibility and effectiveness of choosing LightGBM classifier to process sequence information in our model LGBM.

The comparison results shown the feasibility and effectiveness of selecting LightGBM as classifier in our model (Zhu et al., 2017). In fact, LightGBM, as an improved gradient boosting decision tree, processing the advantages of reducing the number of features and gaining enough information gain through smaller datasets by EFB and GOSS, is superior to other classifiers in terms of computational speed and memory consumption (Wang et al., 2017).

#### Comparison With Other Existing Methods

In this section, we compared the prediction performance combined with 10-fold cross-validation of LGBM model at datasets RPI488, RPI1807, and RPI2241 with RPI-Pred, RPISeq-RF, and Inc-Pro. RPI-Pred is a SVM-based ncRNA-protein interactions prediction model proposed by Suresh et al. which based on sequence and structure information (Suresh et al., 2015). The accuracy of the RPI-Pred model on dataset RPI1807 is 93.00%. RPISeq-RF is a random forest classifier-based model

TABLE 5 | Comparison between LGBM and other methods in RPI488, RPI1807, and RPI2241.


The bold value indicates this measure performance is the best among the compared methods.

proposed by Usha K Muppirala et al. which extracts feature vectors from ncRNA and protein sequence information only (Muppirala et al., 2011). And the accuracy of the RPISeq-RF model on datasets RPI488, RPI1807, and RPI2241 are 88.00, 97.30, and 63.96%, respectively. IncPro is a model proposed by Lu et al. which encodes lncRNA and protein sequences as digital vectors and scores each lncRNA-protein pair using matrix multiplication (Lu et al., 2013). Based on this IncPro model, the accuracy of datasets RPI488, RPI1807, and RPI2241 achieves 87.00, 96.90, and 65.40%, respectively. The summary comparative results of the experiments are shown in **Table 5.** And the 10 fold cross-validation ROC curve for our model LGBM at RPI488, RPI1807, and RPI2241 are shown in **Figures 4**–**6**.

As shown in **Table 5**, our machine learning model LGBM achieved an experimental prediction accuracy of 89.52 %, higher than 88.00% of RPISeq-RF and 87.00% of IncPro on

dataset RPI488. At the meantime, LGBM also had a better performance in other evaluation criterions including precision, sensitivity, specificity and MCC of 93.28, 94.30, 84.17, and 79.02%, respectively. While the performance of RPISeq-RF were 88.00, 93.20, 92.60, 82.20, 76.20% and IncPro were 87.00, 91.00, 90.00, 82.70, and 74.00%. On dataset RPI2241, except specificity, our model had a better performance of 68.86, 72.76, 76.38% on accuracy, precision and sensitivity. While on dataset RPI1807, although our experimental prediction performance was not as good as IncPro, it was still as high as 96.42, 96.21, 95.20, 97.40, and 97.26%, which is not much lower than 96.90, 95.50, 96.50, 98.10, and 93.80% of IncPro on accuracy, precision, sensitivity, specificity and MCC respectively. PRI-Pred performed slightly worse which was 93.00, 94.00, and 95.00% on accuracy, precision and sensitivity.

By comparing the prediction results, we are able to see that our prediction model LGBM has a better performance on datasets RPI488 and RPI2241, however, on dataset RPI1807, the prediction accuracy is worse than IncPro, while the accuracy is still more than 96%. In general, our model LGBM is effective and robust in predicting interactions between ncRNA and protein.

#### CONCLUSION

In this study, we proposed an efficient prediction model LGBM using sequence and evolutionary information to predict interactions between ncRNA and protein. In order to obtain evolutionary information from protein sequences, the Zernike Moment algorithm is used to extract feature vectors of proteins from PSSM. Meanwhile, the SVD was used to extract features from k-mers sparse matrix of ncRNA, in which both the location and frequency information is preserved. On this basis, we fed the high-level feature vectors into the LightGBM classifier to predict the interaction between ncRNA and protein. To verify the accuracy and robustness of our model, 10-fold cross

#### REFERENCES


validation was used. Experimental results on datasets RPI369, RPI488, RPI1807 and RPI2241 demonstrated the robustness and effectiveness of our model. Therefore, the proposed LGBM model is feasible, reliable and full of generalization ability to predict ncRNA-protein interaction. Our research can be a useful tool to further biological research.

#### AUTHOR CONTRIBUTIONS

Z-HZ, Z-HY, and YZ conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript. L-PL and H-CY designed, performed and analyzed experiments and wrote the manuscript. All authors read and approved the final manuscript.

#### ACKNOWLEDGMENTS

This work is supported in part by the National Science Foundation of China, under Grants 61373086, 61572506. The authors would like to thank all the editors and reviewers for their constructive advices.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhan, You, Li, Zhou and Yi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Use of Laplacian Heat Diffusion Algorithm to Infer Novel Genes With Functions Related to Uveitis

Shiheng Lu<sup>1</sup>† , Ke Zhao<sup>1</sup>† , Xuefei Wang<sup>1</sup> , Hui Liu<sup>1</sup> , Xiamuxiya Ainiwaer<sup>1</sup> , Yan Xu<sup>2</sup> and Min Ye<sup>1</sup> \*

<sup>1</sup> Department of Ophthalmology, Shanghai Pudong Hospital, Fudan University Pudong Medical Center, Pudong, China, <sup>2</sup> School of Life Sciences, Shanghai University, Shanghai, China

Uveitis is the inflammation of the uvea and is a serious eye disease that can cause blindness for middle-aged and young people. However, the pathogenesis of this disease has not been fully uncovered and thus renders difficulties in designing effective treatments. Completely identifying the genes related to this disease can help improve and accelerate the comprehension of uveitis. In this study, a new computational method was developed to infer potential related genes based on validated ones. We employed a large protein–protein interaction network reported in STRING, in which Laplacian heat diffusion algorithm was applied using validated genes as seed nodes. Except for the validated ones, all genes in the network were filtered by three tests, namely, permutation, association, and function tests, which evaluated the genes based on their specialties and associations to uveitis. Results indicated that 59 inferred genes were accessed, several of which were confirmed to be highly related to uveitis by literature review. In addition, the inferred genes were compared with those reported in a previous study, indicating that our reported genes are necessary supplements.

#### Edited by:

Quan Zou, Tianjin University, China

#### Reviewed by:

Xiaoyong Pan, Erasmus University Rotterdam, Netherlands Lin Lu, Columbia University Irving Medical Center, United States

#### \*Correspondence:

Min Ye gleye@163.com †These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 03 August 2018 Accepted: 10 September 2018 Published: 08 October 2018

#### Citation:

Lu S, Zhao K, Wang X, Liu H, Ainiwaer X, Xu Y and Ye M (2018) Use of Laplacian Heat Diffusion Algorithm to Infer Novel Genes With Functions Related to Uveitis. Front. Genet. 9:425. doi: 10.3389/fgene.2018.00425 Keywords: uveitis, Laplacian heat diffusion, protein–protein interaction, disease gene, network construction

# INTRODUCTION

Uvea is a specific structure in the eyes and consists of the pigmented layer and the outer fibrous layer (Junqueira et al., 2013; Rekas et al., 2015). As one of the most common types of inflammation in uvea, uveitis is the third leading cause of blindness in all developed countries, generally affecting people aging 20–50 years (Junqueira et al., 2013). According to the statistics from the National Eye Institute in the United States, uveitis can be subtyped as anterior uveitis, intermediate uveitis, posterior uveitis, and panuveitis uveitis based on different pathogenic progressions and sites (Lim et al., 2016). Regardless of its subtype, the major pathogenesis of this disease is the over-activation of inflammatory cells in situ (Pan et al., 2014; Rosenbaum, 2015). Uvea contains most of the eye's blood vessels, from which immune cells can enter the eye. Therefore, uvea gets inflamed easier than other eye tissue regions, revealing the histological causes of uveitis' high morbidity rate.

The following five major clinical symptoms occur during the initiation and progression of uveitis: painful eye(s), bloodshot eye(s), sensitivity to light, and cloudy vision and floaters (Urban et al., 2014). In the early stage of such disease, the patient's eye(s) can only manifest as redness and conjunctivitis with no visual defeats. With the quick progression of such disease, blindness and the five common complications, including glaucoma, cataracts, optic nerve damage, retinal detachment, and permanent vision loss, are usually identified in the patient population

**106**

(Sen et al., 2015). For the detailed pathogenic cause of such disease, the specific cause of uveitis in most clinical cases cannot be clearly identified due to its complicated candidate pathogenesis. In general, the top five causes of uveitis at the histological and pathogenic level can be clustered into five subgroups as follows: (1) eye injury, surgery; (2) autoimmune disorder; (3) inflammatory disorder; (4) eye tissue specific infection; and (5) cancer (Kalinina Ayuso et al., 2014). All such causes can be summarized as exogenous environmental effects and endogenous genetic contributions. According to recent publications, genetic and infectious contributions have received increasing attention and are widely regarded as the top two major pathogenic factors. Infectious progressions and their related immune responses of various infections, including brucellosis (Akyol et al., 2015), herpesviruses (Bojanova et al., 2013), and leptospirosis (Loureiro et al., 2013), contribute to the progression of uveitis, thus reflecting the unique pathogenic contribution of exogenous factors for uveitis.

Apart from exogenous factors such as infections, genetic contributions are also a major pathogenesis of uveitis. Early in 2014, a functional gene named FOXO1 has been confirmed to participate in the pathogenesis of acute anterior uveitis, thus reflecting the specific endogenous role of such gene for uveitis (Yu et al., 2014). In 2015, another study on acute anterior uveitis confirmed that a specific immune associated gene, which is named C5 and encodes complement C5, contributes to its immune associated pathogenesis, thereby reflecting the complicated pathogenesis of such disease (Xu et al., 2015). Other complement associated genes and interleukin related genes have also been identified in different subtypes of uveitis, confirming the pathogenic genetic contribution of such disease (Yang et al., 2012). In 2015, a specific clinical trial (Ildefonso et al., 2015) on the gene therapy on uveitis revealed that the modification and recruitment of specific protein domains encoded by functional genes can reduce the ocular inflammatory response and relieve the symptoms. This finding indicated that genetic contributions may at least be one of the major pathogenic factors of uveitis.

Identifying the core pathogenic genetic factors and revealing the detailed pathogenic mechanisms based on experimental routines are difficult because of the organizational specificity (eye), relatively low incidence, and complicated pathogenesis of uveitis. In recent years, more and more computational methods (Tang et al., 2017; Zeng et al., 2017, 2018; Chen et al., 2018b; Pan et al., 2018; Wang et al., 2018) have been designed to investigate different diseases, thereby giving help to uncover pathogenic mechanisms of diseases. For uveitis, in 2017, uveitisrelated genes were identified based on a computational method (Lu et al., 2017), which adopted the classic network algorithm named random walk with restart (RWR) algorithm (Kohler et al., 2008; Li and Li, 2012) to search novel genes in a protein–protein interaction (PPI) network. In the present study, we employed another network algorithm named Laplacian heat diffusion (LHD) to build a new computational method for inferring novel uveitis-related genes. LHD algorithm has different principles compared with RWR algorithm and thus may help us extract novel genes that cannot be identified by the method in Lu et al. (2017). In addition, the proposed method also adopted several tests to screen out the most related genes. Finally, 59 genes were accessed by our method, and only two of them were also reported in previous studies as validated pathogens (Lu et al., 2017). We conducted an extensive analysis on several of these genes to show the reliability of our method. The new findings reported in the present study may aid in revealing the detailed pathogenic mechanisms of uveitis.

# MATERIALS AND METHODS

### Uveitis-Related Genes

We extracted uveitis related genes from literature indexed by PubMed<sup>1</sup> . In the search bar, we set "uveitis" and "genes" as the keywords, thus obtaining 744 published articles. Among these articles, 98 were review articles, in which several solid uveitis related genes were reported. A total of 121 genes were selected by manual reviewing. These genes are important for uveitis or specific uveitis symptoms and thus were called uveitis-related genes in this study. Proteins encoded by the 121 uveitis-related genes were obtained and further mapped onto their Ensembl IDs because we adopted the PPI network to infer novel uveitis related genes based on these genes. Finally, 113 Ensembl IDs were obtained. The 121 uveitis genes and the Ensembl IDs of their proteins are provided in **Supplementary Table S1**.

# Construction of PPI Network

PPI information is a useful material to study different proteinor gene-related problems (Hu et al., 2011a,b; Gao P. et al., 2012; Gao Y.F. et al., 2012; Chen et al., 2016a, 2018c; Huang et al., 2016; Zhang et al., 2016; Cai et al., 2017; Lu et al., 2017; Li et al., 2018a). Most studies using this information reported that two proteins that interact with each other always share similar functions. The proteins encoded by uveitis-related genes may have some common functions, which may also be shared by their interactive proteins. This procedure can further continue. If we start from the proteins encoded by uveitisrelated genes and diffuse their status to their neighbors and neighbors' neighbors, then some novel proteins that are strongly associated with proteins encoded by uveitis-related genes can be extracted. Their genes may be novel uveitis-related genes. We need a PPI network to complete these procedures. Here, we used the PPI network reported in STRING<sup>2</sup> (version 10.0). Compared with PPI networks reported in other databases, such as DIP (Database of Interaction Proteins) database (Xenarios et al., 2000) and BioGRID (Stark et al., 2006), which are constructed using experimentally determined PPIs, the PPI network used in this study further contains functional associations between proteins. The PPIs in STRING were collected from the following sources: (1) genomic context predictions; (2) high-throughput lab experiments; (3) (conserved) co-expression; (4) automated text mining; and (5) previous knowledge in databases. Thus, they can widely measure the associations between proteins, providing more opportunities to infer novel uveitis-related genes. The file

<sup>1</sup>http://www.ncbi.nlm.nih.gov/pubmed/

<sup>2</sup>https://string-db.org

"9606.protein.links.v10.txt.gz" was retrieved from STRING to construct this PPI network. In this file, large numbers of human PPIs were included. Each PPI was assigned a score to indicate its strength. The constructed PPI network termed proteins, represented by Ensembl IDs, as nodes. Two nodes were adjacent if and only if their corresponding proteins can interact with each other. Furthermore, each edge in the PPI network was assigned a weight, which was defined as the score of its corresponding PPI. For easy description, the constructed PPI network was called as N in the following text.

## Method for Inferring Novel Uveitis-Related Genes

fgene-09-00425 October 4, 2018 Time: 15:24 # 3

Inferring novel genes related to different diseases in network level has become quite popular (Barabasi et al., 2011). Several classic network algorithms, such as shortest path algorithm (Gormen et al., 1990; Gui et al., 2015; Chen et al., 2016a; Zhang et al., 2016; Cai et al., 2017, Chen et al., 2018a), and RWR algorithm (Chen et al., 2017a, 2018c; Li et al., 2017, 2018a; Yuan and Lu, 2017; Zhang et al., 2017), have been applied to develop different computational methods in this regard. A recent publication (Lu et al., 2017) proposed a RWR-based computational method to identify novel uveitis-related genes and reported several ones. Another classic network algorithm, LHD algorithm (Leiserson et al., 2015), was employed to construct a novel computational method to infer novel uveitis-related genes that were not reported in Lu et al. (2017).

#### LHD Algorithm

As a classic network diffusion algorithm, heat diffusion algorithm always starts from some nodes, called seed nodes, and transmits predefined heats on these nodes to other nodes in the network. A heat assigned to a node represents the strength of the associations between the node and seed nodes. Here, we adopted one kind of heat diffusion algorithm named LHD (Leiserson et al., 2015) to infer novel uveitis-related genes in PPI network N. The brief description of LHD algorithm was as below.

Given a PPI network N, let A be its adjacent matrix and D is a diagonal matrix storing the degree of each node. The graph Laplacian L was defined as D-A. According to the 113 Ensembl IDs, which were assigned to uveitis-related genes and used as seed nodes in LHD algorithm, an original heat distribution vector H<sup>0</sup> can be constructed in a way that the components corresponding to seed nodes were set to 1/113 and others were set as zero. The heat distribution vector at time t can be accessed by

$$H\_t = H\_0 \bullet \exp\left(-Lt\right) \tag{1}$$

where exp( ) is the matrix exponential. By setting a series of increasing values of t, we can obtain a series of heat distribution vectors. When two consecutive distribution vectors are quite similar, one vector is assigned as the output of LHD algorithm. In the output vector, each node, including seed nodes, received a heat value. We only extracted the heat values of non-seed nodes, which would be further used for selecting important genes.

#### Permutation Test

Each node received a heat value based on the LHD algorithm. However, this value may be affected by the structure of the PPI network N, i.e., some nodes may have high probabilities to receive high heat values regardless of which nodes are selected as seed nodes. Therefore, these nodes should not be selected as candidate genes of uveitis. To control this type of nodes, we performed a permutation test. We constructed 500 Ensembl ID sets, each comprising 113 Ensembl IDs that were randomly selected from the nodes in the PPI network N. For each set, the nodes were taken as seed nodes, which were inputted onto the LHD algorithm. After the 500 sets were tested, each node received several heat values. By comparing these values with the heat value obtained by using 113 Ensembl IDs of uveitis-related genes, we can compute a measurement, called zscore, to evaluate the reliability of the actual heat value. Zscore can defined as below:

$$zscore\left(\mathbf{g}\right) = \frac{h\left(\mathbf{g}\right) - \mu\left(\mathbf{g}\right)}{8\left(\mathbf{g}\right)}\tag{2}$$

where h g is the heat value of gene g obtained by 113 Ensembl IDs of uveitis-related genes, and µ g and δ g are the mean and standard deviation, respectively, of the heat values obtained by 500 randomly produced sets. According to statistical theory, 1.96 is the threshold for selecting genes that significantly correlate with uveitis-related genes. Thus, we extracted Ensembl IDs with zscores no less than 1.96. These IDs would be further analyzed by the following tests.

#### Interaction Test

Large numbers of genes were discarded through the permutation test. For the remaining genes, some were strongly associated with uveitis-related genes, indicating their high relationships to uveitis. By contrast, others had few, even no associations with uveitis-related genes, implying that they were not related to uveitis and should be discarded. To indicate the association between the candidate genes passing the permutation test and the uveitis-related genes, we employed the PPI information mentioned in Section "Construction of PPI Network". The score was also used to quantify the strength of the PPI. The score of the PPI of proteins p<sup>1</sup> and p<sup>2</sup> was denoted by S p1, p<sup>2</sup> . For one candidate gene g, we assigned a measurement called maximum interaction score (MIS), which was defined as:

$$\text{MIS}\left(\emptyset\right) = \max\left\{ \mathbb{S}\left(\emptyset, \emptyset'\right) : \mathcal{g}' \text{ is a } \mu\nu \text{itis related} \,\text{gen}\,\text{e} \right\} \quad \text{(3)}$$

A high MIS indicates that the gene was highly related to at least one uveitis-related gene. Accordingly, this specific gene may also share the functions shared by this uveitis-related gene and thus has a high probability to become a novel uveitis-related gene. According to the setting in STRING, 900 was the threshold of highest confidence and was also set as the threshold of MIS, i.e., candidate genes passing the permutation test were retained if their MISs were no less than 900.

#### Function Test

The last test measured the linkage between candidate genes and uveitis-related genes based on gene ontology (GO) terms and

biological pathways. Each gene has special relationships with some GO terms or pathways. If one candidate gene exhibited GO terms or pathways that are similar to one uveitis related gene, then it may be highly related to this uveitis-related gene and thus shows a high probability of being a novel uveitis gene. To complete this test, we employed a scheme to indicate the relationship between a gene and a GO term (pathway). Here, we adopted the GO and KEGG enrichment theory (Chen et al., 2016b, 2017a,b, 2018c; Liu et al., 2016; Lu et al., 2017; Li et al., 2018a,b), which can transform the relationship between one gene and one GO term (KEGG pathway) into a number. A vector, denoted as ES g , can be obtained by collecting the numbers of one gene g between all GO terms and KEGG pathways. We further used the direction cosine of two vectors ES g and ES g 0 to measure the linkage between g and g 0 in terms of GO terms and KEGG pathways. The direction was defined as:

$$\text{C(g, g')} = \frac{\text{ES(g)} \bullet \text{ES(g')}}{||\text{ES(g)}|| \bullet ||\text{ES(g')}||} \tag{4}$$

where ES g • ES g 0 is the dot product of two vectors and ||ES(g)|| is the modulus of the vector ES g . A high outcome of equation 4 suggested strong associations between two genes. We assigned the last measurement named maximum function score (MFS), which is similar to MIS, to the candidate gene g. MFS can be computed by:

$$\text{MFS}\left(\emptyset\right) = \max\left\{ \mathcal{C}\left(\emptyset, \mathcal{g'}\right) \, : \, \mathcal{g'} \, \text{is a } \nu \text{eitis related} \, \text{geene}\right\} \tag{5}$$

Then, 0.97 was set as the threshold of MFS to extract final candidate genes. For convenience, the final obtained genes were called inferred genes.

### RESULTS

In this study, we set up a computation method to infer novel uveitis-related genes based on validated ones retrieved from published literature. All procedures are shown in **Figure 1**. This section gave the detailed results of this method.

The Ensembl IDs for uveitis-related genes were picked up as seed nodes for LHD algorithm. Except for the uveitis-related ones, all genes were assigned heat values that are available in **Supplementary Table S2**. However, these values may be affected by the structure of the PPI network N. Genes receiving high heat values were not always highly related to uveitis. Accordingly, a permutation test was performed. Measurement zscores were computed and assigned to each candidate gene and are also available in **Supplementary Table S2**. According to Section "Permutation Test", we selected genes with zscores no less than 1.96, accessing 1,287 candidate genes.

For the 1,287 candidate genes, we further filtered them by interaction test. The measurement MIS was calculated and assigned to each of these genes and is listed in **Supplementary Table S2**. We set 900 as its threshold, resulting in 391 candidate genes. In the function test, each of the 391 candidate genes was evaluated by MFS (see **Supplementary Table S2**). We set 0.97 as the threshold and finally obtained 59 genes (see first 59 genes

FIGURE 1 | Flow chart showing the detailed procedures of the computational method for inferring novel uveitis-related genes. Ensembl IDs of uveitis-related genes were used as the seed nodes of Laplacian heat diffusion (LHD) algorithm, resulting in a heat value for each gene in the PPI network. In the permutation test, the LHD algorithm was executed 500 times with different seed nodes, yielding 500 heat values for each gene. Then, a zscore (cf. equation 2) of each gene was calculated and those with zscores less than 1.96 were discarded. The interaction test assessed each candidate gene by checking its associations to validated genes and excluded candidate genes with MISs (cf. equation 3) less than 900. Finally, the remaining genes were evaluated in the function test, which measured candidate genes by investigating their linkages based on gene ontology (GO) terms and biological pathways. Candidate genes with MFSs (cf. equation 5) less than 0.97 were discarded. Fifty-nine inferred genes were obtained.

in **Supplementary Table S2**). These genes were regarded to be highly related to uveitis and thus called inferred genes.

To show the high probabilities of inferred genes being novel uveitis-related genes, we extracted the linkages between 59 inferred genes and validated uveitis-related genes from the PPI network, as shown in **Figure 2**. It can be seen that each of these genes have strong associations with validated genes, proving that they can be novel uveitis-related genes.

# DISCUSSION

In this study, we set up a new computational method for inferring novel uveitis-related genes. The method finally produced 59 inferred genes. This section first presents a comparison of the resulting genes with those reported in a previous study (Lu et al., 2017) and then gives an extensive analysis on several inferred genes.

# Comparison With Genes Reported in a Previous Study

A previous study (Lu et al., 2017) reported 56 novel genes that were related to uveitis and were accessed by using RWR algorithm and some screening tests. In the present study, 59 inferred uveitis-related genes were finally obtained. The Venn diagram on two gene sets, consisting of novel genes in Lu et al. (2017) and inferred genes in our study, is shown in **Figure 3**.

We can see that only two genes, JAK1 (ENSP00000343204) and MAPK8 (ENSP00000353483), were identified by both methods. The Jaccard coefficient of these two sets was 1.77%, implying that the novel genes yielded by two methods were quite different. In addition, our inferred genes can be important supplements for the previous study if we can prove them to be highly related to uveitis. This result would be elaborated in the following subsection.

### Analysis of Inferred Genes

Our computational method identified 59 inferred genes and regarded them to be highly related to uveitis. To confirm this, we did the GO and KEGG enrichment on them using R program clusterProfiler for detailed functional annotation. The results are provided in **Supplementary Tables S3**, **S4**. It can be observed that the inferred genes are functionally enriched in some GO terms, such as frizzled binding (GO:0005109), G-protein coupled receptor binding (GO:0001664), non-membrane spanning protein tyrosine kinase activity (GO:0004715), and protein tyrosine kinase activity (GO:0004713). The enriched KEGG pathways included signaling pathways regulating pluripotency of stem cells (hsa04550), Wnt signaling pathway (hsa04310), and melanogenesis (hsa04916).

According to recent publications, all inferred genes can be proved to be related to such disease or related pathogenic processes (**Supplementary Table S5**). Here, we selected important ones for detailed analyses, their detailed information is listed in **Table 1**. According to the enrichment analysis, we clustered these genes into three functional groups, as shown in **Figure 4**.

#### Genes With Protein Tyrosine Kinase Activity (GO:0004713)

The first inferred gene is a functional immune cell specifically expressed gene JAK3 (ENSP00000391676) (Lee et al., 2012) and is pathologically related to autosomal severe combined immunodeficiency disease (Notarangelo et al., 2001; Bogaert et al., 2016). For its specific contribution on the pathogenesis of uveitis, a specific study (Liao et al., 2018) on ankylosing spondylitis confirmed that the expression levels of JAK1 and JAK3 are positively correlated with various autoimmune diseases, including enthesitis, ankylosing spondylitis and uveitis, thereby validating their potential pathogenic contributions.

Apart from JAK3, another gene named as JAK1 (ENSP00000343204) was also inferred to be a potential pathogenic uveitis-related gene by our method. Based on the publications mentioned above, such gene was also confirmed to participate in uveitis-related pathogenesis.

Apart from that, the next inferred gene was BTK (ENSP00000308176). BTK contributes to the regulation of B cell maturation and proliferation under physical or pathological


conditions (Wu et al., 2014; Nagel et al., 2015). According to recent publications, BTK is pathologically connected to uveitis by interfering immune cell differentiation (Vargas et al., 2013) and autoimmune responses (Corneth et al., 2016).

The next gene was SYK (ENSP00000364898), which contributes to the regulation of cellular responses, including proliferation, differentiation, and phagocytosis in B, T and myeloid cells (Luger et al., 2013; Hauck et al., 2015; Wang et al., 2015). For its specific contribution on the pathogenesis of uveitis, SYK/CARD9 signaling axis participates in the pathogenesis of autoimmune eye diseases, including ocular inflammatory disorders, uveitis, and dry eye disease (Lee et al., 2016; Hagan et al., 2018).

Another study (Wang et al., 2008) reported that FGR (ENSP00000363115) may also participate in eye diseases, including age-related macular degeneration and uveitis.

#### Genes With G-Protein Coupled Receptor Binding (GO:0001664) Capacity

Apart from being proliferation regulatory genes in immune cells, JAK1 and JAK3, the next inferred gene, are also members of a functional WNT gene family WNT16 (ENSP00000222462). For encoding of a secreted signaling protein, such gene encodes a ligand for the members of frizzled family of seven transmembrane receptors (Wergedal et al., 2015; Ozeki et al., 2016). For its specific contribution on the pathogenic approach of uveitis, no direct evidence confirmed that WNT16 participates in uveitis-specific pathogenesis. However, two recent studies (Reischl et al., 2007; Nalesso et al., 2017) confirmed that WNT16, together with MNT5, participate in the pathogenesis of uveitis by interfering immune responses.

A homolog of WNT16, WNT3 (ENSP00000225512) was also predicted to participate in uveitis-specific pathogenesis. According to a recent publication on the genetic components of wnt/β-catenin signaling pathway (Nakatsu et al., 2011), this gene is similar to our predicted biomarker WNT3 and regulates the proliferation of eye cells, including urea cells (Nakatsu et al., 2011). The abnormal proliferation of urea cells, especially immune cells, may trigger the initiation and proliferation of uveitis. Therefore, speculating that WNT16 and WNT3 may be functionally related to uveitis is reasonable.

The next inferred gene named WNT7B (ENSP00000341032) is also a specific member of wnt/β-catenin signaling pathway. According to the same literature mentioned above, such gene participating in wnt/β-catenin signaling pathway may also contribute to the pathogenic progressions of uveitis due to its abnormal regulatory role on urea immune cell proliferation in situ. As a homolog of WNT7B, WNT7A (ENSP00000285018) acts as one of the potential pathogenic factors of uveitis by interfering immune cell proliferation and maturation. Similarly, WNT2B (ENSP00000358698), WNT9A (ENSP00000272164), WNT4 (ENSP00000290167), and WNT2 (ENSP00000265441) all participate in the pathogenesis of uveitis through the regulation of urea cells, thus validating their strong relationships with uveitis.

We screened out eight WNT signaling pathway components that may contribute to the pathogenesis of uveitis. According to recent publications, our inferred genes (WNT2, WNT16,

WNT3, WNT7A, and WNT7B) are functionally related to the initiation and progression of uveitis due to their interference to the proliferation of urea and mixed immune cells.

#### Other Functional Inferred Uveitis-Related Genes

GLI2 (ENSP00000354586) was also regarded as a potential pathogenic gene of uveitis. Such gene has been widely reported to act as a transcriptional regulator by encoding a C2H2-type zinc finger protein (Eichberger et al., 2006; Jackson et al., 2015). For its specific contribution on uveitis, GLI2 regulates Notch-Gli2 axis and hedgehog signaling pathway; however, no direct reports confirmed its pathogenic role (Roessler et al., 2003; Ringuette et al., 2016). In the pathogenic conditions of urea tissues, the inflammatory environment is also regulated by such two pathways (Takezaki et al., 2011; Swiderska-Syn et al., 2013). Therefore, this gene can be regarded as a potential uveitis-related gene.

ZAP70 (ENSP00000264972) is also a functional inferred uveitis-related gene. According to a recent publication, ZAP70 is a potential biomarker instructing the onset of uveitis in mouse models (Kleinwort et al., 2016). Therefore, this gene can be inferred as a uveitis-related gene.

MAPK8 (ENSP00000353483) was inferred to be potential uveitis-related gene, participating in the specific pathogenesis of such disease. It is necessary to point out that this gene was also reported in one previous study (Lu et al., 2017). Early in 2014, a systematic study (Lisanti et al., 2014) on tumor stroma confirmed that MAPK8 contribute to the pathogenesis of a typical tumor complication, uveitis, corresponding with our prediction. In the same year, another independent study (Honke et al., 2014) validated that p38-MAPK8 participated in the specific IL-6 mediated inflammatory responses. Therefore, such

#### REFERENCES


identified gene MAPK8 may also be a specific uveitis associated gene, corresponding with previous studies and publications.

Literature-based analysis confirmed some of the inferred genes as participating in uveitis, thus validating that our results are reliable. The rest of the inferred genes were left for readers. Most of them are suggested to be related to uveitis.

### CONCLUSION

This study aims to infer novel uveitis-related genes. An efficient network algorithm, LHD algorithm, was adopted as the basic searching algorithm and was executed on a PPI network using validated uveitis-related genes as seed nodes. With the help of three screening tests, 59 functional genes were finally accessed. These novel inferred genes can be useful materials to uncover the pathogenesis of uveitis.

#### AUTHOR CONTRIBUTIONS

All authors contributed to the research and reviewed the manuscript. MY designed the study. SL, KZ, XW, and HL performed the experiments. SL, KZ, XA, and YX analyzed the results. SL and KZ wrote the manuscript.

#### SUPPLEMENTARY MATERIALS

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00425/full#supplementary-material

and network diffusion algorithms. Mol. Ther. Methods Clin. Dev. 10, 57–67. doi: 10.1016/j.omtm.2018.06.007


on interaction information of chemicals and proteins. PLoS One 7:e45944. doi: 10.1371/journal.pone.0045944


fgene-09-00425 October 4, 2018 Time: 15:24 # 9

of the eye after cataract surgery with consideration of different phenotypes of eye structure. Curr. Eye Res. 40, 1018–1027. doi: 10.3109/02713683.2014.97 5366


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Lu, Zhao, Wang, Liu, Ainiwaer, Xu and Ye. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-09-00425 October 4, 2018 Time: 15:24 # 10

# M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species

Xiaoli Qiang<sup>1</sup> , Huangrong Chen<sup>2</sup> , Xiucai Ye<sup>3</sup> , Ran Su<sup>4</sup> \* and Leyi Wei<sup>2</sup> \*

1 Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China, <sup>2</sup> School of Computer Science and Technology, Tianjin University, Tianjin, China, <sup>3</sup> Department of Computer Science, University of Tsukuba, Tsukuba, Japan, <sup>4</sup> School of Software, Tianjin University, Tianjin, China

#### Edited by:

Arun Kumar Sangaiah, VIT University, India

#### Reviewed by:

Chao Pang, Columbia University Medical Center, United States Jianghan Qu, University of Southern California, United States

#### \*Correspondence:

Ran Su ran.su@tju.edu.cn Leyi Wei weileyi@tju.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 18 July 2018 Accepted: 04 October 2018 Published: 25 October 2018

#### Citation:

Qiang X, Chen H, Ye X, Su R and Wei L (2018) M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species. Front. Genet. 9:495. doi: 10.3389/fgene.2018.00495 As one of the well-studied RNA methylation modifications, N6-methyladenosine (m6A) plays important roles in various biological progresses, such as RNA splicing and degradation, etc. Identification of m6A sites is fundamentally important for better understanding of their functional mechanisms. Recently, machine learning based prediction methods have emerged as an effective approach for fast and accurate identification of m6A sites. In this paper, we proposed "M6AMRFS", a new machine learning based predictor for the identification of m6A sites. In this predictor, we exploited a new feature representation algorithm to encode RNA sequences with two feature descriptors (dinucleotide binary encoding and Local position-specific dinucleotide frequency), and used the F-score algorithm combined with SFS (Sequential Forward Search) to enhance the feature representation ability. To predict m6A sites, we employed the eXtreme Gradient Boosting (XGBoost) algorithm to build a predictive model. Benchmarking results showed that the proposed predictor is competitive with the stateof-the art predictors. Importantly, robust predictions for multiple species by our predictor demonstrate that our predictive models have strong generalization ability. To the best of our knowledge, M6AMRFS is the first tool that can be used for the identification of m6A sites in multiple species. To facilitate the use of our predictor, we have established a userfriendly webserver with the implementation of M6AMRFS, which is currently available in http://server.malab.cn/M6AMRFS/. We anticipate that it will be a useful tool for the relevant research of m6A sites.

Keywords: N6-methyladenosine site, eXtreme Gradient Boosting, machine learning, feature representation, RNA methylation, feature selection

### INTRODUCTION

To date, more than 150 types of RNA modifications have been discovered (Maden, 1990; Wang X. et al., 2014). Of these modifications, N6-methyladenosine (m6A) is the most common and abundant one and exists in various species. It is found to be closely associated with diverse biological processes, such as RNA localization and degradation (Wang X. et al., 2014), RNA structural dynamics (Roost et al., 2015), alternative splicing (Liu N. et al., 2015), primary microRNA

processing (Alarcón et al., 2015), cell differentiation, and reprogramming (Chen et al., 2015), and regulation of circadian clock (Geula et al., 2015). Thus, identification of m6A sites is of great importance for better understanding of their functional mechanisms. In the past few years, high-throughput experimental methods, such as MERIP (Meyer et al., 2012) and m6A-seq (Dominissini et al., 2012), have been utilized to identify m6A modifications, and more and more m6A peaks have been characterized. However, they have the following limitations: (1) they cannot accurately locate the positions of m6A sites; (2) they are highly cost; and (3) they are not applicable for the large-scale identification of m6A sites. Hence, it is highly desirable to develop fast and accurate computational methods for the identification of m6A sites (Chen et al., 2015b, 2016).

In recent years, machine learning based prediction methods have emerged as effective approach for predicting m6A sites. For example, Chen et al. (2015a) developed the first machine learning based predictor, called "iRNA-Methyl", for m6A site identification. They exploited physicochemical properties and sequence-order information embedded in PseDNC (pseudo dinucleotide composition) (Liu B. et al., 2015), and used support vector machine for model construction. Later, Liu Z. et al. (2016) proposed to incorporate more additional physicochemical properties coupled with a scalable transformation algorithm into their feature extraction model. To improve the predictive performance, Jia et al. proposed to fuse three types of feature descriptors, such as bi-profile Bayes, dinucleotide composition and KNN scores. Their results showed that this fusion strategy is able to achieve better performance than single one feature descriptor (Jia et al., 2016). Similarly, Xiang et al. (2016b) found that combining binary encoding scheme together with k-mer frequency could contribute to the improved performance. Recently, Zhou et al. (2016) developed "SRAMP", a powerful prediction tool using multiple types of feature descriptors, including positional binary encoding of nucleotide sequence, k-nearest neighbor encoding, nucleotide pair spectrum encoding, and secondary structure pattern, to train an ensemble predictive model with random forest for the identification of m6A sites. SRAMP is reported to achieve relatively good performance as compared to other predictors. More recently, Xiang et al. (2016a) proposed a new predictor called "RNAMethyPre", using compositional information and position-specific information to build predictive models for the prediction of m6A sites on both human and mouse. Additionally, in our previous study, we proposed to use deep learning algorithm to generate highlatent features to improve the predictive performance (Wei et al., 2018d). However, we found that most of existing predictors are species-specific. Currently, there is not any predictor that is capable of predicting m6A sites for multiple species.

For this purpose, we proposed a novel sequence-based predictor, namely "M6AMRFS" for detecting m6A sites in RNA sequences. For feature extraction (Mrozek et al., 2007, 2013), we proposed a feature representation algorithm to encode sequences with dinucleotide binary encoding and local positionspecific dinucleotide frequency. To optimize the feature space, we combined the F-score algorithm with SFS (Sequential Forward Search) (Wei et al., 2018a,c,e) to improve the representation ability of our features. For model training, we trained the optimal feature representations under XGBoost algorithm. Our experimental results showed that the proposed M6AMRFS is able to achieve competitive and robust performance as compared to state-of-the-art predictors for four different species. To the best of our knowledge, this is the first predictor that is applicable for multiple species. Furthermore, we have established a user-friendly webserver that implements the proposed M6AMRFS, which is currently available in http:// server.malab.cn/M6AMRFS/. We anticipate that it will be a useful tool complementary for existing tools, facilitating to further reveal the functional mechanisms of m6A sites.

# MATERIALS AND METHODS

## Benchmark Datasets

To predict the m6A sites in multiple species, we employed four benchmark datasets from four species, including Saccharomyces cerevisiae, Arabidopsis thaliana, Musculus, and Homo sapiens. The detail of the four benchmark datasets is listed in **Table 1**. For the four benchmark datasets, the positives are the sequences centered with true m6A sites, while the negatives are usually the sequences centered with adenines but without any m6A peaks detected. The datasets can be found in the following website: http://server.malab.cn/M6AMRFS/.

# Prediction Framework of the Proposed Predictor

**Figure 1** illustrates the overall procedure of the proposed predictor. As we can see from **Figure 1**, there are two steps in the predictor. The first step is data pre-processing, including data clean and feature extraction. It filters out those irrelevant sequences from input sequences. Then, the resulting sequences are submitted into the feature representation algorithm, in which the sequences are encoded with feature vectors. The second step is feature optimization and model training. For feature space optimization, we used the F-score algorithm combined with SFS (Sequential Forward Search) to search for the optimal features. Afterward, the resulting optimal feature representations are fed into a well-trained XGBoost model to predict whether the sequences are true m6A sites or not. In our predictor, the predicted outcome for each sequence is 0 or 1, where 0 denotes non-m6A site and 1 denotes true m6A site.

### Feature Representation

In this work, we present a new feature representation algorithm that combines two feature descriptors. One is named "Dinucleotide binary encoding" and the other is "Local positionspecific dinucleotide frequency", which are described as follows,

#### Dinucleotide Binary Encoding

The feature descriptor encapsulates the positional information of the dinucleotide at each position in the sequence. Obviously, there are a total of 16 possible dinucleotides. In this descriptor, each dinucleotide can be encoded into a 4-dimensional 0/1 vector. For example, AA is encoded as (0,0,0,0); AT is encoded as

#### TABLE 1 | Summary of the benchmark datasets from four species.

fgene-09-00495 October 23, 2018 Time: 14:29 # 3


(0,0,0,1); AC is encoded as (0,0,1,0); and so forth, GG is encoded as (1,1,1,1). Therefore, using the dinucleotide binary encoding, we yielded a 160 (=40<sup>∗</sup> 4)-dimensional 0/1 vector for the given sequence.

#### Local Position-Specific Dinucleotide Frequency

For a given sequence, the feature vector of this descriptor can be denoted as (f2, f3, . . ., f<sup>l</sup> ), where f<sup>i</sup> is calculated as follows,

$$f = \frac{1}{|N\_{\mathbf{i}}|} \mathcal{C} \left( X\_{\mathbf{i}-1} X\_{\mathbf{i}} \right), 2 \le i \le k$$

where l is the length of the given sequence, |N<sup>i</sup> | is the length of the i th prefix string {X1X2...Xi} in the sequence, and C (Xi−1Xi) is the occurrence number of the dinucleotide Xi−1X<sup>i</sup> in position i of the i th prefix string.

#### Feature Selection

Feature selection is an important process to improve the classification performance (Mrozek et al., 2009; Mrozek et al., 2014; Zeng et al., 2016; Zou et al., 2016a,b; Liu, 2017). Here, we used the F-score algorithm together with the SFS strategy to search the most discriminative features (Peng et al., 2005). **Figure 2** illustrates the procedure of the feature selection strategy, which is described as follows. Firstly, the F-score algorithm is utilized to rank all the features from the highest scores to the lowest scores, generating a ranked feature list. Secondly, we added the features one by one from the ranked list, and respectively trained the predictive models. Lastly, the feature subset corresponding to the highest accuracy of the predictive model is used as the optimal features. The results of feature selection were discussed in section of "Results and Discussion".

#### XGBoost (eXtreme Gradient Boosting)

eXtreme Gradient Boosting, which was proposed by Chen and Guestrin (2016), has been shown to be a powerful classification algorithm. The general idea of XGBoost is to enumerate several candidates that may be the segmentation points according to the percentile method, and then to find the best segmentation point from the candidates for calculating the segmentation points. The main advantage of XGBoost is to combine multithreading, data compression, and fragmentation methods to improve the efficiency of the algorithm as much as possible. Moreover, the regularization terms added by XGBoost in the loss function can be used to control the complexity of the model and avoid overfitting. Parameters, such as subsamples, max depth, and estimators, are utilized to optimize evaluation performance

via parallelization program namely "Grid Search". For the implementation of XGBoost in our predictor, the range of max depth is set from 2 to 10; learning rate is ranged from 0.1 to 0.8; and estimators are ranged from 1 to 10.

### Performance Evaluation

In this work, four commonly used performance metrics are used for performance evaluation, including Acc (accuracy), Sn (sensitivity), Sp (specificity), and MCC (Mathew's correlation coefficient), respectively (Zeng et al., 2015; Lai et al., 2017; Zhang et al., 2017; Cheng et al., 2018; Su et al., 2018; Tang et al., 2018; Wei et al., 2018b; Yang et al., 2018). They are formulated as follows

$$\begin{cases} \text{Sn} = \frac{T p}{T P + F \text{N}} \times 100\% \\ \text{Sp} = \frac{T \text{N}}{\text{TN} + F \text{P}} \times 100\% \\ A\text{cc} = \frac{T p + T \text{N}}{T P + F \text{N} + T \text{N} + F \text{P}} \times 100\% \\ \text{MCC} = \frac{T p \times T \text{N} - F p \times \text{FN}}{\sqrt{(T P + F \text{N})(T \text{N} + F \text{P})(T P + F \text{P})(T \text{N} + F \text{N})}} \end{cases}$$

where TP denotes true positive; TN denotes true negative; FP denotes false positive; and FN denotes false negative. Sn measures the predictive ability of a predictor for positive samples while Sp measures the predictive ability of a predictor for negative samples. Acc and MCC are two metrics measuring the overall performance of a predictor.

Besides, we used Receiver Operating Characteristic (ROC) curve to intuitively evaluate the overall performance (Liu et al., 2013, 2016b). It is plotted with true positive rate (TPR) against false positive rate (FPR) under different classification thresholds. The TPR is the same with sensitivity as described above, while FPR is calculated as 1-specificity. Area under ROC curve (AUC) is usually used as an evaluation metric (Liu et al., 2016a, 2017). The value of AUC ranges from 0.5 to 1. If the AUC is close to 1, it indicates that the predictor has excellent performance. If the AUC approaches to 0.5, the predictor does not perform well for prediction.

Additionally, we used 10-fold cross validation method and jackknife test to evaluate the predictive performance (Wei et al., 2017a; Zeng et al., 2017a,b; Liao et al., 2018; Zou et al., 2018). The two evaluation methods were chosen since existing methods in the literature used them for performance evaluation.

# RESULTS AND DISCUSSION

#### Comparison of XGBoost and Other Classifiers

To evaluate the effectiveness of the XGBoost classifier, we compared it with five commonly used machine learning algorithms, including Random Forest (RF) (Liu B. et al., 2015; Li et al., 2016; Wei et al., 2017b), Naïve Bayes (NB), Logistic Regression (LR), K-Nearest Neighbors (KNN)(Huang and Li, 2018), Support Vector Machine (SVM) (Song et al., 2010, 2012, 2018; Wang M. et al., 2014; Wei et al., 2017), and Gradient Boosting Decision Tree (GBDT) (Liao et al., 2018), respectively. For fair comparison, the machine learning algorithms were trained and evaluated with 10-fold cross validation on the benchmark datasets, respectively. The performance of different classifiers is illustrated in **Figure 3**. The detailed results are presented in **Table 2**.

As shown in **Table 2** and **Figure 3**, XGBoost outperforms the other classifiers on three out of the four datasets, with the exception of Dataset-A101, for which the SVM classifier is slightly better than the XGBoost, which is the second best among the compared classifiers. For those datasets that the XGBoost outperforms other classifiers, the XGBoost is able to achieve higher Acc and MCC. To be specific, our Acc and MCC are 0.7314 and 0.4629 in the Dataset-S51, 0.6 and 1.1% higher



than that of the runner-up SVM. Similar results are observed in the Dataset-H41; XGBoost leads by 0.71 and 1.2% in terms of Acc and MCC, respectively. Moreover, in the Dataset-M41, the performances of our XGBoost are the same with the RF and GBDT in terms of Acc, Sn, Sp, and MCC, respectively. In summary, our results demonstrate that as compared to other commonly used classifiers, the XGBoost shows generally better and more robust performance to classify true m6A sites to nonm6A sites from different species.

#### Impact of Feature Selection

In this study, we employed the F-score with the SFS for feature selection. The results of feature selection are summarized in **Table 3** and illustrated in **Figure 4** as well. As seen from **Table 3**, TABLE 3 | Performance of features before and after feature selection.


before feature selection, the performances of the predictive model in the Dataset-S51 are 0.7314, 0.7345, 0.7284, and 0.4629 in

terms of Acc, Sn, Sp, and MCC, respectively. After applying the feature selection, we observed that the performances in terms of all the metrics were improved. To be specific, the Acc and MCC were improved to 0.7425 and 0.4852, respectively. This indicates that the feature selection strategy to yield more informative features to distinguish true m6A sites from nonm6A sites. For the other datasets from different species, similar results were observed. We can see from **Table 3** that almost all the performances were improved by using feature selection, demonstrating that feature selection is an effective way to enhance the predictive performance of the predictor. Moreover, **Figure 4** illustrates the Acc of the features by varying the feature number when conducting feature selection. As seen in **Figure 4**, we pointed out the optimal feature number and their corresponding highest Acc for each dataset. The optimal feature number for the four datasets are 85, 57, 13, and 355, giving the highest Acc of 0.7425, 0.9102, 0.8924, and 0.8105, respectively.

# Comparison With Other Feature Representation Algorithms

To examine the performance of the proposed feature algorithm, we evaluated and compared it with existing feature representation algorithms, including RFH, PseDNC, PCP (physical and chemical properties), KNN (K-Nearest Neighbors), and AthMethPre, respectively. These algorithms were reported to have relatively strong power for the identification of m6A sites. Thus, they were chosen for comparison. The results of the above algorithms were presented in **Table 4**. As we can see from **Table 4**, the proposed features are competitive with the best-performing AthMethPre other feature representation methods and remarkably outperform the other existing features in all the four datasets. Note that for the Dataset-S51 and the Dataset-A101, our method performs slightly worse than the best-performing AthMethPre; while for the other two datasets, our method is slightly better. As well known, for the genome-wide identification, the running time for a predictor is important as well. Therefore, we further compared the feature number of AthMethPre and our feature representation method. We found that the feature number of the AthMethPre method for each dataset are 540, 500, 500, and 740, while ours are 85, 57, 13, and 355, respectively. As can be seen, our feature numbers for all the four datasets are averagely much fewer than the AthMethPre method. This indicates that the computation time by our predictive models costs less. In general, it can be concluded that our features are at least effective for the representatives of m6A sites in multiple species with different sequence lengths.

# Comparison With State-of-the-Art Predictors

To assess the effectiveness of our predictor, we compared it with existing predictors including pRNAm-PC (Liu Z. et al., 2016), MehtyRNA (Chen et al., 2017), and RFAthM6A (Wang and Yan, 2018), respectively. There were chosen since they were reported to have the best performance on the four benchmark



TABLE 5 | Results of the proposed predictor and the state-of-the-art predictors on benchmark datasets from different species.


N.A., denotes not available.

datasets used in this work. The results were presented in **Table 5**.

As shown in **Table 5**, M6AMRFS outperforms pRNAm-PC on the Dataset-S51. The Acc, Sn, Sp, and MCC by our predictor are 0.7425, 0.7521, 0.7339, and 0.4852, respectively. The performances are higher than that of the second best pRNAm-PC on this dataset. To be specific, our overall performances are 0.0451 and 0.0852 higher in terms of Acc and MCC, respectively. As for the other datasets (Dataset-H41 and Dataset-M41), we observed similar results that our overall performance outperforms the existing predictors. Only on Dataset-A101, our predictor performs slightly worse than RFAthM6A. To be concluded, our results demonstrate that the proposed predictor is better than existing predictors or at least competitive with existing predictors on multiple benchmark datasets from different species. Importantly, our predictor exhibits robust performance for multiple species, demonstrating that our predictor is able of capturing the characteristics of m6A sites in different species. This also implies that the m6A sites from different species might share the common patterns.

### CONCLUSION

In this study, we have developed a machine learning based predictor, namely M6AMRFS, for the identification of m6A sites in multiple species. We have conducted a series of comparative study, and our experimental results indicate that our predictor is at least competitive as compared to previously published predictors. Importantly, we found that our predictor is able to achieve robust performance in several species. To the best of our knowledge, it is the first predictor that can provide predictions in multiple species. For further analysis, we found that the robust performance contributes to the following two possible reasons. One reason is the XGBoost classifier we used for model training. We have compared XGBoost with other machine learning algorithms. XGBoost is shown to perform better than other classification algorithms. The other reason is that our feature selection strategy helps to adaptively select the optimal features for specific species. We anticipate that the tool and webserver we have established will be useful for facilitating to reveal the functional mechanisms of m6A sites.

# AUTHOR CONTRIBUTIONS

XQ and HC wrote the manuscript. HC developed the webserver and analyzed the results. XY analyzed the results. RS and LW designed the experiments. All authors read and approved the manuscript.

### FUNDING

The work was supported by the National Natural Science Foundation of China (Nos. 61701340 and 61702361).

## REFERENCES

fgene-09-00495 October 23, 2018 Time: 14:29 # 8



by incorporating hexamer composition into general PseKNC. Int. J. Biol. Sci. 14, 883–891. doi: 10.7150/ijbs.24616


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Qiang, Chen, Ye, Su and Wei. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification and Analysis of Rice Yield-Related Candidate Genes by Walking on the Functional Network

Jing Jiang<sup>1</sup> , Fei Xing<sup>1</sup> , Chunyu Wang<sup>2</sup> \* and Xiangxiang Zeng<sup>3</sup> \*

<sup>1</sup> School of Aerospace Engineering, Xiamen University, Xiamen, China, <sup>2</sup> School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, <sup>3</sup> School of Information Science and Engineering, Xiamen University, Xiamen, China

Rice (Oryza sativa L.) is one of the most important staple foods in the world. It is possible to identify candidate genes associated with rice yield using the model of random walk with restart on a functional similarity network. We demonstrated the high performance of this approach by a five-fold cross-validation experiment, as well as the robustness of the parameter r. We also assessed the strength of associations between known seeds and candidate genes in the light of the results scores. The candidates ranking at the top of the results list were considered to be the most relevant rice yield-related genes. This study provides a valuable alternative for rice breeding and biology research. The relevant dataset and script can be downloaded at the website: http://lab.malab.cn/~jj/rice.htm.

#### Edited by:

Arun Kumar Sangaiah, VIT University, India

#### Reviewed by:

Jing Lu, Walmart Labs, United States Feng Zhu, Zhejiang University, China

#### \*Correspondence:

Chunyu Wang chunyu@hit.edu.cn Xiangxiang Zeng xzeng@xmu.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Plant Science

Received: 28 June 2018 Accepted: 30 October 2018 Published: 20 November 2018

#### Citation:

Jiang J, Xing F, Wang C and Zeng X (2018) Identification and Analysis of Rice Yield-Related Candidate Genes by Walking on the Functional Network. Front. Plant Sci. 9:1685. doi: 10.3389/fpls.2018.01685 Keywords: rice, yield, random walking, function, network

# INTRODUCTION

Rice (Oryza sativa L.) is one of the most important food crops worldwide, being used as the main food source by more than half of the global population (Mahender et al., 2016; Li et al., 2017). In the developing world, rice provides 27% of dietary energy and 20% of dietary protein (Huang et al., 2013). However, despite genetic improvements in grain yield delivered by the exploitation of semi-dwarfism and heterosis over the past 50 years, a substantial increase in grain productivity of the major crops is still required to feed a growing world population (Abe et al., 2018). The prime breeding target is to increase both grain size and grain number, because they impact both on yield potential and its end-use quality (Okada et al., 2018). However, the simultaneous improvement of grain quality and grain yield is a major challenge because of the well-established negative correlation between these two traits which is controlled by quantitative trait loci and influenced by environmental changes. Additionally, determining which genes in quantitative trait loci regulate grain size and number has not been clarified (Borzee et al., 2018; Li et al., 2018). Therefore, the identification genetic variants associated with improvements in grain yield would facilitate the breeding of new high-yielding rice varieties and may also be applicable to other crops (You et al., 2017).

Vast numbers of genetic variants have been detected by traditional genome-wide association studies and recent sequencing studies, and connecting the functional implications of these results to known genes has become a standard task (Li et al., 2015; Dehury et al., 2017;

**Abbreviations:** AUC, area under the ROC curve; GO, gene ontology; ROC, receiver operating characteristic; RWR, random walk with restart.

Torres and Henry, 2018; Wu et al., 2018). We previously developed a database, RicyerDB, to collect all known rice yieldrelated genes by integrating multiple omics data, information from the literature, and associated databases (Jiang et al., 2018). This work also established a search tool to query a particular gene, and to provide insights into gene functions and locations. Any rice yield-related gene can therefore be easily queried and the findings downloaded through the webpage, while candidate genes can be screened and prioritized to identify those most likely to be associated with known genes.

To achieve this goal, several approaches have been proposed from the perspective of computational systems biology (Behroozi-Khazaei and Nasirahmadi, 2017; He et al., 2017; Liu E. et al., 2017; Liu Y. et al., 2017; Xiong et al., 2017; Maione and Barbosa, 2018; Zhang M. et al., 2018; Zhou et al., 2018). For example, the Endeavor tool uses the guilt-by-association principle to rank candidate genes according to their functional similarities to a set of predefined seed genes (Aerts et al., 2006; Tranchevent et al., 2008, 2016). In recent years, a protein–protein interaction (PPI) network has been developed to achieve a global inference of entire genes (Liu et al., 2010; Lee, 2011; Rezadoost et al., 2016; Wang et al., 2016; Zeng et al., 2016; Luo and Liu, 2017; Holland and Johnson, 2018; Vlaic et al., 2018). PPI networks have also been used to provide a simplified yet systematic measure of functional similarities between genes (Chen et al., 2017a, 2018a).

Some methods for identifying yield-related genes have linked profile and sequence technology to facilitate the prediction of related genes. For example, Odilbekov et al. (2018) used machine learning and integrated this analysis with data obtained from spectroradiometer, infrared thermometer, and chlorophyll fluorescence measurements to identify the most predictive proxy measurements for studying Septoria tritici blotch disease of wheat.

Hybrid breeding is an effective tool to improve yield in rice, although parental selection remains a difficult issue. Xu et al. (2018) compared six genomic selection methods, such as least absolute shrinkage and selection operation and support vector machine, to evaluate predictabilities for different methods, and demonstrated their implementation to predict the hybrid performance of rice. Although good results have been achieved by these studies, the techniques of microarray and sequencing are nevertheless expensive.

The main target of this research was to use current knowledge to identify rice yield-related genes with network prediction methods. We proposed a computational systems biology approach for the identification of candidate genes via a random walk model on a PPI network with functional similarities (Kohler et al., 2008). Starting from known nodes, our method simulates the process in which a random walker travels to its neighbors or jumps to itself in the network, scores a gene using the probability that the walker stays in the gene at a steady state, and then ranks candidate genes according to their scores. Using a series of cross-validation experiments, we systematically demonstrated the robustness of our method, and applied our approach to predict a landscape of associations between known genes and candidates.

# MATERIALS AND METHODS

# Flowchart Overview

We modeled the problem of identifying candidate genes associated with a set of known genes as a prioritization problem, and proposed to solve this problem using a three-step approach. As shown in **Figure 1**, taking the set of known genes as input, we first standardized the genes between STRING (Szklarczyk et al., 2015) and RicyerDB (Jiang et al., 2018). Then, we constructed a protein–protein network that scores the edges through functional similarities. This procedure applied a RWR algorithm to the network to calculate a score for each candidate gene, and then ranked the candidates to obtain a ranking list as the output (Chen et al., 2012a,b; Chen, 2016; Chen X. et al., 2016; Li et al., 2016; Peng et al., 2016; Zhu et al., 2018). Finally, the top candidate gene was verified according to its function and by the published literature.

# Construction of the Functional Similarity Network

The functional similarity network is described as a graph G = (V, E), where V represents the nodes of the network and E stands for the edges of the network. The background network comes from the STRING database because of existing potential associated interactions among the proteins. The known rice yield-related genes were identified from our previous work with RicyerDB (Jiang et al., 2018). To standardize gene names between STRING and RicyerDB, genes were retrieved by reference to National Center for Biotechnology Information gene names. Functional similarities among genes in the background network were considered by scoring E for GO annotations. Using the latest release of the GO database (Ashburner et al., 2000; Chen L. et al., 2016; Raza, 2016; The Gene Ontology, 2017), edges were scored for a shared functional significance score of genes in the network that were annotated with GO terms.

The shared functional significance score F(i,j) between gene i and j was measured by the Weighted Shared Functions approach, which considered a gene's functions as a set of functional categories in GO. The functions shared by a small number of genes are taken to be far more significant than ones shared by a large number of genes. Each function had its own significance, which was defined as the inverse number of genes sharing the function. When two genes, i and j, have m functions in common, i.e., F(i)∩F(j) = ( f1, f2, . . ., f<sup>m</sup> ), F(i,j) was given as the total sum of the significance of the functions shared between them as follows:

$$F(i,j) = \sum\_{n=1}^{m} \text{sig}(f\_n)$$

$$\text{sig}(f\_n) = \frac{1}{|Gene(f\_n)|}$$

Here sig(fn) denotes the significance of a function fn(n = 1,2,..., m) shared between genes i and j, | Genes (fn)| is the number of genes sharing a function fn. We calculated the ranking score, p, for each gene in the disease-related network and ranked these genes in the descending order of p.

# Random Walking on the Functional Similarity Network

candidates were ranked according to their scores.

We achieved the goal of identifying candidates related to known seeds by calculating a score for each candidate and then ranking the candidates to obtain a ranking list. The higher the rank, the more likely the gene was to be related to the given source nodes. For this purpose, we adapted the RWR method in the functional similarity network.

At the beginning, the walker chooses the seeds as the starting point. In each step of the walking process, the walker may start on a new journey with probability r or move on with probability 1−r. When moving on, the walker may move at random to one of its direct neighbors.

In our application, the initial probability vector P<sup>0</sup> was constructed such that equal probabilities were assigned to the nodes representing members of the disease, with the sum of the probabilities equal to 1. This is equivalent to letting the random walker begin from each of the known disease genes with equal probability. The transition matrix W is the column-normalized adjacency matrix of the graph, and P<sup>t</sup> is a vector in which the ith element holds the probability of being at node i at time step t. Formally, the RWR is defined as:

$$P\_{t+1} = (1 - r)WP\_t + rP\_0$$

Candidate genes were ranked according to the values in the steady-state probability vector P. P vector changes with time t, while it is possible to obtain it by explicitly calculating Equation (1) until convergence. The iteration is finished when the change between P<sup>t</sup> and Pt+<sup>1</sup> falls below 10−10. In this paper, we set default values for parameters r = 0.3 (see Results section for details).

### Validation Method

fpls-09-01685 November 17, 2018 Time: 16:33 # 4

We adopted a five-fold cross-validation experiment to assess the capability of RWR to identify the left seeds. All seed genes were divided equally into five parts, then one part was removed as a test set, and added to the candidate genes. All candidate genes were ranked by RWR to determine the ranking of the test gene. This procedure was repeated until all seed genes were used up as test genes.

In the context of the functional similarity network, the above validation procedure was equivalent to removing one part of the seed genes to candidate genes and determining whether candidates containing these seeds could receive a high rank. The r parameter of RWR ranged from [0,1] and was used to identify the ranking of the five parts. ROC curves were plotted, and areas under the ROC curve (AUC) values were used to evaluate the performance of r.

# RESULTS

### Data Sources

We obtained the rice background protein–protein network from the STRING database. In the network, protein associations were either directly derived from physical interactions or functional links from experimental evidence and computational methods (Jensen et al., 2009). The network composes of 6561 nodes and 567034 edges, which represent proteins and interactions between them, respectively. In our study, 136 known genes were selected as seed genes and other genes as candidate genes. We downloaded O. sativa Japonica protein network data through STRING version 10.5 (Szklarczyk et al., 2015).

Proteins with accurate functional annotations are vital to biological research. We obtained functional annotation information from the GO Consortium (Ashburner et al., 2000), and downloaded GO annotations of O. sativa from the most recent GO version. GO enrichment analysis is used to interpret high-throughput molecular data. GO annotation is the list of all annotated genes linked to ontological terms describing those genes.

The RicyerDB database integrates publicly available resources to construct a public platform for browsing and the interactive visualization of yield-related genes. The first release of RicyerDB contained more than 400 manually curated gene information entries which were all associated with rice yield.

### Performance of the Proposed Method

The score vector P (the probability of being at the current node) for all genes in the network was calculated based on the ranking of corresponding r coefficients. Candidate genes were then ranked in the descending order of P score.

For optimal parameters, genes were also ranked according to the calculated p scores with nine different r-values (r = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9). The matching numbers of the five-part seed genes were applied to assess the effectiveness of RWR. In **Figure 2** listed the five cases of all, the number of matched seeds among the top 500 (every 100 is a measurement cutoff) in the ranking list of r = 0.3 was higher than other r-values in most cases.

The sum of the numbers of matched seed nodes in all ranking results was determined, and r = 0.3 was shown to have the maximum match in general. Finally, the parameter r = 0.3 was selected to calculate vector P to obtain the ranking results.

Further to detect the robustness of parameter r, we repeated the five-fold cross validation 100 times. Then we applying statistical analysis to compare the ranking of all seeds at different r-values in our model, the results were shown as **Figure 3**.

growth, development and environmental stress responses (Zhang X.D. et al., 2018). ATP binding has also been shown to play an important role in rice development (Coneva et al., 2014; Zhao et al., 2015; Chang et al., 2016; Lei et al., 2018).

# Prioritization of Candidate Genes and Validation by Literature Review

ranking and in right part presents the ranking position of all seeds.

In the functional similarity network, all candidate genes were prioritized by RWR according to vector P at the final status. We manually searched the 100 top candidate genes (**Table 1**) in PubMed<sup>1</sup> for their association with yield. This verified eight candidate genes associated with rice production. The LOC\_Os11g40150 (rank 39) alias is OsRad51A1, which is a key component of homologous recombination in DNA repair. Direct interaction with OsNAC14 recruits factors involved in DNA damage repair and defense response, resulting in an improved tolerance to drought (Shim et al., 2018). LOC\_Os04g37619 (rank 11) named ZEP, which is one of the key genes that involved hormone abscisic acid biosynthesis in rice by ion beam. Irritation can enhance the expression of genes involved in ABA biosynthesis, resulting in increasing content of endogenous plant hormone abscisic acid in rice (Chen et al., 2014).

Taken together, of the top 100 candidate genes in the ranking list, 46 candidate genes predicted by our method had been confirmed to be correlated with rice yield in PubMed literature (**Table 1**). Top-ranked candidates were found to have a high confirmation rate in terms of their association with rice yield, especially top 20 candidates (**Table 2**).

We conducted GO analysis to assess the functional enrichment of the top 100 candidate genes (**Figure 4**). The GO term having the most candidates annotated to was GO: 0005524 ∼ ATP binding, which is a binding motif within the primary structure of an ATP binding protein. A recently identified rice ATP binding cassette plays multiple roles in plant

TABLE 1 | The top 100 candidate genes in the ranking list.


(Continued)

<sup>1</sup>http://www.ncbi.nlm.nih.gov/pubmed

#### TABLE 1 | Continued

fpls-09-01685 November 17, 2018 Time: 16:33 # 6


Jiang et al. Identification of Rice Yield-Related Genes



#### DISCUSSION

In the present study, we identified genes associated with rice yield using the RWR method on a functional similarity network. We demonstrated the high performance of the RWR approach via a five-fold cross-validation experiment and showed the robustness of the parameter r. As an application of the RWR approach, we predicted a landscape of associations between known seeds and candidate genes.

Our work has the following advantages. First, the RWR method can predict associations among known seed genes and candidate genes with the ability to spread the information that known seeds carried via their neighbors. Second, the interaction network provides a systematic view of functional similarities between genes by calculating GO terms. Finally, the robustness of the parameter r leads to a high level of accuracy in making predictions, and the method that achieving parameter can be adapted to other dataset.

Rice is the most important food crop worldwide. Use of the RWR method in the function similarity network can identify candidate genes associated with known rice yield-related genes, while gene ranking saves experimental time in the exploitation of rice as a major crop. Future development of our research will include the collection of more rice yield-related genes via online databases and the analysis of literature. Subsequent accurate analysis involving an effective prediction algorithm will enable

TABLE 2 | The confirmation rate of top 100 candidate genes in the ranking list.


The confirmation rate was calculated by dividing the confirmation number by the corresponding number of top n. It represented the effectiveness of the confirmation.

the prediction of novel genes that can boost rice yield. In the future, we would further develop computational models for the identification and analysis of rice yield-related microRNAs/Long non-coding RNAs based on Chen et al.'s researches (Chen and Yan, 2013; Chen and Huang, 2017; Chen et al., 2017b, 2018b).

# AUTHOR CONTRIBUTIONS

CW designed the research. XZ performed the research. FX analyzed the data. JJ wrote the manuscript. All authors read and approved the manuscript.

# FUNDING

fpls-09-01685 November 17, 2018 Time: 16:33 # 8

The work was supported by the Natural Science Foundation of China (Nos. 91735306, 61872114, and 61872309).

### REFERENCES


## ACKNOWLEDGMENTS

We thank Sarah Williams, Ph.D., from Liwen Bianji, Edanz Group China (www.liwenbianji.cn), for editing the English text of a draft of this manuscript.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Jiang, Xing, Wang and Zeng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# LLCMDA: A Novel Method for Predicting miRNA Gene and Disease Relationship Based on Locality-Constrained Linear Coding

Yu Qu, Huaxiang Zhang\*, Chen Lyu and Cheng Liang\*

*School of Information Science and Engineering, Shandong Normal University, Jinan, China*

MiRNAs are small non-coding regulatory RNAs which are associated with multiple diseases. Increasing evidence has shown that miRNAs play important roles in various biological and physiological processes. Therefore, the identification of potential miRNA-disease associations could provide new clues to understanding the mechanism of pathogenesis. Although many traditional methods have been successfully applied to discover part of the associations, they are in general time-consuming and expensive. Consequently, computational-based methods are urgently needed to predict the potential miRNA-disease associations in a more efficient and resources-saving way. In this paper, we propose a novel method to predict miRNA-disease associations based on Locality-constrained Linear Coding (LLC). Specifically, we first reconstruct similarity networks for both miRNAs and diseases using LLC and then apply label propagation on the similarity networks to get relevant scores. To comprehensively verify the performance of the proposed method, we compare our method with several state-of-the-art methods under different evaluation metrics. Moreover, two types of case studies conducted on two common diseases further demonstrate the validity and utility of our method. Extensive experimental results indicate that our method can effectively predict potential associations between miRNAs and diseases.

Keywords: miRNA gene–disease relationship, similarity measure, association prediction, locality-constrained linear coding, label propagation

# INTRODUCTION

MiRNAs are small non-coding regulatory RNAs. Since the first miRNA lin-4 (Lee et al., 1993) was found, a plenty of miRNAs have been discovered. Accumulating evidence has shown that miRNAs play a critical role in many biological processes, such as cell proliferation, differentiation, aging, and apoptosis (Ambros, 2004; Xu et al., 2004; Cheng et al., 2005; Miska, 2005; Huang et al., 2016). With the deepening of the research, researchers found that the dysfunctions of miRNAs are closely related to various diseases (Mei et al., 2016; Zou et al., 2016; Liao et al., 2018; Qu et al., 2018b; Tang et al., 2018), which sent an important signal to scientists from all around the world that exploring the associations between miRNAs and diseases is of great significance. Some experimental methods, such as PCR and Microarray (Thomson et al., 2007; Mohammadi-Yeganeh et al., 2013), have been able to successfully identify certain miRNAs related with diseases. However, it is unrealistic to use these traditional experimental methods to predict miRNA-disease associations at a large scale

#### Edited by:

*Quan Zou, Tianjin University, China*

#### Reviewed by:

*Zhenjia Wang, University of Virginia, United States Xiangxiang Zeng, Xiamen University, China*

#### \*Correspondence:

*Huaxiang Zhang huaxzhang@hotmail.com Cheng Liang alcs417@sdnu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *14 September 2018* Accepted: *08 November 2018* Published: *28 November 2018*

#### Citation:

*Qu Y, Zhang H, Lyu C and Liang C (2018) LLCMDA: A Novel Method for Predicting miRNA Gene and Disease Relationship Based on Locality-Constrained Linear Coding. Front. Genet. 9:576. doi: 10.3389/fgene.2018.00576* for their limitations of being time-consuming and expensive. To resolve this situation, multiple computational methods were proposed to efficiently uncover the potential associations between miRNAs and diseases.

Based on the assumption that miRNAs with similar functions are usually related to similar diseases (Zeng et al., 2016; Chen et al., 2017c), Jiang et al. (2010) proposed a networkbased method to predict miRNA-disease associations using a hypergeometric distribution scoring system by constructing a miRNA functional similarity network and a human phenomemicroRNAome network. Xuan et al. (2013) developed a method named HDMP based on weighted k most similar neighbors. They calculated miRNA functional similarity according to disease terms and disease phenotype similarity. In addition, the miRNAs within same families or clusters were assigned higher weights. Shi et al. (2013) performed random walk to predict miRNA-disease associations on protein–protein interaction (PPI) networks and achieved a satisfactory performance. Mørk et al. (2014) proposed a novel protein-driven method named miRPD to predict potential associations between miRNAs and diseases, where they presented a scoring scheme to efficiently predict and rank miRNA-disease associations. Considering that the global network-based methods could achieve better performance than local network-based methods, Chen et al. (2012) proposed a global similarity measure named RWRMDA. They applied random walk with restart to uncover miRNAs related with diseases on miRNA–miRNA functional similarity network. However, RWRMDA could not predict for diseases without any known related miRNAs. Li et al. (2017) proposed another method named MCMDA. In this method, they applied the matrix completion algorithm to update the known miRNA-disease associations matrix and predict the potential associations. Liu et al. (2017) also applied random walk to predict miRNA-disease associations on a heterogeneous network which was constructed by integrating multiple data sources. Similarly, Luo and Xiao (2017) used an imbalanced bi-random walk to predict miRNAdisease associations on a heterogeneous network consisting of miRNA functional similarity network, disease semantic network and known miRNA-disease association network. Chen et al. (2016a) presented another method WBSMDA to identify the associations between miRNAs and diseases by calculating Gaussian interaction profile kernel similarity for both miRNAs and diseases. Specifically, a within-score and a between-score were calculated and combined to gain a prediction score for each miRNA-disease pair. Using the same data, Chen et al. (2016b) presented HGIMDA which iteratively update an optimization function to uncover potential relations between miRNAs and diseases. Zeng et al. (2018) used structural consistency as an indicator to estimate the link predictability of the bilayer network and further predicted the potential associations between miRNAs and diseases based on Structural Perturbation Method (SPM). According to the lengths of different walks, Zou et al. (2015) introduced a path-based method using KATZ model and obtained reliable results. Similarly, You et al. (2017) proposed another effective path-based method named PBMDA. PBMDA also constructed a heterogeneous network and applied depthfirst search algorithm to predict miRNA-disease associations. Although effective, the length of the paths in the searching process is limited to three. Qu et al. (2018a) presented a novel method SNMDA to identify potential diseases-related miRNAs based on sparse neighborhood and achieved comparable results. In recent years, several models based on machine learning have also been developed to predict the relationships between miRNAs and diseases (Chen et al., 2017b, 2018a,d). Based on semisupervised learning framework, a model of Regularized Least Squares for MiRNA-Disease Association (RLSMDA) prediction was proposed by Chen and Yan (2014). Xiao et al. (2018) utilized graph-regularized non-negative matrix factorization to effectively predict for diseases without any related miRNAs based on heterogeneous omics data. Chen et al. (Zou et al., 2017) proposed an effective method ELLPMDA based on ensemble learning and link prediction. They integrated the results given by three classical similarity-based algorithms using ensemble learning. Li et al. (2018) presented a Kronecker kernel matrix dimension reduction (KMDR) model to predict miRNAdisease associations which integrates miRNA space and disease space into a larger miRNA-disease associations space. Chen et al. (2017a) proposed another model called MKRMDA that automatically optimizes the combination of multiple kernels. Recently, Chen et al. (2018b) presented EGBMMDA based on the model of extreme gradient boosting machine. Notably, EGBMMDA was the first decision tree learning-based model to uncover disease-related miRNAs and achieved favorable performance.

Although great efforts have been made to reliably predict miRNA-disease associations, there is still room for improvement. In this paper, we propose a novel method called LLCMDA for predicting miRNA-disease associations based on Localityconstrained Linear Coding (LLC). We apply four different cross-validation frameworks to comprehensively evaluate the performance of our method. The comparison results between LLCMDA and five state-of-the-art computational models demonstrate the utility of the proposed method. Besides, case studies on two common neoplasms further prove the effectiveness of our method. In summary, LLCMDA is an effective model for predicting potential miRNA–disease associations.

### MATERIALS AND METHODS

#### Known miRNA-Disease Associations

HMDD (Li et al., 2014) is a database that records known experimentally-verified miRNA-disease associations, which contains 5,430 associations between 383 diseases and 495 miRNAs. For simplicity, an adjacency matrix A of dimension 495 <sup>∗</sup> 383 is defined to describe the known miRNA-disease associations used in this paper. If miRNA m(i) has been confirmed to be related to d(j), A (i, j) = 1; otherwise A (i, j) = 0.

### MiRNA Functional Similarity

Wang et al. (2010b) proposed an informative measure to calculate miRNA functional similarities. Benefitting from previous researches, we downloaded miRNA similarity scores directly from http://www.cuilab.cn/files/images/cuilab/misim. zip. Similarly, we constructed a miRNA functional similarity matrix FMS to represent similarity scores, where FMS (i, j) represents the similarity score between miRNA i and miRNA j. A larger value indicates more similar function between two miRNAs.

#### Disease Semantic Similarity

According to the Mesh descriptor, each disease can be described as a corresponding Directed Acyclic Network (DAG) (Wang et al., 2010a), i.e., DAG(A) = (A, T(A), E(A)), where T(A) is the node set including itself as well as its ancestor nodes, and E(A) represents the link set of A. Suppose disease t belongs to T(A), then the contribution of disease t to A can be calculated by:

$$\begin{cases} D\_A\left(t\right) = 1\\ D\_A\left(t\right) = \max\left\{0.5 \* D\_A\left(t'\right) \, \middle|\, t' \in \mathit{child} \, t \; \middle|\, \begin{array}{l} \text{if } t = A\\ \text{if } t \neq A \end{array} \right\} \end{cases} \text{if } t = A\\ \begin{array}{l} \text{if } t = A\\ \text{if } t \neq A \end{array} \tag{1}$$

Besides, the semantic of A can be calculated by:

$$DV\,(A) = \sum\_{t \in T\_{(A)}} D\_A\,(t) \tag{2}$$

For disease A and B, the semantic similarity is calculated through the following formula:

$$S\left(A,B\right) = \frac{\sum\_{t \in T(A) \cap T(B)} \left(D\_A\left(t\right) + D\_B\left(t\right)\right)}{DV\left(A\right) + DV\left(B\right)}\tag{3}$$

where t is a common disease both in T(A) and T(B). DA(T)and DB(T)represent the contribution of disease t to the disease A and B, respectively. Therefore, for each disease pair, we can calculate their semantic similarity according to Equation (3). For convenience, we use an adjacency matrix DSS to denote the obtained semantic similarities for all disease pairs.

#### Methods

In this paper, we predict potential associations between miRNAs and diseases based on LLC and label propagation. Specifically, the LLC algorithm is first used to reconstruct similarity networks for both miRNAs and diseases and then label propagation is applied on the similarity networks to obtain reliable predicted labels. An overall workflow of LLCMDA is illustrated in **Figure 1**.

#### Locality-Constrained Liner Coding

Locality-constrained linear coding was first proposed by Wang et al. (2010b) and has been successfully applied to image classification. Compared with sparse representation, LLC is more computationally efficient and can preserve local information during the coding process (Saffari and Ebrahimi-Moghadam, 2015; Zhu et al., 2018). The objective function of LLC algorithm is defined as:

$$\underset{\boldsymbol{w}\_{i}}{\text{arg min }} \|\boldsymbol{x}\_{i} - \boldsymbol{D}\boldsymbol{w}\_{i}\|\_{2}^{2} + \lambda\_{1} \; \|\boldsymbol{P}\_{i} \odot \boldsymbol{w}\_{i}\|\_{2}^{2} \text{ s.t. } \boldsymbol{I}^{T}\boldsymbol{w}\_{i} = \boldsymbol{1} \tag{4}$$

Where x<sup>i</sup> is the i-th sample, D represents a dictionary matrix and Pi is a local adapter vector representing the distances between the i-th sample and the other samples. λ<sup>1</sup> is a regularization parameter. The sign of ⊙ denotes element-wise multiplication. Our goal is to find the optimized reconstructed similarities w<sup>i</sup> for each sample x<sup>i</sup> . The Lagrangian function of Equation (4) can be obtained as follows:

$$\underset{\boldsymbol{w}\_{\boldsymbol{w}\_{i}}}{\arg\min} \ \|\boldsymbol{x}\_{i} - \boldsymbol{D}\boldsymbol{w}\_{i}\|\_{2}^{2} + \lambda\_{1} \left\|\boldsymbol{P}\_{i} \odot \boldsymbol{w}\_{i}\right\|\_{2}^{2} + \lambda\_{2} \left(\boldsymbol{I}^{T}\boldsymbol{w}\_{i} - 1\right) \tag{5}$$

Where λ<sup>2</sup> is the Lagrange multiplier. With simple algebra, the above equation can be further transformed into:

$$L(\boldsymbol{w}\_i; \boldsymbol{\eta}) = \boldsymbol{w}\_i^T \boldsymbol{C} \boldsymbol{w}\_i + \lambda\_1 \boldsymbol{w}\_i^T \left\{ \text{diag} \left( \boldsymbol{P}\_i \right) \right\}^2 \boldsymbol{w}\_i + \lambda\_2 \left( \boldsymbol{I}^T \boldsymbol{w}\_i - 1 \right) \tag{6}$$

where C = xiI <sup>T</sup> − D xiI <sup>T</sup> − D and diag (Pi) is a diagonal matrix whose (j,j)-th diagonal elements equals to the j-th element of vector P<sup>i</sup> . Specifically, we use the following formula to calculate the local distances between samples for P<sup>i</sup> :

$$P\_i = \left\{ P\_{ij} \right\}\_{j=1,\ldots,n} = \left\{ \exp\left(\frac{\left\|\mathbf{x}\_i - \mathbf{x}\_j\right\|\_2}{\mathcal{V}}\right) \right\}\_{j=1,\ldots,n} \tag{7}$$

Where γ is a positive parameter controlling the bandwidth.

By taking the derivative of Equation (6) with respect to w<sup>i</sup> and setting it to zero, we have:

$$\frac{\partial}{\partial \boldsymbol{w}\_{i}} \boldsymbol{L} \left( \boldsymbol{w}\_{i}; \boldsymbol{\eta} \right) = \boldsymbol{0} \Rightarrow \mathbf{S} \boldsymbol{w}\_{i} + \lambda\_{2} \mathbf{1} = \mathbf{0} \tag{8}$$

where S = 2 C + λ<sup>1</sup> diag (Pi) 2 . By multiplying both sides of Equation (8) by 1<sup>T</sup> S −1 and considering the LLC constraint 1Tw<sup>i</sup> = 1, we can derive the optimal solution for w<sup>i</sup> as follows:

$$\begin{cases} \boldsymbol{\omega}\_{i} = \left( \boldsymbol{C} + \left\{ \left( \text{diag}(\boldsymbol{P}\_{i}) \right) \right\}^{2} \right) / I \\ \boldsymbol{\omega}\_{i} = \boldsymbol{w}\_{i} / I^{T} \boldsymbol{w}\_{i} \end{cases} \tag{9}$$

To obtain feature vectors as the input for LLC algorithm, we applied interaction profile to construct the feature vectors for miRNAs and diseases according to the known miRNAdisease associations (Zang and Zhang, 2012; Zhang et al., 2017).Specifically, the i-th row of adjacency matrix A represents the feature vector of miRNA i and the j-th column represents the feature vector of disease j. As a result, we can obtain two reconstructed similarity networks RMS and RDS for miRNAs and diseases according to Equation (9), respectively.

#### Label Propagation

In this section, we adopt label propagation to obtain relevant scores of miRNA-disease pairs. In the process of label propagation, the known miRNA-disease associations are regarded as initial labels and label propagation is used to iteratively update labels (Zhang et al., 2018). Each point receives information not only from its neighbors but also its initial information. Here, we set a parameter α to control the rate. Therefore, the iteration equation on miRNA functional similarity network can be written as follows:

$$F\_M(t+1) = \alpha \, \* \, FMS \, \* \, F\_M(t) + (1 - \alpha) \, \* \, Y \tag{10}$$

Here, FMS represents miRNA similarity network while Y represents the initial labels and F<sup>M</sup> (0) = Y. We used Equation (10) to update the label information. When the iteration equation converges, FM(t+1) is regarded as the relevant score matrix. Therefore, we can sort the miRNAs by relevant scores for each disease. According to previous studies (Zhou et al., 2003), FMS is guaranteed to converge if it is properly normalized as follows:

$$FMS = D^{-1/2} \, \* \, FMS \, \* \, D^{1/2} \tag{11}$$

where D is a diagonal matrix, the values on the diagonal correspond to the sum of all elements in each row. Similarly, we apply label propagation on the other three similarity networks RMS, DSS, and RDS to obtain three relevant score matrixes FRM, FD, and FRD. At last, we integrate the four prediction results and take the average as the final output F.

$$F = \left(F\_M + F\_{RM} + F\_D' + F\_{RD}'\right)/4\tag{12}$$

#### Implementation Details

LLCMDA is implemented in MATLAB under the MATLAB R2016b programming environment. All the experiments are performed on a desktop with an i7-6700 3.40 GHz CPU and 16G RAM. The source code of LLCMDA is freely available at: https:// github.com/misitequ/LLCMDA.

### RESULTS

#### Evaluation

In this section, three cross-validation frameworks are applied to test the performance of our algorithm: global LOOCV, local LOOCV, and five-fold cross-validation. In the framework of global LOOCV, each known miRNA-disease association is left out in turn as a test sample, and the other associations are regarded as training samples. After prediction, each miRNAdisease pair would obtain a score accordingly. If its ranking is higher than a given threshold, the prediction is regarded as a successful prediction. In the framework of local LOOCV, a disease is given in advance and then each miRNA associated with this disease is left out in turn as a test sample while the rest of miRNAs associated with the disease are set as seed samples. The only difference between global LOOCV and local LOOCV is that whether we simultaneously consider the candidates from all diseases (Chen et al., 2018a,c). Five-fold cross validation is

methods (SPM, HGIMDA, EGBMMDA, PBMDA, MKRMDA) in terms of global LOOCV.

also implemented to verify the utility of our method. Concretely, the 5,430 known associations are randomly divided into five subsets, each subset is taken as test samples in turn and the others are considered as training samples. To avoid the bias caused by random division of samples, we repeat five-fold crossvalidation 20 times and take the average as the final result. Receiver-Operating Characteristics (ROC) curves are plotted by calculating True Positive Rate (TPR) and False Positive Rate (FPR) at varying thresholds. We then calculate the Area Under the ROC Curve (AUC) to quantitatively evaluate the performance of prediction models. AUC = 1 means the model is perfect while AUC = 0.5 denotes a random prediction.

As a result, LLCMDA obtained the AUCs of 0.924, 0.870, and 0.919 in global LOOCV, local LOOCV, and five-fold crossvalidation, respectively. To further illustrate the effectiveness

FIGURE 4 | The comparison results between LLCMDA and other four methods (SPM, HGIMDA, EGBMMDA, PBMDA, MKRMDA) in terms of five-fold cross-validation.

of our algorithm, we compared LLCMDA with five state-ofthe-art methods, i.e., SPM, HGIMDA, PBMDA, MKRMDA, EGBMMDA. In the framework of global LOOCV, SPM, HGIMDA, PBMDA, MKRMDA, and EGBMMDA achieved AUCs of 0.942,0.875, 0.922, 0.904, and 0.912 (**Figure 2**). In local LOOCV, the AUCs obtained by SPM, HGIMDA, PBMDA, MKRMDA, and EGBMDA were 0.814, 0.823, 0.853, 0.827, and 0.807 (**Figure 3**). In addition, they obtained AUC-values of 0.865, 0.867, 0.916, 0.884, and 0.904 in five-fold cross-validation (**Figure 4**), respectively. As can be seen from the results, the AUCs of LLCMDA were higher than that of the other methods in all three cross-validation frameworks except the global LOOCV. In conclusion, our method is reliable to predict the potential miRNA-disease associations.

To further test the performance of our method in predicting new associations for diseases without any known related miRNAs, we adopted another evaluation metric called Leave

One Disease Out Cross Validation (LODOCV) (Fu and Peng, 2017). In particular, we removed all the associated miRNAs for a given disease and then prioritized all the candidate miRNAs based on the known associations of other diseases. LODOCV is considerably more stringent than the afore mentioned crossvalidation frameworks since there is no prior association information available for the given disease. We also compared LLCMDA with the five state-of-the-art methods in terms of the AUC-values. As shown in **Figure 5**, LLCMDA achieved the highest AUC-value of 0.822 in LODOCV framework. Here, we only demonstrate the performances of LLCMDA, SPM, and HGIMDA in the figure as the AUC-values obtained by the other three methods were lower than 0.6. The experimental results indicate that LLCMDA has better generalization ability in predicting new miRNA-disease associations.

# Parameter Analysis

Parameter α was used to control the rate of the initial labels on the prediction results for miRNA in Equation (10). Similarly, we used another parameter β to control the effects of initial labels for diseases. To explore the impact of the two parameters, we set different values (0.1–0.9) for both parameters to obtain the prediction results in five-fold cross-validation and LODOCV frameworks (**Figure 6**). It can be seen that parameter α and β only have minor effects on the final prediction accuracies. Similar trends were also observed in global LOOCV and local LOOCV. Consequently, both parameters were set to 0.5.

# Case Study

In recent years, substantial evidence suggests that miRNAs are associated with various neoplasms, such as breast neoplasms, lung neoplasms, and etc. Here, we conducted two types of case studies to validate the utility of LLCMDA on two common neoplasms, lung neoplasms and lymphomas. The case studies on other diseases can be found at

TABLE 1 | Top 50 predicted miRNAs associated with Lung Neoplasms based on known associations in HMDD.


*I, II and, III represent dbDEMC, miR2Disease, and miRwayDB, respectively. The first and third columns record the 1–25 and 26–50 related miRNAs, respectively.*

https://github.com/misitequ/LLCMDA. We selected the top 50 miRNAs predicted by our model for each disease. The prediction results were then verified by another three databases,



*I, II, and III represent dbDEMC, miR2Disease, and miRwayDB, respectively. The first and third columns record the 1–25 and 26–50 related miRNAs, respectively.*

i.e., mir2disease (Jiang et al., 2009), dbDEMC (Yang et al., 2017), and miRwayDB (Das et al., 2018), which all record experimentally-validated miRNA-disease associations.

Lung neoplasms is one of the malignant tumors with the fastest increase in morbidity and mortality and the greatest threat to human health and life (Yanaihara et al., 2006). Therefore, there is an urgent need to identify prognostic and predictive markers for early detection. We used our method to uncover the potential miRNAs and listed the top 50 predicted candidate miRNAs. As a result (**Table 1**), 46 out of the top 50 miRNAs were verified to be associated with lung neoplasms by at least one database from Mir2disease, dbDEMC, and miRwayDB. For instance, studies have shown that hsa-mir-16(1st in **Table 1**) and hsa-mir-429 (3rd in **Table 1**) are closely related to the diagnosis and treatment of lung cancer (Reid et al., 2013; Ren et al., 2016).

To verify the potency of our method on real datasets, we conducted the second type of case study where we used older version of HMDD (v 1.0) as input to predict potential associations and test whether LLCMDA could uncover the newlyadded ones in the latest version of HMDD (v 2.0). Specifically, HMDD v 1.0 contains 1,395 associations between 271 miRNAs and 137 diseases (Zhao et al., 2018). Here, we chose Lymphomas for validation. As shown in **Table 2**, 48 out of the top 50 candidate miRNAs have been confirmed by dbDEMC, miR2Disease or/and miRwayDB. In particular, 31 miRNAs were found in HMDD 2.0. Taken together, these evidence further showed that our prediction method can effectively predict potential associations between miRNAs and diseases.

# DISCUSSION

Nowadays, identifying potential disease-associated miRNAs could provide new insights into the role of miRNA as valuable biomarkers for clinical measure, diagnosis and treatment. However, it is impossible to predict the associations between miRNA-disease relying on traditional experimental-based methods. Consequently, great numbers of computational methods have been proposed to solve this challenging problem in recent years. In this paper, we presented a novel method to predict potential miRNA-disease associations based on locality-constrained liner coding. We first applied LLC algorithm to reconstruct similarity networks for miRNAs and diseases. The label propagation was then applied on the similarity networks to retrieve relevant scores for each miRNA-disease association. The final results were calculated as the average of the predicted results from both miRNA space and disease space, respectively. To comprehensively verify the performance of our method, we compared LLCMDA with five state-of-the-art computational model under four different cross-validation frameworks. The experimental results demonstrated powerful evidence that our method could effectively predict miRNAdisease associations. In addition, case studies on two common diseases also gave a strong confirmation to the prediction ability of our method.

The success of our method is mainly due to the following two reasons. First, the reconstructed similarity networks for both miRNAs and diseases are more robust as the LLC algorithm regards the local information in the coding process. Second, we applied label propagation on the reconstructed similarity networks as well as the original similarity networks to calculate reliable relevant scores for the final output. Nonetheless, more informative data sources should be integrated into our model to further improve the prediction performance. Besides, the final outcome was simply taken as the average from the prediction scores from different similarity networks, which may lead to suboptimal results. Therefore, a more appropriate way to incorporate the prediction results needs to be put forward.

# AUTHOR CONTRIBUTIONS

YQ and CLi conceived the study and planned experiments. YQ and HZ designed the algorithm and implemented. CLy and HZ performed data analysis. YQ and CLi drafted the manuscript. All authors read and approved the final manuscript.

# ACKNOWLEDGMENTS

CLi was supported by the National Natural Science Foundation of China (No. 61602283) and the Natural Science Foundation of Shandong (No. ZR2016FB10). HZ was supported by the National Natural Science Foundation of China under Grant Nos. 61572298, 61772322, 61601268, the Key Research and Development Foundation of Shandong Province (No.

#### REFERENCES


2016GGX101009), and the Natural Science Foundation of Shandong (No. 2017GGX10117, 2017CXGC0703). CLy was supported by the Natural Science Foundation of Shandong (No. ZR2016FB13).


based on microRNA-associated diseases. Bioinformatics 26, 1644–1650. doi: 10.1093/bioinformatics/btq241


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Qu, Zhang, Lyu and Liang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification of Potential Prognostic Genes for Neuroblastoma

Xiaodan Zhong1,2,3, Yuanning Liu1,2, Haiming Liu1,2, Yutong Zhang<sup>3</sup> , Linyu Wang1,2 and Hao Zhang1,2 \*

<sup>1</sup> College of Computer Science and Technology, Jilin University, Changchun, China, <sup>2</sup> Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun, China, <sup>3</sup> Department of Pediatric Oncology, The First Hospital of Jilin University, Changchun, China

Background and Objective: Neuroblastoma (NB), the most common pediatric solid tumor apart from brain tumor, is associated with dismal long-term survival. The aim of this study was to identify a gene signature to predict the prognosis of NB patients.

Materials and Methods: GSE49710 dataset from the Gene Expression Omnibus (GEO) database was downloaded and differentially expressed genes (DEGs) were analyzed using R package "limma" and SPSS software. The gene ontology (GO) and pathway enrichment analysis were established via DAVID database. Random forest (RF) and risk score model were used to pick out the gene signature in predicting the prognosis of NB patients. Simultaneously, the receiving operating characteristic (ROC) and Kaplan-Meier curve were plotted. GSE45480 and GSE16476 datasets were employed to validate the robustness of the gene signature.

#### Edited by:

Quan Zou, University of Electronic Science and Technology of China, China

#### Reviewed by:

Heng Pan, Cornell University, United States Hao Lin, University of Electronic Science and Technology of China, China

> \*Correspondence: Hao Zhang zhangh@jlu.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 07 October 2018 Accepted: 15 November 2018 Published: 29 November 2018

#### Citation:

Zhong X, Liu Y, Liu H, Zhang Y, Wang L and Zhang H (2018) Identification of Potential Prognostic Genes for Neuroblastoma. Front. Genet. 9:589. doi: 10.3389/fgene.2018.00589 Results: A total of 131 DEGs were identified, which were mainly enriched in cancer-related pathways. Four genes (ERCC6L, AHCY, STK33, and NCAN) were selected as a gene signature, which was included in the top six important features in RF model, to predict the prognosis in NB patients, its area under the curve (AUC) could reach 0.86, and Cox regression analysis revealed that the 4-gene signature was an independent prognostic factor of overall survival and event-free survival. As well as in GSE16476. Additionally, the robustness of discriminating different groups of the 4-gene signature was verified to have a commendable performance in GSE45480 and GSE49710.

Conclusion: The present study identified a gene-signature in predicting the prognosis in NB, which may provide novel prognostic markers, and some of the genes may be as treatment targets according to biological experiments in the future.

Keywords: neuroblastoma, differentially expressed genes, gene signatures, prognosis, GEO, ERCC6L

# INTRODUCTION

Neuroblastoma (NB) is a highly heterogeneous pediatric solid tumor both in clinical and biological characteristics. It is the third most common malignant disease, which takes about 7% of malignant tumors in children less than 14 years, with the incidence of close to 60/100,000 during the first year of life (Maris et al., 2007; Sausen et al., 2013; Ward et al., 2014). However, NB has occupied up to 15% of cancer-related death in childhood (Maris et al., 2007; Domingo-Fernandez et al., 2013;

**143**

Sausen et al., 2013; Salazar et al., 2016; Stafman and Beierle, 2016). Different from adult cancers, NB presents a low frequency of mutation and rearrangement, which only takes up less than 40% (Pugh et al., 2013). Multiple "omics" studies have found that some molecules are involved in NB development, including MYCN, ALK, LMO1, PHOX2B, ARID1A, and ARID1B (Huang and Weiss, 2013; Pugh et al., 2013; Sausen et al., 2013; Beckers et al., 2015; Bosse and Maris, 2016; Liu and Thiele, 2017). Nonetheless, the survival time of the high-risk group has not been markedly prolonged, with the long-term survival of less than 40∼50% (Maris et al., 2007; Pinto et al., 2015; Salazar et al., 2016).

Typically, patients at stage 4 and stage 4s all have metastatic tumors, but they have remarkably different outcomes. Notably, most patients with stage 4 diseases are associated with high risks and inferior outcomes regardless of multi-modal therapy (Maris et al., 2007; Whittle et al., 2017). In contrast, stage 4s diseases frequently occur in infants, and they generally have beneficial outcomes even though without any treatment or moderate chemotherapy (Rubie et al., 2011; Brodeur and Bagatell, 2014). Studies indicate that more than 20% NB cases have experienced spontaneous remission or regression, especially for stage 4s cases (Diede, 2014; Attiyeh and Maris, 2015). In addition, some studies have found the differences between stage 4 and stage 4s patients. For instance, Taggart et al. (2011) had accessed 0–18 months old patients with stage 4 and stage 4s diseases, and found that the tumor biological features [such as MYCN, 11q, mitosis-karyorrhexis index (MKI), histology, and 1p] were more important than age and metastatic pattern for predicting the clinical outcomes, which should be considered for risk stratification in patients aged less than 18 months. Moreover, Fischer et al. (2006) had identified a special gene expression pattern between stage 4 and stage 4s patients. Bénard et al. (2008) had identified a stage 4s gene signature by comparing the gene expression profiles between stage 4 and stage 4s patients. However, this is far from enough, and further studies are needed, which may provide some inspiration for understanding and exploiting the therapeutic strategies of NB.

Random forest (RF) is among the most important machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They can be used to rank the importance of features in a regression or classification problem in a natural way. Mean decrease impurity and mean decrease accuracy are employed as criteria for feature selection (Strobl et al., 2007).

In this paper, we have identified a four-gene signature via RF and risk score model in predicting the prognosis of NB patients via the GEO dataset (GSE49710), the gene signature also had good performance in discriminating other groups of GSE49710, as well as in other two independent datasets. Importantly, the obtained gene signature we picked out could separate NB patients with different outcomes in stage 4 or age less than 18 months, or MYCN not amplified. Thus, the gene signature was considered as a prognostic marker of NB, and some genes (like ERCC6L) in the signature might serve as the therapeutic targets based on biological experiments in the future.

# MATERIALS AND METHODS

# Data Collection and Processing

Data were downloaded from GEO datasets GSE49710 (**Supplementary Data Sheet S1**), GSE45480, and GSE16476. Samples with the survival time of less than 30 days and more than 10 years were excluded. Finally, 419 samples were obtained, and 45 stage 4 patients with the survival time of less than 18 months as well as 50 stage 4s patients were also employed in this study. R package "limma" (Ritchie et al., 2015) was used for data processing, and expression data were transformed by log2 calculation. Meanwhile, the prediction ability was validated using the GSE45480 (GPL16876) and GSE16476 datasets. The highest expression value was employed to represent the gene expression level when a single gene matched multiple probes. Eventually, the top 1% mRNAs with high expression were selected in stage 4 patients and the adjusted p-value of < 0.05 was considered as differentially expressed genes (DEGs). Moreover, log-rank test and Cox proportional hazard regression model (Prentice, 1992) were utilized to identify the risk factors related to clinical prognosis. Genes with a hazard ratio of >1 and a p-value of less than 0.001 upon univariate Cox regression were picked out.

## Functional Enrichment Analysis of DEGs

Functional enrichment analysis of the candidate DEGs was carried out using the online database DAVID 6.8<sup>1</sup> . In the meantime, gene ontology (GO) (The Gene Ontology Consortium, 2017) term BP (Biological Process) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2017) analysis were performed to provide gene biological function, with p < 0.05 as the cut-off criterion.

## Selection of Gene Signature and Confirmation of Performance

Firstly, we put the DEGs into the RF model as features, then built the RF model using Scikit-Learn tools. We defined a grid of hyperparameter, and sampled from the grid, performing 10-fold cross-validation with each combination of values. Subsequently, the grid search algorithm outputs the settings that achieved the highest AUC in the validation procedure and the feature importance based on mean decrease impurity (Strobl et al., 2007). Secondly, based on the above ranking results, we combined top 10 important genes to predict the prognosis of patients, then selected one of the optimal combinations, and the coefficient of candidate genes was also calculated with multivariate Cox regression analysis. Thirdly, the risk score model (Zhou et al., 2015; Liao et al., 2018) was adopted to evaluate the performance in predicting the vital status and assess its discrimination ability in other different groups. In addition, the receiving operating characteristic (ROC) curve was plotted.

$$\text{RiskScore} = \sum\_{i=1}^{n} \beta\_i^\* \chi\_i$$

<sup>1</sup>https://david.ncifcrf.gov/

where β<sup>i</sup> indicated the coefficient of each gene in multivariate Cox regression analysis with overall survival as the dependent parameter and χ<sup>i</sup> represented the expression value by log2 transformation of each gene.

Simultaneously, all samples were divided into the high or low-risk group according to the median risk score, and the Kaplan–Meier curves of overall survival and event-free survival were plotted.

#### Statistical Analysis

fgene-09-00589 November 28, 2018 Time: 8:27 # 3

All statistical analyzes were carried out using the IBM SPSS version 23 software and R. Bivariate coefficient was calculated through Spearman's rank correlation. Additionally, Kaplan-Meier method and log-rank test were employed to plot and compare the survival curves. Survival data were assessed through univariate and multivariate Cox regression analyses. A two-tailed p-value of < 0.05 was considered statistically significant.

### RESULTS

## Clinical Characteristics Analysis of GEO Data

A total of 498 NB samples with complete clinical data were downloaded from the GSE49710 dataset, and 419 of them were finally included in our study. The details of clinical/pathological features were listed in **Table 1**. Survival analysis showed that INSS stage 4 patients had inferior outcomes to those of stage 4s patients (**Supplementary Figure S1**). Stage 4 patients had taken up 63.6, 70.0, 77.7, and 83.9% among all the 419 patients with progression, MYCN amplification, death from disease and high risk, respectively. In contrast, stage 4s patients with MYCN amplification, high risk and death from disease had occupied less than 4.5%, and those with progression had accounted for 7.5%. In comparison, in the TARGET-NBL dataset, stage 4 patients had taken up more than 80% in death, event, progression and relapse groups, and the proportions were especially high in death and relapse groups (96.9, and 91.3%, respectively). However, stage 4s patients had only occupied no more than 2% (**Supplementary Table S1**). These results demonstrated huge differences between stage 4 and stage 4s patients. Cox regression analysis of stage 4 patients (GSE49710, n = 168) revealed that the age of ≥ 18 months, MYCN amplification, and high risk were associated with inferior outcomes [hazard ratio: 2.237 (95% CI: 1.209–4.141), 2.437 (95% CI: 1.557–3.814), 11.208 (95% CI: 2.746–45.743) and p-value: 0.01, < 0.001, 0.001, respectively]. Results of risk factor analysis in this dataset were consistent with those from related articles (Pinto et al., 2015).

## Identification of DEGs and Functional Enrichment Analysis in NB

In this study, 45 INSS stage 4 patients aged less than 18 months and 50 stage 4s patients were enrolled from the GSE49710 TABLE 1 | Clinical characteristics of NB patients in GSE49710 (samples with survival time ≤30 days and ≥10 years have been removed).


N/A, not applicable; Class label, Maximally divergent disease courses; unfavorable, patients died despite intensive chemotherapy; favorable, patients survived without chemotherapy for at least 1000 days post-diagnosis.

dataset, so as to exclude the influence of age. According to the cut-off criterion of adjusted p-value of < 0.05, the top 1% high expression genes of 26082 mRNAs in stage 4 patients were selected, and a total of 215 differentially expressed mRNAs were identified after excluding the repetitive genes. Afterward, genes with the hazard ratio of > 1 and p-value of < 0.001 upon univariate Cox regression were chosen, and 131 DEGs were selected at last (**Supplementary Table S2**).

The functions and pathway enrichment of the candidate DEGs were analyzed using online database DAVID, and 58 GO terms in BP (Biological Process) and 5 KEGG pathways were revealed finally (**Figures 1A,B** and **Supplementary Tables S3**, **S4**). As shown in **Figure 1A**, the DEGs were mainly enriched in cell division, DNA replication, mitotic nuclear division, DNA repair and cell proliferation (Stafman and Beierle, 2016). In terms of KEGG pathways, the DEGs were mainly enriched in cell cycle (Williams and Stoeber, 2012; Otto and Sicinski, 2017), Fanconi anemia pathway (Ceccaldi et al., 2016), pyrimidine metabolism (Kelemen et al., 2014),

and p53 signaling pathways, which were the cancer-related pathways.

# Performance of Gene Signature in Predicting Prognosis

In order to select the genes with good performance in predicting prognosis of NB, RF was adopted to rank the 131 DEGs in predicting the prognosis in NB patients (**Supplementary Table S5**). Based on the ranking results, four genes (ERCC6L, AHCY, STK33, and NCAN) were selected as a gene signature in predicting the vital status of NB patients (**Table 2**). To integrate four genes, multivariate Cox regression analysis was employed to obtain the coefficient. The risk score was calculated as follows, risk score = expression of ERCC6L<sup>∗</sup> 0.408 + expression of AHCY<sup>∗</sup> 0.478 + expression of STK33<sup>∗</sup> 0.345 + expression of NCAN<sup>∗</sup> 0.136. Notably, the AUC of the 4-gene signature in predicting vital status could reach 0.86. Simultaneously, the 4-gene signature was employed to distinguish different groups, including MYCN amplification vs. Non-amplification, high risk vs. non-high risk, and progression vs. Non-progression. According to the ROC curve, the AUC of MYCN amplification, high risk, and progression reached 0.965, 0.928, and 0.78, respectively, as shown in **Figure 2**. Surprisingly, AHCY performed excellently in predicting MYCN amplification, with an AUC of 0.946 (**Supplementary Table S6**).

# Association of 4-gene Signature and Clinical Characteristics

Four genes (ERCC6L, AHCY, STK33, and NCAN) were closely correlated with overall survival and event-free survival. As shown in **Supplementary Figures S2A–H**, the increased expression of the four genes was linked with markedly shorter survival time. In risk score model, the group with high risk score of four gene-signature had significant inferior outcomes in NB patients (**Figures 3A,B**). Meanwhile, the risk score in different groups was calculated, and the results revealed that a high age, MYCN amplification, advanced stage, high risk, disease progression, and unfavorable class label groups had high risk scores (p < 0.0001) (**Figure 3C**). Moreover, the overall survival and event-free survival were further assessed in stage 4 patients, as well as those with MYCN non-amplification, and age of less than 18 months with high/low risk score. The results showed that the high-risk group had significant short overall and event-free survival time (**Figure 4**).

Spearman analysis of the correlation between 4-gene signature and clinicopathological features revealed that the high risk score group showed markedly positive correlation with age, risk, and INSS stage, while negative correlation with overall survival and event-free survival time. In addition, the high risk score group was also associated with unfavorable class label and MYCN amplification, as summarized in **Table 3**. Taken together, these known risk factors of the prognosis of NB (Pinto et al., 2015) were


Bold values indicate the genes we selected as the signature.

consistent with the 4-gene signature-based risk score selected in this study.

Univariate analysis suggested that the 4-gene signature, age group (age of ≥ 18 months and age of < 18 months), MYCN amplification, high risk, and INSS stage were the markedly prognostic factors of the overall survival and event-free survival in NB (all ps < 0.0001) (**Table 4**). Besides, multivariate survival analysis was also performed using the variables with significance

TABLE 3 | Spearman analysis of correlation between four-gene signature and clinicopathological features in GSE49710.


in univariate analysis. The results confirmed that the 4-gene signature and INSS stage remained independent prognostic indicators for unfavorable overall survival and event-free survival (**Table 4**).

# Confirmation of Potential Function of ERCC6L in NB

In the four-gene signature, the AUC of ERCC6L reached 0.799 in predicting the survival status of NB patients, which was the highest. In order to obtain the potential function of ERCC6L in NB development, the gene-encoded protein–protein interaction (PPI) network was constructed by employing the STRING database<sup>2</sup> , so as to find the co-expression relationship of genes. CytoHubba (Chin et al., 2014) and MCODE (Bader and Hogue, 2003) of Cytoscape (Shannon et al., 2003) software were employed to screen the hub genes that interacted with ERCC6L, and 8 genes (including MAD2L1, NDC80, CCNB1, KIF18A, BIRC5, CENPM, CDCA5, and CCNB2) with the Clustering Coefficient of > 0.9 were selected, as shown in **Figure 5A**. And the Pearson correlation coefficient (PCC) of the nine genes was also calculated. As shown in **Figure 5B**, the PCC of every two genes was higher than 0.8 (p-value < 2.2E-16) (PCC of all DEGs in **Supplementary Table S8**). The above results suggested

<sup>2</sup>https://string-db.org/

TABLE 4 | Univariate and multivariate Cox regression analysis of the correlation between the 4-gene signature and clinical features in GSE49710.


Overall survival. Bold values emphasize p < 0.05.

that ERCC6L might interact with one or more genes in the development of NB. Nevertheless, the mechanism of ERCC6L in NB should be further confirmed from biological experiments.

# Validation of Performance in Other NB Datasets

To confirm the robustness of the 4-gene signature in predicting different groups, two independent datasets GSE16476 and GSE45480 (GPL16876) were used for evaluation. As shown in **Figure 6**, the 4-gene signature had a commendable performance in dividing MYCN amplification and non-amplification groups, with the AUC of 0.964 and 0.975, respectively. Meanwhile, the AUC of the 4-gene signature was 0.896 and 0.888, respectively, in the vital status prediction of GSE16476 and discrimination of stage 4 and stage 4s in GSE45480. Through analyzing the association of 4-gene signature with clinical features in GSE16476, we obtained similar results as GSE49710 (**Supplementary Figures S3**, **S4** and **Supplementary Table S7**). We divided patients into high/low risk score group based on the median of 4-gene signature, the results showed that high risk score group correlated with age more than 18 months, advanced stage, MYCN amplified, and disease recurrence or progression, and had negative correlation with follow-up time.

# DISCUSSION

NB, an embryonal malignant tumor during early childhood, will lead to poor long-term survival, especially for stage 4 patients. Stratified treatment has achieved great progress in the past two decades (Pinto et al., 2015; Whittle et al., 2017); in the meanwhile, MYCN amplification and ALK mutation have been comprehensively studied as the prognostic factors and important treatment targets (Barone et al., 2013; Huang and Weiss, 2013; Beckers et al., 2015; Liu and Thiele, 2017). However, MYCN amplification accounts for only about 25% patients (Huang and Weiss, 2013). Consequently, identifying key genes, exploring new targets for therapy, and ultimately improving survival time of patients with inferior outcomes remain challenging.

This study had focused on discriminating NB with tremendously different clinicopathological features, identifying the differentially expressed genes (DEGs) in the untreated stage 4 and stage 4s patients, and utilizing bioinformatic means to intensively analyze the data. We had identified 131 DEGs, and BP in GO term and KEGG functional enrichment analysis showed that the DEGs were mainly enriched in the cancer-related pathways, such as DNA replication, cell division and cell cycle (Williams and Stoeber, 2012; Otto and Sicinski, 2017). We picked out a four-gene (ERCC6L, AHCY, STK33, and NCAN) signature in predicting the prognosis of NB, with the AUC of as high as 0.86. Simultaneously, the grouping ability of the gene signature was assessed, the results of which revealed that the four-gene signature had good performance in predicting other groups. When predicting the MYCN amplification group, the AUC of the four-gene signature was 0.965, and in another two independent datasets, the AUC in predicting MYCN amplification was higher than 0.95. We guessed there are two reasons: (1) the AUC in predicting MYCN amplification of AHCY alone was 0.946, which could be attributed to the fact that AHCY was directly regulated by MYC proteins (Chayka et al., 2015); (2) might be due to the insufficient sample size and imbalanced distribution of positive and negative samples.

An ideal prediction model should help clinicians to determine optimal treatment strategies or aid them to predict patient

(C) ROC of MYCN status in GSE45480; (D) ROC of stage 4 vs. stage 4s in GSE45480.

outcomes. However, it is difficult to develop a satisfactory prognostic model, which can be ascribed to extreme cancer complexity. MYCN amplification, ALK mutation (Barone et al., 2013; Lambertz et al., 2015), or other molecular biomarkers (Egler et al., 2008; Pugh et al., 2013; Zage et al., 2013) have provided researchers with some hints, yet more studies are still needed. In particular, current researches mainly focused on how to reduce the treatment-associated side effect in patients with good prognosis, and how to provide more effective treatment for patients with poor prognosis patients (Whittle et al., 2017). This requires researchers to find more markers that distinguish patients with good prognosis and poor prognosis. Our study identified a four-gene signature in predicting prognosis and discriminating other groups, which had attained satisfactory performance.

Multivariate Cox regression analysis showed that the four-gene signature and INSS stage are all independent prognostic indicators. The two features are highly correlated and have a prognostic significance, this indicated that there is a part of coincident information between them, but the signature also has its own independent prognostic information, and it contributes more to the prognosis analysis of NB patients

according to the p-value. In addition, in order to support our results, we removed the INSS stage and applied other features to analyze again. The results showed that the signature was still an independent prognostic indicator, further validating the effect of our signature. At the same time, survival analysis of the four-gene signature in stage 4 and MYCN non-amplification patients, and age of < 18 months was also performed, and the results indicated that the high risk score group had significantly inferior outcomes. In conclusion, our four-gene signature might help clinicians to choose the treatment regimens and aid them to predict patient prognosis in a similar situation. We got similar results in the independent dataset GSE16476, which can support the robustness of our signature in predicting the prognosis in NB patients.

Of the four genes, AHCY has been reported to be a direct target of MYCN, which predicts poor prognosis in NB (Chayka et al., 2015). NCAN has been confirmed to promote the malignant phenotypes in NB cells (Su et al., 2017). ERCC Excision Repair 6 Like (ERCC6L, Spindle Assembly Checkpoint Helicase) is a protein-coding gene, also known as PICH, its related pathways include cell cycle and mitosis. Pu et al. had found that higher ERCC6L expression was notably associated with inferior outcomes in breast and kidney cancers, and silencing of ERCC6L would inhibit the growth of breast and kidney cancer cells. Besides, their further studies considered that ERCC6L might affect the cell cycle via RAB31-MAPF-CDK, so as to promote cancer cell proliferation (Pu et al., 2017). Albers et al. showed that loss of PICH results in DNA damage, P53 activation, and embryonic development impaired (Albers et al., 2018), which indicated that PICH may play an important role in embryonic lethality and tumorigenesis. In our study, univariate and multivariate Cox regression analyses revealed that ERCC6L could be an independent prognostic factor of overall survival and event-free survival (p < 0.0001). Typically, the AUC of ERCC6L alone in predicting different groups was higher than 0.75 (p < 0.0001).

In order to find the potential function of ERCC6L, we have constructed a PPI network and calculated the PCC of ERCC6L with other key genes which may interact with ERCC6L, eight

## REFERENCES


genes had been identified that had a close relationship with it. Most of them were involved in the cell cycle pathway. Importantly, of the eight genes, MAD2L (Gogolin et al., 2013), CCNB1 (Liu et al., 2013; Schwermer et al., 2015), and BIRC5 (Lamers et al., 2011; Hagenbuchner et al., 2016) have been proven play important roles in NB development and metastasis. Taken together, we considered ERCC6L may play an important role in NB, biological experiments are needed for further study of its mechanism.

# AUTHOR CONTRIBUTIONS

YL conceived of and directed the project. XZ designed the study, analyzed the data, wrote the manuscript. HL revised the manuscript critically for important intellectual content. YZ collected data and samples. LW reviewed the data. All authors have read and approved the final manuscript for publication.

# FUNDING

This research was supported by the National Natural Science Foundation of China (Grant Nos. 61471181 and 81702966) and the Natural Science Foundation of Jilin Province (Grant Nos. 20140101194JC and 20150101056JC).

# ACKNOWLEDGMENTS

We wish to acknowledge Chao Lu for his help in processing the data with the random forest algorithm.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00589/full#supplementary-material

and spontaneous regression: a molecular portrait of stage 4S. Mol. Oncol. 2, 261–271. doi: 10.1016/j.molonc.2008.07.002



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhong, Liu, Liu, Zhang, Wang and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Functional Prediction of Chronic Kidney Disease Susceptibility Gene PRKAG2 by Comprehensively Bioinformatics Analysis

Ermin Wang<sup>1</sup> \*, Hainan Zhao<sup>1</sup> , Deyan Zhao<sup>1</sup> , Lijing Li<sup>1</sup> and Limin Du<sup>2</sup>

<sup>1</sup> Department of Nephrology, The First Affiliated Hospital, Jinzhou Medical University, Jinzhou, China, <sup>2</sup> Jinzhou Medical University, Jinzhou, China

The genetic predisposition to chronic kidney disease (CKD) has been widely evaluated especially using the genome-wide association studies, which highlighted some novel genetic susceptibility variants in many genes, and estimated glomerular filtration rate to diagnose and stage CKD. Of these variants, rs7805747 in PRKAG2 was identified to be significantly associated with both serum creatinine and CKD with genome wide significance level. Until now, the potential mechanism by which rs7805747 affects CKD risk is still unclear. Here, we performed a functional analysis of rs7805747 variant using multiple bioinformatics software and databases. Using RegulomeDB and HaploReg (version 4.1), rs7805747 was predicated to locate in enhancer histone marks (Liver, Duodenum Mucosa, Fetal Intestine Large, Fetal Intestine Small, and Right Ventricle tissues). Using GWAS analysis in PhenoScanner, we showed that rs7805747 is not only associated with CKD, but also is significantly associated with other diseases or phenotypes. Using metabolite analysis in PhenoScanner, rs7805747 is identified to be significantly associated with not only the serum creatinine, but also with other 16 metabolites. Using eQTL analysis in PhenoScanner, rs7805747 is identified to be significantly associated with gene expression in multiple human tissues and multiple genes including PRKAG2. The gene expression analysis of PRKAG2 using 53 tissues from GTEx RNA-Seq of 8555 samples (570 donors) in GTEx showed that PRKAG2 had the highest median expression in Heart-Atrial Appendage. Using the gene expression profiles in human CKD, we further identified different expression of PRKAG2 gene in CKD cases compared with control samples. In summary, our findings provide new insight into the underlying susceptibility of PRKAG2 gene to CKD.

Keywords: chronic kidney disease, PRKAG2, genome-wide association study, eQTL, gene expression

# INTRODUCTION

Chronic kidney disease (CKD) is a major global problem caused by the permanent loss of kidney function, and is also associated with an increased risk for cardiovascular disease (Cusumano and Gonzalez Bedat, 2008; Prodjosudjadi et al., 2009; Chambers et al., 2010; Shinohara, 2010; Sherwood and McCullough, 2016; James et al., 2017; Malhotra et al., 2017; Sinha and Bagga, 2018).

Edited by:

Quan Zou, Tianjin University, China

#### Reviewed by:

Chong Wang, Harvard Medical School, United States Sicheng Hao, Northeastern University, United States

> \*Correspondence: Ermin Wang jzmusnk@163.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 15 September 2018 Accepted: 08 November 2018 Published: 03 December 2018

#### Citation:

Wang E, Zhao H, Zhao D, Li L and Du L (2018) Functional Prediction of Chronic Kidney Disease Susceptibility Gene PRKAG2 by Comprehensively Bioinformatics Analysis. Front. Genet. 9:573. doi: 10.3389/fgene.2018.00573

**153**

The overall prevalence of CKD exceeds 10%, and is approximately 14% in the general population and its incidence is increasing (Almirall, 2016; Hursitoglu, 2016; Mills and He, 2016; Wuttke and Kottgen, 2016; Hedayati et al., 2017; James et al., 2017; Clark et al., 2018). It is reported that up to 20% of CKD cases are caused by genetic forms of renal disease (Almirall, 2016; Hursitoglu, 2016; Mills and He, 2016; Wuttke and Kottgen, 2016; Hedayati et al., 2017; James et al., 2017; Clark et al., 2018). Understanding genetic predisposition to CKD and uncovering underlying pathophysiological mechanisms may contribute to the development of targeted therapies. In recent years, the genetic predisposition to CKD has been widely evaluated especially using the genome-wide association studies (GWAS), which highlighted some novel genetic susceptibility variants in many genes, and estimated glomerular filtration rate to diagnose and stage CKD (Pattaro et al., 2016; Wuttke and Kottgen, 2016).

In these CKD risk genes, a genetic variant rs7805747 in PRKAG2 was identified to be significantly associated with both serum creatinine and CKD with genome wide significance level (Chambers et al., 2010). The rs7805747 (chr7:151407801 for hg19) variant is located in intronic of PRKAG2. PRKAG2 is a protein coding gene. Until now, the potential mechanism by which rs7805747 affects CKD risk is still unclear. It is difficult to identify the function of coding and non-coding genes in molecular wet laboratories. However, computational methods including kinds of bioinformatics software and databases may be useful tools to guide and predict function (Zou et al., 2016; Wan and Zou, 2017; Wan et al., 2017; Wei et al., 2017a,b; He et al., 2018a,b; Jia et al., 2018a,b; Jiang et al., 2018; Zeng et al., 2018). Here, we performed a functional analysis of rs7805747 variant using multiple bioinformatics databases including RegulomeDB (Boyle et al., 2012), HaploReg (version 4.1) (Ward and Kellis, 2016), PhenoScanner (version 1.1) (Staley et al., 2016), and UCSC Genome Browser (Rosenbloom et al., 2015; Tyner et al., 2017; Casper et al., 2018), as did in previous studies (Lu et al., 2011; Rhie et al., 2013; Hazelett et al., 2014; Liu et al., 2016, 2017a,b,c,d,e, 2018b; Guo et al., 2017; Hu et al., 2017a,b; Jiang et al., 2017; Zhang et al., 2018). Meanwhile, we analyzed a whole genome case-control expression profiles in human CKD to investigate whether the susceptibility gene PRKAG2 is differently expressed in CKD cases compared with control samples.

# MATERIALS AND METHODS

# Regulatory Analysis of rs7805747 Using RegulomeDB

RegulomeDB database could annotate genetic variants with known and predicted regulatory elements in the intergenic regions of the human genome (Boyle et al., 2012). In brief, the known and predicted regulatory DNA elements include regions of DNAase hypersensitivity, binding sites of transcription factors, and promoter regions that have been biochemically characterized to regulation transcription (Boyle et al., 2012). These regulatory element datasets are from Gene Expression Omnibus (GEO), the Encyclopedia of DNA Elements (ENCODE) project, and published literature (Boyle et al., 2012).

# Functional Analysis of rs7805747 Using HaploReg

HaploReg is a tool for exploring annotations of the non-coding variants (Ward and Kellis, 2012, 2016). HaploReg v4 included LD information from the 1000 Genomes Project, chromatin state and protein binding annotation from the Roadmap Epigenomics and ENCODE projects, sequence conservation across mammals, the effect of SNPs on regulatory motifs, and the effect of SNPs on expression from eQTL studies (Ward and Kellis, 2012, 2016). More detailed information is provided in the original studies (Ward and Kellis, 2012, 2016).

# Functional Analysis of rs7805747 Using PhenoScanner

PhenoScanner included publicly available large-scale GWAS summary results, about 3 billion associations and over 10 million unique single nucleotide polymorphisms (SNPs) and a broad range of phenotypes (Staley et al., 2016). The results are aligned across traits to the same effect and non-effect alleles for each SNP (Staley et al., 2016). Here, we performed three kinds of functional analyses including the GWAS, Metabolites, and eQTL analysis options (Staley et al., 2016). To perform a GWAS analysis, the PhenoScanner included 88 GWAS datasets with 76 kinds of diseases or phenotypes (Staley et al., 2016). To perform a Metabolites analysis, PhenoScanner consisted of two metabolomics datasets (Shin et al., 2014; Kettunen et al., 2016). To perform an eQTL analysis, PhenoScanner included several eQTL datasets from eQTL Browser, Geuvadis, GTEx (version 6), MuTHER and bloodeqtlbrowser. More detailed information is provided on the original study (Staley et al., 2016).

# Gene Expression Analysis of PRKAG2 in GTEx

We evaluated the expression of PRKAG2 using the RNA-Seq datasets from the NIH Genotype-Tissue Expression (GTEx) project, which was created to establish a sample and data resource for studies on the relationship between genetic variation and gene expression in multiple human tissues (GTEx Consortium, 2013; Mele et al., 2015). The GTEx project included median gene expression levels in 51 tissues and 2 cell lines (V6, October 2015) (GTEx Consortium, 2013; Mele et al., 2015). This release is based on data from 8555 tissue samples obtained from 570 adult postmortem individuals (GTEx Consortium, 2013; Carithers et al., 2015; Mele et al., 2015). Here, we used the Genome Browser to evaluate the expression of PRKAG2 in GTEx 53 human tissues (V6, October 2015). The UCSC Genome Browser is a new method to visualize interactions between regions of the genome (Meyer et al., 2013; Karolchik et al., 2014; Rosenbloom et al., 2015; Speir et al., 2016; Tyner et al., 2017; Casper et al., 2018).

# Case-Control Gene Expression Analysis of PRKAG2

We analyzed a whole genome case-control expression profiles in human CKD (Nakagawa et al., 2015). In the original study, a microarray analysis with renal biopsy specimens from

CKD patients was conducted to identify the responsible genes associated with tubulointerstitial fibrosis and tubular cell injury in CKD (Nakagawa et al., 2015). This study showed microarray profiles in a total of 61 samples including 53 biopsy specimens of CKD patients, and 8 controls (Nakagawa et al., 2015). Here, we selected the web tool GEO2R to evaluate whether PRKAG2 gene is significantly dysregulated in CKD cases compared with control samples, as did in a recent study (Liu et al., 2018a). The significance level is defined to be P < 0.01.

#### RESULTS

### Regulatory Analysis of rs7805747 Using RegulomeDB

Using RegulomeDB, the predicted score is 5, which suggested that rs7805747 is likely to affect binding (TF binding) or DNase peak. The predicted binding protein is HNF4A (chr7:151407767- 151408030 by ChIP-seq in Caco2 cell type) (Verzi et al., 2010). The histone modification analysis showed that rs7805747 was predicated to locate in enhancer histone marks (Liver, Fetal Intestine Large, Right Ventricle, Duodenum Mucosa, and Fetal Intestine Small). Here, we provided some key information about regulatory analysis in **Table 1**. More detailed results are described in **Supplementary Table 1**.

### Functional Analysis of rs7805747 Using HaploReg

Using HaploReg v4, rs7805747 is predicated to locate in enhancer histone marks (Liver, Duodenum Mucosa, Fetal Intestine Large, Fetal Intestine Small, and Right Ventricle tissues), DNase hypersensitivity (Foreskin Melanocyte Primary Cells skin01, Fetal Heart, Fetal Intestine Large, Fetal Intestine Small, Small Intestine), and motifs changed (GATA, HMGN3, Pax-6, and Tgif1) (Ward and Kellis, 2012, 2016).

#### Functional Analysis of rs7805747 Using PhenoScanner in GWAS

Using PhenoScanner in GWAS option, we identified 43 significant association results with P < 0.01. In addition to the CKD, we found that rs7805747 is also significantly associated with other diseases or phenotypes including Hemoglobin Hb, Hematocrit Hct, Red blood cell count RBC, systolic blood pressure (SBP), Breast cancer, Gout, Hypertension, and Extraversion, as described in **Table 2**.

## Functional Analysis of rs7805747 Using PhenoScanner in eQTL

Using PhenoScanner in eQTL option, we identified 23 significant associations between rs7805747 and gene expression with P < 0.01. These findings show that rs7805747 is significantly associated with gene expression in multiple human tissues including brain, cells transformed fibroblasts, colon sigmoid, colon transverse, heart left ventricle, liver, lung, skin, small intestine terminal ileum, stomach, and whole blood, as described in **Table 3**. These regulated genes include GIMAP1, GIMAP5, AGAP3, KCNH2, TMEM176B, TMEM176A, FASTK, WDR86, NOS3, WDR86-AS1, XRCC2, YBX1P4, SLC4A2, TMUB1, and PRKAG2. Importantly, rs7805747 could significantly regulate PRKAG2 expression in blood with P = 3.81E-03 and 8.15E-04.

#### Functional Analysis of rs7805747 Using PhenoScanner in Metabolites

Using PhenoScanner in metabolites option, we identified significant association between rs7805747 and 17 metabolites with P < 0.01. These metabolites included Creatinine, Indolelactate, Phenol sulfate, Pseudouridine, Propionylcarnitine, C-glycosyltryptophan, Kynurenine, Myo-inositol, 3-carboxy-4-methyl-5-propyl-2-furanpropanoate (CMPF), 1-palmitoylglycerophosphoethanolamine, Phenyllactate (PLA), Erythronate, N-acetylthreonine, Citrulline, 3-methoxytyrosine, Urate, 2-methylbutyroylcarnitine, as described in **Table 4**.

#### Gene Expression Analysis of PRKAG2 in GTEx

Using the UCSC Genome Browser, the results showed that PRKAG2 had the highest median expression: 34.90 RPKM in Heart – Atrial Appendage (Ensembl gene ID: ENSG00000106617.9, Genomic position: hg38 chr7:151556111- 151877125). PRKAG2 had the total median expression 412.37 RPKM in all these 53 tissues. **Figure 1** provided more detailed

```
TABLE 1 | Key histone modification analysis of rs7805747 using RegulomeDB.
```


TABLE 2 | Significant association between rs7805747 and kinds of diseases or phenotypes with P ≤ 0.01.


eGFR, estimated glomerular filtration rate; SDS, standard deviation scores.

information about the PRKAG2 gene expression in 53 tissues from GTEx RNA-Seq of 8555 samples (570 donors).

#### Case-Control Gene Expression Analysis of PRKAG2

There are three probes to evaluate the expression of PRKAG2 gene in the gene expression profiles in human CKD including A\_24\_P384779, A\_23\_P44366, and A\_23\_P314760. Each of these three probes represents different regions of PRKAG2 gene. These three probes may have the same or different transcript or isoforms, or extrons. The results showed that all these probes about the PRKAG2 gene is significantly dysregulated in CKD cases compared with control samples including A\_24\_P384779 [P = 1.23E-07 and log2(fold change) = −2.07], and A\_23\_P44366 [P = 4.39E-03 and log2(fold change) = 0.74], and A\_23\_P314760 [P = 1.54E-02 and log2(fold change) = 0.62]. In **Table 3**, there are 15 unique genes regulated by rs7805747. In addition to PRKAG2, we also evaluated the expression of other 14

TABLE 3 | Significant association between rs7805747 and the gene expression with P ≤ 0.01.


Beta is the regression coefficient based on A allele, which means that A allele regulates increased (Beta > 0) and reduced (Beta < 0) expression of nearby genes.

TABLE 4 | Significant association between rs7805747 and metabolites with P ≤ 0.01.


Beta is the regression coefficient based on A allele, which means that A allele regulates increased (Beta > 0) and reduced (Beta < 0) metabolites.

genes, as provided in **Table 3**. The results showed that 4 of other 14 genes including TMUB1, AGAP3, XRCC2, and WDR86-AS1 also had different expression in CKD cases with P < 0.01. **Table 5** provided the detailed information about 23 probes of 15 genes including PRKAG2. Importantly, the different expression of PRKAG2, TMUB1, AGAP3, and XRCC2 had passed the multiple testing correction threshold 0.01/23 = 0.000435.



### DISCUSSION

It is reported that PRKAG2 could encode the gamma2-subunit isoform of 5<sup>0</sup> -AMP-activated protein kinase (AMPK) (Ahmad et al., 2005; Banerjee et al., 2007; Folmes et al., 2009; Kim et al., 2012; Thorn et al., 2013; Zhang et al., 2013; Hinson et al., 2016, 2017). AMPK is a metabolic enzyme, which plays important roles in regulating of energy metabolism

in response to cellular stress. AMPK has been identified to be a regulator of metabolism, survival, and fibrosis, by a recent integrative analysis of PRKAG2 cardiomyopathy iPS and microtissue models (Hinson et al., 2016). In addition, mutations in PRKAG2 have been identified to be associated with hypertrophic cardiomyopathy (Xu et al., 2017).

Over the past decade, GWAS have considerably improved our understanding of the genetic basis of kidney function and disease (Wuttke and Kottgen, 2016). A SNP rs7805747 identified by CKD GWAS lies upstream of PRKAG2. Here, we performed a comprehensively functional analysis of this variant using multiple bioinformatics databases including RegulomeDB (Boyle et al., 2012), HaploReg (version 4.1) (Ward and Kellis, 2016), and PhenoScanner (version 1.1) (Staley et al., 2016). Using RegulomeDB, rs7805747 is predicted to affect HNF4A binding or DNase peak. Using RegulomeDB, the predicted score is 5, which suggested that rs7805747 is likely to affect binding (TF binding) or DNase peak. The predicted binding protein is HNF4A (chr7:151407767-151408030 by ChIP-seq in Caco2 cell type). In addition, rs7805747 was predicated to locate in enhancer histone marks (Liver, Fetal Intestine Large, Right Ventricle, Duodenum Mucosa, and Fetal Intestine Small). Using HaploReg (version 4.1), we identified rs7805747 to be associated with enhancer histone marks, DNase hypersensitivity, and motifs changed. In HaploReg (version 4.1), rs7805747 was also predicated to locate in enhancer histone marks (Liver, Duodenum Mucosa, Fetal Intestine Large, Fetal Intestine Small, and Right Ventricle tissues). Hence, the findings in HaploReg (version 4.1) were consistent with RegulomeDB.

Using PhenoScanner in GWAS option, we showed that rs7805747 is not only associated with CKD, but also is significantly associated with other diseases or phenotypes including Hemoglobin Hb, Hematocrit Hct, Red blood cell count RBC, SBP, Breast cancer, Gout, Hypertension, and Extraversion.

#### REFERENCES


Using PhenoScanner in eQTL option, rs7805747 is identified to be significantly associated with gene expression in multiple human tissues and multiple genes including PRKAG2. Previous study has reported rs7805747 to be associated with serum creatinine and CKD (Chambers et al., 2010). Using PhenoScanner in metabolites option, rs7805747 is identified to be significantly associated with not only the serum creatinine, but also with other 16 metabolites, as described in **Table 4**.

The gene expression analysis of PRKAG2 using 53 tissues from GTEx RNA-Seq of 8555 samples (570 donors) in GTEx showed that PRKAG2 had the highest median expression in Heart – Atrial Appendage. Using the gene expression profiles in human CKD, we further identified different expression of PRKAG2 gene in CKD cases compared with control samples. All these findings indicate that rs7805747 is associated with CKD risk, PRKAG2 gene expression, and 17 metabolites. Meanwhile, gene expression analysis further showed that CKD cases had different expression of PRKAG2 gene. In summary, our findings provide new insight into the underlying susceptibility of PRKAG2 gene to CKD.

### AUTHOR CONTRIBUTIONS

EW conceived and initiated the project and performed the functional analysis. EW, HZ, DZ, LL, and LD wrote the manuscript. All authors reviewed the manuscript and contributed to the final manuscript.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00573/full#supplementary-material

in adults with chronic kidney disease: the CKD WIT randomized clinical trial. JAMA 319, 1870–1879. doi: 10.1001/jama.2018.4930



chromatin interactions and partners for the intestinal transcription factor CDX2. Dev. Cell 19, 713–726. doi: 10.1016/j.devcel.2010.10.006


and the signaling pathway involved. Am. J. Physiol. Heart. Circ. Physiol. 313, H283–H292. doi: 10.1152/ajpheart.00813.2016


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Wang, Zhao, Zhao, Li and Du. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Recent Advances on the Machine Learning Methods in Identifying DNA Replication Origins in Eukaryotic Genomics

#### Fu-Ying Dao, Hao Lv, Fang Wang and Hui Ding\*

*Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China*

The initiate site of DNA replication is called origins of replication (ORI) which is regulated by a set of regulatory proteins and plays important roles in the basic biochemical process during cell growth and division in all living organisms. Therefore, the study of ORIs is essential for understanding the cell-division cycle and gene expression regulation so that scholars can develop a new strategy against genetic diseases by using the knowledge of DNA replication. Thus, the accurate identification of ORIs will provide key clues for DNA replication research and clinical medicine. Although, the conventional experiments could provide accurate results, they are time-consuming and cost ineffective. On the contrary, bioinformatics-based methods can overcome these shortcomings. Especially, with the emergence of DNA sequences in the post-genomic era, it is highly expected to develop high throughput tools to identify ORIs based on sequence information. In this review, we will summarize the current progress in computational prediction of eukaryotic ORIs including the collection of benchmark dataset, the application of machine learning-based techniques, the results obtained by these methods, and the construction of web servers. Finally, we gave the future perspectives on ORIs prediction. The review provided readers with a whole background of ORIs prediction based on machine learning methods, which will be helpful for researchers to study DNA replication in-depth and drug therapy of genetic defect.

Keywords: eukaryotic DNA replication, origins of replication, machine learning method, DNA structure properties, webserver

# INTRODUCTION

DNA replication is the most essential process in all living organisms and is the basis for biological inheritance. Two identical replicas of DNA generated from one original DNA molecule in the process. The onset of genomic DNA synthesis requires precise interactions of specialized initiator proteins with DNA at sites where the replication machinery can be loaded. These sites, defined as origins of replication (ORIs) (Macalpine and Bell, 2005; Necsulea et al., 2009; Sequeira-Mendes et al., 2009), regulate the beginning of DNA replication. Thus, they play key roles in DNA replication process.

It is well-known that the replication mechanisms of prokaryotic and eukaryotic genomes are different. Generally, most of the prokaryotes possess a single circular molecule of DNA with only

#### Edited by:

*Dariusz Mrozek, Silesian University of Technology, Poland*

#### Reviewed by:

*Jiangning Song, Monash University, Australia Xiangxiang Zeng, Xiamen University, China*

> \*Correspondence: *Hui Ding hding@uestc.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *21 October 2018* Accepted: *21 November 2018* Published: *10 December 2018*

#### Citation:

*Dao F-Y, Lv H, Wang F and Ding H (2018) Recent Advances on the Machine Learning Methods in Identifying DNA Replication Origins in Eukaryotic Genomics. Front. Genet. 9:613. doi: 10.3389/fgene.2018.00613* one ORI (Skarstad and Katayama, 2013). Eukaryotes have more complex DNA replication process than the prokaryotes as shown in **Figure 1**. One linear chromosome of eukaryotic cell has multiple replicating forks. It has been shown that the number of ORIs is as many as 100,000 in a single human cell (Nasheuer et al., 2002). It ensures DNA replication can be completed in the S phase of the cell cycle timely and speeds the duplication of their much larger store of genetic material. The autonomously replicating sequences (ARS), which contains the specific consensus element autonomous consensus sequences (ACSs) of 11-bp, has been widely distributed in Saccharomyces cerevisiae (S. cerevisiae) (Stinchcomb et al., 1979; Theis and Newlon, 1997; Dhar et al., 2012). ACS is the binding site for origin recognition complexes (ORC), the main factor that subsequently serves as a landing platform for the assembly of the other pre-RC proteins. Other elements close to the ACS motif contribute to its activity and provide a modular structure to origins (**Figure 1**) (Marahrens and Stillman, 1992).

Revealing the DNA replication mechanism could provide important clues to understand the regulatory mechanism of cell division and cell cycle. It can also help the discovery of new drugs for the treatment of various diseases (Mcfadden and Roos, 1999; Soldati, 1999; Raghu Ram et al., 2007). Thus, accurate identification of ORIs is an essential prerequisite for further studying and understanding the DNA replication mechanisms. Chromatin immunoprecipitation (Chip) and the next-generation sequencing technology are popular techniques to determine ORIs, which can precisely identify the ORIs (Metzker, 2009; Lubelsky et al., 2012). However, they are expensive and timeconsuming for these experimental approaches to perform genome-wide identification of ORIs.

Recent years, with the accumulation of biological experimental data (Levitsky et al., 2005; Yamashita et al., 2011; Gao et al., 2012), it is possible to predict ORIs by computational approaches. Breier et al. (2004) firstly developed an Oriscan algorithm to identify ORIs of S. cerevisiae. Shah and Krishnamachari (2012) found the nucleotide correlation measure was better than GC skew to accurately delineate the replication origin. Chen et al. (2012) found that the distribution of DNA bendability and cleavage intensity are different between ORI and non-ORI regions and proposed a support vector machine (SVM) based model to identify ORIs in the S. cerevisiae genome. Li et al. (2014) performed a detailed analysis of the compositional bias of S. cerevisiae genome. Subsequently, they developed a predictor called iORI-PseKNC (Li et al., 2015) to identify ORIs in S. cerevisiae genome. Another web server called iROS-gPseKNC was also established to discriminate ORIs from non-ORIs by using random forest (RF) (Xiao et al., 2016). By combining PseKNC with RF classifier, Zhang et al. (2016) developed a predictor called iOri-Human to identify human ORIs. Recently, Singh et al. (2018) used multi-view ensemble learning (MEL) approach to predict ORIs in S. cerevisiae genome. And Liu et al. (2018) developed a new predictor called "iRO-3wPseKNC" to classify four yeast species by rigorous cross-validations.

This review begins with an introduction of benchmark dataset construction for eukaryotic genomes. Then, we outlined machine learning-based techniques that have been applied in ORIs identification successfully and briefly discussed the advantages and limitations of these methods. Next, we analyzed the published prediction results and the published web servers. Finally, future studies on ORI prediction were also discussed.

# BENCHMARK DATASET

# Published ORI Databases

With the accumulation of biochemical data and the development of computer, and network, more and more databases were constructed to biological data (Huang et al., 2012; He et al., 2016; Feng et al., 2017; Hou et al., 2017; Liang et al., 2017; The Uniprot, 2018). Some have been specially built to store genome replication origin data (Gao and Zhang, 2007; Nieduszynski et al., 2007; Weddington et al., 2008; Cotterill and Kearsey, 2009; Gao et al., 2012; Cherry, 2015). Here, we will briefly introduce these resources.

OriDB is the most extensively used database for identifying eukaryotic DNA replication, in which each potential replication origin site has one of three confidence levels: confirmed, likely and dubious (Nieduszynski et al., 2007). The replication origin information of two organisms budding yeast (S. cerevisiae) and fission yeast (S. pombe) are stored in the database. Users can access to, search and download ORI data from the database. The database also provides a graphics viewer to allow users to select chromosomal regions and display selected data, which could provide a direct observation and lots of assistance for researchers to study DNA replication.

Another database named DeOri constructed in 2012 which stored eukaryotic ORIs (Gao et al., 2012). A total of 16,145 ORIs were collected from 6 eukaryotic organisms. This database will facilitate the comparative genomic analysis of ORIs, and provide some insight into the nature of ORIs on a genome scale.

In addition to the database described above, there are many other ORI related databases, such as DNAReplication (Cotterill and Kearsey, 2009), Replication Domain (Weddington et al., 2008), and SGD (Cherry, 2015). These databases can be obtained by the URLs in **Table 1**. And the details of these databases can be referenced the review from Peng et al. (2015).

We found that most of the training datasets of the eukaryotic ORIs recognized researches were structured from database OriDB and only one obtained from DeOri as **Table 2** shown. It can be seen that these two databases are reliable and can be used for other studies of ORIs.

# The Published Benchmark Datasets

For the purpose of ORIs prediction, it is necessary to construct an objective and strict benchmark dataset which can be handled by machine learning methods. Based on strict steps (Dao et al., 2017), several previous studies have constructed their own benchmark datasets to train and test their proposed prediction models. The details of these datasets were listed in **Table 2**.

Based on OriDB, the first benchmark dataset of ORIs called O1 was constructed by Chen et al. (2012). The dataset includes 322 ORIs verified by experiment and 966 non-ORIs in the yeast genome. Li et al. (2015) established the second yeast benchmark

TABLE 1 | A list of published ORI databases.


dataset named O2, which contains 405 experimentally verified ORIs and 406 non-ORIs. In addition, Zhang et al. (2016) built a new dataset called O3 containing 283 human experimentally confirmed ORIs and 282 human non-ORIs sample on the basis of the DeOri. Singh et al. (2018) gained 251 ARS samples of S. cerevisiae from OriDB and generated three negative datasets, respectively. Recently, a dataset (named O5) of four yeast species, including S. cerevisiae, S. pombe, K. lactis, and P. pastoris, was constructed by Liu et al. (2018).

# ORI SAMPLES FORMULATION

It is well-known that machine learning algorithms can only handle vectors but not sequence samples (Liu et al., 2016; Yang et al., 2018b). Thus, we should consider how to formulate the ORI sequence with a vector.

#### Compositional Analysis Methods

The first method was called GC skew. Since, Lobry (1996) published the computational method to identify ORIs in bacterial genomesin 1996, many scholars have used this method to analyze and identify ORIs (Mclean et al., 1998; Shah and Krishnamachari, 2012; Li et al., 2014; Parikh et al., 2015). For a given ORI sequence, the GC skew can be defined as the following equation.

$$GC\,\text{skew}\left[i\right] = \frac{f\_i\left(G\right) - f\_i\left(C\right)}{f\_i\left(G\right) + f\_i\left(C\right)}\tag{1}$$

where f<sup>i</sup> (G) and f<sup>i</sup> (C) represent the frequencies of occurrences of Guanine (G) and Cytosine (C) in the i-th sliding window along a sequence, respectively. The range of GC skew score is between −1 and +1. Obviously, when f<sup>i</sup> (G) < f<sup>i</sup> (C), the score is a negative value, conversely, it is a positive value. Particularly, the origin of replication is at the position where the GC skew score undergoes an abrupt transition from positive value to negative value.

The GC skew method is the prominent computational measure to predict ORI in the most bacterial genome (Shah and Krishnamachari, 2012). This not only helps to deepen the understanding of advanced biological replication mechanisms, but also contributes to drug discovery. However, this method is not applicable to some bacterial genomes, many archaeal genomes, and almost all eukaryotic genomes (Shah and Krishnamachari, 2012). Moreover, the GC skew is only based on the composition of G and C. Thus, a random sequence displays similar characteristics when it has similar compositions.

The second GC content based method is called GC profile (Li et al., 2014). It is great of importance to acquaint the general compositional features of ORI sequences for understanding the evolution, structure, and function of genomes. For a given ORI sequence, we can obtain the GC profile as Equation (2).

$$\text{GC profile } [i] = \frac{f\_i(G) + f\_i(C)}{f\_i(A) + f\_i(C) + f\_i(G) + f\_i(T)} \tag{2}$$

TABLE 2 | The constructed benchmark data sets for predicting ORIs.


where f<sup>i</sup> (A), f<sup>i</sup> (C), f<sup>i</sup> (G), and f<sup>i</sup> (T) represent the frequencies of occurrences of Adenine(A), Cytosine(C), Guanine(G), and Thymine(T) in the i-th sliding window along a sequence, respectively. Then, the range of GC profile is between 0 and 1. When the value ranges from 0 to 0.5, the content of GC is lower than that of AT in the windows, conversely, the content of GC is higher than AT content.

GC profile can intuitively give the relationships between the GC content and AT content. A quantitative and qualitative view of genome organization can be easily gained by GC profile. A published tool for studying GC profile can freely available from http://origin.tubic.org/GC-Profile/, which was established by Gao and Zhang (2006). They have provided great convenience for visualizing and analyzing the variation of GC content in genomic sequences.

#### Correlation Measure

Two kinds of correlation measures were proposed using ORI prediction. One is the auto-correlation measure which can be defined as:

$$\mathcal{C}\_{\mathcal{G}} = \frac{1}{N - 1} \sum\_{k=1}^{N-1} \left| \begin{array}{c} \text{C}(k) \end{array} \right| \tag{3}$$

where

$$C\left(k\right) = \frac{1}{N-k} \sum\_{j=1}^{N-k} a\_j a\_{j+k} \tag{4}$$

where C k is the auto-correlation function for a discrete ORI sequence, which was defined in Beauchamp and Yuen (1979) and Cavicchi (2000). There into, a<sup>j</sup> ∈ {+1, −1} and the range of the value j is between 1 and N. The auto-correlation measure, CG, is the average of all correlation values. The subscript "G" refers to "genome." The value C<sup>G</sup> ranges from 0 to 1. Lower value of C<sup>G</sup> indicates lower correlation strength in that one ORI sequence and vice versa. For a given nucleic acid sequence ATGTCA, it can be converted into a discrete sequence of bits. When the value of A base is +1, the other three positions (G, C, T) are all −1 and that is similar for each position. Therefore, the sequence can be given rise to four different discrete sequences {1,−1,−1,−1,−1,1}, {−1,−1,1, −1,−1,−1}, {−1,−1,−1,−1,1,−1}, and {−1,1,−1,1,−1,−1} corresponding to the four bases A, T, G, C, respectively. Thus, there are four different bit strings and four different values of correlation strength corresponding to each of the four bases. The detailed usage of the method can be referred to references (Shah and Krishnamachari, 2012; Parikh et al., 2015)

The abrupt change of C k near ORI is helpful to identify ORIs. This method could take into account the order of the bases. However, it did not define the characteristic signature very well. Thus, the cross-correlation measure was developed to identify ORI. It is defined as:

$$\mathcal{C}\_{\mathcal{G}} = \frac{1}{N - 1} \sum\_{k=1}^{N-1} \left| \,^\complement \mathcal{C}\_{\mathcal{C}}(k) \right| \tag{5}$$

where

$$C\_{\mathcal{C}}\left(k\right) = \frac{1}{(N-k)\sigma\_a\sigma\_b} \sum\_{j=1}^{N-k} (a\_j - \mu\_a)(b\_{j+k} - \mu\_b) \tag{6}$$

where the value of b<sup>j</sup> is same as that of a<sup>j</sup> in above Equation (6), σ<sup>a</sup> = 1 = σ<sup>b</sup> and µ<sup>a</sup> = 1 = µ<sup>b</sup> .

Shah and Krishnamachari (2012) calculated the crosscorrelations among A, T and G, C, but they found these values did not give anything meaningful. Therefore, the conclusion can be obtained that a calculation of (A − T)/(A + T) is unable to correctly identify the origin of replication.

#### DNA Structural Properties

Chen et al. (2012) analyzed DNA bendability and cleavage intensity around ORIs in the S. cerevisiae genome. They found that both DNA bendability and cleavage intensity in core replication regions were significantly lower than those in surrounding regions. Therefore, these two structural properties are of crucial importance in identifying ORIs.

The data of DNA bendability for every trinucleotide in genome was obtained by Brukner et al. (1995), which has also been used in promoter prediction (Abeel et al., 2008; Akan and Deloukas, 2008). Suppose, we calculate the bendability of a sequence CTATG, and its value is 0.406 (0.090[CTA] + 0.182[TAT] + 0.134[ATG]). In a similar way, for a given 300 bp sample sequence, six fragments (300/50) were obtained by using window size of 50 bp with the step of 50 bp. For each fragment, the bendability was calculated. As a result, there are six features for each sample.

Cleavage intensity is the capacity that DNA is unwind by hydroxyl radicals. It can be calculated from parameters for a set of tetra-nucleotide patterns in a given DNA sequence. The parameters of tetra-nucleotides were obtained by experiments (Greenbaum et al., 2007). Subsequently, Bishop et al. (2011) predicted cleavage intensity by ORChID2 algorithm (http:// dna.bu.edu/orchid/). Thus, the cleavage intensity of a sequence sample can be calculated by the web tool. By using window size of 50 bp with the step of 50 bp, six features for each sample can be obtained as well.

#### Pseudo K-Tuple Nucleotide Composition

Stimulating from the concept of pseudo amino acid composition (PseAAC) (Shen and Chou, 2008), the pseudo k-tuple nucleotide composition (PseKNC) was developed to deal with DNA/RNA sequences (Chen et al., 2014, 2018b).

The PseKNC is used to formulate samples for predicting ORIs. For an arbitrary DNA sequence D with L nucleic acid residues formulated as:

$$\mathbf{D} = \mathbf{R}\_1 \mathbf{R}\_2 \cdots \mathbf{R}\_{L-1} \mathbf{R}\_L \tag{7}$$

where R<sup>i</sup> denotes the nucleic acid residue at the i-th position in sample sequence, the sequence can be represented by a 4<sup>k</sup> + λ dimension vector as follows.

$$\mathbf{D} = [d\_1 d\_2 \cdots d\_{4^k} d\_{4^k+1} \cdots d\_{4^k + \lambda - 1} d\_{4^k + \lambda}] \tag{8}$$

where

$$d\_{\mathfrak{u}} = \begin{cases} \frac{f\_{\mathfrak{u}}}{\sum\_{i=1}^{4^{k}} f\_{i} + \omega \sum\_{j=1}^{k} \theta\_{j}}, & \left(1 \le \mathfrak{u} \le 4^{k}\right) \\\frac{\alpha \theta\_{\mathfrak{u}-4^{k}}}{\sum\_{i=1}^{4^{k}} f\_{i} + \omega \sum\_{j=1}^{k} \theta\_{j}}, & \left(4^{k} + 1 \le \mathfrak{u} \le 4^{k} + \lambda\right) \end{cases} \tag{9}$$

where f<sup>i</sup> is denoted as the normalized frequency of the k-tuple nucleotide composition in a sequence sample. λ reflects the rank of correlation and is a non-negative integer. ω is the weight factor using to adjust the effect of the sequence correlation. θ<sup>j</sup> is the jtier sequence correlation factor for the sequence, and it can be calculated according to Equations (10)–(12).

$$\theta\_{j} = \frac{1}{L - j - 1} \sum\_{i=1}^{L-j-1} \theta \left( R\_{i} R\_{i+1}, R\_{i+j} R\_{i+j+1} \right),$$
 
$$\left( j = 1, 2 \cdot \cdots, \lambda; \; \lambda < L \right) \tag{10}$$

$$\Theta\left(R\_i R\_{i+1}, R\_{i+j} R\_{i+j+1}\right) = \frac{1}{\mu} \sum\_{\nu=1}^{\mu} \left[P\_{\nu}\left(R\_i R\_{i+1}\right) - P\_{\nu}\left(R\_{i+j} R\_{i+j+1}\right)\right]^2 \tag{11}$$

$$P\_{\nu} \left( R\_{i} R\_{i+1} \right) = \frac{P\_{\nu} \left( R\_{i} R\_{i+1} \right) - < P\_{\nu} >}{SD(P\_{\nu})} \tag{12}$$

where µ is the number of local DNA structural properties in Equation (11). Six types of local structural parameters are more commonly considered, of which three are local translational parameters (shift, slide, and rise) and the other three are local angular parameters (twist, tilt, and roll) (Guo et al., 2014). P<sup>v</sup> (RiRi+1) is the numerical value of the v-th physicochemical property for the dinucleotide at i-th position in an ORI or a non-ORI sample. For the consistency of parameters, a standard conversion should be made before using P<sup>v</sup> (RiRi+1) in Equation (11). Generally, the Z-score is used to normalize the parameters defined in Equation (12) (Chou and Shen, 2006), in there, the symbol < > means the average value of dinucleotides, and SD denotes the corresponding standard deviation. The website (http://lin-group.cn/pseknc/default.aspx) was used to calculate PseKNC (Chen et al., 2014).

#### Three-Window-Based PseKNC

A new method combined PseKNC with GC asymmetry information to represent sequence information, which named three-window-based Pseudo k-tuple nucleotide" or "threewindow-based PseKNC'. The concrete procedures are as follows. We suppose D denotes a DNA sample, L represents the length of the DNA sequence.

The DNA sequence D is divided into three non-overlapping segments called front window D[1, η], middle window D[η+1, ξ ], and rear window D[ξ+1, L] according to two parameters ε and δ. Thereinto, ε represents the percentage of total nucleobases of D in the front window, while 1–δ represents the percentage of total nucleobases of D in the rear window. And η, ξ are defined as below

$$\begin{cases} \eta = \mathsf{Int}^{\mathfrak{c}}[L \times \varepsilon] \\ \xi = \mathsf{Int}^{\mathfrak{c}}[L \times \delta] \end{cases}, \quad (\mathsf{0} < \varepsilon < \delta \prec 1.0) \tag{13}$$

where Int<sup>c</sup> means taking the ceiling integer for the number in the brackets right after it.

If each subfragment is represented by k-tuple nucleotide (or k-mers) composition, the DNA sequence will contain 3 × 4 k components as following shown

$$\mathbf{D} = \left[ f\_1^{\mathbf{l}} \dots f\_{\mathbf{4}^k}^{\mathbf{l}} f\_{\mathbf{4}^k+1}^2 \dots f\_{\mathbf{2} \times \mathbf{4}^k}^2 f\_{\mathbf{2} \times \mathbf{4}^k+1}^2 \dots f\_{\mathbf{3} \times \mathbf{4}^k}^3 \right]^T \tag{14}$$

where f 1 , f 2 , f <sup>3</sup> denote the normalized frequency values of the corresponding k-tuple nucleotides appearing front, middle, and rear window of sample D, respectively. Thus, a sample sequence can be translated into feature vector as

$$\mathbf{D} = \left[ \mathcal{Q}\_1 \dots \mathcal{Q}\_{4^{\mathbf{k}} + \lambda} \mathcal{Q}\_{4^{\mathbf{k}} + \lambda + 1} \dots \mathcal{Q}\_{2 \times \binom{4^{\mathbf{k}} + \lambda}{}} \mathcal{Q}\_{2 \times \binom{4^{\mathbf{k}} + \lambda}{}} + \dots \mathcal{Q}\_{3 \times (4^{\mathbf{k}} + \lambda)} \right]^{\mathbf{T}} \tag{15}$$

Next, the calculation method of ∅**<sup>u</sup>** is referred to Type-I PseKNC (Chen et al., 2014). Here, we will not elaborate on the specific calculation method. More details about the three-window-based PseKNC feature extraction method can refer to the research of Liu et al. (2018).

#### PREDICTION ALGORITHMS

#### Support Vector Machine

Support vector machine (SVM) (Cao et al., 2014) is a supervised machine learning method based on statistical learning theory, which was developed by Cortes and Vapnik (1995). By seeking the minimum structural risk, the generalization ability of SVM can be improved and the risk of experience can be minimized. Good statistical rules can also be achieved on small training sets. Thus, it is one of the most common and effective classifier. Although, the dimension of biological sequence information is generally high, it is not easy to cause over-fitting problem for SVM. Thus, SVM was widely used in bioinformatics (Jensen and Bateman, 2011; Li et al., 2016; Manavalan and Lee, 2017; Manavalan et al., 2017, 2018a,b,c; Song et al., 2018c; Yang et al., 2018a). The detailed descriptions about SVM can be referred to reference (Vapnik and Vladimir, 1997). In order to reduce the programming burden of researchers, the software package LIBSVM (Chang and Lin, 2011) has be developed and can be freely downloaded from https://www.csie.ntu.edu.tw/~cjlin/ libsvm/

Singh et al. (2018) used three classification algorithms (KNN, NB, and SVM) to classify ARS sequences based same feature extracting method, where it was found that SVM is the most reliable classifier. Therefore, SVM is suitable machine learning algorithm for identifying ORIs.

#### Random Forest Algorithm

The Random Forest (RF) algorithm Ho (1995, 1998) is an ensemble learning method for classification and regression. It is also widely used in bioinformatics researches (Zhao et al., 2014). RF integrates multiple trees through the idea of integrated learning. The basic unit is a decision tree. Each decision tree is a classifier from an intuitive point of view. N trees will have N classification results. RF integrates all the classified voting results and specifies the category with the most votes as the final output.

The RF algorithm is flexible and practical. It can handle thousands of input variables without variable deletion and generate an internal unbiased estimate of the generalization error. For estimating missing data and maintains accuracy when a large proportion of the data are missing, the algorithm is still effective.

### COMMONLY-USED EVALUATION METRICS

Selecting suitable assessment criteria is helpful for correctly and objectively estimating the proposed model's performance (Chou, 2011; Feng et al., 2013a,b; Chen et al., 2018a; Li et al., 2018a,b; Song et al., 2018a,b). Jackknife test can yield a unique result for a given benchmark dataset, thus, it has been widely used to validate predictors' performance (Yang et al., 2016; Chen et al., 2017). The following four parameters, sensitivity (Sn), specificity (Sp), overall accuracy (Acc), and Mathew's correlation coefficient (MCC), are always applied and can be defined as

$$\text{Sn} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{16}$$

$$Sp = \frac{\text{TN}}{\text{TN} + \text{FP}} \tag{17}$$

$$\text{Step} = \text{TN} + \text{TN} \tag{10}$$

$$Acc = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \tag{18}$$
 
$$\text{T} \times \text{TN} - \text{FP} \times \text{FN}$$

$$\text{MCC} = \frac{\text{mass} \times \text{mass}}{\sqrt{(TP + FN) \times (TN + FN) \times (TP + FP) \times (TN + FP)}} \tag{19}$$

where TP, FP, TN, and FN, respectively denote the number of true positives, false positives, true negatives, and false negatives.

The receiver operating characteristic (ROC) curve (Metz, 1989) can measure the predictive capability of constructed models across the entire range of algorithms' decision values. It is a visual curve graph that shows the model behavior of the Sn (the ordinate) against the 1-Sp (the abscissa). The area under the ROC (auROC) can objectively assess the performance of a proposed method. auROC = 1 means the model is a perfect classifier, auROC = 0.5 means it is a random predictive classifier.

# PUBLISHED RESULTS

### ORIs Characteristics

Many statistical analyses Chen et al. (2012) and Li et al. (2014) on ORIs have been made for deeply understanding the replication initiation mechanism.

The physiochemical properties of oligonucleotides play important role in replication regulation by analyzing DNA bendability and cleavage intensity around ORIs in the S. cerevisiae genomes, Chen et al. (2012) found that both DNA bendability and cleavage intensity in core replication regions were significantly lower than those in in both upstream and downstream regions of ORIs. Based on this result, they proposed DNA physiochemical properties based computational model to predict yeast ORIs.

Li et al. (2014) did a lot of analysis on yeast ORIs. Firstly, they analyzed the compositional bias in the S. cerevisiae genome by calculating the GC content surrounding ORIs and found GC content was lower than that of genome-wide. Secondly, they found the scores of GC profile and GC skew in the region of ORIs is significantly lower than that in the flanking regions based on the analysis of the GC profile and GC skew. Thus, they deduced that the replication mechanism of S. cerevisiae genome is similar to that of bacterial genomes. Thirdly, by calculating the information redundancies, they found that ORIs sequence have a very strong short-range dominance of base correlations. Fourthly, they investigated the distribution of ORIs in the genome and obtained several conclusions: ORIs always appear in the nucleosome-free regions; promoters might share elements with ORIs; most ORIs are not biased to transcription start regions. Finally, they compared the prediction performance of the above-mentioned characteristics on ORIs prediction by using SVM and found the nucleosome occupancy feature can much more accurately predict ORIs than GC skew and D2.

## ORIs Prediction

Based on the constructed benchmark datasets listed in **Table 2**, researchers have developed various models for ORIs prediction by using machine learning methods

On the basis of the benchmark dataset O1, Chen et al. (2012) constructed two models which were, respectively based on structure characteristics (DNA bendability and cleavage intensity) and local word contents of k-mer (k = 3, 4) by using SVM. They obtained the conclusion that DNA bendability and cleavage intensity could be of great help to ORI prediction. Moreover, they also found that DNA structure characteristics could provide novel insights into regulatory mechanisms of DNA replication. In their structural feature-based model,


TABLE 3 | A list of the published prediction tools for ORI prediction.

overall accuracy of 85.86% was achieved with the auROC of 0.848.

Based on the benchmark dataset O2, Li et al. (2015) encoded the ORI sequences of S. cereviesiae with PseKNC which could reflect the short-range and long-range sequence-order effects of DNA sequence. They incorporated six common local structural properties of 16 dinucleotides into PseKNC, of which three are local translational parameters (shift, slide, and rise) and the other three are local angular parameters (twist, tilt, and roll). As a result, the overall success rate of 83.72% was achieved in the jackknife cross-validation test based on SVM algorithm. Subsequently, a user-friendly web server called iORI-PseKNC was established and could be freely accessible at http://lin-group.cn/server/iOri-PseKNC. They applied the model in yeast genome and found over 8,000 potential ORIs. Later on Xiao et al. (2016), proposed the dinucleotide position-specific propensity information into the general pseudo nucleotide composition for predicting ORIs by using the RF classifier. As a result, the overall success rate reached 98.03%. According to the model, they provided the web server iROS-gPseKNC which could be obtained from http:// www.jcibioinfo.cn/iROS-gPseKNC.

Based on the benchmark dataset O3, Zhang et al. (2016) developed a predictor called iOri-Human. They used the same method as Li et al. (2015) to extract features. The RF algorithm was proposed to perform classification. The overall accuracy in identifying human ORIs was over 75% in jackknife cross-validation. Moreover, a user-friendly web server for iOri-Human has been established at http://lin-group.cn/server/iOri-Human.html, by which users can easily get their desired results without the need to go through the complicated mathematics involved.

Based on the benchmark dataset O4, Singh et al. (2018) compared three classification algorithms namely, distance-based k-nearest neighbor (KNN), probabilistic distribution based Naive Bayes (NB) classifier and SVM. They found SVM was a better choice to predict ARS with given properties in all genomic contexts by using the Multi-view ensemble learning model.

Based on the benchmark dataset O5, Liu et al. (2018) established a classification model for ORIs in four yeast species named iRO-3wPseKNC. They employed a different mode PseKNC to extrac features by incorporating the GC asymmetry information into the sample formulation and used the RF algorithm as classification algorithms. According to the jackknife cross-validation, for four yeast species (S. cerevisiae, S. pombe, K. lactis, and P. pastoris), high success prediction rates were obtained, which were 0.730, 0.965, 0.851, and 0.710, respectively. That clearly indicated the proposed their predictor was indeed quite powerful and may become a very useful bioinformatics tool for genome analysis.

Web server is a newly emerging tool in the internet age. It has brought a lot of convenience to the vast majority of biochemical scholars without the need to understand the mathematical details and programming. The difficult mathematics and computational methods can be easily used by means of web servers. Listed in **Table 3** are the overviews of the web servers for ORI prediction as described above. As we can see in **Table 3**, for a given unknown sequence, predictors, iORI-PseKNC, and iOri-Human, can predict a more accurate ORI position by the 300 bp window but homogeneous species. The iRO-3wPseKNC can classify four different species of yeast for a given sequence but predict a whole given sequence with only one result. And the iROS-gPseKNC can't work.

# CONCLUSIONS AND PERSPECTIVES

DNA molecule can transfer the genetic information from parent to offspring by replication. Thus, DNA replication plays the one of the most important part of life process at the cellular level. It is fundamentally significant for understanding such vitally important biological process to obtain the knowledge of ORIs. Accurate identification of ORIs will provide crucial clues in revealing DNA replication mechanism and discovering new drugs for treatment of various diseases. The computational tools based on machine learning are especially necessary to acquire these predicting outcomes.

Generally, developing a sequence-based predictor needs to consider the following guidelines (Chou, 2011): (i) benchmark dataset construction; (ii) feature extraction and feature optimization; (iii) classification algorithm comparison and selection; (iv) result evaluation and analysis; (v) web server establishment.

We found that none of these abovementioned publications used feature selection methods to improve prediction accuracy. Feature selection is important in pattern recognition for obtaining key features, excluding redundant information, or noise, improving robust, efficiency, and accuracy of models as well as solving dimension disaster. At present, many feature selection techniques have been proposed to optimize a feature set for producing the maximum accuracy and establishing a robust bioinformatics model, for instance, minimal-redundancymaximal-relevance (mRMR) (Peng et al., 2005), maximumrelevance-maximum-distance (MRMD) (Zou et al., 2016b), (BD) (Su et al., 2018), F-score (Lin et al., 2014), and the analysis of variance (ANOVA) (Tang et al., 2018).

minimal-redundancy-maximal-relevance is a kind of filtering feature method proposed by Peng et al. (2005). The core idea of mRMR is to maximize the correlation between features and categorical labels and at same time to minimize the correlation between features and features. It runs fast and can always produce robust models. MRMD is similar to mRMR but can scan the ranking features for a best dimension. It was widely used in bioinformatics recently (Zou et al., 2016a; Wei et al., 2018c). BD-based feature selection technique has strict and objective statistical foundation for extracting the over-represent motifs in sample sequences (Feng et al., 2018; Su et al., 2018; Zhu et al., 2018). Thus, it is also widely applied for sequence analysis (Feng and Luo, 2008; Lai et al., 2017). F-score, a simple feature selection method is usually used to measure the degree of difference between two real number sets (Lin et al., 2014, 2017). This method could achieve the most effective feature selection with strict mathematical definition. The basic idea of ANOVA is to compare the difference between the variance among groups and the variance within the group under different levels of influence, and then to determine differential expressed features (Chen et al., 2016).

In bioinformatics prediction, a key role for obtaining a highly accurate model is to use valid mathematical descriptors to formulate samples. The Type-II PseKNC is a different kind PseKNC which could reflect the correlation effect for different kind of physiochemical properties (Chen et al., 2014). Thus, it is better than Type-I PseKNC for describing ORI samples. However, it has not been used in all the published references

#### REFERENCES


for predicting ORI. In the future, we will try to use the Type-II PseKNC method combined with feature selection techniques to build a powerful and robust prediction model for predicting ORIs.

In summary, although a great progress for ORIs prediction has been obtained, further improvements should be made from the following points. Firstly, most of works focused on the ORIs prediction in bacteria, yeast and human genomes. Thus, we should try our best to construct more models for the prediction of ORIs in other species genomes. Secondly, with more and more accumulation of biochemical data, some old benchmark datasets should be updated constantly to acquire much more reliable samples. Thirdly, appropriate feature selection methods should be employed to reduce feature vector dimensions and improve the prediction accuracy. Fourth, try more machine learning methods to build classification models, such as deep learning (Cao et al., 2016, 2017; Long et al., 2017; Shao et al., 2018; Wei et al., 2018a,b; Yu et al., 2018; Zhang et al., 2018).

# AUTHOR CONTRIBUTIONS

HD conceived and designed the experiments. F-YD, HL, and FW analyzed the data and reviewed the references. F-YD, HL, FW, and HD performed the analysis and wrote the paper. All authors read and approved the final manuscript.

# ACKNOWLEDGMENTS

This work was supported by the National Nature Scientific Foundation of China (61772119, 31771471), the Fundamental Research Funds for the Central Universities of China (Nos. ZYGX2015Z006, ZYGX2016J125, ZYGX2016J118), Natural Science Foundation for Distinguished Young Scholar of Hebei Province (No. C2017209244), the Program for the Top Young Innovative Talents of Higher Learning Institutions of Hebei Province (No. BJ2014028).


Cavicchi, T. J. (2000). Digital Signal Processing. New York, NY: John Wiley & Sons.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Dao, Lv, Wang and Ding. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# MADS-Box Gene Classification in Angiosperms by Clustering and Machine Learning Approaches

Yu-Ting Chen1,2† , Chi-Chang Chang3,4† , Chi-Wei Chen1,5, Kuan-Chun Chen<sup>1</sup> and Yen-Wei Chu1,2,6 \*

1 Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan, <sup>2</sup> Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung, Taiwan, <sup>3</sup> School of Medical Informatics, Chung-Shan Medical University, Taichung, Taiwan, <sup>4</sup> IT Office, Chung Shan Medical University Hospital, Taichung, Taiwan, <sup>5</sup> Department of Computer Science and Engineering, National Chung-Hsing University, Taichung, Taiwan, <sup>6</sup> Biotechnology Center, Agricultural Biotechnology Center, Institute of Molecular Biology, National Chung Hsing University, Taichung, Taiwan

#### Edited by:

Quan Zou, University of Electronic Science and Technology of China, China

#### Reviewed by:

Vishal Acharya, Institute of Himalayan Bioresource Technology (CSIR), India Leyi Wei, Tianjin University, China Hao Lin, University of Electronic Science and Technology of China, China

\*Correspondence:

Yen-Wei Chu ywchu@nchu.edu.tw †These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 30 October 2018 Accepted: 14 December 2018 Published: 08 January 2019

#### Citation:

Chen Y-T, Chang C-C, Chen C-W, Chen K-C and Chu Y-W (2019) MADS-Box Gene Classification in Angiosperms by Clustering and Machine Learning Approaches. Front. Genet. 9:707. doi: 10.3389/fgene.2018.00707 The MADS-box gene family is an important transcription factor family involved in floral organogenesis. The previously proposed ABCDE model suggests that different floral organ identities are controlled by various combinations of classes of MADS-box genes. The five-class ABCDE model cannot cover all the species of angiosperms, especially the orchid. Thus, we developed a two-stage approach for MADS-box gene classification to advance the study of floral organogenesis of angiosperms. First, eight classes of reference datasets (A, AGL6, B12, B34, BPI, C, D, and E) were curated and clustered by phylogenetic analysis and unsupervised learning, and they were confirmed by the literature. Second, feature selection and multiple prediction models were curated according to sequence similarity and the characteristics of the MADS-box gene domain using support vector machines. Compared with the BindN and COILS features, the local BLAST model yielded the best accuracy. For performance evaluation, the accuracy of Phalaenopsis aphrodite MADS-box gene classification was 93.3%, which is higher than 86.7% of our previous classification prediction tool, iMADS. Phylogenetic tree construction – the most common method for gene classification yields classification errors and is time-consuming for analysis of massive, multi-species, or incomplete sequences. In this regard, our new system can also confirm the classification errors of all the random selection that were incorrectly classified by phylogenetic tree analysis. Our model constitutes a reliable and efficient MADS-box gene classification system for angiosperms.

Keywords: ABCDE model, MADS-box gene, phylogenetic tree, support vector machine, machine learning

# INTRODUCTION

Angiosperms, i.e., flowering plants, have evolved a most remarkable flower to ensure fertilization and reproduction. Indeed, angiosperms have evolved many specialized flowering processes to adapt to a wide range of environments as well as to attract animals, which help facilitate their reproduction. Thus, angiosperms comprise a diverse group of plants, accounting for ∼80% of all plant species (Christenhusz and Byng, 2016), and they constitute the source materials for the production of many foods, drugs, wood, paper, and fiber.

Studies of the model plant Arabidopsis indicate that dicotyledonous flowers contain four organs, namely the sepal, petal, stamen, and carpels, which are located on four concentric whorls. Flowering processes and floral organ determination are controlled by MADS-box genes (Theissen et al., 2000). The encoded proteins share a MADS (M) domain at the N-terminus, which is a conserved 56–amino acid residue region that is named for the initials of four members of this family: MCM1, AG, DEF, and SRF (Yang et al., 2012). The MADS-box genes have been classified as type I and type II on the basis of phylogenetic analysis (Alvarez-Buylla et al., 2000). Type I, named the M-type, contains the conserved M domain and the large variability region at the C-terminus (Masiero et al., 2011). Type II is known as MIKC-type, which contains and is named for the M domain, intervening (I) domain, keratin-like (K) domain, and C-terminal (C) domain (Kaufmann et al., 2005). Functionally, the M domain has the DNA binding activity, the I domain influences the DNAbinding dimerization, and the K domain can form amphipathic helices that mediate dimerization of MADS-box proteins and also are involved in the formation of other complexes (Egea-Cortines et al., 1999; Yang et al., 2003). The C domain, which is the most variable in sequence and function, is involved in transcriptional activation and formation of higher-order transcription factor complexes and also contributes to MADSbox protein interaction specificity (van Dijk et al., 2010; Callens et al., 2018). In plants, M-type MADS-box genes are involved in reproduction, especially female gametophyte, embryo, and endosperm development (Masiero et al., 2011), and MIKCtype MADS-box genes participate in meristem differentiation, flowering, fruit development, and the determination of floral organ identity according to the ABCDE model (Callens et al., 2018).

The ABCDE model, which originated from the ABC model (Coen and Meyerowitz, 1991), explains the genetic mechanisms of flower development and floral organ identification through the complex interaction of MIKC-type MADS-box genes (**Figure 1**). On the basis of their homeotic functions, the floral organ identity MADS-box genes have been divided into A, B, C, D, and E classes (Theissen, 2001). A- and E-class proteins are together responsible for sepal development in the first floral whorl, the combination of A-, B-, and E-class proteins specifically controls petal formation in the second whorl, the combination of B-, C-, and E-class proteins regulates stamen differentiation in the third whorl, the combination of C- and E-class proteins specifies carpel development in the fourth whorl, and the combination of D- and E-class proteins is required for ovule identity (Murai, 2013; **Figure 1**). Mutant phenotype analysis has facilitated cloning and identification of many MADSbox genes. The A-class genes include APELATA 1 (AP1) and FRUITFULL (FUL) (Gu et al., 1998); the B-class genes include APELATA 3 (AP3), PISTILLATA (PI), and GLOBOSA (GLO) (Zahn et al., 2005); the C-class gene is AGAMOUS (AG) (Mizukami and Ma, 1997); the D-class genes include SEEDSTICK (STK) and SHATTERPROOF1 (SHP1) (Favaro et al., 2003); and the E-class genes include SEPALLATA1 (SEP1), SEP2, SEP3, and SEP4 (Ditta et al., 2004). Phylogenetic analysis is currently the most common method for MADS-box gene classification. Because of massive, multi-species, and incomplete sequences, phylogenetic tree construction can be time-consuming and result in classification errors (Su et al., 2013a). Incomplete assembly sequences are becoming more abundant with the widespread use of next-generation sequencing. Thus, the use of phylogenetic analysis for MADS-box gene classification is increasingly challenging.

In our previous study, we used a machine-learning approach to construct a MADS-box gene classification prediction tool, iMADS (Yang et al., 2012). This tool reduces the time required to generate output prediction and presents reliable and systematic results to users. However, there are still some deficiencies to conquer. First, the training dataset needs to be updated and filtered more precisely. Second, the training model of iMADS was constructed on the basis cross-alignment of whole sequences, which results in a lower performance on partial sequence prediction. Third, some plants that contain unique floral organs do not fit the ABCDE model. For example, the flowers of Orchidaceae, one of the most diverse and widespread horticultural plants, contain six floral organs: the sepal in whorl 1, the petal and lip in whorl 2, and the pollinia, column, and pedicel in whorl 3 (Su et al., 2013b). The classical ABCDE model cannot explain the determination of orchidspecific flower organs, the lip and pollinia. According to the transcriptomic sequences, microarray analysis, and quantitative PCR validation, Su et al. established a modified ABCDE model extended to eight classes to explain how the lip and pollinia are determined (Su et al., 2013b; **Figure 1**). In this modified model, in addition to C, D, and E classes, the A-class genes are divided into AP1 (A) and AGL6 classes, and the B-class genes are grouped into AP3-1,2 (B12), AP3-2,4 (B34), and PI (BPI). In Phalaenopsis aphrodite, the differentiation of the sepal requires AGL6, B12, BPI, and E-class genes, the petal requires AGL6, B12, B34, BPI, and E class genes, the lip requires AGL6, B34, BPI, and E-class genes, the pollinia require A and BPI class genes, the column requires AGL6, B34, BPI, C and E-class genes, and the pedicel requires A, AGL6, BPI, C, D, and E-class genes. On the basis of their study, we constructed a new automatic MADS-box genes classification platform to encompass more species of angiosperms.

In this study, we incorporated MADS-box genes of orchids into the training dataset to improve the performance of classification. The training model of the system occurs in two stages. To account for unique flower organs, the MADS-box genes are divided into eight classes rather than the five classes of the original ABCDE model. In addition, to test various features, we constructed multiple prediction models according to domain characteristics using support vector machines (SVMs). From the independent and error classification of phylogenetic results, we found that using the domain database as the training model and using BLAST as the feature yielded the best accuracy. In brief, this system can analyze different lengths of query sequences and automatically optimizes the prediction result corresponding to the extended eight-class model.

## MATERIALS AND METHODS

In this study, we constructed the classification models for MADSbox genes using SVMs. In brief, the MADS-box genes were collected from NCBI, TAIR, and TIGR by key word search. The collection included M- and MIKC-type genes. Among them, only the MIKC-type genes, which involve the determination of floral organ identity, were selected and further classified into eight classes as a training dataset by unsupervised algorithms. According to the characteristics of the structure, the whole sequence and the four domains were used to create features via BLAST (Altschul et al., 1990), COILS (Lupas et al., 1991), and BindN (Wang and Brown, 2006) as input to LIBSVM (Fan et al., 2005) to construct the classification models. To improve the accuracy, we used PROSITE to determine whether the query sequence was a MIKC-type gene and to identify its domain content, and then the sequence was sent to create features for classification. The flowchart of the MADS-box gene classification system is shown in **Figure 2**, and the details of model construction are described as follows.

#### Data Collection and Filtration

All the MADS-box protein sequences, including M- and MIKCtype genes, were collected from different flowering plants. For the training dataset curation, 89 Arabidopsis sequences were obtained from "The Arabidopsis Information Resource" (TAIR, http://www.arabidopsis.org) (Huala et al., 2001), 76 Oryza sequences were obtained from The Rice Genome Annotation (TIGR<sup>1</sup> ) (Yuan et al., 2003), 47 orchid sequences (except Phalaenopsis equestris and Oncidium) were obtained from The National Center for Biotechnology Information (NCBI<sup>2</sup> ) and 4 monocot plant sequences (Lilium longiflorum, LMADS\_2, LMADS\_10; Hyacinthus orientalis, HOMADS\_1; Agapanthus praecox, APMADS2) were also from NCBI (**Table 1**) (216 in total). Combining the MUSCLE alignment tool and construction of a phylogenetic tree (using the neighbor-joining method and running 1000 bootstraps) of Molecular Evolutionary Genetics Analysis (MEGA 5.2) (Tamura et al., 2011) and information from the literature, the 216 sequences were divided into two groups: 133 MIKC-type sequences and 83 of M-type sequences. Some MIKC-type genes, which are involved in root development (ANR1; Gan et al., 2005) or regulate flowering time (SOC1; Moon et al., 2003), were discarded. Only the genes that directly regulate floral organs were filtered as the training dataset. Ultimately, 85 MIKC-type genes were further selected and classified into eight classes: 13 sequences of A, 9 sequences of B12, 10 sequences of B34, 10 sequences of BPI, 14 sequences of C, 8 sequences of D, 14 sequences of E, and 7 sequences of AGL6. To establish the independent testing datasets, 15 MADS-box genes of P. aphrodite and 11 of Oncidium Gower Ramsey were obtained from Orchidstra 2.0<sup>3</sup> (**Supplementary Datasets**).

# Feature Encoding

According the characteristics of the structure, the models were constructed on the basis of the four domains and the whole sequences of MIKC-type genes. The prediction tools, Basic Local Alignment Search Tool (BLAST) (Altschul et al., 1990) and BindN (Wang and Brown, 2006), used as features by protein sequences, and COILS (Lupas et al., 1991), used as features by secondary protein structures, were used to identify the

<sup>1</sup>http://rice.plantbiology.msu.edu

<sup>2</sup>http://www.ncbi.nlm.nih.gov

<sup>3</sup>http://orchidstra2.abrc.sinica.edu.tw

FIGURE 2 | Flowchart of the MADS-box gene classification system. The MADS-box genes were collected from NCBI, TAIR, and TIGR (right). After phylogenetic analysis, only MIKC-type genes were selected and clustered into eight class training models. To avoid errors, all the classification was verified in the literature. For domain judgement, the query sequence was assessed by PROSITE to determine whether it was a MIKC-type gene and to identify its domain content (left), and then the sequence was sent to create features for classification.

#### TABLE 1 | Whole data collection.


<sup>a</sup>TAIR, The Arabidopsis Information Resource, http://www.arabidopsis.org.

<sup>b</sup>TIGR, The Rice Genome Annotation, http://rice.plantbiology.msu.edu.

<sup>c</sup>NCBI, The National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov.

<sup>d</sup>For independent testing dataset construction, the MADS-box genes of P. equestris and Oncidium Gower Ramsey were obtained from Orchidstra 2.0 (http://orchidstra2. abrc.sinica.edu.tw).

contribution to classification (**Figure 3**). In general, the whole sequences and the four MIKC domains individually used BLAST as features. Additionally, the M domain used BindN for encoding because it is the main DNA-binding region, and the K domain used COILS for encoding because it is the key region for coiled-coil structure formation. Finally, all of the features were integrated into machine learning by the SVM format. All the features are described as follows (**Figure 3**).

#### BLAST

Basic local alignment search tool compares protein sequences to databases, calculates the statistical significance of matches, and finds regions of similarity between biological sequences (Altschul et al., 1990). In this study, we used the standalone version of BLAST to get the average p-value from the pairwise comparison of the request protein sequence and 8-class model sets. The eight average p-value will be encoded as features for SVM training. The eight training databases—A, B12, B34, BPI, C, D, E, and AGL6 – were constructed in our system. According the characteristics of the structure, the whole sequence and four domains were independently used to create five features via BLAST (**Figure 3**). The query sequence was compared with all the sequences in the database. For any e-values less than 10−<sup>5</sup> , the BLAST feature was encoded as the average of bi-scores. If all the e-values were less than 10−<sup>5</sup> , the BLAST feature was encoded as 0. Compared with eight databases individually, we obtained eight features for each input (**Figure 4**).

#### BindN

BindN is a bioinformatics tool for predicting DNA- or RNAbinding residues in amino acid sequences (Wang and Brown, 2006). The M domain of MADS-box genes, which contains

a helix-loop-helix super-secondary structure related to DNA binding and is involved in dimerization, was used to create a feature by BindN.

BLAST scores were counted and presented as eight features for each input.

#### COILS

COILS is a program that compares a sequence to a database of coiled-coils, derives a similarity score, and then calculates the probability of the coiled-coil formation (Lupas et al., 1991). The K domain of MIKC-type MADS-box proteins can form the leucine-zipper super-secondary structure (a short coiled-coil of two parallel helices), which is critical for dimerization. The two parallel helices are formed by several (abcdefg)<sup>n</sup> heptad repeats. Among them, hydrophobic amino acids usually exist in the "a" and "d" positions, which cause the K domain to form a series of amphipathic α-helices and dimerize with other protein via hydrophobic interactions. In this study, we used a MTIDK

matrix with weighting (weights: a,d = 2.5 and b,c,e,f,g = 1.0), and combined with 3-window width for feature encoding.

# LIBSVM

Among machine learning algorithms, the SVM, a supervised learning model with associated learning algorithms that analyze data and recognize patterns, is used for classification and regression analysis. LIBSVM is a library for SVMs. It incorporates software for support vector classification, regression, and distribution estimation, and supports multi-class classification (Fan et al., 2005). In this study, we used BLAST, COILS, and BindN features to build up nine training models to help classify the unknown-class MADS-box genes into one of the eight classes.

# Motif Discovery and Annotation in the C Domain

Compared with the other domains, the C domain has the lowest sequence conservation and most diverse function, which may improve the degree of computing discrimination. This region might exist in specific structures to achieve different functions. We selected parts of classes that could easily lead to confusion in classification: **"**Group B", including B12, B34 and BPI classes; "Group AE", including A, E, and AGL6; and "Group CD", including C and D. Focused on these three groups, we used MEME (Bailey et al., 2015) and TOMTOM (Gupta et al., 2007) to find the novel motifs and to annotate their functions using the JASPAR plant motif database (Khan et al., 2018).

# RESULTS AND DISCUSSION

# Model Construction and Validation

Before constructing the classification model, it was necessary to make a tool comparison among BLAST, COILS, and BindN which was used along or combined with other on whole or domain's sequence analysis in different training models. The six classification algorithms of LIBSVM, RandomForest, J48, RandomTree, KStar and XGBoost were used for model selection


FIGURE 5 | Model construction and validation. (A) The accuracy of the different training models. Comparison of different tools; BLAST was selected for future evaluation of the effects of whole sequences and domain database training sets. (B) Filtration system to screen the domain content of the input and to subject it to the suitable model for classification prediction.

(Hall et al., 2009; Qiang et al., 2018; Wei et al., 2018). The trained predictor was evaluated with 3-fold cross-validation, and its corresponding accuracy is shown in **Figure 5A**. In brief, BLAST expressed the highest accuracy with whole sequences and I, K, and C domains, but BindN had the best accuracy with the M domain. That is, the significance of DNA/RNA-binding residues could improve the classification. The highly conserved coiled-coil formation in the K domain caused the lowest discriminating rate by COILS.

Given its higher accuracy in most models, BLAST was selected for the training model. Because of the numerous bursts of sequence generation in the post-genome era, the number of incomplete sequences are increasing. To identify the effects of whole and partial gene sequences on this classification system, we used MADS-box genes of P. equestris and Oncidium Gower Ramsey to establish a whole sequence database and a domain database as independent testing sets. Compared with the whole sequence database, we found that this classification system expressed much higher accuracy in the domain database on incomplete sequence classification (**Figure 5A**). For example, if the input was a MADS-box genes of P. equestris containing only the M domain, the corresponding accuracy of the whole sequence database and the domain database was 27.27 and 90.91%, respectively. Thus, we constructed a filtration system to screen the domain content of the input and then subject it to the suitable model for classification prediction (**Figure 5B**).

#### Performance of Independent Dataset

Using LIBSVM to perform predictions, the internal statistical analysis function can be used to calculate the probability that all data are predicted to be in any category. Of the prediction results of the testing data set, 24 of 26 were classified correctly

#### TABLE 2 | Performance of Oncidium Gower Ramsey independent dataset.


<sup>∗</sup>OMADS3 was incorrectly classified as B34.

#### TABLE 3 | Performance of P. aphrodite independent dataset.


<sup>∗</sup>PATC052371 was incorrectly predicted as class D in this study, but it was deleted in the updated Orchidstra 2.0 database.

TABLE 4 | Performance comparison of classification methods using P. aphrodite MADS-box genes.


TABLE 5 | Performance comparison of classification methods using classification error of the phylogenetic tree.


<sup>a</sup>NCBI Protein\_ID.

∗ Incorrect classification by iMADS.

(92.31%) (**Tables 2**, **3**). The Oncidium MADS-box gene OMADS3 had similar probabilities to be B12 or B34 (0.35 and 0.36, respectively) and was incorrectly predicted as class B34 (**Table 2**). The P. aphrodite MADS-box gene PATC052371 has a expression pattern similar to class C, but was predicted as class D in this study (**Table 3**). Although PATC052371 was deleted in the updated Orchidstra 2.0 database (Chao et al., 2017), that could be attributed to missed annotation. Thus, we can ignore this classification error.

### Performance Comparison With the iMADS Classification Method

We previously constructed iMADS, which is a bioinformatics tool for classification of Angiosperm MADS-box genes (Yang et al., 2012). We used the same features of sequence similarity in this prediction tool. However, we divided the MADS-box genes into eight classes rather than five classes in iMADS, and analyzed the domain content of the input before subjecting it to the suitable model for classification prediction in this study. P. aphrodite MADS-box genes were

TABLE 6 | Comparison the classification performance and conservation of the four domains.



FIGURE 6 | Novel coding motif investigation and annotation of the C domains of Group B genes. (A) Comparison of the C domains of Group B genes. The universal PI-derived motif was found in all Group B genes, whereas the paleoAP3/euAP3 motif was found only in B12 and B34. (B) Novel motif identification of the paleoAP3/euAP3 region. The coding DNA sequences of paleoAP3/euAP3 regions were subjected to MEME analysis, and ARR10, FHY3, and RAV1 motifs were then identified and annotated.

universal motifs, AG motifs I and II, were found in Group C and D genes, whereas the MD motif was found only in Group D. (B) Novel motif identification of the MD region. The coding DNA sequences of MD regions were subjected to MEME analysis and MNB1A, PBF, and Dof2 motif were then identified and annotated.

selected for performance comparison. As shown in **Table 4**, the C-class PATC052371AGL6 was incorrectly predicted as D-class by both systems. The D-class PATC202120 was correctly classified in this study, but was incorrectly predicted as C-class by iMADS. The accuracy of this classification system (93.33%) is higher than that of iMADS (86.67%).

Phylogenetic analysis has been the most common method for MADS-box gene classification. Among 226 MADS-box genes, there were 16 classification errors on the tree. Of these, 5 Picea abies B-class genes were incorrectly grouped into E-class by phylogenetic analysis. Because these genes are derived from gymnosperms rather than angiosperms, we selected the other 11 genes for performance evaluation of the prediction tools. As shown in **Table 5**, iMADS corrected most classification errors (72.73%) but not for ZMM1, SHP1 and SHP2. The resolution between C and D-class was poor for iMADS. This new classification system can correct all the classification errors by phylogenetic analysis and provide a reliable classification approach.

# The Correlation Between Conservation and Prediction Accuracy of the Four Domains

To determine the variant accuracy of the four domain prediction models, we used the pairwise distance method (Tamura et al., 2011) to identify the correlation between conservation and prediction accuracy (**Table 6**). The pairwise distance method calculates the pairwise distance average from total scores by constructing a matrix comparing sequences with each other. The domain with a higher pairwise distance average represents its higher diversity. Combining the resulting pairwise distance and prediction accuracy, we found that the diversity in the four domains from high to low was C, I, K, and M which was positively correlated with the accuracy. Thus, when the diversity of a domain was lower, the prediction was more likely to be incorrect.

# Novel Coding Motif Analysis in the C Domain

According to the pairwise distance and prediction accuracy results, the C domain expressed the highest diversity and best discrimination rate among the four domains. This indicated that there could be some specific structures in C domains that correspond to different classes of MADSbox genes. A comparison of the MADS-box genes belonging to Group B indicated that all of them contain a PIderived motif, whereas B12 and B34 have an additional paleoAP3/euAP3 motif (**Figure 6A**). MEME and TOMTOM analyses revealed three coding motifs – ARR10, FHY3, and RAV1 – in the C domains of B12 and B34 (**Figure 6B**). ARR10, a helix-turn-helix super-secondary family motif, serves as a two-component response regulator involving a His-to-Asp phosphate signal transduction system (Hosoda et al., 2002). RAV1, a DNA-binding motif, exists in almost all eight classes and may be a universal key tool of MADSbox genes (Kagaya et al., 1999). The FHY3 motif involves transcriptional regulation of phytochrome A signaling and the circadian clock (Li et al., 2011). The FHY3 motif exists in B12 and B34 (APETALA3), but not BPI (PISTILLATA), which may reflect their diverse functions (Lamb and Irish, 2003).

Upon comparing the MADS-box genes belonging to Group CD, we found that they all share AG motif I and AG motif II motifs close to the middle region of the C terminal, but a specific MD motif is found in the C terminal of Group D genes (**Figure 7A**). We further investigated three functional coding motifs – MNB1A, PBF, and Dof2 – in this region by MEME and TOMTOM analyses (**Figure 7B**). These motifs all belong to the Dof gene family, a family of transcription factors, and may form a single zinc finger for DNA recognition (Yanagisawa and Schmidt, 1999). A protein– protein interaction structure, the coiled-coil, also exists in the zinc-finger of Dof, and thus we predict that MNB1A, PBF, and Dof2 motifs may involve an interaction with Group E genes. There was no significant motif identified in the Group AE genes.

# Tissue-Specific Coding Motif Analysis of MADS-Box Genes Among Multiple Species

DNA motifs are short, conserved functional regions. They are presumed to be involved in RNA localization, translation efficacy, mRNA splicing, mRNA stability, and accessibility to the translation machinery (Ding et al., 2012). The structure specificity could also indicate their key functionality. We collected MADSbox gene sequences from different species and used JASPAR to individually annotate their coding motifs (**Supplementary Tables S1–S4**). Class-unique motifs represent the unique motif in a particular class, such as SEP3 in Arabidopsis class A genes, PIF5 in class B, SOC1 in class D, and TGA1 in class E (**Supplementary Table S1**). These motifs could relate to the unique functions of the particular class. Tissue-related motifs indicated that the motifs exist in the genes expressed in a particular tissue. We found several tissue-specific motifs, such as SOC1 in carpels of Arabidopsis, bZIP910 in lodicules of rice, SPE3 in lips of P. aphrodite, and HAT5 in carpels of Oncidium (**Supplementary Tables S1–S4**). We also found some organism-specific motifs; the myb.Ph3 motif is unique to the MADS-box genes of Arabidopsis, whereas abi4, ERF1, and Gamyb are only found in rice. The specific motifs could be used for prediction or classification reference.

# CONCLUSION

Phylogenetic analysis usually contains classification errors and is time-consuming for massive, multi-species, or incomplete sequences. To solve this problem, we used machine learning approaches to establish a reliable and efficient MADS-box gene classification system for angiosperms. This classification system analyzes the domain content and then automatically subjects the query sequence to a suitable BLAST model. Corresponding to the extended eight classes, this classification system can also correct almost all the incorrect classifications generated from phylogenetic tree analysis. We also identified several classspecific, tissue-specific, and organism-specific coding motifs to use for classification or as future functional investigation references.

# AUTHOR CONTRIBUTIONS

Y-TC, C-CC, and Y-WC conceived the study and drafted the manuscript. C-WC and K-CC collected the datasets and created the work-flow.

# FUNDING

This research was supported by (a) Ministry of Science and Technology, Taiwan, China under grant number 106-2221-E-005-077-MY2, 106-2313-B-005-035-MY2, 107-2634-F-005-002, and 107-2321-B-005-013. (b) National Chung Hsing University and Chung-Shan Medical University under grant number NCHU-CSMU-10705.

### ACKNOWLEDGMENTS

fgene-09-00707 December 31, 2018 Time: 16:25 # 11

The authors would like to thank Professor Chang-Hsien Yang and Hsing-Fun Hsu who provided some useful comments for this work.

#### REFERENCES


#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00707/full#supplementary-material


and machine learning algorithms. Brief. Bioinform. doi: 10.1093/bib/bby107 [Epub ahead of print].


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Chen, Chang, Chen, Chen and Chu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods

#### Kaiyang Qu<sup>1</sup> , Leyi Wei <sup>1</sup> , Jiantao Yu<sup>2</sup> and Chunyu Wang3,4 \*

*<sup>1</sup> College of Intelligence and Computing, Tianjin University, Tianjin, China, <sup>2</sup> College of Information Engineering, North-West A&F University, Yangling, China, <sup>3</sup> School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, <sup>4</sup> Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States*

Motivation: Pentatricopeptide repeat (PPR) is a triangular pentapeptide repeat domain that plays a vital role in plant growth. In this study, we seek to identify PPR coding genes and proteins using a mixture of feature extraction methods. We use four single feature extraction methods focusing on the sequence, physical, and chemical properties as well as the amino acid composition, and mix the features. The Max-Relevant-Max-Distance (MRMD) technique is applied to reduce the feature dimension. Classification uses the random forest, J48, and naïve Bayes with 10-fold cross-validation.

Results: Combining two of the feature extraction methods with the random forest classifier produces the highest area under the curve of 0.9848. Using MRMD to reduce the dimension improves this metric for J48 and naïve Bayes, but has little effect on the random forest results.

Edited by: *Arun Kumar Sangaiah, VIT University, India*

#### Reviewed by:

*Qin Ma, South Dakota State University, United States Shijia Zhu, University of Texas Southwestern Medical Center, United States*

\*Correspondence:

*Chunyu Wang chunyu@hit.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Plant Science*

Received: *20 November 2018* Accepted: *17 December 2018* Published: *10 January 2019*

#### Citation:

*Qu K, Wei L, Yu J and Wang C (2019) Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods. Front. Plant Sci. 9:1961. doi: 10.3389/fpls.2018.01961* Availability and Implementation: The webserver is available at: http://server.malab. cn/MixedPPR/index.jsp.

Keywords: pentatricopeptide repeat, mixed feature extraction methods, maximum relevant maximum distance, random forest, J48, naïve bayes

# INTRODUCTION

Pentatricopeptide repeat (PPR) proteins include tandem repeats of degenerate 35-amino-acid motifs (PPR motifs) (Chen et al., 2018; Rojas et al., 2018). They form a class of nuclear-encoded proteins arranged in series by multiple repeating units (Li and Jiang, 2018). PPR proteins play a vital role in plant growth and development, and are widely found in eukaryotes and terrestrial plants (Ruida et al., 2013; Wang et al., 2018a). The majority of PPR proteins have mitochondrial or chloroplast localization sequences at the N-terminus, making them an ideal model for studying plant cytoplasmic and nuclear interactions (Wang et al., 2008b). Because of the importance of PPR, this study uses machine learning methods to predict sequences in this class of protein.

As PPRs are proteins, protein prediction methods are applicable to PPR. To predict proteins, some algorithm must be employed to extract features from the sequences. With the development of bioinformatics, many feature extraction methods have been developed. The extraction methods are divided into two categories. Based on amino acid composition, only consider the sequence information and the properties of the amino acids. The second, based on protein structure, considers both sequence information and spatial structure information. The N-gram model is a probabilistic language model based on the Markov assumption (Zhu et al., 2015; Lai et al., 2017; Wei et al., 2017a). Chou et al. (Chou, 2010) proposed a method based on the pseudo amino acid composition (Pse-AAC) that has since been used to predict various protein attributes, such as structural class (Sahu and Panda, 2010; Zhu et al., 2018), subcellular location (Wang et al., 2008b; Yang et al., 2016), essential protein (Sarangi et al., 2013), protein secondary structural content

(Chen et al., 2009), T-cell epitope (Zhang et al., 2015), and protein remote homology (Liu et al., 2013, 2015a, 2016a). Liu et al. (2014) enhanced this method by reducing the amino acid alphabet profile, and proposed the physicochemical distance transformation (PDT) (Liu et al., 2012), which is similar to PseAAC. The position-specific scoring matrix (PSSM) (Jones, 1999; Kong et al., 2017) contains abundant evolutionary information and is generated by the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) (Altschul and Koonin, 1998; Altschul et al., 1998). Kumar et al. (2007) were able to extract features according to amino acid or dipeptide composition, PSSM, and four-part amino acid compositions. Classifiers such as support vector machines, random forests, and artificial neural networks can be applied to the extracted features.

In this study, four feature extraction methods and three classifiers are used to predict PPR proteins. The four feature extraction methods not only consider sequence information, but also include the properties of amino acids. We combine these feature extraction methods, and then use the Max-Relevance-Max-Distance (MRMD) method to reduce the dimension. The overall process is shown in **Figure 1**.

## METHODS

#### Dataset

For this study, a dataset was extracted from UniPort using the key word "pentatricopeptide repeat" to search the sequences. This search produced 534 reviewed samples, which we used as the positive set. Based on this positive set, we then constructed a negative set as follows. First, we found the Uniport ID of proteins, which have the following symbol: |. Second, we used the Uniport ID to query the proteins' PFAM family. Each sequence belongs to a PFAM family, and similar sequences belong to the same family. After finding all the PFAM families of the PPR positive samples, duplicate PFAM families were deleted to obtain a non-repeating positive family set. We then deleted the positive samples in all families, leaving a set of negative families. Finally, we used the longest protein sequence in each negative family as the negative samples. From the above steps, we obtained 21,960 negative sequences. As some sequences may be redundant, we used CD-HIT (Fu et al., 2012) to reduce the data with a threshold of 0.7 and deleted sequences that included illegal characters. The final dataset contained 487 positive samples and 9,590 negative samples.

To overcome this imbalance in the dataset, we randomly extracted 10 sets of negative samples, and averaged the results of 10 experiments using these 10 sets. Among the negative sequences, the longest had 35,214 amino acids and the shortest had 11 amino acids. The positive sequences ranged from 196 to 1,863 amino acids in length. Thus, we divided the negative samples into four parts according to their length, and extracted 487 sequences from these four parts in proportion.

#### Feature Extraction Methods Based on Sequence, Physical, and Chemical Properties

This method can extract 188 features (hereinafter referred to as 188D) covering sequence information and amino acid properties

(Zhang et al., 2012; Song et al., 2014; Xu et al., 2014). The first 20 features are the frequency of 20 amino acids in the protein sequence. Furthermore, the content, distribution, and dipeptide composition are essential in protein predictions (Song et al., 2014). We divided the 20 amino acids into three groups according to their properties which were shown in **Figure 2**.

The amino acids were divided into three groups according to their properties, and then we calculated the proportion of the three groups in the sequences for eight properties, giving 3 × 8 = 24 features to be extracted (Cai et al., 2003; Lin et al., 2013). Next, we identified the distribution of the three groups of amino acids at five positions (beginning, 25, 50, 75, and end), giving a further 3 × 5 × 8 = 120 features to be extracted (Cai et al., 2003). Finally, we calculated the number of the three types of dipeptides containing two amino acids from different groups, so another 3 × 8 = 24 features will be extracted. Therefore, the algorithm produces 20 + 24 + 120 + 24 = 188 features (Lin et al., 2013).

#### Pse-in-One

The other three methods are implemented by Pse-in-one, which was proposed by Liu (Liu et al., 2015b) and BioSeq-Analysis (Liu, 2018). We briefly introduce these methods in this section.

#### **Kmer**

Similar to the N-gram model, kmer extracts features using the amino acid spacer. This method uses the frequency of k adjacent amino acid fragments to reflect the sequence composition of the protein. Since there are 20 possibilities for each position, 20<sup>k</sup> features can be extracted. For example, when k = 2, the feature is the frequency of amino acid fragments that have two amino acids in the sequence. It can be expressed as follows (Liu et al., 2008):

$$F\_{kmer} = \langle f\_1^{kmer}, f\_2^{kmer}, \dots, f\_{20^k}^{kmer} \rangle.$$

#### **Auto-cross covariance**

The auto-cross covariance (ACC) transforms the protein sequence to a certain length by measuring the relationship between any two properties of the amino acids (Dong et al., 2009). ACC includes two parts: the auto covariance (AC) calculates the relevance of the same property between two residues along sequence intervals of length lg (Dong et al., 2009), and the cross-covariance (CC) measures the differences between two properties (Guo et al., 2008). For a protein sequence P, the transformation can be written as (Liu et al., 2016b):

$$P' = [\varphi\_1, \varphi\_2, \dots, \varphi\_{N^\* \lg}]^T$$

where N represents the number of amino acid properties and ϕ<sup>n</sup> is calculated as (Liu et al., 2016c):

$$\varphi\_n = AC\left(i, \lg\right) = \frac{1}{N - \lg} \sum\_{j=1}^{L-\lg} \left(\mathcal{S}\_{i,j} - \overline{\mathcal{S}\_i}\right) \left(\mathcal{S}\_{i,j+\lg} - \overline{\mathcal{S}\_i}\right)$$

CC transforms the sequence to the vector set:

$$P' = [\varphi\_1, \varphi\_2, \dots, \varphi\_{N^\*(N-1)^\*\lg}]^T$$

and then calculates (Guo et al., 2008):

$$\text{CC}\left(\text{i1,i2,lg}\right) = \frac{1}{N - lg} \sum\_{j=1}^{L-lg} (\mathbb{S}\_{i1,j} - \overline{\mathbb{S}\_{i1}}) (\mathbb{S}\_{i2,j+lg} - \overline{\mathbb{S}\_{i2}})$$

where i denotes the residues, L represents the length of the sequence, Si,<sup>j</sup> is the score of the j-th amino acid with respect to the i-th property, and S<sup>i</sup> is the average score for i along the sequence.

In this study, we selected three properties and set lg = 2.

#### **Parallel correlation pseudo amino acid composition**

Parallel correlation pseudo amino acid composition (PC-Pse-AAC) considers composition, properties, and sequence orders (Chou, 2010; Xiao and Chou, 2011).

We consider a protein sequence P containing L amino acids. The sequence can be represented by 20 + λ features as:

$$FV\_{\text{PeACCC}} = [\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_{20+\lambda}]^T$$

where λ is a distance parameter that reflects the effect of the amino acid sequence-order (Pan G. et al., 2018).

The first 20 features are the frequencies at which 20 amino acids appear in the sequence. The other features are given by (Mei and Zhao, 2018):

$$\theta\_k = \frac{\sum\_{i=1}^{L-k} \Theta(A\_i, A\_{i+k})}{L-k} \text{ (k \le \lambda)}$$

$$\Theta\left(A\_i, A\_{i+k}\right) = \frac{1}{T} \sum\_{j=1}^{T} \left(I\_j(A\_i) - I\_j\left(A\_{i+k}\right)\right)^2$$

$$I\_{\bar{f}}\left(A\_i\right) = \frac{I\_{\bar{f}}'\left(A\_i\right) - \sum\_{m=1}^{20} \frac{I\_{\bar{f}}'(R\_m)}{20}}{\sqrt{\frac{\sum\_{k=1}^{20} \left(I\_{\bar{f}}'(R\_k) - \sum\_{m=1}^{20} \frac{I\_{\bar{f}}'(R\_m)}{20}\right)^2}}$$

where A<sup>i</sup> represents the i-th amino acid in the protein sequence, and k denotes the distance between two amino acids along the protein sequences. T is the number of physicochemical properties, and I<sup>j</sup> (Ai) is the j-th property of A<sup>i</sup> . Ij ′ (Ai) indicates the original physicochemical property score of amino acid A<sup>i</sup> with respect to property j, and R<sup>m</sup> represents the 20 amino acids.

In this study, we selected three properties and set λ = 2.

#### Mixed Feature Extraction Methods

The Max-Relevance-Max-Distance (MRMD) (Zou et al., 2016; Qu et al., 2017; Wei et al., 2017b) technique was used to reduce the dimension. We used the Pearson correlation coefficient (PCC) to measure the relevance and the Euclidean distance function to identify instances of redundancy.

The PCC can calculate continuous variables and is easy to implement. Therefore, the PCC (Ahlgren et al., 2014) was used to measure the relationship between the features and the target class in the MRMD feature dimension reduction method. The formula for the PCC is (Zou et al., 2016):

$$\text{PCC}\left(\overline{X}, \overline{Y}\right) = \frac{\frac{1}{N-1} \sum\_{k=1}^{N} \left(\boldsymbol{\chi}\_{k} - \overline{\boldsymbol{x}}\right) \left(\boldsymbol{\chi}\_{k} - \overline{\boldsymbol{y}}\right)}{\sqrt{\frac{1}{N-1} \sum\_{k=1}^{N} \left(\boldsymbol{\chi}\_{k} - \overline{\boldsymbol{y}}\right)^{2}} \sqrt{\frac{1}{N-1} \sum\_{k=1}^{N} \left(\boldsymbol{\chi}\_{k} - \overline{\boldsymbol{x}}\right)^{2}}}$$

where x<sup>k</sup> represents the <sup>k</sup>th element in −→<sup>X</sup> , and −→<sup>X</sup> , −→<sup>Y</sup> are vectors composed of each instance's features. Thus, the maximum relevance of the ith feature is:

$$\max M R\_i = |\text{PCC}\left(\overrightarrow{F\_i}, \overrightarrow{C\_i}\right)|$$

The Euclidean distance is given by:

$$\begin{aligned} \operatorname{ED} \left( \overrightarrow{X}, \overrightarrow{Y} \right) &= \sqrt{\sum\_{k=1}^{N} (\chi\_k - \chi\_k)^2} \\ \max\_{\mathbf{max} \, MD\_i} &= \operatorname{ED}\_i = \frac{1}{M - 1} \sum \operatorname{ED} \left( \overrightarrow{F\_i}, \overrightarrow{F\_k} \right) \end{aligned}$$

We selected features according to:

$$\max(MR\_i + MD\_i)$$

As the PCC increases, the relationship between the features and the target classes becomes stronger. The greater the distance between features, the less redundancy exists in the vectors. The final feature set created by this method has less redundancy and greater correlation with the target set (Xu et al., 2016, 2018; Jiang et al., 2017; Wei et al., 2017c).

#### FEATURE SELECTION METHOD

#### Classifiers

We used three classifiers in this study: random forest (RF), naïve Bayes (NB), and J48. The classifiers can be implemented in WEKA, which is based on the Java environment.

#### J48

The J48 method is a decision tree algorithm based on C4.5 (Mohasseb et al., 2018). Decision trees (Quinlan, 1986) are a graphical approach using probability analysis. J48 is a kind of supervised learning, whereby each sample has a set of attributes and a predetermined label. By learning about the samples, a classifier can be taught to generate classification results for new instances (Rondovic et al., 2019).

In each step, decision trees select an attribute to split. Ideally, the optimal attribute should be selected so that the samples included in the branch nodes of the decision tree belong to the same class (Kothandan and Biswas, 2016; Zhong et al., 2018). The selection of attributes is an important problem, and many methods have been derived for this purpose, such as information gain, and information gain ratio. The C4.5 method uses the information gain ratio to select which attributes to split.

#### Random Forest

Ensemble learning is an effective technique that has been applied to many fields of bioinformatics (Li et al., 2016; Liu et al., 2016d, 2018; Zhang et al., 2016a; Tang et al., 2017; Pan Y. et al., 2018; Wang H. et al., 2018; Wei et al., 2018a,b). The RF approach (Wang S. P. et al., 2018) is an ensemble learning method that employs many decision trees, with the output result dependent on "votes" cast by each tree. The construction process is as follows.

First, we determine the quantity of decision trees (m), the depth of each tree (d), and the number of features (f) used by each node. Then, n samples are selected at random from the samples set. In addition, f features are randomly selected, and the selected samples use these features to build decision trees. This step is repeated m times to give m decision trees, forming the random forest. Each decision tree classifies each sample, so each decision tree outputs a value. For classification problems, the final result is the class that has the most votes. For regression problems, the final result is the average of the output of all decision trees (Song et al., 2017).

#### Naïve Bayes

NB (Rajaraman and Chokkalingam, 2014; Deng and Chen, 2015) is a classical classifier based on conditional probability. The most important component of NB is the Bayesian rule, which is given by (Yu et al., 2015):

$$\operatorname{p}\left(B\_{i}|A\right) = \frac{\operatorname{p}\left(A|B\_{i}\right)\operatorname{p}\left(B\_{i}\right)}{\sum\_{j=1}^{n}\operatorname{p}\left(A\middle|B\_{j}\right)\operatorname{p}\left(B\_{j}\right)}$$

where p (B<sup>i</sup> |A) represents the conditional probability of event B<sup>i</sup> occurring under event A. p(Bi) is the marginal probability of independent event B<sup>i</sup> .

The classification principle is that use the Bayesian rule to calculate the posterior probability of an object based on its prior probability, and then select the class with the largest posterior probability as the class to which the object belongs. In this method, all features are statistically independent. So according to the above formula, we can get the following formula:

$$p\left(\boldsymbol{\wp}|\boldsymbol{\varkappa\_{1}},\cdots,\boldsymbol{\varkappa\_{n}}\right) = \frac{p(\boldsymbol{\wp})\prod\_{i=1}^{n}p(\boldsymbol{\varkappa\_{i}}|\boldsymbol{\wp})}{p(\boldsymbol{\varkappa\_{1}})p(\boldsymbol{\varkappa\_{2}})\cdots p(\boldsymbol{\varkappa\_{n}})}$$

Then, the above formula can be converted into:

$$\hat{\boldsymbol{\nu}} = \arg\max\_{\mathcal{V}} p(\boldsymbol{\nu}) \prod\_{i=1}^{n} p(\boldsymbol{\kappa}\_i | \boldsymbol{\nu})$$

Where, y represents class variables and x<sup>i</sup> represents features. yˆ represents the predicted class.

#### Measurement

As we have an imbalanced dataset, we use the area under the receiver operating characteristic (ROC) curve (AUC) and the F-Measure to evaluate the performance of the classifiers.

The abscissa of the ROC curve is the false positive rate (FPR), and the ordinate is the true positive rate (TPR). AUC is the area under the ROC curve, which always has a value of less than one (Lobo et al., 2010; Pan et al., 2017; Wei et al., 2018d). As the ROC curve is generally above the straight line y = x, the value of AUC tends to be greater than 0.5 (Fawcett, 2005). The larger the value of AUC, the better the classification performance.

The F-measure (Nan et al., 2012) is a weighted harmonic average of precision and recall. This metric, which is often used to evaluate the quality of classification models, is computed as follows:

$$\begin{aligned} \text{precision} &= \frac{TP}{TP + FP} \\ \text{recall} &= \frac{TP}{TP + FN} \\ \text{F-measure} &= \frac{\left(\alpha^2 + 1\right) precision^\* \text{recall}}{\alpha^2 (precision + recall)} \end{aligned}$$

Typically, α = 1, so that:

$$\text{F1} = \frac{2precision^\* \, recall}{precision + recall}$$

# RESULTS AND DISCUSSION

Experiments were conducted using 10-fold cross-validation (Wei et al., 2018c; Zhao et al., 2018), whereby the dataset is divided into 10 sections, with nine parts used to train the model and the remaining one used for testing. This process is repeated 10 times, and the average of all the tests gives the final result.

# Results Using Individual Feature Extraction Methods

In this section, we discuss the performance of each individual feature extraction method. The four feature extraction methods focus on different aspects. 188D considers information about the sequence composition and amino acid properties, whereas kmer considers the frequency of amino acid fragments in the sequence. ACC considers three properties, hydrophobicity, hydrophilicity, and mass, and PC-PseAAC considers the amino acids' distance and properties. **Table 1** presents the results using these methods with each classifier.

From **Table 1**, it is clear that the performance is generally good. RF produced the best performance, especially with the kmer feature extraction method, achieving an AUC score of 0.9826. J48 has the worst performance, although this method attained an AUC score of 0.8710 when used with PC-PseAAC. NB performed best with the PC-PseAAC feature extraction method. Obviously, RF is better than J48. This may be because the random forest uses results from multiple decision trees, thus avoiding some exceptional cases.

# Performance of Joint Feature Extraction Methods

Next, we connected the feature extraction methods to give six new feature sets: 188D + ACC (206D), 188D + kmer (588D), 188D + Pse-AAC (210D), ACC + kmer (418D), ACC + Pse-AAC (40D), Pse-AAC + kmer (422D).

**Table 2** presents the results given by mixing the features. And we add the best performance of single into **Table 2**, which can make a more intuitive comparison. From the table, we can see that the performance using the RF classifier is slightly better than for the single 188D method. The highest AUC is 0.9820 and the lowest AUC is 0.8554.


*To represent the experimental results more intuitively, they are displayed as a histogram in* Figure 3*. Bold values indicates Best result in that experiment results which is a combination of Method and Classifier.*



*Bold values indicates Best result in that experiment results which is a combination of Method and Classifier.*

Next, we combined kmer with another method. The results are presented in **Table 2**. In this case, the best AUC is 0.9848 and the lowest AUC is 0.8386, which are both higher than the scores achieved using the kmer method alone. RF gives the best performance, and J48 is again the worst classifier.

The results from combining Pse-AAC with another method are presented in **Table 2**. We can see that the overall performance is worse than in the above cases. With the exception of the RF results, the performance is worse than when using the Pse-AAC method on its own. In this case, the best AUC score is 0.9826 and the worst is 0.8386.

The results from combining ACC with another method are shown in **Table 2**. Compared with the results using ACC alone, the performance has improved, except when using the NB classifier. RF again gives the best results and J48 gives the worst. The highest AUC score is 0.9848 and the lowest is 0.8518.

From the above results, we can conclude that RF is the best classifier for this task, whereas J48 is unsuitable in this case. The best PPR prediction method is to combine ACC and kmer and use the RF classifier, which achieves the highest AUC of 0.9848.

## Performance Using MRMD to Reduce the Dimension

Next, we used MRMD to reduce the dimension of the features considered in section Performance of Joint Feature Extraction Methods, resulting in six new feature sets. As the features were randomly extracted from the dataset 10 times, the number of features after dimension reduction was inconsistent. We conducted experiments using 10 separate sets of data. We then selected the feature set with the best AUC performance and applied this feature set to the remaining nine datasets. The final results are the average of 10 experiments.

TABLE 3 | Results from reduction the features.


*Bold values indicates Best result in that experiment results which is a combination of Method and Classifier.*

The results are shown in **Table 3**, **Figures 4**, **5**. The highest AUC value is 0.9840, and the lowest is 0.8400. Again, RF gives the best performance and J48 is the worst classifier. From the figures, although J48 has the worst performance, the AUCs have improved. In particular, using MRMD for dimension reduction results in better performance by the NB classifier.

# CONCLUSION

PPR proteins play an important role in plants. In this study, we used machine-learning methods to predict this type of protein. To find the best performance, we used four feature extraction methods that consider sequence, physical, and chemical properties as well as the amino acid composition, and three classifiers. In terms of the individual feature extraction methods, using kmer with the RF classifier gave the highest AUC. Next, we combined the feature extraction methods, and found that RF still achieved the best performance while J48 gave the worst results. Finally, we used MRMD to reduce the feature dimension. This improved the AUCs for the J48 and NB classifiers, but had little effect on the RF results. The highest AUC score of 0.9848 was achieved by combining ACC and kmer and using RF as the classifier. The webserver is freely available at: http://server.malab.cn/MixedPPR/index.jsp. In future work,

it can be expected to further improve the performance by integrating other informative features such as motif-based features (Li et al., 2010; Ma et al., 2013; Yang et al., 2017), and validate the reliability of our method using next-generation sequencing analysis (Zhang et al., 2016b; Liu et al., 2017).

#### AUTHOR CONTRIBUTIONS

KQ implemented the experiments and drafted the manuscript. LW and CW initiated the idea, conceived the whole process, and finalized the paper. KQ and JY helped with data analysis and revised the manuscript. All authors have read and approved the final manuscript.

#### REFERENCES


#### ACKNOWLEDGMENTS

The work was supported by the National Key R&D Program of China (SQ2018YFC090002), the Natural Science Foundation of China (No.61872114, 91735306, 61701340), the Natural Science Fundamental Research Plan of Shaanxi Province (2016JM6038), and the Fundamental Research Funds for the Central Universities, NWSUAF, China (2452015060). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank Stuart Jenkinson, Ph.D., from Liwen Bianji, Edanz Group China (www.liwenbianji.cn/ac), for editing the English text of a draft of this manuscript.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Qu, Wei, Yu and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Conserved Disease Modules Extracted From Multilayer Heterogeneous Disease and Gene Networks for Understanding Disease Mechanisms and Predicting Disease Treatments

#### Liang Yu<sup>1</sup> , Shunyu Yao<sup>1</sup> , Lin Gao<sup>1</sup> and Yunhong Zha<sup>2</sup> \*

*<sup>1</sup> School of Computer Science and Technology, Xidian University, Xi'an, China, <sup>2</sup> Department of Neurology, Institute of Neural Regeneration and Repair, Three Gorges University College of Medicine, The First Hospital of Yichang, Yichang, China*

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Qinghua Jiang, Harbin Institute of Technology, China Yangyang Hao, Veracyte, Inc., United States*

> \*Correspondence: *Yunhong Zha yzha7808@ctgu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *07 November 2018* Accepted: *27 December 2018* Published: *18 January 2019*

#### Citation:

*Yu L, Yao S, Gao L and Zha Y (2019) Conserved Disease Modules Extracted From Multilayer Heterogeneous Disease and Gene Networks for Understanding Disease Mechanisms and Predicting Disease Treatments. Front. Genet. 9:745. doi: 10.3389/fgene.2018.00745* Disease relationship studies for understanding the pathogenesis of complex diseases, diagnosis, prognosis, and drug development are important. Traditional approaches consider one type of disease data or aggregating multiple types of disease data into a single network, which results in important temporal- or context-related information loss and may distort the actual organization. Therefore, it is necessary to apply multilayer network model to consider multiple types of relationships between diseases and the important interplays between different relationships. Further, modules extracted from multilayer networks are smaller and have more overlap that better capture the actual organization. Here, we constructed a weighted four-layer disease-disease similarity network to characterize the associations at different levels between diseases. Then, a tensor-based computational framework was used to extract Conserved Disease Modules (CDMs) from the four-layer disease network. After filtering, nine significant CDMs were reserved. The statistical significance test proved the significance of the nine CDMs. Comparing with modules got from four single layer networks, CMDs are smaller, better represent the actual relationships, and contain potential disease-disease relationships. KEGG pathways enrichment analysis and literature mining further contributed to confirm that these CDMs are highly reliable. Furthermore, the CDMs can be applied to predict potential drugs for diseases. The molecular docking techniques were used to provide the direct evidence for drugs to treat related disease. Taking Rheumatoid Arthritis (RA) as a case, we found its three potential drugs Carvedilol, Metoprolol, and Ramipril. And many studies have pointed out that Carvedilol and Ramipril have an effect on RA. Overall, the CMDs extracted from multilayer networks provide us with an impressive understanding disease mechanisms from the perspective of multi-layer network and also provide an effective way to predict potential drugs for diseases based on its neighbors in a same CDM.

Keywords: conserved disease modules, multilayer networks, gene networks, disease mechanisms, drug repositioning

# INTRODUCTION

Complex diseases, such as cancers, diabetes mellitus, and cardiovascular disease, are caused by the combined effects of multiple genes, lifestyles and environmental factors (Craig, 2008), which makes it difficult to study and treat diseases. Studying the pathogenesis of diseases is critical to treat diseases because if it is controlled, the disease would be prevented (Last, 2000). Disease-disease relationship studies can help to understand the interrelationship between diseases and uncover the pathogenesis of diseases (Menche et al., 2015). Network theory is an available and useful solution for describing and analyzing the relationships between complex diseases (Barabási and Oltvai, 2004). To date, there are many network-based methods proposed to analyze diseases similarity. Menche et al. (2015) presented a new definition of module distance in incomplete interactome to predict disease-disease relationships. Zhou et al. (2014) constructed a human symptoms-based disease network using large-scale medical bibliographic records and the related Medical Subject Headings (MeSH) (Lowe and Barnett, 1994) metadata from PubMed (Wheeler et al., 2007). In 2007, Goh et al. (2007) gave the first disease network by connecting diseases that have common disease genes. Based on protein interactions and functional pathways, Liang et al. constructed a human disease network (HPDN) based on pathways to explore the potential relationships between diseases (Yu and Gao, 2017).

However, the biological data is incomplete (Menche et al., 2015), and the different levels of data used to construct disease relationships are usually interrelated (Gligorijevic and Pržulj, ´ 2015). That is to say, single-layer networks may not reveal the molecular mechanisms underlying the real systems because they simplify the varied nature of relationships (Kivelä et al., 2014). Moreover, only aggregating multiple types of interactions between diseases into a single network results in important temporal- or context-related information loss and may distort the actual organization (Rosvall et al., 2014; De Domenico et al., 2015). Therefore, in order to consider multiple types of interactions between diseases and the important interplays between layers, we use multilayer network model (Mucha et al., 2010; Cardillo et al., 2013; Nicosia et al., 2013; Radicchi and Arenas, 2013) to study the relevance between diseases from multiple perspectives. The detection of community structures is an essential method of network analysis and is key to understanding the structure of complex networks (Fortunato, 2010). Communities are topological groups of nodes which have more connections with each other than they are with the rest of nodes (Newman and Girvan, 2004; Porter et al., 2009; Fortunato, 2010). In recent years, researchers have proposed many methods to detect community structures on multilayer networks (Mucha et al., 2010; Li et al., 2011; Bazzi et al., 2014; Boccaletti et al., 2014; Liu et al., 2018). Li et al. presented a tensor-based computational framework for detecting recurrent dense subgraphs in multilayer weighted networks (Li et al., 2011). They applied their method to 130 co-expression networks and found 11,394 recurrent heavy subgraphs, i.e., densely connected node sets that consistently appear in the different layers. By validating against a large set of compiled biological knowledge bases, they showed their results are meaningful biological modules.

Here, we constructed a weighted four-layer disease-disease similarity network to characterize the associations between diseases and detected community structures from the multilayer network to extract useful information, such as potential diseasedisease associations. Further, based on the potential diseasedisease associations, we tried to understand the underlying molecular mechanisms of diseases, and predicted new treatments for diseases. The tensor-based method (Li et al., 2011) was used here to identify significant and reliable disease-disease modules from our multilayer disease network. Because of the consistent appearances of the modules in all the layers, we named them as Conserved Disease Modules (CDMs). **Figure 1** showed the whole framework of our method. We finally identified nine conserved disease modules (CDMs). After investigating these modules with the classification model in MeSH database, most of diseases in a same module belonged to a same classification. More importantly, as we expected, new diseasedisease connections based on CDMs were found, which will help us to explore the unobserved molecular mechanisms of diseases and provided new treatments for them. We chose CDM 7 (classified as Cardiovascular Diseases) to predict potential drugs for Rheumatoid Arthritis (RA). With the help of molecular docking techniques, we predicted three potential drugs (Carvedilol, Metoprolol, and Ramipril) for RA. This results were also validated by literature.

# RESULTS

#### Constructing the Four-Layer Weighted Disease-Disease Similarity Network Human Disease Network Based on Protein Interaction Network (PIDN)

The protein-protein interaction (PPI) network was got from ref (Menche et al., 2015), which consists of 13,460 genes and 141,296 interactions. In order to get the similarity between diseases based on the PPI network, we combined two datasets got from Online Mendelian Inheritance in Man (OMIM) database (Hamosh et al., 2005) and Genome-Wide Association Studies (GWAS) (Ramos et al., 2014) to get the disease-gene data, which includes 718 diseases and 22,410 genes (see **Table S1**). Then, we mapped the genes of each disease to the PPI network. Finally, based on the module distance definition (Menche et al., 2015) in incomplete networks, we calculated the similarity between disease pairs, and constructed the disease network PIDN. Here, nodes are diseases represented by their MeSH IDs (Mottaz et al., 2008). Weighted edges are correlations between disease genes based on module distance (Menche et al., 2015).

#### Human Disease Similarity Network Based on Symptoms (DSDN)

The symptom dataset of human diseases is based on the work of Zhou et al. (2014). Based on 322 symptom terms, they got a weighted disease-disease network. The nodes are diseases and the weighted edges are similarities between diseases. We further discarded the lower weighted edges to get a high confident

FIGURE 1 | The mainframe of our work. (A) Four types of biological information related to diseases. (B) Construct a four-layer disease network based on the four types of data. (C) Extract conserved disease module (CDMs) from the four-layer network and verify them from different aspects. (D) Apply the conserved disease modules (CDMs) to drug repositioning.

network, which includes 1,596 nodes (diseases) and 133,106 edges (associations) (see **Table S2**).

#### Gene Ontology- and Disease Ontology-Based Disease Similarity Networks (GODN and DODN)

Gene Ontology (GO) (Ashburner et al., 2000) gives the definitions of concepts/classes for describing gene function, and associations between these concepts. It includes three categories: molecular function, cellular component, and biological process. Disease Ontology (DO) (Schriml et al., 2011) is a standardized ontology of human disease, which provides a comprehensive hierarchical controlled vocabulary for human disease including anatomy, cell of origin, infectious agent, and phenotype axioms. We evaluated the relationships between diseases based on the terms in GO and DO separately to get two disease similarity networks GODN (see **Table S3**) and DODN (see **Table S4**). The details of constructing networks are shown in Method section.

#### Four-Layer Weighted Disease-Disease Similarity Network

We selected the common nodes (diseases) from PIDN, DSDN, GODN, and DODN. They have 399 overlapped diseases. Then, based on the 399 diseases, we extracted four spanning subgraphs, which consist of the final four-layer disease-disease network.

# Extracting Conserved Disease Modules (CDMs) From the Four-Layer Weighted Disease Similarity Network

In real-world networks, weights on edges characterize the strength, intensity or capacity between nodes (Wasserman and Faust, 1994; Barrat et al., 2004). It is obvious that weighted networks describe information more accurate than their unweighted counterparts. Further, studies showed that in real-world networks, nodes tended to cluster into densely connected subnetworks (Watts and Strogatz, 1998; Louch, 2000; Snijders, 2001). In order to analyze the four-layer weighted disease network further, we used the tensor-based computational framework proposed by Li et al. (2011) to extract conserved disease modules (CDMs) from the multi-layer network. Li's method (Li et al., 2011) mined recurrent heavy subgraphs (RHSs) from multiple weighted networks. Here, we named RHSs as conserved disease modules (CDMs). The definition of CDM is based on that of heavy subgraphs (HS), a subset of heavily interconnected nodes in a single network. The nodes of a CDM are the same in each layer, but the edge weights may vary in different layers. The calculation details are shown in Method section. Finally, we got nine CDMs shown in **Table 1**.

# Classification of the Nine Conserved Disease Modules

For the nine CDMs, their average size is 8.2 diseases. According to disease classification model in Medical Subject Headings (MeSH) (Mottaz et al., 2008), we made a classification for the nine CDMs. For a CDM, if more than 60% of its diseases belong to a same class F in MeSH, this CDM is marked as class F. The classification results are shown in the third column of **Table 1** and the diseases with different classifications are marked as bold italic in the second column of **Table 1**. For example, CDM 3 includes five diseases and the classification of Lupus Erythematosus (Systemic) is different from other four diseases. Therefore, it is marked as bold italic in **Table 1**.

From **Table 1**, we can get five CDMs including diseases with different class labels. **Figure 2** gives the further analyzed results. For each CDM, the figure gives the comparison between the number of diseases with the same classification and the number of diseases with different classifications. From **Figure 2**, we can find that our method not only can find the strong connections between diseases with the same classification, but also can predict the potential relationship between diseases.

# Statistical Significance of the Nine Conserved Disease Modules

To assess the statistical significance of the nine conserved disease modules, we respectively, generated four types of random networks based on PIDN, DSDN, GODN, and DODN. For each type of network, 1,000 random networks were generated, which maintained the degree distribution of the original network. Using the same method (Li et al., 2011), we did not find any conserved

#### TABLE 1 | The classifications of the nine conserved disease modules in MeSH.


*Diseases with different classification are marked as bold italic in the second column.*

disease module. In addition, we also made an analysis based on the disease similarity network got from van Driel et al. (2006), which used text mining to classify human diseases contained in OMIM (Hamosh et al., 2005). Based on each of the nine CDMs, we randomly selected a module with the same size from the network. And then we summed the edge weights in the random module to make a comparison with that of the real CDM. We repeated this process 10,000 times to get the p-value for each of the nine CDMs. The results are shown in **Table 2**. From **Table 2**, we can find that the p-values of all the nine CDMs are significant, i.e., p < 0.1 and four of them are lower than 0.001.

# Comparison With Single Layer Networks

We also made a comparison between our multi-layer network and single layer networks. Here, ClusterONE algorithm (Nepusz et al., 2012) was used to do clustering analysis for the four single layer disease networks: PIDN, DSDN, DODN, and GODN. The size distribution of modules identified from each single layer network are shown in **Figure 3**. We also gave the size distribution of CMDs got from our multi-layer network marked as MLDN (Multi-layer Disease Network).

From **Figure 3**, we can see the sizes of disease modules got from single layer networks are almost all larger than that of modules got from our multi-layer network. This result is consistent with the findings of Domenico's group (De Domenico et al., 2015). Using a multi-layer network to characterize the relationship between diseases, we can get smaller disease modules with more overlap that better capture the actual diseasedisease relationships. The major reasons are maybe that the biological data is incomplete, such as the interactome and the disease gene list (Hart et al., 2006; Wass et al., 2011), and single layer networks only consider single-dimensional biological information, which may introduce false positive data. Multi-layer networks integrate multi-dimensional related

TABLE 2 | The *p*-values of the nine CDMs compared with random modules.


information, which are complementary and can eliminate the uncertainty caused by single-dimensional data. Therefore, the modules extracted from multi-layer networks are smaller and more accurate. Additionally, based on the multilayer network, some potential disease conserved modules can be identified, such as CDM 3, CDM 5, CDM 6, CDM 7, and CDM 9 (shown in **Table 1**). They all contain at least one disease with a different classification. Taking CDM 6 as an example, it includes six diseases: Pulmonary Fibrosis, Bronchiolitis Obliterans, Pulmonary Disease (Chronic Obstructive), Pulmonary Alveolar Proteinosis, **Celiac Disease**, Bronchiectasis. For Celiac Disease, it is a serious genetic autoimmune disease. The other five diseases belong to Respiratory Tract Diseases in MeSH database. If we only extracted modules from PIDN, DSDN, or the common subgraph of four networks, CMD 6 will not be found. The main reason is that we have constructed a weighted four-layer disease network instead of just getting the common subgraph of four single-layer networks, and we chose the tensor-based method (Li et al., 2011) to identify the disease conserved modules. This method is suitable for clustering analysis of weighted multi-layer networks (Li et al., 2011).

## KEGG Pathway Functional Enrichment Analysis and Investigation of Pathogenesis

In this section, we further performed Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000) pathway enrichment analysis on diseases and their related genes. KEGG (http://www.genome.jp/kegg/) is an encyclopedia of genes and genomes (Kanehisa and Goto, 2000). Its primary objective is assigning functional meanings to genes and genomes both at the molecular and higher levels. We applied DAVID (Dennis et al., 2003), which is a functional annotation tool, to make KEGG pathway enrichment analysis. Based on disease gene list got from OMIM, we can obtain disease's enriched KEGG pathways (p ≤ 0.01) for each disease in a given CDM by using DAVID.

Taking CDM 1 as an example, **Figure 4** gives the relationship analysis between 17 diseases in CDM 1 based on their corresponding 47 pathways. From **Figure 4**, we can see these 17 diseases have a great pathway overlapping. These pathways include some important ones that associated with cancer, such as "hsa05202: Transcriptional misregulation in cancer," "hsa05200: Pathways in cancer," "hsa04060: Cytokine-cytokine receptor interaction," and "hsa04630: Jak-STAT signaling pathway," which is consistent with that all the diseases in CDM 1 belong to "Neoplasms" in MeSH (see **Table 1**).

In **Figure 4**, Adenocarcinoma and Esophageal Neoplasms (marked by red solid rectangle) seem to enrich with few pathways. The reason is maybe that we cannot get more genes related to them at present. In fact, based on multidimensional information we used in this paper, we can find their strong relationship with other diseases, and group them together, which indicates that our multi-layer network method can help to complement the incompleteness of one-dimensional biological data.

# Analyzing Disease Genes With the Maximum Frequency

We tried to analyze the pathogenesis of diseases through their similar neighbor diseases. Each disease has a related gene list. For a conserved disease module, we count the frequency of each gene appearing in all its gene lists. For example, CDM 1 contains 17 diseases, so it has 17 gene lists. If one gene appears in all the 17 gene lists, its frequency is 17. For each conserved disease module, we chose its genes with the max frequency. The results are shown in **Table 3**. Those genes with the maximum frequency in a module maybe be the potential targets of diseases or related with the targets of diseases in the module.

We still took CDM 1 as our case for further analysis. In CDM 1, it contains 17 diseases, and TNF (tumor necrosis factor) is found having the maximum frequency 10 in all the 17 diseases. That is to say, TNF is the causal gene of 10 diseases in CDM 1. For the other 7 diseases (Lymphoma, Colorectal Neoplasms, Esophageal Neoplasms, Hodgkin Disease, Leukemia Lymphoid, Leukemia Myeloid, and Adenocarcinoma) in CDM 1, TNF maybe their potential causal gene or have close connections with their casual genes in protein-protein interaction (PPI) network, which will be helpful for studying the pathogenesis of these diseases. Tumor necrosis factor (TNF or TNF-α) is a cell signaling protein (cytokine) involved in early inflammatory events. It effects on lipid metabolism, coagulation, insulin resistance, and the function of endothelial cells lining blood vessels (Vassalli, 1992). Drugs that block the action of TNF have been shown to be beneficial in reducing the inflammation in inflammatory diseases, such as Crohn's disease and Rheumatoid Arthritis (Raza, 2000).

In fact, four of the seven diseases, Lymphoma, Colorectal Neoplasms, Esophageal Neoplasms, and Hodgkin Disease, significantly enrich with "hsa04668: TNF signaling pathway"

disease module 1 (CDM 1). The horizontal axis indicates 17 diseases and the vertical axis represents their enriched 47 pathways. The colors of small bricks from white to steel blue represent the *p*-values with negative log conversion with the maximum and minimum normalization. The greater the value, the more significant the enrichment.

according to the above analysis in **Figure 4**. TNF can induce a wide range of intracellular signal pathways including apoptosis and cell survival as well as inflammation and immunity. For the remaining three diseases, Leukemia (Lymphoid), Leukemia (Myeloid, Acute) and Adenocarcinoma, we find at least one of their casual genes have strong connections with TNF in PPI network (Greene et al., 2015).

# Verify Disease Relationships in a Same CDM With Different Classifications

Our method found five significant conserved disease modules including diseases with different classifications in MeSH database (shown in **Table 1**). In this section, we took CMD 3 as an example, which is composed of five diseases: Glomerulonephritis, Proteinuria, **Lupus Erythematosus**, Nephropathy, and Glomerulonephritis(IGA) (A chronic form of glomerulonephritis). In the five diseases, except for Lupus Erythematosus, the other four diseases are all male urogenital diseases. We tried to find the potential connections between Lupus Erythematosus, and the other four diseases.

All the disease-related treatment drugs were downloaded from Comparative Toxicogenomics Database (CTD) (Davis et al., 2012) and those drugs marked as "T" (therapeutic) are chosen, which means these drugs are used to treat its corresponding diseases (Davis et al., 2012). For any disease pair d<sup>1</sup> and d<sup>2</sup> in CDM 3, their related drug sets are denoted as Drug\_Therapeuticd<sup>1</sup> and Drug\_Therapeuticd<sup>2</sup> , respectively. We used Jaccard index (Jaccard, 1912) to calculate the similarity between d<sup>1</sup> and d<sup>2</sup> shown as following:

$$J(d\_1, d\_2) = \frac{Drug\\_Therapeutic\_{d\_1} \cap Drug\\_Therapeutic\_{d\_2}}{Drug\\_Therapeutic\_{d\_1} \cup Drug\\_Therapeutic\_{d\_2}} \tag{1}$$

We found Lupus Erythematosus has high similarity with other diseases in CDM 3. The Jaccard indexes between Lupus Erythematosus and other two diseases, Glomerulonephritis, and Nephrosis, are both 0.4. The results indicate that Lupus Erythematosus shares a lot of drugs with other diseases in CDM 3 for treatment.

In fact, many reports pointed out that Lupus Erythematosus has a strong correlation with other diseases in CDM 3.


TABLE 3 | The gene lists with the maximum frequency in each conserved disease modules.

For example, in 2004, Weening et al. (2004) pointed out Glomerulonephritis and Lupus Erythematosus should be classified in a same class. Machado et al. (2005) reported a case of a 10-years-old girl with Systemic Lupus Erythematosus (SLE) presenting with Nephrotic Syndrome and Membranous Glomerulopathy.

#### Application of Conserved Disease Modules in Drug Repositioning

#### Scoring Drugs Based on Diseases in Conserved Disease Modules

Drug repositioning is a strategy to identify new therapeutic applications for existing drugs (Ashburn and Thor, 2004). For a conserved disease module, drugs that were used to treat some of these diseases were then regarded as potential drugs for the other diseases in the same disease module (Dudley et al., 2011). Based on the assumption, we tried to predict reusable drugs for the diseases in a same conserved disease module. Firstly, we chose the related drugs for each conserved disease module through combining all the drugs related to the diseases in it. Drugs marked as "T" (therapeutic) were chosen from the CTD database and each of them was scored by the following formula:

$$\text{Drug\\_score} = \frac{n\_T}{N} \tag{2}$$

Where N indicated the total number of diseases in a conserved disease module; n<sup>T</sup> indicated the number of diseases related with this drug in this conserved disease module.

Here, we took CDM 7 as an example. **Table 4** shows the scoring drugs of CDM 7. CDM 7 contains five diseases: Cardiomyopathies (CM), Dilated Cardiomyopathy (DCM), Hypertrophic Cardiomyopathy (HCM), Heart Failure (HF), and Rheumatoid Arthritis (RA). The drugs with Drug\_score ≥ 0.6 were selected. In other words, we believed that drugs that are associated with more than 60% of the diseases are also likely to be effective for treating other diseases in the same module. For each drug, if it has a "T" (therapeutic) connection (Davis et al., 2012) with a disease in CTD database, it will be marked as "1" in the corresponding position in **Table 4**, otherwise it will be marked as "0."

#### Verifying Potential Drugs Based on Molecular Docking Experiments

We chose three drugs, Carvedilol, Metoprolol, and Ramipril, from **Table 4**. The Drug\_score of these three drugs are all 0.8, which means they can treat four cardiovascular diseases in CMD 7 according to the records in CTD. The one remaining disease with no relevant records in CTD is Rheumatoid Arthritis (RA). We carried out molecular docking experiments using AutoDock Vina (Trott and Olson, 2010) to verify the three drugs. AutoDock Vina is a suite of docking tools, which is designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure. We downloaded drugs or molecules information from DrugBank database (Wishart et al., 2006) (https://www.drugbank.ca/) as ligands. The protein PDB files of diseases were obtained from RCSB PDB database (Deshpande et al., 2005) (http://www.rcsb.org/pdb/home/home. do) as receptors. We used these three drug molecules and RA related proteins for molecular docking. The results are shown in **Figure 5**. Binding affinity represents the strength of the binding interactions between the causal proteins of RA to the three drugs, Carvedilol, Metoprolol, and Ramipril (Gohlke and Klebe, 2002). Binding affinity is translated into physico-chemical terms in the equilibrium dissociation constant (KD) (Azimzadeh and Van Regenmortel, 1990), which is used to evaluate and rank order strengths of bimolecular interactions. The smaller the KD value, the greater the binding affinity of the ligand for its target. The results in **Figure 5** showed that the three drug molecules, Carvedilol, Metoprolol, and Ramipril, can be well-docked with the casual proteins of RA.

#### Possible Treatment Mechanism for RA

We noted that the binding affinity between each drug molecule with T-bet and TNF are all smaller (marked as red rectangle in **Figure 5**). We inferred that the three drugs more likely treated RA by affecting T-bet and TNF. **Figure 6** gave the possible treatment mechanism that drugs affect Rheumatoid Arthritis. Synovial T cells may be activated by the combined action of TGF-β, interleukin 6 (IL6), and interleukin 12 (IL12) (McInnes and Schett, 2007). The activated synovial T cells possibly activate the differentiation of T-helper 17 (TH17) cells on the one hand and participate in the activation of T-helper 1 (TH1) cells on the other hand (McInnes and Schett, 2007). Both Th17 and Th1 cells belong to helper T cells, which are important regulatory and effector cells in the immune response. In fact, TNF has been reported that it is associated with the pathogenesis of Rheumatoid Arthritis (McInnes and Schett, 2011).

Moreover, many studies have been reported that Carvedilol and Ramipril have an effect on RA. Arab and El-Sawalhi (2013) pointed out that as a potential anti-arthritic drug, Carvedilol may be effect on the reduction of leukocyte migration. Fahmy Wahba et al. (2015) provided us a clue that Ramipril may represent a new promising strategy against RA because of its anti-inflammatory effect on rats. In short, it is very feasible to apply the conserved disease modules found by our method to drug repositioning research.

# METHODS

# Constructing Disease Networks GODN and DODN

Gene Ontology (GO) provides the consistent representations of gene products across databases (Ashburner et al., 2000). The categories in GO can be described as directed acyclic graphs (DAGs) (Thulasiraman and Swamy, 1992). Nodes represent the terms and edges represent the two kinds of semantic relations ("is\_a" and "part\_of "). The "is\_a" relation forms the basic structure of GO. A "is\_a" B means node A is a subtype of node B. The relation "part\_of " is used to represent part-whole relationships in GO. **Figure 7** gives an example of the DAG for GO term "cellular component assembly: 0022607." There are six GO terms and seven relations between them in **Figure 7**. The



*The forth to eighth columns represent the therapeutic relationships between drugs and diseases. If a drug and a disease have a therapeutic relationship in CTD database, the value of the corresponding intersection is "1." otherwise the value is "0." The last column indicates that the Drug\_score for each drug in the CDM 7 based on the formula (2). CM, cardiomyopathies; DCM, dilated cardiomyopathy, HCM, hypertrophic cardiomyopathy; HF, heart failure; RA, rheumatoid arthritis.*

the intermediate gene that involved in the regulation process. The circle represents a specific cell. The blue oval represents the Rheumatoid Arthritis.

solid blue arrow represents the "is-a" relation and the dotted brown arrow represents the "part-of " relation.

Based on DAGs, Wang et al. (2007) proposed a method to calculate the functional similarities of genes based on gene annotation information in GO database. For term i and term j in GO, the semantic similarity between them is defined as below (Wang et al., 2007):

$$\mathcal{S}\_{GO}\left(i,j\right) = \frac{\sum\_{t \in T\_i \cap T\_j} \left(\mathcal{S}\_i\left(t\right) + \mathcal{S}\_j\left(t\right)\right)}{\mathcal{S}V\left(i\right) + \mathcal{S}V\left(j\right)}\tag{3}$$

where S<sup>∗</sup> (t) [defined by formula (4) (Wang et al., 2007)] indicates the contribution of term t to term "<sup>∗</sup> "; T<sup>∗</sup> is a GO term set, including term "<sup>∗</sup> " and all of its ancestor terms in the DAG; SV( ∗ ) [defined by formula (5) (Wang et al., 2007)] describes semantic similarity of GO term "<sup>∗</sup> ." For anyt ∈ T<sup>i</sup> , its contribution to term i, Si(t), can be defined as Wang et al. (2007):

$$\begin{cases} \mathbb{S}\_{l}(t) = 1 & \text{if} \quad t = i;\\ \mathbb{S}\_{l}(t) = \max \left\{ \left. w\_{\varepsilon} \* \mathbb{S}\_{l} \left( t' \right) \right| t' \in \epsilon \text{hidernof} \left( t \right) \right\} & \text{if} \quad t \neq i \end{cases} (4)$$

where w<sup>e</sup> is the semantic contribution factor (0 < w<sup>e</sup> < 1); e ∈ E<sup>i</sup> links term t with its child term t ′ ; E<sup>i</sup> is the edge set connecting the terms in the DAG for i. From formula (4), we can find that the contribution of term i to itself is 1. Other terms' contributions to term i are decreasing as the distance increases. The semantic similarity of GO term i, SV(i), can be got based on formula (5). Its definition is shown as follows (Wang et al., 2007):

$$SV\left(i\right) = \sum\_{t \in T\_i} \mathbb{S}\_i\left(t\right) \tag{5}$$

According to the formulas (3–5), we can calculate the similarity between two GO terms i and j. Based on these, we can further calculate the similarity between two sets of terms G<sup>1</sup> and G<sup>2</sup> as Wang et al. (2007):

$$\operatorname{Sim}\left(G\_1, G\_2\right) = \frac{1}{|G\_1| + |G\_2|} \times \left(\sum\_{s \in G\_1} \operatorname{sim}\left(s, G\_2\right) + \sum\_{t \in G\_2} \operatorname{sim}\left(t, G\_1\right)\right) \tag{6}$$

where |G1| and |G2| represent the numbers of terms in G<sup>1</sup> and G2, respectively;sim (s,G2) represents the maximum of similarity between term s with any term in set G2, i.e., sim (s,G2) = max<sup>t</sup> ′∈G<sup>2</sup> SGO s, t ′ ; sim (t,G1)represents the maximum of similarity between term t with any term in set G1, i.e.,sim (t,G1) = max<sup>t</sup> ′∈G<sup>1</sup> SGO t, t ′ .

Because each disease relates to a gene set and each gene set can be mapped to a GO term set, we can evaluate the correlation between two diseases based on the similarity between their related GO term sets. **Figure 8** gives the computational framework of disease similarities based on GO terms. In this way, we can construct the GODN.

For the DO-based disease similarity network (DODN), the constructing process is similar to that of GODN. Disease Ontology (DO) is a standardized ontology with consistent, reusable and sustainable descriptions of human disease terms (Schriml et al., 2011). Similar to GO, the associations between disease terms in DO can also be presented as DAGs (Thulasiraman and Swamy, 1992). **Figure 9** gives an example of the DAG for DO term "cerebrovascular disease: 6713."

Nodes represent the DO terms and edges represent the "is\_a" relationships between terms. For instance, DO term "cerebrovascular disease: 6713" is a subclass of DO term "artery disease: 0050828." As a result, each disease corresponds to a DO term set. In **Figure 9**, DO term "cerebrovascular disease: 6713" corresponds to a set {"cerebrovascular disease: 6713," "artery disease:0050828," "cerebrovascular disease: 6713," "vascular disease: 178," "cardiovascular system disease: 1287"}. The similarity between two DO terms represents the relationship between two diseases. Therefore, we use the same method (Wang et al., 2007) as GODN to construct DODN.

### Extracting Conserved Modules From the Four-Layer Disease Network

The method (Li et al., 2011) for extracting conserved modules from the four-layer disease network is based on tensor analysis for multi-networks, which describes the multi-layer complex network as a third-order tensor:

$$A = \left(a\_{ijk}\right)\_{n \times n \times m} \tag{7}$$

where aijk represents the weight of the edge between disease i and disease j in layer k; n, and m, respectively, represent the number of diseases in each layer and the number of layers. The modules in single-layer networks are considered to be tightly internal connections and loosely external connection, which also can be extended to multi-layer networks, such as multi-layer disease networks. Here, we call the modules appear in the fourlayer disease network as conserved disease modules (CDMs). The nodes of a CDM are the same in each occurrence, but the edge

weights may vary between networks. The sum of edge weights in the CDM can be defined as Li et al. (2011):

$$H\_A\left(\mathbf{x}, \boldsymbol{\chi}\right) = \frac{1}{2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} \sum\_{k=1}^{m} a\_{ijk} \mathbf{x}\_i \mathbf{x}\_j \boldsymbol{\upchi}\_k \tag{8}$$

where x = (x1, . . . , xn) T represents disease membership vector and n is the number of diseases in each layer. If disease i appears in the CDM, x<sup>i</sup> = 1; otherwise, x<sup>i</sup> = 0. y = (y1, . . . , ym) T represents the network membership vector and m is the number of disease networks. Here, m = 4. If the CDM appears in network j, y<sup>j</sup> = 1; otherwise, y<sup>j</sup> = 0. Because CDMs represent the disease modules appearing in all the four networks, y<sup>j</sup> = 1 in our work. Discovering conserved disease modules can be formulated by a discrete combinatorial optimization problem (Li et al., 2011): among all CDMs of fixed size, we look for the heaviest, i.e., the maximum of HA, which can be converted to a continuous optimization problem expressed as following (Li et al., 2011):

$$\begin{array}{l} \max\_{\boldsymbol{x} \in \mathbb{R}\_+^m, \boldsymbol{\mathcal{y}} \in \mathbb{R}\_+^m} H\_A \left( \boldsymbol{x}, \boldsymbol{\mathcal{y}} \right) \\ \text{subject to} \begin{cases} f \left( \boldsymbol{x} \right) = 1 \\ \boldsymbol{g} \left( \boldsymbol{\mathcal{y}} \right) = 1 \end{cases} \end{array} \tag{9}$$

where R<sup>+</sup> is a non-negative real space; f(x) and g(y) are vector norms. These equations give a tensor-based computational framework and we use it to identify CDMs. The size of CDMs is set to be no <5 and the sum of edge weights in CDMs is set to be no <0.3.

# DISCUSSION

The framework of multi-layer network in this work is motivated by the underlying disease relationship at different levels. Considering the multidimensional information of the disease, we first constructed four disease similarity networks, namely, PIDN, DSDN, GODN, and DODN. Then, we integrated these four disease similarity networks to get a four-layer disease network. Based on the four-layer disease network, we obtained nine conserved disease modules by tensor-based computational framework. The sizes of these nine disease modules range from 5 to 17. We classified the disease modules based on the MeSH database and used 0.6 as threshold to determine the classification of a disease module. Diseases in conserved modules mostly belonged to a same category. For those diseases whose classification are different from others are more likely the potential disease-disease relationship.

We verified the reliability of our results from a statistical point of view. We randomly disturbed the edges of four disease networks to ensure that the degree of nodes remained unchanged. After repeating the above procedure for 1,000 times, we did not find any conserved disease module. We constructed a statistical experiment by using a disease similarity network as a standard dataset which was created by van Driel et al. (2006), named as Van's network. We firstly found the nine conserved disease module from Van's network and summed weights of each modules. Then we compared the sums with random modules extracted from Van's network. We repeated the above procedure for 10,000 times and found the p-values were lower than 0.001. We also made a comparison with the results of single-layer network clustering and found that modules exacted from multilayer network were more reliable and accurate.

We used the pathogenic genes of each disease in conserved disease module 1 for KEGG enrichment analysis and found many pathways significantly enriched with most of diseases, such as hsa05320, hsa05332, hsa04612, hsa05202, hsa04380, and hsa04060. Through frequency analysis of pathogenic genes in disease similarity module 1, we found that TNF (tumor necrosis factor) gene had the highest frequency. As reported,<sup>1</sup> TNF plays an important role in fighting against pathogens and tumor. It acts via the tumor necrosis factor receptor (TNFR) for triggering apoptosis. For diseases in module 1, TNF maybe their potential causal gene or have close connections with their casual genes, which will be useful for studying the mechanism of these diseases.

More importantly, our method can find potential diseasedisease relationships. Taking conserved disease module 3 as a case, we found lupus erythematosus is an immune system disease that did not has the same classification as others, i.e., male urogenital disease. However, lupus erythematosus shared a lot of drugs with other diseases in module 3 for treatment which suggested that we found the potential relationship between lupus erythematosus and other diseases. As an application of our finding, we can reposition drugs among diseases in a same module. Taking conserved disease module 7 as a case, we found three potential drugs for Rheumatoid Arthritis (RA) based on molecular docking experiments. Furthermore, literature verification was also made.

In summary, our model for constructing multi-layer disease network can get more accurate conserved disease modules. As mentioned above, we verified our results from many aspects. However, there are still some shortcomings. Since our results are data-dependent, the incompleteness of the data affects the extracted module information. For example, DSDN network is relatively sparse comparing with other three networks due to preprocessing. In order to improve the quality of data, we need to filter false positive information in advance. This lead to data scale reduction. Based on such data, we may only find some of the meaningful results. As the data continues to improve, we will find more and more meaningful conserved disease modules. In addition, in the framework of a multi-layer network, more categories of disease data can be integrated, which will help to do more in-depth research on disease mechanisms.

# AUTHOR CONTRIBUTIONS

LY and YZ contributed conception and design of the study. SY organized the database and performed the experiments. LY and YZ performed the results analysis. LY wrote the first draft of the manuscript. LG, YZ, and SY wrote sections of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

### ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (Nos. 61672406, 61532014, 61432010, 61772395, and 61672407) and the Fundamental Research Funds for the Central Universities under Grant No. JB180307.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00745/full#supplementary-material

Table S1 | Disease-gene data combining GWAS and OMIM databases.

Table S2 | Human disease similarity data based on disease symptoms.

Table S3 | Human diseases similarity data based on GO terms.

Table S3 | Human diseases similarity data based on DO terms.

<sup>1</sup>University of Erlangen-Nuremberg. "How tumor necrosis factor protects against infection." ScienceDaily. ScienceDaily, 11 July 2016.

## REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Yu, Yao, Gao and Zha. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# mlDEEPre: Multi-Functional Enzyme Function Prediction With Hierarchical Multi-Label Deep Learning

Zhenzhen Zou1†, Shuye Tian2†, Xin Gao<sup>1</sup> and Yu Li <sup>1</sup> \*

*<sup>1</sup> Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia, <sup>2</sup> Department of Biology, Southern University of Science and Technology (SUSTC), Shenzhen, China*

As a great challenge in bioinformatics, enzyme function prediction is a significant step

toward designing novel enzymes and diagnosing enzyme-related diseases. Existing studies mainly focus on the mono-functional enzyme function prediction. However, the number of multi-functional enzymes is growing rapidly, which requires novel computational methods to be developed. In this paper, following our previous work, DEEPre, which uses deep learning to annotate mono-functional enzyme's function, we propose a novel method, mlDEEPre, which is designed specifically for predicting the functionalities of multi-functional enzymes. By adopting a novel loss function, associated with the relationship between different labels, and a self-adapted label assigning threshold, mlDEEPre can accurately and efficiently perform multi-functional enzyme prediction. Extensive experiments also show that mlDEEPre can outperform the other methods in predicting whether an enzyme is a mono-functional or a multi-functional enzyme (mono-functional vs. multi-functional), as well as the main class prediction across different criteria. Furthermore, due to the flexibility of mlDEEPre and DEEPre, mlDEEPre can be incorporated into DEEPre seamlessly, which enables the updated DEEPre to handle both mono-functional and multi-functional predictions without human intervention.

Keywords: multi-functional enzyme, function prediction, EC number, deep learning, hierarchical classification, multi-label learning

# 1. INTRODUCTION

Enzymes, which catalyze reactions in vivo, play a vital role in metabolism in every species. Predicting enzyme function is an important bioinformatics task, for helping researchers design more efficient novel enzymes and assisting people in diagnosing enzyme-related diseases (Hoffmann et al., 2007). To predict enzyme function, a clear and standard enzyme function ontology should be defined. Currently, the most popular way of standardizing enzyme function is to use the EC number system (Cornish-Bowden, 2014). An enzyme commission (EC) number is composed of 4 digits, i.e., EC 3.1.21.4, with the first digit denoting the main class of the enzyme; and the second digit indicating the subclass of the enzyme, etc. Each further digit defines the function of an enzyme more specifically, combining with the previous digits. As shown in **Figure 1** in Shen and Chou (2007), the label space of the EC system has a tree structure. As an important bioinformatics task, a number of methods have been proposed to deal with the problem, based on

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Huiluo Cao, The University of Hong Kong, Hong Kong Ka-Chun Wong, City University of Hong Kong, Hong Kong*

\*Correspondence:

*Yu Li yu.li@kaust.edu.sa*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *13 November 2018* Accepted: *20 December 2018* Published: *22 January 2019*

#### Citation:

*Zou Z, Tian S, Gao X and Li Y (2019) mlDEEPre: Multi-Functional Enzyme Function Prediction With Hierarchical Multi-Label Deep Learning. Front. Genet. 9:714. doi: 10.3389/fgene.2018.00714*

**208**

structure similarity (Dobson and Doig, 2005; Roy et al., 2012; Yang et al., 2015), sequence similarity (Tian et al., 2004; Quester and Schomburg, 2011), or machine learning techniques (des Jardins et al., 1997; Cai et al., 2003; Shen and Chou, 2007; Zhou et al., 2007; Li et al., 2018d). Despite the success of those methods in predicting mono-functional enzyme function with very high accuracy, seldom have people worked on the prediction of multi-functional enzyme function, which actually constitutes a relatively large part of all the enzymes. Until now, to our knowledge, only five methods (De Ferrari et al., 2012; Zou et al., 2013; Che et al., 2016; Zou and Xiao, 2016; Amidi et al., 2017) are able to address that specific type of enzymes. Among them, De Ferrari et al. (2012) use InterPro signatures as the features and multi-label k-nearest neighbor (KNN) as the algorithm. Zou et al. (2013) took advantage of a number of manually designed features, such as a 20-D feature vector extracted from the position-specific scoring matrix (PSSM) and a 188-D feature vector based on the composition and physicalchemical properties of the protein, and the conventional multilabel machine learning algorithm. Che et al. (2016) also utilized features extracted from PSSM, combined with the multi-label KNN algorithm. Zou and Xiao (2016) deployed three variants of the famous feature, Pseudo Amino Acid Composition (PseAAC), and the multi-label KNN algorithm. Amidi et al. (2017) used the predicted structure information, combined with the sequence information, as the feature, and multi-label KNN and multi-label support vector machine (SVM) as the classifier. Despite their satisfactory performance on a specific dataset, we can find that all of those algorithms can be improved in two ways. Firstly, all the methods utilize very specific expert-designed features. It is both time and knowledge consuming to do so, and the extracted features can be local minima for this particular problem (Dai et al., 2017; Li et al., 2018d). The features automatically designed and extracted from the raw representation of a protein by an algorithm are more desirable than the human designed features. Secondly, almost all the aforementioned methods rely on KNN. KNN is a similarity based algorithm. Unlike the probability based algorithms, KNN is unable to annotate novel enzymes which do not have homologs with high sequence similarities in the current databases. Both the feature design process and the classification process have great potential to be improved.

On the other hand, deep learning (LeCun et al., 2015), an end-to-end learning algorithm which wraps the representation learning and classifier learning into one model, has shown great potential in the bioinformatics field (Dai et al., 2017; Li et al., 2018a,d,e; Umarov et al., 2018; Xia et al., 2018), especially in the enzyme function prediction direction (Li et al., 2018d), in which the newly proposed deep learning based method, DEEPre (Li et al., 2018d), has improved the state-of-the-art performance significantly. In general, Li et al. (2018d) built one deep learning model for each of the internal nodes in the tree structure of the EC number system. In particular, for level 0, that is, predicting whether the input protein sequence is an enzyme or not, there is one model; for level 1, whose task is to predict the main class of an enzyme from the six classes, there is one model; for level 2, which predicts the subclass of an enzyme, there are six models since there are six main classes and we need to build a model for each of those different main classes. In terms of the deep learning model architecture, Li et al. (2018d) proposed a novel architecture which can extract convolutional information and sequential information from three raw representations (PSSM, sequence encoding and functional domain encoding) and combine them together automatically for the downstream classification. They only fed the very raw encoding of the protein sequence to the deep learning model and the model is responsible for both feature extraction and classification. In this way, the algorithm is likely to find a better hidden feature representation implicitly, which benefits the classification results.

However, despite the success of DEEPre, the original version of DEEPre was designed specifically for mono-functional enzyme function prediction and not capable of handling multi-functional enzymes. Following the success of DEEPre and its research direction, we propose a novel hierarchical multi-label deep learning method, mlDEEPre, for predicting the multi-functional enzyme functions. In particular, mlDEEPre first predicts whether an enzyme is a mono-functional enzyme or a multi-functional enzyme as a binary classification problem. If the enzyme is a multi-functional enzyme, it will take the input enzyme sequence and predict its main classes as a multi-label prediction problem. To equip the deep learning model with multi-label prediction ability, we adopt the idea of backpropagation for multilabel learning (BP-MLL) (Zhang and Zhou, 2006) into the original DEEPre architecture. Meanwhile, since the entire DEEPre package can also take the main class of the sequence as input and start the prediction from the second level, after obtaining the main classes of the multi-functional enzymes, we can feed the mlDEEPre result to DEEPre, predicting all the four digits for each function of an enzyme. In this work, we make the following contributions:


#### 2. MATERIALS AND METHODS

In this section, we first introduce the dataset used to evaluate the proposed method (section 2.1) and the needed raw encodings we feed to the deep learning model (section 2.2). Then, we provide a big picture of the mlDEEPre method in section 2.3. After that, we introduce the deep learning architecture used in our method (section 2.4). Following the model introduction, we describe how we equip the model with the ability to perform multi-label prediction (section 2.5 and section 2.6). Finally, we wrap up the mlDEEPre method and combine it with the original version of DEEPre (section 2.7).

#### 2.1. Dataset

For the mono-functional enzyme data, we use the dataset from Li et al. (2018d). As for the multi-functional enzyme data, we use the dataset from Che et al. (2016). Li et al. (2018d) constructed

TABLE 1 | Dataset I: 22,168 single-labeled enzymes.


#### TABLE 2 | Dataset II: 4,076 multi-labeled enzymes.


*This table shows the number of multi-functional enzymes in the dataset with different EC main class combinations.*

TABLE 3 | Dataset II: 1,085 multi-labeled enzymes with 65% sequence similarity cut-off.


a dataset containing 22,168 sequences from UniProt which have mono-function with 40% sequence similarity filtered by CDhit. Che et al. (2016) provided us with 4,076 multi-functional enzymes. More detailed descriptions of how to construct the datasets can be referred to Li et al. (2018d) and Che et al. (2016). Here, we provide the statistics of the mono-functional and the multi-functional enzyme datasets in **Tables 1**–**3**.

#### 2.2. Protein Raw Encoding

Similar to DEEPre (Li et al., 2018d), we use the following three raw protein encodings to represent a protein sequence, which will be fed to the deep learning model as inputs.

#### 2.2.1. PSSM

For each protein sequence, we run PSI-BLAST (Altschul et al., 1997) from BLAST+ (Camacho et al., 2009) against SWISS-PROT (Bairoch and Apweiler, 2000), with three iterations and the evalue as 0.002, to find the sequence homologies. Then we align those sequences, and for each position in the query protein, calculate a vector which indicates the appearance frequency of each amino acid in the alignment. The evolutionary information of the protein sequence is encoded by an L by 20 matrix.

#### 2.2.2. Sequence One-Hot Encoding

To represent the original protein sequence information, we use one-hot encoding. For each type of amino acids, we use a vector composed with nineteen 0s and one 1 to represent it. For example, 'A' is represented as (1, 01, ..., 019) and 'C' is represented as (01, 1, ..., 019). In this way, each position of the protein sequence is encoded into a vector. Putting those vectors together, we have an L by 20 matrix to represent the original raw sequence.

#### 2.2.3. Functional Domain Encoding

This representation encodes the functional domain within a protein sequence. We use HMMER (Eddy, 2011) to search a query protein against Pfam (Finn et al., 2016), which is a functional domain database. If one functional domain is hit, we use 1 to encode it; otherwise, we use 0 to encode it. Consequently, we have a vector composed of 0s and 1s to show the functional domain information of a protein.

# 2.3. mlDEEPre

The primary task of mlDEEPre is to predict the main classes of multi-functional enzymes. However, we start from predicting whether a query enzyme is a multi-functional enzyme or not. As shown in **Figure 1**, mlDEEPre has two levels. Given an enzyme sequence, the first level predicts whether the enzyme is a monofunctional enzyme or a multi-functional one. If the sequence is a multi-functional enzyme, the second level of mlDEEPre will predict the main classes of the enzyme's multi-functions. The model architecture of the two levels is discussed in section 2.4 and the specific design for multi-label prediction is discussed in sections 2.5 and 2.6. mlDEEPre has very close relationship with the original version of DEEPre, which is discussed in details in section 2.7.

## 2.4. Model Architecture

Regarding the deep learning model, we use a similar model architecture as in Li et al. (2018d). As shown in **Figure 2**, we adopt

predict whether the input sequence is an enzyme or not. If it is an enzyme, we use mlDEEPre level 1 to predict whether the enzyme is a mono-functional enzyme or a multi-functional enzyme. If it is a mono-functional enzyme, DEEPre will take over. If not, we use mlDEEPre to predict that multi-functional enzyme's main classes. Inputting the main classes and the sequence to DEEPre, we can obtain the full annotation for each function of the enzyme.

PSSM and sequence one-hot encoding, and fully connected neural network component to handle functional domain encoding. After those components, we concatenate their outputs into one vector, which is fed to a fully connected classifier. We apply a threshold function to the output of the model to obtain the labels of the input sequence.

convolutional layers for the sequence-length dependent features, i.e., PSSM and sequence one-hot encoding, to extract useful information from those encodings. Since functional domain encoding has already been a high level feature with a fixed length, we use fully-connected layers to reduce the dimensionality and further extract information. After those separated layers for each feature, we concatenate their outputs and feed the concatenated vector to a fully connected layer, which can be considered as the classifier. Training the model in an end-toend manner, we are able to optimize the feature extractors (the layers before concatenation) and the classifiers (the final fully connected layers) at the same time, resulting in a better hidden feature representation and thus a classification model with better performance.

#### 2.5. Multi-Label: Loss Function

Deep learning methods are often suitable for multi-label classification. As shown in **Figure 2**, the model's last layer has multiple nodes, whose outputs correspond to the predicted probability of each label. If we use the model to perform single label classification, we will find the label with the highest probability score and assign the query with that label. When we use the model to perform multi-label prediction, we can still use the predicted probabilities. However, we need to change the way of assigning labels. Instead of assigning the label with the highest probability, we may want to assign the labels whose probability score is higher than a certain threshold so that multiple labels can be predicted. On the other hand, when we train the model, we also need to consider the multi-label information in the training data. One of the most straightforward way of incorporating such information when training the model is by changing the loss function to let the model know that we are performing multilabel prediction using a multi-label dataset. In terms of such threshold and loss function, we adopt the idea from Zhang and Zhou (2006), i.e., BP-MLL. We introduce the loss function in this section in details and discuss the threshold in the next section.

Formally, denote the i th enzyme instance as x<sup>i</sup> , and its corresponding label vector as D<sup>i</sup> . Each element of D<sup>i</sup> is a binary value, which indicates whether that enzyme instance belongs to a certain class. We use d j i to denote that element, where j ∈ [ 1, 6] for our problem. If d j i is 1, the enzyme x<sup>i</sup> belongs to the class j, 0 otherwise. As for a classification problem, the most intuitive way to define the global error of the network is to measure the distance between the predicted labels and the real labels of the training set:

$$E = \sum\_{i=1}^{m} E\_i,\tag{1}$$

where E<sup>i</sup> represents the network error on the instance x<sup>i</sup> and m is the size of the training data. For a multi-label classification problem, we can define E<sup>i</sup> as below:

$$E\_i = \sum\_{j=1}^{Q} \left( l\_l^j - d\_l^j \right)^2,\tag{2}$$

where l j i and d j i are the actual output and the true label of the network on x<sup>i</sup> on the class j, respectively; Q is the total number of classes, which is 6 in our problem. Using Equation 2, we are able to incorporate the multi-label information into the model to a certain degree since all the label information is considered in that loss function. However, the loss function in Equation 2 assumes that each class label is independent, which ignores any relationship between different class labels. In reality, one of the most straightforward relationships between labels is that labels in L true i should have higher ranks than those not in L true i , where L true i is set of labels that the instance x<sup>i</sup> has. Accordingly, we can use the following function as the loss which considers the rank relationship between labels:

$$E = \sum\_{i=1}^{m} E\_i = \sum\_{i=1}^{m} \frac{1}{|L\_i^{true}| |\overline{L\_i^{true}}|} \sum\_{(k,q) \in L\_i^{true} \times \overline{L\_i^{true}}} e^{\left(-(l\_i^k - l\_i^q)\right)}, \tag{3}$$

where L true i is the complementary set of L true i , that is, the label set which the instance x<sup>i</sup> does not have, and | • | is the cardinality measure of a set. From the equation, we can find that (l k <sup>i</sup> − l q i ) measures the difference between the outputs of the network on the labels belonging to the training instance and the ones not belonging to it, which is further fed to the exponential function. When l q i happens to be much larger than l k i , which causes large discrepancy, the exponential function can penalize the error severely. By minimizing Equation 3, we can make the model output much higher values for the true labels while very small values for the labels that the training data do not have. Thus, labels in L true i have higher ranks than those not in L true i , which is in agreement with our goal.

#### 2.6. Multi-Label: Threshold

As discussed in the previous section, when we use the model, to determine and assign the labels, there should be a threshold t(x), which is applied to the output of the deep learning model, so that we predict the labels as L pred <sup>i</sup> = {j|l j <sup>i</sup> > t(x), j ∈ [ 1, 6]}. A straightforward and natural solution of the threshold function is to set t(x) as a constant. However, that constant threshold does not consider the difference between different data points. To solve the problem, Elisseeff and Weston (2002) proposed an excellent idea to incorporate the information of each single data point into the threshold, which replaces the constant with a linear function t(xi) = **w** <sup>⊺</sup> · l(xi) + b, where l(xi) is the output of the network on the instance x<sup>i</sup> . In this way, each data point can have its own threshold, which is more flexible than a constant. To obtain the threshold function, we need to solve the following problem:

$$t(\mathbf{x}\_i) = \operatorname\*{argmin}\_t(\left| \{ k | k \in L\_i^{true}, l\_i^k \lesssim t \} \right| + \left| \{ q | q \in \overline{L\_i^{true}}, l\_i^q \gtrsim t \} \right|). \tag{4}$$

If the solution of Equation 4 is not unique and the solution composes a segment, the middle value of the value range is chosen as the threshold. For example, assume the real label and predicted label set of x<sup>i</sup> are {1,1,0,0,0,0} and {0.9, 0.8, 0.3, 0.1, 0.1, 0.1}, when 0.3 < t < 0.8, |{k|k ∈ L true i , l k <sup>i</sup> 6 t}| + |{q|q ∈ L true i , l q <sup>i</sup> > t}| always takes the minimum value as 0. Consequently, we choose the middle value of (0.3, 0.8), which is 0.55, as the threshold. In BP-MLL, the solution of the threshold equation can be obtained through the linear least square method.

To sum up, after we have a well-trained model and the threshold function parameters, and when we need to use the model to perform prediction, firstly, we feed the test instance to the trained network and get the outputs l(**x**). Secondly, we calculate the threshold using t(**x**) = **w** <sup>⊺</sup> · l(**x**) + b and apply the threshold to the output of the model, obtaining the predicted labels for the enzyme instance x.

#### 2.7. DEEPre and mlDEEPre

Although DEEPre is designed for mono-functional enzyme function prediction, it is very flexible, being able to predict the detailed function of an enzyme from the first level or the second level. For example, if we have already known that an enzyme has the follow incomplete EC number: 1.-.-.-, we can run DEEPre from the second level to fulfill the missing digits. Taking into consideration the enzyme's feature representation and the fact that the query sequence is an Oxidoreductase, we run the model trained specifically for the enzyme with the first EC digit as 1. With such flexibility, we can easily combine mlDEEPre and DEEPre to predict the detailed functionality of multi-functional enzymes. Using mlDEEPre, we can predict the main classes of those multi-functional enzymes, such as 2.-.-.- and 3.-.-.-. Feeding the sequence and the main classes annotation to DEEPre, we are able to fill in the missing digits for each incomplete annotation of a multi-functional enzyme. The idea of combining DEEPre and mlDEEPre is illustrated in **Figure 1**. Starting from a protein sequence, we first use level 0 of DEEPre to predict whether the protein is an enzyme or not. If yes, we use mlDEEPre first level to predict whether the enzyme is a mono-functional enzyme or multi-functional enzyme. If that is a mono-functional enzyme, we will further run DEEPre to get full annotation of that enzyme. If not, we will run the second level of mlDEEPre to predict the main classes of the enzyme. For each function, we run DEEPre to obtain the full annotation. Considering that most multifunctional enzymes have multiple EC number annotations for its different functions diverging in the first digit, our method is efficient and reliable under most circumstances.

#### 3. RESULTS

In this section, we first briefly introduce the methods with which we are going to compare mlDEEPre (section 3.1). And then, we define the evaluation criteria for the comparison in details in section 3.2. After that, we show the performance of our method in predicting whether an enzyme is a mono-functional enzyme or a multi-functional enzyme in section 3.3. Furthermore, section 3.4 gives the main classes prediction results of multi-functional enzymes. Finally, we show our method's performance on fatty acid synthase (FAS) function prediction in section 3.5.

#### 3.1. Compared Methods

For mono-function prediction, we compared our method with Pse-ACC (Chou and Ho, 2006), ACC (Che et al., 2016), EnzML (De Ferrari et al., 2012), and SVM. Pse-ACC (Chou and Ho, 2006) is a widely used tool, which predicts the enzymatic attribute of proteins by considering the functional domain composition of a given enzyme sequence. In ACC (Che et al., 2016), the authors utilized autocross-covariance (ACC) feature representation which consists of two feature models, autocovariance (AC) and cross-covariance (CC) (Dong et al., 2009). Another compared method here is EnzML, which can efficiently utilize the InterPro signatures. All the above three methods used K-nearest neighbors (KNN) based classification algorithm as the base classifier. We also compared our method with a baseline method, which used SVM as the algorithm and ACC as the features.

For multi-function prediction, as discussed before, until now, there are only a few works focused on this problem, and all of them are based on KNN. The key idea of KNN is that similar instances should share the same labels and we can assign the labels to a query sequence with the most frequent ones from its K-most similar instances, the idea of which is shown in **Figure S1**. We compared our method with ML-KNN (Zhang and Zhou, 2007), BR-KNN (Spyromitros et al., 2008), IBLR-ML (Cheng and Hüllermeier, 2009), GM (Zou and Xiao, 2016), and SVM-NN (Amidi et al., 2017). In ML-KNN (Zhang and Zhou, 2007), for each unseen instance, its K-nearest neighbors in the training set are firstly identified. And then, maximum a posteriori estimation (MAP) principle is applied to determine the label set of the unseen instance based on the statistical information of its neighbor samples. As for Binary Relevance KNN (BR-KNN) (Spyromitros et al., 2008), it learns M binary classifiers, one for each class. In terms of instance-based learning and logistic regression (IBLR) (Cheng and Hüllermeier, 2009), it combines instance-based learning and logistic regression with ML-KNN. For the above three methods, the ACC (Dong et al., 2009) is used as the feature representation. Furthermore, Zou and Xiao (2016) utilized ML-KNN and a different feature extraction model, Grey Model (GM) (Lin et al., 2011), to perform the task. The last method is SVM-NN (Amidi et al., 2017). In this method, the authors combined structural and amino acid sequence information together, investigating two fusion approaches both in the feature level and the algorithm level (SVM and KNN), resulting in a method for general enzymatic function prediction.

#### 3.2. Evaluation Criteria 3.2.1. Single-Label Measurement

Given multi-label and mono-label test datasets S = {( x<sup>i</sup> , L true i )}, the binary classifier performance is evaluated by the four criteria: accuracy, precision, recall, and F1-score, which are defined below:

$$Accuracy = B(TP\_j, FP\_j, TN\_j, FN\_j) \ = \frac{TP\_j + TN\_j}{TP\_j + FP\_j + TN\_j + FN\_j},\tag{5}$$

$$Precision = \text{B(TP}\_{\text{j}}, \text{FP}\_{\text{j}}, \text{TN}\_{\text{j}}, \text{FN}\_{\text{j}}) \ = \frac{\text{TP}\_{\text{j}}}{\text{TP}\_{\text{j}} + \text{FP}\_{\text{j}}},\tag{6}$$

$$\text{Recall} = \text{B(TP}\_{\text{j}}, \text{FP}\_{\text{j}}, \text{TN}\_{\text{j}}, \text{FN}\_{\text{j}}) \\
= \frac{\text{TP}\_{\text{j}}}{\text{TP}\_{\text{j}} + \text{FN}\_{\text{j}}},\tag{7}$$

$$F1 - score = \frac{2 \ast Precision \ast Recall}{Precision + Recall},\tag{8}$$

in which B( TP<sup>j</sup> , FP<sup>j</sup> , TN<sup>j</sup> , FNj) represents the binary classification indicator; TP<sup>j</sup> indicates the number of true positive instance; TN<sup>j</sup> is the number of true negative instance; FP<sup>j</sup> stands for the number of false positive instances; and FN<sup>j</sup> represents the number of false negative instances.

#### 3.2.2. Multi-Label Measurement

Regarding the multi-label classification evaluation, the measurement criteria cannot be exactly the same as those in single-label classification. The assessment method is much more complicated in multi-label learning. The previous works (Chou and Ho, 2006; Zhang and Zhou, 2007) have defined various metrics, including example-based and label-based metrics. For example-based methods, the classification results for each instance are calculated first. After that, the average value for the entire dataset is obtained. For label-based metrics, the binary classification results for each class are calculated first, and then the average value for all classes is given. Here, we adopt both example-based and label-based methods. The metrics that have been utilized to assess the performance of mlDEEPre are described below:

#### **3.2.2.1. Hamming-loss**

Hamming-loss evaluates the frequency of incorrect prediction of an instance-label pair. This index is averaged over all classes and the entire dataset. The smaller hamming loss is, the better the performance of the classifier is. It is defined as follows:

$$\text{Harmonic} - \text{loss} = \frac{1}{N} \sum\_{i=1}^{N} \frac{1}{6} |L\_i^{pred} \Delta L\_i^{true}|\_1,\tag{9}$$

where 1 is symmetric difference between two sets, |•|<sup>1</sup> represents l1-norm, and N is the number of example enzymes.

#### **3.2.2.2. Subset accuracy**

Subset accuracy is the strictest evaluation in multi-label classification. For each sample, the entire set of labels must be correctly predicted, otherwise the subset accuracy for that instance is equal to 0. In literature, it is also known as zero-oneloss:

$$\text{Subset accuracy} = \frac{1}{N} \sum\_{i=1}^{N} \delta(L\_i^{pred}, L\_i^{true}), \tag{10}$$

where δ is the Kronecker delta:

$$\begin{cases} \delta(L\_i^{pred}, L\_i^{true}) = 1, & \text{if } and \text{ only if all the labels in } L\_i^{pred} \\ & \qquad \text{are equal to those in } L\_i^{true}, \\ \delta(L\_i^{pred}, L\_i^{true}) = 0, & \text{otherwise.} \end{cases} \tag{11}$$

Opposite to hamming-loss, and just as the following Macro and Micro methods, the higher the subset accuracy is, the better the performance is.

#### **3.2.2.3. Macro-precision, Macro-recall, Macro-F1-score**

Macro-precision, Macro-recall, and Macro-F1-score, which have been used in multi-label classification, calculate precision, recall, and F1-score separately for each class. The Macroaverage method is straightforward: just take the average of the precision and recall of the system on different classes. When we want to evaluate system performance on different datasets, macro-averaged metrics are the best choice:

$$Accro-precision = \frac{1}{6} \sum\_{j=1}^{6} \frac{TP\_j}{TP\_j + FP\_j},\tag{12}$$

$$\text{Macro} - \text{recall} = \frac{1}{6} \sum\_{j=1}^{6} \frac{\text{TP}\_j}{\text{TP}\_j + \text{FN}\_j}. \tag{13}$$

$$\text{score} - \text{F1} - \text{score} = \frac{2}{\pi} \sum\_{j=1}^{6} \frac{\text{Macro} - \text{precision}\_j \times \text{Macro} - \text{recall}\_j}{\text{recall}}.$$

$$\text{Macro} - \text{F1} - \text{score} = \frac{1}{6} \sum\_{j=1}^{n} \frac{\mathbf{\hat{i}} \cdot \mathbf{\hat{r}} \cdot \mathbf{\hat{r}} \cdot \mathbf{\hat{r}} \cdot \mathbf{\hat{r}} \cdot \mathbf{\hat{n}} \cdot \mathbf{\hat{r}}}{\text{Macro} - \text{precision} \cdot \mathbf{\hat{i}} \cdot \mathbf{\hat{n}} \cdot \mathbf{\hat{r}} \cdot \mathbf{\hat{n}} \cdot \mathbf{\hat{r}}} \,\tag{14}$$

#### **3.2.2.4. Micro-precision, Micro-recall, Micro-F1-score**

Regarding Micro-precision, Micro-recall, and Micro-F1-score, in Micro-average methods, we sum up the individual TP, FP, TN, and FN of the system for different sets and then apply them to get the statistics. The Micro-metrics pay more attention to whether the enzymes are correctly classified, regardless their original distribution. Thus, in case of the dataset size being variable, Micro-averaged indexes are the better choice:

$$Micro-precision = \frac{\sum\_{j=1}^{6} TP\_j}{\sum\_{j=1}^{6} TP\_j + \sum\_{j=1}^{6} FP\_j},\tag{15}$$

$$Micro-recall = \frac{\sum\_{j=1}^{6} TP\_j}{\sum\_{j=1}^{6} TP\_j + \sum\_{j=1}^{6} FN\_j},\text{ (16)}$$

$$Micro-F1-score = 2 \cdot \frac{Micro-precision \times Micro-recall}{Micro-precision + Micro-recall} . \text{(17)}$$

#### 3.3. Mono-Functional vs. Multi-Functional Prediction

In this section, we describe the performance of the proposed method in predicting whether an enzyme is a mono-functional or multi-function enzyme. The training and testing datasets used in this work are shown in **Tables 1**, **3**. It is worthy pointing out that the data are imbalanced, with 22,168 mono-functional enzymes and 1085 multi-functional enzymes. In this work, we employed penalized models to overcome the imbalance, forcing the model to pay more attention to the multi-functional class. We ran the model 30 times on a GPU node with 32 CPU cores and one GTX 1080 Ti card, each time with 70% of all the data as training data and 30% as testing data. The average training

classification testing performance of different models. Performance lower than 0.6 are not shown in the figure.

time is 11 h for 40 epochs and the average testing time for one batch is < 1 min. We show the comparison results, both the average and the standard deviation, in **Figure 3**. As suggested by **Figure 3**, our method can outperform all the other methods consistently across different criteria. Besides, our method is very stable, with the standard deviation of accuracy being as low as 0.09.

# 3.4. Multi-Functional Enzyme Main Classes Prediction

Similar to the experiments in section 3.3, we also ran the model 30 times on a GPU node with 32 CPU cores and one GTX 1080 Ti card, each time with 70% of all the data as training data and 30% as testing data. The average training time is about 14 h for 40 epochs and the average testing time for one batch is around one minute. Using the criteria for multi-label learning that have been discussed in section 3.2, we evaluated mlDEEPre and compared it with other models introduced in section 3.1, obtaining the performance results shown in **Table 4** and **Figure 4**. According to Hamming-loss, the proposed multi-label model, mlDEEPre, predicts 97.6% of all the actual main classes in the test dataset correctly, with the corresponding standard deviation being 2.7%, which outperforms all the other methods.

TABLE 4 | The multi-functional classification performance of mlDEEPre on dataset II shown in Table 3.


Furthermore, we also compared our method with the other methods using other criteria. Although, because SVM-NN is good at predicting those rare class labels caused by imbalanced training samples (only 52 sequences belonging to class 6), the performance of SVM-NN (84.7%) is slightly better than that of mlDEEPre (82.6%) and GA (80.8 %) in subset accuracy, mlDEEPre performs better than all the other methods in term of all the other criteria, including Macro-precision, Macro-recall, Macro-F1, Micro-precision, Micro-recall, and Micro-F1.

# 3.5. Case Study: FAS

Fatty acid synthase (FAS) is a homodimeric multi-functional enzyme that performs the anabolic conversion of dietary carbohydrate or proteins to fatty acids (Chakravarty et al., 2004). Many human cancers can cause high level expression of FAS. Meanwhile, the regulation of human FAS in a variety of cancers makes FAS a candidate target for anticancer therapy (Camassei et al., 2003). FAS subunit alpha includes two parts, reductase and synthase, whose EC number are 1 and 2, respectively. To assess our model's ability in predicting the whole EC number sets of certain multi-functional enzyme sequences, we excluded the FAS sequences from the training data and fed those sequences to our model during testing. The outputs of our network show that mlDEEPre can exactly predict FAS's main classes which are Oxidorreductase and Transferase, being consistent with the experimental results. Furthermore, the integration of mlDEEPre and DEEPre can annotate the two sets of FAS's EC numbers correctly.

# 4. DISCUSSION

In this paper, based on multi-label deep learning, we propose a novel method, mlDEEPre, to annotate the functionality of multi-functional enzymes. It works seemlessly with DEEPre, which enables DEEPre to perform mono-functional enzyme and multi-functional enzyme function predictions at the same time automatically. Despite of the state-of-the-art performance of mlDEEPre, this tool can still be improved in the following ways. Firstly, when designing the tool, we assume that the multiple functions of an enzyme diverge in the main class. Although that assumption holds under most circumstances, it is inevitable that there are some enzymes with different sub-class or even subsubclass functions. Those kind of enzymes need to be investigated in the future. Secondly, we also assume that the EC system remains static. Although the EC system is stable most of the time, it is a dynamic system if we exam it over a long time period, and the number of classes can increase as we discover more enzymes, which may invalidate the previous classifier. In machine learning, this problem is called class incremental learning (Li et al., 2018b). In the future, efforts will be made to enable the system to perform enzyme function prediction in dynamic labeling space. Finally, since much of the performance gain of mlDEEPre is contributed to the superior performance of deep learning in handling classification problems, some recent works in investigating the nature of deep learning (Soudry et al., 2017; Li et al., 2018c) can be helpful for further improving the performance of deep learning and thus the performance of mlDEEPre. We believe that the idea

Zou et al. mlDEEPre

of mlDEEPre, combining multi-label learning with deep learning, can be helpful for solving other similar bioinformatics problems. For example, it has the potential to be applied to predict the properties of antibiotic-resistant genes (ARG) in multidrugresistant pathogens (Zhu et al., 2013; Cao et al., 2018), perform classification of multicomponent transporter system (Saier et al., 2015), and predict CRISPR-Cas9 gene editing off-target regions (Fu et al., 2013; Pattanayak et al., 2013; Lin and Wong, 2018; Zhang et al., 2018).

#### DATA AVAILABILITY STATEMENT

The datasets for this study can be found in the http://www. cbrc.kaust.edu.sa/DEEPre/dataset.html and http://server.malab. cn/MEC/download.jsp.

#### AUTHOR CONTRIBUTIONS

YL and XG initialized and designed the project. ZZ and ST implemented the idea and run the experiments.

#### REFERENCES


YL and ZZ wrote the manuscript. XG and ST helped revise the manuscript. All authors provided critical feedback and helped shape the research, analysis and manuscript.

#### FUNDING

This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Awards No. FCC/1/1976-17-01, FCC/1/1976-18-01, FCC/1/1976-23-01, FCC/1/1976- 25-01, FCC/1/1976-26-01, URF/1/3007-01, and URF/1/ 3450-01.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00714/full#supplementary-material

factor binding affinity landscape. Bioinformatics 33, 3575–3583. doi: 10.1093/bioinformatics/btx480


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zou, Tian, Gao and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Long Non-coding RNA LINC00941 as a Potential Biomarker Promotes the Proliferation and Metastasis of Gastric Cancer

Haiming Liu1,2† , Nan Wu2,3† , Zhe Zhang<sup>2</sup>† , XiaoDan Zhong1,4, Hao Zhang<sup>1</sup> , Hao Guo<sup>2</sup> , Yongzhan Nie<sup>2</sup> \* and Yuanning Liu<sup>1</sup> \*

#### Edited by:

Quan Zou, University of Electronic Science and Technology of China, China

#### Reviewed by:

Peng Zhang, University of Maryland, Baltimore, United States Ying Wang, Xiamen University, China Sushant Patil, The University of Chicago, United States

#### \*Correspondence:

Yongzhan Nie yongznie@fmmu.edu.cn Yuanning Liu liuyn@jlu.edu.cn †These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 02 November 2018 Accepted: 07 January 2019 Published: 22 January 2019

#### Citation:

Liu H, Wu N, Zhang Z, Zhong X, Zhang H, Guo H, Nie Y and Liu Y (2019) Long Non-coding RNA LINC00941 as a Potential Biomarker Promotes the Proliferation and Metastasis of Gastric Cancer. Front. Genet. 10:5. doi: 10.3389/fgene.2019.00005 <sup>1</sup> College of Computer Science and Technology, Jilin University, Changchun, China, <sup>2</sup> State Key Laboratory of Cancer Biology, National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi'an, China, <sup>3</sup> College of Life Sciences, Northwest University, Xi'an, China, <sup>4</sup> Department of Pediatric Oncology, The First Hospital of Jilin University, Changchun, China

Gastric cancer (GC) is a considerable global health burden. Accumulating evidence suggests that long non-coding RNAs (lncRNAs) are aberrantly expressed in many cancers and play important roles in GC. However, only a few lncRNAs have been functionally characterized. In this study, we identified that long intergenic non-protein coding RNA 941 (LINC00941) is a potential biomarker for diagnosis and prognosis from the cancer genome atlas (TCGA), and we found that the expression of LINC00941 is associated with tumor depth and distant metastasis in GC. Furthermore, functional enrichment analysis of LINC00941 co-expression network demonstrated that LINC00941 might be an essential regulator of tumor metastasis and cancer cell proliferation. To validate our findings, we utilized the loss-of-function analysis to reveal the biological function of LINC00941 in GC cells. Loss-of-function analysis revealed that silence of LINC00941 inhibits GC cells proliferation, migration, and invasion in vitro and modulates tumor growth in vivo. Our findings confirmed that LINC00941 plays an important oncogenic function in GC and may serve as a potential biomarker for diagnosis and prognosis of GC.

#### Keywords: gastric cancer, non-coding RNA, LINC00941, biomarker, TCGA

# INTRODUCTION

Gastric cancer (GC) is one of the most frequently diagnosed cancer and the leading cause of cancer death in the world (Bray et al., 2018). In China, GC is a considerable health burden (Chen W. et al., 2018). The outlook for patients with metastatic GC is unfavorable, with median survival does usually not exceed 1 year (Van Cutsem et al., 2016). It is urgent to explore novel biomarkers and therapeutic targets for GC.

Long non-coding RNAs (lncRNAs) is an important subclass of non-coding RNAs (ncRNAs) and usually longer than 200 nucleotides with lack of protein-coding capability (Engreitz et al., 2016). Mounting evidence confirmed that lncRNAs are aberrantly expressed in many cancers and play key roles in promoting tumor initiation and progression (Lin and Yang, 2018). Several lncRNAs such as UCA1, MALAT1, HOXA11-AS, and ZEB1-AS1 have been proposed as individual diagnostic or

prognostic biomarkers in GC (Sun et al., 2016; Li et al., 2017a,b; Wang et al., 2017). However, only a few lncRNAs have been functionally characterized and the mechanism by which lncRNAs regulate GC remains to be fully elucidated.

Long intergenic non-protein coding RNA 941 (LINC00941), also known as MSC-upregulated factor (lncRNA-MUF), is an lncRNA located in the 12p11.21 region of the human genome. In hepatocellular carcinoma, high expressed LINC00941 significantly promoted epithelial to mesenchymal transition (EMT) and malignant capacity (Yan et al., 2017). In lung adenocarcinoma, LINC00941 displayed prognostic values and regulated PI3K-AKT signaling pathway (Wang et al., 2018). In GC, LINC00941 is highly expressed in GC tissues and may participate in the process of GC (Luo et al., 2018). However, the precise biological function of LINC00941 in GC has not been characterized.

In this study, we confirmed that LINC00941 plays an important oncogenic function in GC by systematically integrating bioinformatics methods and in vitro/vivo studies. We identified that LINC00941 acts as a potential diagnostic and prognostic biomarker in GC, and the expression of LINC00941 is associated with tumor depth and distant metastasis based on the analysis of RNA-seq data from TCGA. In order to characterize the function of LINC00941 in GC, we applied weighted gene co-expression network analysis (WGCNA) (Langfelder and Horvath, 2008) to constructed LINC00941 co-expression network. Furthermore, through the functional enrichment analysis, we found that LINC00941 might be an essential regulator of tumor metastasis and cancer cell proliferation. To validate our findings, we utilized the loss-of-function analysis to reveal the biological function of LINC00941 in GC cells. Loss-of-function analysis confirmed that silence of LINC00941 inhibits GC cells proliferation, migration, and invasion in vitro and modulates tumor growth in vivo. Our results demonstrated that LINC00941 plays an important oncogenic function in GC and acts as a potential biomarker for diagnosis and prognosis of GC.

## MATERIALS AND METHODS

#### GC Data Collection and Processing

Gastric cancer data, including clinical information and gene expression data, were obtained from TCGA<sup>1</sup> (Cancer Genome Atlas Research and Network, 2014). In this study, all samples with follow-up time exceeding 2000 days were excluded. GC data was processed as described in our previous work (Liu H. et al., 2018).

# Identifying Cancer-Related, Metastasis-Related, and Survival-Related lncRNAs

In order to identify cancer-related lncRNAs, we utilized Mann– Whitney U-test to compare the expression values between cancer samples and normal samples. To identify metastasis-related

<sup>1</sup>https://portal.gdc.cancer.gov/

lncRNAs, we used the same statistical method to compare the expression values between metastatic (M1) samples and nonmetastatic (M0) samples. The cancer-related and metastasisrelated lncRNAs with p-values < 0.05 and fold change ≥ 1 were considered significant. In order to identify survival-related lncRNAs, we divided samples into two groups (high and low) based on the median expression level of each lncRNA. We then utilized univariate Cox proportional-hazards regression model and log-rank test to identify survival-related lncRNAs. LncRNAs with p < 0.05 were considered survival related. Kaplan-Meier plot was used to represent the results.

## LncRNA Co-expression Network Construction and Functional Enrichment Analysis

Long non-coding RNAs co-expression network construction was performed by combining lncRNAs with genes expression data using the R package WGCNA (Langfelder and Horvath, 2008). First, the expression matrix was constructed based on the Spearman's rank correlation coefficient between all gene pairs. Then, the expression matrix was converted into an adjacency matrix (AM), and AM was further converted into a topological overlap matrix (TOM). Based on TOM, the average-linkage hierarchical clustering method was used to cluster the genes and dynamic tree cut algorithm was used to identify lncRNA co-expressed module that was set the minimum module size to 30. To determine the functions of lncRNA co-expressed module, the genes in the module were subjected to functional enrichment analysis by gene ontology (GO) (Antonazzo et al., 2017) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2017) analyses with DAVID 6.8<sup>2</sup> (Huang da et al., 2009). In this study, the result of functional enrichment analysis with FDR < 0.05 was considered statistically significant.

# Cell Culture, RNA Extraction, and Real-Time PCR

The human GC cells (MKN45 and AGS) were purchased from American Type Culture Collection (ATCC). Cell culture, RNA extraction, and real-time PCR were performed as described in our previous work (Liu H. et al., 2018). Primer sets specific for LINC00941 were designed by RiboBio (RiboBio, China). The primer sequences of LINC00941 were as follows: LINC00941 forward, 5<sup>0</sup> - ACCACTACACTCAGCCAAATAC-3<sup>0</sup> , reverse, 5<sup>0</sup> - GGCTATCAACTGTCTCCTTTAGAC-3<sup>0</sup> .

# Cell Transfection

The short interfering RNA (siRNA) targeting LINC00941 were synthesized by RiboBio (RiboBio, China). The siRNA sequences for LINC00941 were as follows: LINC0 0941-si1: 5<sup>0</sup> - ACGCGTTGCATAACCTGA-3<sup>0</sup> , LINC00941 si2: 5<sup>0</sup> - GAGACAGTTGATAGCCAAA-3<sup>0</sup> . Oligonucleotide transfection was performed using Lipofectamine 2000 reagent (Invitrogen, United States). Short hairpin RNA (shRNA) directed against LINC00941 were synthesized by GeneChem

<sup>2</sup>https://david.ncifcrf.gov/home.jsp

(Shanghai, China) and were inserted into the vector. The sequence of the effective shRNA was as follows: sh-LINC00941: GAGACAGTTGATAGCCAAA. We utilized an empty vector as a negative control (VECTOR).

# CCK-8, Colony Formation, Cell Migration, and Invasion Assays

CCK-8 and colony formation assays were performed as described in our previous work (Liu H. et al., 2018). Cell migration and invasion assays were measured using Transwell chamber (Corning, United States) with 8-µm pore membrane. Cells need to be starved for 12 h in serum-free medium before the experiment. For the migration assays, 5 × 10<sup>5</sup> cells were seeded into the top chamber of transwell. For the invasion assays, 1 × 10<sup>5</sup> cells were seeded top chamber of transwell with Matrigel (Corning, United States). After 24 h, non-invasive cells were removed, and that migrated or invaded cells were fixed and stained with 5% crystal violet (Beyotime Biotechnology, China). Cells were photographed and counted in five random fields at 100× magnification.

# Protein Extraction and Western Blot Analysis

Cells were washed three times with PBS and collected in RIPA lysis buffer (Beyotime Biotechnology, China) supplemented with a protease inhibitor cocktail (Calbiochem, United States). Protein concentration was determined by staining with Coomassie Blue (Beyotime Biotechnology, China). After electrophoresis, the protein was transferred to a polyvinylidene fluoride membrane (Merck Millipore, Germany). After blocking with 0.1% Tween 20 (TBS-T) in Tris-buffered saline containing 5% skim milk for 1 h at room temperature, the primary monoclonal antibody was added to the membrane and incubated overnight at 4◦C. The next day, the membrane was incubated with the corresponding secondary antibody for 1 h at room temperature and the signal was detected in a Bio-Rad ChemiDoc XRS imaging system. The ratio of the gray value of the target protein to the gray value of β-actin indicates the relative amount of protein. Primary antibodies were used as follows: anti-Ecadherin (1:1000; Cell Signaling Technology, United States), anti-Fibronectin (Cell Signaling Technology, United States) and anti-Snail1 (Cell Signaling Technology, United States), and anti-GAPDH (Santa Cruz Biotech, United States).

### Tumorigenicity Assays in Nude Mice

Empty vector (negative control) and sh-LINC00941 stained cells SGC-7901 were injected subcutaneously into either side of the axillary region of male BALB/c nude mice (4–5 weeks old). Ten female BALB/c nude mice (5–6 weeks old) were randomly divided into two groups (sh-LINC00941 and VECTOR)

and the mice were maintained under specific pathogen-free conditions. To establish the subcutaneous xenograft tumor model, 1 × 10<sup>7</sup> cells were injected into the right side of the back of nude mice. We measured and recorded nude mice body weight and tumor size weekly. After 4 weeks, all the nude mice were sacrificed under deep anesthesia. This study was carried out in strict accordance with the ethical standards of the Fourth Military Medical University. The tumor volumes were calculated using the following formula: tumor volume (mm<sup>3</sup> ) = [length (mm) × width<sup>2</sup> (mm) × π ]/6.

# RESULTS

# LINC00941 Is a Potential Biomarker for Diagnosis and Prognosis of GC

We obtained 385 samples with gene expression profiles and clinical information from TCGA. By comparing 358 cancer samples with 27 normal samples, we identified 926 significantly cancer-related lncRNAs (fold change ≥ 1, p < 0.05, Mann-Whitney U-test, **Supplementary Table S1**). By comparing 24 metastatic (M1) samples with 316 nonmetastatic (M0) samples, we identified 12 significantly metastasis-regulated lncRNAs (fold change = 1, p < 0.05, Mann-Whitney U-test, **Supplementary Table S2**). Then, we found 209 survival-related lncRNAs using univariate Cox proportional-hazards regression model (HR > 1, p < 0.05, log-rank test, **Supplementary Table S3**). As shown in **Figure 1A**, we identified that only LINC00941 is a survival-related (**Figure 1B**), cancer-related (**Figure 1C**), and metastasis-related lncRNA (**Figure 1D**). In addition, receiver operating characteristic (ROC) curve analysis was utilized to explore whether the expression level of LINC00941 could discriminate samples, and area under the ROC curve (AUC) was performed to calculate the diagnostic sensitivity and specificity. As shown in **Figures 1E,F**, the expression of LINC00941 discriminated cancer samples from normal samples (AUC = 0.7911, 95% CI: 0.7264– 0.8559, p < 0.0001) and discriminated M1 samples from M0 samples (AUC = 0.6809, 95% CI: 0.5852–0.7766, p = 0.0031). Our findings indicated that LINC00941 is a potential biomarker for diagnosis and prognosis of GC.

# The Expression of LINC00941 Is Associated With Tumor Depth and Distant Metastasis of Patients in GC

To explore the clinical value of LINC00941, Mann-Whitney U-test and Chi-square test were used to compare the different clinicopathological features in each group according to the expression of LINC00941 (**Table 1** and **Supplementary Table S4**). The aberrantly expressed LINC00941 was


<sup>∗</sup>The values had statistically significant differences.

M, the average expression level; SD, the standard deviation of expression level; Pathological T, tumor depth; Pathological N, lymph nodes affected; Pathological M, distant metastasis.

observed in several clinicopathological features, such as tumor depth (p = 0.018, Mann-Whitney U-test), lymph node metastasis (p = 0.0184, Mann-Whitney U-test), distant metastasis (p = 0.003, Mann-Whitney U-test), and Tumor stage (p = 0.042, Mann-Whitney U-test). The results of chi-square test analysis demonstrated that the expression of LINC00941 had no significant correlation with clinicopathological features including gender, histological type, neoplasm histologic grade, lymph node metastasis, and tumor stage. However, the expression of LINC00941 expression was associated with tumor depth (p = 0.0197, chi-squared test) and distant metastasis (p = 0.0111, chisquared test). By combining the results of two test methods,

our results indicated that the expression of LINC00941 was associated with tumor depth and distant metastasis of patients in GC.

# Construction and Functional Enrichment Analysis of LINC00941 Co-expression Network

In order to characterize the function of LINC00941 in GC, we applied WGCNA to constructe LINC00941 co-expression network by combining LINC00941 with dysregulated coding genes in GC samples (absolute fold change ≥ 1, p < 0.05, Mann-Whitney U-test, **Supplementary Table S5**).

We selected the soft threshold power (β = 4) to ensure that the co-expression network was scale-free topology (**Figure 2A**). After hierarchical clustering and dynamic tree cutting, a total of 28 co-expression modules were identified (**Figure 2B**). We found that LINC00941 is clustered in "yellow" module, which contained 123 dysregulated coding genes (**Supplementary Table S6**). The genes in LINC00941 co-expression module was applied to explore the function of LINC00941 via the functional enrichment analysis. In this study, we identified 6 GO terms in BP and 4 KEGG pathways (**Figures 2C,D** and **Supplementary Table S7**). Some GO terms, such as extracellular matrix organization (GO: 0030198) and cell adhesion (GO: 0007155) are cell migration-related GO processes, which are associated with

tumor metastasis. We detected that ECM-receptor interaction (KEGG: hsa04512) and Focal adhesion (KEGG: hsa04510) are metastasis-related pathways. PI3K-Akt signaling pathway (KEGG: hsa04151) is cell proliferation-related pathway. Our findings demonstrated that LINC00941 could be a potential regulator of tumor metastasis and cancer cell proliferation.

# Silence of LINC00941 Inhibits GC Cells Proliferation in vitro

To validate our findings, we designed two siRNAs, which can specifically target LINC00941. As shown in **Figure 3A**, LINC00941 was effectively knocked down by siRNAs in MKN45 and AGS cells. Then, we assessed whether silence of LINC00941 could affect the proliferation ability of MKN45 and AGS cells. CCK8 assay indicated that silence of LINC00941 significantly inhibited proliferation ability of MKN45 and AGS cells (**Figure 3B**). As shown in **Figure 3C**, colony formation ability of MKN45 and AGS cells was significantly decreased after silencing LINC00941. Our results confirmed that LINC00941 promotes GC cells proliferation in vitro.

# Silence of LINC00941 Reduces GC Cells Migration and Invasion in vitro

To further analyze the effect of LINC00941 on GC metastasis, we utilized cell migration and invasion assays to explore the effect of silencing LINC00941 on metastasis ability of MKN45 and AGS cells. The results of transwell migration assay indicated that silence of LINC00941 inhibits the migration abilities of MKN45 and AGS cells (**Figure 4A**). The results of matrigel invasion assay indicated that silence of LINC00941 also inhibited the invasion abilities of GC cells (**Figure 4B**). Epithelial cell mesenchymal transition (EMT) is a process that is associated with tumor metastasis and lncRNAs could modulate cancer metastasis via affecting EMT (Xu et al., 2016; Grelet et al., 2017; Min et al., 2017). To explore the association between LINC00941 and EMT in GC metastasis, we measured the expression level of EMT markers, such as E-cadherin (CDH1), fibronectin (FN), and Snail1 (SNAI1), using western blot and qRT-PCR. The results confirmed that silence of LINC00941 decreased the protein and gene expression of CDH1, while increased the expression of FN and SNAI1 in MKN45 and AGS cells (**Figure 4C**). Therefore, our results demonstrated that LINC00941 might modulate GC cells metastatic properties via affecting EMT biomarker and could regulate GC cells migration and invasion in vitro.

# Silence of LINC00941 Suppresses Tumor Growth in vivo

To further validate the effect of LINC00941 on tumor growth, we used stably expressing MKN45 cells by infection with lentivirus expressing sh-LINC00941 and negative control.

As shown in **Figure 5A**, LINC00941 was effectively knocked down by shRNA in MKN45 cells. We then evaluated the tumorigenic effects in BALB/c nude mice. The tumors in nude mice with sh-LINC00941-infected significantly smaller than the negative control (**Figure 5B**). We also measured the tumors weight and tumors volume of the two groups. We found that the tumor weight (**Figure 5C**) and the tumor volume (**Figure 5D**) of sh-LINC00941 were significantly smaller than the negative control. Thus, our findings confirmed that LINC00941 could modulate tumor growth in vivo.

# DISCUSSION

Non-coding RNAs represent more than 98% of the total human transcriptome (Yvonne et al., 2014) and the aberrantly expressed lncRNAs drive important cancer phenotypes (Schmitt and Chang, 2016). Few lncRNAs, such as UCA1, MALAT1, HOXA11-AS, and ZEB1-AS1 have been reported to play important roles in GC. However, only several lncRNAs have been functionally characterized in GC.

In this study, we systematically integrated GC clinical information and gene expression data from TCGA. By comparing cancer samples with normal samples, we identified 926 cancer-related lncRNAs, such as FEZF1-AS1 (Liu Y.W. et al., 2017), HOTAIR (Okugawa et al., 2014), HOXA11- AS (Sun et al., 2016), HOTTIP (Zhao et al., 2018), and LINC01234 (Chen X. et al., 2018) have been reported to play important roles in GC. By comparing metastatic samples with non-metastatic samples, we identified 12 metastasis-related lncRNAs, such as LINC00704 (Tracy et al., 2018), LINC00460 (Li et al., 2018), and LINC00520 (Henry et al., 2016) have been reported to associate with tumor metastasis. By using univariate Cox proportional-hazards regression model and ROC curve analysis, we identified that the LINC00941 may serve as a potential biomarker for diagnosis and prognosis of GC. Furthermore, we explored the associations between expression of LINC00941 and clinicopathological features, and we found that the expression of LINC00941 was associated with tumor depth and distant metastasis.

To further explore the function of LINC00941 in GC, LINC00941 co-expression network was constructed by combining LINC00941 with dysregulated coding genes in GC using R package WCGNA. After hierarchical clustering and dynamic tree cutting, LINC00941 co-expression module was identified. The dysregulated coding genes in the module was applied to explore the function of LINC00941 via the functional enrichment analysis. The results of GO terms and KEGG pathways were mainly enriched in cell proliferation, cell migration, and tumor metastasis. Our findings indicated that LINC00941 might play an important oncogenic function in GC. LINC00941 located in the 12p11.21 region of the human genome and also known as lncRNA-MUF. In lung adenocarcinoma,

survival of patients was negatively associated with high expression of LINC00941 which modulates focal adhesion and PI3K-AKT signaling pathway (Wang et al., 2018). In hepatocellular carcinoma, high expression of LINC00941 significantly promoted EMT and malignant capacity tissues and correlated with poor prognosis (Yan et al., 2017). LINC00941 was differentially expressed in TGFβ1-activated A549 cells compared with those in normal controls (Liu H. et al., 2017) and was up-regulated upon treatment of colon cancer cells with chemotherapeutic drugs (Zinovieva et al., 2018). In GC, LINC00941 is up-regulated in tumor tissues (Luo et al., 2018). However, in GC, the precise biological function of LINC00941 have not been characterized.

To validate our findings, we utilized the loss-of-function analysis to reveal the biological function of LINC00941 in GC cells. CCK-8 and colony formation assays confirmed that LINC00941 could promote GC cells proliferation. Cell migration and invasion assays confirmed that LINC00941 could promote GC cells metastasis. Previous studies revealed that EMT is associated with tumor metastasis and lncRNAs could modulate cancer metastasis via affecting EMT (Xu et al., 2016; Grelet et al., 2017; Min et al., 2017). We further measured the expression level of EMT markers such as E-cadherin, Snail (Dong et al., 2012), and Fibronectin (Park and Schwarzbauer, 2014). The results demonstrated that LINC00941 could regulate GC cells metastasis via affecting EMT. Therefore, LINC00941 might modulate GC cells metastatic properties via affecting EMT biomarker. In this study, there are still some limitations and the specific biological mechanism by LINC00941 acts still needs to be investigated.

In summary, by systematically integrating bioinformatics and experimental methods, our findings firstly revealed that LINC00941 plays an important oncogenic function in GC. Our findings not only provide novel insights on the functional characterization of LINC00941 in GC, but can also provide a novel biomarker for diagnosis and prognosis of GC in future studies.

### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the National Institutes of Health Laboratory Animal Care and Use Guidelines. The protocol was approved by the Xijing Hospital Institutional Review Board.

### AUTHOR CONTRIBUTIONS

YL and YN conceived of and directed the project. HL, NW, and ZZ designed and performed the experiments. HL and HG conducted the data analysis and interpreted the results. XZ and HZ revised the manuscript critically for important intellectual

content. HL and NW wrote and edited the manuscript. All authors have reviewed the manuscript and approved it for publication.

#### FUNDING

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61471181, 81702966, and

#### REFERENCES


81730016) and the Natural Science Foundation of Jilin Province (Grant Nos. 20140101194JC and 20150101056JC).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00005/full#supplementary-material



cells with different chemotherapeutic drugs. Biochimie 151, 67–72. doi: 10.1016/j.biochi.2018.05.021

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Liu, Wu, Zhang, Zhong, Zhang, Guo, Nie and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multiple Partial Regularized Nonnegative Matrix Factorization for Predicting Ontological Functions of lncRNAs

#### Jianbang Zhao<sup>1</sup> \* and Xiaoke Ma<sup>2</sup> \*

*<sup>1</sup> College of Information Engineering, Northwest Agriculture & Forestry University, Xianyang, China, <sup>2</sup> School of Computer Science and Technology, Xidian University, Xi'an, China*

Long non-coding RNAs (LncRNA) are critical regulators for biological processes,

which are highly related to complex diseases. Even though the next generation sequence technology facilitates the discovery of a great number of lncRNAs, the knowledge about the functions of lncRNAs is limited. Thus, it is promising to predict the functions of lncRNAs, which shed light on revealing the mechanisms of complex diseases. The current algorithms predict the functions of lncRNA by using the features of protein-coding genes. Generally speaking, these algorithms fuse heterogeneous genomic data to construct lncRNA-gene associations via a linear combination, which cannot fully characterize the function-lncRNA relations. To overcome this issue, we present an nonnegative matrix factorization algorithm with multiple partial regularization (aka MPrNMF) to predict the functions of lncRNAs without fusing the heterogeneous genomic data. In details, for each type of genomic data, we construct the lncRNA-gene associations, resulting in multiple associations. The proposed method integrates separately them via regularization strategy, rather than fuse them into a single type of associations. The results demonstrate that the proposed algorithm outperforms state-of-the-art methods based network-analysis. The model and algorithm provide an effective way to explore the functions of lncRNAs.

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Qinghua Jiang, Harbin Institute of Technology, China Jianbo Pan, Johns Hopkins Medicine, United States*

\*Correspondence:

*Jianbang Zhao zhaojianbang@nwsuaf.edu.cn Xiaoke Ma xkma@xidian.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *29 October 2018* Accepted: *10 December 2018* Published: *23 January 2019*

#### Citation:

*Zhao J and Ma X (2019) Multiple Partial Regularized Nonnegative Matrix Factorization for Predicting Ontological Functions of lncRNAs. Front. Genet. 9:685. doi: 10.3389/fgene.2018.00685* Keywords: lncRNA, nonnegative matrix factorization, gene ontology, networks, regularization

# 1. INTRODUCTION

Long non-coding RNAs (lncRNAs) are a type of non-coding RNAs with more than 200 nucleotides in length, which have very little or no potential to encode proteins (Mercer et al., 2009). In the past lncRNAs are categorized as "dark matter" and "junks." However, more and more evidence demonstrates that lncRNAs are critical regulators for biological processes, such as immune response, cell development and differentiation, as well as gene imprinting (Morris and Mattick., 2014; Turner et al., 2014; Ma et al., 2017). Furthermore, lncRNAs are highly related to diseases and cancers (Zou et al., 2015, 2016; Zhu et al., 2018). Largely due to the high-throughput biological techniques, particularly the next generation sequence (NGS), large numbers of lncRNAs have been identified (Iyer et al., 2015; Fang et al., 2018).

Compared to the protein-coding genes (genes for short), the functions of vast majority of lncRNAs are unknown. Thus, it is promising to predict the functions of lncRNAs, which are critical for revealing the underlying mechanisms of gene regulation. The approaches for annotating the functions of lncRNAs are classified into two classes: the biological experiment and computational based methods. Currently, the functions of some lncRNAs are validated by the biological experiment based methods. For example, based on the RNA-sequencing data, the mechanistic analysis reveals that UCA1 physically interacts with PTBP1 and ALAS2, which stabilizes ALAS2 (Liu et al., 2018). Li et al. (2016) utilized the RT-PCR to detect the expression profiles of lncRNA TUG1 in glioma, and found that TUG1 is involved in the apoptosis and cell proliferation. Based on the cap analysis of gene expression (CAGE) data, FANTOME generated a comprehensive atlas of 27919 human lncRNA genes across 1829 samples from the major human primary cell types and tissues (Hon et al., 2017). Wang et al. (2018) identified the function of NEAT1 using the enhanced green fluorescent protein reporter in human cells.

Except the expression profiles, some lncRNAs execute their functions via interacting with other bio-molecules, such as DNAs, RNAs and proteins. Mercer and Mattick (2013) focused on the lncRNAs as epigenetic modulators via binding to chromatinmodifying proteins and recruiting their catalytic activity to specific sites in the genome. Efforts is devoted to investigate the lncRNA-DNA interactions, including the chromatin isolation by RNA purification (Chu et al., 2012; Nowak et al., 2014). Furthermore, Ferre et al. (2016) identified the protein-lncRNA interactions, offering essential clues for a better understanding of lncRNA cellular mechanisms and their disease-associated perturbations.

Even though the experiment based approaches for the functions of lncRNAs are reliable, they are criticized by the expensive cost and complicated operations. Thus, the computational algorithms for the prediction of lncRNA functions provide an alternative, which become more and more important. Based on the assumption that the molecules with the same or similar functions have the same or similar patterns. Some efforts explore the co-expression patterns (Lee et al., 2004; Necsulea et al., 2014). Furthermore, the gene set enrichment analysis (GSEA) based on the statistics is also adopted to identify the functions of lncRNAs (Guttman et al., 2009). To explore the knowledge from genes, (Liao et al., 2011) combined the expression profiles of lncRNAs and genes to construct a coding and non-coding gene co-expression network according to the expression profiles in the GEO database, then predicted the functions of more than 300 mouse lncRNAs based on the co-expression modules. In order to make use of the global information, Guo et al. (2013) constructed a bi-colored network via integrating the expression profiles of lncRNA and genes, then provided the lnc-GFP algorithm to predict the functions of lncRNAs. Jiang et al. (2015) employed the statistical test to annotate the functions of lncRNAs. Recently, Zhang et al. (2018) proposed the NeuralNetL2GO algorithm, which uses neural networks to annotate lncRNAs.

Actually, there are many different genomic data to link the lncRNA and genes, for example gene co-expression, connection to the diseases, protein binding sites. The current algorithms integrate multiple heterogeneous genomic data into a single network via weighted or unweighted linear functions, which are criticized for not fully characterizing the links between lncRNAs and genes. Evidence shows that the linear combination destroys the patterns in the integrated network (Ma and Dong, 2017; Ma et al., 2019). In fact, each type of genomic data provides a perspective of the links between lncRNAs and genes. The ultimate goal of this study is to provide a computational method to predict functions of lncRNAs by fusing heterogeneous data. As shown in **Figure 2**, we construct multiple bi-color networks for lncRNAs and genes. Then, the multiple partial regularized nonnegative matrix factorization (MPrNMF) algorithm is proposed to simultaneously factorize the multiple networks. In order to improve the accuracy, the regularization strategy is adopted, where the factorized feature matrix preserves the links between lncRNAs and genes. The results demonstrate that the proposed method outperforms these algorithms based on the single bio-colored network, implying the proposed method is promising.

The rest of this paper is organized as: section 2 briefly reviews the related works on the prediction of lncRNAs functions. Section 3 describes the procedure of the proposed method. Section 4 shows the experimental results. Finally, the conclusion is presented in section 5.

# 2. RELATED WORKS

In this section, we first introduce the mathematical notations that are widely used in the forthcoming sections. Then, we review state-of-the-art methods for the prediction of lncRNA functions.

### 2.1. Notations

The notations are summarized in **Table 1**. Let n be the number of entities in the networks. Generally speaking, let n<sup>o</sup> be the number of ontological functions in Gene Ontology (GO), n<sup>g</sup> be the number of proteins (genes) in the PPI network, n<sup>l</sup> be the number of lncRNAs in the co-expression network. Let G<sup>g</sup> ,G<sup>l</sup> be the PPI and lncRNA co-expression networks, respectively. The adjacency matrix for G<sup>g</sup> , denoted by W<sup>g</sup> , corresponds to a n<sup>g</sup> × n<sup>g</sup> matrix whose element w [g] ij is the weight on edge (v<sup>i</sup> , vj) in G<sup>g</sup> . The degree of vertex v<sup>i</sup> in G<sup>g</sup> is the sum of weights on edges connecting v<sup>i</sup> , i.e., d [g] <sup>i</sup> = P j w [g] ij . The degree matrix D<sup>g</sup> is the diagonal matrix with degree sequence of G<sup>g</sup> , i.e., D<sup>g</sup> = diag(d [g] 1 , d [g] 2 , . . . , d [g] <sup>n</sup> ). The Laplacian matrix of G<sup>g</sup> is defined as L<sup>g</sup> = I − D −1/2 <sup>g</sup> WgD −1/2 <sup>g</sup> . Analogously, the adjacent matrix of G<sup>l</sup> is denoted by W<sup>l</sup> . Let L<sup>l</sup> be the Laplacian matrix for Gl . The associations between heterogeneous entities are denoted by matrix. Specifically, let X be the known lncRNA-ontology associations, Y be the known gene-lncRNA associations, and Y1(Y2) be the known lncRNA-disease (gene-disease) associations, respectively.

# 2.2. Related Algorithms

The label propagation algorithm is successfully applied to predict phenotype-gene associations with various backgrounds (Li and Patra, 2010; Vanunu et al., 2010), where the principle of the label propagation algorithms is illustrated in **Figure 1A**. In details, label propagation assumes that the well connected lncRNAs in G<sup>l</sup> are very likely to be the same label, which leads to the following objective function

$$J\_{LP} = \theta \, tr(\widehat{X} L\_l \widehat{X}') + (1 - \theta) \|\widehat{X} - X\|^2,\tag{1}$$

where <sup>b</sup><sup>X</sup> is the predicted lncRNA-ontology associations, <sup>θ</sup> <sup>∈</sup> (0, 1) is the parameter controlling the contributions of two terms in Equation (1), tr(A) is the trace of matrix A, i.e., tr(A) = P i aii and kAk is the l<sup>2</sup> norm of matrix A. In Equation (1), the first item characterizes how the predicted lncRNA-ontology associations <sup>b</sup><sup>X</sup> is consistent with the lncRNA co-expressed network, while the second one measures the good the predicted associations fit the initial labeling.

However, the number of predicted associations is largely determined by the sparsity of the known associations in X. When X is very sparse, the number of predicted associations is limited. Actually, X is very sparse since the GO functions of vast majority of lncRNAs are unknown. Fortunately, the GO functions of most proteins are known. Thus, the available algorithms overcome this limitation of the label propagation algorithm via integrating the


proteins and lncRNAs as shown in **Figure 1B**. Specifically, given the known protein-GO associations X, PPI network G<sup>g</sup> , lncRNA co-expression network G<sup>l</sup> and lncRNA-gene associations Y, the ultimate goal is to predict the lncRNA-ontology associations via integrative analysis of heterogeneous data. The lnc-GFP algorithm (Guo et al., 2013) follows the label propagation method by using the bi-colored network, which is defined as

$$\mathbf{C} = \begin{bmatrix} W\_l & Y \\ Y' & W\_{\mathcal{S}} \end{bmatrix}. \tag{2}$$

Thus, the objective function in Equation (1) is transformed into

$$J\_{LP} = \theta \, tr(\widehat{\mathcal{X}} L\_C \widehat{\mathcal{X}}') + (1 - \theta) \|\widehat{\mathcal{X}} - X\|^2,\tag{3}$$

where L<sup>C</sup> is the Laplacian matrix of the bi-colored network C. The KATZLGO method (Zhang et al., 2017) predicts the GO functions of lncRNAs by using the KATZ score of the bi-colored network, which counts the paths with various lengths in the bi-colored networks.

The bi-colored based methods make use of lncRNA-gene associations to predict the functions of lncRNAs. To explore the knowledge in G<sup>l</sup> and G<sup>g</sup> , Petergrosso et al. (2017) proposed the dual label propagation (DLP) to predict the phenotome-genome associations. Specifically, the objective function in Equation(1) based on the DLP model can be re-written as

$$J\_{DLP} = \|\widehat{X} - X\|^2 + \beta \operatorname{tr}(\widehat{X}L\_{\mathfrak{F}}\widehat{X}') + \gamma \operatorname{tr}(\widehat{X}L\_l\widehat{X}'),\tag{4}$$

where β ≥ 0, γ ≥ 0 are tuning parameters. The first item measures the consistence between the predicted associations and the bi-colored network, and the last two ones measures the smoothness in the PPI and lncRNA networks.

Most of the available algorithms for the prediction of LncRNA functions are based on the bi-colored network model. In this study, we investigate the possibility to predict the functions of lncRNAs via integrating multiple networks, where each type of genomic data is used to construct the lncRNA-gene associations.

FIGURE 1 | The flowchart of the current algorithms based on network analysis: (A) label propagation method based on the lncRNA co-expression network, (B) label propagation method based on the bio-colored network.

#### 3. METHODS

The procedure of MPrNMF is illustrated in **Figure 2**. In this section, we derive the objective function and optimizing rules of the proposed algorithm in turns.

#### 3.1. Objective Function

All these bi-colored network based algorithms predict the lncRNA-ontology associations based on the single bi-colored network via integrating various genomic data. In this study, we construct two bi-colored networks, where each one corresponds to a view of the lncRNA-gene associations. In the first one, the lncRNA-gene associations are determined by the pearson correlation coefficient between the expression profiles of lncRNAs and genes. And, the second lncRNA-gene associations are determined by the diseases. In details, the lncRNA-gene association is the Jaccard index of the diseases related to lncRNAs and genes. The i-th view of the bi-colored network is denoted by

$$\mathbf{C}\_{i} = \begin{bmatrix} W\_{l} & Y\_{i} \\ \mathbf{Y}\_{i}^{'} & W\_{\mathcal{S}} \end{bmatrix},\tag{5}$$

where Yi(i = 1, 2) is the lncRNA-gene associations in the i-th view.

Given the lncRNAs(genes)-ontology associations X, NMF aims at obtaining approximation of X via the product of two nonnegative matrices B<sup>1</sup> and F<sup>t</sup> (Lee and Seung, 1999), i.e.,

$$J = \|X - BF\|^2, \quad \text{s.t.} \quad B \ge 0, F \ge 0,\tag{6}$$

where B is the basis matrix and F is the feature matrix. Furthermore, we also expect the feature matrix F also reflects the topological structure of multiple views of the bi-colored network, which is implemented via the regularization. To this end, the Equation (6) is reformulated as

$$J = \|X - BF\|^2 + \alpha \sum\_{i=1}^{2} tr(FC\_i F^{'}), \quad \text{s.t.} \quad B \ge 0, F \ge 0,\tag{7}$$

where parameter α controls the importance of the regularization items and tr(A) is the trace of matrix A, i.e., tr(A) = P i aii.

In the bi-colored network, the vertices consist of lncRNAs and genes. Thus, the feature matrix F is also re-written as F = [F<sup>l</sup> , F<sup>g</sup> ], where F<sup>l</sup> denotes the part for the lncRNAs and F<sup>g</sup> for genes. Thus, tr(FCiF ′ ) is reformulated as

$$\begin{split} tr(\mathcal{F}\mathcal{C}\_{i}\mathcal{F}') &= tr(\{\mathcal{F}\_{l},\mathcal{F}\_{\mathcal{S}}\} \left[\begin{array}{cc} W\_{l} & Y\_{i} \\ Y\_{i} & W\_{\mathcal{S}} \end{array}\right] \left[\begin{array}{c} F\_{l}' \\ F\_{\mathcal{S}} \end{array}\right]) \\ &= tr(\mathcal{F}\_{\mathcal{S}}\mathcal{W}\_{\mathcal{S}}\mathcal{F}\_{\mathcal{S}}' + \mathcal{F}\_{l}Y\_{i}'\mathcal{F}\_{\mathcal{S}}' + \mathcal{F}\_{l}Y\_{i}\mathcal{F}\_{\mathcal{S}}' + \mathcal{F}\_{l}W\_{l}\mathcal{F}\_{l}')) \\ &= tr(\mathcal{F}\_{\mathcal{S}}\mathcal{W}\_{\mathcal{S}}\mathcal{F}\_{\mathcal{S}}') + 2tr(\mathcal{F}\_{l}Y\_{i}\mathcal{F}\_{\mathcal{S}}') + tr(\mathcal{F}\_{l}W\_{l}\mathcal{F}\_{l}'). \end{split}$$

The above equation indicates that the regularization item for the bi-colored network can be divided into three components: W<sup>g</sup> , W<sup>l</sup> and Y<sup>i</sup> . In the two views, the only difference is the lncRNAgene relations. Thus, we expect the regularization item can fully relect the lncRNA-gene relations Y<sup>i</sup> . In this case, the objective function in Equation (7) is transformed into

$$\min f = \|X - BF\|^2 + \alpha \sum\_{i=1}^{2} \text{tr} \{ F\_{l} Y\_{i} \overset{\cdot}{F}\_{\text{g}} \} \tag{8}$$
 
$$\text{s.t.} \qquad B \ge 0, F \ge 0, F \overset{\cdot}{F} 1\_{n\_l + n\_{\text{g}}} = 1\_{n\_l + n\_{\text{g}}}$$

where 1<sup>n</sup> is the column vector with all elements 1. The l1-norm constraint on matrix F<sup>t</sup> is adopted to obtain sparsity solutions.

FIGURE 2 | The flowchart of the MPrNMF algorithm, which consists of three components: network construction, matrix factorization and function prediction. In the network construction, each type of heterogenous lncRNA-gene associations is used to construct a bi-colored network. The matrix factorization procedure obtains approximation of lncRNA(gene)-ontology associations *X*, where the feature matrix *F* reflects multiple lncRNA-gene associations. The function prediction procedure is based on the decomposed matrices.

#### 3.2. Optimization Rules

To optimize the objective function in Equation (8), we derive the updating rules for matrix B and F. Since the objective function is non-convex, we update one matrix by fixing the other, which continues until the termination criterion is reached.

By integrating the sparsity constraint of matrix F, the Lagrange function for objective function is formulated as

$$\begin{split} L &= \|X - BF\|^2 + 2\alpha \sum\_{i=1}^{2} tr(\mathcal{F}\_{\mathcal{Y}} Y\_i \boldsymbol{\mathcal{F}}\_l') + tr(\Lambda(\boldsymbol{F}^{'} \boldsymbol{1}\_{n\_l + n\_{\mathcal{E}}} - \boldsymbol{1}\_{n\_l + n\_{\mathcal{E}}})) \\ & \quad \text{( $\boldsymbol{F}^{'} \boldsymbol{1}\_{n\_l + n\_{\mathcal{E}}} - \boldsymbol{1}\_{n\_l + n\_{\mathcal{E}}}$  $'$ )}\\ &= \|X - BF\|^2 + \alpha \sum\_{i=1}^{2} tr(\mathcal{F}\_i^\* \boldsymbol{F}^{'}) + tr(\Lambda(\boldsymbol{F}^{'} \boldsymbol{1}\_{n\_l + n\_{\mathcal{E}}} - \boldsymbol{1}\_{n\_l + n\_{\mathcal{E}}})) \\ & \quad \text{( $\boldsymbol{F}^{'} \boldsymbol{1}\_{n\_l + n\_{\mathcal{E}}} - \boldsymbol{1}\_{n\_l + n\_{\mathcal{E}}}$ ) $'$ )} \end{split}$$

where matrix C ∗ i is defined as

$$C\_i^\* = \left[ \begin{array}{cc} \mathbf{0} & Y\_i \\ Y\_i' & \mathbf{0} \end{array} \right].$$

The derivative of L on B is calculated as

$$\frac{1}{2}\nabla\_B L = X\overline{F}' - B\overline{F}\overline{F}',$$

and the derivative of L on F is written as

$$\frac{1}{2}\nabla\_F L = B^\prime X - \stackrel{\circ}{B}^\prime B^{\prime} + \alpha \sum\_{i=1}^2 F \mathcal{C}\_i^\* - \mathbf{1}\_{n\_l + n\_\mathcal{g}} \mathbf{1}\_{n\_l + n\_\mathcal{g}}^\prime \Lambda.$$

According to the Karush-Kuhn-Tucker condition, by setting 1 2 ∇BL=0, we obtain the updating rule for matrix B as

$$B = B \odot \sqrt{\frac{[BFF']}{[XF']}},\tag{9}$$

where ⊙ denotes element-wise product, [·]/[·] denotes element-wise division and <sup>√</sup> · is the element-wise square root. Analogously, the updating rule for matrix F is derived as

$$F = F \odot \sqrt{\frac{[B'BF']}{[B'X + \alpha(FC\_1^\* + FC\_2^\*)]}}.\tag{10}$$

After obtaining matrices B and F, we divide the matrix B = Bl Bg . The prediction of lncRNA-ontology is obtained as BlF<sup>l</sup> . The procedure of the proposed algorithm is illustrated in Algorithm 1. Usually, the number of iterations is 100.

#### 4. RESULTS

#### 4.1. Data

The PPI network is downloaded from the BioGrid database (https://thebiogrid.org/). We select the maximal connected

#### **Algorithm 1** The MPrNMF algorithm

$$\overrightarrow{\text{Input:}}$$

Yi(1 ≥ i ≥ n): The multiple views of lncRNA-gene associations;

X: The known lncRNA(gene)-ontology associations;

k: number of communities;

α: weight for multiple views;

#### **Output:**

bXl : the predicted lncRNA-ontology associations. **Part I: Matrix Decomposition**


#### **Part II: Predicting lncRNA-ontology associations**

5: Predicting the lncRNA-ontology associations as <sup>b</sup>X<sup>l</sup> <sup>=</sup> <sup>B</sup>lF<sup>l</sup> ;

6: **return** <sup>b</sup>X<sup>l</sup> .

subgraph in the PPI network for analysis. The lncRNAs are downloaded from the GENCODE database (https:// www.gencodegenes.org/). The gene-disease associations are downloaded from the OMIM database (https://omim.org/), while the lncRNA-disease associations are downloaded from the LncRNADisease database (http://www.cuilab.cn/lncrnadisease). The expression profiles are downloaded from the COXPRESdb database Okamura et al. (2018) (http://coxpresdb.jp/), where the three preprocessed datasets, including Hsa.c4-1, Hsa2.c2-0, and Hsa3.c1-0, are used.

Since there is no available public database for the ontology of lncRNAs, Zhang and Ma (2018) manually curate a set of 55 lncRNAs with 129 GO terms by literature searching. We adopt this dataset as benchmark to test the performance of the proposed method.

#### 4.2. Criterion

To predict the lncRNA-ontology associations, the output of the proposed algorithm is a real value in the interval [0,1]. Hence a threshold is need to determine the final prediction. Following the NeuraNetL2GO algorithm (Zhang and Ma, 2018), we use the Recall, Precision and Fmax to quantify the accuracy of algorithms. Specifically, let t be the threshold, and P(t) be the set of predicted ontology, and T be the ontology in the benchmark dataset. For the i-th lncRNA, the true positives (TP), false positives (FP) and false negatives (FN) are defined as

$$TP\_i = \sum\_{o \in \mathcal{O}} I(f \in P\_i(t) \land f \in T\_i), \tag{11}$$

$$FP\_i = \sum\_{o \in \mathcal{O}} I(f \in P\_i(t) \land f \notin T\_i), \tag{12}$$

$$FN\_i = \sum\_{o \in \mathcal{O}} I(f \notin P\_i(t) \land f \in T\_i), \tag{13}$$

where o is an ontology, O denotes the set of all functions, and I(x) is indicator function with value 1 if x is true, 0 otherwise. The recall, precision, and Fmax are defined as

$$Recall = \frac{\sum\_{i} TP\_i}{\sum\_{i} TP\_i + \sum\_{i} FN\_i},\tag{14}$$

$$Precision = \frac{\sum\_{i} TP\_i}{\sum\_{i} TP\_i + \sum\_{i} FP\_i},\tag{15}$$

$$Fma\mathbb{x} = \max\_{t} \frac{2Recall(t)Precision(t)}{Recall(t) + Precision(t)}.\tag{16}$$

#### 4.3. Parameter Selection

There are two parameters involved in MPrNMF: parameter k is the number of features, and parameter α controls the relative importance of partial regularization items. On the parameter k, Wu et al. (2016) proposed the instability based NMF model for parameter selection. For each k, MPrNMF runs τ times with random initial solutions and obtains τ basis matrices, denoted by B1, . . . , B<sup>τ</sup> . Given two matrices B<sup>1</sup> and B2, a τ × τ matrix H is defined where the element hij is the cross correlation between the i-th column of matrix B<sup>1</sup> and the j-th column of matrix B2. The dissimilarity between B<sup>1</sup> and B<sup>2</sup> is defined as

$$\text{diss}(B\_1, B\_2) = \frac{1}{2k} (2k - \sum\_j \max H\_j - \sum\_i \max H\_i),$$

where H.<sup>j</sup> denotes the j-th column of matrix Q. The instability is the discrepancy of all the basis matrices for k, which is defined as

$$\Upsilon(k) = \frac{2}{\pi(\pi - 1)} \sum\_{1 \le i < j \le \pi} \text{dis}(B\_i, B\_j).$$

As shown in **Figure 3A**, the instability of MPrNMF changes as the number of features k ranges from 40 to 64 with gap 4. When k <52, the instability decreases, while it increases if k >52. The reason is that when k is small, the number of features cannot fully characterize topological structure of associations, while large k results in the redundance of features. It reaches minimum at k = 52. Thus, we set k = 52.

How the parameter α effects the performance of MPrNMF is illustrated in **Figure 3B**, where the Fmax changes as α increases from 0.1 to 2 with a gap 0.2. It is easy to assert that, when α increases from 0.1 to 1, the performance also improves. The accuracy of the proposed algorithm is robust when α > 1. The reason is that when α is small, the objective function is dominated by the associations between lncRNA(gene)-ontology diseases. As α increases, the contribution of the regularization items for the multiple views of lncRNA-gene associations increases, improving the accuracy. Therefore, we set α = 1 since it reaches a good balance between lncRNA(gene)-ontology associations and lncRNA-gene associations.

#### 4.4. Performance

To fully validate the performance of MPrNMF, three algorithms are selected for a comparison, including lnc-GFP (Guo et al., 2013), Lnc2Function (Jiang et al., 2015) and NeuraNetL2GO (Zhang and Ma, 2018, because of their excellent performance. In this study, we only focus on the biological process of GO terms.

ontology of lncRNAs in terms of Recall, Precision and Fmax.

Frontiers in Genetics | www.frontiersin.org

proposed algorithm.

The accuracy of various algorithms is shown in **Figure 4**, where recall, precision and Fmax are adopted for measuring the performance. These result demonstrate that: (i) MPrNMF achieves the best performance on the recall; (ii) MPrNMF outperforms the lnc-GFP and Lnc2Function; (iii) MPrNMF is inferior to the NeuraNetL2GO. There two possible reasons why the proposed method is superior to lnc-GFP and Lnc2Function. First of all, MPrNMF integrates multiple heterogeneous genomic data via the matrix factorization, which is more accurate to characterize lncRNA-ontology associations. Second, the multiple heterogeneous genomic data are regularized separately, rather than fusing them via a linear function. However, the proposed algorithm is inferior to NeuraNetL2GO. In detail, the Fmax for MPrNMF is 0.309, while that of NeuraNetL2GO is 0.336. There also two possible reasons. First of all, the MPrNMF algorithm is also a network-based method, requiring the networks are connected, which excludes away many lncRNAs or genes for analysis. The second reasons is that MPrNMF does not fully explore the topological information of networks, while the NeuraNetL2GO makes use of graph embedding features from networks.

Furthermore, we also compare these algorithms in terms of the number of lncRNAs that are annotated with a least one biological process GO term. As shown in **Figure 5**, 47 lncRNAs are correctly annotated by the proposed method, which is significantly higher than lnc-GFP and Lnc2Function. Even though it is not as high as that of NeuraNetL2GO, the difference is not significant (p-value = 0.387, Fisher Exact Test).

In MNrNMF, multiple views of lncRNA-gene associations are used. Then, we investigate the performance of each view of the associations. The Fmax of the proposed algorithm based on coexpression lncRNA-gene associations is 0.242, while that based on the disease lncRNA-gene associations is 0.278. These results indicate that the effective integration of heterogeneous genomic data is promising on the prediction of lncRNA-ontology.

### 4.5. Case Study

In this subsection, we apply MPrNMF to lncRNA instance to show the application of the proposed algorithm. HOTAIRM1 is an intergenic lncRNA between HOXA1 and HOXA2. Evidence shows that HOTAIRM1 is a critical regulator for the expression level of HOXA1 and HOXA4 (Zhang et al., 2009, 2014), which is involved in cell growth in leukemia cells. We apply the MPrNMF algorithm to predict the functions of HOTAIRM1, and it discovers 5 ontology functions: biological regulation, cellular process and signal transduction. These functions have been validated by the previous studies, indicating that the proposed method is applicable to predict the ontological functions of lncRNAs.

# 5. CONCLUSION

More and more lncRNAs have been identified in the past few years. However, the functions of vast majority of lncRNAs are poorly characterized. In this study, we propose a novel algorithm to predict the functions of lncRNAs via integrating multiple types of genomic data. The results demonstrate that the proposed algorithm is superior to the network-analysis based methods. However, the proposed method has some limitations. First, only the expression and disease data are used to construct the lncRNAgene associations, which cannot fully characterize the relations. However to construct more reliable lncRNA-gene associations is promising in predicting the functions of lncRNAs. Second, the proposed method cannot fully make use the topological information in the multiple networks, such as graph embedding features. In the further studying, we will investigate how to solve these two issues.

# AUTHOR CONTRIBUTIONS

JZ and XM designed the method and JZ coded the algorithm. JZ and XM wrote the paper.

# FUNDING

This work was supported by the NSFC (Grant No. 61772394), Scientific Research Foundation for the Returned Overseas Chinese Scholars of Shaanxi Province (Grant No. 2018003) and Fundamental Research Funding of Central Universities (Grant No. Z109021508, JB180304).

# ACKNOWLEDGMENTS

The authors appreciate the reviewers for their suggestions.

# REFERENCES


hybrid shows structural and functional asymmetry. Nature Struc. Mol. Biol. 21, 389–396. doi: 10.1038/nsmb.2785


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhao and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Predicting Gene Ontology Function of Human MicroRNAs by Integrating Multiple Networks

#### Lei Deng<sup>1</sup> , Jiacheng Wang<sup>1</sup> and Jingpu Zhang<sup>2</sup> \*

*<sup>1</sup> School of Software, Central South University, Changsha, China, <sup>2</sup> School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan, China*

MicroRNAs (miRNAs) have been demonstrated to play significant biological roles in many human biological processes. Inferring the functions of miRNAs is an important strategy for understanding disease pathogenesis at the molecular level. In this paper, we propose an integrated model, PmiRGO, to infer the gene ontology (GO) functions of miRNAs by integrating multiple data sources, including the expression profiles of miRNAs, miRNA-target interactions, and protein-protein interactions (PPI). PmiRGO starts by building a global network consisting of three networks. Then, it employs DeepWalk to learn latent representations as network features of the global heterogeneous network. Finally, the SVM-based models are applied to label the GO terms of miRNAs. The experimental results show that PmiRGO has a significantly better performance than existing state-of-the-art methods in terms of *Fmax*. A case study further demonstrates the feasibility of PmiRGO to annotate the potential functions of miRNAs.

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Wuritu Yang, Inner Mongolia University, China Wenji Ma, Columbia University, United States Zizhang Sheng, Columbia University Irving Medical Center, United States*

> \*Correspondence: *Jingpu Zhang zhangjp@csu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *02 November 2018* Accepted: *07 January 2019* Published: *29 January 2019*

#### Citation:

*Deng L, Wang J and Zhang J (2019) Predicting Gene Ontology Function of Human MicroRNAs by Integrating Multiple Networks. Front. Genet. 10:3. doi: 10.3389/fgene.2019.00003* Keywords: miRNA function annotation, miRNA co-expression, global heterogeneous network, latent representations, multi-classification

### INTRODUCTION

MicroRNAs (miRNAs) are endogenously small non-coding RNAs of about 21–25 nucleotides and play important roles in gene regulation, via base-pairing mRNA molecules with complementary sequences for cleavage or translational repression (Bartel, 2004; Huang et al., 2011; Yao et al., 2018). Some of the biological processes within which miRNAs are involved include development, differentiation, apoptosis, and viral infection (Miska, 2005). In addition to their importance in biological processes, miRNAs are also valuable biomarker candidates for specific diseases, including Alzheimer's disease (AD) (Esteller, 2011). Currently, the identification of unknown miRNA functions is an essential goal of miRNA research. Research on miRNA function focuses on the experimental determination field. miRNA function is primarily identified by the up-regulation or down-regulation of miRNA expression and its target genes (Zhu and Helliwell, 2010). However, experimental methods for the identification of miRNA functions are considerably expensive and time-consuming.

Recently, computational methods have been proposed to solve those difficulties. These methods elucidate miRNA functions by analyzing the functions of target genes or promoters, which are determined by miRNA-related expression (Pandey and Krishnamachari, 2006; Wei et al., 2012). These methods include TargetScan (Agarwal et al., 2015), Miranda (Enright et al., 2003), PITA (Kertesz et al., 2007), and DIANA-microT (Maragkakis et al., 2009). Many of the tools used are based on the sequence alignment of the miRNA seed region, which allows for the determination of

**236**

the putative binding sites (Maragkakis et al., 2009). However, the prediction results of these tools are unsatisfactory for two reasons: first, the majority of the prediction data of the miRNA target are negative, and the predicted data are not sufficient enough; second, these tools only concentrate on sequence information (Ulitsky et al., 2010) and ignore other useful information, such as miRNA expression data. Therefore, the results are easily affected by negative samples leading to poor results. In a time of increasing high-throughput sequencing, a massive amount of miRNA-seq data is accumulating, however, the analysis of this data remains a significant challenge. miRNA expression determines function, which is also crucial for discovering molecular mechanisms of human gene regulation (Panwar et al., 2017). Backes et al. (2016) developed a novel miRNA annotation tool which provides rich functionality in terms of miRNA categories based on miRNA enrichment analysis. However, miEAA does not take the importance of miRNA co-expression into account. Generally, multiple miRNAs might jointly regulate a target gene, and a miRNA may regulate hundreds of different target genes (Krek et al., 2005; Friedman et al., 2009). The potential associations between miRNAs are also vital to understand the miRNA functional mechanism and to annotate functions of miRNAs. Moreover, miEAA ignores the interactions between miRNA and target gene production (e.g., protein), which provides useful information for predicting the functionalities of miRNAs.

In this paper, we take full advantage of miRNA expression profiles, miRNA-target gene interactions, which are experimentally validated, and protein-protein interactions data. Moreover, a global miRNA-protein network is constructed by integrating these three data sources. Secondly, we employ DeepWalk (Perozzi et al., 2014), an approach used for learning potential representations of nodes in a network, to extract the network features of the global heterogeneous network. Based on these features of the global network, we build an SVM-based classifier for each miRNA to annotate their GO functions. The proteins with Gene Ontology annotations in the GOA database (Huntley et al., 2009) are utilized to train SVM classifiers. Finally, we evaluate our method by applying it to an independent dataset. The results show that our method, PmiRGO, achieves a maximum F-measure of 0.310 and outperforms the other state-of-the-art method, miEAA (Backes et al., 2016).

# MATERIALS AND METHODS

The flowchart of PmiRGO is illustrated in **Figure 1**. As shown in step A, we first downloaded the miRNA co-expression profiles, miRNA-target interactions, and protein-protein interactions (PPIs) to construct the miRNA co-expression network, miRNAtarget interaction network, and PPI network, respectively. Then, the three networks were integrated to build a global heterogeneous network by mapping the target genes into PPI network in step B. We employed DeepWalk to learn the potential representations of the networks as the features of the global heterogeneous network in step C. In step D, we mapped the IDs of miRNAs and proteins to the corresponding nodes in the features. After that, we trained SVM models for each miRNA and used the miRNA2GO-337 dataset to evaluate the performance of the multi-classification models in step E. In the final step F, the GO annotations of miRNAs in the miRNA2GO-337 dataset were predicted.

# Materials

In this study, we downloaded the miRNA expression data, PPI data, and miRNA-target interactions from different databases, from which a total of 2,588 miRNAs and 18,143 proteins were retrieved. The details are as follows.

#### miRNA Expression

The miRNA expression data were downloaded from the miRmine database, containing expression profiles collected from several publicly available miRNA-seq datasets, as well as detailed information regarding different miRNAs (Panwar et al., 2017). This database consists of expression profiles of 2,822 precursor miRNAs, each containing a total of 135 columns of expression values from different human tissues. Note that a mature miRNA may have two or more precursor miRNAs, in our work; the expression profiles of one mature miRNA derived from different precursor miRNAs were averaged as the expression values of this mature miRNA. As a result, 2,588 miRNA expression profiles were obtained. We then calculated the Pearson's Correlation Coefficient (PCC) scores as the co-expression similarity of the expression profiles between each pair of miRNAs (Zhang J. et al., 2017). We constructed a miRNA co-expression network according to the co-expression similarity values. As the PCC scores were used as the weight of the edges in the network, the negative PCC values were removed.

#### Protein-Protein Interactions

The PPIs were obtained from the STRING database V10.0 (Szklarczyk et al., 2014). These interactions were collected from not only biological experiments but also text mining and computational prediction approaches. The overall scores of these interactions were obtained from single or multiple clues with high probability. The number of PPI entries retrieved from 18,143 proteins was 7,866,428, which were then used to construct a PPI network. Each entry of the PPI network consists of protein A, protein B, and corresponding predicted score. The higher the predicted score of an entry, the higher the probability that two proteins in the entry are considered to interact. In our work, we treat the predicted score as weight of the edge between two protein nodes in the entry.

#### miRNA-Target Interactions

We retrieved miRNA-target interactions from the miRTarBase database of release 7.0 (Hsu et al., 2010). The database provides a gold standard resource of experimentally validated microRNAtarget interactions, which were manually collected. We extracted 355,684 different high quality experimentally validated miRNAtarget interactions among 2,588 miRNAs and 18,143 target genes to build the miRNA-target interaction network after removing the duplicate and out-of-range entries.

FIGURE 1 | PmiRGO flowchart. It consists of six steps: (A) three networks (miRNA co-expression network, miRNA-target interaction network, and PPI network) were constructed according to the co-expression profiles, miRNA-target gene interactions, and protein-protein interactions, respectively. (B) By mapping the target genes into PPI network, the three networks were integrated to build a global heterogeneous network. (C) DeepWalk was employed to learn the latent representations of the network as features of the global heterogeneous network. (D) For each miRNA or protein, a feature vector was obtained. (E) SVM models were trained and the miRNA2GO-337 dataset were used to evaluate the performance. (F) The GO annotations of each miRNA in the miRNA2GO-337 dataset were predicted.

# Methods

#### Constructing the Global Network

Three heterogeneous networks, including the miRNAs coexpression network, the miRNA-target interaction network, and the PPI network, were built as described above. The construction of the miRNA co-expression network is based on the hypothesis that miRNAs with similar expression patterns also share similar functions or biological pathways (He and Hannon, 2004; Zhang Z. et al., 2017). The PCC scores were computed to represent the similarity between two miRNAs and the values represent the weights of the edges in the miRNA co-expression network. Moreover, growing evidences have revealed that miRNAs have identical or related functions to their interacting target genes with a significant probability (Bartel, 2009). Hence, the three component networks were integrated to infer the functions of miRNAs. Assuming that M, P, and MP denote the adjacency matrices of the miRNA co-expression network, PPI network, and miRNA-target interaction network, respectively, the global network can be formulated as:

$$G = \begin{bmatrix} M & MP \\ MP^T & P \end{bmatrix} \tag{1}$$

Here, T in MP<sup>T</sup> represents the transpose.

#### Learning Latent Representations of Nodes

In order to obtain the low-dimensional topological information of the vertices of the global heterogeneous network we constructed above, DeepWalk was used to learn the potential representations of miRNAs and proteins in networks (Perozzi et al., 2014). This unsupervised method based on graph learns features that define the graph structure independently of the distribution of the labels (Bengio et al., 2013). DeepWalk uses information extracted locally from truncated random walks for

the learning of potential representations by regarding walks as sentences.

We treated the global heterogeneous network as an undirected graph G = (V , E) that V denotes the set of biological entities (e.g., miRNA and protein) and E denotes the set of undirected edges. DeepWalk employs a stream of short random walks to extract potential associations between miRNAs and proteins from the global network. The series that a random walk starts with every node v<sup>i</sup> are marked as Wv<sup>i</sup> . Moreover, it is a stochastic process with random nodes W<sup>1</sup> vi , W<sup>2</sup> vi , . . . , W<sup>k</sup> vi , where Wk+<sup>1</sup> vi is a node chosen randomly from the neighbors of node v<sup>k</sup> . When getting the random walk sequence for each node, it needs to measure the probability of a specific sequence. More formally, given a sequence of nodes W<sup>n</sup> <sup>1</sup> = (w0, w1, w2, . . . , wn), where w<sup>i</sup> ∈ V, DeepWalk maximizes the Pr(wn|w0, w1, w2, . . . , wn−1) over all nodes. The idea is to calculate the possibility of observing node v<sup>i</sup> given all the previous nodes traversed heretofore in the random walk:

$$\Pr(\nu\_i | (\nu\_1, \,\,\nu\_2, \,\,\, \dots, \,\,\, \nu\_{i-1})) \tag{2}$$

We introduced a mapping function 8 : v ǫ V 7→ R |V|×d to stand for the potential social representation associated with each miRNA and protein in the graph. The next step involves estimating the likelihood:

$$\Pr(\boldsymbol{\nu}\_{i}|(\Phi(\boldsymbol{\nu}\_{1}),\ \Phi(\boldsymbol{\nu}\_{2}),\ \dots,\ \Phi(\boldsymbol{\nu}\_{i-1})))\tag{3}$$

However, as the walk length increases, it becomes too expensive to calculate this conditional probability. According to a recent publication (Mikolov et al., 2013), DeepWalk uses one node to predict the context, both the left and right neighbor nodes of the given node, instead of using the context to predict next node. In terms of node feature modeling, it yields the following optimization problem:

$$\text{minimize } \quad -\log \Pr\left( \left\{ \nu\_{i-\mathcal{W}}, \dots, \nu\_{i+\mathcal{W}} \right\} \middle| \nu\_i \middle| \Phi(\nu\_i) \right) \tag{4}$$

To solve the optimization problem, we then employed SkipGram, a computational language model based on neural network that maximizes the co-occurrence likelihood over the nodes that appear among the context of node v<sup>i</sup> in the random walk sequence, to approximate the conditional probability in Equation 4 based on an independence assumption, as follows:

$$\Pr\left(\left\{\nu\_{i-w},\ldots,\nu\_{i+w}\right\}\middle|\,\nu\_i\middle|\,\Phi(\nu\_i)\right) = \prod\_{j=i}^{i+w} i - w \Pr(\nu\_j|\Phi(\nu\_i))$$

For each of all the possible associations between biological entities in the random walk among the context of node v<sup>i</sup> , we



mapped each node v<sup>j</sup> to its recent representation vector 8 vj ∈ R d and maximized the posterior distribution probability of its neighbors in the walk. To speed up the computing time, we used the Hierarchical Softmax to approximate the probability distribution (Morin and Bengio, 2005; Mnih and Hinton, 2009):

$$\Pr\left(\nu\_{j}|\Phi\left(\nu\_{i}\right)\right) = \prod\_{l}^{\lceil \log |V| \rceil} \Pr(b\_{l}|\Phi(\nu\_{i})) \tag{6}$$

By assigning the nodes to the leaves of a binary tree, we turned prediction of the potential association between miRNAs and proteins into maximizing the probability of a given path in the hierarchy. The path to node v<sup>j</sup> is represented as a sequence of tree nodes (b0, b1, . . . , b⌈log <sup>|</sup>V|⌉). Moreover, Pr(b<sup>l</sup> |8(vi)) can be simulated by a binary classifier as follows:

where 9 bl ∈ R <sup>d</sup> denotes the representation traversed to tree node b<sup>l</sup> 's parent.

After each node completes the random walk process γ times, a matrix 8 ǫ R |V|×d , which denotes the latent representations of the global network, is obtained. The result is that, in the matrix, each row represents a low-dimensional representation vector of a miRNA or a protein in the network. The source code and data of PmiRGO are freely available at http://denglab.org/PmiRGO/.

#### Training the SVM-Based Classifier

Due to the lack of manually curated GO annotations for miRNAs, it is dissatisfactory to build miRNA function predictors based on the miRNAs directly. Therefore, we built the training data sets with GO annotations of proteins downloaded from GOA database (version 201010) (Huntley et al., 2009). Proteins with lengths 50–100 aa were selected and clustered with a sequence similarity of 90 percent (Deng et al., 2018). Moreover, only one protein was chosen as a representation from each

$$\Pr\left(\boldsymbol{\nu}\_{j}|\Phi\left(\boldsymbol{\nu}\_{i}\right)\right) = 1/(1 + e^{-\Phi(\boldsymbol{\nu}\_{i}) \times \Psi\left(b\_{l}\right)})\tag{7}$$

cluster. The representations without at least a non-IEA (not inferred from electronic annotation) GO term were filtered. As a result, 243,561 proteins with Gene Ontology annotations were collected.

For each GO term, we trained a classifier with samples of proteins. More specifically, we constructed a true annotation set for a GO term consisting of proteins, which had the GO annotation, and a false annotation set of proteins where these proteins did not have this GO function. As GO ontology is considered as a directed acyclic graph where each term is related to one or more other terms in the same domain or other domain (Deng and Chen, 2015; Zeng et al., 2018), the protein related to a GO term was also related to the ancestors of the term. Therefore, the false annotation data set was composed of proteins associated with other GO terms (excluding annotated terms and their child nodes). Due to the false annotation set containing more protein-GO pairs than the true annotation set, we randomly selected an equal number of negative and positive samples.

Here we employed support vector machines (SVMs) to build the binary classifier (Yong-Xin et al., 2011). SVM is widely used in bioinformatics research in the fields of miRNA target prediction, miRNA identification (Wei et al., 2014), RNA methylation prediction (Chen et al., 2017), and protein folding (Li et al., 2016), and others (Xiao et al., 2017; Dao et al., 2018; Feng et al., 2018; Pan et al., 2018; Yang et al., 2018; Zhu et al., 2019). We used the radial basis function kernel (RBF) as the kernel function, which achieved a better performance. C is the penalty coefficient of SVM, which can be considered as the weight to adjust the preference of two indexes (interval size, classification accuracy) in the optimization direction. The higher the value of C, the easier the classifier was to overfit. On the contrary, the lower the value of C, the easier the classifier was to underfit. To obtain an optimal C of the SVM and γ of the kernel, the performance for each C and γ was evaluated by carrying out a 10-fold cross-validation.

#### RESULTS

#### Benchmarks

To accurately evaluate the performance of PmiRGO, we created an independent test based on the GOA database (Ashburner et al., 2000; The Gene Ontology Consortium, 2017). It consisted of a total of 337 mature miRNAs (named as miRNA2GO-337), each of which had at least one curated GO annotation (not inferred from electronic annotation, non-IEA). The independent test dataset appears in the **Supplementary Table 1**.

#### Evaluation Measures

In PmiRGO, the classifier predicted several probable GO terms with corresponding scores ranging from 0 to 1 for a specific miRNA. The scores denoted the degree of confidence for those GO terms. The final predictions depended on the selected threshold t. All GO terms predicted for each miRNA with scores equal to or greater than t and their ancestors in GO linked by "is a" and "has a" relationships were collected to build the set of predicted GO terms denoted as P(t) for each threshold t. We used T to denote the set of experimentally validated GO terms. We evaluated the performance of the prediction according to three widely used statistic indexes: recall, precision, and F-measure. The definitions of recall and precision are as follows:

$$Pre\_i\left(t\right) = \frac{\sum\_{\mathcal{G}\in G} I\{\mathcal{g}\_i \in P\_i\left(t\right) \land \mathcal{g}\_i \in T\_i\}}{\sum\_{\mathcal{G}\in G} I\{\mathcal{g}\_i \in P\_i(t)\}}\tag{8}$$

$$\operatorname{Rec}\_{i}(t) = \frac{\sum\_{\mathbf{g} \in G} I(\mathbf{g} \in P\_{i}(t) \land \mathbf{g} \in T\_{i})}{\sum\_{\mathbf{g} \in G} I(\mathbf{g} \in T\_{i})} \tag{9}$$

where g denotes a specific GO term, and G denotes the set of all GO terms used in our work. The indicator function I(x) is stated as follows:

$$I(\mathbf{x}) = \begin{cases} \mathbf{l}^1 & \mathbf{x} = \text{true} \\ \mathbf{0} & \mathbf{x} = \text{false} \end{cases} \tag{10}$$

After all the miRNAs had been predicted, the average precision for each threshold t could be calculated on m(t) miRNAs, each of which had at least one predicted GO term with a score greater than the threshold t. In the same way, the average recall could be calculated from the whole benchmark set of N miRNAs. The average precision and recall are defined as follows:

$$\text{Pre}\,(t) = \frac{1}{m(t)} \times \sum\_{i=1}^{m(t)} \text{Pre}\_i(t) \tag{11}$$

$$\operatorname{Rec}(t) = \frac{1}{N} \times \sum\_{i=1}^{N} \operatorname{Rec}\_i(t) \tag{12}$$

Generally speaking, precision and recall are inversely related. It is not feasible to evaluate the performance of models according to a single precision or recall. To deal with this problem, the

dataset for BP terms.

maximum F-measure over all thresholds was introduced for the overall evaluation of different models (Zhang J. et al., 2018). It combined the two metrics (precision and recall) to provide a single-score. The maximum F-measure is defined as follows:

$$F\_{\max} = \max\left(\frac{2 \times \text{Pre}(t) \times \text{Rec}(t)}{\text{Pre}(t) + \text{Rec}(t)}\right) \tag{13}$$

#### The Effects of Feature Dimensions

As described above, the latent representations of each node in the network act as its low-dimensional topological features. The number of dimensions might have a significant effect on the functional annotations of miRNAs. To assess the influence of the hyper-parameter on the prediction performance, we performed an independent test on the miRNA2GO-337 dataset across a wide range of values for the dimensions. For simplicity, we preset the other parameters, including the number of walks started from one node (n), the walk length (t), and the window size (w), in DeepWalk. The three parameters were selected by conducting experiments of different parameter values and choosing the combination with the best performance (n = 100, t = 80, w = 16).

**Figure 2** shows the Fmax values when the number of dimensions ranges from 128 to 1024. The results demonstrated that the Fmax reached the max value when the dimension increased to 512. However, as the dimension increased beyond this value, the performance decreased accordingly. Hence, 512 was chosen as the dimensions of the feature vector. It is important to note that the SkipGram model based on Hierarchical Softmax of DeepWalk algorithm is a neural network model and its output layer corresponds to a binary tree. Therefore, the dimensions of the latent representations of the model should be a power of two.

#### The Effects of PPI Data

In our method, protein interaction data was incorporated to help improve the effectiveness of the functional annotations of the miRNAs. To confirm this, PmiRGO was carried out on two different network collocations: the global network (consisting of a miRNA co-expression network, miRNA-target interaction network, and PPI network), and the network without PPIs. The comparison was performed in terms of Fmax when the parameters (n, t, w, d) were set to 100, 80, 16, and 512, respectively. The results are shown in **Table 1**. The Fmax value was 0.31 for the global network and 0.252 for the network without PPIs. The performance increased ∼23% with the addition of PPI data. This experiment demonstrated that integrating multiple types of information about other relevant biological entities (e.g., protein) resulted in a great improvement in the performance of predicting miRNA function.

# Comparison of Different Network Representation Algorithms

Recent studies have demonstrated that network representation learning is effective in machine learning, such as in tag recommendation (Tu et al., 2014), vertex classification (Sen et al., 2008), and link prediction (Lü and Zhou, 2011; Yang et al., 2015). Many methods have been proposed to address these issues, most of which investigate network structure for learning, such as DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016), hin2vec (Fu et al., 2017), and metapath2vec (Dong et al., 2017). DeepWalk used information extracted locally from the truncated random walks in order to learn potential representations. On the basis of DeepWalk, node2vec defined a strategy generating a sequence of bias random walk that used both BFS and DFS to retain different network structure information. Different from DeepWalk and node2vec, hin2vec, and metapath2vec have been proposed for heterogeneous information networks. They were designed to capture rich semantics by exploiting different types of relationships among nodes in forms of meta-paths.

In this paper, we compared DeepWalk, hin2vec, and metapath2vec in terms of predicting GO annotations of miRNAs. For the sake of fairness, we used the same global network

constructed above, multi-classification models, and benchmarks. **Figure 3** demonstrates that DeepWalk significantly outperforms hin2vec and metapath2vec in terms of precision and Fmax. Hence, DeepWalk was employed to extract the topological features of our work.

#### Performances

To evaluate the performance of PmiRGO further, we compared it with the state-of-the-art method miEAA (Backes et al., 2016). MiEAA is a tool that uses enrichment analysis to perform the functional analysis of sets of miRNAs based on GeneTrail (Backes et al., 2007). Compared to GeneTrail, miEAA was designed for human miRNA precursors and mature miRNAs. The miRNA2GO-337 dataset was utilized to assess the performance of different methods. Since 53.5% of the functional annotations of miRNAs are biological process (BP) terms, according to the statistics of Gene Ontology Consortium database (Ashburner et al., 2000), and since miRNAs are involved in the biological process when they have interactions with other entities, we only evaluated the performance in terms of BPs.

The prediction performance of the two methods is presented in **Figure 4**. It is quite apparent that PmiRGO outperforms miEAA. For the metric Fmax, PmiRGO achieved 0.310 Fmax on BP terms and had an increase of 0.03 Fmax, while miEAA reached 0.282 Fmax. Also, the recall of PmiRGO reached 0.277 when the Fmax achieved the highest value, and the recall of miEAA was 0.235. **Figure 5** shows that the precision-recall curve of PmiRGO is entirely above the curve of miEAA, which means that our method significantly outperforms miEAA. We calculated the P-value with two-tailed, paired t-test to compare the performances of our PmiRGO method and MiEAA. For each time, we randomly selected 50 miRNAs from the miRNA2GO-337 dataset and calculated the Fmax scores for both PmiRGO and MiEAA. We repeated the procedure for 30 times and obtained 30 paired Fmax scores. We calculated the P-value using MATLAB. A P-value score of 0.05 was used to denote statistical significance. The Fmax of our PmiRGO method was higher than that of MiEAA, a difference that was statistically significant (P = 1.86e-05).

Moreover, the coverage of the two prediction methods on the miRNA2GO-337 dataset was compared. The coverage is defined as the number of miRNAs predicted correctly, a measure that reflects robustness. As presented in **Figure 6**, PmiRGO correctly annotated 205 miRNAs out of 337 miRNA samples, while miEAA successfully predicted 174 miRNAs, demonstrating that our method is more robust than miEAA.

#### Case Study

To illustrate the performance of this prediction method in a real case study, we applied PmiRGO to predict the functions of miRNA has-miR-124-3p. miRNA has-miR-124-3p plays an essential role in mediating tumor growth and the occurrence and development of cancer with high genetic conservation. Recent studies have used high-throughput sequencing to demonstrate that hsa-miR-124-3p has differential expression in normal brain tissue and glioblastoma multiforme (GBM). Moreover, has-miR-124-3p overexpression expressively inhibits GBM cell proliferation, migration, and tumor angiogenesis, which results in cell cycle arrest and GBM apoptosis putatively via the activation of the NRP-1-mediated PI3K/Akt/NFκB pathway in GBM cells, as well as suppressing tumor growth and reducing tumor angiogenesis (Zhang G. et al., 2018). Moreover, hsamiR-124-3p regulates the expression of the CD151 protein by inosculation with the 5′UTR to take part in the development of gastric cancer (Sheng et al., 2009).

As a result, has-miR-124-3p annotated 250 GO terms in total, the top 31 of which had a probability score >0.9, as shown in **Table 2**. Of the four most probable GO Terms, GO:0006915 (apoptotic process), responsible for the process of programmed cell death when a cell receives an internal or external signal, and GO:0006725 (cellular aromatic compound metabolic process), the chemical reactions and pathways involving aromatic compounds, were indirectly related with the occurrence and development of diseases, particularly cancer and tumors. In addition, the predicted GO Terms GO:0008219 (cell death) (ranked 5th), GO:0048468 (cell development) (ranked 7th), and


GO:0009987 (cellular process) (ranked 30th) were associated with adenocarcinoma of the lung, breast neoplasms, and colonic neoplasms. Moreover, those GO terms related to metabolic processes, such as GO:0006259 (DNA metabolic process) (ranked 9th), GO:0019216 (regulation of lipid metabolic process) (ranked 12th), and GO:0031323 (regulation of cellular metabolic process) (ranked 15th), were associated with the production of the gene products TCEAL7 and TNFRSF1A, which may promote the occurrence of prostatic neoplasms, lung diseases, and gastric cancer.

# DISCUSSION

Computational function prediction of miRNAs by integrating varieties of miRNA-related biological information is emerging as a tool to elucidate the role of miRNAs in development and for inferring the biological functions of miRNAs. In our work, we proposed a novel approach, PmiRGO, to predict their function. Specifically, we constructed a global heterogeneous network by integrating expression profiles, miRNA-target interactions, and PPI data. Then, DeepWalk, an approach used for learning online social representations, was employed to learn the latent network features of the global network. Finally, we employed SVM to build multi-classification models for predicting the GO annotations.

In terms of the performance, PmiRGO was used to evaluate the independent dataset miRNA2GO-337. In terms of Fmax and coverage, PmiRGO outperformed miEAA. Moreover, the results demonstrate that the protein interaction data contributes to the improvement of prediction performance for miRNAs. The great performance of our method can be attributed to several factors. At first, the experimentally validated miRNA-target gene interactions, manually curated from reporter assay, blot, and microarray experiments were utilized. More reliable and positive information significantly improves the performance of PmiRGO. Then, we used the miRNA expression profiles to construct a miRNA co-expression network, which is useful for predicting the miRNAs involved in co-regulating one target gene. Finally, the PPI network was introduced to the global network, allowing the performance of function prediction to benefit from the variety of biological entities.

## REFERENCES


However, there are still further improvements to be made to our method. Firstly, the experimentally validated miRNAtarget gene interactions were sparse. A greater number of validated interactions could enhance the effect of PmiRGO further. Secondly, the expression profiles we used covered only a part of human miRNAs, and the coverage of the expression information was not enough. As such, more reliable miRNA expression profiles need to be collected. Thirdly, more types of biological entities could also be introduced to the global network. Others works, including miRNA family information (Zou et al., 2014) and miRNA-disease networks (Zou et al., 2016; Liao et al., 2018; Zeng X. et al., 2018), would also be useful in this study. This should be the focus of future works.

#### AUTHOR CONTRIBUTIONS

LD, JW, and JZ conceived this work and designed the experiments. LD and JW carried out the experiments. LD, JW, and JZ collected the data and analyzed the results. LD, JW, and JZ wrote, revised, and approved the manuscript.

# FUNDING

This work was supported by the National Natural Science Foundation of China [grant number 61672541] and the Natural Science Foundation of Hunan Province [grant number 2017JJ3412].

#### ACKNOWLEDGMENTS

We would like to thank the Experimental Center of School of Software of Central South University, for providing computing resources.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00003/full#supplementary-material

Supplementary Table 1 | The miRNA2GO-337 dataset.


mediating glioblastoma growth and angiogenesis. Int. J. Cancer 143, 635–644. doi: 10.1002/ijc.31329


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Deng, Wang and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Construction of Complex Features for Computational Predicting ncRNA-Protein Interaction

Qiguo Dai 1,2, Maozu Guo<sup>3</sup> , Xiaodong Duan<sup>2</sup> , Zhixia Teng<sup>4</sup> \* and Yueyue Fu<sup>5</sup>

*<sup>1</sup> School of Computer Science and Engineering, Dalian Minzu University, Dalian, China, <sup>2</sup> Dalian Key Laboratory of Digital Technology for National Culture, Dalian Minzu University, Dalian, China, <sup>3</sup> School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China, <sup>4</sup> School of Information and Computer Engineering, Northeast Forestry University, Harbin, China, <sup>5</sup> Department of Hematology, The First Affiliated Hospital of Harbin Medical University, Harbin, China*

Non-coding RNA (ncRNA) plays important roles in many critical regulation processes. Many ncRNAs perform their regulatory functions by the form of RNA-protein complexes. Therefore, identifying the interaction between ncRNA and protein is fundamental to understand functions of ncRNA. Under pressures from expensive cost of experimental techniques, developing an accuracy computational predictive model has become an indispensable way to identify ncRNA-protein interaction. A powerful predicting model of ncRNA-protein interaction needs a good feature set of characterizing the interaction. In this paper, a novel method is put forward to generate complex features for characterizing ncRNA-protein interaction (named CFRP). To obtain a comprehensive description of ncRNA-protein interaction, complex features are generated by non-linear transformations from the traditional k-mer features of ncRNA and protein sequences. To further reduce the dimensions of complex features, a group of discriminative features are selected by random forest. To validate the performances of the proposed method, a series of experiments are carried on several widely-used public datasets. Compared with the traditional k-mer features, the CFRP complex features can boost the performances of ncRNA-protein interaction prediction model. Meanwhile, the CFRP-based prediction model is compared with several state-of-the-art methods, and the results show that the proposed method achieves better performances than the others in term of the evaluation metrics. In conclusion, the complex features generated by CFRP are beneficial for building a powerful predicting model of ncRNA-protein interaction.

Keywords: ncRNA-protein interaction, complex feature, feature construction, feature selection, random forest

# 1. INTRODUCTION

The DNA component encyclopedia project (ENCODE) has revealed that most of RNAs in the human transcriptome are non-coding RNAs (ncRNA), which are not involved in coding protein (ENCODE Project Consortium, 2012). As a kind of critical regulatory molecules, ncRNA can regulate gene expression in different stages, such as epigenetic inheritance, transcription and posttranscription (Quan et al., 2015; Zeng et al., 2017). It participates in various cellular processes such as chromatin modification, transcriptional regulation, translation and post-translational modification (Yarmishyn and Kurochkin, 2015; Yotsukura et al., 2016). Increasing evidences show

#### Edited by:

*Arun Kumar Sangaiah, VIT University, India*

#### Reviewed by:

*Leyi Wei, Tianjin University, China Vincenzo Bonnici, University of Verona, Italy*

\*Correspondence: *Zhixia Teng tengzhixia@nefu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *02 November 2018* Accepted: *14 January 2019* Published: *01 February 2019*

#### Citation:

*Dai Q, Guo M, Duan X, Teng Z and Fu Y (2019) Construction of Complex Features for Computational Predicting ncRNA-Protein Interaction. Front. Genet. 10:18. doi: 10.3389/fgene.2019.00018* that ncRNA is closely related to many major diseases that seriously endanger human health and life (Chen et al., 2013; Tang et al., 2017). However, the functional mechanisms of most ncRNAs remain to be further studied and determined. It is worth noting that many ncRNAs often perform their functions by forming RNA-protein complexes (Zhu et al., 2013). For examples, as a scaffolding molecule, HOTAIR RNA combines with PRC2 and LSD1 protein complexes at its 5′ and 3′ ends, respectively, to involve in histone methylation (Tsai et al., 2010). It has been found that over-expression of HOTAIR could induce the relocation of PRC2 protein complex in the whole genome, which could silence tumor suppressor genes and thus promote the development and metastasis of malignant tumors such as breast cancer and liver cancer (Gupta et al., 2010). Xist RNA, which regulates X chromosome inactivation, can interact with more than 80 proteins in many biological processes (Chu et al., 2015). It can be found that there are pervasive synergistic relationships between ncRNAs and proteins, which play important roles in cellular activities and disease regulations. Therefore, determining the interactions between large amount of ncRNAs and proteins is of great significance for revealing molecular mechanisms of ncRNAs in human diseases and biological processes.

Recently, experiment techniques such as CLIP-sep, RIP-seq and fRIP-seq have been developed for uncovering ncRNAprotein interactions (Ferrè et al., 2016). Many significant findings have been obtained by using these methods. However, it still remains some challenges such as expensive, time-consuming and labor-intensive (Luo et al., 2017). Therefore, it is important to develop an accurately computational method for predicting ncRNA-protein interactions to provide valuable supports and supplements for revealing functionalities of ncRNAs. Computational prediction of ncRNA-protein relationship has attracted much attention in the fields of ncRNA and computational biology. It can be roughly divided into the prediction of the interaction pairs and the prediction of binding sites. The former refers to the method of predicting the interacting relationship between ncRNA and protein. The latter approach focuses on the interaction between amino residues in protein and nucleic acid bases in RNA. In this paper, we focus on the prediction methods of interaction between ncRNA and protein molecules. Previous methods for predicting ncRNAprotein interaction could be roughly divided into machine learning based method and network based method. The machine learning based predicts novel ncRNA-protein interactions by training a machine learning model on available known interaction data (Ferrè et al., 2016). The network based usually constructs a heterogeneous network with known ncRNA-protein interactions and predicts novel edges between ncRNA and protein nodes within the network by using some link prediction algorithm as in Zhang et al. (2018). In this study, we focus on the machine learning based method, because it is able to predict the interaction between molecules that are not present in the training data.

Many machine learning based methods have been developed for predicting ncRNA-protein interaction (Lu et al., 2013; Yang et al., 2013) in the past decade. For example, catRAPID proposed in Bellucci et al. (2011) is one of the earliest machine learning based methods for predicting ncRNA-protein interactions. It extracts sequential features from the primary sequences of a ncRNA-protein pair and trains prediction model based on support vector machine and random forest. Cheng et al. (2015) proposed a method named PRIPU, which built a biased support vector machine model to tackle the imbalance problem of positive and negative samples in the dataset. Pan et al. (2016) employs deep learning model with stacked ensemble technique. Most of previous methods focused on using more powerful machine learning algorithms. In fact, in addition to machine learning model, how to extract a set of good features that could appropriately characteristic the properties of samples is another critical problem for improving the predicting performance (Zou et al., 2016b). It is worth noting that developing a powerful feature extraction method for a specific set of samples usually needs to consider what field the sample comes from. It is because that in different fields the properties of actual objects corresponding to the samples may be very different. In other words, the properties of the samples itself should be particularly considered when generating features for training a prediction model. With respect to the interaction between ncRNA and protein, each sample in the dataset is composed of two primary sequences that corresponds to a pair of ncRNA and protein molecules, respectively. When characteristic ncRNA-protein pair, most of existing methods typically extract the sequential feature vectors from the two molecules separately, and then directly concat the two feature vectors together into one vector that is finally taken as the feature of the given pair. For example, as in **Figure 1A**, for a pair of ncRNA and protein, their k-mer feature vectors are first extracted, which we denoted as R and P, respectively. Then, R and P are directly concatenated to the feature vector of the ncRNA-protein pair. Obviously, it is an easy way to characteristic the pair that is composed of two distinct molecules. However, it is worth noting that this kind of simple feature concatenation does not consider the correlation between the two molecular features, which may be critical for understanding the interactive properties of these molecules. In fact, the interaction between a pair of RNA and protein is usually formed by the physical contacts between the amino acid residues and nucleotide bases at the interface (Hudson and Ortlund, 2014). Consequently, to characterize the interaction between the two molecules, it should not only have to extract the separate features from individual molecules, but also need to focus on the complex relations between these features.

In this paper, we propose a framework for constructing Complex Features to predict ncRNA-Protein interactions (CFRP). The complex features are generated by employing some fusion methods upon the traditional individual molecule features (base features) of ncRNA and protein. The motivation behind constructing the complex feature is to emphasize the complex relations between the basic features of different molecules, and thus to characterize the interactive activities between a pair of RNA and protein. In particular, complex features are constructed by using one or more non-linear operations on the two base features. As in **Figure 1B**, let r<sup>i</sup> be an element in RNA base feature and p<sup>j</sup> in protein base feature. A complex feature fi,<sup>j</sup> is the result of conducting a specific non-linear

operation on r<sup>i</sup> and p<sup>j</sup> . A variety of different non-linearities were investigated for constructing the complex feature in this work, in an attempt to comprehensively characterize the interactive properties between the two molecules. Furthermore, a feature selection based random forest (RF) are employed to reduce the dimensions of constructed complex feature, which make it concise and efficient for training predicting models. To investigate the effectiveness of the proposed CFRP method, the complex features constructed by the method are employed to train a machine learning model (CFRP model) for predicting ncRNA-protein interactions. We conducted extensive tests against to CFRP model on several widely used public datasets. The experimental results demonstrated that the complex feature constructed by CFRP method is helpful to obtain a good prediction model with better performance than the traditional k-mer feature. Compared with other state-of-the-art methods, CFRP model can achieve better prediction performance in terms of many metrics. Especially, on the Sum metric, CFRP method is superior to other methods on all data sets. In conclusion, CFRP can produce a set of discriminative features against to the task of predicting ncRNA-protein interactions.

#### 2. METHODS

In this section, a novel framework for constructing Complex Feature that is used to characterize ncRNA-Protein interaction (CFRP) is put forward, which generates complex feature by employing a set of non-linear transformations upon the traditional k-mer sequential feature (base feature). As shown in **Figure 2**, the framework consists of several steps including base feature extraction, complex feature generation, feature ranking and selection. Specifically, CFRP firstly extracts traditional k-mer features from a pair of RNA and protein as their base features, respectively; then, a set of complex features are constructed by employing different kinds of non-linear transformations upon the extracted base features; finally, the generated complex features are ranked by the feature importances that are induced from a trained random forest model and then the top-k important features of them are chosen as the final feature of input ncRNA-protein pair. The generated complex feature could be used to train a powerful prediction model for ncRNA-protein interaction.

#### 2.1. Base Feature Extraction

Given the sequences of a ncRNA-protein pair, we first extract k-mer features from them, respectively, as their base features. For the RNA, m-mer feature is extracted from its primary sequence, where m-mer represents the frequency of each kind of successive m base combination in the sequence. As there are four kinds of nucleotides (A, C, G, U) in RNA, a base feature vector R base (Equation 1) has 4<sup>m</sup> dimensions.

$$R^{base} = \{r\_1, r\_2, \dots, r\_i, \dots, r\_{4^m}\} \tag{1}$$

For the protein, there are 20 kinds of amino acid residues existing in its primary sequence. Let n be the length of k-mer for protein. If directly computing n-mer frequency, we will get a feature vector with 20<sup>n</sup> dimensions. Such a number of dimensions is too expensive for subsequent construction of complex feature. To reduce the dimensions of protein base feature vector, we group 20 kinds of residues into several subsets as Shen et al. (2007). They proposed a 7-group strategy to classify 20 amino residues into 7 groups based on their physiochemical properties, as {A, G, V}, {I, L, F, P}, {Y, M, T, S}, {H, N, Q, W}, {R, K}, {D, E}, and {C}. This grouping method could reduce the computational cost without significantly performance reduce on characterizing protein sequence. By this grouping, the original protein sequence could be translated into a new string that is composed of 7 characters. Then, a 7<sup>n</sup> dimensional n-mer vector P base could be extracted (Equation 2) from the new string as the base feature of input protein. Each element p<sup>j</sup> in the vector is the frequency of a certain n-mer in the translated string.

$$P^{base} = \{p\_1, p\_2, \dots, p\_j, \dots, p\_{7^n}\} \tag{2}$$

The base feature vectors extracted from RNA and protein (R base and P base) are normalized by the sequence lengths of RNA and protein, respectively.

#### 2.2. Complex Feature Construction

To represent the complex relation between ncRNA and protein, a set of complex features is generated by introducing non-linear transformations upon the base features of individual RNA and protein sequences. In particular, some non-linear operations such as geometric mean, harmonic mean and power operation are introduced. As shown in **Figure 2**, given base features R base and P base of a pair of RNA and protein, a set of complex features such as GM, HM, PowRP, and PowPR is generated. The details are as follows:

the GM feature,

$$\mathcal{G}M = \{ g m\_{ij} | 1 \le i \le 4^m, 1 \le j \le 7^n \} \tag{3}$$

where gmij is the geometric mean of r<sup>i</sup> and p<sup>j</sup> as

$$gm\_{\vec{\eta}} = \sqrt{r\_i \times p\_{\vec{\nu}}}$$

the HM feature,

$$HM = \{hm\_{\vec{\eta}} | 1 \le i \le 4^m, 1 \le j \le 7^n\} \tag{4}$$

where hmij is the harmonic mean of r<sup>i</sup> and p<sup>j</sup> as

$$hm\_{\vec{\eta}} = 2(r\_i \times p\_j)/(r\_i + p\_j),$$

and power operation features,

$$PowRP = \{ pr\_{\vec{i}\vec{j}} = \log r\_i^{p\_{\vec{j}}} | 1 \le i \le 4^m, 1 \le j \le \mathbb{7}^n \}\tag{5}$$

$$PowerR = \{ \mathcal{p} p\_{ij} = \log p\_j^{r\_i} | 1 \le i \le 4^m, 1 \le j \le 7^n \}\tag{6}$$

By means of above different non-linear operations, we could get a 4×4 <sup>m</sup> ×7 n -dimensional feature vector that consists four kinds of complex features. These raw CFRP features could be used to describe the interactive activities between RNA and protein in a more comprehensive view.

# 2.3. Feature Selection

Complex features with thousands of dimensions might arise the problem of dimensionality curse. The high dimensional feature space will yield several problems such as data sparseness, overfitting of prediction model and high computational cost (Li et al., 2016). In order to reduce the adverse effect, a feature selection method should be conducted on the high dimensional space to select the features with more information values and to remove those ones with less importances. It has been proven to be effective and efficient for solving variety machine learning problems on high-dimensional data (Zou et al., 2016a). In this work, the strategy of feature selection against to the complex features constructed above is a two-step process: the first step is to rank all features in descending order with respect to the importance according to their contribution to the classification, and then the top-k important features is selected as the final features. To get the importance of feature, we employed a random forest model (RF) based model. As known, in a decision tree, features used at the top of a decision tree are considered to be contribute to a larger fraction of the input samples for the prediction task. Consequently, we could estimate the importance of the feature based on its contribution to the prediction in the tree. Random forest are composed of a set of decision trees. By estimating the average value of feature importance on multiple trees in a forest, we could get the importance of features with lower variance for a given prediction task (Zhou et al., 2016). After the estimation of feature importance, we could get top-k features according the importance as the final features used for characterizing RNA-protein interactions.

# 2.4. Training a CFRP Model

The CFRP complex features could be used to train a predicting model for ncRNA-protein interaction on a certain dataset. In general, a common procedure for training a CFRP-based predicting model is as:


In accordance with this framework, we trained a CFRP-based model using random forest for the follow experimental testing and performance evaluation.

## 3. EXPERIMENTAL RESULTS

In order to validate the proposed method of constructing complex feature for ncRNA-protein interaction, CFRP model was tested on a set of high-quality public datasets. Several metrics widely used in the field were employed to measure the performance of a prediction model. A series of experimental tests were carried out: (1) the properties of CFRP feature using different k-mer length; (2) the effectiveness of the proposed CFRP feature; (3) comparison of CFRP model with other state-of-theart methods. CFRP source code and other related resources can be download on the website (http://www.dailab.cn/CFRP/index. html).

#### 3.1. Datasets

Three datasets were adopted to test the CFRP method proposed in the experiments, including RPI369 (Muppirala et al., 2011), RPI488 (Pan et al., 2016) and RPI2241 (Muppirala et al., 2011), all of which are widely used in the field of predicting ncRNA-protein interaction (Muppirala et al., 2011; Lu et al., 2013; Pan et al., 2016). All of these datasets consist of nonredundant experimentally-validated ncRNA-protein interaction pairs that are extracted from the three-dimensional structures of RNA-protein complexes within the Protein Data Bank (PDB) (Westbrook et al., 2002). The summary information of these datasets is listed in **Table 1**. In detail, RPI369 consists of 369 experimentally-validated RNA-protein interactions as positive samples and the same number of negative RNA-protein interactions, in which the negative samples are generated with the randomly pairs of proteins and RNAs that does not present in positive sample set. RPI488 is obtained by Pan et al. (2016), which is extracted from 18 ncRNA-protein complexes also in PDB. It consists of 488 samples including 243 RNA-protein interactions and 245 non-interactions. RPI2241 includes 2,241 experimental-validated RNA-protein interactions and negative ones, respectively, where the method for generating negative samples is same as RPI369.

TABLE 1 | ncRNA-protein datasets used in this study.


#### 3.2. Performance Metrics

Several metrics are employed for measuring the performance on predicting ncRNA-protein interaction, including Accuracy (Acc), Sensitivity (Sen), Specificity (Spe), Precision (Pre), Matthews correlation coefficient (Mcc) and area under curve (AUC) of the receiver operation Characteristic (ROC). The details of these metrics are as follows,

$$Acc = \frac{TP + TN}{TP + TN + FP + FN} \tag{7}$$

$$\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{8}$$

$$\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}} \tag{9}$$

$$Precision = \frac{TP}{TP + FP} \tag{10}$$

$$\text{Mcc} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}} \tag{11}$$

where TP, TN represent the number of correctly predicted positive and negative samples, respectively, and FP, FN represent the number of samples wrongly predicted as positive and negative, respectively. These metrics are widely used for measuring the performance of a predicting model in the field (Bellucci et al., 2011; Cheng et al., 2015; Pan et al., 2016) and other related fields (Su et al., 2018; Wei et al., 2018) in Bioinformatics. In addition, since none of the above metrics is a gold standard, we also investigate the sum of above 6 metrics to measure the overall performance of the prediction model (Sum). In order to reduce the variance of performance, all experimental results about CFRP model are obtained by means of k-fold cross-validation (Arlot and Celisse, 2010). In detail, the sample set is divided into k subsets with equal size. For each model training, one sub-set is taken as the testing sample set and the rest ones as the training set. The average performance metrics of these k models are taken as the final result of performance evaluation of the model.

#### 3.3. CFRP Feature With Different Length of K-mer

As described, CFRP generates complex feature on the basis of base feature of RNA and protein. That is to say that using different length of k-mer in the base feature will definitely affect the performance and computing efficiency of CFRP method. Therefore, we study some properties of CFRP feature, such as effects on computing cost and discriminative



*The values in bold represent the best values obtained by the three methods with different k-mer lengths on a certain dataset.*

TABLE 3 | The effects of different k-mer lengths on the prediction performance (10-fold cross-validation).


*The values in bold represent the best values obtained by the three methods with different k-mer lengths on a certain dataset.*

performance, when using different k-mer lengths in base feature.

#### 3.3.1. Time and Memory Consumptions for Constructing CFRP

Let m, n be the length of k-mer in RNA and protein, respectively. m = 2, 3, 4 and n = 2, 3, 4 were tested. Complex feature was constructed by CFRP for each setting of m, n. The computer used for conducting the experiments was equipped with an E7- 4809 v4 CPU, 64G memory, and Ubuntu 16.04 system. Python 3.6.7 and scikit-learn 0.19.1 (Pedregosa et al., 2013) were adopted for algorithm implement. The running time and memory space occupied by CFRP for building complex features on different data sets under different values of m and n were shown in **Figures 4**, **5**, respectively. As can be seen from the tables, running time and memory size were significantly positively correlated with the values of m and n. For example, on RPI369 dataset, when m = 2 and n = 2, only about 6 seconds and 0.1G of memory were consumed; when m, n = 4, it consumed about 3,562 s of running time and 20.2 grams of memory. Due to the limited memory size of our computer, we were not able to successfully obtain CFRP feature and train a model on RPI2241 dataset.

#### 3.3.2. Comparison of CFRP-Model Using Different Classifiers

Several classic machine learning techniques such as random forest (RF), support vector machine (SVM) and logistics regression (LR) were employed for training CFRP models. The models were denoted as CFRP-RF, CFRP-SVM and CFRP-LR depending on which learning technique it used. These models were trained on RPI369, RPI488, and RPI2241 datasets under different values of m, n. And 3-fold, 10-fold cross-validations were both used for evaluating the performance of models, and the results were shown in **Tables 2**, **3**, respectively. Each value in the tables represents the Sum performance value obtained by a model under specific m and n values on a certain dataset, and the value in bold denotes the best one yielded on each dataset. As shown in the tables, CFRP-RF can achieve better prediction performance than the other two models, suggesting that random forest is an appropriate machine learning technique for the task of predicting ncRNA-protein interaction. It also can be found that the best performance on each dataset is mostly obtained when m = 4, except for 3-fold on RPI369 dataset. Therefore, we can infer that the 4-mer RNA sequence pattern is more conducive than that of 2-mer to the characterization of the interactive properties of ncRNA against to protein. Considering that most of the relevant works adopted 10-fold cross-validation, the CFRP-RF running results with the best Sum performance on each dataset in the above 10-fold test were used for subsequent experiments, named CFRP model.

# 3.4. Study on the Effectiveness of CFRP Feature

In CFRP, there are four kinds of non-linear transformations that are used to produce GM, PowRP, PowPR and HM complex features. In order to study the effectiveness of different complex features, we analyze the top 100 features generated by CFRP on RPI2241, which are composed of 19 GM, 38 HM, 12 PowRP, and 31 PowPR features in **Figure 3**. It means that the four nonlinear operations introduced for generating complex features are all effective, and HM is more discriminative for the accurate prediction of ncRNA-protein interactions than other kinds of features.

Furthermore, in order to validate the discriminative ability of CFRP feature, we further trained a model that just uses traditional k-mer feature (base feature) and a model using complex features generated by CFRP but without using any feature selection. These models are also trained with random forest model and denoted as CFRP-raw and BaseFeat, respectively. To investigate different features, we tested CFRP, CFRP-raw and BaseFeat models on RPI369, RPI488, and RPI2241 datasets, the experimental results are shown in **Table 4**. In the table, the value in bold denotes the best one in the tested models on a certain dataset. As shown, the performance of CFRP-raw on Sum is weaker than CFRP model on three datasets and better than the traditional k-mer feature on RPI2241. It suggests that the ability of generated raw complex features for describing ncRNA-protein interaction could be boosted with an appropriate feature selection. Moreover, CFRP model is superior to other two tested models on three datasets for all of performance metrics. For example, the AUC indicators of CFRP model on RPI369, RPI488, and RPI2241 are 0.788, 0.938, and 0.744, exceeding 0.028, 0.024, and 0.043 than Basefeat, respectively. In terms of Accuracy, CFRP also obtained 0.024, 0.016, and 0.027 improvements over BaseFeat on three datasets. The significant improvements of CFRP-model demonstrate that the proposed CFRP complex feature has a more powerful ability for describing ncRNA-protein interaction than the base feature.

# 3.5. Comparison of CFRP Model With Other Methods

In this section, CFRP model is compared with other state-ofthe-art methods, including RPISeq (Muppirala et al., 2011) and lncPro (Lu et al., 2013). Similar to CFRP model, these two methods also take primary sequences of RNA and protein as inputs. RPISeq adopts the framework combining traditional kmer features and random forest, while lncPro predicts ncRNAprotein interaction through scoring the pair by encoding

sequences into numeric vectors. The performance of CFRP model is compared with those of RPISeq and lncPro on RPI369, RPI488, and RPI2241 datasets. As illustrated in **Table 5**, CFRP model performs better than RPISeq and lncPro on three datasets with respect to most of the tested metrics, especially in terms of Sum and AUC metrics. CFRP model achieves more than 0.179, 0.055, and 0.188 Sum improvements than other two methods on three datasets, respectively. In addition, it performs better than other two methods on RPI369 for all of the tested metrics. Also, CFRP gets 0.761, 0.942, and 0.684 Accuracy on three tested datasets, which exceeds 0.057, 0.062, and 0.030 at least than other methods, respectively. As a whole, although it is inferior to the other methods on a few indicators, CFRP method performs better than the other two methods in general. It suggests that the method for generating complex features presented in this work is an effective and efficient way to predict ncRNA and protein interaction.

To sum up, the above series of experiments show that the CFRP method proposed in this work is effective, and it could produce complex features with better descriptive ability for ncRNA-protein interaction. The main reasons include two aspects. On the one hand, CFRP introduces a variety of complex relations about k-mer base feature, which can characterize the properties of ncRNA and protein interaction from a more comprehensive and higher level. On the other hand, by introducing the feature selection method based on random forest, the dimension of generated complex feature can be significantly reduced, so that the problem of dimensional disaster is avoided. And thus the CFRP feature can be more concise and efficient for training a powerful predicting model.

# 4. CONCLUSION

The interaction between ncRNA and protein is significant for many critical biological processes and diseases. Developing a powerful computational method for predicting the interaction could provide a important assistance for understanding the FIGURE 4 | Time consumption (Seconds) of CFRP-models with different k-mer length on three datasets.


FIGURE 5 | Memory consumption (G) of CFRP-models with different k-mer length on three datasets.


TABLE 4 | Performance evaluation of CFRP-model on RPI369, RPI488, and RPI2241.

*The values in bold represent the best values obtained by the three methods on a certain dataset.*

TABLE 5 | Comparison between CFRP and other methods on RPI369, RPI488, and RPI2241.


*The values in bold represent the best values obtained by the three methods on a certain dataset.*

molecular mechanism within variety biological activities. When building a prediction model, it is very important to employ a set of features that could effectively characterize the interaction between ncRNA and protein. In this work, we presented a novel framework named CFRP for constructing a set of complex features, which tries to comprehensively characterize interactive activities of a ncRNA-protein interaction. Firstly, k-mer features (base features) are extracted from primary sequences of ncRNA and protein, respectively; secondly, a set of complex features are generated by employing several nonlinear transformations upon the base features of RNA and protein; finally, a feature selection based on random forest are employed to reduce the dimensions of the generated features. A series of experimental results on several widely used public datasets show that the prediction model using CFRP features is superior to the one using traditional k-mer features. It suggests that complex features generated by the CFRP framework are more descriptive than traditional k-mer features. The CFRP model is also compared with other state-of-the-art methods, and the results show that it could achieve better performance in terms of most of the tested metrics. In conclusion, the propose CFRP method could generate a set of complex features that is

### REFERENCES


more informative that k-mer features. It would be conducive to build a prediction model of ncRNA-protein interaction with more powerful performance. The idea of constructing complex features might be extended to predicting other kinds of molecular interactions such as protein-protein interaction in the field of bioinformatics.

### AUTHOR CONTRIBUTIONS

QD and MG designed the project. MG and XD developed the feature construction and selection methods. ZT and YF analyzed the result. QD and ZT conceived the experiment and wrote the manuscript. All authors read and approved the manuscript.

#### ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (61701073, 61671189, 81400115, 61532014, 61672132, and 61571163) and Fundamental Research Funds for the Central Universities (257201 8BH05).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Dai, Guo, Duan, Teng and Fu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data

#### Limin Jiang1†, Yongkang Xiao2†, Yijie Ding<sup>3</sup> , Jijun Tang1,4 \* and Fei Guo<sup>1</sup> \*

*<sup>1</sup> School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China, <sup>2</sup> School of Chemical Engineering and Technology, Tianjin University, Tianjin, China, <sup>3</sup> School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China, <sup>4</sup> Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States*

#### Edited by:

*Arun Kumar Sangaiah, VIT University, India*

#### Reviewed by:

*Qi Zhao, Liaoning University, China Xianwen Ren, Peking University, China*

#### \*Correspondence:

*Jijun Tang tangjijun@tju.edu.cn Fei Guo fguo@tju.edu.cn*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *30 October 2018* Accepted: *15 January 2019* Published: *05 February 2019*

#### Citation:

*Jiang L, Xiao Y, Ding Y, Tang J and Guo F (2019) Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data. Front. Genet. 10:20. doi: 10.3389/fgene.2019.00020* Discovering cancer subtypes is useful for guiding clinical treatment of multiple cancers. Progressive profile technologies for tissue have accumulated diverse types of data. Based on these types of expression data, various computational methods have been proposed to predict cancer subtypes. It is crucial to study how to better integrate these multiple profiles of data. In this paper, we collect multiple profiles of data for five cancers on The Cancer Genome Atlas (TCGA). Then, we construct three similarity kernels for all patients of the same cancer by gene expression, miRNA expression and isoform expression data. We also propose a novel unsupervised multiple kernel fusion method, Similarity Kernel Fusion (SKF), in order to integrate three similarity kernels into one combined kernel. Finally, we make use of spectral clustering on the integrated kernel to predict cancer subtypes. In the experimental results, the *P*-values from the Cox regression model and survival curve analysis can be used to evaluate the performance of predicted subtypes on three datasets. Our kernel fusion method, SKF, has outstanding performance compared with single kernel and other multiple kernel fusion strategies. It demonstrates that our method can accurately identify more accurate subtypes on various kinds of cancers. Our cancer subtype prediction method can identify essential genes and biomarkers for disease diagnosis and prognosis, and we also discuss the possible side effects of therapies and treatment.

Keywords: cancer subtypes prediction, similarity kernel fusion, spectral clustering, sparse matrix, The Cancer Genome Atlas

# 1. INTRODUCTION

Cancer is a heterogeneous disease caused by chemical, physical, or genetic factors (Mager, 2006; Liu and Chu, 2014). The development of high-throughput genome analysis techniques on the research of cancer subtypes plays an important role in the analysis and clinical treatment of various kinds of cancers (Kruijf et al., 2013; Prat et al., 2015; Thanki et al., 2017). In recent years, much expression data, including genomes, transcriptome and epigenomes, has accumulated and been stored in various databases. The Cancer Genome Atlas (TCGA) (Katarzyna et al., 2015) is a largescale project including over 34 cancers and 15 expression data sets. We can conveniently obtain genome-scale molecular data, which contributes to the development of computational methods for discovering cancer subtypes.

Until now, massive computational methods were proposed to discover cancer subtypes. Some methods are based on single expression data, including gene expression data (Nguyen and Rocke, 2002; Brunet et al., 2004; Finnegan and Carey, 2007; Teschendorff et al., 2007) and copy number (Wong et al., 2012) and DNA methylation (Zhang et al., 2017). Gao and Church (2005) employed sparse non-negative matrix factorization (SNMF) and gene expression data to identify subtypes of three cancers. Also, various kinds of expression data (Wei et al., 2017, 2018a,b) and several types of similarity strategies (Zeng et al., 2016; Ding et al., 2017a,b; Pan et al., 2017, 2018; Guo F. et al., 2018; Song et al., 2018) can be applied in many other biological prediction problems.

Generally, we desire a comprehensive view of one disease with a cohort of patients. We cannot analyze just one kind of data, but must separately abstract information from different types of data (Xu et al., 2017). Therefore, many methods improve the robustness of clustering by focusing on data processing (Ren et al., 2015). Wang et al. (2014) proposed the Similarity Network Fusion (SNF) approach for accurately clustering caner subtypes. This method first collects three types of genome-wide data including gene, methylation and miRNA expression. Then, it constructs the networks of samples (e.g., patients) by using three types of expression data, and fuses these networks into one network by using SNF representing the full spectrum of underlying data. Finally, it employs spectral clustering on an integrated network to predict caner subtypes. Ma and Zhang (2017) developed an improved SNF, Affinity Network Fusion (ANF), to integrate multiple similarity networks. Xu et al. (2016) proposed Weighted Similarity Network Fusion (WSNF) to identify cancer subtypes. This method constructs similarity of patients by integrating associations between miRNA, mRNA, and transcription factors. It is applied to two cancer types to demonstrate performance.

Furthermore, the effective models of clustering that we usually use have strong data sensitivity, such as k-means and hierarchical clustering. Today, many clustering methods have been developed to identify cancer subtypes. Le et al. (2016) developed the SRF algorithm, which identifies subtypes by combining mutational and expression information. It diffuses mutation information over an interaction network on the basis of each sample and eliminates scale differences by applying a rank-based transformation based on mutation and expression data. Then, rank matrix factorization is used to jointly factorize the transformed data into a number of ranked factors, and the subtypes are defined as the combination of ranked factors. This method obtains excellent performance, but some of the patients cannot be identified. Shen et al. (2009) proposed the iCluster method, which is based on the Gaussian latent variable model, to discover caner subtypes. This method was tested on breast cancer and lung cancer by using copy number and gene expression data types. Speicher and Pfeifer (2015) pointed out that iCluster has high computational complexity and proposed a dimensionality reduction method to integrate multiple similarity kernels. This method is evaluated by using five cancer types. Ge et al. (2017) developed the Scluster method, which integrates different types of data and maps them into an effective low-dimensional subspace. First, Scluster uses adaptive sparse reduced-rank regression (S-rrr) to map the original data into the principal subspaces. Next, a fused patient-by-patient network is abstracted for these subgroups by a scaled exponential similarity kernel method. It can then obtain the cancer subtypes by spectral clustering.

In this paper, we first collect multiple profile data on The Cancer Genome Atlas (TCGA), including five cancers (lung cancer, kidney cancer, stomach cancer, breast cancer, and colon cancer) and their three types of expression data (gene expression, isoform expression, and miRNA expression). Then, we construct three similarity kernels for all patients of the same cancer by using the three types of expression data. We then propose a novel unsupervised multiple kernel fusion method, Similarity Kernel Fusion (SKF), in order to integrate three similarity kernels into one combined kernel. Compared with SNF, SKF not only keeps the original information of each type of similarity kernel, but also gets rid of the noise in the integrated kernel. Finally, we make use of spectral clustering on the integrated kernel to predict cancer subtypes. To test the effectiveness and robustness of this novel approach, P- values from a Cox regression model and survival curve analysis can be used to evaluate the performance of our method on cancer subtype prediction. We compare the integrated kernel with the single kernel and other fusion methods, and also analyze the survival curve of the clinical data.

# 2. MATERIALS AND METHODS

In this paper, we first extract five cancer datasets from The Cancer Genome Atlas (TCGA). For a particular cancer, we construct three patient similarity kernels by using the expression data. Then, we combine these similarity kernels into one similarity kernel by using Similarity Kernel Fusion (SKF). Finally, we employ spectral clustering on the integrated kernel to divide all patients into multiple clusters. The flowchart of our method is shown in **Figure 1**.

# 2.1. Dataset

We collect five cancer datasets from the TCGA website, including stomach cancer, lung cancer, kidney cancer, breast cancer, and colon cancer. For each cancer, we extract three kinds of expression data respectively, including gene expression, miRNA expression, and isoform level. Our dataset is denoted as Dataset No.1 in this paper. In addition, we employ anther dataset to evaluate the performance of our method. The second dataset is provided in Wang et al. (2014), which includes lung cancer, kidney cancer, breast cancer, colon cancer, and glioblastoma multiforme (GBM). For each tumor, gene expression, methylation expression, and miRNA expression from TCGA are used to analyze cancer subtypes. We denote this dataset as Dataset No.2. Since genes could be categorized into multiple groups, we selected 18222 coding genes from Dataset No.1, formed as Dataset No.3 . A summary of the three datasets is shown in **Table 1**. It is clear that Dataset No.1 and Dataset No.3 have more patients and expression factors than Dataset No.2 .

## 2.2. Similarity Kernel Construction

A special expression dataset is denoted as E ∈ R <sup>n</sup>×m, where m is the number of expression factors and n is the number of patients. We first normalize E by using Equation (1).

$$\mathbf{x}' = \frac{\mathbf{x} - \overline{\mathbf{X}}}{\mathbf{S}} \tag{1}$$

where x is an element of E, x′ is corresponding elements of E after standardization, X is the mean of E and S is standard deviation of E. And, we denote normalized expression data as E ′ .

Based on the processed expression data E ′ , we construct similarity kernel K ∈ R <sup>m</sup>×<sup>m</sup> for patients. Here, the similarity between two patients is defined as Equation (2) (Chen et al., 2018b; Zhao et al., 2018a,b).

$$K\_{i,j} = \sqrt{(e\_i - e\_j)^T (e\_i - e\_j)}\tag{2}$$

where Ki,<sup>j</sup> is the similarity between i-th patient and j-th patient, e<sup>i</sup> ∈ R n×1 and e<sup>j</sup> ∈ R n×1 is i-th column and j-th column of E ′ , respectively.

Finally, we get three similarity kernels for a special disease, including similarity kernel K<sup>1</sup> ∈ R <sup>m</sup>×<sup>m</sup> by using gene expression, similarity kernel K<sup>2</sup> ∈ R <sup>m</sup>×<sup>m</sup> by using miRNA expression, and similarity kernel K<sup>3</sup> ∈ R <sup>m</sup>×<sup>m</sup> by using isoform expression.

#### TABLE 1 | Description of three datasets from TCGA.


#### 2.3. Similarity Kernel Fusion

We constructed three similarity kernels for patients in the above section. We propose Similarity Kernel Fusion (SKF) to combine these kernels into one kernel K <sup>∗</sup> ∈ R <sup>m</sup>×m. First, we construct two kernels P ∈ R <sup>m</sup>×<sup>m</sup> and S ∈ R <sup>m</sup>×<sup>m</sup> for each similarity kernel by using Equations (3, 4), where P is a normalized kernel and S is a sparse kernel that eliminates weak similarity.

$$P(i,j) = \frac{K\_{i,j}}{\sum\_{k=1}^{m} K\_{k,j}} \tag{3}$$

where P satisfies P<sup>m</sup> k=1 P(k, j) = 1.

$$S(i,j) = \begin{cases} 0 & \text{if } j \notin N\_i\\ \frac{K\_{i,j}}{\sum\_{k \in N\_i} K\_{i,k}} & \text{if } j \in N\_i \end{cases} \tag{4}$$

where S satisfies P<sup>m</sup> k=1 S(i, j) = 1; N<sup>i</sup> is a set of all neighbors of the i-th patient, including itself.

Second, we discover more information by using multiple iterations as Equation (5).

$$P\_l^{t+1} = \alpha (\mathbf{S}\_l \times \frac{\sum\_{r \neq l} P\_r^t}{2} \times \mathbf{S}\_l^t) + (1 - \alpha)(\frac{\sum\_{r \neq l} P\_r^0}{2}) \tag{5}$$

where P t l (l = 1, 2, 3) is the status of the l-th kernel after t iterations, α is a coefficient and satisfies α ∈ [0, 1], P 0 r (r = 1, 2, 3) represents the initial status of P<sup>r</sup> .

After t + 1 iterations, the overall kernel can be computed as Equation (6).

$$K\_{com} = \frac{1}{3} \sum\_{l=1}^{3} P\_l^{l+1} \tag{6}$$

Finally, based on the integrated kernel, we construct a weight matrix to eliminate noise in the integrated kernel as Equation (7).

$$w(i,j) = \begin{cases} 1 & \text{if } j \in N\_i \cap i \in N\_j\\ 0 & \text{if } j \notin N\_i \cap i \notin N\_j\\ 0.5 & \text{otherwise} \end{cases} \tag{7}$$

where N<sup>i</sup> is a set of all neighbors of the i-th patient, including itself, and N<sup>j</sup> is a set of all neighbors of the j-th patient, including itself.

The final similarity kernel can be obtained as Equation (8).

$$K^\* = \boldsymbol{\omega} \circ K\_{\text{com}} \tag{8}$$

where K ∗ is the final integrated similarity kernel by using SKF.

#### 2.4. Mining Subtypes Using Spectral Clustering

In this section, we employ spectral clustering (Ng et al., 2001) on the integrated similarity kernel to divide all patients into multiple clusters. Many previous studies, including CSPRV (Guo Y. et al., 2018), Scluster (Ge et al., 2017), and SNF(Wang et al., 2014), have constructed similarity kernels for patients and used spectral clustering to discover cancer subtypes. These methods have achieved excellent performance by using spectral clustering. Additionally, Luxburg (2007) have pointed out that spectral clustering is effective in capturing the global structure of the graph. Therefore, we use spectral clustering to identify cancer subtypes. Then, we will introduce the processes of spectral clustering in detail. We define a matrix Y ∈ {0, 1} k×n to represent the result of a cluster, where Y(i, j) = 1 if patient p<sup>j</sup> belongs to ith cluster, otherwise Y(i, j) = 0. We also use Equation (9) as the optimal question to solve Y.

$$\begin{aligned} \min\_{Q \in \mathbb{R}^{k \times n}} & \text{Trace}(Q^T L^+ Q) \\ & \text{s.t.} \, Q^T Q = I \end{aligned} \tag{9}$$

where Q = Y(Y ′Y) − 1 <sup>2</sup> , L <sup>+</sup> = I − D − 1 <sup>2</sup> K ∗D − 1 <sup>2</sup> , D is a diagonal matrix whose diagonal element is the sum of the row elements of K ∗ .

#### 3. RESULTS

In this section, we discuss the performance of our method in a variety of ways. First, we introduce an evaluation criteria and a verification method that are used to evaluate the performance significance of the cancer subtype predictions. Second, we analyze the performance of SKF with different parameters α on Dataset No.1 . Third, we discuss the performance of SKF on the three datasets. Fourth, we compare SKF with two other fusion methods on the three datasets. Finally, we analyze the survival probability curves of the predicted subtypes for four cancers.

### 3.1. Evaluation Criteria and Verification Method

In this paper, we employ the P-value from the Cox regression model to evaluate the performance of our method, where a lower P-value indicates higher significance for performance. When the P-value is less than 0.05, it is of significance to the performance of the model. When the P-value is less than 0.01, the performance of the model is highly significant. Here, we use 0.05 as the threshold for significance. The meaning of the P-value is significance in the difference of survival profiles between cancer subtypes. Moreover, we also use survival analysis to evaluate the performance of the clustering results. The survival curve represents the change in survival rate over time, and it is a monotone decreasing curve without any fluctuation. In the survival curve, we can find that different subtypes have different survival rates. We can analyze some subtypes that have a higher risk of death.

#### 3.2. Parameter Selection for SKF

Particularly, α is an important parameter in the process of SKF. A lower α value represents keeping more initial information in the integrated kernel. A higher α value represents keeping more information after multiple iterations. In the three datasets, we take α from 0 to 1 with a step of 0.1 to find the optimal α for the five cancers. Results are shown in **Figure 2**, with the X axis representing the α value and the Y axis representing the − log10(Pvalue). A lower P-value is represented by a higher value of − log10(Pvalue). In **Figure 2**, the P-value maintains clear fluctuation in the range between 0 and 1. It demonstrates that SKF is sensitive to changes in α. We get the optimal P-value when α is equal to 1 for the four cancers except lung cancer on Dataset No.3. From the results of Dataset No.2, we can see that keeping more initial information is necessary for many of the datasets.


TABLE 2 | Comparison results between SKF and single kernel on three datasets.

Frontiers in Genetics | www.frontiersin.org

Dataset No.3.

# 3.3. Performance of SKF in Difference Datasets

In this paper, we obtain Dataset No.1 from TCGA. For a specific disease, we extract all 60483 gene expression data points on Dataset No.1 . We employ the three datasets to evaluate the performance of SKF. For each dataset, we compare the performance of SKF with single kernel by using the optimal number of clusters. In **Table 2**, we can see that SKF achieves outstanding performance compared with single kernel in 12 cases. We also find that the same kernels with different numbers of clusters have different P-values. Therefore, we need to adjust the number of clusters to obtain optimal clustering results. Although P-values do not achieve significant performance for GBM cancer in Dataset No.2 or Kidney cancer in Datasets No.2 and No.3 after SKF, these P-values get remarkable promotion compared to single kernel. Moreover, it is clear that the P-value of Dataset No.3 is better than that in Dataset No.1 , which shows that coding genes play an important role in the clustering of cancer subtypes.

#### 3.4. Comparing With Other Fusion Methods

Several multiple kernel fusion strategies have been developed, including similarity network fusion (SNF) (Wang et al., 2014) and unsupervised multiple kernel learning (UMKL) (Mariette and Villavialaneix, 2018). We compared the performance of SKF with these two strategies to find better subtypes for a particular cancer. We tested the three strategies on the three datasets to compare the performance of different fusion methods. All results are found in the **Supplementary Table 1**. The graphical results are shown in **Figure 3**, with the X axis representing the number of clusters and the Y axis representing the value of − log10(Pvalue). The blue lines represent the change of SKF, the red lines represent the change of SNF, the green lines represent the change of UMKL and the black dashed lines show the Pvalue equal to 0.05. In **Figure 3**, we find that SKF achieved a remarkable level of performance for the clustering of breast and colon cancer subtypes in the three datasets. Additionally, SKF achieved better performance than other kernel fusion strategies for the clustering of lung cancer subtypes in Datasets No.1 and No.3. We also found that SNF performed well for the clustering of kidney cancer subtypes in the three datasets and UMKL reached the best level of performance for the clustering of lung cancer subtypes in Dataset No.2 and stomach cancer subtypes in Dataset No.1. It demonstrates that SKF obtained a significant level performance for discovering subtypes of a particular cancer, and also that the cluster results can be used for guiding clinical treatment.

### 3.5. Survival Analysis

In this paper, we analyzed the performance of SKF based on six cancers, including breast, lung, kidney, colon, stomach, and GBM cancers. However, since the P-values for the clustering Jiang et al. Cancer Subtypes Prediction

of kidney and GBM cancer subtypes were larger than 0.05, we showed survival probability curves for the four other cancers. We analyzed these cancer subtypes by using Dataset No.3 . In **Figure 4**, we find that subtype 3 for stomach cancer has a higher death rate. These patients with subtype 3 need more attention to be paid to them. The average survival time of subtype 2 for colon cancer is longer than the other subtypes. Similarly, subtype 3 for other cancers tends to be more aggressive than other subtypes. We also found that the average survival time for breast cancer and lung cancer are longer than for stomach and colon cancer. It demonstrates that the cluster results of SKF can be used to guide clinical treatment.

### 4. CONCLUSIONS

In this paper, we proposed an accurate model for predicting cancer subtypes. First, we extracted a novel dataset with three expression data types (gene expression, miRNA expression, and isoform expression) and five cancers (breast, lung, kidney, colon, and stomach cancers) from the TCGA website. Second, we constructed three similarity kernels by using the three types of expression data for each cancer. Then, we proposed Similarity Kernel Fusion (SKF) to integrate the three kernels into one combined kernel. Finally, we used spectral clustering on integrated kernel to discover cancer subtypes.

We used an evaluation criteria (P-value) and a verification method (survival analysis) to evaluate the performance of SKF for the discovery of cancer subtypes. We compared SKF with single kernel and two kernel fusion strategies (SNF and UMKL) in three datasets. Results showed that SKF obtains a significant level of performance on P-value, and the survival curve of the subtypes was consistent with the clinical data. It demonstrates that SKF is an accurate computational tool for guiding clinical treatment.

Our method also has some limitations that require some attention. Since spectral clustering is a widely used and accepted

#### REFERENCES


cluster method, we are attempting to find an improved method to discover cancer subtypes more accurately. We will consider various machine learning methods and constructing kernel methods to predict cancer subtypes (Zeng et al., 2017; Ding et al., 2018; Zhang et al., 2018a,b,c; Zou et al., 2018). We also consider the potential possibility of developing computational models for cancer subtype identification based on microRNA information (Chen and Huang, 2017; Chen et al., 2017, 2018a,b; Hu et al., 2018).

# DATA AVAILABILITY STATEMENT

The results and codes for this study can be found at the following address: https://github.com/guofei-tju/Cancer-subtypes.

# AUTHOR CONTRIBUTIONS

FG and LJ conceived and designed the experiments. LJ and YX performed the experiments and analyzed the data. FG and YX wrote the paper. FG, YD, and JT supervised the experiments and reviewed the manuscript.

#### FUNDING

This work is supported by a grant from the National Natural Science Foundation of China (NSFC 61772362), the Tianjin Research Program of Application Foundation and Advanced Technology (16JCQNJC00200) and the National Key R&D Program of China (2018YFC0910405, 2017YFC0908400).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00020/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Jiang, Xiao, Ding, Tang and Guo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Combinatorial Pattern of Histone Modifications in Exon Skipping Event

Wei Chen1,2,3 \*, Xiaoming Song<sup>2</sup> and Hao Lin<sup>3</sup> \*

1 Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China, <sup>2</sup> Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan, China, <sup>3</sup> Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China

Histone modifications are associated with alternative splicing. It has been suggested that histone modifications act in combinational patterns in gene expression regulation. However, how they interact with each other and what is their casual relationships in the process of RNA splicing remain unclear. In this study, the combinatorial patterns of 38 kinds of histone modifications in the exon skipping event of the CD4<sup>+</sup> T cell were analyzed by constructing Bayesian networks. Distinct combinatorial patterns of histone modifications that illustrating their casual relationships were observed in excluded/included exons and the surrounding intronic regions. The Bayesian networks also indicate that some histone modifications directly correlate with RNA splicing. We anticipate that this work could provide novel insights into the effects of histone modifications on RNA splicing regulation.

#### Edited by:

Dariusz Mrozek, Silesian University of Technology, Poland

#### Reviewed by:

Leyi Wei, Tianjin University, China Balachandran Manavalan, Ajou University, South Korea

#### \*Correspondence:

Wei Chen chenweiimu@gmail.com Hao Lin hlin@uestc.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 19 December 2018 Accepted: 04 February 2019 Published: 18 February 2019

#### Citation:

Chen W, Song X and Lin H (2019) Combinatorial Pattern of Histone Modifications in Exon Skipping Event. Front. Genet. 10:122. doi: 10.3389/fgene.2019.00122 Keywords: histone modification, methylation, acetylation, RNA splicing, Bayesian network, casual relationship

### INTRODUCTION

Alternative splicing is a process that can generate multiple mRNA isoforms from a single gene by splicing pre-mRNA molecules in different ways (Black, 2003). As an important process of gene expression, alternative splicing ensures the diversity of gene expression products. It has been estimated that alternative splicing occurs in approximately 90% human genes (Pan et al., 2008; Wang E. T. et al., 2008). Alternative splicing is reported to closely correlate with apoptosis, embryonic development and even a series of diseases (Garcia-Blanco et al., 2004; David et al., 2010; Lai and Greenberg, 2013; Scotti and Swanson, 2016). Although great efforts have been made on studying alternative splicing, the mechanisms of cell type-specific and stage-specific alternative splicing are still unclear (Nellore et al., 2016).

Recent studies have revealed that alternative splicing is regulated not only by trans-acting factors that can interact with cis-acting elements (Badr and Heath, 2015; Badr et al., 2016), but also by epigenetic factors, such as DNA methylation, nucleosome occupancy, and so on (Mele et al., 2017). Since RNA splicing is coupled to transcription, histone modifications were also found to be involved in alternative splicing regulation (Kornblihtt et al., 2013). Luco et al. (2010) found that the alternative splicing of the FGFR2 gene was correlated with the level of H3K36me3. Saint-Andre et al. (2011) demonstrated that the inclusion/exclusion of the alternative exons of the CD44 mRNA is affected by H3K9me3. The combinatorial effect of histone modifications on alternative splicing was also reported. Recently, Shindo et al. (2013) found that the alternative splicing of the BIN1 gene in IMR90 cell was regulated by the cooperation of H3K36me3, H3K4me3, H2BK12ac, and

**266**

H4K5ac. These results strongly indicate that histone modifications play important roles in RNA splicing regulation and are key clues for revealing the regulatory mechanism of alternative splicing.

Based on these experimental results, several computational methods have been proposed to predict the alternative exons in exon skipping event based on histone modifications. The pioneer work was proposed by Enroth et al. (2012), in which a rule-based model was developed to classify included and excluded exons based on histone modification combinations. Later on, based on Enroth et al.'s (2012) dataset, Chen et al. (2014) proposed a quadratic discriminant (QD) function method and obtained an accuracy of 68.5% for classifying the included and excluded exons in the exon skipping event. More recently, a random forest based method was developed for the same aim and obtained an accuracy of 72.91% in the 10-fold cross validation test (Chen et al., 2018b). These results strongly indicate that histone modifications play important roles in RNA splicing regulation and are key clues for revealing the regulatory mechanism of alternative splicing. These results indicate that we should find the novel splicing code from the epigenome information.

Inspired by recent works (Cui et al., 2011; Zhu et al., 2013), in this study, the Bayesian network of histone modifications were constructed in the excluded/included exons and their preceding and succeeding intronic regions of exon skipping event to investigate how histone modifications interact with each other and find their casual relationships in the process of RNA splicing. By analyzing the Bayesian networks, distinct combinational patterns and casual relationships of histone modifications were observed in different regions relative to exons.

### MATERIALS AND METHODS

#### Dataset

Based on the exon expression data of the CD4<sup>+</sup> T cell (Oberdoerffer et al., 2008), by calculating the ratio between exon expression and gene expression, Enroth et al. (2012) obtained 13,374 "included" and 11,587 "excluded" exons. All of these exons are longer than 50 bp with flanking introns longer than 360 bp, and none of them are the first or last exon in any transcripts (Enroth et al., 2012).

The ChIP-seq data for the 20 kinds of histone methylations (H3K27me2, H3K4me1, H3K79me2, H3K9me3, H4K20me3, H3K27me3, H3K4me2, H3K79me3, H3R2me1, H4R3me2, H2BK5me1, H3K36me1, H3K4me3, H3K9me1, H3R2me2, H3K27me1, H3K36me3, H3K79me1, H3K9me2, and H4K20me1) and 18 kinds of histone acetylation modifications (H2AK5ac, H2BK20ac, H3K23ac, H3K9ac, H4K8ac, H2AK9ac, H2BK5ac, H3K27ac, H4K12ac, H4K91ac, H2BK120ac, H3K14ac, H3K36ac, H4K16ac, H2BK12ac, H3K18ac, H3K4ac, and H4K5ac) of the CD4<sup>+</sup> T cell were obtained from previous works (Barski et al., 2007; Wang Z. et al., 2008). By using the SICTIN tool, Enroth et al. (2010) discretized the histone modification signals to binary (present/absent) attributes over the three regions, namely excluded/included exons, the closest 180 bp flanking intronic regions proceeding and succeeding the exons.

After winnowing out exons with no modifications present, they finally obtained 12,692 "included" and 11,165 "excluded" exons. The present/absent of the 38 kinds of histone modifications in the excluded/included exon and the preceding and succeeding intronic regions was annotated by "1" (indicating the presence of histone modification) or "0" (indicating the absence of histone modification), which were used to construct the histone modification Bayesian network. All the data can be found in Enroth et al.'s (2010) work.

## Bayesian Network

Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG) (Yu et al., 2008). The nodes in Bayesian network represent variables, and the edges represent conditional dependencies. A directed edge (→) from node a<sup>i</sup> to node a<sup>j</sup> represents a statistical dependence or the causal relationships between the corresponding variables. The arrow indicates that the variable a<sup>j</sup> depends on the variable a<sup>i</sup> . If there is no edge between two nodes a<sup>i</sup> and a<sup>j</sup> , indicating that the variables i and j are independent of each other.

In this study, the WinMine package which is available at https://www.microsoft.com/en-us/research/project/winminetoolkit/#!downloads, was used to construct the Bayesian network of histone modifications in the excluded/included exons and the preceding and succeeding intronic regions. The nodes in the potential networks will be the histone modifications.

# RESULTS AND DISCUSSION

## Correlations Between Histone Modifications

Previous studies have reported that gene expression is in part regulated by histone modifications that act in a combinatorial fashion, i.e., the so-called "histone code" (Yu et al., 2008; Cui et al., 2011; Zhu et al., 2013). In order to find whether the combinatorial pattern of histone modifications exist in the process of RNA splicing, we first calculated the Pearson correlation coefficients between the 38 kinds of histone modifications in the excluded/included exons and the preceding and succeeding intronic regions, respectively.

Distinct combinatorial patterns of histone modifications were observed in the excluded/included exon and the surrounding regions. For example, H2BK5me1 was found to be positively correlated with H3K4me1, H3K4me2, H3K79me1, H3K9me1, H4K20me1, and H4K91ac in both included and excluded exons, **Figures 1** and **2**. The negative relationship were found between H3K9me3 with most of the remaining 37 kinds of histone modifications in both excluded and included exons, **Figures 1** and **2**. These results also hold for the preceding and succeeding intronic regions of the included and excluded exons (**Supplementary Figures S1–S4**).

Besides the common pattern, the excluded/included exon specific combinatorial patterns of histone modifications were also observed in excluded/included exon, **Figures 1** and **2**. For example, in the included exon, H3K4ac and

H2BK5me1, H3K79me1 and H3K23ac, H4K20ac and H3K4me1, H4K20ac and H3K4me2 exhibit negative correlations, which is absent in excluded exon; while the significantly negative correlation between H3K14ac and H2BK5me1, H4K120ac and H3K27me2, H4K120ac and H3K27me2 were only observed in the excluded exon. The excluded/included exon specific combinatorial patterns of histone modifications can also be found in the preceding and succeeding intronic regions of the included and excluded exons (**Supplementary Figures S1–S4**).

## Interaction Network of Histone Modifications

In order to investigate how histone modifications interact with each other and how their combinational fashions regulate RNA splicing, the Bayesian networks were constructed to deduce the causal relationships among histone modifications in the excluded/included exons and the surrounding intronic regions, respectively. In the Bayesian network, the nodes are the histone modifications, and the edge from one node to another one is their Pearson correlation coefficient.

The 10-fold cross-validation test method was used to find the robust Bayesian networks (Yu et al., 2008). The detailed procedure is as following. In the 10-fold cross-validation test, the dataset (Materials and Methods) is randomly partitioned into ten subsets, and nine of them were used to generate a Bayesian network. Based on the Pearson correlation coefficient of histone modifications of the nine subsets, a fundamental Bayesian network demonstrating the casual relationship between histone modifications was built by using the WinMine package. The 10-fold cross-validation was repeated 10 times. Accordingly, 10 fundamental Bayesian networks will be obtained for the excluded/included exons and the surrounding intronic regions, respectively. According to previous work (Yu et al., 2008), the final Bayesian network was then constructed based on the 10 fundamental Bayesian networks, in which each edge should be appeared within seven of the 10 fundamental Bayesian networks. The edges in the networks were colored according to the Pearson correlations between the two nodes linked by the edge.

It was found that the Bayesian network for the excluded exon event contains 19 edges and 10 combinational patterns of histone modifications, including interactions between different levels of the same modification (e.g., H3K79me1, H3K79me2,

and H3K79me3), between modifications on different amino acids (e.g., H2BK5me1 and H3K9me1), and between different kinds of modifications (e.g., H2BK5me1 and H4K16ac), **Figure 3A**. It can also be observed that the 10 histone modifications that have direct correlations with RNA splicing in excluded exon are H3K79me3, H3K79me2, H3K4me2, H4K16ac, H3K4me1, H3R2me1, H4K5ac, H2BK120ac, H3K18ac, and H3K4ac.

Distinct from the excluded exon, the Bayesian network for the included exon event contains 21 edges and 13 combinational patterns of histone modifications (**Figure 3B**). The interactions between modifications on different amino acids (e.g., H4K20me1 and H3K79me1), and between different kinds of modifications (e.g., H2BK5me1 and H4K91ac) were observed in this case. There are 13 histone modifications (H3K79me1, H3K36me3, H3K36me1, H3K4me1, H3K4me2, H2BK12ac, H3K27ac, H2AK5ac, H4K16ac, H3K4ac, H4K12ac, H2BK120ac, and H3K18ac) that have direct correlations with RNA splicing.

The above results demonstrate that the topologies of the Bayesian networks of histone modifications for the included and excluded exon in the skipping event are different. Moreover, the differences also exist in the proceeding and succeeding intronic regions of the included and excluded exons (**Supplementary Figures S5–S6**). Therefore, it can be concluded that the casual relationship of histone modifications were obviously different between included and excluded exons.

### CONCLUSION

Based on the Pearson correlation coefficients, the casual relationships of histone modifications in the process of RNA splicing were deduced by constructing their Bayesian networks. The results indicate that the inclusion or exclusion of exons is influenced by combinatorial patterns of histone modifications (**Figure 3** and **Supplementary Figures S5–S6**). Some of the histone modifications contribute directly to RNA splicing (e.g., H3K36me3 and H3K79me1), while other histone modifications indirectly contribute to the RNA splicing.

The result that H3K36me3 and H3K79me1 can affect RNA splicing is consistent with previous studies which have demonstrated that H3K36me3 and H3K79me1 are enriched in included exons (Shindo et al., 2013). The H3K36me3 can regulate alternative splicing by interacting with polypyrimidine tractbinding protein (PTB) (Luco et al., 2010). By interacting with the Tudor domain of TP53BP1, the H3K79me1 was also reported to interact with that interacts with snRNP (Huyen et al., 2004; Shindo et al., 2013).

By relaxing the chromatin structure, the H3 and H4 acetylation were also reported to regulating inclusion or exclusion of the skipping exon (Zhou et al., 2011). Besides the histone modifications located in exon regions, histone modifications located in intragenic regions can also influence RNA splicing by regulating RNAPII elongation rates, or by directly binding to splicing factors and hence mediating their binding to pre-mRNA (Gomez Acuna et al., 2013).

Since there is no evidence for some of the histone modifications how they regulate RNA splicing, further experiments are needed in order to illustrate their roles in RNA splicing regulation. Taken together, we hope that this work could provide novel insights into the research on RNA splicing. Besides histone modifications, the method proposed here could also be used to analyze the relationship between RNA splicing with other modifications, such as m6A (Chen et al., 2015; Chen et al., 2018a; Wei et al., 2019), m4C (Chen et al., 2017; He et al., 2018), phosphorylation (Wei et al., 2017), GlcNAcylation (Jia et al., 2018), etc.

#### AUTHOR CONTRIBUTIONS

fgene-10-00122 February 14, 2019 Time: 19:7 # 6

WC and HL conceived and designed the experiments. HL, XS, and WC wrote the manuscript. All authors performed the experiments, read, and approved the final manuscript.

### FUNDING

This work was supported by the National Natural Science Foundation of China (31771471 and 61772119), Natural Science

#### REFERENCES


Foundation for Distinguished Young Scholar of Hebei Province (No. C2017209244), and the Program for the Top Young Innovative Talents of Higher Learning Institutions of Hebei Province (No. BJ2014028).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00122/full#supplementary-material



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Chen, Song and Lin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Improved Pre-miRNAs Identification Through Mutual Information of Pre-miRNA Sequences and Structures

Xiangzheng Fu<sup>1</sup> , Wen Zhu<sup>2</sup> , Lijun Cai <sup>1</sup> \*, Bo Liao1,2 \*, Lihong Peng<sup>3</sup> , Yifan Chen<sup>1</sup> and Jialiang Yang2,4 \*

*<sup>1</sup> College of Information Science and Engineering, Hunan University, Changsha, China, <sup>2</sup> School of Mathematics and Statistics, Hainan Normal University, Haikou, China, <sup>3</sup> School of Computer Science, Hunan University of Technology, Zhuzhou, China, <sup>4</sup> Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, United States*

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Leyi Wei, Tianjin University, China Akshay Kakumanu, Fulcrum Therapeutics, United States Mahashweta Basu, United States Food and Drug Administration, United States*

#### \*Correspondence:

*Lijun Cai ljcai@hnu.edu.cn Bo Liao dragonbw@163.com Jialiang Yang jialiang.yang@mssm.edu*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *15 October 2018* Accepted: *04 February 2019* Published: *25 February 2019*

#### Citation:

*Fu X, Zhu W, Cai L, Liao B, Peng L, Chen Y and Yang J (2019) Improved Pre-miRNAs Identification Through Mutual Information of Pre-miRNA Sequences and Structures. Front. Genet. 10:119. doi: 10.3389/fgene.2019.00119* Playing critical roles as post-transcriptional regulators, microRNAs (miRNAs) are a family of short non-coding RNAs that are derived from longer transcripts called precursor miRNAs (pre-miRNAs). Experimental methods to identify pre-miRNAs are expensive and time-consuming, which presents the need for computational alternatives. In recent years, the accuracy of computational methods to predict pre-miRNAs has been increasing significantly. However, there are still several drawbacks. First, these methods usually only consider base frequencies or sequence information while ignoring the information between bases. Second, feature extraction methods based on secondary structures usually only consider the global characteristics while ignoring the mutual influence of the local structures. Third, methods integrating high-dimensional feature information is computationally inefficient. In this study, we have proposed a novel mutual information-based feature representation algorithm for pre-miRNA sequences and secondary structures, which is capable of catching the interactions between sequence bases and local features of the RNA secondary structure. In addition, the feature space is smaller than that of most popular methods, which makes our method computationally more efficient than the competitors. Finally, we applied these features to train a support vector machine model to predict pre-miRNAs and compared the results with other popular predictors. As a result, our method outperforms others based on both 5-fold cross-validation and the Jackknife test.

Keywords: pre-miRNAs identification, feature representation algorithm, mutual information, structure analysis, support vector machine

# INTRODUCTION

Derived from hairpin precursors (pre-miRNAs), mature microRNAs (miRNAs) belong to a family of non-coding RNAs (ncRNAs) that play significant roles as post-transcriptional regulators (Lei and Sun, 2014). For example, hypothalamic stem cells partially control aging rate through extracellular miRNAs (Zhang et al., 2017). MiRNAs are formed by cleavage of pre-miRNAs by enzymes. Discovery of miRNAs relies on predictive models for characteristic features from pre-miRNAs. However, the short length of miRNA genes and the lack of pronounced sequence features complicate this task (Lopes et al., 2016). In addition, miRNAs are involved in many important biological processes, including plant development, signal transduction, and protein degradation (Zhang et al., 2006; Pritchard et al., 2012). Due to their intimate relevance to miRNA biogenesis and small interfering RNA design, pre-miRNA prediction has recently become a hot topic in miRNA research. However, traditional experimental methods like ChIP-sequencing are expensive and time-consuming (Bentwich, 2005; Li et al., 2013; Liao et al., 2014; Peng et al., 2017). In the post-genome era, a large number of genome sequences have become available, which provides an opportunity for large scale pre-miRNA identification by computational techniques (Li et al., 2010).

In recent years, many computational methods have been proposed to identify pre-miRNAs, most of which are based on machine learning (ML) algorithms or statistical models. The ML-based methods usually model pre-miRNA identification as a binary classification problem to discriminate real and pseudo-pre-miRNAs. Widely used ML-based algorithms include support vector machines (SVMs) (Xue et al., 2005; Helvik et al., 2007; Huang et al., 2007; Wang Y. et al., 2011; Lei and Sun, 2014; Lopes et al., 2014; Wei et al., 2014; Liu et al., 2015b; Khan et al., 2017), back-propagation and self-organizing map (SOM) neural networks (Stegmayer et al., 2016; Zhao et al., 2017), linear genetic programming (Markus and Carsten, 2007), hidden Markov model (Agarwal et al., 2010), random forest (RF) (Jiang et al., 2007; Kandaswamy et al., 2011; Lin et al., 2011), covariant discrimination (Chou and Shen, 2007; Lopes et al., 2014), Naive Bayes (Lopes et al., 2014), and deep learning (Mathelier and Carbone, 2010). For example, Yousef et al. (2006) Peng et al. (2018) used a Bayesian classifier for pre-miRNA recognition, which has demonstrated effectiveness in recognizing pre-miRNAs in the genomes of different species. Xue et al. (2005) proposed a triplet-SVM predictor to identify pre-miRNA hairpin structural features, whose prediction performance has been improved by 10% in a later method using a RF-based MiPred classifier (Jiang et al., 2007). In addition, Stegmayer et al. (2016) proposed a deepSOM predictor to solve the problem of imbalance of positive and negative pre-miRNA samples.

It is known that the performance of ML-based methods is highly associated with the extraction of features (Liao et al., 2015b; Zhang and Wang, 2017; Ren et al., 2018). Typical feature representation methods include secondary structure and sequence information-based methods (Wei et al., 2016; Saçar Demirci and Allmer, 2017; Yousef et al., 2017). For example, Xue et al. (2005) proposed a 32-dimensional feature of triplet sequences containing secondary structure information to better express pre-miRNA sequences. Jiang et al. (2007) performed random sequence rearrangement, which is useful in obtaining the energy characteristics of pre-miRNA sequences. However, this method is quite slow. In addition, Wei et al. (2014) and Chen et al. (2016) extended the features proposed by Xue et al. (2005) into 98-dimensional pre-miRNA features, which resulted in a better pre-miRNA prediction accuracy. Most pre-miRNAs have the characteristic stem–loop hairpin structure (Xue et al., 2005); thus, the secondary structure is an important feature used in computational methods. Recently, Liu et al. proposed several methods for predicting pre-miRNAs on the basis of the secondary structure, namely, iMiRNA-PseDPC (Liu et al., 2016), iMcRNA-PseSSC (Liu et al., 2015b), miRNA-dis (Liu et al., 2015a), and deKmer (Liu et al., 2015c). Some researchers (Khan et al., 2017; Yousef et al., 2017) have increased the dimensionality of features by combining multi-source features to improve the accuracy of pre-miRNAs prediction. With the increase of feature dimension, considerable redundant information and noises are also incorporated, which may reduce the prediction accuracy and slow down the algorithm. Thus, it is usually necessary to perform feature selection to remove irrelevant or redundant features. An excellent feature selection method can effectively reduce the running time for training the model and improve the performance of the prediction (Wang X. et al., 2011; Wang Y. et al., 2011). To further facilitate computational processes, several bioinformatics toolkits have been developed to generate numerical sequence feature information (Liu et al., 2015d).

Developing an effective feature representation algorithm for pre-miRNA sequences is a challenging task. Existing methods have several drawbacks, which may not be sufficiently informative to distinguish between pre-miRNAs and non-premiRNAs. First, even excellent feature extraction methods usually only consider the frequency or sequence information of the bases of pre-miRNA sequences while ignore the interaction between two bases. Second, feature extraction methods based on secondary structures usually only consider the global characteristics while ignore the mutual influence of the local characteristics of structures. Third, methods combining multisource feature information and integrating feature selection algorithms to reduce dimensionality (Khan et al., 2017; Yousef et al., 2017) is inefficient in computational time.

As a useful measure to compare profile information based on their entropy, mutual information (MI) has been extensively applied in computational and bioinformatics studies. For instance, MI profiles were used as genomic signatures to reveal phylogenetic relationships between genomic sequences (Bauer et al., 2008), as a metric of phylogenetic profile similarity (Date and Marcotte, 2003), and for predicting drug-target interactions (Ding et al., 2017) and gene essentiality (Nigatu et al., 2017). Inspired by previous studies (Date and Marcotte, 2003; Bauer et al., 2008; Ding et al., 2017; Nigatu et al., 2017; Zhang and Wang, 2018), we proposed a novel MI-based feature representation algorithm for sequences and secondary structures of pre-miRNAs. Specifically, we used entropy and MI to calculate the interdependence between bases, and calculated the 3-gram MI and 2-gram MI of the sequences and secondary structures as feature vectors, respectively. Due to the nature of MI in representing profile dependency, our method is capable of catching the interactions between sequence bases and local features of the secondary structure, which is critical to pre-miRNA prediction. In addition, we combined the MI feature with the minimum free energy (MFE) feature of premiRNA, one of the most widely used features for RNA study and constructed a total of 55-dimensional features.

Since the feature space is smaller than that of most popular methods, our method is computationally more efficient than the competitors while keeping most important information for premiRNA prediction. Our method was evaluated on a stringent benchmark dataset by a jackknife test and compared with a few canonical methods.

# MATERIALS AND METHODS

# Framework of the Proposed Method

We illustrated in **Figure 1** the overall framework of our method, which consists of two main steps, namely, feature extraction and pre-miRNA prediction. In the feature extraction step, the initial pre-miRNA sequences were first extracted from the raw data. Secondly, homology bias was avoided by using the CD-HIT software (Li and Godzik, 2006) (with threshold value 0.8), and the samples with similarity greater than the threshold in the initial dataset were filtered out. The remaining data was used as the benchmark dataset for this study. After that, the secondary structures of the sequences in the benchmark dataset were predicted by the software RNAfold (Hofacker, 2003). Finally, the primary sequence features based on mutual information (PSFMI), secondary structure features based on mutual information (SSFMI), and MFE features were retrieved, respectively for samples in the benchmark dataset. In the pre-miRNA prediction step, the generated features were fed into an SVM classifier to generate a training model, which was employed to predict pre-miRNAs.

# Datasets

#### Balanced Dataset

Our balanced benchmark dataset for pre-miRNA identification consists of real Homo sapiens pre-miRNAs as positive set and two pseudo pre-miRNAs subsets as negative set, named as: S<sup>1</sup> and S2, respectively. The benchmark dataset S<sup>1</sup> and S<sup>2</sup> can be formulated as:

$$\begin{aligned} \mathcal{S}\_1 &= \mathcal{S}^+ \cup \mathcal{S}^-\_{\text{xue(1612)}} \\ \mathcal{S}\_2 &= \mathcal{S}^+ \cup \mathcal{S}^-\_{\text{wei(1612)}} \end{aligned}$$

The benchmark dataset S <sup>+</sup>contains a total of 1,612 positive samples, which were selected from the 1,872 reported Homo sapiens pre-miRNA entries downloaded from the miRBase (20th Edition) (Kozomara and Griffithsjones, 2011), and the pre-miRNAs sharing sequence similarity more than 80% were removed using the CD-HIT software (Li and Godzik, 2006) to get rid of redundancy and avoid bias; the negative samples set S − xue contains 1,612 pseudo miRNAs, which were selected from the 8,494 pre-miRNA-like hairpins S − xue (Xue et al., 2005); the S − wei contains 1,612 pseudo miRNAs, which were selected from the 14,250 pre-miRNA-like hairpins S − wei (Wei et al., 2014).

In addition, we selected 88 new pre-miRNA sequences from a later version (e.g., miRBase22) as positive samples, and selected 88 samples from S − wei as negative samples to construct a benchmark dataset for independent testing, named S3. The benchmark dataset S<sup>3</sup> can be formulated as:

$$\mathcal{S}\_3 = \mathcal{S}\_{m \dot{n} R 22}^+ \cup \mathcal{S}\_{\text{wei}(88)}^-$$

#### Imbalanced Dataset

To evaluate the performance of our approach in an unbalanced dataset, we have constructed two unbalanced benchmark datasets, named as: S<sup>4</sup> and S5, respectively. The benchmark dataset S<sup>4</sup> and S<sup>5</sup> can be formulated as:

$$\begin{aligned} \mathcal{S}\_4 &= \mathcal{S}^+ \cup \mathcal{S}^-\_{\text{weri}}\\ \mathcal{S}\_5 &= \mathcal{S}^+\_{microPred} \cup \mathcal{S}^-\_{microPred} \end{aligned}$$

Specifically, S<sup>4</sup> consists of S <sup>+</sup> (positive samples) and S − wei (negative samples) with ratio ∼1:8.8 (1,612:14,250). S<sup>5</sup> was adopted from microPred (Batuwita and Palade, 2009), which contains 691 non-redundant human pre-miRNAs from miRBase release 12 and 8,494 pseudo hairpins.

To evaluate experimental performance on other species, we retrieved the virus pre-miRNA sequences dataset from the study of Gudy´s et al. (2013). Similar to other datasets, we removed pre-miRNAs sharing more than 80% sequence similarity by the CD-HIT software. As a result, we constructed a virus dataset namely S6, which contains 232 positive samples and 232 negative samples. The benchmark dataset S<sup>6</sup> can be formulated as:

$$\mathcal{S}\_6 = \mathcal{S}\_{\nu irus}^+ \cup \mathcal{S}\_{\nu irus}^-$$

Where the virus pre-miRNA sequences dataset S<sup>6</sup> consists of S + virus (positive samples) and S − virus (negative samples), which were obtained from the study of Gudy´s et al. (2013).

#### Classification Algorithm and Optimization

We selected SVM to classify the samples. Specifically, the publicly available support vector machine library (LIBSVM) was applied to the benchmark data with our feature representation. The LIBSVM toolkit can be downloaded freely at http://www.csie. ntu.edu.tw/~cjlin/libsvm. We integrated this toolbox in the Matrix Laboratory (MATLAB) workspace to build the prediction system. We selected the radial basis function as the kernel function, and a grid search based on the 10-fold cross validation was used to optimize the SVM parameter γ and the penalty parameter C. C = 65,536 and γ = 10−<sup>4</sup> was tuned to be the optimal parameters.

#### Features Extraction Primary Sequence Features Based on Mutual Information (PSFMI)

Recently, it has been shown that local continuous primary sequence characteristics are crucial for pre-miRNA prediction (Bonnet et al., 2004). As one of the important characteristics, n-grams are often used in feature mapping (Liu and Wong, 2003). Let S be a given pre-miRNA sequence (consisting of four characters: A, U, C, and G) with length L. Then the n-grams represent a continuous subsequences of length n in S with .

**Figure 2** shows the calculation process for the 2-gram and 3-gram PSFMI feature representations. Any two and three consecutive bases in the pre-miRNA sequence, regardless of the order of the bases, are represented as 2- and 3-gram, respectively. For example, as shown in **Figure 2**, the number of bases "GA"(2 gram) is 3. The number of bases "UG"(2-gram) is 4. Similarly, 3-gram represents three consecutive bases, such as the number of bases "G G U"(3-gram) is 2.

In this study, we used entropy and mutual information (MI) to calculate the interdependence between two bases on a given pre-miRNA sequence. Specifically, we calculated the 3-gram MI and the 2-gram MI as the feature vector for a given pre-miRNA sequence. The 3-tuple MI for 3-gram is calculated as:

$$MI(\mathbf{x}, \mathbf{y}, \mathbf{z}) = MI(\mathbf{x}, \mathbf{y}) - MI(\mathbf{x}, \mathbf{y}|\mathbf{z}) \tag{1}$$

where x, y, and z are three conjoint bases. Subsequently, the MI MI(x, y) and conditional MI MI(x, y|z) can be calculated as follows:

$$MI(\mathbf{x}, \mathbf{y}|\mathbf{z}) = H(\mathbf{x}|\mathbf{z}) - H(\mathbf{x}|\mathbf{y}, \mathbf{z}) \tag{2}$$

$$MI(\mathbf{x}, \boldsymbol{\upchi}) = p(\mathbf{x}, \boldsymbol{\upchi}) \* \log(\frac{p(\mathbf{x}, \boldsymbol{\upchi})}{p(\mathbf{x}) \* p(\boldsymbol{\upchi})}) \tag{3}$$

$$MI(\mathfrak{x}, \mathfrak{y}) = MI(\mathfrak{y}, \mathfrak{x}) \tag{4}$$

Here, H(x|z) andH(x|y, z) are calculated as follows:

$$H(\mathbf{x}) = p(\mathbf{x}) \* \log(p(\mathbf{x})) \tag{5}$$

$$H(\mathbf{x}|\mathbf{z}) = -\frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{z})} \log(\frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{z})}) \tag{6}$$

$$H(\boldsymbol{\alpha}|\boldsymbol{\gamma},\boldsymbol{z}) = -\frac{p(\boldsymbol{\alpha},\boldsymbol{\gamma},\boldsymbol{z})}{p(\boldsymbol{\gamma},\boldsymbol{z})}\log(\frac{p(\boldsymbol{\alpha},\boldsymbol{\gamma},\boldsymbol{z})}{p(\boldsymbol{\gamma},\boldsymbol{z})})\tag{7}$$

where p(x) denotes the frequency of x appearing in a pre-miRNA sequence, p(x, y)denotes the frequency of x and y appearing in 2 grams and p(x, y, z) denotes the frequency of x, y, and z appearing in 3-tuples in a pre-miRNA sequence. p(x), p(x, y)andp(x, y, z) can be calculated by Equations (8)–(10):

$$p(\mathbf{x}) = \frac{N\_{\mathbf{x}} + \varepsilon}{L} \tag{8}$$

$$p(\mathbf{x}, \boldsymbol{\nu}) = \frac{N\_{\mathbf{x}\mathbf{y}} + s}{L - 1} \tag{9}$$

$$p(\mathbf{x}, \mathbf{y}, \mathbf{z}) = \frac{N\_{\mathbf{x}\mathbf{y}z} + \varepsilon}{L - 2} \tag{10}$$

(10) N<sup>x</sup> is the number of occurrences of base x appearing in the pre-miRNA sequence, and L is the length of the pre-miRNA sequence. In Equation (8), ε represents a very small positive real number that does not affect the final score, which is used to avoid having 0 as the denominator.

According to the Equation (10), a given pre-miRNA sequence can be expressed as 30 mutual information values [20 3-tuples IM (x, y, z) and 10 2-tuples IM (x, y)]. In addition, we calculated the

frequency of the four base classes appearing in this pre-miRNA sequence. Therefore, the pre-miRNA sequence can be expressed as 20 + 10 + 4 = 34 features, as determined using our proposed mutual information method.

#### Secondary Structure Features Based on Mutual Information (SSFMI)

It has been shown that the structure of pre-miRNA can provide insights into biological functions. Pre-miRNA structural information can be predicted by RNAfold (Hofacker, 2003) software from sequences and is frequently used as features by machine-learning algorithms. **Figure 3** shows the pre-miRNA secondary structure of miRNA hsa-mir-302f, which was obtained using the algorithm in Mathews et al. (1999).

The pre-miRNA secondary structure is represented as a sequence of three symbols: a left parenthesis, a right parenthesis, and a point. In other words, nucleotides have only two states: paired and unpaired nucleotides, which are represented in parentheses "(" or ")" and points ".", respectively. The open parenthesis "(" indicates that the paired nucleotides located on the 5′ end can be paired with 3′-end nucleotides, which are represented by the corresponding close parenthesis ")." The secondary structure of the pre-miRNA sequence is composed of free radicals and radical pairs A–U and C–G. To a certain extent, after such treatment, the secondary structure of the pre-miRNA sequence can be converted into a linear sequence.

A given pre-miRNA sequence S is converted to a premiRNA secondary structure sequence by using the RNAfold software. The length of the sequence is denoted by L, and the mutual information of the secondary structure sequence n-gram is calculated by Equations (1) and (3). The calculation process is similar to that for the PSFMI. **Figure 2** shows the calculation process for the 2-gram and 3-gram SSFMI feature representations.

According to Equations (1)–(10), the pre-miRNA secondary structure sequence can be expressed as 16 mutual information

values [10 3-tuples IM(x, y, z) and 6 2-tuples IM(x, y)]. Similarly, the frequencies of the three symbols that appear in the sequence of secondary structure elements were calculated. Another significant feature is the amount of base pairs in premiRNA sequences. For the pre-miRNA gene, given the presence of the G–U wobble pair in the hairpin loop structure (secondary structure) of the pre-miRNA, the G–U pair is considered in the base pairing.

Therefore, the secondary structure features can be expressed as 10 + 6 + 3 + 1 = 20 features, as determined using our proposed mutual information method.

In addition, studies have shown that real pre-miRNA sequences are generally more stable than randomly generated pseudo-pre-miRNAs and therefore have lower MFE. Therefore, during the process of feature extraction for pre-miRNA sequences, structural energy features are often used to characterize pre-miRNA sequences. Since the structural calculation result of RNAfold is actually provided along with the MFE value of the secondary structure of the sequence, we took this value.

In summary, we extracted a total of 55 [34 (PSFMI) + 20 (SSFMI) + 1 (MFE)] features, in which the 34-dimensional feature was obtained by applying the PSFMI method from the pre-miRNA sequence, the 20-dimensional feature was obtained by applying the SSFMI method from the pre-miRNA secondary structure, and the 1 (MFE) dimension feature is the MFE value calculated by the RNAfold software. Since the distribution of the values in each feature is non-uniform, we normalized each feature to (−1,1) using the MATLAB function mapminmax (MATLAB 2014b), and obtained the final 55-dimensional feature data set for model training.

#### Measurements

In statistical prediction experiments, three cross-validation methods are often used to test the effectiveness of a prediction algorithm including independent dataset test, K-fold validation test and the Jackknife validation test. Among them, the Jackknife test is considered to be the most rigorous and objective method of verification. In the field of pre-miRNA prediction, the Jackknife tests are often used to verify the predictive performance of different algorithms. In the Jackknife test, each pre-miRNA sequence was individually selected as a test sample, and the remaining pre-miRNA sequences were used as training samples, and the test sample categories were predicted from the model trained by the training samples. Therefore, we adopted the Jackknife test in this study.

In order to comprehensively evaluate the performance of the pre-miRNA prediction method, several indicators were introduced in this paper. Receiver operating characteristic (ROC) was plotted based on specificity (Sp) and sensitivity (Sn). The areas under ROC curves (AUC) and average area under the precision-recall curve (AUPR) are both used as the evaluation metrics. The AUC provides a measure of the classifier performance; the larger the value of the AUC is, the better the performance of the classifier. However, for class imbalance problem, AUPR is more suitable than AUC, for it punishes false positive more in evaluation. In addition, Matthew correlation coefficient (MCC) was used to evaluate the prediction performance. The MCC accounts for true and false positives and negatives and are usually regarded as a balanced measure that can be used even if the classes are of different sizes. The sensitivity (SE), specificity (SP), precision (PR), accuracy (ACC), and MCC are defined as follows:

$$SE = \frac{TP}{TP + FN} \tag{11}$$

$$SP = \frac{TN}{TN + FP} \tag{12}$$

$$PR = \frac{TP}{TP + FP} \tag{13}$$

$$F\_1 - score = 2 \times \frac{SE \times PR}{SE + PR} \tag{14}$$

$$ACC = \frac{TP + TN}{TP + FP + TN + FN} \tag{15}$$

$$\text{MCC} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TP} + \text{FP})(\text{TN} + \text{FN})}} \tag{16}$$

Where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives and false negatives, respectively.

# RESULTS AND DISCUSSION

#### Performance of Different Features

According to the feature extraction algorithm proposed in this paper, the corresponding 55 features (including PSFMI, SSFMI, and MFE) were extracted for each true and false pre-miRNA (positive and negative sample data) in the benchmark dataset. For the improved evaluation of these features, they were subdivided into four subsets according to the different feature types, namely, PSFMI, SSFMI, PSFMI + MFE, and SSFMI + MFE feature sets. To assess the importance of each feature subset, predictive models were constructed on the basis of the different feature subsets of the benchmark dataset. Jackknife verification was used to evaluate the performance of the predictive models.

**Table 1** presents a comparison of the performances of the predictive models based on the different feature subsets and combinations thereof. As demonstrated in **Table 1**, the predictive model based on the feature subset SSFMI is better than that based on the feature subset PSFMI. The predictive model based on SSFMI achieves 80.21% sensitivity, 88.34% specificity, the Matthews coefficient of 0.688, and prediction accuracy of 84.27%. The predictive model based on the mutual information of premiRNA secondary structure is better than that based on the sequence-based mutual information. The performances of the predictive models based on the PSFMI + MFE and SSFMI + MFE feature sets are significantly improved compared with those based on the independent feature subsets (i.e., PSFMI and SSFMI feature sets). In terms of accuracy, the performance of the PSFMI + MFE model is 13.24% better than that of the PSFMI model, whereas the performance of the SSFMI + MFE based model is 1.31% better than that of the SSFMI model. The experimental results show that the combination of MFE features should be considered to increase prediction accuracy.

TABLE 1 | The performance of different features on benchmark dataset (Jackknife test evaluation).


*The best values are shown in boldface.*

We also compare the AUROC of four feature combinations obtained by Jackknife cross-validation on benchmark dataset S1, shown in **Figure 4**. We can draw the same conclusion that the prediction model based on feature subset SSFMI is better than the prediction model based on feature subset PSFMI, and the combination of MFE features can improve the accuracy of prediction.

# Feature Importance Analysis

To explore the extent to which the features in the feature set affect the classification, we analyzed the importance of each feature in the feature set. To quantitatively measure the importance of each feature, we introduced the metric information gain (IG) (Deng et al., 2011; Uguz, 2011 ˇ ). IG scores are widely used in the analysis of feature importance of biological sequences (Wei et al., 2014, 2017; Chen et al., 2016). The higher the value of IG, the more important the feature is for the classifier. **Table 2** presents the IG scores of 55 features. As shown in **Table 2**, although the 4 highest IG values all belong to PSFMIs, the 10 lowest IG values also belong to PSFMIs, indicating that the IG values of the PSFMIs are unevenly distributed and have large differences. The average IG value of PSFMI features is 0.5761, whereas the average IG value of SSFMI is 0.7489, further confirming that the secondary structure characteristics of pre-miRNA have a greater influence on the classification results than the primary sequence characteristics. TABLE 2 | Importance of the relatively specific features in the proposed features set.

TABLE 3 | Comparison of performance of different kernel functions on the benchmark dataset *S*1 (Jackknife test evaluation).



*The best values are shown in boldface.*

TABLE 4 | A brief introduction to the state-of-the-art predictors.


*<sup>a</sup>ACC's best parameter settings.*

The experimental findings are also consistent with the feature importance analysis.

#### Effect of Different Kernel Functions

To justify different kernel functions of SVM for our algorithm, we ran another set of experiments on the benchmark dataset using Jackknife test evaluation. Several kernel functions were tested in the experiments: SVM with linear kernel, SVM with polynomial kernel, SVM with Radial Basis Function (RBF) kernel and SVM with sigmoid kernel. The results achieved in these experiments are shown in **Table 3**. We could see the ACC, MCC, and AUC of the SVM classifier with RBF kernel outperformed all other classifiers. Therefore, in this study, we choose the SVM classifier of the RBF kernel.

#### Performance on Balanced Dataset

We compared the ACC, SE, SP, MCC, and AUC achieved on the benchmark dataset S<sup>1</sup> and **S<sup>2</sup>** by our predictor with the following methods: iMiRNA-SSF (Chen et al., 2016), miRNAPre (Wei et al., 2014), Triplet-SVM (Xue et al., 2005), iMcRNA-PseSSC (Liu et al., 2015b), and iMiRNA-PseDPC (Liu et al., 2016), and A brief introduction to these methods is shown in **Table 4**. As can be seen from **Table 4**, both the iMcRNA-PseSSC (Liu et al., 2015b) and iMiRNA-PseDPC (Liu et al., 2016) methods require parameters, and the iMiRNA-PseDPC (Liu et al., 2016) method features the largest dimension.

The performance of different methods on the benchmark datasets S<sup>1</sup> and S<sup>2</sup> via the jackknife test, as showed in **Tables 5**, **6**, respectively. For a fair comparison, the performances of these methods were taken from other studies with best tuned parameters (Liu et al., 2015b, 2016). **Table 5** shows that our method significantly outperforms previous methods in all evaluation metrics used. Among the evaluated methods, our TABLE 5 | Results of the proposed method and state-of-the-art predictors on benchmark dataset *S*1 (Jackknife test evaluation).


*The best values are shown in boldface.*

TABLE 7 | Comparing the proposed method with other state-of-the-art predictors on an independent dataset *S*3.


*The best values are shown in boldface.*

TABLE 6 | Results of the proposed method and state-of-the-art predictors on


*The best values are shown in boldface.*

method achieves the best predictive performance on four metrics: AUC (96.54%), ACC (90.60%), MCC (0.813), and SP (92.62%). The respective ACC and MCC of our method are 1.51% and 0.051 higher than those of the previously known best-performing predictor iMiRNA-SSF (Chen et al., 2016) (ACC = 88.09% and MCC = 0.762). The AUC of our method is 1.57% higher than those of the previously known best-performing predictor iMiRNA-PseDPC (Liu et al., 2016) (AUC = 94.97%). In addition, We have incorporated the new negative samples from Wei's study (Wei et al., 2014) to construct a new benchmark dataset S2, and compared the prediction performance of our method together with 5 other popular methods using the Jackknife test (see **Table 6**). As can be seen, our method achieves the best predictive performance on 4 (out of 5) metrics including AUC (95.04%), ACC (88.00%), MCC (0.760), and specificity (88.71%), and is slightly worse than iMiRNA-PseDPC in sensitivity.

To further compare the performance of our method with other methods on independent testing , we chose the S<sup>1</sup> dataset as the training set and the S<sup>3</sup> dataset as the test set. **Table 7** shows that our method outperforms all other methods in the independent test with an ACC of 70.45% and MCC of 0.412. The iMiRNA-PseDPC (Liu et al., 2016) method has an AUC value of 81.69%, which is the best AUC value in all methods. The AUC of our method (AUC = 75.54%) is comparable to the AUC of the iMcRNA-PseSSC (Liu et al., 2015b) method (AUC = 75.81%). The dimensions of iMiRNA-PseDPC are as high as 725 dimensions, far exceeding the 55-dimensional of our method, and the time overhead of our method is less than iMiRNA-PseDPC.

TABLE 8 | Five-fold cross-validation prediction performance of the proposed method and 4 state-of-the-art predictors on imbalanced benchmark dataset *S*4 and *S*5 .


*The best values are shown in boldface.*

### Performance on Imbalanced Dataset

We then tested our method on S<sup>4</sup> and S<sup>5</sup> together with the other 4 State-of-the-Arts methods including miRNAPre (Wei et al., 2014), Triplet-SVM (Xue et al., 2005), iMcRNA-PseSSC (Liu et al., 2015b), and iMiRNA-PseDPC (Liu et al., 2016). The performance was evaluated using the 5-fold cross validation and the results were summarized in **Table 8**. As can be seen, our method performed the best for all 3 evaluation metrics including AUC (0.9589), F1score (0.7813), and AUPR (0.8525), respectively on the dataset S5. As for the dataset S4, our method ranks the first on F1score (with a value 0.7084) and second on AUC and AUPR. For a better view, we also plotted the AUC curves and AUPR curves of our method on S<sup>4</sup> and S<sup>5</sup> for all 5-folds, respectively in **Figures 5**, **6**.

#### Performance on Other Species

We then compared our method with 4 state-of-the-arts methods on the benchmark dataset S<sup>6</sup> through the jackknife test. **Table 9** shows that our method outperforms all other methods in the independent test with an ACC of 92.59%, MCC of 0.852, and AUC of 98.07%. The experimental results show that our method also has good performance on other species.

#### Case Study

Sometimes, the lower version of miRBase database (e.g., miRBase 20) may contain some false-positive pre-miRNAs, which will be excluded in a later version (e.g., miRBase 22). Usually, they are saved in the file "miRNA.dead." Obviously, if we used miRBase 20 as a bench-mark data, a good method should predict the falsepositive pre-miRNAs to be negative (i.e., not to be pre-miRNAs). Fortunately, it is the case for our method and we listed the 8 predicted false-positive pre-miRNAs in **Table 10**, in which the

TABLE 9 | Comparing the proposed method and state-of-the-art predictors on the benchmark dataset *S*6 (using the Jackknife test).


*The best values are shown in boldface.*

column names "ID" and "Accession" indicate the Id number and the Accession number of the pre-miRNA sequences in miRbase 22, respectively.

#### TABLE 11 | The running time (in seconds) of different methods on benchmark dataset *S*6 using the Jackknife test, where C and γ represent the penalty coefficient of the SVM model and the parameters of the RBF function, respectively.

Running Time In this study, we used the SVM model to predict pre-miRNAs. The time complexity of training our SVM model is O(N 3 <sup>S</sup> +N 2 S .l+ NS.d.l) (Burges, 1998). Where l is the number of training points, N<sup>S</sup> is the number of support vectors (SVs), and d is the dimension of the input data.

To further evaluate the performance of our method and other competitors, we tested the running time on S<sup>6</sup> datasets on the


TABLE 10 | False-positive pre-miRNAs predicted to be negative by our method. ID Accession hsa-mir-566 MI0003572 hsa-mir-3607 MI0015997 hsa-mir-3656 MI0016056 hsa-mir-4417 MI0016753 hsa-mir-4459 MI0016805 hsa-mir-4792 MI0017439 hsa-mir-6723 MI0022558 hsa-mir-7641-1 MI0024975 same platform. The experiments were carried out on a computer with Intel(R) Xeon(R) CPU E5-2650 0@2.00GHz 2.00GHz, 16GB memory and Windows OS. Detailed results of running time were shown in **Table 11**. Our method achieves the better performance of running time, and obtains a good performance of accuracy.

#### CONCLUSIONS

Pre-miRNA prediction is one of the hot topics in the field of miRNA research (Yue et al., 2014; Cheng et al., 2015; Liao et al., 2015a; Luo et al., 2017, 2018; Peng et al., 2017, 2018; Xiao et al., 2017; Fu et al., 2018). In recent years, machine learning-based miRNA precursor prediction methods have made great progress. Most of the existing prediction methods are based on the global feature extraction feature of the sequence, ignoring the influence of the sequence base characters, and the pre-miRNA structure information does not consider the local characteristics. For this reason, this paper performs mutual information calculation on the pre-miRNA sequence and the secondary structure, respectively, to extract the pre-miRNA sequence and the local features of the secondary structure. Then, the extracted features are input to a support vector machine classifier for prediction.

#### REFERENCES


Finally, the experimental results show that: compared with the existing methods, the proposed method improves the sensitivity and specificity of pre-miRNA prediction. In addition, since the feature space of our method is only 55, less than that of most state-of-the-art methods, our feature construction is also efficient when plugging into canonical classification methods such as SVM. In summary, our method can extract effective features of pre-miRNAs and predicts reliable candidate pre-miRNAs for further experimental validation.

#### AUTHOR CONTRIBUTIONS

XF, JY, YC, BL, and LC conceived the concept of the work. XF, WZ, and LP performed the experiments. XF and JY wrote the paper.

#### FUNDING

This study is supported by the Program for National Nature Science Foundation of China (Grant Nos. 61863010,61873076,61370171, 61300128, 61472127, 61572178, 61672214, and 61772192), and the Natural Science Foundation of Hunan, China (Grant Nos. 2018JJ2461, 2018JJ3570).


**282**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Fu, Zhu, Cai, Liao, Peng, Chen and Yang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Prediction of Gene Expression Patterns With Generalized Linear Regression Model

#### Shuai Liu1,2 \*, Mengye Lu<sup>2</sup> , Hanshuang Li 3,4 and Yongchun Zuo3,4 \*

*<sup>1</sup> College of Information Science and Engineering, Hunan Normal University, Changsha, China, <sup>2</sup> College of Computer Science, Inner Mongolia University, Hohhot, China, <sup>3</sup> College of Life Sciences, Inner Mongolia University, Hohhot, China, <sup>4</sup> The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Inner Mongolia University, Hohhot, China*

#### Edited by:

*Arun Kumar Sangaiah, VIT University, India*

#### Reviewed by:

*Yu-Dong Zhang, University of Leicester, United Kingdom Jose Tenreiro Machado, Instituto Superior de Engenharia do Porto (ISEP), Portugal Jianzhong Su, Wenzhou Medical University, China*

#### \*Correspondence:

*Shuai Liu cs.liu.shuai@gmail.com Yongchun Zuo yczuo@imu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *06 November 2018* Accepted: *04 February 2019* Published: *04 March 2019*

#### Citation:

*Liu S, Lu M, Li H and Zuo Y (2019) Prediction of Gene Expression Patterns With Generalized Linear Regression Model. Front. Genet. 10:120. doi: 10.3389/fgene.2019.00120* Cell reprogramming has played important roles in medical science, such as tissue repair, organ reconstruction, disease treatment, new drug development, and new species breeding. Oct4, a core pluripotency factor, has especially played a key role in somatic cell reprogramming through transcriptional control and affects the expression level of genes by its combination intensity. However, the quantitative relationship between Oct4 combination intensity and target gene expression is still not clear. Therefore, firstly, a generalized linear regression method was constructed to predict gene expression values in promoter regions affected by Oct4 combination intensity. Training data, including Oct4 combination intensity and target gene expression, were from promoter regions of genes with different cell development stages. Additionally, the quantitative relationship between gene expression and Oct4 combination intensity was analyzed with the proposed model. Then, the quantitative relationship between gene expression and Oct4 combination intensity at each stage of cell development was classified into high and low levels. Experimental analysis showed that the combination height of Oct4-inhibited gene expression decremented by a temporal exponential value, whereas the combination width of Oct4-promoted gene expression incremented by a temporal logarithmic value. Experimental results showed that the proposed method can achieve goodness of fit with high confidence.

Keywords: cell reprogramming, Oct4, transcription factor binding site (TFBS), combination intensity, generalized linear regression model, gene expression pattern, prediction

# INTRODUCTION

Somatic cells can be reverted to a pluripotent stem cell by cell reprogramming. Cell reprogramming has been significant in many domains of biological and medical science, including tissue repair, organ reconstruction, disease pathogenesis, and new drug development (Wernig et al., 2007; Park et al., 2008). Earlier, the nuclear transfer method was the main method to cultivate new individuals. However, this method was very controversial in terms of ethics (Gurdon, 1958; Campbell et al., 1996; McCreath et al., 2000; Polejaeva et al., 2000). Recently, study of cells induced to reprogram through specific transcription factors became a hotspot. This method solved the problem of immune rejection of allogeneic cells. In this way, the patient-specific stem cells were obtained without ethical controversy (Lv et al., 2018; Poli et al., 2018; Stadhouders et al., 2018).

As an important regulatory element, transcription factor (TF) was involved in the regulation of transcription initiation, and binding sites of TFs in promoter regions affected gene expression (Duren et al., 2017). Oct4, a core transcription factor, played an important regulatory role in stem cell self-renewal and pluripotency maintenance. It controlled the development and differentiation of early embryos and was highly expressed in a variety of stem cells, including germ cells, embryonic stem cells (ESCs), embryonic germ cells (EGCs), and embryonic tumor cells. In an experiment of mice, Oct4 was observed to play a central role in the cellular pluripotency regulatory network, which reprogramed somatic cells into induced pluripotent stem cells (iPSCs) by expressing transcription factors Oct4, Sox2, Klf4, and c-Myc ectopically (Chen et al., 2016). Another study showed that pluripotent stem cells can be obtained by adding Oct3/4, Sox2, c-Myc, and Klf4 to the fiber cells of mice (Boyer et al., 2005). Regulation of these transcription factors on target genes was achieved mainly through the interaction of feedforward systems, self-regulatory networks and other signaling pathways (Boyer et al., 2005).

Oct4-binding sites in promoter regions were closely related to gene expression (Chen et al., 2016). However, the relationship between Oct4 combination intensity in promoter regions and gene expression remained unclear. Therefore, in this paper, a generalized linear regression model was proposed to analyze the relationship between gene expression and Oct4 combination intensity in promoter regions.

The rest of paper was organized as follows. section Related Work introduces related work on cell reprogramming and gene expression; section Materials and Methods provides materials and methods, including source of data, the proposed generalized linear regression model and evaluation criteria of model performance; section Results and Analysis contains detailed experimental results and analysis, including the solution result and performance analysis of our proposed model, analysis of factors affecting gene expression on every stage of cell development, and applications of our proposed model in gene classification; and section Conclusion summarizes the contents of this paper.

# RELATED WORK

Previous studies reported mechanisms and methods of cell reprogramming. Earlier, Gurdon et al. applied the nuclear transfer method to cell reprogramming of Xenopus laevis (Gurdon, 1958). Campbell, McCreath, and Polejaeva cultivated cloning animals using nuclear transfer technology (Campbell et al., 1996; McCreath et al., 2000; Polejaeva et al., 2000). Håkelien and Hochedlinger analyzed a cell recombination mechanism based on nuclear fusion and nuclear transfer technology (Håkelien et al., 2002; Hochedlinger and Jaenisch, 2002). Later, Stadtfeld and Zardo analyzed the effects of specific transcription factors and epigenetic plasticity of chromatin on cell reprogramming (Stadtfeld et al., 2008; Zardo et al., 2008). Studies by Hanna and Li showed that overexpression of transcription factor Oct4 had an effect on cell reprogramming (Hanna et al., 2009; Li et al., 2009). Doege et al. elaborated the effects of the interaction of Oct4, Sox2, Klf4, and c-Myc on cell reprogramming in the early stages of cell reprogramming (Doege et al., 2012). Apostolou and Chen found that the dynamic mechanisms of chromatin change and DNA methylation had important effects on cell reprogramming (Apostolou and Hochedlinger, 2013; Chen et al., 2013). Koqa et al. analyzed the role of transcription factor Foxd1 in cell reprogramming (Koga et al., 2014). Recently, Poli and Stadhouders elaborated the roles of specific transcription factors used as inducing factors in cell reprogramming (Poli et al., 2018; Stadhouders et al., 2018).

The process of cell reprogramming was closely related to the regulation of gene expression. Moreover, regulation of gene expression is the molecular basis of many life activities, including cell differentiation, morphogenesis, and ontogeny (Chen et al., 2016). Earlier, Chen and Rimsky analyzed regulation effects of cis- and trans-regulatory elements on gene expression (Rimsky et al., 1989; Chen et al., 1990). Later, Ueda et al. analyzed effects of diurnal variation of transcription factors on gene expression (Ueda et al., 2002). Patricia et al. analyzed effects of the interaction of cis- and trans-regulatory elements on gene expression (Wittkopp et al., 2004). Sullivan CS et al. studied the regulation effect of microRNAs encoded by SV40 on gene expression (Sullivan et al., 2005). Jeffery et al. found factors related to gene expression using gene expression data and binding sites of transcription factor (Jeffery et al., 2007). Han et al. found that certain types of genomic organization by SATB1 had an effect on gene expression (Han et al., 2008). Afterward, Costa et al. predicted gene expression in T cell differentiation by using histone modification and binding affinity of transcription factor via a linear mixed model (Costa et al., 2011). Maienscheincline et al. searched for target genes regulated by transcription factors based on some information, including binding sites of transcription factors and target genes (Maienschein-Cline et al., 2012). MT and Holoch analyzed the effects of specific transcription factors and the regulation effect of RNA on gene expression, respectively (Lee et al., 2013; Holoch and Moazed, 2015). Recently, Engreitz and Singh clarified effects of lncRNA promoter, transcription factor, variable splicing, and histone modification on gene expression, respectively (Engreitz et al., 2016; Singh et al., 2016). Thomou and Wu analyzed effects of miRNAs and histone modifications on gene expression (Thomou et al., 2017; Wu et al., 2017). Additionally, Duren et al. predicted gene expression based on chromatin accessibility data, cis-acting and trans-acting element data by logistic regression models (Duren et al., 2017). Neumann and Stadhouders analyzed effects of LncRNA and the dynamic interaction of transcription factors with expression of target genes (Neumann et al., 2018; Stadhouders et al., 2018).

Many methods were proposed for deciphering regulation mechanisms of cis-regulatory and trans-regulatory elements based on gene expression. Studies showed that gene expression was closely related to Oct4 combination intensity in promoter regions (Machado et al., 2011; Machado, 2017; Yan et al., 2017; Antão et al., 2018). However, the quantitative relationship between gene expression and Oct4 combination intensity was not considered. Therefore, firstly, a generalized linear regression model was proposed for quantifying the relationship of gene expression and Oct4 combination intensity based on eight gene datapoints. Then, testing data were applied to test the generalization ability of the model. On the one hand, experiments of 27 genes, as well as all genes, from GEO were applied to analyze the quantitative relationship between Oct4 combination intensity and target gene expression at each stage of cell development by our proposed model. On the other hand, 27 genes were divided into positive and negative samples by our proposed method.

#### MATERIALS AND METHODS

#### Datasets

Experimental data came from mouse transcriptome data and ChIP-seq data, which were downloaded from GEO database with accession numbers GSE67462 and GSE67520, respectively. In this paper, gene promoter regions were defined as −1.5 kb to +0.5 kb of gene transcription start sites (TSSs). For quantifying the relationship between gene expression and Oct4 combination intensity, while testing the generalization ability of the proposed model, experimental data were divided into training data and test data.

Training data were related to genes Btbd8, Cnbp, Cyb5r3, Dars2, Eef1a1, Hist1h2bf, Ptrh2, Zfp143, which were extracted based on the following steps.

Step 1. All dynamic Oct4 combination intensity and gene expression data related to genes Btbd8, Cnbp, Cyb5r3, Dars2, Eef1a1, Hist1h2bf, Ptrh2, Zfp143 were extracted from transcriptome and ChIP-seq data (Chen et al., 2016). Oct4 combination intensities were expressed as a series of peaks that contained three characteristics, including height, distance and width, which were defined as the value of the highest point corresponding to the midpoint of the peak (height); distance between the midpoint of the peak and transcription start site (distance); and difference between the right and left boundaries of the peak (width).

Step 2. Transcriptome and ChIP-seq data of the above genes from Day 0, Day 1, Day 3, Day 5, Day 7, Day 11, Day 15, and Day 18 were selected for studying the relation between time and gene expression (Chen et al., 2016).

Step 3. Promoter regions with the strongest signal were extracted to avoid the influence of redundant data.

Testing data were composed of two parts, including data of 27 genes and all genes. Firstly, 27 genes and all genes were applied to analyze quantitative relationship between Oct4 combination intensity and target gene expression at each stage of cell development by our proposed model. Then, 27 genes were divided into high and low expression groups to classify.

In detail, 27 genes were obtained by searching for those data that appeared in all eight different cell development stages from GEO. These genes were Alyref2, Atn1, Btbd8, Btg2, Caprin1, Cnbp, Ctgf, Cyb5r3, Dars2, Ddx5, Eef1a1, Fosb, Hes1, Hist1h2bb, Hist1h2bf, Hist1h2bp, Hnrnpa2b1, Kmt2e, Lonp1, Nfe2l2, Pecr, Phldb2, Ptrh2, Setd5, Trappc6b, Tti2, and Zfp143. In the biclassification experiment, expression values of 27 genes were sorted by descending order. The top 30% of the sorted data were defined as the high expression group, and the lowest 30% were TABLE 1 | Number of genes at each cell development stage.


defined as the low expression group. The value of the minimum high expression was the threshold for classification.

The numbers of all genes at each stage of cell development are shown in **Table 1**.

#### Generalized Linear Regression Model

In **Figure 1**, relations between height, distance, width, gene expression of Oct4 combination intensity, and time were provided, respectively.

**Figure 1** shows different change trends with time of Oct4 combination intensity in promoter regions and gene expression in the eight proposed genes. **Figure 1A** illustrates in detail that change trends of height with time were nearly identical in these genes. Similarly, **Figure 1C** demonstrates that change trends of width with time in these genes were also nearly identical. **Figures 1B,D** show that change trends of distance and gene expression with time were disorganized.

For quantifying the relationship between gene expressions and Oct4 combination intensity, correlations between height, distance, width, time, and gene expressions were analyzed by using their correlation coefficients, which is defined as Equation (1) with two random variables, X and Y.

$$\text{tr}\,(X,Y) = \frac{\text{cov}\,(X,Y)}{\sqrt{\text{var}\,(X)\,\text{var}\,(Y)}}\tag{1}$$

In Equation (1), r (X, Y) represents the correlation coefficient between X and Y, cov (X, Y) represents covariance between X and Y, var (X), and var (Y) represent variance of X and Y, respectively.

The correlation coefficients between gene expression and Oct4 combination intensity are shown in **Table 2**. In addition, correlation coefficients for Oct4 combination intensity and the gene expression, height, distance, width, and time of each gene are provided in **Figure 2**.

**Table 2** and **Figure 2** indicate that the correlation coefficients for gene expression and time were the largest. Correlation coefficients for time and other variables were also strong. However, goodness of fit was low when the predicted model was constructed using height, distance, and width as explanatory variables, and gene expression as explained variable. Due to the strong relationship between time and Oct4 combination

TABLE 2 | Correlation coefficients between gene expression and Oct4 combination intensity for selected genes.


*A11, A12, A13, and A14 refer to correlation coefficients between gene expression and height, distance, width, and time, respectively. Bold text represents absolute values of correlation coefficients that are* > *0.5.*

intensity, several time-dependent derived combination variables were used as explanatory variables of the proposed model.

Firstly, new derived combination variables were obtained by multiplication operations between height, distance, width and a function of time t, including e t , log<sup>10</sup> (t + 1) and t k (k = 1,2,3). In this way, a set V = {H × t, H × t 2 , H × t 3 , H × e t , H × 0.5<sup>t</sup> , H × log<sup>10</sup> (t + 1), D × t, D × t 2 , D × t 3 , D × e t , D × 0.5<sup>t</sup> , D × log<sup>10</sup> (t + 1), W × t, W × t 2 , W × t 3 , W × e t , W × 0.5<sup>t</sup> , W × log<sup>10</sup> (t + 1)} was constructed as the set of explanation variables, where H denotes height, D denotes distance and W denotes width. Then, stepwise regression method was used to determine explanatory parameters of the proposed regression model. Finally, six explanatory variables were selected from V, including H × e t , D × t, D × t 2 , D × t 3 , D × 0.5<sup>t</sup> and W × log<sup>10</sup> (t + 1).

Therefore, a generalized linear regression model was constructed by using selected explanatory variables, in which gene expression was the explained variable. In this paper, four generalized linear regression models, Models 1–4, were constructed by Equations (2–5).

$$\begin{aligned} \text{Model 1}: \text{Exp } &= \beta\_1 \times H \times e^t + \beta\_2 \times D \times t \\ &+ \beta\_3 \times W \times \log\_{10}(t+1) + s \end{aligned} \tag{2}$$

$$\begin{aligned} \text{Model 2}: \text{Exp} &= \beta\_1 \times H \times \varepsilon^t + \beta\_2 \times D \times t^2 \\ &+ \beta\_3 \times W \times \log\_{10}(t+1) + s \end{aligned} \tag{3}$$

$$\begin{aligned} \text{Model 3}: \text{Exp } &= \beta\_1 \times H \times \varepsilon^t + \beta\_2 \times D \times t^3 \\ &+ \beta\_3 \times W \times \log\_{10}(t+1) + s \end{aligned} \tag{4}$$

$$\begin{aligned} \text{Model 4}: \text{Exp } &= \beta\_1 \times H \times \varepsilon^t + \beta\_2 \times D \times 0.5^t \\ &+ \beta\_3 \times W \times \log\_{10}(t+1) + s \end{aligned} \tag{5}$$

FIGURE 2 | Correlation coefficients for Oct4 combination intensity and the gene expression, height, distance, width, and time of each gene. (A–H) represents the correlation coefficients in genes Btbd8, Cnbp, Cyb5r3, Dars2, Eef1a1, Hist1h2bf, Ptrh2 , Zfp143, respectively.

In Equations (2–5), Exp represents the value of gene expression; β1, β2, β<sup>3</sup> are regression coefficients, which are calculated by the Least Squares Method (LSM), and LSM is defined as the sum of squares of differences between predicted value and true value; a random disturbance ε is a normal distribution that was applied to represent other factors affecting gene expression except height, distance and width.

H × e t and W × log<sup>10</sup> (t + 1) were selected in the final model because they were common items in Equations (2–5). Therefore, a general model of gene expression patterns was obtained by Equation (6), and the correctness of the model will be verified in section Analysis of Factors Affecting Gene Expression at Every Stage of Cell Development.

$$\begin{aligned} \text{Exp} &= \beta\_1 \times H \times e^t + \beta\_2 \times D \times f \text{ (t)}\\ &+ \beta\_3 \times W \times \log\_{10}(t+1) + \varepsilon \end{aligned} \tag{6}$$

In Equation (6), f (t) represents a function of time t, which was selected from {t, t 2 , t 3 , 0.5<sup>t</sup> }; β1, β2, and β<sup>3</sup> are regression coefficients calculated by LSM.

#### Evaluation Criteria of Model Performance

F-test, t-test, and goodness of fit R¯ <sup>2</sup> were used to evaluate the performance of linear regression model (Huang and Pan, 2003; Zhou et al., 2003; Xu et al., 2008; Wang and Lee, 2010; Wang et al., 2012). More precisely, F-test was used to test significance of the entire regression model and t-test was used to test significance of regression coefficients in the model. Goodness of fit R¯ <sup>2</sup> was used to measure the approximation degree between fitted curve and original data. Meanwhile, R¯ <sup>2</sup> , a generation from original coefficient of determination R 2 , was an adjusted coefficient of determination. It was eliminated the influence of coefficient of determination generated by number of explanatory variables. In this paper, F-test statistic, t-test statistic, adjusted coefficient of determination R¯ <sup>2</sup> , original coefficient of determination R 2 , total sum of squares (TSS), explained sum of squares (ESS), and residuals sum of squares (RSS) are defined as Equations (7–13) (Huang and Pan, 2003; Zhou et al., 2003; Xu et al., 2008; Wang and Lee, 2010; Wang et al., 2012).

$$F = \frac{\text{ESS}/k}{\text{TSS}/\left(n - k - 1\right)}\text{ }^\circ F \left(k, \, n - k - 1\right) \tag{7}$$

$$t = \frac{\overleftarrow{\beta\_{\vec{\beta}}}}{se\left(\widehat{\beta\_{\vec{\beta}}}\right)} \mathfrak{r}\left(n - k - 1\right) \tag{8}$$

$$
\bar{R}^2 = 1 - \left(1 - R^2\right) \frac{n-1}{n-k-1} \tag{9}
$$

$$R^2 = 1 - \frac{R\text{SS}}{T\text{SS}} = 1 - \frac{\sum\left(Y\_i - \hat{Y}\_i\right)^2}{\sum\left(Y\_i - \bar{Y}\right)^2} \tag{10}$$

$$\text{TSS} = \sum y\_i^2 = \left( Y\_i - \bar{Y} \right)^2 \tag{11}$$

$$\text{ESS} = \sum \hat{\jmath}\_i^2 = \left(\hat{Y}\_i - \bar{Y}\right)^2\_\tag{12}$$

$$\text{RSS} = \sum e\_i^2 = \left( Y\_i - \hat{Y}\_i \right)^2 \tag{13}$$

In Equations (7–13), k is the number of variables; n is the number of samples; βˆ <sup>i</sup> and se βˆ i are estimated value and standard deviation of estimated value of regression coefficient; and Y<sup>i</sup> , Yˆ i , Y¯ represent true, estimated and mean values of explained variable.

Accuracy (Acc), Sensitivity (Sn), specificity (Sp), and Mathew correlation coefficient (Mcc) were used to measure the performance of the classification model (Xu et al., 2013; Guo et al., 2014; Awazu, 2016). Which were defined as Equations (14–17).

$$\mathcal{S}\_n = \frac{T P}{T P + F N} \tag{14}$$

$$\mathcal{S}\_p = \frac{\text{TN}}{\text{TN} + \text{FP}} \tag{15}$$

$$\text{Acc} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FN} + \text{TN} + \text{FP}} \tag{16}$$

$$\text{Mcc} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP} + \text{FN}) \times (\text{TP} + \text{FP}) \times (\text{TN} + \text{FN}) \times (\text{TN} + \text{FP})}} \tag{17}$$

In Equations (14–17), TP represents the number of positive samples that are correctly predicted as positive samples; TN represents the number of negative samples that are correctly predicted as negative samples; FP represents the number of negative samples that are incorrectly predicted as positive samples; and FN represents the number of positive samples that are incorrectly predicted as negative samples (Zhang et al., 2014, 2018; Wang et al., 2017, 2018a,b).

#### RESULTS AND ANALYSIS

#### Solution Result of Our Proposed Model

Gene expression patterns of the eight selected genes were analyzed by using Models 1–4. More specifically, Model 1 was applied to describe the expression pattern of gene Zfp143, Model 2 was applied to describe the expression pattern of gene Hist1h2bf; Model 3 was applied to describe the expression patterns of genes Dars2 and Eef1a1, and Model4 was applied to describe the expression patterns of genes Btbd8, Cnbp, Cyb5r3, and Ptrh2. Both Model 2 and Model 3 were used to express the expression pattern of gene Eef1a1. Parameter values of the models are shown in **Table 3**. Parameter values of Model 2 and Model 3 for gene Eef1a1 were shown in **Table 4**.

**Table 3** showed that regression coefficients β<sup>2</sup> and β<sup>3</sup> were large, which indicated that both distance and width had important influences on gene expression. Furthermore, distance had an effect on gene expression in the form of exponential function of time, and width had an effect on gene expression in the form of logarithmic function of time without other factors. Additionally, **Table 4** shows that the difference of the regression coefficients between Model 2 and Model 3 were small. In both Model 2 and Model 3, β<sup>3</sup> has the largest absolute values in regression coefficients for gene Eef1a1, which indicated that width was a key factor affecting gene expression.

#### TABLE 3 | Parameter values of model for eight genes.


*Model of expression pattern for gene Eef1a1 is Model 3 in* Table 3*. The bold text represents the largest absolute values of weight in the regression coefficients of each gene.*

TABLE 4 | Parameter values of model for gene Eef1a1.


*The thickened data represent the largest absolute values of weight in regression coefficients of Model 2 and Model 3, respectively.*

#### TABLE 5 | Goodness of fit, *F*-test and *t*-test for eight genes.

# Performance Analysis of Our Proposed Model

Goodness of fit for proposed model was calculated to evaluate the performance of these models. In addition, performance of the models was tested by F-test and t-test. Results of goodness of fit, F-test and t-test are shown in **Table 5**. Results of goodness of fit, F-test and t-test of gene Eef1a1 are shown in **Table 6**.

**Table 5** demonstrates that goodness of fit reached at least 80% for all genes except Dars2 by using our proposed method. In addition, the p-value of F-test and t-test were <0.1, which meant that our proposed model was effective with 90% confidence.

As shown in **Table 6**, R¯ <sup>2</sup> from Model 3 was larger than Model 2, which means that distance had a greater influence on gene expression than time for gene Eef1a1.

As shown in **Tables 3**–**6**, absolute values of regression coefficients β2and β3were large in all regression coefficients. Additionally, the absolute value of regression coefficients for β<sup>3</sup> was the largest in all regression coefficients with Model 2 and Model 3 for gene Eef1a1. Therefore, width was considered to be the most important factor affecting gene expression, and width had an effect on gene expression in the form of a logarithmic function.

### Analysis of Factors Affecting Gene Expression in Whole-Cell Developmental Stage

In this paper, the relationship between gene expression and Oct4 combination intensity in promoter regions at the wholecell developmental stage was analyzed based on the generalized linear regression model. Experimental results showed that the


*p1, p<sup>2</sup> and p<sup>3</sup> are p-values of t-test. Results of gene Eef1a1 in* Table 5 *are calculated by Model 3.*

TABLE 6 | Goodness of fit, *F*-test and *t*-test for gene Eef1a1.


*p1, p2, and p<sup>3</sup> are p-values of t-test.*

proposed model was effective for gene expression pattern of all eight selected genes except for Eef1a1. For exploring the effects of each model on the different genes, expression data of selected eight genes and Oct4 combination intensity in promoter regions were substituted into the models. Experimental results are shown in **Figure 3**.

**Figure 3** demonstrated that differences in goodness of fit between different models for the same gene were large, which indicated that distance had strong effects on the gene expression of different genes with different levels. Strong correlation between gene expression and D × 0.5<sup>t</sup> , W × log<sup>10</sup> (t + 1) was found in **Table 3**, which indicated that distance had an effect on gene expression in the form of an exponential function of time, and width had an effect on gene expression in the form of a logarithmic function of time without other factors. However, goodness of fit from D × 0.5<sup>t</sup> and W ×

1–4 are represented from 1 to 4 on the x-axis, respectively. The y-axis represents corresponding goodness of fit.

log<sup>10</sup> (t + 1) was lower than for the selected six derived combination variables, which indicated that gene expression was promoted by the interaction of height, distance, width, and time.

# Analysis of Factors Affecting Gene Expression at Every Stage of Cell Development

Oct4 combination intensity and time had different effects on gene expression in different cell development stages. The goodness of fit obtained by Model 4 was higher than that obtained by Models 1–3 in the prediction of gene expression. Therefore, differences were analyzed based on Model 4 with testing data including 27 genes and all genes. Experimental results are shown in **Figures 4**, **5**.

**Figures 4**, **5** show that the absolute value of β<sup>3</sup> was larger than that of β1, β2, and e. Absolute values of β1and β<sup>2</sup> were close to zero except for a few points, which indicated that width influenced gene expression in the form of a logarithmic function of time. However, change trends of β<sup>1</sup> and β<sup>2</sup> were different for **Figures 4**, **5**. More specifically, the absolute value of β<sup>1</sup> obtained by 27 genes decreased with time, and the value of was negative when time was equal to 0; the absolute value of obtained by all genes decreased with time and the value of was positive when time was equal to 0; the value of obtained by 27 genes was positive while value of obtained by all genes was negative due to partially missing data, which was contradictory and indicated that time had an important impact on gene expression. Incorrect conclusions were obtained when data of some certain time were missing. Therefore, **Figures 4**, **5** showed that width and time had important effects on gene expression. Furthermore, width influenced gene expression in the form of a logarithmic function of time.

FIGURE 5 | Effects of Oct4 combination intensity and time on gene expression at different stages of cell development for all genes from GEO. X-axis represents different stages of cell development after 0, 1, 3, 5, 7, 11, 15, and 18 days. Y axis represents parameter value. Curve in red, green, blue, and black represents values of parameter β1, β2, β3 and ε of the proposed generalized linear regression model, respectively.

# Application of Our Proposed Model in Gene Classification

Gene classification experiments were provided to test the generalization ability of the proposed model. Firstly, in order to avoid the influence of random disturbance on experimental results, the data of 27 genes, including height, distance, width and gene expression, were normalized. Then, Models 1–4 were applied to predict gene expression for 27 genes. Finally, the 27 genes were divided into two categories by comparing gene expression with a threshold; meanwhile, 10-fold cross-validation was used to test the model's performance. Comparison results of Models 1–4 showed Model 4 had a high goodness of fit. Therefore, 27 genes were classified by Model 4.

Gene groups of high and low expression were defined in an artificial way; meanwhile, threshold setting was random in the classification process. A BP neural network was used to classify positive and negative samples in order to prove that the randomness had little effect on experimental results. In this paper, the hidden layer of the BP neural network was set to one layer, and the number of hidden layer neurons was set to 2. In 10-fold cross-validation, regression coefficients and random disturbance of Model 4 were shown in **Table 7**. The prediction performance obtained by Model 4 and the BP neural network are shown in **Table 8**.

**Table 8** showed that the Acc, Sn, Sp, and Mcc obtained by Model 4 were the largest of the two different methods. Therefore, randomness of the threshold setting had little effect on experimental results, and our proposed method was effective in predicting gene expression.

### CONCLUSION

Cell reprogramming has been a hot issue in the field of life sciences and has played a significant role in medicine, such as in tissue repair, organ reconstruction, disease pathogenesis, and TABLE 7 | Parameter values of Model 4 in 10-fold cross-validation.


*1–10 represents the serial number of 10-fold cross-validation.*

TABLE 8 | Prediction performance of different methods using 10-fold cross-validation.


*Bold text represents the maximum value of every performance evaluation criterion.*

new drug development. Oct4 has especially played an important regulatory role in the process of cell reprogramming. However, there was no scientific method to quantify the relationship between Oct4 combination intensity and gene expression. Therefore, data from the eight selected typical genes were extracted from mouse transcriptome data and ChIP-seq data for quantifying the relationship between gene expression values and Oct4 combination intensity in promoter regions.

Firstly, a generalized linear regression model was constructed based on gene expression with eight different time periods during cell development and Oct4 combination intensity in promoter regions. Then, the relationship between Oct4 combination intensity and gene expression at whole and each stage of cell development was analyzed. Finally, the 27 genes were divided into positive and negative samples based on Model 4 and the BP neural network. Experimental results showed that width of combination influenced gene expression by a logarithmic function of time (day). Additionally, accuracy obtained by the models was 4.05% higher than that obtained by the BP neural network, which indicated that our proposed model was effective in predicting gene expression.

Several additional factors, including extent of histone modification, degree of chromatin opening, strength of promoter and binding sites of transcription factors and promoter regions, also affected gene expression. Non-linear relations between gene expression and Oct4 combination intensity were also ignored due to large non-linear relations. Therefore, in the future, multiple factors and non-linear relations should be considered to analyze key factors affecting gene expression.

#### REFERENCES


## AUTHOR CONTRIBUTIONS

SL: design experiment and analyze experiment result; ML: data processing and accomplish experiment; HL: extract and clean data from biological experiment and public database; YZ: provide idea from biological significance.

# FUNDING

This research is funded by the National Natural Science Foundation of China project with Grant No. 61502254, No. 61561036, and No. 61702290, the Program for Yong Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region with Grant Nos. NJYT-18-B10 and No. NJYT-18-B01, Open Funds of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education with Grant No. 93K172018K07.

#### ACKNOWLEDGMENTS

We want to thank Dr. X. Cheng from Middlesex University (UK) and Prof. Y. Zhang from the University of Leicester (UK) for their efforts on language improvement.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Liu, Lu, Li and Zuo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# DMfold: A Novel Method to Predict RNA Secondary Structure With Pseudoknots Based on Deep Learning and Improved Base Pair Maximization Principle

Linyu Wang1,2, Yuanning Liu1,2, Xiaodan Zhong1,2,3, Haiming Liu1,2, Chao Lu1,2, Cong Li 1,2 and Hao Zhang1,2 \*

*<sup>1</sup> College of Computer Science and Technology, Jilin University, Changchun, China, <sup>2</sup> Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun, China, <sup>3</sup> Department of Pediatric Oncology, The First Hospital of Jilin University, Changchun, China*

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Leyi Wei, Tianjin University, China Ning Zhang, University of Missouri, United States*

> \*Correspondence: *Hao Zhang zhangh@jlu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *02 December 2018* Accepted: *12 February 2019* Published: *04 March 2019*

#### Citation:

*Wang L, Liu Y, Zhong X, Liu H, Lu C, Li C and Zhang H (2019) DMfold: A Novel Method to Predict RNA Secondary Structure With Pseudoknots Based on Deep Learning and Improved Base Pair Maximization Principle. Front. Genet. 10:143. doi: 10.3389/fgene.2019.00143* While predicting the secondary structure of RNA is vital for researching its function, determining RNA secondary structure is challenging, especially for that with pseudoknots. Typically, several excellent computational methods can be utilized to predict the secondary structure (with or without pseudoknots), but they have their own merits and demerits. These methods can be classified into two categories: the multi-sequence method and the single-sequence method. The main advantage of the multi-sequence method lies in its use of the auxiliary sequences to assist in predicting the secondary structure, but it can only successfully predict in the presence of multiple highly homologous sequences. The single-sequence method is associated with the major merit of easy operation (only need the target sequence to predict secondary structure), but its folding parameters are the common features of diversity RNA, which cannot describe the unique characteristics of RNA, thus potentially resulting in the low prediction accuracy in some RNA. In this paper, "DMfold," a method based on the Deep Learning and Improved Base Pair Maximization Principle, is proposed to predict the secondary structure with pseudoknots, which fully absorbs the advantages and avoids some disadvantages of those two methods. Notably, DMfold could predict the secondary structure of RNA by learning similar RNA in the known structures, which uses the similar RNA sequences instead of the highly homogeneous sequences in the multi-sequence method, thereby reducing the requirement for auxiliary sequences. In DMfold, it only needs to input the target sequence to predict the secondary structure. Its folding parameters are fully extracted automatically by deep learning, which could avoid the lack of folding parameters in the single-sequence method. Experiments show that our method is not only simple to operate, but also improves the prediction accuracy compared to multiple excellent prediction methods. A repository containing our code can be found at https://github.com/linyuwangPHD/RNA-Secondary-Structure-Database.

Keywords: RNA, secondary structure prediction, pseudoknot, deep learning, multi-sequence method, single-sequence method, improved base pair maximization principle

# INTRODUCTION

RNA, the essential substance for all life, has played various roles in a variety of biological processes, such as translation (Kapranov et al., 2007), catalysis (Cech et al., 1981), and gene regulation (Storz and Gottesman, 2006). RNA contains a larger number of subunits that are called ribonucleotides, each of which is comprised of four possible bases: adenine(A), guanine(G), cytosine(C), or uracil(U). Under normal physiological conditions, these bases can bind with one another through the hydrogen-bond to form the secondary structure. Typically, the RNA secondary structure is a set of stems that are stacked with base pairs, while a base pair may be formed by three possible combinations of nucleotides, including A-U, G-C, and G-U, among which, A-U and G-C are called Watson-Crick pairs (Watson and Crick, 1953), while G-U is referred to as Wobble pair (Varani and Mcclain, 2000). The secondary structure information of RNA is of vital importance, since RNA functions largely depend on its secondary structure (Correll et al., 1997). Hence, predicting the RNA secondary structure is a bridge to understanding RNA functions. While the RNA secondary structure can be directly acquired through x-ray crystal diffraction or nuclear magnetic resonance; both of them are highly accurate and reliable, but they are restricted by their high price, slow and difficult operation. Therefore, it is necessary to develop mathematical and computational methods to predict the RNA secondary structure.

Computational methods have been used for over 40 years, dozens of methods have been proposed to predict the RNA secondary structure (with or without pseudoknots). These methods can mainly be classified into two categories based on the different prediction principles (Zhu et al., 2018): the multisequence method (MPM) (Hofacker and Stadler, 1999; Knudsen and Hein, 2003; Bernhart et al., 2008; Wilm et al., 2008) and the single-sequence method (SPM) (Eddy and Durbin, 1994; Zuker, 2003; Mathews, 2006; Zhu et al., 2018). The MPM can derive the secondary structure based on multiple homologous sequences using a comparative analysis model, which is the most accurate computational method for predicting the RNA secondary structure. However, it cannot predict the secondary structure when there are only some lowly homologous sequences, which can ascribe to its high requirement for homology sequences. The SPM can use a large number of parameters to predict the secondary structure, such as thermodynamic model (Zuker, 2003; Mathews, 2006) and statistical learning model (Eddy and Durbin, 1994; Zhu et al., 2018), and it can achieve favorably high accuracy of prediction results when those parameters are comprehensive and accurate. Unfortunately, the comprehensive and accurate parameters can hardly be obtained for different types of RNA through biological experiments or mathematical statistics, and the insufficient parameters may result in the low prediction accuracy in some RNA.

Pseudoknots have been shown in numerous studies to possess biological functions, which is thereby important to predict the secondary structure with pseudoknots. This paper aims to predict the RNA secondary structure with pseudoknots, which have been discovered in various RNA types, such as transfermessenger RNA, ribosomal RNA, and viral RNA. Moreover, pseudoknots have been recognized to be involved in regulating translation, splicing, and ribosomal frame shifting (Brierley et al., 2007). Hence, predicting the RNA secondary structure with pseudoknots is closer to the natural structure, which contributes to a better understanding of RNA functions. To the best of our knowledge, very few tools have combined the merits of both MPM and SPM in predicting the RNA secondary structure with pseudoknots. Therefore, a new method is proposed in this paper to predict the RNA secondary structure with pseudoknots based on the Deep Learning and Improved Base Pair Maximization Principle (IBPMP), which is called "DMfold." DMfold combines the advantages of both MPM and SPM while avoiding the disadvantages of them. For instance, similar to MPM, DMfold could use the known structure of RNA to help predict the secondary structure; meanwhile, unlike MPM, DMfold would use similar sequences instead of the highly homologous sequences, which reduces the requirement for auxiliary sequences and improve the algorithm availability. Similar to SPM, DMfold only needs to input the target sequence to predict the secondary structure, but it would use the deep learning model to automatically exact the RNA features that could avoid insufficient features and improve the prediction accuracy, which is different from SPM.

DMfold is a single model, which simultaneously using known structural data from multiple of families as learning and training data, to predict the secondary structure with pseudoknots of several different RNA. The secondary structure of RNA could be regarded to be composed of three types pseudoknot-free substructures (Danaee et al., 2018), each of which is represented by different types of symbols. Hence, the structural data of RNA can be transformed into dot-bracket sequences. Before the prediction process of DMfold, the structural data should be transformed into dot-bracket sequences (Danaee et al., 2018). Subsequently, the RNA sequences and dot-bracket sequences are used as the input and label in DMfold, respectively. After processing of RNA data, DMfold would use a deep learning model composed of encoder and decoder to complete the prediction from RNA sequences to dot-bracket sequences. Thereafter, DMfold would employ the IBPMP to obtain three pseudoknot-free substructures through selecting and combining the stems in the prediction dot-bracket sequences. Finally, the secondary structure with pseudoknots could be predicted by combining those substructures.

### MATERIALS AND METHODS

### Data Collection and Processing

The original data used in this paper is same as the recent literature (Ward et al., 2017), which comes from the public database of Mathews lab. The dataset comprises 3,975 known RNA primary sequences and structure pairs. The sequences and structure

**Abbreviations:** MPM, Multi-sequence method; SPM, Single-sequence method; IBPMP, Improved Base Pair Maximization Principle; CSCP, Candidate stems combination principle; PCR, Prediction complementary region.

pairs of 5sRNA, tRNA, tmRNA, and RNaseP are selected as the experimental data. Details of data usage can be found in **Table S1**. The following steps are employed to transform the raw data into mature data.

#### Data Cleaning

The tool of CD-HIT (The word\_length is 10 and threshold is 1.0) (Fu et al., 2012) is adopted to remove the duplicate data and then coding to remove the data that contains the unknown bases.

#### Structure Format Conversion

As structural data is the non-sequence data, the original structure data format is transformed into the dot-bracket format (Danaee et al., 2018), which contains seven symbols. **Figure 1** represents the transform regular between the original RNA secondary structure and dot-bracket sequences.

After these two steps, the clean RNA sequences and dotbracket sequences could be obtained, which are used as the input and label in DMfold.

#### Method

The Prediction Unit (PU) and Correction Unit (CU) are created as the two parts of DMfold. PU is a sequence to sequence Deep Learning model, which uses three-layer bidirectional LSTM (TBI-LSTM, which the dimension of the initial vector in each direction is 1<sup>∗</sup> 300) (Sutskever et al., 2014) as the encoder and four fully connected layer (FFCL) as the decoder to complete the prediction from RNA sequences to dot-bracket sequences. As there are some errors in the prediction results of PU, CU must be used to modify the prediction results and output the correct prediction dot-bracket sequence for each of RNA sequence. Specifically, CU could accomplish the task of modification and output the final prediction secondary structure based on IBPMP. **Figure 2** displays the architecture of "DMfold." As could be seen, DMfold first adopts the one-hot encoding to transform

FIGURE 1 | RNA structure can be decomposed into three pseudoknot-free substructures. Each color represents a substructure. There are three types of parentheses and a dot in the figure. The brackets represent the paired bases, the dots represent unpaired bases. Each pair of brackets corresponds to a separate substructure, and the edges, which represent the base pairs, are nested in a substructure.

each of base into a vector (1<sup>∗</sup> 8). (**Table 1** presents the rules of transformation between bases and one-hot vectors). Afterwards, DMfold could use those vectors (1<sup>∗</sup> 8) as the input of encoder, in which each vector (1<sup>∗</sup> 8) is encoded into a vector (1<sup>∗</sup> 600) containing the context information. Subsequently, the decoder could map the vector (1<sup>∗</sup> 600) to a secondary structure symbol, which employs the one-hot vector of the real symbol as the label (**Table 2** shows the rules of transformation between dotbracket symbols and one-hot vector). After all bases in an RNA sequence has been predicted by PU, the prediction results are then processed by CU and the prediction secondary structure with pseudoknots is output.

#### Prediction Unit

PU is comprised of two parts: encoder and decoder, among which, the encoder is responsible for encoding the contextdependent bases into vectors (1<sup>∗</sup> 600) with context information, while the decoder is responsible for decoding those vectors (1∗ 600) into the secondary structure symbols corresponding to those bases.

#### **Encoder**

The encoder model is built based on the LSTM architecture, which uses the memory cells to update and replace information, and is superior in finding and exploiting the long-range dependencies in context. Specifically, LSTM has been successfully applied in speech recognition (Graves et al., 2013), machine translation (Cho et al., 2014), and sequence to sequence learning (Sutskever et al., 2014). **Figure S1** illustrates a single LSTM memory cell. As could be observed from the figure, some selfparameterized control gates are used to access, write and clear the cell. One advantage of using the gates to control information flow in the memory cell is that the gradient would be trapped in the cell, which could prevent from banishing too quickly, and it is a critical problem in the RNN model. The LSTM memory cell could be implemented as follows:

$$\begin{aligned} \dot{t}\_{t} &= \sigma(\boldsymbol{\omega}\_{\boldsymbol{x}\_{t}} + \boldsymbol{\omega}\_{h\_{t}}h\_{t-1} + \boldsymbol{\omega}\_{c\_{t}}\mathbf{C}\_{t-1} + \boldsymbol{b}\_{t}) \\ \dot{f}\_{t} &= \sigma(\boldsymbol{\omega}\_{\boldsymbol{x}\_{t}}\mathbf{x}\_{t} + \boldsymbol{\omega}\_{\boldsymbol{h}\_{t}}h\_{t-1} + \boldsymbol{\omega}\_{C\_{f}}\mathbf{C}\_{t-1} + \boldsymbol{b}\_{f}) \\ \mathbf{c}\_{t} &= \boldsymbol{f}\_{t}\mathbf{C}\_{t-1} + \dot{\imath}\_{t}\tanh(\boldsymbol{\omega}\_{\mathbf{x}\_{t}}\mathbf{x}\_{t} + \boldsymbol{\omega}\_{\boldsymbol{h}\_{t}}h\_{t-1} + \boldsymbol{b}\_{c}) \\ \mathbf{O}\_{t} &= \sigma(\boldsymbol{\omega}\_{\boldsymbol{x}\_{0}}\mathbf{x}\_{t} + \boldsymbol{\omega}\_{\boldsymbol{h}\_{0}}h\_{t-1} + \boldsymbol{\omega}\_{\boldsymbol{c}\_{0}}\mathbf{C}\_{t} + \boldsymbol{b}\_{o}) \\ \boldsymbol{h}\_{t} &= \boldsymbol{\Theta}\_{t}\tanh(\mathbf{C}\_{t}) \end{aligned}$$

where σ is the logistic sigmoid function, while i, f, o, and c are the input gate, forget gate, output gate and cell vector, respectively, and all of them are at the same dimension as the hidden vector h (The dimension is 1<sup>∗</sup> 300). Meanwhile, w denotes the weight matrices and the b indicates the bias vectors.

In this paper, the RNA sequences are considered as longdistance context-dependent sequences. Hence, it is necessary to access both past and future features for each base in the task of predicting from the RNA sequences to the dot-bracket sequences. In DMfold, a three-layer BI-LSTM model is used as the encoder, which consisted of both forward and backward networks. The forward LSTM processed the RNA sequences from left to right, whereas the backward LSTM processed in the reverse order.

FIGURE 2 | The schematic diagram of DMfold Architecture, which contains two parts: PU and CU. PU is a deep learning model, mainly responsible for predicting the input RNA sequences as dot-bracket sequences. CU is mainly to correct the prediction dot-bracket sequences and output the prediction secondary structure.

Therefore, two hidden state sequences could be obtained, one from the forward network −→h<sup>1</sup> , −→h<sup>2</sup> , . . . , −→hn , and the other one from the backward one ( ←− h<sup>1</sup> , ←− h<sup>2</sup> , . . . , ←− h<sup>n</sup> ). Moreover, the encoder could concatenate the forward and the backward hidden state of each input vector, resulting in h<sup>m</sup> = <sup>h</sup> −→h<sup>m</sup> ; ←− h<sup>m</sup> i . In this way, the encoding vector (1<sup>∗</sup> 600) of each input vector (1<sup>∗</sup> 8) could be obtained. **Figure S2** is a schematic diagram of the encoder. After inputting a vector (1<sup>∗</sup> 8), the feature of multi-layer BI-LSTM could be used to encode the input vector (1<sup>∗</sup> 8) with its context information into a vector (1<sup>∗</sup> 600).

#### **Decoder**

It is necessary to map the vector (1<sup>∗</sup> 600) to an RNA secondary structure symbol after a base is encoded to a vector (1<sup>∗</sup> 600). In this paper, a four-layer fully connected neural network is proposed to accomplish the mapping work. **Figure S3** shows the architecture of the decoder, consisting of one input layer, two hidden layers, and one output layer. The numbers of nodes in each layer are 600, 1024, 512, and 7, respectively. In the network, ReLU is used as the activation function, while the vector (1<sup>∗</sup> 600) is used as the input. The fully connected neural network could be TABLE 1 | The rules of transformation between bases and One-Hot vectors (Details can be found in Supplementary Materials).


implemented as follows:

#### **y**=**ReLU**(**wx**+**b**)

where ReLU is the activation function, w is the weight matrices, b is the bias vectors. x and y are the input and output between any two layers.

There are seven nodes in the output layer, each of the node contains an output value, and the largest value is set as 1, whereas the other nodes are set as 0, which correspond to the onehot vector of dot-bracket symbols. Hence, the result of each



base is 1000000, 0000001, 0001000, 0100000, 0000010, 0010000, or 0000100.

#### **PU training and testing**

The clean data is divided into three sub-sets: including (1) pure testing set containing 10% of all the clean data that is untouched during the learning phase; and (2) the training set and validation set, which are created by the 10-fold cross-validation for the remaining clean data. As the RNA sequences and Dot-Bracket sequences vary in length, and they should be intercepted or padded to have the same length. For those sequences with the length of <300, "N" is padded to those sequences until the length is equal to 300. Meanwhile, the remaining sequences are intercepted from the beginning into multi sub-sequences that contain 300 bases. The overlap length between two consecutive sub-sequences is 200. Finally, "N" is padded to the sub-sequences with the length of <300, and the same length sequences are used to train and test PU.

In PU, the cross-entropy loss (CEL) is employed to quantify the training errors and the goal is to minimize CEL. The complete training details are given below:

(1) A normal distribution with a standard deviation of 0.1 initial all weights and biases is used.


As PU is a deep learning model, the 10-fold cross-validation is employed to verify the stable performance of PU. In each fold experiment, 50 epochs are trained, and the loss and accuracy of each epoch are recorded for both the training set and the testing set. **Figure 3** shows the average loss and accuracy at each epoch in the 10-fold cross-validation experiments. As could be observed, after the 40th epoch, the test loss and accuracy are tending to be stable, with the highest testing accuracy of 87.8%, indicating that PU could successfully complete the prediction from RNA sequences to dot-bracket sequences.

#### Correction Unit

After an RNA sequence has been predicted by PU, those prediction results of the padding bases are removed and the multiple sub-sequences prediction results are spliced. The processed prediction results are the input of CU. Typically, the RNA secondary structure could be considered as a combination of stems and loops (Sakakibara et al., 2007), each stem contains two complementary regions: 5′ complementary region (5′ - CR) and 3′ complementary region (3′ -CR), and each loop is comprised of multiple unpaired bases. Hence, according to different substructure, the continuous "1000000," "0100000," or "0010000" represent the prediction 5′ -CR (5′ -PCR) in the prediction results; whereas the continuous "0000001," "0000010," or "0000100" stand for the prediction 3′ -CR (3′ -PCR) in the prediction results, and the continuous "0001000" indicates the prediction loop in the prediction results.

According to different types of pseudoknot-free substructures, PCRs could be divided into three sets, each set contains all the 5 ′ -PCRs and 3′ -PCRs of a pseudoknot-free substructure, such as all continuous "1000000" and "0000001" in a set, all continuous "0100000" and "0000010" in a set, and all continuous "0010000" and "0000100" in a set. In CU, the IBPMP is employed to find the optimal compatible stem combinations for each set, which represent the prediction pseudoknot-free substructures. Then, those three pseudoknot-free substructures are combined to obtain the prediction secondary structure with pseudoknots.

#### **Some definitions and operation**

Matrix: An n×n upper triangular matrix is created to store all the potential maximal stems for each RNA sequence (n represents the sequence length), in which the specific row and column stand for the corresponding nucleotides in the sequence. For example, row i and column j represents the ith and jth nucleotides, respectively. Accordingly, a position in the matrix that could form a base pair would then be marked as 1; otherwise, it would be denoted as 0. Each of the maximal stem represents a diagonal line in the Matrix. Stem: For an RNA sequence, if two disjoint areas could be paired reversely to form m continuous base pairs, then those m continuous base pairs are deemed as a stem. A stem could be expressed as a triplet stm = (S, E, L), among which, S and E are the subscripts that are the closest to the 5′ end and 3′ end, respectively, whereas L represents the length.

Compatible: For any two stems in the RNA sequence stm<sup>1</sup> = (S1, E1, L1) and stm<sup>2</sup> = (S2, E2, L2), if (E<sup>1</sup> < S2) or (E<sup>2</sup> < S1) or (S<sup>1</sup> +L<sup>1</sup> −1 < S<sup>2</sup> and E<sup>2</sup> −L<sup>2</sup> +1 < E<sup>1</sup> ) or (S<sup>2</sup> +L<sup>2</sup> −1 < S<sup>1</sup> and E1−L1+1 < E2), then stem<sup>1</sup> is compatible with stem2; otherwise, those two stems are incompatible.

Rate: After multiple (>1) PCRs have been performed to search for the optimal combination of stems, the usage rate of those PCRs should be calculated according to the stems. Firstly, those PCRs should be divided into two categories, including 5′ -PCR and 3′ -PCR, with the number bases of H and G, respectively. For the combination stems, the number of bases in 5′ -CR and 3′ -CR which contained in 5′ -PCR and 3′ -PCR are counted as h and g, respectively. Besides, the usage rates of 5′ -PCR and 3′ -PCR are calculated according to the following formulas: 5′ -Rate = h/H and 3′ -Rate = g/G, respectively, while that of all PCRs is recorded as Rate = (5′ -Rate + 3 ′ -Rate)/2.

Extend: When extending the stems in a combination, those stems are first located to the corresponding positions of Matrix before they are extended using the maximal stems. The extension parts of each stem would not overlap with other stems; meanwhile, those extension parts of any two stems would not overlap.

#### **IBPMP**

For each of the PCRs set, the original Base Pair Maximization Principle (Eddy, 2004) would be used to search for the longest combination of stems if all PCRs are completely correct. Unfortunately, there are always some errors in the set, so the IBPMP instead of the original principle is used to find the optimal compatible stem combinations, which might not the longest stem combination. The difference between IBPMP and original principle could mainly be reflected in two aspects. On the one hand, different from the original principle by which stems could be formed in all complementary regions (Eddy, 2004), the new principle stipulates that stems only could be produced between 5 ′ -PCR and 3′ -PCR (the relationships between the 5′ -PCR subscripts i and j, and the 3′ -PCR subscripts p and q follow the order of i<j<p<q). On the other hand, unlike the original principle in which all the stems are selected in the RNA sequence simultaneously as the candidate stems and each stem has the same priority to find the longest compatible stem combination (Eddy, 2004), the new principle selects the candidate stems in multiple steps, and different priorities are used in each step to combine the candidate stems. Noteworthily, the time and space complexities of IBPMP are greatly reduced compared with the original principle. **Figure 4A** is the procedure of IBPMP.

The candidate stems combination principle (CSCP) is the key of IBPMP, which stipulates that a longer stem is associated with a higher priority, and those stems would be combined based on the priority of candidate stems from high to low (Those stems with the same priority would be combined simultaneously). **Figures 4B,C** shows the principle of CSCP, which **Figure 4B** is the procedure of CSCP and **Figure 4C** is an example. As could be observed in **Figure 4C**, each node contains a "C" set, which represents the optimal combination of the node. Of them, the "C" in the root node stand for the initial set (might be empty), while the other "C" are generated layer-by-layer by adding several new stems to the "C" in their parent nodes, the new stems are not only compatible between themselves but also compatible with all stems in their parent nodes. In addition, the "C" contained in the lowest layer leaf nodes stand for the optimal compatible combination results. When all the candidate stems of the same priority are combined, all the random combinations of those stems would be generated first, with the number ranging from 1 to O (O represents the number of those stems). Later, the compatible longest stem combinations would be selected, and each of stem is compatible with all the stems in their father nodes. Finally, all the selected combinations are added to "C" of their compatible father nodes, respectively, and different sub-nodes would then be generated with new "C." Such as (shown in **Figure 4C**), seven stem combinations, including (S4), (S5), (S6), (S4, S5), (S4, S6), (S5, S6), and (S4, S5, S6), would first be produced when S4, S5, and S6 are combined. Then, those compatible longest stem combinations would be selected, and each of stem is compatible with all stems in B or C. Finally, (S4, S6) is selected as the new stem combination, and S4 and S6 are later added to "C" in node B to generate node D.

There are two kinds of candidate stem selection and combination in each step, among which, the first kind of selection is to select all the stems among multiple PCRs as the candidate stems to produce the first kind of stem combinations based on CSCP. If the first kind of stem combinations could satisfy the conditions of the current step, then collect all the appropriate stems in those combinations (The secondary kind of selection) as the candidate stems. Afterwards, all the secondary selection stems would be used to produce the secondary kind of stem combinations based on CSCP, which stand for the optimal results of each step. The details of each step are presented below:

The first step: Firstly, all PCRs are randomly combined, and each of the combinable result involves n (n ≥ 2) PCRs. Afterwards, results containing the PCRs that could not form the stems would be removed; for instance, the minimum subscript PCR is 3′ -PCR or the maximum subscript PCR is 5′ -PCR. For each of the remaining combinable result, the bases of 5′ -PCRs and 3′ -PCRs in the rows and columns of Matrix are located, respectively. All the stems in all fixed areas are collected (The first kind of collection) as the candidate stems, and the subscripts of 5′ -PCR i and j as well as those of 3′ -PCR p and q follow the order of i<j<p<q. For example, if some bases contained in a 5 ′ -PCR have the subscripts from ith to jth, and some contained in 3′ -PCR have the subscripts from pth to qth, then the ith to jth rows and pth to qth columns in Matrix would be located, and all the stems of the area would be collected in the case of i<j<p<q. Based on the collected candidate stems, CSCP (the

FIGURE 4 | The principle diagram of IBPMP. (A) is the procedure of IBPMP, which contains two parts: initialization and algorithm section. In the initialization, the procedure processes the prediction results of PU as the input of CU. In the algorithm section, it obtains the prediction secondary with pseudoknots. See below for details of FirstStep, SecondaryStep, and ThirdStep. (B) The procedure of CSCP, which contains two parts: initialization and algorithm section. In the initialization, it collects all stems and set priority for them. In the algorithm section, it obtains the optimal stem combinations. (C) is an example of the CSCP.

initial "C" is empty) is used to obtain the first kind of optimal stem combinations and to compute the rate. If the rate is 1, then all the appropriate stems would be collected (The secondary kind collection). After processing all the remaining combinable results, all the collected stems are the candidate stems of the step. Secondly, the CSCP (the initial "C" is "OptimalSet," which is an empty set before the first step) is also employed to obtain the secondary kind of optimal stem combinations "OptimalSet." Finally, the PCRs that have formed stems would be removed. The above operation would be repeated, the initial value of n is 2, and each of epoch operation is increased by 1 until the value of n approaches to 4.

The secondary step: Firstly, all the remaining PCRs are randomly combined, and each of the combinable result contains two PCRs. Then, the results containing PCRs that could not form stems would be removed. For each of the remaining combinable result, the bases of 5′ -PCRs and 3′ -PCRs would be located in the rows and columns of Matrix, respectively. All the stems in the fixed areas are then collected (The first kind of collection) as the candidate stems, and the subscripts of 5′ -PCR i and j as well as those of 3′ -PCR p and q follow the order of i<j<p<q. Based on the collected candidate stems, CSCP (the initial "C" is empty) is used to obtain the first kind of optimal stem combinations and to compute the rate. If one single usage rate (5′ -Rate or 3′ -Rate) is 1, then all the appropriate stems would be collected (The secondary kind of collection). After processing all the remaining combinable results, all the collected stems represent the candidate stems of the step. Secondly, the CSCP (the initial "C" is "OptimalSet") is also utilized to obtain the secondary kind of optimal stem combinations "OptimalSet." Finally, the PCRs with the single usage rate of 1 would be removed.

The third step: In this step, all the remaining PCRs are randomly combined first of all, and each of the combinable result contains two PCRs. Then, the results containing PCRs that could not form stems would be removed. For each of the remaining combinable result, the bases of 5′ -PCRs and 3′ -PCRs would be located in the rows and columns of Matrix, respectively. Then, all the stems in the fixed areas would be collected (The first kind collection) as the candidate stems, and the subscripts of 5′ -PCR i and j as well as those of 3′ -PCR p and q follow the order of i<j<p<q. Subsequently, the CSCP (the initial "C" is empty) is employed to obtain the first kind of optimal stems combination and to compute the rate. All the appropriate stems would be collected (The secondary kind of collection) as the candidate stems if the usage rate is >0.6. After processing all the remaining combinable results, all the collected stems are the candidate stems of the step. Secondly, the CSCP (the initial "C" is "OptimalSet") is then used to obtain the secondary kind of optimal stems combination "OptimalSet."

Repeating the above operation allows to obtain the prediction pseudoknot-free substructures of each PCRs set, and those substructures are then randomly combined to produce the final stem combinations, each of which only contains one substructure of each PCRs set. Additionally, the final stem combinations would also be extended to get the prediction secondary structure with pseudoknots. Eventually, the prediction structure would be transformed into the dot-bracket sequences and output them.

#### Performance Measurement

For the same RNA, the prediction structures of some methods contain pseudoknots and some don't contain pseudoknots. Since the pseudoknots are formed by the intersection of stems and the non-pseudoknot structures are formed by nested stems. Therefore, we can calculate the accuracy of the base pairs to represent the accuracy of prediction structures. So that, the prediction structures can be compared between different methods.

To estimate the accuracy of the prediction results for DMfold and other methods, the indexes of sensitivity (SEN) and positive predictive value (PPV) are commonly used (Seetin and Mathews, 2012), among which, SEN could measure the ability to find the positive base pairs, while PPV could measure the ability of not folding false positive base pairs. To be specific, SEN and PPV could be defined by equation (1) and (2), respectively.

$$\text{SEN} = \text{TP}/(\text{TP} + \text{FN}) \quad \text{(1)}$$

$$\text{PPV} = \text{TP}/(\text{TP} + \text{FP}) \quad \text{(2)}$$

where TP (true positive) is the number of matched bases that are correctly predicted, FN (false negative) is the number of existing matched bases that are not predicted, and FP (false positive) is the number of matched bases that are incorrectly predicted.

Generally, the requirements of SEN and PPV could not be satisfied simultaneously when comparing the accuracy of those prediction results. Therefore, the F-score (Yonemoto et al., 2015) is used to comprehensively evaluate the prediction results, which is harmonic mean of SEN and PPV. Specifically, the value of F-score [can be defined by equation (3)] ranges from 0 to 1, 0 indicates that the prediction structure has no common base pair with the real structure, whereas 1 suggests that the prediction structure is the same to the real structure.

**F** − **score** = **2** ∗ ((**SEN** + **PPV**) / (**SEN** ∗ **PPV**)) (**3**)

#### RESULTS

In this section, the prediction results of our method would be presented and our method would be compared with several excellent methods, including mfold (Zuker, 2003), RNAfold (Zuker and Stiegler, 1981), cofold (Proctor and Meyer, 2013), Ipknot (Kengo et al., 2011), and Probknot (Bellaousov and Mathews, 2010). Among those methods, mfold, RNAfold, and cofold could predict the pseudoknot-free secondary structure, while Ipknot and Probknot could predict the secondary structure with pseudoknots. Therefore, those comparison methods include methods for predicting pseudoknots and pseudoknots-free, which can compare the performance of our method more comprehensively. In this paper, the prediction results of multiple methods are compared in two aspects: performance and structure visualization. The performance comparison is mainly to test the accuracy of predicting base pairs, while the structure visualization comparison is mainly to testing which results is closer to the natural structure. To facilitate comparison among methods, the structure with the highest value of F-score would be selected as the prediction structure when an RNA sequence is predicted by a method.

#### Performance Comparison

In this paper, data in the Testing set is classified according to the RNA family, and the prediction results are compared among different families. When calculating the parameters (SEN, PPV, and F-score) of DMfold, the values of those parameters in each fold experiment are obtained, which represent the means of all the RNA sequences in different families. Accordingly, the means of those parameters in 10-fold cross-experiments stand for the prediction parameters of DMfold. When calculating the parameters (SEN, PPV, and F-score) of other methods, all parameters of each RNA sequences would be obtained, and the mean parameters in different families represent the prediction parameters of other methods.

**Table 3** compares the prediction results of our method and other methods on tRNA and 5sRNA, which represent of the short RNA sequences with the length of 70–200. It could be obviously seen that the SEN, PPV, and F-score of DMfold are higher than those of the other methods in terms of tRNA and 5sRNA. Therefore, DMfold is superior to other excellent methods for short RNA sequences. **Table 4** compares the prediction results of our method and other methods on tmRNA and RNaseP, which represent the long RNA sequences that are 300–500 in length. It could be discovered that the parameters of SEN, PPV, and F-score of DMfold are higher than those of other methods in tmRNA. SEN and F-score in RNaseP of DMfold are at the common level, but PPV is optimal, suggesting that the prediction results of DMfold in RNaseP are associated with the least proportion of false positive bases. These two tables have verified that DMfold could effectively predict the secondary structure of both short and long RNA sequences.

#### Structure Visualization Comparison

Since the function of RNA is highly correlated with the shape of its secondary structure, we compare different methods by observing the visualization maps. First, a tRNA molecule (tRNA\_tdbR00000143-Asterias\_amurensis-7602-His-QUG) is randomly selected in the testing set, and the prediction results of those six methods are obtained. Then use the forna tool (Gruber et al., 2015) to get the visualization maps of those prediction results. **Figure 5** shows the visual representation of the real and prediction structures. As shown, the DMfold structure and the real structure have four branches on the bifurcation loop, forming the typical clover shape of tRNA, which is the key to transport amino acids. The prediction structures of mfold, RNAfold, cofold, and ProbKnot lack a branch in bifurcation loop, which can seriously affect the function of tRNA. The IPknot method only successfully prediction two branches, which is also seriously inconsistent with the real structure. Although the structure predicted by our method is not completely correct, it is the closest to the natural structure compared to other methods. Therefore, our method is more conducive to the study of RNA function. See the **Supplementary Materials** for the visualization comparison of the other three families (**Figures S4**–**S6**).

### DISCUSSION

In this paper, a new method is proposed to solve the problem of predicting the RNA secondary structure with pseudoknots. Actually, computational methods have been used for over 40 years to predict the secondary structure of RNA. Although many

TABLE 3 | The comparison between DMfold and other methods on 5sRNA and tRNA.


*The bold value is the maximum of each column.*

TABLE 4 | The comparison between DMfold and other methods on tmRNA and RnaseP.


*The bold value is the maximum of each column.*

prediction methods have been proposed, only a few of them can predict the RNA secondary structure with pseudoknots, since it is an NP-hard problem (Rivas and Eddy, 1999). In the traditional computational methods, the prediction of pseudoknots will greatly add to the algorithmic complexity. Hence, many methods would not predict pseudoknots or would only predict some common pseudoknots for the sake of reducing the algorithmic complexity (Rivas and Eddy, 1999). Different from the traditional computational methods, our method transforms the pseudoknots problems into the pseudoknots-free problems, which could predict the RNA secondary structure with all kinds of pseudoknots in a reasonable complexity. More importantly, it can be found in the results section, our prediction results are closer to the natural structure. Hence, our method is more beneficial to study the function of RNA.

The novel of our method is that first combines the Deep learning and IBPMP to solve the problem of predicting the secondary structure with pseudoknots. Unlike the traditional computational methods using MPM or SPM to predict the secondary structure, our method has taken full use of the advantages of those two main methods. Our method uses the auxiliary sequence to help predict RNA secondary structure and uses the Deep Learning model to automatically extract RNA features without using the energy or statistical parameters in the traditional computational methods. Compared with MSM, only the target sequence is needed in our method as the input, which has greatly simplified the method operation.

Besides, compared with SPM, our method could effectively break through the restriction of parameter insufficiency in traditional computational methods.

Moreover, in order to improve the credibility of our method, the 10-fold cross-validation experiments are employed to train and test our method, and both short and long RNA are included in the experimental data. As could be discovered from the results section, the prediction accuracy of our method in short RNA sequences is greatly improved relative to that of the other methods. In long RNA sequences, the accuracy of our method is not as good as that in short RNA sequences, but the prediction results are also improved. Two reasons may be responsible for such phenomenon; on the one hand, the topology of short RNA sequences is simple and existing data can support short sequences learning and predicting; on the other hand, the topology of long RNA sequences is complex and the existing long sequences are insufficient to support the learning and predicting. These results indicate that the accuracy of our method on long RNA sequences remains to be further improved with the accumulation of known structural data.

The improved prediction accuracy can be due to that different RNA in the different microenvironment. Hence, these differences in microenvironment may result in RNA folding along different rules, indicating that the traditional computational methods taking the common folding rules are not favorable for predicting the multi-type RNA secondary structures, especially those with pseudoknots. On this basis, our method learning from different types of RNA and predicting the similar RNA structure, which could effectively avoid the low prediction accuracy caused by single rules. Compared with the traditional computational methods, our method is more suitable for predicting the multiple different types of RNA secondary structure.

Our method is associate with many advantages, nonetheless, it is also inevitably link with certain limitations. Because our method contains a deep learning model, it needs a large number of similar RNA with known structures to learn features for different types of RNA. Therefore, the prediction accuracy might be reduced in the presence of insufficient similar sequences, so the use space of our method is partly limited. Despite some limitations in the use space of our method, the use space is promising to be gradually growing along with the increase in the number of secondary structures found.

#### AUTHOR CONTRIBUTIONS

YL conceived and directed the project. LW designed the study, wrote the manuscript. HL and XZ revised

#### REFERENCES


the manuscript critically for important intellectual content. CoL and ChL collected the data and coded the procedure. HZ reviewed the data. All authors have read and approved the final manuscript for publication.

#### FUNDING

This research was supported by the National Natural Science Foundation of China (Grant No. 61471181, 81702966) and the Natural Science Foundation of Jilin Province (Grant No. 20140101194JC, 20150101056JC).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00143/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Liu, Zhong, Liu, Lu, Li and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification of Triple-Negative Breast Cancer Genes and a Novel High-Risk Breast Cancer Prediction Model Development Based on PPI Data and Support Vector Machines

Ming Li<sup>1</sup> , Yu Guo<sup>1</sup> , Yuan-Ming Feng1,2 and Ning Zhang<sup>1</sup> \*

<sup>1</sup> Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin, China, <sup>2</sup> Department of Radiation Oncology, Tianjin Medical University Cancer Institute and Hospital, Tianjin, China

#### Edited by:

Arun Kumar Sangaiah, VIT University, India

#### Reviewed by:

Jiangning Song, Monash University, Australia Zhi-Ping Liu, Shandong University, China

> \*Correspondence: Ning Zhang zhni@tju.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 31 October 2018 Accepted: 19 February 2019 Published: 15 March 2019

#### Citation:

Li M, Guo Y, Feng Y-M and Zhang N (2019) Identification of Triple-Negative Breast Cancer Genes and a Novel High-Risk Breast Cancer Prediction Model Development Based on PPI Data and Support Vector Machines. Front. Genet. 10:180. doi: 10.3389/fgene.2019.00180 Triple-negative breast cancer (TNBC) is a special subtype of breast cancer that is difficult to treat. It is crucial to identify breast cancer-related genes that could provide new biomarkers for breast cancer diagnosis and potential treatment goals. In the development of our new high-risk breast cancer prediction model, seven raw gene expression datasets from the NCBI gene expression omnibus (GEO) database (GSE31519, GSE9574, GSE20194, GSE20271, GSE32646, GSE45255, and GSE15852) were used. Using the maximum relevance minimum redundancy (mRMR) method, we selected significant genes. Then, we mapped transcripts of the genes on the protein-protein interaction (PPI) network from the Search Tool for the Retrieval of Interacting Genes (STRING) database, as well as traced the shortest path between each pair of proteins. Genes with higher betweenness values were selected from the shortest path proteins. In order to ensure validity and precision, a permutation test was performed. We randomly selected 248 proteins from the PPI network for shortest path tracing and repeated the procedure 100 times. We also removed genes that appeared more frequently in randomized results. As a result, 54 genes were selected as potential TNBC-related genes. Using 14 out the 54 genes, which are potential TNBC associated genes, as input features into a support vector machine (SVM), a novel model was trained to predict high-risk breast cancer. The prediction accuracy of normal tissues and TNBC tissues reached 95.394%, and the predictions of Stage II and Stage III TNBC reached 86.598%, indicating that such genes play important roles in distinguishing breast cancers, and that the method could be promising in practical use. According to reports, some of the 54 genes we identified from the PPI network are associated with breast cancer in the literature. Several other genes have not yet been reported but have functional resemblance with known cancer genes. These may be novel breast cancerrelated genes and need further experimental validation. Gene ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed to appraise the 54 genes. It was indicated that cellular response to organic cyclic compounds has an influence in breast cancer, and most genes may be related with viral carcinogenesis.

Keywords: triple-negative breast cancer, gene, proteins, protein-protein interaction network, SVM

### INTRODUCTION

fgene-10-00180 March 13, 2019 Time: 18:14 # 2

Breast cancer is a malignant tumor that is highly prevalent among women worldwide. In recent years, the incidence rate has increased significantly. According to estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER-2) status, breast cancer can be classified into four categories. Triple-negative breast cancer (TNBC), one of the more specialized types of breast cancer, is defined as the lack of expression of the ER and PR, as well as breast cancer that lacks HER-2 overexpression or gene amplification. TNBC is more common in young women, with large tumors, high lymphatic metastasis rate, and high clinical stage. The 5-year recurrence rate is high, and visceral metastases such as liver and lung metastasis are more common. Compared with other types of breast cancer, TNBC has characteristics of rapid tumor growth, early recurrence, easy metastasis, and so on (Prat et al., 2013). Up to now, the genes related to this disease are poorly understood.

Triple-negative breast cancer accounts for about 15–25% of all breast cancers. The identification of disease-related genes and prediction of high-risk breast cancer patients have become important problems. Genes that are highly associated with TNBC can be found using gene expression profiles. However, there are still some problems in the current methods of predicting protein function using high-throughput protein interaction data. It usually has a high false positive rate, and the reliability of functional prediction results is reduced (Li et al., 2012b; Oliver et al., 2015).

In recent years, the continuous accumulation of protein interaction data has made it possible to analyze and predict protein functions at the system level through the protein-protein interaction (PPI) network. Nabieva et al. (2005) proposed the "guilt-by-association rule" (GBA), which states that interacting proteins have the same or similar functions, which suggests that protein function can be predicted by protein interactions.

In this study, we identified TNBC-related genes by a computational method. A weighted functional PPI network was integrated, which can overcome the disadvantages of that by only using the gene expression profiles. We also previously successfully applied such an integrating method to gene function prediction and to the identification of novel genes of various kinds of diseases, such as influenza A/H7N9 virus infection (Ning et al., 2014), colorectal cancer (Li et al., 2012b), lung cancer (Li et al., 2013b), colorectal cancer (Li et al., 2013a), hepatitis B virus (HBV) infection-related hepatocellular carcinoma (Jiang et al., 2013), retinoblastoma (Li et al., 2012c), Ebola virus (Cao et al., 2017), etc.

### MATERIALS AND METHODS

The whole process of our study is illustrated in **Figure 1**. Details are presented in the following sub-sections.

#### Dataset

Expression profiles from datasets GSE31519, GSE9574, GSE20194, GSE20271, GSE45255, and GSE15852 were obtained from the GEO database<sup>1</sup> . The dataset involves 319 sample chips with 101 normal breast tissue samples and 218 TNBC tissue samples (including 21 Stage II samples and 101 Stage III samples).

In this study, the robust multi-array average (RMA) method in "limma" in R was used to normalize microarray data and to perform a log<sup>2</sup> transformation of chip data. In total, 12,437 genes were obtained. RMA uses a multi-chip model that requires standardization of all chips together. The expression value is estimated based on a stochastic model employed by the perfect match (PM) signal distribution. It is currently the most common chip data preprocessing method. RMA is commonly used in the literature. This method has also been used in many other biomedical research problems, such as when analyzing diabetic nephropathy (Cohen et al., 2008), the crosstalk between B16 melanoma cells and B-1 lymphocytes (Xander et al., 2013), colon cancer (Melo et al., 2013), etc.

### The mRMR Method

We employed the mRMR method (Peng et al., 2005; Li et al., 2012a,b; Zhang et al., 2012; Zou et al., 2016b; Su et al., 2018) to rank the importance of all 12,437 genes examined. In such a procedure, each gene was regarded as a feature. The Maximum Relevance criterion selects features most important in discriminating TNBC samples and controls. The Minimum Redundancy criterion excludes redundant features among the selected ones. In an mRMR procedure, a value A-B is calculated for each feature, in which value A is represented for the relevance and value B for the redundancy of the feature. Then the features are ranked by their A-B values in descending order to reflect the importance to the target. The most important feature is ranked at the top (Peng et al., 2005; Li et al., 2012a,b; Zhang et al., 2012; Zou et al., 2016b).

Two ordered lists were generated by the mRMR method, one was called the MaxRel table, and the other was called the mRMR table. In the MaxRel table, all the features were ranked only by the Maximum Relevance criterion. In the mRMR table, they were ranked by the mRMR criterion, i.e., a feature with a smaller index in such a table could be more important since it has a better trade-off between the maximum relevance and the minimum redundancy. In this study, we selected the top 248 features from the mRMR table, with which the corresponding 248 genes were regarded as significantly differentially expressed genes from the expression profiles and were analyzed in the downstream procedures.

### PPI Network From STRING

The STRING database (version 10.0)<sup>2</sup> (Franceschini et al., 2013) is a database for searching for known and predicted interactions between proteins. The related interactions mentioned herein include direct and indirect relationships between proteins. The interacting protein can be mapped to a weight network in STRING. In such a network, proteins are denoted as nodes and the interaction of every two proteins is given as an edge marked with a confidence score. If the confidence score is higher, they

<sup>1</sup>http://www.ncbi.nlm.nih.gov/geo/ <sup>2</sup>http://string.embl.de/

Frontiers in Genetics | www.frontiersin.org

may have more analogous functions (Kourmpetis et al., 2010; Ng et al., 2010; Szklarczyk et al., 2011). In this study, we used a d value instead of a confidence score (s) for the weight of each interaction edge. According to the equation d = 1,000−s, d was calculated. Therefore, the d value can be considered to represent the protein distances to each other; a smaller distance value indicates the protein pair has a higher interaction confidence score.

In this study, the human PPI data in the STRING database were selected as the data source, and there are 8,548,002 pairs of related interaction forces. The ID of the human species is 9,606.

### Shortest Path Tracing

Interactions between every protein pair were analyzed in a graph. In this study, the R package "STRINGdb" was used to map the corresponding protein IDs of the top 248 genes selected by mRMR. The betweenness of a shortest path protein is the number of shortest paths across the protein. Then, the shortest path proteins were ranked by betweenness in descending order. The proteins whose betweenness was greater than 3,000 were picked out and their corresponding genes were treated as breast cancer-related genes. The Dijkstra algorithm served to find the shortest path in the graph G between two given proteins, which was implemented in the R package "igraph" (Csardi and Nepusz, 2006). In order to ensure the validity and precision of our results, we randomly chose 248 proteins in the PPI network for shortest path tracing and repeated the procedure 100 times, and a permutation test was performed. Then we removed 5 genes that appear more frequently in randomized results.

### The C-SVC Algorithm

The support vector machine (SVM) method largely overcomes the dimensional disaster and local minimization of feature attributes in traditional machine learning and solves small samples. There are many advantages in non-linear and high-dimensional pattern recognition, which have received more and more attention in the fields of biomedicine and bioinformatics. Therefore, in the field of health care, an improved SVM algorithm for the diagnosis of breast cancer diseases was applied by Zhang et al. (2013). A new data feature dimension reduction method for lymphatic diseases was proposed by Azar et al. (2014). Auxiliary diagnosis has achieved a certain improvement in diagnostic efficiency (Yuan et al., 2010; Mokeddem et al., 2013).

The Cost Support Vector Classification (C-SVC) is a method of SVM classification. It introduces penalty parameter C for SVM classification.

$$\min\_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2} \boldsymbol{\omega}^T \boldsymbol{\omega} + \boldsymbol{\mathcal{C}} \sum\_{i=1}^n \boldsymbol{\xi}\_i \tag{1}$$

,

subject to yi(w <sup>T</sup>φ(xi) + b) ≥ 1 − ζ<sup>i</sup>

fgene-10-00180 March 13, 2019 Time: 18:14 # 4

$$\varsigma\_i \ge 0, \ i = 1, \ldots, n$$

Its dual is:

$$\min\_{\alpha} \frac{1}{2} \alpha^T \mathbb{Q} \alpha - e^T \alpha \tag{2}$$

subject to y <sup>T</sup>α = 0

$$0 \le \alpha\_i \le \mathcal{C}, \ i = 1, \dots, n, \ i$$

where e is the vector of all ones, C > 0 is the upper bound, Q is an n by positive semidefinite matrix, Qij ≡ yiyjK xi, x<sup>j</sup> , where K xi, x<sup>j</sup> = φ (xi) <sup>T</sup> φ xj is the kernel. Here, training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function:

$$\text{sgn}\left(\sum\_{i=1}^{n}\boldsymbol{\wp}\_{i}\boldsymbol{\alpha}\_{i}\boldsymbol{K}\left(\boldsymbol{x}\_{i},\boldsymbol{x}\_{j}\right)+\boldsymbol{\rho}\right)\tag{3}$$

The C-SVC is capable of categorizing two types of breast tissue (Jiang and Yao, 2016).

## Data Preprocessing for the Prediction Model

To test the accuracy of the C-SVC-based high-risk breast cancer prediction model, we divided the samples into two groups, one for normal tissue and breast cancer tissue, and the other for Stage II and Stage III breast cancers.

Scaling data according to the Equation (4):

$$y' = lower + (upper - lower) \, \* \, \frac{y - min}{max - min} \tag{4}$$

where y is the data before scaling, y<sup>0</sup> is the scaled data; lower is the lower bound of the data specified in the parameter, upper is the upper bound of the data specified in the parameter; min is the minimum of all training data, and max is the maximum value of all training data.

The preprocessing of the data has a great influence on the final classification accuracy. This paper will compare the different preprocessing methods and finally choose the method with high classification accuracy to establish the model.

# Parameter Optimization

The choice of a kernel function is important. In a specific problem, several kernel functions should be applied in order to choose the best one, obtaining the highest accuracy (Deng et al., 2016). Both the type of kernel function and other parameters such as penalty parameter C and γ in kernel functions impact the performance. Thus, we use the grid search method to select the appropriate parameters.

#### RESULTS

#### The Top 54 Genes on PPI Shortest Paths

After removing the five randomized genes from the intersection of the shortest path results for normal breast and TNBC tissues, a total of 54 genes associated with TNBC were obtained, as shown in **Table 1**. Similarly, we mapped the PPI networks of these 54 genes using the STRINGdb package in R, as shown in **Figure 2**.

#### Function Gene Enrichment Analysis

In this study, we transferred the disease-related genes into its corresponding EntrezID by using "org.Hs.eg.db" in R. Then, we analyzed the functional enrichment of the 54 candidate genes in KEGG pathways and GO terms using the R package "clusterProfilter." The GO enrichment analysis includes three categories: cellular component (CC), molecular function (MF), and biological process (BP). In our study, we only focus on BP enrichment due to its importance. These terms were ranked by the enrichment p-value. The Benjamin multiple testing correction method was used to regulate family-wide false discovery rate under a certain rate (e.g., ≤0.01) to correct the enrichment p-value (Benjamini and Yekutieli, 2001). Results of the GO enrichment analysis ranked by p-value were provided in **Table 2** and result of the KEGG enrichment analysis ranked by p-value was provided in **Table 3**, respectively. The top 10 terms of the enrichment results are depicted in **Figures 3**, **4**.

#### High-Risk Breast Cancer Prediction

In this study, we implemented the C-SVC algorithm in the Matlab 2015a environment. The radial basis function (RBF) kernel function was employed in this study since the function has been widely used in various bioinformatics prediction problems and usually yields the best results compared to other types of kernel functions (Li et al., 2011; Song et al., 2011; Khan et al., 2016). In this study, we also employed other kernel functions on the same prediction task and found that the RBF performed the best (data not shown). The grid search method was used and the results were verified by the ten-fold cross-validation method. The data in the experiment was divided into 10 sets of similar size, and 9 of them were used in turn as the training set. One set was used as the test to calculate the corrections and errors of the prediction. As a result, the normal tissue and TNBC tissue prediction accuracy reached 95.394%, and the Stage II and Stage III TNBC predictions reached 86.598%, as shown in **Table 4**. It is indicated that based on the 54 genes as features, the C-SVC algorithm can accurately predict normal tissue and TNBC, as well as the stage data for TNBC.



TABLE 2 | Results of the GO enrichment analysis.


TABLE 3 | Results of the KEGG enrichment analysis.


#### DISCUSSION

#### Genes Identified From PPI Shortest Paths

As can be seen from **Table 1**, some genes are associated with TNBC, such as FGFR1, EGFR, NOTCH1, ERBB2, AR, and so on.

Among these genes, CBL, FGFR1, RHOA, EP300, RAC1, CDH1, EGFR, NOTCH1, ERBB2, HIF1A, HDAC1, CCNB1, SRC, ITGB1, NFKB1, CREBBP, PCNA, STAT, and AR are reported to be related to TNBC.

#### The Migration and Invasion

We found that specific genes such as CBL, RHOA, EP300, RAC1, CDK1, and CDH1 are involved in the migration and invasion of breast cancer.

CBL is a proto-oncogene, and it is indicated that CBL is associated with the development of leukemia. It has been found that this gene is mutated or translocated in many cancers (Choi et al., 2003). CBL encodes a protein which is one of the enzymes required to target substrate degradation through the proteasome. It has been found that the gene mutation or translocation occurs in many cancers, such as acute myeloid leukemia. So far, there

are some studies suggesting that CBL is associated with breast cancer or TNBC. It is reported by Kales et al. that low expression of Cbl-c is associated with breast tumors (Kales et al., 2014). It is shown that this gene is involved in the invasion of cancer. The study by Crist et al. showed that a diminished regulatory capacity of Cbl-c is a recurrent event that may play a role in the invasive nature of colorectal cancer cells (Cristóbal et al., 2014). From these studies, it can be speculated that CBL is associated with invasiveness of TNBC.

In the Rho family, RHOA is a small GTPase protein. The overexpression of this gene is related to tumor cell proliferation and metastasis. It is shown that the RhoA pathway mediates the independent invasion of MMP-2 and MMP-9 in TNBS cell lines (Fagan-Solis et al., 2013). RHOA is the target of miR-146a to prevent cell invasion and metastasis in breast cancer (Liu et al., 2016). Lee et al. showed that ODAM expression maintains breast cancer cell adhesion and thus prevents breast cancer cell metastasis by modulating RhoA signaling in breast cancer cells (Lee et al., 2015). The study by Kwon et al. showed that SMURF1 acts in EGF-induced migration and invasion of breast cancer cells (Kwon et al., 2013). In conclusion, RHOA is involved in the invasion of TNBC cells.

EP300 (histone acetyltransferase p300) encodes the p300 transcriptional coactivator of the adenovirus E1A-associated cell. Studies by Cho et al. (2015) showed that p300 and MRTF-A synergistically enhance the expression of migration-associated genes in breast cancer cells. In addition, it is report that the EP300-G211S mutation correlates with a low mutation load in TNBC patients (Bemanian et al., 2017). Therefore, EP300 is directly related to TNBC.

The RAC1 gene encodes a protein belonging to the GTPase of the small GTP-binding protein RAS superfamily. It was found that RASAL2 activates RAC1 to promote TNBC (Feng et al., 2014). Studies by De et al. have shown that the caspase-β-catenin-RAC1 cascade suggests a link between RAC1 and integrin-related metastasis in TNBC (De et al., 2017). In addition, studies by De et al. (2017) observed that two different mTORC2-dependent signaling pathways can be fused with RAC1 to drive breast cancer metastasis. Therefore, RAC1 may play an important clinical role for the treatment of TNBC.

CDH1, the gene encoding E-cadherin (E-cadherin), is a calcium-dependent cell adhesion protein belonging to the cadherin family. It is involved in the process of tumor

proliferation, invasion, and metastasis. Therefore, it is anticipated that gene function defects will promote the occurrence and development of cancer. It is shown that 1α, 25-dihydroxyvitamin D3 induces E-cadherin expression in TNBC cells through demethylation of the CDH1 promoter (Lopes et al., 2012).

#### Posttranscriptional Regulation of Gene Expression

We found that FGR1, MAGOH, RPS3, and CDK1 are all involved in posttranscriptional regulation of gene expression.

FGFR1 is one of the fibroblast growth factor (FGF) encoding genes. Cheng et al. suggested that upregulation of FGFR1 expression in TNBC cells may be treated as a potential therapeutic target (Cheng et al., 2015). Vinayak et al. (2013) reported that FGF pathways have been implicated in breast tumorigenesis as a potential target for TNBC. In addition, there is some research indicating that it is related to breast cancer, as FGFR1 was found to be associated with luminal A breast cancer (Zou et al., 2016a). FGFR is also helpful in the targeted therapy of breast cancer (Ye et al., 2014). Amplification of FGFR1 also occurs in almost 10% of ER-positive breast cancers, particularly luminol type B breast cancer subtypes. In summary, FGFR1 and TNBC are closely related.

MAGOH ranked first, indicating it plays an important role in TNBC. A protein encoded by the gene is the core component of the composite exon. There is some evidence showing that it is associated with TNBC. This gene could possibly be treated as a potential specific gene for TNBC.

The RPS3 gene encodes the 40S ribosomal protein S3 domain. Kim et al. have shown that the rpS3 protein is a marker of malignancy (Kim et al., 2013). It is reported that it is mainly associated with lung cancer. Slizhikova et al. (2005) have shown that this gene is a marker of human squamous cell lung cancer.

CDK1 is a set of Ser/Thr kinase systems corresponding to cell cycle progression. It was shown by Xia et al. (2014) that the CDK1 inhibitor RO3306 potentiates BRCA-negative breast cancer cell responses to PARP inhibitors. CDK1 inhibition may have a role in the adjuvant treatment of TNBC.

Additionally, some genes have also been reported to have a direct relationship with TNBC. In a nutshell, most of the specific genes found in this study have been reported to be associated with TNBC, while others are rarely reported to have a direct relationship with TNBC, suggesting that they could be new specific genes and potentially be new biomarkers for breast cancer prevention and treatment.

#### Candidate Gene Enrichment Analysis

We used the 'clusterProfilter' package in R for the enrichment analysis of the 54 candidate genes, ranking the GO terms and KEGG pathways by p-value in ascending order. In the present study, the p-value was calculated for each KEGG and GO term.

p-values.

fgene-10-00180 March 13, 2019 Time: 18:14 # 8

TABLE 4 | The performance of the high-risk breast cancer classification model.


The experiment result shows its upper recall and precision rate. Its recall rate reaches 100%. The precision and F-measure are also above 80%.

In this study, we only focused on BP. The top 10 terms ranked by p-value are shown in **Figure 2**.

As shown in **Figure 2**, "cellular response to organic cyclic compound (GO:0071407)" was ranked first. It is well known that any process leading to changes in cell state or activity (changes in movement, secretion, enzyme production, gene expression, etc.) is the result of stimulation by organic cyclic compounds. It proved the importance of this BP in TNBC. Both "response to oxidative stress" (GO: 0006979) and "response to reactive oxygen species" (GO: 0000302) are related to the reaction of oxygen. "Rhythmic process" (GO:0048511), "cellular response to lipid" (GO:0071396), "heart development" (GO: 0007507), "gland development" (GO:0048732), and "glandular development" (GO:0048732) are also associated with TNBC. In addition, the two responses "response to mechanical stimulus" (GO: 0009612) and "response to radiation" (GO: 0009314) are also associated with TNBC, as well as the "Fc-epsilon receptor signaling pathway" (GO: 0038095). The above entry comment may provide some new ideas for TNBC.

The top 10 terms of KEGG enrichment ranked by p-value are depicted in **Figure 3**. It is clear that "pathways in cancer" (hsa05200) is ranked at the top, demonstrating its importance in TNBC.

In addition, "Kaposi's sarcoma-associated herpesvirus infection" (hsa05167), "hepatitis B" (hsa05161), "herpes simplex infection" (hsa05168), and "viral carcinogenesis" (hsa05203) are associated with viral infection. Moreover, "adherens junction" (hsa04520), "prostate cancer" (hsa05215), "cAMP signaling pathway" (hsa04024), "proteoglycans in cancer" (hsa05205), and "endocrine resistance" (hsa01522) are also associated with the occurrence and development of TNBC. Huo et al. (2012) suggested that breast cancer and viral infection were statistically significant. From the enrichment analysis above it can be concluded that TNBC may be related to viral carcinogenesis.

## Advantages of the Method and Extension

It is anticipated that our model may become a useful tool for studying cancers from the angle of genes and networks. It was observed by analyzing the results that the specific genes, the biological functions of the significant genes, and the pathways enriched would contribute to cancer diagnosis and cancer predictions. Furthermore, the current model can also be used to solve many other disease prediction problems, and we also have many similar applications in our previous studies, such as for Ebola (Cao et al., 2017) and for A/H7N9 (Zhang et al., 2014). These studies show promising results and prove the efficiency of the proposed methods. However, this method has limitations on diseases with insignificant genes, which may lead to bias in prediction results. Additionally, insufficient samples will also affect the results. Moreover, genes identified from computational methods should be verified by further experimental studies.

In all, results may shed some light on the understanding of the mechanism of the tumorigenesis of breast cancer, providing new references for research into the disease and for the development of new strategies for clinical therapies as well as providing potential for future experimental validation.

#### CONCLUSION

fgene-10-00180 March 13, 2019 Time: 18:14 # 9

In this study, we developed a novel method to identify TNBCrelated genes. This method integrated breast cancer gene expression data and PPI data. Many of the identified genes were reported to be related to TNBC in the literature. Most of these genes are related with invasion and metastasis. GO enrichment analysis indicated that the cellular response to organic cyclic compounds have an influence in breast cancer. KEGG pathway analysis indicated that most of these 54 genes may be related with viral carcinogenesis. We believe that these findings will provide some insights for breast cancer therapy and drug development.

We also developed a new SVM method based on the C-SVC for predicting high-risk breast cancer. The prediction accuracy of normal tissues and TNBC tissues reached 95.394%, and the predictions of Stage II and Stage III TNBC reached 86.598%.

Our method could be helpful for identifying novel cancerrelated genes and assisting doctors in medical diagnosis.

#### REFERENCES


Identification of TNBC genes and a novel high-risk breast cancer prediction model development based on PPI data and SVM method may have certain theoretical significance and practical value in the application of cancer diagnosis. Recently, link prediction paradigms have been applied in the prediction of disease genes (Zeng et al., 2017a,b), circular RNAs (Zeng et al., 2017c), and miRNAs (Liu et al., 2016). Additionally, computational intelligence such as neural networks (Cabarle et al., 2017) can be applied in this field.

#### DATA AVAILABILITY

The datasets generated for this study can be found in NCBI GEO, GSE31519, GSE9574, GSE20194, GSE20271, GSE32646, GSE45255, and GSE15852.

#### AUTHOR CONTRIBUTIONS

NZ conceived and supervised the project. YG and ML were responsible for the design, data preprocessing, computational analyses, and drafted the manuscript with revisions provided. Y-MF, NZ, and ML participated in the design of the study and performed the computational analysis. All authors read and approved the final manuscript.

CDH1 transcription factors in breast cancer progression. Nat. Commun. 6:7821. doi: 10.1038/ncomms8821



of proteins, globally integrated and scored. Nucleic Acids Res. 39(Suppl. 1), D561–D568. doi: 10.1093/nar/gkq973


fgene-10-00180 March 13, 2019 Time: 18:14 # 11

is associated with poor prognosis of breast cancer. Biochim. Biophys. Acta 1833, 2961–2969. doi: 10.1016/j.bbamcr.2013.07.021


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Li, Guo, Feng and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Inferring Bacterial Infiltration in Primary Colorectal Tumors From Host Whole Genome Sequencing Data

Man Guo<sup>1</sup> , Er Xu<sup>1</sup> and Dongmei Ai2,1 \*

<sup>1</sup> School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China, <sup>2</sup> Basic Experimental of Natural Science, University of Science and Technology Beijing, Beijing, China

Colorectal cancer is the third most common cancer worldwide with abysmal survival, thus requiring novel therapy strategies. Numerous studies have frequently observed infiltrating bacteria within the primary tumor tissues derived from patients. These studies have implicated the relative abundance of these bacteria as a contributing factor in tumor progression. Infiltrating bacteria are believed to be among the major drivers of tumorigenesis, progression, and metastasis and, hence, promising targets for new treatments. However, measuring their abundance directly remains challenging. One potential approach is to use the unmapped reads of host whole genome sequencing (hWGS) data, which previous studies have considered as contaminants and discarded. Here, we developed rigorous bioinformatics and statistical procedures to identify tumor-infiltrating bacteria associated with colorectal cancer from such whole genome sequencing data. Our approach used the reads of whole genome sequencing data of colon adenocarcinoma tissues not mapped to the human reference genome, including unmapped paired-end read pairs and single-end reads, the mates of which were mapped. We assembled the unmapped read pairs, remapped all those reads to the collection of human microbiome reference, and then computed their relative abundance of microbes by maximum likelihood (ML) estimation. We analyzed and compared the relative abundance and diversity of infiltrating bacteria between primary tumor tissues and associated normal blood samples. Our results showed that primary tumor tissues contained far more diverse total infiltrating bacteria than normal blood samples. The relative abundance of Bacteroides fragilis, Bacteroides dorei, and Fusobacterium nucleatum was significantly higher in primary colorectal tumors. These three bacteria were among the top ten microbes in the primary tumor tissues, yet were rarely found in normal blood samples. As a validation step, most of these bacteria were also closely associated with colorectal cancer in previous studies with alternative approaches. In summary, our approach provides a new analytic technique for investigating the infiltrating bacterial community within tumor tissues. Our novel cloud-based bioinformatics and statistical pipelines to analyze the infiltrating bacteria in colorectal tumors using the unmapped reads of whole genome sequences can be freely accessed from GitHub at https://github.com/gutmicrobes/UMIB.git.

Edited by:

Arun Kumar Sangaiah, VIT University, India

#### Reviewed by:

Leyi Wei, The University of Tokyo, Japan Yungang Xu, The University of Texas Health Science Center at Houston (UTHealth), United States

> \*Correspondence: Dongmei Ai aidongmei@ustb.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 19 January 2019 Accepted: 27 February 2019 Published: 15 March 2019

#### Citation:

Guo M, Xu E and Ai D (2019) Inferring Bacterial Infiltration in Primary Colorectal Tumors From Host Whole Genome Sequencing Data. Front. Genet. 10:213. doi: 10.3389/fgene.2019.00213

Keywords: unmapped reads, tumor tissue, colorectal cancer, infiltrating bacteria, maximum likelihood estimation

# INTRODUCTION

fgene-10-00213 March 13, 2019 Time: 18:14 # 2

Many microbes inhabit human tissues and bodily fluids, forming a close symbiotic relationship with the host. The types, quantities, distribution features, genomes, and pathogenic mechanisms of human microbes vary greatly (Campo-Moreno et al., 2018). Generally, the total number of microbes (approximately 100 trillion) found in the human body is 10 times more than the number of human cells, and the number of genes they encode is 100 times more than that by the human genome. Those microbes play an important role in human health by regulating our digestive, immune, respiratory, and nervous system, and their dis-symbiosis has been associated with various diseases (O'Hara and Shanahan, 2006), such as inflammatory bowel disease (Norman et al., 2015), Crohn's disease (Li et al., 2012), viral hepatitis (Kostic et al., 2012), and colorectal cancer (Littlejohn et al., 2016).

Using metagenomics approaches, researchers have found that colorectal tumorigenesis is mediated by toxins produced and secreted by the infiltrating bacteria that colonize the intestinal surface and trigger tissue inflammation, inducing otherwise normal cells to emit atypical signaling molecules. The whole process leads to local inflammatory reaction and the infiltration of innate immune cells, events which, in turn, accelerate tumor development (Chung et al., 2018; Dejea et al., 2018). For example, DNA damage may be induced in host cells owing to prolonged exposure to these toxins, initiating tumorigenesis (Zhu, 2013). Bacteria and their products can also facilitate viral infection in host cells, thereby inducing cancer (Lax and Thomas, 2002; Almand et al., 2017).

While direct experimental measurement of infiltrating bacteria remains challenging, the unmapped reads derived from host primary tumor tissue through whole genome sequencing (hWGS) data could allow us to study the pathogenic process involving microbes in colorectal cancer with in situ advantage and no additional cost. In the past, unmapped reads were often overlooked; however, recent studies have proved that they contain crucial microbial information relevant to tumorigenesis (Mangul et al., 2018). Nonetheless, as a consequence of the extremely low abundance of microbial DNA in comparison to host DNA, such research requires the development of rigorous and robust bioinformatics and statistical procedures.

Our approach was built on a growing number of studies measuring microbes in the biopsies of cancer patients via the reanalysis of reads that were not mapped to the human reference genome. Zhang et al. (2015) used MegaBlast to remap the unmapped reads of whole genome sequences of 27 gastric mucosal biopsies to microbial reference genomes, and they verified a close association between Helicobacter pylori and gastric tumors. Tang and Larsson (2017) conducted high-throughput sequencing to analyze the RNA or DNA from tumor tissues of patients with cervical adenocarcinoma and lymphoma and remapped the unmapped reads to the complete viral reference database to successfully detect known oncogenic viruses, as well as identify new viral strains in those tumors. Loohuis et al. (2018) studied 192 blood transcriptome samples of schizophrenic patients, applied MetaPhlAn to analyze the bacteria using unmapped reads, and identified Planctomycetes and Thermotogae phyla closely associated with schizophrenia.

Evidence gathered from those studies has established the rationale for reanalyzing microbes using unmapped reads as a cost-effective approach to investigate the interaction between microbes and disease progression. Accordingly, we herein report a novel cloud-based bioinformatics and statistical pipelines to analyze the infiltrating bacteria in colorectal tumors using the unmapped reads of whole genome sequences. We used SAMtools to extract the unmapped reads, PANDAseq to perform quality control, followed by the assembly of pairedend reads, as well as the use of Burrows-Wheeler Aligner (BWA) for remapping to bacterial reference genomes, and Genome Relative Abundance using Mixture Model theory (GRAMMy) to estimate their relative abundance. By analyzing the obtained relative abundance and diversity, we identified differential infiltrating bacteria between primary tumor tissues and associated normal blood samples.

# MATERIALS AND METHODS

Our data were downloaded from The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) database, including the BAM-formatted whole genome sequencing data of 51

genome.

paired primary colon adenocarcinoma tumor and normal blood samples. Our bioinformatics pipeline was implemented using the Seven Bridge Cancer Genomics cloud platform, including four linked analytical components (SAMtools, PANDAseq, BWA, and GRAMMy) with their Docker images pushed up to the cloud platform. **Figure 1** showed the flowchart of our approach for the analysis of differentially abundant bacteria using whole genome sequencing data. From the BAM files of the whole genome sequence data, we extracted reads that were not mapped to the human reference genome. Those reads were then mapped to a collection of human microbiome reference genomes to estimate the relative abundance of microbes.

### Extracting Unmapped Reads

We aimed to extract all unmapped reads, including both full read pairs (both ends of a read pair were unmapped) and single-end unmapped reads (one read end was mapped, while the other end was unmapped). Our bioinformatics procedures to extract such unmapped reads were as follows:


The assembled full unmapped read pairs and the single-end unmapped read pairs were combined to obtain the complete set (FASTA files) of unmapped reads.

# Mapping and Calculating the Relative Abundance of Microbes

We used the Burrows-Wheeler Alignment tool (BWA) to remap the complete set of unmapped reads obtained in the previous step to the a collection of human microbial genome references. Our reference collection was downloaded from the NCBI human microbiome database: ftp://ftp.ncbi. nlm.nih.gov/genomes/HUMAN\_MICROBIOM/Bacteria. Those reference genomes were sequenced, quality controlled and assembled by the Human Microbiology Program (HMP) (Methé et al., 2012) consortium. This reference collection contains 161 bacterial genus and it is also 519 of the most important bacterial species in the human body, including more than 900 strains. The reference collection was pushed up to the Seven Bridges Cancer Genomics cloud platform using the Cancer Genomics Cloud Uploader.

Next, we used GRAMMy (Xia et al., 2011), a mixture modeling and expectation- maximization algorithm-based maximum likelihood (ML) estimation tool, to determine the relative abundance of microbes. The tool overcomes the ambiguity of mapping to different microbial reference sequences that occur as a result of short read sequencing and a closely related reference collection to estimate the relative abundance accurately.

# Quality Control Post-abundance Estimation

We eliminated samples presenting extremely low relative abundance of all bacteria, except for Propionibacterium sp. We suspected Propionibacterium sp. to be a major contaminating species in both normal blood samples and primary tumor tissues of colorectal cancer, averaged as 0.9313 and 0.7142, respectively. The relative abundance of Propionibacterium sp. in normal blood samples was, on average, higher than that in the primary colorectal cancer tissues.

Because the amount of Propionibacterium sp. in both tumor and normal samples was disproportionately large, we decided to exclude its relative abundance from all analyzed samples and renormalized relative abundance of other species. We also excluded 5 primary tumor tissue samples and 15 normal blood samples, the total unmapped reads counts of which were less than five, presenting extremely low relative abundance of infiltrating bacteria. Finally, our analysis was based on the

FIGURE 2 | Alpha diversity of bacteria in the normal blood samples and primary tumor tissue samples. The violin plots show the alpha diversity of infiltrating bacteria in the normal blood and primary tumor tissue samples. The green color in the plot represents the normal blood samples, and the red color in the plot represents the primary tumor tissue samples. The "∗∗∗" symbol represents P-value < 0.001. Differential analysis was performed by Student's t-test (P = 1.27E–06).

TABLE 1 | The most differentially abundant genera between tumor and normal samples (Q-value < 0.05).


remaining 46 primary tumor tissue samples and 36 normal blood samples. It is noteworthy that a recent study has shown that metabolites of Propionibacterium freudenreichii can kill colorectal cancer cells, implicating its use as a probiotic for the prevention and treatment of early colorectal cancer (Casanova et al., 2018).

# RESULTS AND DISCUSSION

First, we calculated the Shannon's indices of infiltrating bacteria for both normal blood samples and primary tumor tissue samples. As shown in **Figure 2**, the alpha diversity of bacterial communities indicated that the infiltrating bacteria in primary tumor tissues were significantly more diverse than those in normal blood samples. This finding was supported by previous studies, which showed that the alpha diversity of microbes in colorectal cancer biopsies was significantly higher than that in other samples, such as feces and saliva (Russo et al., 2018).

Next, we identified the differential abundance of infiltrating bacteria between normal blood samples and primary tumor tissue samples. We used the wilcox.test() function in R software to perform a non-parametric Mann–Whitney– Wilcoxon test, followed by Benjamini–Hochberg procedure

genus level in normal blood and primary tumor tissue samples. The Benjamini–Hochberg false discovery rate (FDR)-corrected non-parametric Mann–Whitney–Wilcoxon test was used to calculate the P-value and analyze the differences in bacteria. The box plots show bacteria significantly different at the genus level. The "<sup>∗</sup> " symbol represents Q-value < 0.05; the "∗∗" symbol represents Q-value < 0.01; and the "∗∗∗" symbol represents Q-value < 0.001. (B) Differential analysis of bacterial abundance at the species level in the normal blood and primary tumor tissue samples. To differentially analyze the diversity of bacterial species in the normal blood and primary tumor tissue samples, the Benjamini–Hochberg FDR-corrected non-parametric Mann–Whitney–Wilcoxon test was used. Letters B, F, and P in the x-axis represent Bacteroides, Fusobacterium, and Parabacteroides, respectively. (C) The stacked bar charts of the top 10 bacterial species enriched in the normal blood samples and their relative abundance in the primary tumor tissue samples. (D) The stacked bar charts of the top ten bacterial species enriched in the primary tumor tissue samples and their relative abundance in the normal blood samples.

TABLE 2 | The most differentially abundant species between tumor and normal samples (Q-value < 0.05).


to compute the false discovery rate (FDR) and correct the obtained P-values. We identified the most significantly different genera (Q-value < 0.05), as shown in **Table 1**, and plotted them in **Figure 3A**. As we can see, Bacteroides, Clostridium, Fusobacterium, and Streptococcus were abundant in the infiltrated primary tumor tissues, but nearly absent in the normal blood samples.

These findings were widely supported by previous literature. For instance, Flemer et al. (2017) showed that Bacteroides spp. in the mucosal microbiota of patients with colorectal cancer were more abundant compared to the normal control group. Fusobacterium recruits tumor-infiltrating immune cells to generate a pro-inflammatory microenvironment and promote tumorigenesis by triggering inflammation (McCoy et al., 2013). Colitis bacteria can alter host physiology to promote cancer. They disrupt the balance of intestinal microflora and introduce virulent genes that have been shown to promote tumor formation in mice (Walsh et al., 2014). In addition, many other species of Clostridium and Streptococcus, such as Clostridium difficile (Zheng et al., 2017), Streptococcus gallolyticus (Andres-Franch et al., 2017), and Streptococcus infantarius (Kaindi et al., 2018), were reported to be associated with colorectal cancer.

Whole genome sequence data allowed us to precisely identify the most abundant species. We identified such species and plotted the relative abundance of the top 10 most abundant species in stacked bar charts as shown in **Figures 3C,D**. As we can see, Escherichia coli, Ralstonia spp., and Bacteroides spp. were abundant among all the primary tumor tissue samples and normal blood samples. Among these, Ralstonia was a common contaminant when DNA samples were screened (Salter et al., 2014), and its relative abundance may be a result of contamination. Both E. coli and Bacteroides spp. have important functional roles and are commonly found in the human body (Wexler, 2007).

In addition, we identified the most differentially abundant species between tumor and normal samples (Q-value < 0.05), as shown in **Table 2**, which included B. fragilis, F. nucleatum, Parabacteroides merdae, B. dorei, B. vulgatus, B. stercoris, B. finegoldii, B. uniformis, and B. ovatus (**Figure 3B**). It can be seen that the relative abundance of Bacteroides fragilis, B. dorei, and Fusobacterium nucleatum was also among the top 10 abundant species in the primary tumor tissue samples in this study, but they were much less abundant in the normal blood samples.

A subsequent literature search has validated these species as microbial markers of colorectal cancer. For instances, B. fragilis, also known as ETBF, secretes B. fragilis toxins (BFT) that induce immune cells to produce interleukin-17 (Wu et al., 2009). This lymphokine acts on intestinal mucosal cells to initiate the participation of more immune cells in the inflammatory response, thereby leading to the development of inflammationrelated colorectal cancer (Kwong et al., 2018; Tilg et al., 2018). F. nucleatum adheres to and invades colonic epithelial cells, inducing tumor growth in patients with colorectal cancer (Bullman et al., 2017; Shang and Liu, 2018). In addition, F. nucleatum often presents in the human oral cavity to cause periodontitis, and it is reported to be a risk factor for colorectal cancer (Barton, 2017). Other identified bacterial species, such as P. merdae, B. dorei, and B. vulgatus (Cipe et al., 2015), are positively correlated with red meat intake and negatively correlated with the intake of fruits and vegetables (Feng et al., 2015). Red meat was widely recognized as a dietary factor linked to the development of colorectal cancer (Brenner et al., 2014). B. finegoldii and B. dorei can cause bacteremia (Lee et al., 2015), along with B. Stercoris (Lucas et al., 2017; Alomair et al., 2018), B. uniformis, and B. ovatus (Liang et al., 2014), and they were all reported to be correlated with colorectal cancer.

In **Figure 4**, we plotted the overall heat map of 43 bacterial species with significant differences. We used the R heatmap.2 function to draw the figure. The left side of the heat map demonstrates the clustering analysis of different samples using Spearman's correlation coefficients between the relative abundance of bacteria. The figure clearly shows that the infiltrating bacteria of the primary tumor tissue sample were different from those of normal blood samples. The visible diversity of bacteria in the primary tumor tissue samples was significantly higher than that in the normal blood samples. This result is consistent with the findings from the differential analysis of alpha diversity. Interestingly, these 43 bacterial species only rarely present in most of the normal blood samples. In addition, the heatmap-based clustering analysis results showed that the primary tumor tissues of colorectal patients and normal blood samples were perfectly clustered with their sample types, revealing their distinct community structure.

# CONCLUSION

Cloud computing was developed recently in bioinformatics research (Zou et al., 2013; Guo et al., 2018). In this study, we developed a cloud-based bioinformatics pipeline to analyze unmapped reads from whole genome sequencing of human tumor tissues. The reads in the whole genome sequencing data not mapped to the human reference genome were extracted by SAMtools, followed by PANDAseq to assemble overlapping reads, BWA to remap them to the bacterial genome reference database, and GRAMMy to estimate relative abundance.

This pipeline was successfully applied to analyze the infiltrating bacteria of 51 pairs of primary colorectal cancer tumor tissue and normal blood samples. Group-based differential diversity and relative abundance analysis was used to identify microbial markers of colorectal tumor. Our results showed that the total infiltrating bacteria in primary tumor tissues was significantly more abundant than that observed in the normal blood samples. The relative abundance of such bacteria as B. fragilis, B. dorei, and F. nucleatum was significantly higher in primary tumor tissues as compared to normal blood samples. These bacteria are likely pathogenic microbial markers for colorectal cancer. A literature search validated our findings and revealed that these bacteria may induce tumor growth by adhering to and infecting the intestinal epithelial cells and secreting toxins.

### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: ftp://ftp.ncbi.nlm.nih.gov/genomes/ HUMAN\_MICROBIOM/Bacteria.

# AUTHOR CONTRIBUTIONS

MG and DA conceived and designed the study and wrote the manuscript. MG and EX collected the datasets and created the workflow. MG and DA revised the manuscript. All authors read and approved the final manuscript.

# FUNDING

This work was supported by grants from the National Natural Science Foundation of China (61873027 and 61370131).

# REFERENCES

fgene-10-00213 March 13, 2019 Time: 18:14 # 7


of human intestinal origin. Yonsei Med. J. 56, 292–294. doi: 10.3349/ymj.2015. 56.1.292


using whole genome sequencing. Genome Biol. 16:265. doi: 10.1186/s13059- 015-0821-z


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Guo, Xu and Ai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multi-Level Comparative Framework Based on Gene Pair-Wise Expression Across Three Insulin Target Tissues for Type 2 Diabetes

Shaoyan Sun<sup>1</sup> \*, Fengnan Sun<sup>2</sup> and Yong Wang3,4 \*

<sup>1</sup> School of Mathematics and Statistics, Ludong University, Yantai, China, <sup>2</sup> Clinical Laboratory, Yantaishan Hospital, Yantai, China, <sup>3</sup> CEMS, NCMIS, MDIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China, <sup>4</sup> Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China

#### Edited by:

Quan Zou, University of Electronic Science and Technology of China, China

#### Reviewed by:

Tun-Wen Pai, National Taipei University of Technology, Taiwan Zhen Tian, Zhengzhou University, China

#### \*Correspondence:

Shaoyan Sun sunsy\_2014@ldu.edu.cn Yong Wang ywang@amss.ac.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 24 January 2019 Accepted: 06 March 2019 Published: 26 March 2019

#### Citation:

Sun S, Sun F and Wang Y (2019) Multi-Level Comparative Framework Based on Gene Pair-Wise Expression Across Three Insulin Target Tissues for Type 2 Diabetes. Front. Genet. 10:252. doi: 10.3389/fgene.2019.00252 Type 2 diabetes (T2D) is known as a disease caused by gene alterations characterized by insulin resistance, thus the insulin-responsive tissues are of great interest for T2D study. It's of great relevance to systematically investigate commonalities and specificities of T2D among those tissues. Here we establish a multi-level comparative framework across three insulin target tissues (white adipose, skeletal muscle, and liver) to provide a better understanding of T2D. Starting from the ranks of gene expression, we constructed the 'disease network' through detecting diverse interactions to provide a well-characterization for disease affected tissues. Then, we applied random walk with restart algorithm to the disease network to prioritize its nodes and edges according to their association with T2D. Finally, we identified a merged core module by combining the clustering coefficient and Jaccard index, which can provide elaborate and visible illumination of the common and specific features for different tissues at network level. Taken together, our network-, gene-, and module-level characterization across different tissues of T2D hold the promise to provide a broader and deeper understanding for T2D mechanism.

Keywords: type 2 diabetes, gene pairwise expression, dysfunctional interactions, multi-level analysis, random walk with restart

# INTRODUCTION

Type 2 diabetes (T2D) is one of the leading complex diseases. It is most commonly seen in older adults, but it is increasingly seen in children, adolescents, and younger adults due to rising levels of obesity, physical inactivity, and poor diet (International Diabetes Federation [IDF], 2017). T2D is mainly a glucose metabolism disorder, and is currently believed to be a heterogeneous disease. One important fact is that the development of T2D involves multiple tissues (Zhong et al., 2010; Camastra et al., 2011; Petersen et al., 2012; Tang et al., 2018). Those tissues include white adipose tissue, skeletal muscle, and liver (shortly written as adipose, muscle, and liver hereafter). Each tissue has its own characteristics induced by T2D. One important and challenging problem is to explore the cross talk among multiple tissues since T2D is a systemic disease involving complicated synergy and regulation among different tissues. Exploring its underlying mechanisms across multiple tissues will be helpful for personalized therapy and precision medicine in T2D treatment. However, most

existing studies focused on single tissue only (Li et al., 2015; Lee and Kim, 2016; Alghamdi et al., 2017).

On the other hand, it has been accepted that although molecules are basic components of cellular machinery, a complex disease is generally caused not from the malfunction of individual molecules but from the interplay of a group of correlated molecules or a network. Thus, with the development of bioinformatics and high-throughput data, network-based characterization of complex diseases, including T2D, has been invaluable to integrate and interpret functional genomics datasets and identify new biomarkers or modules to better classify patients into subtypes. Such approaches are much more powerful than approaches that examine a single gene at a time (Barabási et al., 2011; Hofree et al., 2013; Zhang et al., 2014; Liu et al., 2017; Zhang and Zhang, 2017; Hwang et al., 2018; Zou et al., 2018). However, most methods tended to characterize the complex programs using sets of genes, while the interaction (described as edge in biological network) information was not fully utilized in final prediction or analysis. Therefore, such methods cannot inform us on the functional discrepancies between gene pairs that are perturbed under disease.

Here, we propose a multi-level comparative framework mostly focusing on interactions across different representative tissues to gain a broader and further understanding of T2D. The whole framework can be divided into two phases and each phase is comprised of several major steps (**Figure 1**). The first phase is disease network construction, to effectively characterize the abnormal information response to disease. In recent years, many efforts have been made to extract disease information through selecting co-expression gene pairs whose expressions were highly correlated across samples. These methods were under the hypothesis that genes associated with the same disorder tend to share common functional features, i.e., their protein products tend to interact with each other. However, genes show highly correlated patterns of expression in one biological state, but not in another, i.e., they may not be highly correlated across the entire dataset, and therefore they fail to be picked out by co-expression based methods (Zhang and Horvath, 2005; Fuente, 2010). For this reason, we proposed a method based on finding 'diverse interactions' according to the discrepancy between correlations in different phenotypes by extending our previous work (Sun et al., 2013). Beyond our previous work, we further considered the weight of interactions together, thus all the diverse interactions along with their discrepant coefficients which composed the weighted diverse interaction network (WDIN). Indeed, this WDIN would unravel the complexity of gene-pair regulation in the complex process regarding different tissues. In addition, it should be noted that during the computation of adjacency matrix for WDIN, we proposed to calculate the discrepant coefficient based on genes' ranks instead of expression values. Such an operation would weaken the biased influence caused by different expression levels in different tissues and experiments and consequently provide a uniform scale for all samples independent of the dynamic range of a data profile (Le et al., 2010; Altschuler et al., 2013).

The second phase is multi-level analysis based on the inferred disease network. This phase is composed of network-, gene-, and module-level analysis. Network-level analysis gives an overall cross-tissue study about the T2D based on the constructed original tissue-dependent WDIN. Gene-level analysis is carried out through discussing the prioritized nodes of WDIN which can provide a local and delicate explanation for the disease. The major step of this phase is to discover core modules and then derive a merged core module. Following module-level analysis could carefully and visibly illustrate the common and specific features belonging to different tissues. In this part, the random walk restart (RWR) algorithm is employed to rank genes and further extended to prioritize gene interactions. The clustering coefficient is introduced to enable identifying biological core modules composed of prioritized interactions.

In summary, we explore the tissue cross talk of T2D at gene-, module-, and network-level comparisons across different tissues to uncover hidden patterns and their biological implications from multi-tissues 'omic' data. Our multi-level comparative framework is shown in **Figure 1**. We expect that our proposed multi-level analysis framework can be extended beyond gene expression level and discover new commonalities and specificities among tissues.

### MATERIALS AND METHODS

#### Data Collection

#### Tissue Dependent Gene Expression Data Retrieval

We obtained gene expression profiles from three rat tissues (white adipose, skeletal muscle, and liver) [diabetes rats: Goto– Kakizaki (GK) rats; control rats: Wistar–Kyoto (WK) rats] from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus database (access ID: GSE13268, GSE13269, and GSE13270) (Almon et al., 2009; Nie et al., 2011; Xue et al., 2011). The profile was composed of 31,099 probes. Our first filter eliminated the probe sets without the corresponding official symbol, leaving 25,345 genes for further consideration.

#### T2D Associated Genes Collection

To provide support and verification for our work, we collected the canonically reported genes associated with T2D. These genes were gathered from Type II diabetes mellitus pathway (KEGG-Kyoto Encyclopedia of Genes and Genomes, H00409) which will be referred to as 'T2D-pathway genes.' In total, we obtained 50 T2D-pathway genes, among which 42 genes had gene expression aforementioned and were used in this study. Other T2D related genes were downloaded from the Rat Genome Database (RGD)<sup>1</sup> in October 2018. In total, 515 genes were downloaded from RGD (referred to as RGD-reported genes), and only those that have been measured in GSE 13268-13270 were kept. Thus 303 RGDreported genes were left.

#### Protein–Protein Interaction Network Integration

Rat protein–protein interaction (PPI) network was integrated based on KEGG pathway and BioGRID (Biological General Repository for Interaction Datasets). Actually, not all the

<sup>1</sup>http://rgd.mcw.edu/wg/home

proteins in PPI network have corresponding gene expression values in expression profiles GSE13268-GSE13270. To carry out our cross-tissue analysis, we only reserved interactions whose two nodes have expressions in three tissues. Finally, the reserved interactions composed a network with 4,081 nodes (proteins) among 24,503 edges (interactions). This network was noted as 'background PPI network' with gene pair-wise expressions in our study.

# Data Processing

In most studies about network biology such as the ones aimed at identifying network biomarkers of complex diseases, researchers usually collect gene expression data from multiple experiments. The expression profiles belonging to different experiments may represent different tissues, or different experiment conditions. Straightly pooling the expression profiles to construct various networks would ignore the underlying structure of the data, and the pooled estimates may be severely biased due to the heterogeneity of the experiments. Instead of pooling the expression data, we first ranked the genes by their expression values for each expression array, and then normalized the rank index for every gene. The normalized rank (z-score) was computed using the mean (µ) and standard deviation (σ) of the ranks of all genes along one sample which can be described as z-score = (x–µ)/σ. In subsequent computation and analysis, this normalized rank was used instead of expression value since it can reduce the data noise deriving from different experimental conditions and different tissues.

**Figure 2** shows that the expression levels estimated by normalized ranks subject to normal distribution are more consistent, indicating that such ranking processing is a more valid procedure.

### Tissue-Dependent WDINs Construction

A WDIN was constructed for each tissue that was related to the query disease—T2D. The pipeline of constructing WDIN was as follows:


After the creation of the tissue-dependent WDIN for each tissue, each interaction was weighted by the diverse correlation of corresponding gene pair. For each interaction, the absolute

phenotypes. (B) Histograms of normalized gene expression in three tissues for two phenotypes. (C) Histograms of gene index based on ranking expression in three tissues for two phenotypes.

value of its weight can reflect the degree of deviation of this interaction in different phenotypes (normal and disease), and the positive and negative property of the weight coefficient indicates whether the corresponding function is active or inactive in disease condition compared with normal condition, which would provide more information for medical biology. Subsequently, we noted the edge with positive weight as 'active edge,' and the edge with negative weight as 'inactive edge,' respectively.

# Random Walk With Restart (RWR) on WDIN

#### Rank Candidate Genes

fgene-10-00252 March 23, 2019 Time: 17:47 # 5

We applied RWR algorithm to our constructed tissue-dependent WDINs. The goal is to rank genes in candidate sets based on their association level with T2D. RWR is a ranking algorithm which simulates a random walk on the network to compute the proximity between two nodes by exploiting the global structure of the network. It starts on a set of seed nodes, which is the set of genes known to be associated with a phenotype p (T2D in this work). The candidate genes are then ranked by the probability of the random walker reaching this node (Lovasz, 1996; Tong et al., 2008; Ganegoda et al., 2014).

Each tissue-specific WDIN can be mathematically described as 0 = (V,ε,w), V is the gene set of WDIN's nodes, εis a set of undirected interactions between these genes (or their products), uv ∈ ε represents an interaction between u ∈ V and v ∈ V, w(u, v) indicates the weight coefficient of interaction uv ∈ ε. The set of interacting partners of a gene v ∈ V is defined as N(v) = {u ∈ V} : uv ∈ ε and the total reliability of known interactions of v is defined as W(v) = P <sup>u</sup><sup>∈</sup> <sup>N</sup>(v) w(uv).

Let p<sup>0</sup> be the initial probability vector and p<sup>s</sup> be a vector in which the i-th element holds the probability of finding the random walker at node i at steps. Algorithmically, random-walk based association scores can be computed iteratively as follows:

$$p\_{\mathbb{S}+1} = (1 - \gamma)M^T p\_{\mathbb{s}} + \gamma(1 - \eta)p\_{\mathbb{O}} \tag{1}$$

Here, η denotes the weight of the network. γ ∈ (0, 1)is a userdefined restart probability to adjust the preference between the importance of a protein or gene with respect to the seed set and network topology. Numerical results show that γ = 0.3 is optimal for RWR's performance (Erten, 2009; Erten et al., 2011). Thus, γ is set to 0.3 for RWR in this paper. M is the transition matrix of the 0, the transition probability from gene u to gene v can be described as follows,

$$M(\boldsymbol{\mu}, \boldsymbol{\nu}) = \begin{cases} \boldsymbol{\nu}(\boldsymbol{\mu}, \boldsymbol{\nu}) / W(\boldsymbol{\nu}), & \text{if } \boldsymbol{\mu}\boldsymbol{\nu} \in \boldsymbol{\varepsilon} \\ 0, & \text{otherwise} \end{cases} \tag{2}$$

The seed set is composed of T2D-pathway genes covered by the WDIN, and the candidate set contains other nodes of WDIN excluding these seed genes. The RGD-reported genes would be used as test genes to verify the performance of applying RWR to WDINs. The details will be shown in the Results section.

#### Rank Candidate Edges

Based on the ranked candidate nodes, we further prioritized the edges of WDIN for each tissue. In detail, given an edge, we assigned an index to the edge according to the ranks of its two linked nodes to indicate its association degree to the studied disease. Theoretically and computationally, using the average ranks of its two nodes as the edge's index would meet the requirement. Then all the edges would be ranked in ascending order according to their assigned indices. An edge having higher rank (with smaller rank value) is more closely associated with the studied disease. These sorted edges enable us to identify a minimal set composed of prioritized edges which can reflect plentiful information about the disease in the considered tissue. Such a set would be referred to as 'core module', which could provide some important information from another viewpoint.

#### Network/Module Comparison

We quantified the similarity of two networks based on edges. Given two different networks, we supposed the numbers of edges belonging to these two networks to be respectively N<sup>1</sup> and N2. Then we calculated the number of edges that are present in both networks (common edges) noted as variable n. We defined variable x<sup>1</sup> as the ratio of the number of common edges (n) to the number of edges in the first network (N1) and defined variable x<sup>2</sup> as the ratio of the number of common edges (n) to the number of edges in the second network (N2). Using variables x<sup>1</sup> andx2, we introduced a S-score as the harmonic mean of x<sup>1</sup> and x<sup>2</sup> (Roy et al., 2013; Knaack et al., 2014). The formulae are as follows,

$$\begin{aligned} x\_1 &= \frac{n}{N1} \\ x\_2 &= \frac{n}{N2} \\ S-score &= \frac{1}{\frac{1}{2}\sum\_{i=1}^{2}\frac{1}{x\_i}} \end{aligned} \tag{3}$$

#### Pathway and Motif Enrichment

Functional analyses about pathway and process enrichment have been carried out with the following ontology sources: GO Biological Processes, KEGG Pathway, Reactome Gene Sets. The analyses are performed through web tool Metascape<sup>2</sup> (Tripathi et al., 2015).

### RESULTS

#### Network-Level Analysis Based on Tissue-Dependent WDINs

In order to better capture the disease related information, for each tissue, we have designed a new way to infer a tissue-dependent WDIN through exacting the interactions with significant diverse correlations in different phenotypes. Instead of choosing disease related genes individually, we picked out such genes in pairs. An edge would be picked out if its corresponding genes were strongly correlated (the Spearman correlation coefficient is larger than or equal to some threshold,

<sup>2</sup>http://metascape.org


TABLE 1 | Characteristics of three tissue-dependent WDINs.

fgene-10-00252 March 23, 2019 Time: 17:47 # 6

such as 0.8) under one strain while not (e.g., less than 0.2) in the other condition. Such an edge implicated that the interaction between the genes was obviously perturbed under the disease and hence noted as 'diverse interaction' in this work. Subtracting the correlation coefficient in normal state from the coefficient in disease state, we have the weight of each diverse interaction.

#### Characteristics of WDINs

Through screening all edges in background PPI network, we identified 2,026/2,118/2,184 diverse edges among 1,710/ 1,783/1,793 nodes for adipose/muscle/liver respectively. We computationally validated these networks by examining the number of T2D-rpathway genes and RGD-reported genes covered by each tissue-dependent WDIN. We found that the created WDIN could hit most (exceed 83%) T2D-pathway genes and more than half RGD-reported genes (**Table 1**). This means that our designed way of constructing WDIN is effective, and through screening the background PPI network and cutting away those edges having loose or no association with the disease, the inferred WDINs could capture abundant disease related information with less edges.

We further investigated the Venn diagram for nodes of three tissue-dependent WDINs and disease related genes they covered (**Figure 3**). From the Venn we can see that the percentage of specific genes in each WDIN ranges from 11.1 to 14.1%, and the housekeeping genes (genes appeared in three WDINs simultaneously) is at the level of 33.2% (Eisenberg and Levanon, 2013; Lee et al., 2015). The Venn about the coved T2D-related genes composed of T2D-pathway genes and RGD-reported genes presented a similar result.

#### Network Similarity Analysis

In addition to comparing networks based on network nodes through Venn diagram, we also carried out network edge comparison to further quantify the extent of shared and tissuespecific network components. Thus, we introduced S-score measure to assess the similarity between networks edges for each pair of tissues since it is a more sensitive measure for comparisons. Based on S-score (**Figure 4A**), we found that the similarity between each pair of WDINs is low, which means that during the progress of T2D, three tissues possess evident tissue specificity. In this network comparison, adipose tissue and muscle tissue were indicated to emerge as the most similar dysfunctions caused by T2D among three insulin responsive tissues.

#### Cross-Tissue Functional Analysis Based on WDINS

To confirm the significant relation between tissue- dependent WDINs and T2D, the pathway and motif enrichment was conducted for each WDIN to categorize the genes participating in different biological functions or pathways.

#### **Enrichment analysis on tissue non-specific genes**

**Table 2** lists the top 20 enriched terms of pathway and biological process of 904 tissue non-specific genes. The well-known T2D

TABLE 2 | Enriched terms of pathway and biological process of house-keeping genes.


related pathway- Insulin signaling pathway ranks in the 16th position. Most other listed pathways have been documented to be associated with T2D. For example, two pathways about cancer (pathways in cancer and proteoglycans in cancer) were enriched, and this is not surprising since cancer is quickly emerging as another pathological consequence of T2D (Poloz and Stambolic, 2015). Due to the lack of adequate glucose uptake induced by dysfunction of insulin response, most pathways related to the cellular regulatory signal transduction of basic energy metabolism became abnormal, such as Chemokine, cAMP and Wnt signaling pathways and so on (Li et al., 2014). As T2D is a well-known metabolic disease, the metabolic related pathways appeared to be abnormal, such as Purine metabolism and drug metabolic process.

#### **Enrichment analysis on tissue specific genes**

Some tissue-specific property can be displayed through the functional analysis of specific genes in three WDINs (**Figure 4B**). For example, Drug metabolic appeared to be significant in liver-dependent WDIN enriched terms. Negative regulation of intracellular signal transduction and response to oxidative stress were found to be dysfunctional in liver-specific WDIN only, and didn't appear in the remaining two tissues. In addition, we found that overall the abnormal functions caused by T2D appear to be more similar in adipose and muscle, which is consistent with the result induced by network similarity based on S-score (**Figure 4A**).

# Gene-Level Analysis Based on Prioritized WDINs

#### Prioritizing WDINs Using RWR

To rank genes belonging to candidate sets based on their association level with T2D, we applied RWR to our three constructed tissue-dependent WDINs. The RWR method starts with the genes belonging to T2D pathway covered by each WDIN, and these genes are referred to as seed genes. The candidate set of genes includes other nodes of WDIN excluding the seed genes. After the random walking, the candidates are ranked according to the proximity of each gene to the genes in the seed set.

To verify the efficiency of random walk on the WDIN, we collected T2D related genes from RGD database. In all 303

RGD-reported genes appeared in our background network, and among 35 T2D-pathway genes hit in adipose-WDIN (seed genes), there are 0 overlaps. These 130 genes are pleasing test genes because we can verify our method through inspecting their rank indexes. Similarly, in muscle-WDIN, there are 36 genes from T2D-pathway (used as seed genes) and 169 RGD-reported genes respectively. In 169 RGD-reported genes, excluding the seed genes, there remains 146 genes which would be used as test genes for RWR. In liver-WDIN, there are 39 genes from T2D-pathway (used as seed genes) and 162 RGD-reported genes respectively. In 162 RGD-reported genes, excluding the overlap genes with seed genes, there remains 136 genes which would be used as test genes for RWR.

After random walking on the adipose-WDIN, the top 100 ranked candidates cover 16 test genes, top 200 covered 26, corresponding p-values tested by Fisher's exact are 0.0019 and 0.0147. The corresponding results are listed in **Table 3**, along with the similar results for the other two tissues. This indicated that our random walking on inferred WDINs can effectively prioritized the disease related genes.

To collect more generally known T2D genes to verify our framework, we queried the approved type 2 diabetes genes through published papers from year 2010 to 2018. After mapping their homologous genes in rats based on the homologous categories from MGI (Bult et al., 2008), 12 genes were reserved. Finally, 8 of 12 genes were detected in three constructed WDINs which are respectively BCAR1, CAMK2B, CPS1, FADS2, GCK, PPARG, PDX1, and POLD2. Among them, GCK (Misra and Owen, 2018) and PDX1 (Kodama et al., 2016) were included in our seed set. Besides, FADS2 (Li et al., 2016) were ranked 112 in our prioritized adipose-WDIN (1,675 genes in total), and POLD2 (Gaudet et al., 2011) were ranked 98 in muscle-WDIN (1,747 genes in total). BCAR1 (Kazakova et al.,

TABLE 3 | We verify the efficiency of prioritized WDIN through Fisher's exact test.


2018) and CAMK2B (Sacco et al., 2016) also have higher ranks in muscle-WDIN which were 289 and 257 respectively. PPARG (Voight et al., 2010) was ranked 533 in adipose-WDIN, while CPS1 (Matone et al., 2016) had an unsatisfied rank 1,272. In general, the ranked results of these 8 genes were basically consistent with their published results associated with T2D. Furthermore, our results could offer tissue-specific insights into T2D.

#### Cross-Tissue Functional Analysis Based on Prioritized Genes

In order to reduce the effect of ascertainment bias in genes loosely or less associated with the disease, we set a threshold, and only focused on the prioritized genes above this threshold in three tissue-dependent WDINs. Thus we can perform a local and more delicate functional analysis on tissue nonspecific and tissue-specific genes among them separately with less noise. Our experiments show that the effect of the selection of the threshold is minor. If the threshold is too small such as less than 50, the genes chosen to perform functional enrichment analysis would be insufficient to achieve statistical significance; whereas if the threshold is too large such as higher than 200 (according to the results of the previous subsection), the follow-up functional analysis would be dilute on account of genes loosely associated with T2D. Therefore, we set a cutoff value for prioritized genes at 100 to conduct the cross-tissue functional analysis, and the results were shown in **Figure 5**.

We found that the enriched terms were more specific and striking with T2D when considering only the top100 prioritized housekeeping genes, such as type II diabetes mellitus, insulin signaling pathway, and glucose metabolic process. When restricting tissue-specific genes in the top100 prioritized genes, the tissue specificity became more visible. In detail, the immune effector process and positive regulation of phosphatidylinositol 3-kinase activity were not enriched in adipose, while specific genes in liver were not involved in negative regulation of transferase activity, positive regulation of phosphatidylinositol 3-kinase activity and epithelial cell differentiation process which also means that only adipose takes part in epithelial cell differentiation. When compared with the other two tissues, the muscle displays significant enrichment in regulation of kinase activity process and MAPK signaling pathway. And also from the overall view, the enriched functional items for adipose and muscle were closer.

#### Characterizing the Molecular Functions of Each Predicted T2D-Associated Gene

After ranking candidates through random walking on tissuedependent WDINs, we found that some genes were ranked ahead while they did not appear in the RGD database or T2D-pathway. For clarity, we focused on top50 nodes and considered them as potential T2D-associated genes. Below, we respectively list their gene information in **Table 4**, and corresponding pathway and process enrichment in **Table 5**.

According to the annotation of genes by Metascape, we found that some detected genes indeed have close connection

with T2D. Among the top50 potential genes in adipose-WDIN, PGM1 mainly takes part in Pentose phosphate pathway and innate immune system and galactose catabolic process. FBP1 responses to insulin stimulus and pentose phosphate pathway. STAT4 was ranked in the top50 in both prioritized adipose- and muscle- WDINs respectively. This gene provides instructions for a protein that acts as a transcription factor, which means that it attaches (binds) to specific regions of DNA and helps control the activity of certain genes. The STAT4 protein is turned on (activated) by immune system proteins called cytokines, which are part of the inflammatory response to fight infection. RASGRF1 appears in potential T2D-associated gene set in both prioritized muscle- and liver- WDINs, and it has been shown to be upstream from IGF1 (Insulin-like growth factor 1) which is a star gene about T2D, allowing it to control growth in mice (Drake et al., 2009). PSMD9 was observed in top50 genes detected in liver-WDIN, and it plays an important role in negative regulation of insulin secretion processes (GO:0046676) and positive regulation of insulin secretion (GO:0032024); GALM is another potential T2D-related gene detected in liver-WDIN that acts as a part of the galactose catabolic process (GO:0019388) and the galactose metabolic process (GO:0006012).

These potential T2D-associated genes would be helpful for physicians or biologists, as they can be used to determine an experimental target as the subject of future research.

#### Module-Level Analysis for the Predicted T2D Associated Genes Identifying Core Module From Prioritized WDIN Based on Edges

In addition to prioritizing candidate nodes, we further ranked interactions in three WDINs separately to capture those crucial interactions in several disorders caused by T2D. Specifically, each edge of the WDIN would be ranked according to the average rank of its corresponding two nodes. Theoretically, the edge having higher ranks (with smaller rank values) tends to have a closer association with the studied disease.

Our final goal was to identify a minimal set of prioritized WDIN edges which can reflect copious information about the disease in some tissues, namely 'core module' hereafter. To achieve this goal, we carried out a series of tests to evaluate the effectiveness of predicting and detecting disease related genes for the subnetworks from prioritized tissue-specific WDIN. Firstly, for each tissue, the series subnetworks were successively exacted from the prioritized WDIN based on edges from upon top 2.5%



to upon top 25% at an internal of 2.5, and then the maximum connected component (MCC) of each subnetwork was retained for further analysis. In total, we had 10 MCCs for each tissue. Secondly, we compute the numbers and the coverage rates of the disease related genes collected from T2D-pathway and RGD database hit by each MCC (**Supplementary Figure S1**).

We then identified our core module through investigating the clustering coefficient of each MCC because the clustering coefficient can reflect the modular feature of biological function module (Caroline and Ralf, 2006). In most cases, a complex with larger clustering coefficient frequently tends to form a biological functional module. Generally, the clustering coefficient ranges from 0 to 1, and for a stochastic network with N nodes the value of it is approximately equal to N −1 .

We calculated the clustering coefficient for each MCC and the corresponding results were listed in **Figure 6**. Besides, we compute the clustering coefficients of three tissue-dependent WDINs which are 0.011, 0.011, and 0.016 for adipose, muscle, and liver respectively. For each tissue, the MCC with the largest clustering coefficient was selected as our core module, that is to say, top 15% of adipose-WDIN, top 17.5% of muscle-WDIN and top 12.5% of liver-WDIN are taken as our tissue-dependent core modules. It should be noted that the clustering coefficients (0.018, 0.013, and 0.021) belonging to three identified modules are all larger than that of their corresponding WDINs.

Analysis of core modules and a focus on the difference between the edges/interactions instead of nodes/genes individually would provide some important information from another viewpoint. We will illustrate this point in the next subsection.

#### Creating Merged Core Module (MCM) Through Jaccard Index

Here we created the merged core module (MCM) by integrating three tissue-dependent core modules through introducing 'Jaccard index' (Knaack et al., 2014). For each pair of core modules, we calculated the Jaccard index, which for a pair of sets is defined as the ratio between the size of the intersection and the size of the union of two sets. If the Jaccard index of an interaction

TABLE 5 | Key issues of motif enrichment analysis on predicted potential T2D-assocaited genes.


#### FIGURE 6 | Clustering coefficient of MCCs for each tissue.

is higher than a threshold in at least one pair of modules, we added this interaction to the merged core module along with their weights. To find a suitable threshold, we separately calculated a series of clustering coefficients of complexes which were generated when the threshold was set as 0.3, 0.2, 0.1, and 0.05, and the resulting clustering coefficients were 0, 0, 0.027, and 0.026. Hence, the parameter α = 0.1 was finalized as the best threshold to create our MCM. Finally, we had an MCM composed of 198 edges among 154 nodes.

We then investigated the inferred MCM to identify the specific and common components of dysfunctions between different tissues.

#### Cross-Tissue Analysis Based on Merged Core Module

To systematically assess the extent to which the dysfunctions were shared among different tissues, we split the edges of MCM into three categories:


between the tissue pair under T2D can be exhibited through the dysfunctional weight of the edge in different tissue-dependent WDIN. (A,B) Show the differential edges between adipose and muscle, respectively. (C,D) Show muscle and liver, respectively. (E,F) Show adipose and liver, respectively. The color and width of the edge here have the same meaning as in Figure 7.

(3) Common edge: the edge was significantly dysfunctional in at least two tissues. When the status was consistent, for example, the edge was active in tissue A whereas inactive in tissue B (shown in **Figure 9**).

We found that in MCM, most adipose-specific edges appeared in links to hub gene PIK3R1, and a large proportion of them were active under T2D, such as the function occurred between PIK3R1 and INS1, the function between PIK3R1 and IRS2, the function between PIK3R1 and IGF1 etc. Taking muscle-specific edges in MCM into consideration, it can be seen that PIK3R1, PIK3CB, and EGFR were key nodes. The corresponding functions linked to PIK3R1 (such as the function occurred between PIK3R1 and IGF1R) were inactive which made the condition different than in adipose, and the corresponding functions around EGFR were active (such as functions linked between this node and IGF1). Though the distribution of liver-specific edges in MCM was decentralized when compared with other two tissues, PIK3R1 and PIK3CB were still two key nodes, interactions between PIK3R1 and other nodes (such as INS2) were active, while PIK3CB was a balanced node in terms of the hallmark of interactions.

There were 7 differential edges between adipose and muscle in MCM, which means the corresponding functions presented opposite status in these two tissues. As documented (Bereziat

tissue pair under T2D can be exhibited through the dysfunctional weight of the edge in different tissue-dependent WDIN. (A,B) Are the common edges between adipose and muscle, respectively. (C,D) Are for muscle and liver, respectively. (E,F) Show adipose and liver, respectively. The color and width of the edge here have the same meaning as in Figure 7.

et al., 2002), GRB14 inhibited the catalytic activity of the INSR, which was identical to our result displayed in **Figure 8B**. Actually, according to our result, this inhibition may have only emerged in muscle tissue, while it would be inverted in adipose tissue (**Figure 8A**) This provides a meaningful target for biological experiments.

All 5 differential edges were identified between muscle and liver in MCM (**Figures 8C,D**). Also from the visible changes in corresponding functions we could reveal more subtle and interesting difference between tissues. For example, as described in GeneCards<sup>3</sup> , SHC1 interacts with the NPXY motif of

<sup>3</sup>https://www.genecards.org

tyrosine-phosphorylated IGF1R. According to our results, we can further infer that the interaction was active in muscle under T2D while inhibited in liver. Some similar results would be observed between adipose and liver.

Inspecting the common edges between tissue pair, a notable thing was that all the four common interactions between adipose and liver were inactive. Remarkably, one of these four edges was between ABCC8 and KCNJ11, which are two well-known T2D-related genes encoding proteins Kir6.2 and Sur1, respectively, in pancreatic beta cells. KCNJ11 interacts with ABCC8 to produce the KATP channel, which transfers potassium ions across the beta cells (Haghvirdizadeh et al., 2015). This interaction was

indirectly linked through CACNA1E and was inhibited in both adipose and liver.

However, here we only described several edges for each category. Similar results can be examined from **Figures 6–9** which would be useful in studies about disease mechanism through analyzing the shared and specific components among tissues under the disease.

#### DISCUSSION

Type 2 diabetes (T2D) is a complex disease and its dysfunction involves many tissues. This work systematically investigates commonalities and specificities of T2D among multiple tissues. We established a multi-level comparative framework across three insulin target tissues (white adipose, skeletal muscle, and liver) to provide a better understanding of T2D.

The first challenge is to represent the tissues from the data. Starting from the ranks of gene expression, we constructed the 'disease network' through detecting diverse interactions to provide a well-characterization for disease affected tissues. Based on the constructed tissue-dependent WDINs, an elementary and integral comparative analysis at network-level was conducted. The results of network similarity according to the edges of network indicated that the similarity among three tissues is lower, thus justifying the necessity to conduct tissue-specific analysis for T2D. The differences among tissues were also visible in enriched motif based on tissue-specific genes, and these differences showed that some T2D-related pathways or biological processes own tissue specificity. Besides, we found that among three tissues, adipose and muscle have more similar components in terms of both enriched functions and network similarity.

To reduce the negative effects induced by genes loosely associated with T2D, RWRs algorithm was applied to the disease network to prioritize its nodes and edges according to their associations with T2D. Genes ranked higher theoretically are significantly associated with T2D. Gene-level analysis was carried out on those genes ranked higher such as in the top100. On one side, we discussed these genes individually and found that some of them have been reported to be related with disease genes, while several are not yet documented and could be potential T2D-related genes which may be further verified experimentally. On the other side, we collected these genes together to survey their combined functions through inspecting the enriched pathways and biological processes. Compared with the similar analysis based on the whole disease network, the analysis based on those closely associated with T2D displayed more specific and striking enriched issues with T2D (such as type II diabetes mellitus, insulin signaling pathway, and glucose metabolic process).

### REFERENCES

Alghamdi, M., Al-Mallah, M., Keteyian, S., Brawner, C., Ehrman, J., and Sakr, S. (2017). Predicting diabetes mellitus using smote and ensemble machine learning approach: the henry ford exercise testing (FIT) project. PLoS One 12:e0179805. doi: 10.1371/journal.pone.0179805

The network- and gene- level analyses could give some novel information cross tissues; meanwhile, they largely verified the effectiveness of our random walking on the constructed WDINs. Thus, based on the prioritized edges of WDINs, we further identified a merged core module (MCM) by combining the clustering coefficient and Jaccard index, which can provide and elaborate and visible explanation about the common and specific features for different tissues at module- level. Edges in the MCM were grouped into three categories: specific edges, differential edges, and common edges. Focusing on these three categories of edges, more detailed common issues and specific differences of dysfunctional functions between tissues were revealed which would enable us to further understand the disease. We await the emergence of tissue-specific and isoformspecific gene (specifically genes from IRs) knockout studies to corroborate these conclusions.

Overall, we presented a mathematical and systems biology framework consisting of constructing tissue-dependent disease related networks and multi-level analysis based on these constructed networks. Our network-, gene-, and modulelevel characterization across different tissues of T2D hold the promise to provide a broader and deeper understanding for T2D mechanism.

# DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: www.ncbi.nlm.nih.gov/geo.

# AUTHOR CONTRIBUTIONS

YW and SS designed the study. FS organized the database and analyzed the results. SS and YW wrote the manuscript. All authors read and approved the submitted version.

### FUNDING

This work was supported in part by the National Key Research and Development Program of China under grant 2017YFC0908400. YW is also supported by NSFC under nos. 11871463, 61621003, and 61671444.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00252/full#supplementary-material


transcriptomes. J. Am. Soc. Nephrol. 26, 2669–2677 doi: 10.1681/ASN. 2014111067


fgene-10-00252 March 23, 2019 Time: 17:47 # 15


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Sun, Sun and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Machine Learning Approach for Identifying Gene Biomarkers Guiding the Treatment of Breast Cancer

Ashraf Abou Tabl<sup>1</sup> \*, Abedalrhman Alkhateeb<sup>2</sup> \*, Waguih ElMaraghy<sup>1</sup> , Luis Rueda<sup>2</sup> and Alioune Ngom<sup>2</sup>

<sup>1</sup> Department of Mechanical, Automotive and Materials Engineering, University of Windsor, Windsor, ON, Canada, <sup>2</sup> School of Computer Science, University of Windsor, Windsor, ON, Canada

Genomic profiles among different breast cancer survivors who received similar treatment

may provide clues about the key biological processes involved in the cells and finding the right treatment. More specifically, such profiling may help personalize the treatment based on the patients' gene expression. In this paper, we present a hierarchical machine learning system that predicts the 5-year survivability of the patients who underwent though specific therapy; The classes are built on the combination of two parts that are the survivability information and the given therapy. For the survivability information part, it defines whether the patient survives the 5-years interval or deceased. While the therapy part denotes the therapy has been taken during that interval, which includes hormone therapy, radiotherapy, or surgery, which totally forms six classes. The Model classifies one class vs. the rest at each node, which makes the tree-based model creates five nodes. The model is trained using a set of standard classifiers based on a comprehensive study dataset that includes genomic profiles and clinical information of 347 patients. A combination of feature selection methods and a prediction method are applied on each node to identify the genes that can predict the class at that node, the identified genes for each class may serve as potential biomarkers to the class's treatment for better survivability. The results show that the model identifies the classes with high-performance measurements. An exhaustive analysis based on relevant literature shows that some of the potential biomarkers are strongly related to breast cancer survivability and cancer in general.

Keywords: breast cancer, classification, feature selection, gene biomarkers, machine learning, cancer survivability, treatment therapy

# INTRODUCTION

Despite the fast increase in the breast cancer incidence rate, the survival rates have also increased due to improvements in the treatments because of new technologies (Siegel et al., 2016). Breast cancer, however, is still one of the leading causes of cancer-related death among women worldwide. The survival rates vary among the various treatment therapies that are currently used, which include surgery, chemotherapy, hormone therapy, and radiotherapy. Nevertheless, each patient's response to a specific treatment varies based on some factors that are being investigated (Miller et al., 2016).

#### Edited by:

Quan Zou, University of Electronic Science and Technology of China, China

#### Reviewed by:

Yang Dai, University of Illinois at Chicago, United States Leyi Wei, Tianjin University, China

#### \*Correspondence:

Ashraf Abou Tabl aboutaba@uwindsor.ca Abedalrhman Alkhateeb alkhate@uwindsor.ca

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 25 October 2018 Accepted: 08 March 2019 Published: 27 March 2019

#### Citation:

Tabl AA, Alkhateeb A, ElMaraghy W, Rueda L and Ngom A (2019) A Machine Learning Approach for Identifying Gene Biomarkers Guiding the Treatment of Breast Cancer. Front. Genet. 10:256. doi: 10.3389/fgene.2019.00256

**344**



Traditional laboratory techniques like CAT scans and magnetic resonance imaging (MRI) have been proven to be useful. However, they provide very little information about the mechanism of the cancer progression. On the contrary, advances in DNA microarray technology have provided high throughput samples of gene expression. Machine learning approaches have been utilized to detect breast cancer treatment or survivals (Mangasarian and Wolberg, 2000; Cardoso et al., 2016; Abou Tabl et al., 2017; Tang et al., 2017; Zeng et al., 2018). many researchers have used DNA microarray technology to study breast cancer survivability (Mangasarian and Wolberg, 2000; Cardoso et al., 2016; Abou Tabl et al., 2017). Analyzing gene expression among breast cancer patients who undergo varying treatment types deepens the current understanding of the disease's progression and prognosis. Many features complicate the computational model; the number of features is usually significantly larger than the number of samples, which is known as the curse of dimensionality problem, in which standard classifiers overfit the data, and hence, perform poorly. Therefore, feature selection techniques are proven to alleviate the curse of dimensionality by removing irrelevant and/or redundant features.

Zou et al. (2016) proposed maximum Relevance maximum distance feature selection approach mRMD 2.0. The method uses Pearson's correlation coefficient to measure the Relevance between sub group of features and the class. The selection criteria balance the accuracy with stability when selecting the features. The authors compared the dimensionality reduction method with both filter and wrapper feature selection types, and the results show that mRMD 2.0 outperformed different features selection method of each type (Zou et al., 2016). We compared mRMD 2.0 with mRMR on the wrapper phase of feature selection, while the accuracy of random forest on the selected features of each method was very close, mRMR overall selected less number of potential biomarkers with 47 genes compared to 60 genes of the mRMD 2.0, Hence, we utilized mRMR in this model to obtain a handful smaller size of potential biomarkers for further analysis.

Tang et al. (2017) predicted a tumor location in breast tissue based on feature selection method where the features are RNA-Seq and miRNA data, they enhanced the prediction of the standard classifiers to be around 93% in average. While Zeng et al. (2018) investigated a potential miRNA biomarker for breast neoplasm with around 80% accuracy. Mangasarian and Wolberg (2000) utilized a linear support vector machine (SVM) to extract 6 out of 31 clinical features. Their dataset contains samples from 253 breast cancer patients. The model involved classifying the samples into two groups: (1) the node-positive group in which the patients have some metastasized lymph nodes, and (2) the node-negative group for patients with no metastasized lymph nodes. Those six features were then used in a Gaussian SVM classifier to classify patients into three prognostic groups: negative, middle, or positive. The researchers found that patients in the negative group had the highest survivability. Most of these patients had received chemotherapy treatment (Mangasarian and Wolberg, 2000).

Using samples from patients with high-risk clinical features in the early stages of breast cancer, Cardoso et al. (2016) proposed the use of a statistical model to determine the necessity of chemotherapy treatment based on clinical data. In one of our earlier works, we built a prediction model based on various treatments without defining the period of survivability (Abou Tabl et al., 2017); that is, given a training dataset consisting of gene expression data of BC patients who survived or died after

TABLE 2 | Illustrate the results of using mRMD 2.0 vs mRMR on each node then applying random forest classifier on each node.

TABLE 3 | Gene biomarkers for each class vs. the Rest at each node.



receiving a treatment therapy, we built a classification model that is used to predict whether a new patient will survive or die. In another work, we have implemented an unsupervised learning approach to find the separation between the treatment-survival groups of classes (Tabl et al., 2018a), the model is grouping different classes together in building the tree model while defining the border between the different groups of classes. Paredes-Aracil et al. (2017) built a scoring prediction system for 5 and 10 years survivability periods for different BC subtypes. The cohort of their study includes 287 patients from a Spanish region. The patients have received different therapies with sometimes mixed of them (Paredes-Aracil et al., 2017), which makes it difficult to relate the genomic activities to a specific therapy during the survival prediction.

In this present paper, we are extending an earlier supervised learning model that shows preliminary results to predict which BC patients will survive beyond 5 years after undergoing a given treatment therapy (Tabl et al., 2018b). This extended model has been refined and validated by comparison with feature Selection approach mRMD 2.0, visual analysis, and biological validation for set of 12 potential biomarkers (FGF16, ASAP1, FBXO41, FOSB, VAMP4, ARFGAP2, BLP, CT47A1, PRPS1, ICOSLG, ARPC3, ZFP91) from the resulting 47 genes in all classification nodes.

# MATERIALS AND METHODS

We used a publicly accessible dataset that contains samples for 2,433 breast cancer patients (Curtis et al., 2012; Pereira et al., 2016). The gene expression profiles were totally processed and normalized (Curtis et al., 2012). After studying the given data and selecting only patients who have received one type of treatment, a set of six classes were identified as the base of this work.



These classes are the combination of each treatment: surgery (S), hormone therapy (H), and radiotherapy (D) with a patient status as living (L) or deceased (D). The numbers of samples (patients) for each class in the proposed model are shown in **Table 1**. Data from a total of 347 patients was included in this work.

To avoid overfitting, we performed the filter feature selection first for each class, before running the wrapper feature selection or even the classification model on all the samples from all classes. The number of genes after the filter feature selection for each class are reported in **Table 1**.

Based on the available data, only three treatment therapies are covered in this study: surgery, hormone therapy, and radiotherapy. Our model uses hierarchical classifiers to classify one-versus-the-rest classes. The classes are imbalanced. Hence, standard classification methods will yield poor performance results. The pipeline starts with feature selection methods like Chi-square (Mantel, 1963) and information gain (IG) that are applied to limit the number of significant features (genes). A wrapper method is also used to obtain the subset of genes that best represents the model by utilizing minimum redundancy maximum relevance (mRMR) (Peng et al., 2005) as a feature selection method. This step is followed by several

class balancing techniques, such as the synthetic minority oversampling technique (SMOTE), resampling, and cost-sensitive to balance the number of classes before applying different types of classifiers, such as naive Bayes (Domingos and Pazzani, 1997) and decision tree (random forest) (Breiman, 2001). Finally, a small number of biomarker genes is recognized for predicting the proper treatment therapy for the patient. To the best of our knowledge, this work is the first prediction model that is built on the combination of the treatment and survivability of the patient as a class.

The patient class distribution for the studied model is shown in **Figure 1**, which shows the percentages of samples within each class. It is clear that there are differences between classes that require class imbalance handling techniques to achieve fair classification.

#### Class Imbalance

This model uses a one-versus-rest scheme to tackle the multiclass problem, which leads to an imbalanced class dataset at each node of the classification model. Therefore, we applied the following techniques to handle this issue:

#### Over-Sampling With Synthetic Data

Oversampling the minority class by using synthetic data generators. Several algorithms are used to achieve this. We used one of the most popular ones, SMOTE (Chawla et al., 2002).

#### Using a Cost-Sensitive Classifier

Using penalizing models that apply additional weight to the minority class to achieve class balancing. This, in turn, biases the model to pay more attention to the minority class than others. The algorithm used in this work is called Cost-Sensitive Classifier in Weka machine learning tool using a penalty matrix to overcome the imbalance (Núñez, 1988).

#### Resampling

Replicating the dataset can be using one of two methods: (1) adding copies of the data instances to the minority class, which is called over-sampling (2) deleting some instances of

the majority class, which is called under-sampling. We used the over-sampling technique (Gross, 1980).

# Feature Selection

The gene expression dataset contains 24,368 genes for each of the 347 samples. The curse of dimensionality makes it difficult to classify the dataset in its current form. Thus, engaging in feature selection is essential to narrow down the number of genes to a handful at each node. Chi-square and Info-Gain are applied to select the best information gain of the selected genes, this step (Which is usually called filter feature selection) will drop down the number of genes to a couple of hundreds based on the correlation between each class and the gene expressions based on the default correlation threshold in WEKA. After that, mRMR is applied to identify the best subset of significant genes. mRMR (Which is usually considered as a wrapper feature selection) is an algorithm that is commonly used in greedy searches to identify the characteristics of features and correctly narrow down their relevance.

In the trial to find the best feature selection wrapper method to select handful gene biomarker for each class, we applied both mRMD 2.0 and mRMR on the filtered genes on each class. mRMD 2.0 outperformed mRMR fourth and the fifth node as seen in **Table 2**, while mRMR performed better in the second and third. Both classifiers had 100% of accuracy in the first node, but the lower number of selected genes in that node using mRMR made it more efficient.

# Multi-Class Classification Model

We applied a multi-class approach, the one-versus-rest technique. This approach involves classifying one class against the remaining classes and then removing that class from the dataset. Afterward, we selected another class to classify it against the rest, and so on. Using a greedy method to find the starting node, the method involves classifying all possible combinations, such as DH, against the rest, then DR against the rest, and so on for all six classes. Afterward, the best starting node is selected as the root node for the classification tree based on the best performance.

Several classifiers were tested to achieve these results, including random forest, SVM, and naive Bayes, random forest outperformed the others and has shown a better classification power for the hierarchical model. Therefore, we used it in all nodes. The classification model was built using 10-fold crossvalidation. The data is divided into 10 equal folds of samples, then the learning method will loop 10 times, at each time, it will learn from 90 folds and test on the remaining (left out) fold. At each time in the loop, it will take out a unique fold that has not been shown up in the previous loop steps as a left out. The 10-fold modules will increase the learning samples to 90% of the samples, while it will test on 100% of the samples. The sample will be classified around 9 times; the class is voted more will be considered as the predicted class. The accuracy and other performance measurements are calculated based on the testing folds; therefore, the accuracy here is a testing accuracy.

# RESULTS AND DISCUSSION

The developed multi-class model also shows the final results for each node and the performance measures that were considered, such as accuracy, sensitivity, F1-measure, and specificity. Moreover, it also shows the number of the correctly and incorrectly classified instances in each node.

In **Figure 2**, the root node is DH against the rest that gives 100% accuracy. The second node is obtained after removing the DH instances from the dataset and then classifying each class against the rest. The best outcome was DR, which had an accuracy level of 100%. We repeated the same technique for the third node, finishing with LH with an accuracy of 100%. Then DS in the fourth node with an accuracy of 97.9%, sensitivity is 96.9%, and specificity is 100% because all the DS samples were correctly classified. In the fifth node, which is the final one, we have LR and LS. The accuracy drops down to

80.9% because it is difficult to distinguish between the living samples in both.

Our method was used to identify the 47 gene biomarkers that are listed in **Table 3**. Functional validation was conducted and biological Analysis was provided for some genes by studying the information provided in the literature. The genes marked as blue are those that were considered for further biological relevance (see the discussion in the next section).

At each node, we tried different standard classifiers to select the classifier with the best accuracy at that node as seen in **Table 4**, random forest outperformed the other classifiers in all nodes. The accuracy at the difficult node 5 still down compared to the other nodes. However, we can see a significant improvement in this node as it is 80.9% comparing to the second best 77.1% accuracy using SVM with a linear kernel. In node 4, where the accuracy is 97.9% for random forest, the other classifiers performed with very low 79.06% accuracy for the second best which is SVM with radial basis function kernel. Bayesian classifier had the second best performance in the first, second, and third nodes with 99.47, 96.3, and 92.4% accuracies in order. SVM with polynomial degree 3 kernel had an average performance in all nodes compared to the other classifiers.

# BIOLOGICAL INSIGHT

A combination of gene regulation analysis and biological analysis have been done to validate some of the biomarker genes. Biological validation was carried out using relevant literature (Bamberger et al., 1999; Sabe et al., 2009; Tommasi et al., 2009; Dombkowski et al., 2011; Allegra et al., 2012; Caballero et al., 2014; Katoh and Nakagama, 2014; Kechavarzi and Janga, 2014; Nam et al., 2015; Qiu et al., 2015). **Figures 4**–**7** are the circos plots for the relationships between the genes for node 2 and node 3. These plots show the significant coefficient correlation among genes expressions.

**Figure 3** is a multi-dimensional representation of the plot matrix for the six biomarker genes found in Node 4 for the

for each group of samples (DH vs. Rest).

DS class vs. the remaining ones, as an example. The figure also shows the relations among the six genes. It is clear from the class column that the samples are separable. The values in x-axis represent the gene expression values in the column side, where the y-axis represents the gene expression values at the row side.

In the first node, FGF16 gene is a member of the fibroblast growth factors (FGFs) family, which is involved in a variety of cellular processes, such as stemness, proliferation, anti-apoptosis, drug resistance, and angiogenesis (Katoh and Nakagama, 2014). **Figure 8** shows that the gene expression of FGF16 is upregulated and the gene expression of UPF3 is down-regulated in the DH samples compared to the rest of the samples. UPF3 is a regulator of non-sense transcripts homolog B (yeast). Kechavarzi and Janga (2014) found that UPF3 is one of the actively upregulated RNA-binding proteins identified in nine

cancers in humans and their cancer relevant references, and breast cancer is one of them.

In the second node, ASAP1 is shown to be a breast cancer biomarker; it is precisely correlated to its invasive phenotypes that have not been accurately identified (Sabe et al., 2009). Sabe et al. (2009) reported that ASAP1 is abnormally overexpressed in some breast cancers and used for their invasion and metastasis. As shown in **Figure 4**, ASAP1 has a strong coefficient correlation with FBXO41 in the DR samples, but it is less correlated with the remaining samples, as shown in **Figure 5**. **Figure 9** shows that the genetic expression of ASAP1 is down-regulated in the DR samples compared to the remaining samples. FOSB is a

values for each group of samples (LH vs. Rest).

member of the AP-1 family of transcription factors. Bamberger et al. (1999) concluded that sharp differences in the expression pattern of AP-1 family members are present in breast tumors, and fosB might be involved in the pathogenesis of these tumors (Bamberger et al., 1999). As shown in **Figure 6**, FOSB has a strong correlation coefficient with AL71228 in the DR samples, but it was not found to be correlated to the remaining samples, as shown in **Figure 7**.

In the third node, the VAMP4 gene is a target for some cellular and circulating miRNAs in neoplastic diseases, such as miRNA-31. In any case, it has been confirmed that cellular miRNAs are involved in the development of breast cancer(Allegra et al., 2012). As shown in **Figure 6**, VAMP4 has a strong coefficient correlation with ARFGAP2 in the LH samples, but it is less correlated to the rest of the samples, as shown in **Figure 7**. **Figure 10** shows that the genetic expression of VAMP4 is down-regulated in the LH samples compared to the remaining samples while the gene BLP is up-regulated in the LH samples compared to the remaining samples. CT47A1 is one of seven cancer/testis genes in the CT class. CT genes are significantly overexpressed in ductal carcinoma in situ DCIS (Caballero et al., 2014).

In the fourth node, Phosphoribosyl pyrophosphate synthetase 1 (PRPS1) was found to be a direct target of miR124 in breast cancer (Qiu et al., 2015). Nam et al. (2015) stated that ICOSLG is a potential biomarker of trastuzumab resistance in breast cancer, which affects the progression of the disease.

Regarding the fifth node, Dombkowski et al. (2011) studied several pathways in breast cancer. They found that ARPC3 reveals extensive combinatorial interactions that have significant implications for its potential role in breast cancer metastasis and therapeutic development. Zinc finger protein 91 homolog ZFP91 is a methylated target gene in mice. It was identified through methylated-CpG island recovery assay-assisted microarray analysis (Tommasi et al., 2009).

**Figures 8**–**10** show three of the five nodes for each class against the rest boxplot for the gene biomarkers. The plots also show the up-regulated and down-regulated genes. Most of the biomarkers exhibit clear discrimination between the expression values for a specific class sample vs. the remaining samples in the classification node. Many of those biomarkers have outliers, and some of those outliers' values are in the opposite direction of other class, such as the outliers for the UPF3B gene in the "Rest" class vs. the "DH" class in the first node, as shown in **Figure 8**. Some others are in the same direction as those of the other class, such as the outliers for the ZNF121 gene in the "Rest" class vs. the "DR" class in the second node, as shown in **Figure 9**. Some have outliers in both directions, such as the outliers for the ARFGAP2 gene in the "Rest" class vs. the "LH" class in the second node, as shown in **Figure 10**. The outliers that are in the same direction do not interfere in distinguishing the two classes, even though they may misguide the classifier in other scenarios.

# CONCLUSION

The use of a machine learning model for identifying gene biomarkers for breast cancer survival is a significant step in determining the proper treatment for each patient and will

# REFERENCES


potentially increase survival rates. This study analyzes the gene activities of the survival vs. deceased for each therapy, and the potential biomarkers will help to identify the best therapy for the patients based on their gene expression test. This model has very high accuracy levels, and it uses a hierarchical model as a tree that includes one-versus-rest classifications.

The computational model pulls sets of biomarkers for patients who received different treatments. These biomarkers can be used to distinguish whether the patient survived or died in a 5-year time window for a specific treatment therapy. Related literature was used to verify the relationships between these biomarkers and breast cancer survivability.

Future work includes testing these gene biomarkers in biomedical labs. This novel model can be improved to be used to identify the proper biomarker genes (signature) for different cancer types or even in cases in which patients need or have received more than one type of therapy. Considering additional patient data will enable researchers to cover all missing treatments. With this considerable data size, big data tools, such as Hadoop and Spark, can be utilized to devise an enhanced model.

# DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: http://www.cbioportal.org/study?id= brca\_metabric.

# AUTHOR CONTRIBUTIONS

AT and AA applied the method. AT retained the results. All authors have equally contributed in brainstorming and writing the manuscript.

# FUNDING

This work has been partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) with the following grants (RGPIN-2016-05017 and RGPIN/05084- 2014), and the Windsor Essex County Cancer Centre Foundation (WECCCF) Seeds4Hope program.

cancer: association of fosb expression with a well-differentiated, receptorpositive tumor phenotype. Int. J. Cancer 84, 533–538. doi: 10.1002/(SICI)1097- 0215(19991022)84:5<533::AID-IJC16>3.0.CO;2-J


fgene-10-00256 March 26, 2019 Time: 16:18 # 12


refine their genomic and transcriptomic landscapes. Nat. Commun. 7:11479. doi: 10.1038/ncomms11479


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Tabl, Alkhateeb, ElMaraghy, Rueda and Ngom. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00256 March 26, 2019 Time: 16:18 # 13

, Borui Zhang<sup>3</sup>

,

# Mining Magnaporthe oryzae sRNAs With Potential Transboundary Regulation of Rice Genes Associated With Growth and Defense Through Expression Profile Analysis of the Pathogen-Infected Rice

#### Edited by:

Hao Zhang<sup>1</sup>

, Sifei Liu<sup>1</sup>

, Haowu Chang<sup>1</sup>

Quan Zou, University of Electronic Science and Technology of China, China

#### Reviewed by:

Xiaofeng Song, Nanjing University of Aeronautics and Astronautics, China Yuexu Jiang, University of Missouri, United States

> \*Correspondence: Yuanning Liu lyn@jlu.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 29 January 2019 Accepted: 19 March 2019 Published: 29 March 2019

#### Citation:

Zhang H, Liu S, Chang H, Zhan M, Qin Q-M, Zhang B, Li Z and Liu Y (2019) Mining Magnaporthe oryzae sRNAs With Potential Transboundary Regulation of Rice Genes Associated With Growth and Defense Through Expression Profile Analysis of the Pathogen-Infected Rice. Front. Genet. 10:296. doi: 10.3389/fgene.2019.00296 Zhi Li<sup>4</sup> and Yuanning Liu<sup>1</sup> \* <sup>1</sup> Key Laboratory of Symbolic Computation and Knowledge Engineering, College of Computer Science and Technology,

, Mengping Zhan<sup>1</sup>

, Qing-Ming Qin<sup>2</sup>

Ministry of Education, Jilin University, Changchun, China, <sup>2</sup> College of Plant Sciences, Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Changchun, China, <sup>3</sup> Columbia Independent School, Columbia, MO, United States, <sup>4</sup> School of Computer Science and Technology, Changchun University of Science and Technology, Changchun, China

In recent years, studies have shown that phytopathogenic fungi possess the ability of cross-kingdom regulation of host plants through small RNAs (sRNAs). Magnaporthe oryzae, a causative agent of rice blast, introduces disease by penetrating the rice tissues through appressoria. However, little is known about the transboundary regulation of M. oryzae sRNAs during the interaction of the pathogen with its host rice. Therefore, investigation of the regulation of M. oryzae through sRNAs in the infected rice plants has important theoretical and practical significance for disease control and production improvement. Based on the high-throughput data of M. oryzae sRNAs and the mixed sRNAs during infection, the differential expressions of sRNAs in M. oryzae before and during infection were compared, it was found that expression levels of 366 M. oryzae sRNAs were upregulated significantly during infection. We trained a SVM model which can be used to predict differentially expressed sRNAs, which has reference significance for the prediction of differentially expressed sRNAs of M. oryzae homologous species, and can facilitate the research of M. oryzae in the future. Furthermore, fifty core targets were selected from the predicted target genes on rice for functional enrichment analysis, the analysis reveals that there are nine biological processes and one KEGG pathway associated with rice growth and disease defense. These functions correspond to thirteen rice genes. A total of fourteen M. oryzae sRNAs targeting the rice genes were identified by data analysis, and their authenticity was verified in the database of M. oryzae sRNAs. The 14 M. oryzae sRNAs may participate in the transboundary regulation process and act as sRNA effectors to manipulate the rice blast process.

Keywords: Magnaporthe oryzae, sRNA, transboundary regulation, rice, SVM

# INTRODUCTION

fgene-10-00296 March 28, 2019 Time: 18:54 # 2

Rice is one of the most important crops in Asia, its production not only directly affects food security but also has a huge impact on the local economy. Rice blast is a disease caused by Magnaporthe oryzae attack, resulting in reduced yield. Because of the importance of this crop, studies on how to control rice blast are very popular.

M. oryzae is a heterotrophic fungal pathogen. It can reproduce in the form of spores and spread between rice plants through conidia. After germination, germ tubes form special infection structures called appressoria which will penetrate host's tissues. Rice has two layers of innate immune mechanisms against M. oryzae. The first layer of defense is activated when pathogenassociated molecular patterns (PAMPs) are recognized on the cell surface; thus, this PAMP-triggered immunity (PTI) is activated (Hanae et al., 2006; Shimizu et al., 2010; Su et al., 2012). While M. oryzae effectors that inhibit PTI can be recognized by rice R proteins, which is the second layer of defense and called effector-triggered immunity (ETI) (Liu et al., 2013). However, the mechanism by which M. oryzae infects rice may not be limited to the molecular aspect, but can also to genetic aspects, such as RNA silencing.

RNA silencing or RNA interference (RNAi) is a regulatory mechanism that specifically inhibits the expression of target genes. In this process, double-stranded RNA (dsRNA) is processed into sRNA under the action of the enzyme called RNase III. One of the small RNA (sRNA) strands joins into an effector complex RISC (RNA-induced silencing complexes) capable of degrading the target RNA, therefore inhibiting the mRNA level of the target gene and the subsequent protein biosynthesis (Brodersen and Voinnet, 2006). sRNA is a short, non-coding RNA that specifically expresses in certain physiological stages of an organism and plays an important role in regulation based on its target-mRNA cleavage. For example, miR393b is specifically expressed in the reproductive stage, it cleaves target genes to inhibit flower development; miR172c is specifically expressed in the vegetative stage to inhibit the expression of LOC\_Os07g13170.1 (AP2 domain-containing protein) (Yijun et al., 2012). RNAi plays a key role in gene regulation in a variety of eukaryotes, and studies have shown that hairpin RNAs (hpRNAs) can effectively silence the expression of target genes (Chen et al., 2015).

In recent years, studies have found that RNA silencing exists not only in the interior of organisms but also in the interaction between organisms. Some sRNAs can be transferred between interacting organisms and induce gene silencing in the counter party; this mechanism is known as cross-kingdom RNAi (Cai et al., 2018a). Arne et al. (2013) showed that in addition to proteins, sRNA molecules can also act as effectors to inhibit host immunity. sRNAs bind to AGO proteins and direct RISCs to complementary genes to induce gene silencing. sRNAs of Botrytis cinerea can inhibit the host plant's immune response to the pathogen at an early stage of infection by this mechanism, demonstrating that sRNAs can act as effectors by silencing host defense-associated genes, thereby disarming plant immunity and achieving infection (Arne et al., 2013). Later, sRNA Bc-siR37 of the pathogen was found to be delivered to plant cells to silence host immune genes (Wang et al., 2017). This cross-kingdom RNAi mechanism has proven in the process of fungal infection of plants. Based on the results aforementioned, it is safe to infer that the transboundary sRNA regulation of M. oryzae of rice may exists.

Current researches on rice blast prevention are mostly focused on the internal immune regulation of rice or M. oryzae. For example, rice endogenous miRNAs, such as miR169, play regulatory roles in rice immunity against M. oryzae (Li et al., 2017). Endogenous sRNAs of M. oryzae may also play a role in the transcriptional regulation of some genes, since these sRNAs are involved in regulation of M. oryzae stress responses when plant conditions change (Raman et al., 2013). However, little is known about transboundary regulation of rice by M. oryzae sRNAs during rice interaction with the rice blast fungus. Here, the study analyzes the transboundary regulation of M. oryzae sRNA on rice based on high-throughput data. By screening the upregulated M. oryzae sRNAs during infection and their target rice genes, as well as analyzing the functional enrichment of the target rice genes, the study identifies M. oryzae sRNAs that may directly participate in the regulation process of rice infection, which inhibit the growth and the survival of the rice. The study provides a theoretical basis for disease control and yield increase in rice, as well as new ideas for innovative study on the processes of plant infection by other phytopathogenic fungi.

# DATA AND METHODS

The differentially expressed rice blast sRNAs during infection were first analyzed through a big data-based method and then the related software was used to predict their target genes in rice. Functional enrichment analysis on the targets were performed to predict gene functions closely related to rice growth and defense, which then allows the experimenters to identify the M. oryzae sRNAs that enforce a transboundary regulation on rice during the infection process. The overview of the design roadmap for this work is illustrated in **Figure 1**.

## Data Source

The sRNA raw data of M. oryzae cultured on a complete medium for 16 h, the mixed sRNA raw data of the rice samples infected by M. oryzae for 72 h (Raman et al., 2013), the data of wildtype rice leaves 48 h after water treatment, the data of wild-type rice leaves 48 h after M. Oryzae infection (Chujo et al., 2013), as well as the M. oryzae and rice genomic data and rice mRNA data were used in the analyses. All of these were obtained from NCBI. For the mixed sRNA raw data, we were able to find the mixed sRNA raw data of rice samples infected by M. oryzae for 0, 72, and 96 h. M. oryzae invades the host through the infection pegs from appressoria. For the molecules that act as effectors during infection, the expression of these molecules takes a certain time. For the rice sample infected by M. oryzae for 0 h, because the infection time is too short, many molecules have no time to express, so this sample is not suitable for use. In addition, because LMg96 infects for too long, some molecules have been

degraded, so it is not suitable for use, too. In contrast, for the rice sample infected by M. oryzae for 72 h, the expression of the molecule is the most active, so this sample is most suitable for subsequent analysis. The data downloaded from NCBI is in SRA format, they had to be converted to FASTQ format before the data could be processed.

# Data Preprocessing

Data preprocessing is a key step in data analysis and has a significant impact on the effectiveness of subsequent analysis. At present, the preprocessing of sRNA high-throughput data is mainly divided into the following steps: filtering, alignment, and normalization. First, high-quality data is obtained by removing adapters and low-quality reads. Second, mapping the data to the genome, the types and corresponding counts of sRNAs that can be mapped to the genome are obtained. Finally, the counts of sRNAs are normalized, and the standardized counts are used to find differentially expressed sRNAs, or to analyze the distribution, variance, and bias of the data (Tam et al., 2015).

#### Adapter and Quality Information

In the acquired high-throughput sequencing data, each sRNA sequence is of the same length; this is because Illumina performing adaptor ligation in the process of library preparation (Tam et al., 2015). Therefore, almost every sequence obtained has an adapter sequence of varying lengths. To obtain the correct sRNA sequence, these adapters should be removed.

The existing adapter removal tools are mainly FASTX-toolkit, Cutadapt, and Trimmomatic. In this work, Cutadapt<sup>1</sup> was used to remove the adapters, which requires the knowledge of the adapter sequence used for the high-throughput data. The M. oryzae sRNA and the mixed sRNA data used in this work were high-throughput sequencing data based on the Illumina platform and incorporated the international standard adapter "TCGTATGCCGTCTTCTGCTTGT". In the resultant FASTQ file after removal of the adapter by Cutadapt, the sRNA sequences no longer contain the adapter. The process of the removal of the adapters in sRNAs is illustrated in **Figure 2**.

In the data source article, although the wrong sequencing data were filtered out by the script, there was no quality control operation on the data (Raman et al., 2013). The length distribution of the preprocessed data is shown in **Figure 3**. The length distribution of the sRNA in the mixed or infection data after removing the adapter with two peaks between 21 and 27 displays a high quality (**Figure 3A**). However, an error is shown in the processed M. oryzae sRNA data (**Figure 3B**) due to the M. oryzae data and the infection data originated from the same place and using the same adapter. The experimental error may be due to the strict setting of the parameters in the process of removing the adapter. Adapters with more than three mismatches were not removed. For the sake of experimental rigor, the false positive result is minimized in the adapter identification, and the setting of the mismatch parameter is rigorous here. When the length control was performed later, the sequence corresponding to this part of the error was discarded.

#### Data Mapping to the M. oryzae Genome

The main research object of this work is to identify M. oryzae sRNAs with potential cross-kingdom regulation, it is thus necessary to find the sRNA sequences of M. oryzae that are differentially expressed before and during host infection. However, some contamination is mixed in the M. oryzae sRNA sequences' data, and in the mixed data during infection. In addition to the M. oryzae sRNA sequences and contamination, many sRNA sequences of rice plants also appeared in the data. The M. oryzae sRNA and mixed sRNA data of infection were mapped to the genome of M. oryzae by using the RNA data without adapter sequences, and the portion of the sRNAs belonging to M. oryzae was obtained. In this section, two tools, Bowtie and Samtools, were used. First, the index library of the M. oryzae genome obtained from NCBI was constructed, and the index package was obtained. Both of these steps used bowtie (Jun et al., 2012). Then the FASTQ file of the M. oryzae sRNAs and the FASTQ file of the mixed sRNAs of infection were mapped to the M. oryzae genome. The process was strictly matched, the mismatch parameter was set to 0, and all matching information was output as a SAM file. Then SAMtools software was used to process the generated SAM file, filter out the redundancies, and yield the FASTQ file that only retains the matching sequence (Li et al., 2009). The process is shown in **Figure 4**.

#### Length Control and Sequence Expression Statistics

In order to find the differentially expressed M. oryzae sRNAs, we need to know the expression level of each sRNA sequence before and during infection, that is, the number of each sequence in the data file. However, since each sequence may match multiple locations of the genome during the mapping process, and all matches will eventually be output to the result file, resulting in an increase in the number of sequences in the resultant file, it is inaccurate to count the expression level in the mapped result file. To solve this problem, the following measures were taken:

First, a script was used to extract the sequence in the FASTQ file from the M. oryzae sRNAs that had been removed from the adapter but had not been mapped to the genome, to obtain a text file only containing the sequence. Then, a script was used to control the length and count the number of occurrences (expression amount) of each sequence. Since the length of miRNA (a sRNA that inhibits gene expression) is between 18 and 25 nt, it is believed that the length of the M. oryzae sRNA targeting rice genes and producing transboundary regulation in rice should also be in this range. Therefore, only the M. oryzae sRNA sequences ranging from 18 nt to 25 nt in length were retained. Finally, a table file (herein referred to as file A) containing the sRNA sequences, the lengths of the sequences, and the expression levels of the sequences was obtained.

<sup>1</sup>https://cutadapt.readthedocs.io/en/stable/

The FASTQ file of M. oryzae sRNA data mapped to the M. oryzae genome was handled by a script to obtain a text

file containing only the sequences; then, a script was used to control the length and to remove duplicates to obtain a file, which only contained sequences with a length between 18 and 25 nt (herein referred to as file B, which does not contain length and expression information, i.e., only sequences, and each sequence appears only once).

Finally, the following processing was performed on file A and file B through a script: if a line in file A appears in file B, then the line is reserved; if a line is in file A, but its sequence does not appear in file B, then that line is discarded.

The above process is illustrated as **Figure 5**. In the final file, each sequence can be mapped to the genome of M. oryzae, and the expression amount is accurate. The mixed sRNA data file of infection was also processed by the above method. Finally, the obtained M. oryzae sRNAs during infection could be mapped to the genome of M. oryzae, and the expression amount was accurate.

#### Elimination of Rice sRNA

M. oryzae has the same sequence as some sRNAs in rice; thus, only mapping the data to the M. oryzae genome cannot guarantee that all the sRNAs obtained belong to M. oryzae. Some rice sRNAs may be mistaken as belonging to M. oryzae because their sequences are identical to some M. oryzae sRNAs. This mistaken identity will bring errors to future experiments. To address this issue, the sRNAs that could be mapped to the M. oryzae genome were then mapped to the rice genome and those sRNAs that could be mapped to the rice genome were removed; thus, the final sRNAs were solely from M. oryzae genome.

# A Normalization Method Based on 3/4 Quantile Data

To find the M. oryzae sRNAs differentially expressed during infection, the M. oryzae expression data must be normalized before and during infection to make it comparable. Because the number of species of M. oryzae sRNAs before and during infection is quite different, and the number of M. oryzae sRNA species during infection is much less than before infection, if the per million counts normalization method is used, then after normalization, the magnitude of the change in data expression during infection will be much larger than that before infection, which makes it impossible to accurately find the sRNAs with a substantial increase in the expression level during infection. To solve this problem, we adopted a normalization method based on a 3/4 quantile. First, the sample data were sorted according to the expression levels from high to low; then, the sRNA ranked at 3/4 was obtained. This sRNA's expression amount can represent the lower level of expression in this sample; then, the expression amounts of other sRNAs were converted into multiples of the expression amount of this sRNA. Because the expression levels of the data in the sample were all converted to the multiple of the sample's lower expression level, this method not only avoids the influence of different cardinalities between different samples, but also evades the influence of the differences in the number of species between different samples, thereby making different samples comparable. This normalization method was used to process the data of M. oryzae sRNAs before and during infection. Then, the 6,100 sRNAs that appeared before and during infection were extracted to compare their changes in the expression level.

# The Selection of Differentially Expressed sRNAs

Through statistics, it was found that the species of M. oryzae sRNAs before and during infection did not completely coincide (**Table 1**). According to the statistics, there were 87,314 species of M. oryzae sRNAs before infection, and 11,033 species during infection. There were 6,100 species of M. oryzae sRNAs presenting in the two stages; moreover, 4,933 species of M. oryzae

sRNAs were newly produced during infection (**Figure 6**). From the above statistics, most of the M. oryzae sRNAs disappeared during rice infection. In order to find M. oryzae sRNAs with a significant increase in expression in infection, the 11,033 M. oryzae sRNA species were divided into two parts for analysis.

The first part is the 6,100 species of M. oryzae sRNAs presenting before and during infection. To clearly observe the changes of sRNA expression levels, the normalization method based on the 3/4 quantile was used to extract the data of M. oryzae sRNAs before and after infection; then, a total of 6,100 sRNAs that presented before and during infection were extracted to compare their changes in the expression levels. Since sRNAs regulate the target genes by inhibiting their expression, we thus only screened for the sRNAs which are significantly higher expressed than before infection and their expressions are more than the others after infection. The increase in the expression level was measured by the growth rate using the following Equation (1):

$$Grouph\\_Rate = \frac{count\_{after} - count\_{before}}{count\_{before}} \tag{1}$$

All the values in Equation (1) were standardized. For the sRNA which showed growth during infection, the results of screening based on the expression level after infection and the growth rate of expression, are, respectively, shown in **Figures 7A,B**. Both sRNAs serve to illustrate screening conditions, and the distribution of screening results is shown in **Figure 7C**.

In the positively growing sRNAs, the percentage of sRNA with an expression level greater than or equal to 9 during infection was less than 50% (**Figure 7A**). The percentage of sRNA with a growth rate greater than or equal to 2 was also less than 50% (**Figure 7B**). Only a small number (220) of the sRNAs met both the criteria. These 220 sRNAs can be considered as the most obvious part of the difference in expression. For four sRNAs of the 220 sRNAs were much high expressed than the others after our sRNA expression level standardization, the comparisons of the four sRNA expressions and the other 216 sRNAs' expressions are shown in **Figures 8**, **9**, respectively. By comparison of the expression level during infection to that before infection, the growth ratio of these 220 sRNAs in infection is extremely high (**Figures 8**, **9**). From the apex of the blue columnar column and the apex of the whole column, it is obvious that the expression levels are dramatically higher than those before infection.

The second part is the 4,933 M. oryzae sRNAs newly produced in infection. This part of sRNAs were extracted from the standardized data and sorted from high to low in expression level. One-hundred-forty-six sRNAs were selected according to their expression levels.

A total of 366 M. oryzae sRNAs screened in the above two parts were used as differentially expressed sRNA for the subsequent analysis.

# SVM Model for Predicting Differential Expression of M. oryzae sRNAs

#### Selection of Positive and Negative Samples

The SVM is a classic supervised machine learning model. SVM model was used to predict differentially expressed and nondifferentially expressed sRNAs. A total of 366 differentially expressed M. oryzae sRNAs were used as positive samples and were removed from all M. oryzae sRNAs. The remaining M. oryzae sRNAs were then used to randomly select the negative samples, and the number of sRNAs in the negative sample was twice as that of the positive sample.

#### Feature Extraction and Normalization

The negative and positive sample labels were set to 0 and 1, respectively. The positive and negative samples were extracted into a file. The RNAfold tool<sup>2</sup> was used to predict the secondary structure to obtain free energy information. The sequences and their free energy for feature extraction were then extracted. In the process of feature extraction, 1–25 bits (the sequences less than 25 bits in length need to be complemented with N), along with the length, GC percentage, free energy, 5<sup>0</sup> mo\_base, 5<sup>0</sup> di\_base, 3<sup>0</sup> mo\_base, 3<sup>0</sup> di\_base, and motif of each sequence, were extracted as features, and the features were represented by the letters encoded in binary. In addition, the features were normalized by TABLE 1 | Statistical information on sRNA species before and during infection.


the normalization method of min-max as shown in Equation (2):

$$\wp = \frac{\varkappa - \min}{\max - \min} \tag{2}$$

Where x is the original value, y is the normalized result, and min and max represent the minimum and maximum values of this feature. Also, the feature value is scaled between 0 and 1.

#### Model Training

In this step, a total numbers of 3/4 of the positive and negative samples were extracted as the training set. The grid search and fivefold cross-validation method were used to train the parameters (Zhang et al., 2009). The Radial Basis Function (RBF) kernel function expressed as the following Equation (3) was used to train the model. The RBF kernel function is a kind of kernel function, which is used to map the linear indivisible problem in the low dimension to a high dimension, thus making the problem linearly separable. Let <w 0 , x <sup>0</sup>> be the inner product of the highdimensional space, x 0 is the high-dimensional vector transformed by x, w 0 is the constant obtained by transforming the constant w in the low-dimensional space, and there is K(w, x) lets g(x) = K(w, x) + b be the same as f(x 0 ) = <w 0 , x <sup>0</sup>> + b, and K(w, x) is the kernel function. The RBF kernel function is a kernel function that satisfies this condition.

$$K(\mathbf{x}, \mathbf{x}') = \exp\left(\frac{-||\mathbf{x} - \mathbf{x}'||^2}{2\sigma^2}\right) \tag{3}$$

<sup>2</sup>http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi

# Target Gene Prediction

To identify M. oryzae sRNAs that may play a role in regulation of rice growth and defense, rice mRNAs were used as the targets to predict target genes for the 366 differentially expressed M. oryzae sRNAs. In this step, the TAPIR, a target gene prediction tool (Xie et al., 2012), was used. In the process of target gene prediction, the input of the sRNA file must be a FASTA file and the bases in the sRNA sequence should be A, U, G, and C; thus, the sequences of these 366 sRNAs were extracted and the base T was converted to U by a script. The sequence files were then converted to FASTA files before target gene prediction by the TAPIR tool.

## Selection of the Core Node

For each target gene predicted, its corresponding GeneID was found in the mRNA file of rice. After de-duplicating the found GeneIDs, a total number of 1,121 GeneIDs were obtained. For the target rice genes, in order to know whether they are differentially expressed, we added experiments. We obtained the data of wild-type rice leaves 48 h after water treatment and the data of wild-type rice leaves 48 h after M. Oryzae infection from NCBI (Chujo et al., 2013), compared the two sets of data and screened the target genes. Considering that the infection time of the comparison data (48 h) is shorter than the infection time of our sRNA data (72 h), and the plant may produce some stress response due to selfprotection, we retained target genes with reduced expression levels after infection and target genes with a slight increase in expression levels (less than 0.2) after infection. A total of 685 target genes were retained, and the proportion of target genes with decreased expression was 69.3%. All the 685 GeneIDs were imported into the STRING database<sup>3</sup> where 586 GeneIDs could be identified and the corresponding interaction network was given. The obtained tabular data of the interaction network from the STRING database (without retaining node annotations) were shown in **Supplementary Table S1**, which provides the two nodes corresponding to each edge of the network, as well as the proteins corresponding to the nodes, and the score of the relationship's credibility between the nodes (**Supplementary Table S1**).

Because of the huge number of nodes, it is difficult to locate the obvious enrichment. Therefore, it is necessary to select the core nodes of the network and to find the obvious enrichment and pathways through the interaction network of the core nodes. To reach this goal, the core nodes of the network were selected through the following two key steps:

Step 1: To obtain the subgraph by the score of the credibility through the relationship between the nodes in the network. In the interaction network, the smaller the score of the credibility of the relationship between nodes, the less possibility of the interaction between the two nodes. Therefore, the threshold of the score was set as 0.6, which means that if the score no less than 0.6, the interaction between the corresponding nodes is authentic. By following this criterium, only the edges with a score of no less than 0.6 were selected. The graph composed of these edges is a sub-graph with higher credibility in the entire interaction network.

Step 2: To select the core nodes based on the degree of the nodes. In an interaction network, the higher a node degree, the more nodes it interacts with. Based on the subgraph obtained in the first step, the degree of each node in the subgraph was counted and sorted the nodes according to the degree from large to small. Finally, a total of 50 nodes were selected as core nodes.

# RESULTS

### The Core Node's Regulation Network

In this research, the authors re-imported the GeneIDs of the 50 core nodes into the STRING database and the resultant interaction network of these 50 core nodes was listed in **Supplementary Table S2**. Because the p-value of the network is 1.59e-8, the network was provided with high accuracy. In the functional enrichment results of the network, there were 15 Biological Processes (GO), 5 Molecular Functions (GO), 5 Cellular Component (GO), 2 KEGG Pathways, 5 PFAM Protein Domains, and 5 INTERPRO Protein Domains and Features. We mainly analyzed the 15 Biological Processes (GO) and 2 KEGG Pathways. Among all the results given, the false discovery rate was less than 0.05. The BP and KEGG enrichment results were shown in **Supplementary Table S3** and **Table 2**. In the 15 Biological Processes (GO), the order of error detection was ranked from low to high. These

<sup>3</sup>https://version-10-5.string-db.Org/

biological processes are the cellular protein modification process, chromatin organization, organelle organization, chromosome organization, cellular process, chromatin remodeling, chromatin modification, phosphate-containing compound metabolic process, primary metabolic process, cellular metabolic process, defense response, organic substance metabolic process, response to stimulus, the mitogen-activated protein kinase (MAPK) cascade, and protein phosphorylation. Two KEGG pathways are inositol phosphate metabolism and phosphatidylinositol signaling system.

# Regulatory Pathways Associated With Rice Growth and Defense

In the 15 biological processes, the defense response, and the stimulating response directly affect the ability of plants (here,


#### TABLE 2 | The enriched KEGG Pathways.

fgene-10-00296 March 28, 2019 Time: 18:54 # 10

rice) to cope with external unfavorable factors, thereby affecting rice survival. When plants are stimulated by pathogenic bacterial infection, injury, temperature, drought, salinity, permeability, ultraviolet radiation, ozone, and reactive oxygen species, MAPK is activated. After translation, it is regulated by phosphorylation (Zhang and Klessig, 2001). Therefore, the MAPK cascade and protein phosphorylation are also closely related to rice's ability to cope with factors of life-threatening in its growth environment.

For the top five GO biological processes, the gene sets enriched in the pathways were imported into the DAVID database and found the lower functional pathways corresponding to the five pathways. The verification results show that there are the biological processes for defense response and the biological processes that positively regulate growth rate in the lower regulatory pathways of these five pathways. In the lower regulation of chromatin organization, organelle organization, chromosome organization, and cellular processes, there are three biological processes: response to temperature stimulation, cell proliferation, and multicellular biological development. In other words, the top five functional enrichments are closely related to rice defense and growth. For the two KEGG Pathways, inositol phosphate metabolism is closely related to biosensory extracellular stimulation.

Nine biological processes related to rice defense response and growth process and one KEGG pathway were found in our work. Interestingly, the 9 biological process pathways already contain

all the genes that can be enriched into the 15 biological processes in core nodes. Further analysis demonstrates that the enriched genes display close interactions (**Figure 10**). These pathways may be used to identify sRNA effectors that facilitate rice infection by the pathogen.

#### Discovery of M. oryzae sRNAs as Potential Effectors in the Infection Process

Through the above 9 biological processes and one KEGG Pathway related to rice growth and defense, the IDs of the proteins enriched in these pathways can be found (**Supplementary Table S3** and **Table 2**). These proteins correspond to 13 genes in the core nodes (**Supplementary Table S2**), and these genes also distribute in the 50 core nodes' interaction networks (**Figure 11**). From the results of target gene prediction, 14 M. oryzae sRNAs targeting these 13 rice genes were found. These 14 sRNA sequences can be obtained in the Magnaporthe Next-Gen Sequence sRNA database<sup>4</sup> (**Table 3** and

<sup>4</sup>https://mpss.danforthcenter.org/dbs/index.php?SITE=mg\_sRNA


TABLE 3 | The resultant 14 sRNA sequences in the Magnaporthe Next-Gen Sequence sRNA database (Clip version).

Len, length (nt); Sum, the sum of abundance; Average, the average of abundance; Max, maximum abundance; Min (>0), minimum abundance; LMg0, the name of the sample; LMg72, the name of the sample.

**Supplementary Table S4**). The comparison of the two columns of LMg0 and LMg72 shows that these 14 sRNAs are actively expressed at 72 h post infection, which further confirms our viewpoint about these sRNAs may serve as effectors that facilitate rice infection by the pathogen.

# The SVM Model Prediction Results

We used the remaining 1/4 of the positive and negative samples as the test set. The accuracy of the final model prediction reached as high as 83%. The Receiver Operating Characteristic (ROC) curve is shown in **Figure 12**. The further the curve is from the diagonal line, the better the model performs. The value of the Area Under the Curve (AUC) can evaluate the model intuitively, the larger the value of AUC is, the better the model is at discriminating between positives and negatives. It can be seen from the figure that the curve is far from the diagonal line and the AUC value is 0.85. Thus the model can be used to select the differentially expressed sRNAs after obtaining the sRNAs mapped to the pathogen genome. The selection of the differential-expressed sRNAs in other species of fungal plant pathogens can also refer to this model.

# DISCUSSION

# The sRNAs Involved in the Transboundary Regulation

Cumulated evidence indicates that transboundary regulations of pathogens on the host plants exist (Shimizu et al., 2010; Su et al., 2012). Based on the findings, the study assumes that the pathogenic mechanism also exists during rice infection by M. oryzae. However, little is known about whether sRNA affects rice growth or defense. This study found that during rice infection, although most of the M. oryzae sRNA disappeared, some sRNA remained in the infected rice, and some were upregulated. All the M. oryzae sRNAs existing in the rice tissue have the potential to interact with the host rice. Based on the

characteristics of sRNA inhibition of gene expression, only the upregulated M. oryzae sRNAs in infection were considered in this process. The sRNA transport and regulation between plants and pathogens is bidirectional. After the pathogen invades plant, the expression levels of some sRNAs in host plants increase, conversely inhibiting the pathogen gene-expression to resist the invasion by the pathogens (Cai et al., 2018b). In this work, the study omitted the analysis on the roles of rice sRNA due to the demand of more focused research on the pathogenic side. The roles of M. oryzae sRNAs with decreased expression levels during infection also need to be further investigated.

# The Limitations of Target Gene Prediction Software

For the M. oryzae sRNAs with increased expression level in infection, the study predicted their target genes in rice and directly analyzed the functional enrichments of these target genes. The results in this study are based on the target gene

prediction software, which may be affected by the algorithms in the software. Therefore, the results obtained may be incomplete because the data collected are based on only one target gene prediction software. If multiple software packages for target gene prediction are used for a comprehensive analysis, there may be more results of target gene predictions. As the actual targeting relationship in the organism is complicated, the prediction results given by the software may not be accurate. The screening for the actual target genes should give more credible results if the prediction genes can be further validated via experimental data.

#### The Method of Selecting Core Nodes

After obtaining the interaction network of all the target genes, the subgraphs were screened out based on the credibility score of the relationship between the nodes and selected the core nodes according to the degree of the nodes in the subgraph. This method of screening the core nodes is simple, which points to the need for more efficient or universal algorithms to select the core nodes.

# The Application of Machine Learning Models

The authors trained the SVM model to predict differentially expressed sRNAs. However, the predicted results of this model are only sRNAs with significantly increased expression levels in the infection. Although sRNAs with decreased expression levels may also play an important role in the infection process, this model does not apply to the down-regulated sRNAs. In addition, there are multiple machine learning models; many of them are suitable for classifying samples. It is not known which model can achieve the best results. Different models can be used to predict differentially expressed sRNAs, and their results can be compared to determine which machine learning model can achieve the best prediction.

## The Significance of Finding Effector sRNA

Using the method of high-throughput data presented in this study, 14 M. oryzae sRNAs were identified, which may act as effectors to silence rice genes and cause disease. The data used in this work were experimentally validated and the authenticity of these 14 sRNAs was confirmed in the Magnaporthe Next-Gen Sequence sRNA database. However, since this study is based on the hypothesis that a cross-kingdom RNAi mechanism exists between M. oryzae and rice. This hypothesis requires further biological experiments to verify. Because this mechanism may exist during the infestation of other fungi on plants, this study lays a foundation for the discovery of sRNA effectors in other fungi. In addition, when various phytopathogenic fungi infect plants,

### REFERENCES

Arne, W., Ming, W., Feng-Mao, L., Hongwei, Z., Zhihong, Z., Isgouhi, K., et al. (2013). Fungal small RNAs suppress plant immunity by hijacking host RNA interference pathways. Science 342, 118–123. doi: 10.1126/science.1239705

there is little known on whether a certain similarity exists among their sRNA effectors. To clarify the similarities and functions of the sRNA effectors from diverse fungal pathogens may constitute an intriguing research direction. If the relationships among these fungal sRNAs are determined, the discovery will provide an important theoretical basis for new ideas on the prevention and control of plant diseases.

# DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here:

https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRX214117; https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRX214123; https://www.ncbi.nlm.nih.gov/Traces/wgs/AACU03?val= AACU03.1;

https://www.ncbi.nlm.nih.gov/Traces/wgs/AACU03?val= LVCG01.1;

https://www.ncbi.nlm.nih.gov/nuccore/?term=Oryza+sativa; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSM973470;

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSM973471.

# AUTHOR CONTRIBUTIONS

YL and HZ conceived and directed the project. SL, HC, and BZ obtained the raw data and interpreted the data. HZ, SL, MZ, HC, and ZL conducted the data analysis and interpreted the results. HZ, SL, HC, and ZL helped to design the study and reviewed the data. HZ, SL, BZ, and Q-MQ wrote and/or edited the manuscript. All authors drafted and reviewed the manuscript and approved it for publication.

# FUNDING

This research was supported by the National Natural Science Foundation of China (Grant No. 61471181), the Natural Science Foundation of Jilin Province (Grant Nos. 20140101194JC and 20150101056JC), and Jilin University Student Innovation Project (2017S039).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00296/full#supplementary-material

Brodersen, P., and Voinnet, O. (2006). The diversity of RNA silencing pathways in plants. Trends Genet. 22, 268–280. doi: 10.1016/j.tig.2006.03.003

Cai, Q., He, B., Kogel, K. H., and Jin, H. (2018a). Cross-kingdom RNA trafficking and environmental RNAi — nature's blueprint for modern crop protection strategies. Curr. Opin. Microbiol. 46, 58–64. doi: 10.1016/j.mib.2018.02.003


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhang, Liu, Chang, Zhan, Qin, Zhang, Li and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Positive Causal Influence of IL-18 Levels on the Risk of T2DM: A Mendelian Randomization Study

He Zhuang<sup>1</sup>† , Junwei Han<sup>2</sup>† , Liang Cheng<sup>2</sup> \* and Shu-Lin Liu1,3 \*

<sup>1</sup> Systemomics Center, College of Pharmacy, and Genomics Research Center (State-Province Key Laboratories of Biomedicine-Pharmaceutics of China), Harbin Medical University, Harbin, China, <sup>2</sup> College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China, <sup>3</sup> Department of Microbiology, Immunology and Infectious Diseases, University of Calgary, Calgary, AB, Canada

A large number of clinical studies have shown that interleukin-18 (IL-18) plasma levels are positively correlated with the pathogenesis and development of type 2 diabetes mellitus (T2DM), but it remains unclear whether IL-18 causes T2DM, primarily due to the influence of reverse causality and residual confounding factors. Genome-wide association studies have led to the discovery of numerous common variants associated with IL-18 and T2DM and opened unprecedented opportunities for investigating possible associations between genetic traits and diseases. In this study, we employed a two-sample Mendelian randomization (MR) method to analyze the causal relationships between IL-18 plasma levels and T2DM using IL18-related SNPs as genetic instrumental variables (IVs). We first selected eight SNPs that were significantly associated with IL-18 but independent of T2DM. We then used these SNPs as IVs to evaluate their effects on T2DM using the inverse-variance weighted (IVW) method. Finally, we conducted sensitivity analysis and MR-Egger regression analysis to evaluate the heterogeneity and pleiotropic effects of each variant. The results based on the IVW method demonstrate that high IL-18 plasma levels significantly increase the risk of T2DM, and no heterogeneity or pleiotropic effects appeared after the sensitivity and MR-Egger analyses.

Keywords: interleukin-18 levels, type 2 diabetes mellitus, casual effect, Mendelian randomization, genome-wide association studies

# INTRODUCTION

Type 2 diabetes mellitus (T2DM) is a complex metabolic disease and accounts for more than 90% of diabetic cases. Its pathogenesis involves both genetic predisposition and unhealthy living habits (Zheng et al., 2018). The disease occurs mostly after the age of 35–40 years, providing potential time windows for proactive strategies toward effective prevention (Palermo et al., 2014; Zheng et al., 2018).

Among the known risk factors, inflammation has been identified as a potential cause of T2DM as well as other obesity-associated diseases, such as atherosclerosis and fatty liver (Kohlgruber and Lynch, 2015; Zou et al., 2018). Inflammations interfere with glucose metabolism in adipocytes, hepatocytes, and muscle cells and also affect insulin production or signaling (Kohlgruber and Lynch, 2015). The IL-1 cytokine family, a major class

#### Edited by:

Arun Kumar Sangaiah, VIT University, India

#### Reviewed by:

Fei Guo, Tianjin University, China Leyi Wei, The University of Tokyo, Japan

#### \*Correspondence:

Liang Cheng liangcheng@hrbmu.edu.cn Shu-Lin Liu slliu@hrbmu.edu.cn

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 10 January 2019 Accepted: 19 March 2019 Published: 05 April 2019

#### Citation:

Zhuang H, Han J, Cheng L and Liu S-L (2019) A Positive Causal Influence of IL-18 Levels on the Risk of T2DM: A Mendelian Randomization Study. Front. Genet. 10:295. doi: 10.3389/fgene.2019.00295

**370**

of immunoregulatory agents, plays important roles in endocrinal processes and the regulation of responses to inflammatory stress, especially in T2DM (Banerjee and Saxena, 2012). For example, human pancreatic cells produce more IL-1β under higher glucose concentrations, which in turn may lead to impaired insulin secretion, decreased cell proliferation, and, eventually, β-cell death (Poitout and Robertson, 2002; Rhodes, 2005). In contrast, IL-1Ra, another member of the IL-1 family, can protect cultured human islets from high glucoseinduced IL-1β-mediated β-cell apoptosis (Maedler et al., 2001). Obviously, members of the IL-1 family, e.g., IL1-Ra and IL-1β, maintain a dynamic balance to influence β-cell function and glycemic regulation in T2DM development (Larsen et al., 2007, 2009). Recently, interleukin-18 (IL-18), an IL-1 family member, has been reported to be involved in T2DM and play a role in regulating innate and adaptive immune responses (Matsui et al., 1997; Wawrocki et al., 2016). Immediately after this report, a nested case-control study based on the Nurses' Health Study showed high IL-18 levels are associated with a higher risk of T2DM (Hivert et al., 2009). In another study, IL-18 levels were measured in serum samples from 130 coronary artery disease (CAD) patients. The study included 43 T2DM patients and 31 healthy controls and also revealed that T2DM patients tend to have higher IL-18 serum levels (Suchanek et al., 2005). These results are consistent with previous clinical findings that increased IL-18 serum levels serve as a marker of insulin resistance in both T2DM patients and nondiabetic people (Fischer et al., 2005). However, due to the interference of multiple confounding factors and the "reverse causal effect" in observational studies, it remains unclear whether high levels of IL-18 trigger the onset of T2DM and cause or push the development of the disease as a main confounding factor, an issue that calls for systematic investigations for the development of effective preventive or therapeutic strategies, e.g., by Mendelian randomization (MR) studies (Noyce et al., 2017; Schuetz and Wahl, 2017).

Mendelian randomization, greatly facilitated by the development of genome-wide association studies (GWASs), is a method for establishing causal effects between genetic traits and diseases by building instrumental variables (IVs) based on the information about single nucleotide polymorphisms (SNPs), i.e., phenotype-associated genetic variants (Visscher et al., 2012; Hayes, 2013; De et al., 2014; Huang, 2015; Li et al., 2015; Sekula et al., 2016; Zheng et al., 2017; Cheng et al., 2018d, 2019; Guo et al., 2018). For MR analysis, all IVs have to be independent of one another and robustly associated with the phenotype (e.g., high IL-18 levels) but not with the disease (e.g., T2DM) (**Figure 1**), ensuring that the only way for the IVs to influence the disease is through the phenotype, with maximum avoidance of any possible residual confounding factors. Based on Mendel's second law, i.e., the principle of random distribution of gametes in offspring (Castle, 1903), IV analysis can avoid reverse causality.

In this study, we verified the assumption that T2DM is caused by high IL-18 levels. Next, we estimated the causal effect of IL-18 levels on T2DM by the MR method.

# MATERIALS AND METHODS

# Strategic Design of Data Processing and Analysis

We extracted summary-level data from GWAS datasets and processed the data by removing the SNPs not suitable for establishing IVs. We then calculated the Wald ratio of each IV, and we used the inverse-variance weighted (IVW) method to predict the causal effects of high IL-18 serum levels on T2DM. Upon completing the MR analysis, we evaluated the heterogeneity and pleiotropic effects of each variant, using the sensitivity analysis and the MR-Egger method, respectively (**Figure 2**).

# Summary-Level Data Extraction for Associations Between Genetic Variants and IL-18

The SNP information required to construct the IVs was extracted from a meta-analysis study done in 2013 by Walston et al. This team identified 18 top significant SNPs associated with plasma IL-18 levels (P < 5 × 10−<sup>8</sup> ), using the GWAS data from the Cardiovascular Health Study (CHS) and a prospective population-based cohort study called InCHIANTI (Matteini et al., 2014). The "haplo.glm" function, implemented by the original author in the "haplo.stats" R package, was used to calculate the beta coefficient (β), the standard error (SE), and the threshold of the P-value for each haplotype relative to the most common reference haplotype (Matteini et al., 2014). Participants in this study included 3233 individuals over the age of 65 from the CHS cohort, and another group of 1210 participants aged 65–102 years from the InCHIANTI cohort, all being Caucasian (Fried et al., 1991; Ferrucci et al., 2000). The related SNP serial numbers, allele frequencies, effect alleles (EAs), beta coefficients, and SEs were obtained from the meta-analysis results by combining the two cohorts.

# Summary-Level Data Extraction for Associations Between Genetic Variants and T2DM

The GWAS data used for this study were obtained from the transethnic T2D GWAS meta-analysis for calculating the subsequent Wald ratio. In total, 26,488 T2DM cases and 83,964 controls were used in the study, and 2,915,012 genetic variants were identified, which have been published by the Diabetes Genetics Replication and Meta-analysis (DIAGRAM) consortium<sup>1</sup> . The odds ratio (OR), SE, and P-value of T2DM per allele were extracted. The P-value established for screening genotypes (P < 5 × 10−<sup>8</sup> ) independent of type 2 diabetes is specifically referenced to a number of similar studies; the most authoritative of which is "Estimating the causal influence of body mass index on risk of Parkinson disease: A Mendelian randomization study," which was published in PLOS Medicine in 2017.

<sup>1</sup>http://diagram-consortium.org/about.html

# Data Processing

fgene-10-00295 April 3, 2019 Time: 21:0 # 4

Due to the potential linkage disequilibrium (LD), the IVs were chosen independent of each other to avoid over-precise estimates in subsequent analysis caused by genetic pleiotropy. According to the application principles of MR analysis, the study is based on Mendel's second law of inheritance: the separation and combination of gene pairs controlling different traits do not interfere with each other; in the formation of gametes, the paired genes are separated from each other, and genes that determine different traits are randomly distributed between two gametes. When two genes are not completely independent, they will show a certain degree of linkage; this situation is called LD, and it greatly affects the exclusiveness of the variable tool to phenotypic inheritance, leading to subsequent calculation bias, generally called "over-precise estimates" (Noyce et al., 2017). Although rudimentary selection has been applied by Walston et al. (Matteini et al., 2014), the processed SNPs were verified again using an LD web tool<sup>2</sup> to remove the interfering SNPs (r2 threshold = 0.1 or within 500 kb physical distance) (Baird, 2015; Noyce et al., 2017). Next, the T2DM-related SNPs (P < 0.05) were removed to meet the conditions for the MR analysis, making the IL-18-associated variants independent of the disease.

# MR Method

Mendelian randomization is a method applied by pooling Wald ratios of the IVs to verify the causal relationship between exposures and diseases (Emdin et al., 2017). The Wald ratio of each IV was calculated first. As shown in **Figure 2**, we assumed X, Y, and Z to be IL-18, T2DM, and IVs, respectively, and the Wald ratio (βXY) of IL-18 to T2DM through a specified variant can be calculated as follows:

$$
\beta\_{\rm XY} = \beta\_{\rm ZY} / \beta\_{\rm ZX},
$$

where βZY represents the per-allele log(OR) of T2DM from summary-level data of Morris et al. (Morris et al., 2012), and βZX is the per-allele log(OR) of IL-18 from summary-level data of Walston et al. (Matteini et al., 2014). The SE of the IL-18–T2DM association of each Wald ratio can be defined as follows:

$$SE\_{XY} = SE\_{ZY} / SE\_{ZX},$$

where SEZY and SEZX represent the SE of the variant–T2DM and variant–IL-18 associations from corresponding summary-level data, respectively. Subsequently, 95% confidence intervals (CIs) were calculated from the SE of each Wald ratio. Then, these data were pooled to estimate a weighted average of the causal effect by the IVW method. This method is one of the most commonly used methods for meta-analysis of fixed effects models. It summarizes effect sizes from numerous independent studies by calculating the weighted mean of the influence sizes, taking the inverse variance of individual studies as weights. The meta-analysis model for the point estimate is on the basis of the heterogeneity of the pooled data. The fixed effect model is applied for the case of no significant heterogeneity, while the random-effect model is used for others (Boucher, 2012; Lee et al., 2016).

In order to assess the genetic heterogeneity of summarized data, Cochran's Q-test and the I<sup>2</sup> statistic were applied. Cochran's Q-test applies a χ <sup>2</sup> distribution with (k-1) degrees of freedom, where k is the number of variants for analysis; I <sup>2</sup> = [Q - (k - 1)]/Q × 100% ranges from 0 to 100%. P < 0.01 and I <sup>2</sup> > 50% are defined as significant heterogeneity (Zhang et al., 2015).

# Leave-One-Out Method for Sensitivity Analysis

The sensitivity analysis was conducted to detect the heterogeneity of each variant, and the IVW method was carried out for each set of variants without a "missing SNP" to get the point estimates from IL-18 on T2DM (Noyce et al., 2017). Then, we checked the fluctuation of the results before and after removing the "missing SNP," which reflects the sensitivity of each IV (Zheng, 2017).

# MR-Egger Method

MR-Egger regression analysis was applied here to ensure that violations in the analysis would not bias the estimates of the directional causal association (Bowden et al., 2015). The MR-Egger regression analysis was originally derived from the Egger regression method, which is mainly used to detect research bias in meta-analysis and systematic bias caused by pleiotropy. The estimated value of the intercept from MR-Egger regression can be interpreted as an estimate of the average pleiotropic effect across the genetic variants. Estimates of the average pleiotropic effect of genetic variants can be reflected in the intercept estimates in MR-Egger regression. A nonzero intercept is indicative of overall directional pleiotropy, and the slope coefficient provides a bias estimate of the causal effect (Bowden et al., 2015). All above statistical analyses were conducted in R 3.4.3 using the R package of meta-analysis 1 and MR<sup>3</sup> .

# RESULTS

# IV SNPs

A selection of eight SNPs to construct IVs (rs2250417, rs2300702, rs2268797, rs6748621, rs7577696, rs6760105, rs212745, and rs212713) satisfied all conditions, including strong associations with IL-18 phenotypes (P < 5 × 10−<sup>8</sup> , β 6= 0) and no association with T2DM (P > 0.05) or LD effect (**Table 1**).

# The Causality Influence From BMI on the Risk of T2DM

The pooled results from the IVW method with eight SNPs suggest that high IL-18 plasma concentrations significantly increase the risk of T2DM. No heterogeneity was found between variants of the summary data (P = 1.0 and I <sup>2</sup> = 0%; **Figure 3**); the fixed-effect model was applied for the meta-analysis, and the OR of T2DM per SD higher IL-18 plasma level was 1.14 (95% CI 1.03 − 1.26, P = 0.0117; **Figure 3**).

<sup>2</sup>http://www.cog-genomics.org/plink/1.9/ld#r

<sup>3</sup>http://cran.r-project.org/web/packages/meta/index.html


TABLE 1 | Associations of genetic variants with IL-18 and T2DM.

#### Sensitivity Evaluation

The ORs obtained after removing the "missing SNP" all exceeded 1, ranging from 1.1345 to 1.1505, with small fluctuations from -0.005 [(1.1345–1.14)/1.14] to 0.009 [(1.1505–1.14)/1.14]. This means that the causality effects we obtained from MR were supported by most of the individual SNPs, demonstrating that no single SNP dominated the IVW point estimate, and there was no heterogeneity in the variants (**Figure 4**; Rosmalen et al., 2012).

#### Pleiotropic Effect Assessment

The pooled causal effects from the MR-Egger regression analysis are consistent with the IVW results: an estimated bate of T2DM per SD higher IL-18 plasma level was 0.122 (95% CI 0.003 − 0.221, P = 0.044); the intercept size was 0.011 (95% CI -0.004 to 0.026, P = 0.158), suggesting that all variants were valid. There is no alternative pathway leading to the disease, and the IVW was applied under no pleiotropic effect.

#### DISCUSSION

In this study, we conducted an MR analysis to explore the causal effect of IL-18 plasma levels on the risk of T2DM. The estimated causal impact resulting from the IVW method was 1.14 (95% CI 1.03 − 1.26, P = 0.0117). Additionally, the sensitivity analysis and the MR-Egger regression analysis also provided adequate evidence that the results were not due to heterogeneity or pleiotropic effects of any single variant.

A major innovative aspect of this study design is the introduction of the concept of IVs in the association analysis. In the causal inference of observational studies, no matter how good an epidemiological research design is and how accurate the measurements, we cannot eliminate the potential, unmeasurable confounding factors. The MR study design follows the Mendelian inheritance law of "random allocation of alleles to offspring." If the genotype is associated with the disease through the phenotype, it can use genotypes as a variable to infer the association between phenotype and disease, as shown in **Figure 1**.

Given the results of this study, we can almost certainly conclude that the IL-18-associated T2DM risk is mainly due to the role of pro-inflammatory cytokines in β-cell dysfunction. Islet inflammations cause serious tissue lesions in both T1DM and T2DM. Upon infiltrating into the islet, the immune cells secrete a variety of pro-inflammatory cytokines, such as IL-1β, tumor necrosis factor alpha (TNF-α), and γ-interferon, and cause islet cell function defects and diabetes (Morgan et al., 2014; Marchetti, 2016; Eguchi and Nagai, 2017). The

question of causality between T2DM and elevated IL-18 levels is answered in this study, as we have demonstrated that pro-inflammatory cytokines have a causal effect on T2DM. Our study not only aids in the development of prognostic techniques for diabetes and its complications but also provides a more comprehensive strategy for all types of clinical drug regimens to circumvent the risk of T2DM. Especially for non-T2DM treatments that can increase IL-18 expression, more stringent control, and careful handling are needed. For example, bacillus Calmette-Guérin (BCG) vaccines, which have been used for nearly a 100 years, have been confirmed in 2002 to cause a large increase in the expression of IL-1 family members, including IL-18, after vaccination (Lyons et al., 2002). Attention should be paid to the avoidance of virulence factors caused by the treatment process, and the rationality and safety of various medical treatments should be comprehensively evaluated.

Finding a solution for the high IL-18 levels may be a lengthy task. For instance, Schrezenmeir et al. used probiotic oligosaccharides to reduce the production of proinflammatory cytokines in intestinal cells and effectively reduced the burden of self-immunity (Zenhom et al., 2011). The use of statins can also effectively inhibit the expression of pro-inflammatory cytokines in CrFK cells infected with influenza A virus (Mehrbod et al., 2012). In cohort trials of patients with Alzheimer's disease, researchers also found that ascorbic acid, α-tocopherol, and β-carotene can reduce oxidative stress and pro-inflammatory cytokine production in monocytes (de Oliveira et al., 2012). At the same time, daily exercise aids in reducing plasma levels of pro-inflammatory cytokines; early treadmill exercise reduced the production of pro-inflammatory factors in mice and even alleviated anxiety symptoms after cerebral ischemia (Zhang Q. et al., 2017).

Inevitably, this study has some minor limitations. When studying a single phenotypic variable, other phenotypes become confounding factors, so we introduced the MR concept, based on Mendel's second law, the "law of independent assortment," to solve this problem and to control genetic factors of different traits. We can insulate other pathway effects by linking genetic loci that control a single phenotype to the disease. But, as stated, the possibility of "non-isolation" still exists for some phenotypes that have not yet been completely described and which may be regulated by the same set of genetic loci. However, with the rapid updates and development of the databases, we expect this issue will be solved soon, e.g., by methods such as link prediction (Cheng et al., 2016, 2018a,b; Zeng et al., 2017; Jiang et al., 2018; Zhang et al., 2018; Ding et al., 2019) or artificial intelligence (Cabarle et al., 2017; Liu et al., 2017; Zhang X. et al., 2017; Cheng and Hu, 2018; Dao et al., 2018; Feng et al., 2018;

Pan et al., 2018; Song et al., 2018; Tang et al., 2018; Wei et al., 2018; Xu et al., 2018a,b; Yang et al., 2018; Zhu et al., 2019). With the discovery of new IL-18 variants and the large collection of results of randomized controlled trials, we anticipate the discovery of more non-coding biomarkers for novel diagnostic or therapeutic strategies for T2DM (Zou et al., 2015; Lu et al., 2016; Liu et al., 2017; Wei et al., 2017a,b; Cheng et al., 2018c; Zeng et al., 2018).

#### AUTHOR CONTRIBUTIONS

HZ wrote the manuscript. JH and LC carried out the experimental method design. S-LL did the supervision and modification of the article.

#### REFERENCES


### FUNDING

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61871160 and 61502125), the Heilongjiang Postdoctoral Fund (Grant Nos. LBH-TZ20 and LBH-Z15179), and the China Postdoctoral Science Foundation (Grant Nos. 2018T110315 and 2016M590291).

#### ACKNOWLEDGMENTS

We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhuang, Han, Cheng and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# DeePromoter: Robust Promoter Predictor Using Deep Learning

#### Mhaned Oubounyt <sup>1</sup> , Zakaria Louadi <sup>1</sup> , Hilal Tayara<sup>1</sup> \* and Kil To Chong<sup>2</sup> \*

*<sup>1</sup> Department of Information and Electronics Engineering, Chonbuk National University, Jeonju, South Korea, <sup>2</sup> Advanced Research Center of Information and Electronics Engineering, Chonbuk National University, Jeonju, South Korea*

The promoter region is located near the transcription start sites and regulates transcription initiation of the gene by controlling the binding of RNA polymerase. Thus, promoter region recognition is an important area of interest in the field of bioinformatics. Numerous tools for promoter prediction were proposed. However, the reliability of these tools still needs to be improved. In this work, we propose a robust deep learning model, called DeePromoter, to analyze the characteristics of the short eukaryotic promoter sequences, and accurately recognize the human and mouse promoter sequences. DeePromoter combines a convolutional neural network (CNN) and a long short-term memory (LSTM). Additionally, instead of using non-promoter regions of the genome as a negative set, we derive a more challenging negative set from the promoter sequences. The proposed negative set reconstruction method improves the discrimination ability and significantly reduces the number of false positive predictions. Consequently, DeePromoter outperforms the previously proposed promoter prediction tools. In addition, a web-server for promoter prediction is developed based on the proposed methods and made available at https://home.jbnu.ac.kr/NSCL/deepromoter.htm.

Keywords: promoter, DeePromoter, bioinformatics, deep learning, convolutional neural network

# 1. INTRODUCTION

Promoters are the key elements that belong to non-coding regions in the genome. They largely control the activation or repression of the genes. They are located near and upstream the gene's transcription start site (TSS). A gene's promoter flanking region may contain many crucial short DNA elements and motifs (5 and 15 bases long) that serve as recognition sites for the proteins that provide proper initiation and regulation of transcription of the downstream gene (Juven-Gershon et al., 2008). The initiation of gene transcript is the most fundamental step in the regulation of gene expression. Promoter core is a minimal stretch of DNA sequence that conations TSS and sufficient to directly initiate the transcription. The length of core promoter typically ranges between 60 and 120 base pairs (bp).

The TATA-box is a promoter subsequence that indicates to other molecules where transcription begins. It was named "TATA-box" as its sequence is characterized by repeating T and A base pairs (TATAAA) (Baker et al., 2003). The vast majority of studies on the TATA-box have been conducted on human, yeast, and Drosophila genomes, however, similar elements have been found in other species such as archaea and ancient eukaryotes (Smale and Kadonaga, 2003). In human case, 24% of genes have promoter regions containing TATA-box (Yang et al., 2007). In eukaryotes, TATA-box is located at ∼25 bp upstream of the TSS (Xu et al., 2016). It is able to define the direction

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Leyi Wei, The University of Tokyo, Japan Zhouhao Zeng, Facebook, United States Shravan Sukumar, Corteva Agriscience, United States Mufeng Hu, AbbVie, United States*

#### \*Correspondence:

*Hilal Tayara hilaltayara@jbnu.ac.kr Kil To Chong kitchong@jbnu.ac.kr*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *07 February 2019* Accepted: *15 March 2019* Published: *05 April 2019*

#### Citation:

*Oubounyt M, Louadi Z, Tayara H and Chong KT (2019) DeePromoter: Robust Promoter Predictor Using Deep Learning. Front. Genet. 10:286. doi: 10.3389/fgene.2019.00286*

**379**

of transcription and also indicates the DNA strand to be read. Proteins called transcription factors bind to several non-coding regions including TATA-box and recruit an enzyme called RNA polymerase, which synthesizes RNA from DNA.

Due to the important role of the promoters in gene transcription, accurate prediction of promoter sites become a required step in gene expression, patterns interpretation, and building and understanding the functionality of genetic regulatory networks. There were different biological experiments for identification of promoters such as mutational analysis (Matsumine et al., 1998) and immunoprecipitation assays (Kim et al., 2004; Dahl and Collas, 2008). However, these methods were both expensive and time-consuming. Recently, with the development of the next-generation sequencing (NGS) (Behjati and Tarpey, 2013) more genes of different organisms have been sequenced and their gene elements have been computationally explored (Zhang et al., 2011). On the other hand, the innovation of NGS technology has resulted in a dramatic fall of the cost of the whole genome sequencing, thus, more sequencing data is available. The data availability attracts researchers to develop computational models for promoter prediction task. However, it is still an incomplete task and there is no efficient software that can accurately predict promoters.

Promoter predictors can be categorized based on the utilized approach into three groups namely signal-based approach, content-based approach, and the GpG-based approach. Signalbased predictors focus on promoter elements related to RNA polymerase binding site and ignore the non-element portions of the sequence. As a result, the prediction accuracy was weak and not satisfying. Examples of signal-based predictors include: PromoterScan (Prestridge, 1995) which used the extracted features of the TATA-box and a weighted matrix of transcription factor binding sites with a linear discriminator to classify promoter sequences form non-promoter ones; Promoter2.0 (Knudsen, 1999) which extracted the features from different boxes such as TATA-Box, CAAT-Box, and GC-Box and passed them to artificial neural networks (ANN) for classification; NNPP2.1 (Reese, 2001) which utilized initiator element (Inr) and TATA-Box for feature extraction and a time-delay neural network for classification, and Down and Hubbard (2002) which used TATA-Box and utilized a relevance vector machines (RVM) as a classifier. Content-based predictors relied on counting the frequency of k-mer by running a k-length window across the sequence. However, these methods ignore the spatial information of the base pairs in the sequences. Examples of Content-based predictors include: PromFind (Hutchinson, 1996) which used the k-mer frequency to perform the hexamer promoter prediction; PromoterInspector (Scherf et al., 2000) which identified the regions containing promoters based on a common genomic context of polymerase II promoters by scanning for specific features defined as variable length motifs; MCPromoter1.1 (Ohler et al., 1999) which used a single interpolated Markov chain (IMC) of 5th order to predict promoter sequences. Finally, GpG-based predictors utilized the location of GpG islands as the promoter region or the first exon region in the human genes usually contains GpG islands (Ioshikhes and Zhang, 2000; Davuluri et al., 2001; Lander et al., 2001; Ponger and Mouchiroud, 2002). However, only 60% of the promoters contain GpG islands, therefore the prediction accuracy of this kind of predictors never exceeded 60%.

Recently, sequence-based approaches have been utilized for promoter prediction. Yang et al. (2017) utilized different feature extraction strategies to capture the most relevant sequence information in order to predict enhancer-promoter interactions. Lin et al. (2017) proposed a sequence-based predictor, named "iPro70-PseZNC", for sigma70 promoter's identification in the prokaryote. Likewise, Bharanikumar et al. (2018) proposed PromoterPredict in order to predict the strength of Escherichia coli promoters based on a dynamic multiple regression approach where the sequences were represented as position weight matrices (PWM). Kanhere and Bansal (2005) utilized the differences in DNA sequence stability between the promoter and non-promoter sequences in order to distinguish them. Xiao et al. (2018) introduced a two layers predictor called iPSW(2L)- PseKNC for promoter sequences identification as well as the strength of the promoters by extracting hybrid features from the sequences.

All of the aforementioned predictors require domainknowledge in order to hand-craft the features. On the other hand, deep learning based approaches enable building more efficient models using raw data (DNA/RNA sequences) directly. Deep convolutional neural network achieved state-of–the-art results in challenging tasks such as processing image, video, audio, and speech (Krizhevsky et al., 2012; LeCun et al., 2015; Schmidhuber, 2015; Szegedy et al., 2015). In addition, it was successfully applied in biological problems such as DeepBind (Alipanahi et al., 2015), DeepCpG (Angermueller et al., 2017), branch point selection (Nazari et al., 2018), alternative splicing sites prediction (Oubounyt et al., 2018), 2'-Omethylation sites prediction (Tahir et al., 2018), DNA sequence quantification (Quang and Xie, 2016), human protein subcellular localization (Wei et al., 2018), etc. Furthermore, CNN recently gained significant attention in the promoter recognition task. Very recently, Umarov and Solovyev (2017) introduced CNNprom for short promoter sequences discrimination, this CNN based architecture achieved high results in classifying promoter and non-promoter sequences. Afterward, this model was improved by Qian et al. (2018) where the authors used support vector machine (SVM) classifier to inspect the most important promoter sequence elements. Next, the most influential elements were kept uncompressed while compressing the less important ones. This process resulted in better performance. Recently, long promoter identification model was proposed by Umarov et al. (2019) in which the authors focused on the identification of TSS position.

In all the above-mentioned works the negative set was extracted from non-promoter regions of the genome. Knowing that the promoter sequences are rich exclusively of specific functional elements such as TATA-box which is located at –30∼–25 bp, GC-Box which is located at –110∼–80 bp, CAAT-Box which is located at –80∼–70 bp, etc. This results in high classification accuracy in due to huge disparity between the positive and negative samples in terms of sequence structure. Additionally, the classification task becomes effortless to achieve, for instance, the CNN models will just rely on the presence or absence of some motifs at their specific positions to make the decision on the sequence type. Thus, these models have very low precision/sensitivity (high false positive) when they are tested on genomic sequences that have promoter motifs but they are not promoter sequences. It is well known that there are more TATAAA motifs in the genome than the ones belonging to the promoter regions. For instance, alone the DNA sequence of the human chromosome 1, ftp://ftp.ensembl.org/pub/release-57/fasta/homo\_sapiens/dna/, contains 151 656 TATAAA motifs. It is more than the approximated maximal number of genes in the total human genome. As an illustration of this issue, we notice that when testing these models on non-promoter sequences that have TATA-box they misclassify most of these sequences. Therefore, in order to generate a robust classifier, the negative set should be selected carefully as it determines the features that will be used by the classifier in order to discriminate the classes. The importance of this idea has been demonstrated in previous works such as (Wei et al., 2014). In this work, we mainly address this issue and propose an approach that integrates some of the positive class functional motifs in the negative class to reduce the model's dependency on these motifs. We utilize a CNN combined with LSTM model to analyze sequence characteristics of human and mouse TATA and non-TATA eukaryotic promoters and build computational models that can accurately discriminate short promoter sequences from non-promoter ones.

#### 2. MATERIALS AND METHODS

#### 2.1. Dataset

The datasets, which are used for training and testing the proposed promoter predictor, are collected from human and mouse. They contain two distinctive classes of the promoters namely TATA promoters (i.e., the sequences that contain TATAbox) and non-TATA promoters. These datasets were built from Eukaryotic Promoter Database (EPDnew) (Dreos et al., 2012). The EPDnew is a new section under the well-known EPD dataset (Périer et al., 2000) which is annotated a nonredundant collection of eukaryotic POL II promoters where transcription start site has been determined experimentally. It provides high-quality promoters compared to ENSEMBL promoter collection (Dreos et al., 2012) and it is publically accessible at https://epd.epfl.ch//index.php. We downloaded TATA and non-TATA promoter genomic sequences for each organism from EPDnew. This operation resulted in obtaining four promoter datasets namely: Human-TATA, Human-non-TATA, Mouse-TATA, and Mouse-non-TATA. For each of these datasets, a negative set (non-promoter sequences) with the same size of the positive one is constructed based on the proposed approach as described in the following section. The details on the numbers of promoter sequences for each organism are given in **Table 1**. All sequences have a length of 300 bp and were extracted from -249∼+50 bp (+1 refers to TSS position). As a quality control, we used 5-fold cross-validation to assess the proposed model. In this case, 3-folds are used for training, 1-fold is used for validation, and the remaining fold is used for testing. Thus, the TABLE 1 | Statistics of the four datasets used in this study.


proposed model is trained 5 times and the overall performance of the 5-fold is calculated.

#### 2.2. Negative Dataset Construction

In order to train a model that can accurately perform promoter and non-promoter sequences classification, we need to choose the negative set (non-promoter sequences) carefully. This point is crucial in making a model capable of generalizing well, and therefore able to maintain its precision when evaluated on more challenging datasets. Previous works, such as (Qian et al., 2018), constructed negative set by randomly selecting fragments from genome non-promoter regions. Obviously, this approach is not completely reasonable because if there is no intersection between positive and negative sets. Thus, the model will easily find basic features to separate the two classes. For instance, TATA motif can be found in all positive sequences at a specific position (normally 28 bp upstream of the TSS, between –30 and –25 pb in our dataset). Therefore, creating negative set randomly that does not contain this motif will produce high performance in this dataset. However, the model fails at classifying negative sequences that have TATA motif as promoters. In brief, the major flaw in this approach is that when training a deep learning model it only learns to discriminate the positive and negative classes based the presence or absence of some simple features at specific positions, which makes these models impracticable. In this work, we aim to solve this issue by establishing an alternative method to derive the negative set from the positive one.

Our method is based on the fact that whenever the features are common between the negative and the positive class the model tends, when making the decision, to ignore or reduce its dependency on these features (i.e., assign low weights to these features). Instead, the model is forced to search for deeper and less obvious features. Deep learning models generally suffer from slow convergence while training on this type of data. However, this method improves the robustness of the model and ensures generalization. We reconstruct the negative set as follows. Each positive sequence generates one negative sequence. The positive sequence is divided into 20 subsequences. Then, 12 subsequences are picked randomly and substituted randomly. The remaining 8 subsequences are conserved. This process is illustrated in **Figure 1**. Applying this process to the positive set results in new non-promoter sequences with conserved parts from promoter sequences (the unchanged subsequences, 8 subsequences out of 20). These parameters enable generating a negative set that has 32 and 40% of its sequences containing conserved portions of promoter sequences. This ratio is found to be optimal for having robust promoter predictor as explained in section 3.2. Because

FIGURE 1 | Illustration of the negative set construction method. Green represents the randomly conserved subsequences while red represents the randomly chosen and substituted ones.

the conserved parts occupy the same positions in the negative sequences, the obvious motifs such as TATA-box and TSS are now common between the two sets with a ratio of 32∼40%. The sequence logos of the positive and negative sets for both human and mouse TATA promoter data are shown in **Figures 2**, **3**, respectively. It can be seen that the positive and the negative sets share the same basic motifs at the same positions such as TATA motif at the position -30 and –25 bp and the TSS at the position +1 bp. Therefore, the training is more challenging but the resulted model generalizes well.

#### 2.3. The Proposed Models

We propose a deep learning model that combines convolution layers with recurrent layers as shown in **Figure 4**. It accepts a single raw genomic sequence, S={N1, N2, ..., N<sup>l</sup> } where N∈ {A, C, G, T} and l is the length of the input sequence, as input and outputs a real-valued score. The input is one-hot encoded and represented as a one-dimensional vector with four channels. The length of the vector l=300 and the four channels are A, C, G, and T and represented as (1 0 0 0), (0 1 0 0), (0 0 1 0), (0 0 0 1), respectively. In order to select the best performing model, we have used grid search method for choosing the best hyperparameters. We have tried different architectures such as CNN alone, LSTM alone, BiLSTM alone, CNN combined with LSTM. The tuned hyper-parameters are the number of convolution layers, kernel size, number of filters in each layer, the size of the max pooling layer, dropout probability, and the units of Bi-LSTM layer.

The proposed model starts with multiple convolution layers that are aligned in parallel and help in learning the important

FIGURE 3 | The sequence logo in mouse TATA promoter for both positive set (A) and negative set (B). The plots show the conservation of the functional motifs between the two sets.

motifs of the input sequences with different window size. We use three convolution layers for non-TATA promoter with window sizes of 27, 14, and 7, and two convolution layers for TATA promoters with window sizes of 27, 14. All convolution layers are followed by ReLU activation function (Glorot et al., 2011), a max pooling layer with a window size of 6, and a dropout layer of a probability 0.5. Then, the outputs of these layers are concatenated together and fed into a bidirectional long short-term memory (BiLSTM) (Schuster and Paliwal, 1997) layer with 32 nodes in order to capture the dependencies between the learnt motifs from the convolution layers. The learnt features after BiLSTM are flattened and followed by dropout with a probability of 0.5. Then we add two fully connected layers for classification. The first one has 128 nodes and followed by ReLU and dropout with a probability of 0.5 while the second layer is used for prediction with one node and sigmoid activation function. BiLSTM allows the information to persist and learn long-term dependencies of sequential samples such as DNA and RNA. This is achieved through the LSTM structure which is composed of a memory cell and three gates called input, output, and forget gates. These gates are responsible for regulating the information in the memory cell. In addition, utilizing the LSTM module increases the network depth while the number of the required parameters remains low. Having a deeper network enables extracting more complex features and this is the main objective of our models as the negative set contains hard samples.

The Keras framework is used for constructing and training the proposed models (Chollet F. et al., 2015). Adam optimizer (Kingma and Ba, 2014) is used for updating the parameters with a learning rate of 0.001. The batch size is set to 32 and the number of epochs is set to 50. Early stopping is applied based on validation loss.

#### 3. RESULTS AND DISCUSSION

#### 3.1. Performance Measures

In this work, we use the widely adopted evaluation metrics for evaluating the performance of the proposed models. These metrics are precision, recall, and Matthew correlation coefficient (MCC), and they are defined as follows:

$$Precision = \frac{TP}{TP + FP} \tag{1}$$

$$Recall = \frac{TP}{TP + FN} \tag{2}$$

$$TP \times TN - FP \times FN \tag{3}$$

$$M\infty = \frac{\text{m} \times \text{m} - \text{m} \times \text{m}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}} (3)$$

Where TP is true positive and represents correctly identified promoter sequences, TN is true negative and represents correctly rejected promoter sequences, FP is false positive and represents incorrectly identified promoter sequences, and FN is false negative and represents incorrectly rejected promoter sequences.

#### 3.2. Effect of the Negative Set

When analyzing the previously published works for promoter sequences identification we noticed that the performance of those works greatly depends on the way of preparing the negative dataset. They performed very well on the datasets that they have prepared, however, they have a high false positive ratio when evaluated on a more challenging dataset that includes non-prompter sequences having common motifs with promoter sequences. For instance, in case of the TATA promoter dataset, the randomly generated sequences will not have TATA motif at the position -30 and –25 bp which in turn makes the task of classification easier. In other words, their classifier depended on the presence of TATA motif to identify the promoter sequence and as a result, it was easy to achieve high performance on the datasets they have prepared. However, their models failed dramatically when dealing with negative sequences that contained TATA motif (hard examples). The precision dropped as the false positive rate increased. Simply, they classified these sequences as positive promoter sequences. A similar analysis is valid for the other promoter motifs. Therefore, the main purpose of our work is not only achieving high performance on a specific dataset but also enhancing the model ability on generalizing well by training on a challenging dataset.

To more illustrate this point, we train and test our model on the human and mouse TATA promoter datasets with different methods of negative sets preparation. The first experiment is performed using randomly sampled negative sequences from non-coding regions of the genome (i.e., similar to the approach used in the previous works). Remarkably, our proposed model achieves nearly perfect prediction accuracy (precision=99%, recall=99%, Mcc=98%) and (precision=99%, recall=98%, Mcc=97%) for both human and mouse, respectively. These high results are expected, but the question is whether this model can maintain the same performance when evaluated on a dataset that has hard examples. The answer, based on

TABLE 2 | Comparison of the DeePromoter with the state-of-the-art method.


FIGURE 5 | The effect of different conservation ratios of TATA motif in the negative set on the performance in case of TATA promoter dataset for both human (A) and mouse (B).

analyzing the prior models, is no. The second experiment is performed using our proposed method for preparing the dataset as explained in section 2.2. We prepare the negative sets that contain conserved TATA-box with different percentages such as 12, 20, 32, and 40% and the goal is reducing the gap between the precision and the recall. This ensures that our model learns more complex features rather than learning only the presence or absence of TATA-box. As shown in **Figures 5A,B** the model stabilizes at the ratio 32∼40% for both human and mouse TATA promoter datasets.

#### 3.3. Results and Comparison

Over the past years, plenty of promoter region prediction tools have been proposed (Hutchinson, 1996; Scherf et al., 2000; Reese, 2001; Umarov and Solovyev, 2017). However, some of these tools are not publically available for testing and some of them require more information besides the raw genomic sequences. In this study, we compare the performance of our proposed models with the current state-of-the-art work, CNNProm, which was proposed by Umarov and Solovyev (2017) as shown in **Table 2**. Generally, the proposed models, DeePromoter, clearly outperform CNNProm in all datasets with all evaluation metrics. More specifically, DeePromoter improves the precision, recall, and MCC in the case of human TATA dataset by 0.18, 0.04, and 0.26, respectively. In the case of human non-TATA dataset, DeePromoter improves the precision by 0.39, the recall by 0.12, and MCC by 0.66. Similarly, DeePromoter improves the precision, and MCC in the case of mouse TATA dataset by 0.24 and 0.31, respectively. In the case of mouse non-TATA dataset, DeePromoter improves the precision by 0.37, the recall by 0.04, and MCC by 0.65. These results confirm that CNNProm fails to reject negative sequences with TATA promoter, therefore, it has high false positive. On the other hand, our models are able to deal with these cases more successfully and false positive rate is lower compared with CNNProm.

For further analyses, we study the effect of alternating nucleotides at each position on the output score. We focus on the region –40 and 10 bp as it hosts the most important part of the promoter sequence. For each promoter sequence in the test set, we perform computational mutation scanning to evaluate the effect of mutating every base of the input subsequence (150 substitutions on the interval –40∼10 bp subsequence). This is illustrated in **Figures 6**, **7** for human and mouse TATA datasets, respectively. Blue color represents a drop in the output score due to mutation while the red color represents the increment of the score due to mutation. We notice that altering the nucleotides to C or G in the region –30 and –25 bp reduces the output score significantly. This region is TATA-box which is a very important functional motif in the promoter sequence. Thus, our model is successfully able to find the importance of this region. In the rest of the positions, C and G nucleotides are more preferable than A and T, especially in case of the mouse. This can be explained by the fact that the promoter region has more C and G nucleotides than A and T (Shi and Zhou, 2006).

# 4. CONCLUSION

Accurate prediction of promoter sequences is essential for understanding the underlying mechanism of the gene regulation process. In this work, we developed DeePromoter -which is based on a combination of convolution neural network and bidirectional LSTM- to predict the short eukaryote promoter sequences in case of human and mouse for both TATA and non-TATA promoter. The essential component of this work was to overcome the issue of low precision (high false positive rate) noticed in the previously developed tools due to the reliance on some obvious feature/motifs in the sequence when classifying promoter and non-promoter sequences. In this work, we were particularly interested in constructing a hard negative set that drives the models toward exploring the sequence for deep and relevant features instead of only distinguishing the promoter and non-promoter sequences based on the existence of some functional motifs. The main benefits of using DeePromoter is that it significantly reduces the number of false positive predictions while achieving high accuracy on challenging datasets. DeePromoter outperformed the previous

method not only in the performance but also in overcoming the issue of high false positive predictions. It is projected that this framework might be helpful in drug-related applications and academia.

#### AUTHOR CONTRIBUTIONS

MO and ZL prepared the dataset, conceived the algorithm, and carried out the experiment and analysis. MO and HT prepared

#### REFERENCES


the webserver and wrote the manuscript with support from ZL and KC. All authors discussed the results and contributed to the final manuscript.

#### FUNDING

This research was supported by the Brain Research Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. NRF-2017M3C7A1044815).

Processing Systems, eds F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (New York, NY: ACM), 1097–1105.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Oubounyt, Louadi, Tayara and Chong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# LPI-IBNRA: Long Non-coding RNA-Protein Interaction Prediction Based on Improved Bipartite Network Recommender Algorithm

#### Guobo Xie1†, Cuiming Wu1†, Yuping Sun<sup>1</sup> \*, Zhiliang Fan<sup>1</sup> and Jianghui Liu<sup>2</sup>

*<sup>1</sup> School of Computers, Guangdong University of Technology, Guangzhou, China, <sup>2</sup> Department of Emergency, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China*

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Fei Guo, Tianjin University, China Jia Qu, China University of Mining and Technology, China Qi Zhao, Liaoning University, China*

> \*Correspondence: *Yuping Sun syp@gdut.edu.cn*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *26 January 2019* Accepted: *29 March 2019* Published: *18 April 2019*

#### Citation:

*Xie G, Wu C, Sun Y, Fan Z and Liu J (2019) LPI-IBNRA: Long Non-coding RNA-Protein Interaction Prediction Based on Improved Bipartite Network Recommender Algorithm. Front. Genet. 10:343. doi: 10.3389/fgene.2019.00343* According to the latest research, lncRNAs (long non-coding RNAs) play a broad and important role in various biological processes by interacting with proteins. However, identifying whether proteins interact with a specific lncRNA through biological experimental methods is difficult, costly, and time-consuming. Thus, many bioinformatics computational methods have been proposed to predict lncRNA-protein interactions. In this paper, we proposed a novel approach called Long non-coding RNA-Protein Interaction Prediction based on Improved Bipartite Network Recommender Algorithm (LPI-IBNRA). In the proposed method, we implemented a two-round resource allocation and eliminated the second-order correlations appropriately on the bipartite network. Experimental results illustrate that LPI-IBNRA outperforms five previous methods, with the AUC values of 0.8932 in leave-one-out cross validation (LOOCV) and 0.8819 ± 0.0052 in 10-fold cross validation, respectively. In addition, case studies on four lncRNAs were carried out to show the predictive power of LPI-IBNRA.

Keywords: lncRNA, protein, interaction prediction, bipartite network, second-order correlation elimination

# 1. INTRODUCTION

LncRNA, a class of ncRNAs (non-coding RNAs) of more than 200 nucleotides, that do not encode proteins, has gained increasing scientific interest in recent years (Jorge et al., 2012; Hajjari et al., 2016). Only 2% of RNAs in the human transcriptome can encode proteins while others, called ncRNAs, cannot. Note that most ncRNAs are lncRNAs. Compared to other ncRNAs, lncRNAs are longer and have complex secondary or higher-order structures (Bonasio and Shiekhattar, 2014), and their genes often have independent regulatory elements such as promoters and enhancers (Ulitsky and Bartel, 2013). There is increasing evidence that lncRNAs are related to the regulation of gene expression levels such as epigenetic regulation, transcriptional regulation, and multiple levels of post-transcriptional regulation (Sarah and Jeff, 2013), but only a few functions and mechanisms of lncRNA have been studied (Maarabouni et al., 2008; Lee et al., 2016). Moreover, interactions of lncRNAs with other molecules also have become hot spots in oncology research over the past years. The studies found that an important way for lncRNAs to function is by interacting with proteins (Khalil and Rinn, 2011). LncRNAs play a broad and important regulatory role in various processes such as tumorigenesis, cancer progression and metastasis by interacting with proteins. Thus, the prediction and identification of lncRNA-protein interactions can further reveal lncRNA-related functions and is beneficial for the study on the pathogenesis of complex diseases at the molecular level (Faghihi et al., 2008; Chen and Yan, 2013; Cui et al., 2013; Li et al., 2013; Chen et al., 2016a,c, 2017b, 2018c).

Numerous biological experimental methods were exploited to confirm protein-related RNAs (Ule et al., 2005; Galgano and Gerber, 2011; Zambelli and Pavesi, 2015). However, such experimental methods are laborious, time-consuming, and costly (Huang et al., 2012). Recently, various computational methods have been proposed to address the challenges in bioinformatics (He et al., 2018a,b; Zou et al., 2018), such as lncRNA-protein (Hu et al., 2017; Shen et al., 2019a,b), miRNA-disease (Chen and Huang, 2017; Chen et al., 2018a,b,d,e,f; Jiang et al., 2018a,b; Xie et al., 2018), drug-target (Chen et al., 2016b; Wang et al., 2017; Wu et al., 2018) and microbe-disease associations predictions (Chen et al., 2017a; Peng et al., 2018). The methods for inferring lncRNA-protein associations can roughly be classified into two types: the machine learning methods and the network-based methods. The so-called machine learning methods usually use the biological features of lncRNAs and proteins, and then employ a supervised classifier to identify whether proteins have potential interactions with a specific lncRNA (Zhan et al., 2018). For example, Bellucci et al. (2011) proposed to utilize secondary structure, hydrogen bonding and van der Waals contributions for feature integration, which has a beneficial effect for inferring the binding propensity of protein and ncRNA. Protein and lncRNA sequence information is utilized in Muppirala et al. (2011), with the employment of a support vector machine (SVM) and random forest (RF). Suresh et al. (2015) proposed an SVM-based method named RPI-Pred, which uses high-order 3D structural features and sequences of the lncRNA and protein. Hu et al. (2018) developed a method called HLPI-Ensembl, adopting the ensemble strategy based on extreme gradient boosting (XGB), SVM and RF. However, the main drawback of these methods is the insufficiency of negative samples of lncRNA-protein interactions. The lack of negative samples may cause the unstable performance of the supervised classifier. Moreover, selecting appropriate features to predict lncRNA-protein interactions is not an easy task.

Apart from the aforementioned methods, there are other approaches for potential lncRNA-protein interaction prediction, with the employment of network analysis algorithms. For instance, Li et al. (2015) presented a method called LPIHN, which constructs a heterogeneous network, and implements a random walk with restart on the heterogeneous network. In order to improve prediction performance, some recent networkbased methods use recommender algorithms to infer lncRNAprotein interactions. For example, Ge et al. (2016) proposed a method called LPBNI, which only uses known lncRNA-protein interactions and implements the two-step propagation on a bipartite network. Zhao et al. (2018b) introduced an approach based on the bipartite network called LPI-BNPRA, which infers lncRNA-protein interactions by constructing bias ratings for lncRNAs and proteins, using agglomerative hierarchical clustering. By implementing two-round resource allocation on bipartite networks, these approaches achieved impressive results. But predictive validity of these investigations remains insufficient due to the existence of high-order correlations, which might have a negative effect on the lncRNA-protein interaction prediction. For example, the proteins directly correlated by the same lncRNA, could also be indirectly correlated by other media proteins, resulting in correlation redundancy. Properly eliminating the redundancy induced by the second-order correlation might further enhance the accuracy of the prediction. This inspired us to develop an effective network-based recommender algorithm for lncRNA-protein interaction prediction.

Motivated by the effectiveness of high-order correlation elimination in the study of Qiu et al. (2014), we propose a novel method named LPI-IBNRA for inferring new lncRNA-protein interactions. LPI-IBNRA uses known lncRNA-protein and protein-protein interactions, and lncRNA expression similarity, and then eliminates second-order correlations on the bipartite network appropriately to enhance the prediction accuracy. Compared with previous machine learning methods, our method does not require negative samples. Compared with many existing network-based methods (Ge et al., 2016; Zhao et al., 2018b), our method yields comparable or even better results due to secondorder correlation elimination. Both 10-fold cross validation and LOOCV were carried out to assess the prediction ability of the proposed method. Experimental results illustrated that LPI-IBNRA outperformed five other methods by achieving higher AUC values. In addition, case studies on four lncRNAs further demonstrated the predictive power of LPI-IBNRA. Therefore, we conclude that LPI-IBNRA is feasible and effective for inferring potential lncRNA-protein interactions.

# 2. MATERIALS AND METHODS

#### 2.1. Human LncRNA-Protein Interactions

The known ncRNA-protein interaction dataset was downloaded from the NPInter v2.0 database (Yuan et al., 2014). We limited the organism to "Homo sapiens" and the type of ncRNAs to "NONCODE", in order to filter ncRNAs and their interacting proteins. The lncRNAs were further filtered from these ncRNAs, through a human lncRNA dataset from the NONCODE 4.0 database (Xie et al., 2014). We deleted duplicate interactions. Considering the sample requirement of LOOCV, we removed the lncRNAs and proteins that have only one interaction. We then obtained 4796 distinct experimentally confirmed lncRNAprotein interactions, containing 26 proteins and 1105 lncRNAs. We denoted np as the number of known proteins, nl as the number of known lncRNAs, and matrix I ∈ R np∗nl as the adjacency matrix of protein-lncRNA interactions. The interaction between protein p<sup>i</sup> and lncRNA l<sup>j</sup> could be denoted as follows:

$$I(p\_i, l\_j) = \begin{cases} 1 & \text{if } p\_i \text{ interactions with } l\_j \\ 0 & \text{otherwise.} \end{cases} \tag{1}$$

### 2.2. Protein-Protein Interaction Score Matrix and Similarity Matrix

Protein-protein interactions (PPI) were obtained from the STRING 9.1 database (Franceschini et al., 2013), which included weighted protein-protein interactions through co-expression data, genomic context predictions, automated text mining, and high-throughput lab experiments. We then deleted the redundant PPI data, and obtained 214 PPI data, and the corresponding interaction scores based on the known lncRNAprotein dataset. The symmetric matrix AP was denoted as an interaction score matrix based on PPI data, where AP(p<sup>i</sup> , pj) is the interaction score between proteins p<sup>i</sup> and p<sup>j</sup> . AP could then be standardized as follows:

$$AP'(p\_i, p\_j) = \frac{AP(p\_i, p\_j)}{\sqrt{R(p\_i)R(p\_j)}},\tag{2}$$

where R(pi) is the sum of the elements in i-row of AP.

Considering the hypothesis that similar proteins tend to exhibit a similar interaction and non-interaction pattern with lncRNAs (Zheng et al., 2017), we calculated the protein similarity with the utilization of Gaussian kernel interaction profiles. We denoted X(pi) as the ith row vector of matrix I, in which the nonzero values occur at the indices where the corresponding lncRNA have one interaction with a protein p<sup>i</sup> . Then the similarity between proteins p<sup>i</sup> and p<sup>j</sup> based on Gaussian kernel interaction profiles could be calculated as follows:

$$KP(p\_i, p\_j) = \exp(-\beta\_p \left\| X(p\_i) - X(p\_j) \right\|^2),\tag{3}$$

where the adjustment coefficient β<sup>p</sup> for the kernel bandwidth is defined as follows:

$$\beta\_{\mathcal{P}} = \beta\_{\mathcal{P}}' / (\frac{1}{np} \sum\_{i=1}^{np} \left\| X(p\_i) \right\|^2). \tag{4}$$

#### 2.3. LncRNA-LncRNA Similarity Matrix

LncRNA expression profiles were downloaded from the NONCODE 4.0 database (Xie et al., 2014). After removing the superfluous data, we obtained the expression profiles of 1,105 lncRNAs in 24 human tissues or cell types. Then the Pearson correlation coefficient (PPC) was applied for the calculation of lncRNA expression similarity between each pair of lncRNA expression profiles (Wang et al., 2010; Ganegoda et al., 2013; Tang et al., 2014). We denoted E(i) = {ei1, ei2, . . . , ei24} and E(j) = {ej1, ej2, . . . , ej24} as the expression profiles of l<sup>i</sup> and l<sup>j</sup> . The expression similarity AL(l<sup>i</sup> , lj) between lncRNAs l<sup>i</sup> and l<sup>j</sup> was calculated as follows:

$$AL(l\_i, l\_j) = \left| \frac{cov(E(i), E(j))}{\sigma\_{E(i)} \times \sigma\_{E(j)}} \right|, \tag{5}$$

where AL(l<sup>i</sup> , lj) denotes the absolute value of PCC between l<sup>i</sup> and lj , cov(E(i), E(j)) is the covariance between E(i) and E(j), σE(i) and σE(j) are standard deviations of E(i) and E(j), respectively.

We denoted X(pi) as the ith column vector of matrix I, in which the nonzero values occur at the indices where the corresponding protein has one interaction with the lncRNA li . Similar to the aforementioned protein case, the Gaussian interaction profile kernel similarity for lncRNAs could be computed as follows:

$$KL(l\_i, l\_j) = \exp(-\beta\_l \left\| X(l\_i) - X(l\_j) \right\|^2),\tag{6}$$

where

$$\beta\_l = \beta\_l' / (\frac{1}{nl} \sum\_{i=1}^{nl} \left\| X(l\_i) \right\|^2). \tag{7}$$

#### 2.4. Integrated Similarity Matrix for Proteins and LncRNAs

Note that the Gaussian interaction profile kernel similarity is an association information-based measurement, which can be utilized to complement protein-protein interactions and lncRNA expression similarity. Motivated by the study of Chen (2015), we constructed the integrated protein similarity matrix Sim<sup>P</sup> and integrated the lncRNA similarity matrix Sim<sup>L</sup> as follows:

$$Sim^P(p\_i, p\_j) = \begin{cases} \frac{AP'(p\_i, p\_j) + KP(p\_i, p\_j)}{2} & \text{if } AP'(p\_i, p\_j) \neq 0\\ KP(p\_i, p\_j) & \text{otherwise}, \end{cases} \tag{8}$$

$$Sim^L(l\_i, l\_j) = \frac{AL(l\_i, l\_j) + KL(l\_i, l\_j)}{2}.\tag{9}$$

#### 2.5. LPI-IBNRA

The flow chart of LPI-IBNRA is shown in **Figure 1**. At first, we denoted S <sup>P</sup> ∈ R np∗nl as the resource score matrix based on protein similarity, S <sup>L</sup> ∈ R np∗nl as the one based on lncRNA similarity. These two matrices were computed as follows:

$$S^{p}(p\_{i},l\_{j}) = \begin{cases} \frac{\sum\_{k=1}^{np} \text{Sim}^{p}(p\_{i}p\_{k})I(p\_{k}l\_{j})}{\sum\_{k=1}^{np} \text{Sim}^{p}(p\_{i}p\_{k})} & \text{if } I(p\_{i},l\_{j}) = 1\\ 0 & \text{otherwise}, \end{cases} \tag{10}$$

$$S^L(p\_i, l\_j) = \begin{cases} \frac{\sum\_{k=1}^{nl} I(p\_i, l\_k) \text{Sim}^L(l\_k, l\_j)}{\sum\_{k=1}^{nl} \text{Sim}^L(l\_k, l\_i)} & \text{if } I(p\_i, l\_j) = 1\\ 0 & \text{otherwise}, \end{cases} \tag{11}$$

where S P (pi , lj) represents the score between protein p<sup>i</sup> and lncRNA l<sup>j</sup> based on protein similarity, and S L (pi , lj) represents the score between protein p<sup>i</sup> and lncRNA l<sup>j</sup> based on lncRNA similarity.

Then the integrated resource score matrix was initialized as the weighted sum of S P and S L as follows:

$$\mathcal{S}\_{\text{ini}} = \mathcal{Y}\mathcal{S}^p + (1 - \mathcal{Y})\mathcal{S}^L,\tag{12}$$

where parameter γ ∈ [0, 1] is a scalar controlling the relative contributions of protein similarity and lncRNA similarity in Sini. Following the general setting, we set the parameter γ = 0.5 in this paper, making S P and S L equally weighted.

The final score matrix can be obtained by updating the Sini column by column. In other words, the calculation process can be partitioned into nl runs, each of which corresponds to a specific lncRNA. Thus, at the beginning of the kth run, the score for protein p<sup>i</sup> interacting with the given lncRNA l<sup>k</sup> can be initialized as follows:

$$s\_0(\mathfrak{p}\_i) = \mathbb{S}\_{ini}(\mathfrak{p}\_i, l\_k). \tag{13}$$

Then the 1st-round of our allocation model is to allocate the score of the lncRNA l<sup>k</sup> from the protein p<sup>i</sup> , which can be calculated as follows:

$$s\_1(p\_i, l\_k) = \frac{S\_{ini}(p\_i, l\_k)s\_0(p\_i)}{d(p\_i)},\tag{14}$$

where d(pi) = Pnl x=1 Sini(p<sup>i</sup> , lx) is obtained by a summing operation over all initial scores from lncRNAs interacting with protein p<sup>i</sup> .

The score of lncRNA l<sup>k</sup> can be obtained by summing scores over all proteins connected with l<sup>k</sup> :

$$s\_1(l\_k) = \sum\_{j=1}^{np} s\_1(p\_j, l\_k) = \sum\_{j=1}^{np} \frac{S\_{ini}(p\_j, l\_k) s\_0(p\_j)}{d(p\_j)}.\tag{15}$$

In the 2nd-round, resource scores were allocated in a similar way as the first round. The score allocated from the lncRNA l<sup>k</sup> to the protein p<sup>i</sup> was calculated as follows:

$$s\_2(p\_i, l\_k) = \frac{S\_{ini}(p\_i, l\_k)s\_1(l\_k)}{d(l\_k)},\tag{16}$$

where d(l<sup>k</sup> ) = Pnp y=1 Sini(py, l<sup>k</sup> ) is the sum of initial scores from all proteins interacting with lncRNA l<sup>k</sup> .

The score of protein p<sup>i</sup> was allocated from all lncRNAs that interacted with p<sup>i</sup> as follows:

$$s\_2(p\_i) = \sum\_{k=1}^{nl} \frac{S\_{ini}(p\_i, l\_k)s\_1(l\_k)}{d(l\_k)} = \sum\_{k=1}^{nl} \frac{S\_{ini}(p\_i, l\_k)}{d(l\_k)} \sum\_{j=1}^{np} \frac{S\_{ini}(p\_j, l\_k)s\_0(p\_j)}{d(p\_j)}.\tag{17}$$

As described from Equation (13) to (17), we first initialized the score of protein p<sup>i</sup> from the given lncRNA l<sup>k</sup> and then updated it by a two-round resource allocation. An example is given in **Figure 2**. We defined Sfin ∈ R np∗nl as the final resource score matrix, which can be represented as follows:

$$S\_{\hat{f}\hat{m}}(p\_i, l\_k) = s\_2(p\_i). \tag{18}$$

S can also be computed in a vectorized form as:

$$
\vec{S\_{fin}} = \vec{WS\_{ini}} \tag{19}
$$

where SE fin is a column vector of Sfin, SEini is a column vector of Sini, and W ∈ R np∗np is the weight matrix. Then Equation (17) can also be represented as:

$$s\_2(p\_i) = \sum\_{j=1}^{np} W(p\_i, p\_j) s\_0(p\_j),\tag{20}$$

where

$$\mathcal{W}(p\_i, p\_j) = \frac{1}{d(p\_j)} \sum\_{k=1}^{nl} \frac{\mathcal{S}\_{ini}(p\_i, l\_k)\mathcal{S}\_{ini}(p\_j, l\_k)}{d(l\_k)}.\tag{21}$$

In the lncRNA-protein interaction network, the proteins interacting with the same lncRNA are considered to be directly correlated, i.e., having the low-order correlation, while higherorder correlations between these proteins might also arise from indirect associations. Such high-order correlations might have a negative effect on the lncRNA-protein interaction prediction. Based on the studies of Zhou et al. (2009) and Liu et al. (2010), we eliminated second-order correlations in an appropriate way to further enhance the accuracy of the prediction:

$$W' = W + \alpha \, W^2,\tag{22}$$

where the parameter α ∈ (−1, 0). The final score matrix for inferring potential lncRNA-protein interactions can then be calculated as follows:

$$\mathbf{S}'\_{\rm fin} = \mathcal{W}' \mathbf{S}\_{\rm ini}.\tag{23}$$

After the calculations, we can recommend proteins to the given lncRNA l<sup>k</sup> in descending order by the kth column of S ′ fin.

#### 2.6. Performance Evaluation

We evaluated the classification performance of the proposed LPI-IBNRA method by applying two types of classification schemes, i.e., LOOCV and 10-fold cross validation. The performance

Frontiers in Genetics | www.frontiersin.org

eliminations to obtain the final scores of proteins.

of LPI-IBNRA was evaluated in terms of several widely-used indicators, including precision (PRE), sensitivity (SEN), accuracy (ACC), F1 score, and Matthews correlation coefficient (MCC), expressed as follows:

$$PRE = \frac{TP}{TP + FP},\tag{24}$$

$$\text{SEN} = \frac{TP}{TP + FN},\tag{25}$$

$$ACC = \frac{TP + TN}{TP + TN + FP + FN},\tag{26}$$

$$F1\text{ Score} = \frac{2 \times TP}{2 \times TP + FP + FN} = 2 \times \frac{PRE \times \text{SEN}}{PRE + \text{SEN}},\tag{27}$$

MCC = (TP + TN) − (FP + FN) √ (TN + FN) × (TN + FP) × (TP + FN) × (TP + FP) . (28)

where TP, TN, FP, and FN count the number of true positives, true negatives, false positives, and false negatives, respectively.

As a popular method for performance evaluation, the receiver operating characteristic (ROC) curve was also utilized in our experiments. The area under the ROC curve (AUC) = 1 indicates perfect performance, while AUC = 0.5 indicates random performance. The precision-recall curve (PR curve) and the area under the PR curve (AUPR) are also used to reduce the negative influence of false positive data on the method performance. The larger the AUC and AUPR is, the better performance the evaluated method has.

### 3. RESULTS

#### 3.1. Comparison With Other Methods

We used the aforementioned 4,796 known human lncRNAprotein interactions to carry out the above-mentioned two cross validation schemes. In each LOOCV trial, each known lncRNAprotein interaction was used as a test sample while the rest were used as training samples. To analyze the influence of parameter α on the performance of LPI-IBNRA, we applied

FIGURE 3 | (A)The AUC values of LPI-IBNRA method with different values of α. (B) ROC curves of lncRNA-protein interaction predictions by all compared methods in LOOCV. (C) Precision-recall curves of all compared methods. (D) ROC curves of lncRNA-protein interaction predictions by all compared methods in 10-fold cross validation.

LOOCV for the selection of parameter α. As shown in **Figure 3A**, the performance of LPI-IBNRA drops a lot when α is smaller than –0.70. When α is larger than –0.70, the performance of LPI-IBNRA decreases slightly. Thus, the parameter α is set to –0.70 due to the optimal performance.

Five previous approaches were used for comparison in the experiments, including collaborative filtering (CF), random walk with restart (RWR), LPBNI, LPIHN, and LPI-BNPRA. LPBNI, LPIHN, and LPI-BNPRA are network-based methods that infer potential lncRNA-protein interactions, while CF and RWR have been used as benchmark methods in Ge et al. (2016) and Wen et al. (2017). RWR is often utilized as a powerful tool for networkbased methods to forecast association (Zhao et al., 2018a,c; Zhu et al., 2018), while CF is a well-known recommender algorithm which can infer the information from similar neighborhoods (Fu et al., 2014; Zeng et al., 2017). In our experiments, RWR was implemented to make predictions based on the protein-protein similarity network, while a simple version of the CF algorithm was adopted to calculate the prediction scores between lncRNAs and proteins.

Here, we reproduced these methods on the same dataset by ourselves. See **Figures 3B,C** and **Table 1** for the results of LOOCV. We can see from **Figure 3B** that our proposed method achieved an AUC of 0.8932, which exhibited a considerable improvement over the five previous methods (i.e., 12.81% for CF, 10.71% for RWR, 1.56% for LPBNI, 2.00% for LPIHN and 3.39% for LPI-BNPRA). In addition, the comparison of these methods, in terms of precision vs. recall, is presented in **Figure 3C**. It can be seen that LPI-IBNRA almost achieved a higher precision than the other methods at every recall value. Moreover, LPI-IBNRA outperformed the other methods in terms of AUPR, PRE, SEN, ACC, F1 score and MCC, which is presented in **Table 1**. As shown in **Figure 3D**, in 10-fold cross validation, LPI-IBNRA achieved an AUC of 0.8819 ± 0.0052 and was superior to the comparison methods, including CF (0.7655 ± 0.0069), RWR (0.7800 ± 0.0076), LPBNI (0.8695 ± 0.0047), LPIHN (0.8591 ± 0.0044), and LPI-BNPRA (0.8413 ± 0.0351).

The aforementioned results indicate that in both LOOCV and 10-fold cross evaluation, LPI-IBNRA outperforms other methods in terms of the AUC values. The outstanding performance of


TABLE 2 | The top five ranked proteins for lncRNA DLEU2, CRHR1-1T1, LRRC75A-AS1, and SNHG5.


LPI-IBNRA demonstrates its stable and satisfying abilities in inferring potential lncRNA-protein interactions. The superior performance of the proposed method could be attributed to second-order correlation elimination, which is more suitable for our task and can lead to better prediction performance.

#### 3.2. Case Studies

In addition, four case studies have been carried out to further evaluate the effectiveness of LPI-IBNRA. The interactions in our benchmark dataset were obtained in NPInter v2.0 which was established in 2013. NPInter was then upgraded to NPInter v3.0 in 2016 (Hao et al., 2016), which includes newly discovered lncRNA-protein interactions. Thus, we predicted novel lncRNA-protein interactions based on known interactions in the benchmark dataset, then confirmed our predictions in NPInter v3.0. For each lncRNA, the proteins ranked within the top 5 were considered as potential proteins that interact with the given lncRNA. Case studies were carried out on four lncRNAs, including lncRNA DLEU2, CRHR1-1T1, LRRC75A-AS1 and SNHG5.

**Table 2** shows the prediction results and whether there were confirmations for these lncRNAs. It indicates that five (DLEU2), five (CRHR1-1T1), five (LRRC75A-AS1), and four (SNHG5) out of the top five predicted lncRNA-interacted proteins, were confirmed by NPInter v3.0. The rankings of these lncRNAprotein interactions in other benchmark method predictions are also listed in **Table 2**. It can be observed that several novel interactions did not have high rankings in the predictions of other methods, and these interactions are likely to be ignored by these methods. Therefore, LPI-IBNRA has great potential to predict new lncRNA-protein interactions.

#### 4. DISCUSSION AND CONCLUSION

In this article, we proposed a novel method LPI-IBNRA for predicting lncRNA-protein interactions, based on the known lncRNA-protein interactions, lncRNA expression similarity and protein-protein interactions. We integrated the known interactions and similarity as the initial resource scores for a two-round resource allocation of a bipartite network recommendation. Furthermore, we optimized the weight matrix by eliminating second-order correlations appropriately, to obtain the final result of lncRNA-protein interaction

#### REFERENCES


prediction. We finally acquired gratifying and reliable prediction performance in LOOCV, 10-fold cross evaluation and case studies. Thus, we believe that LPI-IBNRA can make reliable predictions and might guide future experimental studies on lncRNA-protein interactions.

LPI-IBNRA has the following improvements over several previous methods in predicting lncRNA-protein interactions. First, with the employment of the bipartite network recommender algorithm, we utilized the known lncRNAprotein interactions to construct a bipartite network between lncRNAs and proteins, and then allocated the resource scores via interaction edges between lncRNA nodes and protein nodes. Therefore, the negative sample set is not required in our methods. Second, we assigned weights to each edge on the bipartite network, which is distinct from most former bipartite network methods. Thus, the resource scores would not be evenly distributed during the resource allocation process. Finally, we eliminated second-order correlations on the bipartite network appropriately, to enhance prediction accuracy.

Although impressive results have been achieved, there is still much room for improvement in our method. At first, though known lncRNA-protein interactions have been more than before, it is still very difficult for the proposed method to obtain adequate results based on the prediction. Moreover, as the resource allocation of the bipartite network recommendation algorithm is based on known lncRNA-protein interactions, LPI-IBNRA is not suitable to predict interactions of lncRNAs without any known interacted protein.

### AUTHOR CONTRIBUTIONS

GX designed the experiments. CW and ZF performed the experiments. GX, CW, YS, ZF, and JL conceived the project and analyzed the data. CW and YS wrote the manuscript and all authors contributed to the writing.

### FUNDING

This work was supported by the National Natural Science Foundation of China (618002072), the Natural Science Foundation of Guangdong Province (2018A030313389), and the Science and Technology Plan Project of Guangdong Province (2017A040405050, 2016B030306004, 2015B010129014).


non-coding rna-protein interactions. Mol. Therapy Nucleic Acids 13, 464–471. doi: 10.1016/j.omtn.2018.09.020


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xie, Wu, Sun, Fan and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# An Ensemble Strategy to Predict Prognosis in Ovarian Cancer Based on Gene Modules

#### Yi-Cheng Gao, Xiong-Hui Zhou\* and Wen Zhang\*

*Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China*

Due to the high heterogeneity and complexity of cancer, it is still a challenge to predict the prognosis of cancer patients. In this work, we used a clustering algorithm to divide patients into different subtypes in order to reduce the heterogeneity of the cancer patients in each subtype. Based on the hypothesis that the gene co-expression network may reveal relationships among genes, some communities in the network could influence the prognosis of cancer patients and all the prognosis-related communities could fully reveal the prognosis of cancer patients. To predict the prognosis for cancer patients in each subtype, we adopted an ensemble classifier based on the gene co-expression network of the corresponding subtype. Using the gene expression data of ovarian cancer patients in TCGA (The Cancer Genome Atlas), three subtypes were identified. Survival analysis showed that patients in different subtypes had different survival risks. Three ensemble classifiers were constructed for each subtype. Leave-one-out and independent validation showed that our method outperformed control and literature methods. Furthermore, the function annotation of the communities in each subtype showed that some communities were cancer-related. Finally, we found that the current drug targets can partially support our method.

Keywords: prognosis gene, ovarian cancer, subtype, gene co-expression network, ensemble classifier

# INTRODUCTION

Cancer is a disease that seriously endangers human health (Siegel et al., 2017). Cancer prognosis research is very important to avoid patients receiving excessive or improper treatment (Domany, 2014; Kourou et al., 2015). Ovarian cancer is one of the most common malignant tumors and there is an urgent need to develop new treatment methods to improve the prognosis (Wang et al., 2017). Identifying prognostic genes in cancer is important not only for the treatment of cancer patients but also for drug discovery (Wang et al., 2017). Therefore, the selection of prognostic genes and prognosis prediction for ovarian cancer is of great importance (Konecny et al., 2016).

These days, many methods have been used in solving biological problems by using highthroughput biological data (Zhang et al., 2017, 2018a,b,c,d,e,f) and machine learning algorithms (Zhang et al., 2008). However, the existing models for predicting the outcomes of ovarian cancer are poorly generalized (Konecny et al., 2016), possibly due to the high heterogeneity of cancer (Burrell et al., 2013). Even in the same cancer, it can be divided into different subtypes (Jiang et al., 2019), but most of the existing methods do not take this into account (Yu et al., 2016; Pawlovsky and Matsuhashi, 2017). Recent literature has confirmed that considering the subtype of cancer and then constructing the cancer prognosis model is conducive to the improvement of the performance of the cancer prognosis model (Yu et al., 2018).

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Fei Guo, Tianjin University, China Chi Zhang, Indiana University Bloomington, United States*

#### \*Correspondence:

*Xiong-Hui Zhou zhouxionghui@mail.hzau.edu.cn Wen Zhang zhangwen@mail.hzau.edu.cn; zhangwen@whu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *11 January 2019* Accepted: *05 April 2019* Published: *24 April 2019*

#### Citation:

*Gao Y-C, Zhou X-H and Zhang W (2019) An Ensemble Strategy to Predict Prognosis in Ovarian Cancer Based on Gene Modules. Front. Genet. 10:366. doi: 10.3389/fgene.2019.00366*

**398**

In addition, cancer is a complex disease and the occurrence of cancer is usually not caused by a single gene, but by the combined action of multiple genes (Yang et al., 2014a). Many current prognostic methods do not take this into account (Petitjean et al., 2007; Hu et al., 2010). Gene co-expression networks are able to reflect the interrelationships between genes in biological processes (Guo et al., 2015; Deng et al., 2016; Serin et al., 2016). A community (dense cluster) in a biological network can work together as a basic functional module to participate in the occurrence of diseases (Zhou et al., 2014). Therefore, the community in the gene co-expression network in cancer patients may be related to the prognosis of cancer, and multiple communities related to the prognosis of cancer may more comprehensively reflect the prognosis process of cancer.

In this work, we first applied clustering analysis to the data set of ovarian cancer from TCGA (The Cancer Genome Atlas) (Network, 2008) in order to divide the patients into different subtypes. Our clustering results were validated using survival analysis to determine whether patients in different subtypes had different survival risks. We then constructed a co-expression network for each subtype. In this network, the correlation between genes was determined by measuring the Pearson's correlation coefficient (Sedgwick, 2012). Then, we mined the dense clusters as gene communities in each network (Ruan et al., 2010; Zhou et al., 2014). Based on the communities in each subtype, we construct an ensemble classifier to predict the cancer prognosis in the corresponding subtype. To validate the performance of our model, we compared it with two control models: the classifier constructed without clustering information and the classifier with clustering information but without the gene co-expression network. Furthermore, we also compared our method with two models based on the published papers. Finally, we adopted the functional annotation with these community modules in each subtype to reveal some biological mechanisms of cancer. In addition, based on these communities, we used hypergeometric distribution tests to validate whether these communities could be used to screen drugs for ovarian cancer.

### MATERIALS AND METHODS

#### Data Set and Preprocessing

To evaluate our method, two ovarian cancer data sets, each containing gene expression profiles and clinical information (including the time to death and the status of death) were collected in this work. One data set from TCGA (Network, 2008) containing 574 patients was used to test the model. A merged data set containing 1287 patients, collected from previous work (Gyorffy et al., 2012), was used as an independent data set. The platform of TCGA data set is Agilent G4502A. Since the merged data set contains the samples of TCGA, we removed the samples of TCGA and 782 samples were remained. Quantile normalization (Bolstad et al., 2003; Belorkar and Wong, 2016) was then applied to all the data sets in terms of data preprocessing. Since all the data sets come from gene chips, this standardized method can eliminate the errors caused by experimental technologies and keep the data of all samples at the same level (Bolstad et al., 2003).

The prognosis information of the cancer patients was discretized when constructing the classifier. If the death of a patient occurred within 3 years, we set the phenotype as highrisk. If a patient's total survival time was more than 3 years, we set the phenotype as low-risk. Otherwise, the patients that were alive but still within 3 years were abandoned.

In order to validate whether the genes involved in the community could be used to screen drugs for ovarian cancer patients, we obtained the drug target information from the Therapeutic Target Database (TTD) (Yang et al., 2015), DrugBank (Wishart et al., 2008; Law et al., 2014), and Drug-Gene Interaction Database (DGIdb) (Wagner et al., 2016). The drug targets were set as the combined set of the entire three databases. Adaptation diseases for each drug were also obtained from the three databases.

# Clustering Analysis

We applied the K-means (Jain, 2010) algorithm to cluster the cancer patients into different subtypes. First, the top 15% of genes with the greatest variance were selected as the clustering features as they are considered to contribute to interesting variance (Belorkar and Wong, 2016). Second, using the selected genes as features, we used the K-means (Jain, 2010) clustering method to divide the patients in TCGA into different subtypes, and the Euclidean distance was used to measure the distance between samples. Third, the Dunn Index (Dunn, 1973) was used as the indicator to evaluate the quality of the cluster and to find the best number of clusters, which are calculated from the following equation,

$$DI\_m = \frac{\min\_{1 \le i \le j \le m} \delta(C\_i, C\_j)}{\max\_{1 \le k \le m} \Delta\_k} \tag{1}$$

where m is the number of clusters, 1<sup>k</sup> is the mean distance between all sample pairs in the same cluster C<sup>k</sup> , and δ(C<sup>i</sup> , Cj) is the distance between the centroids of cluster C<sup>i</sup> and C<sup>j</sup> . The higher the value of DIm, the better the quality of the cluster. Finally, we selected m with the highest Dunn Index as the number of subtypes.

#### Constructing the Co-expression Network

In this work, we constructed a gene co-expression network for each subtype based on the gene expression data of cancer patients. First, the Pearson correlation coefficient was used to calculate the correlation between every two genes (Sedgwick, 2012). The Pearson correlation coefficient (r) was calculated as follows:

$$\mathbf{r} = \frac{1}{n-1} \sum\_{i=1}^{n} \left( \frac{X\_i - \bar{X}}{\sigma\_X} \right) (\frac{Y\_i - \bar{Y}}{\sigma\_Y}) \tag{2}$$

where n is the number of cancer samples, X represents the gene expression value of gene X in sample i, X¯ is the mean value of gene X in all samples, and σ<sup>X</sup> is the standard deviation of gene X in all samples. Similarly, the values Y, Y¯ , and σ<sup>Y</sup> correspond to the Y chromosome in the gene pairs.

Next, a rank-based method (Ruan et al., 2010) was applied to construct the gene co-expression network (Serin et al., 2016). For each gene, the top n genes most related to it were selected as its neighbors. In our project, we set n as 4 following a previous literature report (Ruan et al., 2010). Finally, all the selected gene pairs could create a co-expression network for each subtype.

## Network Visualization and Community Mining

Cytoscape 3.6.1 was used to visualize the network of every subtype and topology analysis was applied to these networks. The MCODE plug-in (Bader and Hogue, 2003) was then used in Cytoscape to mine communities in these networks.

# Constructing the Ensemble Classifier

Ensemble strategy has made great achievements in bioinformatics (Lin et al., 2013, 2014; Zhou et al., 2013; Zou et al., 2013, 2015; Wan et al., 2017). In this work, we also used an ensemble classifier to predict the prognostic of ovarian cancer patients. The main frame of constructing our ensemble classifier is shown in **Figure S1**. To begin with, our training data set was divided into different subtypes (Method 2.2) and we constructed the gene co-expression network for each subtype (Method 2.3). Then we mined the dense clusters as modules for each network (Method 2.4) and constructed the centroid classifier for each module as sub-classifier. In each subtype, the sub-classifiers were filtered by ACC (accuracy) and the ensemble classifier was constructed. The subtype of each sample in independent data set was determined and its prognosis was predicted by the corresponding ensemble classifier. The detail process to construct the prognostic model was shown as follow:


prognosis of each patient was predicted by the ensemble classifier of the corresponding subtype.

# Comparison With the Control Classifiers

In order to evaluate our main hypothesis that the clustering information and ensemble classifier based on communities in gene co-expression network could contribute to the prognosis of cancer patients, we compared our method with two controls. That is, the classifier without using the subtype information and the ensemble strategy, and the classifier using subtype information but not the ensemble strategy.

In the first control method, a centroid classifier is constructed without the subtype information and the gene co-expression network. The t-test is used to select the differentially expressed genes between low- and high-risk groups in all the patients. The t-test is calculated using the following equation (3),

$$\mathbf{t} = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S\_x^2}{n} + \frac{S\_y^2}{m}}} \tag{3}$$

where X¯ and Y¯ are the average gene expression levels of low- and high-risk patients, respectively. S<sup>x</sup> and S<sup>y</sup> are their corresponding standard deviations, and n and m are the sizes of the data sets of patients with good and bad prognosis, respectively.

Based on the t-test, the top n genes with the smallest p-value were selected as features. We varied n from 1 to 100 and using the leave-one-out method each feature set was used to construct a centroid classifier to predict the prognosis in training data set. Next, we chose the control classifier with the best validation result as the final classifier. That is, only one classifier was construct for prognosis. The prognosis of the samples in independent data set were predicted directly using the chosen classifier.

The second control method used the clustering information to the construction of the model. That is, it constructed a centroid classifier for each subtype, and each centroid classifier was constructed using the same strategy with the first control method. For each patient in the independent data set, its subtype was determined based the Euclidean distance of the vector of its expression levels to the centroid of each subtype. That is, the patient was assigned to the subtype which it is the most similar with. And its prognosis was predicted using the centroid classifier of the corresponding subtype.

# Construction of the Representative Classifiers

In previous works, some gene-signatures were selected to predict the prognosis of ovarian cancer patients (Gyorffy et al., 2012; Martinez-Ledesma et al., 2015). Herein, we also compare our work with two literature methods (Gyorffy et al., 2012; Martinez-Ledesma et al., 2015).

The first method (Martinez-Ledesma et al., 2015) used 41 genes for cancer prognosis and the authors demonstrated that it could perform well on 11 types of cancer prognosis, including ovarian cancer. Using these genes, we constructed a centroid classifier based on a training data set which was denoted as a 38-gene classifier in this work (only 38 genes was present in our training data set).

The second method (Gyorffy et al., 2012) identified 34 genes which were considered to be related to the prognosis of ovarian cancer. Among the 34 genes, 33 were present in our training data set. Based on these genes, a centroid classifier was constructed (denoted as a 33-gene classifier in this work).

In their previous work, they both used the Cox model to evaluate their methods. Based on their gene signatures, we also used the Cox model to evaluate the prognostic capability of their genes. First, the Cox proportional hazards regression was applied to the correlation between each gene expression level and the prognostic risks of all the patients in TCGA. Next, we adopted the same strategy using the Gene expression Grade Index (Sotiriou et al., 2006) to calculate the prognosis risk of each patient in the independent data set, based on all the genes in the corresponding gene signature.

The risk score is calculated by the following equation (4),

$$\text{Risk\\_Score} = \sum \mathbf{x}\_i - \sum \mathbf{y}\_j \tag{4}$$

where x<sup>i</sup> is the expression level of the gene whose Cox coefficient is positive and y<sup>i</sup> is the expression level of the gene whose Cox coefficient is negative. According to the risk scores of the patients, they were equally divided into high- and low-risk groups.

#### Performance Measures

The area under the curve (AUC), Matthews Correlation Coefficient (MCC) and Accuracy (ACC) were used as indexes to evaluate the classifiers in our work. The receiver operating characteristic (ROC) curve is a graphical plot that illustrates the sensitivity vs. one minus the specificity at different threshold settings. The AUC is the area under the curve and it is a widely used indicator to evaluate the performance of a classifier. The MCC is also an important indicator of the quality of classifiers and was used as an accuracy index in the US FDA-led initiative MAQC-II (Jurman et al., 2012). The MCC values fluctuate between −1 and +1 (a coefficient of +1 for completely correct predictions, 0 for meaningless predictions, and −1 for completely incorrect predictions) (Zhou et al., 2012). The ACC is the most natural performance measure indicator (Jurman et al., 2012). It is defined as the probability that a random event will be correctly classified, which is estimated by diving the number of correct classes by the total number of classes (Klinkenberg and Renz, 1998).

#### Enrichment Analysis

We used the Gene Set Enrichment Analysis (GSEA) (Subramanian et al., 2005) to perform functional annotation of the genes in the selected communities of each subtype. The hypergeometric distribution test (5) was used to calculate whether the intersection set between the genes in a community and the targets of a drug were significant:

$$\text{p}-\text{value} = 1 - \sum\_{i=0}^{M} \frac{\binom{M}{i}\binom{N-M}{n-i}}{\binom{N}{n}} \tag{5}$$

where N is the number of all the genes in our training data set, M is the number of genes in the community, n is the number of the targets of the drug, and m is the size of the intersection set. The hypergeometric distribution test was also used to test whether the ratio of screened cancer drugs by the community is significantly high, compared with the number of cancer drugs in the entire database.

# RESULT

#### Survival Analysis of the Cancer Patients in the Three Subtypes

Some works have succeeded in identifying the subtypes of cancer patients based the high—throughput data of cancer patients (Sørlie, 2004; Justin et al., 2015; Jiang et al., 2019). In order to reduce the heterogeneity among the patients in the same group, we divided the ovarian cancer patients into different subtypes based on transcriptome data of these patients. In this work, we used K-means to cluster the patients into different subtypes. The Dunn Index (Dunn, 1973) was used to evaluate the clustering quality on the TCGA training data set, where the number of clusters (K) was varied from 2 to 6. The Dunn Index of the clustering result shows that the optimal number of clusters is three (**Figure 1**). In addition, we applied survival analysis to patients in the three subtypes of the TCGA data set, which indicated that the patients in our three subtypes have different survival risks (**Figure 2A**). In a previous work, NMF clustering method was applied to cluster the ovarian cancer patients into four subtypes (Network, 2011). Here, we also applied this method to the training data set and survival analysis shows that the difference of the survival risks of the patients divided by our method is slightly significant than theirs (**Figure S2**). Considering that our approach is simpler, we still use K-means to cluster the cancer patients in ovarian cancer.

K (the number of the clusters) was varied from 2 to 6.

In addition, we also divided the patients in the merged data set into three subtypes according to the Euclidean distance between the expression level vector of each patient in the independent data set and the centroid of the clusters in training data set. We also applied survival analysis to the patients in the three subtypes on the merged data set. As a result, the patients could be significantly distinguished by the survival probability (**Figure 2B**). These results may indicate that the three subtypes identified by our method have different prognostic risks and the patients in each subtype may have more similarities than the patients in different subtypes.

# The Co-expression Networks of the Three Subtypes

In order to describe the relationship among the genes in each subtype of ovarian cancer patients, we constructed a gene coexpression network for each subtype. Adopting a similar strategy as the rank-based method for each gene (Ruan et al., 2010), we selected the top four genes that were most related to its neighbors to construct the co-expression network. Each of the three networks has 11,049 nodes and 44,196 edges (**Figure S3**). The average number of neighbors in the network of the first, second and third subtype are 6.633, 6.617, and 6.525, respectively. All the three networks are shown in the **Tables S1–S3**. Furthermore, we applied the power-law fitting to the three networks, and the correlation and R-square of the fittings indicated that all the networks fitted the power law distribution well (**Table 1**). Our topology analysis showed that the three networks were scale-free and could be used to mine communities, which could be used to construct prognostic models in ovarian cancer.

### Forecasting Ability of Our Classifier

As the genes in a community work together to play an important role in many biological processes, we used MCODE (Bader and Hogue, 2003) to mine the communities in each subtype. Next, we used the genes in each community as features to construct a TABLE 1 | Correlation and R-square of power-law fitting in three networks.


centroid classifier to predict the prognosis of the cancer patients in the corresponding subtype, using leave-one-out validation to evaluate its performance. Using the majority voting strategy, the classifiers capable of distinguishing prognosis were selected as weak classifiers to construct the ensemble classifier. We used the ACC of the classifier as an index to evaluate its prognostic capability and changed the threshold from 0.55 to 0.60. As a result, the ensemble classifier based on the weak classifiers with an ACC of more than 0.56 could achieve the best performance (**Figure 3**). Finally, we obtained 50 communities in the first subtype (**Table S4**), 73 communities in the second subtype (**Table S5**), and 92 communities in the third subtype (**Table S6**). These subtypes can be used to construct three ensemble classifiers for the three subtypes, which could be used as prognostic models for ovarian cancer patients.

In training data set, our method could achieve an AUC of more than 0.86, MCC of 0.57 and ACC of 0.79. An independent data set containing 782 samples was applied to verify our method. The AUC, MCC, and ACC values using our method were 0.64, 0.22, and 0.61, respectively (**Figure 4**). These performances showed that our classifier has a good forecasting ability in both the training and the independent data set.

### Comparison With Two Control Classifiers

In order to validate the hypothesis that the clustering information and the ensemble strategy based on gene co-expression network could improve the performance of the prognostic model, we compared the performance of our method with two control

threshold of ACC was varied from 0.55 to 0.60.

classifiers. The first one used a t-test to select features in all the patients in the training data set and used the selected features to construct a centroid classifier to predict the prognosis of all the ovarian cancer patients. The second one also used a t-test to select features and adopted the centroid classifier as a prognostic model but it constructed a centroid classifier for each subtype, i.e., the second method used clustering information in the process of constructing the model.

The performances of the two control methods in the training data set are shown in **Table 2**. From these results, it is evident that our method was better than both control methods, and the method without the clustering information is better than the method using clustering information. In addition, our classifier TABLE 2 | The leave-one-out result of classifier based on different methods.


and the other two classifiers were independently verified using the independent data set (**Figure 4**). Our classifier can achieve an AUC of 0.64 (MCC of 0.22 and ACC of 0.61), the control classifier with clustering information had an AUC of 0.58 (MCC of 0.16 and ACC of 0.54), and the control classifier without clustering information had an AUC of 0.55 (MCC of 0.07 and ACC of 0.54).

Our method outperformed the two control methods in both the training data set and the independent data set. The control method with clustering information performed better in the independent data set, although it is not better in the training data set. As we know, the control method without clustering used all the samples in the training data set to construct the classifier. However, the control method with the clustering information only applied the samples in each subtype to fit the model. The classifier with more samples may perform better in the training data set. However, the independent data set does not perform well using this method because of overfitting, which may be caused by the high heterogeneity of the cancer patients in different subtypes.

As we know, the cox regression is also a frequently-used method to select features in cancer prognosis. Here, we also used cox regression to select features for the two control classifiers. A similar result could be found that our ensemble classifier was the best, and the control classifier using subtype information was better than the one without the subtype information (**Figure S4**). All these results showed that the clustering information and the ensemble strategy based on gene co-expression network could improve the performance of the prognostic model.

### Comparison With Two Representative Works

Two representative methods were compared with our method. We constructed the centroid classifiers based on their gene sets as we did in our work. From **Figure S5**, our classifier was better than the 33-gene and 38-gene classifiers. The AUC performance of our classifier achieved 0.64 and MCC achieved 0.22 (ACC of 0.61). Compared to our method, the other two classifiers AUC was lower and the MCC was 0 (ACC <0.5). Thus, the classifier based on our method outperformed the other two methods.

In their previous work, they both used Cox models to evaluate their methods. Based on their gene signatures, we also used the Cox model to evaluate the prognostic capability of their genes (see Materials and Methods). In order to give a more direct comparison of the performance of our method with two other prognostic gene sets, a total of 789 patients of the merged data set were predicted prognostic outcome by our ensemble classifier. Meanwhile, their risk scores based on

by the representative gene sets, respectively.

two gene sets from these two representative methods were calculated. As a result, the log-rank p-value between the patients in the two groups predicted by our method is 8 × 10−<sup>9</sup> . The p-values of the log-rank test between the low-risk and high-risk group calculated by the two representative methods are 0.016 and 0.026, respectively (**Figure 5**). In summary, our ensemble classifier outperformed the two representative methods, both in classification and survival analysis.

### Functional Annotation of the Filtered Communities

As the communities with distinguishing capability in cancer prognosis may play important roles in cancer prognosis, we applied enrichment analysis to the genes in the top ten communities according to the ACC performance in each subtype with gene ontology (GO) terms by GSEA (Subramanian et al., 2005). In these top ten communities, we selected the most significant related biological processes for each annotated community, which are listed in **Tables S7–S9**.

In the first subtype, three out of ten communities were significantly annotated. The first community was enriched by "ethanol metabolism process." As we know, this biological process can produce a type of carcinogens-reactive aldehydes (Kottemann and Smogorzewska, 2013). In addition, "positive regulation of proteolysis" was reported to be related to the occurrence of ovarian cancer (Lengyel, 2010) and it was significantly enriched in the fourth community. In addition, "the glutathione derivative metabolic process" is the most significant one in the ninth community of the first subtype, with a p-value of 9.52 × 10−11. It was reported that glutathione played an important role in cancer progression and chemoresistance (Traverso et al., 2013).

In the second subtype, five communities were significantly enriched. Among them, disturbing "DNA metabolic process" was reported to contribute to oncogenesis (Hoeijmakers, 2001). Gao et al. An Ensemble Strategy for Prognosis

In addition, the other four GO Terms were also significantly enriched, such as the "response to steroid hormone," "response to endogenous stimulus," "response to topologically incorrect protein," and "response to fatty acid." The steroid hormone receptor has been previously demonstrated to be a potential prognostic marker for ovarian cancer patient survival (Lenhard et al., 2012). The endogenous stimulus comes from the microenvironment difference between normal and tumor tissues, and it could be used to treat cancer (Yang et al., 2014b). The incorrectly folded protein could affect the survival of tumor cells (Goloudina et al., 2012), the fatty acid which had been validated to be related to the rapid growth of tumor (Nieman et al., 2011) and the abnormal expression of its synthase have been often found in ovarian cancer with poor prognosis (Kuhajda, 2000). From the results of the survival analysis of the patients in the three subtypes, the prognosis of the patients in the second subtype was the poorest.

In the third subtype, "protein localization to centrosome" and "cell cycle process" were significantly related to the sixth and the tenth community, respectively. The "protein localization to centrosome" demonstrated that some proteins could affect the tumor cell cycle by the centrosome (Zhou et al., 1998; Kimura et al., 1999) and that cell cycle proteins are promising targets in cancer therapy (Otto and Sicinski, 2017). In other words, the communities in the third subtype were annotated by two cellcycle related GO Terms. To summarize, the communities in the three subtypes were all cancer-related but each different subtype corresponded to aspects of different biological processes.

# Drug Screening Using Filtered Communities

As described above, some communities in the three subtypes are cancer-related. Therefore, genes involved in these communities may be candidates for therapy. In this work, we used these communities to screen drugs using the hypergeometric distribution test. We tested whether the targets for each drug could be enriched significantly with the genes in the corresponding community (Materials and methods).

In the first subtype, three drugs were screened by the community which was annotated as "positive regulation of proteolysis." Among these three drugs, two drugs could be used as therapy for ovarian cancer. They were Carfilzomib (Tagawa et al., 2012) and Bortezomib separately and both could target to the gene PSMB1 in this community. Specially, Bortezomib has been used as the treatment drug for ovarian cancer in clinical trials (Bruning et al., 2009). The ratio of drugs that could be used as therapy for ovarian cancer, among the drugs screened by the community, is significantly high compared with the proportion of the ovarian cancer drugs among all the drugs in the database, with a p-value of 0.021. In the second subtype, nine drugs were obtained by the community (enriched by the "DNA metabolic process") and six drugs could be used to treat ovarian cancer, and the p-values of the community were 2.73 × 10−<sup>5</sup> . Specially, the drug Niraparib could target to the PARP2 in this community and it was one of the most familiar drug for recurrent ovarian cancer (Kanjanapan et al., 2017; Scott, 2017). Using the community enriched by the "response to endogenous stimulus," 183 drugs were screened and 115 drugs could be used for cancer patients, (p-value of 4.08 × 10−<sup>7</sup> ). In the third subtype, the ratio of drugs for ovarian cancer screened by the community, which were related to the "cell cycle process," was significantly higher than the ratio in all the drugs (p-value of 2.32 × 10−<sup>4</sup> ). Among the 50 drugs screened by the community, 13 drugs were used as therapy for ovarian cancer and all of the drug could target to the YES1 or TYMS. Among these drugs, the Dasatinib could inhibit YES1 directly (Pathak et al., 2015). Besides, the drug Gemcitabine is reported that it can combine the gene TYMS to regulate the cell cycle (Duran et al., 2017). All these results indicate that the genes involved in the filter communities may be candidates for drug targets in ovarian cancer.

# CONCLUSION

Considering the heterogeneity and complexity of ovarian cancer, we demonstrated a new method to predict the prognosis of ovarian cancer based on the clustering information and gene co-expression network in each subtype of cancer patients. We divided the ovarian cancer data into three subtypes by clustering analysis and we found that the survival risks in these three subtypes were significantly different. We mined the important communities based on the co-expression networks in each subtype. There are 50, 73, and 92 communities in the first, second and third subtype, respectively. Next, we constructed a new ensemble classifier based on these communities to predict the prognosis of cancer. Compared to other literature methods, our classifier had improved performance. Furthermore, the function annotation of the communities in each subtype showed that some representative communities were cancer-related and the enrichment analysis of the genes in the communities with the drug-ontology data can partially support our biomarker identification method.

# AUTHOR CONTRIBUTIONS

X-HZ and WZ designed the research. X-HZ and Y-CG performed the research and wrote the paper. Y-CG analyzed the data. All authors revised the manuscript.

# FUNDING

This research was funded by the National Natural Science Foundation of China (61602201), the Fundamental Research Funds for the Central Universities (2662018PY023), and the National Training Program of Innovation and Entrepreneurship for Undergraduates of Huazhong Agricultural University (201710504091).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00366/full#supplementary-material

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Gao, Zhou and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Predicting Ion Channels Genes and Their Types With Machine Learning Techniques

Ke Han1,2 \*, Miao Wang<sup>3</sup> , Lei Zhang<sup>3</sup> , Ying Wang<sup>1</sup> , Mian Guo<sup>4</sup> , Ming Zhao1,2, Qian Zhao1,2 , Yu Zhang1,2, Nianyin Zeng<sup>5</sup> and Chunyu Wang<sup>6</sup>

*<sup>1</sup> School of Computer and Information Engineering, Harbin University of Commerce, Harbin, China, <sup>2</sup> Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, China, <sup>3</sup> Life Sciences and Environmental Sciences Development Center, Harbin University of Commerce, Harbin, China, <sup>4</sup> Department of Neurosurgery, The Second Affiliated Hospital of Harbin Medical University, Harbin, China, <sup>5</sup> Department of Instrumental and Electrical Engineering, Xiamen University, Xiamen, China, <sup>6</sup> School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China*

#### Edited by:

*Dariusz Mrozek, Silesian University of Technology, Poland*

#### Reviewed by:

*Balachandran Manavalan, Ajou University, South Korea Hui Ding, University of Electronic Science and Technology of China, China Wei Chen, North China University of Science and Technology, China*

> \*Correspondence: *Ke Han thruster@163.com*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *14 February 2019* Accepted: *12 April 2019* Published: *03 May 2019*

#### Citation:

*Han K, Wang M, Zhang L, Wang Y, Guo M, Zhao M, Zhao Q, Zhang Y, Zeng N and Wang C (2019) Predicting Ion Channels Genes and Their Types With Machine Learning Techniques. Front. Genet. 10:399. doi: 10.3389/fgene.2019.00399* Motivation: The number of ion channels is increasing rapidly. As many of them are associated with diseases, they are the targets of more than 700 drugs. The discovery of new ion channels is facilitated by computational methods that predict ion channels and their types from protein sequences.

Methods: We used the SVMProt and the k-skip-n-gram methods to extract the feature vectors of ion channels, and obtained 188- and 400-dimensional features, respectively. The 188- and 400-dimensional features were combined to obtain 588-dimensional features. We then employed the maximum-relevance-maximum-distance method to reduce the dimensions of the 588-dimensional features. Finally, the support vector machine and random forest methods were used to build the prediction models to evaluate the classification effect.

Results: Different methods were employed to extract various feature vectors, and after effective dimensionality reduction, different classifiers were used to classify the ion channels. We extracted the ion channel data from the Universal Protein Resource (UniProt, http://www.uniprot.org/) and Ligand-Gated Ion Channel databases (http:// www.ebi.ac.uk/compneur-srv/LGICdb/LGICdb.php), and then verified the performance of the classifiers after screening. The findings of this study could inform the research and development of drugs.

Keywords: ion channel, machine learning, random forest, SVM, feature selection

# INTRODUCTION

Ion channels are the pathways for the passive transport of various inorganic ions across a membrane. The structure and function of cellular ion channels are the basis of lifesustaining processes, and their genetic variation, and dysfunction are related to the occurrence and development of many diseases (Gabashvili et al., 2007; Bagal et al., 2013; Cheng et al., 2018a,c). Usually, ion channels are in a closed state. Under particular stimuli, the channel protein conformation changes, and the probability of the ion channels opening increases. Based on their type of gate, ion channels are typically categorized into voltage-gated

**408**

ion channels and ligand-gated ion channels (Wang et al., 2017a). On the binding of a ligand, a ligand-gated channel undergoes a conformational change that causes opening of the channel gate and ion flux. Voltage-gated ion channels predominantly contain potassium (K+), sodium (Na+), calcium (Ca2+), and anion channels (Shu-An et al., 2011). They are usually surrounded by four transmembrane segments of the same subunit. In these channels, there are some charged groups (potential sensors) that control the gate. When the membrane potential changes, the electric sensors undergo a displacement under the effect of the electric field force, and the gate is opened or closed in response to the change in the membrane potential. Ion channels are expressed in practically all tissues and can cause deafness, renal cysts, cardiac arrhythmias migraines, and epilepsy (Cai et al., 2002a). Therefore, many drugs are found to target ion channels. One example is an antiarrhythmic drug, Lidocaine, which acts as a voltage-gated sodium channel inhibitor (Peters et al., 1993; Tiwari and Srivastava, 2015). The actions of Lidocaine affect the conduction system and muscle cells of the heart, raising its depolarization threshold and making it less likely to initiate or conduct action potentials (Lin et al., 2015). Another example is Ziconotide, which targets calcium channels and is used for pain relief. This compound blocks the calcium influx in the nerve terminals, which results in a reduced release of glutamate and neuropeptides, effectively interrupting the spinal transmission of pain signals (Schmidtko et al., 2010).

Owing to the significance of ion channels in biological processes, researchers have initiated conducting more in-depth research on them to establish the relationships between ion channels and different diseases. Currently, ion channels have become important targets for disease diagnosis and drug development. It is known that many chemicals and genetic disorders can disrupt the normal function of ion channels and have catastrophic consequences for living organisms (Santos et al., 2017). Most animal toxins are used to treat diseases such as chronic pain by modulating ion channels to shut down the nervous system.

In recent years, ion channels have played an increasingly important role in the treatment of diseases and drug research and development. Therefore, several researchers have started to pay attention to the structure and function of ion channels. With the rapid growth of proteomics data, earlier prediction and identification of the type of a particular ion channel has become important. Therefore, researchers have developed various bioinformatics software to predict the identification of ion channels. As researchers are interested in developing drugs that target ion channel and extending ion channel protein annotation, a series of high-throughput computational tools have been developed to predict ion channels and their types directly from protein sequences. In the last decade, many computational methods have been developed based on machine learning algorithms (Yu et al., 2015; Zou et al., 2017a,b; Stephenson et al., 2019), which are used in different fields, such as drug repositioning (Yu et al., 2016, 2017). Increasingly, researchers have applied machine learning algorithms to predict and classify ion channels. Sudipto et al. (2006) used amino acid composition and dipeptide composition as the feature vectors and classified them using a support vector machine (SVM) to predict voltage-gated ion channels and their subtypes. Liu et al. (2010) proposed a voltage-gated potassium channel identification method based on local sequence information. The prediction result of this method was better than that of voltage-gated potassium channel identification based on global sequence information (Lin and Ding, 2011). Zhao et al. (2017) constructed a support vector machine (SVM)-based model to quickly predict ion channels and their types. By considering the residue sequence information and their physicochemical properties, a novel feature-extracted method which combined dipeptide composition with the physicochemical correlation between two residues was employed. Recently, Gao et al. (2016) proposed a model based on a SVM to search for predicted ion channels and their subfamilies using the sequence similarity search feature of the basic local alignment search tool. Although many classifiers have been developed for the identification of ion channels, there are still some unresolved problems. For example, ion channel sequence similarity is very high, which may result in overestimation of the predictive classification performance of the model (Olivier and Du, 2012).

In this study, SVM and random forest classifiers were used to identify ion channels and further classify them. The maximumrelevance-maximum-distance (MRMD) method was introduced for feature selection to improve the prediction accuracy. We followed three steps to predict and classify ion channels. First, a protein sequence was detected to determine if it belonged to an ion channel. If the test results demonstrated that the sequence was an ion channel, then the protein sequence was classified as either a voltage-gated ion channel or ligand-gated ion channel. Finally, if the protein sequence was found to belong to a voltagegated ion channel, we classified it as a potassium (K+), sodium (Na+), calcium (Ca2+), or anion voltage-gated ion channel.

### MATERIALS AND METHODS

**Figure 1** shows the basic flow of the processes proposed in this paper. In this section, we introduce in detail the data set, feature extraction method, dimension reduction method, and classifier used in this study.

## Benchmark Dataset

The data that we used to establish the prediction model in this study were collected from Lin and Ding (2011). The sequences of ion channels were collected from the Universal Protein Resource (UniProt) and Ligand-Gated Ion channel databases (Marco et al., 2006). The following measures were taken to obtain reliable high-quality datasets. Initially, the protein sequences containing blurred disabilities, such as those with amino acids "X," "B," and "Z" were discarded. Then, the sequences of other protein fragments were removed. Proteins that were inferred by homology or prediction were discarded because of their unreliability. Finally, to avoid any homology bias, the CD-HIT (Li and Godzik, 2006) program was used to remove highly homologous sequences, with a 40% sequence identity as the cutoff (Wei et al., 2012; Chen et al., 2016; Zou et al., 2018a).

In strict accordance with the above steps, 148 voltage-gated ion channels, including 81 potassium channels, 29 calcium channels, 12 sodium channels, 26 anion channels, and 150 ligand-gated ion channels were finally extracted. To ensure the reliability and practicability of the ion channel prediction, and classification and maintenance of the balance between the positive and negative data, 300 protein sequences were randomly selected from UniProt as non-ion channels. It was observed that the consistency of these non-ion channel sequences was <40%.

#### Feature Extraction of Samples

Section Benchmark dataset mainly discusses the series of preprocessing steps performed for the dataset. The reconstruction provided a reliable database for the study on the positioning method. This section focuses on specific methods of protein subcellular localization based on machine learning.

The first and most important role of a predictor is to extract protein sequences (Liu et al., 2015; Ding et al., 2017a,b; Zou et al., 2018b). We used two feature extraction methods including the SVMProt 188-D feature extraction method, which is based on protein composition and physicochemical properties, and the k-skip-n-gram 400-D feature extraction method.

#### SVMProt 188-D Feature Extraction

Different types of amino acids possess their own unique physicochemical properties. These characteristics of amino acid sequences can be used to predict types of protein. This method has yielded good predictive results (Cao and Cheng, 2016; Li et al., 2016b). Dubchak et al. (1995) proposed a composition transition distribution model based on the composition, transformation, and distribution of protein sequences, and achieved better results for the prediction of protein folding patterns. The physicochemical properties of protein sequences were fully embodied in this model, where the composition and physical and chemical properties were independent of each other. Cai et al. (2003) extracted 188-dimensional features in combination with amino acid composition and physicochemical characteristics for the characterization of proteins. SVMProt also contains nine physicochemical properties besides amino acid frequencies. The quantities of each of these properties are listed in **Table 1** (Zou et al., 2013a,b).

In the model, 20 amino acids in the query protein sequence constitute the first 20-dimensional feature vector. The first 20 dimensional vector is calculated as follows:

$$E\_i = \frac{A\_i}{L} \times 100\% \ (1 \le i \le 20) \tag{1}$$

where A<sup>i</sup> and L denote the number of the amino acids in the sequence and the length of the sequence, respectively, (Zhu et al., 2018b A20). {A1, A2, . . . , A20} represents the 20 amino acids that form the proteins. According to the physicochemical types, the amino acids can be classified under three categories based on their content (C), distribution (D), and bivalent frequency (F) (Bagal et al., 2013). The features of each of the remaining eight physicochemical properties are obtained using the following formula:

$$\mathbf{C}\_{i} = \frac{count\_{D\_{i}}}{L} \times 100 \ (1 \le i \le 20) \tag{2}$$

$$\begin{aligned} T\_{i,j} &= \frac{D\_i D\_j \text{ } or D\_j D\_i}{L - I} \times 100, \\\\ dj &\in \left\{ \left( i = c, j = d \right), \left( i = c, j = f \right), \left( i = d, j = f \right) \right\} \end{aligned} \tag{3}$$

$$D = \frac{P\_j \, th \, \, of \, D\_i}{L} \times 100,$$

and

$$P\_{\vec{j}} = \begin{cases} \frac{1}{\underline{\hspace{1cm}}\_{D\_{\vec{l}}}} & (\text{j} = 1, 2, 3, 4) \\ \end{cases} \tag{5}$$

where D<sup>i</sup> (i = c, d, f) and countD<sup>i</sup> denote the physicochemical properties of the amino acids and number of such properties present in the sequence, respectively. After calculating all the physical and chemical properties, we finally extracted all the 188 (20 + (21 × 8) =188) feature vectors.

**j**=**0**,**1**,**2**,**3**,**4**;**i**=**c**,**d**,**f**

**4**×**j**

(4)

#### TABLE 1 | Number of features in SVMPRot.


#### k-skip-n-gram 400-D Feature Extraction

Guthrie et al. (2006) first proposed the k-skip-n-gram model. In protein sequences, the distance between two amino acids Ai and Aj is denoted by DT (A<sup>i</sup> , Aj), which is defined as the position interval between two amino acids (Liu et al., 2014). It is calculated as follows:

$$DT\left(A\_{i\cdot}A\_{j}\right) = j - i - I \tag{6}$$

where i and j are the positions of the amino acids in a sequence.

The k-skip-n-gram model provides the composition of n residues with distances k in a sequence. Its features are calculated as follows:

$$FV\_{SkipGram} = \left\lfloor \frac{N\left(a\_{m\_1}a\_{m\_2}\dots a\_{m\_n}\right)}{N\left(T\_{SkipGram}\right)} \right\rfloor$$

$$1 \le a\_{m1} \le 20,\\ 1 \le a\_{m2} \le 20,\\ \dots, 1 \le a\_{m\mu} \le 20 \tag{7}$$

where N TSkipGram and N am<sup>1</sup> am<sup>2</sup> ... am<sup>n</sup> denote the total number of elements in set TSkipGram and total number of terms am<sup>1</sup> am<sup>2</sup> ... am<sup>n</sup> appearing in set TSkipGram, which is formulated as

$$T\_{SkipGram} = \left\{ \bigcup\_{a=1}^{k} \text{Skip} \ (DT = a) \right\} \tag{8}$$

where

$$\mathbf{S} \& \mathbf{i} \mathbf{p} \ (\mathbf{D} \mathbf{T} = a)$$
 
$$\mathbf{i} = \langle A\_i A\_{i+a+1} \dots A\_{i+a+n-1} | \mathbf{1} \le \mathbf{i} \le \mathbf{L} - \mathbf{a}, \mathbf{1} \le \mathbf{a} \le \mathbf{k} \rangle \qquad (9)$$

Because only 20 amino acids can form a protein, a sequence has a total of 20<sup>n</sup> permutations. Therefore, a protein sequence can be transformed into 20<sup>n</sup> feature vector sets FVSkipGram.

As the number of feature vectors exhibits an exponential distribution, the value of n is quite important. When n = 1, there are only 20 features. If the number of features is quite small, the feature representation of a sequence is negatively affected. In contrast, when the value of n is very high, it affects the calculation efficiency. In this study, the value of n was considered as 2. Finally, we obtained 400 feature vectors.

#### Feature Selection (MRMD)

Owing to their limitations, the two feature representation methods mentioned above were combined to form a new feature vector containing more than one feature. SVM and random forest classifiers were used to classify the new feature vector set. When multiple feature extraction methods are combined, many dimensions may be generated and the classification result may be affected (Tang et al., 2017; Liu et al., 2018b; Zhu et al., 2018b). Feature selection can alleviate the problem of dimensionality by selecting a subset of features (Zhu et al., 2018c). Therefore, we employed the dimensionality reduction method based on MRMD (http://lab.malab.cn/soft/MRMD/index\_en. html) to reduce the dimensionality of the generated feature vectors (Xu et al., 2016; Zou et al., 2016a,b; Zhu et al., 2017, 2018b; Chen et al., 2018; Tang et al., 2018b). MRMD selects the feature with the highest correlation and least redundancy by calculating the maximum relevance and maximum distance. In this study, Pearson's correlation coefficients were used to measure the relevance, and three distance functions were used to calculate the redundancy of the features. As the value of the Pearson correlation coefficient increased, the relationship between the features and target classes became stronger. As the distance between the features increased, the redundancy of the feature vectors decreased. Finally, the sub-features generated after the MRMD dimension reduction were found to possess the characteristics of low redundancy and a strong relationship. This could aid in achieving more accurate classification results.

#### Classifier Models

#### Random Forest

A random forest is a classifier that uses multiple trees to train and predict samples; it has been widely used in many bioinformatics tasks (Xu et al., 2013, 2018b; Liu et al., 2018a; Pan et al., 2018; Su et al., 2018; Wei et al., 2018a). It was proposed by Leo Breiman in 2001 and combines the Bagging integrated learning theory with the random subspace method (Verikas et al., 2011). A random forest is an integrated learning model based on a decision tree. It contains multiple decision trees trained by the Bagging integrated learning technology. Samples are input into a random forest for classification. The final classification result is governed by the output of a single decision tree. Since Buntine and Niblett (1992) proposed the random forest algorithm, it has been widely used, owing to its good performance, in many practical fields, such as the classification and regression of gene sequences, action recognition, face recognition, anomaly detection in data mining, and metric learning. In this study, we used a random forest classifier to build a model.

#### Support Vector Machine

An SVM is a supervised learning model related to learning algorithms and has achieved good performance in several bioinformatics (Momot et al., 2010; Cao et al., 2014; Ding et al., 2016; Li et al., 2016a; Wang et al., 2017b, 2018; Wei et al., 2017a,b, 2018c; Chen and Chuang, 2018; Liu et al., 2018c; Tang et al., 2018a; Shen et al., 2019; Zhu et al., 2019) and biomedicine (Zeng et al., 2018a; Zhang et al., 2018) studies. The dual-classification problem of an SVM can be broadly divided into three cases: linear separable, approximate linear separable, and non-linear separable. The solution for the linear separable problem is an optimal hyperplane that allows two groups of samples to be classified appropriately and to have the largest classification interval. This is shown in **Figure 2**, where the H plane is the optimal hyperplane. The approximate linear separability problem can be solved by adding a relaxation variable, i, in the optimization function of the linear classification. To solve the non-linear separable problem, we need to select an appropriate kernel function, transform the low-dimensional space into a high-dimensional space, and find the appropriate classification plane in the high-dimensional space so that the two samples can be classified appropriately (Cai et al., 2002b; Yu-Dong et al., 2010; Liu, 2017). Therefore, an SVM can achieve good classification results even when there are few experimental data. In this study, we used LIBSVM 3.23, which was downloaded from https://www. csie.ntu.edu.tw/~cjlin/libsvm/index.html. To obtain the optimal model, we performed a grid search to optimize parameters c and g. Then, the values of c and g were added to the model to obtain the optimal classification result. A combination of different types of features and classifiers can improve the overall performance of the model (Zhu et al., 2016, 2018a).

#### Prediction Assessment

In machine learning, dividing experimental data into training sets is necessary to build a prediction model (Cao et al., 2017; Xu et al., 2017; Cheng et al., 2018b; Hu et al., 2018). Experimental data need to be further divided into test sets so that the final results of the training can be validated. To divide experimental data into training and test sets, a large amount of experimental data is needed. However, in practice, the number of experimental data is often limited. Therefore, researchers often use cross-validation for testing. Three types of cross-validation methods are commonly used in bioinformatics: independent data testing, folding cross-validation, and n-fold cross-validation. Among these, the folding knife test has been

widely used in bioinformatics owing to its excellent results. However, this test is time and resource intensive (Lin et al., 2012; Zeng et al., 2016; Lai et al., 2017; Liu et al., 2017b; Manavalan et al., 2018). The n-fold cross-validation is commonly used to test the accuracy of an algorithm. The dataset was divided into 10 parts, nine of which were used as the training data and one as the testing data. After several experiments were conducted using numerous amounts of varied data, the best error estimates were obtained by dividing the dataset into 10 parts. There is sufficient theoretical basis to prove this approach (Chen et al., 2017; Zeng et al., 2018b).

### Performance Evaluation

To obtain clearer classification prediction results and estimate the accuracy of the prediction model, we used other evaluation criteria as well (Feng et al., 2013, 2018; Chen et al., 2017; Zhang and Liu, 2017; Dao et al., 2018; Yang et al., 2018). The prediction accuracy was estimated using the sensitivity (Sn), overall accuracy (OA), and average accuracy (AA), which are defined as follows:

$$\text{Sn}\ (i) = \frac{TP\_i}{TP\_i + FN\_i} \tag{10}$$

$$OA = \sum\_{i=1}^{n} \frac{T P\_i}{N} \tag{11}$$

and

$$AA = \sum\_{i=1}^{n} \text{Sn}(i) / n \tag{12}$$

where TP<sup>i</sup> and FN<sup>i</sup> denote the true positives and false positives of the ith class, respectively, (Liu et al., 2017a; Zeng et al., 2017a). N and n are the total number of sequences and number of species, respectively.

#### RESULTS

#### Prediction Results of Ion and Non-ion Channels

We compared the predictive effects of the SVM-based and random forest-based methods on both ion and non-ion channels in different dimensions. The results obtained are listed in **Table 2**. The 10-fold cross-validation results of the 188-dimensional features, 400-dimensional features, and mixed features (188 dimensional features combined with 400-dimensional features) are listed in **Table 2**. We then applied the MRMD method to reduce the dimensions of the 588-dimensional features to obtain 587-dimensional features. However, the average classification accuracy of the 587-dimensional features was found to be lower than that of the 400-dimensional features. The results also revealed that the SVM classifier was the best method for classifying the 400-dimensional features, with an average overall accuracy (OA) rate of 85.1%. As can be seen in **Table 2**, 86.6% of the ion channels and 83.7% of the non-ion channels can be appropriately identified using the SVM classifier, with a total

TABLE 2 | Prediction results of ion channels and non-ion channels.


TABLE 3 | Prediction results of voltage-gated and ligand-gated ion channels.


accuracy rate of 85.1%. The feature vectors of the 188- and 400 dimensional features yield good prediction results. This result reveals that the SVM can moderately improve the predictive performance of the model. And we also try to use other classifiers to classify ion channels, but the classification effect is obviously worse than that of random forest and SVM classifiers, so we finally choose the two classifiers for comparison.

# Classification Results of Voltage-Gated and Ligand-Gated Ion Channels

We evaluated the accuracy of the 188-dimensional features, 400-dimensional features, and mixed features (188-dimensional features combined with 400-dimensional features), and the 88 dimensional features obtained after the dimensional reduction using the MRMD method for discriminating between the classification results of voltage-gated and ligand-gated ion channels. The results are tabulated in **Table 3**. They reveal that the random forest classifier is the best for classifying the 188-dimensional features, with an average overall accuracy rate of 89.9%. As seen in **Table 3**, 93.9% of the voltage-gated ion channels and 86.0% of the ligand-gated ion channels could be correctly identified using the random forest method. The results reveal that the random forest classifier is better than the SVM classifier in some cases and can improve the prediction performance model.

The results listed in **Tables 2**, **3** reveal that the difference between the voltage-gated and ligand-gated ion channels appears to be more distinct than that between the ion and non-ion TABLE 4 | Prediction results for four types of voltage-gated ion channels.


channels. This may be due to the obvious differences between voltage-gated ion channels and ligand-gated ion channels with respect to some specific components.

# Classification Results of Four Types Voltage-Gated Ion Channels

Finally, we classified the four types of voltage-gated ion channels, i.e., K, Ca, Anion, and Na, using the SVM and random forest methods. The prediction accuracy of the 188-dimension features, 400-dimensional features, 424-dimensional features, and mixed features were calculated individually. The results are listed in **Table 4**. This table shows that the best classification effect is achieved when the SVM classifier, which had a maximum overall accuracy rate of 72.973%, is used to extract the 188 dimensional features. We applied the MRMD method to reduce the dimensions of the 588-dimensional features to obtain 424-dimensional features. However, the average classification accuracy of the 424-dimensional features was lower than that of the 188-dimensional features. After dimension reduction, the dimension of ion channel feature vectors did not decrease significantly, and the accuracy was even decreasing, which indicates that MRMD was not effective in classifying ion channel feature vectors.

In general, the robustness of the results can be improved by using the minimum dimensions of the feature vector data. Therefore, we recommend using 188-dimensional feature vectors to predict the four types of voltage-gated ion channels.

### DISCUSSION AND CONCLUSIONS

In this study, new features were used to extract the features of ion channels, and good prediction results were obtained. To accurately predict and classify ion channels and their types, we constructed SVM-based and random forest-based models that used SVMProt 188- dimensional feature extraction and k-skip-n-gram to extract features. Then, we combined the 188-dimensional features with the 400-dimensional features to obtain 588-dimensional features. To achieve a higher accuracy with fewer features, the MRMD method was used to reduce the dimensions of the 588-dimensional features. Finally, the SVM and random forest models were used to model 188-dimensional features, 400-dimensional features, 588 dimensional features, and the MRMD-reduced features. The experimental results revealed that the features extracted by the SVMProt 188-dimensional feature extraction and k-skipn-gram methods could effectively predict and classify the ion channels. Such a fast and accurate method can accelerate the prediction of ion channels and promote the discovery of drug targets.

Although this method can guide the study of ion channel discovery, it has some limitations. With the rapid increase in ion channel types and data, more perfect prediction and classification models need to be developed by researchers. We believe that more in-depth research using computational intelligence (Mrozek et al., 2009; Zeng et al., 2014; Cabarle et al., 2017; Xu et al., 2018a) and machine learning (Zeng et al., 2017b; Song et al., 2018; Zhu et al., 2018c) can result in the development of additional feature extraction methods (Wei et al., 2018b) and more accurate prediction classification models (Wang et al., 2016), and contribute to drug research and development.

# REFERENCES


# DATA AVAILABILITY

The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

# AUTHOR CONTRIBUTIONS

KH, MW, LZ, and YW made substantial contributions to the design of the work and drafted and revised the article. MG, MZ, QZ, and YZ focused on the machine learning programs and plotted the figures. NZ and CW mainly made the analysis and interpretation of data for the work.

# ACKNOWLEDGMENTS

The work was supported by the Support Program for Young Academic Key Teachers of Harbin University of Commerce (No. 7XN004).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Han, Wang, Zhang, Wang, Guo, Zhao, Zhao, Zhang, Zeng and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Dual Convolutional Neural Networks With Attention Mechanisms Based Method for Predicting Disease-Related lncRNA Genes

Ping Xuan<sup>1</sup> , Yangkun Cao<sup>1</sup> , Tiangang Zhang<sup>2</sup> \*, Rui Kong<sup>3</sup> and Zhaogong Zhang<sup>1</sup>

<sup>1</sup> School of Computer Science and Technology, Heilongjiang University, Harbin, China, <sup>2</sup> School of Mathematical Science, Heilongjiang University, Harbin, China, <sup>3</sup> Department of Pancreatic and Biliary Surgery, The First Affiliated Hospital of Harbin Medical University, Harbin, China

#### Edited by:

Quan Zou, University of Electronic Science and Technology of China, China

#### Reviewed by:

Lei Deng, Central South University, China Pora Kim, University of Texas Health Science Center at Houston, United States

> \*Correspondence: Tiangang Zhang zhang@hlju.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 15 February 2019 Accepted: 16 April 2019 Published: 03 May 2019

#### Citation:

Xuan P, Cao Y, Zhang T, Kong R and Zhang Z (2019) Dual Convolutional Neural Networks With Attention Mechanisms Based Method for Predicting Disease-Related lncRNA Genes. Front. Genet. 10:416. doi: 10.3389/fgene.2019.00416 A lot of studies indicated that aberrant expression of long non-coding RNA genes (lncRNAs) is closely related to human diseases. Identifying disease-related lncRNAs (disease lncRNAs) is critical for understanding the pathogenesis and etiology of diseases. Most of the previous methods focus on prioritizing the potential disease lncRNAs based on shallow learning methods. The methods fail to extract the deep and complex feature representations of lncRNA-disease associations. Furthermore, nearly all the methods ignore the discriminative contributions of the similarity, association, and interaction relationships among lncRNAs, disease, and miRNAs for the association prediction. A dual convolutional neural networks with attention mechanisms based method is presented for predicting the candidate disease lncRNAs, and it is referred to as CNNLDA. CNNLDA deeply integrates the multiple source data like the lncRNA similarities, the disease similarities, the lncRNA-disease associations, the lncRNA-miRNA interactions, and the miRNA-disease associations. The diverse biological premises about lncRNAs, miRNAs, and diseases are combined to construct the feature matrix from the biological perspectives. A novel framework based on the dual convolutional neural networks is developed to learn the global and attention representations of the lncRNA-disease associations. The left part of the framework exploits the various information contained by the feature matrix to learn the global representation of lncRNA-disease associations. The different connection relationships among the lncRNA, miRNA, and disease nodes and the different features of these nodes have the discriminative contributions for the association prediction. Hence we present the attention mechanisms from the relationship level and the feature level respectively, and the right part of the framework learns the attention representation of associations. The experimental results based on the cross validation indicate that CNNLDA yields superior performance than several state-of-the-art methods. Case studies on stomach cancer, lung cancer, and colon cancer further demonstrate CNNLDA's ability to discover the potential disease lncRNAs.

Keywords: lncRNA-disease prediction, dual convolutional neural networks, attention at feature level, attention at relationship level, lncRNA-miRNA interactions

# INTRODUCTION

Long non-coding RNA genes (lncRNAs) are transcripts longer than 200 nucleotides which are not translated into proteins (Reik, 2009). Accumulating evidences have indicated that lncRNAs play crucial roles in the metastasis and progression of various diseases (Prensner and Chinnaiyan, 2011; Schmitt and Chang, 2016; Hu et al., 2018). Therefore, identifying the associations between lncRNAs and diseases is important for understanding the functions of lncRNAs in the disease processes.

Predicting disease-related lncRNAs (disease lncRNAs) can screen the potential candidates for the biologists to discover the real lncRNA-disease associations with the wet-lab experiments (Chen et al., 2016a). Existing methods have been presented for prioritizing the candidate disease lncRNAs, which fall into three main categories. Methods in the first category utilize the biological information related to lncRNAs, such as the genome locations, tissue specificity and expression profile. Chen et al. and Li et al. predicted disease lncRNAs by exploiting the locations of lncRNAs and genes in the genome (Chen et al., 2013; Li et al., 2014a). However, the methods are not effective on the lncRNAs which have no adjacent genes. Liu et al. and Chen predicted the potential associations by using the lncRNA tissue specificity or lncRNA expression profile (Liu et al., 2014; Chen, 2015). The methods suffered from the limited information of tissue-specific expressions and low expression levels of lncRNAs.

Methods in the second category construct the prediction models based on machine learning for inferring the lncRNAdisease associations. A semi-supervised learning based method was proposed to predict the potential associations (Chen and Yan, 2013). On the basis of this study, Chen et al. and Huang et al. optimized the calculation of the similarities of lncRNAs and diseases (Chen et al., 2015; Huang et al., 2016). However, the methods considered the information of the lncRNA and disease spaces, and did not fuse them completely. Several methods infer the candidate lncRNAs related to a disease by random walk on the lncRNA functional similarity network or heterogeneous network composed of lncRNAs, genes and diseases (Sun et al., 2014; Chen et al., 2016b; Gu et al., 2017; Yao et al., 2017). The common and similar neighbors of two diseases (or two lncRNAs) in the lncRNA-disease bipartite network are utilized to infer the association scores between lncRNAs and diseases (Ping et al., 2018). Nevertheless, most of these methods fail to be applied to new diseases without any known related lncRNAs.

The methods in the third category integrate the multiple data sources about the proteins and miRNAs that are interacted with lncRNAs, and the drugs associated with the proteins. Zhang et al. constructed the lncRNA-protein-disease network and obtained the candidate disease lncRNAs by propagating information flow in the heterogeneous network (Zhang et al., 2017). After calculating the various lncRNA and disease similarities, LDAP used the bagging SVM classifier to uncover the potential diseases lncRNAs(Lan et al., 2017). A couple of methods established the matrix factorization based prediction models to fuse the multiple kinds of information related to the lncRNAs, diseases and proteins (Fu et al., 2017; Lu et al., 2018). However, most of the previous methods are the shallow learning methods which cannot learn the deep and complex representations of lncRNAdisease associations.

Deep learning approaches can hold the promise of much better performance (Xu et al., 2017). In our study, we propose a novel method based on dual convolutional neural networks to predict lncRNA-disease associations, which we refer to as CNNLDA. CNNLDA exploits the similarities and associations of lncRNAs and diseases, the interactions between lncRNAs and miRNAs, and the miRNA-disease associations. The feature matrix is firstly constructed based on the biological premises about lncRNAs, miRNAs, and diseases. Combining the biological premise about the cases that two lncRNAs (diseases) should be more similar can capture the relationships between the lncRNAdisease associations and the lncRNA (disease) similarities. Integrating the interactions between lncRNAs and miRNAs, and the miRNA-disease associations can capture the relationships between the lncRNAs and miRNAs interacted with each other and the lncRNA-disease associations. A new framework based on the dual convolutional neural networks is established for extracting both the global and the attention feature representations of lncRNA-disease associations. The left part of the framework is concentrated on extract features from the associations and similarities of lncRNAs and diseases. In the right part of the framework, each of features and each kind of features are assigned to different weights by applying our proposed attention mechanisms, which may discriminate their different contributions for predicting the potential disease lncRNAs. The comprehensive cross-validation experiments confirm that CNNLDA outperforms several state-of-the-art methods for predicting candidate disease lncRNAs. Moreover, case studies on 3 diseases indicate that CNNLDA is able to discover potential association candidates that are supported by the corresponding databases and literature.

# MATERIALS AND METHODS

# Datasets for Disease lncRNA Prediction

The lncRNA-disease associations, the lncRNA-miRNA interactions, and the miRNA-disease associations are obtained from the previous work on prediction of the lncRNA-disease associations (Fu et al., 2017). The 2687 lncRNA-disease associations are originally extracted from the databases LncRNADisease (Chen et al., 2013) and Lnc2Cancer (Ning et al., 2016) that contains the experimentally confirmed lncRNA-disease associations, and the database GeneRIF (Lu et al., 2006) that records the lncRNA functional description. The 1002 lncRNA-miRNA interactions are extracted from database starBase (Li et al., 2014b) which includes the interaction information between multiple kinds of RNAs. The disease semantic similarities are obtained from DincRNA (Cheng et al., 2018) that are used by us to calculate the lncRNA similarities based on their associated diseases. The 5218 verified miRNAdisease associations by experiment are obtained from the human miRNA-disease database HMDD (Li et al., 2013). All of these associations and interactions cover 240 lncRNAs, 402 diseases, and 495 miRNAs.

# Calculation and Representation of Multiple Kinds of Data

#### Representation of the lncRNA-Disease Associations and miRNA-Disease Associations

The bipartite graph composed of lncRNAs and diseases is constructed by the known lncRNA-disease associations (**Figure 1A**). We use matrix **A**ǫR <sup>n</sup>l×n<sup>d</sup> to represent the association case between n<sup>l</sup> lncRNAs and n<sup>d</sup> diseases, where Aij is 1 if lncRNA l<sup>i</sup> has been observed to be related to disease d<sup>j</sup> or 0 otherwise. As shown in **Figure 1C**, the known miRNA-disease associations form the miRNA-disease bipartite graph. Matrix **B**ǫR <sup>n</sup>m×n<sup>d</sup> represents the associations between n<sup>m</sup> miRNAs and n<sup>d</sup> diseases. Bij is set to 1 means there is observed association between miRNA m<sup>i</sup> and disease d<sup>j</sup> , and it is 0 otherwise.

#### Representation of the Disease Similarities

The more similar that two diseases are, the more likely that they are associated with similar lncRNAs. Hence the disease similarities are integrated by our model for predicting diseaserelated lncRNAs. A disease can be represented by a directed acyclic graph (DAG) that includes all the disease terms related to the disease. If two diseases have more common disease terms, they are more similar, which is the basic idea for semantic similarity between Gene Ontology terms (Xu et al., 2013b). Wang et al. have successfully measured the similarity of two diseases based on their DAGs (Wang et al., 2010). The disease similarities are calculated by Wang's method, and they are represented by matrix **D**ǫR <sup>n</sup>d×n<sup>d</sup> where Dij is the similarity of two diseases d<sup>i</sup> and dj(**Figure 1B**).

#### Representation of the lncRNA Similarities

As the lncRNAs associated with the similar diseases are generally possible to have more similar functions, Chen et al. measured the similarity of two lncRNAs based on their associated diseases (Chen et al., 2015), of which similar approaches have been used for miRNA-miRNA network inference (Xu et al., 2013a). The lncRNA similarities that we used are calculated by Chen's method. For instance, the lncRNA l<sup>a</sup> is associated with a group of diseases DT<sup>a</sup> = {di1, di2, . . . , dim}, lncRNA l<sup>b</sup> is associated with a group of diseases DT<sup>b</sup> = {dj1, dj2, . . . , djn}. The similarity between DT<sup>a</sup> and DT<sup>b</sup> is then calculated as the similarity of l<sup>a</sup> and lb , and it is denoted as LS la, l<sup>b</sup> . LS la, l<sup>b</sup> is defined as,

$$LS\left(l\_{a}, l\_{b}\right) = \frac{\sum\_{\substack{i=1 \ 1 \leq j \leq n}}^{m} \left(\text{DS}\left(d\_{di}, d\_{bj}\right)\right) + \sum\_{j=1}^{n} \max\_{1 \leq i \leq m} \left(\text{DS}\left(d\_{bj}, d\_{di}\right)\right)}{m+n}, \text{(1)}$$

where DS dai, dbj is the semantic similarity of disease of dai and dbj which belong to DT<sup>a</sup> and DT<sup>b</sup> respectively. m and n are the numbers of diseases that are included by DT<sup>a</sup> and DT<sup>b</sup> . The lncRNA similarities are denoted by matrix **L**ǫR <sup>n</sup>l×n<sup>l</sup> where Lij is the similarity of two lncRNAs l<sup>i</sup> and l<sup>j</sup> (**Figure 1A**).

#### Representation of the lncRNA-miRNA Interactions

It is well-known that the lncRNAs often interact with the corresponding miRNAs and they are involved in the biological processes synchronously (Yang et al., 2014; Paraskevopoulou and Hatzigeorgiou, 2016). Hence our prediction model also takes the interaction relationships between lncRNAs and miRNAs into account (**Figure 1C**). The interactions between nl lncRNAs and n<sup>m</sup> miRNAs are represented by the matrix **Y**ǫR <sup>n</sup>l×n<sup>m</sup> , and each row of **Y** corresponds to a lncRNA and each column of **Y** corresponds to a miRNA. Yij is 1 when lncRNA l<sup>i</sup> interacts with miRNA m<sup>j</sup> and it is 0 otherwise.

part by exploiting the lncRNA-miRNA interactions and the miRNA-disease associations. (D) Concatenate these three parts to form the feature matrix P.

convolutional and pooling layers. (B) Establish the attention mechanism at the feature and relationship levels. (C) Construct the final module to estimate the association score.

# Disease lncRNA Prediction Model Based on Dual CNN

In this section, we describe our prediction model for learning the latent representations of lncRNA-disease associations and predicting the disease-related lncRNAs. The feature matrix is constructed firstly by incorporating the similarities, interactions, and associations about lncRNAs, miRNAs, and diseases (**Figure 1**). A novel framework is then established based on dual convolutional neural networks with attention mechanisms (**Figure 2**). The left part of the framework learns the global representation of a lncRNA-disease association, while the right part learns the more informative connection relationships among lncRNAs, miRNAs, and diseases. These two representations are integrated by an additional convolutional and fully connected layer and the possibility that a lncRNA is associated with a disease is obtained as their association score. We take the lncRNA l<sup>1</sup> and the disease d<sup>2</sup> as an example to describe our model CNNLDA for lncRNA-disease association prediction.

#### Construction of Feature Matrix

The feature matrix of the lncRNA l<sup>1</sup> and the disease d<sup>2</sup> is constructed by combining three biological premises. First, if l<sup>1</sup> and d<sup>2</sup> have similarity and association relationships with more common lncRNAs, they are more likely associated with each other. For instance, if l<sup>1</sup> and l<sup>2</sup> have similar functions, and d<sup>2</sup> has been observed to be associated with l2, l<sup>1</sup> will be possibly associated with d2. Let **x**<sup>1</sup> represent the 1st row of **L** which contains the similarities between l<sup>1</sup> and the various lncRNAs. The 2nd column of **D**, **x**2, records the associations between d<sup>2</sup> and all the lncRNAs. **x**<sup>1</sup> and **x**<sup>2</sup> are put together to form a matrix whose dimension is 2×n<sup>l</sup> (**Figure 1A**). Second, when l<sup>1</sup> and d<sup>2</sup> have the association and similarity connections with more common diseases, l<sup>1</sup> is more likely to be associated with d2. **x**<sup>3</sup> is the 1st row of **A** and it records the associations between l<sup>1</sup> and all the diseases. **x**<sup>4</sup> is the 2nd row of **D** and it contains the similarities between d<sup>2</sup> and these diseases. **x**<sup>3</sup> and **x**<sup>4</sup> are also combined and they form a matrix with dimension 2×n<sup>d</sup> (**Figure 1B**). Third, there is a possible association between l<sup>1</sup> and d<sup>2</sup> when they have the interaction and association connections with the common miRNAs. The 1st row of **Y**, **x**5, records the interactions between l<sup>1</sup> and the various miRNAs, while the 2nd column of **B**, **x**6, records the associations between d<sup>2</sup> and these miRNAs. **x**<sup>5</sup> and **x**<sup>6</sup> are integrated to form a matrix with dimension 2×n<sup>m</sup> (**Figure 1C**). All of these three matrices are concatenated and then form a feature matrix of lncRNA l<sup>1</sup> and disease d<sup>2</sup> whose dimension is 2 × (n<sup>l</sup> + n<sup>d</sup> + nm) (**Figure 1D**).

#### Convolutional Module on the Left

The feature matrix of l<sup>1</sup> and d2, **P**, is input to the convolutional module on the left to learn a global deep representation for l<sup>1</sup> and d2. The convolutional module includes two convolutional layers and two pooling layers (**Figure 2A**), we take the first convolutional layer and the first pooling layer as examples to describe the process of the convolution and the pooling. To learn the marginal information of **P**, we pad zeros around **P** and obtain a new matrix named **P** ′ .

#### **Convolutional layer**

For the first convolutional layer, the length of a filter is set as n<sup>f</sup> , and its width is nw. If the number of filters is nconv1, the filters **W**conv<sup>1</sup> ∈ R <sup>n</sup>conv1×nw×n<sup>f</sup> are applied to the matrix **P** ′ , and get the feature maps **Z**conv<sup>1</sup> ∈ R nconv<sup>1</sup> ×(4−nw+1)× nt+2−n<sup>f</sup> +1 . **P** ′ (i, j) is the element at the ith row and the jth column of **P** ′ , and **P** ′ k,i,j represents a region within the filter when the kth filter slides to the position **P** ′ (i, j). The formal definitions of **P** ′ <sup>k</sup>,i,<sup>j</sup> and **Z**conv1,<sup>k</sup> are as follows,

$$\mathbf{P'}\_{k,i,j} = \mathbf{P'}\left(i: i + n\_w, j: j + n\_f\right), \qquad \mathbf{P'}\_{k,i,j} \in \mathbb{R}^{n\_w \times n\_f}, \tag{2}$$

$$Z\_{conv1,k}\begin{pmatrix} i,j \end{pmatrix} = f\left(\mathcal{W}\_{conv1}\begin{pmatrix} k,:,: \end{pmatrix} \* \mathcal{P}'\_{k,i,j} + b\_{conv1}\begin{pmatrix} k \end{pmatrix}\right),\tag{3}$$

$$i \in \left[1, 4 - n\_{\le} + 1\right], j \in \left[1, n\_l + 2 - n\_f + 1\right], k \in \left[1, n\_{\text{conv}1}\right], j$$

where **b**conv<sup>1</sup> is the bias vector, f is a relu function (Nair and Hinton, 2010), and n<sup>t</sup> = n<sup>l</sup> +n<sup>d</sup> +nm. **Z**conv1,<sup>k</sup> i, j is the element at the ith row and jth column of the kth feature map **Z**conv1,<sup>k</sup> .

#### **Pooling layer**

We apply the max pooling to extract the robust features from the feature maps **Z**conv1. n<sup>g</sup> and n<sup>p</sup> are the length and width of a filter of pooling layer, respectively. The pooling outputs of all the feature maps are **Z**convpool<sup>1</sup> ,

$$\mathbf{Z}\_{convpool1,k} \begin{pmatrix} i,j \end{pmatrix} = \text{Max} \left( \mathbf{Z}\_{conv1,k} \left( i : i + n\_{\mathcal{G}}, j : j + n\_{\mathcal{P}} \right) \right), \tag{4}$$

$$\begin{aligned} i \in \left[1, 5 - n\_{\mathcal{W}} - n\_{\mathcal{G}} + 1\right], j \in \left[1, n\_t + 3 - n\_f - n\_{\mathcal{P}} + 1\right], \\ k \in \left[1, n\_{conv1}\right], \end{aligned}$$

where **Z**convpool1,<sup>k</sup> is the kth feature map, and **Z**convpool1 ,<sup>k</sup> i, j is the element at its' ith row and jth column.

#### Attention Module on the Right

In our model, the attention module are used to learn which features or connection relationships are more informative for the representation of lncRNA l<sup>1</sup> and disease d2. Thus, the module consists of the attention mechanism at the feature level and the one at relationship level (**Figure 2B**).

#### **Attention at the feature level**

The features within **P** usually have different contributions for representations of lncRNA-disease associations. For instance, in terms of a specific disease, the lncRNAs that have been observed to be associated with the disease are often more important than the unobserved ones. In the feature matrix **P** = {**x**1, **x**2, . . . , **x**<sup>i</sup> , . . . , **x**<sup>6</sup> } , each feature xij of vector **x**<sup>i</sup> is assigned an attention weight α F ij . α F ij is defined as follows,

$$s\_i^F = \boldsymbol{H}^F \tanh\left(\boldsymbol{W}\_\mathbf{x}^F \boldsymbol{x}\_i + \boldsymbol{b}^F\right),\tag{5}$$

$$\alpha\_{ij}^F = \frac{\exp\left(s\_{ij}^F\right)}{\sum\_k \exp\left(s\_{ik}^F\right)},\tag{6}$$

where **H**<sup>F</sup> and **W**<sup>F</sup> **x** are the weight matrices, and **b** F is a bias vector. **s** F <sup>i</sup> = [s F i1 ,s F i2 , . . . ,s F ik, . . . ,s F ini ] is the vector that records the attention scores representing the importance of different features in **x**<sup>i</sup> , where n<sup>i</sup> is the length of **x**<sup>i</sup> , s F ini is the score of xin<sup>i</sup> . α F ij is the normalized attention weight for feature xij. Thus the latent representation of different features may be denoted as **y**<sup>i</sup> ,

$$\mathcal{Y}\_{i} = \alpha\_{i}^{F} \otimes \mathfrak{x}\_{i} \,, \tag{7}$$

where ⊗ is the element-wise product operator, and the symbol F represents the feature level.

#### **Attention at the relationship level**

There are several connection relationships among lncRNAs, diseases, and miRNAs, including the similarities between lncRNAs, the associations between lncRNAs and diseases, the similarities between diseases, the interactions between lncRNAs and miRNAs, and the associations between diseases and miRNAs. Different relationships also have different contributions to the representation of lncRNA-disease associations. Therefore, in relationship level, we use an attention mechanism on each feature vector **y**<sup>i</sup> to generate the final attention representation. The attention scores at relationship level are given by,

$$s\_i^R = \hbar^R \tanh\left(\mathbf{W}\_\mathcal{\mathcal{Y}}^R \boldsymbol{\mathcal{y}}\_i + \mathbf{b}^R\right),\tag{8}$$

$$\beta\_i^R = \frac{\exp(s\_i^R)}{\sum\_{j \in \mathcal{G}} \exp(s\_j^R)}\,,\tag{9}$$

where **W**<sup>R</sup> y is the weight matrix , and **b** R is a bias vector . **h** R is a weight vector and s R i represents the score of the ith relationship

**y**i . β R i is the normalized attention weight for relationship **y**<sup>i</sup> . The latent representation of association through the attentions at the feature and relationship levels is obtained and represented by

$$\mathbf{g} = \sum\_{i} \beta\_i^R \mathbf{y}\_i \,, \tag{10}$$

where the symbol R represents the relationship level. Let **G** be the matrix after **g** is padding zeros. The attention representations **Z**att are obtained by feeding **G** into a convolutional layer and a maxpooling layer.

#### Final Module

Let **Z**glo be the global representation that are learned from the left convolutional module and **Z**att be the attention representation that are learned from the right convolutional module. **Z**glo and **Z**att are combined by putting the former on top and putting the later under it, and denoted as **Z**con (**Figure 2C**). **Z**con runs through an additional convolutional layer to obtain the final representation **Z**fin. **z**<sup>o</sup> is a vector of flattening **Z**fin and it is inputted into a fully connected layer **W**out and a softmax layer (Bahdanau et al., 2014) to get **p**

$$\mathfrak{p} = \operatorname{softmax} \left( \mathcal{W}\_{\text{out}} \mathbf{z}\_{\mathfrak{o}} + \mathbf{b}\_{\mathfrak{o}} \right). \tag{11}$$

**p** is an association probability distribution of C classes (C=2), and it contains the probability that a lncRNA and a disease is determined to have an association relationship and the probability that they have no association.

#### Loss of Association Prediction

In our model, the cross-entropy loss between the ground truth distribution of lncRNA-disease association and the prediction probability **p** is defined as L,

$$\mathcal{L} = -\sum\_{i}^{T} \sum\_{j=1}^{C} z\_{j} \log \mathfrak{p}\_{j},\tag{12}$$

where **z** ∈ R 2 is the classification label vector and T is a set of training samples. If l<sup>1</sup> is associated with d2, the second dimension of the vector **z** is 1 and the first one is 0. On the contrary, if l<sup>1</sup> is not associated with d2, the first dimension of **z** is 1 and the second one is 0.

We denote all neural network parameters by θ. The objective function in our learning process is defined as follows,

$$\min\_{\theta} \mathcal{L}\left(\theta\right) = \mathcal{L} + \lambda \left\|\theta\right\|^2,\tag{13}$$

where λ is a trade-off parameter between the training loss and regularization term. We use Adam optimization algorithm to optimize the objective function (Kingma and Ba, 2015).

#### RESULTS AND DISCUSSION

#### Parameter Setting

In CNNLDA, 2×2 window size is used for all of the convolutional and pooling layers. In the left convolutional module (**Figure 2A**), the number of filters in the first convolutional layer is 8 and one in the second layer is 16. In the right attention convolutional module (**Figure 2B**), the number of filters is 16. In the final module (**Figure 2C**), we set the number of filters to 32. We implement our method using Pytorch to train and optimize the neural networks, and a GPU card (Nvidia GeForce GTX 1080Ti) is utilized to speed up the training process. The training process is terminated when the maximum number of iterations, 80, is reached.

#### Performance Evaluation Metrics

Five-fold cross-validation is performed to evaluate the performance of CNNLDA and other state-of-the-art methods for predicting lncRNA-disease associations. If a lncRNA l<sup>s</sup> is associated with a disease d<sup>t</sup> , we treat the ls-d<sup>t</sup> node pair as a positive sample. If l<sup>s</sup> is not observed to associate with d<sup>t</sup> , it is treated as a negative sample. For each cross validation, we randomly select 80% positive samples and the same number of negative samples as the training data and use the remaining 20% positive samples and all of the negative samples for testing. Note that the association dataset is separated to 5 folds for cross-validation, and we recomputed the lncRNA similarities by using the known associations that are used for training in each cross validation process.

The samples are ranked by their association scores after the association probabilities of the testing samples are estimated. The higher the node pairs of the positive samples are ranked, the better CNNLDA performs. If an observed association exists in lncRNA-disease node pair samples, and its association score is greater than a threshold θ, it is a successfully determined positive sample. If the prediction score of a negative sample is smaller than θ, it is a determined correctly negative sample. We calculate the true positive rates (TPRs) and the false positive rates (FPRs) to get a receiver operating characteristic (ROC) curve by changing threshold θ. TPR and FPR are defined as follows,

$$TPR = \frac{TP}{TP + FN}, FPR = \frac{FP}{FP + TN},\tag{14}$$

where TP is the number of successfully identified positive samples, and FN is the number of misidentified negative samples. TN is the number of correctly identified negative samples, and FP is the number of incorrectly identified positive samples. The global prediction performance of a method is always measured by the area under the ROC curve (AUC) (Karimollah, 2013).

The known lncRNA-disease associations (the positive samples) and the unobserved ones (the negative samples) form the serious imbalance. In such case, we also use the precisionrecall (PR) curve and its area (AUPR) to assess the performance of a prediction method (Takaya and Marc, 2015). Precision and recall are defined as follows,

$$Precision = \frac{TP}{TP + FP}, \text{Recall} = \frac{TP}{TP + FN} \,. \tag{15}$$

Precision is the rate of the correctly identified positive samples among the samples that are retrieved, and recall is the rate of the correctly identified positive samples among all the positive samples. In terms of 5-fold cross-validation, we use averaging

FIGURE 3 | ROC curves and PR curves of CNNLDA and other methods for all the diseases. (A) ROC curves of all the methods. (B) PR curves of all the methods.

TABLE 1 | AUCs of ROC curves of CNNLDA and other methods for all of the diseases and 10 well-characterized diseases.


The bold values significant the highest AUC.

CV to obtain the final performance. Averaging CV means that we obtain a separate performance (AUC or AUPR) for each of the 5 folds when used as a test set, and the 5 performances are averaged to give the final performance.

In addition, the biologists usually select lncRNA candidates from the top part of the ranking list, and then further validate their associations with diseases. Therefore, the recall values of top 30, 60, . . . , 240, are calculated, and they represent the fraction of the successfully recovered positive samples in the top list k among the total positive samples.

#### Comparison With Other Methods

To evaluate the performance of CNNLDA, we compare it with several state-of-the-art methods including SIMCLDA (Lu et al., 2018), Ping's method (Ping et al., 2018), MFLDA (Fu et al., 2017) and LDAP (Lan et al., 2017) for lncRNA-disease association prediction. As shown in **Figure 3A** and **Table 1**, CNNLDA achieves the highest average AUC on all of the tested 402 diseases (AUC = 0.952). It outperforms SIMCLDA by 20.6%, Ping's method by 8.05%, MFLDA by 32.6% and LDAP by 8.85%. We also list the AUCs of the five methods on 10 well-characterized diseases that are associated with at least 15 lncRNAs (**Table 1**). CNNLDA yields the best performance for 9 out of 10 diseases. CNNLDA achieves best average performance (AUPR = 0.251) which is 15.6%, 3.19, 18.5, and 8.51% better than SIMCLDA, Ping's method, MFLDA and LDAP respectively (**Figure 3B**). In addition, CNNLDA achieves the highest AUPRs on 9 out of 10 well-characterized diseases (**Table 2**). The performance of Ping's method is similar to that of LDAP as they exploit different types of similarities of lncRNAs and diseases. These two methods achieves the second and third best performance respectively. The performance of MFLDA is not as good as the other four methods as it did not exploit the disease similarities and the lncRNA similarities. The improvement of CNNLDA over the compared methods is primarily due to its deeply learning the global and attention representations of lncRNA-disease associations.

TABLE 2 | AUPRs of PR curves of CNNLDA and other methods for all of the diseases and 10 well-characterized diseases.


The bold values significant the highest AUPR.

TABLE 3 | A pairwise comparison with a paired Wilcoxon-test on the prediction results in terms of AUCs and AUPRs.


We perform a paired Wilcoxon-test to evaluate whether CNNLDA's AUCs and AUPRs across all of the tested diseases are significantly higher than those of another method. CNNLDA achieves significantly higher performance than the other methods in terms of both AUCs and AUPRs as the corresponding P-values are smaller than 0.05 (**Table 3**).

The higher the recall rate on the top k ranked lncRNA-disease associations is, the more genuine associations are determined correctly. Under different k cutoffs, the performance of CNNLDA consistently outperforms other methods (**Figure 4**), and ranks 89.6% of the positive samples in the top 30, 96.2% in the top 60, and 98.8% in the top 90. Most of the recalls of Ping's method are very close to LDAP, while Ping's method ranks 68.9% in top 30, 81.3% in top 60, 88% in top 90. LDAP ranks 68.5% in top 30, 81.3% in top 60, 88% in top 90. SIMCLDA ranks 49.3% in top 30, 63% in top 60, 74.1% in top 90, which is not as good as Ping's method but better than MFLDA (42%, 53.9% and 61%).

In addition, to validate the effectiveness of exploiting the information related to the miRNAs, we construct another instance of CNNLDA that is trained without this kind of


TABLE 4 | The candidate lncRNAs associated with stomach cancer, lung cancer and colon cancer.

(1) "Lnc2Cancer" and "LncRNADisease" are manually curated database. (2) "literature<sup>1</sup> " means that published literature supports that dysregulation of the lncRNA in cancer. (3) "literature<sup>2</sup> "or "miRCancer, StarBase" means that the lncRNA is related to some important factors affecting the development of the cancer.

information, and the instance is referred to as CNNLDA-nM. The instance of CNNLDA that is trained by using the miRNArelated information is still named as CNNLDA. CNNLDA's AUC and AUPR are 0.2%and 0.94% greater than CNNLDA-nM, which confirms the importance of integrating the information for improving CNNLDA's prediction performance.

## Case Studies: Stomach Cancer, Lung Cancer, and Colon Cancer

To demonstrate CNNLDA's ability to discover potential candidate disease lncRNAs, we execute the case studies on stomach cancer, lung cancer, and colon cancer and analyze the top 15 candidates respectively related to these cancers (**Table 4**).

First, a database named Lnc2Cancer curates the lncRNAs that have different expression in the disease tissues compared to the normal ones. Lnc2Cancer contains lncRNAs related to cancers that have been identified by analyzing the results of northern blot experiments, microarray experiments, and quantitative real-time polymerase chain reaction experiments (Gao et al., 2018). LncRNADisease is also a database which includes 2,947 lncRNA-disease entries (Chen et al., 2013). By using text mining techniques, these associations are extracted from the published literature, and then the dysregulation of lncRNAs are manually confirmed. As shown in **Table 4**, 33 candidate lncRNAs are contained by Lnc2Cancer and 13 candidate lncRNAs are included by LncRNADisease, which confirms these lncRNAs have been upregulated or downregulated in these cancers.

Next, 2 candidates of stomach cancer, 1 candidate of lung cancer and 2 candidates of colon cancer labeled with "literature<sup>1</sup> " are supported by several published literature. These lncRNAs are confirmed to have dysregulations in the cancers when compared with the normal tissues (Bahari et al., 2015; Zhang et al., 2016; Shen et al., 2017; Gu et al., 2018; Sun et al., 2018).

Finally, 5 candidates labeled with "literature<sup>2</sup> ," and "miRCancer, StarBase" are related to the important factors affecting the development of the corresponding cancers. In the metabolic network, lncRNA HCP5 is regulated by three miRNAs, and the miRNAs are downregulated in stomach cancer. It indicates that the expression of HCP5 is more likely to associate with stomach cancer (Mo et al., 2018). Four lncRNAs (CBR3-AS1, NPTN-IT1, CDKN2B-AS1 and SNHG4) have interactions with four corresponding miRNAs (hsa-miR-217, hsa-miR-520c-3p, hsa-miR-320a and hsa-miR-4458) (Li et al., 2014b). These four miRNAs have been to be observed associated with stomach cancer, lung cancer and colon cancer (Xie et al., 2013). Hence these lncRNAs are probably involved in the progression of these cancers.

#### Predicting Novel Disease-Related lncRNAs

After evaluating its prediction performance through the crossvalidation process and case studies, CNNLDA is applied to all 402 diseases. All the positive samples and the negative ones are used to train CNNLDA to predict the novel disease-associated lncRNAs. The potential candidate lncRNAs for these diseases are listed in **Supplementary Table 1**. In addition, the lncRNA similarities based on the diseases associated with these lncRNAs are shown in **Supplementary Table 2**.

# CONCLUSIONS

A novel method based on dual convolutional neural networks, CNNLDA, is developed for predicting the potential diseaserelated lncRNAs. We respectively construct the attention mechanism at feature and relationship levels to discriminate the different contributions of features and learn the more informative representation of lncRNA-disease associations. The new framework based on dual convolutional neural networks is developed for learning the global representation and the attention of lncRNA-disease associations. The experimental results indicate that CNNLDA is superior to the compared other methods in terms of both AUCs and AUPRs. The case studies on 3 diseases demonstrate CNNLDA's ability for discovering potential disease-associated lncRNAs.

# DATA AVAILABILITY

All datasets analyzed for this study are cited in the manuscript and the **Supplementary Files**.

### REFERENCES


# AUTHOR CONTRIBUTIONS

PX and YC conceived the prediction method, and they wrote the paper. YC and ZZ developed computer programs. TZ and RK analyzed the results and revised the paper.

# FUNDING

The work was supported by the Natural Science Foundation of China (61702296, 61302139), the Heilongjiang Postdoctoral Scientific Research Staring Foundation (LBH-Q18104, LBH-Q16180), the Natural Science Foundation of Heilongjiang Province (FLHPY2019329), the Fundamental Research Foundation of Universities in Heilongjiang Province for Technology Innovation (KJCX201805), the Fundamental Research Foundation of Universities in Heilongjiang Province for Youth Innovation Team (RCYJTD201805), the Young Innovative Talent Research Foundation of Harbin Science and Technology Bureau (2016RQQXJ135), the Research Fund of the First Affiliated Hospital of Harbin Medical University (2019M24), and the Foundation of Graduate Innovative Research (YJSCX2018-140HLJU).

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00416/full#supplementary-material

Supplementary Table 1 | Potential candidate lncRNAs related to 402 diseases.

Supplementary Table 2 | The lncRNA similarities.

associations and ncRNA function. Bioinformatics 34, 1953–1956. doi: 10.1093/bioinformatics/bty002


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xuan, Cao, Zhang, Kong and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# rSeqTU—A Machine-Learning Based R Package for Prediction of Bacterial Transcription Units

*Sheng-Yong Niu1 , Binqiang Liu2 , Qin Ma3 \* and Wen-Chi Chou4 \**

*1 Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, United States, 2 School of Mathematics, Shandong University, Jinan, China, 3 Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States, 4 Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, United States*

A transcription unit (TU) is composed of one or multiple adjacent genes on the same strand that are co-transcribed in mostly prokaryotes. Accurate identification of TUs is a

*Edited by: Dariusz Mrozek, Silesian University of Technology, Poland Reviewed by: Erliang Zeng, The University of Iowa, United States Liang Yu, Xidian University, China* crucial first step to delineate the transcriptional regulatory networks and elucidate the dynamic regulatory mechanisms encoded in various prokaryotic genomes. Many genomic features, for example, gene intergenic distance, and transcriptomic features including continuous and stable RNA-seq reads count signals, have been collected from a large amount of experimental data and integrated into classification techniques to computationally predict genome-wide TUs. Although some tools and web servers are able to predict TUs based on bacterial RNA-seq data and genome sequences, there is a need to have an improved machine learning prediction approach and a better comprehensive pipeline handling QC, TU prediction, and TU visualization. To enable users to efficiently perform TU identification on their local computers or high-performance clusters and provide a more accurate prediction, we develop an R package, named rSeqTU. rSeqTU uses a

*\*Correspondence:* 

*Qin Ma qin.ma@osumc.edu Wen-Chi Chou wcc957@gmail.com*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> *Received: 15 February 2019 Accepted: 09 April 2019 Published: 15 May 2019*

#### *Citation:*

*Niu S-Y, Liu B, Ma Q and Chou W-C (2019) rSeqTU—A Machine-Learning Based R Package for Prediction of Bacterial Transcription Units. Front. Genet. 10:374. doi: 10.3389/fgene.2019.00374*

s18692001.github.io/rSeqTU/) has six computational functionalities including read quality control, read mapping, training set generation, random forest-based feature selection, TU prediction, and TU visualization.

Keywords: machine learning, bacteria, transcription unit, R package, transcriptome

## INTRODUCTION

The gene expression and regulation in bacteria use different machinery from eukaryotic organisms. Operon has been defined as a set of genes controlled by a single promoter are first co-transcribed into one mRNA molecule, and then the mRNA molecule is translated into multiple proteins (Jacob et al., 1960). Operationally, an operon uses a single promoter to regulate the set of genes. Functionally, the set of genes in the operon encodes proteins with related biological functions. The *lac* operon in *Escherichia coli* is a typical operon that consists of a promoter, an operator, and three structural genes. The three genes, *lacZ*, *lacY*, and *lacA*, are co-transcribed into one mRNA transcript and are subsequently translated into three proteins, β-galactosidase,

random forest algorithm to select essential features describing TUs and then uses support vector machine (SVM) to build TU prediction models. rSeqTU (available at https://

β-galactoside permease, and Galactoside acetyltransferase. The *lac* operon is responsible for the transport and metabolism of lactose in many enteric bacteria. The discovery of the *lac* operon won the Nobel Prize in Physiology by Jacob and Monod in 1965 (Jacob et al., 1960).

Recently, many works revealed bacterial genes are not transcribed only in single operons but may be dynamically co-transcribed into mRNAs with different gene sets under different growth environments or conditions (Yan et al., 2018). Each of the co-transcribed gene set is called transcription units (TUs). The concept of TU is analogical to alternative spliced protein isoforms in eukaryotic systems that use different exons to produce protein isoforms. Although alternative splicing can use nonadjacent exons, a TU consists of a set of adjacent genes.

Several operon databases, such as RegulonDB (Santos-Zavaleta et al., 2019), MicrobesOnline (Dehal et al., 2010), and ProOpDB (Taboada et al., 2012) provide various levels of operon information describing genes only expressed in single TU or operon. While DOOR2 (Mao et al., 2014) and OperomeDB (Chetal and Janga, 2015) provide the more comprehensive TUs describing genes are co-transcribed into different gene sets. Some TU or operon databases provide experiment-verified results while most of them rely on TU or operon predictions. Studies including DOOR2 (Mao et al., 2014), SeqTU (Chou et al., 2015), and Rockhopper (McClure et al., 2013) use genomic information and gene expression profile to predict operon or TU with machine learning and other approaches. Taboada et al. (2018) recently developed a new operon prediction method based on artificial neural network (ANN).

Other than *in silico* prediction works, Yan et al. recently used SMRT-Cappable-seq and PacBio sequencing to re-examine the transcription units of *E. coli* grown under different conditions to provide a higher resolution map of dynamic TUs (Yan et al., 2018). The work of Yan et al. revealed that TUs are better to describe the real bacterial transcription profiles and a gene can be contained in many different co-transcribed gene sets, TUs, under the same or different growth conditions. In our previous works (Chou et al., 2015; Chen et al., 2017), we assumed a gene can only be co-transcribed into only one adjacent gene set, which is one TU. We also assumed co-transcribed gene pairs follow transitive relation, and thus we connected co-transcribed gene pairs into a larger gene sets to form a TU.

In this study, we focused on improving our machine learning model for the prediction of the co-transcribed gene pairs and providing a user friendly R package, rSeqTU, for a comprehensive pipeline including RNA-seq read analysis, TU prediction, and TU visualization.

#### RESULTS

In this rSeqTU R package, we updated the TU prediction model with random forest-based feature selection and support vector machine (SVM). Besides, rSeqTU has a completed workflow performing RNA-seq read quality control (QC), RNA-seq read mapping, generation of TU results in two formats, and generation of IGV files for visualization (**Figure 1**).

rSeqTU requires three input data including RNA-seq data in FATSTQ format, reference genome sequence in FASTA format, and gene annotations in GFF format. With the input data, rSeqTU first performs RNA-seq data QC and RNA-seq read mapping to generate QC reports and mapping results in BAM format.

Then, rSeqTU uses whole genome per base read coverage and gene annotations to generate constructed TUs as the training data set. The constructed TUs are generated based on the SeqTU algorithm that was first presented by Chou et al. (2015). Briefly, the constructed TUs come from real single genes that are split into two adjacent sub genes with their intergenic regions to enable us to capture the continuity and stability features of RNA-seq signals of the real TUs. rSeqTU then applies random forest to select informative features using the constructed TUs and applies SVM to build a TU prediction model with the selected features.

rSeqTU reports the prediction accuracy and uses the TU prediction model to identify all co-transcribed gene pairs in the given genome. rSeqTU outputs TU prediction results in single gene pairs and concatenated gene pairs. Last, rSeqTU converts TU results into IGV-compatible files for TU visualizations. In short, rSeqTU produces RNA-seq read QC reports, RNA-seq mapping statistics and results, TU prediction results, and files for IGV visualization.

To evaluate rSeqTU R package, we used two sets of bacterial RNA-seq data of *Bacteroides fragilis* (*B. fragilis*) produced and published by Donaldson et al. (2018). These *B. fragilis* RNA-seq data were used to discover that human gut microbiome can use immunoglobulin A (IgA) to trigger robust hostmicrobial symbiosis for mucosal colonization. The study focused on investigating commensal colonization factors (CCFs), an operon, which was previously found to be essential for *B. fragilis* for colonization of colonic crypts (Lee et al., 2013). The CCF operon has five genes, *ccf*A-E, which are homologous to polysaccharide utilization systems, and the *ccf*A is activated by extracellular glycan sensing and is hypothesized to activate genes involved in mucosal colonization (Martens et al., 2009). To understand the function of *ccfA* gene, Donaldson et al. compared gene expression profiles between *ccfA* overexpressed *B. fragilis* and wild-type *B. fragilis* during laboratory culture growth. The RNA-seq data helped identify 24 out of 25 non-CCF genes that were differentially expressed and mapped to the biosynthesis loci for capsular polysaccharides A and C (PSA and PSC).

With the two RNA-seq data sets, reference genome sequence, and gene annotations, we performed a full run of rSeqTU analysis. The RNA-seq data QC and RNA-seq mapping were generated and shown in **Figure 2**.

In **Figure 2**, we generated QC report for both *ccfA* overexpression and wild-type. It shows the read quality score plot, which is good in general over 30 (**Figure 2A**). Also, it generated nucleotide frequency plot (**Figure 2B**), sequence duplication plot (**Figure 2C**), percentage of aligned bases plot (**Figure 2D**), and percentage of unique and mapped reads (**Figure 2E**). We could observe that the sequence duplication is not severe. The nucleotide frequency, aligned bases, and

major results. In the input data layer, rSeqTU needs RNA-seq data, reference genome sequence, and gene annotations. In the core process layer, rSeqTU performs QC, builds prediction models, and predicts TUs. The results layer includes the QC and mapping results, TU prediction tables, and files for visualization in IGV.

FIGURE 2 | rSeqTU generates RNA-seq read QC reports and RNA-seq mapping statistics. SRR6899499 is *ccfA* overexpression data, and SRR6900706 is wild-type data. The panels (A–E) present quality scores, nucleotide frequency, sequence duplication, percentage of aligned bases plot, and percentage of unique and mapped reads.

mismatched bases information are in the normal range. The percentage of mapped reads and unique reads are lower than 30% as expected due to the most of the RNAs in the samples belong to mouse, the host, but not bacteria.

The two RNA-seq read mapping results were used to generate training data for TU prediction models, respectively. For *ccfA* overexpression data set, rSeqTU reported the sensitivity, specificity, and accuracy at 0.857, 0.999, and 0.963.

The visualization also includes read coverage, gene annotations, and mapping results. SRR6899499 is *ccfA* overexpression data, and SRR6900706 is wild-type data.

For wild-type data set, rSeqTU reported the sensitivity, specificity, and accuracy at 0.885, 0.996, and 0.964. In general, we could find that rSeqTU generated high accuracy models after proper feature selections and cross-validation.

The two TU prediction models were used to predict co-transcribed gene pairs. There are 1,759 and 1,626 co-transcribed gene pairs predicted in *ccfA* overexpression and wild-type RNA-seq data sets. If we concatenated co-transcribed gene pairs, rSeqTU identified 2,727 TUs including 2,079 singlegene TUs, 271 two-gene TUs, and 377 TUs with more than two genes in *ccfA* overexpression RNA-seq data set. In wildtype RNA-seq data set, rSeqTU identified 2,860 TUs including 2,249 single-gene TUs, 256 two-gene TUs, and 355 TUs with more than two genes. rSeqTU then uses the TU results to generate bedgraph files for the visualization in IGV (**Figure 3**). In **Figure 3**, we showed a region of *B. fragilis* genome containing eight genes. rSeqTU identified four TUs in the *ccfA* overexpression data (SRR6899499) and four TUs in the wild-type data (SRR6900706). However, the structure of the TUs is very different between two RNA-seq data sets. The two genes with locus tags, BF9343\_RS17275 and BF9343\_RS17280 were identified as a co-transcribed gene pair in the *ccfA* overexpression data but not in the wild-type data. The four genes with the locus tags, BF9343\_RS17295, BF9343\_RS17305, BF9343\_RS17310, and BF9343\_RS17315, were predicted as a single TU in the wild-type data but two TUs in the *ccfA* overexpression data.

To ensure the rSeqTU also performs well on RNA-seq data sets of different species, we took two RNA-seq data sets of uropathogenic *Escherichia coli strain* CFT073 to run TU predictions. The two data sets were used to investigate how *Escherichia coli strain* CFT073 senses and detoxifies nitric oxide (NO), which is a defense mechanism generated by host immune cells (Mehta et al., 2015). For without NO treatment RNA-seq data set, rSeqTU reported the sensitivity, specificity, and accuracy at 0.879, 0.997, and 0.952. For without NO treatment RNA-seq data set, rSeqTU reported the sensitivity, specificity, and accuracy at 0.824, 0.996, and 0.945.

# MATERIALS AND METHODS

#### New Functions Integrated or Invented by rSeqTU

rSeqTU uses QuasR R package to perform RNA-seq data QC and RNA-seq read mapping. The read mapping results are then processed by an algorithm named SeqTU first presented by Chou et al. (2015). In brief, the SeqTU algorithm splits relatively long single genes into three parts including two sub-gene regions and an intergenic region, and then SeqTU uses RNA-seq per-base read coverage over the three parts to generate TU features to describe the continuity and stability of RNA-seq read coverage. SeqTU assumes the RNA-seq read coverage within a TU is continuous and stable like it is within a gene.

rSeqTU selects essential TU features by random forest and builds TU prediction model by SVM using an R packages, Caret and e1071. rSeqTU converts TU prediction results into IGV-compatible files in bedgraph format for TU visualizations.

# Feature Selection by Random Forest

Random forest is a supervised learning algorithm using the ensemble learning based on decision trees. Random forest has been successfully used on biological data types such as genomics, transcriptomics, epigenomics, proteomics, and metabolomics (Degenhardt et al., 2019). rSeqTU uses recursive feature elimination to perform random forest and selects the top eight features. The top eight features may vary in different RNA-seq data sets, and the top few features are constantly fold change of adjacent gene expressions and proportion of gap positions in the whole given gene pair region.

#### *Bacteroides fragilis* RNA-seq Data

We used two RNA-seq data from each triplicate experiment from NCBI's SRA database with project accession number PRJNA445716. The accession numbers of the two data sets are SRR6899499 (*ccfA* overexpression) and SRR6900706 (wildtype). The reference genome sequence and gene annotations of *Bacteroides fragilis* NCTC 9343 are GCF\_000025985.1\_ ASM2598v1\_genomic.fna and GCF\_000025985.1\_ASM2598v1\_ genomic.gff.

#### *Escherichia coli* RNA-seq Data

We used two RNA-seq data from each triplicate experiment from NCBI's SRA database with project accession number PRJNA286883. The accession numbers of the two data sets are SRR2061823 (without NO treatment) and SRR2061826 (with NO treatment). The reference genome sequence and gene annotations of *Escherichia coli* strain CFT073 are GCF\_000007445.1\_ ASM744v1\_genomic.fna and GCF\_000007445.1\_ASM744v1\_ genomic.gff.

#### DISCUSSION

rSeqTU is a machine learning-based R package for TU prediction, empowered by a random forest algorithm for feature selection and multiple graphical visualizations and interactive tables for customized downstream analysis. Its superior prediction performance has been demonstrated by testing multiple RNA-Seq datasets in *B. fragilis*. The source code and tutorial of rSeqTU is available at https://s18692001.github.io/rSeqTU/.

rSeqTU will be useful to understand transcriptional profiles of bacterial genomes in the gene level and the TU level. In addition to the single bacterium, rSeqTU may also be applied onto the metatranscriptomic data, the RNA-seq data of microbiome. The TUs of multiple bacteria may provide systemic view to understand how microbiome regulates functional translation and can be integrated with other metagenomic and metabolomic data (Niu et al., 2018).

A TU is dynamically composed of different adjacent genes under various conditions, and different TUs may overlap with each other under the same and different conditions. The dynamic TUs sharing the same gene(s) are called alternative transcription units (ATUs), and the identification of ATUs is recognized as a more challenging computational problem due to their

#### REFERENCES


condition-dependent nature. Meanwhile, the third generation sequencing technology will shortly generate substantial genome scale ATU datasets in the public domain for various prokaryotic organisms. Hence, advanced computational models are urgently needed for ATU prediction based on RNA-Seq data.

Intuitively, the output of rSeqTU can lay a solid foundation of ATU prediction as (1) a TU identified in our method can represents a maximal ATU clusters with apparent promoter and terminator and (2) the TU can be used as an independent genomic region for further ATU prediction based on other genomic and transcriptomic features. If available, the ATUs along with related cis-regulatory motifs analysis will generate the dynamic regulatory networks in a bacterial genome to a higher resolution and an advanced level.

#### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https://www.ncbi.nlm.nih.gov/ sra/?term=SRR6899499.

#### AUTHOR CONTRIBUTIONS

S-YN, QM, and W-CC designed the study. S-YN implemented the R package with W-CC's help. S-YN, BL, QM, and W-CC wrote the manuscript.

#### FUNDING

This work was supported by Dr. Qin Ma's startup funding in the Department of Biomedical Informatics at the Ohio State University. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by the National Science Foundation #ACI-1548562. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation.

#### ACKNOWLEDGMENTS

We thank Anjun Ma's help on proofreading the references.


gene regulation in *E. coli* K-12. *Nucleic Acids Res.* 47, D212–D220. doi: 10.1093/nar/gky1077


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor declared a past co-authorship with one of the authors QM.

*Copyright © 2019 Niu, Liu, Ma and Chou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# An Effective Method to Measure Disease Similarity Using Gene and Phenotype Associations

Shuhui Su<sup>1</sup> , Lei Zhang<sup>2</sup> and Jian Liu<sup>1</sup> \*

*<sup>1</sup> School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, <sup>2</sup> School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou, China*

Motivation: In order to create controlled vocabularies for shared use in different biomedical domains, a large number of biomedical ontologies such as Disease Ontology (DO) and Human Phenotype Ontology (HPO), etc., are created in the bioinformatics community. Quantitative measures of the associations among diseases could help researchers gain a deep insight of human diseases, since similar diseases are usually caused by similar molecular origins or have similar phenotypes, which is beneficial to reveal the common attributes of diseases and improve the corresponding diagnoses and treatment plans. Some previous are proposed to measure the disease similarity using a particular biomedical ontology during the past few years, but for a newly discovered disease or a disease with few related genetic information in Disease Ontology (i.e., a disease with less disease-gene associations), these previous approaches usually ignores the joint computation of disease similarity by integrating gene and phenotype associations.

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Jun Wan, Indiana University, United States Wuritu Yang, Inner Mongolia University, China*

> \*Correspondence: *Jian Liu jianliu@hit.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *15 March 2019* Accepted: *30 April 2019* Published: *21 May 2019*

#### Citation:

*Su S, Zhang L and Liu J (2019) An Effective Method to Measure Disease Similarity Using Gene and Phenotype Associations. Front. Genet. 10:466. doi: 10.3389/fgene.2019.00466* Results: In this paper we propose a novel method called GPSim to effectively deduce the semantic similarity of diseases. In particular, GPSim calculates the similarity by jointly utilizing gene, disease and phenotype associations extracted from multiple biomedical ontologies and databases. We also explore the phenotypic factors such as the depth of HPO terms and the number of phenotypic associations that affect the evaluation performance. A final experimental evaluation is carried out to evaluate the performance of GPSim and shows its advantages over previous approaches.

Keywords: disease similarity, phenotype association, genomic annotation, disease ontology, biomedical ontology

# INTRODUCTION

The emergence of massive biomedical data offers a marvelous opportunity for the life science research and modern disease diagnosis. The wealth of knowledge contained in biomedical big data also brings great challenges, since many biologists chronically construct their biomedical database applications by using their own terms to represent biomedical knowledge. In order to create controlled vocabularies for the shared use of knowledge, a large number of biomedical ontologies such as Disease Ontology [DO (Schriml et al., 2012; Kibbe et al., 2014)] and Human Phenotype Ontology [HPO (Köhler et al., 2014)], etc., are created in the bioinformatics community. Biomedical ontologies (Lee et al., 2008; Köhler et al., 2009; Meehan et al., 2013; Groza et al., 2015; Patel et al., 2015; Denny et al., 2018; Lovering et al., 2018) reduce the complexity of life science's concepts and make innovative contributions to advance the understanding of human diseases with controllable terminology. Currently, these ontologies have been used in a variety of biomedical applications. For example, HPO-based analysis tools have been used to assist in clinical diagnosis (Westbury et al., 2015) and exon sequencing research (Peng et al., 2018), etc. In addition, by using DO, researchers build the chain knowledge base of etiology (Harrow et al., 2017; Kozaki et al., 2017) and annotate human genes to improve the coverage of disease genes' annotations (Osborne et al., 2009).

Exploring the associations (Landrum et al., 2014) among diseases by using biomedical ontologies has attracted a significant attention in biomedical domains (Zhao and Halang, 2006; Zhang et al., 2008; Zeng et al., 2017). Quantitative measures of these associations among diseases could help researchers gain a deep insight of human diseases, since similar diseases are usually caused by similar molecular origins or have similar phenotypes. Deducing the semantic similarity of disease is beneficial to reveal the common attributes (e.g., the classification of diseases, diseaserelated genes, disease-related symptoms, etc.) of these diseases, which could facilitate the understanding of underlying causes and improve the disease diagnoses and treatment plans. For example, the gene "SH2D3C" is one of the common genes of "Amnesia" and "Alzheimer's disease," which reveals that they may involve the same biological processes. The greater similarity means that the more closely related these two concepts are, and that the more common information they have (Liu and Yan, 2016; Liu and Zhang, 2017; Liu et al., 2017). A good quantitative method for computing the similarity among diseases could directly help researchers obtain the information of diseases having close relationships from massive biomedical data and do the corresponding experiments for the further analysis, which could significantly reduce the experimental cost and improve the efficiency of discovering potential pathogenic mechanism and drugs.

DO regulates the controlled vocabularies about diseases, and integrates the diseases' terms and medical data through external links. It provides the accurate, non-duplicative terms with high disease coverage and has been used to compute the degree of correlation among diseases during last decade (i.e., the disease similarity) (Osborne et al., 2009). DO is usually selected as the source of disease terms for the disease similarity calculation. Several previous approaches, including those based on information content (IC) (Resnik, 1995; Lin, 1998; Schlicker et al., 2006; Wang et al., 2007; Bandyopadhyay and Mallick, 2014), ontology Directed Acyclic Graph (DAG) structure (Kim et al., 1993; Zhang et al., 2010; Santos et al., 2012) and biological function process (Mathur and Dinakarpandian, 2012; Cheng et al., 2014; Jeong and Chen, 2015; Zou et al., 2016; Yang et al., 2017; Ni et al., 2018), have been proposed with the aim to measure the disease similarity by using DO. For the IC-based approaches, Resnik (1995) use IC of the most informative common ancestor (MICA) to measure the similarity of two diseases. To improve the efficiency of the Resnik's method, Lin (1998) propose the ratio of the amount of IC of MICA and that of two DO terms and then Schlicker et al. (2006) improve the Lin's approach through the Bernoulli probability distribution to reduce the impact of shallow annotations (Li et al., 2010). However, IC-based approaches only focus on the semantic information of two terms in different layers of ontology DAG. They ignore the information from the ontology DAG structure, and it is difficult to reveal the semantic differences between two terms under the same MICA. DAGbased approaches are susceptible to shallow annotations since the shallow concepts are too generalized to have much information

(Li et al., 2010). For the DAG-based approaches, Kim et al. (1993) consider that the reciprocal of the shortest distance of two disease in DAG to measure their similarity. Zhang et al. (2010) take into account not only the shortest distance, but also the depth of the least common ancestor. For the methods based on biological functional processes, BOG (Mathur and Dinakarpandian, 2012) calculates by the overlapping of related gene sets as the disease similarity. PSB (Cheng et al., 2014) takes account of the gene similarity additionally to improve BOG's performance. By adding associations obtained from external databases, BOG and PSB perform better performance than previous IC-based and DAG-based approaches. Nevertheless, they ignore the joint computation of disease similarities by integrating gene and phenotype associations, and have poor performance when evaluating disease similarities for the disease with less genetic information such as viral infectious disease (Common Wart, DOID:11165) and vein disease (Esophageal Varix, DOID:112) in DO.

To effectively evaluate the similarities of newly discovered diseases or diseases with few genetic information in current medical research (i.e., diseases with less disease-gene associations), we propose a novel semantic similarity measure method called GPSim in this paper. GPSim takes genes, diseases and phenotypes into account, and calculates the similarities by jointly utilizing their associations extracted from multiple biomedical ontologies and databases. Besides, we explore the phenotypic factors influencing the performance of GPSim. The experimental results show that, in comparison with previous similarity evaluation methods, our proposed approach has the best performance in terms of ROC (receiver operating characteristic curve) and AUC (area under curve).

# METHODS

In this section, we introduce the details of our proposed method GPSim. GPSim relies on the associations of disease-gene and

disease-phenotype. We firstly integrate the association data extracted from HPO, DO, and other biomedical databases, and then compute the corresponding disease similarity.

# Disease-Genetic and Disease-Phenotypic Relationship Integrations

Disease-phenotype and disease-gene-phenotype mapping relations mainly come from the HPO mapping file (Download from http://compbio.charite.de/jenkins/job/hpo.annotations. monthly/lastStableBuild/artifact/annotation/ALL\_SOURCES\_ ALL\_FREQUENCIES\_diseases\_to\_genes\_to\_phenotypes.txt).

The disease information is extracted from the DO, which has totally 11191 disease terms and 2,140 of them have diseasegene and disease-phenotype mapping relations, and 808 have disease-phenotype mapping relations. Additionally, we integrate the proven disease-gene relationships in the SIDD (Cheng et al., 2014) and the Dancer databases (Download from http:// wodaklab.org/dancer/ downloads). Through the completely matching names and synonyms of DO terms, we identify the DO terms and obtain the corresponding disease-gene mapping from Dancer. In this scenario, the number of terms having disease-gene and disease-phenotype mapping relations is 2505.

As shown in **Figure 1**, we extract the disease terms including disease's id, label, definition, synonyms, related databases, parents, and children from DO (Peng et al., 2013). Then we gain the disease-gene associations from Dancer, SIDD and HPO, and their relationships among diseases and phenotypes from HPO. The format of data from Dancer is "Disease's name: GeneID." We identify the disease term in Dancer through totally matching the disease's name, obtaining the association such as "DOID: GeneID." The format of data obtained from SIDD and HPO is "OMIMID/ORPHA ID/DOID: GeneID", and we get the association through matching their ID information. Similarly, we transform the disease-phenotype associations into available formats "DOID: HPOID" through the id of OMIM, Orphanet, and DO. Finally the associated data of disease-gene and diseasephenotype are loaded and integrated in the database (depicted as DGP).

# Computing the Similarity

The similarity evaluation of any two DO terms relies on diseasegene and disease-phenotype associations. We firstly compute the similarity of disease-related gene set and the similarity of disease-related phenotype set, and then integrate them as follows:

$$\begin{aligned} \textit{simGPS}(d1, d2) &= \beta \times \textit{simGeneSet}(\textit{G1, G2}) \\ &+ \ (1 - \beta) \times \textit{simHPOSet}(\textit{P1, P2}) \quad (1) \end{aligned}$$

Here, simGPSim represents the disease similarity computed by using GPSim. For two DO terms d1 and d2, G1, and G2 represent the disease-related gene sets of d1 and d2, respectively. P1 and P2 represent the disease-related phenotype sets of d1 and d2, respectively. simGeneSet represents the similarity between G1 and G2. simHPOSet represents the similarity between P1 and P2. β is the weight tuning the contribution of genes and phenotypes to the similarities of diseases, and the value of β depends on the quality of disease-gene and disease-phenotype associations (e.g., the association number, the depth of terms in HPO) of diseases in the tested dataset.

Computing the similarity of two gene sets relies on the gene-gene similarity network. We extract the network from the HumanNet (Lee et al., 2011). The HumanNet is a probabilistic functional gene network. Each interaction in the HumanNet is a log-likelihood score (LLS) which measures the probability of a true functional linkage between two genes. The functional similarity of two genes by normalizing the HumanNet (denoted as LLSN) are computed as follows (Cheng et al., 2014):

$$\begin{array}{rcl} \text{LLSN} \left( \text{t1.t2} \right) & = & \frac{\text{LLS} \left( \text{t1.t2} \right) - \text{LLS}\_{\text{min}}}{\text{LLS}\_{\text{max}} - \text{LLS}\_{\text{min}}} \\\\ \text{sim}\_{\text{geme}} \left( \text{g1.g2} \right) & = & \begin{cases} \text{1,} & \text{g1} = \text{g2} \\ \text{LLSN} \left( \text{g1.g2} \right), \text{g1} \neq \text{g2} and \text{e} \left( \text{g1.g2} \right) \in \text{HumanNet} \\ & \text{0,} \text{g1} \neq \text{g2} and \text{e} \left( \text{g1.g2} \right) \notin \text{HumanNet} \end{cases} \end{array} \tag{2}$$

Here, LLSmin and LLSmax represent the minimum and maximum in the HumanNet, respectively. simgene represents the similarity of two genes g1 and g2. If there is no linkage of two genes in HumanNet, then their similarity is 0. Thus, the similarity measurement of two gene sets is defined as follows (Cheng et al., 2014):

$$simGeneSet\ (G1^{\circ}G2)$$

$$=\frac{\sum\_{\mathbf{g}\models\in G1}\operatorname{sim}\_{\text{max}}\left(\operatorname{gi},\mathbf{G2}\right)+\sum\_{\mathbf{g}\models\in G2}\operatorname{sim}\_{\text{max}}\left(\operatorname{gi},\mathbf{G1}\right)}{n+m} \tag{4}$$

$$
\sin\_{\max}(k \,\mathrm{G}) = \max(k \mathrm{i} \mathrm{i} \mathrm{e} \mathrm{G} - \mathrm{i} \mathrm{m} \,(\mathrm{k}, \mathrm{k} \mathrm{i})) \tag{5}
$$

Here gi represents a gene in the gene set G1 and gi' represents a gene in the gene set G2. k represents a gene in a gene set. G represents a gene set and ki represents any gene in G. We define the similarity between a gene k and a gene set G as the maximum of the similarity of k and ki in G. As shown in **Figure 2** and formula (4), we compute the similarity of every gene gj (j = 1, 2,. . . ,m) in gene set G1 and that of every gene gj' (j = 1, 2,. . . ,n) and gene set G2 respectively, and then calculate the average value

of all similarities representing the similarities of two gene sets G1 and G2.

Computing the similarity of two disease-related phenotype sets relies on the association of diseases and phenotypes. We could measure the similarity of two phenotype sets by their overlaps, the similarity between two phenotype sets could be defined as follows:

$$simHPOSet\ (P1, P2) \ = \frac{2 \times |PI \cap P2|}{|P1| + |P2|} \tag{6}$$

The total process of the disease similarity computation is shown in **Figure 3**. For instant, to calculate the similarity of two disease terms, "Alzheimer's disease" (DOID:10652) and "schizophrenia" (DOID:5419). We firstly get the disease-related gene sets and disease-related phenotype sets of two diseases respectively, from the integrated DGP database in Disease-Genetic and Disease-Phenotypic Relationship Integrations. By using formula (4) and (5), the similarity of two gene sets is calculated as 0.4784. The similarity result of two phenotype sets by using formula (6) is 0.1111. Finally, we integrate the similarity of gene sets and phenotype sets by using formula (1), in this scenario the corresponding similarity value is 0.4417.

Let N be the total number of diseases in DO, and K and L be the sizes of disease-related gene and disease-related phenotype sets, respectively. There are N <sup>2</sup> pairs of diseases and it costs O(N 2 ) to compute all the similarities. For each disease pair, we need to compute both the similarities based on diseaserelated gene and disease-related phenotype sets. Calculating two diseases' similarity based on disease-related gene sets costs O(K 2 ) to obtain the corresponding similarity. The intersection between two disease-related phenotype sets takes O(L). As a result, it takes O(N 2 ∗ (K <sup>2</sup> +L)) to compute the similarity of all disease pairs.

#### RESULTS

In our experiments, we explore the phenotypic factors including the depth of HPO terms and the number of diseasephenotype associations when each disease has few disease-gene

associations and compare GPSim with previous disease similarity measurement methods, including Resnik (Resnik, 1995), Zhang (Zhang et al., 2010), BOG (Mathur and Dinakarpandian, 2012) and SemFunSim(Cheng et al., 2014).

All the experiments are performed on 2.50 GHz Intel Core i7 CPU with 8.00 GB RAM running on Windows 10 64-bit system. We implemented all the approaches in Java with JDK 1.8.0 and Python 3.0.

To provide a fair comparison with previous approaches, we select the disease pairs with disease-gene and diseasephenotype associations from the SIDD benchmark and carry out the experiments by using the tested method used in previous approach (Cheng et al., 2014). In particular, we take the disease pairs in the benchmark set as the positive examples, and randomly generate 500 disease pairs as the negative examples, combining the positive examples and the negative examples as a tested set. To reduce test error we generate 100 tested sets to compare the performance of different methods and get the average value of 100 test results. For each tested set, we calculate the similarity of each disease pair by using the Resnik's method, Zhang's, BOG, SemFunSim and GPSim, and the performance comparisons are performed by using a receiver operating characteristic curve (ROC). ROC curve is a curve drawn with true positive rate (TPR) as Y axis and false positive rate (FPR) as X axis according to a series of different dichotomies (boundary values or decision thresholds). Generally, the closer to the upper left corner the ROC is, the more accurate the corresponding method is. For showing the performance of different methods more directly, the area under curve (AUC) of the ROC was also given. The greater AUC is, the better the performance is.

In the first sets of experiments, for diseases having less genetic associations (e.g., <9) in DO, we firstly calculate the similarities by using GPSim with different values of the beta (see formula 1 in Computing the Similarity). From the results observed from **Figure 4A**, we see that beta value of 0.9 is an optimum threshold in the tested dataset, which also reveals that jointly using disease-gene and disease-phenotype associations could improve the effect of disease similarity measurements. In this scenario, we also investigate the impact of the phenotypic factors such as the depth of HPO terms and the number of the disease-phenotype associations. To test the impact of similarity evaluation using different depth of HPO terms, we vary the HPO terms' depth from 3, 5, 7, 9, and 11. As shown in the **Figure 4B**, we see that, when the HPO terms' depth is >5, after obtained the corresponding diseasephenotype associations (depth ≥5), GPSim obtains the best the performance (the AUC is 81.51%), which illustrates that the performance of calculating disease similarity is declined by using the shallow HPO disease-phenotype associations. **Figure 4C** shows the experimental results using different number of disease-phenotype associations in the deep layer of the HPO (e.g., depth ≥5). From the figure, we see that, the more the number of disease-phenotype associations, the better the effect of GPSim.

In the second sets of experiments, we firstly compare the performance of Resnik, Zhang, BOG, SemFunSim and GPSim in terms of ROC and the AUC, for the scenarios of diseases with few disease-gene associations (e.g., <9). As shown in **Figure 5**, GPSim also presents the best performance. **Figure 6** shows the performance for the scenarios of diseases with multiple diseasegene associations, and the consistent results are obtained and GPSim has the best performance. In particular, we see that the AUCs of GPSim, SemFunSim, BOG, Zhang and Resnik are 99.05, 97.69, 80.99, 67.80, and 59.05% respectively. Note that, since the negative samples are randomly generated, the average values of AUC of these methods may have a 2% float, and their corresponding floating directions are consistent. This is because (i) the similarity evaluation of Resnik's and Zhang's methods are centered on Disease Ontology only, and additional information such as associations among genes and diseases are not taken into account, (ii) BOG and SemFunSim improve the similarity measurement method by adding associations among genes and

diseases to alleviate information insufficiency, (iii) GPSim further integrate gene, disease and phenotype associations extracted from multiple biomedical ontologies and databases, and it jointly utilizes these associations to effectively deduce the semantic similarity. Therefore, GPSim is more suitable for the similarity evaluation, which is what we have expected.

#### CONCLUSION

The vast amount of biomedical data has brought huge benefits to disease diagnosis and life science research, but it has also brought challenges to the understanding and searching of biological information in different disease terms. Thus, a large number of biomedical ontologies with controlled vocabularies are created for the biomedical knowledge share. Currently, quantitative measures of the associations among diseases by using biomedical ontologies have become the research hotspot. In this paper, we focus on the joint computation of disease similarities by integrating gene and phenotype associations. In particular, we propose an effective method to measure the similarity of diseases in Disease Ontology with disease-related gene and phenotype associations extracted from HPO and other biomedical databases, which calculates the similarities by jointly utilizing their associations. The final experiments show that, our proposed method has the best performance in terms of ROC and AUC, compared with previous methods. In the future, we plan to apply GPSim to the disease annotation applications for providing researchers with a more powerful annotation tool based on biomedical ontologies. Additionally, we would like to involve more information, such as gene sequence, expression information, to improve our disease similarity model.

#### REFERENCES


## DATA AVAILABILITY

All datasets analyzed for this study are included in the manuscript and the supplementary files. The source code of GPSim is freely available at https://github.com/lyotvincent/GPSim.

#### AUTHOR CONTRIBUTIONS

JL conceived the project, conceptualized the method, designed the studies, and contributed to writing the manuscript. JL, SS, and LZ implemented the algorithms, performed the analysis and contributed to writing the manuscript. All authors read and approved the final version of the manuscript.

### FUNDING

The work was partially supported by the National Key R&D Program of China (2017YFC1200200, 2017YFC1200205, 2018YFC1603800, and 2018YFC1603802), National Natural Science Foundation of China (61602130 and 61872115), China Postdoctoral Science Foundation funded project (2015M581449 and 2016T90294), Heilongjiang Postdoctoral Fund (LBH-Z14089), Natural Science Foundation of Heilongjiang Province of China (QC2015067), Fundamental Research Funds for the Central Universities (HIT.NSRIF.2017036), and Shanghai Municipal Science and Technology Major Project (Grant No. 2017SHZDZX01).

#### ACKNOWLEDGMENTS

The authors thank the referees for their valuable comments and suggestions.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Su, Zhang and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A New Method of RNA Secondary Structure Prediction Based on Convolutional Neural Network and Dynamic Programming

Hao Zhang<sup>1</sup> , Chunhe Zhang<sup>1</sup> , Zhi Li <sup>2</sup> , Cong Li <sup>1</sup> , Xu Wei <sup>1</sup> , Borui Zhang<sup>3</sup> and Yuanning Liu<sup>1</sup> \*

*<sup>1</sup> College of Computer Science and Technology and Symbol Computation and Knowledge Engineering, Ministry of Education, Jilin University, Changchun, China, <sup>2</sup> College of Computer Science and Technology, Changchun University of Science and Technology, Changchun, China, <sup>3</sup> Columbia Independent School, Columbia, MO, United States*

In recent years, obtaining RNA secondary structure information has played an important role in RNA and gene function research. Although some RNA secondary structures can be gained experimentally, in most cases, efficient, and accurate computational methods are still needed to predict RNA secondary structure. Current RNA secondary structure prediction methods are mainly based on the minimum free energy algorithm, which finds the optimal folding state of RNA *in vivo* using an iterative method to meet the minimum energy or other constraints. However, due to the complexity of biotic environment, a true RNA structure always keeps the balance of biological potential energy status, rather than the optimal folding status that meets the minimum energy. For short sequence RNA its equilibrium energy status for the RNA folding organism is close to the minimum free energy status; therefore, the minimum free energy algorithm for predicting RNA secondary structure has higher accuracy. Nevertheless, in a longer sequence RNA, constant folding causes its biopotential energy balance to deviate far from the minimum free energy status. This deviation is because of its complex structure and results in a serious decline in the prediction accuracy of its secondary structure. In this paper, we propose a novel RNA secondary structure prediction algorithm using a convolutional neural network model combined with a dynamic programming method to improve the accuracy with large-scale RNA sequence and structure data. We analyze current experimental RNA sequences and structure data to construct a deep convolutional network model, and then we extract implicit features of an effective classification from large-scale data to predict the pairing probability of each base in an RNA sequence. For the obtained probabilities of RNA sequence base pairing, an enhanced dynamic programming method is applied to obtain the optimal RNA secondary structure. Results indicate that our proposed method is superior to the common RNA secondary structure prediction algorithms in predicting three benchmark RNA families. Based on the characteristics of deep learning algorithm, it can be inferred that the method proposed in this paper has a 30% higher prediction success rate when compared with other algorithms, which will be needed as the amount of real RNA structure data increases in the future.

Keywords: convolutional neural network, dynamic programming, RNA secondary structure, base pairing probability, energy balance status

Edited by:

*Arun Kumar Sangaiah, VIT University, India*

#### Reviewed by:

*Fei Guo, Tianjin University, China Yan Huang, Harvard Medical School, United States*

> \*Correspondence: *Yuanning Liu lyn@jlu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *14 February 2019* Accepted: *30 April 2019* Published: *22 May 2019*

#### Citation:

*Zhang H, Zhang C, Li Z, Li C, Wei X, Zhang B and Liu Y (2019) A New Method of RNA Secondary Structure Prediction Based on Convolutional Neural Network and Dynamic Programming. Front. Genet. 10:467. doi: 10.3389/fgene.2019.00467*

# INTRODUCTION

RNA is an important basic substance in living organisms. It plays an important role in encoding, decoding, regulating, and expressing genes. The function of RNA in an organism depends mainly on its tertiary structure. However, the tertiary structure of RNA molecules is complex and lacks an effective representation to describe it; thus, it is very difficult to directly predict the tertiary structure from the primary structure of RNA molecules. Therefore, predicting the secondary structure of RNA from the primary structure of RNA becomes the main process for studying RNA structure.

At present, the identified RNA secondary structure can be obtained mainly by means of biological experiments such as X-ray diffraction and NMR. However, biological experimental methods are inefficient, expensive, and arduous when measuring structures on large scales (Novikova et al., 2012); furthermore, they are not effective for all RNA molecules (Fürtig et al., 2003). Howard and Eran proposed the PARS technique to predict the RNA secondary structure (Kertesz et al., 2010). It applies endonucleases to cleave the single-stranded portion and the double-stranded portion of the RNA to create a library of two RNA fragments, and then sequence-analyzes the two RNA fragment libraries separately to obtain an RNA secondary structure. But endonucleases cannot pass through the cell membrane, and RNA can only be extracted from the cells. This will destroy an RNA natural structure and result in structural changes. Ding et al. (2014) uses DMS for biological experiments. DMS can react with adenine and cytosine in unpaired RNA sequences in cells, and RNA regions reactive with DMS cannot be reverse transcribed into DNA. The DNA reverse-transcribed into RNA is subjected to sequence analysis to determine unpaired RNA regions. DMS technology still has drawbacks. It can only determine two paired nucleotides in an RNA molecule, and the rest requires computer algorithms for simulation. In addition, researchers have used SHAPE reagents instead of DMS reagents (Wilkinson et al., 2008; Novikova et al., 2013), which can acylate the 2' hydroxyl groups of four bases in an unpaired state, thereby analyzing the single-strand flexibility of the RNA backbone at any position and speculating whether the bases are paired. However, the pairing object cannot be determined. Up until now, not one biological RNA method has been able to predict a true RNA secondary structure in large quantities; thus, computational prediction algorithms are still needed to effectively predict RNA secondary structures.

There are two main types of mainstream RNA secondary structure prediction algorithms. One is the deterministic dynamic programming algorithm. The earliest use of a dynamic programming algorithm is the Nussinov algorithm based on the maximum number of base pairings (Nussinov et al., 1978). This algorithm simply assumes that the RNA single-strands are folded into themselves so that base pairs can (as much as possible) constitute the secondary structure of the RNA. However, this algorithm has low prediction accuracy due to the assumption that the premise is too simple, and the formed base pairs are often discontinuous and cannot form stem regions. Based on the Nussinov algorithm and energy information, Zuker proposed a minimum free energy algorithm (Zuker and Stiegler, 1981). The minimum free energy algorithm assumes that RNA structure has a great relationship with free energy. The size of free energy is not only related to the type of base pairing, but the free energy size is also affected by adjacent base pairs. The free energy of different structures (hair-loop, inner-loop, etc.) is also very different. The minimum free algorithm still uses the idea of dynamic programming, but the calculated object is a series of complex free energy parameters obtained from experiments. Many well-known RNA secondary structure prediction software applications, such as the mfold web server (Zuker, 2003) and RNAfold (Hofacker et al., 1994), have adopted the minimum free algorithm and its improvement. However, experiments show that due to the complexity of the internal environment, RNA is seldom folded in a manner that can minimize the free energy of the structure, and it is generally in a suboptimal energy folded structure (Zou et al., 2008).Notably, the Zuker algorithm has better prediction results for secondary structures of shorter RNAs. However, for longer RNAs, its prediction accuracy acutely decreases.

The second category of mainstream RNA secondary structure prediction algorithms refers to the comparative sequence analysis methods. In biological experiments, it is usually necessary to simultaneously process one or more sets of homologous RNA sequences. It is generally believed that in homologous RNA molecules, the conservation of the structure is greater than the conservation of the sequence. For example, the secondary structures of all tRNA molecules are clover-shaped. This consistency of shape gives tRNA molecules the structural consistency they need to perform similar functions. Therefore, the comparing sequence method can improve prediction accuracy to a certain extent. There are three main methods of comparative sequence analysis. The first method includes a prior distribution of RNA structures, which includes evolutionary history when comparing and post-predicting (Knudsen and Hein, 1999). The results obtain by this method strongly depend on the effect of multiple sequence alignment. The second method simultaneously performs structural prediction and sequence comparison, but this algorithm consumes excessive computational resources (Sankoff, 1985). The third comparative sequence analysis method predicts first and compares afterwards. This method can obtain multiple candidate structures, but it cannot be guaranteed to contain real structures (Allali and Sagot, 2005).

Artificial intelligence methods have been applied in many fields. At present, there have been some artificial intelligence learning algorithms such as the genetic algorithm (Hu, 2003), neural network algorithm (Zhang et al., 2006), support vector machine algorithm, and other methods to predict the secondary structure of RNA. All achieved good results. However, all these methods are based on small samples, and the prediction accuracy is low for single-class data samples. With the development of computer technology, deep learning methods have emerged in the field of artificial intelligence, which can effectively improve the accuracy of prediction. Deep learning methods can extract effective and implicit features through deep-seated networks in large-scale data and use these features to construct effective prediction models. At present, deep learning methods have made great breakthroughs in the field of protein secondary structure prediction (Wang et al., 2016). However, compared with secondary structure prediction of proteins, RNA secondary structure prediction is more complicated and difficult since each pair of bases on the RNA needs to correspond to another base in the chain even though each amino acid of a protein is not related to other amino acids in the chain during structure prediction. This paper proposes a novel computational method that combines deep learning with dynamic programming to predict RNA secondary structure prediction, which can effectively solve the problems above. Compared with the current mainstream algorithms, our method has better results.

# DATA AND METHODS

The RNA secondary structure is mainly composed of a stem structure formed by complementary pairing of contiguous bases and a cyclic structure formed by non-pairing of bases. This RNA secondary structure is also called the stem-and-loop structure, As long as all the paired bases of an RNA sequence are determined, the secondary structure of the entire RNA can be determined. Based on the RNA secondary prediction problems presented in our literature search up to this point, this paper proposes a more efficient algorithm for RNA secondary structure prediction. This algorithm, referred to as CDPfold, combines a convolutional neural network and dynamic programming as well as a sequence alignment method. In comparative sequence analysis, we constructed a convolutional neural network to extract the characteristics of effective implicit features from largescale data and predicted the matching probability of each base on the RNA sequence. Convolutional neural networks can use the currently collected RNA sequences as training samples, which solves the constraints of homologous sequences in comparative sequence analysis. For the probabilistic results obtained by the convolutional neural network, we used the iterative idea of dynamic programming and the definition of the RNA secondary structure to obtain the base matching probability and the maximum RNA secondary structure. This operation can avoid the degradation of long sequence prediction accuracy due to the use of the free energy method. The process of CDPfold predicting an RNA secondary structure is shown in **Figure 1**.

## RNA Matrix Representation Based on RNA Sequence Pairing

An RNA sequence is mainly composed of four types of base combinations, "A," "U," "G," and "C," but most of the algorithm models do not accept the "AUGC" combination sequence as input data. So, we had to encode the sequence. Currently, the most common encoding method is one-hot encoding, but since one-hot encoding does not reflect the implicit matching between bases, we developed a new encoding method.

We built the matrix Wi×<sup>i</sup> for each RNA, where each row of the matrix represents possible pairings of bases at that position, as follows:

1. According to the number of hydrogen bonds between the paired bases, the pairing weight between A and U is set to 2, and the pairing weight between G and C is set to 3. Since the U-G pair is a wobble base pair, the pairing weight between U and G is set to x (0 < x < 2), which leads to:

$$P(R\_{\hat{j}}, R\_{\hat{j}}) = \begin{cases} 2, \text{ (if } (R\_{\hat{j}} = A \text{ and } R\_{\hat{j}} = U) \text{ or } (R\_i = U \text{ and } R\_{\hat{j}} = A) \text{)}\\ 3, \text{ (if } (R\_{\hat{i}} = G \text{ and } R\_{\hat{j}} = C) \text{ or } (R\_{\hat{i}} = C \text{ and } R\_{\hat{j}} = G) \text{)}\\ \text{x, (if } (R\_{\hat{i}} = G \text{ and } R\_{\hat{j}} = U) \text{ or } (R\_{\hat{i}} = U \text{ and } R\_{\hat{j}} = G)) \text{ (1)}\\ 0, \text{ else} \end{cases}$$


Combining these points of view, this paper introduces the following algorithm flow to calculate the specific values of each position of the coding matrix Wi×<sup>i</sup> , as shown in **Figure 2**.

Base pairing according to RNA sequence coding matrix can be obtained by calculation. Through the analysis of the matrix, we can know that the position of the stem region in the real structure of the RNA is represented by a sub-diagonal line with a large intermediate value and a small value on both sides in the coding matrix. The advantage of the convolutional neural networks in deep learning methods is that they can effectively extract the regional features of the blocks in the matrix. Therefore, we used the convolutional neural networks instead of other machine learning models to predict the pairing of bases in RNA sequences.

# Convolutional Neural Network Predicts the Probabilities of RNA Sequence Base Pairing

Our goal is to predict the pairing of each base on an RNA sequence; so, we had to split the RNA sequence encoding matrix. The RNA representation method converts a sequence of length n into a matrix of size n× n. We use the sliding window method to divide the matrix into n matrices of size d × n. Where d is the size of the sliding window. Thus, the bases on each RNA sequence can be represented by a matrix of size d × n. The size of the sliding window, using the sliding window method, has a great influence on the experiment. If the sliding window is set too small, the extracted features will be incomplete. Too large a window setting will result in more redundant information in the matrix, which leads to a longer training model and may affect the accuracy of the final prediction model prediction. After analysis, the value of

the sliding window should be related to the length of the stem region in the RNA. Therefore, we had to count the stem region information of the experimental object to determine the size of the sliding window.

The convolutional neural network requires that the data input into the model be of a uniform size, and the size of the RNA sequence corresponding to each RNA sequence is different due to the length of the RNA sequence. Therefore, during the experiment, we need to calculate the mean value of the RNA sequence length in the experimental data set and use that mean value to normalize the data. The sliding window method and normalization of the RNA coding matrix can convert the RNA sequence of length n into n matrices of the same size, which satisfies the requirements of the convolutional neural network for input data.

This article uses the dot bracket representation to represent the RNA secondary structure. The dot-bracket indicates that the RNA secondary structure is represented as a combination of sequences of "(",")" and ".". Therefore, the output layer of the convolutional neural network designed in this paper is composed of three nodes, and the output of each base corresponds to the matrix corresponding to the probability of the three labels "(", ")"and ".".

### Maximum Probability Sum Algorithm Corrects Predictions

The deep learning method has a high accuracy rate for classification problems. However, RNA secondary structure prediction is not a simple classification problem. We can consider RNA secondary structure prediction as a combination of multiple classification problems under certain restrictions.

From the result of the previous step, we can obtain Pleft, Pright, and Ppoint. Which are the probabilities of the three labels "(",")" and "." in the secondary structure of each base in the RNA sequence. However, if the label with the highest probability of prediction is used as the prediction result for each base, this combination does not guarantee that such a result will satisfy the definition of the secondary structure defined for RNA: It may appear that the number of left brackets is not equal to the number of right brackets or a prediction could be made in which the matched brackets cannot pair with the corresponding bases. So we need to modify the prediction results to meet the requirements of the definition of an RNA secondary structure.

Based on the probabilistic results obtained in the previous step of the convolutional neural network, the goal of this paper is to find a compatible bracketed sequence that represents

the secondary structure of the RNA. To achieve this, the process requires:


To find a sequence that meets these requirements, this article enhances the Nussinov algorithm in the dynamic programming method. This requires changing the number of iteratively accumulated paired bases in the Nussinov algorithm to the sum of the cumulative probability of the iterative cumulative bases. Thus, a maximum probability sum algorithm was proposed. This algorithm makes use of the dynamic programming method. Through multiple iterations, the secondary structure of RNA that satisfies the requirements can be obtained. The specific iteration formula follows:

$$N(i,j) = \max\begin{cases} N(i+1,j) + p\_{point}(R\_i) \\ N(i,j-1) + p\_{point}(R\_j) \\ N(i+1,j-1) + \delta(R\_j, R\_i) \\ \max\_{i$$

where N (i, j) is the maximum probability of the i-th base to the j-th base in the RNA sequence. Pleft, Pright , and Ppointrepresent the probability of the ith base of the RNA sequence being outputted by the convolutional neural network for three labels.

### RESULTS

# Prediction of Secondary Structure of Single Family RNA by CDPfold

The data used in our experiment are derived from Turner and Mathews (2009). The data contained in the data set is shown in **Table 1**.

Among the various RNA families included in the dataset, we first selected the 5sRNA with the largest number and the most concentrated distribution without a pseudoknot. Sequence analysis of the 5sRNA dataset reveals that some identical or similar sequence data exists in RNA dataset. In order to avoid the effect of the experiments by the same or similar sequence data, it is necessary to preprocess the data in the dataset. That is, the 5sRNA data set is programmed to remove the same or similar sequences in the data. After the duplication removal operation, the number of 5s RNAs used in the experiment is 1,059. To train the model and accurately evaluate the entire model, we divide the number of removed 5sRNA datasets into a training set, consisting of a validation set and a test set. The ratio of RNAs in the training sets, validation sets and test set is 7:2:1. TABLE 1 | Distribution of RNA types and their number in each dataset.


The experiment uses the training set to train the network model and determine the model parameters; then, the verification set is used to make the model selection. Therefore, in the final optimization and determination of the model, the final test set is used to measure the generalization ability of the whole prediction method.

Several parameters in the CDPfold can affect the results of the experiment, and the problematic parameters must be fixed before the experiment. The first problem parameter is the size of the sliding window. We calculated the length of the largest stem region of all RNAs in the 5sRNA data set used in the experiment. The results obtained are shown in the **Figure 3**. The length of the longest stem region in the 5sRNA dataset is used as the size of the sliding window method. We also calculated the average length of the 5sRNA sequence in the data set, as shown in **Figure 4**.

**Figure 4** shows that the maximum stem length of the 5sRNA is 11 continuous base pair, and the average length of the sequence is 120 nt. Since the convolutional neural network has a good accuracy for the shifted and scaled images, this paper applies the idea of image scaling, which means that the matrix representation of the bases obtained through the sliding window can be uniformly scaled into a matrix size of 11 × 120.

The framework used in the convolutional neural network model constructed in this paper is Tensorflow. The convolutional neural network model consists of an input layer, three convolutional layers, three pooling layers, two fully connected layers, and a final output layer. In the test phase, the tf.nn.top\_k() function of the output layer is removed to obtain the probability that each base will correspond to three tags. The convolutional neural network model used in this paper is shown in **Figure 5**.

The data input from the input layer of the convolutional neural network is represented by the sliding window algorithm and the normalized base matrix. The parameter optimization

method using a batch random gradient descent uses 256 data for each iteration. The convolutional neural network consists of three convolutional layers, three pooling layers, and two fully connected layers, where each convolutional layer uses 16 3 × 3 convolution kernels, and each pool layer also uses 3 × 3 convolution kernels. The largest pool, and each full connection layer uses 32 nodes. The output layer of the model maps the data to the three labels of the point bracket representation, and the probability that the base belongs to three labels can be verified. The initialization method of each parameter of the model is the Xavier initialization method, and the error function of the output layer adopts the maximum entropy function. When the model parameters are trained, the model is iterated 400 times.

Through the sliding window method and normalization, a matrix representation corresponding to each base in the 5s RNA sequence in training set can be obtained, wherein each base has a corresponding structural label. Analysis of the data shows that because the number of unpaired bases in each 5s RNA sequence is slightly larger than the number of paired base pairs, this will result in an imbalance of the three types of data samples in the data set, so the data needs to be processed with unbalanced data. Since the amount of experimental data is sufficient, the upsampling data processing method will be adopted to balance the various sample data in the data set.

The processed data is used to train the convolutional neural network model. The performance of the convolutional neural network model we built on the training set and test set is shown in **Figure 6**. From **Figure 6**, we can see that the model has a similar test accuracy on the training set and the test set, and the experimental results are not over-fitting. This figure also shows that the model has a similar test accuracy on the training set and the test set, and the experimental results are not over-fitting.

After determining the model used in the experiment, we need to select an appropriate value for the weight x of G-U pairing (Formula 1). The matching weight of the swing pair should not be too large or too small. Unfavorable weights will result in a

decrease in prediction accuracy. In order to select the appropriate weights, we conducted a number of experiments. The results are shown in **Figure 7**. Experiments show that when the matching weight of G-U pairing is 0.8, the overall model's mean and variance of accuracy are optimal.

The test set data are input into the trained CDPfold, and the pairing probability of each base on each RNA obtained by the convolutional neural network is used as an intermediate result. These intermediate results are used in our probability and maximum correction algorithm. The optimal secondary structure that satisfies the definition of RNA secondary structure is obtained, and compared with the corresponding real structure, thereby validating our complete model design.

For the prediction of an RNA secondary structure obtained by the CDPfold, we used two indicators, sensitivity and specificity. Sensitivity refers to the predicted percentage of all base pairs in the real structure, corresponding to the recall-rate in machine learning. Specificity refers to the correct percentage of all predicted base pairs, corresponding to the precision-rate in machine learning. The RNA secondary structure prediction algorithm is difficult to achieve in general since it is always biased to one side. The F-score can be used to measure the precision and recall.

$$F\text{-score} = \frac{2 \times \text{Sensitivity} \times \text{Specificity}}{\text{Sensitivity} + \text{Specificity}} \tag{3}$$

Based on the above metrics, we obtained the predicted effects of the designed algorithm model on the 5sRNA dataset. We used the same data to perform experiments under other published algorithms. **Table 2** compares the results of our experiments included in our new algorithm with the results obtained by other popular programs in current software. **Table 2** shows the accuracy of our designed algorithm compared with other algorithms on the 5sRNA dataset. Obviously, the sensitivity and specificity of our designed algorithm are significantly higher than that found in other algorithms.

# Prediction of Secondary Structure of Multiple Family RNAs by CDPfold

Based on the above studies, we used the 5sRNA dataset trained model to predict the secondary structure of tRNA. The results have a sensitivity of 0.2 and a specificity of 0.15.This is quite different from the 5sRNA dataset trained model effect on the 5sRNA data set. We analyzed this result and found the function of 5sRNA to quite different from that of tRNA. Models trained using the 5sRNA dataset only extract features that favor the classification of 5sRNA. The lack of these features in tRNA resulted in a greatly reduced prediction accuracy. Therefore, without determining the RNA function or family, it is not possible to directly predict those characteristics using the established model. Thus, the entire sample data must be obtained to build a general model.

First, we analyzed all the data in the dataset and found pseudoknots in some RNA structure data. Since the pseudoknot belongs to RNA tertiary structure category, all data with pseudoknots were deleted in the pre-processing operation. **Table 3** shows the number of RNAs after the pseudoknot deletion.

We chose the 5sRNA, srpRNA, and tRNA using a number >100 after the pseudoknot, and first performed removal of redundant operations on these three types of RNA to delete the same or similar sequence data in the RNA data set. **Figure 8**




shows the RNAs of each family and the data distribution after the removal of identical or similar sequence data.

We also calculated the maximum stem length of each RNA sequence in the data set and the average length of the RNA sequence, which are shown in **Figures 9**, **10**.

As it can be seen from **Figures 9**, **10**, the maximum length of the stem region is 19 continuous base pair, and the average length of the sequence is 128 nt. Thus, the RNA matrix representation will be represented as a matrix of size 19 × 128 after normalization and sliding window operations. The experiment divides the data into training sets, validation sets and test sets, in which the ratio of various types of RNA remained at 7:2:1. In the general model, due to the significant increase in the number of data types and RNA, the experiment will fine-tune the convolutional neural network of the original 5sRNA prediction model: To extract the more generalized hidden features of various RNAs, the number of convolutional and pooling layers is reduced from three to two. Other configuration parameters have not changed. During the training process, the batch data size for each iteration is increased from 256 to 512, and the number of iterations is increased to 2,000. In **Figure 11**, the convolutional neural network model used in the experiment is as follows.

After our convolutional neural network model is trained, the test set data is input into the trained general model to obtain the pairing probability of each base on each RNA, and the maximum probability and base correction algorithm are used for pairing probability and RNA sequence. In this manner, the optimal secondary structure of the RNA sequence is obtained.

Using the F-score, we can get the predicted effect of the designed generic model on the three types of RNA datasets. We use the same test data to perform experiments under other published algorithms. The comparison results are shown in **Table 4**.

# DISCUSSION

This paper proposes a CDPfold prediction method based on a convolutional neural network for RNA secondary structure. This method uses the convolutional neural network to extract the hidden features of RNA sequence data and applies it to the field of structural prediction. The results are corrected using a dynamic programming-based correction algorithm to obtain an optimal RNA secondary structure. Experimentally, our method has had good performance in predicting the accuracy of a RNA secondary structure.

Although CDPfold has achieved good results in RNA secondary structure prediction, some problems encountered during the experiment process are summarized below, and suggestions for solving the problem follow.

First, the reason why RNA can form stem regions depends on the hydrogen bonds formed by the complementary pairing of bases. The secondary structure of DNA mainly exists in the form of a double helix. Due to the limitation of double helix structure, base pairing in DNA can only be composed of pyrimidines and hydrazine pairs. Therefore, DNA molecules only have two pairing modes: A-T and G-C.RNA molecules are different. RNA

molecules mainly exist in single-stranded forms. Their doublestranded regions are composed of different regions of the same chain. They do not have a long structural regular double-helix structure. Therefore, in addition to standard A-U and G-C base pairs, there are also G-U swing pairs. The hydrogen bonds formed by the rocking pair are unstable, and not all G and U

elements can form paired base pairs. The current handling of G-U swing pairings will be fixed, either as pairable or as unpairable. In this paper, a smaller number is selected as the pairing weight on the G-U swing pair, but the G-U pairing problem is not well-explained. We believe that it is necessary to dynamically

TABLE 4 | Comparison of three types of RNA based on their prediction accuracy.


determine whether G and U can form paired base pairs according to different states during the process of RNA folding, but this dynamic method is extremely difficult, and there is no research to propose a corresponding solution.

The results predicted by the CDPfold method proposed in this paper still need to be further corrected in the results predicted by the convolutional neural network. This is because all machine learning algorithms have generalization errors, and the convolutional nerves are caused by the existence of generalization errors. The results obtained by the network did not form a satisfactory RNA secondary structure. A similar situation has emerged in other studies that use machine learning algorithms to solve RNA secondary structure predictions. There are two main solutions to obtaining a satisfactory RNA secondary structure for more accurate predictions. One is to directly optimize the results of the machine learning model. This paper adopts this approach. The second is to use the results as conditional constraints, and use these constraints to optimize other algorithms. In essence, both approaches are an optimization process for intermediate results. In this problem, it may be an effective solution to generate an anti-network model. The generator that generates the antinetwork is used to generate the RNA secondary structure, and the discriminator is used to determine whether the results satisfy the definition of the RNA secondary structure. The optimal RNA secondary structure is obtained by the confrontation between the generator and the discriminator. The difficulty of this method is how to design a good training method. Otherwise, the output may be unsatisfactory due to the freedom of generating the model.

In the selection method of an optimization algorithm, the authors of this paper used group intelligence optimization algorithms such as genetic algorithm. These intelligent algorithms can solve complex non-linear problems by simulating biological evolution. In this paper, the probabilistic results provided by the convolutional neural network are used as the probability of selection, mutation, hybridization, etc. in the genetic algorithm, and the number of mismatches in the simulated RNA structural species is used as the optimization target. Although this method can also obtain the RNA secondary structure that meets the requirements, the randomness of each link in the group intelligent optimization algorithm and the discreteness of the data prevent the algorithm from having a fixed number of optimization iterations. In addition, since the goal is to find the secondary structure of the RNA that does not mismatch, and the number of such results is large, the result of each optimization is uncertain, so the group intelligence algorithm cannot be used as the optimization algorithm of this paper. Therefore, the dynamic programming algorithm was chosen as the optimization algorithm, and the probability and maximum correction method are proposed based on the Nussinov algorithm.

In the current prediction of the RNA secondary structure, the prediction of pseudoknots is still a difficult point. In this study, it was found that 5sRNA, srpRNA, and tRNA are free of pseudoknots, while most of RNasePRNA and tmRNA have pseudoknots. In these RNAs containing pseudoknots, the number of pseudoknots in each RNA is relatively small, but their existence cannot be ignored. Not only pseudoknots plays an important role in the function of the RNA, but also the prediction of the pseudoknot effect is wrong, it will cause a mistake in the normal stem area. The RNA structure representation method used in this paper uses the dot bracket representation. However, the dot bracket representation does not reflect the false knots present in the RNA structure. Therefore, the data containing the pseudoknots are deleted in the experiment. If a secondary structure representation of RNA can be found that can represent a pseudoknot, the CDPfold proposed in this paper can be modified accordingly to predict the secondary structure of the RNA with a pseudoknot.

The experimental data used in this paper focuses on 5sRNA, srpRNA, and tRNA. The length of these three types of RNA sequences is mostly between 50 and 350 nt. In this part of the length range, the effect of CDPfold is due to the existing RNA prediction software. The prediction of the secondary structure of longer RNA sequences is not reflected. This is because the current experimental methods are not perfect enough. The secondary structure data of long-sequence RNAs measured by experiments are not enough. The data set was provided by Turner and Mathews (2009) used in the experiment. Less than 200 RNA sequences longer than 1,000 nt were used, which is <10% of the entire data set. The most important factor affecting the predictive effect of deep learning is the amount of data, so we did not study longer sequences. However, with the continuous improvement of experimental techniques, the number of long sequence structures measured by experiments continues to increase. On this basis, models based on deep learning have an advantage.

The last point is the instability of the RNA structure. The structure of the RNA molecule is highly susceptible to environmental factors. Studies have shown that RNA molecules can damage their natural structures when they are exposed to an in vitro environment, leading to structural damage; thus, in vivo structural prediction experiments are not perfect, which means the current RNA secondary structure is not necessarily a real structure.

In addition, unlike proteins, which function differently, not all RNA molecules can function in the body; furthermore, RNA that encodes proteins accounts for only 2% of the total RNA. Thus, RNA structures that do not have an actual function may not be as fixed as functional RNA structures. These problems all have an impact on the prediction of RNA secondary structure prediction.

In general, the CDP-Fold algorithm based on the convolutional neural network for RNA secondary structure

# REFERENCES


prediction achieved good results in data sets without pseudoknots. Many difficulties remain in the research of RNA secondary structure prediction, and many parts still need to be improved. Our research provides new ideas for the study of the RNA secondary structure and serves as a very good source of structural prediction problems and solutions for other researchers.

# DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/zhangch994/CDPfold.

### AUTHOR CONTRIBUTIONS

YL, HZ, and ZL conceived and directed the project. CZ, CL, and XW obtained the raw data and interpreted the data. HZ, CZ, and ZL conducted the data analysis and interpreted the results. HZ, CZ, XW, and BZ helped to design the study and reviewed the data. YL, HZ, CZ, and BZ wrote and edited the manuscript. All the authors helped with the draft and reviewed the manuscript before approving for publication.

#### FUNDING

This work was supported by the Natural Science Foundation of Jilin Province (20150101056JC and 20140101194JC), the China Postdoctoral Science Foundation (2015M570273), the National Natural Science Foundation of China (61471181), and the Postdoctoral Research Station of the College of Computer Science and Technology, Jilin University, Project of Development and Reform Commission of Jilin Province (No. 2019C053-6).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhang, Zhang, Li, Li, Wei, Zhang and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00456 May 23, 2019 Time: 15:19 # 1

# A Systems Analysis of the Relationships Between Anemia and Ischemic Stroke Rehabilitation Based on RNA-Seq Data

Yingying Wang<sup>1</sup>† , Xingxian Huang<sup>2</sup>† , Jianfeng Liu<sup>3</sup> , Xuefei Zhao<sup>4</sup> , Haibo Yu<sup>2</sup> \* and Yunpeng Cai<sup>1</sup> \*

<sup>1</sup> Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China, <sup>2</sup> Shenzhen Traditional Chinese Medicine Hospital, Shenzhen, China, <sup>3</sup> Department of Neurology, The First Affiliated Hospital of Harbin Medical University, Harbin, China, <sup>4</sup> Institute of Harbin Hematology & Oncology, The First Hospital of Harbin, Harbin, China

#### Edited by:

Dariusz Mrozek, Silesian University of Technology, Poland

#### Reviewed by:

Fei Guo, Tianjin University, China Haijing Wang, Microsoft (United States), United States

#### \*Correspondence:

Haibo Yu 13603066098@163.com Yunpeng Cai yp.cai@siat.ac.cn †These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 02 February 2019 Accepted: 30 April 2019 Published: 24 May 2019

#### Citation:

Wang Y, Huang X, Liu J, Zhao X, Yu H and Cai Y (2019) A Systems Analysis of the Relationships Between Anemia and Ischemic Stroke Rehabilitation Based on RNA-Seq Data. Front. Genet. 10:456. doi: 10.3389/fgene.2019.00456 Ischemic stroke (IS) is one of the main causes of morbidity and disability worldwide due to its complex mechanism. Anemia was characterized as a risk factor of IS because the direct connection between central nervous system, blood supply, and tissue oxygen delivery. As the key oxygen-carrying molecule in the blood, hemoglobin (Hb) may be decisive in the destiny of penumbral area or influence the brain recovery and neurologic function, which could finally affect the outcome of IS. However, more detailed information on the expression levels of Hb related genes were still lacking possibly because the concentration of Hb was determined by the genes' expression several hours ago, which may make the research more difficult to perform. This time gap between gene expressions and protein concentration could make these genes predictive bio-markers for IS outcome. In this study, we choose 28 IS patients, of which 12 were suffering from anemia. Statistical analysis results showed that the outcomes of the patients were different when dividing them into two groups characterized by Hb concentration. 2 sex and age matched patients were first chosen to perform RNA-seq analysis on, on two occasions at two different time points, after which the Hb counts were tested at least 24 h after sequencing. Results showed that the outcome of anemia patients was poor compared with non-anemia patients. Two other patients were then chosen for analysis which excluded the coincidence of other factors. The results showed that the low value of Hb under 13 g/dL in men were closely related to the poor outcome of IS patients. Differently expressed Hb related genes were tested and six genes were shown to be positively correlated with the recovery degree of IS patients: ELANE, FGF23, HBB, PIEZO1, RASA4, and PRTN3. Gene CPM was shown to be negatively correlated with clinical outcomes. All of the seven genes were validated to be related to strokes using real-time PCR or literature searches. Taken together, these genes could be considered as new predictors for the recovery of IS patients.

Keywords: hemoglobin, ischemic stroke, bioinformatics, anemia, predictive

# INTRODUCTION

fgene-10-00456 May 23, 2019 Time: 15:19 # 2

Ischemic stroke (IS) is one of the main causes of morbidity and disability worldwide due to its complex mechanism. Many bio-markers and risk factors had been successfully identified using different methods (Jiang et al., 2018a,b, 2019). Similarly, alongside smoking, diabetes mellitus, hypertension, and hypercholesterolemia, anemia was characterized as 'the fifth cardiovascular risk factor' (Kaiafa et al., 2015). For example, the mortality rate was shown to be significantly higher in atherosclerosis-related IS patients suffering from anemia when admitted (Huang et al., 2009). The results of ASTRAL (Acute Stroke Registry and Analysis of Lausanne) showed that anemia on admission could predict both short-term and long-term outcomes in patients with IS and the risk of recurrent stroke was higher in anemia patients (Milionis et al., 2015).

The relationships between IS and anemia may be explained partly by the direct connection between central nervous system (CNS), blood supply, and tissue oxygen delivery (Dubyk et al., 2012). A sudden interruption of the oxygen supply to brain tissue was a crucial step in the pathophysiology of IS. The regaining of oxygen supply to ensure timely reperfusion or collateral perfusion would then determine the conditions of brain tissues (Kellert et al., 2011). As the key oxygen-carrying molecule in the blood, hemoglobin (Hb) may be decisive for the destiny of penumbral area or influence the brain recovery and neurologic function, which could finally affect the outcome of IS (Kimberly et al., 2013; Park et al., 2013). It had been shown that poor outcome and mortality after IS were strongly associated with low and further decreasing hemoglobin (Hb) and hematocrit (Hct) levels (Kellert et al., 2011). Another study found that the poor outcome of acute IS was related to the lower but not the higher end of the Hb, regardless of the time point and methods Hb concentrations been measured (Kimberly et al., 2013). It was widely accepted that women often had worse outcomes after suffering a stroke compared to men. A study focusing on the relationship between Hb and clinical outcome measured by mRS found that sex differences in stroke outcome were linked to lower Hb level, which was more prevalent in women (Park et al., 2013).

However, more detailed information on the expression levels of Hb related genes was still lacking which might be caused by the fact that Hb was in mature red blood cells with no nucleus. In other words, the concentration of Hb was determined by the genes' expression several hours ago which may make the research more difficult to perform. However, this time gap between gene expressions and protein concentration could make these genes as predictive bio-markers for IS outcome. Compared with commonly used clinical scales such as NIH Stroke Scale (NIHSS) score, modified Rankin Scale (mRS), and biochemical tests such as routine blood tests that could describe the current recovery degree of patients, the bio-markers on gene levels could act as an early predictor for the IS patients which may help reduce the chance or severity of disability and chance of recurrence.

In this study, we chose 28 IS patients, of which 12 were suffering from anemia. Statistical analysis results showed that the outcome of the patients was different when dividing them into two groups characterized by Hb concentration. 2 sex and age matched patients were first chosen to perform RNA-seq analyses twice on different time points, after which the Hb counts were tested at least 24 h after sequencing. Results showed that the outcome of anemia patients was poor compared with the other patients. Then two other patients were chosen to exclude other factors besides Hbs. The results showed that the low value of Hb under 13 g/dL in men was closely related to the poor outcome of IS patients. Different expressed Hb related genes were tested and six genes were shown to be positively correlated with the recovery degree of IS patients: ELANE, FGF23, HBB, PIEZO1, RASA4, and PRTN3. Gene CPM was shown to be correlated with clinical outcome negatively. All of the seven genes were validated to be related to stroke using real-time PCR or literature search. Taken these together, these genes could be considered as new predictors for the recovery of IS patients.

# MATERIALS AND METHODS

The framework of this work could be divided into several steps shown in **Figure 1**, which will be elaborated on in the remainder of this section.

#### Clinical Samples

Whole blood samples were collected from 46 IS patients under different stages to perform analyses on three levels as follows:


#### Data Generation

Besides Hb concentration, different types of data sets were generated from the three clinical sample groups as follows:

(1) C-group: the NIH Stroke Scale (NIHSS) score, pre-stroke disability, and clinical outcome defined using the modified fgene-10-00456 May 23, 2019 Time: 15:19 # 3

Rankin Scale (mRS), BI (Barthel Index), and Myodynamia (including four levels) of all the 28 patients were collected (see **Supplementary Table S1** for details). The higher scores of NIHSS, mRS or Myodynamia indicated the worse outcome of IS patients while a lower score of BI indicated a worse outcome.


The study was approved by SIAT Institutional Review Committee with IRB number SIAT-IRB-16515-H0107 and all the procedures were in accordance with the SIAT-IRB guidelines and the Declaration of Helsinki.

### Statistical Analyses for C-Group

The baseline characteristics of the patients were compared between the ISA and non-ISA groups using the Student's t- test for continuous variables and the Pearson χ 2 tests for categorical fgene-10-00456 May 23, 2019 Time: 15:19 # 4

variables. The categorical variables including sex and medical history for each group of patients were reported as percentages, and continuous variables were reported as mean ± SD.

Regression analyses were performed using linear models to explore the relationships between the Hb count and NIHSS, mRS, BI, and Myodynamia. A small Pr( > | t| ) indicated a significant correlation between Hb counts and the corresponding factors mentioned above.

#### Bioinformatics Analyses for R-Group

(1) Differential expression analyses: the sequencing raw datasets generated in R-group were analyzed using Tophat2 and Cufflinks. The differential expressed genes (DEG) between the two different time points for a same patient were identified using Cuffdiff with q-value not over 0.05. The biological DEG changes were further analyzed using Fold Change (FC) methods, which was the ratio between two time points based on the calculation of the average expression values. The common DEGs were selected as initial candidate bio-markers.

(2) Hemoglobin pattern analyses: the expression of 141 hemoglobin related genes retrieved by using 'hemoglobin' as key word in NCBI Gene with at least one expression over 0 in the 8 samples we collected were selected. A co-expression network was constructed by calculating the Pearson correlation coefficient for each pair of the 141 genes. All the pairs with p-value not over 0.05 were kept to construct the HB-network (hemoglobin network).

(3) Candidate HB-network (CHB-network) construction: the common DEGs were mapped into the HB-network. All the genes connecting to the common DEGs in the HB-network were kept to form the CHB-network.

(4) Extended CHB-network (ECHB-network) construction and analyses: all the genes in the CHB-network were mapped into the HB-network. Any genes connecting to any of them were kept to construct the ECHB-network. The topological analyses were performed using a R package "iGraph." Each gene in the ECHBnetwork were considered as candidate bio-markers, an expression changed score (ECS) were calculated as follows:

$$\text{ECS} = (E\_2 - E\_1) / E\_1$$

Of which, E<sup>2</sup> presented the expression value of the gene on time-point 2 while E<sup>1</sup> presented the expression value of the gene on time-point 1. This score reflected the changes each genes occurs between the two time points that could help estimate the biological roles of different genes. DAVID was used to perform the functional analyses for these bio-markers.

### Real-Time PCR Analysis for P-Group

The relative differences in expression between the four different groups were measured using 11cycle time (CT) values: the CT values of the interested genes were semi-quantitative corrected using the internal reference gene β-actin. The expression of each gene was reported as mean ± SD. The Wilcoxon rank test was used to perform the statistical analysis.

# RESULTS AND DISCUSSION

# Baseline Characteristics of Patients in C-Group

The baseline characteristics of the 28 patients in C-group were listed in **Table 1**. There were no significant differences (with p-value over 0.05) between the two groups on the characteristics of sex, lateral upper limb-distal, hypertension, diabetes mellitus, hyperlipemia, hyperuricemia, cerebral arteriosclerosis, atrial fibrillation, coronary disease, and carotid atherosclerosis. The regression analysis between Hb count and these baseline characters showed similar results: there were no significant associations between Hb counts and sex, Lateral upper limbproximal, Lateral upper limb-distal, and all the Medical history besides pulmonary infection. It was interesting that the relationships between stroke, Hb counts, and pulmonary infection were different from other related diseases which might be explained by a 'double hit model' showing reasons of strokeinduced respiratory distress syndrome (Mascia, 2009).

Compared with this, the patients in ISA group were older than those in non-ISA group. Importantly, the higher NIHSS, mRs, and lower BI of ISA group compared with non-ISA showing the worse outcome of ISA patients. Besides, the myodynamia scores of the ISA group were lower than the non-ISA group. These results exhibited the important role of anemia played in the recovery of IS. The patients with lower HB counts were likely to have a worse outcome. The regression analysis results listed in **Table 1** show the close relationship between Hb counts and the outcome of IS patients with smaller Pr( > | t| ) values.

# Differential Expression Analyses in R-Group

Differential expression analyses were performed to find the biomarkers of IS based on the two whole blood samples extracted from same patients on different time points in R-group. Patients P1 in non-ISA and P2 in ISA were chosen for this purpose since their basic clinical characters were similar as follows:


#### TABLE 1 | Baseline characteristics.

fgene-10-00456 May 23, 2019 Time: 15:19 # 5


Significance codes: <sup>∗</sup> for p-values less than 0.05, ∗∗ for p-values less than 0.01, ∗∗∗ for p-values less than 0.001, and ∗∗∗∗ for p-values less than 0.0001.

TABLE 2 | Patients information.


time points which indicated us that this could a possible reason for the different outcomes.

Despite the above correlation, we still cannot make the conclusion that different outcomes were related with HB counts since the initial conditions of P1 and P2 were different: the NIHSS of P1 was 2 while P2 was 5 showing that the condition of P1 was much better than P2 when admitted. To exclude this influence, we chose two other patients (marked as P3 and P4) from the non-ISA group with higher NIHSS scores (9 and 10) respectively. Interestingly, we found similar results as follows:


Under the threshold of q-value ≤ 0.05, the numbers of DEGs were small: 0 for P1, 8 for P2, 48 for P3, and 30 for P4 between the two time points. The common DEGs between P2 and P3 were PRKCSH, HBD, and HBB. The common DEGs between P2 and P4 were DHPS, CFD, IRF9, FTH1P10, and HBB. HBB was the only common DEGs among P2, P3, and P4. The fold change calculating for HBB showed that for better outcome of P1 and P3, the fold change value was positive (2.42268 for P1 and 4.59503 for P3). Compared with this, the fold change value for worse outcomes was negative (−3.43035 for P2 and −3.35511 for P4). This indicated us that the decrease on HBB expression levels were fgene-10-00456 May 23, 2019 Time: 15:19 # 6

FIGURE 2 | Network composed of 23 Hb related genes. Nine genes directly connected with HBB were shown by the black border lines. Different shapes represent the relationships between these genes and IS as follows: HBB-diamond. To be validated-rectangle; stroke biomarker-ellipse, stroke-related risk factor-octagon.

TABLE 3 | Expression changed score and correlations with IS outcome for seven hemoglobin related genes.


fgene-10-00456 May 23, 2019 Time: 15:19 # 7

closely related to the worse outcome of IS, especially in anemia or older IS patients.

# Correlations Between Candidate Bio-Markers and IS Outcome

The HB-network was composed of 141 nodes (hemoglobin related genes) and 977 edges between them. HBB was found to be connected to nine genes as follows: PIEZO1, FGF23, ELANE, PRTN3, FN3KRP, TERT, HFE, TNF, and REN (see **Figure 2** for details). These 10 genes formed the CHB-network. 6 of these genes were shown to be related to stroke as follows: (1) PIEZO1 was involved in hypertension-dependent arterial re-modeling (Retailleau et al., 2015). (2) FGF23 was a risk factor for overall stroke (Wright et al., 2014). (3) TERT was a determinant of risk of IS in the Atherosclerosis Risk in Communities (ARIC) study (Bressler et al., 2015). (4) HFE may play a role in modifying the relationship between smoking and stroke (Njajou et al., 2002). (5) TNF was differently expressed between stroke and controls (Bokhari et al., 2014). (6) REN was an important part in the renin angiotensin aldosterone system (RAAS), which played important roles in acute IS. Besides, the change of RAAS was shown to be related to the increased blood pressure (Back et al., 2015).

We further extended the CHB-network by finding out the neighbor of the 9 nodes and got other 14 nodes: AGT, CD59, CPM, ICAM1, MMP9, POU5F1, PRKCA, SMG5, UGT1A1, CSNK2A1, RASA4, GP1BA, TUBGCP3, and PKM (see **Figure 2** for details). These 24 genes formed the ECHB-network. Their functional analysis results based on KEGG pathway using DAVID were shown in **Supplementary Table S2**. seven genes were shown to be related to stroke as follows: (1) AGT: its polymorphisms were proven to be related to the risk of IS in a meta-analysis in the Chinese population (Gao et al., 2015). (2) CD59: 50% patients with three additional mutations in CD59 were shown to have recurrent strokes (Tabib et al., 2017). (3) ICAM1: a neuro-inflammatory biomarker in post-stroke (Deddens et al., 2013). (4) MMP9: the combination therapies of MMP-9 inhibitor along with tPA was proven to be beneficial in IS (Chaturvedi and Kaczmarek, 2014). (5) PRKCA was related to the blood pressure (Shahin and Johnson, 2016). (6) RASA4 was involved in Ras signaling pathway which contributed to neuro-protective signaling cascades in stroke (Shi et al., 2011). (7) GP1BA was considered as one of the candidate 'stroke risk' genes affecting hemostasis (Stankovic and Majkic-Singh, 2010).

As shown in **Table 2**, the patients P1 and P3 had better outcomes compared with P2 and P4. We than compared the ranking of these genes based on their ECSs to find the correlations between them and the clinical outcome. As shown in **Table 3**, six genes were shown to be positively correlated with the recovery degree of IS patients: ELANE, FGF23, HBB, PIEZO1, RASA4, and PRTN3. Gene CPM was shown to be correlated with clinical outcome negatively. Of which, the relationships between stroke and PIEZO1, FGF23, RASA4 had been validated in former researches as mentioned above (Shi et al., 2011; Wright et al., 2014; Retailleau et al., 2015).

# Biological Validation of Candidate Bio-Markers

Real-time PCR was performed for the following genes, which were identified as candidate bio-markers without supporting literature: ELANE, HBB, PRTN3, CPM, FN3KRP, POU5F1, SMG5, UGT1A1, CSNK2A1, TUBGCP3, and PKM. Of which, ELANE, HBB, PRTN3, POU5F1, and PKM were shown to be up-regulated in IS patients in both ISA and non-ISA groups compared with the control group. It was interesting to find out that the rank of the genes' mean expression values were: ISA, non-ISA, TIA, control (see **Figures 3A–E** for details). Compared with this, CPM was shown to have trend to the contray with the mean expression rank as: control, TIA, non-ISA, ISA (see **Figure 3F**), which was in accordance with our results listed in **Table 3**. Taken together, we believe that these genes could be considered as new predictors for the recovery of IS patients.

#### ETHICS STATEMENT

fgene-10-00456 May 23, 2019 Time: 15:19 # 8

The study was approved by SIAT Institutional Review Committee with IRB number SIAT-IRB-16515-H0107 and all the procedures were in accordance with the SIAT-IRB guidelines and the Declaration of Helsinki.

## AUTHOR CONTRIBUTIONS

YW performed the bio-informatics analyses and wrote the manuscript. XH collected all the blood samples for sequencing. JL analyzed the clinical information. XZ performed the PCR experiment. HY directed the clinical analyses and revised the

### REFERENCES


manuscript. YC directed the bio-informatics analyses and revised the manuscript. All authors agreed to be accountable for the content of the work.

# FUNDING

This study was partly funded by National Natural Science Foundation of China (61702496 and 81601575), Shenzhen Technology R&D Program (JSGG20170413152936281), and Shenzhen Science and Technology Research Funding (20170502171625936).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00456/full#supplementary-material

TABLE S1 | Clinical information of patients.

TABLE S2 | Common pathways of different groups.


fgene-10-00456 May 23, 2019 Time: 15:19 # 9


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Huang, Liu, Zhao, Yu and Cai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# NCNet: Deep Learning Network Models for Predicting Function of Non-coding DNA

#### Hanyu Zhang1,2, Che-Lun Hung3,4,5,6 \*, Meiyuan Liu<sup>7</sup> , Xiaoye Hu<sup>7</sup> and Yi-Yang Lin<sup>6</sup>

<sup>1</sup> College of Computing and Informatics, Providence University, Taichung City, Taiwan, <sup>2</sup> Labo MICS, École CentraleSup élec, Université Paris Saclay, Gif-sur-Yvette, France, <sup>3</sup> Department and Graduate Institute of Computer Science and Information Engineering, Chang Gung University, Taoyuan City, Taiwan, <sup>4</sup> Division of Rheumatology, Allergy and Immunology, Chang Gung Memorial Hospital, Taoyuan City, Taiwan, <sup>5</sup> AI Innovation Research Center, Chang Gung University, Taoyuan City, Taiwan, <sup>6</sup> Department of Computer Science and Communication Engineering, Providence University, Taichung City, Taiwan, <sup>7</sup> Affiliated Cancer Hospital & Institute of Guangzhou Medical University, Guangzhou, China

The human genome consists of 98.5% non-coding DNA sequences, and most of them have no known function. However, a majority of disease-associated variants lie in these regions. Therefore, it is critical to predict the function of non-coding DNA. Hence, we propose the NCNet, which integrates deep residual learning and sequence-to-sequence learning networks, to predict the transcription factor (TF) binding sites, which can then be used to predict non-coding functions. In NCNet, deep residual learning networks are used to enhance the identification rate of regulatory patterns of motifs, so that the sequence-to-sequence learning network may make the most out of the sequential dependency between the patterns. With the identity shortcut technique and deep architectures of the networks, NCNet achieves significant improvement compared to the original hybrid model in identifying regulatory markers.

#### Edited by:

Dariusz Mrozek, Silesian University of Technology, Poland

#### Reviewed by:

Hai Jiang, Arkansas State University, United States Leyi Wei, Tianjin University, China

\*Correspondence: Che-Lun Hung clhung@mail.cgu.edu.tw

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 28 February 2019 Accepted: 24 April 2019 Published: 29 May 2019

#### Citation:

Zhang H, Hung C-L, Liu M, Hu X and Lin Y-Y (2019) NCNet: Deep Learning Network Models for Predicting Function of Non-coding DNA. Front. Genet. 10:432. doi: 10.3389/fgene.2019.00432 Keywords: Non-coding DNA, residual learning, LSTM, sequence to sequence learning, deep learning

# 1. INTRODUCTION

Owing to the rapid development of the next-generation sequencing (NGS) technologies, various scale sequencing data can be produced in days. Large amounts of omics data, including genomics, transcriptomics, proteomics, and metabolomics, have been accumulated rapidly. Biologists can utilize these datasets to extract knowledge (Mrozek et al., 2016; Małysiak-Mrozek et al., 2018; Mrozek, 2018). Machine learning (ML) algorithms have been applied to various bioinformatics applications (Mohri et al., 2012), resulting in significant improvement. Particularly, ML algorithms such as linear and logistic regression, random forests, hidden Markov models, Bayesian networks, Gaussian networks, and support vector machines are most commonly used in gene function prediction.

Recently, deep neural networks (DNNs), also known as deep learning, have been proved to be superior to traditional ML algorithms in most applications aimed at finding patterns from training data and building models to make predictions (Hinton et al., 2012; Krizhevsky et al., 2012). Typically supervised deep learning algorithms learn a model from a given labeled training data, then the learned model is used to predict labels for unseen data (Mohri et al., 2012). As a result of the rapid growth of hardware technologies in graphic processing units (GPU) by NVIDIA, numerous deep neural networks have been proposed. In 2012, AlexNet (Krizhevsky et al., 2012), the first deep convolutional neural network (CNN) approach using a GPU, was introduced for image classification. Since then various new architectures have been proposed including VGG (Simonyan and Zisserman, 2014), NiN (Lin et al., 2013), Inception (Chen et al., 2017), ResNet (He et al., 2015), DenseNet (Huang et al., 2016), and NASNet (Zoph et al., 2017), SENet (Hu et al., 2017). The accuracy of top-1 classification in ImageNet (Russakovsky et al., 2014) has been increased from 62.5% (AlexNet) to 82.7% (NASNet-A). All of these networks are based on a CNN. Among the recent deep learning network approaches, another useful networks are recurrent neural networks (RNNs), which have been successfully applied with tremendous success (Graves et al., 2013; Bahdanau et al., 2014; Xu et al., 2015) such as in speech recognition, neural machine translation, and image caption generation. To improve the training efficiency of RNNs, including long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) or a gated recurrent unit (GRU) (Cho et al., 2014) has been proposed to control the gradient information in the training procedures. LSTM is one of the most wellknown RNN units that has been applied in many deep learning applications (Graves et al., 2013; Kalchbrenner et al., 2015; Danihelka et al., 2016). LSTM can reduce the vanishing or exploding gradient problem in RNNs with gates that are used to memorize past information.

Since the successes are recognized by researchers, now the deep learning approaches have been introduced into bioinformatics domain to improve the performance of prediction or classification tasks. For example, CNNs surpass previous algorithms such as support vector machines or random forests in predicting the protein binding and accessibility based on a DNA sequence (Alipanahi et al., 2015). DeepSEA (Zhou and Troyanskaya, 2015) is a useful tool to predict the chromatin effects of sequence alterations with single nucleotide sensitivity. It adopts CNN to learn a regulatory sequence code from largescale chromatin-profiling data. DeepBind (Alipanahi et al., 2015) is another useful tool to discover the sequence specificities of DNA- and RNA-binding proteins based on the patterns learned from experimental data using CNNs. In 2016, Danil et al. (Quang and Xie, 2016) proposed a DanQ model, similar to DeepSEA, but as a hybrid framework integrating a CNN and bi-directional LSTM RNN for predicting the noncoding function de novo from a sequence. Kelley et al. introduced Basset (Kelley et al., 2016) to serves as a tool to predict the accessibility of DNA sequences in utilizing CNNs to learn the functional activities of DNA sequences. More recently in 2017, Wei et al. proposed a DeepPSL predictor (Wei et al., 2018) based on stacked auto-encoder networks to learn high-level feature representations of proteins to predict protein subcellular localization without handcrafted features. Later in 2018, Wei et al. developed a DeepM6APred (Wei et al., 2019) predictor, which is trained on features extracted by a deep belief network together with handcrafted features by a support vector machine, to improve the ability of predicting N6 methyladenosine m6a sites. All the models above either create a new method or outperform previous existing methods in accomplishing the tasks.

In this work, we propose several enhancements of the convolutional part in the hybrid framework proposed in DanQ model. Particularly, we choose to employ the identity shortcut technique proposed in ResNet as it significantly reduces the difficulty in training a very deep neural network and successfully

improves the performance of convolution networks. We confirm that the depth of deep neural networks is crucial in improving performance as deep architecture allows to build a better representation of the underlying problem than a shallow one does. We also investigate how reversing the arrangement of the convolutional and recurrent part in the hybrid framework may improve the performance.

In the remainder of this paper, we first introduce the materials that we use and describe the proposed models in section 2 in details. Then in section 3, we will report the results of evaluations of the proposed models in comparison with a reimplemented DanQ model for several metrics. In the same section, we also carefully analyze and discuss the improvements obtained by the proposed models. Costs in terms of time and space are investigated as well for the proposed models. Finally, we conclude in section 4 that the proposed models are valuable enhancements which outperform the original hybrid model, and we also suggest some possible work path in future.

### 2. MATERIALS AND METHODS

### 2.1. Features and Data

Our models use the segmented GRCh37 reference genome as the data. Target TF bindings are computed from the intersections of the ChIP-seq and DNase-seq peak sets, which are uniformly processed from the ENCODE (The ENCODE Project Consortium, 2012) and Roadmap Epigenomics (Roadmap Epigenomics Consortium et al., 2015) data. Briefly, we use the same dataset that were used in the DanQ (Quang and Xie, 2016) model. The complete dataset is divided into three nonoverlapping sets: training set (4,400,000 samples), validation set (8,000 samples), and testing set (455,024 samples). The former two are used in the training phase, whereas the latter one belongs to the testing phase.

In this dataset, each 1,000-bp length input genome fragment is transformed into a 1, 000×4 one-hot encoded vector, part (15 bp) of an imaginary sample is shown in **Figure 1**. Components from top to bottom correspond to nucleobases adenine (A), cytosine (C), guanine (G), thymine (T), respectively. When one of the nucleobases appears, the corresponding component is set to one and the others are set to 0. And for each of the 1,000-bp genome fragments, 919 target TF bindings are labeled as "True" or "False" in a certain order to denote their presences or absences. An illustration of target TF bindings' distribution for the first 1,838 samples extracted from the training set is made in the left of **Figure 2**, in which each column represents a sample and each row represents a target TF binding, hence, a black spot denotes the existence of the horizontally relevant target for the vertically corresponding sample. We also give the probability density of every target TF binding's occurrence in the middle of **Figure 2**, from which we may easily deduce that the dataset is quite imbalanced, as majority of the targets are only observed in <5% of the training samples. Moreover, more than half of the targets never spread to 2.5% of the samples. And in the right of the **Figure 2**, we plot the histogram of the number of targets that are possessed by each sample in the training set, most of them can have only a few of the target TF bindings. The testing set has similar statistics as the training set does, therefore, we omit them here.

#### 2.2. NCNet Models

We derive three novel hybrid networks, aiming at enhancing the performance of the hybrid DanQ model by employing some techniques developed recently in deep learning. We retain the main idea of introducing a bi-directional LSTM network in the framework; however, we modify the convolutional part by applying deep residual CNNs, which have been proven to be successful in many domains, such as image classification and object detections. We also attempt to reverse the arrangement of the recurrent part and convolutional part in the hybrid framework. The comparisons show that our models are either better in some aspects or comparable to the DanQ model.

#### 2.2.1. Re-implementation of DanQ Model

Our implementation uses Keras library of the latest version (2.2.4) as well as the latest version (1.0.2) of Theano backend together on Python 3.5.6 when writing this work. The original DanQ model's implementation is no longer compatible with these more recent libraries. Hence, we reimplement the DanQ model, and we shall call this model as r-DanQ model whenever we refer to it in the rest part. Then we re-build the model with the training set, and set the performance baseline with the testing set in our environments, therefore, the results are not exactly the same as in the original paper.

For completeness, we concisely describe the DanQ model, an illustration of the model is given in **Figure 3**. The Input layer regards the one-hot encoded vector of 1,000-bp genome fragments as a linear sequence of fixed length 1,000 with 4 channels and feed them to the following regular convolution layer to extract the local consecutive spatial features. And this single convolution layer constitutes the convolutional part of the hybrid framework. Then a max pooling layer is used in order to reduce the length of sequence before connecting to the bidirectional LSTM layer which constitutes the recurrent part in the framework. The main reason for including a recurrent network is that the regulatory grammar and repetitive combinations extracted by the convolutional layer maybe recognized more easily by the bi-directional LSTM because it can remember important patterns seen previously in some sense. Finally the recurrent layer is flattened and fully connected to a dense layer prepared to fully connected to the multi-task output layer of 919 nodes, each of them conducts a binary classification for the corresponding target TF bindings. There are also two dropout layers before and after the recurrent bi-directional LSTM layer respectively, which are represented by dash-dot arrows in the illustration.

#### 2.2.2. NCNet-RR Model (Residual Then Recurrent Network Model)

It is known that deep neural networks tend to build hierarchical features along the layers (Goodfellow et al., 2016). Base features are first constructed in the low-level convolution layers, based on which more abstract features can be built by the high-level layers, and thus, contain rich information. Furthermore, by stacking layers, the network indirectly spans the length of the kernels so that it can learn more local spatial information. However, it is more difficult to train a deep network. For a classical CNN, as shown in He et al. (2015), a shallow network yields better results than a deep network because the latter one is more prone to local optimization. However, the authors of the work found this phenomenon could be partially overcome by adding a shortcut connection linking from the beginning of a convolution block directly to the end of the same block, then combining both

information before flowing to the next block. Such a block is shown in **Figure 4A**. Conventionally the shortcut is an identity mapping and the combination method is addition, therefore, the input and the output must be congruent. In such a case, the classical flow of a convolutional layer only needs to learn the optimal residual part of the target function, which facilitates the learning phase and helps to enable a deeper network to work much better. Such a block is called a residual block. This design made the deep residual network win the first place in the ILSVRC 2015 classification task.

Hence, we borrow the idea to enhance the convolutional part of the DanQ model with an expectation of a better result. We replace the single convolution layer with a stack of two residual blocks, in each block, the following layers are connected in order: a 1D convolution layer, then a batch normalization layer and a ReLU activation layer, finally another 1D convolution layer with another batch normalization layer. And we keep the other parts intact as in the r-DanQ model. We shall call this Residual then Recurrent network model as NCNet-RR model whenever we refer to it in the following text.

#### 2.2.3. NCNet-bRR Model (Bottleneck Residual then Recurrent Network Model)

However, a deeper network usually means more weights to train than a shallow one with similar blocks does, which leads to more time for both training and testing. This in turn limits the number of layers that can be implemented in the former model. To overcome this problem, we employ the "bottleneck" design to reduce the input/output dimensions. In **Figure 4B**, an illustration of the bottleneck design is given. The main idea is to use three 1D convolutions instead of two, with the first and last convolutions equipped with kernels of size one. These two convolutions simply mean to reduce and restore the channels for the middle convolution. In this way, weights are cut down in favor of more layers. We also decide to reduce the kernel size for the middle convolution compared to those in the r-DanQ model, as deep CNN may compensate the kernel size by its depth in accessing contiguous local spatial data. With such a reduction, we are able to stack 8 bottleneck residual blocks before connecting to the Recurrent part which is kept the same. Whenever refer to this bottleneck Residual then Recurrent network model later, we shall call it NCNet-bRR model.

#### 2.2.4. NCNet-RbR Model (Recurrent Then Bottleneck Residual Network Model)

In previous models, the recurrent part mostly composed by a bidirectional LSTM layer is the second part after the convolutional part, as if it is trying to learn the grammar translated by the convolutional transformations. However, it also makes sense to directly learn the grammar embedded in the raw DNA fragments by a recurrent network without a convolutional transformation. In fact, the convolutional part can then be used to interpret these globally learned patterns locally. As regions that can be accessed by a convolution node is limited by the kernel size and depth, important information may be left unknown simply because the kernel is not big enough and/or the network is not deep enough, whereas recurrent network looks into the data globally and may avoid such defaults. Hence, the rationale behind the reversion of the two network parts' arrangement is that the recurrent part may first recognize important sequences in the DNA fragments and make the convolutional part combine these global-local spatial information more easily. An analogy is that we may regard those patterns found by recurrent part as "alphabet letters," then the convolutional part combines them into meaningful "words." We also employs the bottleneck design of residual networks for the convolutional part, and keep other part as intact as possible. However, appropriate modification must be made to connection part, for example, the bi-directional LSTM is fed directly with the complete input sequences, no dropout is applied before the recurrent part, and a global max pooling layer is used instead of a flatten layer in connecting convolution layer to the dense

layer. And we will refer to this Recurrent then bottleneck Residual network model as NCNet-RbR in the rest of the paper.

# 2.3. Training Method

All the models are initialized with the built-in Glorot uniform random initializer (Glorot and Bengio, 2010) in Keras, and then the RMSprop algorithm (Tieleman and Hinton, 2012) is applied to train the model in maximum 60 epochs with each mini-batch composing 100 samples. And the validation set is used to early stop the training phase if five consecutive epochs do not improve the loss in prevention of overfitting.

# 2.4. Environments

Our implementations are written in python 3.5.6 with Keras 2.2.4 based on Theano 1.0.2. The experiments were carried out on a computer with operating system Ubuntu 16.04.5 x86\_64 with kernel version 4.4.0-131-generic, running on two Intel Xeon CPU E5-2620 v4 with 16 processors clocking at 2.10 GHz with 2 MBytes L3 cache, 256 KBytes L2 cache, 64 KBytes L1 cache, and 128 GBytes RDIMM main memory using a clock speed of 2,400 MHz. In the meanwhile, two NVIDIA Tesla P100 12GB GPUs cooperate to accelerate the experiments. The v384.130 Nvidia-driver is installed and the 7101 version of cuDNN (The NVIDIA CUDA Deep Neural Network) library lays the foundation to exploit the parallel computational power provided by the GPUs mentioned.

# 3. RESULTS AND DISCUSSION

In this section, we present the results of evaluation of performance with the three proposed models and compare them to the performance of the r-DanQ model, which is used as the baseline. All models are tested on the same testing dataset, for each target, a separate confusion matrix is calculated, and we first compute several conventional metrics to evaluate the performance, such as accuracy, sensitivity, specificity, F1 score, and area under the Receiver operating Characteristic curve (ROC AUC) and Precision Recall curve (PR AUC). As there are 919 targets in total, it would be reasonable to use an weighted average for each of these metrics, where weights are the percentage of existence of each corresponding targets in training set after normalization, as it is meaningful to weight less on extremely imbalanced datasets than more balanced ones, see the middle part of **Figure 2**, notice that it is quite clear the distributions of black spots which represent positive samples are really sparse, and even the most balanced dataset's positive samples do not exceed 20%. Hence accuracy and specificity are not very useful criterion as they may retain a high value with a random predictor. Moreover, in our case, it is much more important for the presence of the target TF bindings to act as an indicator than the absence; hence, it is normal to pay more attention to other criterion. However, even with a weighted scheme, the sensitivity is much lower than the other criterion for all models, again owing to the imbalanced dataset. Therefore, we are mostly interested in how much the enhancement in convolutional part and the reverse arrangement of the framework improve the performance related to the r-DanQ mode, so we set it as baseline whose scores are all 100%.

The resulting relative performance are listed in **Table 1**, which generally outlines the comparison of performance among the four models. A score <100% means the model actually performs worse than r-DanQ model for the corresponding metric. In this aspect, the NCNet-RR model is only comparable to the r-DanQ model since it only outperforms r-DanQ model in ROC AUC and PR AUC and declines a little in others, whereas the other two models generally outperform the r-DanQ model except for the specificity, at a little cost of <1% of which, they gain a big improvement in sensitivity, which is 43.1% for NCNet-bRR model and 78.5% for NCNet-RbR model. And the F1 score is also naturally raised a lot as it essentially measures a balanced score of specificity and sensitivity. In fact, ROC AUC and PR AUC are two criterion usually considered more suitable when evaluating binary classification tasks than the former four criterion. As we use a sigmoid activation for the output layer, output may be interpreted as the probability of presence for the target, therefore they must be binarized according to a threshold to make a prediction. Thus, the ROC curve and PR curve would reflect how the other criterion would dynamically relate to the threshold when it varies from 0 to 1, so AUCs are overall measurements,


Models Accuracy(%) Sensitivity(%) Specificity(%) F1 score(%) ROC AUC(%) PR AUC(%)

TABLE 1 | Relative performance of NCNet models compared to r-DanQ model.

whereas the other criterion only report a point performance of threshold being 0.5.

Since all NCNet models are preferred than r-DanQ model for ROC AUC and PR AUC, we shall look into them closely. Several individual target TF bindings' ROC and PR curves are shown row by row in **Figure 5**, respectively. These curves are quite typical when the target's corresponding dataset is not extremely imbalanced. We also give average results for both ROC and PR curves to the right of rows in the figure, respectively.

Generally speaking, deeper models performs better than shallow ones, as we can see in figures of ROC curves that, NCNetbRR model and NCNet-RbR model performs roughly equal to each other, and both better than NCNet-RR model, which again is slightly better than r-DanQ model. And for PR curves, we would observe the same precedence again, but in cases, the NCNet-RbR model could even beat the NCNet-bRR model by a reasonable margin. Such a result implies that both bottleneck design and small kernels in convolution blocks in trading for depth of the network are valuable for enhancing the hybrid framework.

However, in order to consolidate this implication, only four individual observations wouldn't be enough, hence we examine all individual targets in statistical ways. We calculate the percentage of targets for which the NCNet models actually outperform the r-DanQ model on the testing dataset and report the ratio in **Table 2**. The result is consistent with **Table 1**. Notice that both of NCNet-bRR model and NCNet-RbR model perform worse than r-DanQ model on almost all targets by the angle of specificity, but as we have observed that the weighted average scores of specificity are eventually rather close to r-DanQ model, it is not difficult to see that declines in specificity for each target are negligible. In exchange, the two models are able to perform better than r-DanQ model on nearly or more than 60% of the targets in sensitivity. Such a trade is quite ideal. The ratios get even higher for the metrics of ROC AUC and PR AUC, by the two metrics, all three NCNet models manage to improve the performance on a large majority of the targets, especially for NCNet-bRR and NCNet-RbR models, the ratios achieve as high as 95%. Therefore, the observations firmly support the implication mentioned above.

An interesting point is that NCNet-bRR model beat r-DanQ model on more targets than the NCNet-RbR model does on all metrics, yet NCNet-RbR model eventually has higher weighted average scores than NCNet-bRR model on most metrics. Therefore, visualizations of the extent of improvements are realized by a series of scatter comparisons between NCNet models and r-DanQ model for ROC AUC and PR AUC, so that we may also investigate exactly how much improvement are made for each target by NCNet models. See **Figure 6**, points above the anti-diagonal segment represents the targets on which the NCNet models outperform r-DanQ model, further the point is vertically away from the antidiagonal segment, bigger the improvement is. In the figure two overlapped histograms of relative improvements for ROC AUC and PR AUC, respectively, are also given. We should add a remark for those points near the left-bottom corner for PR AUC as they stand for small values of AUC, and we exclude those points if they are less than 0.2 for r-DanQ model in calculating the histogram. Though NCNetbRR model possess more targets on which the performance is better than r-DanQ model, it is clear now that, NCNet-RbR generally improves more than NCNet-bRR model on those targets that defeat r-DanQ model, especially for PR AUC, which explains our interesting observation. And we conjecture that passing data through max pooling layer and dropout layer before to the bi-directional LSTM layer,


TABLE 2 | Percentage of targets that perform better with NCNet models than r-DanQ model.

TABLE 3 | Time and Spatial cost of the models.


some useful information may be lost, and which could have been captured if raw data are directly fed to the recurrent layer.

In addition to the comparisons of convolutional metrics above, we also consider the cost in terms of both the time and space, as listed in **Table 3**. The modification to the convolutional part of NCNet-RR model is disappointed as it takes much more time to train or to make prediction, and it requires more space to store the model than the r-DanQ model, but only brings limited improvements, which suggests that the depth of convolution network is crucial to the performance. And for NCNet-RbR model, the modification leads to even more time than that NCNet-RR model needs, due to the width of recurrent layer as we feed it with raw DNA fragments. However, the improvements can not be ignored and the model size is reduced to <5% of r-DanQ model. Therefore, it is suitable for cases where storage is the main concern, and it remains to be potential with developments in fast training recurrent networks. On the contrary, the enhancement introduced in NCNet-bRR model is quite successful in both cost of time and space. Compared to r-DanQ model, it not only takes <20% space to store the model, but also cuts down half of the time to train the model or two thirds of the time to predict for unseen samples. And take the similar performance as NCNet-RbR model in consideration, it should be the best option in most cases.

### 4. CONCLUSION AND PERSPECTIVES

In this work, we mainly explore the modification of the convolutional part of the hybrid framework with a deep residual network to enhance the performance. In conclusion, all the proposed NCNet models outperform the r-DanQ model for identifying the TF binding sites directly from noncoding DNA fragments for metrics of ROC AUC and PR AUC, especially the NCNet-bRR model and NCNet-RbR model. Therefore their outputs are appropriate candidates as input data for the following phases in DeepSEA to predict the chromatin effect or variant functionality. We also confirmed that depth of the convolution network is crucial for improving performance. Since the bottleneck design and small kernels of the network are effective techniques in trading for depth, they should be considered whenever possible. Moreover, we make the most out of the residual convolution network when it runs deep. Besides, when the employment of bottleneck design and small kernels

may even gain us time and space if the whole hybrid framework is connected appropriately. We also declare that it is possible to reverse the arrangement of the convolutional and recurrent part of the hybrid framework to maintain similar performance or even achieve better results, though the reversed model may take much longer time to train or to predict when effectively reducing the model size.

With the success of applying the idea of residual convolution network in the hybrid framework, we believe that other powerful convolution networks could play the role as well as the residual network or even better in principle. And this kind of thoughts may also applied to the recurrent part by some enhancement too. Thus, we suggest to test other combinations of convolution network and recurrent network to make performance improvements in future. Moreover, as the arrangement of different part of the framework matters, one may even add multiple convolutional and recurrent part to the hybrid framework and try different arrangements of them.

#### REFERENCES


#### DATA AVAILABILITY

The datasets generated for this study can be found in ENCODE (The 83 ENCODE Project Consortium, 2012), https://www. encodeproject.org/ or DanQ model (Quang and Xie, 2016); p.213 http://github.com/uci-cbcl/DanQ.

#### AUTHOR CONTRIBUTIONS

HZ and C-LH designed the models, experiments, and revised the manuscript. HZ implemented the models. C-LH wrote the Introduction section, whereas HZ wrote the remaining part of manuscript text. Y-YL carried out these experiments. ML and XH verified the experimental results.

#### ACKNOWLEDGMENTS

The research is supported by the Ministry of Science and Technology under the grants MOST 107-2218-E-126-001 and MOST 107-2218-E-029-001.


the prediction of N6-methyladenosine sites. Neurocomputing 324, 3–9. doi: 10.1016/j.neucom.2018.04.082


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhang, Hung, Liu, Hu and Lin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Corrigendum: NCNet: Deep Learning Network Models for Predicting Function of Non-Coding DNA

*Hanyu Zhang 1,2, Che-Lun Hung 3,4,5,6\*, Meiyuan Liu 7, Xiaoye Hu 7 and Yi-Yang Lin 6*

#### *Approved by:*

*Frontiers Editorial Office, Frontiers Media SA, Switzerland*

#### *\*Correspondence:*

*Che-Lun Hung clhung@mail.cgu.edu.tw*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 03 September 2019 Accepted: 04 September 2019 Published: 13 September 2019*

#### *Citation:*

*Zhang H, Hung C-L, Liu M, Hu X and Lin Y-Y (2019) Corrigendum: NCNet: Deep Learning Network Models for Predicting Function of Non-Coding DNA. Front. Genet. 10:923. doi: 10.3389/fgene.2019.00923*

*1 College of Computing and Informatics, Providence University, Taichung City, Taiwan, 2 Labo MICS, École CentraleSup élec, Université Paris Saclay, Gif-sur-Yvette, France, 3 Department and Graduate Institute of Computer Science and Information Engineering, Chang Gung University, Taoyuan City, Taiwan, 4 Division of Rheumatology, Allergy and Immunology, Chang Gung Memorial Hospital, Taoyuan City, Taiwan, 5 AI Innovation Research Center, Chang Gung University, Taoyuan City, Taiwan, 6 Department of Computer Science and Communication Engineering, Providence University, Taichung City, Taiwan, 7Affiliated Cancer Hospital & Institute of Guangzhou Medical University, Guangzhou, China*

#### Keywords: Non-coding DNA, residual learning, LSTM, sequence to sequence learning, deep learning

#### **A corrigendum on**

#### **NCNet: Deep Learning Network Models for Predicting Function of Non-Coding DNA**

*by Zhang H, Hung C-L, Liu M, Hu X and Lin Y-Y (2019). Front. Genet. 10:432. doi: 10.3389/ fgene.2019.00432*

In the published article, there was an error in affiliation "1." The affiliation "Affiliated Cancer Hospital & Institute of Guangzhou Medical University, Guangzhou, China" should be moved to affiliation "7" and should be removed for the first and corresponding authors. Additionally, all subsequent affiliations should move up in order.

The authors apologize for this error and state that this does not change the scientific conclusions of the article in any way. The original article has been updated.

*Copyright © 2019 Zhang, Hung, Liu, Hu and Lin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in his journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

**472**

# Gradient Boosting Decision Tree-Based Method for Predicting Interactions Between Target Genes and Drugs

Ping Xuan<sup>1</sup> , Chang Sun<sup>1</sup> \*, Tiangang Zhang<sup>2</sup> \*, Yilin Ye<sup>1</sup> , Tonghui Shen<sup>1</sup> and Yihua Dong<sup>1</sup>

*<sup>1</sup> School of Computer Science and Technology, Heilongjiang University, Harbin, China, <sup>2</sup> School of Mathematical Science, Heilongjiang University, Harbin, China*

Determining the target genes that interact with drugs—drug–target interactions—plays an important role in drug discovery. Identification of drug–target interactions through biological experiments is time consuming, laborious, and costly. Therefore, using computational approaches to predict candidate targets is a good way to reduce the cost of wet-lab experiments. However, the known interactions (positive samples) and the unknown interactions (negative samples) display a serious class imbalance, which has an adverse effect on the accuracy of the prediction results. To mitigate the impact of class imbalance and completely exploit the negative samples, we proposed a new method, named DTIGBDT, based on gradient boosting decision trees, for predicting candidate drug–target interactions. We constructed a drug–target heterogeneous network that contains the drug similarities based on the chemical structures of drugs, the target similarities based on target sequences, and the known drug–target interactions. The topological information of the network was captured by random walks to update the similarities between drugs or targets. The paths between drugs and targets could be divided into multiple categories, and the features of each category of paths were extracted. We constructed a prediction model based on gradient boosting decision trees. The model establishes multiple decision trees with the extracted features and obtains the interaction scores between drugs and targets. DTIGBDT is a method of ensemble learning, and it effectively reduces the impact of class imbalance. The experimental results indicate that DTIGBDT outperforms several state-of-the-art methods for drug–target interaction prediction. In addition, case studies on *Quetiapine*, *Clozapine*, *Olanzapine*, *Aripiprazole*, and *Ziprasidone* demonstrate the ability of DTIGBDT to discover potential drug–target interactions.

Keywords: drug–target interaction prediction, class imbalance, ensemble learning, path category-based features, gradient boosting decision tree

# INTRODUCTION

Computational prediction of drug–target interactions (DTIs) plays a key role in drug discovery and repositioning (Chen et al., 2015; Yu et al., 2015, 2017b). Drugs exert their functions by interacting with various targets, of which genes are one important group. Through binding, drugs can either enhance or inhibit the expressions of genes and thereby affect disease processes

#### Edited by:

*Quan Zou, University of Electronic Science and Technology of China, China*

#### Reviewed by:

*Zhu-Hong You, Xinjiang Technical Institute of Physics & Chemistry (CAS), China Fang Bai, Rice University, United States*

#### \*Correspondence:

*Chang Sun sunchangcn@outlook.com Tiangang Zhang zhang@hlju.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *12 February 2019* Accepted: *30 April 2019* Published: *31 May 2019*

#### Citation:

*Xuan P, Sun C, Zhang T, Ye Y, Shen T and Dong Y (2019) Gradient Boosting Decision Tree-Based Method for Predicting Interactions Between Target Genes and Drugs. Front. Genet. 10:459. doi: 10.3389/fgene.2019.00459*

(Overington et al., 2006; Yu et al., 2016; Santos et al., 2017). However, in most cases, drugs may cause multiple side-effects because they can interact with several unintended targets. The identification of targets that interact with drugs by biological and chemical experiments is very laborious and expensive (Langley et al., 2017). Therefore, many studies have attempted to predict DTIs by using computational methods, to reduce the workload and costs in providing candidate DTIs for biologists to verify (Ding et al., 2017a,b, 2019; Shen et al., 2017).

Several prediction methods concentrate primarily on incorporating information from drug–target homogeneous networks (Mei et al., 2012; Xu et al., 2014a,b, 2016; Li et al., 2015; Hao et al., 2017; Yu et al., 2017a). For example, Bleakley and Yamanishi constructed a support vector machine (SVM) framework named BLM, which is based on a bipartite local model, to predict DTIs (Bleakley and Yamanishi, 2009). However, because this method is trained with a large-scale bipartite graph model, high computational power is needed. Mei et al. analyzed DTI features from neighbors and predicted novel interactions (Mei et al., 2012); it is difficult to obtain enough neighbor information for this method. Ezzat et al. and Luo et al. incorporated topological information by applying a random walk on the homogeneous network and used graph regularized matrix factorization to calculate the propensities of DTIs (Ezzat et al., 2017; Luo et al., 2017). However, the accuracy of the results may be influenced when the features are projected into low-dimensional space, because some valuable information may be lost. Hao et al. proposed a method based on non-linear integral of similarity measurements (Hao et al., 2017). Although this method showed good performance, its accuracy depended heavily on the similarity measurements. DTI prediction has been treated as a binary classification problem in Lee's methods (Lee and Nam, 2018). The features of drugs and targets that were used for training a k-nearest-neighbors model were weighted by random walks. However, the known and unknown DTIs have a serious class imbalance, which has an adverse impact on prediction accuracy. In DDR, which was applied by Olayan et al., path category-based feature vectors were constructed to incorporate the topological information of the network, and a random forest was used for DTI prediction (Olayan et al., 2017). Random forest does not perform as well as in classification when it solves the regression problem, because it cannot yield a continuous output.

In this work, in order to further improve the accuracy of DTI prediction and mitigate the impact of class imbalance, we propose a novel computational method named DTIGBDT. We construct a drug–target heterogeneous network to extract features. A gradient boosting decision tree (GBDT)-based prediction model is used for calculating the propensities of interactions. We compare our approach with other prediction methods using various performance measurements: the results show that DTIGBDT outperforms the other methods.

### MATERIALS AND METHODS

Our goal is to predict novel (that is, unknown) interactions between drugs and targets. In order to integrate the information of various connections and the node attributes, we construct a drug–target heterogeneous network. We then design a novel prediction model based on GBDT for the network, to obtain the interaction scores of drug–target pairs. The higher the score, the more likely they are to interact (Zou et al., 2015; Zeng et al., 2017a).

# Dataset for DTI Prediction

We obtained the drug–target interaction data from a published work (Luo et al., 2017). In this dataset, there are 1923 known DTIs, involving 708 drugs from DrugBank 5.0 (Wishart et al., 2017) and 1,412 targets from HPRD 9.0 (Keshava Prasad et al., 2008). For each pair of drugs and each pair of targets, we also extracted the similarities between them from these two databases. The similarity between two drugs was calculated by using the Tanimoto coefficient (Francesco et al., 2010), based on their chemical structures. The similarity between two targets is measured by the Smith-Waterman score (Wenhui et al., 2014), based on their primary sequences.

# Heterogeneous Network-Based Feature Extraction

Construction of Drug–Target Heterogeneous Network We defined a set of DTIs, which consists of a set of drugs D and a set of targets T, where D = {d1, d2,..., dm} includes m drug nodes, and T = {t1, t2,..., tn} contains n target nodes. The drug–target network can be considered as a heterogeneous network, which is constructed by a drug network and a target network. In these two networks, we added an edge to connect two drug nodes or two target nodes when the similarity between them were >0. Furthermore, the edge was weighted by the similarity between the two nodes. The edge between a drug and a target represented a known DTI and was weighted by 1. This heterogeneous network can be represented as in **Figure 1A**.

The interactions between D and T could also be represented as a matrix Y where Yij is 1 if drug d<sup>i</sup> and target t<sup>j</sup> are observed to interact and 0 otherwise. The set of similarities between drugs was represented by SDǫR <sup>m</sup>∗<sup>m</sup> and the set of similarities between targets was represented by STǫR n ∗n . The element values in S<sup>D</sup> or S<sup>T</sup> are in the range of [0, 1] which represents how similar drugs or targets are to each other.

#### Similarity Calculation Based on Network and Selection of k Neighbors

Random walk with restart, a network diffusion algorithm, has been widely used to analyze complex biological network data (Köhler et al., 2008; Tong et al., 2008; Berger et al., 2010; Li and Patra, 2010; Xu et al., 2016; Cheng et al., 2018b; Gao et al., 2018). Random walk can consider the topological information of the network to fully analyze the potential associations between nodes. We conduct random walks on the drug and target networks separately, to extract the topological information of the networks. Based on these similarities, we select the k most similar neighbors for each node.

We take the drug network as an example to illustrate the random walk procedure. We defined a matrix ND, in which each

FIGURE 1 | Algorithm flow of DTIGBDT. (A) Construct the heterogeneous network. (B) Random walk on drug network and target network, respectively. (C) Select most similar *k* neighbors. (D) Get feature vectors for each drug–target pair. (E) Train the DTIGBDT with the feature vectors.

element N<sup>D</sup> (i, j) describes the probability of a transition from d<sup>i</sup> to d<sup>j</sup> .

$$N\_D\left(i,j\right) = \frac{\mathcal{S}\_D(i,j)}{\sum\_{j'} \mathcal{S}\_D(i,j')}\tag{1}$$

where SD(i, j) represents the similarity between two drugs, d<sup>i</sup> and dj . Next, we defined a matrix W<sup>t</sup> D ǫR <sup>m</sup>∗<sup>m</sup> where W<sup>t</sup> D (i, j) is the probability that the walker reaches d<sup>j</sup> from d<sup>i</sup> after t iterations in the random walk process. The matrix W<sup>D</sup> t can be calculated as Equation (2).

$$\boldsymbol{W}\_{D}^{t+1} = (1 - a)\,\boldsymbol{N}\_{D}\boldsymbol{W}\_{D}^{t} + a\boldsymbol{W}\_{D}^{0} \tag{2}$$

where parameter a is the restart probability. The matrix W<sup>0</sup> D can be initialized by Equation (3).

$$W\_D^0\left(i,j\right) = \begin{cases} 1, & i=j\\ 0, & i \neq j \end{cases} \tag{3}$$

The convergence condition of the random walk procedure is k W<sup>t</sup> <sup>D</sup> <sup>−</sup> <sup>W</sup>t−<sup>1</sup> D k<sup>1</sup> < 10−<sup>6</sup> . After the condition is satisfied, the converged probability W<sup>t</sup> D (i, j) can be regarded as a similarity score between two drugs. This score incorporates the topological information in the drug network and is used to update the weight of the edge between diand d<sup>j</sup> . Next, we selected the k most similar neighbors of d<sup>i</sup> based on the similarities. We obtained the matrix KDǫR m∗ <sup>k</sup> where the ith row stores the k most similar neighbors of di . Similarly, we conducted random walk on the target network to obtain the similarity matrix W<sup>t</sup> T (i, j)ǫR n ∗n and the matrix of the k most similar neighbors, KTǫR n ∗ k (**Figures 1B,C**).

#### Path Category-Based Features

Based on the assumption that similar drugs can usually interact with the same target and vice versa, we extracted an 18 dimensional feature vector based on the path category for each drug–target pair. In this study, we worked with the path categories whose lengths are 2 and 3 (but not longer than that, because of the computational cost). If we limit paths to start at the drug nodes and end at the target nodes, there are only two path categories with length 2. These two categories can be denoted as C1: (D–D–T) and C2: (D–T–T), where D represents a drug node and T represents a target node. The four categories with paths of length 3 are C<sup>3</sup> :(D–T–T–T), C<sup>4</sup> :(D–D–T–T), C<sup>5</sup> :(D–D–D–T), and C<sup>6</sup> :(D–T–D–T). We considered these six categories of paths to predict whether the drug can interact with the target. In this process, we started from a given drug d<sup>i</sup> to reach a given target t<sup>j</sup> through a specific path category Ch, where h is selected from {1, 2, 3, . . . , 6}. We only considered paths that pass through the k nearest neighbors of d<sup>i</sup> or t<sup>j</sup> . We denoted the set of such paths as Rijh. Next, for the qth path p<sup>q</sup> between d<sup>i</sup> and t<sup>j</sup> , we calculated a weight s by multiplying all weights on the edges of path p<sup>q</sup> as Equation (4).

$$s\left(i,j,h,q\right) = \prod\_{\forall \varepsilon\_{\mathbf{x}} \in p\_q} w\_{\mathbf{x}} \tag{4}$$

where exis the xth edge of pq, and w<sup>x</sup> is the weight of the edge. We defined three matrices V1ǫR i ∗ j ∗h , V2ǫR i ∗ j ∗h , and V3ǫR i ∗ j ∗h , to store the features between d<sup>i</sup> and t<sup>j</sup> under each path category Ch. V1(i, j, h) is the sum of the s-values in set Rijh. V2(i, j, h) is the maximum s-value in set Rijh, and V3(i, j, h) is the number of paths in the set.

$$V\_1\left(i,j,h\right) = \sum\_{\forall p\_q \in R\_{ijh}} s(i,j,h,q) \tag{5}$$

$$V\_2\left(i,j,h\right) = \max\_{\forall p\_{\emptyset} \in \mathcal{R}\_{\text{jph}}} \left( s\left(i,j,h,q\right) \right) \tag{6}$$

$$V\_3\left(i,j,h\right) = \mathfrak{u}\mathfrak{w}\_{\mathbb{P}\_{\mathbb{Q}} \in \mathbb{R}\_{\mathbb{H}}}\left(\mathfrak{p}\right) \tag{7}$$

We combined the three matrices into a new matrix V<sup>f</sup> ǫR i ∗ j ∗ (3∗h) , where the row V<sup>f</sup> (i, j) represents the feature vector of d<sup>i</sup> and t<sup>j</sup> (**Figure 1D**).

We take the drug–target pair (d7, t3) in **Figure 1A** as an example to describe the process of heterogeneous networkbased feature extraction. The paths from d<sup>7</sup> to t<sup>3</sup> are shown in **Figure 2A**, and the values of s for each path are listed in **Figure 2B**. There are two paths in the set R733, p1: d7-t5-t2-t<sup>3</sup> and p2: d7-t5-t4-t3, and the values of s for these paths are 0.03 and 0.05, respectively. V1(7,3,3) is set as the sum of these s-values, 0.08. V2(7,3,3) is set as the maximum of them, 0.05. V3(7,3,3) is set as the number of the paths, 2.

In terms of the fifth type of path categories C5, there is only one path p1: d7-d3-d2-t<sup>3</sup> in the set R735, and the s of p<sup>1</sup> is 0.02. Therefore, V1(7,3,5) and V2(7,3,5) are both set as 0.02 and V3(7,3,5) is set as 1. Similarly, we can compute the features for the other path categories. As a result, the rows which represent the feature vectors of (d7, t2) in matrix V1, V2, V<sup>3</sup> are set as (0.16, 0.16, 0.08, 0.08, 0.02, 1), (0.16, 0.16, 0.05, 0.05, 0.02, 1), and (1, 1, 2, 2, 1, 1), respectively (**Figure 2C**). Finally, these three vectors are combined into a single vector of V<sup>f</sup> , namely V<sup>f</sup> (7,3) (**Figure 2D**).

#### DTI Prediction Model Based on GBDT

In our dataset, there are only 1,923 known drug–target interactions, while more than 300,000 interactions are unknown, which causes a serious class imbalance. Aiming to reduce the impact of class imbalance and make full use of the negative samples in the dataset, we constructed an ensemble learning model based on GBDT (Ye et al., 2009), and refer to it as DTIGBDT.

The feature of a drug–target pair (d<sup>i</sup> , tj) is denoted by a vector Vf (i, j). Let Xi,<sup>j</sup> = {x1,x2. . . ,xz} represent z subsets of V<sup>f</sup> (i, j), x<sup>k</sup> was obtained by randomly sampling some of the features from Vf (i, j). For each element in Xi,<sup>j</sup> , we built a decision tree model that is used for predicting the potential DTIs. In this way, we obtained a set Ti,<sup>j</sup> = {T1, T2. . . , Tz} that denotes z decision trees. Finally, we obtained the interaction score of the pair by summing the score of all decision trees. This can be calculated as Equation (8).

$$score\left(i,j\right) = \frac{1}{z} \sum\_{k=1}^{z} \lambda\_k T\_k(\mathbf{x}\_k) \tag{8}$$

where T<sup>k</sup> (x<sup>k</sup> ) represents the score of the decision tree T<sup>k</sup> . λk is used to adjust the contribution of T<sup>k</sup> . The greater the value of score(i, j), the more likely d<sup>i</sup> is to interact with t<sup>j</sup> . We thereby obtained a matrix Yˆ ǫR <sup>m</sup>∗<sup>n</sup> where Yˆ ij= score (i, j) (**Figure 1E**). We used the negative log-likelihood to calculate the loss of DTIGBDT.

$$loss = \sum\_{i,j} \log(1 + \exp(-2Y\_{ij}\hat{Y}\_{ij})) \tag{9}$$

where Yi,<sup>j</sup> is the actual interaction between d<sup>i</sup> and t<sup>j</sup> . We defined the objective function as Equation (10).

$$\min L\left(\hat{Y}\right) = \text{loss} + \lambda ||\hat{Y}||\tag{10}$$

The first term is the loss of DTIGBDT. The second term is the regular term to prevent overfitting, and λ is the regularization parameter for adjusting this term's contribution. The converged Yˆ is the interaction score matrix, which can be calculated by **Figure 3**.

FIGURE 2 | Feature vector calculation of *d*7-*t*3. The edges between drug nodes or target nodes are weighted by the similarities between two nodes. The edges between drugs and target nodes represent the known DTIs and are weighted by 1. (A) Paths between d7 and t3. (B) The *s*-values of all the paths. (C) Three types of path feature vectors. (D) Connection of three feature vectors.

# EXPERIMENTAL EVALUATION AND DISCUSSION

#### Performance Evaluation Metrics

To evaluate our method and the state-of-the-art methods for DTI prediction, we performed five-fold cross validation (Cheng et al., 2015; Chen et al., 2017; Lin et al., 2017; Wei et al., 2017a, 2018; Zeng et al., 2017b; Bu et al., 2018; Su et al., 2018; Xu et al., 2018b,c). All known DTIs were randomly divided into five subsets with equal size, and the same operation was applied to the unknown interactions (Liu et al., 2017; Zhang et al., 2017; Zeng et al., 2018). In each cross-validation trial, a subset of known DTIs and another subset of unknown DTIs were selected in turn as the test set, while the remaining DTIs were used for training a prediction model. The known and unknown interactions were regarded as the positive and negative samples, respectively. After the prediction is performed, each sample was given a predicted score which represents the propensity of the drug to interact with the target. The positive and negative samples were ranked by their score. The higher the positive samples were ranked, the better was the prediction performance.

For a given threshold δ, if the score of a positive sample was >δ, it was considered as a true positive sample (TP), and if the score was <δ, it would be considered as a false negative sample (FN). If the score of a negative sample was lower than δ, it would be regarded as a true negative sample (TN). If the score was <δ, it would be regarded as a false positive sample (FP). We obtained a receiver operating characteristic (ROC) curve (Streiner and Cairney, 2007) by calculating the true positive rates (TPRs) and false positive rates (FPRs) for various values of δ.

$$TPR = \frac{TP}{TP + FN} \text{ FPR} = \frac{FP}{TN + FP} \tag{11}$$

The areas under the ROC curves (AUCs) were used to evaluate the performance of each method (Lobo et al., 2008; Cheng et al., 2014, 2018a; Dao et al., 2018; Feng et al., 2018; Nie et al., 2018; Tang et al., 2018; Xu et al., 2018a; Yang et al., 2018). It is generally believed that the closer the value of AUC is to 1, the better the performance is. However, in the case of imbalanced data, AUPR (the area under the precision–recall curve) can provide a more valuable metric (van Laarhoven et al., 2011; Saito and Rehmsmeier, 2015; Patel et al., 2017; Sahiner et al., 2017; Wei et al., 2017b; Jiang et al., 2018a,b). Therefore, we also used AUPR as another measurement to evaluate the performance of each method. The precision–recall curve was constructed by precision rates and recall rates, which are defined as Equation (12).

$$Precision = \frac{TP}{TP + FP} \text{ Recall} = \frac{TP}{TP + FN} \tag{12}$$

In addition, biologists usually select the top section of the prediction result for a wet-lab experiment to further validate. As a result, the accuracy of the top k candidates is more important for discovering novel DTIs. We demonstrate the recall rates within the top k (k = 50, 100, 150, 200, 250, 300) candidates to reveal how many of these positive samples are identified successfully.

# Comparison With Other Methods

We compared DTIGBDT with four state-of-the-art methods for DTI prediction, including GRMF (Ezzat et al., 2017), DTINet (Luo et al., 2017), Lee's method (Lee and Nam, 2018), and DDR (Olayan et al., 2017). We describe these methods in more detail below.

**GRMF:** This method proposed a matrix factorization-based model to predict novel DTIs. The drug–target interaction matrix **Y** were decomposed into two low-rank latent feature matrices **A** (for drugs) and **B** (for targets) by using the SVD algorithm. Alternating least squares was used to iteratively update A and B.

The optimization problem can be described as:

$$\begin{aligned} \min\_{A,B} & \|Y - AB^T\|\_F^2 \\ + \lambda\_I \left( \|A\|\|\_F^2 + \|B\|\|\_F^2 \right) \\ & + \lambda\_d Tr\left(A^T \vec{\mathcal{L}}\_d A\right) \\ & + \lambda\_t Tr(B^T \vec{\mathcal{L}}\_t B) \end{aligned} \tag{13}$$

where L˜ <sup>d</sup> and <sup>L</sup>˜ <sup>t</sup> are the normalized graph Laplacians that were computed based on the similarities between drugs or targets. λ<sup>l</sup> , λd, and λ<sup>t</sup> are parameters that adjust the contribution of the terms. The interaction score Yˆ <sup>i</sup>,<sup>j</sup> of drug d<sup>i</sup> and target t<sup>j</sup> can be calculated as:

$$\hat{Y}\_{i,j} = a\_i b\_j^T \tag{14}$$

where a<sup>i</sup> is the ith row of A and b<sup>j</sup> is the jth row of B.

**DTINet:** Heterogeneous data sources provide diverse information for DTI prediction, so Luo et al. integrated four types of drug similarities and three types of target similarities. The random walk with restart algorithm was applied to extract the topological information of the drug network and the target network, and the result of the algorithm was a matrix SD. The low-rank model S<sup>D</sup> ≈ XW<sup>T</sup> used X to represent the corresponding low-dimensional feature vector of each drug. Similarly, the low-dimensional feature vectors of targets could be calculated and were represented by a matrix Y. Let P denote the interactions between drugs and targets; matrix Z can then be calculated by Equation (15).

$$XZY^T \approx \mathbf{P} \tag{15}$$

The interaction score between drug d<sup>i</sup> and target t<sup>j</sup> was defined as follows:

$$\text{score}\left(i,j\right) = \mathbf{x}\_i \mathbf{Z} \mathbf{y}\_j^T \tag{16}$$

where x<sup>i</sup> is the ith row of X and is the feature vector of d<sup>i</sup> , and y<sup>j</sup> is the jth row of Y and is the feature vector of t<sup>j</sup> .

**Lee's method:** In this method, each drug was represented by a bit vector, in which each bit suggests whether a specific sub molecular structure is contained by the drug. In addition, Lee et al. constructed a model based on random walk with restart to extract the topological information of the drug–drug interaction network. The rows of the matrix F <sup>d</sup> were used to store the bit vectors of each drug and a matrix N <sup>d</sup> was defined to denote the result of the random walk. The final representation of drug d<sup>i</sup> , denoted by ν d i , was calculated by Equation (17):

$$\nu\_i^d = N\_i^d \ast F\_i^d \tag{17}$$

where N d i and F d i are the ith row of N d and F d , respectively. Similarly, Lee et al. can calculate a vector ν t j to represent the target t<sup>j</sup> . The feature vector of the drug–target pair (d<sup>i</sup> , tj) can be obtained by connecting ν d i and ν t j . On the basis of the Euclidean distance between each pair of drug and target, a k-nearestneighbor model was trained to infer whether a target interacted with the drug.

**DDR**: DDR constructed a drug-target heterogeneous graph that contains the known DTIs with multiple drug similarities and target similarities. A non-linear similarity fusion method was performed to obtain the optimized drug similarities and the target similarities. For each drug–target pair, DDR constructed a path-category-based feature, which integrates the sum of the paths' weight and the maximum weight of the paths. A random forest-based model was performed to analyze the potential associations between each drug–target pair with these features.

Several parameters may influence the performance of DTIGBDT, including the restart probability a, the number of neighbors k, and the regularization parameter λ. The ranges of a, k, and λ are set to {0.2,0.4,0.6,0.8}, {10,20,30,40,50}, and {0.01,0.1,1,10}, respectively. The results of cross validation showed that our method achieves the best performance when a = 0.4, k = 30, and λ = 0.1. For fair comparison, the parameters of the other methods were also adjusted to obtain their best performance (n = 600, k = 5 in DDR; r = 0.8 in Lee's method; η = 0.5, d = 0.1, t = 0.1, l = 2 in GRMF; and λ = 1, r = 0.8 in DTINet). The performance of each method was obtained by using the optimum parameters in each case. The ROC curves and precision–recall curves of all these methods are shown in **Figure 4**.

TABLE 1 | *P*-values between DTIGBDT and other methods based on AUCs and AUPRs.


DTIGBDT achieves the best performance (AUC = 0.877, AUPR = 0.129), and it achieves 2.3% higher AUC and 4.3% higher AUPR than the second-best method, GRMF. Comparing to DTINet, DTIGBDT achieves 7.3% higher AUC and 5.7% higher AUPR. Both GRMF and DTINet have applied a lowrank model to reduce the dimension of the drug features and target features. However, a great deal of valuable information may be lost in this process. Lee's method does not perform well because it only used the same quantities of negative samples as that of the positive samples to train the k-nearestneighbor model and most of the negative samples were discarded. The AUC and AUPR of DTIGBDT are 11.6% and 9.7% higher than Lee's method, respectively. DDR shows the worst performance because its' prediction model fails to accurately estimate the interaction scores, and the AUC and AUPR of DTIGBDT are 12.9 and 6.6% higher than DDR, respectively. The superior performance of DTIGBDT is mainly due to our model based on GBDT that completely exploits all the negative samples.

We performed a paired t-test to evaluate whether DTIGBDT's performance (AUC and AUPR) is significantly better than that of other methods (Ruxton, 2006). The p-values are listed in **Table 1**. These statistical results show that DTIGBDT achieves a significantly better performance than all other methods at the significance level 0.05.

A higher recall value for the top k reveals that more positive samples are identified successfully. The average recall values of all drugs, for various k values, are shown in **Figure 5**. DTIGBDT outperforms the other methods at each of the k cutoffs, and successfully identified 78.1% of the positive samples in the top 50, 82.1% in the top 100, and 90.9% in the top 200. GRMF achieved the second-best performance, for which identified 73.1% in the top 50, 77.5% in the top 100, and 86.1% in the top 200. DTINet identified 68.1% in the top 50, 72.2% in the top 100, and 79.9% in the top 200. Lee's method identifies 52.9% in the top 50, 66.8% in the top 100, and 79.4% in the top 200, which is worse than DTINet but better than DDR. DDR suffers the worst performance, which only identified 59.1% positive samples in the top 50, 71.4% in the top 100, and 75.1% in the top 400.

#### Case Studies on Five Drugs

To demonstrate the ability of DTIGBDT to discover potential DTIs, we used it to predict novel drug-related targets. We performed DTIGBDT for all the drugs. All the known DTIs were used to train the model, and the prediction results are listed in **Supplementary Table 1**. In particular, we executed case studies on five drugs, including Quetiapine, Clozapine, Olanzapine, Aripiprazole, and Ziprasidone. The top-ranked five candidate targets for each drug were collected and listed in **Table 2**. To confirm these novel interactions, we consulted several reference databases and the biomedical literature to support them.

DrugBank (Wishart et al., 2017) is a database with annotated cheminformatics resources which combines detailed drug data with target information. As shown in **Table 2**, 10 of the 25 novel interactions were reported in DrugBank, which confirms the drugs were indeed interacted with the targets. CheMBL (Gaulton et al., 2016) contains the binding and functional information of drug-like bioactive compounds and the information of their binding targets. Three of the 25 interactions were contained in CheMBL, indicating that these drugs can interact with their candidate targets. KEGG (Kanehisa and Goto, 2000) is another useful database dealing with genomes, biological pathways, drugs, and chemical substances. There are 15 interactions that can be found in KEGG, which suggests the expression of the genes can be upregulated or downregulated by the drugs. For example, the drug Aripiprazole can act as a potentiator to enhance the expression of the target gene GABRA1 in combination with another drug Phenobarbital.

In addition, a database named UniProt (Consortium, 2014), which collects the protein sequence and function information from research literature, is used to find whether a drug can interact with a specific target; this database includes two interactions. Specifically, the expression of two target genes, GABRG3 and GABRA4, can be reduced by drug Olanzapine to inhibit the activity of extracellular ligand-gated ion channels.



*The novel DTIs are proved by other existing evidence (public databases or literature) and*

Finally, four novel interactions, which are labeled with "literature," were confirmed by some of the published literature that can be found in PubMed (McEntyre and Lipman, 2001). These drugs were confirmed that they can enhance or inhibit the expressions of their candidate genes. For instance, Sugawara et al. found that drug Quetiapine can decrease the DNA methylation level of the promoter region of the gene SLC6A4 (Sugawara et al., 2015). Case studies suggests that DTIGBDT has powerful ability

*the supporting databases are listed in the evidence.*

to discover the potential drug-interacted targets.

target genes, and it can mitigate the impact of class imbalance by completely exploiting the negative samples. The results of 5-fold cross-validation experiments confirm the superiority of DTIGBDT for DTI prediction. The case studies on five drugs further prove the ability of our model to discover the potential interactions. Therefore, DTIGBDT is a powerful tool which may provide reliable candidate target genes for subsequent identification of actual drug–target interactions with wet-lab experiments. In the future, we will develop our methods on parallel platforms (Zou et al., 2013; Guo et al., 2018) for handling the big data problem.

#### DATA AVAILABILITY

All datasets analyzed for this study are included in the manuscript and the **Supplementary Files**.

#### AUTHOR CONTRIBUTIONS

PX and CS conceived the prediction method. PX, CS, and YY they wrote the paper. CS and TS developed the computer programs. TZ and YD analyzed the results and revised the paper.

#### FUNDING

The work was supported by the Natural Science Foundation of China (61702296, 61302139), the Natural Science Foundation of Heilongjiang Province (LH2019F049, LH2019A029), China Postdoctoral Science Foundation (2019M650069), the Heilongjiang Postdoctoral Scientific Research Staring Foundation (BHL-Q18104), the Fundamental Research Foundation of Universities in Heilongjiang Province for Technology Innovation (KJCX201805), the Fundamental Research Foundation of Universities in Heilongjiang Province for Youth Innovation Team (RCYJTD201805), and the Foundation of Graduate Innovative Research (YJSCX2018-047HLJU, YJSCX2018-139HLJU).

### ACKNOWLEDGMENTS

We would like to thank Editage (www.editage.com) for English language editing.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00459/full#supplementary-material

Supplementary Table 1 | Potential candidate target genes interacted with 549 drugs.

#### REFERENCES

CONCLUSIONS

Berger, S. I., Ma'ayan, A., and Iyengar, R. (2010). Systems pharmacology of arrhythmias. Sci. Signal. 3, ra30–ra30. doi: 10.1126/scisignal.2000723

In this paper, we proposed a novel method, DTIGBDT, for predicting the target genes that interact with drugs. We incorporated topological information from the heterogeneous interaction network, and the feature vectors between the drug– target pairs were constructed based on the path categories. A GBDT-based model was constructed for predicting candidate

> Bleakley, K., and Yamanishi, Y. (2009). Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics 25, 2397–2403. doi: 10.1093/bioinformatics/b tp433


Technology Assessment: International Society for Optics and Photonics (Orlando, FL), 101360G.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xuan, Sun, Zhang, Ye, Shen and Dong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership