# MACHINE LEARNING ADVANCED DYNAMIC OMICS DATA ANALYSIS FOR PRECISION MEDICINE

EDITED BY : Tao Zeng, Tao Huang and Chuan Lu PUBLISHED IN : Frontiers in Genetics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-554-2 DOI 10.3389/978-2-88963-554-2

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# MACHINE LEARNING ADVANCED DYNAMIC OMICS DATA ANALYSIS FOR PRECISION MEDICINE

Topic Editors: Tao Zeng, Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, China Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China Chuan Lu, Aberystwyth University, United Kingdom

Citation: Zeng, T., Huang, T., Lu, C., eds. (2020). Machine Learning Advanced Dynamic Omics Data Analysis for Precision Medicine. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-554-2

# Table of Contents

*07 Editorial: Machine Learning Advanced Dynamic Omics Data Analysis for Precision Medicine*

Tao Zeng, Tao Huang and Chuan Lu


Chang Gu, Zhenyu Huang, Chenyang Dai, Yiting Wang, Yijiu Ren, Yunlang She, Hang Su and Chang Chen


Jingxuan Qiu, Yuxuan Shang, Zhiliang Ji and Tianyi Qiu

*95* Apolipoprotein E *Overexpression is Associated With Tumor Progression and Poor Survival in Colorectal Cancer*

Zhixun Zhao, Shuangmei Zou, Xu Guan, Meng Wang, Zheng Jiang, Zheng Liu, Chunxiang Li, Huixin Lin, Xiuyun Liu, Runkun Yang, Yibo Gao and Xishan Wang

*105 Whole Exome Sequencing Identifies a Novel Pathogenic RET Variant in Hirschsprung Disease*

Wei Wu, Li Lu, Weijue Xu, Jiangbin Liu, Jun Sun, Lulu Zheng, Qingfeng Sheng and Zhibao Lv

*113 FLOating-Window Projective Separator (FloWPS): A Data Trimming Tool for Support Vector Machines (SVM) to Improve Robustness of the Classifier*

Victor Tkachev, Maxim Sorokin, Artem Mescheryakov, Alexander Simonov, Andrew Garazha, Anton Buzdin, Ilya Muchnik and Nicolas Borisov

*125 Identification of Rare Copy Number Variants Associated With Pulmonary Atresia With Ventricular Septal Defect*

Huilin Xie, Nanchao Hong, Erge Zhang, Fen Li, Kun Sun and Yu Yu

*136 Exposing the Causal Effect of Body Mass Index on the Risk of Type 2 Diabetes Mellitus: A Mendelian Randomization Study*

Liang Cheng, He Zhuang, Hong Ju, Shuo Yang, Junwei Han, Renjie Tan and Yang Hu

*146 Using Pan RNA-Seq Analysis to Reveal the Ubiquitous Existence of 5*¢ *and 3*¢ *End Small RNAs*

Xiaofeng Xu, Haishuo Ji, Xiufeng Jin, Zhi Cheng, Xue Yao, Yanqiang Liu, Qiang Zhao, Tao Zhang, Jishou Ruan, Wenjun Bu, Ze Chen and Shan Gao


Xin Feng, Jialiang Li, Han Li, Hang Chen, Fei Li, Quewang Liu, Zhu-Hong You and Fengfeng Zhou

*206 Identification of Target Genes at Juvenile Idiopathic Arthritis GWAS Loci in Human Neutrophils*

Junyi Li, Xiucheng Yuan, Michael E. March, Xueming Yao, Yan Sun, Xiao Chang, Hakon Hakonarson, Qianghua Xia, Xinyi Meng and Jin Li


Ping Luo, Qianghua Xiao, Pi-Jing Wei, Bo Liao and Fang-Xiang Wu


Jie Zhang, Zhi Wei, Christopher J. Cardinale, Elena S. Gusareva, Kristel Van Steen, Patrick Sleiman, International IBD Genetics Consortium and Hakon Hakonarson

*245 UltraStrain: An NGS-Based Ultra Sensitive Strain Typing Method for*  Salmonella enterica

Wenxian Yang, Lihong Huang, Chong Shi, Liansheng Wang and Rongshan Yu

*256 Identifying Critical State of Complex Diseases by Single-Sample-Based Hidden Markov Model*

Rui Liu, Jiayuan Zhong, Xiangtian Yu, Yongjun Li and Pei Chen

*266 A Convergent Study of Genetic Variants Associated With Crohn's Disease: Evidence From GWAS, Gene Expression, Methylation, eQTL and TWAS*

Yulin Dai, Guangsheng Pei, Zhongming Zhao and Peilin Jia


Xiangbo Chen, Yunjie Jin and Yu Feng

*318 High-Order Correlation Integration for Single-Cell or Bulk RNA-seq Data Analysis*

Hui Tang, Tao Zeng and Luonan Chen


Shengyang Jiang, Changfa Guo, Wei Zhang, Wenliang Che, Jie Zhang, Shaowei Zhuang, Yiting Wang, Yangyang Zhang and Ban Liu

*346 MildInt: Deep Learning-Based Multimodal Longitudinal Data Integration Framework*

Garam Lee, Byungkon Kang, Kwangsik Nho, Kyung-Ah Sohn and Dokyoon Kim

*353 Efficient Mining of Variants From Trios for Ventricular Septal Defect Association Study*

Peng Jiang, Yaofei Hu, Yiqi Wang, Jin Zhang, Qinghong Zhu, Lin Bai, Qiang Tong, Tao Li and Liang Zhao

#### *364 Multi-view Subspace Clustering Analysis for Aggregating Multiple Heterogeneous Omics Data*

Qianqian Shi, Bing Hu, Tao Zeng and Chuanchao Zhang

*374 Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins*

Wenchuan Wang, Robert Langlois, Marina Langlois, Georgi Z. Genchev, Xiaolei Wang and Hui Lu

*384 Genomic Profiling of Driver Gene Mutations in Chinese Patients With Non-Small Cell Lung Cancer*

Hongxue Meng, Xuejie Guo, Dawei Sun, Yuebin Liang, Jidong Lang, Yingmin Han, Qingqing Lu, Yanxiang Zhang, Yanxin An, Geng Tian, Dawei Yuan, Shidong Xu and Jingshu Geng

# Editorial: Machine Learning Advanced Dynamic Omics Data Analysis for Precision Medicine

Tao Zeng1,2\*, Tao Huang3 and Chuan Lu<sup>4</sup>

<sup>1</sup> Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai, China, <sup>2</sup> Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai, China, <sup>3</sup> Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences (CAS), Shanghai, China, <sup>4</sup> Department of Computer Science, Aberystwyth University, Aberystwyth, United Kingdom

Keywords: machine learning, dynamic, OMICS data, precision medicine, integration

Editorial on the Research Topic

#### Machine Learning Advanced Dynamic Omics Data Analysis for Precision Medicine

By utilizing high-throughput technologies, precision medicine is being developed as a preventative, diagnostic and treatment tool to combat complex human diseases. It is therefore necessary to investigate how to integrate these multi-scale 'omics datasets to distinguish the novel individualspecific disease causes from conventional cohort-common disease causes. Currently, machine learning plays an important role in biological and biomedical research, especially in the analysis of big 'omics data. This Research Topic focuses on the application of wet 'omics technology and dry machine learning approaches together to further develop precision medicine.

#### Edited and reviewed by:

Richard D. Emes, University of Nottingham, United Kingdom

> \*Correspondence: Tao Zeng zengtao@sibs.ac.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 20 October 2019 Accepted: 09 December 2019 Published: 04 February 2020

#### Citation:

Zeng T, Huang T and Lu C (2020) Editorial: Machine Learning Advanced Dynamic Omics Data Analysis for Precision Medicine. Front. Genet. 10:1343. doi: 10.3389/fgene.2019.01343

# STUDIES BASED ON INDIVIDUAL TEMPORAL 'OMICS DATA FROM DISEASE COHORTS OR ANIMAL MODELS

Liu, R. et al. proposed a single-sample-based hidden Markov model approach to detect the dynamical differences between a normal and a pre-disease states, to detect the immediately upcoming critical transition from the pre-disease state. Lee et al. implemented a deep learningbased python package for multimodal longitudinal data integration, especially the numerical data including time series and non-time series data. Yu et al. implemented an adjusted individual-specific edge-network analysis (iENA) method when a limited number of samples from one individual are available, and made a proof-of-concept study on individual-specific disease classification based on microbiota compositional dynamics.

# STUDIES BASED ON MULTIPLE 'OMICS DATA, E.G., THE COMBINATION OF GENOMIC, TRANSCRIPTOMIC, EPIGENOMIC, OR PROTEOMIC DATA FOR A SINGLE DISEASE/CONDITION

Chen et al. analyzed the miRNA expression profiles in whole plasma, Extracellular Vesicle (EV) and EV-free plasma of lung cancer patients and identified several discriminative miRNAs and

**7**

Zeng et al. Machine Learning Advanced Dynamic Omics

classification rules as potential non-invasive biomarkers by Monte-Carlo feature selection method and Repeated Incremental Pruning to Produce Error Reduction method. Liu, Z. et al. conducted a genome-wide analysis of allele-specific expression (ASE) in colorectal cancer patients, providing a systematic understanding of how ASE is implicated in both tumor and normal tissues. Hu et al. used RNA sequencing data to identify and quantify the circRNAs in atrial fibrillation (AF) by bioinformatics analysis and characterized their potential functions through the competing endogenous RNA network and protein-protein interaction network. Shi, X. et al. screened a cohort of Total anomalous pulmonary venous connection cases and healthy controls for rare copy number variants by whole exome sequencing, providing candidate genes associated with rare congenital birth defect. Wu et al. performed whole exome sequencing on seven members of an HSCR family, making a first report on the in-frameshift variant p.Phe147del in RET responsible for heritable HSCR. Xie et al. investigated rare Copy number variants (CNVs) in a recruited cohort of unrelated patients with pulmonary atresia and a populationmatched control cohort of healthy children by whole-exome sequencing, helping elucidate critical disease genes and new insights of pathogenesis. Meng et al. made a brief research report on the driver gene mutations in Chinese patients with non-small cell lung cancer by target sequencing and Hotspot3D computational approach together.

Ho et al. provided a review of polygenic risk scoring and machine learning in complex disease risk prediction with tissuespecific targets, expecting their power to manage complex diseases for customized preventive interventions. Li et al. identified target genes at Juvenile idiopathic arthritis risk loci in neutrophils by an integrated multi-omics approach, constructing a protein-protein interaction network on the basis of a machine learning approach. Dai et al. applied the megaanalysis of Odds Ratio (MegaOR) method to prioritize candidate genes of Crohn's Disease, based on a comprehensive collected multi-dimensional data. Wang, C.H. et al. detected differentially expressed lncRNAs and mRNAs in atherosclerosis by analyzing public datasets with the weighted gene co-expression network analysis, and this bioinformatics study would provide potential novel therapeutic and prognostic targets for atherosclerosis. Jiang, S. et al. collected and profiled the circRNA expressions of heart tissues from Atrial fibrillation patients and healthy controls, providing new insights of the circRNA roles in AF with highly potential interaction mechanisms among circRNAs, microRNAs, and mRNAs.

Gu et al. reused the Surveillance, Epidemiology, and End Results registry database to conduct stratification analyses, univariable and multivariable analyses, indicating surgery is an important component of multidisciplinary treatment and sublober resection is not inferior to lobectomy for the specific patients. Zhang, J. et al. exploited the largest crohn's disease dataset and ulcerative colitis dataset by a two-step approach, exhaustively searching for epistasis with dense markers and exploiting marker dependencies. Du et al. analyzed the genome-wide splicing data in 16 cancer types with normal samples by a network-based and modularized approach and captured the pan-cancer splicing and modularized perturbation, which support the dominant patterns of cancer-associated splicing. Zhao et al. assessed the prognostic value of Apolipoprotein E and explored the potential relationship with tumor progression in colorectal cancer (CRC), by collecting the microarray data from the Gene Expression Omnibus and exploring the gene with prognostic significance from the TCGA database. Tang et al. proposed an effective data integration framework HCI (High-order Correlation Integration) to realize high-dimensional data feature extraction with extensive flexibility and applicability on sample clustering with RNA-seq data on bulk and single-cell levels. Chang et al. identified new susceptibility genes and causal sub-networks in schizophrenia by an integrated network-based approach, and reported the N-methyl-D-aspartate receptor interactome highly targeted by multiple types of genetic risk factors. Wang and Liu recognized potential diagnostic biomarkers of Alzheimer's disease by integrating gene expression profiles from six brain regions in a machine-learning manner and validating marker genes in multiple cross-validations and functional enrichment analyses. Xu et al. provided an effective way for the annotation of nuclear non-coding and mitochondrial genes and the identification of new steady RNAs, making a pan RNA-seq analysis to suggest the ubiquitous existence of both 5' and 3' end small RNAs.

# STUDIES BASED ON THE GUT METAGENOME AND HOST 'OMICS FOR COMPLEX DISEASES DIAGNOSIS AND TREATMENT

Yang et al. presented a new pathogen detection and strain typing method UltraStrain for Salmonella enterica based on whole genome sequencing data, which includes a noise filtering step, a strains identification step on the basis of statistical learning, and a final refinement step. Tan et al. conducted comprehensive and systematic experiments, including in vitro genetic assessments and an in vivo acute toxicity study, aiming to study safety issues associated with Bacteroides ovatus ELH-B2. Qiu et al. set up an in-silico model emerging or re-emerging dengue virus (DENV) based on possible antigenicity-dominant positions of envelope (E) protein, so that, the DENV serotyping may be re-considered antigenetically rather than genetically. Zhang, B. et al. collected and re-analyzed the published fecal 16S rDNA sequencing datasets to identify biomarkers to classify and predict colorectal tumors by random forest method, and the trained random forest model has good AUC performance for CRC when combined all samples, although the predication performed poorly for advance adenoma and adenoma.

#### STUDIES BASED ON CONDITIONAL GENOTYPE-PHENOTYPE DETECTION WITH DEEP LEARNING OR OTHER BRAIN-LIKE ARTIFICIAL INTELLIGENCE (AI) TECHNOLOGIES

Luo et al. proposed a manifold learning-based method to predict disease-gene associations by assuming that the geodesic distance of related disease-gene pairs should be shorter than that of nonassociated disease-gene pairs. Tkachev et al. proposed a heuristic technique termed FLOating Window Projective Separator (FloWPS) for data trimming with SVM and applied it for personalized predictions based on molecular data. Wang, W. et al. developed a new multiple-instance leaning algorithm derived from AdaBoost and accessed this algorithm on annotating proteins that bind DNA and RNA. Xiao et al. proposed a method called BPLLDA to predict lncRNA-disease associations from a heterogeneous lncRNA-disease association network assuming the association paths on network with fixed lengths. Zou et al. used decision tree, random forest and neural network to predict diabetes mellitus by the hospital physical examination data, and the best prediction could be achieved by random forest after dimensionality reduction by principal component analysis and minimum redundancy maximum relevance.

Guo et al. proposed a new approach SGL-LMM for mining multivariate associations of quantitative traits by combining sparse group lasso and linear mixed model together, which can consider confounding effects and groups of SNPs simultaneously. Zhang, W. et al. developed a new calling method for differentially expressed genes as DECtp by integrating tumor purity information into a generalized least square procedure and a follow-up Wald test. Cheng et al. utilized a Mendelian randomization (MR) to test the influence of body mass index (BMI) on the risk of T2DM based on GWAS data, validating the causal effect of high BMI on the risk of T2DM. Feng et al. utilized one analysis procedure of feature selection and classification on both transcriptomes and methylomes cancer data, suggesting age should be an essential factor rather than confounding factor in the training and optimization of disease diagnosis model.

Qin et al. developed a new joint gene set analysis statistical framework, aiming to improve the power of identifying enriched gene sets by integrating multiple similar disease datasets when the sample size is limited. Shi, Q. et al. proposed a new computational framework of "Multi-view Subspace Clustering Analysis" to capture the underlying heterogeneity of samples from multiple data types, by first measuring the local similarities of samples in the same subspace and then extracting the global consensus sample patterns. Jiang, P. et al. developed a new variants mining algorithm based on trio-based sequencing data, and applied this method on a Ventricular septal defect (VSD) trio and identified several genes and lncRNA highly related to VSD.

Finally, we sincerely thank the reviewers for their great efforts to ensure the high quality of all contributing articles, and we hope this Research Topic can attract wide attention in these topics of precision medicine based on machine learning and omics data.

# AUTHOR CONTRIBUTIONS

TZ drafted the manuscript. TZ, TH, and CL revised the manuscript.

# FUNDING

This study was supported by the National Natural Science Foundation of China (11871456), the Shanghai Municipal Science and Technology Major Project (2017SHZDZX01), and the Natural Science Foundation of Shanghai (17ZR1446100).

Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Zeng, Huang and Lu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# DECtp: Calling Differential Gene Expression Between Cancer and Normal Samples by Integrating Tumor Purity Information

#### Weiwei Zhang<sup>1</sup> , Haixia Long<sup>2</sup> , Binsheng He<sup>3</sup> \* and Jialiang Yang4,5 \*

*<sup>1</sup> School of Science, East China University of Technology, Nanchang, China, <sup>2</sup> Department of Information Science and Technology, Hainan Normal University, Haikou, China, <sup>3</sup> The First Affiliated Hosptial, Changsha Medical University, Changsha, China, <sup>4</sup> College of Information Engineering, Changsha Medical University, Changsha, China, <sup>5</sup> Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, United States*

Identifying differentially expressed genes (DEGs) between tumor and normal samples is critical for studying tumorigenesis, and has been routinely applied to identify diagnostic, prognostic, and therapeutic biomarkers for many cancers. It is well-known that solid tumor tissue samples obtained from clinical settings are always mixtures of cancer and normal cells. However, the tumor purity information is more or less ignored in traditional differential expression analyses, which might decrease the power of differential gene identification or even bias the results. In this paper, we have developed a novel differential gene calling method called DECtp by integrating tumor purity information into a generalized least square procedure, followed by the Wald test. We compared DECtp with popular methods like *t*-test and limma on nine simulation datasets with different sample sizes and noise levels. DECtp achieved the highest area under curves (AUCs) for all the comparisons, suggesting that cancer purity information is critical for DEG calling between tumor and normal samples. In addition, we applied DECtp into cancer and normal samples of 14 tumor types collected from The Cancer Genome Atlas (TCGA) and compared the DEGs with those called by limma. As a result, DECtp achieved more sensitive, consistent, and biologically meaningful results and identified a few novel DEGs for further experimental validation.

Keywords: differentially expressed genes, tumor purity, generalized least square, the Wald test, generalized least square

#### INTRODUCTION

Nowadays, RNA sequencing (RNA-Seq) has become a routine for measuring RNA expression levels (Mortazavi et al., 2008; Wang et al., 2009). Due to continuous improvements on sequencing accuracy and reduction on costs, this technology has revolutionized most fields in life sciences especially clinical medicine (Berger et al., 2010). Among many goals of RNA-Seq study, identifying differentially expressed genes (DEGs) between usually two conditions is probably the most common (Ritchie et al., 2015). Generally speaking, DEG analysis performs statistical analysis to discover significant gene expression changes between the experimental and control groups, which are critical for explaining transcriptomic changes incurred by experimental conditions. For instance, DEGs between normal and tumor samples help to study tumorigenesis, and have been routinely applied to identify diagnostic, prognostic, and therapeutic biomarkers for many cancers (Wu et al., 2013).

#### Edited by:

*Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China*

#### Reviewed by:

*Yuannyu Zhang, The University of Texas at Dallas, United States Minxian Wang, Broad Institute, United States Cheng Guo, Columbia University, United States*

#### \*Correspondence:

*Binsheng He hbscsmu@163.com Jialiang Yang jialiang.yang@mssm.edu*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *14 June 2018* Accepted: *30 July 2018* Published: *28 August 2018*

#### Citation:

*Zhang W, Long H, He B and Yang J (2018) DECtp: Calling Differential Gene Expression Between Cancer and Normal Samples by Integrating Tumor Purity Information. Front. Genet. 9:321. doi: 10.3389/fgene.2018.00321*

**10**

Over the past years, a number of statistical methods and softwares have been developed for identifying DEGs by considering the distributions of gene transcript abundance measured by read counts, Fragments Per Kilobase of transcript per Million (FPKM) (Trapnell et al., 2012), RNA-Seq by Expectation Maximization (RSEM) (Li and Dewey, 2011), and so on. Gene read counts usually follow a multinomial distribution, which can be approximated by a Poisson distribution, if they are independently sampled from a population with fixed fractions of genes. Consequently, the Poisson distribution has been widely assumed to test for differential expressions (Marioni et al., 2008; Wang et al., 2010). However, there is only one single parameter in the Poisson distribution, so the resulting statistical test does not control for the type-I error (Robinson and Smyth, 2007). To solve this so-called over-dispersion problem, the negative binomial (NB) distribution has been proposed to model count data (Anders and Huber, 2010; Zhou et al., 2011; McCarthy et al., 2012; Wu et al., 2013). Alternatively, the read counts can be converted to log2 transformed counts per million, for which the Bayes moderated Student's t-test and linear modeling methods like limma can be used. For instance, limma used a linear model to assess differential expression from microarray or RNA-Seq technologies by using multifactor designed experiments. It has a few advantages include stable on even small sample sizes and good in complex experiments with a variety of experimental conditions and predictors (Ritchie et al., 2015).

Differential expression analyses have been widely performed in cancer (Liang and Pardee, 2003). It is known that clinical tumor samples contain not only tumor cells but also tumorassociated normal epithelial and stromal cells, immune cells, and vascular cells (Joyce and Pollard, 2009), which play important roles in tumor growth, disease progression, and drug resistance (Hanahan and Weinberg, 2011; Junttila and de Sauvage, 2013). As a result, tumor purity, i.e., the percentages of cancer cells in solid tumor samples, is critical in genomic, transcriptomic, and methylation analyses in cancer (Aran et al., 2015; Zheng et al., 2017). For example, we recently developed InfiniumPurify by integrating tumor purity into differential methylation (DM) analysis, which significantly improved the accuracy of the DM identification (Zheng et al., 2017). In addition, we developed a rigorous statistical method InfiniumClust to perform sample clustering on DNA methylation data using tumor purity, which also exhibited superior accuracy (Zhang et al., 2017). There are also a few attempts to account for tumor purity in differential expression analysis (Wang et al., 2015; Shen et al., 2016) by adding it as an additive or semi-additive covariate in linear models (Aran et al., 2015). For example, contamDE proposed a few statistical models to call differential genes between unmatched or matched normal and tumor samples, in which the mean expression for a "contaminated" tumor cell sample follows a semi-additive pattern (Shen et al., 2016). Briefly, let w<sup>i</sup> be the proportion of tumor cells in the ith tumor sample. For the jth gene, contamDE models the distribution of reads from normal cell samples as Nij ∼ NB(kiµ<sup>j</sup> , φj) and those from "contaminated" tumor samples as Tij ∼ NB(k<sup>i</sup> ′ µ<sup>j</sup> + wiδj), φ<sup>j</sup> , where NB denotes the negative binomial distribution, k<sup>i</sup> and k ′ i are normalization size factors for normal and tumor samples, µ<sup>j</sup> and µ<sup>j</sup> + wiδ<sup>j</sup> are the adjusted means for normal and tumor samples, and φ<sup>j</sup> is the dispersion. The DE is obtained by testing if δ<sup>j</sup> is 0. UNDO is designed for deconvoluting array-based gene expression data of tumor samples (Wang et al., 2015), which models the mixing proportion of pure tumor and stroma cells as latent variables. However, tumor purity has multiplicative effects on gene expression, which might not be additive (Zheng et al., 2017). Thus, it is inadequate to simply treat tumor purity as an additive or semi-additive covariate in computational models.

To solve this problem, we have developed a novel method called Differential Expression Caller by combining tumor purity information (DECtp) to identify DEGs between tumor and normal samples. DECtp models expression profiles of tumor samples as a mixed Gaussian distribution, where the mixing proportion is tumor purity. With known or estimated tumor purity, differential expressions are then called based on a generalized least square procedure followed by the Wald test. We performed analyses on extensive simulated data with different sample sizes and noise levels and TCGA data of various cancers. DECtp achieves more accurate, consistent, and biologically meaningful results than those from other state of the art methods, such as limma (Ritchie et al., 2015).

#### MATERIALS AND METHODS

Supposing that the input data consists of expression profiles of N genes on n<sup>0</sup> normal and n<sup>1</sup> cancer samples, we first transform the expression values on each sample group (by log2 transformation, quantile normalization, and so on) such that they will follow a Gaussian distribution. This transformation allows for the introduction of a linear model with Gaussian noise in subsequent steps.

Specifically, for any gene i, let X<sup>i</sup> be its transformed expressions on all normal samples. We assume that X<sup>i</sup> ∼ N(m<sup>i</sup> , σ 2 i ), where m<sup>i</sup> and σ 2 i represent the mean and variance of X<sup>i</sup> . Similarly, let Y<sup>i</sup> be the transformed expressions on "pure" cancer samples for gene i, which also admits a normal distribution. Without loss of generality, we assume Y<sup>i</sup> = X<sup>i</sup> + δ<sup>i</sup> , where δ<sup>i</sup> represents the difference between cancer and normal samples. Clearly, δ<sup>i</sup> is a random variable following a normal distribution with mean µ<sup>i</sup> and variance τ 2 i , i.e., δ<sup>i</sup> ∼ N µi , τ 2 i . Thus, differential genes could be inferred by the hypothesis test: H<sup>0</sup> :µ<sup>i</sup> = 0. However in practice, the expression profile of "pure" cancer sample Y<sup>i</sup> is not observed. Instead, the observed expressions of solid tumor samples are always a mixture of expressions on cancer and normal cells.

Let Y ′ i be the expression profile of gene i on observed tumor samples. For a tumor sample with known purity λ<sup>s</sup> estimated by existing methods, we use Y ′ is to denote the expression of gene i on sample s. Then Y ′ is can be modeled by a linear formula: Y ′ is = (1 − λs) Xis + λsYis = (1 − λs) Xis + λ<sup>s</sup> (Xis + δis) = Xis + λsδis, so Yis ′ ∼ N(m<sup>i</sup> + λsµ<sup>i</sup> , σ 2 <sup>i</sup> + λ 2 s τ 2 i ). Clearly, the gene expression variance of tumor samples are greater than or equal to that of normal samples since σ 2 <sup>i</sup> + λ 2 s τ 2 <sup>i</sup> ≥ σ 2 i , and bias can arise when directly testing the mean difference between Xis and Y ′ is due to the influence of tumor purity. It is worth noting that tumor purity has multiplicative (instead of additive) effect (Zheng et al., 2017) on differential expression under this assumption. So previous DEG calling method modeling tumor purity as an additive covariate might be inappropriate (Aran et al., 2015).

To solve this problem, we propose a simple linear model and a generalized least square procedure by taking Xis and Y ′ is as input data. Specifically for gene i, the linear regression model is trained as follows: Z<sup>i</sup> = Wβ<sup>i</sup> + ǫ<sup>i</sup> , where

$$Z\_{i} = \begin{bmatrix} X\_{i1} \\ X\_{i2} \\ \vdots \\ X\_{i n\_0} \\ Y\_{j1} \\ Y\_{j2} \\ \vdots \\ \vdots \\ Y\_{i n\_1} \end{bmatrix}, W = \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & \lambda\_1 \\ 1 & \lambda\_2 \\ \vdots & \vdots \\ 1 & \lambda\_{n\_1} \end{bmatrix}, \beta\_i = \begin{bmatrix} m\_i \\ \vdots \\ \mu\_i \end{bmatrix}, and \epsilon\_i = \begin{bmatrix} \epsilon\_1 \\ \epsilon\_2 \\ \vdots \\ \epsilon\_{n\_0} \\ \epsilon\_{n\_0 + 1} \\ \epsilon\_{n\_0 + 2} \\ \vdots \\ \epsilon\_{n\_0 + n\_1} \end{bmatrix}.$$

Here, the (n<sup>0</sup> + n1) × 1 vector Z<sup>i</sup> represents expressions from normal and tumor samples with the first n<sup>0</sup> entries from normal samples, and the last n<sup>1</sup> entries from tumor samples. In addition, W is a matrix of dimensionality (n<sup>0</sup> + n1) × 2 with the first column consisting of all 1 s and the second column consisting of n<sup>0</sup> 0 s and n<sup>1</sup> tumor purities (i.e., λ1, λ2, . . . , λn<sup>1</sup> ) for respective tumor samples. β<sup>i</sup> is the linear model parameter to be determined, and ǫ<sup>i</sup> is the random error. The objective is to test H<sup>0</sup> :µ<sup>i</sup> = 0.

The parameters can be fitted by a least square procedure to minimize <sup>Z</sup><sup>i</sup> <sup>−</sup> (Wβ<sup>i</sup> <sup>+</sup> <sup>ǫ</sup>i) 2 2 . As a result, βˆ <sup>i</sup> = (WTW) <sup>−</sup>1WTZ<sup>i</sup> , HZ<sup>i</sup> where <sup>H</sup> <sup>=</sup> (WTW) <sup>−</sup>1W<sup>T</sup> , and var βˆ i = Hvar (Zi)H<sup>T</sup> . The variance of Z<sup>i</sup> is 6 0 0 6 ′ , where6 = σ 2 0 0 0 . . . 0 0 0 σ 2 n0×n<sup>0</sup> and 6 ′ = σ ′ 2 0 0 0 . . . 0 0 0 σ ′ 2 n1×n<sup>1</sup> . So, var βˆ i = Hvar (Zi) H<sup>T</sup> = [H<sup>1</sup> H2] 6 0 0 6 ′ H<sup>T</sup> 1 H<sup>T</sup> 2 = H16H<sup>T</sup> <sup>1</sup> <sup>+</sup> <sup>H</sup>26′H<sup>T</sup> 2 , then var(βˆ <sup>i</sup>) can be obtained with σ 2 and σ ′ 2 , the residual variances from normal and cancer groups respectively. Given estimated βˆ i , regression residuals are now ǫˆ = Z<sup>i</sup> − Wβˆ i , and the residual variances from normal and cancer groups are obtained as σ 2 = Pn<sup>0</sup> i=1 ǫˆ 2 i n0−2 , σ ′ 2 = <sup>P</sup>n0+n<sup>1</sup> i=n0+1 ǫˆ 2 i n1−2 . We apply a shrinkage estimator similar to Cui et al. (2005) on the estimated cancer/normal variances, and obtained σ˜ 2 and σ˜ ′ 2 . The procedure shrinks all residual variances to the genometeric mean and stabilizes the estimates. After getting βˆ <sup>i</sup> and var βˆ i , the Wald test statistics for testing H<sup>0</sup> :µ<sup>i</sup> = 0 is calculated as t<sup>i</sup> = βˆ i q [2] var(βˆ <sup>i</sup>)[2,2] , where βˆ <sup>i</sup>[2] is the second item of <sup>β</sup><sup>ˆ</sup> <sup>i</sup> and

q var(βˆ <sup>i</sup>)[2,2] is the element of the matrix <sup>q</sup> var(βˆ <sup>i</sup>) at indices [2,2]. Finally, we assume the Wald test follow a t distribution with n<sup>0</sup> + n<sup>1</sup> − 2 degrees of freedom, and the p-values can be obtained accordingly. False discovery rate (FDR) can be estimated using established procedures such as the Benjamini-Hochberg method (Benjamini et al., 2001).

# RESULTS

We applied and compared DECtp with canonical DEG calling algorithms like limma on a few simulated datasets and cancer datasets downloaded from The Cancer Genome Atlas (https:// cancergenome.nih.gov/). Before stepping into detailed analyses, it is insightful to first examine the relationship between gene expression and tumor purity.

#### Correlation Between Gene Expression and Tumor Purity

Through extensive analyses of the TCGA data, we discovered that the expression levels of many genes have strong correlation with tumor purity in cancer and the correlation increases with the difference of gene expressions between cancer and normal samples. Specifically, the tumor purities were downloaded from https://zenodo.org/record/253193, which were calculated by InfiniumPurify (Zhang et al., 2015; Zheng et al., 2017). InfiniumPurify for purity estimation is based on an important observation from the Illumina Infinium 450 k methylation data: the number of probes with intermediate methylation level is significantly greater in tumor samples than that in normal samples. InfiniumPurify first identifies a number of informative differentially methylated CpG sites (iDMCs) from cancer-normal comparison by using a non-parametric Wilcoxon Rank-Sum test and ANOVA analysis for each probe, and then estimates purity from the probability density of methylation levels of iDMCs.

#### Expression Levels of Many Genes Have Strong Correlation With Tumor Purity

We used Prostate adenocarcinoma (PRAD) in TCGA as an example to illustrate the correlation between gene expression and tumor purity. Specifically, after quantile-normalizing the expression profiles (quantified by RSEM, Li and Dewey, 2011) for tumor samples, the purity value of each sample was estimated by InfiniumPurify (Zheng et al., 2017). For each gene, we computed the Spearman correlation between expression levels and tumor purities across tumor samples (termed as "Observed" in **Figure 1A**). From there we obtained 20440 correlation values, each for a gene. As a comparison, we also randomly shuffled the purities of all tumor samples, and used the shuffled tumor purities as input to compute the correlation (termed as "Random" in **Figure 1A**). As can be seen from **Figure 1A**, the distribution of observed correlations has a longer right tail, demonstrating that there are much more genes with high correlation with tumor purity than by random. In particular, we identified 1252 genes

with absolute observed correlation over 0.5 (accounting for 6.2% of all genes), while this number is close to 0 by random.

#### Correlation Between Gene Expression and Tumor Purity Increases With the Difference of Gene Expressions Between Cancer and Normal Samples

We identified genes highly correlated with tumor purity. What are these genes? To answer this question, we studied the relationship between previously calculated correlations and gene expression changes between tumor and normal samples. Specifically, we first conducted a t-test on the normalized expression profiles of each gene between tumor and normal samples, and then divided all genes into 10 subsets by the test statistics. We then plotted in **Figure 1B** the distribution of observed correlations (between tumor purity and gene expression) in each group. As can be seen, the mean observed correlation in each group increases with the t-test statistics (measuring the extent of gene expression difference between tumor and normal samples). Similarly, we also classified the genes into 10 subgroups according to their correlations with tumor purity and observed a positive correlation between the t-test statistics and group labels (see **Figure 1C**).

We conducted the above analyses across 14 cancer types with sufficient normal tissues (each cancer type with over 10 normal samples) including Bladder Carcinoma (BLCA), Breast Invasive Carcinoma (BRCA), Esophageal Carcinoma (ESCA), Head-Neck Squamous Cell Carcinoma (HNSC), Kidney Chromophobe (KICH), Kidney Renal Clear Cell Carcinoma (KIRC), Cervical Kidney renal papillary cell carcinoma (KIRP), Liver Hepatocellular Carcinoma (LIHC), Lung Adenocarcinoma (LUAD), Lung Squamous Cell Carcinoma (LUSC), PRAD, Stomach Adenocarcinoma (STAD), Thyroid Cancer (THCA), and Uterine Corpus Endometrial Carcinoma (UCEC). The top 1000 genes with the largest correlations for each cancer type were shown in **Supplementary Table S1**. The results were similar for all cancers, which could be well explained by our linear regression model on gene expression (see Materials and Methods). When there are significant differences between tumor and normal samples (i.e., δis is big), the gene expressions are more correlated with purities. However, when there is no difference between tumor and normal samples (i.e., δis is close to 0), the gene expressions will have a low correlation with purities. These results revealed that tumor purity will bias differential expression analysis if not correctly accounted for, and our method was motivated from this observation.

#### Analyses on Simulated Data

To evaluate DECtp and compare it with other methods, we simulated a few datasets resembling true biological scenarios with different sample sizes and noise levels.

#### Simulated Datasets

We first downloaded from TCGA the LUAD gene expression data (in RSEM values) consisting of 517 tumor and 59 matched normal samples. Each RSEM value was transformed to log2 (RSEM + min), where min is the minimum nonzero RSEM value. The log2-transofmred data was quantile normalized, which was then used to generate simulation data.

It is worth mentioning that our purpose is to call DEGs between pure normal and pure tumor samples. However, both kinds of samples are infeasible to retrieve in reality, thus we adopted a compromised strategy as follows:


(3) We generated tumor purity values λ<sup>j</sup> uniformly from [0.05, 0.95]. Plugging in Xij, Yij and λ<sup>j</sup> into the formula Y ′ ij = λjYij + (1 − λj)Xij, we simulated Y ′ ij as the observed expression profile of sample j at gene i, which is a mixture of expression profile from "pure" tumor and "pure" normal samples.

We then called DEGs between simulated pure normal (e.g., Xij) and mixed (e.g., Y ′ ij) samples and compared them with the underlying true DEGs to assess accuracy. Because the true mean expression levels are known, we can construct a gold standard for comparison. For a gene, if the absolute difference of the true expression profiles between normal and pure tumor samples is greater than a threshold, it is defined as a DEG. The simulations were repeated for δ = 1, 2, 3, which roughly provides proportions of DEGs at 38%, 16%, and 8% of total number of genes. We also tested the performance of the algorithms with varied sample sizes from 10, 50, and 100, respectively.

#### DECtp Outperforms Other Methods in Simulated Datasets

We performed DEG calling on the 9 simulated datasets using DECtp and a few popular methods including t-test, limma. The receiver operating characteristic (ROC) curve analysis

(Davis and Goadrich, 2006) using truth DEGs as a gold standard was performed to compare the performances of the methods (see **Figure 2**). Compared with traditional DEG calling methods, DECtp takes purity as an experimental design factor in a linear model. So we added to tumor purities a noise of the Gaussian distribution with mean 0 and standard deviations 0.1 to test the robustness of our method against purity estimation. It is clear that DECtp achieved the best AUCs in all simulated datasets even if estimated tumor purities are biased. In addition, limma and t-test have very similar performances, which is not surprising since it is known that they are similar for normal distributed data (Murie et al., 2009). Moreover, the performances of all methods became better when the thresholds (δ) or sample sizes increase as expected. Overall, these real data-based simulation results demonstrate the robustness and accuracy of DECtp in DE detection when tumor purity is a confounding factor.

#### Analyses on Real Data

With the success of DECtp on simulated data, we next tested DECtp on real TCGA tumor data on 14 cancer types including BLCA, BRCA, ESCA, HNSC, KICH, KIRC, KIRP, LIHC, LUAD, LUSC, PRAD, STAD, THCA, and UCEC respectively. There are overall 6289 tumor and 632 normal samples. For all cancers, we estimated tumor purities by InfiniumPurify (Zheng et al., 2017).

#### The Top Differential Genes Identified by DECtp Is More Associated With Tumor Purity Than Those of Limma

To study the correlation between tumor purity and top ranked differential genes, we first ranked genes by their false discovery rate calculated by DECtp or limma. We then calculated the average absolute correlation between tumor purity and top n ranked genes. In **Figure 3A**, we plotted the average absolute correlation against n (0 ≤ n ≤ 20000) for BLCA and PRAD. Similar to previous findings, we found that top differentially expressed genes are more correlated with purity than other genes for both DECtp and limma. The trend is clearer for DECtp, indicating that it is better in identifying tumor purity-associated differential genes. The observation holds for all 14 cancer types (see **Supplementary Figure S1**).

We also examined the overlaps of DEGs (at FDR 0.001) called from the t-test, limma and DECtp. **Figure 3B** shows the overlapping Venn diagrams for BLCA and PRAD respectively. For BLCA, the t-test identified 5,689 DEGs, among which 4,231 (74%) are overlapped with those identified by DECtp. limma identified 5,393 DEGs, among which 4180 (78%) are overlapped with those identified by DECtp. Similarly for PRAD, the t-test identified 9,223 DEGs, among which 8696 (94%) are overlapped with those identified by DECtp. Limma identified 8682 DEGs, among which 8271 (95%) are overlapped with DECtp. The overlaps of DEGs for other cancer types were

shown in **Supplementary Figure S2**. In summary for all tested cancer types, there are 114842 DEGs overlapped between DECtp (with an overall of 151327 DEGs) and t-test (with an overall of 136918 DEGs), 107621 DEGs overlapped between DECtp and limma (with an overall of 121378 DEGs), 112772 DEGs overlapped between t-test and limma, suggesting that the three methods are generally consistent. We also have downloaded RNA-seq count data of six cancer types from TCGA, including BLCA, BRCA, HNSC, LUAD, LUSC, and PRAD to investigate the overlaps of DEGs called from DECtp, limma and edgeR. To have a fair comparison, we selected the same tumor and normal samples from the two different data type (count vs. RSEM value) when using DECtp and edgeR (332 normal samples versus 2858 tumor samples). The overlaps of DEGs for the three methods were shown in **Supplementary Figure S3**. It is shown that DEGs called from the three methods have rather significant overlap for the six cancer types. To be specific, for the six cancer types, limma identified 55593 DEGs, edgeR identified 59860 DEGs, and DECtp identified 71115 DEGs, and 44532 DEGs (accounting for 62.6%) in DECtp are overlapped with those identified by limma and edgeR.

Next, we examined the Pearson correlation among test statistics for different cancer types. Even though different cancer types have distinct etiologies, they might still share many genomic and transcriptomic features. We plotted in **Figure 3C**

the correlation of test statistics among 14 cancer types using both DECtp and limma. Overall the correlations for DECtp are higher than those of limma.

#### DECtp Identifies New Biological Meaningful Differential Genes

We selected several gene expression profiles from BRCA to demonstrate the confounding effect of tumor purity on differential expression analysis. As shown in **Figure 4**, the left panel displays the boxplots of three genes expression profiles including IRF8, CECR1 and IL10RA for tumor and normal samples. It is clear that the p-values are not statistically significant for limma, i.e., the p-value is 0.872 for IRF8, 0.959 for CECR1, and 0.867 for IL10RA. The middle panel shows the scatter plot of expression profiles versus InfiniumPurify purities, in which the correlations are all very high, especially, −0.68 for IL10RA. The high correlation indicates that the large within group variance of cancer samples is mostly caused by variation in purities for different samples, which dilutes the signals of DEGs. And thus, after removing the effect of tumor purity, we could observe significant difference on expressions of these genes between normal and tumor groups. Indeed, there are many studies linking these 3 genes to breast cancer (Heinonen et al., 2008; Takaoka et al., 2008; Pavlides et al., 2010). We also selected the differentially expressed genes detected only by DECtp for the David enrichment analysis (at FDR < 0.05). **Supplementary Table S2** shows the enrichment of DE genes for the 10 cancer types. We have obtained a lot of biological functions. For example, GO:0006955∼immune response is the most enriched Go term for BLCA and PRAD with FDR being 6.161542e-29 and 1.45e-12, respectively. Thus, by considering tumor purity, DECtp could identify new biological meaningful DEGs for further experimental validation.

# DECtp Is More Consistent and Identifies More

Biological Meaningful Differential Genes Than Limma It is known to us all that consistency is a very important criteria to evaluate DE calling methods on real data. Generally speaking, a robust method should obtain consistent results on technical or biological replicates. To compare the consistency of DECtp with that of limma, we randomly divided tumor samples in each cancer into two groups, and then detected DEGs by comparing the two tumor groups with normal samples, respectively. This process was repeated 50 times. **Figure 5A** shows the overlapping odd ratios of the top 500 DE genes for all 14 cancers. Clearly, DECtp detected more overlapped DE genes than those of limma in most cancer types, which suggests that it is more consistent. We then examined the biological implications of the DE calling results. To have a fair comparison, we selected top 4,000 differential genes by the two methods, and tested their enrichments with "PATHWAYS\_IN\_CANCER" from KEGG (Kanehisa and Goto, 2000), which contains 328 biologically meaningful genes. DECtp detects 110 genes compared to 80 genes by limma in UCEC. **Figure 5B** shows the – log10 of the p-values for the enrichment of DEGs in "PATHWAYS\_IN\_CANCER" by using the Chi-square test. As can be seen, DECtp shows much smaller p-values compared to limma in most cancer types, especially in UCEC and BRCA. Overall, these results suggest that DECtp can detect more enriched DEGs in "PATHWAYS\_IN\_CANCER" than limma.

# DISCUSSIONS

In this work, we systematically investigated the impact of tumor purity as a confounding factor in differential expression analysis (Aran et al., 2015; Wang et al., 2015; Shen et al., 2016), and proposed a novel statistical model to adjust for tumor purity in

DE calling. We first examined the correlations between cancer expression profiles and tumor purity, and found that DE genes have high correlations with tumor purity. It is known that tumor purity has multiplicative effect on gene expression, instead of additive, so traditional DE calling methods ignoring tumor purity or modeling it as an additive covariate may present biased results. To solve this problem, we proposed DECtp, in which gene expression profiles from tumor samples are modeled as mixed Gaussian distributions, where the mixing proportion is tumor purity. DECtp achieved more robust and accurate DEGs in both simulation and real data studies compared with canonical methods like limma, which reinforces our previous claim that tumor purity may confound genomic analyses if not correctly accounted for (Zhang et al., 2017; Zheng et al., 2017).

DECtp is specifically developed to identify DEGs for gene expression profiles admitting normal distributions. However, RNA-sequencing technology has led to a rapid increase in gene expression data in the form of counts. The counts data are usually modeled by the negative binominal (NB) models, thus DECtp cannot be directly applied. In the future, it will be interesting to develop similar models using the NB distributions incorporating tumor purity information.

Finally, we would like to point out that DECtp may have a few further applications. Similar to differential gene analysis, differential protein and differential methylation analyses have also been widely performed between cancer and normal samples. In principle, DECtp could be applied to any differential analysis between cancer and normal samples given the data is Gaussian. In addition, Aran et al. found that identifying co-expression networks from genomics data without accounting for tumor

REFERENCES


purity is problematic (Aran et al., 2015). So we believe that similar principals proposed in this work can be applied to analyzing gene co-expression. Moreover, tumor purity information might be useful in identifying cancer associated expression quantitative trait loci (eQTLs). However, it is out of the scope of this study.

## AUTHOR CONTRIBUTIONS

WZ and JY conceived the concept of the work. WZ, BH, and HL performed the experiments. WZ, JY, and BH wrote the paper.

# FUNDING

This work was supported by the Hainan Provincial Natural Science Foundation of China (Grant No. 618MS057), National Natural Science Foundation of China (Grant Nos. 61762034 and 11661003), the Natural Science Foundation of Hunan, China (Grant No. 2018JJ2461), the research grant (Grant No. GJJ170445) from Science and Technology Project of Education Department of Jiangxi Province, the Key Program of Hunan Provincial Education Department (Grant No. 15A026), and the General Program of Hunan Provincial Philosophy and Social Science Planning Fund office (Grant No. 15YBA035).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00321/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhang, Long, He and Yang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Common and Rare Genetic Risk Factors Converge in Protein Interaction Networks Underlying Schizophrenia

Xiao Chang<sup>1</sup> \*, Leandro de Araujo Lima<sup>1</sup> , Yichuan Liu<sup>1</sup> , Jin Li1,2, Qingqin Li<sup>3</sup> , Patrick M. A. Sleiman1,4,5 and Hakon Hakonarson1,4,5 \*

<sup>1</sup> The Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, United States, <sup>2</sup> Affiliated Cancer Hospital & Institute of Guangzhou Medical University, Guangzhou, China, <sup>3</sup> Janssen Research & Development, LLC, Titusville, NJ, United States, <sup>4</sup> Department of Pediatrics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States, <sup>5</sup> Division of Human Genetics, Children's Hospital of Philadelphia, Philadelphia, PA, United States

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Wei Wang, Rockefeller University, United States Jie Ping, Vanderbilt University Medical Center, United States

#### \*Correspondence:

Xiao Chang changx@email.chop.edu Hakon Hakonarson hakonarson@email.chop.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 05 August 2018 Accepted: 12 September 2018 Published: 28 September 2018

#### Citation:

Chang X, Lima LdA, Liu Y, Li J, Li Q, Sleiman PMA and Hakonarson H (2018) Common and Rare Genetic Risk Factors Converge in Protein Interaction Networks Underlying Schizophrenia. Front. Genet. 9:434. doi: 10.3389/fgene.2018.00434 Hundreds of genomic loci have been identified with the recent advances of schizophrenia in genome-wide association studies (GWAS) and sequencing studies. However, the functional interactions among those genes remain largely unknown. We developed a network-based approach to integrate multiple genetic risk factors, which lead to the discovery of new susceptibility genes and causal sub-networks, or pathways in schizophrenia. We identified significantly and consistently over-represented pathways in the largest schizophrenia GWA studies, which are highly relevant to synaptic plasticity, neural development and signaling transduction, such as long-term potentiation, neurotrophin signaling pathway, and the ERBB signaling pathway. We also demonstrated that genes targeted by common SNPs are more likely to interact with genes harboring de novo mutations (DNMs) in the protein-protein interaction (PPI) network, suggesting a mutual interplay of both common and rare variants in schizophrenia. We further developed an edge-based search algorithm to identify the top-ranked gene modules associated with schizophrenia risk. Our results suggest that the N-methyl-D-aspartate receptor (NMDAR) interactome may play a leading role in the pathology of schizophrenia, as it is highly targeted by multiple types of genetic risk factors.

Keywords: schizophrenia, GWAS, PPI Network, copy number variation (CNV), gene modules

## INTRODUCTION

Schizophrenia is a psychiatric disorder with profound genetic heterogeneity. Genetic risk factors of schizophrenia range in frequency from common to rare, including common single nucleotide polymorphisms (SNPs), recurrent rare copy number variants (CNVs) and de novo mutations (DNMs) (Friedman et al., 2008; International Schizophrenia Consortium, 2008; Vrijenhoek et al., 2008; Walsh et al., 2008; Xu et al., 2008; Glessner et al., 2010; Mulle et al., 2010; Girard et al., 2011; Levinson et al., 2011; Vacic et al., 2011; Kirov et al., 2012; Xu et al., 2012; Ripke et al., 2013;

**20**

Sleiman et al., 2013; Fromer et al., 2014; Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014). Current genome-wide association studies (GWAS) in schizophrenia have reported 108 genome-wide significant loci, each of small effect size (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014). It has also been reported that at least a quarter of the genetic contribution to schizophrenia risk can be explained by common SNPs (Lee et al., 2012; Ripke et al., 2013; Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014). On the other hand, multiple case-control studies have identified rare CNVs of strong effect to the risk of schizophrenia (International Schizophrenia Consortium, 2008; Vrijenhoek et al., 2008; Walsh et al., 2008; Xu et al., 2008; Glessner et al., 2010; Mulle et al., 2010; Levinson et al., 2011; Bergen et al., 2012; Kirov et al., 2012; Szatkiewicz et al., 2014). In addition, recent sequencing studies have shed new light on the genetic basis of schizophrenia that DNMs play a prominent part in the sporadic form of schizophrenia (Xu et al., 2012; Gulsuner et al., 2013; Fromer et al., 2014; McCarthy et al., 2014).

In these studies, multiple pieces of evidence show that genetic susceptibility of schizophrenia displays disruption across a group of functionally related genes implying a complex genetic network underlying schizophrenia (Glessner et al., 2010; Gulsuner et al., 2013; Fromer et al., 2014). To explore the network structure of schizophrenia, many network-based approaches have been applied to different types of genetic variations (Bullmore and Sporns, 2009; Gilman et al., 2012; Jia et al., 2012; Luo et al., 2014a,b). Among the different types of gene networks, proteinprotein interaction (PPI) networks have been shown to be a powerful tool to identify the disease-associated modules and pathways, and reveal the biological significance of diverse genetic variations (Barabasi et al., 2011; Jia et al., 2011; Chang et al., 2013; Han et al., 2013; International Multiple Sclerosis Genetics Consortium, 2013; Leiserson et al., 2013; Luo et al., 2014b; Zhou et al., 2014). For example, instead of pursuing genome-wide significance, two GWA studies have successfully identified disease-associated gene modules, which are comprised of many closely interacting genes showing nominal significance, by integrating PPI networks analysis into GWAS (Han et al., 2013; International Multiple Sclerosis Genetics Consortium, 2013). However, it is still a challenge to conduct a comprehensive PPI network analysis, in particular by incorporating different types of genetic factors from different tissue types.

In the present study, we established a network-based approach to investigate the gene modules and pathways underlying schizophrenia, and to explore the inherent associations among multiple genetic risk factors. Our analysis uncovered significantly enriched association signals in pathways relevant to synaptic plasticity, neural development and signaling transduction such as long-term potentiation, neurotrophin signaling pathway, ERBB signaling pathway and MAPK signaling pathway, suggesting those play contributory roles in the pathophysiology of schizophrenia. We also demonstrated that genes targeted by common SNPs are more likely to interact with genes carrying DNMs. Finally, we identified a group of interacting genes showing a significant combined effect to the genetic susceptibility of schizophrenia.

# MATERIALS AND METHODS

# GWAS Data Sets

Gene-level P values were calculated based on SNP P values from the largest GWAS conducted by Schizophrenia Psychiatric Genome-Wide Association Study Consortium (PGC), which recruited 36,989 cases and 113,075 controls (PGC phase 2, abbreviated as PGC2) (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014). The association results were downloaded from the website of PGC<sup>1</sup> . As a control, we used the GWAS data of Crohn's disease (CD) from the International IBD Genetics Consortium<sup>2</sup> including a total of 3,685 cases and 5,968 controls (Jostins et al., 2012).

## Gene-Level Associations

Gene-level associations were calculated by VEGAS (Liu et al., 2010). VEGAS performs Monte-Carlo simulations from the multivariate normal distribution based on the LD pattern from reference populations and assigns an estimated P value to each gene. SNPs located within 50 kb upstream and 50 kb down stream of gene boundaries are used in the analysis in order to capture regulatory regions and SNPs in LD. Previous studies suggested P-value < 0.05 as the threshold of gene-level significance (Liu et al., 2010; International Multiple Sclerosis Genetics Consortium, 2013). However, since the number of genome-wide significant loci from the PGC2 study are much more than from the previous studies as a result of study size differences, the gene-level significance at both P-value < 0.01 (2501 significant genes) and P-value < 0.05 (4698 significant genes) was evaluated in this study. Genes located in the MHC region (25–34 mb on chr6) were excluded in the analysis.

# Rare Variations Curation

In this study, we used the sequencing results from previous studies (Xu et al., 2012; Gulsuner et al., 2013; Fromer et al., 2014) and annotated the variants by wANNOVAR<sup>3</sup> (Chang and Wang, 2012). We used SIFT and Polyphen2 (HDIV) scores compiled by dbNSFP2 database as well as the AVSIFT score based on annotations at http://sift.bii.a-star.edu.sg to assess whether the missense variants are benign or damaging (**Supplementary Table S1**).

For the CNVs, we collected the genes disrupted by CNVs reported in large case-control studies of schizophrenia (**Supplementary Table S2**).

#### Network Analysis

Schematic overview of the network analysis pipeline in this study was provided in **Supplementary Figure S2**.

The PPI Network was constructed based on the database iRefindex, which collected the protein interactions from a number of primary interaction databases (Razick et al., 2008). In order to control the rate of false positive interactions, we selected only those interactions that were supported by at least

<sup>1</sup>http://www.med.unc.edu/pgc/downloads

<sup>2</sup>http://www.ibdgenetics.org

<sup>3</sup>http://wannovar.usc.edu

two independent PubMed literatures. A high-confidence network with 9,090 proteins (nodes) and 25,864 interactions (edges) was subsequently built for downstream analyses.

We next mapped the significant genes (P < 0.05) identified by VEGAS to the PPI network, and obtained a sub-network comprised of the significant genes and the interactions among them. The sub-network contains several connected components and many singletons. We then extracted the largest connected component (LCC) of the sub-network for downstream analysis.

To test whether the size of the LCC is larger than what would be expected by chance, we randomly assigned P values of the same network and generated the simulated LCCs. We repeated this procedure 10,000 times, and use these simulations as background to estimate the significance of the LCCs generated from the real data (**Figure 1** and **Supplementary Table S3**). To investigate the biological significance of the genes in the LCC, we carried out a gene function enrichment analysis against the KEGG database using DAVID (**Supplementary Table S4**) (Huang et al., 2007).

## Gens (GWAS Edge-Based Network Search) Algorithm

Gens algorithm is modified based on a previously published node-based network search method (Ideker et al., 2002; Chuang et al., 2007; Jia et al., 2011).

Gens first assigns a weight to each edge of the network calculated by the gene-wise P values and mRNA expression correlations of interacting gene pairs (**Supplementary Data Sheet 1**). The weight of each edge is defined as

$$W\_{\vec{\imath}\vec{\jmath}} = C\_{\vec{\imath}\vec{\jmath}} \times \sqrt{P\_{\vec{\imath}} \times P\_{\vec{\jmath}}}$$

where Cij denotes the Pearson Correlation Coefficient of interacting gene pairs, gene i and gene j. P<sup>i</sup> is the P value of Gene i, P<sup>j</sup> is the P value of Gene j.

The gene mRNA expression data were downloaded from Allen Brain Atlas<sup>4</sup>

The weight of each edge was then converted into a Z score

$$Z\_{\vec{\imath}\vec{\jmath}} = \phi^{-1} \left( 1 - W\_{\vec{\imath}\vec{\jmath}} \right)$$

where φ −1 represents the inverse normal cumulative distribution function.

The score of gene module is defined as

$$Z\_m = \sum Z\_{i\bar{j}} / \sqrt{k}$$

where k is the number of edges in the module.

The search procedure starts from the seed edge, neighborhood interactors are added into the module if they can yield an increment greater than Zm×r, r is set to 0.05 in this study.

To evaluate the likelihood of the detected modules were identified by chance, Gens creates a background distribution by scoring 100,000 randomly generated modules with the same number of genes as the detected module. The significance is calculated as the proportion of those random generated modules whose Z<sup>m</sup> are larger than or equal to that of the identified module. Gens also adjusted the identified module size by defining a normalized module score

$$Z\_n = (Z\_m - mean\left(Z\_m\left(\pi\right)\right)) \Big/sd\left(Z\_m\left(\pi\right)\right),$$

<sup>4</sup>http://human.brain-map.org/static/download

where Zm(π) represents the distribution of Z<sup>m</sup> generated by 100,000 simulations.

# RESULTS

## Enriched Pathways Underlying Schizophrenia

fgene-09-00434 September 26, 2018 Time: 15:23 # 4

We first used VEGAS to convert the SNP associations into genelevel P values (**Supplementary Figure S1**). We next extracted the sub-networks by genes with a significant gene-level P value. The identified sub-networks are comprised of connected components and singletons. Among the connected components, the LCC contains most of the nodes and edges in the subnetwork, which may participate in potential pathways underlying schizophrenia. To investigate the biological significance of the LCCs, we carried out a gene function enrichment analysis on the gene set of LCCs. We found significantly over-represented KEGG pathways, which are highly relevant to synaptic plasticity, neural development and signaling transduction such as long-term potentiation, neurotrophin signaling pathway, ERBB signaling pathway, MAPK signaling pathway, and T cell receptor signaling pathway. Other enriched pathways include proteasome, ubiquitin mediated proteolysis pathway and multiple cancers associated pathways (**Supplementary Table S4**).

We further confirmed that the sizes of LCCs are significantly larger than the LCCs generated by simulated random networks (**Figure 1** and **Supplementary Table S3**). For comparison, we performed the same analysis on a CD cohort, the LCC size is also larger than random simulations (**Supplementary Table S3**). This result is consistent with a previous study pointing to a biological plausibility that a set of genes coherently contribute to disease risk through interactive co-function and co-regulation (International Multiple Sclerosis Genetics Consortium, 2013).

## Mutual Interplay of Common and Rare Genetic Risk Factors in Schizophrenia

To examine whether genes belonging to the LCC network and identified by GWAS data are more likely to interact with genes harboring DNMs, We added the genes carrying potential DNMs (frameshift insertions/deletions, missense variants, or nonsense variants) and extracted the LCC based on the merged gene set. The size of the LCC significantly increased, larger than 10,000 simulations of the above procedure based on the same number of randomly selected genes. As a control, we tested the same number of top significant genes from CD GWAS. The size of the resulting LCC was not significantly different from random simulations. Furthermore, we also found the size of LCC did not increase significantly than random simulations if genes with silent de novo variants in schizophrenia cases were included (**Figure 2** and **Supplementary Table S5**).

# Causal Gene Modules Identified by Network Search Algorithm

In an attempt to add some more understanding to the schizophrenia genetic puzzle, we collected evidence for literature reported genes that are known to be disrupted by CNVs in schizophrenia patients (**Supplementary Table S2**), and added them to the PPI network analysis. We subsequently derived the LCC from genes targeted by SNPs, DNMs, and CNVs.

To pinpoint a small group of interactive genes with significant combined/additive effect to schizophrenia, we developed an edge-based network search algorithm (Gens) for detecting causal gene modules in PPI networks (**Supplementary Figure S2**). The results from gene-level significance at both 0.05 and 0.01 were highly consistent with each other demonstrating that the top-ranked gene modules overlapped considerably in their gene content. The shared genes between top-ranked modules significantly pointed to the interactome of N-methyl-D-aspartate receptor (NMDAR) genes including DLG1, DLG2, DLG4, ERBB4, GRIN2A, and GRIN2B (**Supplementary Figure S3**). All of those genes exhibited strong associations with schizophrenia susceptibility (DLG1, rs436564, P = 8.97 × 10−<sup>4</sup> ; DLG2, rs12294291, P = 4.90 × 10−<sup>7</sup> ; DLG4, rs222854, P = 3.76 × 10−<sup>5</sup> ; ERBB4, rs16846200, P = 1.62 × 10−<sup>5</sup> ; GRIN2A, rs9922678, P = 6.72 × 10−<sup>9</sup> ; GRIN2B, rs11757887, P = 8.81 × 10−<sup>7</sup> ; **Supplementary Figure S4**) with GRIN2A,reaching genome-wide significance in the PGC2 study.

Some of the NMDAR genes are also targeted by rare variations. For example, DLG1 and GRIN2A have been reported to be targeted by DNMs; DLG1, DLG2, and ERBB4 have been reported to be targeted by CNVs. To further explore the risk genes from the PPI network, we next select all the gene modules with P < 0.05 (P value calculated by random simulation, see Methods) and calculated the frequency of genes occurring in the selected modules. Genes with the frequency above the upper quartiles were defined as 'top genes'. The 'top genes' was used to construct a new PPI network of 152 nodes and 324 edges (**Figure 3**), which reflects the most significant gene module derived from the network analysis.

Enrichment analysis indicated that they are enriched in the neurotrophin signaling pathway (P = 7.27 × 10−13), ERBB signaling pathway (P = 1.84 × 10−<sup>7</sup> ), long-term potentiation (P = 5.37 × 10−<sup>5</sup> ), MAPK signaling pathway (P = 3.16 × 10−<sup>5</sup> ), T cell receptor signaling pathway (P = 1.17 × 10−<sup>5</sup> ), and pathways in cancer (P = 4.87 × 10−<sup>8</sup> ) to name a few (**Supplementary Table S6**). Moreover, in this network, we found multiple genes are connected with the core members of NMDAR interactome, such as ATP2B2, DLGAP, MAP1A, NOS1, PTK2B, PTPRG and PRKCA. Among them, ATP2B2 (rs9879311, P = 2.77 × 10−<sup>6</sup> ) and NOS1 (rs2293052, P = 1.24 × 10−<sup>6</sup> ) exhibited strong associations with schizophrenia risk in the PGC2 GWAS.

Beside the NMDAR interactome, we also found candidate genes showing strong associations with schizophrenia risk in the network, such as ANKS1B (rs10745841, P = 1.28 × 10−<sup>6</sup> ), CHUK (rs975752, P = 2.52 × 10−<sup>6</sup> ), CNTN2 (rs16937, P = 8.69 × 10−<sup>7</sup> ), CNTNAP2 (rs6961013, P = 4.80 × 10−<sup>5</sup> ), CREB1 (rs2709410, P = 4.07 × 10−<sup>6</sup> ), CREB5 (rs4722797, P = 7.58 × 10−<sup>6</sup> ; rs887622, P = 8.79 × 10−<sup>6</sup> ), CUL3 (rs11685299, P = 1.11 × 10−<sup>8</sup> ), EP300 (rs9607782, P = 6.76 × 10−12), GABBR2 (rs2304389, P = 3.81 × 10−<sup>7</sup> ), GNA13 (rs11868185, P = 4.44 × 10−<sup>5</sup> ), NCOR2 (rs2229840, P = 2.90 × 10−<sup>4</sup> ), NTRK3 (rs146797905, P = 3.35 × 10−<sup>7</sup> ; rs8042993, P = 7.84 × 10−<sup>6</sup> ), PAK2 (rs10446497, P = 5.30 × 10−<sup>6</sup> ), PTK2 (rs4961278, P = 1.86 × 10−<sup>5</sup> ), PTK2B (rs2565065, P = 1.94 × 10−<sup>7</sup> ), PTN (rs3735025, P = 7.75 × 10−<sup>9</sup> ),

FIGURE 2 | (A,B) Comparison of the number of nodes between the real network (damaging events) and random networks. (C,D) Comparison of the number of nodes between the real network (benign events) and random networks. Connectedness of the LCC based on gene-level significant genes (Pgene < 0.01) from PGC2 study and genes harboring DNMs. Original size of LCC based on gene-wise significant genes constituting 402 nodes and 620 edges. 635 genes harboring DNMs are added to generate the new LCC. The background distribution is generated by 10,000 LCCs based on adding 635 random selected genes. P values are estimated by the proportion of LCCs from random networks with more nodes or edges than the real network. As a control, we use the LCC generated by adding top 635 gene-level significant genes from Crohn's disease as control. Dash line denotes the size of LCC generated by adding DNMs. Solid line denotes the size of LCC generated by adding CD top genes. Adding DNMs significantly increased the size of LCCs (DNMs: Pnode = 0.0022, Pedge = 0.0032; CD: Pnode = 0.1941, Pedge = 0.0678), while adding top CD genes did not. For comparison, we also added the synonymous and non-frameshift substitutions to generate the new LCC. The size of new LCC is not significantly larger than random simulations (Benign substitutions: Pnode = 0.698, Pedge = 0.0571; CD: Pnode = 0.1922, Pedge = 0.1900).

PTPRF (rs11210892, P = 4.97 × 10−10), STK4 (rs6065777, P = 5.92 × 10−<sup>6</sup> ), TCF4 (rs9636107, P = 9.09 × 10−13). Among them, CUL3, EP300, NCOR2, PTK2B, and PTPRF were targeted by DNMs, and PAK2, PARK2 and PTK2 were targeted by CNVs.

#### DISCUSSION

Given the heterogeneity and complexity of the genomic landscape in schizophrenia, we employed multiple network-based methods to reveal the instinct associations among different types of genetic risk variants, resulting in the discovery of novel gene modules and pathways underlying schizophrenia (**Supplementary Figure S2**).

With the recent GWAS success measures in schizophrenia uncovering 108 genome-wide significant loci (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014), the genetic underpinning of this complex disease have begun to unravel. However, a considerable number of nominally significant loci are likely to be identified in future studies through the analysis of larger sample sizes or the application of

new and innovative methods. For example, the schizophrenia susceptibility gene CAMKK2 showing nominal significance (rs1063843, P = 2.32 × 10−<sup>5</sup> ) in the PGC2 study was successfully identified by integrative analysis of gene expression and PPI (Luo X.J. et al., 2014).

We hypothesize that a group of functionally related genes with nominal significance could jointly contribute to schizophrenia susceptibility. We further performed a PPI network-based pathway analysis on two GWA studies of schizophrenia and identified significantly enriched KEGG pathways in both studies. Some pathways have been strongly associated with schizophrenia, such as the long-term potentiation, ERBB signaling pathway and MAPK signaling pathway (Fazzari et al., 2010; Pitcher et al., 2011; Funk et al., 2012; Savanthrapadian et al., 2013; Salavati et al., 2015). Interestingly, we found both the proteasome pathway and the ubiquitin mediated proteolysis pathway to be significantly enriched (**Supplementary Table S4**). Dysfunction of the ubiquitin-proteasome pathway (UPP) has been implicated in the pathology of various neurodegenerative conditions, and has been linked to several late-onset neurodegenerative diseases caused by aggregate-prone proteins such as Alzheimer's disease Parkinson's disease and Huntington's disease (Rubinsztein, 2006; Hegde and Upadhya, 2011). Cumulative evidence also suggests that schizophrenia patients have aberrant gene expression patterns and protein expression disruptions in the UPP suggesting the UPP may also contribute to the deficits in schizophrenia (Vawter et al., 2001; Aston et al., 2004; Altar et al., 2005; Bousman et al., 2010; Rubio et al., 2013). Our results are consistent with these findings and provide new evidence in support of the association between the UPP and the pathogenesis of schizophrenia.

Cumulative evidence suggests that DNMs are an important cause of mental disorders such as schizophrenia, autism and intellectual disability (Veltman and Brunner, 2012). DNMs occur in different genes of different patients may be collectively responsible for a portion of sporadic schizophrenia cases. However, unlike CNVs, genes recurrently mutated by SNVs are rare and the overlap of genes disrupted by DNMs from recent studies is also small (**Supplementary Figure S5**). Thus, we naturally raise the question if genes targeted by common SNPs are more likely to be targeted by DNMs, and if genes

targeted by common SNPs are more likely to interact with genes carrying DNMs? For the first question, the PGC2 study unveiled significant overlap between genes in the schizophrenia GWAS associated intervals and those with DNMs in schizophrenia (P = 0.0061) (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014). For the second question, our analysis provides new evidence suggesting that genes targeted by common SNPs or DNMs are likely to interact with each other or participant in the same pathway. Collectively, these results suggest that schizophrenia susceptibility involves a mutual interplay of both common and rare genetic risk factors.

We additionally developed an edge-based network search algorithm to identify the leading disease associated modules underlying schizophrenia. The network search method was initially node-based, and developed in order to detect a group of interactive genes which show significantly changes in mRNA expression (Ideker et al., 2002). Later, this method was successfully applied on the post-GWAS network analysis (Jia et al., 2011; Jia et al., 2012; Han et al., 2013; International Multiple Sclerosis Genetics Consortium, 2013). Here, the advantage of Gens is that the edge-based method can utilize not only the node P values for the node but also the gene co-expression information as edge weights to score and rank the detected modules (**Methods**).

Using this approach, we found the top-ranked modules were significantly enriched in the NMDAR pathway associated genes including DLG1, DLG2, DLG4, ERBB4, GRIN2A, and GRIN2B. All of those genes show strong association with schizophrenia from GWAS. DLG1, DLG2, ERBB4, and GRIN2A were also targeted by DNMs or CNVs. In addition to GRIN2A, which has surpassed genome-wide significance (rs9922678, P = 6.72 × 10−<sup>9</sup> ) in the PGC2 study, DLG2 (rs12294291, P = 4.90 × 10−<sup>7</sup> ), GRIN2B (rs11757887, P = 8.81 × 10−<sup>7</sup> ) also showed strong associations nearly reaching genome-wide significance. These results suggested that the dysfunction of the NMDAR complex plays a leading role in the pathology of schizophrenia and is highly impacted by multiple genetic risk factors.

We further pinpointed two genes ATP2B2 (rs9879311, P = 2.77 × 10−<sup>6</sup> ) and NOS1 (rs2293052, P = 1.24 × 10−<sup>6</sup> ), which were closely connected to the NMDAR interactome and showed strong associations with schizophrenia risk. ATP2B2 encodes the plasma membrane calcium-transporting ATPase 2 which plays an important role in intracellular calcium homeostasis and extrudes Ca2<sup>+</sup> from cytosol into extracellular space. Family-based association studies suggested ATP2B2 as a risk gene for autism in multiple ethnicities (Carayol et al., 2011; Prandini et al., 2012; Yang et al., 2013). A previous study also suggested ATP2B2 could confer risk to schizophrenia (Ikeda et al., 2010). NOS1 encodes a member of nitric oxide synthases, which functions as a biologic mediator in neurotransmission. Previous studies also provided evidence of the associations between NOS1 and schizophrenia risk (Shinkai et al., 2002; Reif et al., 2011; Zhang et al., 2014).

Besides the NMDAR interactome, CUL3, EP300, PTN, PTPRF, TCF4 reached genome-wide significance in the PGC2 study. CUL3, EP300, and PTPRF were also targeted by DNMs. EP300 servers as an important hub in the network which directly interacted with 14 genes (TCF4, EGR1, SREBF1, and SREBF2 located in genome-wide significant regions; AKT1 and SMAD7 targeted by DNMs). The product of EP300 functions as histone acetyltransferase and regulates transcription via chromatin remodeling. Defects of EP300 can cause Rubinstein-Taybi syndrome (a disease with short stature and intellectual disability) and may result in the formation of tumors (Tillinghast et al., 2003; Roelfsema et al., 2005; Negri et al., 2015). Interestingly, the DNM (NM\_001429, exon14, c.C2656G, p.P886A) found in EP300 is not predicted as damaging by either SIFT nor PolyPhen2, and a common missense variant in EP300 is also strongly associated with schizophrenia (rs20551, P = 1.38 × 10−<sup>8</sup> ; NM\_001429, exon15, c.A2989G, p.I997V), which suggest that slight changes in the protein conformation of EP300 may confer risk to schizophrenia. EP300 is also interacted and co-expressed with CREB1 in the network. It is reported that EP300 can mediate cAMP-gene regulation through phosphorylated CREB proteins. CREB1 also showed strong association (rs2709410, P = 4.07 × 10−<sup>6</sup> ) in the PGC2 study. CREB1 has been linked to drug addiction, memory disorders and neurodegenerative diseases (Bilecki and Przewlocki, 2000; Nestler, 2002; Josselyn and Nguyen, 2005; Lee et al., 2005). There is also some evidence of the association between CREB1 and schizophrenia (Li et al., 2013; Ma et al., 2014; Kumar et al., 2015). PTN is another important hub, which interacted with eight genes (NCAN, PSMB10, and SGSM2 located in genome-wide significant regions; NCAN, PSMD2, and SGSM2 targeted by DNMs). PTN encodes pleiotrophin, which may suppress long-term potentiation induction (Pavlov et al., 2002).

In the network, candidate genes with nominal significance such as ANKS1B, CNTN2, CNTNAP2, GABBR2, NCOR2, and NTRK3 also may be involved in the pathology of schizophrenia. The product of ANKS1B is predominantly expressed in brain tissue and interacted with amyloid beta protein precursor, which may play a role in brain development. A recent study demonstrated that ANKS1B product regulates synaptic GluN2B levels and further influence the NMDAR function. Multiple pieces of evidence have linked CNTN2, CNTNAP2, GABBR2, and NTRK3 to neuropsychiatric disorders, including schizophrenia (Weickert et al., 2005; Friedman et al., 2008; Otnaess et al., 2009; Fazzari et al., 2010; Fatemi et al., 2011; Roussos et al., 2012; Bormuth et al., 2013; Fatemi et al., 2013; Karayannis et al., 2014). SNPs in NCOR2 are associated with cocaine dependence in a recent GWAS (Gelernter et al., 2014).

In conclusion, the heterogeneity and complexity of the genetic landscape in schizophrenia is high. Here, we demonstrate that common and rare genetic risk factors converge on PPI networks that are enriched for schizophrenia candidate genes involved in synaptic plasticity and neural development. We also provide new evidence demonstrating that the NMDAR interactome is highly targeted by multiple types of genetic risk factors and may play a leading role in the risk of schizophrenia. Furthermore, we pinpointed many nominally significant genes in GWAS showing strong evidence to influence schizophrenia risk according to their network properties. These genes may reach genome-wide significance or carry DNMs to be unveiled in further genetic studies with more samples.

# AUTHOR CONTRIBUTIONS

fgene-09-00434 September 26, 2018 Time: 15:23 # 8

XC and HH designed the research. XC, LL, YL, and JL performed the analysis. QL and PS provided guidance for the analysis. XC and HH wrote and finalized the paper.

# FUNDING

This study is funded by an Institutional Development Award to the Center for Applied Genomics from The Children's Hospital

#### REFERENCES


of Philadelphia; Adele and Daniel Kubert donation; MH096891- 03S1 (NIMH); RC2 MH089924-02 (NIMH); R01 MH097284-03 (NIMH); U01-HG008684 (NIH).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00434/full#supplementary-material

schizophrenia and epilepsy. Mol. Psychiatry 13, 261–266. doi: 10.1038/sj.mp. 4002049




**Conflict of Interest Statement:** QL was employed by company Janssen Research & Development, LCC.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Chang, Lima, Liu, Li, Li, Sleiman and Hakonarson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# BPLLDA: Predicting lncRNA-Disease Associations Based on Simple Paths With Limited Lengths in a Heterogeneous Network

Xiaofang Xiao1†, Wen Zhu2†, Bo Liao1,2 \*, Junlin Xu<sup>1</sup> , Changlong Gu<sup>1</sup> , Binbin Ji <sup>2</sup> , Yuhua Yao<sup>2</sup> , Lihong Peng<sup>3</sup> and Jialiang Yang2,4 \*

*<sup>1</sup> College of Information Science and Engineering, Hunan University, Changsha, China, <sup>2</sup> School of Mathematics and Statistics, Hainan Normal University, Haikou, China, <sup>3</sup> School of Computer Science, Hunan University of Technology, Zhuzhou, China, <sup>4</sup> Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, United States*

#### Edited by:

*Tao Zeng, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences (CAS), China*

#### Reviewed by:

*Jianbo Pan, Johns Hopkins Medicine, United States Xianwen Ren, Peking University, China*

#### \*Correspondence:

*Bo Liao dragonbw@163.com Jialiang Yang jialiang.yang@mssm.edu*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *01 July 2018* Accepted: *05 September 2018* Published: *16 October 2018*

#### Citation:

*Xiao X, Zhu W, Liao B, Xu J, Gu C, Ji B, Yao Y, Peng L and Yang J (2018) BPLLDA: Predicting lncRNA-Disease Associations Based on Simple Paths With Limited Lengths in a Heterogeneous Network. Front. Genet. 9:411. doi: 10.3389/fgene.2018.00411*

In recent years, it has been increasingly clear that long noncoding RNAs (lncRNAs) play critical roles in many biological processes associated with human diseases. Inferring potential lncRNA-disease associations is essential to reveal the secrets behind diseases, develop novel drugs, and optimize personalized treatments. However, biological experiments to validate lncRNA-disease associations are very time-consuming and costly. Thus, it is critical to develop effective computational models. In this study, we have proposed a method called BPLLDA to predict lncRNA-disease associations based on paths of fixed lengths in a heterogeneous lncRNA-disease association network. Specifically, BPLLDA first constructs a heterogeneous lncRNA-disease network by integrating the lncRNA-disease association network, the lncRNA functional similarity network, and the disease semantic similarity network. It then infers the probability of an lncRNA-disease association based on paths connecting them and their lengths in the network. Compared to existing methods, BPLLDA has a few advantages, including not demanding negative samples and the ability to predict associations related to novel lncRNAs or novel diseases. BPLLDA was applied to a canonical lncRNA-disease association database called LncRNADisease, together with two popular methods LRLSLDA and GrwLDA. The leave-one-out cross-validation areas under the receiver operating characteristic curve of BPLLDA are 0.87117, 0.82403, and 0.78528, respectively, for predicting overall associations, associations related to novel lncRNAs, and associations related to novel diseases, higher than those of the two compared methods. In addition, cervical cancer, glioma, and non-small-cell lung cancer were selected as case studies, for which the predicted top five lncRNA-disease associations were verified by recently published literature. In summary, BPLLDA exhibits good performances in predicting novel lncRNA-disease associations and associations related to novel lncRNAs and diseases. It may contribute to the understanding of lncRNA-associated diseases like certain cancers.

Keywords: disease similarity, lncRNA similarity, path with limited length, Gaussian interaction profile kernel similarity, leave-one-out cross validation, ROC curve

# INTRODUCTION

It is known that there are about 20,000 protein-coding genes, consisting of less than 2% of the human genome (Bertone et al., 2004; Claverie, 2005). Most DNA regions in the human genome are either not transcribable or transcribed into noncoding RNAs (ncRNAs), which are deemed to be transcriptional noises in a long period of time. However, many recent studies have suggested that ncRNAs play key regulatory roles in many important biological processes such as cell proliferation (Esteller, 2011). Based on their sizes, ncRNAs can be divided into long ncRNAs (lncRNAs) (Pauli et al., 2011) and small ncRNAs such as microRNAs (miRNAs) (Farazi et al., 2013), transfer RNAs (tRNAs) (Birney et al., 2007), and Piwi-interacting RNAs (piRNAs) (Li et al., 2013). LncRNAs are ncRNAs of lengths greater than 200 nucleotides (Mercer et al., 2009; Mitchell Guttman et al., 2013). Compared to protein-coding, RNAs, lncRNAs are less conservative among species (Harrow et al., 2012; Cabili et al., 2016), and have a relatively low expression level, more tissue-specific patterns (Guttman et al., 2010), and longer but less exons (Chen, 2015). Recently, more and more lncRNAs have been identified in eukaryotes from nematodes to human beings due to the advancement in sequencing technologies and computational methods (Awan et al., 2017).

Previous studies have suggested that lncRNAs are critical in cell proliferation, cell differentiation, chromatin remodeling, genome splicing, epigenetic regulation, transcription, and many other important biological processes (Guttman et al., 2009). The dysregulation of lncRNAs has also been associated with the development of many diseases, including diabetes (Pasmant et al., 2011), cardiovascular diseases (Congrains et al., 2012), HIV (Zhang et al., 2013), neurological disorders (Johnson, 2012), and several cancers such as lung cancer (Ji et al., 2003; Zhang et al., 2003), breast cancer (Barsyte-Lovejoy et al., 2006; Gupta et al., 2010), and prostate cancer (Kok et al., 2002; Szell et al., 2008). As a result, it has become a hot topic recently to identify lncRNA-disease associations, and many important disease-associated lncRNAs have been discovered. For example, breast cancer metastasis patients have about 100 to 2,000 times higher HOTAIR expression than that of the healthy people, based on a quantitative PCR study (Gupta et al., 2010). HOTAIR is also related to metastasis and progression of other cancers, such as liver cancer (Hrdlickova et al., 2014), lung cancer (Li et al., 2014), colorectal cancer (Res, 2011; Maass et al., 2014), gastric cancer (Li et al., 2014; Liu et al., 2014), and so on. Therefore, HOTAIR is deemed to be a potential biomarker for cancers (Maass et al., 2014). In addition, the dysfunction of lncRNA H19 is found in several diseases, such as bladder cancer (Ariel et al., 2000). The downregulation of H19 also significantly reduces the clonogenic and anchored nondependent growth of breast cancer cells based on a knock-down study (Barsyte-Lovejoy et al., 2006).

Known lncRNA-disease associations have been stored in a few databases, including LncRNADisease (Chen et al., 2013), Lnc2Cancer (Ning et al., 2016), MNDR (Wang et al., 2013), and so on, which are the basis for predicting novel associations using efficient computational methods. The computational models to predict lncRNA-disease associations are generally divided into two categories including machine learning-based models and network-based models (Chen et al., 2017). Machine learningbased models usually train predictors from features based on training samples and test their performances based on crossvalidation or independent data. For example, Chen et al. developed Laplacian Regularized Least Squares for LncRNA-Disease Association (LRLSLDA) for inferring candidates of disease-associated lncRNAs by applying a semisupervised learning framework (Chen and Yan, 2013). LRLSLDA assumes that similar diseases tend to correlate with functionally similar lncRNAs, and vice versa. Thus, known lncRNA-disease associations and lncRNA expression profiles are combined to prioritize disease-associated lncRNA candidates by LRLSLDA, which does not require negative samples (i.e., confirmed uncorrelated lncRNA-disease associations). However, LRLSLDA faces difficulty in optimizing the best model parameters. Zhao T. et al. (2015) proposed a naïve Bayesian classifier, which exploits various information related to cancer-associated lncRNAs, including regulome, genome, transcriptome, and multiomic data. As a result, 707 potential cancer-related lncRNAs were identified. However, this method requires negative samples, which are usually unknown. In contrast, network-based methods take the advantage of the lncRNA-disease association network, the disease similarity network, and the lncRNA similarity network to study the connectivity of lncRNAs and diseases. For instance, Sun et al. (2014) developed RWRlncD, which infers potential lncRNAdisease associations by a random walk with restart (RWR) on the lncRNA functional similarity network. However, the method cannot predict lncRNAs related to novel diseases (i.e., diseases with no known associated lncRNA). Gu et al. (2017) provided a global network random walk model for predicting lncRNAdisease associations (GrwLDA), which performs RWR on both lncRNA functional similarity network and disease similarity network. However, GrwLDA also faces a dilemma in optimizing model parameters.

In this study, we have proposed a novel method BPLLDA to predict lncRNA-disease associations based on paths connecting them with limited lengths in a heterogeneous network. Specifically, BPLLDA first establishes a heterogeneous network consisting of the known lncRNA-disease association network, the disease similarity network, and the lncRNA similarity network. It then calculates the association between a disease and an lncRNA by the paths connecting them and their lengths. BPLLDA does not require negative samples and is capable of predicting novel diseases and novel lncRNAs.

# MATERIALS AND METHODS

#### lncRNA-Disease Associations

The lncRNA-disease association data were retrieved from the database LncRNADisease (Chen et al., 2013; Sun et al., 2014). After eliminating identical lncRNA-disease entries from distinct pieces of evidence, there were 352 experimentally confirmed lncRNA-disease associations, containing 156 lncRNAs and 190 diseases (see **Supplementary Figure 1** and **Supplementary Tables 2**, **3**). We summarize some basic characteristics (e.g., the average degree) of the dataset in **Table 1**.

TABLE 1 | The basic characteristics of the lncRNA-disease association dataset.


We then established the lncRNA-disease association network, whose adjacency matrix is denoted by LD. That is, LD i, j is set to 1 if lncRNA l(i) is associated with disease d j , and 0 if otherwise. Before presenting the details of BPLLDA, we first introduced two important notations, namely, disease semantic similarity and lncRNA functional similarity.

#### Disease Semantic Similarity

The Disease Ontology (DO) is an open source ontology of human diseases (http://www.disease-ontology.org/). The terms in DO are diseases or disease-correlated concepts, which are organized in a directed acyclic graph (DAG). On the basis of Disease Ontology, Li et al. (2011) provided an R package called DOSim to calculate the disease semantic similarity, and we adopted this method in this study. Specifically, we used a symmetric matrix SS to record semantic similarity values among diseases, in which SS i, j represents semantic similarity between disease d (i) and d j as calculated by DOSim. We plot the distribution of SS in **Figure 1A**. There are overall 36100 (190 × 190) values, among which 21148 values (58.58%) are 0 s.

#### lncRNA Functional Similarity

We adopted a similar method to Sun et al. for measuring the functional similarity between two lncRNAs (Wang et al., 2010; Sun et al., 2014). Specifically, suppose lncRNA l(i) is associated with a disease set D<sup>i</sup> = dik <sup>1</sup> <sup>≤</sup> <sup>k</sup> <sup>≤</sup> <sup>m</sup>} and lncRNA <sup>l</sup> j is associated with D<sup>j</sup> = djl <sup>1</sup> <sup>≤</sup> <sup>l</sup> <sup>≤</sup> <sup>n</sup>}. The method first calculates the semantic similarity between a disease, say di1, and a disease group, say D<sup>j</sup> , as

$$\text{SIM}\left(d\_{i1}, D\_{\rangle}\right) = \left(\text{SS}\left(d\_{i1}, d\right)\right).$$

Then, the functional similarity between l(i) and l j is calculated as

$$\operatorname{FS}\left(l\left(i\right), l\left(j\right)\right) = \frac{\sum\_{1 \le k \le m} \operatorname{SIM}\left(d\_{ik}, D\_j\right) + \sum\_{1 \le l \le n} \operatorname{SIM}\left(d\_{jl}, D\_l\right)}{m+n}.$$

It is clear that the lncRNA functional similarity matrix FS is symmetric. Similarly, we plot the distribution of FS in **Figure 1B**. There are 24336 (156 × 156) values, among which 8662 (35.59%) are 0 s.

#### Gaussian Interaction Profile Kernel Similarity for lncRNAs

There are many zeros in FS due to the fact that lncRNA-disease associations are rather incomplete. To avoid such scenario, we introduced the Gaussian interaction profile kernel similarity between lncRNA l(i) and l(i) as

$$\text{GL}\left(l\left(i\right), l\left(j\right)\right) = \exp\left(-\gamma\_l \left\|\left.I\left(l\left(i\right)\right) - \left.I\left(l\left(j\right)\right)\right\|^2\right)\right)$$

where IP l(i) and IP l j are the vectors in the ith and jth row of the lncRNA-disease association matrix LD. The parameter γl is a regulation parameter of the kernel bandwidth with γ<sup>l</sup> = γ ′ l / 1 ln Pln i=1 IP l(i) 2 , where ln is the number of all lncRNAs studied and γ ′ l is usually set to 1 according to van Laarhoven et al. (2011).

#### Gaussian Interaction Profile Kernel Similarity for Diseases

Similarly, we defined the Gaussian interaction profile kernel similarity for diseases as

$$\text{GD}\left(d\left(i\right), d\left(j\right)\right) = \exp\left(-\gamma\_d \left\|\left.I\mathbf{P}\left(d\left(i\right)\right) - \left.I\mathbf{P}\left(d\left(j\right)\right)\right\|^2\right)\right)$$

with γ<sup>d</sup> = γ ′ d / 1 dn Pdn i=1 IP d (i) 2 , where IP d (i) and IP d (i) are the binary vectors in the ith and jth column of the adjacency matrix LD and dn is the numbers of diseases. Clearly, GD is also symmetric.

#### Integrated Similarity Between lncRNAs and Between Diseases

We integrated disease semantic similarity (lncRNA functional similarity) with the Gaussian interaction profile kernel similarity for diseases (lncRNAs) as follows:

$$\begin{aligned} &DS\left(d\left(\dot{\mathbf{i}}\right),d\left(\dot{\mathbf{j}}\right)\right) \\ &= \begin{cases} GD\left(d\left(\dot{\mathbf{i}}\right),d\left(\dot{\mathbf{j}}\right)\right) & \text{if } d(\mathbf{i}) \in NS \text{ or } d(\mathbf{j}) \in NS \\ SS\left(d\left(\dot{\mathbf{i}}\right),d\left(\dot{\mathbf{j}}\right)\right) & \text{otherwise} \end{cases} \\ &LS\left(l\left(\dot{\mathbf{i}}\right),l\left(\dot{\mathbf{j}}\right)\right) \\ &= \begin{cases} GL\left(l\left(\dot{\mathbf{i}}\right),l\left(\dot{\mathbf{j}}\right)\right) & \text{if } l\left(\dot{\mathbf{i}}\right) \in NF \text{ or } l\left(\dot{\mathbf{j}}\right) \in NF \\ FS\left(l\left(\dot{\mathbf{i}}\right),l\left(\dot{\mathbf{j}}\right)\right) & \text{otherwise} \end{cases} \end{aligned}$$

where NS is the set of diseases with no sematic similarity with any other disease, and NF is the set of lncRNAs with no functional similarity with any other lncRNAs. By definition, DS and LS are symmetric. We plot the distributions of DS and LS in **Figure 2**, in which the numbers of 0 s are greatly reduced compared to SS and FS.

#### BPLLDA

The general workflow of BPLLDA is illustrated in **Figure 3**, in which a heterogeneous network is first constructed with nodes

FIGURE 1 | The distributions of disease semantic and lncRNA functional similarity. (A) Disease semantic similarity (SS) distribution. (B) lncRNA functional similarity (FS) distribution. The *x*-axis indicates the intervals of similarity values and the *y*-axis indicates the numbers of values in the interval. The actual values are also marked above the histograms.

FIGURE 2 | The distributions of integrated similarities. (A) Distribution of the integrated similarity for diseases (DS). (B) Distribution of the integrated similarity for lncRNAs (LS). The *x*-axis indicates the intervals of similarity values and the *y*-axis indicates the numbers of values in the interval. The actual values are also marked above the histograms.

denoting lncRNAs or diseases. For any two diseases d (i) and d(j), the weight of the edge between them is defined to be

$$\begin{aligned} &= \begin{cases} ^{WD}(\boldsymbol{d}\,(\boldsymbol{i}),\boldsymbol{d}\,(\boldsymbol{j})) \\ ^{DS}\boldsymbol{\tilde{D}}\,(\boldsymbol{d}\,(\boldsymbol{i}),\boldsymbol{d}\,(\boldsymbol{j})) \end{cases} \end{aligned} = \begin{cases} ^{dd}\boldsymbol{\tilde{D}}\,^{DS}\left(\boldsymbol{d}\,(\boldsymbol{i}),\boldsymbol{d}\,(\boldsymbol{j})\right) \quad < \boldsymbol{T} \\ ^{DS}\boldsymbol{\tilde{D}}\,(\boldsymbol{d}\,(\boldsymbol{i}),\boldsymbol{d}\,(\boldsymbol{j})) \quad \quad \quad \quad \quad \quad \end{cases} ,$$

where T is a threshold value to avoid all diseases being connected (You et al., 2017). Similarly, the weight of the edge between two lncRNAs l(i) and l j is

$$\text{WL}\left(l\left(i\right), l\left(j\right)\right) = \begin{cases} 0 & \text{if } LS\left(l\left(i\right), l\left(j\right)\right) < T\\ LS\left(l\left(i\right), l\left(j\right)\right) & \text{otherwise} \end{cases}$$

The weight of an edge between an lncRNA l(i) and a disease d j is LD l(i), d j , that is, the weight is 1 if they are associated and 0 if otherwise. We tuned T from 0.1 to 0.5 with interval 0.1 by a leave-one-out cross-validation (LOOCV) process and finally chose T to be 0.2.

For a given lncRNA node l(i) and a disease node d j , we performed a depth-first search (Hopcroft and Tarjan, 1974) to

identify all noncyclic paths between them. To avoid long paths, we restricted the maximum number of edges in the path to be τ . Similarly, we performed an LOOCV search for τ being 1 to 4 and decided τ to be 3. Intuitively, l(i) and d j tend to be associated if there are many paths with high edge weights connecting them. Therefore, a score measuring their association confidence can be defined using the paths together with a decay function Fdecay pw :

$$\text{score}(l(i), d(j)) = \sum\_{\bowtie 1}^{n} \left( \prod \rho\_{\bowtie} \right)^{F\_{\text{decay}}(\rho\_{\bowtie})}$$

where p = p1, p2, . . . , p<sup>n</sup> is the set of paths connecting l(i) and d j , and Q p<sup>w</sup> denotes the product of the weights of all edges in the path pw. Generally speaking, long paths will have little contribution to the total score. So the decay function Fdecay p is denoted as

$$F\_{\text{decay}}\left(\mathfrak{p}\_{\text{w}}\right) = \alpha \times \text{len}\left(\mathfrak{p}\_{\text{w}}\right),$$

where the decay factor α is set to 2.26 based on a previous study (Ba-Alawi et al., 2016; You et al., 2017) and len pw is the length of the path pw. Clearly, the higher the score(l(i) , d j ), the more likely that l(i) and d j will be associated.

#### Analysis of the Computational Complexity

We analyzed the time complexity and space complexity of BPLLDA. Recall that there are m diseases and n lncRNAs with m > n. The algorithm mainly consists of two steps. First, a heterogeneous network was constructed, for which two matrices were established. So the time complexity and space complexity are O m2 respectively in this step. Then, BPLLDA infers the probability of an lncRNA-disease association based on paths with limited lengths in the network. We performed a depth-first search to identify all noncyclic paths between nodes and the time complexity is O((m + n) 2 ) on each node. Because there are m diseases, the time complexity is O m3 in this step. And the space complexity is O (mn) because we need to only save the prediction result. In summary, the time complexity and space complexity are at most O m3 and O m2 , respectively, for BPLLDA.

#### RESULTS AND DISCUSSIONS

#### Performance of BPLLDA in Predicting lncRNA-Disease Associations

We applied BPLLDA to a known lncRNA-disease association data LD, together with two popular methods GrwLDA (Gu et al., 2017) and LRLSLDA (Chen and Yan, 2013). The reason why we selected the two methods for comparison is that they can both predict novel lncRNAs and novel diseases. Specifically, two LOOCV methods namely global LOOCV and local LOOCV were adopted to evaluate their performances. Global LOOCV sets each experimentally confirmed lncRNA-disease association as a test sample once, but local LOOCV sets all associations of an lncRNA or those of a disease as test samples once. Other known lncRNA-disease associations are considered as training samples. The performances of the methods were evaluated by the area under the receiver operating characteristic (ROC) curve (AUC).

As a result, we plotted the global LOOCV ROC curves and their associated AUCs of BPLLDA, GrwLDA, and LRLSLDA, respectively, in **Figure 4**. BPLLDA has an AUC of 0.87117, and outperformed LRLSLDA (0.81952) and GrwLDA (0.78246). Similarly, we plotted the local LOOCV ROC curves and AUCs of the three methods on novel lncRNAs in **Figure 5**. As can be seen, BPLLDA has an AUC of 0.82403, about 8 and 18% higher than that of LRLSLDA (0.76542) and GrwLDA (0.69817), respectively. Finally, the AUC of BPLLDA (0.78528) in predicting novel diseases is significantly higher than that of LRLSLDA (0.65812) with an increase of 19% and GrwLDA (0.65802) with an increase of 20% (see **Figure 6**). In summary, our method is better than LRLSLDA and GrwLDA in both lncRNA-disease association prediction and prediction related to novel lnRNAs and diseases.

Meanwhile, we list in **Table 2** the precision versus the prediction scores in the global LOOCV. In general, the higher the score, the more likely the disease is related to the lncRNAs. The association confidence is greater than 0.9 when the prediction score is larger than 21.58.

#### Effects of Parameters

There are two model parameters in BPLLDA, including the maximum path length L and the weight threshold T. We tested the effects of these parameters on AUCs for LOOCV with L (L = 2, 3, 4) and T (T = 0.2, 0.4, 0.5), and we list the results in **Table 3**. As can be seen, the parameter L has significant effects on the performance of BPLLDA, and the best AUC is achieved at

predicting lncRNA-disease associations by global LOOCV.

FIGURE 5 | Performance evaluation of BPLLDA, LRLSLDA, and GrwLDA in predicting novel lncRNA-associated diseases.

L = 3. In contrast, T has only minor effects on the performance of our method. To further illustrate this, we fixed L to be 3, and let T vary from 0.1 to 0.5 with interval 0.1 (see **Table 4**). The AUC values are between 0.85568 and 0.87117, only about 2% difference.

#### Effects of Gaussian Interaction Profile Kernel Similarity for lncRNAs and Diseases

Disease similarity and lncRNA similarity are calculated by integrating disease semantic similarity, lncRNA functional similarity, as well as the Gaussian interaction profile kernel similarity for lncRNAs and diseases. We tested the effects of the Gaussian interaction profile kernel similarity for lncRNAs and



TABLE 3 | Tuning two model parameters: the maximum path length L and the weight threshold T by LOOCV.


*The value in each cell represents LOOCV AUC.*

\**T* = *0.2 and L* = *4 was not calculated because it takes more than 48 h.*


diseases on LOOCV with L = 3 and T = 0.2 with four settings: (1) without using both the Gaussian interaction profile kernel similarity for lncRNAs and diseases; (2) only using the Gaussian interaction profile kernel similarity for lncRNAs; (3) only using the Gaussian interaction profile kernel similarity for diseases; (4) using both the Gaussian interaction profile kernel similarity for lncRNAs and diseases. The results are summarized in **Table 5**. As can be seen, the two similarities indeed have a significant influence on the LOOCV AUC. The best AUC (0.87117) was achieved when both similarities were adopted into our model.

TABLE 5 | The effects of the Gaussian interaction profile kernel similarity for lncRNAs and diseases on LOOCV.


*The value in each cell represents LOOCV AUC.*

TABLE 6 | The top five lncRNA candidates predicted for cervical cancer, glioma, and non-small-cell lung cancer.


## Case Studies on Predicted lncRNA-Disease Associations

It is known that lncRNAs play critical roles in the development of many diseases. To further evaluate the ability of BPLLDA in inferring novel lncRNA-disease associations, we used all known lncRNA-disease associations in LD as training data and assessed the potential of predicted associations by our model. The novel lncRNA-disease associations were ranked according to the predicted score of BPLLDA. To validate the predictions, the newest LncRNADisease database was used, which curated 1766 distinct known lncRNA-disease associations among 888 lncRNAs and 328 diseases. Specifically, we listed the top five lncRNAs associated with three diseases, including cervical cancer, glioma, and non-small-cell lung cancer (NSCLC), respectively, in **Table 6** and the paths of cervical cancer in **Supplementary Table 1**. For a better view, we also plotted the associations of the three diseases and their top 10 predicted lncRNAs in **Figure 7**.

Cervical cancer is a cancer in the cervix and its early symptoms are hard to uncover. As the second common cancer among women all over the world, cervical cancer causes numerous incidents of death in developing countries (Forouzanfar et al., 2011). It was reported that there are approximately 500,000 novel cases of cervical cancer diagnosed annually (Tewari et al., 2014). Therefore, there is an urgent need to explore its biological mechanisms and develop effective treatment strategies. Interestingly, all of the top five novel cervical cancer-associated lncRNAs predicted by BPLLDA were confirmed by the newest updates of the LncRNADisease database. For example, the top predicted lncRNA, MEG3, can inhibit tumor growth in cervical cancer by regulating miR-21-5p, which is regarded as a tumor suppressor (Zhang J. et al., 2016). Serum PVT1 can accurately differentiate patients with cervical cancer from healthy controls (Yang et al., 2016). The high expression of HOTAIR is involved in cervical cancer progression and may be a potential target for diagnosis and gene therapy (Huang et al., 2014).

Glioma is considered to be the most common malignant tumor in the central nervous system and is characterized by aggressive blood vessel formation (Khasraw et al., 2010). Despite the continuous improvement of various treatments, including surgery, radiotherapy, and chemotherapy, the overall survival of patients with glioma is only about 12–14 months after diagnosis (Wang et al., 2015). The poor treatment effect is mainly due to the prominent tumor angiogenesis. Similarly, BPLLDA achieved good performance in predicting glioma-associated lncRNAs as all top five predicted lncRNAs were confirmed by the newest LncRNADisease database and literature. For example, it was shown that H19 regulates the development of glioma by deriving miR-675 and offers an essential clue to understanding the key role of the lncRNA-miRNA functional network in glioma (Shi et al., 2014). The expression level of lncRNA MALAT1 is significantly correlated with the overall survival of patients with glioma and can be used as a convictive prognostic biomarker for patients with glioma (Ma et al., 2015). In addition, Gas5 inhibits tumor malignancy by downregulating miR-222, which may be a promising treatment for glioma (Zhao X. et al., 2015).

NSCLC, including adenocarcinoma and squamous cell carcinoma, is a predominant form of lung cancer (Siegel et al., 2012). Despite the progress in clinical and experimental oncology, the prognosis remains difficult. More and more evidence indicates that ncRNAs could take part in the pathogenesis of NSCLC. Similarly, the top five NSCLC-correlated lncRNA candidates predicted by BPLLDA were validated by literature. For example, HOTAIR is significantly upregulated in NSCLC tissues and partly regulates cell invasion and metastasis of NSCLC by HOXA5 downregulation (Liu X. H. et al., 2013). So, HOTAIR is a potential therapeutic target for NSCLC intervention. In addition, patients with NSCLC with high PVT1 expression have a significantly lower overall survival rate than those with low PVT1 expression (Yang et al., 2014). Finally, the expression of CDKN2B-AS1 (ANRIL) might damage cell proliferation and leads to cell apoptosis in vitro and in vivo (Nie et al., 2015), which is linked to the survival of patients with NSCLC.

#### Case Studies on Predicted Novel Diseases and Novel lncRNAs

To test the ability of BPLLDA in predicting novel diseaseassociated lncRNAs, all known lncRNA-disease associations correlated with a disease were eliminated. We selected two diseases: colorectal cancer and breast cancer (see **Table 7**). As can

TABLE 7 | The top five novel disease-correlated lncRNA candidates predicted for colorectal cancer and breast cancer.


TABLE 8 | The top five novel disease-correlated lncRNA candidates predicted for *H19* and *HOTAIR*.


be seen, all top five predicted lncRNAs associated with colorectal cancer were confirmed by the newest LncRNADisease database, whereas four of the top five lncRNAs associated with breast cancer were also validated by the database or literature.

Similarly, to test the ability of BPLLDA in predicting novel lncRNA-associated diseases, all known lncRNA-disease associations correlated with an lncRNA were removed. As two case studies, we selected two lncRNAs, H19, and HOTAIR (see **Table 8**). In both cases, four of the top five associated diseases were validated by the database and literature. In summary, BPLLDA achieves favorable performances in predicting novel disease-associated lncRNAs and novel lncRNA-associated diseases.

#### CONCLUSIONS

Many studies have demonstrated that lncRNAs are essential in many physiological processes related to human diseases. They could be important biomarkers for the diagnosis, prognosis, and treatment of these diseases. However, the biological experiments to validate lncRNA-disease associations are not only time consuming but also costly, which promotes the need for developing computational prediction models. In this study, we proposed BPLLDA, a novel computational method to predict lncRNA-disease associations based on simple paths with limited lengths in a heterogeneous network consisting of the lncRNA similarity network, the disease similarity network, and the lncRNA-disease association network. BPLLDA outperforms two compared methods in prediction accuracy, and most top predicted novel lncRNA-disease associations were validated by literature. However, there are a few limitations of BPLLDA. First,

#### REFERENCES


available experimentally validated lncRNA-disease associations are rather incomplete. Secondly, lncRNA similarity is computed on the basis of known lncRNA-disease associations. There is a problem of sparseness in the disease semantic similarity and lncRNA functional similarity, which is remedied by integrating the Gaussian interaction profile kernel similarity for diseases and lncRNAs, respectively. So, BPLLDA may result in biased predictions. Finally, the distance-decay function in BPLLDA is relatively simple and could be improved by machine learning methods.

#### AUTHOR CONTRIBUTIONS

JY and BL: conceived the concept of the work and designed the experiments; XX, JX, BJ and YY: performed the literature search; XX, WZ, CG, and LP: collected and analyzed the data; XX and JY: wrote the paper, and all authors have approved the manuscript.

#### FUNDING

This work was supported by National Nature Science Foundation of China (Grant Nos. 61863010, 61873076, 61370171, 61300128, 61472127, 11171369, 61272395, 61572178, 61672214, and 61702054) and the Natural Science Foundation of Hunan, China (Grant Nos. 2018JJ2461 and 2018JJ3568).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00411/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Xiao, Zhu, Liao, Xu, Gu, Ji, Yao, Peng and Yang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Predicting Diabetes Mellitus With Machine Learning Techniques

#### Quan Zou1,2 \*, Kaiyang Qu<sup>1</sup> , Yamei Luo<sup>3</sup> , Dehui Yin<sup>3</sup> , Ying Ju<sup>4</sup> and Hua Tang<sup>5</sup> \*

<sup>1</sup> School of Computer Science and Technology, Tianjin University, Tianjin, China, <sup>2</sup> Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China, <sup>3</sup> School of Medical Information and Engineering, Southwest Medical University, Luzhou, China, <sup>4</sup> School of Information Science and Technology, Xiamen University, Xiamen, China, <sup>5</sup> Department of Pathophysiology, School of Basic Medicine, Southwest Medical University, Luzhou, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Jianbo Pan, Johns Hopkins Medicine, United States Zhu-Hong You, Xinjiang Technical Institute of Physics & Chemistry (CAS), China Chao Pang, Columbia University Medical Center, United States

\*Correspondence:

Quan Zou zouquan@nclab.net Hua Tang huatang@swmu.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 29 July 2018 Accepted: 12 October 2018 Published: 06 November 2018

#### Citation:

Zou Q, Qu K, Luo Y, Yin D, Ju Y and Tang H (2018) Predicting Diabetes Mellitus With Machine Learning Techniques. Front. Genet. 9:515. doi: 10.3389/fgene.2018.00515 Diabetes mellitus is a chronic disease characterized by hyperglycemia. It may cause many complications. According to the growing morbidity in recent years, in 2040, the world's diabetic patients will reach 642 million, which means that one of the ten adults in the future is suffering from diabetes. There is no doubt that this alarming figure needs great attention. With the rapid development of machine learning, machine learning has been applied to many aspects of medical health. In this study, we used decision tree, random forest and neural network to predict diabetes mellitus. The dataset is the hospital physical examination data in Luzhou, China. It contains 14 attributes. In this study, five-fold cross validation was used to examine the models. In order to verity the universal applicability of the methods, we chose some methods that have the better performance to conduct independent test experiments. We randomly selected 68994 healthy people and diabetic patients' data, respectively as training set. Due to the data unbalance, we randomly extracted 5 times data. And the result is the average of these five experiments. In this study, we used principal component analysis (PCA) and minimum redundancy maximum relevance (mRMR) to reduce the dimensionality. The results showed that prediction with random forest could reach the highest accuracy (ACC = 0.8084) when all the attributes were used.

Keywords: diabetes mellitus, random forest, decision tree, neural network, machine learning, feature ranking

# INTRODUCTION

Diabetes is a common chronic disease and poses a great threat to human health. The characteristic of diabetes is that the blood glucose is higher than the normal level, which is caused by defective insulin secretion or its impaired biological effects, or both (Lonappan et al., 2007). Diabetes can lead to chronic damage and dysfunction of various tissues, especially eyes, kidneys, heart, blood vessels and nerves (Krasteva et al., 2011). Diabetes can be divided into two categories, type 1 diabetes (T1D) and type 2 diabetes (T2D). Patients with type 1 diabetes are normally younger, mostly less than 30 years old. The typical clinical symptoms are increased thirst and frequent urination, high blood glucose levels (Iancu et al., 2008). This type of diabetes cannot be cured effectively with oral medications alone and the patients are required insulin therapy. Type 2 diabetes occurs more commonly in middle-aged and elderly people, which is often associated with the occurrence of obesity, hypertension, dyslipidemia, arteriosclerosis, and other diseases (Robertson et al., 2011).

**41**

With the development of living standards, diabetes is increasingly common in people's daily life. Therefore, how to quickly and accurately diagnose and analyze diabetes is a topic worthy studying. In medicine, the diagnosis of diabetes is according to fasting blood glucose, glucose tolerance, and random blood glucose levels (Iancu et al., 2008; Cox and Edelman, 2009; American Diabetes Association, 2012). The earlier diagnosis is obtained, the much easier we can control it. Machine learning can help people make a preliminary judgment about diabetes mellitus according to their daily physical examination data, and it can serve as a reference for doctors (Lee and Kim, 2016; Alghamdi et al., 2017; Kavakiotis et al., 2017). For machine learning method, how to select the valid features and the correct classifier are the most important problems.

Recently, numerous algorithms are used to predict diabetes, including the traditional machine learning method (Kavakiotis et al., 2017), such as support vector machine (SVM), decision tree (DT), logistic regression and so on. Polat and Günes (2007) distinguished diabetes from normal people by using principal component analysis (PCA) and neuro fuzzy inference. Yue et al. (2008) used quantum particle swarm optimization (QPSO) algorithm and weighted least squares support vector machine (WLS-SVM) to predict type 2 diabetes Duygu and Esin (2011) proposed a system to predict diabetes, called LDA-MWSVM. In this system, the authors used Linear Discriminant Analysis (LDA) to reduce the dimensions and extract the features. In order to deal with the high dimensional datasets, Razavian et al. (2015) built prediction models based on logistic regression for different onsets of type 2 diabetes prediction. Georga et al. (2013) focused on the glucose, and used support vector regression (SVR) to predict diabetes, which is as a multivariate regression problem. Moreover, more and more studies used ensemble methods to improve the accuracy (Kavakiotis et al., 2017). Ozcift and Gulten (2011) proposed a newly ensemble approach, namely rotation forest, which combines 30 machine learning methods. Han et al. (2015) proposed a machine learning method, which changed the SVM prediction rules.

Machine learning methods are widely used in predicting diabetes, and they get preferable results. Decision tree is one of popular machine learning methods in medical field, which has grateful classification power. Random forest generates many decision trees. Neural network is a recently popular machine learning method, which has a better performance in many aspects. So in this study, we used decision tree, random forest (RF) and neural network to predict the diabetes.

#### MATERIALS AND METHODS

#### Data

The dataset was obtained from hospital physical examination data in Luzhou, China. This dataset is divided two parts: the healthy people and the diabetes. There are two healthy people physical examination data. We used one of healthy people physical examination data that contains 164431 instances as the training set. In the other data set, 13700 samples were randomly selected as an independent test set. The physical data include 14 physical examination indexes: age, pulse rate, breathe, left systolic pressure (LSP), right systolic pressure (RSP), left diastolic pressure (LDP), right diastolic pressure (RDP), height, weight, physique index, fasting glucose, waistline, low density lipoprotein (LDL), and high density lipoprotein (HDL). In the training dataset, there are many missing data. We deleted the abnormal and missing samples to reduce the impact of data processing on result. Consequently, we got 151598 diabetic physical data and 69082 healthy people physical data. So, we randomly selected 68994 healthy people and diabetic patients' data, respectively as training set. Due to the data unbalance, we randomly extracted 5 times. The final result was the mean value of 5 experiments. The 13,700 patients physical examination data, which were randomly selected as the independent test set, were different from the previous five sets which were used as training set.

Another dataset is Pima Indians diabetics data (Jegan, 2014). In particular, all patients are females at least 21 years old of Pima Indian heritage. The dataset contains 8 attributes which are times of pregnancy, plasma glucose concentration after an 2-h oral glucose tolerance test, diastolic blood pressure, triceps skin fold thickness, 2-h serum insulin, body mass index, diabetes pedigree function and age. In this dataset, the original 786 diabetics data reduces to 392 after deleted the missing data.

#### Classification

In this section, we used decision tree, RF and neural network as the classifiers. Decision tree and RF can implement in WEKA, which is a free, non-commercial, open source machine learning and data mining software based on JAVA environment. Neural network can be implemented in MATLAB, which is a commercial mathematics software exploited by MathWorks, Inc. It is used for algorithmic development, data visualization, data analysis and provides advanced computational language, and interactive environment for numerical calculation

#### Decision Tree

Decision tree is a basic classification and regression method. Decision tree model has a tree structure, which can describe the process of classification instances based on features (Quinlan, 1986). It can be considered as a set of if-then rules, which also can be thought of as conditional probability distributions defined in feature space and class space.

Decision tree uses tree structure and the tree begins with a single node representing the training samples (Friedl and Brodley, 1997; Habibi et al., 2015; Liao et al., 2018). If the samples are all in the same class, the node becomes the leaf and the class marks it. Otherwise, the algorithm chooses the discriminatory attribute as the current node of the decision tree. According to the value of the current decision node attribute, the training samples are divided into serval subsets, each of which forms a branch, and there are serval values that form serval branches (Quinlan, 1986; Kohabi, 1996). For each subset or branch obtained in the previous step, the previous steps are repeated, recursively forming a decision tree on each of the partitioned samples (Quinlan, 1986; Friedl and Brodley, 1997; Habibi et al., 2015).

The typical algorithms of decision tree are ID3, C4.5, CART and so on. In this study, we used the J48 decision tree in WEKA.

J48 another name is C4.8, which is an upgrade of C4.5. J48 (Salzberg, 1994; Kohabi, 1996) is a top-down, recursive divide and conquer strategy. This method selects an attribute to be root node, generates a branch for each possible attribute value, divides the instance into multiple subsets, and each subset corresponds to a branch of the root node, and then repeats the process recursively on each branch (Kohabi, 1996). When all instances have the same classification, the algorithm stop. In J48, the nodes are decided by information gain. According to the following formulas, in each iteration, J48 calculates the information gain of each attribute, and selects the attribute with the largest value of information gain as the node of this iteration (Quinlan, 1996a,b; Sharma et al., 2014).

Attribute A information gain:

$$\text{Gain} \left( \mathbf{A} \right) = \text{Info} \left( \mathbf{D} \right) - \text{Info}\_{\mathbf{A}} \left( \mathbf{D} \right)$$

Pre-segmentation information entropy:

$$\text{Info (D)} = \text{Entropy (D)} = -\sum\_{j} p \text{ (j|D)} \log p (j|d).$$

Distributed information entropy:

$$\text{Info}\_{\text{A}}(D) = \sum\_{\mathbf{i}=1}^{\mathbf{v}} \frac{n\_{\mathbf{i}}}{n} \text{Info}\left(D\_{\mathbf{i}}\right)$$

#### Random Forest

RF is a classification by using many decision trees. This algorithm proposed by Breiman (Breiman, 2001). RF is a multifunctional machine learning method. It can perform the tasks of prediction


and regression. In addition, RF is based on Bagging and it plays an important role in ensemble machine learning (Breiman, 2001; Lin et al., 2014; Svetnik et al., 2015). RF has been employed in several biomedicine research (Zhao et al., 2014; Liao et al., 2016).

RF generates many decision trees, which is very different from decision tree algorithm (Pal, 2005). When the RF is predicting a new object based on some attributes, each tree in RF will give its own classification result and 'vote,' and then the overall output of the forest will be the largest number of taxonomy. In the regression problem, the RF output is the average value of output of all decision trees (Liaw and Wiener, 2002; Svetnik et al., 2015).

#### Neural Network

Neural network is a math model, which imitates the animal's neural network behaviors. This model depends on the complexity of the system to achieve the purpose of processing information by adjusting the relationship between the internal nodes (Mukai et al., 2012). According to the connections' style, the neural network model can be divided into forward network and feedback network. In this paper, we used the Neural Pattern Recognition app in MATLAB, which is a two-layer-feed-back network with sigmoid hidden and softmax output neurons. The neural network structural is shown in (**Figure 1**).

In neural network, there are some important parts, namely input layer, hidden layer and output layer. The input layer is responsible for accepting input data. We can get the results from the output layer. The layer between the input layer and the output layer is called hidden layer. Because they are invisible to the outside. There is no connection between neurons on the same layer. In this network, the number of hidden layers set to 10, which can get a better performance. We suppose the input vector is xE, the weight vector is wE, and the activation function is a sigmoid function, then the output is:

$$\mathbf{y} = \text{sigmoid}\left(\vec{\boldsymbol{w}}^{\mathrm{T}} \cdot \vec{\boldsymbol{x}}\right),$$

and the sigmoid is:

$$\text{sigmoid}\left(\mathbf{x}\right) = \frac{1}{1 + e^{-\mathbf{x}}}$$

#### Model Validation

fgene-09-00515 November 2, 2018 Time: 17:6 # 4

In many studies, authors often used two validation methods, namely hold-out method and k-fold cross validation method, to evaluate the capability of the model (Kohavi, 1995; Bengio and Grandvalet, 2005; Kim, 2009; Chen et al., 2016; Refaeilzadeh et al., 2016; Yang et al., 2016, 2018; Su et al., 2018; Tang H. et al., 2018). According to the goal of each problem and the size of data, we can choose different methods to solve the problem. In hold-out method, the dataset is divided two parts, training set and test set. The training set is used to train the machine learning algorithm and the test set is used to evaluate the model (Kim, 2009). The training set is different from test set. In this study, we used this method to verity the universal applicability of the methods. In k-fold cross validation method, the whole dataset is used to train and test the classifier (Kim, 2009). First, the dataset is average divided into k sections, which called folds. In training process, the method uses the k-1 folds to training the model and onefold is used to test. This process will be repeat k times, and each fold has the chance to be the test set. The final result is the average of all the tests performance of all folds (Kohavi, 1995). The advantage of this method is the whole samples in the dataset are trained and tested, which can avoid the higher variance (Refaeilzadeh et al., 2016; Kavakiotis et al., 2017). In this study, we used the five-fold cross validation method.

## Feature Selection

Feature selection methods can reduce the number of attributes, which can avoid the redundant features. There are many feature selection methods. In this study, we used PCA and minimum redundancy maximum relevance (mRMR) to reduce the dimensionality.

#### Principal Component Analysis

PCA (Wang and Paliwal, 2003; Polat and Günes, 2007; You et al., 2018) obtains the K vectors and unit eigenvectors by solving the characteristic equation of the correlation matrix of the observed variables. The eigenvalues are sorted from large to small, representing the variance of the observed variables explained by K principal components, respectively (Smith, 2002).

The model for extracting principal component factors is:

$$F\_i = T\_{i1}X\_1 + T\_{i2}X\_2 + T\_{ik}X\_k \ (i = 1, 2, \dots, m)$$

where, F<sup>i</sup> is the i principal component factor; Tij is the load of the i principal component factor on the j index; m is the number of principal component factors; k is the number of indicators.

The PCA method can reduce the original multiple indicators to one or more comprehensive indicators. This small number of comprehensive indicators can reflect the vast majority of the information reflected by the original indicators, and they are not related to each other, and they can avoid the repeated

TABLE 2 | Predict the diabetes by using blood glucose.


indicate the index has the highest information gain and insulin and age play important roles in this method.

TABLE 4 | Predict diabetes of using PCA to reduce dimensionality.


information (Jackson, 1993; Jolliffe, 1998). At the same time, the reduction of indicators facilitates further calculation, analysis and evaluation.

We used Statistical Product and Service Solutions (SPSS) to implement the PCA algorithm. SPSS is a general term for a series of software products and related services launched by IBM. It is mainly used for statistical analysis, data mining, predictive analysis and other tasks. SPSS has a friendly visual interface and is easy to operate.



#### Minimum Redundancy Maximum Relevance

mRMR (Jackson, 1993; Sakar et al., 2012; Li et al., 2016; Wang et al., 2018) ensures the features have the max Euclidean distances, or their pairwise have the minimized correlations. Minimum redundancy standards are usually supplemented by the largest relevant standards, such as maximum mutual information and target phenotypes. Two ways can achieve the benefits. First, with the same number of features, mRMR feature set can have a more representative target phenotype for better generalization. Secondly, we can use a smaller mRMR feature set to effectively cover the same space made by a larger regular feature set. For individual categorical variables, the similarity level between each feature is measured by using mutual information. Minimum redundancy is the choice to have the most different features. Similar to mRMR, researchers also developed Maximum Relevance Maximum Distance (MRMD) (Zou et al., 2016b) for features ranking. And they were employed in several biomedicine researches (Zou et al., 2016a; Jia et al., 2018; Tang W. et al., 2018; Wei et al., 2018).

TABLE 5 | Predict diabetes of using all features without blood glucose.


TABLE 6 | Predict diabetes of using 11 features.

fgene-09-00515 November 2, 2018 Time: 17:6 # 6


#### Measurement

In this study, we used sensitivity (SN), specificity (SP), accuracy (ACC), and Matthews correlation coefficient (MCC) to measure the classified effectiveness. And the formulas are as follow:

$$\text{SN} = \frac{TP}{TP + FN}$$

$$\text{SP} = \frac{\text{TN}}{\text{TN} + FP}$$

$$\text{ACC} = \frac{\text{TN} + \text{TP}}{\text{TN} + \text{TP} + \text{FP} + \text{FN}}$$

$$\text{MCC} = \frac{(\text{TP} \times \text{TN}) - (\text{FN} \times \text{FP})}{\sqrt{(\text{TP} + \text{FN}) \times (\text{TN} + \text{FP}) \times (\text{TP} + \text{FP}) \times (\text{TN} + \text{FN})}}$$

where true positive represents (TP) the number of identified positive samples in the positive set. True negative (TP) means the number of classification negative samples in the negative set. False positive (FP) is the number of the number of identified positive samples in the negative set. And false negative (FN) represents the number of identified negative samples in the positive set. It is often used to evaluate the quality of classification models. The accuracy is defined as the ratio of the number of samples correctly classified by the classifier to the total number of samples. Inmedical statistics, there are two basic characteristics, sensitivity (SN) and specificity (SP). Sensitivity is the true positive rate, and specificity is the true negative rate. The MCC is a correlation coefficient between the actual classification and the predicted classification. Its value range is [-1, 1]. When the MCC equals one, it indicates a perfect prediction for the subject. When the MCC value is 0, it indicates the predicted result is not as good as the result of random prediction, and -1 means that the predicted classification is completely inconsistent with the actual classification.

#### RESULTS AND DISCUSSION

In the tables, we used Luzhou to represent the dataset from hospital physical examination data in Luzhou, China and Pima Indians represents the Pima Indians diabetics data. The two datasets contain 14 and 8 attributes, respectively.

For better comparison, firstly, we used all features for predicting diabetes. And the results are shown in **Table 1**.

Through the **Table 1**, we can get better results. In addition, RF has the best result among the three classifiers when the

FIGURE 4 | The results of using Luzhou dataset. According to this figure, we found the method, which used all features and random forest has the greatest performance. And the methods without blood glucose are not good.



TABLE 8 | Predict diabetes of using all features without blood glucose.


dataset is Luzhou physical examination. When the dataset is Pima Indians, random forest has similar effects to neural networks. And the decision tree structure of Luzhou dataset is shown in **Figure 2**, the decision tree structure of Pima Indians dataset is shown in **Figure 3**. According to **Figures 2**, **3**, we can find the root node is glucose, which can show the glucose has the max information gain, so it confirm the common sense and the clinical diagnosis basis. But there are diabetic patients whose fasting blood glucose is less than 6.8 in Luzhou dataset, we considered the reason maybe they injected insulin before the physical examination to control blood sugar levels.

According to consulting relevant information, we know there are three indicators to determination the diabetes mellitus, which are fasting blood glucose, random blood glucose and blood glucose tolerance. Because the data only has fasting blood glucose in Luzhou dataset and the Pima Indians dataset only has blood glucose tolerance, we used fasting blood glucose and blood glucose tolerance to prediction, respectively. And the results are shown in **Table 2**.

According to the **Table 2**, we found in Luzhou dataset J48 has a better performance than the others do, and the accuracy is above 0.76. In the Pima Indians dataset, only using blood glucose tolerance is not good.

Then, we used mRMR to select features. We get the score of each feature. According to the matrix, we chose the first five features, which are height, HDL, fasting glucose, breathe, and LDL, to predict diabetes using Luzhou dataset and select the first three attributes, which are glucose, 2-h serum insulin and age, to predict the Pima Indians dataset. The results are shown in **Table 3**.

When we use the Luzhou dataset, J48 has the best performance. But the results are not better than using all features. In the Pima Indians dataset, this method, which used RF as the classifier, has the best performance.

Then we used PCA to reduce the features. Because height and weight are related to physical index, we did not use height and weight to using PCA in Luzhou dataset. We used SPSS to analyzing the factors. According to the KMO and Bartlett test, the two datasets can use PCA to reduce the features. And we can get the composition matrix and eigenvalues. According to the composition matrix and total variance interpretation, we can get the new five features for Luzhou dataset and three features for Pima Indians dataset. We use the new features to conduct experiment, and the results are shown in **Table 4**.

The ACC of Luzhou dataset is less than the above methods. The results show PCA is not suitable for this data. For Pima Indians dataset, the accuracy is better than only use glucose. In this second, neural network has the best performance for predicting diabetes.

In order to explore the importance of other indexes in predicting diabetes, we designed the following experiments by using Luzhou dataset. Firstly, we used the all features without blood glucose to predict diabetes, and the results are shown in **Table 5**.

And then, we deleted the blood glucose, LDL and HDL which need to go to the hospital for testing data. So there are 11 features in this experiment, and the results are shown in **Table 6**.

According to the **Tables 5**, **6**, we found the RF is able to predict better diabetes. Although the accuracy is not the best, we can use the prediction as a reference.

According to the above experiments, we summarized the above results and get **Figures 4**, **5**, which can more clearly demonstrate the accuracy of each method in order to make a better comparison.

From the **Figures 4**, **5**, we can find PCA is not very suitable to the two dataset. And using all features has a good performance, especially for the Luzhou dataset. There is not much difference among random forest, decision tree and neural network when the feature set contains blood glucose. When we used the features without blood glucose, random forest has the best performance. But relatively speaking, the neural network performs poorly.

According to the **Figure 4**, we selected several methods that performed better and conducted independent testing experiments by using Luzhou dataset. So we chose three methods (all features, mRMR and blood glucose) to conduct independent test experiments. The results are shown in **Table 7**.

According to **Table 7**, we found the method using all features still has a better result. And the method only using blood glucose is not good, especially using neural network as classifier. The reason for this result may be that the blood glucose contains too little information.

Because Luzhou dataset is collected by ourselves, it is unable to use this data for comparison experiments. In order to compare

#### REFERENCES

Alghamdi, M., Al-Mallah, M., Keteyian, S., Brawner, C., Ehrman, J., and Sakr, S. (2017). Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: the henry ford exercise testing (FIT) project. PLoS One 12:e0179805. doi: 10.1371/journal.pone.0179805

with the methods in other papers, we used Pima Indians dataset for 10-fold cross validation experiments. The results are shown in **Table 8**.

#### CONCLUSION

Diabetes mellitus is a disease, which can cause many complications. How to exactly predict and diagnose this disease by using machine learning is worthy studying. According to the all above experiments, we found the accuracy of using PCA is not good, and the results of using the all features and using mRMR have better results. The result, which only used fasting glucose, has a better performance especially in Luzhou dataset. It means that the fasting glucose is the most important index for predict, but only using fasting glucose cannot achieve the best result, so if want to predict accurately, we need more indexes. In addition, by comparing the results of three classifications, we can find there is not much difference among random forest, decision tree and neural network, but random forests are obviously better than the another classifiers in some methods. The best result for Luzhou dataset is 0.8084, and the best performance for Pima Indians is 0.7721, which can indicate machine learning can be used for prediction diabetes, but finding suitable attributes, classifier and data mining method are very important. Due to the data, we cannot predict the type of diabetes, so in future we aim to predicting type of diabetes and exploring the proportion of each indicator, which may improve the accuracy of predicting diabetes. We uploaded the Pima Indians dataset in http://121.42.167.206/PIMAINDIANS/data.html.

#### AUTHOR CONTRIBUTIONS

QZ designed the experiments. KQ and YL performed the experiments. KQ wrote the paper. DY and YJ analyzed the data. HT provided the data.

#### FUNDING

The work was supported by the National Key R&D Program of China (SQ2018YFC090002), and Natural Science Foundation of China (Nos. 61771331 and 61702430), the Scientific Research Foundation of the Health Department of Sichuan Province (120373), the Scientific Research Foundation of the Education Department of Sichuan Province (11ZB122) the Scientific Research Foundation of Luzhou city (2012-S-36).

American Diabetes Association (2012). Diagnosis and classification of diabetes mellitus. Diabetes Care 35(Suppl. 1), S64–S71. doi: 10.2337/dc12 s064

Bengio, Y., and Grandvalet, Y. (2005). Bias in Estimating the Variance of K - Fold Cross-Validation. New York, NY: Springer, 75–95. doi: 10.1007/0-387-245 55-3\_5



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zou, Qu, Luo, Yin, Ju and Tang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Pilot Safety Evaluation of a Novel Strain of Bacteroides ovatus

Huizi Tan1,2† , Zhiming Yu<sup>3</sup>† , Chen Wang1,2, Qingsong Zhang1,2, Jianxin Zhao1,2 , Hao Zhang1,2,4, Qixiao Zhai1,2,5 \* and Wei Chen1,2,4,6

<sup>1</sup> State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi, China, <sup>2</sup> School of Food Science and Technology, Jiangnan University, Wuxi, China, <sup>3</sup> Wuxi People's Hospital Affiliated to Nanjing Medical University, Wuxi, China, <sup>4</sup> National Engineering Research Center for Functional Food, Jiangnan University, Wuxi, China, <sup>5</sup> International Joint Research Laboratory for Probiotics, Jiangnan University, Wuxi, China, <sup>6</sup> Beijing Advanced Innovation Center for Food Nutrition and Human Health, Beijing Technology and Business University, Beijing, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Jiachao Zhang, Hainan University, China Yun Li, University of Pennsylvania, United States

#### \*Correspondence:

Qixiao Zhai zhaiqixiao@sina.com

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 26 September 2018 Accepted: 24 October 2018 Published: 06 November 2018

#### Citation:

Tan H, Yu Z, Wang C, Zhang Q, Zhao J, Zhang H, Zhai Q and Chen W (2018) Pilot Safety Evaluation of a Novel Strain of Bacteroides ovatus. Front. Genet. 9:539. doi: 10.3389/fgene.2018.00539 Bacteroides ovatus ELH-B2 is considered as a potential next-generation probiotic due to its preventive effects on lipopolysaccharides-associated inflammation and intestinal microbiota disorders in mice. To study safety issues associated with B. ovatus ELH-B2, we conducted comprehensive and systematic experiments, including in vitro genetic assessments of potential virulence and antimicrobial resistance genes, and an in vivo acute toxicity study of both immunocompetent and immunosuppressed mice via cyclophosphamide treatment. The results indicated that this novel strain is non-toxigenic, fragilysin is not expressed, and most of potential virulence genes are correlated with cellular structures such as capsular polysaccharide and polysaccharide utilizations. The antibiotic resistance features are unlikely be transferred to other intestinal microorganisms as no plasmids nor related genomic islands were identified. Side effects were not observed in mice. B. ovatus ELH-B2 also alleviated the damages caused by cyclophosphamide injection.

Keywords: Bacteroides ovatus, safety evaluation, antibiotic resistance, virulence genes, next-generation probiotics

#### INTRODUCTION

Probiotics, prebiotics, and antibiotics are the most relevant therapies for disorders induced by disturbed microbiota. Traditional probiotics mainly refer to Lactobacillus and Bifidobacterium, which are normally obtained from traditional fermented foods and are widely accepted as food ingredients or supplements for daily intake, with a prediction of global turnover value of US\$46.55 billion by 2020 (O'Toole et al., 2017).

Beneficial strains other than the traditional probiotics have been discovered due to the developments in bacterial culture methodologies and sequencing techniques and have started to be authorized as ingredients in food, particularly from Bacteroides which is one of the most abundant genera in the human intestine. For example, Bacteroides xylanisolvens DSM23964, which promotes the maturation of natural antibodies against cancers in humans, has recently been permitted to be added to pasteurized milk products under Novel Food Regulation No. 258/97 by the European Commission (Brodmann et al., 2017); B. uniformis CECT7771 can improve overweight-induced disorders by reducing the levels of cholesterol and triglyceride (Cano et al., 2012), with no obvious damages identified in vivo (Fernandez-Murga and Sanz, 2016).

**51**

B. ovatus is another dominant species identified as nextgeneration probiotics either in its original form with sufficient tumor-specific Thomsen–Friedenreich antigen expressed for cancer prevention (Ulsemer et al., 2013), or genetically modified with genes encoding human keratinocyte growth factors (Hamady et al., 2010) or transforming growth factors (Hamady et al., 2011) to facilitate therapies for bowel diseases. However, the beneficial functions of Bacteroides are strain-dependent, such as the polysaccharides A (PSA)-producing B. fragilis is capable of relieving Helicobacter hepaticus associated inflammation and autism spectrum disorders (Mazmanian et al., 2008; Hsiao et al., 2013), but the fragilysin (bft)-carrying B. fragilis directly contributes to severe colitis (Yim et al., 2013), emphasizing the necessities for safety evaluations of each potentially beneficial strain.

Previously, we established an efficient method for purifying low-abundant Bacteroides species from the human intestine (Tan et al., 2018), and discovered that one of the isolates, B. ovatus ELH-B2 displaying promising potentials of modulating lipopolysaccharides (LPS)-induced disorders in cytokine secretions and intestinal microbiota through restoring the balance of regulatory T cells (Tregs) and T helper 17 (Th-17) cells (data unpublished). In this study, a pilot safety assessment of this novel strain was carried out, including explorations of its hemolytic and motile characteristics, antibiotic resistance, genetic virulence factors, and underlying side effects in both normal and immunosuppressed mice.

#### MATERIALS AND METHODS

#### Bacterial Strains and Culture Conditions

Bacteroides ovatus ELH-B2 was recovered from the in-house preservations at Culture Collections of Food Microbiology (CCFM), Jiangnan University (Wuxi, China). B. ovatus JCM5824 was purchased from RIKEN BioResource Center, Japan. Salmonella enterica CMCC50335 and Escherichia coli CMCC44102 were acquired from the National Center for Medical Culture Collections, China.

Salmonella enterica and E. coli were anaerobically cultured in brain heart infusion (BHI, Hopebio, China) at 37◦C. The B. ovatus strains were cultured in BHI supplemented with hemin (Sangon Biotech, China) and vitamin K1 (BHIS) at 37◦C in anaerobic chamber for further analysis of bacterial characterizations. The bacteria solutions for in vivo tests were prepared with cells at early stationary phase after centrifugation at 6000 rpm for 15 min and re-suspension in phosphate buffer saline supplemented with 20% glycerol, and maintained at −80◦C. Cell viability after freezing and thawing was evaluated via colony-forming unit (cfu) enumeration on BHIS agar before use.

#### Bacterial Characterizations

Bacteroides ovatus type strain JCM5824 was used as control for the bacterial characterization assessments of ELH-B2. Hemolytic capabilities were examined by dropping 5 µl of overnight culture on Brucella agar (Hopebio, China) supplemented with hemin, vitamin K1 and 5% sheep blood (Nanjing SenBeiJia Biological Technology Co., Ltd., China) (Robertson et al., 2006). Motility was tested via standard motility agar assays using BHIS broth supplemented with 0.5% (w/v) agar (soft agar) (Cousin et al., 2015), inoculated with 5 µl of overnight culture and incubated anaerobically for 48 h. S. enterica CMCC50335 was adopted as the positive control and E. coli CMCC44102 as the negative control. All of the experiments were carried out in three biological replicates.

## Genome Sequencing and Screening of Potential Virulence Factors

The genomic DNA was extracted from B. ovatus ELH-B2 culture at early stationary phase and sequenced with Illumina Hiseq system by Majorbio (China). Library of average insert size of 410 bp was generated with low-quality reads filtered. The genome was assembled using SOAPdenovo v2.04<sup>1</sup> (Li et al., 2008; Li et al., 2010) followed by gap closure and base correction using GapCloser v1.12. A K-mer value of 23 was determined according to the accuracy evaluation. Gene annotation was performed by blastp (BLAST 2.2.28+) against Nr, Swiss-prot, string and GO databases. In order to show the relationships between B. ovatus ELH-B2 and other B. ovatus isolates, whose genome sequences were available from the NCBI database<sup>2</sup> , a neighbor-joining phylogenetic tree (Bottacini et al., 2014) was established by phyML<sup>3</sup> (Guindon et al., 2009) after alignment of homologous genes identified by graph theory-based Markov clustering algorithm using mafft<sup>4</sup> (Katoh and Standley, 2013) (**Figure 1**).

Putative antibiotic resistance genes and virulence genes were identified in the genome of B. ovatus ELH-B2 by using a Proteintranslated nucleotide Basic Local Search Tool (tblastn) according to the Comprehensive Antibiotic Resistance Database (CARD<sup>5</sup> (McArthur et al., 2013) and the Virulence Factor Database (VFDB<sup>6</sup> ) (Chen et al., 2012) respectively. Positive results were accepted with at least 30% identity and 70% coverage, and e-value less than 0.01 (Salvetti et al., 2016). Genetic islands were also predicted using IslandPath-DIMOB and Islander (Lu and Leong, 2016) for identifying putative virulence factors and possibilities of transportation of antibiotics resistance genes between bacteria. Moreover, Bacteroides-specific virulence factors including bft, ompW, upaY, upaZ, wcfR, wcfS, cfiA, cepA, and cfxA, of which the amino acid sequences were acquired from the NCBI database (**Table 1B**), were screened in B. ovatus ELH-B2 using tblastn based on BioEdit v7.2.5 with e-value less than 1e-5. The genome sequence of B. ovatus JCM5824 (GenBank accession number NZ\_CP012938) was analyzed for comparison.

<sup>1</sup>http://soap.genomics.org.cn/

<sup>2</sup>https://www.ncbi.nlm.nih.gov/

<sup>3</sup>http://www.atgc-montpellier.fr/phyml/

<sup>4</sup>https://mafft.cbrc.jp/alignment/software/

<sup>5</sup>http://arpcard.mcmaster.ca/

<sup>6</sup>http://www.mgc.ac.cn/VFs/main.htm

# Minimum Inhibitory Concentration (MIC) of Different Antibiotics

Fourteen antibiotics, corresponding to ampicillin, cefoxitin, ceftriaxone, penicillin G, and vancomycin which suppress cell wall synthesis; chloromycetin, clindamycin, erythromycin, kanamycin, streptomycin, and tetracycline which restrain protein synthesis; ciprofloxacin and metronidazole which inhibit nucleic acid synthesis; and polymyxin B which suppresses cytoplasmic functions, were applied to determine the antibiotic resistance profiles of B. ovatus ELH-B2, with the type strain of JCM5824 as comparison. B. ovatus overnight culture (100 µl) at a concentration of 10<sup>7</sup> cfu/ml was treated with serially diluted antibiotics from 0.125 to 1024 µg/ml in sterile 96-well plates. The optical density at 600 nm was checked with a microplate reader (Multiskan GO, Thermo Scientific, United States) after anaerobic cultivation at 37◦C for 48 h. The MIC of each antibiotic was determined by the lowest concentration that inhibited 90% of the growth of the tested B. ovatus strains (D'Aimmo et al., 2007). All of the experiments were carried out in three biological replicates.

# Animals

Male C57 mice (7 weeks old, spf grade) were purchased from Shanghai Laboratory Animal Center (China) and raised within the IVC rodent caging system at Jiangnan University. The mice were maintained under a 12-h light/dark cycle with temperature and humidity strictly controlled. Treatment was initiated after acclimatization for at least 1 week. The entire experiment was approved by the Animal Ethics Committee of Jiangnan University (JN. No. 20180415c0450730[61]), and protocols for the care and use of experimental animals were based on the European Community guidelines (Directive 2010/63/EU).

# Acute Toxicity to Immunocompetent and Immunosuppressed Mice

Both immunocompetent and immunosuppressed mice were involved in this acute toxicity assessment of B. ovatus ELH-B2. The immunocompetent mice, comprising control group (CTRL, 6 mice) and B. ovatus ELH-B2 group (BO, 6 mice), were given 150 µl of PBS/glycerol solution or 10<sup>9</sup> cfu B. ovatus ELH-B2 solution by gavage, respectively, every 24 h for 5 days. The other 12 mice were immunosuppressed by intraperitoneal injection with 250 mg/kg of cyclophosphamide (CTX, Sigma-Aldrich, United States), and were allocated to CTX group or CTX + BO group 3 days after followed by daily oral administration of 150 µl of PBS/glycerol solution or 10<sup>9</sup> cfu B. ovatus ELH-B2 solution, respectively, for 5 days (Hirsh et al., 2004; Salva et al., 2014).

The behavior and body weight of each mouse were monitored and recorded throughout the experiments. All of the mice were anesthetized with sodium pentobarbital and sacrificed by cervical dislocation. Liver, spleen and colon tissues and blood samples were collected immediately after sacrifice for further investigations.

# Assays of Hematological and Liver Parameters

Hematological parameters were assessed using automatic hematology analyzer (BC-5000, Mindray, China) and associated buffers with fresh blood samples. The liver parameters were examined using automatic biochemical analyzer (BS-480) and corresponding kits (Mindray, China) with serum obtained by centrifuging the blood samples at 2500 rpm for 10 min. The serum standard (Shanghai Zhicheng Biological Technology Co. Ltd., China) was used for quality control.

# Cytokine Concentrations in Serum

The secretions of tumor necrosis factor alpha (TNF-α), interleukin-6 (IL-6), interleukin-8 (IL-8), and interleukin-10 (IL-10) were determined using mouse Elisa kits purchased from Nanjing SenBeiJia Biological Technology Co., Ltd. (China) with serum samples, according to the manufacturer's instructions.

## Histological Analysis

fgene-09-00539 November 5, 2018 Time: 7:47 # 4

Liver, spleen and colon tissues were preserved in 4% paraformaldehyde solution and then embedded in paraffin. The histological analysis was conducted using Hematoxylin-Eosin (H&E) staining as published (Al-Hashmi et al., 2011). Images were recorded using Pannoramic digital slide scanner (Pannoramic MIDI II, 3DHISTECH Ltd., Hungary).

## Statistical Analysis

Significant differences between groups were determined by unpaired Student's t-test using Graphpad Prism v5.0 (Graph Pad Software Inc., United States), with p-values of less than 0.05. All of the data were presented as mean ± SD.

# RESULTS

# Microbiological Properties

Bacteroides ovatus ELH-B2 grew well at 37◦C on the Brucella agar supplemented with laked sheep blood under strict anaerobic conditions. The colonies were round, semi-opaque with smooth edges, and the bacteria were Gram-negative and rod shaped, which matches the description of B. ovatus in Bergey's Manual (Krieg et al., 2001). Similar to the type strain B. ovatus JCM5824, ELH-B2 was confirmed to be non-motile, but slightly hemolytic.

## Genetic Characteristics and Identification of Potential Virulence Factors

The size of the complete genome of B. ovatus ELH-B2 is 1 206 654 732 base pair, including 102 scaffolds and 5909 genes. The GC concent is 41.98%. The genomic information indicates the most similar strain to B. ovatus ELH-B2 is B. ovatus CL03T12C18 (**Figure 1**).

According to the blast against VFDB database, 44 virulence factor homologs were identified in B. ovatus ELH-B2 (**Table 1A**), most of which correlate with cellular structures like capsular polysaccharide and polysaccharide utilizations such as glycosyltransferase, and yet have been discovered as the pathogenesis of B. ovatus. And there are nine predicted genomic islands of over 30 kb (**Table 1B**), six of which are metabolismrelated. Toxin-antitoxin system-associated genes were discovered two of the potential genomic islands, and one of the components in the Type IV secretion system (T4SS) were also identified.

As for the Bacteroides-specific virulence factors (**Table 1C**), Similar to B. ovatus JCM5824, B. ovatus ELH-B2 does not contain the diarrhea-associated B. fragilis enterotoxin bft. The coding gene of TonB-linked outer membrane protein (ompW), which has been implicated in inflammatory bowel disease (IBD) (Wei et al., 2001), were identified in both B. ovatus strains with high similarity. No hits or only low matches were identified in ELH-B2 for the highly conserved open reading frames, upaY and upaZ, and another two genes critical for synthesizing capsular PSA, wcfR and wcfS. And among the three β-lactamaseassociated genes, only cepA was found in the two genomic sequences with high similarity.

# Minimum Inhibitory Concentrations of Antibiotics

As shown in **Table 2A**, the potential antibiotic resistance genes indicated the possibilities of B. ovatus ELH-B2 to survive under the treatment of tetracycline, kanamycin, macrolide antibiotics like erythromycin, cationic antibiotics like polymyxin B, and glycopeptide like vancomycin. Accordingly, the MIC experiments verified that B. ovatus ELH-B2 was resistant to these antibiotics except tetracycline (**Table 2B**). It was also clinically susceptible to penicillin, cefoxitin, chloromycetin, and metronidazole with MICs of no more than 32 µg/ml (D'Aimmo et al., 2007). The lowest MIC of ELH-B2 was 4 µg/ml during treatment with metronidazole. Clindamycin and erythromycin were able to inhibit the growth of B. ovatus JCM5824 rather than ELH-B2.

# In vivo Toxicity of B. ovatus ELH-B2

All of the animals were alive and healthy at the end of the experiment, and no abnormal behaviors were witnessed. Although the immunosuppressed mice displayed significantly less body weights, B. ovatus ELH-B2 treatment did not induce obvious alterations in the body mass of healthy or cyclophosphamide-injected mice (**Figure 2A**). Concerning the organ index, which refers to the ratio of organ weight to body weight, ELH-B2 had very little effect on liver indexes in immunocompetent mice, and did not enhance the enlarged spleen indexes of the drug-injected mice (**Figure 2B**). Colon length was also not notably influenced by the administration of ELH-B2.

B. ovatus ELH-B2 intervention did not show significant alterations in the hematological (**Table 3A**) or liver parameters (**Table 3B**) of normal mice. After CTX injection, the percentages of lymphocytes (p < 0.01), hemoglobin concentration (p < 0.001), hematocrit value (p < 0.01) and platelets enumeration (p < 0.05) were markedly dropped, and the percentage of neutrophils were dramatically increased (p < 0.01). However, during the treatment with ELH-B2, the hematocrit value of the immunosuppressed mice was recovered (p < 0.01) and corpuscular volume was also significantly increased (p < 0.01). As for the liver parameters, CTX induced notable upregulation of alanine aminotransferase (p < 0.05), and concentration of alkaline phosphatase was dropped to normal level due to the administration of ELH-B2 in CTX-treated mice (p < 0.01).

No obvious modifications in cytokine productions were observed after the treatment of B. ovatus ELH-B2 in both healthy and immunosuppressed mice (**Figure 3**). Treatment with B. ovatus ELH-B2 did not lead to any histopathological damage in

TABLE 1 | (A) Identification of potential virulence factors in the genome of Bacteroides ovatus ELH-B2 according to VFDB database; (B) prediction of putative genomic islands of over 30 kb in the genome of B. ovatus ELH-B2; (C) comparison of Bacteroides-specific virulence genes in B. ovatus ELH-B2 and JCM5824.


#### TABLE 1 | Continued

fgene-09-00539 November 5, 2018 Time: 7:47 # 6


(B)



the liver, spleen or colon of the healthy mice (**Figure 4**). However, the CTX-treated mice suffered from hypertrophy of spleen, the histological structure of which was severely damaged with obvious fibrosis and hemorrhage. The red and white pulps could not be well-defined and splenocytes were irregularly aligned (**Figure 4**).

# DISCUSSION

Evidences of indigenous and genetically modified intestinal commensals which obtain underlying efficacy in modulating immune and metabolic disorders, extend the range of probiotics, and are termed "next-generation probiotics" or "live biotherapeutic products." The United States Food and Drug Administration (FDA) provide a definition for live biotherapeutic products as "a biological product that contains live organisms, such as bacteria, and is applicable to the prevention, treatment or cure of a disease or condition of human beings, but is not a vaccine," which is also suitable for next-generation probiotics (O'Toole et al., 2017). In the meantime, the FDA drafted guidance that next-generation probiotics should be authorized as food ingredients when first entering the market. However, specific guidelines for applications of these promising microorganisms are yet to be developed. Therefore, a safety evaluation of B. ovatus ELH-B2 was carried out according to the regulations of the FAO/WHO for development of probiotics, in which explained the importance of complete bacterial characterizations such as original source, culture history, phenotype and genotype, antibiotic resistance, and manufacturing methods and threestep clinical trials, including safety assessment and functional characterization; double blind, randomized, placebo-controlled human studies; and efficiency comparisons with standard treatments, and published toxicity analyses of Lactobacillus spp. (Yakabe et al., 2009; Jia et al., 2011), B. xylanisolvens DSM23964 (Ulsemer et al., 2012a,b), B. uniformis CECT7771 (Fernandez-Murga and Sanz, 2016), and B. fragilis ZY312 (Wang et al., 2017), along with previous results which revealed that ELH-B2 had little effect on the production of secretory immunoglobulin A (sIgA) and chemokine (C-X-C motif) ligand 2 (CXCL2) or the balance

TABLE 2 | (A) Predicted genes associated with antibiotic resistance in the genome of B. ovatus ELH-B2; (B) minimum inhibitory concentrations for antibiotics against B. ovatus ELH-B2 and B. ovatus JCM5824 (µg/ml).


(Continued)

#### TABLE 2 | Continued

fgene-09-00539 November 5, 2018 Time: 7:47 # 8


<sup>(</sup>B)


FIGURE 2 | Impact of B. ovatus ELH-B2 on (A) the body weight and (B) the organ indexes of immunocompetent and immunosuppressed mice. The organ indexes of the liver and spleen are expressed as the ratio of the corresponding weight of the organ and body, while the colon index is expressed as the colon length of each animal. Data are displayed as mean ± SD, "##" indicates statistically significant differences between the CTX + BO group and the CTRL group (p < 0.01).

of Treg and Th-17 cells, and even upregulated the diversity of intestinal microbiota (unpublished data).

Morphological analysis showed that B. ovatus ELH-B2 cells were non-motile, which excluded the pathogenic factor of flagella for inducing inflammation via activating the NF-κB pathway through Toll-like receptor 5 and secreting IL-8 (Neville et al., 2012), and facilitating nutrient acquisition, niche colonization (Lane et al., 2007; Neville et al., 2012), and biofilm formation (Houry et al., 2010). The translucent circles around ELH-B2 colonies on blood agar plates indicated the possible existence of hemolysin, which is a pore-forming toxin with cytolytic functions on various types of cells, such as keratinocytes, epithelial cells, and lymphocytes (Kennedy et al., 2010; Wilke and Wardenburg, 2010).

The virulence factors discovered via the Virulence Factor Database include genes facilitating protein secretion, carbohydrates degradation and maintaining cellular structures. These elements could be probiosis-related and contribute to bacterial adhesion and colonization, rather than pathogenicity (Wassenaar et al., 2015). The majority of the predict islands were discovered to be metabolism-related. Although TrbF of T4SS were identified, the rest preserved genes were absent. Thus B. ovatus ELH-B2 are not capable of producing the entire secretion systems for any possibilities of transfering virulence protein and antibiotic resistance genes (Aguilar et al., 2010). Moreover, toxin-antitoxin system are widely existed in natural bacteria strains for better adaptive ability to the environment resulted from evolution (Buts et al., 2005), the virulence characteristics of which requires further analysis.

Bacteroides species are, to some extent, considered to be opportunistic pathogens as some of them are carriers of virulence factors, such as the enterotoxigenic B. fragilis with bft

TABLE 3 | Profiles of (A) hematological values and (B) liver parameters in immunocompetent and immunosuppressed mice after 5-day treatments with B. ovatus ELH-B2.


WBC, white blood cells; Neu, neutrophils; Lym, lymphocytes; Mon, monocytes; Eos, eosinophils; Bas, basophils; RBC, red blood cells; HGB, hemoglobin concentration; HCT, hematocrit value; MCV, mean corpuscular volume; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; RDW, red cell distribution width; CV, coefficient of variation; SD, standard deviation; PLT, platelet; MPV, mean platelet volume; PDW, platelet distribution width; PCT, thrombocytocrit. Data are displayed as mean ± SD, "∗∗" indicates statistically significant differences between the CTX + BO group and the CTX group (p < 0.01); "#", "##," and "###" indicate statistically significant differences between the CTX + BO group and the CTRL group (p < 0.05, p < 0.01 and p < 0.001, respectively).


Glu, glucose; TC, total cholesterol; TG, triglyceride; ALT, alanine aminotransferase; AST, aspartate aminotransferase; TBIL, total bilirubin; ALP, alkaline phosphatase; TP, total protein; ALB, albumin; CK, creatine kinase; LDH, lactic dehydrogenase.

(Sears, 2009) and B. caccae with ompW (Wei et al., 2001). The metalloprotease bft was not identified in ELH-B2, but ompW was found to be 95% identical. The protein encoded by ompW in B. caccae was discovered by pANCA monoclonal antibody in IBD patients and is closely associated with a pathogenic factor of Porphyromonas gingivalis that contributes to tissue damage. However, this TonB-linked structure of ompW is also conserved in the starch-utilization system of Bacteroides species that helps to break down resistant carbohydrates in the host. Hence, the underlying role of the ompW-like structure in ELH-B2 requires further investigation. Meanwhile, the blast results of the four genes essential for constructing PSA indicated the absence

of the capsular polysaccharide, which suggests that ELH-B2 does not possess the PSA-associated anti-inflammatory character (Mazmanian et al., 2008), but also avoids the consequently potential abscess formation (Cohen-Poradosu et al., 2011).

Overall, the putative antibiotic resistance gene list based on the Comprehensive Antibiotic Resistance Database are almost in correspondence to the antibiotic resistance profiles of B. ovatus ELH-B2 acquired from MIC experiments. B. ovatus ELH-B2 was found to be less susceptible to antibiotics compared with the type strain. The three genes encoding β-lactamase, corresponding to cfxA for class A cephalosporinase, cfiA for class B metalloβ-lactamase and cepA for endogenous cephalosporinase (Garcia et al., 2008), were not or were only partially aligned in the genome of ELH-B2, which was consistent with the result that the novel strain was clinically resistant to penicillin G and cefoxitin. It is notable that neither plasmid nor antibiotic resistance-related genomic islands were found in the genome, indicating low chances of transferring the characteristics of antibiotic resistance to other intestinal commensals.

In general, the results demonstrated that B. ovatus ELH-B2 showed no oral pathogenicity in healthy animals with a daily dose of 10<sup>9</sup> cfu live cells, which is the appropriate concentration for commercial application that guarantees both the viability and integrity of bacterial cells after freeze-drying and restoration procedures (Miyamoto-Shinohara et al., 2000). Based on a mean body weight of 20 g for the mice, 3.5 × 10<sup>12</sup> cfu of ELH-B2 cells are predicted to be safe for a 70 kg healthy human adult.

In the meantime, confidence in authorizing treatments for industrial and clinical applications would be enhanced once no side effects of the bacteria have been confirmed in immunodeficient animals (FAO/WHO, 2002). In this study, immunosuppressed condition was established via CTX injection, which is one of the most commonly used as a chemotherapeutic treatment for cancers and as an immunosuppressive agent before myeloablative therapies (Ehrke, 2003). Accordingly, the cytotoxicity was reflected in liver which is the first target organ engaging all toxic drugs (Singh et al., 2018), and spleen which is responsible for the immune status by controlling the proliferation of T cells, B cells and lymphocytes (Gong et al., 2015). CTX treatment led to distinctly upregulated liver parameter of alanine aminotransferase, injuries in spleen and disturbed hematological values.

Nevertheless, B. ovatus ELH-B2 did not accelerate the toxicities of CTX. Besides, the restoration of alkaline phosphatase revealed a recovery effect on the liver and pancreatic functions. Alkaline phosphatase contributes to the dephosphorylation of LPS from the gut by circulation (Moreira et al., 2012), and thereby the downregulation of which emphasized the capability of the Bacteroides to help reduce the threats of endotoxemia in the mice. This result corresponds to the previous characterization study of this novel strain. A reduction in the secretion of TNF-α was observed with B. ovatus ELH-B2 treatment, demonstrating its potential anti-inflammatory function, although in a nonsignificant way.

In summary, B. ovatus ELH-B2, a novel strain which was confirmed to be capable of attenuating LPS-induced inflammation in vivo, did not raise severe safety issues in either immunocompetent or immunosuppressed mice, and even partially relieved the side effects associated with the chemotherapeutic drug. Further assessments of viable doses and extreme conditions, such as intraperitoneal injection of bacteria, should be considered (Ulsemer et al., 2012b).

#### AUTHOR CONTRIBUTIONS

fgene-09-00539 November 5, 2018 Time: 7:47 # 11

HT and ZY carried out the experiment and drafted the manuscript. CW and QinZ participated in the analysis of the data. JZ and HZ provided the essential

#### REFERENCES


reagents and materials. QixZ conceived of the study and managed the project design. WC helped to revise the manuscript. All authors read and approved the final manuscript.

# FUNDING

This work was supported by the National Natural Science Foundation of China (No. 31871773 and 31470161, and Key Program No. 31530056), the Natural Science Foundation of Jiangsu Province (BK20160175 and BK20141098), the National First-Class Discipline Program of Food Science and Technology (JUFSTR20180102), and the Collaborative Innovation Center of Food Safety and Quality Control in Jiangsu Province.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Tan, Yu, Wang, Zhang, Zhao, Zhang, Zhai and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-09-00568 November 20, 2018 Time: 15:8 # 1

# Prognostic Analysis of Limited Resection Versus Lobectomy in Stage IA Small Cell Lung Cancer Patients Based on the Surveillance, Epidemiology, and End Results Registry Database

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Fen Xue, Fudan University Shanghai Cancer Center, China Jun Yong Qian, West China Hospital of Sichuan University, China

#### \*Correspondence:

Chang Chen chenthoracic@163.com

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 06 October 2018 Accepted: 06 November 2018 Published: 22 November 2018

#### Citation:

Gu C, Huang Z, Dai C, Wang Y, Ren Y, She Y, Su H and Chen C (2018) Prognostic Analysis of Limited Resection Versus Lobectomy in Stage IA Small Cell Lung Cancer Patients Based on the Surveillance, Epidemiology, and End Results Registry Database. Front. Genet. 9:568. doi: 10.3389/fgene.2018.00568 Chang Gu<sup>1</sup>† , Zhenyu Huang<sup>2</sup>† , Chenyang Dai<sup>1</sup>† , Yiting Wang<sup>3</sup> , Yijiu Ren<sup>1</sup> , Yunlang She<sup>1</sup> , Hang Su<sup>1</sup> and Chang Chen<sup>1</sup> \*

<sup>1</sup> Department of Thoracic Surgery, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai, China, <sup>2</sup> Department of Colorectal and Anal Surgery, Xinhua Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China, <sup>3</sup> Department of Radiation Oncology, Shanghai Chest Hospital, Shanghai Jiao Tong University, Shanghai, China

Objective: The prognostic analysis of limited resection vs. lobectomy in stage IA small cell lung cancer (SCLC) remains scarce.

Methods: Using the Surveillance, Epidemiology, and End Results registry (SEER) database, we identified patients who were diagnosed with pathological stage IA (T1a/bN0M0) SCLC from 2004 to 2013. The overall survival (OS) and lung cancerspecific survival (LCSS) rates of patients with different treatment schemes were compared in stratification analyses. Univariable and multivariable analyses were also performed to identify the significant predictors of OS and LCSS.

Results: In total, we extracted 491 pathological stage IA SCLC patients, 106 (21.6%) of whom received lobectomy, 70 (14.3%) received sublobar resection and 315 (64.1%) received non-surgical treatment, respectively. There were significant differences among the groups based on different treatment schemes in OS (log-rank p < 0.0001) and LCSS (log-rank p < 0.0001). Furthermore, in subgroup analyses, we did not identify any differences between sublober resection group and lobectomy group in OS (logrank p = 0.14) or LCSS (log-rank p = 0.4565). Patients with four or more lymph node dissection had better prognosis. Multivariable analyses revealed age, laterality, tumor location, and N number were still significant predictors of OS, whereas age, tumor location, and N number were significant predictors of LCSS.

Conclusion: Surgery is an important component of multidisciplinary treatment for stage IA SCLC patients and sublober resection is not inferior to lobectomy for the specific patients.

Keywords: lung cancer, small cell lung cancer, prognosis, sublober resection, lobectomy

# INTRODUCTION

fgene-09-00568 November 20, 2018 Time: 15:8 # 2

Lung cancer is the second most commonly diagnosed cancer and the leading cause of death from cancer worldwide (Jemal et al., 2011). Small cell lung cancer (SCLC), as the most common neuroendocrine tumor, comprises almost 14% of all lung cancer patients (Siegel et al., 2016). Besides, SCLC is recognized as an aggressive neoplasm characterized by rapid growth and early development of widespread metastases (especially hematogenous metastases) (Ettinger et al., 2018). When compared with nonsmall cell lung cancer (NSCLC), the 5-year overall survival rate is only 6.2% while the rate reaches to 18.0% in NSCLC patients (Kalemkerian et al., 2013).

As the Veterans Administration (VA) Lung Study Group proposed, SCLC is typically classified as limited-stage and extensive-stage disease (Argiris and Murren, 2001). Approximately 30% of SCLC patients present with limited-stage disease, most of whom have experienced lymphatic metastasis at their first diagnosed. SCLC is sensitive to chemotherapy and radiotherapy. Thus, systemic therapy is recommended for all patients with SCLC by the National Comprehensive Cancer Network (NCCN) Guidelines. However, the NCCN Guidelines indicates that, for stage I SCLC patients without mediastinal lymph node metastasis, surgery should be considered (Ettinger et al., 2018).

In early-staged NSCLC patients, surgical resection could offer a potential cure in clinical practice (Islam et al., 2013). Lobectomy with mediastinal lymph mode dissection has been recommended as the standard scheme for early-staged NSCLC patients (Darling et al., 2011). However, limited resection (anatomical segmentectomy and non-anatomical wedge resection) is considered as a compromising surgical procedure for high-risk patients, whereas it has the advantage of preserving lung function and providing the chance for a second operation (Kocaturk et al., 2011; Zuin et al., 2013; Gu et al., 2017). Although the efficacy of limited resection for early-staged NSCLC patients has been doubted, many studies have proved it achieves equivalent oncological outcomes to lobectomy, no matter in the elderly or the young set (Altorki et al., 2014; Sihoe and Van Schil, 2014; Gu et al., 2017).

As for early-staged SCLC patients, the role of surgery has not been assessed by any prospective studies. However, data from retrospective studies showed favorable results when additional surgery was applied in patients with stage I SCLC (James et al., 2010; Yang et al., 2016). And the 5-year survival rate could be improved to 40–60% by surgery in stage I SCLC patients (James et al., 2010). However, given the characteristics of rapid growth and the sensitivity to chemoradiotherapy, limited resection, especially for high-risk patients, is only recognized as a compromise procedure by many surgeons. Few studies evaluated the oncological effect of limited resection and the equivalency of limited resection verse lobectomy among stage IA SCLC patients. In this study, we used the population-based Surveillance, Epidemiology, and End Results (SEER) registry to compare the oncological efficacy between limited resection and lobectomy in patients with stage IA SCLC patients, and further investigated the prognostic factors for these patients.

## MATERIALS AND METHODS

#### Study Population

The study population was confined to patients who were diagnosed with pathological stage IA (T1a/bN0M0) SCLC from 2004 to 2013 in SEER database. The exclusion criteria were as follows: (1) patients with a second primary neoplasm or with synchronous multiple primary lung cancer; (2) surgical patients treated with neoadjuvant/intraoperative radiotherapy, which could be neoplasms of higher stage; (3) patients with lung metastases (pathologically conformed SCLC) from other locations; (4) unknown tumor location or primary main bronchus tumor; (5) patients with unknown medical records on survival status. All the data extracted from the Surveillance, Epidemiology, and End Results registry, which is a public population-based database and the Institutional Review Board of our hospital approved our study with a waiver for the requirement of patient consent.

The codes of tumor histology were consistent with the International Classification of Diseases for Oncology (Percy et al., 1990). Relevant sociodemographic information was extracted from SEER database, along with all the available tumor features, including age, gender, race, laterality (left or right), primary tumor location (which lobe), grade, tumor size, and treatment strategy. Tumor pathologic TNM stage were determined according to the 7th edition of TNM staging system proposed by the American Joint Committee on Cancer (AJCC) (Edge et al., 2010).

Survival time was defined as the time frame of the date of diagnosis to the date of death. Patients still alive at the time point of December 31, 2013 were set as censored cases. Furthermore, deaths from other causes were censored at the time of death when investigating the lung cancer specific survival (LCSS).

#### Statistical Analysis

All the patients were grouped by treatment strategies and the baseline variables of different groups were compared. Data with continuous covariates were presented as median ± standard deviation (SD) and were analyzed using Student's t-test while data with categorical covariates were presented as number (%) and were analyzed using Pearson χ 2 test. The distributions of overall survival (OS) and LCSS were calculated with Kaplan-Meier method, and the significance among different groups was explored by the log-rank test. Furthermore, a Cox proportional hazards model was established to probe prognostic factors for OS and LCSS by univariable and multivariable analyses.

All the clinicopathological data were analyzed using SPSS 22.0 software package (SPSS Inc., Chicago, IL, United States) while the distributions of OS and LCSS were draw utilizing Prism 5 (Graph Pad Software Inc., La Jolla, CA, United States). Statistical significance was set as p < 0.05.

#### RESULTS

Totally, we identified 491 stage IA SCLC patients from SEER database. There were 106 (21.6%) patients received lobectomy, fgene-09-00568 November 20, 2018 Time: 15:8 # 3

70 (14.3%) received sublobar resection, 215 (43.8%) received adjuvant radiotherapy alone, and 100 (20.3%) received no treatment, respectively. Furthermore, of all the patients who underwent surgical resection, 83 patients underwent lobectomy only, 23 underwent lobectomy plus adjuvant radiotherapy, 54 underwent sublober resection only, and 16 underwent sublober resection plus adjuvant radiotherapy, respectively.

The baseline characteristics of all the patients were listed in **Table 1**. The elderly patients account for the majority of the patient sets. Based on the data in **Table 1**, there was no statistical difference among the three groups of different treatment schemes in gender, race and tumor location. However, compared with patients who had surgical treatment, patients without surgical resection were apt to had older age (p < 0.001), higher tumor



stage (p = 0.003), larger tumor size (p < 0.001), and more radiotherapy (p < 0.001). Moreover, between patients underwent sublobar resection and patients received lobectomy, there was no significant difference in age at diagnosis (p = 0.152), gender (p = 0.994), race (p = 0.464), laterality (p = 0.129), tumor location (p = 0.071), T stage (p = 0.275), tumor size (p = 0.143), grade (p = 0.619), and radiotherapy (p = 0.857) whereas more lymph nodes were dissected in lobectomy group (p < 0.001).

As for the survival, there were significant differences among the groups with different treatment schemes in OS (log-rank p < 0.0001) and LCSS (log-rank p < 0.0001) (**Figure 1**). Besides, patients who received surgery plus postoperative radiotherapy experienced the longest survival time (**Figure 1**). In subgroup analyses, there was no difference among the groups based on different surgical procedures both in OS (log-rank p = 0.14) and LCSS (log-rank p = 0.4565). However, survival in patients with lobectomy was better than those with sublober resection in trend (**Figure 2**). Moreover, postoperative radiotherapy would help improving the survival both in lobectomy group and sublober resection group (**Figure 2**). More lymph nodes dissected would lead to better survival both in OS (log-rank p < 0.0001) and LCSS (log-rank p = 0.0007) (**Figure 3**).

Univariable analysis revealed that age, laterality, tumor location, N number, and grade were significant predictors of OS while age, laterality, tumor location, and N number were significant predictors of LCSS (**Table 2**). Furthermore, age, laterality, tumor location, and N number were still significant predictors of OS, whereas age, tumor location, and N number

cancer-specific survival (B) based on different treatment schemes in patients with stage IA small cell lung cancer.

fgene-09-00568 November 20, 2018 Time: 15:8 # 4

were significant predictors of LCSS in multivariable analysis (**Table 3**).

#### DISCUSSION

Lung cancer maintains the leading cause of death from cancer around the world. The treatment of SCLC, with the characteristics of rapid growth and early metastasis, is still an intractable problem. Although some researchers verified the effect of surgery on early-staged SCLC (stage I) (Ahmed et al., 2017), no previous studies focused on the equivalency of lobectomy versus sublober resection among stage IA SCLC patients. In the current study of stage IA SCLC patients, we analyzed the prognosis (OS and LCSS) among groups based on different treatment schemes. Our findings revealed that surgery is an important part of multidisciplinary treatment for stage IA SCLC patients and sublober resection is not inferior to lobectomy for the specific patients. Sublober resection could preserve more lung parenchyma and have reduced overall mortality when compared to lobectomy, considering that the clinicopathological data are unavailable in SEER database, whether sublober resection could be recommended for stage IA SCLC patients still need further study.

As NCCN Guidelines suggested, chemotherapy acts as an essential part of appropriate regimens for all SCLC

patients, especially for those with surgical resection, no matter limited-stage or extensive-stage disease (Ettinger et al., 2018). Radiotherapy is also recommended for concurrent use with chemotherapy, but the optimal dose and schedule of radiotherapy has not reached a consensus. In our study, patients who received radiotherapy alone could acquire better survival than those without treatment in both OS (log-rank p < 0.0001) and LCSS (log-rank p = 0.0016). Moreover, surgery plus radiotherapy could achieve the best prognosis. Ahmed et al. (Ahmed et al., 2017) analyzed stage I SCLC patients based on the SEER database, and they also found patients with surgery plus radiation owned the longest survival, which is in concordance with our findings.

As for stage IA SCLC patients without mediastinal lymph nodes involved, surgery should be considered (Schneider et al., 2011). In early days, surgery alone could not be identified as a significant benefit for patients with limited-stage SCLC (Fox and Scadding, 1973; Osterlind et al., 1985). Recently, most of the retrospective studies regarding surgery in earlystaged SCLC patients have revealed improved survival with surgical resection (James et al., 2010; Combs et al., 2015). Weksler et al. (2012) identified 3566 stage I or II SCLC patients in SEER database from 1988 to 2007, and the findings showed patients who underwent surgical resection had better outcomes when compared with those without surgery (median, 34.0 months versus 16.0 months, p < 0.001). Moreover, they also found patients who underwent lobectomy or pneumonectomy experienced significant longer survival than those underwent wedge resection (median, 39.0 months versus 28.0 months, p < 0.001). Similar findings were vertified by another study (Ahmed et al., 2017). Although many researchers in favor of lobectomy for early-staged SCLC patients due to the aggressive characteristics of the tumor, and they thought lobectomy plus

#### TABLE 2 | Univariable analyses for OS and LCSS.

fgene-09-00568 November 20, 2018 Time: 15:8 # 5


OS, overall survival; LCSS, lung cancer specific survival; HR, hazard ratio; CI, confidence interval.

#### TABLE 3 | Multivariable analyses for OS and LCSS.


OS, overall survival; LCSS, lung cancer specific survival; HR, hazard ratio; CI, confidence interval.

lymph node dissection could achieve complete resection, we did not observe any survival differences between lobectomy group and sublober resection group in our study. The reason would be: (1) the tumor of stage IA SCLC was small and harbors relatively weaker invasiveness; (2) the number of patients with sublober resection in the set was relatively small, which would cause some bias.

Adequate lymph node dissection also made sense for overall survival. The dissected number of lymph nodes was identified as a significant predictor of OS and LCSS. The removal of four or more lymph nodes yielded important long-term benefit in survival for stage IA SCLC patients. Besides, adequate lymph node dissection is helpful in the determination of pathological tumor staging, choice of therapy and prediction of prognosis.

Our results also showed that elderly patients (65 years or more) were less likely to receive surgical resection (p < 0.001). The probable reasons may be the higher incidence of comorbidities and poorer lung function (Jazieh et al., 2002). Similarly, McCann et al. (2005) suggested that the lower surgical rate of surgical resection for older patients because of lower performance status and concurrent comorbidities.

The limitations of the study are as follows. First, it was a retrospective study and the nature of retrospective analysis may cause selection bias. Second, despite SEER database is a population-based data, many clinicopathological variables are unavailable, such as lung function, clinical tumor stage, comorbidities, adequacy of resection margin, and neoadjuvant or adjuvant chemotherapy. Consequently, the effect of chemotherapy could not be evaluated and the heterogeneity of enrolled patients would exist. However, as pathological stage IA SCLC, when compared with advanced SCLC, the number of stage IA SCLC patients who received preoperative radiochemotherapy is much smaller. Thus, the deficiency of preoperative radiochemotherapy data has limited influence on our conclusions. Third, the number of patients who underwent surgery were relatively small, and thus we could not further investigate the equivalency of wedge resection versus segmentectomy in stage IA SCLC patients. Prospective studies are required to further confirm the role of different surgical procedures in stage IA SCLC patients.

In summary, our findings revealed that surgery is an important component of multidisciplinary treatment for stage IA SCLC patients and sublober resection is not inferior to lobectomy for the specific patients. But these findings still need to be verified by further prospective researches.

#### AUTHOR CONTRIBUTIONS

CC conceived and designed the study, and provided the administrative support. CG, ZH, and CD provided the study materials or patients, and collected and assembled the data. CG, YW, and YR analyzed and interpreted the data. All authors wrote the manuscript and approved the final version of the manuscript.

#### FUNDING

fgene-09-00568 November 20, 2018 Time: 15:8 # 6

This work was supported by the projects from Shanghai Hospital Development Center (SHDC12015116), Health and Family Planning Commission of Shanghai

#### REFERENCES


Municipality (2013ZYJB0003 and 20154Y0097), Science and Technology Commission of Shanghai Municipality (15411968400 and 14411962600), Shanghai Lingjun Program (2015057), and Shanghai Pujiang Program (15PJD034).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Gu, Huang, Dai, Wang, Ren, She, Su and Chen. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Rare Copy Number Variants Identify Novel Genes in Sporadic Total Anomalous Pulmonary Vein Connection

Xin Shi<sup>1</sup>† , Liangping Cheng<sup>1</sup>† , XianTing Jiao<sup>1</sup> , Bo Chen<sup>1</sup> , Zixiong Li<sup>2</sup> , Yulai Liang<sup>3</sup> , Wei Liu<sup>4</sup> , Jing Wang<sup>1</sup> , Gang Liu<sup>5</sup> , Yuejuan Xu<sup>1</sup> , Jing Sun<sup>1</sup> , Qihua Fu<sup>5</sup> , Yanan Lu<sup>4</sup> \* and Sun Chen<sup>1</sup> \*

<sup>1</sup> Department of Pediatric Cardiovascular, Xinhua Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China, <sup>2</sup> Department of Medical Oncology, Bayi Hospital, Nanjing University of Chinese Medicine, Nanjing, China, <sup>3</sup> Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China, <sup>4</sup> Department of Cardiothoracic Surgery, Xinhua Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China, <sup>5</sup> Medical Laboratory, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Shanghai, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Yujun Shen, Tianjin Medical University, China Daizhan Zhou, Shanghai Jiao Tong University, China

#### \*Correspondence:

Yanan Lu luscmc@189.cn Sun Chen chensun@xinhuamed.com.cn †These authors share first authorship

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 29 August 2018 Accepted: 02 November 2018 Published: 23 November 2018

#### Citation:

Shi X, Cheng L, Jiao X, Chen B, Li Z, Liang Y, Liu W, Wang J, Liu G, Xu Y, Sun J, Fu Q, Lu Y and Chen S (2018) Rare Copy Number Variants Identify Novel Genes in Sporadic Total Anomalous Pulmonary Vein Connection. Front. Genet. 9:559. doi: 10.3389/fgene.2018.00559 Total anomalous pulmonary venous connection (TAPVC) is a rare congenital heart anomaly. Several genes have been associated TAPVC but the mechanisms remain elusive. To search novel CNVs and candidate genes, we screened a cohort of 78 TAPVC cases and 100 healthy controls for rare copy number variants (CNVs) using whole exome sequencing (WES). Then we identified pathogenic CNVs by statistical comparisons between case and control groups. After that, we identified altogether eight pathogenic CNVs of seven candidate genes (PCSK7, RRP7A, SERHL, TARP, TTN, SERHL2, and NBPF3). All these seven genes have not been described previously to be related to TAPVC. After network analysis of these candidate genes and 27 known pathogenic genes derived from the literature and publicly database, PCSK7 and TTN were the most important genes for TAPVC than other genes. Our study provides novel candidate genes potentially related to this rare congenital birth defect (CHD) which should be further fundamentally researched and discloses the possible molecular pathogenesis of TAPVC.

Keywords: congenital heart defects, total anomalous pulmonary venous connection, whole-exome sequencing, copy number variants, pathogenesis

# INTRODUCTION

Total anomalous pulmonary venous connection (TAPVC) is a rare but heterogeneous congenital heart anomaly in which pulmonary veins do not connect routinely to the left atrium but abnormally connect to the right atrium or systemic venous system. The incidence of TAPVC is approximately 1 out of 15,000 live births (Ammash et al., 1997; Bjornard et al., 2013; Thummar et al., 2014). TAPVC is rare but without proper intervention in the first year of life the mortality of TAPVC is nearly 80% (Burroughs and Edwards, 1960). However, the molecular mechanism of TAPVC remains unknown.

So far, only a few genes have been demonstrated as pathogenic genes for TAPVC and these genes are just a partial explanation for some patients. Bleyl et al. (2006) used genetic linkage analysis found a locus for TAPVR at 4p13-q12 called total anomalous pulmonary venous return 1 (TAPVR1) and other important pathogenic genes in this region include vascular endothelial growth factor

**69**

receptor 2 (VEGFR2) and platelet-derived growth factor receptor α (PDGFRA). Nash et al. (2015) used whole genome sequence to identify a non-synonymous variant in the retinol binding protein 5 (RBP5) gene which probably related to TAPVC. Li et al. (2017) considered activin A receptor type II-like 1 (ACVRL1) and sarcoglycan delta (SGCD) as TAPVC pathogenic genes using whole exome sequence from 6 TAPVC cases. However, these pathogenic genes explain only a small fraction of the molecular mechanism of TAPVC, the underlying cause in most patients remains unknown.

Copy number variant (CNV) is defined as a segment of DNA at least 1 kb in size that differs in copy number compared with a representative reference genome (Wellcome Trust Case Control Consortium et al., 2010; Pinto et al., 2011). CNVs have been shown to play an important role in the pathogenicity of complex birth defects (Greenway et al., 2009). CNV, or submicroscopic chromosomal deletions or duplications, has emerged as an important contributor to congenital genetic disorders and has identified critical dosage sensitive genes important for cardiac development (Hitz et al., 2012; Southard et al., 2012; Mlynarski et al., 2015). Whether CNV detection could be as a genetic selection for novel pathogenesis genes of TAPVC is still not reported previously, and it needs to be further studied.

#### MATERIALS AND METHODS

#### Patient Ascertainment

Our study recruited patients with TAPVC Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine whose diagnoses were confirmed by echocardiography, cardiac catheterization, computed tomography, and other medical recordings. Patients with multiple major developmental anomalies, developmental syndromes, or major cytogenetic abnormalities were excluded. Ethical approval was given by the medical ethics committee of Xinhua Hospital.

#### Detection of CNVs From WES Data

Peripheral blood samples were obtained and DNA was extracted using the QIAamp DNA Blood Midi Kit (Qiagen, Germany). WES samples were captured with the Agilent Sure Select Target Enrichment kit (V6 58 Mb; Agilent Technologies, United States) and sequenced on the Illumina HiSeq 2500 platform (Glessner et al., 2014; Li et al., 2015). CNV coordinates were converted to the GRCh37/hg19 build using the UCSC Genome Browser LiftOver tool. CNVs with 50% or larger overlap with telomere, centromere, segmental duplications, or immunoglobulin regions were excluded (Hanemaaijer et al., 2012). After filtering we screened out rare CNVs by comparing with the Database of Genomic Variants (DGV<sup>1</sup> ) and Online Mendelian Inheritance in Man (OMIM<sup>2</sup> ).

# Identification of Pathogenic CNV Candidates

The CNV regions were firstly annotated with RefSeq genes. For each gene and each sample, the copy number status was determined by the annotated CNV regions. The pathogenic CNV candidates were then identified by statistical comparisons between the case and control groups. The statistical comparisons between groups were analyzed by one-side Fisher's exact test with alternative hypothesis that the mutation frequency is greater in case group than the control group. The CNV candidates were defined as potentially pathogenic if the P < 0.01. The analysis and visualization were implemented in R programming software with version 3.5.0.

#### Protein–Protein Interaction (PPI) Analysis

Protein–Protein Interactions (PPI) are physical contacts with molecular associations between chains that occur in a cell or in a living organism in a specific biomolecular context (De Las Rivas and Fontanillo, 2010). Our candidate pathogenic genes with CNVs, combined with 27 known disease-causing genes derived from the literature and publicly available database, were mapped to PPI network in STRING database<sup>3</sup> (Brohee et al., 2008), which identified the connections between the candidate pathogenic genes and the known disease-causing genes. Information found in STRING databases supports the construction of interaction networks (McDowall et al., 2009).

#### Expression Patterns of the Selected Genes During Human Embryonic Heart Development

Expression patterns of the human embryonic heart of candidate genes were detected using an Affymetrix HTA 2.0 microarray. To determine whether these candidate genes could affect human embryonic heart development, Carnegie stages 11 through 15 of human embryonic heart samples were collected from Xinhua hospital. RNA extraction used TissueLyser II (Qiagen, Germany) and the RNeasy MinElute Cleanup Kit (Qiagen, Germany) as previous study (Nolan et al., 2006). The integrity and purity of the RNA was detected by the Experion automated gel electrophoresis system (Bio-Rad, United States) and the NanoDrop 2000c spectrophotometer (Thermo Fisher Scientific, United States).

# RESULTS

#### Clinical Data

A total of 78 sporadic TAPVC cases and 100 healthy controls were recruited in our research. Among these patients, no one had central nervous system malformations, vertebral defects, or genitourinary malformations. The patients' ages ranged from 27 days to 7 years; 45 patients were male (57.7%) and 33 were female (42.3%). Among all these patients, 47 had atrial septal defect (ASD), 16 had patent foramen ovale (PFO), 10 had ventricular septal defect (VSD), and 16 had patent ductus

<sup>1</sup>http://dgv.tcag.ca/

<sup>2</sup>http://omim.org

<sup>3</sup>https://string-db.org/

arteriosus (PDA). Double outlet right ventricle (DORV) was discovered in three patients, atrioventricular septal defect in three patients. The detailed clinical data and cardiac phenotypes are summarized in **Table 1**. All patients were recruited via Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine and all signed an informed consent approved by the Ethics Committee of Xinhua Hospital.


# CNVs in Patients With TAPVC and Identification of Candidate Genes

To discover the pathogenic CNV candidates, we identified WES data by statistical comparisons between the case and control groups. We use circos plot for CNV visualization with broad horizontal area from chromosome level (**Figure 1**). In all chromosomes, chromosome 1 had the most CNV numbers than other in our patients.

Based on these data, we identified statistically significant CNVs at different genomic loci. CNVs were filtered as potentially pathogenic if the P < 0.01 (**Table 2**). Finally, we identified eight potentially pathogenic CNVs of seven genes (PCSK7, RRP7A, SERHL, TARP, TTN, SERHL2, and NBPF3) among 45 patients with TAPVC (**Figure 2**). The percentage of subjects with CNVs candidates was 58.4% (45 of 78 TAPVC subjects).

## Expression Pattern in Human Embryonic Heart

We then detected the time course expression patterns of the candidate genes during different Carnegie stages of human heart development using microarray (**Table 3**). Expression of TNN in

FIGURE 1 | A whole-genome view of copy number variations in case and control groups. Circos plot for variants visualization with broad horizontal area from chromosome level. The outer, middle, and inner tracks display the chromosomes, CNV frequency in case group, and CNV frequency in control group. The lines above or under zero represent gain or loss.



human embryonic hearts had a significantly higher level than other genes. Expression of PCSK7, RRP7A, and NBPF3 were also high just behind TNN.

## STRING Network Analysis

We got 27 known pathogenic genes derived from the literature and publicly database. Then we used STRING database to explore the PPI network between CNV candidate genes and known pathogenic genes. Through PPI network, we found PCSK7 and TTN had more direct and obvious relation to known pathogenic genes (**Figure 3**). PCSK7 directly interacts with KDR and TTN indirectly interacts with ANKRD1 and SGCD. These two genes could interact with other pathogenic genes via several other genes.

# DISCUSSION

Total anomalous pulmonary venous connection is a rare congenital heart defect characterized by the misconnection of all four pulmonary veins, which could cause severe morbidity and mortality (Bando et al., 1996). Several genes have been associated TAPVC but the etiology of TAPVC is still complicated. To detect the underlying mechanism of TAPVC, we screened a cohort of 78 TAPVC cases and 100 healthy controls for rare CNVs and novel candidate genes, using whole exome sequencing (WES). Then we got seven totally novel candidate genes (PCSK7, RRP7A, SERHL, TARP, TTN, SERHL2, and NBPF3) that were associated with TAPVC. STRING network analysis demonstrated that PCSK7 and TTN which are highly related to known pathogenic genes, appear to play an important role in the genetic mechanism of TAPVC.

Both deletion and duplication of CNV could have been associated with congenital disorders (McLysaght et al., 2014). Recent data show that the frequency of duplications is approximately half of deletions and their phenotypes of heart malformation are much more diverse. It is possible that genomic deletions are more likely to cause dosage sensitivity compared with duplications because the fold change is greater for deletions.


TABLE 3 | Expression patterns of candidate genes in human embryonic hearts at different time points.

In seven candidate genes, we found deletion CNVs only in PCSK7 and TTN.

A total of 9 (9/78, 11.5%) patients had duplication and 6 (6/78, 7.7%) patients had deletion in PCSK7. PCSK7 (Proprotein convertase subtilisin/kexin type 7) is a member of the subtilisinlike proprotein convertase family (Constam et al., 1996). The genetic regulation of PCSK expression especially PCSK7 could bind to other genes to make huge impact on the blood pressure (Peloso et al., 2014; Turpeinen et al., 2015). Recent research of cardiovascular Disease (CVD) network using 1512 SNPs associated with 21 traits in genome-wide association showed PCSK7 connected closely to the incidence of CVD (Yao et al., 2015). In our study, 8 (8/78, 10.3%) patients were detected to have the deletion in TTN.TTN (Titin) encodes the sarcomere protein titin. Among its related pathways are dilated cardiomyopathy (DCM) and cardiac conduction (Hinson et al., 2015). A large literature suggested that majority of familial and sporadic DCM had the rare variants in TTN (Herman et al., 2012;

Ware et al., 2016). A study found TTN and ANKRD1 which was an important pathogenic gene of TAPVC could combine to cause DCM (Arimura et al., 2009). And expression of TNN and PCSK7 were higher than other genes in human embryonic hearts. Above all, PCSK7 and TTN can be a totally novel candidate gene for TAPVC pathogenesis but the underlying mechanism remains unclear.

We found seven patients had duplications in SERHL, RRP7A and six patients in SERHL2. SERHL, SERHL2, and RRP7A, these genes are all located on chromosome 22q13. SERHL (Serine hydrolase-like) was encoded within the mRNA is an open reading frame of 311 amino acids which shows identity to a family of serine hydrolases (Sadusky et al., 2001). SERHL was found in tetralogy of Fallot patients and was associated with DNA methylation abnormalities (Serra-Juhe et al., 2015). SERHL2 also belongs to the serine hydrolase family, while its functional role is yet to be elucidated, and other nearby genes in the region, such as RRP7A, could also be biological candidates linked to 22q13 deletion syndrome (Okada et al., 2018). Patients with 22q13 duplication have been reported to have the clinical diagnosis of cardiovascular abnormalities and intrauterine growth restriction (Chen et al., 2003; Rahikkala et al., 2013). The relationship between RRP7A, SERHL, and SERHL2 and TAPVC needs to be further validated experimentally. Thus far, the functions of these genes in cardiovascular development remain unknown, and they be might newly associated with TAPVC pathogenesis.

In our research, TARP had the most patients than other genes, 13 (13/78, 16.7%) patients with duplication was detected in TARP.TARP (TCR gamma alternate reading frame protein) which is a marker for T cells and NKT cells and uniquely expressed in males in prostate epithelial cells and prostate cancer cells (Littman et al., 1987). It has been reported to be a biomarker for viral myocarditis (Rowe et al., 2018). We also found 6 (6/78, 7.7%) patients had duplication in NBPF3.NBPF3 (NBPF member 3) is a member of the neuroblastoma breakpoint

#### REFERENCES


family (NBPF) which consists of dozens of recently duplicated genes primarily located in segmental duplications on human chromosome 1 (Vandepoele et al., 2005). NBPF3 is reported to express in a variety of tissues (Vandepoele and van Roy, 2007). Our study is flawed. First, the lack of parental samples limited our ability to study the genetic backgrounds of the variants. Second, we lack the information of prognosis of TAPVC cases. In addition, the functions of our candidate genes need to be further verified with fundamental research. In summary, an effective analytical bioinformatics strategy allowed us to identify CNVs in novel genes that play a vital role in TAPVC pathology. Based on the results of CNV discovery in a case-control cohort, our study found evidence that CNVs of seven candidate genes (PCSK7, RRP7A, SERHL, TARP, TTN, SERHL2, and NBPF3) could contribute to the genetic etiology of TAPVC. Our candidate genes open new fields of investigation into TAPVC pathology and provide novel insights into pulmonary vein development.

#### AUTHOR CONTRIBUTIONS

SC conceived and designed the project, responsible for the overall content and revised the manuscript. XS, LC, WL, JW, and XJ performed bioinformatics analysis of CNV data. BC, JS, YX, QF, and YaL collected the clinical data. ZL, GL, and YuL carried out all experiments. XS and LC prepared the manuscript. All authors have seen and approved the final manuscript.

#### ACKNOWLEDGMENTS

We would like to acknowledge funding from the National Natural Science Foundation of China (81720108003, 81670285, and 81741066), a multi-center clinical research project of Shanghai Jiao Tong University (DLY201609), and Innovation Program of Shanghai Municipal Education Commission (15ZZ055).


two siblings with distinctive dysmorphic features, heart defect and mental retardation. Eur. J. Med. Genet. 56, 389–396. doi: 10.1016/j.ejmg.2013. 05.004


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer DZ declared a shared affiliation, with no collaboration, with several of the authors, XS, LC, XJ, BC, WL, JW, GL, YX, JS, QF, YaL, and SC, to the handling Editor at the time of review.

Copyright © 2018 Shi, Cheng, Jiao, Chen, Li, Liang, Liu, Wang, Liu, Xu, Sun, Fu, Lu and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-09-00559 November 22, 2018 Time: 10:36 # 7

# A Genome-Wide Study of Allele-Specific Expression in Colorectal Cancer

Zhi Liu<sup>1</sup> , Xiao Dong<sup>2</sup> \* and Yixue Li3,4,5 \*

<sup>1</sup> Department of Epidemiology and Biostatistics, Jiangsu Key Lab of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Cancer Personalized Medicine, School of Public Health, Nanjing Medical University, Nanjing, China, <sup>2</sup> Department of Genetics, Albert Einstein College of Medicine, Bronx, NY, United States, <sup>3</sup> Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China, <sup>4</sup> Shanghai Center for Bioinformation Technology, Shanghai Industrial Technology Institute, Shanghai, China, <sup>5</sup> Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, China

#### Edited by:

Chuan Lu, Aberystwyth University, United Kingdom

#### Reviewed by:

Shihao Shen, University of California, Los Angeles, United States Younghee Lee, University of Utah, United States Yanfeng Zhang, HudsonAlpha Institute for Biotechnology, United States

#### \*Correspondence:

Xiao Dong biosinodx@gmail.com Yixue Li yxli@sibs.ac.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 19 July 2018 Accepted: 06 November 2018 Published: 27 November 2018

#### Citation:

Liu Z, Dong X and Li Y (2018) A Genome-Wide Study of Allele-Specific Expression in Colorectal Cancer. Front. Genet. 9:570. doi: 10.3389/fgene.2018.00570 Accumulating evidence from small-scale studies has suggested that allele-specific expression (ASE) plays an important role in tumor initiation and progression. However, little is known about genome-wide ASE in tumors. In this study, we conducted a comprehensive analysis of ASE in individuals with colorectal cancer (CRC) on a genome-wide scale. We identified 5.4 thousand genome-wide ASEs of single nucleotide variations (SNVs) from tumor and normal tissues of 59 individuals with CRC. We observed an increased ASE level in tumor samples and the ASEs enriched as hotspots on the genome. Around 63% of the genes located there were previously reported to contain complex regulatory elements, e.g., human leukocyte antigen (HLA), or were implicated in tumor progression. Focussing on the allelic expression of somatic mutations, we found that 37.5% of them exhibited ASE, and genes harboring such somatic mutations, were enriched in important pathways implicated in cancers. In addition, by comparing the expected and observed ASE events in tumor samples, we identified 50 tumor specific ASEs which possibly contributed to the somatic events in the regulatory regions of the genes and significantly enriched known cancer driver genes. By analyzing CRC ASEs from several perspectives, we provided a systematic understanding of how ASE is implicated in both tumor and normal tissues and will be of critical value in guiding ASE studies in cancer.

Keywords: allele-specific expression, colorectal cancer, cis-regulatory variation, somatic mutation, tumor

## BACKGROUND

Allele-specific expression (ASE) refers to the phenomenon that occurs in diploid or polypoid genomes, where two or more alleles of a gene has an imbalanced expression (Kwaepila et al., 2006; Ge et al., 2009; Heap et al., 2010; Tung et al., 2011). It is common in both humans (Lo et al., 2003) and other organisms (Tung et al., 2011; Graze et al., 2012; Hasin-Brumshtein et al., 2014), and potentially contributes to multiple phenotypes and complex traits (Frazer et al., 2009). Because of the intrinsic power of using two alleles of a gene in the same individual, as controls to reduce the

**76**

background genetic and environmental effects, ASE is also an accurate and sensitive marker for cis-regulatory variation (Pastinen, 2010). For example, an ASE can indicate a heterozygous variant within the translated region, resulting in a modified or truncated protein (Kukurba et al., 2014); at a regulatory site, it can cause differential binding of transcription factors or epigenetic modifiers (Prendergast et al., 2012; Reddy et al., 2012); or at a splice site or UTR, it can affect transcript processing (Li et al., 2012).

Allele-specific expressions are also frequently observed in tumors (Valle et al., 2008; Curia et al., 2012; Walker et al., 2012; Wei et al., 2013). ASE was first proposed as a direct approach for connecting a genotype to disease susceptibility in 2002 (Yan et al., 2002). In 2013 it was discovered that ASE, at the deathassociated protein kinase 1 (DAPK1) gene locus, was potentially predisposed to chronic lymphocytic leukemia (CLL) using a single-nucleotide primer extension (SNuPE) and MALDI-TOF mass spectrometry (Wei et al., 2013). In colorectal cancer (CRC), a decrease in expression of one adenomatous polyposis coli tumor suppressor (APC) gene allele, leads to the development of familial adenomatous polyposis (Curia et al., 2012). In addition to APC, ASE of transforming growth factor beta receptor 1 (TGFBR1), which leads to reduced expression of the gene, can also cause an increased risk of CRC (Valle et al., 2008). In addition to the association with cancer risk, ASE also affects the prognosis and outcome of cancer patients. For example, the monoallelic expression of TP53 and IDH1 was found to determine the oncogenic progression and survival in brain tumors (Walker et al., 2012).

With the development of large-scale transcriptome sequencing, the systematic analysis of the ASE in the transcriptome was achieved at the single nucleotide resolution (Tuch et al., 2010; Smith et al., 2013). To date, several studies have reported genome-wide ASE, in human, mice and cell lines, and identified hundreds of genes exhibiting ASE (Heap et al., 2010; Smith et al., 2013; Hasin-Brumshtein et al., 2014). However, little is known about genome-wide patterns of ASE in tumor tissues. In this study, we carried out an ASE study in a cohort of 59 patients with CRC (Seshagiri et al., 2012) and revealed the comprehensive landscape of ASE in CRC patients.

# MATERIALS AND METHODS

#### Data Preprocessing

RNA and Exon sequencing data of matched human colorectal tumor-normal samples was downloaded from the European Genome-Phenome Archive (EGA) under accession number EGAS00001000288 (Seshagiri et al., 2012). Fifty-nine pairs of samples correctly processed were retained for ASE analysis.

Quality controlled DNA and RNA sequencing data was mapped with bowtie2 (Langmead and Salzberg, 2012) with default parameters to report the best alignment. The base qualities were then recalibrated using the procedure recommended by GATK (DePristo et al., 2011).

Somatic mutations were called with both Mutect (Cibulskis et al., 2013) and Varscan (Koboldt et al., 2009), and the intersection was considered a reliable result and used for the following analysis. Germline SNVs were called using the GATK best practices from DNA sequencing data, and filtered using the flowing four criteria to obtain a final SNV list ready for ASE analysis.


Allele counts for each germline SNV and the somatic mutation in DNA and RNA sequencing data, were generated with SAMtools (Li et al., 2009) in a pileup format.

The list of germline SNV and somatic mutation, as well as the corresponding pileup files were subjected to cisASE for ASE identification, respectively.

# ASE Identification

Allele-specific expression SNVs and genes were identified by the cisASE pipeline (Liu et al., 2016). SNVs with a coverage of less than 10 in RNA or DNA sequencing data were filtered. SNVs or genes with a log likelihood ratio (LLR) value more than the LLR cutoff, at a significance level of 0.01, were defined as ASE. In addition, genes with a heterogeneity p-value less than 0.05, which indicates inconsistent ASE status of SNVs within the gene, were excluded from further analysis.

#### Identifying ASE Hotspots

Allele-specific expression counts and frequency was calculated in consecutive sliding windows with fixed sizes along the genome. We randomly assigned ASE labels to the SNVs across chromosomes, according to the total ASE frequency. By repeating this process 1000 times, we obtained a null distribution of ASE density in each of the sliding windows. A p-value was derived by counting the number of times the number of ASEs in the window after perturbation, exceeded the observed ASE, and adjusted it with an add-one smoothing. These p-values were then corrected for multiple tests using the Benjamini-Hochberg method.

#### Group of ASE Somatic Mutation

We mapped the ASE somatic mutations to genes and then classified the genes into two categories, i.e., genes with overexpressed mutant alleles and genes with under-expressed mutant alleles. Genes harboring multiple somatic mutations with conflicting mutant allele expression, were excluded from the following analysis. Gene expression profiles were generated with tophat2 (Kim D. et al., 2013) and cufflinks (Trapnell et al., 2012) software. For genes in each group, we compared the FPKM value of both tumor and normal tissues of patients with the somatic mutation, and defined the FPKM fold change of 2 and 1/2 as the threshold of up-regulated and down-regulated expression in tumor samples. This resulted in three groups for each category according to the gene expression.

# Identifying Somatic ASE Genes

fgene-09-00570 November 24, 2018 Time: 16:25 # 3

We counted the number of ASE somatic events (si) and the number of total tested pairs (ti) in a population of 118 individuals, for each gene. We refer to the ASE somatic event (si) as the gene showing ASE in a tumor sample but not in the matched normal sample. In addition, we refer to the tested pairs (ti) as the sample pairs, and the gene is tested in both matched tumor-normal samples. The expected ASE somatic event rate was calculated by the following equation,

$$f = \sum\_{i=1}^{n} \text{S}\_{\text{i}} \left/ \sum\_{i=1}^{n} t\_{\text{i}} \right.$$

where n is the number of genes.

The expected number of the ASE somatic event for each gene, was calculated as the product of the total tested pairs and the expected ASE somatic event rate (f<sup>∗</sup> ti). A p-value was obtained for each gene using the Poisson distribution and the observed and expected number of ASE somatic events (P[X ≥ x]). These p-values were corrected for multiple testing using the Benjamini-Hochberg method and genes that had a corrected p-value <0.05 were called a somatic ASE gene.

# RESULTS

#### Increased ASE Level in Tumor Samples

We identified SNV and gene level ASEs in both normal and tumor tissues of 59 CRC patients with our previously developed pipeline for ASE identification (Kukurba et al., 2014; **Figure 1** and **Supplementary Table S1**). The major steps included sequence alignment, variant calling, ASE detection using cisASE (Kukurba et al., 2014), and further filtration (see section "Materials and Methods" for details). We detected 431 (SD = 133.3) SNV-level ASEs per normal tissue and 477 (SD = 181.6) per tumor tissue, and 137 (SD = 39.3) and 216 (SD = 108.5) gene-level ASEs per normal and tumor tissue, respectively (**Supplementary Table S1**). The frequency of ASE SNVs (a ratio of number of ASE SNVs to number of non-ASE SNVs) in normal tissue is in agreement with previous studies (Zhang et al., 2009; Skelly et al., 2011).

We compared the portion of sites exhibiting an ASE in tumors with its matched normal tissues. On average, 20.0% of the SNVs in tumor samples and 16.8% in normal samples exhibited an ASE (two-tailed paired t-test, p-value = 7.1e−09), indicating a significantly higher ASE level in tumor samples than in normal samples. When only testing the SNVs identified in tumor and normal tissues, the results were the same, i.e., a significantly higher portion of the ASE in tumor samples (21.6%) than in the normal samples (18.1%; two-tailed paired t-test, p-value = 1.2e−04; **Figure 2A**).

For each tumor-normal pair, we found that 68% of the ASEs are either normal (29%) or tumor (39%) specific (**Figure 2B**). And the remaining 32% of the ASEs are shared by both the tumor and normal samples (**Figure 2B**), most of which had the same ASE direction. This indicates that the majority of ASEs (about 2/3) are dynamic in tumorigenesis while the other 1/3 ASEs are consistent.

Next, we identified genes with recurrent ASE events in tumor and normal samples. There were 94 and 95 genes with ASE events in at least 20% of the tumor and normal samples, respectively, of which 63 genes were shared by both tumor and normal samples (common ASE genes) (**Supplementary Table S2** and **Figure 3**). The allele ratio of recurrent of ASE genes was significantly segregated from the background and the total pool of ASE genes (**Supplementary Figure S1**) and the average major haplotype allele ratio of common ASE genes was 0.92.

The ASE genes that was mostly recurrently observed in both tumor and normal samples had a high allele imbalance, such as AP3P1, BCLAF1, STED8, PRIM2, IL32, SEC22, and MAP2K3 (**Figure 3A**). The recurrent ASE genes in tumor samples include Chromosome-Associated Kinesin KIF4B, spindle and kinetochore associated complex subunit 3 (SKA3) and so on. We also found that the ASE of TP53 was specifically and recurrently observed in tumor samples (observed in 12 tumor samples and 1 normal sample). There were 32 recurrent ASE genes observed in normal samples. For example, PYY, CD177, PEG3, and FAM83D, were observed in more than 11 normal samples, while less than 3 were observed in tumor samples.

# The ASE Hotspots in the Normal and Tumor Genome

Variants on the cis-regulatory element on the genome, tend to affect the expression of one or more genes nearby (Pastinen, 2010), and if the variation is heterozygous, the genes regulated by it will exhibit an ASE, therefore we prioritized the existence of such variations by identifying clusters of the ASEs on genomic regions. We calculated the ASE density and frequency in the tumor and normal samples, by using a sliding window approach with a window size of 100k base pair (bp) and a step size of 10 kbp. Windows with an adjusted p-value <0.05 were kept, and overlapping windows were manually checked and merged, to get focal hotspot regions.

We identified 32 ASE hotspots in normal samples (**Supplementary Table S3**) and 27 in tumor samples (**Supplementary Table S4**), affecting a total of 57 genes (**Supplementary Figure S2**), which resulted in 4.0% (723 out of 17866) and 4.4% (748 out of 17866) of the ASE SNVs identified in normal and tumor samples, respectively. There were 21 genes located within hotspots identified in both normal and tumor samples, as well as 22 and 14 genes located within the hotspots specific to tumor and normal samples, since the tumor or normal differential expression might result in a different power of ASE detection. We checked the expression of all these genes in tumor and normal samples (**Supplementary Table S5**), and found no difference of the tumor and normal FPKM ratio among the three groups of genes (Kruskal–Wallis test p-value = 0.07), indicating the difference of the ASE hotspots in tumor and normal samples did not result from the different detection powers, due to the expression difference. In addition, one gene with relatively low expression (PRSS1, FPKM < 0.1) was excluded (**Figure 4**).

To investigate the biological process affected by the ASE, we conducted the GO and KEGG enrichment analyses for the ASE genes. The ASE genes shared by tumor and normal tissues were

base pair.

fgene-09-00570 November 24, 2018 Time: 16:25 # 4

significantly enriched in antigen processing. The significantly enriched GO and KEGG terms of tumor specific ASE genes were closely associated with immune activity. However, the normal ASE genes were not enriched in specific functions. The results impose the possibility that the ASE plays a role in maintaining regular immune activities, and an excessive ASE event was

activated in tumor tissues. Among the genes shared in normal and tumor tissues, several genes involved in polymorphism, or which were reported to be related to cancer predisposition or progression were included, such as, the human leukocyte antigen (HLA) on chromosome 6, members of the mucin gene family (MUC) on chromosome 3, and the MAP2K3 and CDC27 locus, which is involved in the cell proliferation and cell cycle on chromosome 7. Three members of the carcinoembryonic antigen (CEA) family (CEACAM5, CEACAM6, and CEACAM7) were also included.

Though 35.6% (21 out of 59) of the ASE genes were common in both normal and cancer tissues, it was reported that the change in the allele ratio of the ASE can also lead to phenotypic diversity, for example, a study reported that the proportion of the JAK2 V617F mutant allele in RNA levels is significantly associated with distinct subtypes of BCR/ABL-negative myeloproliferative neoplasms (MPNs) (Kim H.R. et al., 2013). Therefore, we tested whether there are similar cases in ASE genes between the tumor and normal tissues. We found that four out of the 21 shared ASE genes (HLA-A, HLA-B, HLA-C, and CEACAM7) and showed significant differences in the allele ratios between tumor and normal tissues (paired t-test; **Supplementary Table S6**). Three of the four genes belong to the HLA family, i.e., HLA-A and HLA-B, and HLA-C, and all revealed a lower allelic heterozygosity in tumor tissues (**Figure 5**). Loss of heterozygosity (LOH) of the HLA loci was reported in many cancers (Maleno et al., 2002; Wang et al., 2006; Zollikofer et al., 2014). In our case, these loci are heterozygous at the DNA level, while at the mRNA level, one of the copies showed a significantly reduced expression compared to the other one. The results suggest the possibility that in tumor tissues, the allele-specific regulation on the transcriptional level may lead to a similar consequence as the LOH.

The other 18 shared ASE genes, showed no difference in the allele ratio, between normal and tumor tissues (**Supplementary Figure S3**), indicating that most of the shared ASEs are conserved during tumorigenesis. However, because the normal tissues we

studied were obtained from CRC patients, ASE genes shared by tumor and normal tissues can be involved either in normal physiological functions or associated with tumor predisposition. Since it is hard to obtain gut tissue samples from healthy people, we cannot distinguish these two possibilities.

Of the twenty-two genes located in the tumor-specific hotspots (**Supplementary Table S4**), several were reported to play an important role in tumor progression. For example, over-expression of the FAT1 was observed in different tumors including in DCIS breast cancer (Kwaepila et al., 2006), melanoma (Sadeqzadeh et al., 2011) and leukemia (de Bock et al., 2012). MKI67 is a protein that is widely used as a marker for cell proliferation, and its increased expression in human cancer specimens generally denotes an aggressive phenotype (Cidado et al., 2016). The observed allele specific expression of these phenotypes may help to prioritize the underlying mechanisms which contribute to the abnormal expression in tumors. Furthermore, 14 genes (ACSF3, AHNAK, APOBR, CCBL2, CLN3, EPPK1, FAM104B, FUT2, HLA-DRA, HLA-G, MUC12, NBPF1, RASIP1, RBMXL1, and SLC25A5) were located in the normal-specific hotspots (**Supplementary Table S3**), which suggests that precise control of the ASE may be important for maintaining the normal function of cells. These results might provide opportunities for mining new therapy targets.

## Overexpressed Allele With Somatic Mutations in Tumors

Somatic mutations (missense mutations and non-sense mutations) within the coding region may lead to abnormal protein products. However, the impact of a heterozygous coding SNV depends on whether the SNV-containing allele is transcribed to the RNA. In addition, clinical therapy-selection for targeted drugs, often assay mutations using DNA as an analyte, such as KRAS assays designed to identify responders to anti-EGFR therapy (Allegra et al., 2009). However, if the mutant allele is selectively lost in the transcript, the mutation may not have a therapeutic impact and the merit of using a DNA-based assay for clinical decision-making may be problematic. The above are the major reasons for us to further analyze the allelic expression of somatic mutations in tumors. A genome-wide study in mouse tumor cell lines reported that mutations are transcribed in proportion to their DNA allele frequency (Castle et al., 2014). However, to our knowledge, a genome-wide study of the relationship between DNA and RNA mutation allele frequency in tumor tissues, has not been done.

We found that 37.5% of the 2,754 somatic mutations exhibit an ASE in the colorectal tissues (**Figure 6A**), which is more than two times higher than that for germline polymorphisms (18%) (**Figure 6B**). This indicates a significant imbalance of the allelic expression for somatic mutations. Furthermore, 78% of the ASE somatic mutations over-expressed mutant alleles, comparing to a proportion of 52% for germline polymorphisms (chi-square test p-value <2.2e−16). The results reveal that gene copies with somatic mutations have prevailing expression superiority compared to the wild type ones in tumor tissues.

Next, we explored the functional significance of the ASE somatic mutations with a different mutant/wild-type allele expression pattern. We mapped the ASEs to genes and classified them into six groups according to the alteration of both mutant allele expression and total gene expression in tumor tissues (**Figure 7** and **Supplementary Table S7**; Materials and Methods).

Ideally, if an ASE somatic mutation is functional, the direction of the mutant allele expression change should be the same as the direction of the gene expression change in tumor cells, compared with normal cells (Group a and f in **Figure 7**). Genes which exhibited the ASE somatic mutation but an unchanged total gene expression (Group b and e in **Figure 7**) might be regulated by other trans-regulatory factors, and the effects of the ASE were buffered. Those conflicting with the somatic allele expression and tumor gene expression (Group c and d in **Figure 7**) were possible artificial results, or the ASE was a random event without functional significance.

FIGURE 4 | Gene expression in tumor and normal samples for genes located in ASE hotspots.

As expected, only the genes in Groups a were farely significantly enriched in KEGG pathways (Du et al., 2014; **Table 1**). Group a, which contain genes over-expressing mutant allele and showing an up-regulated gene expression level in tumor samples compared with the matched normal sample, is enriched in the DNA replication and mismatch repair pathways. Dysfunction of the DNA replication and DNA mismatch repair pathways are implicated in many cancer types (Boyer et al., 2016; Puigvert et al., 2016), which initiates cancer or promotes cancer cell proliferation (Padmanabhan et al., 2004; Dudderidge et al., 2007). The average mutant allele fraction for the genes enriched in these two pathways is 80%, indicating a widely over-expressed mutant allele. This suggests that in tumor tissues, genes involved in the DNA replication and DNA mismatch repair pathways, tend to selectively express mutant proteins with abnormal functions,

which may compete with normal proteins to disrupt normal signal pathways, or decrease the dosage of normal proteins for normal functions.

Genes in Group f, which contain genes with under-expressed mutant alleles and down-regulated gene expression in tumor samples, compared with matched normal ones, are enriched in the focal adhesion signal pathway. The genes enriched in the focal adhesion pathway showed limited mutant allele fractions only 10% of the two alleles, suggesting that mutationcontaining alleles are effectively silenced by epigenetic and chromatin modification mechanisms (Jaenisch and Bird, 2003) or mutation-containing transcripts are degraded by activating RNA surveillance mechanisms (Rehwinkel et al., 2006), resulting in an overall decrease of gene expression levels and thus an abnormal signal pathway.

# Somatic ASE Genes Are Enriched in Known Cancer-Related Genes

Genes specifically exhibiting the ASE in cancer tissues are likely linked to somatic variations in regulatory regions. In order to detect genes with an excess of somatic cis-regulatory events, we used matched tumor and normal samples to identify genes specifically and significantly exhibiting ASE in tumor samples (which we defined as the "somatic ASE gene"). We found 50 somatic ASE genes (**Supplementary Table S8**), which significantly enriched TCGA pan-cancer drivers (Gonzalez-Perez et al., 2013) (five pan-cancer drivers p-value = 0.010) and CRC drivers (Gonzalez-Perez et al., 2013) (two CRC drivers p-value = 0.04), indicating that the tumor specific ASE genes analysis catches known cancer genes, and has the potential to be a complementary method for driver detection. Next we compared the somatic ASE gene with differential expressed genes (DEG) between tumor and normal samples (**Supplementary Table S9**), and found that they significantly enriched in DEGs (fisher exact test p-value = 5.0e−07, odds ratio = 3.22).

# DISCUSSION

The ASE is a measure of the effect of genetic variants on gene expression, that does not require any assumption on the genetic structure of the population studied, and hence a direct measurement of how gene-expression changes at the individual level (Yan et al., 2002). The development of next generation sequencing technologies and our unbiased computation method cisASE (Kukurba et al., 2014) have enabled us to characterize this genome-wide landscape of the ASE in tumor and normal tissues of CRC patients from diverse perspectives.

The higher incidence of the ASE in tumor samples than that of normal samples is consistent with the fact that gene expression in tumor cells is under more complicated cis-regulation (Maurano et al., 2012). Furthermore, 29 and 39% of the ASE SNVs were specific to either normal or cancer samples, respectively, indicating both the gain and loss of cis-regulatory variation as possible contributors to tumor initiation or development. We also observed a high percentage (32%) of ASEs shared by normal and tumors tissues of patients, which might be a mixture of CRC preposition sites, as well as sites where ASE play a role in maintaining regular cellular function. Since it is difficult to obtain gut tissue samples from healthy people to distinguish these two categories of ASE, some researchers suggest using blood samples from normal healthy people (Valle et al., 2008). However, cisregulatory variation is a tissue dependent feature, so is the ASE (GTEx Consortium, 2015), therefore, using a different tissue as control might result in high false discovery rates. Creative and accurate methods are needed to further explore cancer risk sites from regular sites.

By summarizing the ASE in a region-based fashion, we identified the ASE hotspots under true and recurrent cisregulation in the studies samples. Although the majority of the ASE hotspots, including the HLA loci, were shared by both normal and tumor tissues, four of the HLA genes revealed a significant lower heterozygosity in the tumor tissues


compared with normal tissues. The LOH in the short arm of chromosome 6 is the most frequent mechanism contributing to the HLA haplotype loss in human cancer, which is a tumor escape mechanisms from the host's immune surveillance system (So et al., 2005). The selective expression of one allele of the HLA gene might be another mechanism that contributes to the HLA haplotype loss in cancer.

One category of targeted drugs, is the targeting of specific genes with or without certain somatic mutations, such as osimertinib targeting at EGFR (with EGFR T790M mutation) and afatinib targeting at EGFR(with EGFRL858R mutations) in non-small cell lung cancer, vemurafenib targeting at BRCA(with BRAF V600 mutation) in melanoma, and panitumumab targeting at EGFR (with KRAS will type) in CRC. A DNA assay is usually used to test whether a specific gene mutation codes the target. However, an RNA level expression is not necessary a faithful replication of the DNA. We found that 38% of the somatic ASE exhibited the ASE, indicating that the DNA-assay based therapy-selection might be problematic. Somatic mutations and mutant allele that followed the same direction as the total gene expression, i.e., Group a and f, were enriched in important signal pathways involved in tumor initiation and progression. However, mutations belonging to other groups may also have biological implications, are not significantly enriched in the KEGG pathways, since we cannot exclude the possibility that, in some cases, homeostatic or feedback mechanisms act to constrain the total expression so that an imbalance in allelic expression does not change the total output.

Somatic ASE genes were regulated by cis-regulatory elements with somatic variations, which may be the driver mutation implicated in cancers, the fact that the identified somatic ASE genes enriched pan-cancer and CRC driver genes, support this speculation.

In this study, we focused on the ASE of protein coding regions. However, in recent years, lncRNAs were reported to be involved in gene regulation and other cellular processes (Quinn and Chang, 2016). With an ASE analysis, Almlof et al. (2014) found that 22.9% (258 out of 1122) of intergenic lncRNAs were regulated by cis-rSNP in human primary monocytes, which is comparable to our analysis. Though the number of lncRNAs exceeded the protein coding genes, because of a much lower expression (Iyer et al., 2015), a higher sequencing depth and more sensitive detector is required to quantify ASE in lncRNAs more efficiently.

#### CONCLUSION

By applying the ASE studies in CRC patients, we found a higher incidence of the ASE in tumor tissues, which implicated more complicated cis-regulation in tumors. ASEs under recurrent cisregulation were enriched as hotspots on the genome and the majority of the genes (∼63%) involved in the hotspots, were previously reported to have complex regulatory elements, or were implicated in tumor progression. In addition, the ASE analysis of somatic mutation revealed a significant increased ASE rate for somatic mutations, and genes harboring such somatic mutations were enriched in important pathways implicated in CRC (DNA replication and focal adhesion). Furthermore, the somatic ASE genes analysis catches known cancer genes.

In summary, this study provides a systematic understanding of how the ASE is implicated in tumors and a schema of the application of the ASE studies in patients with cancerous tumors.

### DATA AVAILABILITY

The datasets supporting the conclusions of this article are included within the article and its additional files. Raw RNA and Exon sequencing data were downloaded from the European Genome-Phenome Archive (EGA) under accession number EGAS00001000288 by proper application.

#### AUTHOR CONTRIBUTIONS

ZL and XD conceived the study and wrote the manuscript. ZL carried out all the analysis in this study. YL supervised the study and revised the manuscript. All authors read and approved the final manuscript.

## FUNDING

This work was supported by the National Key R&D Program of China (2016YFC0901704, 2017YFA0505500, 2017YFC0907505, and 2017YFC0908405).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00570/full#supplementary-material

FIGURE S1 | The allele ratio of recurrent of ASE genes was significantly segregated from the background and the total pool of ASE genes.

FIGURE S2 | ASE frequency across chromosomes.

FIGURE S3 | The comparison of allele ration of 18 shared ASE genes between normal and tumor tissues.

TABLE S1 | SNV and gene levels of ASE in the normal and tumor tissues.

TABLE S2 | Genes with recurrent ASE events in tumor and normal samples.

TABLE S3 | ASE Hotspots in normal.

TABLE S4 | ASE Hotspots in Cancer.

TABLE S5 | Expression for genes in hotspots region.

TABLE S6 | Differences in allele ratios between tumor and normal tissues.

TABLE S7 | Allele ratio and FPKM for somatic ASE and involved genes.

TABLE S8 | Somatic ASE genes.

TABLE S9 | Gene expression of somatic ASE genes in tumor and normal samples.

#### REFERENCES

fgene-09-00570 November 24, 2018 Time: 16:25 # 10

Allegra, C. J., Jessup, J. M., Somerfield, M. R., Hamilton, S. R., Hammond, E. H., Hayes, D. F., et al. (2009). American society of clinical oncology provisional clinical opinion: testing for KRAS gene mutations in patients with metastatic colorectal carcinoma to predict response to anti-epidermal growth factor receptor monoclonal antibody therapy. J. Clin. Oncol. 27, 2091–2096. doi: 10. 1200/Jco.2009.21.9170

Almlof, J. C., Lundmark, P., Lundmark, A., Ge, B., Pastinen, T., Cardiogenics Consortium, et al. (2014). Single nucleotide polymorphisms with cis-regulatory effects on long non-coding transcripts in human primary monocytes. PLoS One 9:e102612. doi: 10.1371/journal.pone.0102612


Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., et al. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219. doi: 10.1038/nbt. 2514

Cidado, J., Wong, H. Y., Rosen, D. M., Cimino-Mathews, A., Garay, J. P., Fessler, A. G., et al. (2016). Ki-67 is required for maintenance of cancer stem cells but not cell proliferation. Oncotarget 7:6281. doi: 10.18632/oncotarget.7057

Curia, M. C., De Iure, S., De Lellis, L., Veschi, S., Mammarella, S., White, M. J., et al. (2012). Increased variance in germline allele-specific expression of APC associates with colorectal cancer. Gastroenterology 142, 71.e1–77.e1. doi: 10. 1053/j.gastro.2011.09.048

de Bock, C. E., Ardjmand, A., Molloy, T. J., Bone, S. M., Johnstone, D., Campbell, D. M., et al. (2012). The Fat1 cadherin is overexpressed and an independent prognostic factor for survival in paired diagnosis-relapse samples of precursor B-cell acute lymphoblastic leukemia. Leukemia 26, 918–926. doi: 10.1038/leu. 2011.319

DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., et al. (2011). A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nat. Genet. 43, 491–498. doi: 10.1038/ng. 806

Du, J. L., Yuan, Z. F., Ma, Z., Song, J., Xie, X., and Chen, Y. (2014). KEGG-PATH: kyoto encyclopedia of genes and genomes-based pathway analysis using a path analysis model. Mol. Biosyst. 10, 2441–2447. doi: 10.1039/c4mb00287c

Dudderidge, T. J., McCracken, S. R., Loddo, M., Fanshawe, T. R., Kelly, J. D., Neal, D. E., et al. (2007). Mitogenic growth signalling, DNA replication licensing, and survival are linked in prostate cancer. Br. J. Cancer 96, 1384–1393. doi: 10.1038/sj.bjc.6603718

Frazer, K. A., Murray, S. S., Schork, N. J., and Topol, E. J. (2009). Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251. doi: 10.1038/nrg2554

Ge, B., Pokholok, D. K., Kwan, T., Grundberg, E., Morcos, L., Verlaan, D. J., et al. (2009). Global patterns of cis variation in human cells revealed by high-density allelic expression analysis. Nat. Genet. 41, 1216–1222. doi: 10.1038/ng.473

Gonzalez-Perez, A., Perez-Llamas, C., Deu-Pons, J., Tamborero, D., Schroeder, M. P., Jene-Sanz, A., et al. (2013). IntOGen-mutations identifies cancer drivers across tumor types. Nat. Methods 10, 1081–1082. doi: 10.1038/Nmeth.2642

Graze, R. M., Novelo, L. L., Amin, V., Fear, J. M., Casella, G., Nuzhdin, S. V., et al. (2012). Allelic imbalance in Drosophila hybrid heads: exons, isoforms, and evolution. Mol. Biol. Evol. 29, 1521–1532. doi: 10.1093/molbev/msr318

GTEx Consortium (2015). The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660. doi: 10.1126/ science.1262110

Hasin-Brumshtein, Y., Hormozdiari, F., Martin, L., van Nas, A., Eskin, E., Lusis, A. J., et al. (2014). Allele-specific expression and eQTL analysis in mouse adipose tissue. BMC Genomics 15:471. doi: 10.1186/1471-2164-15-471

Heap, G. A., Yang, J. H., Downes, K., Healy, B. C., Hunt, K. A., Bockett, N., et al. (2010). Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum. Mol. Genet. 19, 122–134. doi: 10.1093/hmg/ddp473

Iyer, M. K., Niknafs, Y. S., Malik, R., Singhal, U., Sahu, A., Hosono, Y., et al. (2015). The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208. doi: 10.1038/ng.3192

Jaenisch, R., and Bird, A. (2003). Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat. Genet. 33(Suppl.), 245–254. doi: 10.1038/ng1089

Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., and Salzberg, S. L. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14:R36. doi: 10.1186/Gb-2013- 14-4-R36

Kim, H. R., Choi, H. J., Kim, Y. K., Kim, H. J., Shin, J. H., Suh, S. P., et al. (2013). Allelic expression imbalance of JAK2 V617F mutation in BCR-ABL negative myeloproliferative neoplasms. PLoS One 8:e52518. doi: 10.1371/journal.pone. 0052518

Koboldt, D. C., Chen, K., Wylie, T., Larson, D. E., McLellan, M. D., Mardis, E. R., et al. (2009). VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283–2285. doi: 10.1093/ bioinformatics/btp373

Kukurba, K. R., Zhang, R., Li, X., Smith, K. S., Knowles, D. A., Tan, M. H., et al. (2014). Allelic expression of deleterious protein-coding variants across human tissues. PLoS Genet. 10:e1004304. doi: 10.1371/journal.pgen. 1004304

Kwaepila, N., Burns, G., and Leong, A. S. (2006). Immunohistological localisation of human FAT1 (hFAT) protein in 326 breast cancers. Does this adhesion molecule have a role in pathogenesis? Pathology 38, 125–131. doi: 10.1080/ 00313020600559975

Langmead, B., and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–U354. doi: 10.1038/Nmeth.1923

Li, G., Bahn, J. H., Lee, J. H., Peng, G., Chen, Z., Nelson, S. F., et al. (2012). Identification of allele-specific alternative mRNA processing via transcriptome sequencing. Nucleic Acids Res. 40:e104. doi: 10.1093/nar/ gks280

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., et al. (2009). The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079. doi: 10.1093/bioinformatics/btp352

Liu, Z., Gui, T. T., Wang, Z., Li, H., Fu, Y., Dong, X., et al. (2016). cisASE: a likelihood-based method for detecting putative cis-regulated allele-specific expression in RNA sequencing data. Bioinformatics 32, 3291–3297. doi: 10. 1093/bioinformatics/btw416

Lo, H. S., Wang, Z. N., Hu, Y., Yang, H. H., Gere, S., Buetow, K. H., et al. (2003). Allelic variation in gene expression is common in the human genome. Genome Res. 13, 1855–1862. doi: 10.1101/gr.1006603

Maleno, I., Lopez-Nevot, M. A., Cabrera, T., Salinero, J., and Garrido, F. (2002). Multiple mechanisms generate HLA class I altered phenotypes in laryngeal carcinomas: high frequency of HLA haplotype loss associated with loss of heterozygosity in chromosome region 6p21. Cancer Immunol. Immunother. 51, 389–396. doi: 10.1007/s00262-002-0296-0

Maurano, M. T., Humbert, R., Rynes, E., Thurman, R. E., Haugen, E., Wang, H., et al. (2012). Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195. doi: 10.1126/science.1222794

Padmanabhan, V., Callas, P., Philips, G., Trainer, T. D., and Beatty, B. G. (2004). DNA replication regulation protein Mcm7 as a marker of proliferation in prostate cancer. J. Clin. Pathol. 57, 1057–1062. doi: 10.1136/jcp.2004. 016436

Pastinen, T. (2010). Genome-wide allele-specific analysis: insights into regulatory variation. Nat. Rev. Genet. 11, 533–538. doi: 10.1038/nrg2815

Prendergast, J. G., Tong, P., Hay, D. C., Farrington, S. M., and Semple, C. A. (2012). A genome-wide screen in human embryonic stem cells reveals novel sites of allele-specific histone modification associated with known disease loci. EpigeneticsChromatin 5:6. doi: 10.1186/1756-8935-5-6

Puigvert, J. C., Sanjiv, K., and Helleday, T. (2016). Targeting DNA repair, DNA metabolism and replication stress as anti-cancer strategies. FEBS J. 283, 232–245. doi: 10.1111/febs.13574


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Liu, Dong and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# In-silico Antigenicity Determination and Clustering of Dengue Virus Serotypes

#### Jingxuan Qiu<sup>1</sup> , Yuxuan Shang<sup>2</sup> , Zhiliang Ji <sup>3</sup> \* and Tianyi Qiu<sup>4</sup> \*

*<sup>1</sup> School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, Shanghai, China, <sup>2</sup> Shanghai Qibao Dwight High School, Shanghai, China, <sup>3</sup> State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Xiamen University, Xiamen, China, <sup>4</sup> Shanghai Public Health Clinical Center & Institutes of Biomedical Sciences, Shanghai Medical School, Fudan University, Shanghai, China*

Emerging or re-emerging dengue virus (DENV) causes dengue fever epidemics globally. Current DENV serotypes are defined based on genetic clustering, while discrepancies are frequently observed between the genetic clustering and the antigenicity experiments. Rapid antigenicity determination of DENV mutants in high-throughput way is critical for vaccine selection and epidemic prevention during early outbreaks, where accurate prediction methods are seldom reported for DENV. Here, a highly accurate and efficient *in-silico* model was set up for DENV based on possible antigenicity-dominant positions (ADPs) of envelope (E) protein. Independent testing showed a high performance of our model with AUC-value of 0.937 and accuracy of 0.896 through quantitative Linear Regression (LR) model. More importantly, our model can successfully detect those cross-reactions between inter-serotype strains, while current genetic clustering failed. Prediction cluster of 1,143 historical strains showed new DENV clusters, and we proposed DENV2 should be further classified into two subgroups. Thus, the DENV serotyping may be re-considered antigenetically rather than genetically. As the first algorithm tailor-made for DENV antigenicity measurement based on mutated sequences, our model may provide fast-responding opportunity for the antigenicity surveillance on DENV variants and potential vaccine study.

*Shanghai Institutes for Biological Sciences (CAS), China*

Edited by: *Tao Huang,*

#### Reviewed by:

*Xin Chen, Zhejiang University, China Qi Liu, Vanderbilt University Medical Center, United States*

\*Correspondence:

*Zhiliang Ji appo@xmu.edu.cn Tianyi Qiu ty\_qiu@126.com*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *24 October 2018* Accepted: *23 November 2018* Published: *07 December 2018*

#### Citation:

*Qiu J, Shang Y, Ji Z and Qiu T (2018) In-silico Antigenicity Determination and Clustering of Dengue Virus Serotypes. Front. Genet. 9:621. doi: 10.3389/fgene.2018.00621* Keywords: dengue virus, envelope protein, bioinformatics, antigenicity-dominant positions, antigenicity clustering

# INTRODUCTION

Dengue virus (DENV) is a mosquito-borne RNA virus from flaviviridae family, which could cause dengue fever epidemics in tropical and subtropical countries (Rodenhuis-Zybert et al., 2010). Every year, nearly 390 million people were infected by DENV, among them, 96 million developed into an acute systemic illness and over 500 thousand experienced potentially life-threatening complications such as dengue hemorrhage fever (DHF) and dengue shock syndrome (DSS) (WHO/TDR, 2009; Bhatt et al., 2013). Traditionally, DENV are genetically divided into four subtypes (Lanciotti et al., 1997; Zhang et al., 2005; Chau et al., 2008). In 1952, an early clinical study reported that individuals with primary DENV infections often provide protections among the homologous type, and show only partial cross-protection against heterologous types (Sabin, 1952). As such, DENV serotypes were simply defined based on genetic clusters. This classification was subsequently supported by in vitro experiments in which DENV strains were better neutralized by antisera from homologous

**87**

rather than those of heterologous types (Hammon et al., 1960). Despite above, it was frequently realized that antigenic variation does occur within same DENV serotype. Initially, this intraserotype difference was considered as substantially less than those of inter-serotypes, and can be neglected (Russell and Nisalak, 1967; Gentry et al., 1982). Yet, with the accumulation of clinical and epidemiological evidence, researchers noted that the traditional classification of DENV serotypes based on genotypes can no more explain the clinical observations. Cross-reactions were often found between antiserum from different serotypes, which leads to the rethinking of the DENV antigenicity clustering (Katzelnick et al., 2015). Currently, it was believed that the antigenicity of DENV viruses was actually volatile, while the traditional genotypic categorization may not be sensitive enough to evaluate the antigenicity difference (Katzelnick et al., 2015). Also, the epidemic magnitude of DENV might not only be affected by traditional serotypes, but most importantly, be determined by antigenic differences between particular infecting viruses (Kochel et al., 2002; Adams et al., 2006; OhAinle et al., 2011). Since antigenic differences among the DENV types correlate with not only disease outcome and vaccineinduced protection, but also epidemic magnitude and viral evolution, accurate antigenic analysis were highly desired for DENV serotypes.

In order to investigate the antigenicity relationship of DENV subtypes, comprehensive serological tests were accomplished on both animal and vaccinated or infected humans to calibrate the serological relationship between DENV subtypes. In Leah's study, 36 DENV isolates covering four serotypes were selected to inoculate against African green monkey, and the anti-serum of each monkey was tested against 47 DENV strains to generate dengue antigenic mapping (Katzelnick et al., 2015). According to antigenic cartography, the antigenicity of DENV isolates are usually similar to those viruses from the same serotypes. However, a substantial number of strains illustrated greater antigenic variance to inter-type viruses than those from intratypes (Katzelnick et al., 2015).

Above results suggest that the traditional genotype classification cannot fully meet the needs of antigenicity clustering, and new methods of more accurate antigenicity evaluation are highly needed. With the development of bioinformatics technology, computational approaches have started to provide possibility in both accurate and highthroughput way (Liao et al., 2009; Qiu et al., 2016). Although, this may be managed by a few general in-silico model (Qiu et al., 2018), the limitation often includes requiring clearly defined epitope residues and low computational efficiency. In this study, a rapid model tailor-made for DENV was established to infer the antigenic relationship between inter- and intra-serotypes of DENV strains considering the conformational environment of major surface envelope (E) protein. Based on the comprehensive experimental dataset collated from previous researches, antigenicity-dominant positions (ADPs) of DENV and four serotypes were firstly derived based on the correlation between residual mutation of E protein and antigenicity variance. Then, the position specific scoring matrix (PSSM) was combined with physic-chemical descriptors (PCDs) to build the antigenicity calculation model. Finally, 1,143 historical sequences of DENV E antigens from NCBI (Resch et al., 2009) were predicted and the antigenicity relationship was analyzed between DENV serotypes.

# MATERIALS AND METHODS

# Dataset

For model construction, virus-antiserum neutralization titers which reflecting the antigenic relationship were collated from previous researches (Katzelnick et al., 2015), in which the binding ability between DENV and DENV-post-infection African green monkey antisera were determined. Corresponding envelope protein of DENV were collected from National Center for Biotechnology Information (NCBI) (Resch et al., 2009). Considering the injected time and integrity of data, antisera samples derived from African green monkey which injected with corresponding vaccine for 3 mouth were chosen for model construction and validation. Totally, 1,444 strain pairs with experimental antigenicity distance involving 46 strains were retained and those with antisera value labeled as <10 were arbitrarily set as 5 to simplify the calculation. For model construction, 80% of strain pairs (1,155) from experimental data were randomly selected as training dataset and the remaining 20% (289 strain pairs) were defined as independent validation set.

Further, historical DENV strains with envelope protein sequence were collected virus variation resources at the NCBI (Resch et al., 2009), a total number of 4,633 E protein sequences were retained. Based on the sequence identity of 100%, 1,143 unredundant E protein sequence were selected for further analysis. The three-dimensional structure of envelope protein was collated from Protein Data Bank (PDB id: 1OAN) (Berman et al., 2000; Modis et al., 2003).

# Identifying Antigenicity-Dominant Positions of E Protein Surface

Since the antigenicity recognition between antigen and antibody often occurs at the interaction interface of antigen surface, those surface mutations exposed on protein surface in training set were initially selected as candidates. After mapping all positions to template structure (PDB id: 1OAN), 357 surface positions are collected with solvent accessible surface areas (SASA) over 1 Å, which was calculated through Naccess V2.1.1. As antigenic variation often related with mutations at multiple positions, it can be further correlated with antisera titer values by linear regression (LR).

For each strain pair to be compared, the candidate ADPs are defined as set P, which initially covers 357 surface positions. By marking the positions with amino acid mutations as 1 and otherwise as 0, a 357-bit vector vec(P) can be generated. Combined with the normalized antisera titer value, a LR was established and those positions with weight (absolute value) over 0 was defined as positions correlated with antigenicity distance. In that case, 97 ADPs were retained. According to geometric distance, those 97 positions can be classified into four antigenic patches. Here, antisera titer value (V) was normalized by logarithm (Log2V). For individual serotypes, the ADPs were derived based on intra-serotypes experimental titers.

#### Quantitative Model Construction Based on Antigenicity-Dominant Positions Position Specific Scoring Matrix

To quantitatively describe amino acid mutations on each antigenic dominant position, amino acid distribution was calculated to reflect the effect of residue mutations. A PSSM was generated by position-specific iterated basic local alignment search tool (PSI-BLAST) (Altschul et al., 1997) based on 1,143 historical envelope protein sequence. Each score on a 1 × 20 matrix represents the frequency of each amino acid occurred on the described positions. For a pair of DENV strains, PSSM vector was constructed based on the score of each position, each score was defined as absolute difference of matrix score for compared residues. For each queried E protein pairs, a 97 bit PSSM descriptor was formulated to summarize amino acid mutations at 97 positions.

#### Physical Chemical Property Descriptors

Physical chemical property descriptors were based on amino acid index from AAindex database (Kawashima and Kanehisa, 2000). The optimization of physic-chemical indexes was done as below: (1) Pair-wised Pearson Correlation Coefficient (PCC) were calculated between any two AAindexes. (2) Two indexes were defined as similar only when the corresponding PCCvalue was over 0.8. (3) All indexes were ranked according to the number of similar ones which can be represented in descending order. (4) From the top to the bottom of rank list, indexes which can be represented by others were removed sequentially. (5) Minimum index set was obtained which can represent the full index list. Physic-chemical property descriptor was generated based on the absolute difference of AAindex summed for each antigenic region, further, the relationship between PCDs, and experimental titers were constructed through LR for feature selection. Here, the neighborhood region of conformational structures was set as 1 Å according to distance screen from 1 to 5 Å (**Supplementary Table 2**). Each round, those indexes with weight unequal to 0 were remained and after iterative selection, 20 antigenically-related indexes were selected for further model construction. For the four antigenic patches, (4∗ 20=) 80 bits of descriptors were generated as physic-chemical property descriptors.

#### Modeling the Antigenic Variance

Based on antigenic descriptors incorporating PSSM profile and physic-chemical properties, prediction model for antigenicity regression could be constructed. Here, both qualitative and quantitative model were adopted for model construction between normalized experimental titers and antigenicity descriptors. To further analysis the antigenic relationship, different antigenic cutoffs were set for classifications based on the homologous titers between DENV strains and the antiserum against itself. In that case, for the strain pairs based on E protein marked as E<sup>a</sup> and E<sup>b</sup> , a 177-array quantitative descriptor for antigenicity-dominant positions (QDAP) was derived as below containing PSSM profiles and PCD. Further, the machine learning model can be generated to fit the parameters of 177-dimensional descriptors for antigenic variation which defined by logarithm of experimental titers (LogVab) on the training set, as follows:

$$\begin{cases} \begin{array}{c} QDAP\_{1:1\,\text{TF}}^{(E\_a, E\_b)} = \left\{ PSSM\_{1:9\,\text{T}}^{(E\_a, E\_b)} + PCD\_{1:80}^{(E\_a, E\_b)} \right\} \\ \end{array} \\\begin{array}{c} LogV\_{ab} = Train\_{1155} \text{( $\alpha\_1QDAP\_1$ ,  $\alpha\_2QDAP\_2$  \cdots  $\alpha\_{177}QDAP\_{177}$ )} + \varepsilon\_d \end{array} \end{cases} \tag{1}$$

Till the optimized model is reached as below:

$$\widehat{LogV\_{ab}} = \wp\_0 + \wp\_1 QDAP\_1 + \wp\_2 QDAP\_2 + \dots + \wp\_{177} QDAP\_{177} \tag{2}$$

Here, score LogV \ab stood for the predicted antigenicity variation between two DENV strains. The experimental LogVab represent the logarithm of experimental titers which used for model training. Thus, the escape threshold for the predicted LogV \ab is the same as that of LogVab from experimental titers.

#### Parameter Definition

To evaluate the performance of our model, statistical parameters were defined as follows:

$$\text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN} \tag{3}$$

$$\text{Precision} = \frac{TP}{TP + FP} \tag{4}$$

$$\text{Sensitivity} = \frac{TP}{TP + FN} \tag{5}$$

Here, TP represents true positive, TN represents true negative, FP represents false positive, and the FN represents false negative. Also, to evaluate our regression mode, correlation coefficient (CC) was introduced as follows:

$$Correlation\ coefficient = \frac{\sum\_{l=1}^{n} (X\_l - \overline{X})(Y\_l - \overline{Y})}{\sqrt{\sum\_{l=1}^{n} (X\_l - \overline{X})^2} \sqrt{\sum\_{l=1}^{n} (Y\_l - \overline{Y})^2}} \tag{6}$$

Where X<sup>i</sup> represents the predicted value, Y<sup>i</sup> represents the actual value, X refer to average of X<sup>i</sup> , and Y refer to the average of Y<sup>i</sup> .

# RESULTS

### Determination of Antigenicity-Dominant Positions

Antigenicity-dominant positions (ADPs), whose mutations are correlated with antigenicity variation, were determined by following procedures: (1) surface exposed residues with potential to become epitopes for immune response, and (2) essential positions where mutations will likely lead to antigenicity variations (see section Materials and Methods). Three hundred and fifty-seven surface exposed positions were initially retrieved. According to the correlation with training data from experimental antigenicity variance (Katzelnick et al., 2015), 97 were identified as potential ADPs. It can be found that ADPs are mainly located in domain ED1, ED2, and ED3 on E protein surface, which was illustrated in **Figure 1A**. Above ADPs can be clustered into four major surface patches according to spatial distance, which may correlate to potential epitope areas on E protein. Here, all four domains were labeled as D1, D2, D3, and

D4, as being marked in blue, red, black, and green in **Figure 1B**, respectively.

It can be seen that those 97 ADPs are highly overlapping with broad-neutralizing epitopes derived from corresponding antibodies targeting all four serotypes (**Figure 1C**). For instance, the cross-neutralizing mAb of 4E11 was reported to recognize the ED3 region of E protein monomer structure (Cockburn et al., 2012), and another mAb of 1F4 could bind to ED1 regions (Fibriansah et al., 2014). They are well-matched with our region of D1 (blue) and D2 (red), respectively. Also, cross-neutralizing antibodies for polymer structures, such as 1B7 (Aaskov et al., 1989) and 2D22 (Fibriansah et al., 2015), are partially overlapping with our region of D3 (black) and D4 (green).

Besides the general ADPs for DENV, serotype-specific antigenic sites were also determined for each DENV serotype according to antisera of corresponding serotypes in a similar way (see section Materials and Methods). Finally, 30, 40, 47, and 37 sites are derived as serotype-specific ADPs for DENV 1–4, respectively, as been illustrated in **Figures 1D–G** and **Supplementary Table 1.**

#### Model Construction and Evaluation

Machine learning models are adopted here for DENV antigenicity predictions. Molecular features mainly cover positions specific scoring matrix (PSSM) and physic-chemical environments for each of the ADPs, which were previously reported important to antigenicity predictions (Qiu et al., 2016, 2018). The workflow of DENV antigenicity prediction model covers four steps: (1) deriving PSSM for each ADP, (2) generating physic-chemical properties of neighboring regions for each ADP clusters, (3) selecting appropriate machine learning approaches, and (4) calculating the antigenicity distance between two compared DENV strains. Detailed information can be found in section Materials and Methods.

For machine learning methods, both qualitative and quantitative approaches were tested. Five qualitative models including Sequential Minimal Optimization (SMO), Naïve Bayes (NB), Support Vector Machine (SVM), Logistic Classifier (LC), and Random Forest (RF) were used to establish different classification models. Note that, no titer threshold has been reported in DENV case. According to experimental results of Katzelnick's (Katzelnick et al., 2015), over 90% of selfreactive titer value is over 20. In that case, tilter value of 10, 15, 20, and 40 were tentatively tried in turn as classification cutoff for evaluation. Through 10-fold cross-validation, the performance of all five algorithms indicated that NB classifier obtained the best overall performance on different thresholds and achieved the AUC-value over 0.88 under the threshold of 20 (**Figure 2A**). Also, the average (AVG) accuracy of our model achieved a range from 0.763 to 0.931 and fluctuation of accuracy is extremely small with variance (VAR) no more than 0.002 (**Supplementary Figure 1**). This results illustrated that our model could provide an accurate and robust prediction on antigenic classifications and NB classifier was chosen to establish our qualitative mode. After that, the performance of our model was evaluated through independent testing dataset from previous experiments (Katzelnick et al., 2015). Results indicated that our NB classifier could achieve high AUC from 0.81 to 0.90 and ACC from 0.77 to 0.86 under different thresholds, which indicate the outstanding ability of our model for qualitative antigenicity classifications between comparable DENV strains (**Supplementary Figure 2**).

For quantitative approaches, different regression model including Additive Regression (AR), Support Vector Regression (SVR), Gaussian Processes (GP), LR, and Isotonic Regression (IR) were evaluated. Results indicated that LR could achieve the best quantitative predictions with CC of 0.744 (**Supplementary Figure 3**). Thus, LR was chosen to establish our quantitative model. Further, by setting different thresholds, the classification performance of quantitative LR model was also evaluated and compared with qualitative NB classifier (**Figure 2B**). The results showed that quantitative LR model are always better than qualitative NB classifier under different thresholds. Thus, quantitative model of LR was adopted for final analysis.

#### The Discrepancy Between DENV Serotypes And Antigenicity Clusters

With above model, we made a large-scale antigenicity mapping for 1,143 historical DENV strains retrieved from NCBI to investigate the relationship between DENV serotypes (genetic clusters) and antigenicity clusters. Firstly, the pair-wised antigenicity similarity of all 1,143 historical strains were calculated through our model for intra- and inter-serotypes. Similarly, the genetic distance was also done by counting the number of residual mutations for intra- and inter-serotypes

FIGURE 2 | Model performance of our model. (A) Cross-validation performance of qualitative model constructed by Sequential Minimal Optimization (SMO), Naïve Bayes (NB), Support Vector Machine (SVM), Logistic Classifier (LC), and Random Forest (RF). Here, Y axis represents the AUC-value of different computational models. (B) Performance of Linear Regression and Naïve Bayes on independent dataset.

(**Figure 3**). It can be seen that, the genetic distance or variation within one serotype is significantly less than that of interserotypes and the distribution ranges of genetic distance were clearly distinguishable without any overlapping between the two classes (**Figure 3A**). However, in the case of antigenicity similarities, this border become overlapping, despite the slight differential trends (**Figure 3B**). Because the computational principle to predict clusters is that the similarity of intraserotype strains should be separable from that of inter-serotype strains, now the mixed border will certainly lead to discrepancies between DENV genetic cluster (serotypes) and antigenic clusters.

As reference, the large-scale animal data from Leah's study (Katzelnick et al., 2015) were calculated similarly to show the difference between 4 intra- and 6 inter-groups (**Figures 3A,B**). The discriminable genetic border, but not the antigenic border, can be observed again in experiments as well. The agreeing results indicated that DENV can be clearly clustered into four groups genetically, but not antigenically. Thus, the traditional DENV antigenic cluster should be re-evaluated.

Then, all the pair-wised antigenic similarity of 1,143 historical strains were mapped into a clustering tree (**Figure 3C**), while different colors represent different DENV serotypes. It can be found that most of the intra-type strains tend to cluster together, which were consistent with the serotype classification, as in the case of serotype 1 and serotype 3. However, substantial number of strains are clearly clustered into other serotypes. For instance, a number of serotype 4 strains are grouped into serotype 2 and 3. More interesting, two different antigenic groups can be clearly demonstrated for DENV 2. Therefore, we would like to propose the further subtyping of DENV 2 into two sub-clusters, where DENV 2a was antigenically closer to serotype 4 rather than DENV 2b (**Figure 3D**).

Further, we clustered the antigenicity distance of DENV based on neutralization titer value from monkey experiments (Katzelnick et al., 2015). Because of the noise, the raw experimental data was cleaned as below: (1) null values were abandoned; (2) small and uncertain titers which labeled as "<10" were defined as 5; (3) the logarithm of each titer values was defined as the antigenic similarity of two compared strains. Antigenic clustering between remained pairs involving 28 DENV strains was illustrated in **Supplementary Figure 4**.

It can be found that besides four clusters representing four traditionally defined serotypes, a new cluster of DENV2 can be detected, which was antigenically closer to serotype 4 rather than serotype 2. This experiments result can support our proposal of two sub-clusters for DENV2.

#### DISCUSSION

The antigenic difference between DENV viruses plays essential role to the DENV epidemic control, vaccine-based prevention, and clinical treatment. In this paper, we built an accurate and efficient model to calculate the antigenic similarity for DENV strains based on mutated sequences of E proteins. To achieve that, we primarily considered the possible ADPs instead of all mutations in E antigens, not only for the reasons of computing efficiency, but also for predicting accuracy.

It is aware that not all mutations can cause antigenicity variation. After possible ADPs were derived where mutations could significantly affect the antigenicity, ADPs were further clustered into spatial patches for detecting potential epitope regions based on geometric distance on protein surface. It is noted that ADPs were calculated from experimental data previously accumulated. More abundant experimental data will lead to more accurate model. Despite the creditability of our model, the range of ADPs might be refreshed, and slightly adjusted with the future accumulation of latest binding assays, so as the minor changes of antigenic grouping.

Apart from the contribution of ADPs, the performance of our model is also contributed by full consideration of PSSM profile and the physic-chemical environment around the ADPs. Here, the PSSM generated by PSI-BLAST (Altschul et al., 1997) could provide a detailed description on evolution pressure of ADPs at sequence level. Moreover, the physic-chemical environment described by amino acid indexes are also considered to better reflect the micro-environment variations between two compared strains (Qiu et al., 2016). Thus, by incorporating PSSM profiles and PCDs, our model could better predict the antigenicity variation of DENV strains.

It was reported that, many DENV isolates are antigenic similar to those viruses from different types rather than those from the same type (Katzelnick et al., 2015). In this paper, we explained the reasons why canonical DENV types are not antigenically homogenous. Both data of experiments

#### REFERENCES


and historically published sequences showed that, the mutation accumulation is discrete but the antigenicity variation of mutants tends to be continuous among the DENV mutant populations (**Supplementary Figure 5**). The discrete genetic distance between intra- and inter-groups make it easy to define DENV subgroups but that may not correlate with the antigenic similarity. Thus, we suggest the re-consideration of the traditional serotype definition via DENV antigenic similarity instead of genetic distance. Our model provides convenient way to calculate the relative antigenicity difference.

In summary, we established as a fast and efficient model for DENV antigenicity based on sequence input of E antigens. With the improvement of ADPs updating and incorporation of additional antigens, it will be possible to establish an on-line tool to serve the purpose of epidemic monitoring and broad-spectrum vaccine design of DENV.

#### DATA AVAILABILITY

Sequence data of dengue virus envelop protein (E) were derived from NCBI virus resources at https://www.ncbi.nlm. nih.gov/genomes/VirusVariation/Database/nph-select.cgi?cmd= database&taxid=12637.

## AUTHOR CONTRIBUTIONS

JQ and TQ developed the algorithm. TQ and YS wrote the manuscript. JQ constructed the computational model. TQ and ZJ supervised the whole project and modified the manuscript. All authors read and approved the final manuscript.

#### FUNDING

This work was supported in part by grants from the National Postdoctoral Program for Innovative Talents (BX201600033), the China Postdoctoral Science Foundation Funded Project (2017M611451), the National Natural Science Foundation of China (31671362).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00621/full#supplementary-material

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Qiu, Shang, Ji and Qiu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Apolipoprotein E Overexpression Is Associated With Tumor Progression and Poor Survival in Colorectal Cancer

Zhixun Zhao<sup>1</sup>† , Shuangmei Zou<sup>2</sup>† , Xu Guan<sup>3</sup> , Meng Wang<sup>1</sup> , Zheng Jiang<sup>3</sup> , Zheng Liu<sup>3</sup> , Chunxiang Li<sup>4</sup> , Huixin Lin<sup>5</sup> , Xiuyun Liu<sup>2</sup> , Runkun Yang<sup>1</sup> , Yibo Gao<sup>4</sup> \* and Xishan Wang1,3 \*

<sup>1</sup> Department of Colorectal Surgery, The Second Affiliated Hospital of Harbin Medical University, Harbin, China, <sup>2</sup> Department of Pathology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China, <sup>3</sup> Department of Colorectal Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China, <sup>4</sup> Department of Thoracic Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China, <sup>5</sup> Genesis (Beijing) Co., Ltd., Beijing, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Hauke Busch, Universität zu Lübeck, Germany Changlong Gu, Hunan University, China Min Tang, The University of Arizona, United States

#### \*Correspondence:

Yibo Gao gaoyibo@cicams.ac.cn Xishan Wang wxshan1208@126.com

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 10 October 2018 Accepted: 30 November 2018 Published: 13 December 2018

#### Citation:

Zhao Z, Zou S, Guan X, Wang M, Jiang Z, Liu Z, Li C, Lin H, Liu X, Yang R, Gao Y and Wang X (2018) Apolipoprotein E Overexpression Is Associated With Tumor Progression and Poor Survival in Colorectal Cancer. Front. Genet. 9:650. doi: 10.3389/fgene.2018.00650 Apolipoprotein E (ApoE) plays a key role in tumorigenesis and progression, such as cell proliferation, angiogenesis and metastasis. ApoE overexpression was associated with aggressive biological behaviors and poor prognosis in a variety of tumor according to previous studies. This study aimed to assess the prognostic value and explore the potential relationship with tumor progression in colorectal cancer (CRC). We collected the expression profiling microarray data from the Gene Expression Omnibus (GEO), investigated the ApoE expression pattern between the primary CRC and liver metastasis of CRC, and then explored the gene with prognostic significance based on the TCGA database. ApoE high expression was associated with poor overall survival (OS, p = 0.015) and progression-free survival (PFS, p = 0.004) based on the public databases. Next, ApoE expression was evaluated in two CRC cohorts by immunohistochemistry, of whom 306 cases were stage II and 201 cases were metastatic liver CRC. In the cohort of the liver metastasis, the ApoE expression was increasing in normal mucosa tissue, primary colorectal cancer (PC), and colorectal liver metastases (CLM) in order. Meanwhile, the level of ApoE expression in stage II tumor sample which had no progression evidence in 5 years was lower than that in PC of synchronous liver metastases. The high ApoE expression in PC was an independent risk factor in both stage II (HR = 2.023, [95% CI 1.297–3.154], p = 0.002; HR = 1.883, [95% CI 1.295-2.737], p = 0.001; OS and PFS respectively) and simultaneous liver metastasis (HR = 1.559, [95% CI 1.096–2.216], p = 0.013; HR = 1.541, [95% CI 1.129–2.104], p = 0.006; OS and PFS respectively). However, the overexpression of ApoE could not predict the benefit from the chemotherapy in stage II. The study revealed that the relevance of the ApoE overexpression in CRC progression, conferring a poor prognosis in CRC patients especially for stage II and simultaneous liver metastasis. These finding may improve the prognostic stratification of patients for clinical strategy selection and promote CRC clinic outcomes.

Keywords: colorectal cancer, Apolipoprotein E (ApoE), prognosis, stage II, simultaneous liver metastasis, biomarkers, chemotherapy

# INTRODUCTION

fgene-09-00650 December 11, 2018 Time: 17:39 # 2

Colorectal cancer (CRC) is one of the most common digest track malignant tumors, which are threatening the public health worldwide. According to the data published by Chinese National Cancer Center, in China, over 376.3 thousand CRC new cases and 191.0 thousand CRC-related deaths were estimated just in 2015 (Chen et al., 2016).The current treatment regimen option mainly depends on American joint committee on cancer TNM staging classification system which is based on the clinicopathologic characteristics. However, owing to the tumor heterogeneity, the patients with the same staging and similar treatment may gain different clinical outcomes. Moreover, chemotherapy as one of principal therapeutic means is recommended for stage III, IV and part of II CRC patients according to the CRC treatment guideline. Regarding stage II CRC, chemotherapy could improve survival outcome of patients, but absolute improvement in survival was less than 5% (Chen et al., 2016). Adverse events from adjuvant chemotherapy would have impacts on the quality of life of patients (Rosmarin et al., 2014). Therefore, there remains an urgent to identify valuable biomarkers aiming to improve the prognostic stratification of patients for clinical strategy selection.

Apolipoprotein E (ApoE) plays a multi-functional role in cholesterol transport and metabolism, which mediates the cellular uptake of lipoprotein particles by binding to receptors of low-density lipoprotein (LDL) receptor family and the receptor for chylomicron remnants (Gliemann, 1998). Previous research has suggested ApoE abnormal function is associated with Alzheimer's disease, atherosclerosis and chronic heart disease (Wilson et al., 1996; Hofman et al., 1997; Greenow et al., 2005). Besides, the functions of ApoE have been identified in DNA synthesis, cell proliferation, angiogenesis and metastasis, so the aberration of these functions may lead to tumorigenesis and progression. ApoE overexpression has previously been reported in gastric, lung, prostate, thyroid, ovarian, endometrial cancer and glioblastoma (Nicoll et al., 2003; Venanzoni et al., 2003; Oue et al., 2004; Ito et al., 2006; Huvila et al., 2009; Su et al., 2011). A recent study has shown that ApoE was associated with tumor advanced grade and stage in gastric carcinomas and involved in invasion, metastasis and carcinogenesis (Oue et al., 2004). Another study found that increased expression of ApoE might represent a late event in the progression of endometrioid endometrial adenocarcinoma (Huvila et al., 2009). In lung adenocarcinoma, ApoE over-expression promotes cancer proliferation and migration and is related to chemoresistance (Su et al., 2011). A recent study implicated the ApoE moderates the colon homeostasis and constitutes a risk factor for colon pathologies (El-Bahrawy et al., 2016). However, the prognostic value of ApoE expression for CRC remains unclear, and to the best of our knowledge, although some previous research has been implicated that APOE might influence CRC development through three potential path ways: cholesterol and bile metabolism, triglyceride and insulin regulation, and the prolonged inflammation (Slattery et al., 2005; Mrkonjic et al., 2009; Kato et al., 2010), there has not been a prior study of functional expression and prognostic significance for CRC.

In the present study, we analyzed Affymetrix gene microarray in the setting of liver metastatic CRC from the GEO, which aimed to study the expression pattern of ApoE between CRC primary and liver metastasis samples. Next, we evaluated the expression patterns of ApoE in CRC and assessed prognostic significance based on The Cancer Genomic Atlas (TCGA). Subsequently, we further studied the expression pattern in stage II and liver metastasis of CRC respectively, and made survival analysis in two cohorts, to explore the relationship of the expression features with the clinic and prognosis.

# MATERIALS AND METHODS

# Patients and Tissue Samples

The specimens in this study were collected from the CRC patients who underwent the surgical resection from January 2006 through December 2012, which were all archived by Pathology Department of Cancer Institute and Hospital, Chinese Academy of Medical Sciences. All the sample diagnoses were confirmed according to the 7th edition of TNM staging system. The inclusive criteria of stage II were as follow: (A) AJCC pathology staging was stage II (T3-4N0M0); (B) no systemic or chemotherapy before the surgery; (C) the case can provide complete clinical information, such as age, gender, tumor location, histology, differentiation, TNM classification, adjuvant therapy regime, follow-up information and so on. At the same time, we utilized the primary tumor and corresponding metastatic liver specimen to establish another simultaneous liver metastatic CRC cohort (LMCRC). Totally, 306 cases of stage II CRC and 201 cases of liver metastatic CRC were collected based on the inclusive criteria. The Clinical Research Ethics Committee of Cancer Institute and Hospital, Chinese Academy of Medical Sciences approved this study. All the patients were followed up regularly until December 31st, 2017, every 3 months up to the 5th year.

## ApoE Expression Analyses in the GEO and TCGA Databases

To investigate the ApoE expression pattern between the primary CRC (PC) and liver metastasis of CRC (CLM), we collected the expression profiling microarray data from the Gene Expression Omnibus (GEO<sup>1</sup> ) database under the accession number GSE41258 (Affymetrix Human Genome U133A Array), GSE62322 (Affymetrix Human Genome U133B Array) and GSE68468 (Affymetrix Human Genome U133A Array), respectively. Gene expression was first measured at the probe set level using the RMA (Robust Multi-array Average) methodology on perfect match probes, followed by quantile normalization. Probe set annotation for the U133 Array was downloaded from Affymetrix's website. The probe set with the greatest average expression across all samples was chosen to represent each gene. Information about datasets was summarized in **Supplementary Table S1**. All the sample preparation and microarray were performed based on the standard protocols.

<sup>1</sup>http://www.ncbi.nlm.nih.gov/geo/

The standardized ApoE expression was obtained by dividing into N (normal mucosa), PC and CLM in each of datasets after annotation.

Two hundred and seventy one cases of colon cancer and 89 cases of rectal cancer are provided by the TCGA project (**Supplementary Table S8**). According to the expression value of ApoE, the cohort was classified into high expression group and low expression group (cut-off = 50%) after merging the colon and rectal cancer cases. The Box Plots was generated to compare the ApoE expression level between the tumor and normal tissues of CRC, and to show the ApoE expression features in different pathological stages. A tool named GEPIA<sup>2</sup> which is an interactive web server for analyzing the RNA sequencing expression data from the TCGA projects is used for batch TCGA data processing and visualization in this study (Tang et al., 2017).

# Tissue Microarray and Immunohistochemistry

fgene-09-00650 December 11, 2018 Time: 17:39 # 3

The stage II cohort included the tumor and normal tissue of each patient, and LMCRC cohort consisted of the primary tumor, metastatic tumor, normal intestinal mucosa and normal liver tissue from each patient. The TMAs were built after verification by HE staining and the punched sample which measured 1.0 mm were taken from the center of the tumor. The different specimen derived from one patient were placed on the same TMA and every TMA has another copy to reduce systematic errors.

Immunohistochemical staining was performed on the slides (5 µm thick) from the TMAs, using an ApoE (pan) (D7I9N) rabbit monoclonal antibody (#13366; 1:500; Cell Signaling Technology, United States) antibody to ApoE, as it was described previously (Holtzman et al., 2000). The SI score was calculated by multiplication of the staining intensity (0, negative; 1, weak; 2, moderate; 3, strong) and the percentage of positive stained cells (no staining, 0; 1–10%, 1; 11–50%, 2; 50–100%, 3). In this study, moderate/strong cytoplasm staining of (SI = 3–9) was defined as positive staining, while weak or negative staining (SI = 0–2) was defined as negative staining. Representative staining of ApoE in the specimens illustrated in **Figure 1**. Positive rate refers to the proportion of ApoE positive staining samples, namely positive rate = positive samples/(positive samples + negative samples).

#### Statistical Analysis

The statistical significance of the difference was assessed using Student t-test, and the one-way ANOVA with Tukey post-test was conducted for multiple comparisons. Chi-square test or Fisher exact test was used to evaluate the difference in rates among different groups. All the statistical results were summarized in **Supplementary Table S9**. Survival curves were plotted according to the Kaplan–Meier method and the log-rank test was used to compare the overall survival (OS) and progression free survival (PFS) in the study cohort. Univariate and multivariate analysis for CRC prognosis were undertaken using Cox proportional hazards regression model. The calculations were performed with IBM SPSS Statistics 24.0 software program and R version 3.3.3. A value of p < 0.05 was considered as significant.

FIGURE 1 | Representative immunohistochemistry staining pictures of ApoE expression in CRC tissues Tissue high expression (4X for A, 10X for C) and low expression (4X for B, 10X for D) for the ApoE protein are shown. Each of punched samples is 1.0 mm in the tissue microarrays.

## RESULTS

#### ApoE Is Highly Expressed in Colorectal Liver Metastasis and Has Prognostic Significance in Colorectal Cancer Based on the Public Databases

We first assessed the ApoE expression level in the normal intestinal mucosa, PC and CLM based on 3 datasets from GEO (**Supplementary Table S1**). In GSE41258 and GSE 62322, PC refers to the primary tumor from metastatic CRC, but to account for the limit of clinical data PC in GSE68468 included all the stages. As demonstrated in **Figure 2**, ApoE was significantly higher expressed in CLM compared with normal tissue and PC in all three datasets. However, there was no significant difference between the normal mucosa and PC.

To further investigate the ApoE expression pattern and prognostic significance in CRC, we analyzed the expression of ApoE in TCGA database. The result revealed that there was no significant difference between the PC and normal tissues in both the colon cancer (COAD) and rectal cancer (READ) dataset, which was consistent with GEO data (**Figure 3A**). In the meanwhile, the ApoE expression demonstrated rising tendency in general as the pathology stage development and ApoE expression level in stage I was significantly lower than the other stages (**Figure 3B**). As illustrated in the Kaplan–Meier survival curves, overexpression ApoE proved to associated with poorer OS and the DFS in CRC patients (p = 0.015 for OS; p = 0.004 for DFS; **Figures 3C,D**).

<sup>2</sup>http://gepia.cancer-pku.cn/index.html

∗∗Represents p-value < 0.01.

# The ApoE Expression Features in the LMCRC and the Stage II CRC Cohort

According to the results analyzed from the public data, we primary identified the expression patterns and the potential prognostic value of ApoE in CRC. Therefore, we further investigated the expression patterns of ApoE in 201 cases of PC and CLM from simultaneous liver metastasis patients and corresponding adjacent normal mucosa and liver tissues utilizing immunohistochemistry staining. As shown in **Table 1**, ApoE protein expression was detected in 103/201 (51.2%) of the PC samples, 128/201 (63.7%) of the CLM samples and 43 cases (21.4%) of adjacent normal mucosa stained positively. Thus, at protein levels, the expression of ApoE was higher than normal mucosa (51.2% vs. 21.4%, p < 0.001) and ApoE was upregulated in the CLM tissues (63.7% vs. 51.2%, p = 0.012) comparing with PC.

In the cohort of stage II, there was no significant difference between the tumor and normal tissue (34.3% vs. 39.2%, p = 0.209). According to the follow-up, 306 cases of the stage II patients were divided into the non-progression group (195 cases) and the progression group (111 cases), and 30 cases with liver metastasis after surgery included. Immunohistochemistry staining indicated that progression group had a higher ApoE expression positive rate than the non-progression group (45.9% vs. 27.7%, p = 0.001). Comparing the ApoE expression level of the primary tumor between stage II and simultaneous liver metastatic group, the latter turned out to be higher (34.3% vs. 51.2%, p = 0.001). We further analyzed the ApoE expression pattern in primary tumor between the stage II with liver metastasis after surgery and the simultaneous liver metastatic group, whereas it proved no significant difference (53.3% vs. 51.2%, p = 0.831).

# The Low ApoE Expression Is Associated With Improved Survival Outcome in Two Cohorts

Two cohorts of CRC patients were classified into low ApoE expression group (SI 0–4) and high ApoE expression group (SI 6–12) based on the immunohistochemistry staining of the

for differential gene expression analysis is one-way ANOVA, using the pathological stage as a variable for calculating differential expression. The ApoE high expression group was associated with decreased overall survival (C) and disease-free survival (D) in CRC according to the data from TCGA, which were calculated using a log-rank test. CRC, Colorectal cancer; TPM, transcript per million; #Represents p-value < 0.05.

primary tumor. The relationship between the ApoE expression and the clinicopathologic characteristics of stage II and LMCRC patients are summarized in **Supplementary Tables S2**, **S3**, respectively. ApoE highly expressed in the LMCRC cohort patients who underwent neoadjuvant therapy. Besides, the other clinicopathologic information such as age, gender, tumor location, gross pathology type, differentiation grade, T stage, MSI (Microsatellite instability) status, preoperative CEA level and preoperative CA19-9 level had no significant correlation with the ApoE expression in both two cohorts (**Supplementary Tables S2**, **S3**).

To identify the prognostic significance of the ApoE expression in CRC, we further conducted the survival analysis in two cohorts respectively. In the cohort of stage II, the median follow-up was over 59 months, 78 died cases and 111 relapsed patients included. Kaplan–Meier curves revealed that the patients with



Stage II, stage II colorectal cancer; LMCRC, liver metastatic colorectal cancer; Progression, tumor recurrence after surgery in 5 years; Non-Progression, no recurrence signs after surgery in 5 years.

low ApoE expression had a longer 5-year OS and PFS (p = 0.002 for OS and p = 0.001 for PFS; **Figures 4A,B**) in stage II cohort. Multivariate Cox regression analysis confirmed that high ApoE was independently associated with worse prognosis significance for OS (HR 2.023, [95% CI 1.297–3.154]) and PFS (HR 1.883, [95% CI 1.295–2.737]) (**Tables 2**, **3** and **Supplementary Tables S4**, **S5**). MSI status was independently associated with better OS (HR 0.328, [95% CI 0.120–0.897]) and neurological involvement was an independent prognostic factor for PFS in multivariate analysis (HR 2.115, [95% CI 1.133–3.949]).

Kaplan–Meier analysis was also conducted in simultaneous liver metastatic patients. With 27-month median follow-up, 141 patients died and 168 patients relapsed. ApoE-low group had a significantly improved OS (p = 0.002 for OS and p = 0.008 for PFS; **Figures 4C,D**). The multivariate analysis demonstrated that ApoE expression in PC was an independent prognosticator for OS (HR 1.559, [95% CI 1.096–2.216]) and PFS (HR 1.541, [95% CI 1.129–2.104]) in patients with synchronous liver metastasis CRC (**Tables 4**, **5** and **Supplementary Tables S6**, **S7**). Besides, in the LMCRC cohort, N staging was an independent prognostic indicator in both OS (HR 0.488, [95% CI 0.302–0.789]) and PFS (HR 0.462, [95% CI 0.302–0.706]).

#### The Expression of ApoE Could Not Predict the Benefit From the Adjuvant Chemotherapy for Stage II CRC

Next, we investigated the potential role of ApoE as a predictor of adjuvant chemotherapy for stage II. In the stage II cohort, 131 patients received the 5-FU-based adjuvant chemotherapy, 63 lower rectal cancer patients underwent radiotherapy (50Gy) and 112 patients underwent surgery alone. ApoE expression was shown to have a negative impact on survival both the patients who underwent surgery alone (25.5% vs. 43.9%, p = 0.049) and those who received the 5-FU-based chemotherapy (53.3%vs. 73.3%, p = 0.019) (**Figures 4E,F**). We explored the association between ApoE expression and PFS among the patients who either received or did not receive the chemotherapy. However, there was no significant interaction between the chemotherapy and high ApoE expression of CRC. Further analysis showed the benefit observed in high ApoE expression group was superior to that in low expression group.

### DISCUSSION

In the present study, we studied the ApoE expression profiling and relevant prognostic value of ApoE in CRC, especially for stage II and liver metastasis. We compared the expression level of ApoE in primary lesion, liver metastases and corresponding normal mucosa according to three GEO datasets and two our center cohorts. We found that ApoE was significantly higher expressed in CLM compared with normal tissue and PC. Here, we proposed an assumption that the different expressing genes between the CRC primary and liver metastatic tumors may play roles in the metastasis or progression and these genes would have the potential prognostic value. We found that ApoE expression level proved rising tendency in stage II tumor, primary tumor and liver metastasis of CLM in order, and high ApoE expression was associated with shorter PFS in stage II cohort. Thus we conducted survival analysis based on TCGA data and validated the result in our two cohorts. When we made survival analysis based on the TCGA data, it was demonstrated that the expression of ApoE was significantly associated with OS and PFS of CRC. Next, the survival analysis was performed in the two cohorts to validate the prognostic significance in stage II and metastatic CRC. In two cohorts, the higher expression level of ApoE has been shown to be independently associated with a reduced prognosis. Besides, neurological involvement was also independently related to the PFS of stage II. Concerning liver metastasis of CRC, we found that N staging was one of the independent risk factors both in OS and PFS. The patients should be stratified based on the independent prognostic factor to accept suitable treatment regime.

Previous studies have demonstrated the overexpression of ApoE was associated with a series of malignant behaviors and it was regarded as a prognostic marker in a variety of cancers according to previous studies (Nicoll et al., 2003; Oue et al., 2004; Ito et al., 2006). Related studies have revealed that ApoE activity on cancer cells is dual according to different tissues and ApoE affects several signaling cascades, including by increasing disabled phosphorylation and by activation of the ERK1/2 pathway (Hoe et al., 2005; Zheng et al., 2018). At the same time, ApoE could activate PI3K/AKT/mTOR signaling pathway, which has been confirmed as a critical regulator during tumor progression, including cell–cell adhesion, proliferation, and migration (Thorpe et al., 2015). Meanwhile, the aberration of ApoE expression might also lead to the development of CRC. Niemi et al. (2002) found HT29 cell line with overexpressed ApoE would enhance the cell polarity which was one potential step of tumor metastasis. Mrkonjic et al. (2009) reported that ApoEexpressing cell would induce proliferative signals and inhibit apoptosis in CRC. These findings suggested that ApoE might be a potential predictive marker during the development of CRC. Intriguingly, in our cohorts ApoE was highly expressed in liver metastasis than primary tumor, however, there was no significant difference in primary lesion between the stage II and stage IV

TABLE 2 | Cox analyses of potential prognostic factors for overall survival in the stage II CRC cohort.


TABLE 3 | Cox analyses of potential prognostic factors for progression-free survival in the stage II CRC cohort.


TABLE 4 | Cox analyses of potential prognostic factors for overall survival in the simultaneous liver metastatic CRC cohort.


TABLE 5 | Cox analyses of potential prognostic factors for progression-free survival in the simultaneous liver metastatic CRC cohort.


according to the TCGA data. Because stage IV samples in TCGA databases were not the only liver metastasis but also included the other types of metastatic CRC. Consequently, we suspect that ApoE may be one of the potential liver metastasis-specific biomarkers in CRC, but this assumption remains to be further verified by the larger sample scale.

However, it turned out that ApoE had a prognostic rather than predictive value, which did not seem to be associated with the resistance to chemotherapy in stage II. Even so, we demonstrated that stage II CRC with overexpressed ApoE was more prone to recurrence or metastasis and worse prognosis. The results remind us stage II patients should increase the postoperative follow-up frequency properly according to the ApoE expression level. The study indicated the association between the high ApoE expression and MSS (Microsatellite Stability) status in stage II. Previous studies have shown that the prognosis of with MSI is better than those with MSS for the stage II patients (Salazar et al., 2011) and the result was also confirmed in our study. It may include that some interactions between the ApoE expression level and DNA mismatch-repair (MMR) functional status, which needs to be further explored and identified. The results also showed that the neoadjuvant therapy for liver metastasis significantly increased the ApoE expression level. We hypothesize that highdozen chemotherapy might lead to metabolic abnormality of lipid through the body and therefore a high level of ApoE was detected in CRC. The alternative of ApoE after neoadjuvant chemotherapy cannot reflect the real expression level in the tumor. Consequently, if we intend to take the ApoE as a prognostic marker, the effects from preoperative chemotherapy should be taken into account in advance. Besides, we think ApoE could be considered as a valuable molecular marker for prediction of CRC prognosis and it might be a potential new therapeutic target of the CRC.

One of the limitations of this study is that we did not detect the presence of three common isoforms, including ApoE2, E3 and E4 which are from amino acid substitutions (Raffai et al., 2001). Different ApoE isoforms by binding to the LDL receptor could lead to various of biological behaviors for tumor, so the related research stratified by ApoE phenotypes is required. On the other hand, although the result analyzed by TCGA indicated that the ApoE expression level increases with the development of CRC and ApoE have potential prognostic value in CRC, our validation cohorts only consist of stage II and liver metastatic CRC patients. Especially the TCGA data showed that ApoE expression level in

stage I was significantly lower than the other stages. Whether the ApoE could be regarded as a potential biomarker for diagnosis or ApoE plays some critical roles in the development from stage I to the higher stage, it should be further verified. Therefore, next, we need to complete the CRC cohort establishment of stage I, stage III and even precancerous, in order to further vindicate current results.

## CONCLUSION

In this study, we found that the ApoE expression was higher in the primary tumor of liver metastasis as compared with the stage II. High level of ApoE was an independent prognostic indicator for OS and PFS in stage II and simultaneous liver metastatic CRC.

#### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of 7th edition of TNM staging system, the Clinical Research Ethics Committee of Cancer Institute and Hospital, Chinese Academy of Medical Sciences with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Clinical Research Ethics Committee of Cancer Institute and Hospital, Chinese Academy of Medical Sciences.

# AUTHOR CONTRIBUTIONS

ZZ and YG designed the study. ZZ, SZ, XG, ZL, and YG collected the data. ZZ, SZ, XG, and YG analyzed the data. SZ, XG, and

#### REFERENCES


YG interpreted the data. MW, ZJ, ZL, HL, and XL sourced the literature. ZZ, SZ, MW, and ZJ wrote the draft. CL and RY edited the manuscript. XW acquired the funding and supervised the whole study.

# FUNDING

This study was supported by the National Key Research and Development Program of the Ministry of Science and Technology of China (2016YFC0905303), CAMS Innovation Fund for Medical Sciences (CIFMS) (2016-I2M-1-001), and Beijing Science and Technology Program (D17110002617004).

#### ACKNOWLEDGMENTS

The authors would like to thank the donors and the Cancer Institute and Hospital, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China for tumor specimens which was provided for the study, and acknowledge the efforts of the The Cancer Genomic Atlas (TCGA) project, gene expression microarray data sharing by GEO and the program development of the GEPIA by Tang et al. (2017). The interpretation of these data is the sole responsibility of the authors.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00650/full#supplementary-material


analysis of gene expression. Cancer Res. 64, 2397–2405. doi: 10.1158/0008-5472. CAN-03-3514


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhao, Zou, Guan, Wang, Jiang, Liu, Li, Lin, Liu, Yang, Gao and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Whole Exome Sequencing Identifies a Novel Pathogenic RET Variant in Hirschsprung Disease

Wei Wu, Li Lu, Weijue Xu, Jiangbin Liu, Jun Sun, Lulu Zheng, Qingfeng Sheng and Zhibao Lv\*

Department of General Surgery, Shanghai Children's Hospital, Shanghai Jiao Tong University, Shanghai, China

Hirschsprung disease is a birth defect characterized by complete absence of neuronal ganglion cells from a portion of the intestinal tract. To uncover genetic variants contributing to HSCR, we performed whole exome sequencing on seven members of an HSCR family. With the minor allele frequency (MAF) calculated by gnomAD, we finally filtered a total of 1,059 rare variants in this family (MAF < 0.1%). With the mode of inheritance and pathogenicity scores by bioinformatics tools, we identified an inframeshift variant p.Phe147del in RET as the disease-causing variant. Further analysis revealed that the in-frameshift variant may function by disrupting the glycosylation of RET protein. To our knowledge, this is the first study to report the in-frameshift variant p.Phe147del in RET responsible for heritable HSCR.

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Li Zhang, East China Normal University, China Jun Jiang, Fudan University, China Xuechao Wan, Northwestern University, United States

#### \*Correspondence:

Zhibao Lv lvzhibao@sohu.com; Zhibao\_Lv@126.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 29 October 2018 Accepted: 31 December 2018 Published: 14 January 2019

#### Citation:

Wu W, Lu L, Xu W, Liu J, Sun J, Zheng L, Sheng Q and Lv Z (2019) Whole Exome Sequencing Identifies a Novel Pathogenic RET Variant in Hirschsprung Disease. Front. Genet. 9:752. doi: 10.3389/fgene.2018.00752 Keywords: whole exome sequencing, Hirschsprung disease, RET variant, minor allele frequencies, bioinformatics

#### INTRODUCTION

Hirschsprung disease (HSCR) is a congenitally genetic disorder of the enteric nervous system (ENS) characterized by complete absence of neuronal ganglion cells from a portion of the intestinal tract. The incidence of HSCR is approximately 1 in 5,000 live births, which varies among different ethnic groups (Parisi and Kapur, 2000). HSCR can be classified into short-segment HSCR (S-HSCR), long-segment HSCR (L-HSCR), and total colonic aganglionosis (TCA) based on the length of the aganglionic segment (Lantieri et al., 2006). Treatment options for HSCR include surgical treatment with resection of the aganglionic segment and reconstitution of the intestinal passage after the first year of life, following bridging therapy with colostomy (Bachmann et al., 2015).

The mode of inheritance of HSCR vary from dominant with reduced penetrance or recessive in familial cases to a more complex, non-Mendelian mode of inheritance in the sporadic cases (Amiel et al., 2008). So far, several genes have been found to contribute to HSCR, such as RET (Tomuschat and Puri, 2015), ECE1 (Vohra et al., 2007), EDN3 (Sanchez-Mejias et al., 2010), EDNRB (Sanchez-Mejias et al., 2010), GDNF (Eketjall and Ibanez, 2002), NRTN (Doray et al., 1998), SOX10 (Lecerf et al., 2014), PHOX2B (Fernandez et al., 2013), and KIAA1279 (Amiel et al., 2008). With the exception of the RET proto-oncogene that is responsible for approximately 50% of familial and up to 15% of sporadic cases, other HSCR genes only account for a small proportion of the cases (Amiel et al., 2008). RET encodes a transmembrane tyrosine kinase receptor that, during development of specific neuronal cell lineages, transduces extracellular signals for cell growth and differentiation. The mechanism of HSCR caused by loss-of-function mutations in RET is highly dependent on their location in the protein (So et al., 2011). For example, mutations affecting the intra-cytoplasmatic domain could impair the kinase activity required for proper signal transduction, altering either the

**105**

catalytic function, the stability of the enzyme structure or the binding of transduction effectors (Hyndman et al., 2013). In contrast, mutations of the extracellular domain (ECD) can affect RET function through a number of different mechanisms, such as lack of ligand binding, and impairment of protein folding (Kjaer and Ibanez, 2003). However, these mechanisms are mostly revealed for missense, nonsense, frameshift, and splicing mutations, and the pathogenicity of in-frameshift variants is underestimated by previous studies.

In the present study, we collected an HSCR family with four affected and three unaffected members. To uncover the novel pathogenic genes or variants, we performed whole exome sequencing on seven family members. With the filtering steps by minor allele frequency (MAF), the mode of inheritance, cosegregation, and pathogenicity scores by the bioinformatics tools, we identified a novel in-frameshift variant p.Phe147del in RET as the disease-causing variant, which may function by disrupting RET N-glycosylation. To our knowledge, this is the first study to report the p.Phe147del in RET as a disease-causing variant for heritable HSCR.

## RESULTS

#### Clinical Features of the HSCR Cases

The proband (III-2) was a 2-month-old Chinese boy who had the symptoms of abdominal distension and vomiting after birth (**Figure 1A**). The proband was diagnosed as Hirschsprung disease by barium enema examination (**Figures 1B–D**), which showed typical symptoms of congenital megacolon. In detail, we observed that the ganglion cells were present in dilated segment, but not detected in narrow segment of mucosa of intestinal wall and myenteric nerve plexus by microscopic image-based histologic examination. Unmyelinated nerve fiber and Schwann cells were increased in the narrow segment. Further family history survey revealed that the proband's father (II-1), older brother (III-1), and cousin (III-3) had the similar symptoms (**Figure 1A**). The proband as well as his older brother and cousin was diagnosed as short-form and long-form aganglionosis based on the length of aganglionosis, respectively. Unfortunately, the length of aganglionosis was unclear due to loss of medical record. All the patients were cured by radical operation.

#### Identification of Rare Variants in Coding or Splicing Regions

To uncover the genetic variants contributing to HSCR, we performed whole exome sequencing on the seven family members, including II-1, II-2, II-3, II-4, III-1, III-2, and III-3. Variants were called by VarScan (Koboldt et al., 2012) with the trio-based mode. The steps of genetic variant analysis were illustrated in **Figure 2**. In total, we detected 239,225 variants at which at least one family member had an allele that varied from the reference genome, including 217,172 substitutions and 22,053 indels (insertion and deletion) (**Figure 2A**). For the three affected boys (III-1, III-2, and III-3), the Mendelian errors were estimated about 1.10, 1.43, and 1.33%, suggesting that the variants were high reliable by the trio-based variant calling strategy.

(A) Pedigree of the three-generation, Chinese family with four affected individuals. Squares indicate males, and circles represent females. Black and white symbols represent affected and unaffected individuals, respectively. The proband is indicated by an arrow. (B–D) The colon X-ray images of the proband at three different views.

As the heritable Hirschsprung disease was rare in the population, the pathogenic variants were more likely to be rare in healthy population. To identify the rare variants, we firstly obtained their MAFs from gnomAD database (Lek et al., 2016). By excluding the non-coding and synonymous variants, we finally filtered a total of 1,059 rare variants in this family (MAF < 0.1%). As the Hirschsprung disease could be inherited by autosomal dominant and recessive patterns (Amiel et al., 2008), autosomal dominant and recessive variants were considered. Among the homozygous variants and biallelic variants, no recessive variants were shared by the four patients. Notably, under the hypothesis of autosomal dominant inheritance, the unaffected female, II-4, may be a carrier of the pathogenic variant due to incomplete penetrance. Finally, we identified 18 dominant variants (**Figure 2B**), including 14 missenses, 1 nonsense, 2 frameshifts, and 1 in-frameshift variants, co-segregated in the three boys.

# Identification of Pathogenic Candidate for HSCR

To identify pathogenic variants, we evaluated the pathogenicity of the 18 dominant variants using bioinformatics tools, such as SIFT (Ng and Henikoff, 2003), PolyPhen (Adzhubei et al., 2013), MutationTaster (Schwarz et al., 2010), M-CAP (Jagadeesh et al., 2016), DDIG-in (Zhao et al., 2013), and SIFT-indel (Hu and Ng, 2013) (**Supplementary Table S1**).

The 14 missense variants were filtered using the pathogenic scores in SIFT (≤0.05), PolyPhen (≥0.957), MutationTaster ('disease causing'), and M-CAP (>0.025), and CAPN9, GLYCTK, and DRD5 were recognized. Moreover, RET, FANCI, and CALN1 were identified by the two pathogenicity prediction algorithms for indels, DDIG-in and SIFT-indel. The nonsense variant in NPHP3 was recognized as potentially pathogenic by MutationTaster and DDIG-in. In summary, seven genes, including CAPN9, GLYCTK, DRD5, NPHP3, FANCI, CALN1, and RET, were pathogenic candidates by the bioinformatics tools.

To further evaluate the relationship between the variants and HSCR, we performed literature review about the seven pathogenic candidates. We found one in-frameshift variant p.Phe147del in RET, the most commonly observed pathogenic gene for HSCR. The remaining genes, such as GLYCTK (Sass et al., 2010), DRD5 (Daly et al., 1999), NPHP3 (Bergmann et al., 2008), and FANCI (Mehta and Tolar, 1993), were wellcharacterized pathogenic genes for some other rare diseases with recessive mode of inheritance or multi-gene diseases, such as D-glyceric aciduria, attention deficit-hyperactivity disorder, Meckel syndrome, and Fanconi Anemia. However, these genes

were excluded due to recessive inheritance of their associated diseases. To further clarify the implications of CALN1 and CAPN9 in HSCR, we mapped the RET, CALN1 and CAPN9, combined with some known HSCR pathogenic genes, including, ECE1, EDN3, EDNRB, GDNF, NRTN, SOX10, PHOX2B, and KIAA1279, to protein–protein interaction (PPI) network curated in STRING database (Szklarczyk et al., 2015). CALN1 and CAPN9 were observed to connect with none of these known genes directly or indirectly within five nodes, suggesting that the two genes may not be pathogenic for HSCR (**Figure 3**). Only RET connected with the known pathogenic genes in the PPI network, in particular, which was also a known HSCR gene. The result indicated that the in-frameshift variant p.Phe147del in RET was pathogenic for the HSCR family.

# Potential Impact of the In-Frameshift Variant on RET Protein Function

As illustrated in **Figure 4A**, the ECD of RET was composed of four cadherin-like domains (CLD1-4), and cysteine-rich domain (CRD). We found the in-frameshift variant p.Phe147del was located within CLD1. Previous study (Leon et al., 2012) reported that a disease-causing mutation at pVal145Gly in RET, which was close to the in-frameshift variant p.Phe147del, could disrupt RET N-glycosylation, giving us a hint that the in-frameshift variant may also function by disrupting RET N-glycosylation. To determine the consequence of the in-frameshift variant p.Phe147del on RET glycosylation, we predicted the N-glycosites of wild-type and p.Phe147del RET proteins using GlycoEP (Chauhan et al., 2013), a webserver for predicting potential Nand O-glycosites in protein sequence. Moreover, the N-glycosites of p.Val145Gly was also predicted as a positive control. Finally, the sites of Asn151, Asn834 and Asn1084 were predicted to be glycosylated in wild-type RET protein (GlycoEP score > 0.85). However, Asn151, which is closest glycosylated site to the two mutants, p.Phe147del and p.Val145Gly, was predicted to be not glycosylated in the two mutant RET proteins (**Table 1**). The result indicated that the in-frameshift variant p.Phe147del could function by disrupting RET N-glycosylation.

# Validation of Pathogenic Variant by Sanger Sequencing

We validated the two variants by Sanger sequencing (**Figure 4B**). Our results demonstrated that the four patients carried the in-frameshift variant p.Phe147del in RET. Moreover, we also confirmed our hypothesis that the pathogenic variant carrier II-4 was unaffected due to incomplete inheritance. In accordance with the genotyping by whole exome sequencing, the other family members, II-2 and II-3, were wild genotypes. The result indicated that whole exome sequencing was efficient to identify pathogenic variants for monogenic inherited diseases.

unaffected carrier, and two unaffected members with wild genotype are validated by Sanger sequencing. The deleted three bases were enclosed by the red box.

# MATERIALS AND METHODS

#### Ethics Statement

The present study was approved by the Ethics Committee of the Children's Hospital of Shanghai Jiao Tong University, Shanghai, China, and was conducted according to the principles expressed in the Declaration of Helsinki. Participants and/or their legal guardians involved in this study gave a written informed consent prior to inclusion in the study.

# Sample Collection

The present study included DNA samples from four patients (II-1, III-1, III-2, and III-3) and three unaffected family members (II-2, II-3, and II-4) as shown in **Figure 1A**. Genomic DNA samples were obtained with written informed consent. TIAN amp Blood DNA Kit (Tiangen Biotech, Co., Ltd., Beijing) was used for extracting genomic DNA from blood samples.



+, glycosylated site (GlycoEP score > 0.85).

−, non-glycosylated site (GlycoEP score < 0.85).

## Reads Mapping and Variants Calling

Paired-end reads of 300 bp (150 bp at each end) from whole exome sequencing were mapped to UCSC human reference genome (GRCh37/hg19 assembly) using BWA (Li and Durbin, 2010) version 0.7.7-r441 'mem' mode with default options followed by removal of PCR duplicates and low-quality reads (BaseQ < 20). The bam files were then sorted and indexed by samtools (Li et al., 2009), and were converted as three-sample mpileup format for each parents-offspring trio. Variant calling was performed by VarScan (Koboldt et al., 2012) software<sup>1</sup> using version 2.3.7 of Trios mode.

#### Variant Annotation and Prioritization

We used the ANNOVAR (Wang et al., 2010) software to annotate the MAF in gnomAD (Lek et al., 2016) database, variant pathogenicity scores by SIFT (Ng and Henikoff, 2003), PolyPhen (Adzhubei et al., 2013), MutationTaster (Schwarz et al., 2010), M-CAP (Jagadeesh et al., 2016), RefSeq gene and the consequences on protein, such as missense, frameshift, inframeshift, stop-gain, and splicing. For the indels, we used DDIGin (Zhao et al., 2013) and SIFT-indel (Hu and Ng, 2013) to evaluate their pathogenicity. Rare variants (MAF < 0.1% in Asian cohort) were filtered based on the gnomAD (Lek et al., 2016) database.

## Prediction of Glycosylated Sites

The glycosylated sites for RET proteins with wild-type, and p.Phe147del and p.Val145Gly mutants were predicted by GlycoEP (Chauhan et al., 2013) <sup>2</sup> with standard predictor, a webserver for predicting potential N- and O-glycosites in protein sequence. The prediction was performed based on Average Surface Accessibility (ASA+BPP) with threshold score 0.85.

# DISCUSSION

Hirschsprung disease is a birth defect characterized by complete absence of neuronal ganglion cells from a portion of the intestinal tract. In the present study, we performed whole exome sequencing on seven members of the HCSR family to identify the disease-causing gene. Microscopic image-based histologic examination of the proband's diseased tissue also observed the absence of neuronal ganglion cells in narrow segment of mucosa of intestinal wall and myenteric nerve plexus.

To uncover genetic variants contributing to HSCR, we identified 100s of 1000s variants in seven members of the HSCR family using whole exome sequencing data. The lower Mendelian error demonstrated that trio-based variant calling was an effective strategy for family-based sequencing. As the heritable Hirschsprung disease was rare in the population, the disease-causing variants were more likely to be rare in healthy population. With the MAF calculated by gnomAD, we finally filtered a total of 1,059 rare variants in this family (MAF < 0.1%). In addition to MAF in healthy population, the dominant and recessive inheritance modes of HSCR were also considered in this family. In general, the autosomal recessive genes may be altered by two compound heterozygous variants, or one homozygous variant. However, we did not detect any recessive variants based on this assumption. For the assumption of autosomal dominant inheritance, the unaffected female, II-4, must be a carrier of the pathogenic variant due to incomplete penetrance, in accordance with the previous studies (Parisi and Kapur, 2000; Belknap, 2002). Moreover, the pathogenic variant must be co-segregated in the four patients and the carrier. The filtering steps by MAF, mode of inheritance and co-segregation could greatly exclude non-pathogenic variants.

To accurately identify the pathogenic variants, we further performed pathogenicity analysis on the 18 dominant variants. Generally, the algorithms evaluating the pathogenicity of single nucleotide substitutions and indels were different. Specifically, among the single nucleotide substitutions, the missense variants were mostly evaluated by SIFT, PolyPhen, MutationTaster and M-CAP, while the nonsense variants were evaluated by MutationTaster and DDIG-in. On the other side, we evaluated the pathogenicity of indels, frameshift and in-frameshift variants, using DDIG-in (Zhao et al., 2013) and SIFT-indel (Hu and Ng, 2013). Among the 18 dominant variants, seven genes or variants, including CAPN9, GLYCTK, DRD5, NPHP3, FANCI, CALN1, and RET, were predicted as pathogenic candidates by these algorithms. To our knowledge, GLYCTK (Sass et al., 2010), DRD5 (Daly et al., 1999), NPHP3 (Bergmann et al., 2008), and FANCI (Mehta and Tolar, 1993), were wellcharacterized pathogenic genes for some other rare diseases with recessive mode of inheritance or multi-gene diseases, such as Dglyceric aciduria, attention deficit-hyperactivity disorder, Meckel syndrome, and Fanconi Anemia. However, these genes were excluded due to recessive inheritance of their associated diseases. In addition, CAPN9 and CALN1 are thought to be associated with gastric cancer (Yoshikawa et al., 2000) and schizophrenia (Schizophrenia Psychiatric Genome-Wide Association Study (GWAS) Consortium, 2011), respectively. To further narrow down the gene list that may contribute to HSCR, we performed literature review and mapped the pathogenic candidates to protein–protein interaction network. Apart from RET, the literature review and PPI network analysis successfully excluded the other six genes. Notably, p.Phe147del in RET was a novel pathogenic variant in HSCR based on the curation by ClinVar database (Landrum et al., 2018). Finally, the in-frameshift variant p.Phe147del in RET, the most commonly observed pathogenic gene for HSCR, was identified as the pathogenic variant.

To further examine the functional impact of the inframeshift variant p.Phe147del in RET on the occurrence of the disease, we mapped the variant to RET protein structure. It is well-recognized that variants locating within specific functional domains or protein translation modification sites could alter the protein conformation, protein–ligand binding, or protein–protein interaction. In this study, the inframeshift variant p.Phe147del in RET was located within the CLD1 domain. We accessed the Uniprot database, and found that the five amino acids adjacent to the in-frameshift variant were only characterized to be glycosylated, not be

<sup>1</sup>http://varscan.sourceforge.net/

<sup>2</sup>http://crdd.osdd.net/raghava/glycoep/

phosphorylated or methylated. Further literature investigation also accorded with the annotations by Uniprot database. RET encodes a transmembrane receptor, which is composed of ECD, transmembrane domain, and intracellular tyrosine kinase domain. Particularly, the CLD1 domain belongs to the ECD. Further analysis of the consequence of p.Phe147del variant on the RET protein revealed that this in-frameshift variant may disrupt glycosylation of RET protein, which may be the cause of HSCR in this family.

Compared with previous studies, our study focused on the cases from familial HSCR. The pathogenic genes for familial HSCR including RET, EDNRB and EDN3, exhibited high penetrance. However, for the sporadic cases with Hirschsprung disease, like the report by Tang et al. (2018), some novel pathogenic or susceptibility genes, such as PLD1, had reduced penetrance, indicating that the penetrance of pathogenic genes was higher in familial HSCR than the sporadic Hirschsprung disease. The sporadic Hirschspring disease may be caused by additive effect of the susceptibility genes and environmental factors.

In reality, the lack of experimental validation is a major concern about this research. However, we conducted systematic bioinformatics analysis to demonstrate the pathogenicity and functionality of the variant in this family. To our knowledge,

#### REFERENCES


this is the first study to report the in-frameshift variant p.Phe147del in RET responsible for heritable HSCR. In conclusion, the systematic analysis of this study not only improved understanding of the causes of this disease, but also was useful for clinical and prenatal diagnosis.

#### AUTHOR CONTRIBUTIONS

ZL and WW collaborated in the conception and design of the present study. LZ and LL collected and assembled the data. WW and QS were involved in data analysis and interpretation. All authors contributed to writing the manuscript and approved the final version.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00752/full#supplementary-material

TABLE S1 | The evaluation of the pathogenicity of the 18 dominant variants using bioinformatics tools.


development. Gastroenterology 155, 1908.e5–1922.e5. doi: 10.1053/j.gastro. 2018.09.012


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wu, Lu, Xu, Liu, Sun, Zheng, Sheng and Lv. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-09-00752 January 11, 2019 Time: 16:8 # 8

# FLOating-Window Projective Separator (FloWPS): A Data Trimming Tool for Support Vector Machines (SVM) to Improve Robustness of the Classifier

Victor Tkachev<sup>1</sup> , Maxim Sorokin1,2, Artem Mescheryakov<sup>3</sup> , Alexander Simonov<sup>1</sup> , Andrew Garazha<sup>1</sup> , Anton Buzdin1,2,4, Ilya Muchnik<sup>5</sup> and Nicolas Borisov1,4 \*

<sup>1</sup> Department of Bioinformatics and Molecular Networks, OmicsWay Corporation, Walnut, CA, United States, <sup>2</sup> Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia, <sup>3</sup> Yandex N.V. Corporation, Moscow, Russia, 4 I.M. Sechenov First Moscow State Medical University (Sechenov University), Moscow, Russia, <sup>5</sup> Hill Center, Rutgers University, Piscataway, NJ, United States

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Guilherme De Alencar Barreto, Universidade Federal do Ceará, Brazil Firoz Ahmed, Jeddah University, Saudi Arabia

> \*Correspondence: Nicolas Borisov borisov@oncobox.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 01 September 2018 Accepted: 21 December 2018 Published: 15 January 2019

#### Citation:

Tkachev V, Sorokin M, Mescheryakov A, Simonov A, Garazha A, Buzdin A, Muchnik I and Borisov N (2019) FLOating-Window Projective Separator (FloWPS): A Data Trimming Tool for Support Vector Machines (SVM) to Improve Robustness of the Classifier. Front. Genet. 9:717. doi: 10.3389/fgene.2018.00717 Here, we propose a heuristic technique of data trimming for SVM termed FLOating Window Projective Separator (FloWPS), tailored for personalized predictions based on molecular data. This procedure can operate with high throughput genetic datasets like gene expression or mutation profiles. Its application prevents SVM from extrapolation by excluding non-informative features. FloWPS requires training on the data for the individuals with known clinical outcomes to create a clinically relevant classifier. The genetic profiles linked with the outcomes are broken as usual into the training and validation datasets. The unique property of FloWPS is that irrelevant features in validation dataset that don't have significant number of neighboring hits in the training dataset are removed from further analyses. Next, similarly to the k nearest neighbors (kNN) method, for each point of a validation dataset, FloWPS takes into account only the proximal points of the training dataset. Thus, for every point of a validation dataset, the training dataset is adjusted to form a floating window. FloWPS performance was tested on ten gene expression datasets for 992 cancer patients either responding or not on the different types of chemotherapy. We experimentally confirmed by leave-one-out crossvalidation that FloWPS enables to significantly increase quality of a classifier built based on the classical SVM in most of the applications, particularly for polynomial kernels.

Keywords: bioinformatics, machine learning, oncology, gene expression, support vector machines, personalized medicine

**Abbreviations:** ALL, acute lymphoblastic leukemia; AML, acute myelogenous leukemia; ASCT, allogeneic stem cell transplantation; AUC, area under curve; FDR, false discovery rate; FloWPS, floating window projective separator; FP, false positive; FN, false negative; GEO, gene expression omnibus; GSE, GEO series; HER2, human epidermal growth factor receptor 2; kNN, k nearest neighbors; MCC, Matthews correlation coefficient; mRNA, messenger ribonucleic acid; NGS, nextgeneration sequencing; PC, principal component; PCA, principal component analysis; ROC, receiver operating characteristic; SVM, support vector machine; TN, true negative; TP, true positive; VTD, velcade, thalidomide and dexamethasone.

# INTRODUCTION

fgene-09-00717 January 14, 2019 Time: 16:17 # 2

Support vector machine is one of the most popular machine learning methods in biomedical sciences with constantly growing impact and more than 11,000 citations in the PubMed-indexed literature<sup>1</sup> , of those ∼2,300 are only for the 2017 and first 6 months of 2018. This method has been successfully applied for a wide variety of biomedical applications like searching Dicer RNase cleavage sites on pre-miRNA (Ahmed et al., 2013), prediction of miRNA guide strands (Ahmed et al., 2009a), identification of poly(A) signals in genomic DNA (Ahmed et al., 2009b), finding conformational B-cell epitopes in antigens by nucleotide sequence (Ansari and Raghava, 2010). More recent developments include drug design according to physicochemical properties (Yosipof et al., 2018), learning on transcriptomic profiles for age recognition (Mamoshina et al., 2018), predictions of drug toxicities and other side effects (Zhang et al., 2018).

The performance quality of the classifiers based on these methods may reach the value of 0.80 or higher for the metrics such as ROC AUC<sup>2</sup> and/or accuracy rate, e.g., for problems of age recognition (Mamoshina et al., 2018) and drug compound selection (Yosipof et al., 2018). However, although generally clearly helpful, the SVM approach frequently demonstrates insufficient performance in several applications for separating groups of the patients with different clinical outcomes (Mulligan et al., 2007; Ray and Zhang, 2009; Babaoglu et al., 2010; Kim et al., 2018). These failures were most likely caused by insufficient number of preceding clinical cases, which provokes overtraining of all machine learning algorithms. Particularly, the rareness of training points in the feature space leads to frequent extrapolations, and SVM method is known to be highly vulnerable to such conditions (Arimoto et al., 2005; Balabin and Lomakina, 2011; Balabin and Smirnov, 2012; Betrie et al., 2013).

In order to increase the performance of SVM for distinguishing between clinically relevant features, such as degrees of response to cancer therapies, we propose here a new method termed FloWPS for data trimming that generalizes the SVM technique by precluding extrapolation in the feature space. FloWPS acts by selecting for further analysis only those features that lay within the intervals of data projections from the training dataset. This approach can avoid extrapolations in favor of interpolations and thus increases a prediction quality of the output data. FloWPS combines somehow two methods, SVM and kNN (Altman, 1992), where kNN plays a particular role to extract informative features. The idea to combine feature extraction methods with SVM is well known (Tan and Gilbert, 2003; Kourou et al., 2015; Tan, 2016; Liu et al., 2017; Tarek et al., 2017). The approach proposed in this paper, however, is in principle a novelty, at least because its selection capacity is focused on every single point available for prediction.

We tested FloWPS on ten published gene expression datasets for totally 992 cancer patients treated with different types of chemotherapy with known clinical outcomes. In all the cases, the classifiers built using FloWPS outperformed standard SVM classifiers.

# RESULTS

#### Data Sources and Feature Selection

In this study, we investigated gene expression features associated with the responses to chemotherapy. The gene expression profiles were extracted from the datasets summarized in **Table 1**. The clinical outcome information was related to response on different chemotherapy regimens, linked with high throughput gene expression profiles for the individual patients.

Each patient was primarily labeled as either responder or non-responder to a treatment. For all the datasets taken from the GEO repository, we used the response criteria formulated in the respective original papers first publishing these data. Namely, for two breast cancer datasets, GSE25066 (Hatzis et al., 2011; Itoh et al., 2014) and GSE41998 (Horak et al., 2013), we considered partial responders as responders. For the first multiple myeloma dataset, GSE9782 (Mulligan et al., 2007), we took the (non)responder classification used by the authors, where patents with complete and partial response were annotated as responders, and with no change and progressive disease – as non-responders. For three other multiple myeloma datasets, GSE39753 (Chauhan et al., 2012), GSE68871 (Terragna et al., 2016), and GSE55145 (Amin et al., 2014), we considered complete, near-complete and very good partial responders as responders, whereas partial, minor and worse responders – as non-responders. For the datasets of pediatric Wilms kidney tumor, ALL and AML, extracted from the TARGET gene expression repository of National Cancer Institute (Goldman et al., 2015), the cases was classified according the distribution of the event-free survival time, which appeared to have two modes with different slopes (**Supplementary Figure S1**).

To preclude any possible bias that may affect the performance of machine-learning classifiers due to unequal representation of samples in two different classes (clinical responders and non-responders), numbers of responding and non-responding cases were equalized within each dataset. Equalization was done by taking the full smaller subset of those for the two classes (responders/non-responders), and then by random selection of samples from the bigger subset. Thus, each resulting dataset contained equal numbers of cases classified as responders and non-responders.

To engineer a plausible feature space, where the SVM can be applied efficiently, we proposed to select from tens of thousands of individual gene expression features only few of them, which produce a good separation of clinical responders from nonresponders. To do so, for every dataset under investigation we selected its particular top 30 genes, whose expression levels taken one by one had the highest ROC AUC values for distinguishing responder and non-responder profiles. We made a number of

<sup>1</sup>This is the result of a PubMed query https://www.ncbi.nlm.nih.gov/pubmed/ ?term=support+vector+machine\_

<sup>2</sup>The ROC (receiver–operator curve) is a widely-used graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The area under the ROC curve, called ROC AUC, or simply AUC, is routinely employed for assessment of any classifier's quality.

#### TABLE 1 | Clinically annotated gene expression datasets.

fgene-09-00717 January 14, 2019 Time: 16:17 # 3


top informative features equal to 30 because the usual number of samples in considered datasets was not lower than 50 (a direct heuristic number for degree of freedom). These 30 top marker genes, and response statuses (100 for a responder, 0 for a nonresponder) for all selected patients from all datasets are listed on **Supplementary Table S1**.

To produce more robust feature selection, for each dataset having, say, N samples, the leave-one-out procedure has been performed. Each individual sample was removed from the investigation one at a time, so N subdatasets each having N-1 individuals were generated. For each subdataset, the ROC AUC test was performed between responders and non-responders for each gene. The genes were next sorted according to their ROC AUC, and top 30 were selected for each subdataset. The final list of such core informative genes was generated as an intersection between top 30 selected genes for all N subdatasets. For every dataset under investigation, these final core sets are listed in **Supplementary Table S2**; the number of core marker genes is also shown on **Table 1**.

#### Data Trimming for Application in SVM

We developed a first of its class data trimming<sup>3</sup> tool termed FloWPS that has a potential to improve the performance of machine learning methods. Since extrapolation is a widely recognized Achilles heel of SVM (Arimoto et al., 2005; Balabin and Lomakina, 2011; Balabin and Smirnov, 2012; Betrie et al., 2013), FloWPS avoids it by using the rectangular projections along all irrelevant expression features that cause extrapolation during the SVM-based predictions for every validation point.

In this section we describe and investigate our data trimming procedure (FloWPS) as a preprocessing for SVM application.

Since the number of samples in most of the datasets used here was relatively low, we tested our classifier using the leave-oneout cross-validation method, which introduces lesser errors than the standard five-bin cross-validation scheme generally applied for bigger datasets. According to the leave-one-out approach, for each sample i = 1, N serves as a validation case whose response to the treatment had to be predicted, whereas all remaining samples, j = 1,. . .(i−1),(i+1),. . .,N, collectively acts as a training dataset, and this procedure is repeated for all the samples. For machine leaning without data trimming, in a predefined feature space **F** = (f <sup>1</sup>,. . ., f<sup>s</sup> ) every sample i, given for the test, is assigned by a classifier, constructed to (N-1) samples used for training.

According to the current data trimming approach, instead a fixed space **F** for all N testing samples, we propose using an individual space **F**<sup>i</sup> , which contains individually adapted training data (of N-1 samples) for the testing sample i. It can be implemented using the following heuristics (**Figure 1**).

(1) From the whole predefined feature space **F** = (f <sup>1</sup>,. . ., f <sup>s</sup> ) we extract a subset **F**<sup>i</sup> (m), where m is a parameter. A feature f<sup>j</sup> is kept in **F**i(m) if on its axis there are at least m projections from training samples, which are larger than f<sup>j</sup> (i), and, at the same time, at least m, which are smaller than f<sup>j</sup> (i). The procedure for extraction of

<sup>3</sup>Data trimming is the process of removing or excluding extreme values, or outliers, from a dataset (Turkiewicz, 2017).

Turquoise dots stand for the points from the training dataset. The features (here: f<sup>1</sup> and f2) are considered relevant when they satisfy the criterion that at least m flanking training points must be present on both sides relative to the validation point along the feature-specific axis. In the figure, it is exemplified that m-condition is satisfied for f<sup>1</sup> feature when m = 0 only, and for the f2, when m ≤ 5. (B) After selection of the relevant features, only k nearest neighbors in the training sets are selected to construct the SVM model. On the figure, k = 4, although k starting from 20 was used in our calculations, to build SVM model.

a subset **F**i(m) is illustrated in **Figure 1A** for a two-dimensional space **F** = (f <sup>1</sup>, f <sup>2</sup>). A violet point stands for the validation sample in the feature space. Turquoise dots represent scattering of the training points. For example, the m-condition for the feature f <sup>2</sup> is satisfied when m = 0,1,2,3,4,5 (projection of the training set on f <sup>2</sup> axis has five points both below and above the validation point), whereas for the feature f <sup>1</sup> it is satisfied only for m = 0 (projection of the validation point on axis f <sup>1</sup> lies outside of the cloud of training points).

(2) In **F**<sup>i</sup> (m) we keep for training only k closest samples (from given (N-1) samples); k is also a parameter (**Figure 1B**; note that although for the sake of simplicity k = 4 in the picture, in the computational trials we varied k from 20 to N-1).

Hence, for every individual i = 1, N, and m and k parameter values, the predicted classification values are obtained [i.e., predictions P<sup>i</sup> (m,k), i = 1, N]. Considering known response status for each sample i, it is possible to calculate AUC values for a whole set of samples as a function over whole range of the parameters m and k (**Figure 2B**). Since the predicted classification efficiencies depend upon the chosen values for m and k, it is possible to interrogate the AUC values over the full lattice of all possible (m, k) pairs.

We propose an algorithm of achieving the optimal (m,k) settings for a final classifier (**Figure 2A**). The AUC threshold (θ) is set to θ = p · max(AUC), where max(AUC) is the maximal value of AUC, taken over the set of all possible (m, k) pairs, and the parameter p equals to a user-defined confidence threshold. To illustrate performance of this approach, we took two alternative values of p = 0.95 or 0.90, and then considered all the (m,k) pair positions on the AUC(m,k) topogram. We next screened for the positions where AUC exceeded the threshold θ, and the total combination of these positions was taken as the predictionaccountable set S (**Figure 2B**; prediction-accountable positions are shown in yellow). The final prediction of FloWPS (PF) for a certain validation case should be calculated by averaging the SVM predictions, P(m,k), over the whole set of positions belonging to the prediction-accountable set S, according to the formula: P<sup>F</sup> = meanS(P(m,k)).

The usual SVM method, i.e., without FloWPS data trimming, corresponds to a very right and bottom corner of the AUC(m,k) topogram (**Figure 2B**), with the parameter settings m = 0, k = N − 1. On the example shown in **Figure 2B**, the classical SVM, without any doubt, provides essentially lower accuracy than FloWPS.

FIGURE 2 | Optimization of data trimming parameters m and k for a given individual. (A) Overall scheme for prediction for an individual sample i = 1, N. All but one individuals serve as a training dataset. For a training dataset at the fitting step, the AUC for a classifier prediction is calculated and plotted (B) as a function of data trimming parameters m and k. Positions of this AUC topogram where AUC > p · max(AUC), p = 0.95, are considered prediction-accountable (highlighted with bright yellow color) and form the prediction-accountable set S. This AUC topogram, as well as the set S, is individual for every validation point i.



Area-under-curve (AUC) and false discovery rate (FDR) values calculated for each version of a classifier are shown. All calculations were made using leave-one-out cross-validation approach.

# FloWPS Performance for Default SVM Settings

At first, we investigated performance of FloWPS on ten cancer gene expression datasets (**Table 1**) with the default SVM settings (linear kernel and cost/penalty parameter C = 1). During our calculations, the FloWPS classifier was first fitted for the training dataset without a sample (say, i) to be classified. For these all (N-1) samples AUCi(m,k) was calculated as a function of data trimming parameters m and k (see **Figure 2A**). This enabled finding the prediction-accountable set S<sup>i</sup> in the AUCi(m,k) topogram (on **Figure 2B**, the set was marked with bright yellow). The m and k values from the set S<sup>i</sup> were then used for data

trimming and classifying of a single sample i. In parallel, we applied the standard SVM algorithm for leave-one-out crossvalidation without data trimming, i.e., m = 0, k = N-1 for each training sub-dataset. The comparison is shown on **Table 2**, **Supplementary Table S3**, and **Figures 3**, **4**.

The discrimination threshold (τ), which is shown as a black horizontal line on **Figure 3** (so that any sample with FloWPS prediction value above τ is classified as a responder, and below it – as a non-responder), was set to minimize the sum of FP and FN predictions.

For every dataset, confidence parameter p and scheme of gene selection, FloWPS classifier demonstrated the ROC AUC exceeding the corresponding value for the classical SVM (**Table 2**). For three datasets out of ten, AUC for classical SVM was between 0.64 and 0.68. For all these cases, application of FloWPS with confidence level p = 0.90 enabled obtaining essentially better AUC values ranging between 0.71 and 0.78.

The comparison of classifier's quality by another metric, the FDR<sup>4</sup> , has demonstrated similar results: FDR was lower for FloWPS than for classical SVM for almost all the cases (**Table 2**, columns without boldface font). Other metrics, such as sensitivity (Sn), specificity (Sp), accuracy rate (ACC) and MCC<sup>5</sup> also strongly tend to be higher for FloWPS than for classical SVM without data trimming (**Supplementary Table S3**).

## FloWPS Performance at Different Settings and Comparison With Alternative Data Reduction Approach

Although the classifier quality tended to be higher for data trimming than for default SVM settings, the advantages were different in different cancer datasets. The FloWPS performance, therefore, was investigated for different SVM kernels (linear vs. polynomial) and different values for cost/penalty parameters C (ranged from 0.1 to 1000), **Figure 5** and **Supplementary Table S4**. These calculations were done for the core marker gene datasets and FloWPS confidence parameter p = 0.90. The advantage of FloWPS over SVM is more essential in the conditions vulnerable to SVM overtraining, e.g., for linear kernel with high values of the cost/penalty parameter (C = 100 or 1000) or for polynomial kernel, where SVM may be easily overfitted. Fortunately, FloWPS precludes such overfitting, thus raising AUC and decreasing FDR. The same pattern was also seen for the Sn, Sp, ACC and MCC values (**Supplementary Table S4**).

Note that FloWPS is not the only possible data reduction/feature selection method, which may be used for preprocessing to improve the classifier's quality. To try a simple alternative to FloWPS, which is, however, not specific to individual samples, we did calculations based on PCA mode rather than original features. The number of PCs taken for building the SVM model, may act as a parameter, which is optimized in a manner similar to optimization of m and k for FloWPS. Namely, a maximum for AUC as a function of PC

<sup>4</sup>FDR shows the percentage of false positive (FP) predictions among all those classified as positive, FDR = FP/(FP + TP), where TP means true positive.

<sup>5</sup>MCC can be calculated from the confusion matrix, MCC = <sup>√</sup> TP·TN−FP·FN (TP+FP)(TP+FN)(TN+FP)(TN+FN)

number is found and then used as the optimal number of PCs for an SVM-based prediction.

Thus, we compared the classifier qualities for three methods, namely classical SVM without data reduction, PCA-assisted SVM with pre-trained PC number, and FloWPS with the confidence parameter p = 0.90 (**Table 3**; note that both classical SVM and FloWPS calculations were done using gene expression features rather than PCs). The calculations were done for core marker gene datasets and cost/penalty SVM parameters C = 1 and 100. For linear kernel, several datasets had comparable AUC for simple PCA-assisted data reduction and for FloWPS (**Table 3**). However, for polynomial kernel FloWPS essentially outperformed the PCA-assisted data reduction, most likely due to bigger risk of overtraining for SVM with nonlinear kernels.

TABLE 3 | AUC of (non)responder classifier for classical SVM without data reduction (SVM), PCA-assisted SVM (PCA) and FloWPS with confidence parameter p = 0.90.


DISCUSSION

It was seen previously that SVM sometimes fails when it is intended for distinguishing fine biomedical properties such as disease progression prognosis or assessment of clinical efficiency of drugs for an individual patient, using high throughput molecular data, e.g., complete DNA mutation or gene expression profiles (Ray and Zhang, 2009; Babaoglu et al., 2010). Particularly, for many biologically relevant applications, SVM occurred either fully incapable to predict drug sensitivity (Turki and Wei, 2016), or demonstrated poorer performance than competing method for machine learning (Davoudi et al., 2017; Cho et al., 2018; Jeong et al., 2018; Leite et al., 2018; Sauer et al., 2018; Yosipof et al., 2018). Thus, the tool for improvement of SVM performance is certainly needed.

trimming works locally and may separate classes more accurately.

In this study, we investigated ten sets of gene expression data for cancer patients treated with different anti-cancer drugs with known clinical outcomes, where the original dimension of samples (patients) is many hundreds times larger than the numbers of patients. So, the first problem in such applications was to extract an appropriate number of features, in which space one could achieve a classifierpredictor with a high level of quality. There are many authors focused to resolve the preprocessing problem (Tan and Gilbert, 2003; Kourou et al., 2015; Tan, 2016; Liu et al., 2017; Tarek et al., 2017). Some feature selection methods, like the DWFS wrapping tool (Soufan et al., 2015), use sophisticatedly designed approaches such as genetic algorithms to improve the classifier quality. In this paper we proposed one more, FloWPS, which is very different from all known. Its critical characteristic is that for every single new sample, which class has to be predicted, the method extracted its individual sub-space and, more, in that subspace takes for training data an appropriate subset of samples.

FloWPS data trimming method simultaneously combines the advantages of both global (like SVM) and local (like kNN) (Altman, 1992) methods of machine learning, and successfully acts even when purely local and global approaches fail. The failure of SVM, which we have observed at least for 3 out of 10 datasets in the current study (**Table 2**), means that there is no strict distant order in the placement of responder and non-responder points in the space of gene expression features. Yet, the lack of distant order does not necessary mean the absence of local order (**Figure 6**). The latter may be detected using local methods such as kNN, which has been confirmed by our FloWPS (**Table 2** and **Figures 3**, **5**). The FloWPS advantages are better seen for SVM with polynomial than for linear kernel due to higher risk of overtraining on such models (**Figure 5** and **Table 3**).

We hypothesize that FloWPS and data trimming may be also helpful for improving other learning methods based on multi-omics data, including nowadays-flourishing deep learning approaches (Bengio et al., 2013; LeCun et al., 2015; Schmidhuber, 2015).

# MATERIALS AND METHODS

#### Preprocessing of Gene Expression Data

For the datasets investigated using the Affymetrix microarray hybridization platforms, gene expression data were taken from the series matrices deposited in the GEO public repository and then quantile-normalized (Bolstad et al., 2003) using the R package preprocessCore (Bolstad, 2018). All pediatric datasets taken from the TARGET database (Goldman et al., 2015) contained results of NGS mRNA profiling at Illumina HiSeq 2000 platforms; they were normalized using R package DESeq2 (Love et al., 2014).

#### SVM Calculations

All the SVM calculations with linear and polynomial kernels were performed using the Python package sklearn (Pedregosa et al., 2012) that employs the C++ library 'libsvm' (Chang and Lin, 2011). The penalty parameter C varied from 0.1 to 1000 for different calculations. Other SVM parameters had the default settings for the sklearn package.

#### Plot Preparations

AUC(m,k) topograms, like **Figure 2B**, were plotted using mathplotlib Python library (Hunter, 2007). Violin plots for FloWPS predictions (see **Figure 3**) for responders and nonresponders were plotted using the ggplot2 R package (Wilkinson, 2011).

# AVAILABILITY OF DATA AND MATERIALS

The datasets analyzed during the current study are available in the GEO repository,

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25066

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41998 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9782 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39754 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68871 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55145 ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/WT/mRNAseq/

ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/AML/mRNAseq/

ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/ALL/mRNAseq/

The Python module that performs data trimming according to the FloWPS method for different values of parameters m and k, as well as the R code that makes FloWPS predictions using the results obtained with the Python module, and a README manual how to use these codes, were deposited on Gitlab and are available by the link: https://gitlab.com/oncobox/flowps.

# ETHICS STATEMENT

Current research did not involve any new human material. All the gene expression data that were used for research, were taken from publicly available repositories Gene Expression Omnibus (GEO) and TARGET, and had been previously anonymized by the teams, who had worked with them.

# AUTHOR CONTRIBUTIONS

NB designed the overall research, suggested the principles of data trimming and prediction-accountable set, and wrote most parts of the manuscript. VT performed most part of calculations. MS suggested datasets with clinical responders and non-responders and performed feature selection. AM wrote the initial version of computational code. AS adapted this code for parallel calculations. AG tested and debugged the computational code. IM and AB essentially improved the manuscript upon the draft version has been prepared. AB preformed the overall scientific supervision of the project.

# FUNDING

This work was supported by Amazon and Microsoft Azure grants for cloud-based computational facilities. We thank Oncobox/OmicsWay research program in machine learning and digital oncology for software and pathway databases for this study. Financial support was provided by the Russian Science Foundation grant no. 18-15-00061.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2018. 00717/full#supplementary-material

#### REFERENCES

fgene-09-00717 January 14, 2019 Time: 16:17 # 11


chemotherapy for invasive breast cancer. JAMA 305, 1873–1881. doi: 10.1001/ jama.2011.593



**Conflict of Interest Statement:** VT, MS, AS, AG, AB, and NB were employed by OmicsWay Corporation, Walnut, CA, United States. AM was employed by Yandex N.V. Corporation, Moscow, Russia.

The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Tkachev, Sorokin, Mescheryakov, Simonov, Garazha, Buzdin, Muchnik and Borisov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification of Rare Copy Number Variants Associated With Pulmonary Atresia With Ventricular Septal Defect

Huilin Xie<sup>1</sup> , Nanchao Hong<sup>1</sup> , Erge Zhang<sup>1</sup> , Fen Li<sup>2</sup> , Kun Sun<sup>1</sup> \* and Yu Yu<sup>1</sup> \*

<sup>1</sup> Department of Pediatric Cardiology, Xin Hua Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China, <sup>2</sup> Department of Pediatric Cardiology, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Shanghai, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Lihua Ding, Beijing Institute of Technology, China Fang Liu, Fudan University, China

#### \*Correspondence:

Kun Sun sunkun@xinhuamed.com.cn Yu Yu yuyu@xinhuamed.com.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 14 November 2018 Accepted: 14 January 2019 Published: 28 January 2019

#### Citation:

Xie H, Hong N, Zhang E, Li F, Sun K and Yu Y (2019) Identification of Rare Copy Number Variants Associated With Pulmonary Atresia With Ventricular Septal Defect. Front. Genet. 10:15. doi: 10.3389/fgene.2019.00015 Copy number variants (CNVs) are major variations contributing to the gene heterogeneity of congenital heart diseases (CHD). Pulmonary atresia with ventricular septal defect (PA-VSD) is a rare form of cyanotic CHD characterized by complex manifestations and the genetic determinants underlying PA-VSD are still largely unknown. We investigated rare CNVs in a recruited cohort of 100 unrelated patients with PA-VSD, PA-IVS, or TOF and a population-matched control cohort of 100 healthy children using whole-exome sequencing. Comparing rare CNVs in PA-VSD cases and that in PA-IVS or TOF positive controls, we observed twenty-two rare CNVs only in PA-VSD, five rare CNVs only in PA-VSD and TOF as well as thirteen rare CNVs only in PA-VSD and PA-IVS. Six of these CNVs were considered pathogenic or potentially pathogenic to PA-VSD: 16p11.2 del (PPP4C and TBX6), 5q35.3 del (FLT4), 5p13.1 del (RICTOR), 6p21.33 dup (TNXB), 7p15.2 del (HNRNPA2B1), and 19p13.3 dup (FGF22). The gene networks showed that four putative candidate genes for PA-VSD, PPP4C, FLT4, RICTOR, and FGF22 had strong interaction with well-known cardiac genes relevant to heart or blood vessel development. Meanwhile, the analysis of transcriptome array revealed that PPP4C and RICTOR were also significantly expressed in human embryonic heart. In conclusion, three rare novel CNVs were identified only in PA-VSD: 16p11.2 del (PPP4C), 5q35.3 del (FLT4) and 5p13.1 del (RICTOR), implicating novel candidate genes of interest for PA-VSD. Our study provided new insights into understanding for the pathogenesis of PA-VSD and helped elucidate critical genes for PA-VSD.

Keywords: copy number variants, congenital heart defects, pulmonary atresia with ventricular septal defect, whole exome sequencing, network, PPP4C, FLT4, RICTOR

# INTRODUCTION

Pulmonary atresia with ventricular septal defect (PA-VSD) is a kind of rare complex manifestations of congenital heart diseases (CHD), characterized by the lack of luminal continuity and blood flow from either the right ventricle and the pulmonary artery, together with ventricular septal defect (Digilio et al., 1996; Tchervenkov and Roy, 2000). PA-VSD is considered as one of the most complex and unmanageable CHD, with an estimated prevalence of 0.2% of live births and roughly 2% in

**125**

congenital heart defects (Hoffman and Kaplan, 2002; Abid et al., 2014). Surgical interventions and medical care are always needed for patients with PA-VSD, nevertheless, PA-VSD remains a leading cause of neonatal death (Leonard et al., 2000; Amark et al., 2006).

Copy number variants (CNVs) contribute to the gene heterogeneity of CHD (Soemedi et al., 2012; Tomita-Mitchell et al., 2012; Warburton et al., 2014), providing important genetic information of complex CHD. Previous studies showed the 22q11.2 deletion was a well-known pathogenic variant in CHD and was most common in tetralogy of Fallot (TOF) and PA-VSD (Momma, 2010; Xu et al., 2011; Warburton et al., 2014). Deletions in 15q11.2 and 8p23.1 also contribute to the risk of sporadic CHD (Soemedi et al., 2012). Additionally, some rare CNVs and relevant genes were associated with pulmonary atresia (PA-IVS and PA-VSD), such as 5q14.1dup (DHFR), 10p13dup (CUBN), and 17p13.2del (CAMTA2) (Xie et al., 2014). However, it lacks genetic evidence of PA-VSD in current studies and the majority of them have typically focused on the diagnostic instruments and surgical procedures; the genetic determinants underlying PA-VSD are still needed to be identified.

The aim of our study is to determine the contribution of rare CNVs in the etiology of sporadic PA-VSD and distinguish the genetic pattern between PA-VSD and PA-IVS or TOF. Here we genotyped sixty PA-VSD patients with the whole-exome sequencing and investigated the same or different rare CNVs in PA-VSD compared to non-PAVSD CHD cohort (PA-IVS or TOF) to explain their common or diverse phenotypes. In addition, we detected putative candidate genes encompassed in rare CNVs and identified functional gene sets associated with heart development through gene network analysis.

# MATERIALS AND METHODS

#### Study Population

We recruited unrelated patients with PA-VSD (n = 60) or TOF (n = 20) or PA-IVS (n = 20), diagnosed by echocardiogram, cardiac catheterization, or surgery from Shanghai Xin Hua Hospital. Patients with TOF and PA-IVS were as a non-PAVSD CHD cohort and 100 healthy children without heart diseases were as controls. Written informed consents were obtained from the parents or guardians of participants in this study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Xin Hua Hospital. The genomic DNA of participants was extracted by using the QIAamp DNA Blood Mini Kit (QIAGEN, Germany) following the manufacturer's instructions and was then stored at −80 ◦C.

## Whole-Exome Sequencing and Data Analysis

The whole exome sequencing was performed for copy number variations (CNVs) in all participants. Whole exome sequencing data sequenced by HiseqTM Sequencer was filtered (removing the adaptor sequences, reads with >5% ambiguous bases (noted as N) and low-quality reads containing more than 20 percent TABLE 1 | Cardiac diagnoses for study population of patients.


PA-VSD, pulmonary atresia with ventricular septal defect; PA-IVS, pulmonary atresia with intact ventricular septum; TOF, tetralogy of Fallot; F, female; M, male; d, day(s); m, month(s); and y, year(s).

of bases with qualities of <20) and mapped to Cattle genome (Human genome Version GRCh37 Ensembl75 NCBI) utilizing BWA-mem under following parameter (bwa mem -t 8 -R) (Li and Durbin, 2010). Duplicated reads were marker by PICARD<sup>1</sup> and recalibration was applied based on the GATK standard calling pipeline tools<sup>2</sup> .

#### CNV Determination From WES Data

CNVkit (Talevich et al., 2016) was used to calculate the CNV in the WES analysis. This method applies the copy number in control group as base line and CNVs below 1.5 times than baseline in case group were excluded. Then CNVs were analyzed with the Database of Genomic Variants (DGV<sup>3</sup> ) and the overlapped CNV region was filtered. CNVs were excluded if they were shorter than 10 kb.

## Tissue Collection and Transcriptome Array

Human embryos from Carnegie stages 10–16 were acquired after medical termination of pregnancy at Shanghai Xin Hua

<sup>1</sup>http://broadinstitute.github.io/picard/

<sup>2</sup>https://software.broadinstitute.org/gatk/

<sup>3</sup>http://dgv.tcag.ca/

Hospital. The medical ethics committee of Xin Hua Hospital approved the study. Human embryonic heart samples were remained for transcriptome array. TissueLyserII (Qiagen) and the RNeasy MinElute Cleanup Kit (Qiagen) were utilized for RNA extraction. The integrity and purity of the RNA was detected by the Experion automated gel electrophoresis system (Bio-Rad) and the NanoDrop 2000c spectrophotometer (ThermoFisher Scientific). The time course expression patterns of the candidate genes were measured using an Affymetrix HTA 2.0 microarray.

#### Network Analysis

fgene-10-00015 January 24, 2019 Time: 19:9 # 3

We used the bioinformatic software, Cytoscape, with STRING database to perform network analysis. Three different gene lists derived from previous literatures and MalaCards database<sup>4</sup> were used. The lists were constructed as follows: (1) Genes associated with CHD outflow tract development, the secondary heart field (SHF) or cardiac neural crest (CNC) from previous studies; (2) Genes involved in blood vessel development; (3) Genes related

<sup>4</sup>http://www.malacards.org/

#### TABLE 2 | Rare CNVs only in PA-VSD.

to well-known syndromes with heart defects from previously reported studies and database (**Supplementary Table S1**). We selected 30 genes from the rare CNV loci and then severally analyzed the network between these genes and the three gene lists.

#### RESULTS

#### Identification of Rare CNVs

Of the 100 patients, sixty were PA-VSD, twenty were TOF and another twenty were PA-IVS. The patients are from the Chinese Han population with ages ranging from 2 months to 13 years (**Table 1**). We studied the 100 patients genotyped by WES analysis.

Using a stringent CNV analysis strategy described in the Methods, 129 CNVs were identified and 66 (51%) were duplications and 63 (49%) were deletions. These CNVs had been analyzed with DGV for overlap, which not detected in the DGV were considered as rare CNVs. Moreover, rare CNVs were excluded if they were shorter than 10 kb. We compared rare


PA-VSD, pulmonary atresia with ventricular septal defect; Locus, cytogenetic location of CNV; Size, in base pairs; CN, type of copy number aberration.

CNVs in PA-VSD cases and PA-IVS or TOF positive controls and there were twenty-two rare CNVs only in PA-VSD, five rare CNVs only in PA-VSD and TOF as well as thirteen rare CNVs only in PA-VSD and PA-IVS (**Figure 1**).

## Rare CNVs Only in PA-VSD

fgene-10-00015 January 24, 2019 Time: 19:9 # 4

Twenty-two rare CNVs were only identified in PA-VSD with a size range from 12.3 to 344.5 kb (**Table 2**). Among these rare CNVs, some have been reported to implicated in CHD. The most compelling was 16p11.2 deletion previously detected in a neonate with TOF with pulmonary atresia (Hernando et al., 2002) and identified in a CHD cohort (Zhu et al., 2016). Besides, the duplication of 1q21.2 and 17q12 were previously related to TOF (Liu et al., 2016). In a previous study, the 13q33.3 deletion, together with a 4p12 duplication were detected in a patient with double outlet right ventricle (McMahon et al., 2015). However, the deletion of 4p12 and 13q33.1 (near 13q33.3 locus) were observed in the same patient in our study. Additionally, there were two rare CNVs previously relevant to syndromes with heart defects: one was 11q23.3 deletion involving Jacobsen syndrome with severe cardiac malformations (Mattina et al., 2009), and another was 5q35.3 deletion correlated with 5q35.3 subtelomeric deletion syndrome which showed developmental delay and CHD (Rauch et al., 2003).

Amongst these rare CNVs of note, the deletions of 16p11.2 and 5q35.3 implicated specific candidate genes of interest. The 344.5 kb 16p11.2 deletion included two candidate genes: PPP4C (BMP signaling pathways) and TBX6 (T-box family), and we considered that they might have an impact on cardiac development or are implicated by a relevant family of genes. We also identified a 50.7 kb deletion at 5q35.3 locus containing the FLT4 gene, which are also called VEGFR3 (Ferrara and Alitalo, 1999). Another interesting rare CNV identified in two patients with PA-VSD was 5p13.1 deletion containing the RICTOR gene which played a crucial role in heart development (**Figure 2**).

FIGURE 2 | Rare CNVs overlapping novel candidate gene for PA-VSD: RICTOR, PPP4C, and FLT4. The dotted rectangles represent the part of candidate genes which are not within the CNVs. Genomic parameters from Ensembl (GRCh37.p13).

#### Rare CNVs in PA-VSD and TOF

fgene-10-00015 January 24, 2019 Time: 19:9 # 5

Pulmonary atresia with ventricular septal defect shows similar phenotype and hemodynamics to TOF with PA, so sometimes PA-VSD is considered as the most severe type of TOF. To explain the possible similar development mechanism of PA-VSD and TOF, we compared the rare CNVs in PA-VSD and TOF, then identified five rare CNVs in both PA-VSD and TOF and they were 6p21.33 duplication, 22q13.2 duplication, 7p22.1 deletion, 16q22.1 duplication, and 15q21.1 deletion (**Table 3**). A duplication of 15q21.1 (chr15: 48023616-49017024) was previously reported to associate with CHD and identified in a patient with TOF and PA (Molck et al., 2017). However, the deletion of 15q21.1 spanned approximately 17.8 kb in our study did not include the vital genes involving in heart morphogenesis.

An interesting CNV in eight PA-VSD patients and two TOF patients at the 6p21.33 locus was observed as a recurrent rare event, and the gain CNV overlapped the TNXB gene, which was reported to be highly expressed in fetuses and pregnancies with isolated ventricular septal defects (VSD) (Arcelli et al., 2010; Morano et al., 2018).

#### Rare CNVs in PA-VSD and PA-IVS

Although the genetic developmental patterns of PA-VSD are partly different from that of PA-IVS, we believe that the common CNVs and genes in these two populations may help explain the similarity in phenotypes. In this study, PA-VSD and PA-IVS had thirteen rare CNVs identified in common (**Table 4**). We identified a duplication of 2q37.1 in four PA-VSD patients and one PA-IVS patients. It was previously reported that 2q37 microdeletion syndrome showed developmental delay and congenital heart disease phenotypes (Doherty and Lacbawan, 2007).

In this group, we focused on 7p15.2 deletion and 19p13.3 duplication which encompassed candidate genes. The loss CNV at 7p15.2 containing the HNRNPA2B1 gene was identified in one PA-VSD patients and one PA-IVS patients. Another rare CNV at 19p13.3 locus identified in two PA-VSD patients

TABLE 3 | Rare CNVs only in PA-VSD and TOF.


PA-VSD, pulmonary atresia with ventricular septal defect; TOF, tetralogy of Fallot; Locus, cytogenetic location of CNV; Size, in base pairs; CN, type of copy number aberration.


PA-VSD, pulmonary atresia with ventricular septal defect; PA-IVS, pulmonary atresia with intact ventricular septum; Locus, cytogenetic location of CNV; Size, in base pairs; CN, type of copy number aberration.

and one PA-IVS patients contained the candidate gene FGF22.

# Expression Pattern of Candidate Genes in Human Embryonic Heart

We collected human embryonic heart in different Carnegie stages from S10 to S16 and performed the gene expression analysis using transcriptome array. Among these candidate genes, HNRNPA2B1 was the most highly expressed in embryonic heart; additionally, the expression levels of RICTOR and PPP4C were also significantly higher than those of other genes (**Figure 3**).

#### Gene Networks

The cardiovascular malformations of PA-VSD are caused by heart and vessel abnormally development, such as the formation and development of the cardiac outflow tract, pulmonary artery, SHF, or CNC. Additionally, multiple systemic syndromes show heart defects, like LEOPARD syndrome, Noonan syndrome, Digeorge syndrome and so on. Therefore, we consider that the genes implicated in these above aspects may play roles in

represent candidate genes, the blue nodes represent rare CNVs loci genes in this study and the yellow nodes represent the genes in list 1. The different width of line

connecting proteins represents different intensity of the protein interaction, and the wider the connecting line is, the closer the interaction is.

the pathogenesis of PA-VSD. To detect which aspects of heart development the genes in rare CNVs identified in our study were related to, we screened previous studies and MalaCards database to get known genes about heart morphogenesis, blood vessel development and syndromes involved in heart defects, and then analyzed the networks between these candidate genes and three gene groups, respectively (**Figures 4**–**6**). We found that PPP4C, FLT4, RICTOR, and FGF22 were directly relevant to all three gene groups. TBX6 directly interacts with FGF8, BMP4 and PAX3 in gene list 1 which were related to heart development. HNRNPA2B1 directly interacts with RAF1 and DGCR8 in gene list 3 associated with syndrome. These data suggested that the four genes, PPP4C, FLT4, RICTOR and FGF22, had strong roles in cardiac development and pathogenesis of PA-VSD.

# DISCUSSION

Copy number changes appear to be important genetic variants contributing to the etiology of PA-VSD, however, the current understanding of the role of CNVs in the etiology of PA-VSD is limited. There is just one report of rare de novo CNVs in patients with pulmonary atresia by Xie et al. (2014); however, it did not separate PA-IVS from PA-VSD. Thus, to investigate the pathogenesis of PA-VSD, we collected genomic DNA samples from sixty patients with PA-VSD and 100 controls; meanwhile, the samples of PA-IVS and TOF were also collected as positive control. All cases and controls were assayed using WES. Rare CNVs were identified in 100 patients and six of these CNVs were considered pathogenic or potentially pathogenic to PA-VSD.

Pulmonary atresia with ventricular septal defect is considered as the most severe type of TOF, and we intent to discover the different genomic causes of severity; whilst the genetic developmental pattern of PA-VSD partly differs from that of PA-IVS. Therefore, we compared rare CNVs between PA-VSD cases and positive controls (PA-IVS and TOF) to find the unique CNVs in PA-VSD. One CNV was a 344.5 kb deletion on 16p11.2 that contained PPP4C and TBX6. PPP4C, a catalytic subunit of protein phosphatase 4 which plays in various cellular signaling and regulation, is highly conserved from invertebrates to vertebrates (Cohen et al., 2005). Knockdown of ppp4c inhibits ventral development in zebrafish embryos via enhancing BMP signaling responses through its direct interaction with Smad1. Meanwhile, PPP4C also enhances BMP2 cellular responses in mammalian cells including mouse C2C12 myoblast cells (Jia et al., 2012). We all know that Bmp2-null mice show abnormal cardiac formation and BMP2 plays a pivotal role in cardiac development in human (Zhang and Bradley, 1996; Tan et al., 2017). It indicated that PPP4C could be implicated in human PA-VSD. The second gene, TBX6, was found in the same

deletion as PPP4C. TBX6 is a member of the T-box family of transcription factors which are critical for normal heart development (Plageman and Yutzey, 2005). T-box genes are deemed to be important in early cardiac lineage determination and valvuloseptal development, including TBX1 (Yagi et al., 2003), TBX5 (Li et al., 1997), TBX20 (Plageman and Yutzey, 2004), and so on. Previous studies have revealed that Tbx6 has important roles in the formation of somite borders and the specification of presomitic mesoderm (Chapman et al., 1996; Chapman and Papaioannou, 1998; Chapman et al., 2003; White et al., 2003). Moreover, a recent research further indicated that Tbx6 was essential for pluripotent stem cells (PSCs) differentiation into mesoderm and inhibits cardiac specification (Sadahiro et al., 2018). The deletion CNV with TBX6 in our study may loss its function and result in the heart defects.

In addition, the 5q35.3 deletion contains the FLT4 gene, which encodes a receptor tyrosine kinase for VEGF-C and VEGF-D and promotes lymph angiogenesis as well as angiogenesis (Alitalo, 2011; Benedito et al., 2012). Moreover, FLT4 is highly expressed in the pulmonary arterial endothelial cells and interacts closely with BMPR2 to regulate BMP signaling which has an intimate association with cardiac development; genetic deletion of Flt4 in endothelial cells led to impaired BMP signaling in mouse (Hwangbo et al., 2017). It supports that the role of FLT4 in the pathogeny of PA-VSD is crucial.

The 5p13.1 deletion includes another interesting candidate gene RICTOR. RICTOR protein is an essential regulatory protein and structural subunit of the mammalian target of rapamycin complex 2 (mTORC2), which is a signaling protein complex involved in the epithelial-mesenchymal transition of embryonic development (Lamouille et al., 2012). Loss of endothelial homozygous Rictor results in mouse embryonic lethality at E11.5 (Aimi et al., 2015). It is reported that Rictor/mTORC2 may play a key role in the cardiomyocyte differentiation of the mouse embryonic stem cells with reduced protein levels of Nkx2.5 (cardiac progenitor cell protein), α-Actinin (cardiomyocyte biomarker), and brachyury (mesoderm protein) in Rictor knockdown mice during cardiogenesis. Furthermore, Rictor knockdown specifically suppressed the ventricular-like cells differentiation of the mouse embryonic stem cells (Zheng et al., 2017). These demonstrated that the crucial functions of RICTOR in heart development and potential pathogenesis of PA-VSD.

To a certain extent, PA-VSD and TOF show similar phenotype and may share the similar genetic mechanism. We identified a 6p21.33 duplication in both PA-VSD and TOF overlapping the TNXB gene, which encodes the tenascin-X protein. Tenascin is one of tendon-related extracellular matrix components (Peacock et al., 2008), expressing at the valve leaflets in chicken and mouse embryos as well as playing an important role in heart

valve development (Lincoln et al., 2004; Combs and Yutzey, 2009).

Furthermore, PA-VSD shows the similar phenotype "pulmonary atresia" with PA-IVS, we intent to explain the similarity in genetic level by comparing the rare CNVs between PA-IVS and PA-VSD. For the common rare CNVs in PA-VSD and PA-IVS, 7p15.2 deletion was detected in this study and comprised HNRNPA2B1 gene. The HNRNPA2B1 gene, a molecular homolog of HNRNPA1 (Hutchison et al., 2002), has the similar structure and function to that of HNRNPA1 (Biamonti et al., 1994; Patry et al., 2003). The HNRNPA1 gene codes heterogeneous ribonucleoprotein (hnRNP) A1 protein, which is well-known trans-acting splicing factors that inhibit splice site recognition (Matlin et al., 2005) and promotes alternative splicing of target genes (Jean-Philippe et al., 2013). A recent study observed that the hnRNP A1 knockout mice had heart structure defects and the alternative splicing of mef2c was evidently affected (Liu et al., 2017). In our results, HNRNPA2B1 was highly expressed in human embryonic heart. Therefore, we inferred that HNRNPA2B1 may also play a role in heart development, especially involved in the formation of PA.

FGF22, another candidate gene within a duplication CNV at 19p13.3 locus, is most closely related to FGF7 and FGF10; these three FGFs constitute a subfamily among FGF family members (Miki et al., 1992; Ornitz et al., 1996; Yeh et al., 2003). Previous studies revealed that Fgf10 have dosage sensitive requirements in multiple aspects of early murine cardiovascular development (Urness et al., 2011) and plays an essential role in outflow tract morphogenesis (Kelly et al., 2001; Golzio et al., 2012). Although FGF22 was reported most in the neurology (Pasaoglu and Schikorski, 2016; Terauchi et al., 2016; Williams et al., 2016), we detected its variant in our patient populations with heart defects and speculated that FGF22 may have some similar function with FGF10 in heart development. Additionally, from the network we found that FGF22 directly interact with PDGFRA and KDR which are cardiac progenitor populations with Flk1 in differentiating embryonic stem cells (Kattman et al., 2006; Yang et al., 2008; Kattman et al., 2011). The result implied that FGF22 indeed have relation with heart development.

For the candidate genes, PPP4C, FLT4, RICTOR, and FGF22 showed strong interaction with all gene groups in networks analysis; meanwhile, RICTOR and PPP4C had high expression levels in human embryonic heart. It gives evidences that the rare CNVs of RICTOR and PPP4C contribute to pathogenesis of PA-VSD with great potential.

In conclusion, we identified three rare CNVs only in patients with PA-VSD and the putative candidate genes: 16p11.2 del (PPP4C), 5q35.3 del (FLT4) and 5p13.1 del (RICTOR). These

#### REFERENCES

Abid, D., Elloumi, A., Abid, L., Mallek, S., Aloulou, H., Chabchoub, I., et al. (2014). Congenital heart disease in 37,294 births in Tunisia: birth prevalence and mortality rate. Cardiol. Young 24, 866–871. doi: 10.1017/S1047951113 001194

rare CNVs and genes were not previously described and may contribute significantly to the genetic basis of PA-VSD. There were, however, limitations to this study. Our cohorts lacking parental samples and large or multicentric studies with trio samples may be needed for further replication studies to define the significance of the novel rare CNVs identified in our study. In order to minimize false positives in our small cohorts, the restricted CNV analytic methods were used for rare CNVs and it might have resulted in missing some rare variants of interest. Additionally, further mechanism studies are needed to prove the functional significance of putative candidate genes of PA-VSD in vivo or in vitro. Nevertheless, the discovery in our study of rare novel CNVs in patients with PA-VSD helps elucidate critical genes for PA-VSD and may provide new insights into understanding the pathogenesis of PA-VSD.

#### AUTHOR CONTRIBUTIONS

YY, KS, and HX contributed to conception and design of the study and performed the statistical analysis. NH, EZ, and FL collected the blood samples from all subjects. HX wrote the first draft of the manuscript. YY and KS revised the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

# FUNDING

This study received financial supports from National Natural Science Foundation of China (81670285), the key international (regional) cooperation projects of the National Natural Science Foundation of China (81720108003), the Shen Kang municipal hospital's emerging frontier technology joint research project (SHDC12015102), and the special key projects of integrated traditional Chinese and Western Medicine in Shanghai General Hospital (ZHYY-ZXYJHZX-1-04).

#### ACKNOWLEDGMENTS

We express our gratitude to all subjects who participated in this study.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00015/full#supplementary-material



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xie, Hong, Zhang, Li, Sun and Yu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00015 January 24, 2019 Time: 19:9 # 11

# Exposing the Causal Effect of Body Mass Index on the Risk of Type 2 Diabetes Mellitus: A Mendelian Randomization Study

Liang Cheng<sup>1</sup> , He Zhuang<sup>1</sup> , Hong Ju<sup>2</sup> , Shuo Yang<sup>1</sup> , Junwei Han<sup>1</sup> \*, Renjie Tan<sup>1</sup> \* and Yang Hu<sup>3</sup> \*

<sup>1</sup> College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China, <sup>2</sup> Department of Information Engineering, Heilongjiang Biological Science and Technology Career Academy, Harbin, China, <sup>3</sup> School of Life Sciences and Technology, Harbin Institute of Technology, Harbin, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Chong Wang, Brigham and Women's Hospital and Harvard Medical School, United States Sicheng Hao, Northeastern University, United States Leyi Wei, Tianjin University, China

#### \*Correspondence:

Junwei Han hanjunwei1981@163.com Renjie Tan renjie.tan@outlook.com Yang Hu huyang@hit.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 21 October 2018 Accepted: 29 January 2019 Published: 14 February 2019

#### Citation:

Cheng L, Zhuang H, Ju H, Yang S, Han J, Tan R and Hu Y (2019) Exposing the Causal Effect of Body Mass Index on the Risk of Type 2 Diabetes Mellitus: A Mendelian Randomization Study. Front. Genet. 10:94. doi: 10.3389/fgene.2019.00094 Introduction: High body mass index (BMI) is a positive associated phenotype of type 2 diabetes mellitus (T2DM). Abundant studies have observed this from a clinical perspective. Since the rapid increase in a large number of genetic variants from the genome-wide association studies (GWAS), common SNPs of BMI and T2DM were identified as the genetic basis for understanding their associations. Currently, their causality is beginning to blur.

Materials and Methods: To classify it, a Mendelian randomisation (MR), using genetic instrumental variables (IVs) to explore the causality of intermediate phenotype and disease, was utilized here to test the effect of BMI on the risk of T2DM. In this article, MR was carried out on GWAS data using 52 independent BMI SNPs as IVs. The pooled odds ratio (OR) of these SNPs was calculated using inverse-variance weighted method for the assessment of 5 kg/m<sup>2</sup> higher BMI on the risk of T2DM. The leave-one-out validation was conducted to identify the effect of individual SNPs. MR-Egger regression was utilized to detect potential pleiotropic bias of variants.

Results: We obtained the high OR (1.470; 95% CI 1.170 to 1.847; P = 0.001), low intercept (0.004, P = 0.661), and small fluctuation of ORs {from −0.039 [(1.412 – 1.470) / 1.470)] to 0.075 [(1.568– 1.470) / 1.470)] in leave-one-out validation.

Conclusion: We validate the causal effect of high BMI on the risk of T2DM. The low intercept shows no pleiotropic bias of IVs. The small alterations of ORs activated by removing individual SNPs showed no single SNP drives our estimate.

Keywords: body mass index, type 2 diabetes mellitus, casual effect, Mendelian randomisation, phenotype

# INTRODUCTION

Diabetes mellitus (DM) is characterized by a bunch of chronic metabolic diseases leading to insulinsecretion deficiency (Olokoba et al., 2012; Pan et al., 2013; Shi and Hu, 2014). High blood sugar levels in DM patients over a prolonged period impair body tissues, such as eye, kidney, heart, and so on. Currently, more than 400 million people suffer from diabetes worldwide, of which type 2 DM (T2DM) makes up about 90% (Olokoba et al., 2012; Pan et al., 2013; Shi and Hu, 2014). Most patients who suffer from T2DM are over the age of 40 (Olokoba et al., 2012; Pan et al., 2013;

Shi and Hu, 2014). In theory, people have a long time to prevent T2DM under the right direction. To this end, researchers go out of their way to investigate the causes of T2DM.

Observational studies exposed that body mass index (BMI) was strongly associated with the risk of being diagnosed with T2DM (Sanada et al., 2012; Ganz et al., 2014; Chen et al., 2015, 2016; Zhao et al., 2017). In Sanada et al. (2012) conducted a 10-year retrospective cohort study on 969 men and 585 women (Sanada et al., 2012). They observed high BMI was an independent and dose-dependent risk factor for T2DM in Japanese patients (Sanada et al., 2012). In Ganz et al. (2014) directed a case-control study to assess the association between BMI and the risk of T2DM in the United States (Ganz et al., 2014). A positive association between them was found in 12,179 cases (> = 18 years old) and 25,177 controls (Ganz et al., 2014). The analogous studies without considering genetic factors almost came to a consistent conclusion.

After identifying a large number of BMI-associated and T2DM-associated loci in genome-wide association studies (GWAS), their common associated variants were then interpreted as the underlying cause of BMI and the risk of T2DM. In 2007, the first common variant in the FTO gene of BMI and T2DM was reported in European descents (Frayling et al., 2007). Subsequently, corresponding investigations sprung up for validating the existing common locus and identifying their novel common variants of BMI and T2DM (Andreasen et al., 2008; Herder et al., 2008; Cauchi et al., 2009; Legry et al., 2009; Webster et al., 2010; Song et al., 2012; Xi et al., 2014). In 2014, a meta-analysis of 42 studies for BMI and T2DM associated variants was conducted (Xi et al., 2014). Eventually, 4 statistically significant associated variants (FTO rs9939609, SH2B1 rs7498665, FAIM2 rs7138803, GNPDA2 rs10938397) were identified for both in Europeans.

Whether a higher BMI increases the risk of T2DM or T2DM affects BMI or their common genetic factors take effect, is still unknown according to current observations. In addition, after considering confounding factors, the causal relationship between BMI and T2DM may be reverse. To estimate the causal effect of BMI on the risk of T2DM, we conducted this Mendelian randomization (MR) study, which is an instrumental variable (IV) based method to infer causality of exposure and disease in observation studies. Genetic variants that are associated with intermediate phenotypic exposures are introduced as IVs by MR to estimate the effect of phenotypic exposures on a disease outcome (**Figure 1A**). Due to random distribution of gene variants during gametogenesis, IV-based analysis can avoid reverse causality. The basic principle of estimating the influence of BMI on the risk of T2DM using MR is shown in the **Figure 1B**, where Z (e.g., variants) represents IV, X indicates exposure BMI, and Y is disease T2DM. Two assumptions should be suitable for the case before using MR.


The two assumptions mean the variants should be associated with BMI but not with T2DM. Therefore, the conclusions based on MR could not result from the common genetic factors of BMI and T2DM.

# MATERIALS AND METHODS

Two summary-level data of GWAS datasets were utilized by MR analysis. One of them was for extracting significant BMI SNP sets to meet the assumption 1. And the other was for extracting no significant T2DM SNP sets to meet assumption 2. The intersections of these two SNP sets were then analyzed using MR.

#### Summary-Level Data for Associations Between Genetic Variants and BMI

In Locke et al. (2015) conducted a meta-analysis of BMI using GWAS on Metabochip studies (Voight et al., 2012). Totally, 322,154 individuals of European descents and 17,072 individuals of non-European descent were analyzed. As a result, 97 BMIassociated SNPs (P < 5 × 10−<sup>8</sup> ) were identified for European. The corresponding SNPs, effect allele (EA), allele frequencies, beta coefficients, and standard errors (SEs) were extracted from Genetic Investigation of Anthropometric Traits (GIANT) consortium (Locke et al., 2015) as summary-level data for associations between genetic variants and BMI.

## Summary-Level Data for Associations Between Genetic Variants and T2DM

Morris et al. (2012) carried out a combined meta-analysis of European descents on two GWAS data sets (Yang et al., 2010; Lee et al., 2011), which involved 22,669 cases and 58,119 controls. All the variants were then genotyped with Metabochip involving 1,178 cases and 2,472 controls of Pakistani descent. The analytical result contains novel susceptibility locus together with other SNPs, SEs and their P-values on the risk of T2DM. These were utilized as summary-level data for associations between genetic variants and T2DM.

#### Data Processing and Analysis

Two summary-level datasets were processed into assumptionoriented data (**Figure 2**). According to assumption 2, genetic pleiotropy can result in over-precise estimates in subsequent analysis. According to the application principles of Mendelian randomization analysis, the study is based on Mendel's second law of inheritance: the separation and combination of genetic gametes controlling different traits do not interfere with each other; in the formation of gametes, the paired genetic gametes that determine the same trait are separated from each other, and the genetic gametes that determine different traits are freely combined. When the two genes are not completely independent, they will show a certain degree of linkage, a situation called linkage disequilibrium (LD), which will greatly affect the exclusiveness of the variable tool to phenotypic inheritance, leading the subsequent calculations bias generally called "over-precise estimates." To avoid this situation, these

FIGURE 1 | Mendelian randomisation analysis using genetic variants as instrumental variables for estimating the influence of BMI on T2DM. (A) Causal effect in Mendelian randomisation. (B) The basic principle of estimating the influence of BMI on the risk of T2DM.

loci with potential LD were removed from 97 BMI-associated SNPs, which was done by Noyce et al. (2017) in the previous study. The 97 SNPs were first ranked from the smallest to largest P-values. Then for the top ranked SNPs, Noyce et al. (2017) removed those in LD (R 2 threshold of 0.001) or those within 10,000 kb physical distance based on a reference dataset (Devuyst, 2015) from the 97 SNPs. This process was iterated for the remaining SNPs. As a result, 78 BMI-associated SNPs (P < 5 × 10−<sup>8</sup> ) without potential LD of each other were obtained. According Xi et al. (2014), meta-analysis, four SNPs (rs9939609, rs7498665, rs7138803, rs10938397) were found at the T2DM-associated locus, and were also further removed from these 78 SNPs. In addition, those SNPs with P-value less than 0.05 by Morris et al. (2012) were removed as well. Finally, 52 SNPs that confirmed to the two MR assumptions were retained for MR analysis.

Three subjects involving the influence of BMI on the risk of T2DM (**Figure 2**), the sensitivity of the disproportionate effects of variants, and the detection of bias due to pleiotropy were investigated in MR analysis. These issues were analyzed by MR method, leave-one-out validation, and MR-Egger regression (Bowden et al., 2015), respectively.

#### • MR method

MR method was described in the previous study (Bowden et al., 2015) and summarized for evaluating the influence of BMI on the risk of T2DM as below. Assuming X, Y, and Z are BMI, T2DM, and variants, respectively, Wald ratio (βXY) of BMI to T2DM through specified variant is calculated as follows:

$$
\beta\_{\text{XY}} = \beta\_{\text{ZY}} / \beta\_{\text{ZX}}, \tag{1}
$$

where βZY represents the per-allele log(OR) of T2DM from summary-level data of Morris et al. (2012) study. βZX is the perallele log(OR) of BMI from summary-level data of Locke et al. (2015) study. SE of BMI-T2DM association of each Wald ratio is defined as follows:

$$SE\_{XY} = SE\_{ZY} / SE\_{ZX},\tag{2}$$

where SEZY and SEZX represent the SE of the variant-T2DM and variant-BMI associations from corresponding summarylevel data, respectively. Subsequently, 95% confidence intervals (CIs) were then calculated from the SE of each Wald ratio. These summarized data were then estimated using inversevariance weighted (IVW) linear regression for meta-analysis. The meta-analysis model for the point estimate is according to the heterogeneity of the summarized data. Fixed effect model is used for no significant heterogeneity, and random-effect model is used for others.

To evaluate the genetic heterogeneity of summarized data, Cochran's Q-test and statistic I <sup>2</sup> were utilized here. Cochran's Q-test follows a χ <sup>2</sup> distribution with k−1 degrees of freedom, where k represents the number of variants for analysis. I <sup>2</sup> = (Q−(k−1))/Q × 100% ranges from 0 to 100%. P < 0.01 and I <sup>2</sup> > 50% were defined as the significant heterogeneity here (Zhang et al., 2015).

#### • Leave-one-out validation

To test the sensitivity of variants, we designed a leave-one-out validation measure. In brief, to test the influence of an SNP to the conclusion, the SNP was removed from the 52 SNPs to carry out IVW point estimate. The fluctuation of the results before and after removing the SNP reflects the sensitivity of this SNP. Here this process was iterated for each of these 52 SNPs.

#### TABLE 1 | Associations of genetic variants with BMI and T2DM.

fgene-10-00094 February 12, 2019 Time: 18:58 # 4



FIGURE 3 | Forest plot of Wald ratios and 95% CIs from BMI-associated SNPs.

#### • MR-Egger test

fgene-10-00094 February 12, 2019 Time: 18:58 # 6

To ensure that violations of our analysis were not biasing the estimate of the directional causal association, MR-Egger regression asymmetry test was used (Bowden et al., 2015). The MR-Egger regression is adapted from Egger regression, which is a tool to detect small study bias in meta-analysis and test for bias from pleiotropy. The estimated value of the intercept in MR-Egger regression can be interpreted as an estimate of the average pleiotropic effect across the genetic variants. An intercept that differs from zero is indicative of overall directional pleiotropy. The slope coefficient from MR-Egger regression provides a bias estimate of the causal effect.

All statistical tests for MR analysis were undertaken using the R Package of meta-analysis<sup>1</sup> and Mendelian Randomization (Yavorska and Burgess, 2017).

#### RESULTS

Among the 97 BMI-associated SNPs (Locke et al., 2015), 19 SNPs with LDs, 2 T2DM-associated SNPs (rs7138803, rs10938397) from Xi et al. (2014) study, 20 T2DM-associated SNPs and 1 unmapped SNPs from Morris et al. (2012) study, and 3 uncertain SNPs were removed (**Supplementary Table 1**). 52 BMI-associated SNPs were eventually selected for the MR analysis in **Table 1**. Each line of the table documents 12 items involving the SNP, EA and its frequencies, beta coefficients of the SNP on the risk of BMI and T2DM, and SEs.

#### The Influence of BMI on the Risk of T2DM

The pooled results using IVW method from 52 individual SNPs showed that high BMI significantly increases the risk of T2DM. Due to the lack of evidence of heterogeneity between variants of the summarized data (P = 0.499 and I <sup>2</sup> = 0%; **Figure 3**), the fixed-effect model was utilized here for meta-analysis. The OR of T2DM per 5kg/m<sup>2</sup> higher BMI was 1.470 (95% CI 1.170 to 1.847; P = 0.001). In addition, we analyzed the effect of BMI on the risk of T2DM by six other methods involving Simple median, Weighted median, Penalized weighted median, Penalized IVW, Robust IVW, and Penalized robust IVW methods (Zhao et al., 2017). The results were shown in **Table 2**, which are consistent with the result based on IVW method.

#### Sensitivity Analysis

ORs from leave-one-out analysis were shown in **Figure 4**. In comparison with the observed result (1.470) from 52 SNPs, the OR increased by 0.075 [(1.568 – 1.470) / 1.470] after removing rs10182181. The ORs after removing other SNPs range from 1.412 to 1.507, which means that the small fluctuation {from −0.039 [(1.412 – 1.470) / 1.470] to 0.025 [(1.507 – 1.470) / 1.470]} can be activated by most of the individual SNPs. These results demonstrated that no single SNP drives the IVW point estimate. The detailed results about Heterogeneity test and TABLE 2 | Associations of genetic variants with BMI and T2DM.


meta-analysis of the leave-one-out analysis were shown in the **Supplementary Table 2**.

#### Pleiotropic Effect Analysis

**Figure 5** shows the symmetrical inverted funnel of the point estimate from individual variants. The effect estimated from MR-Egger regression was 1.24 (95% CI 0.553 to 1.928; P = 0.493), with an intercept of 0.004 (95% CI −0.013 to 0.020; P = 0.661; **Figure 6**). Together these findings provided evidence against the possibility that horizontal pleiotropic effects tend to be bias IVW estimates.

# DISCUSSION

In this study, we exposed the causal effect of BMI on the risk of T2DM using MR method. Here, two-summary level data involving association between genetic variants and BMI from Locke et al. (2015) study and association between genetic variants and T2DM from Morris et al. (2012) study were utilized for this purpose. According to the previous investigation, the MR was viewed as a meta-analysis of multiple genetic variants (Bowden et al., 2015; Nordestgaard et al., 2017; Noyce et al., 2017; Wei et al., 2017). Since there was very low heterogeneity between variants of the summarized data (P = 0.499 and I <sup>2</sup> = 0%) (**Figure 3**), the fixed-effect model was utilized for meta-analysis. The pooled results of point estimates using IVW method indicate that the OR of T2DM per 5 kg/m<sup>2</sup> higher BMI was 1.470 (95% CI 1.170 to 1.847; P = 0.001). This evidence suggested that high BMI increases the risk of T2DM.

Sensitivity analysis and bias analysis were then carried out for genetic variants. To test whether the results are influenced by individual SNPs, we conducted the leave-one-out validation. Results in **Figure 4** indicate very small fluctuations after the removal of individual SNPs. The statistical evidence of MR-Egger regression (P = 0.493) with a very low intercept (0.004; **Figure 6**) indicates no significant bias of our data and no pleiotropic effect of the genetic variants, respectively.

The inference that the causal effect of BMI on the risk of T2DM from this study is valuable for both investigations and clinical practice. Although abundant observational studies identified the association between BMI and T2DM, a causal effect cannot be ascertained from these investigations. Especially when their common SNPs were identified in recent studies, these genetic variants were then deemed as the primary cause of the BMI-T2DM association by some of the researchers. In brief, current studies cannot help to understand how BMI is associated

<sup>1</sup>http://cran.r-project.org/web/packages/meta/index.html

with T2DM. The observation of this causal effect suggested that helping to decline BMI could be used as a potential method when developing T2DM prevention strategies. Excessive BMI means that the body is overweight or in most cases obese, and this is most likely as the real initial cause of T2DM. Obesity has become a pandemic disease worldwide, which has resulted in a significant increase in the incidence of diabetes, non-alcoholic fatty liver disease and coronary heart disease (Milic et al., 2014; Rao et al., 2015; Zhou et al., 2017). In obesity, the hypertrophy, hypoxia of fat cells, endoplasmic reticulum stress, lipids toxicity and many other factors can lead to adipocytokines dysfunction, increased vascular permeability, along with promoting immune cell infiltration into fat tissue, release of more inflammatory factors, and formation of a vicious circle of inflammatory reactions, leading to the persistence of chronic inflammatory states. It is now widely believed that inflammation plays a key mediator role in the development of type 2 diabetes (Ramalho and Guimaraes, 2008; Engin, 2017). Therefore, strengthening exercise, maintaining a reasonable diet and good fitness are still the iron we must adhere to.

Our study benefits from both the GWAS data and MR method. Clinical statistics using typical methods exposed large number of the associations between diseases and phenotypic exposures. With the rapid increase in the identifications about the genetic basis of diseases and phenotypic exposures, using genetic variants for precise estimates of the causal effect of phenotype on disease by MR method, attracts more and more attention (Benn et al., 2017; Richmond et al., 2017; Rodriguez-Broadbent et al., 2017; Went et al., 2017). For example, Noyce et al. (2017) utilized the MR method for assessing the causal influence of BMI on the risk of Parkinson disease (PD). Nordestgaard et al. (2017) estimated the effect of BMI on Alzheimer's disease (AD). On account of multiple genetic variants of phenotypes, Bowden et al. (2015) proposed a strategy to view MR with multiple instruments as a meta-analysis and an MR-Egger method for analyzing bias caused by pleiotropy, which was widely used in current studies. Considering the fuzzy relation between BMI and T2DM, we conducted this MR analysis to specify their relationship.

The two assumptions were described in the "Introduction" section for our MR study. Following assumption 1, 97 BMIassociated SNPs were extracted from summary-level data of Locke et al. (2015) study. After removing SNPs with LD and T2DM-associated SNPs, 52 SNPs conforming to the assumption 2 were assigned for further analysis. In addition, MR requires that the genetic variants are independent of any known confounding variables. During to the lack of information about potential confounding factors of BMI and T2DM, no confounders were considered in this study. Therefore, our observation may be limited by this weakness. Link prediction (Liu et al., 2017; Zhang et al., 2017; Peng et al., 2018a) and artificial intelligence methods (Cabarle et al., 2017; Peng et al., 2018b; Wei et al., 2017, 2018b,c) may be used to solve this problem, which has been successfully applied in the prediction of disease genes (Peng et al., 2017; Zeng et al., 2017), miRNAs (Zeng et al., 2016, 2018; Zou et al., 2016), RNA methylation (Wei et al., 2018a), and drug-induced hepatotoxicity (Su et al., 2018).

In summary, the MR analysis in this article verified that high BMI can increase the risk of T2DM. It helps us to understand the pathogenic factor of T2DM. It also may help to enhance the molecular and phenotypic annotations of T2DM and human diseases (Cheng et al., 2016, 2018c), which could be further applied in analyzing diseases in a system biology perspective (Cheng et al., 2018a,b; Hu et al., 2018).

#### AUTHOR CONTRIBUTIONS

LC, JH, RT, and YH conceived and designed the experiments. LC, HZ, HJ, and SY analyzed the data. LC wrote this manuscript. All authors read and approved the final manuscript.

# FUNDING

This work was supported by the Major State Research Development Program of China (No. 2016YFC1202302), the National Natural Science Foundation of China (Grant No. 61871160, and 61502125), the Heilongjiang Postdoctoral Fund (Grant No. LBH-TZ20, and LBH-Z15179), and the China Postdoctoral Science Foundation (Grant No. 2018T110315, and 2016M590291).

#### ACKNOWLEDGMENTS

The authors thank for Guiyou Liu for the improvement of this manuscript.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 00094/full#supplementary-material

## REFERENCES

fgene-10-00094 February 12, 2019 Time: 18:58 # 9



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Cheng, Zhuang, Ju, Yang, Han, Tan and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Using Pan RNA-Seq Analysis to Reveal the Ubiquitous Existence of 5<sup>0</sup> and 3<sup>0</sup> End Small RNAs

Xiaofeng Xu<sup>1</sup>† , Haishuo Ji2,3† , Xiufeng Jin<sup>2</sup> , Zhi Cheng<sup>2</sup> , Xue Yao<sup>4</sup> , Yanqiang Liu<sup>2</sup> , Qiang Zhao<sup>2</sup> , Tao Zhang<sup>2</sup> , Jishou Ruan<sup>5</sup> , Wenjun Bu<sup>2</sup> , Ze Chen<sup>1</sup> \* and Shan Gao2,3 \*

<sup>1</sup> State Key Laboratory of Veterinary Etiological Biology, Key Laboratory of Veterinary Parasitology of Gansu Province, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Lanzhou, China, <sup>2</sup> College of Life Sciences, Nankai University, Tianjin, China, <sup>3</sup> Institute of Statistics, Nankai University, Tianjin, China, <sup>4</sup> Department of Orthopedics, Tianjin Medical University General Hospital, Tianjin, China, <sup>5</sup> School of Mathematical Sciences, Nankai University, Tianjin, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Tuo Zhang, Cornell University, United States Fuyou Fu, Agriculture and Agri-Food Canada (AAFC), Canada

#### \*Correspondence:

Ze Chen chenze@caas.cn Shan Gao gao\_shan@mail.nankai.edu.cn

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 05 December 2018 Accepted: 30 January 2019 Published: 14 February 2019

#### Citation:

Xu X, Ji H, Jin X, Cheng Z, Yao X, Liu Y, Zhao Q, Zhang T, Ruan J, Bu W, Chen Z and Gao S (2019) Using Pan RNA-Seq Analysis to Reveal the Ubiquitous Existence of 5<sup>0</sup> and 3<sup>0</sup> End Small RNAs. Front. Genet. 10:105. doi: 10.3389/fgene.2019.00105 In this study, we used pan RNA-seq analysis to reveal the ubiquitous existence of both 5<sup>0</sup> and 3<sup>0</sup> end small RNAs (5<sup>0</sup> and 3<sup>0</sup> sRNAs). 5<sup>0</sup> and 3<sup>0</sup> sRNAs alone can be used to annotate nuclear non-coding and mitochondrial genes at 1-bp resolution and identify new steady RNAs, which are usually transcribed from functional genes. Then, we provided a simple and cost effective way for the annotation of nuclear non-coding and mitochondrial genes and the identification of new steady RNAs, particularly long non-coding RNAs (lncRNAs). Using 5<sup>0</sup> and 3<sup>0</sup> sRNAs, the annotation of human mitochondrial was corrected and a novel ncRNA named non-coding mitochondrial RNA 1 (ncMT1) was reported for the first time in this study. We also found that most of human tRNA genes have downstream lncRNA genes as lncTRS-TGA1-1 and corrected the misunderstanding of them in previous studies. Using 5<sup>0</sup> , 3<sup>0</sup> , and intronic sRNAs, we reported for the first time that enzymatic double-stranded RNA (dsRNA) cleavage and RNA interference (RNAi) might be involved in the RNA degradation and gene expression regulation of U1 snRNA in human. We provided a different perspective on the regulation of gene expression in U1 snRNA. We also provided a novel view on cancer and virus-induced diseases, leading to find diagnostics or therapy targets from the ribonuclease III (RNase III) family and its related pathways. Our findings pave the way toward a rediscovery of dsRNA cleavage and RNAi, challenging classical theories.

#### Keywords: small RNA, 5<sup>0</sup> end, 3<sup>0</sup> end, Pan RNA-seq, genome annotation

# INTRODUCTION

RNA sequencing (RNA-seq), performed primarily on next-generation sequencing (NGS) platforms, is widely used to measure the expression levels of multiple genes simultaneously, with higher accuracy than Serial Analysis of Gene Expression (SAGE) and microarray (Gao et al., 2014). RNA-seq is also used for genome annotation, enabling the study of gene transcription, RNA processing and various other biological functions. In particular, RNA-seq or small RNA sequencing (sRNA-seq) is indispensable for the annotation of non-coding genes, while the annotation of protein-coding genes can be conducted based on the analysis of protein codons. However, RNA-seq cannot be used to obtain full-length transcripts by de novo assembly or alignment. Both PacBio

**146**

full-length transcripts (PacBio cDNA-seq) (Ren et al., 2016) and Nanopore cDNA sequencing (Nanopore cDNA-seq) (Gao et al., 2014) can be used to obtain full-length transcripts of mature RNAs or RNA precursors (Gao et al., 2016). PacBio cDNA-seq produces reads with lower error rates than Nanopore cDNA-seq, while Nanopore cDNA-seq can produce longer reads than PacBio cDNA-seq. However, neither PacBio cDNA-seq nor Nanopore cDNA-seq can provide the exact 3 0 -end information of transcripts (e.g., polyA regions) due to reverse transcription. This results from the fact that primers anneal to random positions located in the polyA or A-enriched regions within the transcripts to start reverse transcription. Nanopore direct RNA sequencing (Nanopore RNA-seq), as the only available sequencing technology which can sequence RNA directly (Garalde et al., 2018), theoretically can be used to obtain the full-length 3<sup>0</sup> ends of transcripts. However, it cannot be used to obtain the exact 3<sup>0</sup> -end information of transcripts either, due to the high error rate of Nanopore RNA-seq data. Combined with specific capture or enrichment technologies, several other RNA-seq methods have been developed to extend the use of standard RNA-seq. Parallel Analysis of RNA Ends and sequencing (PARE-seq), Cap Analysis of Gene Expression and sequencing (CAGE-seq) and Precision nuclear Run-On and sequencing (PRO-seq) have been developed to identify the 5<sup>0</sup> ends of mature RNAs (Bouvy-Liivrand et al., 2017). Polyadenylation sequencing (PA-seq) has been developed to identify the 3<sup>0</sup> ends of mature RNAs (Ni et al., 2013). Global Run-On and sequencing (GRO-seq) has been developed to sequence nascent RNAs (Bouvy-Liivrand et al., 2017), which helps to determine the primary transcripts of genes.

In our previous studies, we used standard RNA-seq, sRNA-seq, PARE-seq, CAGE-seq, PRO-seq, PA-seq, GRO-seq, PacBio cDNA-seq, Nanopore cDNA-seq, and Nanopore RNA-seq etc to improve gene annotation, defined as pan RNA-seq analysis. Using pan RNA-seq analysis, we reported the corrected annotation of tick and human rRNA genes (Chen et al., 2017), insect mitochondrial genes (Gao et al., 2016) and human mitochondrial genes (Gao et al., 2017). We also reported two novel long non-coding RNAs (lncRNAs) found in human mitochondrial DNA (Gao et al., 2017). In addition, we unexpectedly detected the existence of 5<sup>0</sup> and 3<sup>0</sup> end small RNAs (5<sup>0</sup> and 3<sup>0</sup> sRNAs) in animal rRNA genes (Chen et al., 2017) and later proved the ubiquitous existence of 5<sup>0</sup> and 3<sup>0</sup> sRNAs in nuclear non-coding and mitochondrial genes. In this study, we demonstrated that 5 0 and 3<sup>0</sup> sRNAs alone can be used to annotate nuclear non-coding and mitochondrial genes at 1-bp resolution and identify new steady RNAs. Using public sRNA-seq data from the same species, this method provides a simple and costeffective way to annotate nuclear non-coding and mitochondrial genes and identify new steady RNAs, which are defined to be against transient RNAs. Furthermore, 5<sup>0</sup> , 3<sup>0</sup> , and intronic sRNAs can be used to investigate RNA processing, maturation, degradation and even gene expression regulation. Using 5<sup>0</sup> , 3<sup>0</sup> , and intronic sRNAs, we revealed that enzymatic double-stranded RNA (dsRNA) cleavage initiates RNA interference (RNAi), which might be involved in the RNA degradation and gene expression regulation of U1 snRNA in human. Our findings pave the way toward a rediscovery of dsRNA cleavage and RNAi, challenging classical theories.

#### RESULTS

# Discovery of 5<sup>0</sup> and 3<sup>0</sup> sRNAs

A genome-alignment map of sRNA data usually exhibits certain peaks or hotspots (Li et al., 2012) where the depths of these positions are much higher than those of other positions in the genome. In our previous study of human rRNA genes (Chen et al., 2017), we found that some peaks represented 5<sup>0</sup> and 3 0 sRNAs that existed ubiquitously in nuclear non-coding and mitochondrial genes in eukaryotes. Given that current sRNA-seq technologies usually provide sequences with short lengths, 5<sup>0</sup> and 3<sup>0</sup> sRNAs are defined as sRNA-seq reads with lengths of 15∼50 bp, which are precisely aligned to the 5<sup>0</sup> and 3<sup>0</sup> ends of mature RNAs, respectively (**Figures 1A,B**). They exhibit the following features: (1) 5<sup>0</sup> and 3<sup>0</sup> sRNAs are degraded fragments of mature RNAs and their lengths vary progressively with 1 bp differences; (2) the cleavage sites between 3<sup>0</sup> sRNAs and their downstream 5<sup>0</sup> sRNAs are not limited to one, but instead consist usually of three sites (**Figure 1C**), due to inexact cleavage by enzymes; and (3) 5<sup>0</sup> and 3<sup>0</sup> sRNAs of steady RNAs (e.g., 18S, 5.8S, and 28S rRNA) are significantly more abundant than their intronic sRNAs, while 5<sup>0</sup> and 3<sup>0</sup> sRNAs of transient RNAs (e.g., internal transcribed spacers of rRNA, ITS1, and ITS2) are not. The last criterion can be used to identify new steady RNAs, which are usually transcribed from functional genes. One example of a new steady RNA lncTRS-TGA1-1 and another example of two novel mitochondrial lncRNAs (MDL1 and MDL1AS) are introduced in the following paragraphs. In addition, we demonstrated that MDL1 and MDL1AS are two steady lncRNAs in human mitochondrial DNA with biological functions (Gao et al., 2017).

We used 5<sup>0</sup> and 3<sup>0</sup> sRNAs from one sRNA-seq dataset to annotate genes and used one CAGE-seq dataset, one GRO-seq dataset and one PacBio cDNA-seq dataset (section "Materials and Methods") to validate the annotations. Later, we developed a simplified procedure for gene-annotation. Using only 5<sup>0</sup> sRNAs, gene annotation can be reduced to the identification of the 5<sup>0</sup> ends of mature RNAs. In doing so, the 3<sup>0</sup> ends of their upstream mature RNAs and their cleavage sites can be derived (**Figure 1A**). We have defined a new file format, named "5-end format," to easily identify the 5<sup>0</sup> ends of mature RNAs. The new format is derived from the Pileup format (see section "Materials and Methods") to include eight columns (**Figure 1C**) for each line providing information for a genomic position: (1) chromosome ID; (2) 1-based coordinate of this position; (3) reference base; (4) depth (the number of reads covering the position); (5) ratio-1 (the number of positive-stranded reads starting at this position divided by the total number of positive-stranded reads); (6) the number of positive-stranded reads starting at this position and the total number of positive-stranded reads; (7) ratio-2 (the number of negative-stranded reads starting at this position divided by the total number of negative-stranded reads); and (8) the number of negative-stranded reads starting at this position

positions 7923, 7924, and 7925 with ratio1s (the 5th column) above 70%, the position 7925 with the highest ratio1 was determined as the 5<sup>0</sup> end of 28S rRNA.

and the total number of negative-stranded reads. As the inexact cleavage in RNA processing results in two or three neighboring sites, we select the most occurred one for annotation. Using the 5-end format, the 5<sup>0</sup> end of one mature RNA can easily be identified from two to three candidates (**Figure 1C**), the ratio-1s or ratio-2s of which must be above a threshold (e.g., 75%) and significantly higher than those of the positions surrounding them.

#### 5 <sup>0</sup> and 3<sup>0</sup> sRNAs in Nuclear Non-coding Genes

Using 5<sup>0</sup> and 3<sup>0</sup> sRNAs, we corrected the annotation of human rRNA genes. For the 5<sup>0</sup> end of each mature RNA, we obtained two or three candidates and selected the position with the highest ratio-1 or ratio-2 to annotate genes on the positive or negative strands. For example, we obtained three positions, 7,923, 7,924, and 7,925, to identify the 5<sup>0</sup> end of 28S rRNA and selected 7,925 for annotation (**Figure 1C**). In the same way, the 5<sup>0</sup> ends of 18S and 5.8S rRNA were also identified using 5<sup>0</sup> sRNAs. Then the 3<sup>0</sup> ends of 18S, 5.8S, and 28S rRNA were identified using 3<sup>0</sup> sRNAs. Finally, the annotations of ITS1 and ITS2 were derived using the annotations of 18S, 5.8S and 28S rRNA (**Figure 2A**). The corrected annotations of human rRNA genes (**Table 1**) were validated using the CAGE-seq dataset and the GRO-seq dataset (**Figures 2B,C**). Although the depth of 1,471,247 reads at position 6,601 was much higher than the depth of 647,406 reads at position 6,596 in the sRNA-seq dataset, the 5<sup>0</sup> end of 5.8S rRNA annotated at position 6,601 with a ratio-1 of 35.42% (520,006/1,468,024) was still corrected as position 6,596 with a ratio-1 of 88.11% (569,882/646,805). In addition, the genome-alignment map using the sRNA-seq dataset showed that human rRNA genes had peaks at positions 6,596, 7,925, and 6,756 corresponding to the 5<sup>0</sup> ends of 5.8S and 28S rRNA and the 3<sup>0</sup> end of 5.8S rRNA, respectively (**Figure 2A**). The genome-alignment map using the CAGE-seq dataset showed that human rRNA genes had peaks at positions 3,657 and 7,926 corresponding to the 5<sup>0</sup> ends of 18S and 28S

rRNA, respectively (**Figure 2B**). This suggested that 18S and 28S rRNA could be capped by 5<sup>0</sup> m7G or other caps, but 5.8S rRNA could at most be capped at a low level, if at all. By analyzing the 3<sup>0</sup> sRNAs, we confirmed that mature rRNAs did not contain 3<sup>0</sup> polyAs.

Lee et al. (2009) a novel class of sRNAs named tRNA-derived RNA fragments (tRFs) was introduced and three series of tRFs (tRF-5, tRF-3, and tRF-1) were identified using the sRNA-seq data of the human prostate cancer cell line by 454 deep sequencing. However, these authors did not achieve a

TABLE 1 | Annotation of human rRNA genes with corrections.


Human rRNA genes (RefSeq: NR\_046235.1) were annotated using 5<sup>0</sup> and 3<sup>0</sup> sRNAs and <sup>∗</sup> represented the corrected annotation.

full understanding of tRFs due to technological limitations and their small dataset size. Using pan RNA-seq analysis, we elucidated that the tRF-5 and tRF-3 series were 5<sup>0</sup> and 3<sup>0</sup> sRNAs from mature tRNAs and that the tRF-1 series were 5<sup>0</sup> sRNAs from mature RNAs of the downstream genes (**Figure 1B**). As these 3<sup>0</sup> sRNAs contained detailed 3<sup>0</sup> -end information of mature RNAs, we were able to assess factors related to tRNA processing, maturation and degradation by analyzing 12 mature tRNAs and their 42 precursors (**Supplementary Table S1**). For example, we found that there are four types of 3<sup>0</sup> sRNAs derived from tRNAs: non-tail, C-, CC-, and CCA-tailed. The proportions of these four types were 5.26% (22,906/435,595), 12.36% (53,845/435,595), 13.81% (60,176/435,595), and 68.57% (298,668/435,595). In addition, we obtained the sequences of full-length mature tRNAs of all four types: with non-tail, C-, CC-, and CCA-tailed. Among these full-length mature tRNAs, 8,539 TRD-GTC2-1 tRNAs (for Asp) and 16,900 TRE-CTC1-1 tRNAs (for Glu) were obtained. These results suggested that 3 0 sRNAs were produced by tRNA degradation during its synthesis, when CCAs were post-transcriptionally added to the 3<sup>0</sup> ends of tRNAs one nucleotide at a time. Another example was the correction of TRL-TAG3-1's annotation. Mature TRL-TAG3-1 (chr16:22195711-22195792) was annotated as an 82-nt sequence from the human genome with its 3<sup>0</sup> cleavage site ACCGCTGCCA| cacctcagaa. Using 5<sup>0</sup> and 3<sup>0</sup> sRNAs, the 3<sup>0</sup> cleavage site of TRL-TAG3-1 (chr16:22195710- 22195792) was determined to be ACCGCTGCCAC| acctcagaa. The genome-alignment results using the CAGE-seq dataset showed that 5<sup>0</sup> m7G or other caps of tRNAs did not exist. By analyzing the 3<sup>0</sup> sRNAs, we confirmed that mature tRNAs did not contain 3<sup>0</sup> polyAs. 5<sup>0</sup> and 3<sup>0</sup> sRNAs from all of the 13 mature tRNAs were represented by peaks in the genome-alignment maps, while only a few 3<sup>0</sup> sRNAs of their upstream genes or 5<sup>0</sup> sRNAs of their downstream genes were represented by peaks. Among the peaks from these upstream or downstream genes, the highest one was downstream of TRS-TGA1-1 (chr10:67764503- 67764584), which suggested that this peak was the 5<sup>0</sup> end of a new steady RNA which might be transcribed from a functional gene that had not been annotated in the current genome (version GRCh38/hg38). Since this new gene was downstream of TRS-TGA1-1, it was named by lncTRS-TGA1-1 (**Figure 1B**).

Small nuclear RNAs (snRNAs) include a class of small RNA molecules that are found within the splicing speckles and Cajal bodies of the cell nucleus in eukaryotic cells (Matera et al., 2007). snRNAs are always associated with a set of specific proteins and

boxes.

the complexes are referred to as small nuclear ribonucleoproteins (snRNPs). SnRNAs are also commonly referred to as U-RNAs and one well-known member is U1 snRNA (Cheng et al., 2017b). Using 5<sup>0</sup> sRNAs, we confirmed annotations of U1, U2, U3, U4, U5, U6, and U7 (**Supplementary Table S1**). The genome-alignment results using the CAGE-seq dataset showed that U1, U2, U3, and U4 snRNAs could be capped by 5<sup>0</sup> m7G or other caps, but U5, U6, and U7 snRNAs could at most be capped at a low level, if at all. By analyzing 3<sup>0</sup> sRNAs, we confirmed that mature snRNAs did not contain 3<sup>0</sup> polyAs. In addition, we did not find any new steady RNA upstream or downstream of seven snRNA genes.

#### 5 <sup>0</sup> and 3<sup>0</sup> sRNAs in Mitochondrial Genes

Using pan RNA-seq analysis, we confirmed that nuclear mitochondrial DNA segments (NUMTs) in the human genome did not transcribe into RNAs (Gao et al., 2017). This finding simplified the analysis of mitochondrial genes (e.g., mutation detection or quantification) using transcriptome data. In our previous study, we annotated two primary transcripts and 30 mature transcripts (tRNAIle, tRNAGlnAS, tRNAMet, ND2, tRNATrp, tRNAAlaAS/tRNAAsnAS/tRNACysAS/tRNATyrAS, COI, tRNASerAS, tRNAAsp, COII, tRNALys, ATP8/6, COIII, tRNAGly, ND3, tRNAArg, ND4L/4, tRNAHis, tRNASer, tRNALeu , ND5/ND6AS/tRNAGluAS, Cytb, tRNAThr, MDL1, tRNAPhe , 12S rRNA, tRNAVal, 16S rRNA, tRNALeu, and ND1) on the H-strand at 1-bp resolution (Gao et al., 2017). We classified mitochondrial genes into tRNA, mRNA, rRNA, antisense tRNA (e.g., tRNASerAS), antisense mRNA (e.g., ND6AS), antisense rRNA and lncRNAs (e.g., MDL1 and MDL1AS) (Gao et al., 2017). Among the mature transcripts in human mitochondrial DNA, tRNA transcripts were tailed by 3<sup>0</sup> CCAs, while other mature transcripts were tailed by 3<sup>0</sup> polyAs. The analysis of 3 0 sRNAs using the human931 sRNA-seq dataset (section "Materials and Methods") showed that the maximum lengths of the polyAs in tRNAGlnAS, ND2, tRNAAlaAS-tRNATyrAS, COI, tRNASerAS, COII, ATP8/6, COIII, ND3, ND4L/4, ND5/ND6AS/tRNAGluAS, Cytb, MDL1, 12S rRNA, 16S rRNA, and ND1 are 22, 13, 22, 17, 22, 24, 35, 22, 19, 22, 29, 25, 28, 32, 24, and 24, respectively. There was no significant difference in length distribution between polyAs in mRNAs and rRNAs, which updated the previous finding that the lengths of polyA tails in rRNAs could only be estimated within 3–4 or 6–7 bp (Stewart and Beckenbach, 2009). 3<sup>0</sup> sRNAs containing polyAs or CCAs of different lengths were captured to demonstrate that 3<sup>0</sup> sRNAs were produced by RNA degradation during its synthesis, when polyAs or CCAs were post-transcriptionally added to the 3<sup>0</sup> ends of RNAs one nucleotide at a time. In this study, we also confirmed that mitochondrial mRNAs and rRNAs were capped by 5<sup>0</sup> m7G or other caps (Gao et al., 2016). Our data also showed that MDL1AS, ND5/ND6AS/tRNAGluAS and tRNAAlaAS/tRNAAsnAS/tRNACysAS/tRNATyrAS could be capped by 5<sup>0</sup> m7G or other caps, but tRNAGlnAS and MDL1 could at most be capped at a low level, if at all. Although MDL1 was not capped by 5<sup>0</sup> m7G or other caps as was MDL1AS, we still proposed that both MDL1 and MDL1AS were steady RNAs with biological functions, due to the fact that 5<sup>0</sup> and 3<sup>0</sup> sRNAs of MDL1 and MDL1AS were significantly more abundant than their intronic sRNAs. Further study showed that qPCR of MDL1 provided higher sensitivities than that of BAX/BCL2 and CASP3 in the detection of cell apoptosis (Liu C. et al., 2018).

The annotation resolution of mitochondrial tRNAs is limited due to the complexity of tRNA processing. The annotation of consecutive tRNAs (e.g., tRNATyr/tRNACys/tRNAAsn/tRNAAla in human) is still difficult to solve (**Figure 3**). Using 5<sup>0</sup> and 3 0 sRNAs, we annotated the mitochondrial tRNAs of human at 1 bp resolution, which corrected the previous annotations (GenBank: NC\_012920.1). Based on these results, we propose a mitochondrial tRNA processing model. One mitochondrial tRNA is cleaved from a mitochondrial primary transcript into a precursor (**Figure 3A**), and then the acceptor stem of the precursor is adenylated (e.g., tRNATyr in human) or trimmed (e.g., tRNAAsn in human) to contain a 1-bp overhang at the 3 0 end. Finally, CCAs (for most of tRNAs) or CAs (e.g., tRNAHis in Erthesina fullo) are post-transcriptionally added to the 3<sup>0</sup> ends of tRNAs, one nucleotide at a time. Using other existing methods, mitochondrial tRNAs are annotated between two trimming sites of their mature RNAs, which misses the information of the cleavage sites. Using our method, mitochondrial tRNAs are annotated between two cleavage sites and the information of the trimming sites (**Figure 3B**) can be derived using the mitochondrial tRNA processing model. As the new annotations cover both entire strands of mitochondrial genomes without any gaps or overlaps between neighboring genes, a novel ncRNA named non-coding mitochondrial RNA 1 (ncMT1) was first discovered between tRNACys and tRNAAsn . ncMT1 (NC\_012920.1:5730-5760) with a length of 31 nt is encoded by the L-strand and was identified as a steady RNA (**Figure 3B**). The mature ncMT1 has a polyA tail as mitochondrial mRNAs and rRNAs.

Mitochondrial genome annotation can also be confirmed by the "mitochondrial cleavage" model that we proposed in our previous study (Gao et al., 2017). The model is based on the fact that RNA cleavage is processed: (1) at 5<sup>0</sup> and 3 0 ends of tRNAs, (2) between mRNAs and mRNAs (e.g., ATP8/6 and COIII) except fusion gene [e.g., ATP8/6/COIII in Platysternon megacephalum (Liu J. et al., 2018)], (3) between antisense tRNAs and mRNAs (e.g., tRNATyrAS and COI) and (4) between mRNAs and antisense tRNAs (e.g., COI and tRNASerAS); but is not processed: (1) between mRNAs and antisense mRNAs (e.g., ND5 and ND6AS) or (2) between antisense RNAs (e.g., ND6AS and tRNAGluAS or tRNAAlaAS/tRNAAsnAS/tRNACysAS/tRNATyrAS). This model does not rule out the possibility of a few cleavage events in tRNAAlaAS/tRNAAsnAS/tRNACysAS/tRNATyrAS, ND5/ND6AS/tRNAGluAS or MDL1 (tRNAProAS/D-loop), however, such events are not necessary for their biological functions. Among these 30 mature transcripts on the H-strand, the enzymatic cleavage of COI/tRNASerAS was the most complicated in that the cleavage site contained an A-enriched region TCTAGACAAAAAA. The analysis of full-length transcripts using the PacBio cDNA-seq dataset (section "Materials and Methods") showed that 95.65% (23,000/24,045) of COI/tRNASerAS was not further cleaved, while only 0.19%

(45/24,045) and 4.16% (1,000/24,045) were cleaved at TCT| agacaaaaaa and TCTAGAC| aaaaaa, respectively. This suggested that COI/tRNASerAS was used as the template for the synthesis of proteins as ND5/ND6AS/tRNAGluAS was used as the template. This model was used to correct annotations of non-coding RNAs in human mitochondrial DNA. For example, the identification of ND5/ND6AS/tRNAGluAS, MDL1 and MDL1AS demonstrated that all other reported mitochondrial lncRNAs (Hedberg et al., 2018) could be degraded fragments of transient RNAs or random breaks during experimental processing. Another example included the observation that tRNAAlaAS-tRNATyrAS (NC\_012920: 1318-1638) was not further cleaved for specific functions, which contradicted the hypothesis of a previous study (Seligmann, 2010).

We had previously determined that the first transcription initiation site (TIS) of the H-strand (ITH1) and the TIS of the L-strand (ITL) were at positions 561 and 407 of the human mitochondrial genome (RefSeq: NC\_012920.1); however, the second TIS of the H-strand (ITH2) was not determined using only sRNA-seq data (Gao et al., 2017). By the analysis using sRNA-seq and GRO-seq data, ITH2 was determined to be at position 647 or 648 that was also the 5<sup>0</sup> end of 12S rRNA. This finding went against the long-standing claim that ITH2 was at position 638 (Montoya and Attardi, 1982). Using pan RNA-seq analysis, we found that all of the TISs (ITH1, ITH2 and ITL) could be capped by 5<sup>0</sup> m7G or other caps. We also found polyAs before the TISs, particularly GAG6A0∼<sup>11</sup> before ITH1, which suggested that the transcription of mitochondrial genes could be initated by primers containing polyAs. This finding explained why all of the TISs resided in A-enriched regions. However, further studies are necessary to support these explanations.

## Be Careful With Design of Experiments on ncRNAs

As it has been accepted that yeast and human cells transcribe almost their entire genomes, a huge mass of hidden or cryptic ncRNAs, particularly lncRNAs, has been identified (Houseley and Tollervey, 2009). However, some of them are basic transcriptional noise (Houseley and Tollervey, 2009), fragments from RNA degradation or random breaks during experimental processing. The correct identification of lncRNAs, particularly steady lncRNAs, has not been addressed before this study. Using the incomplete annotation of genome, researchers could

lentiviruses The control group used unprocessed samples.

misinterpret the results from experiments on ncRNAs. Here is a typical example. In a previous study, Lee et al. (2009) designed RNAi experiments to show that the knockdown of tRF-1001 impaired cell proliferation. However, tRF-1001 belongs to 5<sup>0</sup> sRNAs from lncTRS-TGA1-1, which is an antisense gene of HERC4 (**Figure 1B**). Therefore, the knockdown experiments using siRNA duplexes in that study could result in the decrease in the expression of both lncTRS-TGA1-1 and HERC4. We suggest to use single-stranded siRNAs, instead of siRNA duplexes, to knockdown these 5<sup>0</sup> sRNAs and then compare the results to those using over-expression of HERC4, since 5<sup>0</sup> sRNAs from lncTRS-TGA1-1 could inhibit the expression of HERC4 via RNAi

samples named control, ×1, ×2, ×3, ×4, ×5, ×6, ×7, ×8, ×9, ×10, and ×11 were transfected by 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11 µL U1-packaged

or similar mechanisms based on new findings in this study. We also found that most of human tRNA genes have downstream lncRNA genes as lncTRS-TGA1-1 and the 5<sup>0</sup> sRNAs of these lncRNAs could perform molecular functions by inhibiting the expression of their antisense genes.

#### Analysis of RNA Degradation Using 5<sup>0</sup> , 3<sup>0</sup> and Intronic sRNAs

As 5<sup>0</sup> , 30 and intronic sRNAs are degraded fragments of mature RNAs, they can be used to investigate RNA degradation (Houseley and Tollervey, 2009), particularly that of steady RNAs. The analysis of sRNA-seq data showed that in general, 5<sup>0</sup> and 3<sup>0</sup> sRNAs were more abundant than intronic sRNAs and short 5<sup>0</sup> and 3<sup>0</sup> sRNAs were more abundant than longer ones for tRNAs, rRNAs, snRNAs and mitochondrial RNAs. This suggested that these mature RNAs, particularly short RNAs (e.g., tRNAs), were mainly degraded by 3<sup>0</sup> and 5<sup>0</sup> exonucleases to accumulate 5<sup>0</sup> and 3 0 sRNAs. As for rRNAs and snRNAs, we found many peaks representing intronic sRNAs in the body of genes, which were significantly higher than the peaks representing 5<sup>0</sup> or 3<sup>0</sup> sRNAs in the genome-alignment map. In addition, the peaks representing intronic sRNAs in rRNAs showed tissue specificities. Liver tissue (SRA: SRP002272) exhibited specific peaks at position 12,891. Plasma (SRA: SRP034590) exhibited specific peaks at positions 5,431, 9,891, and 11,158. B-cells and exosome (SRA: SRP046046) exhibited specific peaks at positions 3,789 and 9,891. Platelets (SRA: SRP048290) exhibited specific peaks at positions 4,384 and 10,627. A more comprehensive study of these tissue specificities was beyond the scope of this study. Instead, we focused on a study of the secondary structures around these peaks in rRNAs and snRNAs and found that some of them were involved in dsRNA regions. In particular, we found a featured peak spanning a 43-bp region from 49 bp to 92 bp of U1 snRNA (**Figure 4**). In this region, the 5<sup>0</sup> ends of most intronic sRNAs were precisely aligned to 49 or 78 bp (**Figure 4A**). We also found a series of duplexes with lengths from 15 bp to at least 25 bp (**Figure 4C**) from the 43-bp region forming a hairpin in the secondary structure of U1 snRNA. The most abundant reads AGGGCGAGGCTTATC and TGTGCTGACCCCTGC formed a 15-bp duplex structure. The second most abundant reads AGGGCGAGGCT and TGTGCTGACCC formed a 11-bp duplex structure. 99.97% (49,889/49,903) of these duplexes were found from 14 samples of plasma (SRA: SRP034590) and the duplex ratio of AGGGCGAGGCTTATC against TGTGCTGACCCCTGC was 2.15 (34,065/15,824) and, which suggested that this dsRNA region was cleaved by the RNase III family (Nicholson, 2013) to produce these siRNA duplexes (Niu et al., 2017) and could induce RNAi. This 15-bp and 11-bp duplex structures from U1 snRNA are symmetric with 2-bp overhangs at the 5<sup>0</sup> and 3<sup>0</sup> ends, while duplexes from other snRNAs are not. For example, the most abundant reads AAAATTGGAACGATACAGAGAA and TGAAGCGTTCCATATTTTT from U6 snRNA formed a asymmetric duplex structure, which still suggested that this dsRNA region was cleaved by the RNase III family and could induce RNAi. Based on the findings in this study, we hypothesize that 5<sup>0</sup> and 3<sup>0</sup> exonucleases are more prevalent than endonucleases for the degradation of mature non-coding RNAs, hence the abundance of 5<sup>0</sup> and 3<sup>0</sup> sRNAs observed using sRNA-seq data. The longer mature RNAs have more and longer dsRNA regions (e.g., 15 bp long for stems in U1 snRNAs) than shorter ones (e.g., 7–9 bp as the longest for stems in tRNAs) to induce dsRNA cleavage to produce siRNA duplexes. Although the vast majority of the lengths of siRNA duplexes revealed in this study were concentrated at 15 bp (section "Conclusion and Discussion"), we still hypothesized that they could induce RNAi due to the unbalanced duplex ratio of 2.15. As RNAi regulates the expression of these genes through a negative feed-back mechanism, we designed preliminary experiments to over-express U1 snRNA in the HEK293 (human), SY5Y (human), and PC-12 (rat) cell lines to prove our hypothesis. The basic idea was that if the negative feed-back mechanism existed, the expression level of U1 snRNA would decrease rather than remain stable once its over-expression surpassed a threshold. The experimental results showed that the expression level of U1 snRNA decreased after ×4, ×9, and ×6 dosages (section "Materials and Methods") in the HEK293 (human), SY5Y (human), and PC-12 (rat) cell lines, respectively (**Figure 4D**). In particular, the results in the HEK293 cell line showed a significant effect caused by the negative feed-back mechanism. Therefore, RNAi could be involved in the RNA degradation and regulation of gene expression in U1 snRNA.

# CONCLUSION AND DISCUSSION

In this study, we used pan RNA-seq analysis to reveal the ubiquitous existence of both 5<sup>0</sup> and 3<sup>0</sup> end small RNAs. 5 0 and 3<sup>0</sup> sRNAs alone can be used to annotate nuclear non-coding and mitochondrial genes at 1-bp resolution and identify new steady RNAs. The identification of new steady RNAs lead to the discovery of new genes (e.g., MDL1 and MDL1AS), new biological functions and even new mechanisms. In our previous study on human rRNA genes (Chen et al., 2017), we hypothesized that 5<sup>0</sup> and 3<sup>0</sup> sRNAs performed biological functions and they are likely to have detrimental effects on the regulation of gene expression, as RNA degradation intermediates (Houseley and Tollervey, 2009). Cellular experiments showed the RNAi knockdown of one 20-nt degraded fragment "ATTCGTAGACGACCTGCTTC" from 28S rRNA induced cell apoptosis and inhibited cell proliferation (Chen et al., 2017). Additional investigation of the biological functions of 5<sup>0</sup> and 3<sup>0</sup> sRNAs was beyond the scope of this study.

Using 5<sup>0</sup> , 3<sup>0</sup> and intronic sRNAs, we reported for the first time that enzymatic dsRNA cleavage and RNAi might be involved in the RNA degradation and gene expression regulation of U1 snRNA in humans. The function of RNAi in RNA degradation was reported as an inappropriate event in yeast rRNA and tRNA degradation and only happened when 5<sup>0</sup> and 3 <sup>0</sup> degradation were absent (Buhler et al., 2008). However, our findings suggest that the function of RNAi in RNA degradation might be a general mechanism. Based on a previous study, the Rnt1p protein cleaves hairpin structures in pre-rRNAs, pre-mRNAs and transcripts containing non-coding RNAs (e.g.,

snoRNAs) for their maturation in yeast. Rnt1p recognizes the tetraloops [A/u]GNN and cleaves the stems ∼14–16 bp from the hairpin structures (Nicholson, 2013). The most abundant read AGGGCGAGGCTTATC discovered in this study contained AGGG and AGGC tetraloops and had a length of 15 bp. This suggested that Rnt1p-like enzymes could produce siRNA duplexes from U1 snRNAs but Rnt1p has yet to be reported in human to the best of our knowledge. This finding also contradicted our basic knowledge that Dicer is required for RNAi in mammal, producing siRNA duplexes with lengths of ∼20–25 bp. As members of RNase III family, both Rnt1p and Dicer have RIIIDa, RIIIDb, and dsRBD domains. Rnt1p in Saccharomyces cerevisiae contains a 155-aa N-terminal domain (NTD), whereas Dicer and Drosha in human have much longer N-terminal. The structure of Rnt1p post-cleavage complex shows that a novel RNA-binding motif (RBM) recognizes the guanine nucleotide in the [A/u]GNN tetraloop and that NTD and dsRBD function as two rulers measuring the distance between the tetraloop and the cleavage site (Song et al., 2017). Although our preliminary experiments supported the existence of RNAi, the identity of the enzyme that caused 15-bp duplexes in U1 snRNAs remains unclear.

The ancestral function of RNAi is generally agreed to have been immune defense against exogenous genetic elements such as transposons and viral genomes (Buchon and Vaury, 2006). However, our findings have rediscovered dsRNA cleavage and RNAi. Our rediscovery is that both dsRNA cleavage and RNAi are housekeeping systems rather than immune defense systems. Basically, enzymatic dsRNA cleavage is responsible for RNA processing, maturation and degradation, while RNAi regulates gene expression via highly efficient RNA degradation. RNAi of one gene produces siRNA duplexes that regulate expression levels of itself or other genes. Mature RNAs containing a greater number of hairpin structures have more chances to induce RNAi, which is important for highly expressed genes (e.g., U1 snRNA) or viral genes. As DNA complemented palindromes are prone to produce dsRNA regions, viruses containing a greater number of such DNA complemented palindromes in their genomes have more chances to induce RNAi for the regulation of gene expression, which is important for their infectivity and pathogenesis. In addition, we reported for the first time the existence of complemented palindromic small RNAs (RNAs) and proved that one cpsRNA from a 22-bp DNA complemented palindrome in the SARS-CoV genome could induced RNAi (Liu C. et al., 2018).

We provided a different perspective on the regulation of gene expression in U1 snRNA. The primary function of U1 snRNA is its involvement in the splicing of pre-mRNAs in nuclei. In the past 20 years, research on U1 snRNA has focused on its primary function, particularly as it relates to neurodegenerative diseases caused by abnormalities in U1 snRNA (Cheng et al., 2017b). In one of our previous studies, we reported that over-expression of U1 snRNA induced a decrease in U1 spliceosome function associated with Alzheimer's disease. However, the relationship between U1 snRNA over-expression and U1 snRNP loss of function remains unknown (Cheng et al., 2017a). In another study, we reported that U1 snRNA over-expression induced cell apoptosis in SY5Y cells, but not in PC-12 cells (Cheng et al., 2017b). This inconsistent result can be explained by considering the function of RNAi in the RNA degradation of U1 snRNA. Though SY5Y cells and PC-12 cells exhibited different responses to U1 snRNA over-expression, both of them displayed phenomena caused by the negative feedback mechanism (**Figure 4D**). Using the human931 sRNA-seq dataset (section "Materials and Methods"), we also found that sRNAs of U1 snRNA were enriched in brain (SRA: SRP021924) but only a few of them were siRNA duplexes. It suggested that RNAi did not take a major role in the degradation of U1 snRNA in brain. This finding helped better understanding of neurodegenerative diseases caused by abnormalities in U1 snRNA.

We also provided a novel view on cancer and virus-induced diseases. In one of our previous studies, we reported that U1 snRNA over-expression affected the expression of mammal genes on a genome-wide scale and that U1 snRNA could regulate cancer gene expression. This was explained by the fact that alternative splicing (AS) and alternative polyadenylation (APA) were deregulated and exploited by cancer cells to promote their growth and survival (Spraggon and Cartegni, 2013). Our alternate explanation is that the over-expressed U1 snRNA in cancer cells recruits excess RNase III for RNAi, thereby causing RNase III to lose its abilities to function in the RNA degradation of other genes or in genome surveillance (Nicholson, 2013). Viruses also recruit excess RNase III, prompting RNase III to lose its abilities to function in host defense as well as its regular functions.

### MATERIALS AND METHODS

#### Datasets and Data Analysis

Data in four projects (SRP002272, SRP034590, SRP046046, and SRP048290) were selected from the human931 sRNA-seq dataset to build one sRNA-seq dataset for this study. Human931 was built using 931 runs of human sRNA-seq data downloaded from the NCBI SRA database (Wang et al., 2016). 15, 14, 12, and 6 runs of sRNA-seq data in these four projects were produced using Illumina sequencing technologies with length 35∼46, 101, 101, and 101 bp, respectively. One CAGE-seq dataset, one GRO-seq dataset (Bouvy-Liivrand et al., 2017) and one PacBio cDNA-seq dataset (Gao et al., 2017) were used to validate the annotations. The cleaning and quality control of sRNA-seq, CAGE-seq and GRO-seq data were performed using the pipeline Fastq\_clean (Zhang et al., 2014) that was optimized to clean the raw reads from Illumina platforms. To simply annotate genes from a sequenced genome, we aligned all the cleaned reads from sRNA-seq, CAGE-seq, and GRO-seq data to the reference sequences using the software bowtie v0.12.7 allowing one mismatch. Then, we obtained SAM, BAM, sorted BAM, Pileup files using the software samtools (Zhang et al., 2016). One perl script (**Supplementary Table S1**) was used to transform Pileup files into 5-end files. Statistical computation and plotting were performed using the software R v2.15.3 with the Bioconductor packages (Gao et al., 2014).

#### Validation by Preliminary Experiments

fgene-10-00105 February 12, 2019 Time: 19:35 # 10

U1 over-expression in the HEK293 (human), SY5Y (human), and PC-12 (rat) cell lines were conducted by virus transfection using the pLVX-shRNA1 plasmids and the Lenti-X HTX Packaging System (Clontech, United States), which had been described in our previous study (Cheng et al., 2017a). U1 snRNAs of human and rat used synthetic DNA containing the sequence (RefSeq: NR\_004430.2) and the sequence (GenBank: V01266.1), respectively. For each experiment, 12 groups of samples named control, ×1, ×2, ×3, ×4, ×5, ×6, ×7, ×8, ×9, ×10, and ×11 were transfected by 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11 µL U1-packaged lentiviruses (**Figure 4D**). Each group contained three samples for biological replicates and the control samples used unprocessed cells. Each sample contained 10<sup>5</sup> cells and virus titer was 10<sup>7</sup> TU/mL for 1X. After transfection, RNA extraction, cDNA synthesis and cDNA amplification were performed following the same procedure in our previous study (Cheng et al., 2017b). For each sample, total RNA was isolated using RNAiso Plus Reagent (TaKaRa, Japan) and the cDNA was synthesized by Mir-X miRNA First-Strand Synthesis Kit (Clontech, United States). The cDNA product was amplified by qPCR (Thermo Fisher Scientific, United States) using U6 snRNA as internal control under genespecific reaction conditions. U1 snRNAs of human and rat used the forward and reverse primers GGGAGATACCATGATCAC and CCACTACCACAAATTATGC. U6 snRNAs of human and rat used CGGCAGCACATATACTAA and GAACGCTTCACGAATTTG. The qPCR reaction mixture was incubated at 95◦C for 30 s, followed by 40 PCR cycles (5 s at 95◦C, 5 s at 60◦C, and 10 s at 68◦C for each cycle) using Hieff qPCR SYBR Green Master Mix (Yeasen, China).

#### DATA AVAILABILITY

All NGS data and reference sequences are available in the NCBI SRA or RefSeq databases. Their accession numbers are provided in the method section.

#### REFERENCES


# AUTHOR CONTRIBUTIONS

SG conceived the project and drafted the main manuscript. SG and ZeC supervised this project. SG, HJ, and XJ analyzed the data. XY, YL, and TZ curated the sequences and prepared all the figures, tables, and additional files. XX, ZhC, and QZ performed the experiments. JR and WB revised the manuscript. All authors have read and approved the manuscript.

#### FUNDING

This study was supported by National Key Research and Development Program of China (2016YFC0502304-03) to Defu Chen, National Natural Science Foundation of China (31871992) to Bingjun He, and Central Public-Interest Scientific Institution Basal Research Fund of Lanzhou Veterinary Research Institute of CAAS to ZC.

#### ACKNOWLEDGMENTS

We appreciate the help equally from the people listed below. They are undergraduate student Yier Ma, graduate students Weihao Dou, Siyu Li and Professors Bingjun He, Dawei Huang, Guoqing Liu from College of Life Sciences, Nankai University. This manuscript has been released as a preprint at https://www. biorxiv.org/content/early/2018/10/15/444117.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00105/full#supplementary-material

TABLE S1 | All tRNA and snRNA sequences.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xu, Ji, Jin, Cheng, Yao, Liu, Zhao, Zhang, Ruan, Bu, Chen and Gao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Integrative Analysis Reveals Key Circular RNA in Atrial Fibrillation

Xiaofeng Hu<sup>1</sup> , Linhui Chen<sup>2</sup> , Shaohui Wu<sup>1</sup> , Kai Xu<sup>1</sup> , Weifeng Jiang<sup>1</sup> , Mu Qin<sup>1</sup> , Yu Zhang<sup>1</sup> and Xu Liu<sup>1</sup> \*

<sup>1</sup> Department of Cardiology, Shanghai Chest Hospital, Shanghai Jiao Tong University, Shanghai, China, <sup>2</sup> Department of Neurology, Zhejiang Hospital, Hangzhou, China

Circular RNAs (circRNAs) are an emerging class of RNA species that may play a critical regulatory role in gene expression control, which can serve as diagnostic biomarkers for many diseases due to their abundant, stable, and cell- or tissue-specific expression. However, the association between circRNAs and atrial fibrillation (AF) is still not clear. In this study, we used RNA sequencing data to identify and quantify the circRNAs. Differential expression analysis of the circRNAs identified 250 up- and 126 downregulated circRNAs in AF subjects compared with healthy donors, respectively. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis of the parental genes of the dysregulated circRNAs indicated that the up-regulated parental genes may participate in the process of DNA damage under oxidative stress. Furthermore, to annotate the dysregulated circRNAs, we constructed and merged the competing endogenous RNA (ceRNA) network and protein-protein interaction (PPI) network, respectively. In the merged network, 130 of 246 dysregulated circRNAs were successfully characterized by more than one pathway. Notably, the five circRNAs, including chr9:15474007-15490122, chr16:75445723-75448593, hsa\_circ\_0007256, chr12:56563313-56563992, and hsa\_circ\_0003533, showed the highest significance by the enrichment analysis, and four of them were enriched in cytokine-cytokine receptor interaction. These dysregulated circRNAs may mainly participate in biological processes of inflammatory response. In conclusion, the present study identified a set of dysregulated circRNAs, and characterized their potential functions, which may be associated with inflammatory responses in AF. To our knowledge, this is the first study to uncover the association between circRNAs and AF, which not only improves our understanding of the roles of circRNAs in AF, but also provides candidates of potentially functional circRNAs for AF researchers.

#### Keywords: circular RNAs, atrial fibrillation, ceRNA network, PPI network, inflammatory responses

# INTRODUCTION

Atrial fibrillation (AF) is one of the most common arrhythmias, which is closely associated with poor life quality, stroke, heart failure, and elevated mortality (Chu et al., 2013; Lang et al., 2014). The number of individuals with AF worldwide in 2010 was estimated to be about 33.5 million (Chugh et al., 2014). The prevalence of AF varies regionally according to previous reports, ranging from 0.1% in India (Kaushal et al., 1995) to 1–2% in Europe and North America (Go et al., 2001;

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences, China

#### Reviewed by:

Xian Sun, Fudan University, China Yongyong Yang, Northwestern University, United States

> \*Correspondence: Xu Liu drliuxu@126.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 10 December 2018 Accepted: 30 January 2019 Published: 19 February 2019

#### Citation:

Hu X, Chen L, Wu S, Xu K, Jiang W, Qin M, Zhang Y and Liu X (2019) Integrative Analysis Reveals Key Circular RNA in Atrial Fibrillation. Front. Genet. 10:108. doi: 10.3389/fgene.2019.00108

Krijthe et al., 2013) and 4% in Australia (Middleton et al., 2002). The prevalence and incidence of AF have been reported to be higher in European ancestry than non-Europeans (Go et al., 2001; Ball et al., 2013). The occurrence and development of AF are significantly associated with multiple risk factors, including aging (Chugh et al., 2014), male sex (Ball et al., 2013), ethnicity (Rodriguez et al., 2015), cigarette smoking (Ball et al., 2013), alcohol consumption (Ball et al., 2013), obesity (Rahman et al., 2014), hypertension, left ventricular hypertrophy (LVH), coronary artery disease (CAD) (Schnabel et al., 2009), heart failure (HF) (Wang et al., 2003), and valve disease (Rahman et al., 2014).

With the development of high-throughput technologies, such as microarray, next generation sequencing, and mass-spectrum based proteomics, our understanding of the AF pathogenic mechanisms at different levels has been greatly improved. Previous studies (Uemura et al., 2004; Pei et al., 2010; Li et al., 2011; Yao et al., 2015; Mase et al., 2017) used a variety of means to uncover potential molecules responsible for the pathogenesis of AF. For example, genome-wide association studies (Benjamin et al., 2009; Ellinor et al., 2010, 2012; Sinner et al., 2014; Christophersen et al., 2017; Low et al., 2017) have identified at least 30 loci associated with AF, which expand the diversity of genetic pathways implicated in AF and provide novel molecular targets for future biological investigation. Furthermore, transcriptome analysis is one of the most utilized approaches to study human diseases at the mRNA level (Casamassimi et al., 2017). It has been used to define the atrial mRNA expression in different types of AF (e. g., postoperative, chronic, and paroxysmal) (Kim et al., 2003, 2005; Ohki et al., 2005; Deshmukh et al., 2015). In addition to transcriptome analysis, mass-spectrometry-based proteomics has matured into a broadly applied analytical tool over the past decade (Aebersold and Mann, 2016). Mayr et al. (2008) and Zhang et al. (2013) performed proteome analyses in left and right human atrial appendages with persistent AF and found 17 and 223 differentially expressed proteins compared to patients with sinus rhythm. These studies suggest that the pathogenesis of AF is multifactorial, and highlight the association between increased inflammatory burden and the presence and future development of AF (Kourliouros et al., 2009). However, the increased morbidity of AF suggested that some specific pathogenic mechanisms have not been fully understood.

Recently, there is growing evidence that non-coding RNAs, including microRNAs, small nucleolar RNAs and long non-coding RNAs, play important roles in occurrence and development of diseases (Shi et al., 2013; Ruan et al., 2015; Yi et al., 2018). Furthermore, circular RNAs are emerging as a new type of regulatory molecules that participate in gene expression control and disease progression (Han et al., 2018). In AF, circRNA-associated ceRNA networks revealed that dysregulated circRNAs (hsa\_circRNA002085, hsa\_circRNA001321) in nonvalvular persistent atrial fibrillation (NPAF) may be involved in regulating hsa-microRNA (miR)-208b and hsa-miR-21 (Zhang et al., 2018). Moreover, bioinformatics analysis provides a novel perspective on circRNAs involved in AF due to rheumatic heart disease and established the foundation for future research of the potential roles of circRNAs in AF. To uncover the association between circRNAs and AF, we performed an integrative analysis of circRNAs, and identified dysregulated circRNAs in lymphocytes of AF. The functions of the dysregulated circRNAs were annotated by network-based Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis, which highlighted several circRNAs participating in biological processes of inflammatory response.

# MATERIALS AND METHODS

# Data Collection and Format Conversion

RNA sequencing data of 6 cases with AF and 6 healthy donors were downloaded from Sequence Read Archive (SRA)<sup>1</sup> database (Leinonen et al., 2011) with an accession number SRP093226 using SRA Toolkit (Leinonen et al., 2011) version 2.9.2<sup>2</sup> , which was released by previous study (Yu et al., 2017). The downloaded files with SRA format were converted to paired-end FASTQ files by fastq-dump with the option –split-files.

# RNA Sequencing Data Analysis

The RNA sequencing data were analyzed by two pipelines. For the gene expression quantification, we mapped the RNA-seq reads to UCSC human reference genome (hg19)<sup>3</sup> by samples using hisat2 (Kim et al., 2015). The resulting SAM files were sorted by SAMtools. Gene expression was quantified by StringTie (Pertea et al., 2015) with GENCODE (Harrow et al., 2012) annotation v19. For the circular RNA detection and quantification, we used the BWA-mem aligner to map the RNA-seq reads to UCSC human reference genome (hg19). The circular RNAs were predicted and quantified by CIRI-2 with GENCODE (Harrow et al., 2012) annotation v19.

# Identification of Highly Reliable Circular RNAs Using RNA-seq Data

To identify the circular RNAs, we filtered the circRNAs with more than 5-read counts in more than two samples. Moreover, the threshold of the average ratio of junction reads supporting circRNAs was also set to 10%.

# Differential Expression Analysis

The count-based expression was used to identify differentially expression genes and circRNAs by DESeq2 (Love et al., 2014), a differential expression analysis based on the negative binomial distribution. The gene and circRNA expression were normalized to avoid the influence of sequencing depth and transcript length, and was implemented in R package DESeq2. The differentially expressed genes/circRNAs were identified at the threshold P-value < 0.05 and fold change > 2 or < 1/2. The up- or downregulation status was determined based on the fold change for each gene/circRNA.

<sup>1</sup>https://www.ncbi.nlm.nih.gov/sra

<sup>2</sup>http://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/

<sup>3</sup>http://www.genome.ucsc.edu

#### GO and KEGG Enrichment Analysis

fgene-10-00108 February 16, 2019 Time: 17:31 # 3

The Gene Ontology (GO) and KEGG enrichment analysis was implemented at WEB-based Gene Set Analysis Toolkit (WebGestalt) (Wang et al., 2017). The Gene Ontology (Ashburner et al., 2000) biological processes and KEGG pathways (Kanehisa et al., 2017) were selected as the functional database.

#### Protein-Protein Interaction Analysis

The Search Tool for the Retrieval of Interacting Genes/ Proteins (STRING) (Szklarczyk et al., 2017) online software<sup>4</sup> was used to assess the interactions. The interactions of the proteins encoded by the differently expressed genes were searched using STRING online software.

#### MiRNA Target Prediction

The miRNA binding sites of circRNAs were predicted by Miranda (Betel et al., 2008) with option –strict. We selected default options for other parameters. The miRNA-mRNA interactions were extracted from MiRTarBase (Chou et al., 2018). Together with the reverse co-expression analysis of miRNA and mRNA, miRNA and mRNA interaction pairs were predicted.

#### Competing Endogenous RNA Prediction

The competing endogenous RNAs (ceRNAs) function by competing for miRNAs with mRNAs. The number of miRNAs shared by each circRNA and mRNA pair should be significantly higher. For each mRNA-circRNA pair, Fisher's exact test was used to estimate the significance of shared miRNAs (P-value < 0.0001).

#### Functional Annotation of circRNAs

The protein-protein interaction (PPI) and ceRNA network were merged and visualized using Cytoscape software<sup>5</sup> . The function of circRNAs were predicted by the KEGG pathway (Kanehisa et al., 2017) enrichment analysis performed on the genes connected to these circRNAs within one node in the merged network.

#### Statistical Analysis

The statistical analyses, such as hierarchical clustering analysis and Fisher's exact test, were implemented in R programming software<sup>6</sup> .

#### RESULTS

#### Identification of circRNAs in Lymphocytes From Atrial Fibrillation and Healthy Donors

We collected RNA sequencing data of 6 cases with atrial fibrillation and 6 healthy donors from SRA<sup>7</sup> database with an accession number SRP093226 (Yu et al., 2017) (see section "Materials and Methods"), the RNA libraries of which were constructed by rRNA-removal protocol and could be used to identify circular RNAs (circRNAs). As described in the previous study, two and three male samples were collected in AF and healthy controls, respectively. Moreover, all samples did not have smoking habits and alcohol abuse. Particularly, the average age of AF patients was about 62 years old, which was slightly older than that of healthy controls. The analysis of sequencing data allowed for identifying 52,024 circRNAs in total, of which, 28,384 were identified in both atrial fibrillation and healthy donors (**Figure 1A**). Among these identified circRNAs, we observed that 13,899 were curated by a circRNA database, circBase<sup>8</sup> (Glazar et al., 2014). Moreover, we also found 13,733 and 9,907 circRNAs to be specific to the atrial fibrillation and healthy donors, respectively (**Figure 1A**). Genomic annotations revealed that these circRNAs were mostly originated from the exons (77%), followed by introns (13%) and intergenic regions (10%), suggesting that a considerable portion of circRNAs were circularized at unannotated splicing sites in lymphocytes (**Figure 1B**). The ratio of circRNAs transcribed from the sense strands was close to 0.5, indicating that there was not strandpreference in circRNA biogenesis (**Figure 1C**). In addition, we also examined the distribution of circRNAs expression levels in each sample, and observed that most of circRNAs were expressed at low levels (**Figure 1D**). However, there were also about 25% circRNAs in each sample that were expressed at a higher level (> 30 read count, **Figure 1D**).

#### Identification of Dysregulated Genes and circRNAs in Atrial Fibrillation

To identify the dysregulated genes and circRNAs, we performed differential expression analysis on the gene and circRNA expression profiles, respectively. We identified 713 up- and 994 down-regualated genes, and 250 up- and 126 downregulated circRNAs in atrial fibrillation compared with healthy donors (P < 0.05 and fold change > 2 or < <sup>1</sup> /2, **Figures 2A,B**), respectively. The hierarchical clustering analysis of the dysregulated circRNA expression profiles revealed that the samples with AF could be clearly distinguished from the healthy donors (**Figure 2C**), suggesting that the dysregulated circRNAs may act as potential AF diagnostic biomarkers in lymphocytes. Notably, we observed an up-regulated circRNA, hsa\_circ\_0030569, in AF patients (P-value < 0.05 and fold change > 1), which has been reported to response to Mycobacterium tuberculosis (Mtb) infection in human monocyte derived macrophages (MDMs), suggesting that this circRNA may participate in immune or inflammatory processes (Huang et al., 2017).

#### GO and KEGG Analysis of the Dysregulated circRNA Parental Genes

It has been shown in previous studies that there is a close association between circRNAs and their parental genes as they could affect the expression of their parental genes

<sup>4</sup>https://string-db.org

<sup>5</sup>http://www.cytoscape.org

<sup>6</sup>http://www.r-project.org/

<sup>7</sup>https://www.ncbi.nlm.nih.gov/sra

<sup>8</sup>http://www.circbase.org/

(Zhang et al., 2016; Wei et al., 2017). To investigate the functions of the parental genes of dysregulated circRNAs in AF samples compared with normal samples, we conducted a gene set enrichment analysis of their parental genes based on biological processes from GO and pathways from KEGG database (**Supplementary Table S1**).

Gene ontology analysis indicated that the upregulated genes were mainly involved in the regulation of chromosome segregation, response to radiation, cell cycle phase transition, DNA repair, cilium organization, mRNA processing, mitotic nuclear division, cell projection assembly, microtubule cytoskeleton organization, and peptidyl-lysine modification (**Figure 3A**). Furthermore, the downregulated genes were mainly enriched in categories associated with the regulation of histone modification, forebrain development, microtubule cytoskeleton organization, chromosome segregation, protein acylation, macromolecule deacylation, skeletal system development, organelle localization, in utero embryonic development, and

log2-transformed and scaled by circRNA.

reproductive system development (**Figure 3B**). These upregulated pathways noted above indicated that the up-regulated parental genes may participate in the process of DNA damage under oxidative stress.

Kyoto encyclopedia of genes and genomes pathway analysis revealed that upregulated genes were primarily enriched in pathways associated with RNA transport, endocytosis, cell cycle, fanconi anemia pathway, terpenoid backbone biosynthesis, protein processing in endoplasmic reticulum, p53 signaling pathway, and hepatitis C (**Figure 3C**). In accordance with the enriched GO terms, the up-regulated genes were significantly enriched in pathways related to DNA damage under oxidative


TABLE 1 | The top-five circRNAs with the highest significance level by KEGG enrichment analysis.

stress. Downregulated genes were mainly associated with homologous recombination, HTLV-I infection, transcriptional misregulation in cancer, N-Glycan biosynthesis, FoxO signaling pathway, lysine degradation, and breast cancer (**Figure 3D**).

## Alternative Circularization of Dysregulated circRNAs in Exonic Regions

Alternative RNA circularization was determined only by backsplicing sites, and therefore we inferred the gene structure of circRNAs based on annotated transcripts. To avoid the occurrence of fuzzy gene structure, only exonic circRNAs were included in such analyses. We found that 24 genes had more than two circRNA isoforms, of which, 20 produced two isoforms, and 4 produced three isoforms (**Figure 4A**). Interestingly, we also observed that six genes, including NCOA1, ANKRD36BP2, PAPD4, PRRC2C, SCLT1, and EIF2AK1, produced circRNA isoforms with opposite expression patterns (**Figure 4B**), indicating that these expression-switched circRNA isoforms may have opposite functions. Moreover, the expressionswitched circRNA isoforms for 5 of 6 genes did not have overlapping exons. Exceptionally, the two circRNA isoforms, hsa\_circ\_0015210 and chr1:171493960-171502100, produced by PRRC2C, shared the 10-th exon (**Figure 4C**), indicating that the differential usage of the 10-th exon was associated with AF.

## Functional Annotation of circRNAs by Integrating Potential ceRNA and PPI Networks

To further investigate the regulatory mechanism of circRNAs, we predicted the miRNA binding sites for each circRNA. Finally, we predicted 43,307 miRNA-circRNA interactions by Miranda v3.3a with a strict mode. As circRNAs could also act as ceRNAs by competing for miRNAs with mRNAs, we also collected 322,389 experimentally validated miRNA-mRNA interactions from MiRTarBase (Chou et al., 2018), of which, 12,930 were miRNA/dysregulated mRNA interactions.

To construct the ceRNA network, we estimated the significance of shared miRNAs for each circRNA-mRNA pair. We predicted 1,025 up-regulated and 245 down-regulated circRNAmRNA pairs by one-tailed Fisher's exact test (P-value < 0.0001), involving 246 dysregulated circRNAs. Furthermore, we also mapped the up-regulated and down-regulated protein-coding genes to PPI network, respectively. To characterize the biological functions of circRNAs, we merged the potential ceRNA network with the PPI network (**Supplementary Table S2**), and performed KEGG enrichment analysis on the genes connected to the circRNAs within one node in the merged network. Finally, 130 of the 246 dysregulated circRNAs in the merged network were successfully characterized by more than one pathway. Notably, the five circRNAs, including chr9:15474007-15490122, chr16:75445723-75448593, hsa\_circ\_0007256, chr12:56563313- 56563992, and hsa\_circ\_0003533, showed the highest significance in the enrichment analysis, and four of them were enriched in cytokine-cytokine receptor interaction (**Table 1** and **Figure 5A**). Notably, CCR5, which acted as a receptor for chemokines, was the target of three circRNAs in the ceRNA network, suggesting that the three circRNAs may enhance the activity of cytokine-cytokine receptor interaction through CCR5. As shown in **Figure 5B**, the pathways charactering top-ten number of circRNAs, such as RIG-I-like receptor signaling pathway, Tolllike receptor signaling pathway, NOD-like receptor signaling pathway, and JAK-STAT signaling pathway, were mostly related to inflammation, suggesting that the circRNAs enriched in these pathways may participate in biological processes of inflammatory response (**Supplementary Table S3**).

# DISCUSSION

Circular RNAs are an emerging class of RNA species that may play a critical regulatory role in gene expression control. CircRNAs can serve as diagnostic biomarkers for many diseases (Han et al., 2018) due to their abundant, stable, and cell- or tissue-specific expression (Bachmayr-Heyda et al., 2015; Li et al., 2018). However, the association between circRNAs and AF is still not clear.

In this study, we used RNA sequencing data to identify and quantify the circRNAs. Differential expression analysis of the circRNAs identified 250 up- and 126 down-regulated circRNAs in atrial fibrillation patients compared with healthy donors, respectively (**Figures 2A,B**). The hierarchical clustering analysis of the dysregulated circRNA expression profiles revealed that the samples with AF could be clearly distinguished from the healthy donors (**Figure 2C**), suggesting that the dysregulated circRNAs may act as potential AF diagnostic biomarkers in lymphocytes. GO and KEGG analysis of the parental genes of the dysregulated circRNAs indicated that parental genes of dysregulated circRNAs may participate in the process of DNA damage under oxidative stress (**Figures 3A,C**). The down-regulated parental genes were mainly associated with homologous recombination, HTLV-I infection, transcriptional misregulation in cancer, N-Glycan biosynthesis, FoxO signaling pathway, lysine degradation, and

breast cancer (**Figures 3B,D**). To examine whether circRNA isoforms originated from the same genes were dysregulated in AF, we inferred the gene structure of circRNAs based on annotated transcripts. Interestingly, among the dysregulated circRNA isoforms, six genes, including NCOA1, ANKRD36BP2, PAPD4, PRRC2C, SCLT1, and EIF2AK1, were identified to produce circRNA isoforms with opposite expression patterns, indicating that these expression-switched circRNA isoforms may have opposite functions (**Figure 4B**). Notably, the two circRNA isoforms, hsa\_circ\_0015210 and chr1:171493960-171502100, produced by PRRC2C, shared the 10-th exon (**Figure 4C**), indicating that the differential usage of the 10-th exon was associated with AF. To further annotate the dysregulated circRNAs, we constructed and merged the ceRNA network and PPI network. In the merged network, 130 of 246 dysregulated circRNAs were successfully characterized by at least one pathway. Notably, the five circRNAs, including chr9:15474007-15490122, chr16:75445723-75448593, hsa\_circ\_0007256, chr12:56563313- 56563992, and hsa\_circ\_0003533, showed the highest significance in the enrichment analysis, and four of them were enriched in cytokine-cytokine receptor interaction (**Table 1**). In summary, these dysregulated circRNAs may participate in biological processes of inflammatory response.

In this study, there also existed some limitations. Firstly, more samples were needed considering the small sample size in the present study. Secondly, we provided a set of dysregulated circRNAs associated with AF, however, further experimental validation would be required for future verification. Moreover, specific functions of those dysregulated circRNAs had not been further excavated in this study. We hope to conduct further researches with a larger samples group, to perform experimental validation and much deeper analysis in the near future.

#### CONCLUSION

We identified six genes, including NCOA1, ANKRD36BP2, PAPD4, PRRC2C, SCLT1 and EIF2AK1, producing circRNA

#### REFERENCES


isoforms with opposite expression patterns, and characterized some inflammation-related circRNAs, such as chr9:15474007- 15490122, chr16:75445723-75448593, hsa\_circ\_0007256, chr12:56563313-56563992, and hsa\_circ\_0003533, which may be associated with inflammatory responses in AF. To our knowledge, this is the first study to uncover the association between circRNAs and AF, which not only improves our understanding of the roles of circRNAs in AF, but also provides candidates of potentially functional circRNAs for AF researchers.

# AUTHOR CONTRIBUTIONS

XL led the research team. XL and XH conceived and designed the study. LC and SW developed the methodology. KX and WJ collected the sample. MQ and YZ analyzed and interpreted the data. XH wrote, reviewed, and revised the manuscript.

# FUNDING

This work was supported by the National Natural Science Foundation of China (Grant No. 81670305).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00108/full#supplementary-material

TABLE S1 | A gene set enrichment analysis of their parental genes based on biological processes from GO and pathways from KEGG database.

TABLE S2 | The potential ceRNA network merged with the PPI network of circRNAs.

TABLE S3 | The KEGG pathways involving 102 circRNAs.


susceptibility and persistence. Circ. Arrhythm Electrophysiol. 8, 32–41. doi: 10. 1161/CIRCEP.114.001632



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Hu, Chen, Wu, Xu, Jiang, Qin, Zhang and Liu. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification of Key lncRNAs Associated With Atherosclerosis Progression Based on Public Datasets

Chuan-hui Wang<sup>1</sup>† , Hui-hua Shi<sup>1</sup>† , Lin-hui Chen<sup>2</sup> , Xiao-li Li<sup>1</sup> , Guo-liang Cao<sup>1</sup> \* and Xiao-feng Hu<sup>3</sup> \*

<sup>1</sup> Department of Geriatrics, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China, <sup>2</sup> Department of Neurology, Zhejiang Hospital, Zhejiang University, Hangzhou, China, <sup>3</sup> Department of Cardiology, Shanghai Chest Hospital, Shanghai Jiao Tong University, Shanghai, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Guoping Li, Massachusetts General Hospital and Harvard Medical School, United States Shu Yang, National Center for Advancing Translational Sciences (NCATS), United States

#### \*Correspondence:

Guo-liang Cao guolcao@126.com Xiao-feng Hu ncdx2006hu@126.com †These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 01 December 2018 Accepted: 04 February 2019 Published: 28 February 2019

#### Citation:

Wang C-h, Shi H-h, Chen L-h, Li X-l, Cao G-l and Hu X-f (2019) Identification of Key lncRNAs Associated With Atherosclerosis Progression Based on Public Datasets. Front. Genet. 10:123. doi: 10.3389/fgene.2019.00123 Atherosclerosis is one of the most common type of cardiovascular disease and the prime cause of mortality in the aging population worldwide. However, the detail mechanisms and special biomarkers of atherosclerosis remain to be further investigated. Lately, long non-coding RNAs (lncRNAs) has attracted much more attention than other types of ncRNAs. In our work, we found and confirmed differently expressed lncRNAs and mRNAs in atherosclerosis by analyzing GSE28829. We performed the weighted gene co-expression network analysis (WGCNA) by analyzing GSE40231 to confirm highly correlated genes. Gene Ontology (GO) analysis were utilized to assess the potential functions of differential expressed lncRNAs in atherosclerosis. Co-expression networks were also constructed to confirm hub lncRNAs in atherosclerosis. A total of 5784 mRNAs and 654 lncRNAs were found to be dysregulated in the progression of atherosclerosis. A total of 15 lncRNA-mRNA co-expression modules were identified in this study based on WGCNA analysis. Moreover, a few lncRNAs, such as ZFAS1, LOC100506730, LOC100506691, DOCK9-AS2, RP11-6I2.3, LOC100130219, were confirmed as important lncRNAs in atherosclerosis. Taken together, bioinformatics analysis revealed these lncRNAs were involved in regulating the leukotriene biosynthetic process, gene expression, actin filament organization, t-circle formation, antigen processing, and presentation, interferon-gamma-mediated signaling pathway, and activation of GTPase activity. We believed that this study would provide potential novel therapeutic and prognostic targets for atherosclerosis.

Keywords: long non-coding RNA, atherosclerosis, WGCNA analysis, co-expression analysis, biomarker

#### INTRODUCTION

Atherosclerosis is characterized by intima-media thickness (IMT) in the middle membrane of cervical artery and the formation of atherosclerotic plaque (Sluimer and Daemen, 2009). Atherosclerosis is one of the most common types of cardiovascular disease and the prime cause of mortality in the aging population worldwide (Sarnak et al., 2003; Mannino and Buist, 2007). Although the previous studies that indicated Immune system responses and inflammation

**169**

responses were involved in the progression of atherosclerosis, the detail mechanisms, and special biomarkers of atherosclerosis remained to be further investigated.

Previous studies have revealed that non-coding RNAs, such as miRNAs, lncRNAs, and circRNAs, played important regulatory roles in human diseases. For instance, miRNAs were a type of post-transcriptional regulators involved in mRNAs degradation or translation blocking (Chekulaeva and Filipowicz, 2009; Huntzinger and Izaurralde, 2011). Lately, lncRNAs have attracted much more attention than other types of ncRNAs. lncRNAs were a type of ncRNAs with more than 200 bps and observed to be dysregulated in human diseases, including cancers, diabetes, neurodegenerative disease, and cardiovascular diseases (Zamore and Haley, 2005; Hauptman and Glavac, 2013 ˇ ; Zhang et al., 2013). lncRNAs played crucial roles in regulating genome epigenetic modification, RNA splicing, protein translation and mRNA decay. For instance, XIST was a well-known lncRNAs involved in X chromosome inactivation (McHugh et al., 2015). Of note, emerging studies indicated lncRNAs could serve as a type of novel biomarkers for diseases. For example, PCA3 was a potential prognostic marker of prostate cancer, which was more sensitive than the most widely used biomarker, prostate specific antigen (PSA) (Leyten et al., 2014).

In the past decades, a few lncRNAs had been found to be involved in the progression and prognosis of atherosclerosis. For example, Shen and She (2018) reported rs145204276 in the promoter region of GAS5 was associated with the risk of atherosclerosis. Yao et al. (2018) found ENST00113 promotes cell growth and metastasis in atherosclerosis via PI3K/Akt/mTOR pathway. Moreover, H19 was also involved in atherosclerosis through influencing NF-kB and MAPK pathway (Ding et al., 2018). However, there was still lacking system identification of differently expressed lncRNAs in atherosclerosis. Exploring the functions and mechanisms of atherosclerosis related lncRNAs will be useful for the identification of novel biomarkers for this disease.

The Weighted gene co-expression network analysis (WGCNA) method was widely used to identifying key genes involved in human diseases progression. In our work, atherosclerosis related lncRNAs were detected by analyzing GEO datasets GSE28829. Furthermore, we performed the WGCNA to analyze GSE40231 to confirm highly correlated genes. Bioinformatics analysis were also performed to reveal the potential functions of atherosclerosis related lncRNAs. We thought this study will provide novel biomarkers associated with atherosclerosis prognosis and progression.

# MATERIALS AND METHODS

#### Data Sources

The public datasets, GSE28829 and GSE40231, were downloaded from the NCBI Gene Expression Omnibus database. GSE28829 included 13 primary atherosclerotic plaques and 16 advanced atherosclerotic plaques. GSE40231 included 278 atherosclerotic samples from 66 patients. The original data were converted into recognizable format in R, and the preprocess Core package was used for the normalization. Afterward, the limma package of R was used to identify the differentially expressed genes (DEGs) in the progression of atherosclerosis.

## Data Preprocessing

The R software package affy (Gautier et al., 2004) was used to read the microarray data. The robust multiarray business method (Hoffmann et al., 2006) was used for data preprocessing. For the GSE28829 dataset, we identified differently expressed genes using the limma package (Ritchie et al., 2015). The DEG with adjusted P-value of less than 0.05 was selected.

# lncRNA Classification Pipeline

In this work, a pipeline was utilized to identify lncRNA expression pattern in atherosclerosis, which was described by Zhang et al., 2012).

## Weighted Gene Co-expression Network Analysis (WGCNA) Analysis

In this study, we conducted WGCNA to predict the potential roles of lncRNAs in atherosclerosis progression. The WGCNA R package was used to evaluate the significance of the two lncRNAs and their module membership. We assessed the weighted co-expression relationship among all dataset subjects in an adjacency matrix using the pairwise Pearson correlation. Following the identification of weighted correlation, the network was presented by Cytoscape 3.4.0.

## Functional Group Analysis

Here, we used GO analysis and KEGG analysis to predict the potential roles of genes by using DAVID system<sup>1</sup> (Huang et al., 2009).

# Identification of lncRNA-Associated PPI Modules

We applied the analysis of the interaction between lncRNA and protein by utilizing STRING online software was utilized to analyze (Zhang et al., 2016) and the combined score >0.4 was used as the cut-off criterion (Yan et al., 2017). The PPI network was built by utilizing Cytoscape software (Kohl et al., 2011).

# RESULTS

### Identification of Atherosclerosis Progression Related mRNAs and lncRNA

We performed analysis of a public dataset GSE28829 to identify atherosclerosis related mRNAs and lncRNA. GSE28829 was reported by Döring et al. (2012), and contained 13 early and 16 advanced atherosclerosis samples. As shown in **Figure 1A**, we identified 3542 up-regulated mRNAs and 2487 down-regulated mRNAs in advanced atherosclerosis samples compared to early atherosclerosis samples.

<sup>1</sup>https://david.ncifcrf.gov/

After applying lncRNA classification according to Hägg et al. (2009) report, we identified 356 up-regulated mRNAs and 412 down-regulated lncRNAs in advanced atherosclerosis samples compared to early atherosclerosis samples **Figure 1B**. However, most of these lncRNA, such as RP11-212P7.2, RP11-498E2.7, RP11-803D5.4, RP11-646J21.6, and RP11-334C17.5, were for the first time observed to be associated with atherosclerosis.

# Construction and Analysis of Gene Co-expression Network

In order to explore the potential functions and mechanisms of these lncRNAs in the progression of atherosclerosis, we conducted WGCNA10 analysis using GSE40231. The network was built by utilizing the WGCNA10.11 package in R software (Langfelder and Horvath, 2008). After identifying the best parameter (β = 4), we applied the WGCNA analysis according to Langfelder et al.'s (2008) reports.

Based on such hypothesis, we acquired 15 gene modules (**Figure 2A**). We acquired 78 gene modules by using cutreeDynamic in WGCNA package (Langfelder and Horvath, 2008). According to Pidsley et al.'s (2014) reports, the soft thresholding power five was selected, then, a large minimum module size 10, and a medium sensibility (deep Split = 2) were utilized to segment cluster (**Figure 2B**). After the Pearson correlation coefficient between modules was calculated, the key network was built (**Figures 2C,D**). When the absolute value of correlation was greater than 0.45, two modules would be connected.

# Construction of Atherosclerosis Related lncRNA-mRNA Co-expression Networks

Furthermore, we built atherosclerosis associated lncRNA-mRNA co-expression networks by the Pearson correlation coefficient of lncRNA-mRNA pairs in 15 gene modules based on WCGNA analysis. lncRNA-mRNA pairs with | R| > 0.65 were selected for co-expression networks construction. Our results revealed that module 1 related network consisted of 36 lncRNAs and 520 DEGs, module 2 related network consisted of 22 lncRNAs and 229 DEGs, module 3 related network consisted of 17 lncRNAs and 195 DEGs, module 4 related network consisted of 10 lncRNAs and 143 DEGs, module 5 related network consisted of 19 lncRNAs and 175 DEGs, module 6 related network consisted of 15 lncRNAs and 104 DEGs (**Figure 3**), module 7 related network consisted of 11 lncRNAs and 89 DEGs, module 8 related network consisted of 10 lncRNAs and 61 DEGs, module 9 related network consisted of 14 lncRNAs and 88 DEGs, module 10 related network consisted of 7 lncRNAs and 79 DEGs, module 11 related network consisted of 9 lncRNAs and 66 DEGs, module 12 related network consisted of 7 lncRNAs and 60 DEGs (**Figure 4**), module 13 related network consisted of 4 lncRNAs and 34 DEGs, module 14 related network consisted of 5 lncRNAs and 34 DEGs, module 15 related network consisted of 4 lncRNAs and 20 DEGs (**Figure 5**).

A few lncRNAs, such as ZFAS1 (degree = 358), LOC100506730 (degree = 183), LOC100506691 (degree = 170), DOCK9-AS2 (degree = 167), RP11-6I2.3 (degree = 166), LOC100130219 (degree = 157), LOC100268168 (degree = 138), DAPK1-IT1 (degree = 130), LOC100507250 (degree = 129), HLA-J (degree = 128), and LOC102723845 (degree = 121), were identified as key regulators in this network.

# Function Annotation of Atherosclerosis Related lncRNAs

Furthermore, we performed bioinformatics analysis for atherosclerosis related lncRNAs using DAVID system. Our results showed lncRNAs in module 1 were involved in regulating leukotriene biosynthetic process, response to heat, integrin-mediated signaling pathway, Fc-gamma receptor signaling pathway involved in phagocytosis, signal transduction, positive regulation of catalytic activity, and inflammatory response. lncRNAs in module 2 were involved in regulating positive regulation of gene expression, sequestering of actin monomers, gene silencing by RNA, muscle cell differentiation, and ubiquitin-dependent protein catabolic process. lncRNAs in module 3 were involved in regulating actin filament organization, regulation of focal adhesion assembly, I-kappaB kinase/NF-kappaB signaling, response to stress, and apical constriction. lncRNAs in module 4 were involved in regulating positive regulation of t-circle formation, t-circle formation, interstrand cross-link repair, protein phosphorylation, and mRNA processing **Figure 6**.

lncRNAs in module 5 were involved in regulating antigen processing and presentation, interferon-gamma-mediated signaling pathway, immunoglobulin production, humoral immune response, and positive regulation of insulin secretion. lncRNAs in module 6 were involved in regulating activation of GTPase activity, positive regulation of transcription, positive regulation of receptor biosynthetic process, negative regulation of angiogenesis, and negative regulation of protein ubiquitination. lncRNAs in module 7 were involved in positive regulation

of DNA topoisomerase activity, and ossification. lncRNAs in module 8 were involved in positive regulation of protein import into nucleus. lncRNAs in module 9 were involved in circadian rhythm, multicellular organism development, intracellular protein transport, protein polyglutamylation. lncRNAs in module 10 were involved in response to virus, positive regulation of GTPase activity, cellular response to caffeine. lncRNAs in module 12 were involved in cell cycle, positive regulation of t-circle formation, t-circle formation **Figure 6**.

# DISCUSSION

Atherosclerosis had been the prime cause of mortality in the ageing population worldwide. However, the detail mechanisms underlying atherosclerosis progression and accurate biomarker for the prognosis of atherosclerosis remained to be investigated. In our work, we aimed to confirm atherosclerosis related lncRNAs and mRNAs using GSE28829 and GSE40231. Totally, 5784 mRNAs and 654 lncRNAs were identified to be dysregulated in the progression of atherosclerosis. WGCNA was performed to identify highly correlated lncRNAs and mRNAs. Moreover, co-expression network and bioinformatics analysis were used to find the potential functions of lncRNAs in atherosclerosis.

lncRNAs played crucial roles in human diseases via binding to DNA, proteins and RNA molecules. Recently, a few lncRNAs, such as GAS5 (Chen et al., 2017), NEAT1 (Jian et al., 2016) and MALAT1 were reported to be associated with regulating atherosclerosis progression and prognosis. For example, interactions among MALAT1 (Han et al., 2018), NEAT1, and key immune effector molecules could regulate the development of atherosclerosis. However, still lacking was a systematic identification of differentially expressed lncRNAs in atherosclerosis. In this study, we identified 356 up-regulated mRNAs and 412 down-regulated lncRNAs in advanced compared to early atherosclerosis samples. Among these lncRNAs, MBNL1-AS1, HAND2- AS1, and RP11-999E24.3 were most down-regulated and PSMB8-AS1, LINC01094, and RP11-389C8.2 were most up-regulated lncRNAs in advanced atherosclerosis patients. Interestingly, we observed several well-known lncRNAs, such as TUG1, PCA3, and HOTAIR, which were also involved in regulating atherosclerosis progression. A previous study showed TUG1 knockdown could ameliorate atherosclerosis via inducing FGF1 expression (Zhang et al., 2018). Moreover, TUG1 was reported to be an oncogene in various types of human cancers, such as colorectal cancer, ovarian cancer, and gastric cancer (Huarte, 2015). PCA3 was a novel potential biomarker for prostate cancer (Leyten et al., 2014). HOTAIR is abnormally expressed in cancers and involved in regulating cancer proliferation, cell cycle and apoptosis (Esteller, 2011). This study together with previous studies demonstrated lncRNAs also played key roles in the progression of Atherosclerosis.

One of the biggest challenges in exploring the functions of lncRNAs in human diseases was that lncRNAs could not be used to perform GO and KEGG analysis. In previous studies, many groups conducted bioinformatics analysis for lncRNAs using their co-expressing genes. For instance, Feng et al. (2018) identified and predicted the functions of implantation failure related lncRNAs by constructing the lncRNA-mRNA co-expression network. In order to study the potential functions of atherosclerosisrelated lncRNAs, we performed WGCNA analysis. A total of 15 lncRNA-mRNA co-expression modules were identified in this study. A few lncRNAs, such as ZFAS1, LOC100506730, LOC100506691, DOCK9-AS2, RP11-6I2.3, LOC100130219, LOC100268168, DAPK1-IT1, LOC100507250, and LOC102723845, were confirmed as important lncRNAs due to that they co-expressed with more than 100 different mRNAs in Atherosclerosis. Besides ZFAS1, the roles of these lncRNAs remained unknown. Here, we found ZFAS1 played the most important roles in this network though co-expressing with 358 mRNAs. ZFAS1 was dysregulated in breast cancer, gastric cancer, and colorectal cancer, and played as an oncogene in cancer progression though promoting cancer metastasis, growth and EMT. Furthermore, we performed bioinformatics analysis and observed these dysregulated lncRNAs were significantly associated with the regulation of leukotriene biosynthetic process, gene expression, actin filament organization, t-circle formation, antigen processing, and presentation, interferon-gamma-mediated signaling pathway, and activation of GTPase activity. Of note, we observed lncRNAs in module 5, such as RP11-171N4.1, DKFZP434K028, LOC101929153, LGALS8-AS1, and LINC01410, were significantly involved in regulating immune system responses and inflammation responses, which had been reported to be key regulators in atherosclerosis.

We should point out that there were several limitations included in this study. First, the expression levels of key lncRNAs in atherosclerosis was not validated using clinical samples. Second, the detail of molecular functions of key lncRNAs in the progression of atherosclerosis had not been investigated. Therefore, the further validation and function investigation will still require further study.

# CONCLUSION

In conclusion, we identified a total of 275 lncRNAs were found to be dysregulated in the progression of atherosclerosis. WGCNA was performed to identify highly correlated lncRNAs and mRNAs. Moreover, ZFAS1, LOC100506730, LOC100506691, DOCK9-AS2, RP11-6I2.3, and LOC100130219 were identified as key lncRNAs in atherosclerosis. Bioinformatics analysis revealed these lncRNAs were involved in regulating the leukotriene biosynthetic process, gene expression, actin filament organization, t-circle formation, antigen processing, and presentation, interferon-gamma-mediated signaling pathway, and activation of GTPase activity. This research would provide potential novel therapeutic and prognostic targets for atherosclerosis.

#### AUTHOR CONTRIBUTIONS

fgene-10-00123 February 26, 2019 Time: 16:3 # 12

C-hW, H-hS, G-lC, and X-fH conceived and designed the study. C-hW, H-hS, L-hC, and X-lL developed the methodology. L-hC and X-lL collected the sample. C-hW, H-hS, L-hC, and X-lL analyzed and interpreted the data. C-hW, H-hS,

#### REFERENCES


L-hC, X-lL, G-lC, and X-fH wrote, reviewed, and/or revised the manuscript.

#### FUNDING

This study was supported by Shanghai Jiao Tong University School of Medicine Affiliated Ninth People's Hospital Integration Fund.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Shi, Chen, Li, Cao and Hu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Detecting Diagnostic Biomarkers of Alzheimer's Disease by Integrating Gene Expression Data in Six Brain Regions

#### Lihua Wang and Zhi-Ping Liu\*

*Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, China*

Alzheimer's disease (AD) is a neurodegenerative and progressive disease, which often causes irreversible damages to the cerebrum. The pathogenesis of AD is far from being fully understood, while there are some popular hypotheses. So far, the diagnosis of AD relies only on clinical screening in the form of imaging techniques or cerebrospinal fluid analysis, which may lead to inaccurate evaluation and then cause the delay of suitable treatments. While molecular biomarkers provide promising alternatives of establishing correct relationships between genotypes and phenotypes of clinical symptoms. In this paper, we propose a machine-learning-based method of identifying potential diagnostic biomarkers of AD based on gene coexpression network by integrating gene expression profiles in six brain regions. After building an integrated gene coexpression network of multiple brain regions, we decompose the differential network into some subnetwork modules. The module candidates from these coexpressed gene communities are then identified by screening their discriminative powers in control from disease samples. The potential biomarkers are then validated by multiple cross-validations and functional enrichment analyses. If the biomarkers successfully pass clinical significance tests, they can be used as a reference for clinical diagnosis after wet-experimental validations.

#### Edited by:

*Tao Zeng, Shanghai Institutes for Biological Sciences (CAS), China*

#### Reviewed by:

*Qi Zhao, Liaoning University, China Wenyuan Li, University of California, Los Angeles, United States Guangxu Jin, Wake Forest University, United States*

\*Correspondence:

*Zhi-Ping Liu zpliu@sdu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *25 September 2018* Accepted: *13 February 2019* Published: *12 March 2019*

#### Citation:

*Wang L and Liu Z-P (2019) Detecting Diagnostic Biomarkers of Alzheimer's Disease by Integrating Gene Expression Data in Six Brain Regions. Front. Genet. 10:157. doi: 10.3389/fgene.2019.00157* Keywords: Alzheimer's disease, biomarker discovery, gene expression, data integration, classification, machine learning

# 1. INTRODUCTION

Alzheimer's disease (AD) is a neurodegenerative and progressive disease, which causes irreversible damages to the cerebrum with cognitive and functional impairments (Porteri et al., 2017). Approximately, 50 million peoples are suffering from AD worldwide. The pathogenesis of AD is still poorly understood and some popular hypotheses have been proposed, such as genetics, cholinergic, amyloid and Tau protein hypothesis (Goedert and Spillantini, 2006). The progression of AD is rather long-time because its pathological change is a slowly accumulating process. It often takes years to decode, reveal and recognize the neuronal dysfunctions and neurodegeneration with dominant symptoms (Hardy and Selkoe, 2002; Goedert and Spillantini, 2006).

Currently, the diagnosis of AD generally relies on clinical screening in the form of imaging techniques or cerebrospinal fluid analysis (Jack et al., 2010). The limited dementia at an early stage often leads to inaccurate diagnosis and then results in the delay of beneficial treatments. Thus, the discovery of effective and efficacious biomarkers that can establish correct correspondences and relationships with clinical symptoms has become an urgent request (Porteri et al., 2017).

**182**

Take it into consideration that the complicated genetic and environmental risk factors of developing AD in the human brain, there are thousands or 10,000 of candidates from genes, transcripts, and proteins with their interactions (Wang et al., 2016). It is a big challenge to identify AD biomarker molecules by making full use of the available big data. Due to the underlying complexity, network-based computational methods become important options to meet the challenge (Liu et al., 2011, 2012a).

In this paper, we aim to detect AD biomarkers by integrating gene expression data in six brain regions. Gene expression profiling data generates a genome-wide measurement of RNA abundance in parallel manners, which provide possible materials of bridging the gap between genotype and phenotype, which is the foundation of biomarker screening. Physiological and cellular processes are executed through interactions among genes and their products. Through the analysis of genetic network, which models their interactive activities, it is possible to screen out the core genes which play crucial roles in AD development and progression (Liu et al., 2009). Moreover, the incidence of AD in brain regions is sequential during disease progression. It is necessary to identify molecular biomarkers by integrating gene expression data from brain regions (Jack et al., 2013). To these ends, we provide a bioinformatics framework of detecting the potential diagnostic biomarkers based on differential gene coexpression network obtained by integrating gene expression profiles in multiple brain regions.

# 2. METHODS

#### 2.1. Framework of Biomarker Discovery

**Figure 1** demonstrates our proposed framework of identifying diagnostic biomarkers of AD by integrating gene expression data in six brain regions. Briefly, we identify the correlation coefficients between differentially expressed genes across control and disease samples. By integrating the correlations of six brain regions, differential co-expressed gene pairs are selected by a statistical test, and they construct a differential co-expressed network. Then, we employ a network clustering method to partition off it into subnetwork modules. By evaluating their classification ability of distinguishing controls from diseases, the modules are screened individually by machine learning algorithms. The modules with the highest performance are identified as biomarkers after functional enrichment analysis and validation. The details shown in **Figures 1A–D** are introduced as follows.

#### 2.2. Data Pre-processing

The microarray gene expression datasets are downloaded from NCBI GEO (ID:GSE5281) database (www.ncbi.nlm.nih.gov/ geo) (Liang et al., 2007). The experiments contain the gene expression profiles of 161 samples in six brain regions, i.e., EC (entorhinal cortex), HIP (hippocampus), MTG (medial temporal gyrus), PC (posterior cingulate cortex), SFG (superior frontal gyrus), and VCX (primary visual cortex). In each brain region, there are the corresponding samples of disease and control simultaneously. The numbers of samples of affect/control cases are 10/13 in EC, 10/13 in HIP, 16/12 in MTG, 9/13 in PC, 23/11 in SFG, and 19/12 in VCX. According to the GPL570 annotation table, we map the probe set IDs to Entrez gene IDs and gene official symbols, respectively. When there are two or more corresponding gene IDs, we only select the one with maximum interquartile range. In each sample, the gene expression values are then normalized into Z-scores (Cheadle et al., 2003). Totally, there are 23,643 unique genes to get their expression measurements after data pre-processing.

#### 2.3. Integration of Data in Six Brain Regions 2.3.1. Differential Gene

First of all, we identify the differentially expressed genes in the six brain regions by the pre-processed gene expression data. Specifically, we evaluate the differential p-value of each gene across the control and disease samples via Welch's two sample t-test. For removing the high probability of committing type I error in multiple hypotheses testing, the corresponding FDR is also calculated (Noble, 2009). By setting up p ≤ 0.05 and FDR ≤ 0.01, we screen out these differential genes in each brain region respectively. We integrate the top 200 (top 10%) differential genes in each brain region and get the union of differentially expressed genes.

#### 2.3.2. Correlation Analysis

For building gene-gene coexpression relationships in multiple brain regions, we pick out the dysregulated interactions between genes using differential correlation analysis in each region individually. We firstly associate gene pairs in these identified differential genes in an all-against-all manner. In other words, we generate all the non-repetitive gene pairs that are produced by these differential genes. For each gene pair, we calculate their coexpression status in the samples via PCC (Pearson correlation coefficient) (Liu et al., 2012b), i.e.,

$$r(X,Y) = \frac{\sum\_{i=1}^{n} (X\_i - \overline{X})(Y\_i - \overline{Y})}{(n-1)S\_X S\_Y},\tag{1}$$

where X and Y are the gene expression vectors. X and Y refer to the mean values of X and Y. S<sup>X</sup> and S<sup>Y</sup> represent their standard deviations. Then the coexpression values for all gene pairs in control and disease are obtained, respectively. We integrate the six coexpression values under control condition and those under disease condition into two new vectors across six brain regions. The differentially coexpressed gene pairs are identified via a nonparametric statistical testing. For the two vectors of six elements, we implement Spearman's t-test to detect the differential gene coexpressions with thresholds of p-value ≤ 0.05 and FDR ≤ 0.1.

#### 2.4. Differential Co-expression Network

After collecting these differentially coexpressed gene pairs, we put them together to form into a differential coexpression network as shown in **Figure 1C**. It can be visualized when we import these dysregulated gene interactions into Cytoscape (Shannon

FIGURE 1 | The framework of detecting AD biomarkers from gene expression data. (A) The gene expressions of AD samples and controls in the six brain regions of EC, HIP, MTG, PC, SFG and VCX respectively. (B) The union of differentially expression gene in each brain region. We generate the non-repetitive gene pairs from the pairwise differential genes. For each gene pair, we calculate their coexpression status in the control and disease samples via PCC. For the two correlation vectors, we implement Spearman's *t*-test to detect the differential gene coexpressions with thresholds of *p*-value ≤ 0.05 and FDR ≤ 0.1. (C) The differential correlation gene pairs construct a differential coexpression network. (D) The differential coexpression network is grouped into several subnetwork modules by clustering. They are screened out as candidate biomarkers when they successfully classify controls and diseases. The functional enrichment analysis will be performed to justify the dysfunctions underlying these candidates. Then, the validations in independent experimental settings are to check the classification performances of the identified biomarkers.

et al., 2003). The subnetworks of this network will be targeted for identifying module biomarkers.

# 2.5. Clustering

For decomposing the whole differential coexpression network into subnetwork modules, we group the nodes by a network clustering algorithm, i.e., MCL (Markov clustering) (Van Dongen, 2000). Specifically, MCL algorithm is a fast and scalable unsupervised network clustering algorithm based on topological structures and features. It repeats two basic algebraic operations on matrices to simulate random walks on the network (Vlasblom and Wodak, 2009). The first operation is expansion, which is a process to calculate the probability of a random walk of length n between any two nodes in the network. Considering that the behavior of matrix multiplication is similar to random walks on graph, the Markov matrix associated with the graph can be used as the foundation of simulating these random walks. In a network, the flow is much easier within its dense regions than across its sparse boundaries. Thus, the second operation of MCL is inflation, which aims to keep this property by changing the distribution of each vertex transition values in the Markov matrix such that high values are further high and low values are further low. If the two-step iterations produce a convergent matrix, the final clustering will be achieved (Van Dongen, 2000).

# 2.6. Classification

These gene subnetwork modules provide the candidates for screening out the module biomarkers of classifying control and disease samples in brain regions. We perform an SVM (support vector machine) classification procedure to evaluate the discriminative ability of each module in distinguishing disease state from a normal state. SVM classifier aims to find an optimal hyperplane that satisfies the classification requirement and the optimal margin evaluation criteria are based on the distance between two support vectors (Suykens and Vandewalle, 1999). In the classification with two categories, the classifier can be constructed as follows. Given a training set of data points (x<sup>i</sup> , yi ), i = 1, 2, · · · , m, **x** ∈ R n , y ∈ {±1}, optimal hyperplane H is:

$$(\mathbf{w} \cdot \mathbf{x}) + b = 0.\tag{2}$$

SVM classifier should meet some constraints, one of them is:

$$\mathbf{w} \cdot \mathbf{x} i + b \ge 1, \quad \text{if} \quad \boldsymbol{\chi}\_i = +1; \; \mathbf{w} \cdot \mathbf{x} i + b \le -1, \quad \text{if} \quad \boldsymbol{\chi}\_i = -1 \tag{3}$$

which is equivalent to

$$\{\mathcal{Y}\_i[\ \mathbf{w} \cdot \mathbf{x}\_i + b] \ge 1, \quad \mathbf{i} = 1, 2, \cdots, m\} \tag{4}$$

The other is to maximize the margin which is calculated as 2/k**w**k. In other words, it is to minimize **w**. For solving the constraint optimization problem, the Lagrange function is introduced:

$$L(\boldsymbol{\omega}, b, a) = \frac{1}{2} \|\boldsymbol{\omega}\| - \lambda \left( \boldsymbol{\chi} \left( (\boldsymbol{\omega} \cdot \boldsymbol{\omega}) + b \right) - 1 \right) \tag{5}$$

Where λ<sup>i</sup> > 0 is Lagrangian multiplier. By setting partial derivatives of (4) for **w** and b as 0, we finally find the optimal hyperplane and construct a classifier as:

$$\log(\boldsymbol{\kappa}) = \text{sign}\left[\sum\_{i=1}^{m} \lambda\_i \boldsymbol{\nu}\_i \, \boldsymbol{\kappa}\_i^T \, \boldsymbol{\kappa} + b\right] \tag{6}$$

In the case of binary classification, we assess the classification performance of the SVM-based classifier by a leave-one-out cross validation (Cawley and Talbot, 2004). For a comparison study, we also implement several machine learning algorithms in the classification, such as naive Bayes, neural network and random forest (Liu, 2016).

#### 2.7. Classification Evaluation

We evaluate the classification performance of these modules by the ROC (receiver operating characteristic) curves and their corresponding AUC (area under ROC curve) values. For each gene module, we compare the classification AUC values achieved by integrating gene expressions in six brain regions as well as in a single brain region. In addition, we also implement naive Bayes, neural network and random forest algorithms for classification. The comparison identifies the target module selected by SVM with the consistently high classification performance serving as AD module biomarkers. We also prove the rationality of data integration in six brain regions in the identification. The subnetwork module with highest AUC values is identified as the module biomarkers of AD for further cross-brain-region and cross-dataset validations. Then, the target network module with the best classification performances is regarded as the final identified AD biomarkers.

#### 2.8. Enrichment Analysis

The functional implications of these network modules with good classification performance are obtained by GO (gene ontology) enrichment analysis. We implement our NOA (network ontology analysis) method (http://app.aporc.org/ NOA/) to identify the enriched dysfunctions in these biomarker genes. From the functional implications, we can partially validate these identified biomarkers about their roles of AD development and progression.

#### 3. RESULTS

#### 3.1. Differentially Expressed Genes

After data pre-processing, we obtain 23,643 genes with their expression profiles. In each brain region, we identify the top 200 (top 10% genes picked after setting up p ≤ 0.05 and FDR ≤ 0.01) differential genes. All together, we identify 1,001 differentially expressed genes. **Figure 2** illustrates the overlapping summary statistics of these differential genes distributed in the six brain regions. We find that most of the differential genes are only the differentially expressed genes in individual brain regions. Few genes are simultaneously differential across several brain regions.

#### 3.2. Coexpression Network and Modules

For each pair of differential genes, we calculate the differential correlation values via a statistical testing between control and disease samples. We identify the differentially coexpressed gene pairs and put them together to form a differential coexpressed network with 615 dysregulated interactions. By employing MCL algorithm, we identify some dysregulated subnetwork modules from the network. **Figure 3A** demonstrates five (top 5 number of genes in modules) of these modules. We note that there is obviously a hub gene in these modules individually, which indicates a topological feature of these differential coexpression networks.

As shown in **Figure 3A**, gene NPIPA1 (nuclear pore complex interacting protein family member A1) is the identified hub differential gene with differential correlations with all the other genes in Cluster 1. NPIPA1 is proved to perform biological functions of mRNA transport and protein transport. It has an interacting gene MAP2K4, which encodes an important membrane protein of MAPK (mitogen-activated protein kinase) family. From the interacting partners in Cluster 1, the biologically cooperative dysfunctions can be revealed. The differential coexpressed interaction between NPIPA1 and MAP2K4 implies the dysfunctional signal transduction in AD. From the network-based approach, the global scenario of dysfunctions is displayed for AD development and progression in the form of molecular subnetworks.

#### 3.3. Biomarker Classification

For evaluating the performance of these clusters in distinguishing control and disease, we perform leave-one-out classifications. The ROC curves of these five clusters in the six brain regions are shown in **Figure 3B**. We also implement our evaluations in each brain region respectively. The sensitivity, specificity and AUC

values are shown simultaneously. The detailed AUC values in six brain regions are shown in **Table 1**.

From **Table 1**, we find that the five clusters reach high AUC values in the six brain regions. The 5th cluster reaches the highest AUC values of 1.0. These results provide direct evidence for the effectiveness and efficiency of these candidate biomarkers in distinguishing between control and disease states. We also calculate the AUC values of summarizing these individual brain regions and their average values. The good classification performances indicate these modules can service as biomarkers of classifying the disease states in multiple brain regions. For better AUC values of these modules in various brain regions, we select Cluster 1 and Cluster 5 to further screening through different classification algorithms.

We further test the discriminative capability of the two clusters by other three classification algorithms, i.e, naive Bayes, neural network, and random forest. Joint with SVM, **Figures 4A,B** demonstrate the ROC curves of the classifiers based on the four algorithms. In Cluster 1, we find that random forest achieves the best AUC of 0.994 from **Figure 4A**. While in Cluster 5, it achieves the AUC of 0.755 as shown in **Figure 4B**. Relatively, SVM obtains stably high AUC values of 0.984 and 1.0, respectively. Thus, we prefer SVM classifier to distinguish normal and disease states and Cluster 1 is the identified AD biomarkers.

For a comparison study with conventional biomarker discovery methods, we implement two widely-used methods, i.e., the method using differentially expressed genes (denoted as 'DiffGene' method) (Liu, 2016) and the variable/feature selection method by SVM-RFE algorithm (denoted as 'SVM-RFE' method) (Guyon et al., 2002). **Figure 5** demonstrates the AUC values of classification results. As shown **Figure 5A**, the AUC values of 'DiffGene' method are not as good as our proposed method shown in **Table 1**. In **Figure 5B**, the AUC values of 'SVM-RFE' method are not consistently high. In brain regions of HIP, SFG and VCX, the AUC values of our proposed method (**Table 1**) exceed those of 'SVM-RFE'. The comparisons demonstrate our method outperforms the conventional methods in terms of classification performance.

#### 3.4. Biomarker Dysfunctional Analysis

For analyzing the functional implications in these identified diagnostic biomarkers of AD, we use NOA to enrich the GO annotations underlying these gene modules. **Table 2** shows the significant GO terms of biological process. As shown in **Table 2**, we find the function of 'lipid transport' is enriched,

TABLE 1 | The classification AUC values of the five clusters.


FIGURE 5 | The classification performances by conventional biomarker discovery methods in the six brain regions. (A) The ROC curves of 'DiffGene' method on top 44 differential genes. (B) The ROC curves of 'SVM-RFE' method on top 1,000 differential genes.

which indicates the dysfunctional metabolism and energy transformation in AD. The epigenetics of 'regulation of DNA methylation' indicates the dysfunctional modifications related to AD. The important enrichments provide a functional map with blocks in these identified biomarker genes. They provide more evidence of functional importance of these biomarkers, which enlighten the insightful findings of AD pathogenesis.

# 4. DISCUSSION

# 4.1. Cross-Region Biomarker Classification

AD is a chronic neurodegenerative disease which affects various brain regions of controlling various physical functions (Liang et al., 2007). The module biomarker of Cluster 1 with good classification power in control and disease samples has been

TABLE 2 | The enriched GO biological processes in the identified AD biomarkers.


identified by integrating gene expression data of six brain regions. It is of interest to investigate the cross-region classification performances for checking the potential pathogenic relationship between brain regions.

To evaluate the classification accuracy of module biomarker between six brain regions, we train the SVM classifier by utilizing gene expression data in one brain region and then test it in the other brain regions. Taking EC brain region as an example, we first extract the expression data of these module biomarker genes in EC and train the classifier for recognizing their patterns in control and disease samples. Then we test the trained classifier of distinguishing controls from diseases by the gene expression of these biomarker genes in the other five brain region individually. The five AUC values of classification are shown in **Figure 6A**. They are plotted as a bar. Secondly, we train the SVM classifier by the gene expressions in the other five brain regions, respectively and then test the classification performance in EC. The five AUC values are shown as the other bar graphs in **Figure 6A**.

From the AUC values of cross-brain-region validations, we can roughly estimate the dysfunctional relationships between the six brain regions from the view of dynamic gene expressions. In **Figure 6A**, we can find the classifiers achieve higher AUC values in HIP, MTG, PC, and SFG than that in VCX when we train them by the expressions of biomarkers in EC (0.657, 0.707, 0.598, and 0.809 vs. 0.508). This indicates the gene expressions in VCX are different from the other five brain regions. During AD progression in brain regions, the differences of effect in VCX have been identified (Liang et al., 2007; Liu et al., 2011). When we train the classifiers by the gene expression of biomarkers in the five brain regions, the classification performance for the samples in EC achieves high AUCs, i.e., 0.912 of HIP, 0.827 of MTG, 0.843 of PC, 0.802 of SFG, and 0.496 of VCX, respectively. We find the AUC of VCX is still the lowest one. This provides more evidence for the distinction of VCX during AD development. Moreover, the high AUC in some specific brain region implies its dysfunctional specificity. While we mainly focus on integrating the gene expression data of six brain regions to identify general biomarkers for AD instead of detecting specific biomarkers for individual brain regions.

Compared to the former AUCs by training the classifiers in EC and testing them in the other five regions, the higher AUC values prove the significant gene expression deviance of these biomarkers in EC. When we train the classifiers in the other five brain regions, the accurate classification performance in EC indicates that the gene expressions in the four brain regions contain the information of distinguishing controls from diseases. The asymmetric cross-brain-region classification results also inspire us to integrate the gene expressions in six brain regions to identify AD biomarkers for compensating the diversity of gene expressions in multiple brain regions.

## 4.2. Individual-Region Biomarker Classification

Instead of detecting AD biomarkers in the six individual brain regions, we integrate the differential coexpression gene pairs in these regions by a systematic strategy. For the comparison study, we also identify the candidate biomarkers by the gene expression data in the six brain regions individually and investigate their classification powers. We implement the whole formerdescribed processes of biomarker discovery except the selection of differential gene coexpression pairs. In individual brain regions, the differential gene correlation pairs are alternatively based on the absolute difference values of the PCCs in control and disease samples. In each brain region, we rank the gene pairs according to differential correlations and select the same number of them as those in the former integration method. These differential gene pairs construct the individual gene coexpression networks in the six brain regions, respectively.

For each gene coexpression network, we also employ the MCL algorithm to decompose it to subnetwork clusters. For similarity, the clusters with the largest number of genes are recognized as the candidate biomarkers. For comparing the classifications of individual candidate biomarkers with the region-integrated biomarkers, we implement the leave-one-out cross-validations in these competitors and in the identified AD biomarkers.

**Figure 6B** demonstrates the comparison of AUC values in the six brain regions. By leveraging the gene expressions in each brain region, we implement the cross-validations of classification in the individual-region biomarkers and the region-integrated biomarkers. Except in EC, we can find the module biomarker achieves higher AUC values when compared to these candidate biomarkers in individual brain regions. In EC, the candidate biomarkers achieve a perfect AUC of 1.0 (vs. 0.938 of the identified biomarker). However, the identified module biomarker obtains higher classification AUC values than those in the other four individual brain regions. The results also indicate the rationality of identifying AD biomarkers by integrating gene expression datasets in several brain regions.

## 4.3. Cross-Dataset Biomarker Classification

For cross-dataset validation of our identified AD biomarkers, we also test their classification performance in independent datasets. The other AD gene expression profiles are downloaded from NCBI GEO (access ID: GSE48350). The dataset consists two sample-paired subsets in EC. One contains 15 AD brain samples and 21 control samples (from donors of young ages from 20 to 52). The other contains 15 AD brain samples and 18 control samples (from donors of old ages from 64 to 99). By utilizing the biomarkers, we test the classification in the two subsets, respectively. The ROC curves of classification by our module biomarker are shown in **Figure 6C**.

0.719, 0.847, 0.902, and 0.954 in the six brain regions respectively. For the 'integration' module biomarker, the corresponding AUCs are 0.938, 0.953, 1.000, 0.936, 0.966, and 0.972, respectively. (C) The ROC curves of classification by AD biomarkers in independent datasets. 'Young' and 'Old' represent the datasets with different types of control sample respectively. The gray region refers to the standard deviations of classification in 30 random-choosing gene sets. (D) The ROC curve of biomarker classification in independent blood samples.

In classifying the AD samples with old-aged controls, the module biomarker achieves the AUC of 0.877. And the AUC value in the samples with young-aged controls is 0.972. The two AUC values prove the effectiveness and efficacy of our identified module biomarker in distinguishing AD samples from controls. **Figure 6C** also shows the ROC curve (with the gray range of standard deviations) in the same-size number of gene sets randomly choosing from the gene expression profiling data. The higher classification performances in the identified biomarkers provide more evidence for the efficiency and advantage of our proposed method.

#### 4.4. Blood Validation

Currently, the accurate detection of AD in clinics is often based on nuclear magnetic resonance imaging, cerebrospinal fluid as well as PET (positron emission tomography) - CT (computed tomography). The finding diagnosis biomarkers provide possible alternatives with more clinical validations. Note that our identification is based on gene expression profiles in human brains. From a practical perspective in clinician, peripheral blood plasma testing is much more convenient, cheaper and with lower invasion in AD diagnosis (Suhre et al., 2017). Thus, we perform validation of these potential gene markers in blood gene expression samples to check their classification performances. The gene expression profiling data in blood mononuclear cells is downloaded from NCBI GEO (Access ID: GSE4226) (Maes et al., 2007). By mapping 44 genes in Cluster 1 to the measured blood gene expressions, we get 6 overlapping markers in blood samples of 14 AD patients and 14 normal controls. Using these six biomarker genes, the classification performance of ROC curve in distinguishing controls from diseases is demonstrated in **Figure 6D**. The AUC

value achieves as high as 0.76. Although the number of biomarker genes measured in the samples is small, the diagnotic accuracy is competitive with the available clinic approaches. From the crossdataset and blood validations, we partially verify the identified biomarkers in public data.

Recently, the circulating microRNAs in serum seem to be an alternative promising way of finding diagnostic biomarkers for complex diseases (Chen et al., 2017, 2018a). The development of computational methods for identifying potential diagnostic lncRNA biomarkers is also promising in the biomarker screening for AD, especially when these kind of high-throughput data are available (Chen et al., 2016, 2018b). It is an interesting research direction for AD biomarkers discovery from epigenetic transcripts in blood.

# 4.5. Relationship Between Biomarkers and AD Genes

Although APP (Jonsson et al., 2012), APOE (Morris et al., 2010) and PSEN (Hjermind, 2016) have been recognized as genetic risk factors of AD, we have not identified them in the diagnostic biomarkers because they are not differentially expressed genes in any of the six brain regions. It is of interest to study the relationship between biomarkers and AD genes. We firstly build up an integrative human protein-protein interaction (PPI) network by combining the interactions in various PPI databases (Liu et al., 2011). We employ the 28 genes in KEGG AD pathway as the documented AD genes (Liu et al., 2011). Then we identify the intersection of the first-order neighbors of the biomarker genes in Cluster 1 and those of AD genes. **Figure 7** demonstrates their linkages. There are 16 AD genes containing the overlapping 38 first-order neighbors with the 44 biomarker genes. This indicates that the biomarkers have close relationships with these AD genes although they are not contained in the identified biomarkers. The results also prove the effects of AD causal genes have close distances with those biomarker genes in the molecular interactome.

# CONCLUSION

In this paper, we proposed a computational method of detecting AD biomarkers by integrating gene expression data in six brain regions. The framework is based on differential coexpression network and machine learning. The network modules are screened out by their classification powers via SVM classifiers. We identified five module candidates and regarded Cluster 1 as the identified AD biomarkers by using the other three classification algorithms for further screening. The cross-brainregion, cross-dataset, and validations in blood gene expression data provide evidence of its efficiency, efficacy, and advantage. Totally, 44 genes in Cluster 1 are targeted as the potential biomarkers in the form of a network module. Furthermore, the blood biomarkers are also important in clinical applications (Ngo et al., 2018), and we should screen out more genetic biomarkers from different datasets to map more potential blood biomarkers to improve classification accuracy. In the future, we also intend to incorporate these risky AD genes in our identification and investigate the causality between disease genes and marker genes. Considering the false positives in the computational strategy of identifying disease biomarkers, clinical validations of these potential biomarkers are urgent requests. If these identified AD biomarkers pass the multiple phases of clinical trials, they will be highly beneficial for early diagnosis of AD.

# DATA AVAILABILITY

The datasets analyzed for this study can be found in the NCBI GEO dataset: www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE5281.

# AUTHOR CONTRIBUTIONS

Z-PL conceived and designed the study. LW wrote the code. LW and Z-PL analyzed the data and drafted the manuscript.

# FUNDING

This work was partially supported by the National Nature Science Foundation of China (NSFC) under Grant Nos. 61572287 and 61533011; the Innovation Method Fund of China (Ministry of Science and Technology of China, 2018IM020200); Shandong Provincial Key Research and Development Program (2018GSF118043); Department of Science and Technology of Shandong Province, China (2017CXGC1502 and 2015ZDXX0801A01); the Fundamental Research Funds of

## REFERENCES


Shandong University under Grant No. 2016JC007. The paper was also supported by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, Ministry of Education of China.

# ACKNOWLEDGMENTS

Thanks are due to the reviewers for their valuable comments. We also thank Haixia Shang and Ruth Mwale for their assistance in this work.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Age Is Important for the Early-Stage Detection of Breast Cancer on Both Transcriptomic and Methylomic Biomarkers

Xin Feng1,2, Jialiang Li3,4, Han Li3,4, Hang Chen1,2, Fei Li3,4, Quewang Liu1,2 , Zhu-Hong You<sup>5</sup> and Fengfeng Zhou1,2,3,4 \*

<sup>1</sup> BioKnow Health Informatics Lab, College of Computer Science and Technology, Jilin University, Changchun, China, <sup>2</sup> Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China, <sup>3</sup> BioKnow Health Informatics Lab, College of Software, Jilin University, Changchun, China, <sup>4</sup> Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China, <sup>5</sup> Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China

#### Edited by:

Tao Zeng, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Wen Zhang, Huazhong Agricultural University, China Jianbo Pan, Johns Hopkins Medicine, United States Sen Peng, Translational Genomics Research Institute, United States

#### \*Correspondence:

Fengfeng Zhou fengfengzhou@gmail.com; ffzhou@jlu.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 18 November 2018 Accepted: 27 February 2019 Published: 26 March 2019

#### Citation:

Feng X, Li J, Li H, Chen H, Li F, Liu Q, You Z-H and Zhou F (2019) Age Is Important for the Early-Stage Detection of Breast Cancer on Both Transcriptomic and Methylomic Biomarkers. Front. Genet. 10:212. doi: 10.3389/fgene.2019.00212 Patients at different ages have different rates of cell development and metabolisms. As a result, age should be an essential part of how a disease diagnosis model is trained and optimized. Unfortunately, most of the existing studies have not taken age into account. This study demonstrated that disease diagnosis models could be improved by merely applying individual models for patients of different age groups. Both transcriptomes and methylomes of the TCGA breast cancer dataset (TCGA-BRCA) were utilized for the analysis procedure of feature selection and classification. Our experimental data strongly suggested that disease diagnosis modeling should integrate patient age into the whole experimental design.

Keywords: age, feature selection, TriVote, BRCA, classification, transcriptome, methylome

# INTRODUCTION

Some types of cancers grow faster in younger hosts. Renal cancer has an average growth rate of 0.3 cm per year and many clinical studies focused on the surveillance of small tumors only in elderly patients (Mues et al., 2010; Mehrazin et al., 2014). However, renal cancers in younger patients may grow at a much larger rate of 2.13 cm per year (Gofrit et al., 2015), which requires more frequent follow-up examinations. Prostate cancer was mostly diagnosed at an older age (>65 years old), but the early-onset cases (<55 years old) had a much faster growth rate and a stronger genetic association (Salinas et al., 2014).

Breast cancer has the largest incidence rates for females in both China (Chen et al., 2016) and United States (Siegel et al., 2018) and tends to grow faster in younger females (Weedon-Fekjaer et al., 2008). One of twenty breast tumors may double in diameter from 10 mm within 1.2 months, compared with 6.3 years for the same proportion with the slowest growth rates (Weedon-Fekjaer et al., 2008). Generally, younger age was one of the risk factors for poor prognosis and high aggressiveness (Bardia and Hurvitz, 2018; Lee et al., 2018). Even the genomic or transcriptomic biomarkers demonstrated different associations with younger breast cancer patients compared to older ones (Wang et al., 2018) and required age-specific treatments (Kim et al., 2018).

Breast cancer diagnosed at its early stage may be treated with mastectomy or lumpectomy and systematically reduces relapse risk (Kummerow et al., 2015; Santa-Maria et al., 2015). Earlystage breast cancer was usually diagnosed by radiological imaging technologies (Simos et al., 2014) or molecular biomarkers (Duffy et al., 2015). X–ray-based mammogram (Kashyap et al., 2017; Sinthia and Malathi, 2018) and breast magnetic resonance imaging (MRI) were the predominant choices for detecting the candidate lesion sites of breast cancer (Wang et al., 2013; Loggers et al., 2016). Serum microRNA and urine DNA damage were also recently observed to have strong associations with early-stage breast cancer (Guo et al., 2017; An et al., 2018). Unfortunately, these early-stage breast cancer detection technologies did not integrate the age information in the decision-making process.

This study hypothesized that the integration of age information may improve the performance of the biomarker detection problem, which is known as the feature selection problem in the machine learning area (Alshawaqfeh et al., 2017; Xu et al., 2018). Following this, we split the transcriptomic and methylomic datasets of breast cancer into multiple age groups and investigated whether a machine learning procedure achieved better performance after the split of age groups.

# MATERIALS AND METHODS

#### Summary of Datasets

This study utilized the transcriptomic and methylomic datasets from The Cancer Genome Atlas database (TCGA) (Ma and Ellis, 2013). The level-3 transcriptomes of the TCGA breast cancer (BRCA) project were hybridized and measured by the Agilent 244K Custom Gene Expression G4502A-07-3 array (TCGA platform code AgilentG4502A\_07\_3), which was designed by the University of North Carolina on the Agilent (Santa Clara, CA, United States) Sure Print G3 Microarray Platform (Cancer Genome Atlas Network, 2012). Each sample has the expression levels of 17,814 probe sets. The developmental stage of each sample was retrieved from the entry "tumor\_stage" in the clinical annotations of the TCGA-BRCA project at the NIH National Cancer Institute GDC Data Portal (Cancer Genome Atlas Network, 2012; Ciriello et al., 2015). There were 502 transcriptomic samples with the stage annotations, among which there were 90, 291, 108, and 13 samples for stages I, II, III, and IV, respectively.

Methylome was generated by the Illumina Infinium HumanMethylation450K BeadChip, and each sample had 485,577 features (Morris and Beck, 2015). There were 765 methylomic samples with the stage annotations in the TCGA-BRCA project, among which there were 125, 433, 196, and 11 samples for stages I, II, III, and IV, respectively.

#### Feature Selection Algorithms

Biomedical datasets have two major types, either a large feature number with a small sample number or a large sample number with a small feature number. The OMIC datasets usually extract a large number of features for a small number of samples, and the number of features must be reduced to avoid the overfitting problem for machine learning modeling (Lyu et al., 2017; Ye et al., 2017; Ali and Aittokallio, 2018; Xu et al., 2018). For the second style of biomedical datasets, although it is not a required step, reducing the dimensions may substantially increase modeling performance (Guan et al., 2018; Zou et al., 2018).

Seven feature selection algorithms were evaluated for their classification performances on the datasets with different age groups. The F-test evaluated the analysis of variation between two variables, or a variable and the phenotype (Lomax and Hahs-Vaughn, 2013). The PCC (Pearson Correlation Coefficient) was used to evaluate how significantly a feature was associated with the phenotype (Yoon and Chung, 2013). The classic T-test was also chosen to rank the features by their association significance with the phenotype (Kim, 2015;Ye et al., 2017).

The Recursive Feature Elimination (RFE) strategy was evaluated based on three different algorithms. The Support Vector Machine (SVM) was frequently used to facilitate the procedure of recursive feature elimination and denoted as rfeSVM (Xu et al., 2018). The L1 regularization was known as the least absolute shrinkage and selection operator and generated weights for each chosen feature (Guyon and Elisseeff, 2003). The RFE procedure based on Lasso was denoted as rfeLasso (Sfakianakis et al., 2014). The logistic regression (LR) model was also used to calculate how the features were eliminated by their weights (Pandey et al., 2018).

TriVote (Tri-Step Feature Voting algorithm) was recently proposed to perform very well on both transcriptomic and methylomic data and evaluated on the datasets in this study (Xu et al., 2018).

#### Classification Algorithms

Classification algorithms may achieve drastically different performances on the same dataset (Ge et al., 2016; Liu et al., 2017; Xu et al., 2018). As a result, in this study, we chose three representative classification algorithms to evaluate the classification performance of a given feature subset, i.e., Logistic Regression (LR), Support Vector Machine (SVM) and Gaussian Naïve Bayes (GaussianNBayes).

Logistic regression calculated the probability of a binary response for a given dataset (Menard, 2018). SVM optimized the maximal separation margin of a discrimination hyperplane between the groups of positive and negative samples, and the discrimination hyperplane tended to have a good binary classification performance (Suthaharan, 2016). The Gaussian Naïve Bayes (GaussianNBayes) assumed the inter-feature independence and calculated the probability that a given query sample belonged to a class (Bouckaert, 2004).

Ten-fold cross-validation was utilized to calculate the binary classification performances (Ren et al., 2018).

#### Performance Measurements

A binary classification problem was usually evaluated by the performance metrics accuracy (Acc), sensitivity (Sn), and specificity (Sp) (Xu et al., 2017; Ye et al., 2017). There were two classes of samples in a binary classification problem, denoted as Positive and Negative ones, respectively. There were P and N samples in the classes of Positive and Negative samples.

Sensitivity (Sn) was defined as the percentage of correctly predicted positive samples, i.e., Sn = TP/P, where TP (True Positive) was the number of correctly predicted positive samples, and FN (False Negative) was defined as FN = P−TP. The measurement Specificity (Sp) was defined as the percentage of correctly predicted negative samples, i.e., Sp = TN/N, and the number of false positive samples (FP) was defined as FP = N−TN. The overall accuracy was Acc = (TP + TN)/(P + N).

The balanced accuracy [bAcc = (Sn + Sp)/2] was usually utilized to evaluate the classification model without generating bias for a dataset with significantly different numbers of positive and negative samples (Feng et al., 2018). Matthew's correlation coefficient (MCC) was defined as MCC = (TP × TN−FP × FN)/sqrt[(TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)], where sqrt() is the squared root (Xu et al., 2018; Zhang et al., 2018; Zhao et al., 2018).

#### Experimental Design

This study modeled the early detection of breast cancer as a binary classification problem, due to the fact that there were much fewer samples in stage IV than the other three stages. A binary classification problem was defined as a discrimination function to separate samples between stages I/II and III/IV. The investigations in this study were planned as shown in the outline in **Figure 1**.

First, a given dataset was screened by variance, which was defined as the average of the squared deviations from the mean in the Python numpy.var(). This study supposed that an OMICfeature with a large standard deviation may be clinically detected more easily. Thus, this step kept 10,000 features with the largest standard deviations for further biomarker screening.

Then, the dataset was screened by one of the three algorithms (F-test, PCC, and T-test) for the associations of each feature with the class label. The top 1000 ranked features were kept for further analysis. Iteratively, the remaining dataset was evaluated by one of the recursive feature elimination algorithms (rfeSVM, rfeLasso, and rfeLR), and the feature with the smallest weight was removed from the dataset while the remaining dataset was processed repeatedly. This study decided that the numbers of features would be between 10 and 100 with a step size of 5.

# RESULTS AND DISCUSSION

#### Data Preprocessing

First of all, we need to rule out the hypothesis that the sample age was correlated with the tumor stages. The Pearson correlation coefficient (PCC) (Mpairaktaris et al., 2017; Zhang et al., 2017) between the sample age and the tumor stage was −0.0221 with P-value = 0.6206 for the transcriptome samples. The methylome samples had PCC = −0.0223 with P-value = 0.5377 between the sample age and the tumor stage. The hypothesis was rejected for both the transcriptome and methylome samples. The maximal information coefficient (MIC) is very sensitive in detecting weak or non-linear correlations (Reshef et al., 2011) and has been widely used in feature selection (Ge et al., 2016) and inter-gene synergy (Xing et al., 2017), etc. The MIC value was in the range [0, 1] and a larger MIC value means a higher correlation between the two variables. The MIC values between age and tumor stage were 0.0591 and 0.0490 for transcriptome and methylome samples, respectively. These two MIC values were similar to that of the random correlations, as described in Reshef et al. (2011).

As a result, both PCC and MIC correlation measurements rejected the hypothesis that the sample age was correlated with the tumor stages.

denoted as RNA(1). The early-stage patients were regarded as the negative class, and the late-stage ones were the positive class.

Among the 502 transcriptomic samples in the TCGA-BRCA project, there were 121 and 381 samples in the early stages (I and II) and late stages (III and IV), respectively. This dataset was

Each class of samples was split into two or three bins with equally-sized sample age ranges, as illustrated in **Table 1**. The minimum age of samples with either transcriptome or methylome was 26, and the maximum age was 90. We used the


Samples in the early stages (I and II) were denoted as positives, and the other samples in the late stages (III and IV) were the negatives.

FIGURE 2 | Classification performances of rfeSVM screening of top-ranked 1000 features by T-test. The accuracy was calculated by the 10-fold cross validation of three classifiers, i.e., LR, SVM, GaussianNBayes. The horizontal axis was the number of features screened by rfeSVM, and the vertical axis was accuracy. The plots were for the datasets (A) RNA(1), (B) RNA(2)(0), (C) RNA(2)(1), (D) RNA(3)(0), (E) RNA(3)(1), and (F) RNA(3)(2).

TABLE 2 | The number of times each classifier achieved the best accuracy for the RFE-screened features of a given dataset.


There were 19 feature subsets screened by rfeSVM/rfeLasso/rfeLR, with the numbers of features 10, 15, 20, . . ., 100.



Dataset FS MaxAcc Classifiers RNA(1) T-test 0.9183 SVM RNA(1) F-test 0.9422 SVM RNA(1) PCC 0.9223 SVM RNA(2)(0) T-test 0.9951 SVM RNA(2)(0) F-test 1.0000 LR, SVM RNA(2)(0) PCC 0.9951 SVM


Column "MaxAcc" provides the maximal accuracy achieved by the three classifiers on the 19 feature subsets screened by the RFE algorithm given in the Column "RFE". The column "Classifiers" provides the algorithms achieving the maximal accuracy in the column "MaxAcc." More than one classifier may achieve the same best accuracy. Best model of each dataset was illustrated in bold.

RNA(2)(1) T-test 0.9732 LR RNA(2)(1) F-test 0.9966 SVM RNA(2)(1) PCC 0.9765 SVM RNA(3)(0) T-test 1.0000 SVM RNA(3)(0) F-test 1.0000 LR, SVM RNA(3)(0) PCC 1.0000 SVM RNA(3)(1) T-test 0.9732 LR RNA(3)(1) F-test 1.0000 SVM RNA(3)(1) PCC 0.9765 SVM RNA(3)(2) T-test 1.0000 LR, SVM RNA(3)(2) F-test 1.0000 LR, SVM RNA(3)(2) PCC 1.0000 SVM Column "MaxAcc" provides the maximal accuracy achieved by the three classifiers

on the 19 feature subsets screened by rfeSVM. The initial subset of 1000 features was ranked by the algorithm given in the Column "FS." The column "Classifiers" provides the algorithms achieving the maximal accuracy in the column "MaxAcc." More than one classifier may achieve the same best accuracy. Best model of each dataset was illustrated in bold.

upper integers of (20 + 70 × i/k) as the thresholds. The age bins for k = 2 were [20, 55) and [55, 90], while the age bins for k = 3 were [20, 44), [44, 67), and [67, 90].

The 121 negative samples were split into two groups with 51 and 70 samples. Moreover, the two groups of positive samples had 153 and 228 members. This dataset was denoted as RNA(2). The two pairs of negative and positive groups were denoted as RNA(2)(0) and RNA(2)(1). The dataset RNA(1) was also split into three bins with equally-sized sample age ranges, which was denoted as RNA(3). The three groups of negative samples in RNA(3) had 21, 71 and 29 members, respectively, and the positive class was split into three groups with 56, 227 and 98 members. The three pairs of negative and positive groups were denoted as RNA(3)(0), RNA(3)(1) and RNA(3)(2).

FIGURE 3 | Comparison of early-stage breast cancer detection models with different age groups based on transcriptome data. The patients were split into (A) two groups and (B) three groups with equally-sized age ranges. The horizontal axis shows the numbers of features chosen by rfeSVM and the vertical axis shows the 10-fold cross validation accuracy of the classifier SVM. F-test was used to generate the initial subset of the 1000 top-ranked features.

FIGURE 4 | Comparison of early-stage breast cancer detection models with different age groups based on methylome data. The patients were split into (A) two groups and (B) three groups with equally-sized age ranges. The horizontal axis depicts the numbers of features chosen by rfeSVM and the vertical axis depicts the 10-fold cross-validation accuracy of the classifier SVM. F-test was used to generate the initial subset of the 1000 top-ranked features.

FIGURE 5 | Comparison of early-stage breast cancer detection models with different age groups based on both transcriptome and methylome data. Transcriptomes of the patients were split into (A) two groups and (B) three groups with equally-sized age ranges. The methylomes were split into (C) two groups and (D) three groups in the same way. The horizontal axis shows the numbers of features chosen by TriVote and the vertical axis shows the 10-fold cross-validation accuracy of the classifier SVM. F-test was used to generate the initial subset of the 1000 top-ranked features.

The 765 methylomic samples had 207 early-stage and 558 latestage samples and were denoted as the dataset Methy(1). The two classes in Methy(1) were split into two bins with equally-sized sample age ranges, which was denoted as the dataset Methy(2). There were 93 and 114 members in the two negative groups. The sizes of the two positive groups were 222 and 336. Thus, we had two pairs of negative and positive groups, denoted as Methy(2)(0) and Methy(2)(1). The dataset Methy(3) was constructed by splitting the two classes of samples in Methy(1) into three bins with equally-sized sample age ranges. There were 31, 121 and 55 members in the three negative groups. The sizes of the three positive groups were 67, 345 and 146. The three pairs of negative and positive groups Methy(3)(0), Methy(3)(1) and Methy(3)(2) refer to the three split datasets.

The 17,814 features were first reduced to the 10,000 with the largest variance, as described in the Section "Materials and Methods."

# An Initial Investigation of T–Test-Selected Features on Transcriptomes

The T-test was widely used to evaluate how significantly a feature was associated with the phenotype for various biomedical data types, including transcriptome (Ye et al., 2017), methylome (Aref-Eshghi et al., 2015), imaging data (Beheshti et al., 2016), etc. As described in the above Section "Materials and Methods," the top 1000 features ranked by the T-test were further screened by the three RFE algorithms, i.e., rfeSVM, rfeLasso, and rfeLR.

**Figure 2** demonstrated that the classifier GaussianNBayes did not perform very well on the features screened by rfeSVM. For the first dataset of 10 rfeSVM-screened features, GaussianNBayes (Acc = 0.7629) performed slightly worse than the other two classifiers LR (Acc = 0.7849) and SVM (Acc = 0.7769). When more features were chosen by rfeSVM, GaussianNBayes performed even worse classification. It is interesting to observe that LR and SVM seemed to have performed similarly well. As a result, we generated a more precise summary of how the three classifiers performed, as shown in **Table 2**. The data suggested that SVM achieved maximal accuracy in 75 cases while LR achieved the same in 39 cases. Unfortunately, GaussianNBayes did not achieve maximal accuracy at any point.

**Table 2** also suggested that GaussianNBayes outperformed the other two classifiers SVM and LR only on very few feature subsets screened by rfeSVM/rfeLasso/rfeLR. For most of the feature subsets chosen by the three RFE algorithms, the two classifiers SVM and LR performed similarly well. We further generated another summary table to demonstrate whether each of the three

FIGURE 6 | Comparison of early-stage breast cancer detection models with different age groups based on both transcriptome and methylome data. Transcriptomes of the patients were split into (A) two groups and (B) three groups with equally-sized age ranges. The methylomes were split into (C) two groups and (D) three groups in the same way. The horizontal axis shows the numbers of features chosen by TriVote and the vertical axis shows the 10-fold cross-validation accuracy of the classifier RFC. F-test was used to generate the initial subset of the 1000 top-ranked features.

classifiers achieved the best accuracy across the 19 feature subsets of each dataset, as shown in **Table 3**. We may observe that the best classifier was usually SVM or LR, and sometimes these two classifiers performed the same best accuracy. Moreover, for all six datasets, rfeSVM outperformed the other two RFE feature selection algorithms. As a result, the following sections would use rfeSVM as the RFE screening choice.

#### Comparison of T-Test, F-Test, and PCC for Association Evaluation

A comparison was carried out to evaluate whether the choice of the top 1000 features was important for the binary classification problem of early-stage breast cancer detection, as shown in **Table 4**. The pair comprised of the feature selection algorithm F-test and the classifier SVM achieved the best accuracies for all six datasets. The classifier LR also achieved the same best accuracy for the three datasets RNA(2)(0), RNA(3)(0), and RNA(3)(2). Thus, the default modeling procedure in the following sections started with the top 1000 features ranked by F-test. Then, rfeSVM was utilized to find the number of features with the best accuracy calculated by the 10-fold cross-validation of the classifier SVM.

#### Age Grouping for Transcriptomes

We first split the negative and positive samples into two equallysized groups, as shown in **Figure 3A**. The SVM models trained over RNA(2)(0) and RNA(2)(1) were much better than that on the whole dataset RNA(1). The averaged improvement in accuracy was 0.0900 for the dataset RNA(2)(0) compared to RNA(1). The model accuracy of RNA(2)(1) was also improved by 0.0654 in accuracy on average. If we chose the best model of each dataset as the final result, both RNA(2)(0) and RNA(2)(1) were improved at least 0.0544 in accuracy compared against RNA(1). The best model of RNA(1) used 100 features to achieve 0.9422 in accuracy, while only 40 features were needed for both RNA(2)(0) and RNA(2)(1) to outperform this model.

Similar results were observed for the experiment of splitting RNA(1) into three equally-sized groups of samples, as shown in **Figure 2B**. The averaged improvements in accuracy were 0.1078, 0.0673, and 0.1086 for the three datasets RNA(3)(0), RNA(3)(1), and RNA(3)(2). A minimum 0.0578 improvement in accuracy was achieved for all three datasets compared with the best model of RNA(1). Only 50 features were required for the three datasets RNA(3)(0), RNA(3)(1), and RNA(3)(2) to outperform the complete dataset RNA(1) (0.9422 in accuracy with 100 features).

#### Age Grouping for Methylomes

The same default classification procedure on the datasets with smaller age groups outperformed that of the complete dataset Methy(1), as shown in **Figure 4**. A minimum 0.0524

FIGURE 7 | Comparison of early-stage breast cancer detection models with different age groups based on both transcriptome and methylome data. Transcriptomes of the patients were split into (A) two groups and (B) three groups with equally-sized age ranges. The methylomes were split into (C) two groups and (D) three groups in the same way. The horizontal axis shows the numbers of features chosen by TriVote and the vertical axis shows the 10-fold cross-validation accuracy of the classifier XGB. F-test was used to generate the initial subset of the 1000 top-ranked features.

FIGURE 8 | Comparison of early-stage breast cancer detection models with different age groups based on both transcriptome and methylome data using the classifier SVM on the training datasets. Transcriptomes of the patients were split into (A) two groups and (B) three groups with equally-sized age ranges. The methylomes were split into (C) two groups and (D) three groups in the same way. The horizontal axis shows the numbers of features chosen by rfeSVM and the vertical axis shows the 10-fold cross-validation accuracy of the classifier SVM. F-test was used to generate the initial subset of 1000 top-ranked features.

FIGURE 9 | Comparison of early-stage breast cancer detection models with different age groups based on both transcriptome and methylome data using the classifier SVM on the independent test datasets. Transcriptomes of the patients were split into (A) two groups and (B) three groups with equally-sized age ranges. The methylomes were split into (C) two groups and (D) three groups in the same way. The horizontal axis shows the numbers of features chosen by rfeSVM, and the vertical axis shows the accuracy of the classifier SVM on the independent test dataset. F-test was used to generate the initial subset of 1000 top-ranked features.

improvement in accuracy was achieved against the complete dataset Methy(1), if the dataset was split into two groups with equally-sized age ranges. The best model for Methy(1) achieved 0.8745 in accuracy with 100 features, while the classifier SVM achieved 0.9910 and 0.9353 in accuracy for the two datasets with smaller age groups, i.e., Methy(2)(0) and Methy(2)(1). Even better improvements were achieved for datasets with smaller age groups. The classifier SVM achieved 1.0000, 0.9958, and 1.0000 in accuracies for the three smaller datasets Methy(3)(0), Methy(3)(1), and Methy(3)(2), respectively. Only 40 features were needed by these three datasets to outperform that of the complete dataset Methy(1).

## TriVote Selected Features for Both Transcriptomes and Methylomes

A comparison between different age groups was also conducted using a recently published feature selection algorithm, TriVote (Xu et al., 2018), as shown in **Figure 5**. TriVote selected features with very good accuracies on both transcriptomes and methylomes calculated by the best classifier SVM, mentioned above. We have a similar pattern in that a biomedical classification problem may be improved simply by splitting the samples into multiple groups with equally-sized age ranges. The best model on the dataset RNA(1) with the accuracy 0.9223 was achieved by 95 features, as shown in **Figure 5**, while the two smaller groups RNA(2)(0) and RNA(2)(1) achieved their best accuracies, 0.9412 and 0.9664, with only 35 and 65 features, respectively. Moreover, the best models of both datasets outperformed the best model of RNA(1), with at least 0.0508 in accuracy. An average improvement of 0.0676 was achieved by merely splitting the dataset RNA(1) into three smaller groups with equallysized age ranges.

Similar patterns were also observed on the TriVote-selected feature subsets, as shown in **Figures 5C,D**. TriVote achieved average accuracy improvements of 0.0607 and 0.0965 for the cases of two and three groups with equally-sized age ranges.

We further evaluated our hypothesis using two more classifiers, Random Forest Classifier (RFC) (Pal, 2005; Gislason et al., 2006) (**Figure 6**) and XG boost (XGB) (Chen and Guestrin, 2016) (**Figure 7**). A similar pattern was observed, but RFC achieved weaker improvements in Acc, as shown in **Figure 6**. RFC also did not achieve Acc higher than 0.8500. Even weaker improvements in Acc were performed by the age-specific models trained by the classifier XGB, as shown in **Figure 7**. For example,

only 0.0123 and 0.0294 in Acc improvements were achieved by the age-specific XGB models.

## SVM Models on the Independent Test Datasets Using the Features Selected by F-Test and rfeSVM

This section covers the investigation of the best algorithms on the independent test sets. Features selected by F-test and rfeSVM tended to achieve the best performances, as demonstrated in **Figures 3–5**. **Table 4** suggests that the classifier SVM usually achieved the best classification accuracies. A stratified splitting strategy was used to get 10% of samples as an independent test dataset, which was used for evaluating the model trained over the other samples. The classification performances were iteratively calculated over the next 10% of samples to ensure that all samples were tested.

**Figure 8** demonstrates that the age-specific models outperformed the age-dependent models for both transcriptomes and methylomes on the total dataset, while **Figure 9** suggests that a similar relationship was observed between the age-independent models and the age-specific models.

## Comparison of Age-Independent and Age-Specific Models on the Head-Neck Squamous Cell Carcinoma (HNSC) Samples

We further analyzed the TCGA-HNSC (Head-Neck Squamous Cell Carcinoma) dataset for our hypothesis to see whether the age-specific models outperformed the age-independent ones, as shown in **Figure 10**. The analysis procedure with the best performance was utilized for the TCGA-HNSC dataset, i.e., the SVM classifier on the F-test + rfeSVM feature selection duet.

The age-independent model in the solid lines in **Figure 10** demonstrated very good accuracies (Acc = 0.9223 for transcriptome and Acc = 0.8758 for methylome). However, at least a 0.05 improvement in Acc may be achieved by building two age-specific transcriptome models, as in **Figure 10A**. The averaged improvement 0.0676 in Acc may be achieved if the transcriptome dataset is split into three age groups, as in **Figure 10B**. The classification accuracy of the age-independent methylome model may be improved by 0.0607 and 0.0965 on average for the two-group and three-group age-specific models, respectively (**Figures 10C,D**).

# CONCLUSION

This study carried out a series of extensive modeling experiments and demonstrated that age was an essential factor in selecting biomarkers. A biomarker-based disease diagnosis model may

#### REFERENCES

Ali, M., and Aittokallio, T. (2018). Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev. 11, 31–39. doi: 10.1007/s12551-018-0446-z

be improved by simply splitting the samples into multiple groups with smaller age ranges. SVM achieved the largest Acc improvements compared with the other classification algorithms. It should be further investigated how age could be directly integrated into the biomarker selection and diagnosis modeling.

We have tried to investigate the discrimination model between cancer and control samples. Unfortunately, there only 1 transcriptome and 6 methylome control samples contained both stage and age data, respectively. These sample numbers were much fewer than those of the cancer samples. We regret that we did not find the dataset to compare cancer and normal samples with our proposed age-specific models.

## DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https://portal.gdc.cancer.gov/projects/ TCGA-BRCA.

## AUTHOR CONTRIBUTIONS

FZ and XF conceived the project and designed the experiments. XF, JL, HL, HC, FL, and QL wrote the codes and conducted the experiments. XF, JL, FL, and QL generated the experimental results and drafted the discussions. FZ and Z-HY discussed the experimental design and polished the manuscript. FZ and XF drafted and polished the manuscript.

# FUNDING

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13040400), Jilin Provincial Key Laboratory of Big Data Intelligent Computing (20180622002JC), the Education Department of Jilin Province (JJKH20180145KJ), and the startup grant of the Jilin University. This work was also partially supported by the BioKnow MedAI Institute (BMCPP-2018-001). This work was supported in part by the NSFC Excellent Young Scholars Program (61722212), the Pioneer Hundred Talents Program of Chinese Academy of Sciences, and the National Natural Science Foundation of China under Grant 61572506, 61702444, and 61732012.

#### ACKNOWLEDGMENTS

The constructive comments from the editor and the three reviewers were greatly appreciated.

Alshawaqfeh, M., Bashaireh, A., Serpedin, E., and Suchodolski, J. (2017). Consistent metagenomic biomarker detection via robust PCA. Biol. Direct 12:4. doi: 10.1186/s13062-017-0175-4

An, X., Quan, H., Lv, J., Meng, L., Wang, C., Yu, Z., et al. (2018). Serum microRNA as potential biomarker to detect breast atypical hyperplasia and

early-stage breast cancer. Fut. Oncol. 14, 3145–3161. doi: 10.2217/fon-2018- 0334



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Feng, Li, Li, Chen, Li, Liu, You and Zhou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification of Target Genes at Juvenile Idiopathic Arthritis GWAS Loci in Human Neutrophils

Junyi Li<sup>1</sup> , Xiucheng Yuan<sup>1</sup> , Michael E. March<sup>2</sup> , Xueming Yao<sup>1</sup> , Yan Sun<sup>1</sup> , Xiao Chang<sup>2</sup> , Hakon Hakonarson2,3,4, Qianghua Xia<sup>1</sup> , Xinyi Meng<sup>1</sup> \* and Jin Li<sup>1</sup> \*

<sup>1</sup> Department of Cell Biology, 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Tianjin Medical University, Tianjin, China, <sup>2</sup> Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, United States, <sup>3</sup> Division of Human Genetics, Children's Hospital of Philadelphia, Philadelphia, PA, United States, <sup>4</sup> Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Abhay Sharma, Institute of Genomics and Integrative Biology (CSIR), India Haitao Zhang, National Institutes of Health (NIH), United States Jiangong Niu, University of Texas MD Anderson Cancer Center, United States

#### \*Correspondence:

Xinyi Meng mengxy@tmu.edu.cn Jin Li jli01@tmu.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 15 October 2018 Accepted: 19 February 2019 Published: 27 March 2019

#### Citation:

Li J, Yuan X, March ME, Yao X, Sun Y, Chang X, Hakonarson H, Xia Q, Meng X and Li J (2019) Identification of Target Genes at Juvenile Idiopathic Arthritis GWAS Loci in Human Neutrophils. Front. Genet. 10:181. doi: 10.3389/fgene.2019.00181 Juvenile idiopathic arthritis (JIA) is the most common chronic rheumatic disease among children which could cause severe disability. Genomic studies have discovered substantial number of risk loci for JIA, however, the mechanism of how these loci affect JIA development is not fully understood. Neutrophil is an important cell type involved in autoimmune diseases. To better understand the biological function of genetic loci in neutrophils during JIA development, we took an integrated multi-omics approach to identify target genes at JIA risk loci in neutrophils and constructed a protein-protein interaction network via a machine learning approach. We identified genes likely to be JIA risk loci targeted genes in neutrophils which could contribute to JIA development.

Keywords: juvenile idiopathic arthritis, target gene identification, epigenetic regulation, protein-protein interaction, pathway enrichment

#### INTRODUCTION

Juvenile idiopathic arthritis (JIA) is the most common chronic rheumatic disease in childhood at a prevalence rate of 1 in 1000, and JIA is a common cause of disability among children (Oen and Cheang, 1996). The typical clinical manifestation of JIA is joint enlargement of unknown origin for more than 6 weeks in children under 16 years old (Petty et al., 2004). JIA has long been considered as a type of autoimmune disease, however, its etiology is still not fully understood. Similar to other complex diseases, genetic and environmental factors both contribute to its pathogenesis (Glass and Giannini, 1999). Substantial evidence suggests the large contribution of genetic components. Previous studies showed that monozygotic twin concordance rates for JIA are between 25 and 40%, much higher than the population prevalence rate (Savolainen et al., 2000). Affected sibling studies showed that siblings of JIA probands had an over 10-fold increased risk of developing the disease (Frisell et al., 2016). Our recent heritability study based on SNP-h2 estimated that the heritability of JIA is 0.73 among the most highly heritable pediatric autoimmune diseases (Li et al., 2015b). Several genome-wide association studies (GWAS) have been carried out and discovered a number of JIA susceptibility loci, but how these loci affect the pathogenesis and development of JIA remains to be explored (Behrens et al., 2008; Hinks et al., 2009, 2013; Thompson et al., 2012; Cobb et al., 2014; Aydin-Son et al., 2015; Li et al., 2015a; Finkel et al., 2016; Ombrello et al., 2017; Haasnoot et al., 2018).

Neutrophils are one of the most important innate immune cells in human bodies. When infection or inflammation occurs, they are recruited to the disease site under the attraction of chemokines. In recent years, studies have found that neutrophils can secrete a variety of cytokines to play a key role in immunomodulation. The clinical manifestations of JIA are highly similar to those of classical autoinflammatory diseases. The large accumulation of white blood cells is one of the causes for local tissue damage and loss of joint function due to the inflammatory response at the joint (Fattori et al., 2016). Neutrophils likely play an important role in the effector phase of autoimmune diseases including JIA, and their action can cause or exacerbate articular inflammation (Németh and Mócsai, 2012). Neutrophil extracellular traps (NETs) are the newly discovered mechanism by which neutrophils fight infection, and has been demonstrated to play a role in pathogenesis of systemic immune diseases such as systemic lupus erythematosus (SLE) (Hakkim et al., 2010), antineutrophil cytoplasmic antibodies (ANCA)-associated systemic vasculitis (Kessenbrock et al., 2009) and multiple sclerosis (Naegele et al., 2012). However, little is known about the genes involved in JIA development in neutrophils.

A number of JIA loci have been identified in GWAS (Behrens et al., 2008; Hinks et al., 2009, 2013; Thompson et al., 2012; Cobb et al., 2014; Aydin-Son et al., 2015; Li et al., 2015a; Finkel et al., 2016; Ombrello et al., 2017; Haasnoot et al., 2018), but few have been functionally characterized as most of the GWAS SNPs are located at the intronic or intergenic regions, without directly affecting the sequence of any protein product. We hypothesize that they may function as cis-regulatory elements, regulating target gene expression. Therefore, we focused on understanding the target genes of JIA GWAS loci in neutrophils.

JIA is a heterogeneous group of diseases including several different subtypes. In recent years, due to the progress in disease management, their prognosis has been greatly improved, but there are still few effective treatments. Our study took an in silico analysis approach, utilizing genomics, transcriptome, epigenome, and methylome data to find genes targeted by JIA risk loci in neutrophils, facilitating the design of precision strategy of JIA prevention and treatment.

# MATERIALS AND METHODS

#### Extraction of JIA GWAS Loci

Juvenile idiopathic arthritis loci identified in previous GWAS were found in GWAS catalog (MacArthur et al., 2017) by conducting search using keyword "Juvenile idiopathic arthritis." All loci found were downloaded without further imposing any significance threshold.

#### eQTL Analysis

eQTL analysis was performed via Genotype-Tissue Expression (GTEx) Project website (Lonsdale et al., 2013), from which the correlation between each SNP genotype and gene expression level in whole blood was extracted. We set the significance threshold as P-value < 0.05. The boxplots for the SNP-gene pairs were reviewed via GTEx Portal.

# Analysis of Microarray Data

Series matrix files of microarray datasets GSE11083 (Frank et al., 2009a,b) and GSE67596 (Jiang et al., 2015) containing transcriptome data from neutrophils of 36 JIA patients and 26 healthy controls were downloaded from NCBI Gene Expression Omnibus (GEO) (Edgar et al., 2002; Barrett et al., 2013). Gene expression levels were compared between JIA patient group and control group. Expression values across studies was summarized through median polish and normalization was performed using Robust Multi-array Average (RMA) algorithm which minimizes variance across arrays and log transformation was conducted for variance stabilization (Irizarry et al., 2003). Meta-analysis were performed using RankProd package (Hong et al., 2006) in R 3.5.1 (R Core Team, 2018). The threshold used to select for differentially expressed genes was defined as possibility of false positives (PFP) < 0.05 and absolute value of fold change (FC) > 1.2.

# Histone Modification Analysis

The SNPs of interest were input into web portal Haploreg<sup>1</sup> (Ward and Kellis, 2012) and their overlap with histone modification regions in neutrophil cell line E030 BLD.CD15.PC (primary neutrophils from peripheral blood) was evaluated using epigenome data from ROADMAP epigenomics database (Kundaje et al., 2015).

# Methylation Data Analysis

The methylation data were extracted from the genome-wide methylation profiles of 843 subjects processed on the Infinium HumanMethylation450 BeadChip at the Center for Applied Genomics, the Children's Hospital of Philadelphia, which has been described in previous publication (Van Ingen et al., 2016). The log2 ratio between the methylated and unmethylated intensities of each probe on the chip was represented by the M-values. The association between JIA SNP genotype and methylation probes in each of the 11 genes was assessed in a linear regression model conditioned on gender, age and 10 genotype-derived principle components.

## Construction of Protein-Protein Interaction (PPI) Network

Protein-Protein Interaction network on the 11 target genes was constructed via NetworkAnalyst<sup>2</sup> (Xia et al., 2014) which was based on integration of machine learning and Walktrap algorithms (Pons and Latapy, 2005). The resource of proteinprotein interaction data was IMEx Interactome database (Orchard et al., 2012). Hypergeometric test for gene set enrichment analysis was implemented in NetworkAnalyst and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa et al., 2017) was used as the pathway database resource. In addition to FDR P-value calculated based on hypergeometric test and multiple-testing adjustment, empirical P-value of pathways was derived from permutation analysis. A list

<sup>1</sup>https://pubs.broadinstitute.org/mammals/haploreg/haploreg.php <sup>2</sup>https://www.networkanalyst.ca

of 11 genes was randomly generated from the human genome and such resampling was performed 100 times. For each 11 gene list randomly drawn, the steps of network construction, pathway analysis were similarly performed as for JIA target genes in neutrophils, and a list of significantly enriched pathways with FDR < 0.05 was resulted from each resampling. For each enriched pathway in PPI network of JIA target genes, its empirical P-value was derived based on the number of times it appears as significantly enriched pathway from 100 permutations.

#### Hi-C Data Visualization

fgene-10-00181 March 25, 2019 Time: 18:12 # 3

Hi-C data visualization for the JIA loci and target genes were carried out via the 3D Genome browser<sup>3</sup> (Wang et al., 2018) and FUMA GWAS<sup>4</sup> (Watanabe et al., 2017). Hi-C data from cell line K562 (Rao et al., 2014; Schmitt et al., 2016) were used.

#### RESULTS

A large number of GWAS loci have been identified for human complex diseases, including JIA. We extracted all 127 genomic regions that have been reported to be associated with JIA from GWAS catalog (Behrens et al., 2008; Hinks et al., 2009, 2013; Thompson et al., 2012; Cobb et al., 2014; Aydin-Son et al., 2015; Li et al., 2015a; Finkel et al., 2016; MacArthur et al., 2017; Ombrello et al., 2017; Haasnoot et al., 2018). All these SNPs are located outside of gene exons, which may contribute to disease etiology by affecting gene expression. We then input these SNPs into GTEx database (Lonsdale et al., 2013) to identify genes that are regulated by these SNPs. Because GWAS SNPs and their target genes may not always exhibit highly significant correlation in eQTL analysis, exemplified by obesity SNP rs9930506 and IRX3 gene (Smemo et al., 2014), we set the significance threshold as nominal P-value < 0.05. We found that the expression level of 238 genes correlates with JIA SNP genotype in whole blood.

As we are particularly interested in identifying genes regulated by JIA GWAS loci in neutrophils, we examined which of these 238 genes showed differential expression in neutrophils between JIA cases and controls. We extracted two microarray datasets from gene expression omnibus (GEO) database, GSE11083 (Frank et al., 2009a,b) and GSE67596 (Jiang et al., 2015). Gene expression data from a total of 36 JIA cases and 26 controls were meta-analyzed. Among the 264 eQTL genes for JIA SNPs, only 11 genes showed significant differential expression, including 5 up-regulated and 6 down-regulated (**Table 1**). Our in silico analysis suggested that these genes may function as JIA loci targeted genes in neutrophils. Among the 13 pairs of JIA SNPs and target genes, only SNP rs79893749 is located in the intron of its target gene CCR3; all the other SNPs are located outside of the transcript region of their target genes. Hi-C data provide additional supporting evidence for plausible chromatin interactions between some JIA SNPs and their target genes (**Supplementary Figures 1**, **2**), with the caveat that these data came from a chronic myelogenous leukemia cell

Li et al. JIA Loci Targeted Genes in Neutrophils


TABLE 1 | Summary of JIA GWAS loci targeted genes in neutrophils.

<sup>3</sup>promoter.bx.psu.edu/hi-c/

<sup>4</sup>http://fuma.ctglab.nl/

line K562 (Rao et al., 2014; Schmitt et al., 2016). Experiments using neutrophils would be necessary to further explore their possible interactions.

To understand how these genes coordinately contribute to JIA development, we constructed PPI network among proteins encoded by these genes and their direct interactors (**Figure 1**) using NetworkAnalyst which integrates statistical analyses and machine learning for interactive PPI network visualization. We further conducted pathway analysis and found several signaling pathways significantly enriched among proteins in this network, including neurotrophin signaling pathway, cardiac muscle contraction, cell cycle and hypertrophic cardiomyopathy (HCM) (**Table 2**). To test the cell type specificity of 11 target genes and enriched pathways, we repeated the whole process using microarray gene expression data from PBMC samples of the same GEO datasets. We found that among the 11 target genes in neutrophils, one gene (TRIM58) was shared with PBMC (**Table 1** and **Supplementary Table 1**). No enriched pathway was shared between neutrophils and PBMC (**Figure 1**, **Table 2**, **Supplementary Figure 3**, and **Supplementary Table 2**). To further determine the specificity of the 4 enriched pathways among JIA target genes in neutrophils, we checked the distribution of significantly enriched pathways from 100 randomly generated 11-gene lists and derived the empirical P-values for each of the 4 pathways of interest (**Table 2**). The pathways of cardiac muscle contraction and hypertrophic cardiomyopathy were of empirical P-value < 0.01. Based on these two control gene set analyses, we demonstrated the specificity of these target genes and pathways in neutrophils, serving the initial screening purpose for further functional validation.

Next, we investigated how JIA loci may regulate the expression of their targeted genes. To address this question, we examined the ROADMAP database (Kundaje et al., 2015) through HaploReg (Ward and Kellis, 2012). We found rs79893749 and rs149850873 overlap with histone marks in a neutrophil cell line E030 BLD.CD15.PC, suggesting that these loci may regulate their targeted gene expression through histone modifications in the promoter or enhancer region. We also looked into the potential mechanism of DNA methylation. In our methylation analysis, we tested the 10 JIA SNPs against their corresponding one or two genes which each contains ∼11 methylation probes on average. A total of 144 SNP-methylation-probe pairs were tested, thus the multiple-testing adjusted P-value cutoff is set at 3.5 × 10−<sup>4</sup> . The correlation between four SNP-methylationprobe pairs reached this experiment-wide significance threshold, suggesting that these JIA SNPs may regulate the expression of their target genes through DNA methylation (**Table 1**).

#### DISCUSSION

In this study, we conducted data mining in existing datasets to gain a better understanding of the molecular mechanism of JIA GWAS loci. By eQTL and transcriptome analyses, we identified 11 genes may be JIA loci target genes in neutrophils. We further built PPI network and found pathways enriched among target genes and their interactors. We also found multiple JIA GWAS SNPs overlap with histone marks and/or correlate with methylation level in their target genes.

We did not observe extensive overlap between JIA eQTL genes in whole blood and genes of differential expression in



Total = the total number of genes in each pathway; Expected = the expected number of genes in each pathway given the number of JIA target genes; Hits = the actual number of JIA target genes falling into each pathway; P-value = P-value of each pathway in enrichment test; FDR = false discovery rate of each pathway; <sup>∗</sup>empirical P-value < 0.01 based on permutation analysis.

neutrophils. It possibly resulted from the small sample size in our microarray datasets which did not have enough power to detect certain differentially expressed genes. In addition, JIA eQTL genes may be expressed in cell types other than neutrophils which we are particularly interested in.

Several of the target genes we identified are highly related to the immune system, such as CCR3, ELL2, and HLA-DPA1. Others play a role in cell proliferation, carcinogenesis and/or other biological functions. The Human leukocyte antigen (HLA) gene complex encodes human major histocompatibility complex (MHC), a group of cell-surface proteins playing important roles in the regulation of human immune system. HLA genes have been reported to be associated with autoimmune diseases (Sollid, 2017; Kawabata et al., 2018), including rheumatoid arthritis (Onuora, 2015; Smolen et al., 2018) and JIA (Smerdel et al., 2002). The HLA-DPA1 locus has been particularly linked to ankylosing spondylitis, a type of chronic inflammatory rheumatic disease (Diaz-Pena et al., 2011). As expected, HLA genes were also found as target genes in PBMC, suggesting they contribute to pathogenesis of JIA in diverse immune cell types. CCR3 gene encodes a protein as a member of the G proteincoupled receptor family, responding to the C-C type chemokines. SNP in CCR3-CCR5 region has been linked to family history of autoimmune disease among children with type I diabetes (Parkkola et al., 2017). It has been reported that CCR3 expression was increased under rheumatoid arthritis conditions, and it mediated eotaxin-1 induced matrix metalloproteinase (MMP)- 9 upregulation in fibroblast-like synoviocyte which may further result in articular damage (Liu et al., 2017). Previous studies have also demonstrated that CCR3 is highly expressed in infiltrated synovial neutrophils of rheumatoid arthritis patients (Hartl et al., 2008). ELL2 gene encodes Elongation Factor for RNA Polymerase II 2, a component of the super-elongation complex. It functions in immune regulation by affecting IgH alternative processing, Ig secretion and plasma cell differentiation. Missense mutation in ELL2 gene affects IgA and IgG level associated with multiple myeloma (Swaminathan et al., 2015). Study has shown that ELL2 is expressed in mature neutrophils and its expression is elevated in responses to inflammatory stimuli (Zhang et al., 2004). Our results suggest that these genes may also play a role in neutrophils mediating the effect of JIA risk loci during JIA pathogenesis which should be further investigated by experimental approaches.

The pathways of cardiac muscle contraction and hypertrophic cardiomyopathy are significantly and specifically enriched in PPI network of JIA target genes and their interactors in neutrophils. Multiple studies have reported that patients with rheumatoid arthritis have a higher incidence and mortality of cardiovascular

#### REFERENCES


disease (Maradit-Kremers et al., 2005; Voskuyl, 2006; Avina-Zubieta et al., 2008; Georgiadis et al., 2008). Cardiac involvement has similarly been found in JIA patients (Svantesson et al., 1983; Hull, 1988). However, whether JIA increases the long-term risk of cardiovascular disease is still uncertain (Coulson et al., 2013). Our results suggest that JIA and cardiovascular disease may share common underlying molecular mechanism.

High-throughput omics technology provides a wealth of experimental data for disease gene discovery. The multi-omics studies on the interplay between genes, RNA, proteins and small molecules reveal new directions for the research of complex diseases (Bersanelli et al., 2016; Bock et al., 2016). Integration of data from different dimensions of multi-omics data via different analytical approaches facilitates prioritizing genes for efficient functional studies and contributes to the understanding of disease etiology.

## DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the **Supplementary Files**.

#### AUTHOR CONTRIBUTIONS

JL, XM, and QX conceived and designed the study. JYL, XCY, XMY, YS, and QX performed the analysis. MM, XC, and HH provided the methylation analysis data. JYL wrote the manuscript. JL and XM reviewed and edited the manuscript critically. All authors read and approved the manuscript.

#### FUNDING

This study was supported by National Natural Science Foundation of China (81771769); Tianjin Natural Science Foundation (18JCYBJC42700); Startup Funding from Tianjin Medical University and the Thousand Youth Talents Plan of Tianjin.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00181/full#supplementary-material


P. Yolum, T. Güngör, F. Gürgen, and C. Özturan (Berlin: Springer), 284–293. doi: 10.1007/11569596\_31


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Li, Yuan, March, Yao, Sun, Chang, Hakonarson, Xia, Meng and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00181 March 25, 2019 Time: 18:12 # 7

# Machine Learning SNP Based Prediction for Precision Medicine

Daniel Sik Wai Ho<sup>1</sup> , William Schierding<sup>1</sup> , Melissa Wake<sup>2</sup> , Richard Saffery<sup>2</sup> and Justin O'Sullivan<sup>1</sup> \*

<sup>1</sup> Liggins Institute, University of Auckland, Auckland, New Zealand, <sup>2</sup> Murdoch Children Research Institute, Melbourne, VIC, Australia

In the past decade, precision genomics based medicine has emerged to provide tailored and effective healthcare for patients depending upon their genetic features. Genome Wide Association Studies have also identified population based risk genetic variants for common and complex diseases. In order to meet the full promise of precision medicine, research is attempting to leverage our increasing genomic understanding and further develop personalized medical healthcare through ever more accurate disease risk prediction models. Polygenic risk scoring and machine learning are two primary approaches for disease risk prediction. Despite recent improvements, the results of polygenic risk scoring remain limited due to the approaches that are currently used. By contrast, machine learning algorithms have increased predictive abilities for complex disease risk. This increase in predictive abilities results from the ability of machine learning algorithms to handle multi-dimensional data. Here, we provide an overview of polygenic risk scoring and machine learning in complex disease risk prediction. We highlight recent machine learning application developments and describe how machine learning approaches can lead to improved complex disease prediction, which will help to incorporate genetic features into future personalized healthcare. Finally, we discuss how the future application of machine learning prediction models might help manage complex disease by providing tissue-specific targets for customized, preventive interventions.

Keywords: machine learning, polygenic risk score, precision medicine, genetic disease risk prediction, personalized medicine, complex disease risk

#### PRECISION MEDICINE

Since the completion of the Human Genome Project, DNA sequencing technologies have been advancing rapidly (Laksman and Detsky, 2011; Johnson, 2017). These advances have been most notable in terms of a dramatic decrease in the cost per base pair sequenced (Schuster, 2008). This has led to an exponential increase in the abundance of individualspecific genotype data and other forms of human biological "omics" information (Laksman and Detsky, 2011; Spiegel and Hawkins, 2012). As a result of these technological developments, the concept of precision medicine, or personalized medicine, has undergone a world-wide upsurge

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Deepak Singla, Punjab Agricultural University, India Leyi Wei, Tianjin University, China

\*Correspondence: Justin O'Sullivan justin.osullivan@auckland.ac.nz

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 15 October 2018 Accepted: 11 March 2019 Published: 27 March 2019

#### Citation:

Ho DSW, Schierding W, Wake M, Saffery R and O'Sullivan J (2019) Machine Learning SNP Based Prediction for Precision Medicine. Front. Genet. 10:267. doi: 10.3389/fgene.2019.00267

**213**

in support as a way of transforming disease prediction, prognosis, and individual participation in preventative strategies (Laksman and Detsky, 2011; Johnson, 2017).

The objective of precision medicine is to deliver tailored medical treatments for patients according to their genetic characteristics. This primarily involves customizing proactive and preventive care to maximize medical efficacy and costeffectiveness (Laksman and Detsky, 2011). Personalization is achieved by integrating and utilizing various types of omics information to generate and understand disease risks (Laksman and Detsky, 2011; Spiegel and Hawkins, 2012; Redekop and Mladsi, 2013). The application of precision medicine to pharmacogenomics has allowed for customized drug and dosage use with considerable success. For example, genetic information is regularly incorporated into treatment strategies for trastuzumab treatment for HER2-positive breast cancers, erlotinib for EGFR-overexpressing lung cancers, or imatinib for Philadelphia chromosome-positive chronic myelogenous leukaemias (Salari et al., 2012; Wald and Morris, 2012). However, in the context of population health, it is hotly debated whether precision genomics is yet at a point where it offers costbenefits over and above fully implemented standard public health approaches.

#### GENOME-WIDE ASSOCIATION STUDIES

There are millions of single nucleotide polymorphisms (SNPs, also known as genetic variants) in each human genome (Auton et al., 2015). Genome-wide association (GWA) studies identify SNPs that mark genomic regions that are strongly associated with phenotypes in a population (Visscher et al., 2012). These genomic regions must contain the variant that is causally associated with the phenotype, however it does not follow that the SNP that is identified by the GWA study is causal. Notably, many common and complex diseases [e.g., type 2 diabetes (T2D) and obesity] are influenced by multiple SNPs, each with small per-SNP effect sizes (Visscher et al., 2017). Of note, the majority of these SNPs are located in non-coding regions and thus must be indirectly involved in their disease association, likely through tissue-specific regulatory activities (Visscher et al., 2017; Schierding et al., 2018). New methods to understand these regulatory activities include the integration of spatial and temporal aspects of gene expression data (Schierding and O'Sullivan, 2015; Schierding et al., 2016; Fadason et al., 2017, 2018; Nyaga et al., 2018). These approaches are providing insights into the impacts of genetic variants that can reassign population based risk to individualized risk.

#### PREDICTING RISK SCORES AND AUC

Traditional epidemiology based models of disease risk (with limited predictive power) have been primarily informed by lifestyle risk factors such as family history (Jostins and Barrett, 2011; Wang et al., 2016). Recently, the inclusion of genetic risk factors, including disease or phenotype associated SNPs, into risk modeling has improved the accuracy of individual disease prediction (Jostins and Barrett, 2011; Wang et al., 2016). Perhaps the greatest promise of risk prediction models lies in their potential to guide diease prevention and treatment without the need for costly and potentially adverse medical screening procedures (e.g., invasive biopsies) (Wray et al., 2007; Ashley et al., 2010; Manolio, 2013; Abraham and Inouye, 2015).

Currently, the main focus of developing genetic risk models is to achieve accurate predictive power for recognizing atrisk individuals in a robust manner (Ashley et al., 2010; Manolio, 2013; Montañez et al., 2015). As stated earlier, GWA studies define SNPs according to their association with a disease/phenotype at a population level. Therefore, the incorporation of SNPs into a risk prediction model requires integration into models that score an individual's genotype to enable the estimation of risk. Genetic risk prediction models are typically constructed by: (1) Polygenic risk scoring; or (2) Machine learning (Wei et al., 2009; Abraham and Inouye, 2015). The predictive performance of both model types is evaluated by receiver operating characteristic curves (ROCs) (Kooperberg et al., 2010; Jostins and Barrett, 2011; Vihinen, 2013; Wang et al., 2016), where the sensitivity and specificity of the predictions are ranked at various cut-off values (Kooperberg et al., 2010; Jostins and Barrett, 2011; Vihinen, 2012; Wang et al., 2016). The area under a ROC curve (AUC) is the probability of the examined model correctly identifying a case out of a randomly chosen pair of case and control samples (Kooperberg et al., 2010; Jostins and Barrett, 2011; Kruppa et al., 2012; Vihinen, 2012; Wang et al., 2016). AUC results range from 0.5 (i.e., random) to 1 (i.e., 100 percent accuracy) (Kooperberg et al., 2010; Jostins and Barrett, 2011; Vihinen, 2012; Wang et al., 2016).

# POLYGENIC RISK SCORING

Polygenic risk scoring uses a fixed model approach to sum the contribution of a set of risk alleles to a specific complex disease (Belsky et al., 2013; Che and Motsinger-Reif, 2013; Wang et al., 2016; So et al., 2017). Polygenic risk scores can be unweighted or weighted. In weighted polygenic risk scores, the contributions of the risk alleles is typically weighted by their odds ratios or effect sizes (Evans et al., 2009; Purcell et al., 2009; Wei et al., 2009; Carayol et al., 2010; Medicine and Manolio, 2013). By contrast, unweighted polygenic risk scores are equal to the sum of the number of associated variant alleles in a genome. The unweighted model assumes that all variants have an equivalent effect size (Carayol et al., 2010; Abraham and Inouye, 2015; Hettige et al., 2016). This simplistic assumption limits the utility of unweighted polygenic risk scores for complex traits with underlying genetic architectures that include uneven variant effects (Carayol et al., 2010; Abraham and Inouye, 2015; Hettige et al., 2016).

There are two stages to the development of a polygenic risk score: (1) the discovery stage; and (2) the validation stage. The discovery stage of a weighted polygenic risk score uses statistical association testing (e.g., linear or logistic regression) to estimate effect sizes from a large case and control dataset of individual genotype profiles (Evans et al., 2009; Che and Motsinger-Reif, 2013; Dudbridge, 2013). The discovery

stage of an unweighted polygenic risk score requires strict SNP selection parameters to prevent incorporation of SNPs with minor effect sizes. In both the weighted and unweighted polygenic risk score, once developed, the discovery model is passed to the validation stage. Validation of the polygenic risk score requires the extraction of informative SNP identities and effect sizes from the discovery set, using a stringent association p-value threshold (e.g., 5 × 10−<sup>8</sup> ) (Dudbridge, 2013; Wray et al., 2014),which is subsequently passed to a scoring phase of the validation. During this process, the polygenic risk score model is applied to a testing dataset [i.e., an independent set of case and control genotype data (Che and Motsinger-Reif, 2013; Dudbridge, 2013)]. Polygenic risk scores are calculated for each individual genotype profile in the testing data (Che and Motsinger-Reif, 2013; Dudbridge, 2013). The predictive power of the individual polygenic risk scores for the complex trait are then established by the strength of the score associations with the clinically measured outcomes (phenotypes) in the testing dataset (Che and Motsinger-Reif, 2013; Dudbridge, 2013).

Early attempts to use weighted polygenic risk scores, were based on small numbers of highly significant SNPs identified from GWA studies, and achieved only limited predictive value for complex diseases (Amin et al., 2009; Dudbridge, 2013). This illustrates a key limitation of weighted polygenic risk score modeling, specifically the p-value threshold for SNP choice in the discovery dataset impacts on the model's performance and predictive power. The selection of limited numbers of SNPs, with large effect sizes, over-simplifies the biological underpinnings of the complex diseases by ignoring the bulk of the variants that make much smaller individual contributions to the phenotype (Visscher et al., 2017). For example, the average odds ratio per T2D risk allele ranges from 1.02 to 1.35 (Shigemizu et al., 2014). Recent polygenic risk score models incorporate expanded SNP selection to achieve better predictive results for complex polygenic traits (Dudbridge, 2013; Escott-Price et al., 2015; So et al., 2017). For example, the use of relaxed p-value thresholds (as high as 0.01, 0.1, and 0.2 etc. . .) has enabled the development of improved polygenic risk score models for psychiatric diseases, with minimal increases in false positive errors (i.e., the models have an acceptable power-to-noise ratio) (Amin et al., 2009; Kooperberg et al., 2010; Wray et al., 2014). The weighted polygenic risk score approach has enabled the risk prediction of schizophrenia to achieve reasonable efficacy with an AUC of ∼0.65 (Jostins and Barrett, 2011). Similarly, significant results from weighted polygenic risk score predictions were also obtained for other complex traits including Type 1 diabetes and celiac disease (CD) (Jostins and Barrett, 2011; Wray et al., 2014; So et al., 2017).

# MACHINE LEARNING DISEASE PREDICTION MODELS

Machine learning approaches adapt a set of sophisticated statistical and computational algorithms (e.g., Support vector machine (SVM) or Random forest) to make predictions by mathematically mapping the complex associations between a set of risk SNPs to complex disease phenotypes (Quinlan, 1990; Wei et al., 2009; Kruppa et al., 2012; Mohri et al., 2012). These methods use supervised or unsupervised approaches to map the associations with complex diseases (Dasgupta et al., 2011). Despite the utility of unsupervised machine learning methods and non-genetic data in disease predictions (Singh and Samavedham, 2015; Worachartcheewan et al., 2015), we will focus the remainder of this manuscript on supervised modeling that is informed by SNP data.

Supervised machine learning disease prediction models are generated by training the pre-set learning algorithms to map the relationships between individual sample genotype data and the associated disease (Dasgupta et al., 2011; Okser et al., 2014). Optimal predictive power for the target disease is achieved by mapping the pattern of the selected features (variables) within the training genotype data (Quinlan, 1990; Mohri et al., 2012; Okser et al., 2014). Some models use gradient descent procedures and iterative rounds of parameter estimation to search through the training data space for optimized predictive power (Yuan, 2008; Mehta et al., 2019). This recursive process continues until the optimal predictive performance is reached (Yuan, 2008; Mehta et al., 2019). At the end of the training stage, the models with the maximum predictive power on the training dataset are selected for validation (Vihinen, 2012; Abraham and Inouye, 2015). A generalized workflow for creating a machine learning model from a genotype dataset is illustrated in **Figure 1**.

During the validation stage, the performance of the predictive machine learning models is evaluated to determine their power for generalized prediction. As with polygenic risk scoring, the validation stage is accomplished by evaluating the algorithm on an independent dataset. The validation stage is essential for ensuring the prediction models do not overfit the training data (Dasgupta et al., 2011; Okser et al., 2014; Abraham and Inouye, 2015). Cross validation is a commonly used procedure for validating the models performance using the original dataset (Schaffer, 1993; Kruppa et al., 2012; Vihinen, 2012; Nguyen et al., 2015; Zhou and Troyanskaya, 2015). However, external validation (testing) using an independent dataset is required to finally confirm the predictive power of a machine learning model. The utility of the algorithm is finally determined through randomized controlled comparisons to current clinical best practice. Only if the algorithm adds information to more accurately stratify populations, predict disease risk or treatment responses does it ultimately prove its clinical utility.

## FACTORS THAT IMPROVE THE POWER OF PREDICTIVE MODELS FOR COMPLEX DISEASES

Despite initial promise, the predictive performance of polygenic risk scores for complex diseases has only been moderately successful (Wei et al., 2009; Kruppa et al., 2012; Abraham and Inouye, 2015). A significant contributor to this relatively poor performance revolves about the finding that experimental GWA study data suggests that risk allele contributions to complex diseases have average odds ratios of between 1.1 and 2 (Wray

et al., 2007). However, GWA studies are typically underpowered and only capable of detecting risk SNPs with odds ratios of >1.3 (Dudbridge, 2013; Wray et al., 2014). Thus, improving the predictive power of polygenic disease risk models could be as simple as increasing GWA study sample sizes (Wei et al., 2009; Okser et al., 2014; Abraham and Inouye, 2015). Rapidly decreasing DNA sequencing costs have led to meta-GWA studies analyzing datasets containing half a million or more samples (The Wellcome Trust Case Control Consortium, 2007; Amin et al., 2009; Lyall et al., 2018). The use of larger datasets has increased the frequency of detection of SNPs with small effect sizes. Incorporating SNPs with small effect sizes into polygenic risk models has resulted in an increase in the accuracy of complex disease predictions (Wei et al., 2009; Jostins and Barrett, 2011; Vihinen, 2012; Abraham and Inouye, 2015). It remains likely that this trend to use SNPs identified from bigger datasets will continue into the future, with the associated increases in the accuracy of the resulting risk prediction models.

The size of the training and validation datasets is another critical element in machine learning modeling. However, size is not enough and the datasets must be of high quality with accurate phenotyping that ensures the generalizing predictive power of the resultant machine learning models (Vihinen, 2012; Wei et al., 2014). Wei et al. (2013) illustrated the impact of training sample size on the predictive power of a machine learning classification algorithm for inflammatory bowel disease (IBD). The dataset used in the study contained 60,828 individual genotypes from 15 European counties (Wei et al., 2013). A machine learning prediction model for Crohn's disease (a subtype of IBD) created from a small subset (n = 1,327) of the dataset only performed moderately (AUC = 0.6). However, the predictive power of the model improved consistently with increases in size of the training datasets until the predictive performance reached the maximum (AUC = 0.86) with the full training dataset (n = 11,943) (Wei et al., 2013).

Technological advances are constantly improving the quality and quantity of the complex integrative datasets that are collected on human phenotypes and disease. Integration of these highly dimensional genomic data within machine learning models can lead to improvements in genetic risk prediction over that achieved for polygenic risk scores (Wei et al., 2009; Okser et al., 2010, 2014; Kruppa et al., 2012; Fourati et al., 2018; Joseph et al., 2018). Polygenic risk score predictions are based on a linear parametric regression model that incorporates strict assumptions, which include additive and independent predictor effects, a normal distribution for the underlying data, and that the data observations are non-correlated (Wei et al., 2009; Abraham et al., 2013; Che and Motsinger-Reif, 2013; Casson and Farmer, 2014; Abraham and Inouye, 2015). These assumptions do not necessarily hold true for the fundamental genetic structures of complex polygenic diseases, thus leading to greatly reduced predictive efficacy (Wei et al., 2009; Abraham et al., 2013; Che and Motsinger-Reif, 2013). Notably, linear additive regression modeling is incapable of accounting for complex interactive effects between associated alleles (Abraham et al., 2013; Che and Motsinger-Reif, 2013; Okser et al., 2014), which have been reported to make major contributions to phenotypes (Furlong, 2013). Thus, linear additive regression based modeling leads polygenic risk scores toward biased and less effective predictions (Clayton, 2009; Huang and Wang, 2012; Che and Motsinger-Reif, 2013; Okser et al., 2014). By contrast, machine learning algorithms employ multivariate, non-parametric methods that robustly recognize patterns from non-normally distributed and strongly correlated data (Wei et al., 2009; Okser et al., 2010, 2014; Ripatti et al., 2010; Silver et al., 2013). The capacity of machine learning algorithms to model highly interactive complex

data structures has led to these approaches receiving increasing levels of interest for complex disease prediction (Wei et al., 2009; Okser et al., 2010, 2014; Ripatti et al., 2010; Silver et al., 2013). The strengths and weaknesses of both polygenic risk scoring and predictive machine learning models are summised in **Figure 2**.

#### MACHINE LEARNING FEATURE SELECTION AND REGULARIZATION

Data feature selection is the major factor that impacts on a machine learning model's predictive performance (Okser et al., 2014). Data feature selection occurs during the machine learning training stage with the aim of reducing data dimensionality, removing noisy and irrelevant data, and thus preserving the most useful signals from the dataset (Kwak and Choi, 2002; Okser et al., 2014). Data feature selection procedures can be broadly implemented using filtering, embedded modules, or wrapper methods (Pal and Foody, 2010; Kruppa et al., 2012; Okser et al., 2013, 2014; Shi et al., 2016). The choice of selection procedures depends on the original data attributes and prediction model criteria (Pal and Foody, 2010; Okser et al., 2014). For complex polygenic diseases, SNPs are currently considered the most informative data features within genotype data (Abraham et al., 2013; Okser et al., 2013; Wei et al., 2013; Shi et al., 2016). It is assumed that the SNPs that are selected for inclusion in the predictive models are associated with loci that contribute mechanistically to the underlying disease etiology (Pal and Foody, 2010; Okser et al., 2014; López et al., 2017). Despite this, how the SNP mechanistically contributes to the disease may not be understood. Commonly, in the first stage of the model building, variants within the genotype data are filtered and subdivided into groups according to their GWA study P-value thresholds (Wei et al., 2009, 2013; Okser et al., 2013, 2014; Montañez et al., 2015). Embedded methods are implemented inside the model building algorithm and function to select SNPs following the detection of their interactive effects (Okser et al., 2013) and thus enable incorporation of only informative SNPs into the predictors (Wu et al., 2009; Okser et al., 2013; Wei et al., 2013). Wrappers serve the same purpose as embedded methods. However, wrappers are independent stand-alone SNP selection modules implemented before the model building process (Pahikkala et al., 2012; Okser et al., 2013).

Overfitting is a phenomenon whereby models are so closely fitted to a dataset and they cannot be used to generalize to other datasets. The chances of overfitting models can be reduced by regularization, which is a process that maximizes the generalized predictive power of machine learning models (Tibshirani, 1996; Zou and Hastie, 2005; Okser et al., 2014). For example, the two most common types of regression-based regularization are L1 and L2. L1 and L2 regularizations both use a penalized loss function to assign weights that adjust data feature effects and reduce the complexity of the regression models (Tibshirani, 1996; Zou and Hastie, 2005; Okser et al., 2014). L1 regularization sets the weights of non-informative data features to zero, thus eliminating effects and allowing only


FIGURE 2 | The strengths and weaknesses of polygenic risk scoring and machine learning model.

essential and valuable data feature effects to be included into the machine learning regression modeling (Tibshirani, 1996; Zou and Hastie, 2005; Okser et al., 2014). By contrast, L2 regularization minimizes non-essential data features using non-zero weights (Tibshirani, 1996; Zou and Hastie, 2005; Okser et al., 2014). As a result of this, L2 regularization is not typically used for feature selection.

Regression-based L1-regularization is one of the most commonly used machine learning feature selection methods, with Lasso and Elastic Net currently being the most popular L1 regularization modules (Tibshirani, 1996; Zou and Hastie, 2005; Wu et al., 2009; Okser et al., 2014). There are many examples where L1-regularization has enhanced the machine learning algorithm's predictive performance for different diseases (Abraham et al., 2013; Wei et al., 2013; Shigemizu et al., 2014; Shieh et al., 2017). For example, Wei et al. (2013) implemented a two-step model training process in the development of an L1-regularized algorithm for Crohn's disease prediction. Firstly, the Lasso-logistic regression method identified a set of essential and informative SNPs. Subsequently, the selected SNPs were applied to a SVM and a logistic predictor for Crohn's disease. Following SNP optimization by L1-regularization, both the nonparametric and parametric predictors achieved similar results with an AUC = 0.86 compared to an AUC = 0.73 for the simple polygenic risk score.

Abraham et al. (2014) used six European genotype datasets to develope a Lasso–SVM integrated model, with an AUC = 0.9, for CD. Following data cleaning and adjustment for population structure effects by principal components, Abraham et al. (2014) created a L1-SVM predictor from each dataset with cross-validaion. They then used the other five datasets for external validation. Data feature selection for all the predictors was acomplished by the Lasso method embedded within the SVM algorthm. The best predictor that was generated had an AUC = 0.9 and its clinical utility is being explored for

CD prediction (Abraham and Inouye, 2015). Notably, the identification of the essential SNPs by the Lasso-SVM model has provided insights that will help decipher the genetic basis underlying the etiologic pathways of CD pathogenesis.

#### SUPERVISED LEARNING ALGORITHMS

Supervised learning algorithms can be classified as regressionbased or tree-based methods (**Table 1**; Dasgupta et al., 2011; Okser et al., 2014). Logistic regression, linear regression, neural networks, and SVM are popular examples of regression based supervised learning algorithms (Dasgupta et al., 2011; Kruppa et al., 2012). Regression-based supervised learning methods employ polynomial parametric or non-parametric regression methods to map the associations of multidimensional input data to outputs (Dasgupta et al., 2011; Okser et al., 2014; Mehta et al., 2019). By contrast, tree-based supervised learning algorithms, which include Decision trees and Random forests, typically utilize binary decision splitting rule approaches to model the relationships between the input and output data (Dasgupta et al., 2011; Okser et al., 2014; Mehta et al., 2019).

Regression-based machine learning approaches have been widely employed in risk prediction for many diseases including: cancer; Alzheimer's; cardiovascular disease; and diabetes (Capriotti et al., 2006; Cruz and Wishart, 2006; Palaniappan and Awang, 2008; Yu, 2010; Zhang and Shen, 2012). For example, an SVM regression-based non-parametric machine learning model of the genetics of type 1 diabetes was built and trained from 3443 individual genotype samples (Mieth et al., 2016) achieving an AUC = 0.84, which is significantly higher than the polygenic risk scoring model AUC = 0.71 (Clayton, 2009; Wei et al., 2009; Jostins and Barrett, 2011). Notably, validation testing confirmed that the predictive power of the non-parametric SVM consistently outperformed the logistic regression control prediction model on two independent datasets (Wei et al., 2009).

Deep learning prediction models developed from neural network algorithms have been gaining a lot of interest following their successful implementation in image recognition and natural language processing applications (He et al., 2016; Young et al., 2018). In genomics, deep learning applications are helping to identify functional DNA sequences, protein binding motifs and epigenetic marks (Alipanahi et al., 2015; Zhou and Troyanskaya, 2015; Zhang et al., 2018). A deep learning model incorporating SNPs associated with obesity has demonstrated a remarkable ability to correctly identify a case out of a randomly chosen pair of case and control samples with an AUC = 0.99 (Montañez et al., 2015). After data cleaning, a genotype dataset of 1997 individuals including 879 cases and 1118 controls with 240,950 SNPs was obtained. The dataset was subsequently filtered into four SNP feature sets, according to P-value thresholds obtained from the GWA study. The numbers of SNPs in the feature sets were: 5 (P-value: 1 × 10−<sup>5</sup> ); 32 (P-value: 1 × 10−<sup>4</sup> ); 248 (P-value: 1 × 10−<sup>3</sup> ); and 2465

TABLE 1 | A brief view of common machine learning algorithms.


The examples include the founding papers and current examples as at December 2018.

(P-value: 1 × 10−<sup>2</sup> ). The feature set with 2465 SNPs (Pvalue: 1 × 10−<sup>2</sup> ) was used to construct an artificial neural network (ANN) deep learning model from 60% of the original genotypes as training, 20% as internal validation, and 20% as testing. The ANN deep learning model delivered a significant predictive performance for obesity on the testing set with an AUC = 0.9908 (Montañez et al., 2015). Montañez et al. (2015) clearly demonstrated the ability of the ANN deep leaning algorithm to capture combined SNP effects and predict complex polygenic diseases.

Tree-based machine learning commonly uses a Random Forest algorithm (Jiang et al., 2009; Boulesteix et al., 2012; Touw et al., 2013; López et al., 2017). Random Forest algorithms construct prediction models using an ensemble method with many decision trees. Specifically, Random Forest algorithms select for and evaluate SNPs that are informative in the decision-tree building process (Boulesteix et al., 2012; Nguyen et al., 2015). A strength of Random Forest models is their ability to effectively handle missing and highly dimensional data structures that contain complex interactions (Boulesteix et al., 2012; Nguyen et al., 2015). For example, in a recent study a Random Forest algorithm was used to predict T2D risk, outperforming both SVM, and logistic regression models (López et al., 2017). In this study, a set 1074 individual genotypes and 101 preselected T2D related SNPs were collected and cleaned. The cleaned data (677 samples with 96 related SNPs) were fed into a Random Forest learning algorithm and produced a T2D predictor that delivered an AUC = 0.85 with cross validation (López et al., 2017). In so doing, the Random Forest model also refined the preselected SNPs to identify a subset that are strongly associated with T2D and can be used to interrogate the etiology of the disease (Boulesteix et al., 2012; Nguyen et al., 2015; López et al., 2017). The implementation of Random Forrest is still useful as a machine learning method for complex disease risk modeling (Boulesteix et al., 2012; Chen and Ishwaran, 2012; Austin et al., 2013; López et al., 2017).

# INDIVIDUAL TISSUE-SPECIFIC HETEROGENEITY

Although PRS and machine learning approaches have been extensively used in complex disease prediction, little attention has been given to the utility of machine learning applications in calculating tissue-specific disease risk in individuals. This is largely because GWAS studies identify relationships between global somatic SNPs and their associated phenotypes (Visscher et al., 2017). However, GWAS-identified, diseaseassociated SNPs are recognized as modifying regulatory mechanisms which affect gene expression in a tissue-specific manner (Parker et al., 2013; Ardlie et al., 2015). Therefore, by expanding GWAS methodology to include expression measures (i.e., expression quantitative trait locus, eQTL), genetic analyses could help to interrogate the inter-related biological networks between cell and tissue types that propagate the causal effects to complex diseases (Ardlie et al., 2015; Ongen et al., 2017). For example, incorporating eQTL data led to the identification of adipose-specific gene expression patterns that could have an inferred causal role in obesity (Nica and Dermitzakis, 2013). Similarly, genes with liver specific expression are now thought to be a major contributor to T2D (Rusu et al., 2017). By extending eQTL analyses to include chromatin spatial interaction (Hi-C) data, it was shown that T2D and obesity associated SNPs have spatial-eQTLs which implicate dysfunction of specific regulatory actions in various tissue types (Fadason et al., 2017). These studies strongly suggest that by aggregating biological data types (e.g., DNA, RNA, and epigenetic data), the accumulated result becomes a tissue-specific network analysis of associated dysfunctionally regulated genes. Thus, specific disease risk to individuals should be calculated using a tissue-by-tissue approach, concluding with tissuespecific networks and pathways that are particular to the development of a disease.

In so doing, it may be possible to leverage the tissue-effect heterogeneity of patients by identifying the correct genes and tissue loads to provide essential targets for potential therapeutic interventions leading to enhanced therapeutic effectiveness. The tissue-effect heterogeneity could also help to recognize individual subtypes of complex disease, facilitating personalized treatments. By targeting the causal associated SNP tissuespecific effects, predictions of patient specific tissue-effect disease risks could provide informative biomarkers for early disease prevention, bringing about a substantial reduction of later disease burdens and costs. Zhou and Troyanskaya (2015) have utilized the deep learning algorithm to predict the functional effects of non-coding variants by modeling the pattern of genomic and chromatin profiling information. They have been able to employ this method to distinguish important eQTLs and disease-related SNPs from various eQTL and SNP databases. Nevertheless, despite the immense promise of machine learning, it is important to recognize that at present there is insufficient research in their application for the identification of disease-associated tissue-specific risks. It is likely that these caveats will be attenuated in the near future through advanced tissue-specific studies of complex traits and disease.

# CONCLUSION

Precision medicine is a rapidly advancing field that already provides customized medical treatments and preventative interventions for specific diseases, especially cancer. Using a patient's SNPs to predict individual disease risks is an essential element for delivering the fuller promise of precision medicine. Polygenic risk scoring is a straightforward model for assigning genetic risk to individual outcomes, but has achieved only limited success in complex disease predictions due to its dependency on linear regression. The polygenic risk scoring method is ineffective in modeling highly dimensional genotype data with complex interactions. By contrast, the strength of machine learning data modeling in complex disease prediction lies in its handling of interactive high-dimensional data. Coupled with large new population datasets with high-quality phenotyping at different stages in the lifecourse, machine learning models are capable of classifying individual disease risks with high precision. Notably, machine learning predictors that include tissue-specific disease risks for individuals show even greater promise of insights that could ultimately provide cost-effective and proactive healthcare with great efficacy.

#### DATA AVAILABILITY

fgene-10-00267 March 25, 2019 Time: 18:12 # 8

No datasets were generated or analyzed for this study.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

DH conceived and wrote the manuscript. MW and RS advised DH and commented on the manuscript. WS and JO'S supervised DH and co-wrote the manuscript.

# FUNDING

This review was funded by grant UOAX1611: New Zealand – Australia Lifecourse Collaboration on Genes, Environment, Nutrition and Obesity (GENO) from the Ministry of Business Innovation and Employment of New Zealand.


Intelligence Informatics Biomedical Science, (Shanghai), 52–58 doi: 10.1039/ b907946g


fgene-10-00267 March 25, 2019 Time: 18:12 # 9

regulatory network for human growth. Hum. Mol. Genet. 25, 3372–3382. doi: 10.1093/hmg/ddw165


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ho, Schierding, Wake, Saffery and O'Sullivan. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00267 March 25, 2019 Time: 18:12 # 10

# Identifying Disease-Gene Associations With Graph-Regularized Manifold Learning

#### Ping Luo<sup>1</sup> , Qianghua Xiao<sup>2</sup> , Pi-Jing Wei 1,3, Bo Liao<sup>4</sup> and Fang-Xiang Wu1,4,5,6 \*

*<sup>1</sup> Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK, Canada, <sup>2</sup> School of Mathematics and Physics, University of South China, Hengyang, China, <sup>3</sup> College of Computer Science and Technology, Anhui University, Hefei, China, <sup>4</sup> School of Mathematics and Statistics, Hainan Normal University, Haikou, China, <sup>5</sup> Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK, Canada, <sup>6</sup> Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada*

Complex diseases are known to be associated with disease genes. Uncovering disease-gene associations is critical for diagnosis, treatment, and prevention of diseases. Computational algorithms which effectively predict candidate disease-gene associations prior to experimental proof can greatly reduce the associated cost and time. Most existing methods are disease-specific which can only predict genes associated with a specific disease at a time. Similarities among diseases are not used during the prediction. Meanwhile, most methods predict new disease genes based on known associations, making them unable to predict disease genes for diseases without known associated genes.In this study, a manifold learning-based method is proposed for predicting disease-gene associations by assuming that the geodesic distance between any disease and its associated genes should be shorter than that of other non-associated disease-gene pairs. The model maps the diseases and genes into a lower dimensional manifold based on the known disease-gene associations, disease similarity and gene similarity to predict new associations in terms of the geodesic distance between disease-gene pairs. In the 3-fold cross-validation experiments, our method achieves scores of 0.882 and 0.854 in terms of the area under of the receiver operating characteristic (ROC) curve (AUC) for diseases with more than one known associated genes and diseases with only one known associated gene, respectively. Further *de novo* studies on Lung Cancer and Bladder Cancer also show that our model is capable of identifying new disease-gene associations.

Keywords: disease gene identification, manifold learning, disease module theory, gene ontology, multi-task learning

# 1. INTRODUCTION

Complex diseases are caused by a group of genes known as disease genes. Identifying disease-gene associations is of critical importance since it helps us unravel the mechanisms of diseases, which has many applications such as diagnosis, treatment and prevention of disease. With the advances in high-throughput experimental techniques, a large amount of data that indicate associations between diseases and their associated genes have been generated, which could accelerate the identification of disease-associated genes. However, it is expensive and time-consuming to

#### Edited by:

*Chuan Lu, Aberystwyth University, United Kingdom*

#### Reviewed by:

*Ling-Yun Wu, Academy of Mathematics and Systems Science (CAS), China Min Chen, Hunan Institute of Technology, China*

> \*Correspondence: *Fang-Xiang Wu faw341@mail.usask.ca*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *21 December 2018* Accepted: *12 March 2019* Published: *02 April 2019*

#### Citation:

*Luo P, Xiao Q, Wei P-J, Liao B and Wu F-X (2019) Identifying Disease-Gene Associations With Graph-Regularized Manifold Learning. Front. Genet. 10:270. doi: 10.3389/fgene.2019.00270*

**223**

experimentally prove an association between a gene and a disease. Computational methods that translate the experimental data into legible disease-gene associations are necessary for in-depth experimental validation.

Currently, many algorithms have been developed to predict disease-gene associations, and they can be briefly divided into two categories: the machine learning-based methods and the network-based methods. The typical machine learning-based methods extract gene-related features and train models that can discriminate disease genes and passenger genes (Mordelet and Vert, 2011; Yang et al., 2012; Singh-Blom et al., 2013; Luo et al., 2019a,b). Since the features are extracted for genes, these algorithms are usually single-task algorithms which once can only predict disease genes for a specific disease. Thus, for diseases that have a few or no known associated genes, the number of the genes would be too small to train the model. In the meantime, the relationships among diseases are usually not used in the prediction since only one disease is considered at a time. Matrix completion methods, as a type of machine learning methods, can solve the above two issues by jointly predicting diseasegene associations and leveraging the similarities among diseases during the calculation (Natarajan and Dhillon, 2014; Zeng et al., 2017). However, matrix completion methods generally do not have the global optimal solutions and could take a very long time to converge to even a local optimal solution. Networkbased methods are based on the assumption that genes close related in the network are associated with the same diseases. Centrality indices, random walk and network energy are used in many methods to predict disease-gene associations (Köhler et al., 2008; Vanunu et al., 2010; Chen et al., 2014a,b). Although most network-based methods are not affected by the above two issues, their performance is strongly affected by the quality of networks, and they usually perform worse than machine learning-based methods on diseases with many known associated genes (Chen et al., 2015, 2016).

In this study, we propose a manifold learning-based method (dgManifold) to predict disease-gene associations. In our dgManifold, genes and diseases are regarded as points in the same high-dimensional Euclidean space. Our assumption is that diseases and their associated genes should be consistent in some lower dimensional manifold, and the geodesic distance between a disease and its associated genes should be shorter than that of other non-associated disease-gene pairs. Although the Euclidean distance between diseases and genes in the highdimensional space may not reflect their true geodesic distance, we can map the diseases and genes into a low-dimensional manifold based on the experimentally verified disease-gene associations (Tenenbaum et al., 2000; Ham et al., 2005). Then, the true geodesic distance between all the disease-gene pairs can be calculated. In the meantime, the mapping process is regularized by two affinity graphs, disease similarity network and gene similarity network, so that the learned representations with the similarity information can further increase the prediction accuracy. Additionally, since our dgManifold is a supervised method, and it is difficult (if possible) to learn valuable representations for diseases that only have a few or no known associated genes. A prior information vector calculated with the disease similarities and known disease-gene associations should be combined with the original association data to solve this issue. Similar strategies have been applied to calculate the initial probabilities used in the random walk, which have improved the accuracy of predicting miRNA-disease associations (Chen et al., 2016b, 2018a,b).

In the rest of the manuscript, section 2 describes our algorithm as well as the data sources and evaluation metrics used in the study. Section 3 discusses the evaluation results. Section 4 draws some conclusions.

#### 2. MATERIALS AND METHODS

#### 2.1. General Model

Given n diseases and m genes, the associations among them can be represented by a matrix A ∈ R <sup>n</sup>×<sup>m</sup> in which aij = 1 if disease i is associated with gene j, and otherwise aij = 0. Intuitively, each disease can be represented by a binary m-dimensional row vector while each gene can be represented by a binary n-dimensional column vector. However, in these high-dimensional spaces, it is hard to calculate the actual distance between a disease and a gene.

If we map the diseases and genes into the same manifold with a lower dimensionality and assume that the distance between a disease and its associated genes should be as short as possible on this manifold, predicting disease-gene associations can be solved by computing this mapping based on known diseasegene associations, which can be mathematically formulated as: finding k-dimensional representatives of diseases **r**1, . . . , **r**<sup>n</sup> and k-dimensional representatives of genes **q**<sup>1</sup> , . . . , **q**<sup>m</sup> such that the following objective function is minimized

$$O\_k = \sum\_{i=1}^{n} \sum\_{j=1}^{m} a\_{ij} \|\mathbf{r}\_i - \mathbf{q}\_j\|^2. \tag{1}$$

However, without any constraints, the objective function (1) is not well defined. To illustrate this, if k-dimensional vectors **r** + i and **q** + j for i = 1, . . . , n and j = 1, . . . , m minimize the objective function (1), then ǫ**r** + i and ǫ**q** + j can further minimize the objective function when 0 ≤ ǫ < 1. Especially, when ǫ = 0, any k-dimensional vectors **r** + i and **q** + j can minimize the objective function. Therefore, to make the optimization problem well defined, the following constraints are added

$$\sum\_{i=1}^{n} \mathbf{r}\_i \mathbf{r}\_i^T = I\_k \quad \text{and} \quad \sum\_{j=1}^{m} \mathbf{q}\_j \mathbf{q}\_j^T = I\_k. \tag{2}$$

where I<sup>k</sup> is the k × k identity matrix. As a results, the learned representations are unique with these constraints.

To insure that the mapped representations of diseases and genes are in concert with their intrinsic properties, two affinity graphs, disease similarity network and gene similarity network are used to regularize the objective function (1), and the new objective function is as follows

$$O\_k = \sum\_{j=1}^{m} \sum\_{i=1}^{n} a\_{ij} \|\mathbf{r}\_i - \mathbf{q}\_j\|^2 + \frac{\alpha}{2} \sum\_{i=1}^{n} \sum\_{j=1}^{n} s\_{ij}^d \|\mathbf{r}\_i - \mathbf{r}\_j\|^2$$

$$+\frac{\beta}{2} \sum\_{i=1}^{m} \sum\_{j=1}^{m} s\_{ij}^{\mathbb{g}} \|\mathbf{q}\_i - \mathbf{q}\_j\|^2 \tag{3}$$

where S d and S g are the adjacency matrices of the disease similarity network and the gene similarity network, respectively. Note that

O<sup>k</sup> = Xn i=1 ( Xm j=1 aij)**r** T i **r**<sup>i</sup> + Xm j=1 ( Xn i=1 aij)**q** T j **q**<sup>j</sup> − 2 Xn i=1 Xm j=1 aij**r** T i **q**j +α Xn i=1 ( Xn j=1 s d ij)**r** T i **r**<sup>i</sup> − α Xn i=1 Xn j=1 s d ij**r** T i **r**j +β Xm i=1 ( Xm j=1 s g ij)**q** T i **q**<sup>i</sup> − β Xm i=1 Xm j=1 s g ij**q** T i **q**j = Xn i=1 Ari**r** T i **r**<sup>i</sup> + Xm j=1 Acj**q** T j **q**<sup>j</sup> − 2 Xn i=1 Xm j=1 aij**r** T i **q**j +α Xn i=1 S d i **r** T i **r**<sup>i</sup> − α Xn i=1 Xn j=1 s d ij**r** T i **r**j +β Xm j=1 S g j **q** T j **q**<sup>j</sup> − β Xm j=1 Xm i=1 **s** g ij**q** T i **q**j = Xn i=1 (Ari + αS d i )**r** T i **r**<sup>i</sup> + Xm j=1 (Acj + βS d j )**q** T j **q**j −2 Xn i=1 Xm j=1 aij**r** T i **q**<sup>j</sup> − α Xn i=1 Xn j=1 s d ij**r** T i **r**<sup>j</sup> − β Xm j=1 Xm i=1 s g ij**q** T i **q**j

(4) where S d <sup>i</sup> = P<sup>n</sup> j=1 s d ij, S g <sup>i</sup> = P<sup>m</sup> j=1 s g ij, Ari = P<sup>m</sup> j=1 aij, Acj = P<sup>n</sup> i=1 aij. Let

$$\begin{aligned} L^{11} &= \operatorname{diag} [A\_{r1} + \alpha S\_1^d, A\_{r2} + \alpha S\_2^d, \dots, A\_{rn} + \alpha S\_n^d] - \alpha S^d, \\ L^{22} &= \operatorname{diag} [A\_{c1} + \beta S\_1^{\emptyset}, A\_{c2} + \beta S\_2^{\emptyset}, \dots, A\_{cm} + \beta S\_m^{\emptyset}] - \beta S^{\emptyset}, \end{aligned} \tag{5}$$

the objective function (3) can be simplified as

$$O\_k = \sum\_{i=1}^n \sum\_{j=1}^n L^{11} \mathbf{r}\_i^T \mathbf{r}\_j + \sum\_{i=1}^m \sum\_{j=1}^m L^{22} \mathbf{q}\_i^T \mathbf{q}\_j - 2 \sum\_{i=1}^n \sum\_{j=1}^m a\_{ij} \mathbf{r}\_i^T \mathbf{q}\_j \tag{6}$$

Furthermore, let

$$r\_i = \begin{bmatrix} \boldsymbol{\chi}\_{i1} \\ \boldsymbol{\chi}\_{i2} \\ \vdots \\ \boldsymbol{\chi}\_{ik} \end{bmatrix}, q\_j = \begin{bmatrix} \boldsymbol{\chi}\_{j1} \\ \boldsymbol{\chi}\_{j2} \\ \vdots \\ \boldsymbol{\chi}\_{jk} \end{bmatrix}, z\_t = \begin{bmatrix} \boldsymbol{\chi}\_{1t} \\ \vdots \\ \boldsymbol{\chi}\_{nt} \\ \boldsymbol{\chi}\_{1t} \\ \vdots \\ \boldsymbol{\chi}\_{mt} \end{bmatrix} = \begin{bmatrix} \boldsymbol{\chi}\_t \\ \vdots \\ \boldsymbol{\chi}\_t \end{bmatrix},\tag{7}$$

$$\begin{aligned} A\_r &= \operatorname{diag}[A\_{r1}, \dots, A\_{rn}], \quad A\_\varepsilon = \operatorname{diag}[A\_{\varepsilon 1}, \dots, A\_{\varepsilon m}], \\ L^d &= \operatorname{diag}[\mathbf{S}\_1^d, \dots, \mathbf{S}\_n^d] - \mathbf{S}^d, \quad L^\xi = \operatorname{diag}[\mathbf{S}\_1^\xi, \dots, \mathbf{S}\_m^\xi] - \mathbf{S}^\xi, \end{aligned} \tag{8}$$

$$L = \begin{bmatrix} A\_r + \alpha L^d & -A \\ -A^T & A\_c + \beta L^\xi \end{bmatrix},\tag{9}$$

objective function (6) can be simplified as

$$\begin{aligned} Ok &= \sum\_{t=1}^{k} \sum\_{i=1}^{n} \sum\_{j=1}^{n} L^{11} \chi\_{it} \chi\_{jt} + \sum\_{t=1}^{k} \sum\_{i=1}^{m} \sum\_{j=1}^{m} L^{22} \mathcal{Y}\_{it} \mathfrak{y}\_{jt} \\ &- 2 \sum\_{t=1}^{k} \sum\_{i=1}^{n} \sum\_{j=1}^{m} a\_{ij} \chi\_{it} \mathfrak{y}\_{jt} \\ &= \sum\_{t=1}^{k} [\mathbf{x}\_{t}^{T} L^{11} \mathbf{x}\_{t} + \mathbf{y}\_{t}^{T} L^{22} \mathbf{y}\_{t} - 2 \mathbf{x}\_{t}^{T} A \mathbf{y}\_{t}] \\ &= \sum\_{t=1}^{k} [\mathbf{x}\_{t}^{T} \mathbf{y}\_{t}^{T}] \begin{bmatrix} L^{11} & -A \\ -A^{T} & L^{22} \end{bmatrix} \begin{bmatrix} \mathbf{x}\_{t} \\ \mathbf{y}\_{t} \end{bmatrix} \\ &= Tr(\mathbf{Z}^{T} L \mathbf{Z}) \end{aligned} \tag{10}$$

Therefore, minimizing the objective function (4) with constraints (2) is equivalent to minimize the following function

$$Q\_k = \operatorname{Tr}(\mathbf{Z}^T \mathbf{L} \mathbf{Z}) \tag{11}$$

with constraints

$$Z^T Z = X^T X + Y^T Y = 2I\_k \tag{12}$$

According to Bolla (2013), minimizing objective function (11) with constraints (12) can be solved by

$$Z^\* = (\mathbf{u}\_0, \mathbf{u}\_1, \dots, \mathbf{u}\_{k-1})\tag{13}$$

where **u**0, **u**1, . . . , **u**k−<sup>1</sup> are k eigenvectors correspond to the k smallest eigenvalues of L. Meanwhile, the smallest eigenvalue is 0, and the corresponding eigenvector **u**<sup>0</sup> is a constant vector which does not contribute to the calculation of the geodesic distance. Thus, let Zˆ denote the matrix by removing the fist column of Z ∗ . The first n rows of Zˆ are the obtained (k − 1)-dimensional representations of diseases, and the rest m rows of Zˆ are the learned representations of genes. The geodesic distance between a disease i and gene j can be calculated by

$$gdist\_{i\hat{\jmath}} = \|\hat{\mathbf{r}}\_i - \hat{\mathbf{q}}\_j\|^2. \tag{14}$$

#### 2.2. Similarity Network

#### 2.2.1. Gene Similarity

In this study, the learning process is regularized by similarity networks, and the similarities of genes are calculated based on the Gene Ontology (GO). GO database provides a set of vocabularies to describe the function of genes and gene products (Ashburner et al., 2000; Consortium, 2017). The GO terms and their relationships are manifested as a directed acyclic graph (DAG) where nodes represent terms while edges represent semantic relationships. Many algorithms have been proposed to calculate the similarities of genes using ontology data, and the approach proposed by Wang et al. (2007) is used in this study.

Let DAG<sup>h</sup> = (Th, Eh) denote GO term h, where T<sup>h</sup> contains all the successor GO terms of h in the DAG, and E<sup>h</sup> contains the semantic relationships between h and other terms in Th. Each term t in T<sup>h</sup> has a τ -value related to h:

$$\begin{cases} \tau\_h(t) = 1, \text{if } t = h\\ \tau\_h(t) = \max\{\boldsymbol{w}\_e \* \tau\_h(t')|t' \in \mathit{children} \text{ of } t\}, \text{otherwise} \end{cases} \tag{15}$$

where w<sup>e</sup> is the weight of the edge (semantic relationships) in the DAG. Two types of semantic relationships ("is\_a" and "part\_of ") are used in the DAG, and the corresponding w<sup>e</sup> is set to 0.8 and 0.6, respectively, as recommended in Wang et al. (2007).

Given DAG<sup>h</sup> = (Th, Eh) and DAG<sup>b</sup> = (T<sup>b</sup> , E<sup>b</sup> ) for GO terms h and b, their similarity can be computed by

$$\text{sgo}(h, b) = \frac{\sum\_{t \in T\_h \cap T\_b} (\tau\_h(t) + \tau\_b(t))}{\sum\_{t \in T\_h} \tau\_h(t) + \sum\_{t \in T\_b} \tau\_b(t)} \tag{16}$$

Then, the similarity of one GO term t ′ and a set of GO terms GO = {t1, t2, . . . , t<sup>l</sup> } is defined as

$$\text{SGO(t',GO)} = \max\_{1 \le i \le l} \text{(SGO(t',t\_i))}\tag{17}$$

Finally, the functional similarity of two genes g<sup>1</sup> and g<sup>2</sup> is calculated by

$$s\_{\mathbb{S}^1, \mathbb{S}^2}^{\mathcal{S}} = \frac{\sum\_{1 \le i \le n\_1} \text{SGO}(t\_{1i}, GO\_2) + \sum\_{1 \le j \le n\_2} \text{SGO}(t\_{2j}, GO\_1)}{n\_1 + n\_2} \tag{18}$$

where GO<sup>1</sup> = {t11, t12, . . . , t1n<sup>1</sup> } and GO<sup>2</sup> = {t21, t22, . . . , t2n<sup>2</sup> } are two sets of GO terms that describe g<sup>1</sup> and g2, respectively.

#### 2.2.2. Disease Similarity

The similarities among diseases are also calculated with the ontology data. Instead of GO, the Human Phenotype Ontology (HPO) (Köhler et al., 2018) is used to characterize human diseases. The HPO provides a vocabulary of phenotypic terms related to human diseases. Each term represents a clinical abnormality, and all the terms are structured as a DAG, in which every term is related to their parent terms by "is\_a" relationships. Although diseases are not directly described by the HPO, the annotation file provided by HPO contains terms associated with every disease, and thus Equations (17) and (18) can be used to compute the similarities of diseases. When we calculate the similarities of phenotypic terms based on the DAG, w<sup>e</sup> in Equation (15) is set to 0.7 as recommended in Li et al. (2011).

#### 2.3. Prior Information

For diseases with only a few associated genes, the limited information would affect the performance of any computational algorithms. This problem is especially serious for diseases with no known associated genes. To solve this problem, we add some prior information for diseases with no known associations.

Specifically, given a disease i ′ , **p**<sup>i</sup> ′ is added to the i ′ -th row of the matrix A as prior information so that the shortage of known information can be alleviated. The j-th entry of **p**<sup>i</sup> ′ is calculated by

$$p\_{i'j} = \left(\sum\_{i=1, i \neq i'}^{n} s\_{ii'}^{d} a\_{ij}\right) / \left(\sum\_{i=1, i \neq i'}^{n} a\_{ij}\right) \tag{19}$$

In our experiments, when cross-validation is used to evaluate the algorithm, the prior information is added to the i-th row of matrix A as long as one of the associated genes of disease i is left to test the model. Meanwhile, in the de novo study, prior information is also added to the diseases used for evaluation.

#### 2.4. Data Sources

The disease-gene association data are downloaded from the Online Mendelian Inheritance in Man (OMIM) database (Amberger et al., 2014) in August 2018. The Morbid Map at OMIM contains nearly seventy-five hundred entries sorted alphabetically by disorder names. Each entry represents an association between a gene and a disease. Different entries are labeled with different tags ("(3)," "[]," and "?") which indicate their reliabilities. To obtain a reliable association dataset, based on (Goh et al., 2007), three steps were performed to preprocess the originally downloaded data. First, entries with the tag "(3)" are selected while others are abandoned. We adopt this strategy because diseases with tag "(3)" indicate that the molecular basis of these diseases is known and the associations are reliable, while entries with "[]" represent abnormal laboratory test values, and entries with "?" represent provisional disease-gene associations. Second, disease entries are classified into distinct diseases by merging disease subtypes based on their given disorder names. For instance, 17 entries of "Leigh syndrome" are merged into disease "Leigh syndrome," and the 19 complementary terms of "Lung cancer somatic" are merged into "Lung Cancer." Third, 74 diseases are removed because they are not annotated by any HPO terms. During the classification, string match was used to classify adjacent entries, followed by a manual verification. Finally, we obtain a dataset consisting of 4,770 associations between 1,537 diseases and 3,320 genes. Among the 1,537 diseases, 917 have only one associated gene (single-gene disease), while the rest diseases have at least two associated genes (multiple-gene disease).

The ontology data of genes and phenotypes are downloaded from the GO database (Ashburner et al., 2000; Consortium, 2017), and the HPO database (Köhler et al., 2018), respectively. The PPI network used in the competing algorithms is downloaded from the InWeb\_InBioMap database (version 2016\_09\_12) (Li et al., 2016).

#### 2.5. Evaluation Metrics

In this study, the algorithm is evaluated in two steps. In the first step, our dgManifold is compared with two competing algorithms: PCFM (Zeng et al., 2017) and Katz (Singh-Blom et al., 2013). PCFM is a matrix completion method which integrates disease similarities and gene similarities to predict disease-gene associations. Katz is a classic network-based method which uses Katz centrality to rank the disease-gene associations. We choose these two algorithms because they are all multi-task algorithms which can predict all disease-gene associations as our dgManifold does. The AUC (area under of the receiver operating characteristic (ROC) curve) scores calculated from 3-fold crossvalidation are used to compare these three algorithms.

ROC curve plots the true positive rate [TP/(TP+FN)] verses the false positive rate [FP/(FP+TN)] at different thresholds, and a larger AUC represents better overall performance. In this study, a true positive (TP) is a known disease-gene association (positive sample) predicted as a disease-gene association, while a false positive (FP) is a non-disease-gene association (negative sample) predicted as a disease-gene association. A false negative (FN) is a positive sample predicted as negative while a true negative (TN) is a negative sample predicted as negative. Since negative samples are not included in existing databases, we randomly select a set of unknown disease-gene pairs as negative samples. The number of negative samples is equal to that of positive samples. Considering that the selected negative samples may have small possibilities to be a real disease-gene association, the random selection was run for five times to generate 5 sets of negative samples. The final AUC score is the average score obtained from the 5 sets of samples.

During the cross-validation, the known disease-gene associations are split into 3 groups, and the algorithm is run for 3 rounds. In each round, one group of associations is regarded as unknown (aij = 0), while the rest two groups of associations are used to train the model. The prior information is recomputed during every round of the cross-validation. Considering that single-gene diseases would have no known associated genes if they are left for testing the model during the cross-validation, predicting disease genes for these diseases is similar to predict disease genes for a completely new disease. Thus, the three algorithms are compared on multiple-gene diseases and singlegene diseases separately. Additionally, to show the effect of the prior information, the AUC scores of our method without prior information are also calculated.

In the second step, the model is trained with all the known associations, and the geodesic distance between every unknown disease-gene pairs is calculated. To find out whether our new predictions are in concert with existing experimental studies, the top-10 predictions of two diseases, Lung Cancer and Bladder Cancer, are searched from the existing literature. In our dataset, Lung Cancer has 16 associated genes, and Bladder Cancer has 4 associated genes. We choose these two types of cancer because they are experimentally well studied which could better prove our results.

#### 3. RESULTS

#### 3.1. Model Parameters

In our study, several parameters affect the performance of the model. To obtain the optimal parameters, the grid search is conducted by searching k from {20, 30, 50, 100, 500, 800, 1, 000, 1, 200, 1, 500} and α from {0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5}. β is set to be equal to α. The AUC score is used to determine whether the selected parameters are optimal. Finally, for multiple-gene diseases, the model performs

FIGURE 1 | ROC curves of the three competing algorithms on multiple-gene diseases.

best when k = 30, α = β = 0.2, and for single-gene diseases, the optimal parameters are k = 30, α = β = 0.1.

#### 3.2. Cross-Validation

**Figures 1**, **2** show the resulted ROC curves and AUC scores of the three competing algorithms on multiple-gene diseases and single-gene diseases, respectively. For multiple-gene diseases, our dgManifold achieves AUC score of 0.882 with prior information and 0.873 without prior information, while the AUC scores of Katz and PCFM are 0.742 and 0.636, respectively. For singlegene diseases, the AUC score of our dgManifold is 0.854 when prior information is used and 0.485 with no prior information, while the AUC scores of Katz and PCFM are 0.455 and 0.322, respectively. These results show that our method is superior to the competing methods in terms of the AUC scores.

It is worth noting that the AUC scores of all three algorithms are less than 0.5 when they are applied to single-gene diseases.



This is mainly because that single-gene diseases have no known associated genes during the cross-validation, and algorithms can only use disease similarities and association data of other diseases to perform the prediction. These data are not enough to generate accurate results, especially for supervised algorithms. Thus, prior information is necessary for the algorithm. In fact, the results of our experiments have shown that the prior information is beneficial to the prediction of disease-gene associations, especially when the diseases have no known associated genes.

#### 3.3. De novo Study

In addition to AUC scores, we evaluate the performance of our dgManifold in predicting new disease-gene associations. Specifically, Lung Cancer and Bladder Cancer are selected, and prior information corresponded to these two diseases is added to matrix A. Then, all known disease-gene associations are used to train the model (k = 30, α = β = 0.2), and the geodesic distance between all the unknown disease-gene pairs is calculated. For each of the two selected diseases, the unknown disease-gene pairs are ranked based on the geodesic distance in ascending order, and the top-10 predictions are searched from existing literature.

**Table 1** shows the results of de novo studies. 5 out of 10 predicted genes have been experimentally confirmed as associated with Lung Cancer. Among these genes, KCNK9 is a potential therapeutic target (Sun et al., 2016). HTRA1 contributes to the tumor formation by inhibiting the TGF-beta pathway (Esposito et al., 2006). ATP6AP1 and MYL2 are two potential biomarkers (Che et al., 2013; Sabrkhany et al., 2018). Mutation of C282Y allele in HFE is associated with Lung Cancer (McLarty et al., 2008). Although SEMA4A is not proved to be associated with Lung Cancer yet, it is related to Lung Inflammation and Colorectal Cancer, and its role in Lung Cancer genesis might be discovered in the future (Iyer and Chapoval, 2019). For Bladder Cancer, 3 out of 10 genes have been experimentally verified. Among them, SMAD3 mediates epithelial-mesenchymal transition which affects the invasion and migration of Bladder Cancer (Tong et al., 2018). DMP1 is a tumor suppressor gene of Bladder Cancer (Peng et al., 2015). CALR is potential biomarker (Kageyama et al., 2004). These results show that our predictions are in concert with existing reports, and thus our dgManifold is valuable for predicting new disease-gene associations.

# 4. CONCLUSION

In this study, we have proposed dgManifold to predict diseasegene associations with manifold learning. Our dgManifold assumes that the distance between diseases and their associated genes should be shorter than that of other non-associated diseasegene pairs and maps the diseases and genes into a lower dimensional manifold based on known disease-gene associations, disease similarity and gene similarity. The prediction of new associations can be achieved by sorting the geodesic distance between unknown disease-gene pairs. The cross-validation results show that our model outperforms the competing algorithms in terms of AUC scores for both multiple-gene diseases and single-gene diseases. The further de novo studies also demonstrate that our dgManifold is valuable in predicting new disease-gene associations.

Note that dgManifold is only regularized by disease similarities and gene similarities at the current version, and the prior information is also obtained from the disease similarities. In the future, we can improve our method by regularizing the objective function with more types of data and computing the prior information with clinical evidences.

# DATA AVAILABILITY

The datasets generated for this study and a reference implementation of the algorithm can be found in the GitHub repository of the study.

# AUTHOR CONTRIBUTIONS

F-XW conceived this study. F-XW, PL, QX, P-JW, and BL discussed about the methods. PL implemented the algorithm, designed and performed the experiments. PL and F-XW wrote the manuscript. All authors read and approved the final manuscript.

# FUNDING

This work is supported in part by Natural Science and Engineering Research Council of Canada (NSERC) and China Scholarship Council (CSC).

# REFERENCES


to catalyze genomic interpretation. Nat. Methods 14, 61–64. doi: 10.1038/ nmeth.4083


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer MC declared a past co-authorship with one of the authors, BL, to the handling editor.

Copyright © 2019 Luo, Xiao, Wei, Liao and Wu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Modularized Perturbation of Alternative Splicing Across Human Cancers

Yabing Du1†, Shoumiao Li 2†, Ranran Du3†, Ni Shi <sup>4</sup> , Seiji Arai 5,6, Sai Chen<sup>7</sup> , Aijie Wang<sup>7</sup> , Yu Zhang<sup>8</sup> , Zhaoyuan Fang<sup>9</sup> \*, Tengfei Zhang1,10 \* and Wang Ma1,10 \*

*<sup>1</sup> Department of Oncology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China, <sup>2</sup> Department of Surgery, Anyang Tumor Hospital, Anyang, China, <sup>3</sup> Institute of Medical Information, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China, <sup>4</sup> Comprehensive Cancer Center, Ohio State University, Columbus, OH, United States, <sup>5</sup> Department of Hematology and Oncology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, United States, <sup>6</sup> Department of Urology, Gunma University Graduate School of Medicine, Maebashi, Japan, <sup>7</sup> Department of Clinical Medicine, Xinxiang Medical University, Xinxiang, China, <sup>8</sup> Bioinformatics Group, Medcurius Co., Zhejiang, China, <sup>9</sup> Shanghai Institutes for Biological Sciences, Chinese Academy of Science, Shanghai, China, <sup>10</sup> Medical College, Henan Polytechnic University, Jiaozuo, China*

#### Edited by:

*Chuan Lu, Aberystwyth University, United Kingdom*

#### Reviewed by:

*Igor B. Rogozin, National Institutes of Health (NIH), United States Sandeep Kumar Dhanda, La Jolla Institute for Immunology (LJI), United States*

#### \*Correspondence:

*Zhaoyuan Fang fangzhaoyuan@sibs.ac.cn Tengfei Zhang fcczhangtf@zzu.edu.cn Wang Ma doctormawang@126.com*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *20 November 2018* Accepted: *05 March 2019* Published: *03 April 2019*

#### Citation:

*Du Y, Li S, Du R, Shi N, Arai S, Chen S, Wang A, Zhang Y, Fang Z, Zhang T and Ma W (2019) Modularized Perturbation of Alternative Splicing Across Human Cancers. Front. Genet. 10:246. doi: 10.3389/fgene.2019.00246* Splicing perturbation in cancers contribute to different aspects of cancer cell progression. However, the complete functional impact of cancer-associated splicing have not been fully characterized. Comprehensive large-scale studies are essential to unravel the dominant patterns of cancer-associated splicing. Here we analyzed the genome-wide splicing data in 16 cancer types with normal samples, identified differential splicing events in each cancer type. Then we took a network-based and modularized approach to reconstruct cancer-associated splicing networks, determine the module structures, and evaluate their prognosis relevance. This approach in total identified 51 splicing modules, among which 10/51 modules are related to patient survival, 8/51 are related to progression-free interval, and 5/51 are significant in both. Most of the 51 modules show significant enrichment of important biological functions, such as stem cell proliferation, cell cycle, cell growth, DNA repair, receptor or kinase signaling, and VEGF vessel development. Module-based clustering grouped cancer types according to their tissue-of-origins, consistent with previous pan-cancer studies based on integrative clustering. Interestingly, 13/51 modules are highly common across different cancer types, suggesting the existence of pan-cancer splicing perturbations. Together, modularized perturbation of splicing represents an functionally important and common mechanism across cancer types.

Keywords: alternative splicing, splicing network, splicing modules, cancer splicing, prognosis

# INTRODUCTION

Newly transcribed messenger RNAs undergo processing steps such as capping, splicing and polyadenylation to derive mature RNAs for export and translation (Hocine et al., 2010). The splicing process, as accomplished by the spliceosome machine, can produce multiple alternative products, which is a well-known phenomena called alternative splicing (AS) (Lee and Rio, 2015). Since its first discovery in 1977, many classical studies have characterized its widespread participation in biological processes such as cell proliferation, apoptosis, angiogenesis, neuronal functions, and transcriptional regulation (Kelemen et al., 2013). Deregulation of AS also contributes to human diseases and various aspects of cancer development (David and Manley, 2010; Scotti and Swanson, 2016).

To systematically characterize the extensive cancer-associated AS perturbations, it is essential to design effective analytic strategies suitable for the ever-growing cancer genomic datasets, largely from projects such as The Cancer Genome Atlas (TCGA). Several approaches have already been taken previously. One of the most popular approaches is the event-driven approach, which aimed to detect individual events that are correlated with cancer or prognosis (Danan-Gotthold et al., 2015; Dvinge and Bradley, 2015; Shen et al., 2016). A second approach focuses on the splicing machinery side and tries to determine the deregulation of splicing factors in tumors (Sebestyén et al., 2016; Sveen et al., 2016; Seiler et al., 2018). Since these approaches emphasized different aspects of the AS perturbations, several recent studies have been linking the splicing factors and events together to identify AS deregulation and the corresponding functional impacts (Li et al., 2017; Kahles et al., 2018). However, this approach may oversee the vast majority of perturbed splicing events that are not easily explained by the few known regulatory factors (Li et al., 2017). Moreover, these studies essentially relied on single-event analysis, and have missed the inter-event linkages which could be equally important to fully understand cancerspecific AS perturbations. To complement these analyses, a fully network-based approach is needed to capture the concurrent perturbation patterns of cancer-associated AS. In addition, such an approach might also discover more robust AS patterns in one or multiple cancer types.

We carried out an extensive analysis of AS events and their interactions in different cancer types. For each cancer type, a network of cancer-associated events is reconstructed. To uncover the potential modularized control in these splicing networks, a random walk-based community identification algorithm is employed. These analyses have revealed representative splicing modules in each type of cancers, and a number of them are prognosis-relevant and involved in cancer-related functional processes. Finally, our work supports the unique value of an splicing network-based approach in understanding cancer splicing deregulation.

# MATERIALS AND METHODS

# Data Sets and Processing

Splicing data have been downloaded from the TCGASpliceSeq database (Ryan et al., 2016). Clinical information is from the GDC TCGA project. Splicing events that failed to be quantified >10% in normal samples or >1% in cancer samples were filtered without further use. Cancer samples with >0.1% missing data were also removed. The remaining missing values were imputed with the Bioconductor impute package.

#### Network Reconstruction and Module Identification

For each cancer type, Pearson correlation coefficients among splicing events were computed between each pair of the differentially spliced events (Wilcoxon signed-rank test FDR<0.1, |delta PSI|>0.1), and were used as edge weights in the reconstructed undirected graph. The Pons-Lapaty random

walk algorithm (step = 4) was used to partition the weighted graph. The identified modules from each cancer type were named according to the order of module sizes (from larger to smaller). So M1 is always at least as large as M2, and M2 at least as large as M3, etc.

# Overall and Progression-Free Survival Analysis

Module scores are averaged from all splicing events in each sample, with normal sample PSI values used as references and subtracted. Thus, the score measures how strong the module is perturbed in one cancer sample. To ensure robustness, both the average and median scores have been calculated. Overall survival (OS) and progression-free intervals (PFI) are respectively categorized for testing with module scores. The Kaplan-Meier survival curves are fitted and compared between samples with a higher vs. a lower module score using the Log-rank test. Hazard ratios and confidence intervals are estimated from the Cox proportional regression model. In total, 13 modules were found to be significantly correlated to either OS (10 modules) or PFI (8 modules). Of these, 5 modules were commonly significant in both OS and PFI. In addition, 2 Lung squamous cell carcinoma (LUSC) modules were nearly significant in OS and also included as candidates. Therefore, 15 modules were retained after filtering with OS and PFI analyses.

# Functional Enrichment Analyses

Gene ontology (GO) enrichment was used to assess the functional properties of each module. The enrichment was determined by the Fisher's exact test method. For all significant GO terms, careful manual inspection and curation were performed to find the most relevant and biologically important functions, which is often a subset of the significant terms.

## RESULTS

# Splicing Network-Based Flowchart for Identifying Prognosis-Relevant Splicing Modules

The main analytic flowchart consist of six steps (**Figure 1**): (1) Collection of annotated events from the TCGASpliceSeq database across 33 cancer types. The splicing classes included are exon skipping (ES), retention of introns (RI), alternative donor (AD), alternative acceptor (AA), mutually exclusive exons (ME), alternative terminator (AT). The numbers of quantified events were found in at least 99% of the samples in each cancer type range from 21129 to 43937 across the cancer types. (2) Differential splicing events between cancer samples and adjacent normal samples were identified for 16 cancer types with at least 10 normal samples. The Wilcoxon signed-rank test was used for testing. The number of differential events obtained ranges from 228 in ESCA to 1133 in LUSC. (3) Reconstruction of splicing network for each cancer type, with Pearson correlation coefficient-based similarity linkages. Pearson and Kendall correlation coefficients showed a good consistency in subsequent community detection, confirming the reliability of this procedure **(Figure S1**). (4) Network module identification with the Pons-Lapaty algorithm which is based on random walks in 3–5 steps to measure vertex distances for hierarchical clustering and subsequent modularity-optimized graph partition (Pons and Latapy, 2005). (5) Modules are then scored with the averaged splicing deregulation between each cancer sample and the normal samples, which provide a reasonable quantification of module-level perturbation across cancer samples. (6) Prognosis analyses for each module and its corresponding cancer type. **Figure 1** shows a schematic diagram of these steps.

#### Cancer Splicing Networks and Modules for TCGA Cancer Types

We reconstructed splicing networks for each cancer type with at least 10 normal samples on differential events. There are 16 cancer types that bypass the above criteria, namely: Bladder urothelial carcinoma (BLCA), Breast invasive carcinoma (BRCA), Colon adenocarcinoma (COAD), Esophageal carcinoma (ESCA), Head and Neck squamous cell carcinoma (HNSC), Kidney Chromophobe (KICH), Kidney renal clear cell carcinoma (KIRC), Kidney renal papillary cell carcinoma (KIRP), Liver hepatocellular carcinoma (LIHC), Lung adenocarcinoma (LUAD), Lung squamous cell carcinoma (LUSC), Prostate adenocarcinoma (PRAD), Rectum adenocarcinoma (READ), Stomach adenocarcinoma (STAD), Thyroid carcinoma (THCA), and Uterine Corpus Endometrial Carcinoma (UCEC). The number of modules identified with the Pons-Lapaty algorithm varied between 2 and 5 for these cancer types. The visualized inspection revealed quite clear network partitions (**Figure 2**). In total, 51 modules were identified, and the number of events and genes in each module can be found in **Table 1** and **Table S1**. For example, the KIRC\_M1 module consisted of 710 events in 630 genes, with 41 AA events, 36 AD events, 229 AT events, 210 ES events, 3 ME events, and 191 RI events.

TABLE 1 | A list of 51 cancer splicing modules.


*AA, alternative acceptor; AD, alternative donor; AT, alternative terminator; ES, exon skipping; ME, mutually exclusive exons; RI, retention of introns. For cancer type abbreviations, see text.*

#### Overall and Progression-Free Survival Analyses for Cancer Splicing Modules

Since the motivation of this study is to discover prognosisrelated splicing modules, we quantified module scores in each cancer sample and test associations between module scores and patient survival. Both the average score and median score were computed and assessed for prognosis correlation, and a very good consistency was found (**Figure S2**), indicating robustness of the module scoring procedure. At a 0.05 significance level, Log-rank tests identified 10 prognosisrelated modules: BLCA\_M1, BLCA\_M2, KIRC\_M1, KIRC\_M2, LIHC\_M1, LIHC\_M2, LUAD\_M1, LUAD\_M3, PRAD\_M1, and UCEC\_M3 (**Table 2**, **Figure 3**). Two additional modules, the LUSC\_M2 (P = 0.0595, HR = 0.75 with a confidence interval 0.55–1.01) and the LUSC\_M3 (P = 0.099, HR = 0.77 with a confidence interval 0.57–1.05), are close to the significance level, and therefore are still likely to be potential prognosis biomarkers (**Figure 3E**). Notably, LUSC\_M2 contains a ME event on the known LUSC amplification gene FGFR1 (exons 12.1:12.2 vs. exon 13), which could be functionally important in LUSC (Weiss et al., 2010; Heist et al., 2012).

Besides overall survival (OS) that reflects a long-term prognosis, it is often of interest to evaluate short-term effects on disease progression. Therefore, to further capture more prognosis-related modules, we also tested the correlation between module scores and progression-free intervals (PFI). This analysis returned 8 significant modules (P ≤ 0.05), namely, BLCA\_M1, BLCA\_M2, BLCA\_M4, LUAD\_M3, PRAD\_M1, PRAD\_M2, THCA\_M1, and UCEC\_M3 (**Table 2**, **Figure 4**). Note that BLCA\_M4 is also marginally significant in OS analysis (P = 0.076, HR = 0.75 with a confidence interval of 0.54–1.03), while PRAD\_M2 (OS P = 0.29) and THCA\_M1 (OS P = 0.7) are only significant in PFI analysis. Five modules are strictly significant in both the OS and PFI settings (BLCA\_M1, BLCA\_M2, LUAD\_M3, PRAD\_M1, and UCEC\_M3), and interestingly, their HR ratios in these two settings are in a similar trend, either both reducing or both increasing malignancy risks. BLCA\_M1 lowers both the death risk (0.57, 0.41–0.78) and the disease progression risk (0.67, 0.48–0.91); BLCA\_M2 increases both the death risk (1.78, 1.29–2.48) and the progression risk (1.42, 1.04–1.95); LUAD\_M3 also increases both risks (1.86, 1.34–2.58 and 1.42, 1.05–1.93, respectively); PRAD\_M1 also increases both risks (6.42, 0.78–52.60 and 1.80, 1.16–2.79, respectively); UCEC\_M3 similarly increases both risks (1.81, 1.15–2.84 and 1.66, 1.12– 2.47, respectively). These strongly indicate the consistency of splicing modules as potential prognosis biomarkers, suggesting underlying functional involvement of these modules in their corresponding cancer types.

#### Cancer Splicing Modules Are Enriched for Critical Biological Functions

The above analyses yielded 15 modules with potential prognosis relevance (**Table 2**). To characterize the functional properties of each splicing module, GO enrichment analysis was performed on the 15 modules. The major functional implications of each module were manually examined and curated from the enrichment results (**Table 2**). Since nearly all genes transcribed in the genome, including many long non-coding genes, underwent alternative splicing, typically very few events could drive strong functional changes, and the majority of alternative splicing events at most function as weaker modifiers. Surprisingly, we found that the 15 modules, when compared to the splicing events catalog, showed very strong enrichment of important biological functions, such as stem cell proliferation and epithelial-mesenchymal transition (EMT),


*OS, overall survival; PFS, progression-free survival; HR, hazard ratio; P, Log-rank test p-value.*

(BLCA\_M1, LIHC\_M2, LUSC\_M1, THCA\_M1, THCA\_M2, UCEC\_M2), cell cycle control (BRCA\_M1, BRCA\_M2, COAD\_M4, KICH\_M1, STAD\_M1), DNA repair or regulation (COAD\_M3, ESCA\_M2, HNSC\_M3, KICH\_M1, LUAD\_M3, LUSC\_M3, UCEC\_M3), developmental cell growth (BRCA\_M4, COAD\_M1, LIHC\_M1, LIHC\_M3, READ\_M1, READ\_M2, STAD\_M2, STAD\_M3), receptor or kinase signaling pathways (BLCA\_M4, HNSC\_M1, KICH\_M1, KIRC\_M2, KIRP\_M1, KIRP\_M2, LIHC\_M1, LIHC\_M2, LUAD\_M1, LUAD\_M2, LUSC\_M1, LUSC\_M3, READ\_M2), VEGF-mediated vessel development (LUAD\_M1, LUSC\_M1) (**Table 2**). Among these major functions, EMT is required for cancer invasion and metastasis, which is closely related to cancer mortalities and prognosis (Singh and Settleman, 2010). The important EMT-related gene modulated in BLCA and LIHC is FGFR2, which regulates mesenchymal condensation in BLCA (Chaffer et al., 2006). Targeting FGFR signaling through splicing factors might expand the current toolkits (Touat et al., 2015). Vessel development controlled by VEGF signaling is another pathway directly involved in cancer metastasis and patient survival (Stacker et al., 2002; Su et al., 2006). Both VEGFA and its receptor FLT4 (VEGFR-3) were altered during splicing in lung cancers LUAD and LUSC, which might modulate angiogenesis through splicing control. In summary, these suggest that the splicing network-based module identification approach taken in this study was powerful enough to extract the few critically functional events from a much larger splicing background.

# Splicing Modules Across Cancer Types Reveal Pan-Cancer Signatures

Having obtained those functionally coherent modules, we next asked whether it would be helpful to explore the pan-cancer landscape at the module level. Hierarchical clustering of cancer types with the 51 modules revealed a clear pattern that is closely related to tissue origins (**Figure 5A**). Lung cancers (LUAD, LUSC), colon cancers (COAD, READ), gynecological cancers (BRCA, UCEC), kidney, and prostate cancers (KIRC, KIRP, KICH, PRAD) each are clustered in a tissue origin manner. This is actually consistent with a recent pan-cancer analysis using multi-platform integrative clustering (Hoadley et al., 2018), suggesting that splicing events can also be useful for cancer classification and subtyping.

Due to the intra-type and between-type heterogeneity of cancers, it is important to know which of the splicing modules are shared by multiple cancer types and which modulesare restricted to one or few cancer types. We summarized the scores for each module in the cancer samples and categorized them by cancer type (**Figure 5B**). The diagonal line here reflects the score of modules in their corresponding cancer types, while the offdiagonal regions depicts their pan-cancer potential. Although a few modules from kidney and liver cancers show a strong cancer type specificity, and largely not perturbed in other types (KICH\_M3, LIHC\_M3, LIHC\_M4), many other modules display strong pan-cancer perturbation patterns, suggesting their wider involvement in most cancer types. With a strict criteria (perturbation found in at least 15/16 cancer types), we found that

13/51 modules are highly common across different cancer types (marked with ∗∗ , **Figure 5B**), again suggesting that the modules identified with splicing network analysis are highly informative and important.

## DISCUSSION

Although AS has been identified and studied for many years, the full regulation pattern of these many AS events within and across cancer types are still not completely understood. Previous studies have taken advantage of single-event analyses and linked splicing to splicing factors as well as the cis-elements. Very recently, an interesting study sets out to determine the involved of spliceosome RNAs in cancer-specific AS regulation (Dvinge et al., 2018).

In this study, we have taken a novel approach that emphasizes the inter-event correlations and uncovers the modularized perturbation of splicing events in cancers. Previous studies have not emphasized the modularized control of splicing events, which according to our study is quite important. Indeed, a relatively small number of functionally important and prognosis-relevant modules have been successfully identified, with some of them being common across cancer types and others being more specific to one or few cancer types, indicating that our approach is both powerful and useful.

To focus on the more typical AS classes, we have not considered alternative promoters in this study, as their regulation are more relevant to transcriptional factors, enhancers or even epigenetic modifications (Maunakea et al., 2010; Kowalczyk et al., 2012). Nonetheless, it would be interesting to investigate the possibility of combining transcriptional events and splicing events in the future, as co-transcriptional splicing has already been proposed and supported by various studies. This might serve as a plausible framework for those interactions (Lee and Rio, 2015).

# REFERENCES


# DATA AVAILABILITY

The datasets used for this study are publicly accessible, and are also available by contacting the corresponding authors.

# AUTHOR CONTRIBUTIONS

WM, TZ, and ZF: conceived and oversaw the study, wrote the manuscript. YD, SL, RD, and YZ: performed the data analysis. NS, SA, SC, and AW: assisted in data analysis and interpretation.

# ACKNOWLEDGMENTS

This study was supported by the National Natural Science Foundation of China (No. 31570917, 31400752, 81602459), the Program of China International Medical Foundation (No. CIMF-F-H001-285), the Science and Technology Project of Henan Province (No. 162102310136, 132102310148, 112102310150), the General project of Henan Medical Science and Technology Research Plan (No. 201403067), the Zhengzhou University School Joint Cultivation Fund, the Science and Technology Project of Zhengzhou City (No. 112PPTSF-317-6).

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00246/full#supplementary-material

Figure S1 | Comparison of Pearson and Kendall correlation coefficients on network and module detection. NMI (normalized mutual information) and ARI (adjusted Rand index) are used to evaluate consistency.

Figure S2 | Consistency between average and median module scores on survival analyses. (Left) OS analysis. (Right) PFI analysis. Each circle or "+" refers to a module.

Table S1 | Detailed splicing events within each of the 51 splicing modules. Each event is named as "gene::eventID::eventType".

squamous cell carcinoma of the lung. J. Thorac. Oncol. 7, 1775–1780. doi: 10.1097/JTO.0b013e31826aed28


**Conflict of Interest Statement:** Author YZ was employed by the company Medcurius Co.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Du, Li, Du, Shi, Arai, Chen, Wang, Zhang, Fang, Zhang and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multiple Epistasis Interactions Within MHC Are Associated With Ulcerative Colitis

Jie Zhang1,2, Zhi Wei <sup>1</sup> \*, Christopher J. Cardinale<sup>3</sup> , Elena S. Gusareva<sup>4</sup> , Kristel Van Steen4,5 , Patrick Sleiman3,6, International IBD Genetics Consortium and Hakon Hakonarson3,6 \*

<sup>1</sup> Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, United States, <sup>2</sup> Adobe Inc., San Jose, CA, United States, <sup>3</sup> The Children's Hospital of Philadelphia, Center for Applied Genomics, Philadelphia, PA, United States, <sup>4</sup> GIGA-R Medical Genomics - BIO3, University of Liege, Avenue de l'Hôpital 11, Liège, Belgium, <sup>5</sup> WELBIO—Walloon Excellence in Life Sciences and BIOtechnology, Liège, Belgium, <sup>6</sup> Division of Human Genetics, Department of Pediatrics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Xiang Yu, University of Pennsylvania, United States Richa Gupta, University of Helsinki, Finland

#### \*Correspondence:

Zhi Wei zhiwei@njit.edu Hakon Hakonarson hakonarson@email.chop.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 11 December 2018 Accepted: 08 March 2019 Published: 03 April 2019

#### Citation:

Zhang J, Wei Z, Cardinale CJ, Gusareva ES, Van Steen K, Sleiman P, International IBD Genetics Consortium and Hakonarson H (2019) Multiple Epistasis Interactions Within MHC Are Associated With Ulcerative Colitis. Front. Genet. 10:257. doi: 10.3389/fgene.2019.00257 Successful searching for epistasis is much challenging, which generally requires very large sample sizes and/or very dense marker information. We exploited the largest Crohn's disease (CD) dataset (18,000 cases + 34,000 controls) and ulcerative colitis (UC) dataset (14,000 cases + 34,000 controls) to date. Leveraging its dense marker information and the large sample size of this IBD dataset, we employed a two-step approach to exhaustively search for epistasis. We detected abundant genome-wide significant (p < 1 × 10−13) epistatic signals, all within the MHC region. These signals were reduced substantially when conditional on the additive background, but still nine pairs remained significant at the Immunochip-wide level (P < 1.1 × 10−<sup>8</sup> ) in conditional tests for UC. All these nine epistatic interactions come from the MHC region, and each explains on average 0.15% of the phenotypic variance. Eight of them were replicated in a replication cohort. There are multiple but relatively weak interactions independent of the additive effects within the MHC region for UC. Our promising results warrant the search for epistasis in large data sets with dense markers, exploiting dependencies between markers.

Keywords: epistasis, genome-wide association study, immunochip, major histocompatibility complex, ulcerative colitis

#### INTRODUCTION

Genome-wide association studies (GWAS) have been conducted widely to interrogate the genetic architecture of common and complex diseases (McCarthy et al., 2008). For Crohn's disease (CD) and ulcerative colitis (UC), the two common forms of inflammatory bowel disease (IBD), GWAS have been fruitful in identifying their susceptibility loci with independent, additive, and cumulative effects (Franke et al., 2010; Anderson et al., 2011; Jostins et al., 2012). Like most other GWAS, these studies have employed a single-locus analysis strategy, namely, testing the variants one at a time for association. Complementary to single-locus analysis, searching for gene-gene

**240**

interactions, or epistasis, has also attracted extensive research interest in the past decades (Cordell, 2009). However, in contrast to the fruitful achievements of identifying independent additive effects, the success of searching for epistasis is very limited so far.

For IBD, epistasis in CD was once searched in an exhaustive epistatic SNP association analysis on the expanded Wellcome Trust data of seven complex diseases. However, no significant epistasis in CD was identified (Lippert et al., 2013). Indeed, searching for epistasis is much more challenging than detecting additive effects for various reasons, including weaker linkage disequilibrium capturing for a pair of tagging SNPs, increased model complexity, and curse of dimensionality. Very dense marker information and very large sample sizes therefore are required to overcome these challenges (Wei et al., 2014), as well as standardized analysis protocols (Gusareva and Van Steen, 2014).

The Immunochip <sup>R</sup> , a custom Illumina genotyping microarray, is designed to perform both deep replication of suggestive associations and fine mapping of established GWAS significant loci of major autoimmune and inflammatory diseases (Parkes et al., 2013). For each disease, about 3,000 top-ranked SNPs are selected from available GWAS data. At loci with established disease associations, it includes all known SNPs in the dbSNP database, from the 1000 Genomes project (Feb. 2010 release), and from any other sequencing initiatives that were available to the consortium. As a result, it has in total 196,524 variants, including 718 small insertion deletions and 195,806 SNPs. Thus, it provides a more comprehensive catalog of the most promising candidate variants by picking up the remaining common variants and rare variants that are missed in the first generation of GWAS. Recently, using three large Immunochip datasets, Wei et al. confirmed multiple interactions within the major histocompatibility complex (MHC) and reported novel non-MHC epistatic signals of suggestive significance in their analyses of epistasis in rheumatoid arthritis (Wei et al., 2016).

Here we used the largest data set to date for IBD compiled by the International IBD Genetics Consortium's from its members Immunochip projects to examine epistasis in IBD. Leveraging its dense marker information and the large sample size of this IBD dataset, we searched for epistasis in hope to identify genegene interactions for IBD that were missed in previous singlelocus analysis.

# MATERIALS AND METHODS

#### Subjects, Genotyping, and Quality Control

We used the large IBD cohort samples from the International IBD Genetics Consortium. These cohort samples have been described in detail elsewhere (Jostins et al., 2012). Briefly, a total of 68,427 samples were recruited from 15 European countries, including 18,227 CD cases with 34,050 CD controls and 14,224 UC cases with 33,954 UC controls, and typed by 11 different genotyping centers on the Immunochip. As shown in **Table 1**, we randomly split the dataset into a discovery cohort and a replication cohort, each with an approximately equal size (See **Tables S1, S2** for details). We refer to the


MHC as residing between 28.7 and 34.0 Mb on chromosome 6, based on SNP genomic locations in the GRCh38/hg38 version throughout.

The IBD dataset used has gone through rigorous quality control (QC) by the IBD consortium (Jostins et al., 2012). Briefly, initial SNP QC was conducted by removing SNPs that fail Hardy-Weinberg equilibrium (HWE) tests across the entire collection or within each batch, SNPs that have significant missing genotypes across the entire collection or within each batch, and SNPs that have different missing genotype rates in case vs. control. The sample QC followed by removing individuals who have a high missing genotype rate, individuals who show significant heterozygosity rate, and duplicated/related individuals. Then, another round of SNP QC was conducted by removing SNPs that show heterogeneous allele frequencies across batches, and SNPs not identified in 1000G project phase 2. For our epistasis analysis, we performed further QC by filtering out markers with Hardy-Weinberg equilibrium P < 10−<sup>6</sup> and minor allele frequency <10−<sup>5</sup> , which results in 149,532 and 150,424 markers for CD and UC, respectively.

#### Statistical Analysis

We employed PLINK (Purcell et al., 2007) with default parameters (i.e., "--logistic --hide-covar --adjust") to perform a GWAS in each cohort using a logistic regression model with 5 principal components and the batch indicators as covariates. The consensus genome-wide significance threshold of 5 × 10−<sup>8</sup> was applied. We expect that many of these genomewide significant SNPs are correlated. To obtain independent signals, we further pooled the selected SNPs (P < 5 × 10−<sup>8</sup> ) and re-fit the logistic regression model using all of them. For the fully correlated SNPs, only one will be kept. Then we considered the SNPs with P < 0.05 from the logistic regression fitted with all marginally genome-wide significant SNPs, as independent additive signals. As we describe later, these significant independent GWAS SNPs will be used as the additive background to screen the SNP pairs we identify, for ensuring they are truly novel epistasis signals that are not captured by any single additive signals.

We used a 2-step approach to detect epistasis. First, we utilized the fast approximate tests provided by BOOST (Wan et al., 2010) to screen candidate gene-gene interactions. BOOST can perform quickly a full pairwise screening without correction for covariates in the discovery cohort. The BOOST P-values were approximate and we retained all possible epistatic pairs of SNPs with an interaction P < 10−<sup>10</sup> and r <sup>2</sup> < 0.2. We subsequently computed accurate P-values for the retained SNP pairs using a full logistic regression model accounting for population stratification and batch effect covariates. The epistatic interaction was tested using a likelihood ratio test with 4 degrees of freedom as previously described (Gyenesei et al., 2012). Following Wei et al. (2016), we adopted two P-value thresholds, 10 <sup>−</sup><sup>13</sup> for claiming genome-wide significant epistatic SNP pairs (Wei et al., 2014), and 1.1 × 10 − 8 for claiming Immunochip-wide 5% significance.

Finally, to identify significant epistatic SNP pairs conditioning on the additive background, we added the selected independent GWAS SNPs as covariates into the logistic regression model for testing the epistasis. Variance explained by the selected epistatic SNP pairs was estimated using a full logistic regression model including all the covariates, independent SNPs and SNP pairs. The SNP pairs identified in the discovery cohort were tested similarly in the replication cohort. We considered a pair to be directly replicated if its epistatic P-value remained <0.05 after conditioning on the additive background.

#### RESULTS

The univariate GWAS scan of the discovery cohorts identified 2,765 and 1,123 genome-wide significant SNPs ( P < 5 × 10 − 8 ) for CD and UC, respectively. After the further independence screening, we obtained 306 and 121 independent SNPs for CD and UC, respectively. The detailed information about the SNPs selected in each stage were presented in the **Supplemental Data** . The full pairwise scan by BOOST produced 13,843 and 35,373 candidate pairs of SNPs that have BOOST interaction P < 10 <sup>−</sup><sup>10</sup> and r 2 < 0.2 for CD and UC, respectively. We computed their accurate P-values, and detected 11 and 513 genome-wide significant pairs ( P < 10 <sup>−</sup>13) for CD and UC, respectively. Conditioning on the additive background of the 306 independent CD SNPs, none of the 11 CD pairs remained significant (smallest P = 3.0 × 10−<sup>4</sup> ). For UC, we obtained 9 pairs significant at the Immunochip-wide level ( P < 1.1 × 10 − 8 ), and all of them came from the MHC region. Conditional on the additive background, these epistatic pairs jointly explain an additional 0.49% of the phenotypic variance on the observed scale, of which 0.36% by interactions only, suggesting that these interaction effects were not negligible jointly, but weak individually (i.e., on average explained 0.15% of the phenotypic variance). Except for the pair of lowest significance, the top 8 of these 9 pairs were replicated in the independent replication cohort with P < 0.05 (See **Table 2** and **Figure S1** for details).

Following Hemani et al. (2014), we decompose the genetic effects of each of the SNP pairs into orthogonal additive (A1, A2), dominant (D1, D2) and epistatic effects (A1xA2, A1xD2, D1xA2, D1xD2); and then display them (regression coefficients) as a heatmap (**Figure 1**). For these interactions, we observe that the epistatic effects (A1xA2, A1xD2, D1xA2, D1xD2) generally act in opposite direction against the main effects (A1, A2, D1, D2). In addition, we note that the effects across the discovery and replication cohorts are largely concordant.

All the 10 SNPs contributing to the 8 epistatic pairs are non-coding, with rs3852215 close to HLA-DQB1, and rs6928482


and rs7744001 close to HLA-DQB1-AS1. It has been known for some time that the single strongest genetic association for IBD is the HLA-DRB1<sup>∗</sup> 103 allele, which is located within the MHC region (Silverberg et al., 2003). A recent study by Goyette et al. demonstrated the importance of HLA-DRB1<sup>∗</sup> 0103 in both CD and UC by genotyping 7,406 MHC SNPs in 32,000 IBD cases and an equal number of controls (Goyette et al., 2015) The fine resolution of mapping allowed localization of the association signal to specific amino acid substitutions in the MHC molecule which revealed that the causal variants are located within the peptide binding groove and thereby influence antigen presentation directly (Goyette et al., 2015). The mechanism by which these mutations produce autoimmunity could be by enabling self- or commensal-antigenic peptides to bind and be presented to helper T cells. The non-coding SNPs identified in our study which have genegene interaction with these high-risk variants could affect transcriptional regulation of the high-risk MHC molecules themselves, enabling greater amounts of self or commensal antigens to be presented in a differential manner provoking an inflammatory response.

# DISCUSSION

In this study we present results from an epistatic analysis of a large data set from the International IBD Genetics Consortium genotyped on the Immunochip array. Most previous studies have used a relatively small sample sizes (with <2,000 cases and 3,000 controls) and a GWAS array (Wan et al., 2010; Lippert et al., 2013), while here we had the largest IBD dataset to date with a much increased sample size (14,000+ cases and 30,000+ controls) genotyped on the high-density Immunochip. The large sample size has enabled us to identify genome-wide significant interactions within the MHC region for UC for the first time. All the 8 replicated epistasis signals are local interactions (within a distance <1 Mb), which is consistent with recent finding that examining local interactions between SNP closely located but with low LD (r <sup>2</sup> < 0.2) could increase the power of detection of missing variants and/or functional interactions (Wei et al., 2013, 2014). It is noted that 3 of these 8 interaction signals are genome-wide significant and the other 5 remain Immunochipwide significant after conditioning on the additive background. These independent interactions each with a substantial effect were statistically replicated in an independent replication cohort. These results confirm the increased level of complexity in the entire MHC region as observed also in rheumatoid arthritis (Wei et al., 2016), namely, there can still exist additional epistatic interactions over and above the well-established multiple independent MHC signals. The current resolution provided by the Immunochip SNP resolution should be able to identify most MHC diversity. Even if some additive background may be derived imperfectly, the large sample size of the IBD dataset should compensate and lend sufficient power for capturing all the additive background comprehensively. Therefore, the 8 epistatic pairs we report here should be independent from the additive background.

Finally, we would like to point out that multiple views to genome-wide data for epistasis screening exist. In this work, we developed a protocol that lead to replicable statistical results. Notably, small changes in the protocol (including the use of prior biological knowledge about the disease, LD pruning, analytic methodology, correction for population structure or multiple testing) may give rise to widely varying results (Bessonov et al., 2015). More work is needed to investigate the relation between MHC susceptibility genes and their relation with other genomic regions, aiding the hunt for non-MHC driven epistasis.

In conclusion, by leveraging the large sample size available through the International IBD Genetics Consortium genotyped on the Immunochip, we have identified and replicated concordant epistatic interactions within the MHC region for UC. Further examination of these identified epistatic interactions may help to understand the molecular mechanisms underlying epistatic interactions with the MHC locus and their contributions to immunological diseases. Our promising results warrant the search for epistasis in large data sets to address the missing heritability in complex disease. Optimal epistasis analysis protocols need to be derived in order to exploit the richness potentially harbored by dense marker panels.

#### DATA AVAILABILITY

Data have been deposited in NCBI's database of Genotypes and Phenotypes (dbGaP) through study accession numbers phs000130.v1.p1 and phs000345.v1.p1.

# AUTHOR CONTRIBUTIONS

ZW and HH conceptualized and led the study. JZ and ZW performed the experiments and analyses. JZ, ZW, CC, ESG,

# REFERENCES


KVS, and HH contributed to writing. PS and HH contributed to samples and phenotypes.

#### FUNDING

This study was supported by Institutional Development Fund from The Children's Hospital of Philadelphia to the Center for Applied Genomics, and The Children's Hospital of Philadelphia Endowed Chair in Genomic Research to HH.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00257/full#supplementary-material


**Conflict of Interest Statement:** JZ is employed by company Adobe.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer XY declared a shared affiliation, with no collaboration, with one of the authors, PS, to the handling editor at time of review.

Copyright © 2019 Zhang, Wei, Cardinale, Gusareva, Van Steen, Sleiman, International IBD Genetics Consortium and Hakonarson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# UltraStrain: An NGS-Based Ultra Sensitive Strain Typing Method for Salmonella enterica

Wenxian Yang1†, Lihong Huang2†, Chong Shi <sup>2</sup> , Liansheng Wang<sup>2</sup> \* and Rongshan Yu1,2 \*

*<sup>1</sup> Aginome-XMU Joint Lab, Xiamen University, Xiamen, China, <sup>2</sup> School of Information Science and Engineering, Xiamen University, Xiamen, China*

#### In the last few years, advances in next-generation sequencing (NGS) technology for whole genome sequencing (WGS) of foodborne pathogens have provided drastic improvements in food pathogen outbreak surveillance. WGS of foodborne pathogen enables identification of pathogens from food or environmental samples, including difficult-to-detect pathogens in culture-negative infections. Compared to traditional low-resolution methods such as the pulsed-field gel electrophoresis (PFGE), WGS provides advantages to differentiate even closely related strains of the same species, thus enables rapid identification of food-source associated with pathogen outbreak events for a fast mitigation plan. In this paper, we present UltraStrain, which is a fast and ultra sensitive pathogen detection and strain typing method for *Salmonella enterica* (*S. enterica*) based on WGS data analysis. In the proposed method, a noise filtering step is first performed where the raw sequencing data are mapped to a synthetic species-specific reference genome generated from *S. enterica* specific marker sequences to avoid potential interference from closely related species for low spike samples. After that, a statistical learning based method is used to identify candidate strains, from a database of known *S. enterica* strains, that best explain the retained *S. enterica* specific reads.Finally, a refinement step is further performed by mapping all the reads before filtering onto the identified top candidate strains, and recalculating the probability of presence for each candidate strain. Experiment results using both synthetic and real sequencing data show that the proposed method is able to identify the correct *S. enterica* strains from low-spike samples, and outperforms several existing strain-typing methods in terms of sensitivity and accuracy.

Keywords: metagenomes, next-generation sequencing (NGS), whole genome sequencing (WGS), Salmonella enterica, strain typing

## 1. INTRODUCTION

Rapid pathogen identification is one of the most important issues for microbial community studies for infectious diseases and food security. It is reported that in the United States alone, at each year 31 major pathogens cause 9.4 million episodes of foodborne illness, resulting in 55,961 hospitalizations and 1,351 deaths (Scallan et al., 2011). Foodborne illness poses a \$77.7 billion economic burden in the United States annually, excluding indirect costs to the food industry such as reduced consumer confidence, recall losses, or litigation (Mandernach et al., 2013). The faster the sources linked with

#### Edited by:

*Tao Zeng, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences (CAS), China*

#### Reviewed by:

*Cuncong Zhong, University of Kansas, United States Xiangtao Liu, The University of Iowa, United States*

#### \*Correspondence:

*Liansheng Wang lswang@xmu.edu.cn Rongshan Yu rongshanyu@ieee.org.*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *21 December 2018* Accepted: *12 March 2019* Published: *03 April 2019*

#### Citation:

*Yang W, Huang L, Shi C, Wang L and Yu R (2019) UltraStrain: An NGS-Based Ultra Sensitive Strain Typing Method for Salmonella enterica. Front. Genet. 10:276. doi: 10.3389/fgene.2019.00276*

**245**

the outbreak being investigated are identified, the faster the outbreak can be stopped, limiting the potential loss it may cause.

A large number of laboratory (in vitro) tools have been developed over the past decades for pathogen identification to assist the diagnosis, treatment, and monitoring of infectious diseases. Traditionally, in vitro diagnostics of infectious diseases have been performed using culture-based testing, which usually yields diagnostic results in days. In addition, cultivation of bacteria is not always successful under laboratory conditions due to possibly unsuitable methods. In recent years, deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) based molecular assays (Barghouthi, 2011) have become more routine. A DNAbased in vitro assay may take the form of a quantitative or qualitative polymerase chain reaction (PCR) assay where the target for detection is a pathogen-specific gene or an antimicrobial resistance marker. The most common bacterial broadrange PCR methods use primers that recognize conserved DNA sequences of bacterial genes that encode ribosomal RNA (rRNA 16S or 23S) (Greisen et al., 1994). Such methods allow the detection of multiple targets in a single experiment and are faster and more sensitive than culture-based methods. However, these targeted approaches require the clinician's a priori knowledge of the potential targets to order the appropriate diagnostic tests.

The application of NGS in metagenomics has revolutionized the field of microbial ecology and greatly facilitates the identification and classification of microbes. The enormous increase in sequencing throughput has enabled the adoption of metagenomic sequencing approaches in which highly complex communities of microorganisms are sequenced in parallel. Compared to the traditional culture-based and assay-based approaches, metagenomic approaches are less biased because they do not require any a priori knowledge of the sample composition. Clinical samples may contain a mixture of microbes with varying levels of constituents and additional DNA from a host organism. Metagenomic sequencing data obtained from such samples provides a qualitative and quantitative profile of the individual components of the respective microbial community. Genus, species and even strain-level taxonomic assignments of microorganisms, as well as their relative abundance, could be potentially obtained. For example, metagenomic sequencing data can identify infections with pathogen-specific strain (Maxson and Mitchell, 2016). It also allows the detection and identification of antibiotic resistant genes and virulence factors in complex samples (Jitwasinkul et al., 2016). The ability to rapidly characterize and identify the entire microbial composition of a complex sample provides a unique and novel strategy for pathogen detection and identification in diagnosis and outbreak investigation of infectious diseases, or to guide treatment options.

On the other hand, metagenomic data brings new challenges for downstream analysis and biologically meaningful interpretation. First of all, the vast amount of sequencing data which contains billions of short reads leads to high time consumption. The short read length and low coverage would result in many short contigs and unassembled sequences, leading to the prediction of a large number of small, fragmented genes which may not exhibit any matches in the reference sequence database, or match with low confidence. The second challenge lies in the sample complexity (Rose et al., 2015), as the target pathogens could be surrounded by a complex background of commensal organisms at a range of abundances in addition to hosting nucleic acids. In addition, problems arise from variation between similar subspecies, genomic sequence similarity between different species, the difference in abundance for species in a sample, and different sequencing depths for individual species, etc.

In pathogen identification from metagenome data, strain-level bacterial typing from uncultured food samples is an especially challenging task. Advances in metagenome bioinformatics over the last decade have refined the resolution of microbial community taxonomic profiling from phylum to the species, but it is still challenging to characterize microbes in communities at strain level (Truong et al., 2017). Strain typing distinguishes between different strains of the same species, and is more valuable in a number of specialized fields including epidemiology, compared to species level typing. More specifically, strain typing helps to trace the source of food poisoning and relate individual cases to an outbreak of infectious disease. Strain level variants within microbial species are crucial in determining their functional capacities within the human microbiome (Truong et al., 2017). Strain typing of a single genome has been well studied (Li et al., 2009). However, the tools built under the assumption of assembling a single genome often underperform when used for complex metagenome assemblies. Salmonella is a diverse genus of Gram-negative bacilli and a major foodborne pathogen responsible for more than a million illnesses annually in the United States alone. In particular, strain typing for foodborne pathogen such as S. enterica is of special interest and importance (Bell et al., 2016). Methods specific for Salmonella detection and identification have been proposed in the literature, including serotyping (Zhang et al., 2015; Yachison et al., 2017), multilocus sequence typing (MLST) (Ranjbar et al., 2017), and strain typing (Hong et al., 2014b; Wood and Salzberg, 2014; Ahn et al., 2015; Truong et al., 2015), etc. However, as different S. enterica strains share many common genome regions that are very similar to those from other bacteria in food samples, the accuracy of traditional strain typing methods is not satisfactory especially when the target strain has very low abundance.

In this paper, we introduce UltraStrain, which is a highly sensitive strain typing method based on shot-gun sequencing data. The method exploits the concept of species-specific marker genes (Segata et al., 2012) that are used as genetic proxies of species to efficiently extract high-confidence S. enterica reads from the metagenomics sample, whereby subsequent strain typing is performed on a large pool of S. enterica reference database based on the high confident S. enterica reads. More specifically, in UltraStrain, we first perform a denoise filtering step to remove ambiguous reads that may come from other bacteria or species other than S. enterica. This is done by mapping the raw shot-gun sequencing reads to a synthetic reference genome that contains only specific genome regions for S. enterica, and keeping only reads that could be successfully mapped to the synthetic reference genome on certain criteria. After that, we compare the resulting high-confidence S. enterica specific reads against a pool of known S. enterica strains, and formulate strain identification as statistical learning problem, as to identify the probabilities of S. enterica strains that could be able to produce those reads if they were present in the original sample. A preliminary version of UltraStrain was used in our submission to PrecisionFDA's CFSAN Pathogen Detection Challenge in 2018 and was one of the top performers in this competition (https:// precision.fda.gov/challenges/2/view/results).

# 2. RELATED WORK

Taxonomic profiling of metagenome data can be done by aligning every read to a large database of genomic sequences using BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi). However, this is always not clinically applicable due to the large data amount. Other methods for strain typing from metagenome data include de novo assembly based methods and mapping based methods. Depending on how the reference sequence library is constructed, mapping based methods further include k-mer and marker-gene based methods, and those that map reads to full reference genomes.

Metagenomic assembly of single isolates can be used to identify strains of uncharacterized species with high sensitivity. Strain level metagenomic assembly methods, such as the Lineage (OBrien et al., 2014) and the DESMAN algorithms (Quince et al., 2017), typically use contig binning and statistical analysis of base frequencies across different strains in the sample to resolve ambiguities. The intuition behind is that the frequencies of variants associated with a strain fluctuate with the abundance of that strain. However, metagenomic assembly for multiple strains is computationally challenging. In addition, especially for complex clinical samples when multiple similar strains co-exist, it is generally impossible for assembly based method to achieve high accuracy on strain level due to the conserved regions between strains. Instead, direct assembly of multiple similar strains always produces highly fragmented assemblies which represent aggregates of multiple similar strains. Therefore, it is difficult to generalize assembly-based approaches to large sets of metagenomes and low abundance microbes.

Mapping based methods align the reads to a target reference library and apply statistical and probabilistic analysis techniques on the alignment results to identify the multiple strains that present in the sample. Raw reads of a metagenome can be aligned against full reference genomes for microbe identification if the library of target reference genomes can be constructed. Short read alignment-based methods can achieve high accuracy in strain level identification and are considerably faster than metagenome assembly based methods. Sigma (Ahn et al., 2015) is a read mapping based method that maps the metagenomic dataset onto a user-defined database of reference genomes. A probabilistic model is used to identify and quantify genomes, and the reads are assigned to their most likely reference genomes for variant calling. PathoScope2 (Hong et al., 2014b) builds a complete pipeline for taxonomic profiling and abundance estimation from metagenomic data, integrating modules for reads quality control (Hong et al., 2014a), reference library preparation, filtering of host and non-target reads (Byrd et al., 2014), alignment, and Bayesian statistical inference to estimate the posterior probability profiles of identified organisms (Francis et al., 2013), etc. It can quantify the proportions of reads from individual microbial strains in metagenomic data from environmental or clinical samples.

To speed up the alignment process, the reference library may contain only part of the whole reference genomes that have differentiating power among different but closely related strains. In such methods, metagenomic reads are aligned to a set of preselected marker sequences, e.g., k-mers, marker genes, or even pangenomes, and assigned to its most likely origin according to the alignment results. The taxonomic classification can be inferred from phylogenetic distances to these marker sequences. These methods differ in terms of the selection of the markers and the probabilistic algorithms for read assignment. The performance also heavily depends on the completeness of the reference database, and how the marker sequences are extracted.

Kraken (Wood and Salzberg, 2014) is a fast k-mer based method for metagenomic sequence classification. Kraken builds a database that contains records consisting of a k-mer and the lowest common ancestor (LCA) of all organisms whose genomes contain that k-mer. The database is built from a user-specified library of genomes and allows quick look-up of the most specific node in the taxonomic tree, leading to fast and accurate strain identification. StrainSeeker (Roosaare et al., 2017) constructs a list of specific k-mers for each node of a given guide tree, whose leaves are all the strains, and analyzes the observed and expected fractions of nodespecific k-mers to test the presence of each node in the sample. MetaPhlAn (Segata et al., 2012) is a taxonomic profiling method using marker genes. The method estimates the relative abundance of microbial cells by mapping reads against a reduced set of clade-specific marker sequences that unequivocally identify specific microbial clades at the species level and cover all of the main functional categories. MetaPhlAn2 (Truong et al., 2015) further extends the reference library from species level markers to subspecies markers that enable strain-level analysis, and increases the accuracy on taxonomic composition reconstruction. PanPhlAn (Scholz et al., 2016) builds a pangenome of the species of interest by extracting all genes from available reference genomes and merging them into gene family clusters. The method then leverages gene family co-abundance within a metagenomic sample to identify strainspecific gene repertoires, with the assumption that single-copy genes from the same genome should have comparable sequencing coverage within the sample.

# 3. METHODS

In this paper, we present an ultra sensitive pipeline for S. enterica strain typing from metagenomics samples based on NGS data analysis. The processing modules involved in the proposed pipeline are illustrated in **Figure 1**. The major components of the pipeline include quality control (QC), reads filtering and strain identification.

#### 3.1. Quality Control

The first step of the metagenomic sequencing data processing is quality control (QC). The QC procedure usually includes identification and filtration of sequencing artifacts such as low-quality reads and contaminating reads, which would significantly affect and sometimes mislead downstream analysis. In our method, we apply fastp (v.0.19.4; http://opengene.org/fastp/fastp) (Chen et al., 2018) to trim the reads in the front and the tail. For all the raw reads used in our experiments, we trim the front of both reads in a pair with fastp options (-f 15 - F 15), and perform per-read cutting by quality in the tail (--cut\_by\_quality3).

# 3.2. Reads Filtering

Metagenomics samples could be contaminated with DNA from host genomes or commensal species. Such background noise will often dominate metagenomics samples, which can swamp out target signal, resulting in inaccurate analysis and even leading to incorrect strain identification results. To mitigate this issue, in this step we filter out reads that are not specific to S. enterica to minimize potential false positive results in strain identification. This is achieved by aligning the reads after QC to a synthetic reference genome which is composed of S. enterica specific regions. Only the properly mapped paired reads that meet certain criteria will be retained for further analysis. The read filtering module consists of the following two steps.

#### 3.2.1. Generating a Synthetic Reference Genome

We follow the method in Laing et al. (2017) to identify speciesspecific regions for S. enterica. First of all, Panseq Laing et al. (2010) is used to identify regions of 1000 bp from closed S. enterica genomes in GenBank. These regions are then screened against the online GenBank non-redundant (nr) database to filter out genomic regions that also present in other bacterial genomic sequences. The resulting 403 regions, 1,000 bp each, are identified as marker genomic regions that represent S. enterica species.

These regions are concatenated into a single sequence to create a synthetic reference genome that represents the S. enterica species. During the concatenation, we insert "separating regions" of repeating N's in-between of the adjacent regions, as shown in **Figure 1B**. The purpose of inserting such separating regions is to avoid the unfavorable case when a read is mapped to a subsequence on the synthetic reference genome that overlaps with two different S. enterica specific regions. The length of the separating regions, or the number of N's, can be set to one more than the maximum read length. In our experiments, we use a large number of 500. The resulting synthetic reference genome is then used to identify reads that can be mapped to unique S. enterica genome regions from the shotgun sequencing data for further strain typing.

#### 3.2.2. Read Filtering Through Alignments

We align the sample reads after QC to the synthetic reference genome using BWA (v.0.7.12-r1039; https://github.com/lh3/bwa. git) (Li, 2013). We then analyze the resulting SAM file to filter the reads such that only high confidence S. enterica specific reads that are "properly mapped" to the synthetic reference genome are retained.

A read is considered to be "properly mapped" if all the following criteria are met. First of all, its edit distance to the reference genome is no larger than a predefined threshold, with default value of 5 in our implementations. Secondly, the total length of soft clipping bases is no larger than a predefined threshold, with default value 10. Lastly, paired-end reads are retained only if both reads satisfy the above two criteria. The filtering is implemented in Python using the pysam (https://github.com/pysam-developers/ pysam) module.

The alignments in the SAM file that pass the filtering are then converted back to fastq format using Picard tools (http://broadinstitute.github.io/picard) as input to the strain identification module.

# 3.3. Strain Identification

#### 3.3.1. Building a Reference Library of S. enterica Genomes

A basic step for strain identification from metagenomics sequencing data is to build a library of reference genomes, which contains all the possible strains that may exist in the sample. In this work, we also create a reference genomes library containing known S. enterica strains. First, we download all the closed S. enterica reference genomes from NCBI. At the time when experiments presented in this paper were performed, we downloaded 380 whole S. enterica genomes and 157 chromosomes from NCBI which contain the main sequence and plasmids. We remove the plasmids and keep only the main sequence.

#### 3.3.2. Identification of S. enterica Strains

At this stage, we try to identify a subset of S. enterica strains from the reference library that best explains the S. enterica specific reads present in the sample. The strain identification problem can be formulated as a statistical inference problem that identifies a set of S. enterica strains that maximizes the likelihood of the observed S. enterica specific reads, as it is unlikely that those reads are from non S. enterica strains. Let 8 = {φm|m = 1, . . . , M} denote the reference library where each φ<sup>m</sup> represents a known S. enterica strain. Let R = {rn|n = 1, . . . , N} denote the set of high confidence S. enterica specific reads after QC and read filtering steps. The strain typing problem can be formulated as:

$$\arg\max\_{I \in \Phi} \left[ L(R|I) - \chi |I| \right],\tag{1}$$

where L(R|I) is the likelihood of R under the assumption that a subset of S. enterica strains I are present in the sample under test, | · | is the cardinality of a set, and γ is a regulator parameter introduced to avoid trivial solutions such as using the entire reference library as the optimal solution. Note that the parameter γ controls the sparsity level of the solution. The larger the value γ is, the fewer potential candidate strains will be included in the solution.

The optimization problem Equation (1) is a minimum set cover problem, which is typically solved using integer linear programming (ILP) (Garfinkel and Nemhauser, 1972). However, the optimal solution of minimum set cover problem is NPhard and intractable for large data sets. Instead, in this work we propose an alternative statistical learning based method to solve this problem. More specifically, denote xnm = 1 if a read r<sup>n</sup> is from strain φm, and xnm = 0 otherwise. We notice that xnm is a random variable of which the probability distribution by and large depends on how well r<sup>n</sup> maps to φm, and the number of reference genomes in 8 that r<sup>n</sup> can be successfully mapped to.

Denote such a conditional probability as P(xnm = 1|µnm, νn), where µnm is the editing distance from read r<sup>n</sup> to reference φm, and ν<sup>n</sup> denotes the number of reference genomes in the library that read r<sup>n</sup> has successfully mapped to. The probability of whether a strain φ<sup>m</sup> is present in the sample is given by 1 minus the joint probability of xnm = 0 for all the reads r<sup>n</sup> ∈ R, i.e.,

$$f(m) = 1 - \prod\_{\forall r\_n \in R} \left[ 1 - p(\mu\_{nm}, \nu\_n) \right],\tag{2}$$

where p(µ, ν) , P(x = 1|µ, ν). In actual implementation, p(µ, ν) can be trained from generated metagenomic samples with spikein reads from known S. enterica strains. Once the values for p(µ, ν) are trained, for a given sample under test, strain-typing can be accomplished by identifying strains with highest f(m) calculating using Equation (2) from the alignment information (µnm, νn) of all the S. enterica specific reads from the sample.

#### 3.3.3. Refinement

In our experiments, we observed that for sample with very low S. enterica abundance, there could be more than one candidate S. enterica strains with highest f(m) since there are not enough S. enterica specific reads to identify the true target strain using Equation (2). To further improve the specificity of the proposed algorithm, in this case an additional reassignment step is conducted where the statistical inference procedure Equation (2) is performed again on a subset of reference library that contains only the top N candidate strains obtained from previous step using all the reads from the entire sample after the quality control step. The final candidate strains are identified from the highest probability f(m) after the refinement step.

#### 4. EXPERIMENTAL RESULTS

In this section, we first describe the training of the conditional probability distribution table from simulated training data. Then, we evaluate the sensitivity of the proposed UltraStrain method and compare with three existing methods, namely, Kraken (Wood and Salzberg, 2014), Sigma (Ahn et al., 2015), and Pathoscope2 (Hong et al., 2014b). For all the algorithm test, the same library of S. enterica genomes as described in section 3 was used. Simulated metagenome sequencing data, which were created by merging reads from target strains with reads from real background microbial samples at various spike-in levels, were used in performance evaluation as they provide necessary ground truth information. We then further evaluated the performance of the proposed method using data from PrecisionFDA's CFSAN Pathogen Detection Challenge (https://precision.fda.gov/challenges/2). Finally, we compared the runtime performance of these methods using two set of samples generated from dataset of PrecisionFDA CFSAN Pathogen Detection Challenge.

#### 4.1. Training of Conditional Probability Distribution Table

First, we created a training data set for the purpose of learning the conditional probability distribution table. The training set included 1,100 simulated samples, which were created using ART simulator (Huang et al., 2011) from various S. enterica genomes. All simulated reads were created with 250 bases long with error profile that mimics typical MiSeq v1 sequencing machine (options: "-ss MSv1 -p -l 250 -m 300 -s 10 -na"). The generated simulated reads were then filtered using the synthetic S. enterica specific reference to obtain reads that mapped to the S. enterica specific regions for constructing the conditional probability distribution table as follows.

The S. enterica specific reads r<sup>n</sup> obtained from previous step were mapped to the reference library 8, and a condition matrix CN×<sup>M</sup> was extracted from the alignment results, where N denotes the total number of reads being analyzed and M denotes the size of the reference library. Each element of C is a 2-tuple Cnm = (µnm, νn), where µnm is the editing distance from read r<sup>n</sup> to reference φm, and ν<sup>n</sup> denotes the number of reference genomes in the library that read r<sup>n</sup> has successfully mapped to. Note that read r<sup>n</sup> could map to different reference genomes with different editing distance values. For each read rn, the ground truth label xnm is also available for all reference strains φm, i.e., xnm = 1 if read r<sup>n</sup> comes from strain φ<sup>m</sup> and xnm = 0 otherwise.

For each (µnm, νn)-tuple, we counted the number of occurrences when xnm = 1 and xnm = 0, respectively, as follows:

$$c^+\_{\left(\mu\_{nm}, \upsilon\_n\right)} = \left| \bigcup\_{\chi\_{nm} = 1} \{ (\mu\_{nm}, \upsilon\_n) \} \right| \tag{3}$$

$$c^{-}\_{\left(\mu\_{nm}, \upsilon\_{n}\right)} = \left| \bigcup\_{\chi\_{nm} = 0} \{ (\mu\_{nm}, \upsilon\_{n}) \} \right|. \tag{4}$$

The conditional probability of a positive hit can then be calculated as

$$p(\mu\_{nm}, \upsilon\_n) = \frac{c^+\_{\left(\mu\_{nm}, \upsilon\_n\right)}}{c^+\_{\left(\mu\_{nm}, \upsilon\_n\right)} + c^-\_{\left(\mu\_{nm}, \upsilon\_n\right)}}\tag{5}$$

Due to the large number of strains in the reference library, the total number of possible conditions is large. This may cause the so-called "null context" problem where some conditions may only have very small number of occurrences, leading to inaccurate estimation of probability. This problem can be overcome by reducing the number of conditions using nonuniform binning method on νn. Specifically, we grouped values of ν<sup>n</sup> into a number of bins with different sizes. The calculation of conditional probabilities is then performed on the grouped bins using accumulated counting from those of all the ν<sup>n</sup> inside each bin. In our simulation, we used 6 bins which are {[0, 2), [2, 5), [5, 10), [10, 30), [30, 100), [100,∞)} where the last bin covers all ν<sup>n</sup> values that are not less than 100.

The learned conditional probability table was then used in the following experiments for strain identification by calculating the probability of presence of each candidate strain from the library as described in section 3.

#### 4.2. Experiment on Abundance

To evaluate the performance of the proposed UltraStrain, we generated 65 synthetic sample data with spike-in of different S. enterica strains at different abundance levels for testing. The background reads in the synthetic samples were produced from a mixture of simulated reads generated from 10 non S. enterica genomes listed in **Table 1**, and the foreground reads were simulated from 13 target S. enterica genomes as listed in **Table 2**. In both cases the simulated reads were generated using ART read simulator (Huang et al., 2011) with the same parameters as in section 4.1. For the background, the reads were generated at 10x coverage from the 10 listed non S. enterica genomes, respectively. In addition, to avoid potential contamination from the background sample, reads that could be aligned to the synthetic S. enterica specific reference genome at high quality were removed. Finally, the foreground reads were randomly down-sampled to 5 different abundance levels of 10%, 1%, 0.1%, 0.01%, 0.001% according to the total read number in the background sample, and mixed with the background sample to generate the synthetic testing samples.

The strain identification results on the 65 data sets for the abundance test are showed in **Figure 2**. In can be seen from the results that UltraStrain perform best in correctly identifying the target strains. In particular, UltraStrain correctly identifies all the 13 strains at 0.1%, while Pathoscope2, Sigma, and Kraken2 only correctly identify 7, 5, and 0 strains, respectively. In addition, UltraStrain could still correctly identifies 4 out of 13 strains at 0.01% abundance while all the other algorithms under test failed to identify the correct strain at this abundance level.

#### 4.3. Experiments on Coverage

It is interesting to note that due to the filtering process used in the algorithm, the sensitivity of UltraStrain will be increased if more

TABLE 1 | The 10 *non-S. enterica* genomes used as background strains in the simulated data sets.


TABLE 2 | The 13 *S. enterica* genomes used as target strains in the simulated data sets.


metagenomic data are available for a given sample. That is, for a given sample with low abundance of S. enterica contamination, the chance of UltraStrain to correctly identify its strain will be higher if it is sequenced to higher coverage. This is because that with higher coverage of data, more S. enterica specific reads will be retained after the filtering operation. Hence it will give better chance for UltraStrain to correctly identify the target strain. Note that this property is in general not applicable to other strain typing software since the ratio of reads from S. enterica vs. other species simultaneously present in the sample will remain constant without the filtering operation.

To illustrate that the sensitivity of UltraStrain will be increased with higher coverage data, we further evaluated the performance of UltraStrain on metagenomic data of different coverage. The same procedure in previous sector was followed to create the testing data. The synthetic background reads were generated from 10 non S. enterica strains at 17 different coverage values ranging from 10×, 15×, · · · , to 500×, and the target S. enterica reads were spiked-in at constant abundance level of 0.01%. In total, 102 test data sets were generated for this experiment. **Figure 3** shows the performance of UltraStrain on the testing data. It can be seen that with increasing coverage, the calculated probability of target strain is also increased. Note that the increment is not monotonically due to the randomness nature of the number of spiked-in reads present in the S. enterica specific genome region. However, at higher coverage, UltraStrain is able to correctly identify the target that it is not able to detect at lower coverage.

We had also tested other three algorithms (Pathoscope2, Sigma, and Kraken2). However, none of them was able to correctly identify the target strain under all testing conditions.

## 4.4. Results on FDA CFSAN Pathogen Detection Challenge

The PrecisionFDA CFSAN Pathogen Detection Challenge (https://precision.fda.gov/challenges/2/) aims at detecting S. enterica in shotgun metagenomic samples from contaminated cilantro. The goal of the challenge was to identify and type Salmonella in naturally and in silico contaminated


FIGURE 2 | Comparison of UltraStrain, Pathoscope2, Sigma, and Kraken2 in strain identification from 65 simulated data sets. The reads from each of the 13 target strains, as listed in the rightmost column, are mixed with the reads from the background strains, at 5 different abundance levels from 10% to 0.001%. "1" means that the method successfully identifies the target strain from the simulated sample data, while "0" means failure, i.e., the method either identifies a different *S. enterica* strain as the most probable strain, or did not identify any *S. enterica* strains from the simulated sample data.

samples. The Challenge provided 24 test samples, and the participants were asked to identify the serotype, sequence type (i.e., MLST), and strain of Salmonella present in positive challenge samples.

We tested the performance of UltraStrain on the 24 challenge samples, and the results are shown in **Figure 4**. Among these 24 samples, 13 are positive, including 5 in silico synthetic samples with a spike-in known S. enterica target strain into the culturenegative samples, and 8 culture-positive samples. The remaining 11 samples are culture-negative samples. UltraStrain correctly identified the target S. enterica strain in 8 positive samples (5 in silico and 3 culture-positive samples). Both Pathoscope2 and Sigma successfully identified the target strain in 7 samples, while Kraken failed in all samples. However, for culture-positive samples C01, C08, C18, C21, and C24, none of the four methods can identify the correct S. enterica strain.

It can be seen from the results that for some negative samples, UltraStrain still identify target strains with very high probabilities. This could possibly be due to two reasons. First, the negative sample may not be truly negative due to the high


FIGURE 4 | Comparison of performance of UltraStrain, Pathoscope2, Sigma, and Kraken2 on PrecisionFDA CFSAN Pathogen Detection Challenge data set. For each testing sample, the most probable strains identified by the algorithms are shown. Correctly identified strains are marked with red color. For UltraStrain, Pathoscope2, and Sigma, the scores reported in the figure indicate the probabilities of the identified strains present in the sample. For Kraken2, the scores indicate the related abundances of the identified strains.

FIGURE 5 | Comparison of the runtime performance of UltraStrain, Pathoscope2, Sigma and Kraken2. (A). Performance on four different samples (C03, C04, C06, C19 from PrecisionFDA CFSAN challenge) that have different levels of *S. enterica* abundance of 0.00 (C06), 0.005 (C03), 0.03 (C04), 0.06 (C19). For fair comparison, all the files are truncated to 1.2 GB. (B) Performance on four samples with increasing file size from 318 MB (1×) to 1.3 GB (4×) constructed by duplicating testing sample C13. A higher bar indicates a computationally more expensive process. Note that the runtime results are shown in logarithmic scale.

sensitivity of UltraStrain. In particular, there are still some amount of S. enterica specific reads left after the filtering process, which may suggest that the sample may contain certain level of S. enterica contamination. Secondly, it is possible that the sensitivity of UltraStrain could be too high for real-life samples. Therefore, it is possible that we select a higher cut-off value of probability (e.g., 0.99) when it is used for S. enterica detection.

# 4.5. Experiments on Runtime

To compare the computational complexity of UltraStrain in terms of runtime with other methods, we tested the runtime performance of all four methods using two sets of samples selected from PrecisionFDA CFSAN challenge dataset. The experiments were conducted on an Intel Xeon workstation with 48 CPU threads and 256 GB RAM. All methods were run with their default settings, and set to utilize up to 44 CPU threads whenever it is possible. The results are shown in **Figure 5**. It can be seen that the runtime performance of these tools varies dramatically, which can take from 10<sup>1</sup> to 10<sup>4</sup> seconds per test depending on respective method as well as the sizes and compositions of samples under test. In general, the runtime of each tool increases as the file sizes of testing samples increase. In addition, the runtimes of UltraStrain and Pathoscope2 also increase as the abundances of the target spike-in strains increase, which is reasonable as there will be more matched reads to be processed in both algorithms when the abundances of target strains increase. Overall, Kraken2 has lowest complexity among all tools. UltraStrain has the second lowest complexity followed by Pathoscope2. Sigma has the highest complexity in all cases.

# 5. CONCLUSIONS

UltraStrain is a highly sensitive, rapid and efficient method for metagenomic taxonomic classification at strain level. In UltraStrain pipeline, the reads filtering step uses a synthetic reference genome consisting of differentiating regions from known S. enterica strains to filter out the reads that are not specific to S. enterica species, greatly improving the efficacy as well as efficiency of the process. Strain identification through the proposed statistical learning provides a fast and accurate solution for metagenome sample data analysis. Experiments on both simulated data sets and real sample demonstrate that UltraStrain achieves high accuracy even at very low abundance level. Ultrastrain achieves both shorter run time and higher sensitivity, which indicates its usability as a standalone pathogen identification pipeline. In addition, our experiments show that the sensitivity of UltraStrain can be further improved by using deeper sequencing of the sample, which could be particularly useful when it is necessary to perform strain typing on sample with extremely low abundance of target strains.

The proposed algorithm can be further improved in many aspects. For example, although it is developed with the target of high-sensitivity S. enterica in mind, the proposed framework can be easily extended to taxonomic profiling and analyze other bacteria strains by adapting its filter and reference library designs. In addition, the ability of current algorithm in dealing with sample with more than one target strains from the same species still needs further investigation. Importantly, the current approach, as its primary goal is for ultra sensitive strain typing, lacks the ability to accurately identify the relative abundance of multiple bacteria species/strains present in a sample as provided by other similar tools. Therefore, it is anticipated that it

#### REFERENCES


could be used in conjunction with other metagenomic pipelines when necessary.

# DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https://precision.fda.gov/challenges/2/view.

# AUTHOR CONTRIBUTIONS

WY and RY: conceptualized the algorithm design and interpreted the data. WY, LW, and RY: designed the study. WY, LH, CS, LW, and RY: collected the data. LH, CS, WY, and RY: analyzed the data. WY, LH, CS, LW, and RY: sourced the literature. WY, LH, LW, and RY: wrote the draft. WY, LW, and RY: edited the manuscript. LW and RY: acquired the funding and supervised the whole study.

#### FUNDING

This work was supported by National Natural Science Foundation of China (Grant No. 61671399).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Yang, Huang, Shi, Wang and Yu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identifying Critical State of Complex Diseases by Single-Sample-Based Hidden Markov Model

#### Rui Liu<sup>1</sup> , Jiayuan Zhong<sup>1</sup> , Xiangtian Yu<sup>2</sup> , Yongjun Li <sup>3</sup> and Pei Chen<sup>1</sup> \*

*<sup>1</sup> School of Mathematics, South China University of Technology, Guangzhou, China, <sup>2</sup> Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, China, <sup>3</sup> School of Computer Science and Engineering, South China University of Technology, Guangzhou, China*

The progression of complex diseases is generally divided as a normal state, a pre-disease state or tipping point, and a disease state. Developing individual-specific method that can identify the pre-disease state just before a catastrophic deterioration, is critical for patients with complex diseases. However, with only a case sample, it is challenging to detect a pre-disease state which has little significant differences comparing with a normal state in terms of phenotypes and gene expressions. In this study, by regarding the tipping point as the end point of a stationary Markov process, we proposed a single-sample-based hidden Markov model (HMM) approach to explore the dynamical differences between a normal and a pre-disease states, and thus can signal the upcoming critical transition immediately after a pre-disease state. Using this method, we identified the pre-disease state or tipping point in a numerical simulation and two real datasets including stomach adenocarcinoma and influenza infection, which demonstrate

the effectiveness of the method.

Keywords: hidden Markov process, single-sample-based diagnosis, dynamical network biomarker (DNB), pre-disease state, critical transition, early-warning signal

# INTRODUCTION

Considerable evidence suggests that during the progression of many complex diseases the deterioration is not necessarily smooth but abrupt (Litt et al., 2001; McSharry et al., 2003; Scheffer et al., 2009). In order to describe the underlying mechanism of complex diseases, their evolutions are often modeled as time-dependent non-linear systems, in which the abrupt deterioration is viewed as the phase transition at a tipping point (Murray, 2002; Venegas et al., 2005; Hirata et al., 2010; He et al., 2012; Liu et al., 2012). Therefore, from a dynamical systems' perspective, the general progression of complex diseases was modeled as three states or stages (**Figure 1A**): (i) a normal state, which represents a relative healthy stage with high stability and robustness to perturbations; (ii) a pre-disease state, which was defined as the limit of the normal state, and locating just before the occurrence of sudden deterioration, therefore, with low stability and robustness; (iii) a disease state, which represents a serious deteriorated stage generally with high stability and robustness, because it is usually very difficult to return to the normal state even with intensive treatment (Liu et al., 2014a). In contrast to the irreversible disease state, the pre-disease state is sensitive to perturbation and thus reversible to the normal state if timely and appropriate treatment is received during this stage. It is thus crucial to detect the pre-disease state for patients with complex diseases. However, it is hard to detect a pre-disease state by traditional biomarkers since it is similar to the normal state in terms of the phenotype and gene expression.

#### Edited by:

*Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China*

#### Reviewed by:

*Huanfei Ma, Soochow University, China Ling-Yun Wu, Academy of Mathematics and Systems Science (CAS), China*

> \*Correspondence: *Pei Chen chenpei@scut.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *19 December 2018* Accepted: *15 March 2019* Published: *04 April 2019*

#### Citation:

*Liu R, Zhong J, Yu X, Li Y and Chen P (2019) Identifying Critical State of Complex Diseases by Single-Sample-Based Hidden Markov Model. Front. Genet. 10:285. doi: 10.3389/fgene.2019.00285*

**256**

FIGURE 1 | The outline for identifying the SSI score based on HMM. (A) The progression of complex diseases is generally modeled as three states, i.e., a normal state, a pre-disease state, and a disease state. The pre-disease state is immediately before the sudden deterioration, which is sensitive to treatment and reversible to the normal state. The disease state is usually irreversible even with intensive medical care. For an individual, samples from a few initial time points can be regarded as reference. Each single case sample was added to the reference, forming a series of combining samples. (B) At each time point *t* = 1, 2, … *T*, a differential network N*<sup>t</sup>* was constructed by PCC. (C) The sharp increase of SSI score signals the upcoming critical transition into the disease state.

Recently, the dynamical network biomarker (DNB) method was proposed to detect the pre-disease state (Chen et al., 2012), that is, by identifying a group of DNB biomolecules (e.g., genes and proteins) which together signal the occurrence of pre-disease state in the following three ways: (i) the DNB members turn to be widely fluctuating; (ii) the correlation between any two DNB members increase significantly; (iii) the correlation between a DNB member and a non-DNB molecule decrease significantly. Different from traditional biomarkers, DNB aims at signaling the pre-disease state before the occurrence of catastrophic deterioration. This method has been employed by many groups and applied to a number of cases, including detecting the tipping points of cell fate decision (Mojtahedi et al., 2016) and cellular differentiation (Richard et al., 2016) studying immune checkpoint blockade (Lesterhuis et al., 2017) and identifying the critical transition states during various biological processes (Liu et al., 2014b, 2018; Chen et al., 2015, 2017, 2018). However, it is noted that the DNB method works only when there are multiple case samples, so that the above three statistical conditions can be evaluated. This limits the practical application of DNB in many clinical cases because generally it is impossible to collect multiple samples for each individual at a time point.

In this work, by exploring the differential information between the normal and pre-disease states, we proposed a single-samplebased hidden Markov model (HMM) to signal the tipping point, even if there was only one case sample available. Specifically, the normal state was modeled as a stationary Markov process due to its highly stable nature in dynamics, while the pre-disease state was viewed as a time-varying Markov process considering its dynamical instability. Taking multiple normal samples as the references or background, a differential network whose edges carried the differential information before and after combing a single sample with references, was obtained specific to the single sample derived at a time point (**Figure 1A**). Then, under the hypothesis that a time point t = T (T > 2) is the candidate tipping point, a probabilistic score, namely single-sample-based inconsistency score (SSI score), was developed for quantitatively measuring the difference between samples from a normal state and that from a pre-disease state. The calculation of SSI score was based on an HMM, where the HMM was trained by taking a series of differential networks derived up to t = T−1 as the training set (**Figure 1B**). The abrupt increase of such probabilistic score indicates the occurrence of tipping point (**Figure 1C**). Clearly, this approach is individual-specific, and thus may help to achieve personalized diagnosis based on the historical information of patients. To validate the effectiveness, this method has been applied to a numerical simulation and two real datasets, i.e., stomach adenocarcinoma (STAD) dataset from TCGA database and influenza infection dataset from GEO database.

# METHODS

#### Theoretical Basis

The theoretical basis of this study is the DNB theory, which provide the following generic properties when a dynamical system approaches a bifurcation point (Chen et al., 2012):


The detailed description and derivation of DNB can be seen in reference (Liu et al., 2015) and its **Supplementary Information**. In view of the dynamical characteristics of the normal state, i.e., stable dynamics with little fluctuation and high resilience, it was modeled as a stationary Markov process. The pre-disease was modeled as a time-varying Markov process due to its highly unstable dynamics with strong fluctuation and low resilience. The disease state can be regarded as another stationary Markov process because of its dynamical stability (Chen et al., 2016). To identify the pre-disease state, it is equivalent to detect a switching point at which a stationary Markov process ends and turns into a time-varying Markov process.

#### Algorithm

A sketch of the single-sample-based HMM algorithm was provided in **Figure 2**. Specifically, detecting the outset of a predisease state is equivalent to identifying the end of this stationary Markov process, which requires a detailed model to present such stationary Markov process. Therefore, an HMM was trained and employed to describe the dynamical characteristics of the system in the normal state. And a probability index was proposed to evaluate the inconsistency between a sample from a testing point and the trained HMM. We carry out the following algorithm to identify the tipping point by using only one case sample.

#### **(i) Choosing Reference Samples**

A few samples that represents the relatively healthy condition were chosen as the reference or background. Generally, for individual-specific samples (e.g., samples for each symptomatic subject in influenza infection dataset), samples from a few initial time points of an individual (as shown in **Figure 1A**) can be regarded as reference. For stage-course data (e.g., TCGA data for stomach adenocarcinoma), samples from a normal cohort or normal tissue can be viewed as reference.

#### **(ii) Training Process**

First, we added each single case sample to the reference (**Figure 1A**), forming a series of combining samples. In other words, if there were n samples in the reference, in each time point we obtained a set of n + 1 samples, which can be viewed as a perturbation to n samples in the reference group.

Second, based on the observation samples at each time point t, a differential network N<sup>t</sup> was constructed by the difference of the corresponding Pearson correlation coefficient (PCC) between the reference and combined samples (**Figures 1A,B**), that is,

$$|\Delta \text{ PCC}(\emptyset\_i, \emptyset\_j) = |\text{PCC}\_{n+1}(\emptyset\_i, \emptyset\_j)| - |\text{PCC}\_n(\emptyset\_i, \emptyset\_j)|,$$

Where g<sup>i</sup> and g<sup>j</sup> represent gene expressions for any pair of genes. Then |1PCC(g<sup>i</sup> , gj)| was employed to constructed the differential network, i.e., when |1PCC(g<sup>i</sup> , gj)| > d, there was a differential link between g<sup>i</sup> and g<sup>j</sup> (**Figure 1B**), where threshold d was selected based on specific real data, that is, d was chosen such that few differential links arising in the initial differential networks of the normal state, thus highlighting the pre-disease state when many links appear. After this step, we obtained a differential-network series {N1, N2, . . . , NT, . . . }.

Third, suppose a time point t = T (T > 2) as a candidate tipping point. Then differential-network series was divided into training part ranging from t = 1 to t = T−1, i.e., observation sequence OT−<sup>1</sup> = {o1, o<sup>2</sup> , . . . , oT−1} = {N1, N<sup>2</sup> , . . . , NT−1}, and testing part starting from t = T, i.e., o<sup>T</sup> = {NT}. Let {s1, s2, . . . ,st} represents the state sequence up to t. Symbols P<sup>0</sup> and P1, respectively, denote the normal state (P0) and a possible pre-disease state (P1), which are two unobserved (hidden) states. Then based on the training samples OT−<sup>1</sup> = {N1, N<sup>2</sup> , . . . , NT−1}, a HMM

$$\theta^{T-1}(O\_{T-1}) = \langle A^{T-1}, B^{T-1}, \pi \rangle$$

was trained by the Baum-Welch procedures (Bilmes, 1998). Here, the subscript T-1 of θ denotes that the HMM θ was obtained from the training samples up to t = T−1. The state

tipping point, and the algorithm ends. Otherwise, if there is no significant change in SSI score, then *t* = *T* is classified as a time point belonging to the normal state,

transition matrix at time point T−1 is

$$A^{T-1} = \left(a\_{\vec{\eta}}\right)\_{2 \times 2}$$

and the algorithm continues with *t* = *T*+1 being a new candidate tipping point.

with

$$a\_{ij} = P(s\_q = P\_i | s\_{q-1} = P\_j), i, j \in \{0, 1\}.$$

q − 1 ∈ {1, . . . , T − 2} stands for a time point in the training process, q stands for the next time point after q − 1. The observation matrix at time point T−1 is

$$B^{T-1} = \left(b\_{jk}\right)\_{2 \times N}$$

with

$$b\_{jk} = P(\#1\left(q\right) = k | s\_q = P\_j\right), j \in \{0, 1\}, \ k \in \{0, 1, \dots, M\}, \ \bar{s}$$

Where #1 q = k represents that there are k edges in the differential network NT−1, M is the number of all possible edges, e.g., M = C 2 <sup>m</sup> if there are m nodes in Nq. The initial probabilities are

$$
\pi = \{\pi\_1, \pi\_2\}
$$

with π<sup>i</sup> = P sq−<sup>1</sup> = P<sup>i</sup> , i ∈ { 0, 1}.

#### **(iii) Testing Process**

Based on the testing sample oT−<sup>1</sup> = {NT} we tested if the candidate point t = T is a "real" tipping point. A single-sample-based inconsistency score (SSI score) was proposed, i.e.,

$$\begin{split} \text{SSI}\left(T\right) &= P\left(s\_T = P\_1 \Big| s\_1 = P\_0, s\_2 = P\_0, \dots, \ s\_{T-1} = P\_0; \theta^{T-1}\right) \\ &= 1 - P\left(s\_T = P\_0 \Big| s\_1 = P\_0, s\_2 = P\_0, \dots, \ s\_{T-1} = P\_0; \theta^{T-1}\right) \\ &= 1 - P\left(s\_T = P\_0 \Big| s\_{T-1} = P\_0; \theta^{T-1}\right) \\ &= 1 - \frac{P\left(s\_T = P\_0, s\_{T-1} = P\_0; \theta^{T-1}\right)}{P\left(s\_{T-1} = P\_0; \theta^{T-1}\right)}. \end{split}$$

Given the HMM θ T−1 , the SSI score was calculated by a forward algorithm. According to above settings, the calculation of probability SSI (T) (the inconsistency probability) at a time point t = T only relies on the samples from T−1 and T. If SSI (T) increases significantly, then the candidate point t = T is determined as the identified tipping point, and the algorithm ends (**Figure 2**). Otherwise, if there is no significant change in SSI (T), then t = T is classified as a time point belonging to the normal state. Accordingly, the differential network o<sup>T</sup> = {NT} is added to the training set, and the algorithm continues with t = T+1 being a new candidate tipping point (**Figure 2**).

According to the DNB theory, there are few differential edges in a differential network constructed in a normal stage, due to the high stability nature of the system during the normal stage. However, when the system approaches the critical transition point, there are many differential edges appearing in the differential network due to the time-varying and fluctuating dynamics of the system. Specifically, the algorithm is guaranteed by the generic properties 2 and 3 listed in section Theoretical Basis.

# Data Accessing and Processing for Real Datasets

Two gene expression profiling datasets including the timecourse dataset for influenza virus infection process (GSE30550) downloaded from the NCBI GEO database (www.ncbi.nlm.nih. gov/geo) and stage-course dataset for stomach adenocarcinoma (STAD) from TCGA database (http://cancergenome.nih.gov). For omics data (GSE30550), we discarded the probes without corresponding NCBI Entrez gene symbol. After removing any redundancy in dataset GSE30550, we obtained 11,451 molecules through probe mapping. For each gene mapped by multiple probes, the average value was employed as the gene expression.

When applied the algorithm to both two disease datasets, there were two extra steps as follows.

First, the expression profiling information was mapped to the protein-protein interaction networks from STRING (http:// stringw-db.org) (Szklarczyk et al., 2014) for Homo sapiens. In such a network, the edges were filtered by the confidence level with a threshold of 0.700. All the isolated nodes were discarded. Then we choose the cutoff parameter d so that there are only 10% edges in the first differential network comparing with original STRING network, that is, over 90% edges disappear comparing with the original STRING network due to the generic property that the network structure would remain stable during the normal stage, and thus there are few edges in a differential network based on samples generated from normal stage.

Second, the differential network was partitioned into local networks to reduce computational complexity. Each local network contained a center node and its first-order neighbors. The local SSI score for each local network was calculated through above algorithm. Given k local networks, then a weighted average SSI score was derived as follows,

$$\text{SSI} = \frac{n\_1 \text{SSI}\_1 + n\_2 \text{SSI}\_2 + \dots + n\_k \text{SSI}\_k}{n\_1 + n\_2 + \dots + n\_k},$$

Where n<sup>i</sup> denotes the number of nodes in the i-th local network (I = 1, 2,. . . , k) and SSI<sup>i</sup> stands for the local SSI score of this subnetwork.

The networks were visualized using Cytoscape (www. cytoscape.org) and the functional analysis was based on Ingenuity Pathway Analysis (IPA, http://www.ingenuity.com/ products/ipa) and KEGG enrichment analysis (http://www. genome.jp/kegg/tool/map/\_pathway2.html).

#### RESULTS

## Identifying the Critical Transition for a Numerical Simulation Model

The proposed computational method and SSI score was applied to a numerical simulation dataset, which was generated from a nine-node regulatory network (**Figure 3A**) with a set of nine stochastic differential equations Equation (S1) provided in **Supplementary Information**. Such model of regulatory network of Michaelis-Menten form, is usually employed to study genetic regulations including transcription, translation, diffusion, and translocation processes (Chen et al., 2009). With varying parameter p ranging from −0.45 to 0.15, a dataset was generated for numerical simulation.

In Equation (S1), the parameter value p = 0 was set as a bifurcation value, at which the system undergoes a critical transition. The dynamical change in SSI score was exhibited in **Figure 3B**. Clearly, there is an abrupt increase of SSI score when the system approaches the tipping point (p = 0). Thus, the significant increase of SSI score indicates the upcoming critical transition at p = 0. In **Figure 3C**, after 1,000 simulations, the distribution of differential edges was illustrated for the network specific to each parameter value. It is seen that the frequency for the occurrence of differential edges was significantly different in the vicinity of the tipping point (p = 0), which implies that much more edges would occur in the differential network when the system approaches the tipping point.

# Identifying the Critical Transition for Stomach Adenocarcinoma

Cancer of the stomach is difficult to cure unless it is found at an early stage (before its metastasis). Unfortunately, because early stomach cancer causes few symptoms, the disease is usually advanced when the diagnosis is made (Wadhwa et al., 2013). According to a clinical-stage division (Guide, 2009) stage IV is generally regarded as a severe deteriorated stage, at which cancer has spread to nearby tissues and distant lymph nodes or has metastasized to other organs. Generally, a cure is very rarely possible at stage IV. Therefore, it is important to detect the early-warning signal for metastasis before stage IV.

The proposed method was employed in STAD dataset from TCGA, and identified the tipping point of distant metastasis (IIIA stage). This dataset contained RNA-Seq data and included 141 tumor samples and 33 tumor-adjacent samples. The tumor samples were grouped into seven stages, that is, stage IA (9 samples), stage IB (18 samples), stage IIA (23 samples), stage IIB (29 samples), stage IIIA (27 samples), stage IIIB (20 samples), and stage IV (15 samples) of stomach cancer. The tumor-adjacent samples were regarded as control data and were employed as reference samples.

As shown in **Figure 4A**, the abrupt increase of average SSI score indicated the imminent critical transition in tipping point stage (IIIA), after which cancer would spread to the serosal layer of the stomach wall (stage IIIA) and ultimately cause distant metastasis (stage IV). In **Figure 4B**, the box plot showed that the expression deviation of deferential expression genes fails to provide any effective signals for the tipping point, where the differential-expression genes were obtained by comparing with tumor-adjacent TA samples at each stage. **Figure 4C** shows the dynamical evolution of the whole gene regulatory network including 3,247 nodes and 22,301 edges. These edges were selected through the STRING network with high confidence

level (level higher than 0.700). A group of 214 nodes, i.e., genes with the most significant increases in their local SSI score, were intentionally arranged at the right bottom corner. This group of genes together exhibited obvious signal at the tipping point (stage IIIA), and can be regarded as the dynamical network biomarker for distant metastasis of STAD. These top 1% genes with the most significant increase in local SSI scores were considered as the SSIsignaling genes which is a set of dynamical network biomarker and may highly relate to the catastrophic deterioration. Thus, we carried out functional analysis on these SSI-signaling genes.

Based on IPA analysis, the common SSI-signaling genes were highly related to functions annotation "Digestive organ tumor" (P-value = 3.0E-34), "Abdominal adenocarcinoma" (P-value = 7.1E-29), "Cancer of cells" (P-value = 2.2E-10), "Metastasis" (Pvalue = 2.0E-04), etc. Besides, from KEGG enrichment analysis, the SSI-signaling genes were enriched in cancer-related pathways including Pathways in cancer, AMPK signaling pathway, Ras signaling pathway. Some SSI-signaling genes have been found in literatures and identified to be associated with the process of cancer metastasis. For example, COL11A1 was reported as a remarkable biomarker for carcinoma progression and metastasis (Vázquez-Villa et al., 2015). BLNK was known as one of the downstream targets of Pax-5, which plays important role in metastasis (Crapoulet et al., 2011). HNRNPC, whose specific siRNA was reported to inactivate Akt pathway (Hwang et al., 2012) was also identified to control the metastatic potential of glioblastoma by regulating PDCD4 (Park et al., 2012). MMP1 proteolytically engage EGF-like ligands in an osteolytic signaling cascade for metastasis (Lu et al., 2009). LIN9 is a component of the metastasis-predicting Mammaprint gene signature in breast cancer (Van't Veer et al., 2002). The functional analysis showed that the SSI-signaling genes were highly related to metastasis or related biological functions, which also validated the sensitivity and effectiveness of the identified SSI-signaling genes. A list of common SSI-signaling genes for STAD was provided in **Table S1**.

# Identifying the Critical Transition for Influenza Infection

We applied the proposed method to a time-course dataset of live influenza infection challenge (GSE30550), in which there were 17 subjects who received injection of influenza virus (H3N2/Wisconsin). Among the 17 subjects, nine (subjects 1, 5, 6, 7, 8, 10, 12, 13, and 15) were infected who showed clinic symptoms and the other eight (subjects 2, 3, 4, 9, 11, 14, 16, and 17) were always stay healthy who didn't show any clinic symptom during the whole period of infection challenge (**Figure 5A**). The gene expression profiles were derived in the whole peripheral blood drawn from all subjects at 16 time points, i.e., 24 h before

injection, 0, 5, 12, 21, 29, 36, 45, 53, 60, 69, 77, 84, 93, 101, and 108 h after the injection. At each time point, there was only a single sample for each subject. By employing the proposed method, we obtained the individual-specific SSI score for each subject either in symptomatic or asymptomatic group.

The individual-specific SSI scores in **Figure 5B** demonstrated that there were obvious signals provided by SSI score for all symptomatic subjects (9 red curves), while there were few significant changes in the SSI scores for asymptomatic subjects (8 blue curves). The specific SSI scores for nine symptomatic subjects were shown in **Figure 5C**. Clearly, the SSI score indicated the pre-disease states (the state before the appearance of clinical symptom) for each symptomatic individual, with 100% accuracy. However, there was 25% false positive rate (**Figure 5A**). To demonstrate the evolution of individual-specific differential network, two sets of differential networks, respectively, for two symptomatic subjects, i.e., subject 1 and subject 12, were illustrated in **Figure 6**. Clearly, at the respective tipping point, there were many differential edges arising just before the emergence of clinic symptoms. At the tipping point of each symptomatic subject, the top 1% genes with the largest local SSI scores were regarded as a set of dynamical network biomarker, which were selected for further functional analysis.

Based on IPA analysis, the common SSI-signaling genes were highly related to functions annotation "Quantity of lymphocytes" (P-value = 2.23E-11), "Inflammation" (P-value = 2.47E-10), "Viral Infection" (P-value = 1.06E-09), "Homeostasis of leukocytes" (P-value = 1.14E-08). From KEGG enrichment analysis, the common SSI-signaling genes were enriched in Influenza A, and a variety of cellular pathways including PI3K-Akt signaling pathway, MAPK signaling pathway, NF-kappa B signaling pathway, etc. The functional analysis again validated the effectiveness of SSI-signaling genes. A list of common SSIsignaling genes for influenza infection was provided in **Table S2**.

#### DISCUSSION

Detecting the early-warning signal before a sudden deterioration into a severe disease state is crucial to patients all over the world. However, it is generally challenging to signal such critical transition through only a single case sample, since the lack of samples disables statistical indices and thus makes conventional methods fail. In this work, we proposed a computational method to identify the pre-disease state on the basis of a single sample. Specifically, given a number of reference samples which can be the normal samples of an individual (**Figure 1A**), the proposed method can distinguish the abnormal single sample by a differential-network-based HMM scheme. The proposed method has been validated by both the numerical simulation (**Figure 3**) and two real datasets (**Figures 4**, **5**).

Comparing with the traditional methods which are mostly based on the differential expression of observed biomolecules, the proposed method aims at exploring the dynamic information of

differential associations among biomolecules when a biological system is in the vicinity of a tipping point. This method thus possesses several obvious advantages. First, it works when only a single case sample is available, which benefits the analysis in personalized medicine. Second, it detects the pre-disease state rather than a disease state, which may help to achieve early diagnosis of some complex diseases. Third, it well-exhibits the critical properties at a network level which may provide new insights into catastrophic deterioration, such as the abnormally arising differential associations.

Although the proposed method is merely a step toward the identification of pre-disease state and the algorithm is expected to be improved in both sensitive and accurate ways, following the idea of personalized medicine, it provides a computational way and achieves individual-specific analysis and prediction by making use of only a single sample.

#### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https://www.ncbi.nlm.nih.gov/geo/query/acc. cgi?acc=GSE30550.

#### REFERENCES


# AUTHOR CONTRIBUTIONS

RL and PC conceived the project. PC supervised the project. JZ, XY, and YL performed computational and analysis. All authors wrote the manuscript and read and approved the final manuscript.

# FUNDING

This work was supported by National Natural Science Foundation of China (Nos. 11771152, 91530320, 61803360, and 11871456), Pearl River Science and Technology Nova Program of Guangzhou (No. 201610010029), Major Science and Technology Projects in Guangdong Province under Grant No. 2015B010128008.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00285/full#supplementary-material

dynamical network biomarkers. Sci. Rep. 2, 342–349. doi: 10.1038/ srep00342


network marker. J. Cell. Mol. Med. 23, 395–404. doi: 10.1111/ jcmm.13943


osteolytic signaling cascade for bone metastasis. Genes Dev. 16, 1882–1894. doi: 10.1101/gad.1824809


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Liu, Zhong, Yu, Li and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Convergent Study of Genetic Variants Associated With Crohn's Disease: Evidence From GWAS, Gene Expression, Methylation, eQTL and TWAS

#### Yulin Dai<sup>1</sup> , Guangsheng Pei<sup>1</sup> , Zhongming Zhao1,2,3 \* and Peilin Jia<sup>1</sup> \*

<sup>1</sup> Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States, <sup>2</sup> Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States, <sup>3</sup> Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Jia Wen, University of North Carolina at Chapel Hill, United States Xingming Zhao, Tongji University, China

#### \*Correspondence:

Zhongming Zhao zhongming.zhao@uth.tmc.edu Peilin Jia peilin.jia@uth.tmc.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 22 December 2018 Accepted: 21 March 2019 Published: 09 April 2019

#### Citation:

Dai Y, Pei G, Zhao Z and Jia P (2019) A Convergent Study of Genetic Variants Associated With Crohn's Disease: Evidence From GWAS, Gene Expression, Methylation, eQTL and TWAS. Front. Genet. 10:318. doi: 10.3389/fgene.2019.00318 Crohn's Disease (CD) is one of the predominant forms of inflammatory bowel disease (IBD). A combination of genetic and non-genetic risk factors have been reported to contribute to the development of CD. Many high-throughput omics studies have been conducted to identify disease associated risk variants that might contribute to CD, such as genome-wide association studies (GWAS) and next generation sequencing studies. A pressing need remains to prioritize and characterize candidate genes that underlie the etiology of CD. In this study, we collected a comprehensive multi-dimensional data from GWAS, gene expression, and methylation studies and generated transcriptome-wide association study (TWAS) data to further interpret the GWAS association results. We applied our previously developed method called mega-analysis of Odds Ratio (MegaOR) to prioritize CD candidate genes (CDgenes). As a result, we identified consensus sets of CDgenes (62–235 genes) based on the evidence matrix. We demonstrated that these CDgenes were significantly more frequently interact with each other than randomly expected. Functional annotation of these genes highlighted critical immunerelated processes such as immune response, MHC class II receptor activity, and immunological disorders. In particular, the constitutive photomorphogenesis 9 (COP9) signalosome related genes were found to be significantly enriched in CDgenes, implying a potential role of COP9 signalosome involved in the pathogenesis of CD. Finally, we found some of the CDgenes shared biological functions with known drug targets of CD, such as the regulation of inflammatory response and the leukocyte adhesion to vascular endothelial cell. In summary, we identified highly confident CDgenes from multi-dimensional evidence, providing insights for the understanding of CD etiology.

Keywords: GWAS, TWAS, eQTL, integrative study, Crohn's Disease, COP9 signalosome, IL12RB2, LTBR

# INTRODUCTION

fgene-10-00318 April 9, 2019 Time: 12:44 # 2

Crohn's Disease (CD) is one of the major forms of inflammatory bowel disease (IBD). CD has a prevalence of 26 to 200 per 100,000 person in populations with European ancestry (Loftus, 2004). Family studies have shown that CD has 0.25 to 0.42 heritability (Gordon et al., 2015). Dysregulated immune response to environmental factors such as gut microbiome (Khor et al., 2011; Jostins et al., 2012; Ananthakrishnan, 2013) has been reported in CD. Complex diseases like CD are usually affected by a large number of genetic factors and environment factors (Rivas et al., 2011). Recent genome-wide association studies (GWAS) of CD have successfully identified more than two hundreds diseaseassociated loci at the genome-wide significance level (Franke et al., 2010; Liu et al., 2015). However, these findings could only explain a moderate proportion of the heritability (Verstockt et al., 2018). Recently, integrating GWAS signals with transcriptomewide association study (TWAS) and expression quantitative trait loci (eQTL) annotation has become an effective approach to identify new susceptibility loci and has been successfully applied in several complex diseases including CD (He et al., 2013; Marigorta et al., 2017; Gusev et al., 2018). Other forms of genetic variants are also implied, such as copy number variation (CNV) and rare variants, and they are expected to have large effects (Visscher et al., 2017). For example, a genome-wide association study of CNVs identified IRGM (immunity-related GTPase family, M) and the HLA gene family for CD (Wellcome Trust Case Control Consortium et al., 2010). Several genes were reported to harbor rare variants associated with CD, such as NOD2 (Nucleotide Binding Oligomerization Domain Containing 2, Alias CARD15) and ADCY7 (Adenylate Cyclase 7) (Hunt et al., 2013; Luo et al., 2017). Apart from those genetic variants, epigenetic alternations were also observed in CD patients. For example, altered methylation levels in peripheral blood were reported for the genes MIR21 (MicroRNA 21), TXK (TXK Tyrosine Kinase), ITGB2 (Integrin Subunit Beta 2) and HLA loci in case-control studies (Adams et al., 2014; Ventham et al., 2016). Lastly, a number of transcriptome profiling studies have been conducted, revealing genes that were differentially expressed in CD compared to controls, such as IFITM1 (Interferon Induced Transmembrane Protein 1), STAT1 (Signal Transducer And Activator Of Transcription 1), TAP1 (Transporter 1, ATP Binding Cassette Subfamily B Member), and PSMB8 (Proteasome Subunit Beta 8) identified using endoscopic pinch biopsies (Wu et al., 2007) and SERPINB2 (Serine (or cysteine) proteinase inhibitor, clade B (ovalbumin), member 2, PAI 2), NCK2 (NCK Adaptor Protein 2), and ITGB3 (Integrin Subunit Beta 3) identified using peripheral blood mononuclear cell (PBMC) (Burczynski et al., 2006). Each of these unbiased, GWAS have provided unique insights and candidate pathogenic variants and genes to understand the etiology of CD. However, challenges remain in how to effectively integrate these heterogeneous association data that range in a wide variety of biological processes.

Considerable work have been developed by integrating highthroughput multi-omics data ranging from unsupervised data integration to supervised data integration (Jiang et al., 2014; Wang et al., 2015; Huang et al., 2017; Jia et al., 2017). However, most of these tools require domain expertise, especially for the investigated diseases. Under the assumption that the number of susceptibility genes to complex disease is limited (Yang et al., 2005), we developed an unsupervised machine learning approach named mega-analysis of Odds Ratio (MegaOR) to prioritize candidate genes from multiple omics data (Jia et al., 2018). MegaOR relies on that each single omics data was conducted with control of false discoveries using the domain specific criteria (e.g., fold change for gene expression studies and stringent genomewide significance threshold for GWAS data). We successfully demonstrated the method in schizophrenia (Jia et al., 2018). In this study, we collected five types of omics data, each representing a genome-wide association study of a molecular type with CD. We investigated the disease relevant tissues using unbiased GWAS data and conducted TWAS for CD in these tissues. By applying MegaOR, we prioritized consensus sets of candidate genes and investigated their characteristics using functional enrichment analysis and drug target crosstalk.

## MATERIALS AND METHODS

#### GWAS Summary Statistics

We collected the summary statistics from a GWA study for CD conducted by the International Inflammatory Bowel Disease Genetics Consortium (IIBDGC) (Liu et al., 2015). The study included 27,726 individuals (5,956 cases and 21,770 controls) of European ancestry genotyped using a combination of array platforms, including Affymetrix GeneChip Human Mapping 500K, Affymetrix Genome-Wide Human SNP Array 6.0, and Illumina HumanHap300 BeadChip. The genotype data were also imputed based on the 1000 Genomes Project reference panel (1000 Genomes Project Consortium et al., 2015). In total, the GWAS summary statistics included association results for a total of 11,002,658 SNPs either genotyped or imputed (score > 0.3).

#### Gene Expression Data

We approached the gene expression data from a recent study that profiled the whole blood expression of 24 CD patients and 23 healthy controls (Ventham et al., 2016) (GEO accession ID: GSE86434). The expression data was generated using Illumina HumanHT-12 V4.0 expression BeadChip platform (GPL10558), which contained about 31,000 annotated genes with more than 47,000 probes. We used the online tool GEO2R<sup>1</sup> to conduct differential gene expression analysis. We compared the expression of whole blood mRNA between CD cases and controls. Following the method used in the original paper, log2 transformation was conducted for the expression data, and then Limma (R package) was used to adjust covariates (age and gender) to obtain the differentially expressed genes (DEGs) between CD cases and controls. Genes with fold change (FC) ≥ 1.5 or ≤ 0.67 and adjusted p-value < 0.05 (the Benjamini and Hochberg method) were defined as DEGs (Mitra et al., 2015; Ritchie et al., 2015; Hu et al., 2018).

<sup>1</sup>https://www.ncbi.nlm.nih.gov/geo/geo2r/

## Methylation Data

fgene-10-00318 April 9, 2019 Time: 12:44 # 3

We obtained the methylation data from a recent study that conducted differential methylation analysis using 121 CD cases and 191 healthy controls (Ventham et al., 2016) (GEO accession ID: GSE87648). The study provided whole genome methylation using Illumina HumanMethylation450 BeadChip platform (GPL13534), which contained ∼485,000 probes. We requested the methylation results from the author of the study. This differential methylation genes was generated using whole blood leukocyte samples. In the original work (Ventham et al., 2016), the authors normalized the methylation matrix using the R package lumi and estimated the cell proportion by the R package minfi. Lastly, Limma was used to identify differentially methylated CpG probes. Probes were mapped to genes according to the annotation file of the chip (Jiang et al., 2016). For genes with multiple probes, we selected the most significant probe for the gene.

# Gene-Based Association Test Using Pascal

As our analysis builds on genes and the GWAS summary statistics provided association results for SNPs, we compiled a p-value for each gene using the association results of SNPs mapped to the gene. Specifically, we considered all SNPs mapped to the gene body or 50 kb upstream or downstream of the gene. We used the method Pascal to calculate the gene-based p-values (Lamparter et al., 2016). Pascal utilizes the sums of chi-squares and controls potential biases from gene length, SNP density, and the local LD structure. We used the European panel as the reference, as similarly, did in a recent study (Sun et al., 2018).

# Tissue-Specific Enrichment Analysis (TSEA)

To identify the tissues in which the GWAS genes were specifically expressed, we conducted a tissue specific enrichment analysis using our in-house R package, deTS (Pei et al., 2019a). deTS provides a preprocessed reference panel with 47 tissues (each with ≥ 30 samples) from the GTEx (v7) expression data (GTEx Consortium et al., 2017) and implements Fisher's Exact Test for the enrichment analysis. We applied deTS to genes defined by the Pascal results.

## Transcriptome Wide Association Studies (MetaXcan)

Transcriptome-wide association study estimates genetically regulated expression (GReX) for each gene and conducts association studies between genes and traits by assessing the difference of GReX in trait samples and control samples. We utilized the method MetaXcan for a TWAS analysis of the CD GWAS summary statistics (Barbeira et al., 2018). The pre-calculated weight matrix was downloaded from http:// predictdb.org/. We utilized three disease-relevant tissues for the analyses, where were determined based on previous knowledge and deTS results.

# Integrative Analysis of eQTL and GWAS Data (Sherlock)

Considering that many disease-associated genetic variants have regulatory roles, we applied the method Sherlock to integrate eQTLs and GWAS with the aim to identify concordant evidence between the two platforms (He et al., 2013). Sherlock uses a Bayesian statistical method to match the signature of genes from eQTLs to GWAS. As eQTL data have population and tissue specificity, we applied Sherlock for the CD GWAS data using the same tissues as for MetaXcan. A gene-based p-value was calculated from Sherlock for each gene in each tissue.

# Mega-Analysis of Odds Ratio (MegaOR)

We adopted our previous work MegaOR to identify a consensus set of candidate genes that collectively had the most intensive load of evidence for their association with CD (hereafter referred as CDgenes). MegaOR took a multidimensional data matrix as the input. In each dimension, genes that were determined as significantly associated with the trait based on the domain-specific threshold were labeled as 1 while other genes that failed the significance threshold were labeled as 0. For example, in the category of gene expression, significantly differentially expressed genes [FDR < 0.05 and (FC) ≥ 1.5 or ≤ 0.67] were labeled 1 and other genes 0. The same preprocessing was performed for each dimensional data following the particular domain-specific thresholds. As a result, the multidimensional data matrix included only binary values. MegaOR took this binary data matrix and defined a combined OR (cOR):cOR = µ − P(OR−µ) 2 d , where OR represented the Odds Ratio for each dimension, d was the dimension of evidence, and µ was the average OR across dimensions. The part P(OR−µ) 2 d was introduced as the penalty to control deviation of any dimensional OR and served to balance the multidimensional lines of evidence. MegaOR implemented an iterative optimization procedure to find the best set of genes (denoted by S) with the pre-defined size n such that at the stable status, genes in S had the best cOR. A workflow was illustrated in **Figure 1**. Further details can be found in our previous work (Jia et al., 2018).

# Functional Enrichment Analysis

We used the R package RDAVIDWebService (version 1.16.0) for functional enrichment analysis. We focused on Gene Ontology (GO) and genetics association database (GAD) (Fresno and Fernandez, 2013). GO functional annotation tool (FAT) was used to filter out very broad terms based on a measured specificity of each term (not level-specificity). We further use the plug-in ClueGO of Cytoscape to display the relationship between genes and GO terms (Shannon et al., 2003; Bindea et al., 2009). Only GO terms with more than five CDgenes were demonstrated.

# Drug Target Gene Enrichment Analysis

We queried the Therapeutic Target Database<sup>2</sup> to identify Food and Drug Administration (FDA) approved drugs that were used

<sup>2</sup>http://bidd.nus.edu.sg/group/cjttd/ (accessed 2 February 2019).

for CD (Li et al., 2018). Meditation target genes for CD were extracted from the database.

## Protein-Protein Interaction (PPI) Analysis

We searched the STRING database<sup>3</sup> to identify protein-protein interactions (PPIs) between CD drug target genes and our CDgenes (Szklarczyk et al., 2017). We selected Homo sapiens as the organism and considered only the PPIs that were experimentally validated with medium confidence > 0.35.

# RESULTS

# Multi-Dimensional Evidence for Crohn's Disease

Using the approaches described in methods, we organized our data into five major categories: Pascal (combined GWAS information), Sherlock (integrative information of GWAS and eQTL), MetaXcan (TWAS), gene expression (with DEGs labeled as 1), and methylation (with differentially methylated genes (DMGs) labeled as 1). Particularly for Sherlock and MetaXcan, the analyses were performed for different tissues and thus, each had multiple sets of omics data. Each dimension presents a unique biological aspect to assess the potential association between a gene and CD.

As previously reported, interpretation of disease-associated genetic variants are more appropriate in tissues that are related to the diseases, as genetic regulation has a strong tissue specificity. To determine the disease-relevant tissues to CD, we conducted TSEA using the CD GWAS data (see the section "TSEA to determine CD related tissues") and determined three tissues for the analysis of Sherlock and MetaXcan: whole blood (the most significant p-value was 9.75 × 10−<sup>7</sup> ), spleen (p = 4 × 10−<sup>3</sup> ), and small intestine (terminal ileum) (p = 5.48 × 10−<sup>3</sup> ) (**Figure 2A**). As a result, we had a total of nine groups of genes: Pascal, three groups of Sherlock results, three groups of MetaXcan, DEGs, and DMGs. For each group, we applied group-specific thresholds to select positive genes (i.e., genes to be labeled as 1 in the matrix) (**Table 1**). Specifically, there were 773 Pascal genes (pBH < 0.05), 289 Sherlock genes in whole blood (pBH < 0.2), 170 Sherlock genes in spleen (pBH < 0.2), 108 Sherlock genes in small intestine (terminal ileum) (pBH < 0.2), 200 MetaXcan genes in whole blood (pBH < 0.2), 112 MetaXcan genes in spleen (pBH < 0.2), 69 MetaXcan genes in small intestine (terminal ileum) (pBH < 0.2), 282 DEGs (pBH < 0.05 and | log2(FC)| > 0.58), and 337 DMGs (pBH < 0.2). These data collectively nominated a total of 1,668 genes, each with at least one type of association evidence. By applied TSEA to each gene sets (**Figure 2B**), we found that whole blood, spleen, lung, and small intestine (terminal ileum) were the most enriched tissues. Specifically, Pascal genes (p = 1.44 × 10−<sup>6</sup> ), DEGs (p = 8.05 × 10−52), and DMGs (p = 5.82 × 10−<sup>5</sup> ) were all most significantly enriched in whole blood. Six gene sets were most significantly enriched in spleen: the three Sherlock gene sets, MetaXcan genes calculated using small intestine (terminal ileum) and MetaXcan genes calculated using whole blood, and the merged gene set of Sherlock genes.

Among the 1,668 genes, 1,287 (79.3%) genes had only one line of evidence and no gene was found with more than eight lines of evidence. We further merged the Sherlock genes from the three tissues and obtained a union of 398 Sherlock genes (5.6%, **Figure 2C**) for the following analysis of MegaOR. Similarly, a union of 305 MetaXcan genes (4.1%, **Figure 2D**) were obtained from three result sets in three tissues for MetaXcan. Collectively, these multidimensional data were organized as the input matrix

<sup>3</sup>https://string-db.org/ (accessed 13 February 2019).

p-values from each of the three disease-relevant tissues [whole blood, spleen and small intestine (terminal ileum)]. Red-line indicates the –log<sup>10</sup> (0.2) threshold. (D) Distribution of Sherlock adjusted p-values. X-axis: –log10 Sherlock adjusted p-values from each of the three disease-relevant tissues [whole blood, spleen and small intestine (terminal ileum)]. Red-line indicates the –log<sup>10</sup> (0.2) threshold. (E) Pair-wise comparison among the five lines of evidence. Fisher's Exact Test was used for the significance test. The values in each cell represent the –log10 p-value. The figure was based on 1,668 genes that had at least one line of evidence.

with 1,668 genes in five dimensions, each representing one kind of disease association evidence. We referred this matrix as the evidence set (ES) genes.

As a control, we generated a second set of genes containing all the protein-coding genes that were expressed in the three CD related tissues, without requiring them to have at least one line of evidence in association with CD. Specifically, we obtained 13,763 protein-coding genes (GENCODE v19) that had an average RPKM (Reads Per Kilobase of transcript, per Million mapped reads) value >1 in whole blood, spleen, or small intestine (terminal ileum) (GTEx v7 data). These genes, referred as tissue set (TS) genes, were considered with very weak support for their potential association with CD. A total of 1,286 genes were shared between the TS genes and the 1,668 genes with evidence. After


<sup>∗</sup>FDR, false discovery rate; FC, fold change; DEG, differentially expressed genes; DMG, differentially methylated genes.

removing redundancy, we built a second matrix with a union of 14,065 genes (13,763 TS genes expressed in CD-relevant tissues and 1,668 genes with at least one line of evidence in association with CD). We applied MegaOR to both matrices and we expected that MegaOR could prioritize disease genes with or without the TS genes that had weak association evidence.

#### TSEA to Determine CD Related Tissues

Crohn's Disease causes inflammation of the gastrointestinal tract (Fakhoury et al., 2014). Digestive tissues such as colon and small intestine (terminal ileum) have long been considered to be related to CD (Wu et al., 2007). Among the multidimensional data and methods, Sherlock and MetaXcan both require pre-defined disease relevant tissues. DEGs and DMGs were obtained using blood samples. Hence, only Pascal genes from GWAS data were suitable for the determination of tissues (Pei et al., 2019b). We performed TSEA using Pascal genes defined at different threshold (p < 0.05, p < 0.01, p < 5 × 10−<sup>3</sup> , p < 1 × 10−<sup>3</sup> , p < 5 × 10−<sup>4</sup> , p < 1 × 10−<sup>4</sup> , p < 5 × 10−<sup>5</sup> , p < 1 × 10−<sup>5</sup> , and p < 5 × 10−<sup>6</sup> , **Figure 2**). As shown in **Figure 2**, Pascal genes were found to be most significantly enriched in whole blood at different thresholds (e.g., the most significant p-value being 9.75 × 10−<sup>7</sup> when using genes with pPascal < 0.05), followed by small intestine (terminal ileum) (the most significant p-value being 3.22 × 10−<sup>3</sup> when using genes with pPascal < 0.005). Both spleen and lung were found to be enriched with Pascal genes. However, considering that spleen acted as a filter for blood as part of the immune system while lung had no obvious link to CD, we selected whole blood, small intestine (terminal ileum), and spleen as the three most relevant tissues to CD and used these tissues for the application of Sherlock and MetaXcan.

# Pair-Wise Comparison of the Multidimensional Association Data

To explore the correlation among different dimensional data, we conducted a pair-wise comparison using genes from each group. We used Fisher's exact test to test if any two types of evidence were associated. As shown in **Figure 2E**, among all possible pairs (n = 15), we only observed a significant correlation between Sherlock and MetaXcan genes (p = 2.63 × 10−43). This is within expectation because both data types measure the integrative signals of genetic variants and their regulatory roles in diseases. Surprisingly, Pascal genes had no correlation with either Sherlock genes (p = 0.95) or MetaXcan (p = 0.98), even though both Sherlock and MetaXcan used the same GWAS data as the input to calculate gene-based p-values. This lack of association implied that there was independent information that could be obtained by integrating eQTL and GReX in interpreting GWAS data, providing a fundamental support to our work of integrating these diverse evidence data. In addition, DEGs and DMGs showed no association with any of the other dimensional data.

#### CDgenes Identified by MegaOR

To identify a set of candidate genes that have the most intensive load of evidence, we applied MegaOR to the multidimensional evidence data, respectively, the ES matrix with 1,668 genes (each with at least one type of evidence) and the TS matrix with 14,065 genes (the union of the genes expressed in disease-relevant tissues and genes from the ES matrix). We tested eight set sizes separately, i.e., S = 150, 190, 230, 270, 310, 350, 390, 430 for the ES matrix and T = 230, 270, 310, 350, 390, 430, 470, 510 for the TS matrix. For each set size, there were likely different sets of genes reaching the best cOR, even though they have the same number of genes. Thus, we applied MegaOR for each set size 100 times. The average ORs at each set sizes were displayed in **Figures 3A,B**. Taking the ES matrix as an example, we obtained eight sets of CDgenes. At each size, we selected genes that were retained in more than 50% times (**Figure 3E**). We referred the genes at each set size to S1 (set size: S = 150, CDgenes: 62), S2 (S = 190, CDgenes: 121), S3 (S = 230, CDgenes: 148), S4 (S = 270, CDgenes: 162), S5 (S = 310, CDgenes: 210), S6 (S = 350, CDgenes: 234), S7 (S = 390, CDgenes: 235), and S8 (S = 430, CDgenes: 235). CDgenes obtained using large set sizes covered nearly all the CDgenes obtained using lower set sizes. For example, the 121 genes in S2 included all the 62 genes in S1. For TS-set, T1 for set size T = 230 (CDgenes: 124), T2 for T = 270 (148), T3 for n = 310 (155), T4 for n = 350 (165), T5 for n = 390 (196), T6 for n = 430 (222), T7 for n = 470 (230), and T8 for n = 510 (235). In both sets, a converged stable status could be observed from S6 to S8 and T7 to T8, respectively (**Figures 3C,D**). Thus, we suggested that the 235 CDgenes in S7 and the 235 genes in T8 were close to consensus sets of CDgenes that could reach the global maximum load of evidence. Interestingly, the two sets of CDgenes (S7 and T8) shared 234 genes. Thus, we found MegaOR performed relatively stable to generate such consensus sets of candidate genes.

# CDgenes Interact With Each Other Significantly

Many disease genes were reported to interact with each other more often than with randomly selected genes, especially genes associated with the same diseases (Barabasi et al., 2011). This was likely because genes underlying the same disease are often involved in related biological pathways. To investigate whether

(E) The frequency of genes covered by 100 stable sets at an example size S = 390 in at least on type of evidence set (ES). Genes on the left part of the plot in green were less frequently recovered (<50% occurrence). Genes on the right part of the plot were selected as the CDgenes for the corresponding set size.

our CDgenes tended to interact more often with each other, we curated protein-protein interaction (PPI) data from three sources. The first network was from HumanNet and has been previously used to study GWAS data (Lee et al., 2011). The second network was from a precomputed influence graph that was recently used in cancer (Ding et al., 2015). The third network was a combined dataset of HPRD and STRING (MAGI) (Hormozdiari et al., 2015). For each set of CDgenes, we recorded the number of interactions among CDgene and resampled 10,000 random gene sets, each with the same number of CDgenes. The number of random gene sets that had interactions exceeding the actual number of interactions was used to calculate an empirical p-value. We performed this analysis in each human PPI network, respectively. Interestingly, CDgenes showed significantly more PPIs than those from random gene sets in both HumanNet joint, influence\_graph, and MAGI (**Figure 4**), implying that our CDgenes tended to interact with each other significantly more frequently than expected in random gene sets.

results using the ClueGO method in Cytoscape (Bindea et al., 2009). Each dot represented a gene or a GO term. Dots in the same color were considered from the same functional group by ClueGO annotation. Gene names were highlighted in red. Each edge indicated the gene was a component gene of the linked GO term.

# Functional Enrichment Analysis of CDgenes

To identify the biological roles of the genes in the significant modules, we performed functional enrichment analysis using DAVID (See section "Materials and Methods"). We focused on GO terms and gene sets from the GAD. Our finding showed that the 235 CDgenes in S7 were enriched with MHC class II receptor activity (GO: 0032395, Molecular Function, p = 9.08 × 10−<sup>6</sup> ), immune response (GO: 0006955, Biological Process, p = 1.02 × 10−14), and MHC protein complex (GO: 0042611, Cellular Component, p = 2.85 × 10−<sup>5</sup> ) (**Figure 5A**). In GAD, immunological disorders such as Systemic Lupus Erythematosus (adjusted

p = 2.52 × 10−24) and Psoriasis (adjusted p = 7.56 × 10−20) were found to be most significantly enriched (**Figure 5A**). Importantly, the category "Crohn's Disease" from GAD was also significantly enriched in our CDgenes (adjusted p = 2.44 × 10−13). Evidence of 235 CDgenes was provided in **Supplementary Tables S1**, **S2**.

#### DISCUSSION

In this work, we collected five multi-dimensional data to prioritize CD-associated genes. Using tissue specific enrichment analysis and GWAS data, we determined three tissues that were most related to CD [whole blood, spleen, and small intestine (terminal ileum)]. With these tissues, we calculated integrative association signals between tissue eQTL and GWAS data and conducted tissue-specific TWAS. We constructed two evidence matrices and applied MegaOR to identify a consensus set of CD-associated genes. The candidate CDgenes in this consensus set tended to interact with each other more often than size-matched random genes, indicating these CDgenes could functionally cooperate with each other. Functional enrichment analysis showed that these CDgenes were enriched in immune related diseases and biological processes. Moreover, methods of integrative studies such as MegaOR are powerful tools to unravel the etiology of complex diseases (Wang et al., 2016; O'Brien et al., 2018). With the increasing volume of omics data, these methods could be easily extended to other complex diseases, such as cancer, psychiatric diseases, and immune diseases.

#### Consensus CDgenes Overlaps With Known Disease Risk Genes

Although we did not collect the rare mutations as our evidence, two genes from our CDgenes were previously reported to harbor rare variants with CD, ADCY7 (adjusted pPascal = 4.76 × 10−10, adjusted pSherlock = 2.20 × 10−<sup>3</sup> in whole blood, adjusted pMetaXcan = 9.15 × 10−<sup>4</sup> in whole blood, and adjusted pDMG = 0.045) and NOD2 (adjusted pPascal = 4.76 × 10−10, adjusted pSherlock = 2.20 × 10−<sup>3</sup> in whole blood, and adjusted pMetaXcan = 0.096 in whole blood) (Hunt et al., 2013; Luo et al., 2017). Moreover, previously known DEGs and DMGs (MIR21, TXK, IFITM1, and TAP1) could also be observed in our CDgenes, suggesting these genes have robust association with CD (Adams et al., 2014; Ventham et al., 2016).

#### Function Enrichment Analysis of CDgenes Highlighted COP9 Signalosome

Our consensus CDgenes provided a promising list of candidate genes for CD. The significantly enriched pathways and functional sets suggested that CDgenes were biologically related to CD. In addition, we observed quite a number of promising genes with various types of evidence, such as genes involved in antigen binding (HLA-DOA, HLA-DOB, HLA-DQA2, HLA-DQB1, TAP1, and TAP2) and genes involved in the immune response (NOD2, IFITM1, PSMB8, TXK, and AIM2). Other genes of interest included NCKIPSD (NCK interacting protein with SH3 domain: adjusted pPascal = 1.00 × 10−<sup>3</sup> , adjusted pSherlock = 0.037 in whole blood, adjusted pMetaXcan = 0.13 in whole blood), WDR6 (WD repeat domain 6, adjusted pPascal = 0.029, adjusted pSherlock = 5.60 × 10−<sup>3</sup> in small intestine (terminal ileum), adjusted pMetaXcan = 0.025 in whole blood), DOCK7 (dedicator of cytokinesis 7, adjusted pPascal = 2.00 × 10−<sup>3</sup> , adjusted pSherlock = 2.40 × 10−<sup>3</sup> in whole blood, adjusted pMetaXcan = 2.10 × 10−<sup>3</sup> in whole blood), SPNS1 [Sphingolipid Transporter 1 (Putative), pPascal = 4.02 × 10−<sup>3</sup> , adjusted pSherlock = 2.19 × 10−<sup>3</sup> in whole blood, pMetaXcan = 7.43 × 10−<sup>3</sup> ), FLOT1 (flotillin 1, pPascal = 3.79 × 10−3, pSherlock = 0.13 in whole blood, pDEG = 5.87 × 10−<sup>3</sup> ),and HSPA7 (encoding heat shock protein family A (Hsp70) member 7, pSherlock = 0.14 in whole blood, pDEG = 1.34 × 10−<sup>3</sup> )]. With NOD2, these seven genes were all from the COP9 signalosome (CSN) (53 genes in this term from ClueGO annotation, **Figure 5B** and **Supplementary Tables S3**, **S4**). Interestingly, these seven genes were not the subunits of CSN complex, but they interacted with CSN complex as suggested by affinity purification and mass spectrometry experiment (Fang et al., 2008). CSN is a multi-subunit protease that regulates the activity of cullin-RING ligase (CRL) families of ubiquitin E3 complexes with isopeptidase activity. The major activities that CSN was involved included de-ubiquitination activity and phosphorylation of important signaling regulators in protein kinase activities (Wei and Deng, 2003; Wei et al., 2008). Previous studies have revealed COP9 signalosome subunit 5 (CSN5/Jab1) could regulate the development of immune system in Drosophila (Harari-Steinberg et al., 2007). In mice, deficiency of one subunit of COP9 resulted in dysfunction of paneth cell and colonic enterocyte, which could lead to impaired antimicrobial peptide and might change the composition of intestinal microbiota (Wang et al., 2014). This evidence infers the dysregulation of CSN might impact the intestinal microbiota and lead to pathogenesis of inflammatory bowel disease. In addition, disrupting CSN subunit showed impact in T-cell development and antigen response, indicating CSN might involve in the homeostasis of T cells (Menon et al., 2007; Panattoni et al., 2008). Although the debates continue on that whether microbiota, innate immunity or T cell activation leads to CD, our study shed lights on the potential etiology of CD through the dysregulation of COP9 signalosome. These seven genes were only able to be discovered when integrating multi-dimensional evidence, demonstrating the advance of MegaOR to unveil such signals, which cannot be achieved by traditional single domain approaches.

#### CDgenes as the Potential Drug Target

Disease associated genes are natural candidates for drug development in both complex disease and cancer (Butcher et al., 2004; Zhao et al., 2015; Lee et al., 2016). We further compared our CDgenes with known target genes of CD meditation using the Therapeutic Targets Database (TTD) (Li et al., 2018).

Overall, six FDA approved drugs were found for CD: Clofazimine, Metronidazole, Ustekinumab, MLN0002, Infliximab, and Vedolizumab. These drugs had seven target genes: ABCB11, CYP51A1, IL12B, IL23A, ITGA4, ITGB7, and TNF (**Supplementary Table S5**). None of them were included in our CDgenes. We queried the STRING database (See text footnote 3.) for the interactions between the seven drug target genes and the 235 CDgenes (Szklarczyk et al., 2017). We observed two CDgenes had experimental mediumconfidence (>0.35) in interaction with two drug target genes: IL12RB2 (CDgene) interacting with IL12B (drug target) and LTBR (CDgene) interacting TNF (drug target) (**Supplementary Figure S1**). IL12RB2 was the receptor of the drug target gene IL12B and was discovered from Pascal (p = 4.76 × 10−10), Sherlock (p = 2.19 × 10−<sup>3</sup> ) and MetaXcan (p = 0.12). LTBR (Tumor Necrosis Factor Receptor Superfamily Member 3) was the receptor of tumor necrosis factor ligand Superfamily member 14 and was discovered from Pascal (p = 0.013) and Sherlock (p = 0.16). Moreover, two TNF Superfamily ligand genes (TNFSF10 and TNFSF15) and three interleukin family genes IL18RAP, IL27, and IL4 were found in our CDgenes. These findings provided some insights of our CDgenes into the identification of drug targets from multiomics datasets.

#### Limitation

There were some limitations of the current work. First, although we collected five dimensional data, there were still other omics data that were missed in our work. For example, previous studies have reported that copy number variations could be associated with CD (Wellcome Trust Case Control Consortium et al., 2010). However, the number of genes implied by CNV studies were very limited (∼10) and we could not include them into our matrix. Second, due to the limited tissue data, our DEGs and DMGs were both generated using PBMCs from CD patients and samples, instead of disease tissues from the patients. PBMCs are signs of infection and auto-immune diseases (Burczynski et al., 2006). Future studies are warranted to use samples from disease related tissues, such as intestinal biopsies (Wu et al., 2007). Lastly, due to the data heterogeneity, we used different threshold to control FDR for each individual omics data, e.g., adjusted p < 0.05 in selecting DEGs while adjusted p < 0.2 for MetaXcan, Sherlock and DMGs. This inconsistence among different omics data may lead to inaccurate estimate of the actual OR. In future studies, when more data are generated, either from different omics or multiple data sets for the same omics, an enhanced evidence matrix could be constructed to validate the current CDgenes.

# CONCLUSION

In summary, we conducted an integrative analysis of genetic, epigenetic, and transcriptomic data in CD. Our approach prioritized candidate genes associated with CD from multidimensional data and such methods could be extended to many other complex diseases with multi-dimensional omics data being available. Functional analysis of these CDgenes revealed strong immune response enrichment. We further highlighted the potential involvement of COP9 signalosome in CD and suggested interactions among our CDgenes with CD drug target genes.

# DATA AVAILABILITY

Publicly available datasets were analyzed in this study. The data used in R package "deTS" can be found here: https://gtexportal. org/home/. Other data could be obtained from the resource described in Materials and Methods.

# AUTHOR CONTRIBUTIONS

PJ and ZZ conceived and designed the study. YD performed the data preparation and analysis, YD and GP performed the result demonstration. YD, PJ, and ZZ wrote the manuscript. All authors have read, edited, and approved the current version of the manuscript.

# FUNDING

This work was supported by the UTHealth Presidential Collaborative Research Award. This work was partially supported by National Institutes of Health grant (R01LM012806). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

# ACKNOWLEDGMENTS

The authors would like to extend the gratitude to Dr. Ventham from The University of Edinburgh to share the methylation results with us and to answer our questions.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00318/full#supplementary-material

FIGURE S1 | STRING-network interaction of genes.

TABLE S1 | Binary table for 1,688 Crohn's Disease related genes collected from five evidence. We collected this data based on the criteria from Table 1.

TABLE S2 | Binary table for 1,688 Crohn's Disease related genes collected from nine evidence. We collected this data based on the criteria from Table 1.

TABLE S3 | Binary table for seven genes shared by COP9 signalosome and Crohn's Disease consensus genes from five evidence.

TABLE S4 | Binary table for seven genes shared by COP9 signalosome and Crohn's Disease consensus genes from nine evidence.

TABLE S5 | FDA approved Crohn's Disease drugs and their target genes obtained from Therapeutic Target Database.

#### REFERENCES

fgene-10-00318 April 9, 2019 Time: 12:44 # 12


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Dai, Pei, Zhao and Jia. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00318 April 9, 2019 Time: 12:44 # 13

# Combining Sparse Group Lasso and Linear Mixed Model Improves Power to Detect Genetic Variants Underlying Quantitative Traits

#### Yingjie Guo1,2, Chenxi Wu<sup>3</sup> , Maozu Guo1,4 \*, Quan Zou<sup>5</sup> , Xiaoyan Liu<sup>1</sup> and Alon Keinan2,6 \*

*<sup>1</sup> School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, <sup>2</sup> Department of Computational Biology, Cornell University, Ithaca, NY, United States, <sup>3</sup> Department of Mathematics, Rutgers University, Piscataway, NJ, United States, <sup>4</sup> School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China, <sup>5</sup> Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China, <sup>6</sup> Cornell Center for Comparative and Population Genomics, Center for Vertebrate Genomics, and Center for Enervating Neuroimmune Disease, Cornell University, Ithaca, NY, United States*

#### Edited by:

*Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China*

#### Reviewed by:

*Yan Huang, Harvard Medical School, United States Yen-Wei Chu, National Chung Hsing University, Taiwan*

#### \*Correspondence:

*Maozu Guo guomaozu@bucea.edu.cn Alon Keinan alon.keinan@cornell.edu*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *17 December 2018* Accepted: *12 March 2019* Published: *10 April 2019*

#### Citation:

*Guo Y, Wu C, Guo M, Zou Q, Liu X and Keinan A (2019) Combining Sparse Group Lasso and Linear Mixed Model Improves Power to Detect Genetic Variants Underlying Quantitative Traits. Front. Genet. 10:271. doi: 10.3389/fgene.2019.00271* Genome-Wide association studies (GWAS), based on testing one single nucleotide polymorphism (SNP) at a time, have revolutionized our understanding of the genetics of complex traits. In GWAS, there is a need to consider confounding effects such as due to population structure, and take groups of SNPs into account simultaneously due to the "polygenic" attribute of complex quantitative traits. In this paper, we propose a new approach SGL-LMM that puts together sparse group lasso (SGL) and linear mixed model (LMM) for multivariate associations of quantitative traits. LMM, as has been often used in GWAS, controls for confounders, while SGL maintains sparsity of the underlying multivariate regression model. SGL-LMM first sets a fixed zero effect to learn the parameters of random effects using LMM, and then estimates fixed effects using SGL regularization. We present efficient algorithms for hyperparameter tuning and feature selection using stability selection. While controlling for confounders and constraining for sparse solutions, SGL-LMM also provides a natural framework for incorporating prior biological information into the group structure underlying the model. Results based on both simulated and real data show SGL-LMM outperforms previous approaches in terms of power to detect associations and accuracy of quantitative trait prediction.

Keywords: genome-wide association studies, single nucleotide polymorphisms, quantitative traits, linear mixed model, sparse group lasso

# 1. INTRODUCTION

Quantitative traits are important in medicine, agriculture, and evolution but, until recently, few polymorphisms have been shown to be related in these traits. Genome-wide association studies (GWAS) is a statistical technique that has been used successfully in the identification of over 65,000 single-nucleotide polymorphisms (SNPs) that are connected to various traits or diseases

**279**

(MacArthur et al., 2017). Typically, GWAS are carried out using single-locus models (i.e., testing for association between each SNP and a given phenotype independently using linear or logistic regression). However, according to the popular "polygenic theory" (Li et al., 2015b; Dudbridge, 2016), complex traits are often controlled by multiple SNPs collectively. Due to the need to eliminate multi-testing corrections that decrease statistical power, a better understanding of the underlying heritable genetic architecture of complex traits requires one to move beyond single-locus models to multivariate linear regression models that incorporate the joint effects of multiple SNPs explicitly (Ma et al., 2013).

Usually, the multi-locus GWAS are large p small n problems (i.e., the number of features (SNPs) far exceeds the number of samples, and one would expect only a small number of features are associated with the phenotype predictor). Therefore, as is customary for similar regression problems, it is necessary to regularize by demanding sparsity in the coefficients of the final model to prevent over-fitting and to maintain interpretability. The most popular regularizing penalty that serves this purpose is the lasso (i.e., least absolute shrinkage and selection operator) (Tibshirani, 1996), which is the L1 norm of the coefficients of features. Yang et al. (2012) fit sparse predictors for all genomewide SNPs using stepwise, forward selection. Li et al. (2011) imposed a Laplace prior, which led to the Bayesian lasso. Arbet et al. (2017) developed a permutation-based, selection procedure to test the significance of lasso coefficients.

In GWAS, one expects the effective SNPs to be clustered in a few genes or pathways, hence, adding group structure by mandating sparsity on the group level is a good way to apply this prior knowledge that can potentially outperform the simple lasso. Yuan and Lin (2006) proposed using the group lasso for the linear regressions, which imposed a regularization penalty of the sum of the L2 norm on groups that guaranteed that few groups were selected. But if a group is selected, so are all the predictors in it.

The group lasso has already enjoyed much success in GWAS (Li et al., 2015a; Lim and Hastie, 2015). A caveat, however, is its assumption that either all SNPs in a group being associated or none of the SNPs in a group being associated. It is desirable to not only constrain sparsity between groups (only a few groups are associated), but also within groups; only a few SNPs in each active group are associated. Hence, we propose to employ a sparse group lasso (SGL), which is a regularization method aimed at achieving both between- and within-group sparsity simultaneously (Rao et al., 2013, 2016; Simon et al., 2013). The SGL has a L2 penalty that promotes the selection of only a subset of the groups and L1 penalty that promotes the selection of only a subset of the predictors within a group.

Another important factor in genetic association studies is the existence of confounding, which are indirect associations between markers and traits due to factors like population structure, family structure, and cryptic relatedness. Methods for correcting these confounding factors include EIGENSTRAT (Price et al., 2006), family-based association, genomic control, and linear mixed models (LMMs) (Fisher, 1919; Hoffman, 2013; Hoffman et al., 2014). Compared with other methods, LMMs provide more fine-grained control by modeling the contribution of these confounders as a random effect term. They are capable of capturing the cumulative effect of all types of confounding simultaneously without the need of prior knowledge on which confounding is present and without the need to estimate them individually. However, the time and space costs of LMM are high compared with simpler confounding models. Previous attempts to improve the performance of LMM includes Zhou and Stephens (2012) (EMMA), Kang et al. (2010) (EMMAX), Zhang et al. (2010) (P3D), Lippert et al. (2011) (FaST-LMM), and Li et al. (2017) (StepLMM). All of these methods are univariate models that are powerful in detecting few associations with large effect sizes.

Although joint modeling of multiple weak effects and correction for population structure have been tackled individually, few existing methods are capable of addressing them simultaneously. Segura et al. (2012) proposed a multilocus, mixed model approach using stepwise forward selection. Rakitsch et al. (2012) and Papachristou et al. (2016) developed new association methods that combined LMM and lasso to enjoy the benefits of both methods.

There are a variety of patterns that typically arise in regularization (**Figure 1**). Prior knowledge can be utilized by using the SGL, which maintains both between- and within-group sparsity. The relative strength between L1 and L2 norms can be used to represent prior knowledge on the comparative degrees of sparsity at the SNP and gene level. In particular, by varying the ratio between L1 and L2 norms, the approach includes both group lasso and lasso as special cases.

In this paper, we propose a novel analysis that not only combines multivariate analysis with population correction using Fast-LMM, but we also incorporate the group structure of the SNPs as biological priors. We use the gene as the group unit, and it is reasonable to assume that the model should be sparse not only on the SNP-level (only relatively few SNPs are involved), but on the gene level as well (those functional SNPs belong to relatively few genes). Experiments on semi-empirical data showed that the combination of sparse group lasso and a linear mixed model yielded better power to identify marker associations in a large range of settings, and application to real datasets have verified that SGL-LMM generated a sparse solution with accurate prediction of phenotypes and interpretable detection of marker associations.

### 2. MATERIALS AND METHODS

#### 2.1. Method

We used a linear mixed model to model the genetic effects on the phenotypes. More precisely, we modeled the phenotype as a sum of three terms: a fixed effect determined by the association SNPs, a random confounding effect due to population structure, and an i.i.d. noise as follows:

$$\mathcal{y} = X\mathcal{B} + \mathcal{y}\_{pop} + \phi \tag{1}$$

where y is a vector of observed phenotypes of size m × 1 for m samples, X is a m×q matrix that consists of SNPs and other (e.g., environmental, familial etc.) variables of the m samples, **y**pop is a m × 1 random matrix with distribution N (0, σ 2 <sup>g</sup> **K**) where **K**

is an m by m matrix called realized relationship matrix(RRM) that captures the overall genetic similarity between all pairs of samples, and φ ∼ N (0, σ 2 e **I**).

(b). In (c), we show the pattern in which we are interested in this paper.

To make a prediction on **y**, one only needs β and δ = σ 2 e σ 2 g . Following FAST-LMM, our overall strategy for estimating the parameters β and δ goes as follows:


Now we describe each of the two steps in more detail.

#### 2.1.1. Estimate of δ

To calculate δ we use an approach similar to Fast-LMM. Because β was set to 0, we have:

$$\mathcal{Y} \sim \mathcal{N}(\mathbf{0}, \sigma\_{\mathcal{X}}^2 (\mathbf{K} + \delta I)) \tag{2}$$

Hence the log likelihood for a given **y** is

$$\begin{split} LL(\boldsymbol{\delta}, \sigma\_{\boldsymbol{\xi}}^{2}) &= \log \mathcal{N}(\mathbf{0}, \sigma\_{\boldsymbol{\xi}}^{2}(\mathbf{K} + \delta I)) \\ &= -\frac{1}{2} \left( m \log(2\pi \sigma\_{\boldsymbol{\xi}}^{2}) + \log(\det(\mathbf{K} + \delta I)) \right. \\ &\left. + \frac{1}{\sigma\_{\boldsymbol{\xi}}^{2}} \mathbf{y}^{T}(\mathbf{K} + \delta I)^{-1} \mathbf{y} \right) \end{split} \tag{3}$$

Diagonalize **K** into **K** = **USU**<sup>T</sup> where **U** is orthogonal and **S** is diagonal, and we have:

$$LL(\boldsymbol{\delta}, \sigma\_{\mathcal{g}}^2) = -\frac{1}{2} \left( m \log(2\pi \sigma\_{\mathcal{g}}^2) + \log(\det(\mathbf{S} + \delta I)) \right.$$

$$+ \frac{1}{\sigma\_{\mathcal{g}}^2} (\mathbf{U}^T \boldsymbol{\mathcal{y}})^T (\mathbf{S} + \delta I)^{-1} (\mathbf{U}^T \boldsymbol{\mathcal{y}}) \Big) \tag{4}$$

Substitute σ 2 <sup>g</sup> with the optimal value:

$$\hat{\sigma}\_{\xi}^{2} = \frac{(\mathbf{U}^{T}\boldsymbol{\upchi})^{T}(\mathbf{S} + \delta I)^{-1}(\mathbf{U}^{T}\boldsymbol{\upchi})}{m} \tag{5}$$

we have:

$$LL(\delta) = -\frac{1}{2} \left( \log(\det(\mathbf{S} + \delta I)) + m \log \frac{(\mathbf{U}^T \mathbf{y})^T (\mathbf{S} + \delta I)^{-1} (\mathbf{U}^T \mathbf{y})}{m} \right) \tag{6}$$

Where C does not depend on δ. The optimal δ can then be calculated from above as a one dimensional optimization problem:

$$\hat{\delta} = \arg\min \left( \log(\det(\mathbf{S} + \delta I)) + m \log \frac{(\mathbf{U}^T \mathbf{y})^T (\mathbf{S} + \delta I)^{-1} (\mathbf{U}^T \mathbf{y})}{m} \right) \tag{7}$$

#### 2.1.2. Estimate of β

In this subsection, we describe the estimation for β based on model described by Equation (1), then, in the next subsection, we introduce the SGL regularization.

Equation (1) implies that:

$$\mathbf{y} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma\_{\boldsymbol{\mathcal{X}}}^2 (\mathbf{K} + \delta \mathbf{I})) \tag{8}$$

Hence, using the diagonalization we see that, after δ and σ 2 g have been estimated in the previous subsection, the log-likelihood becomes:

$$\begin{split} LL(\boldsymbol{\theta}) &= \log \mathcal{N}(\mathbf{X}\boldsymbol{\theta}, \hat{\sigma}\_{\mathcal{S}}^{2} (\mathbf{K} + \hat{\delta}\boldsymbol{I})) \\ &= -\frac{m}{2} \log(2\pi \hat{\sigma}\_{\mathcal{S}}^{2}) - \frac{1}{2} \log(\det(\mathbf{S} + \hat{\delta}\boldsymbol{I}) \\ &\quad - \frac{1}{2\sigma\_{\mathcal{S}}^{2}} (\mathbf{U}^{T} (\mathbf{y} - \mathbf{X}\boldsymbol{\theta}))^{T} (\mathbf{S} + \hat{\delta}\boldsymbol{I})^{-1} (\mathbf{U}^{T} (\mathbf{y} - \mathbf{X}\boldsymbol{\theta})) \end{split}$$

$$=-\frac{1}{2\hat{\sigma}\_{\text{g}}^{2}}(\boldsymbol{\mathcal{U}}^{T}(\boldsymbol{\mathcal{y}}-\mathbf{X}\boldsymbol{\mathcal{Y}}))^{T}(\boldsymbol{\mathcal{S}}+\hat{\boldsymbol{\delta}}\boldsymbol{I})^{-1}(\boldsymbol{\mathcal{U}}^{T}(\boldsymbol{\mathcal{y}}-\mathbf{X}\boldsymbol{\mathcal{Y}}))+\boldsymbol{\mathcal{C}}\tag{9}$$

Let **S**δ<sup>ˆ</sup> be the non-negative diagonal matrix defined by **S** −2 <sup>δ</sup><sup>ˆ</sup> <sup>=</sup> **<sup>S</sup>** <sup>+</sup> <sup>δ</sup>ˆ**I**, or, more concretely, (**S**δ<sup>ˆ</sup> )ii = (**S**ii + δˆ) −1/2 , then the MLE of β is

$$\begin{split} \hat{\boldsymbol{\mathcal{B}}} &= \operatorname\*{arg\,min} (\boldsymbol{\mathcal{U}}^T (\boldsymbol{\mathcal{y}} - \mathbf{X}\boldsymbol{\mathcal{\mathcal{B}}}))^T (\mathbf{S} + \hat{\boldsymbol{\mathcal{S}}} \boldsymbol{I})^{-1} (\boldsymbol{\mathcal{U}}^T (\boldsymbol{\mathcal{y}} - \mathbf{X}\boldsymbol{\mathcal{\mathcal{B}}})) \\ &= \operatorname\*{arg\,min} (\mathbf{S}\_{\hat{\boldsymbol{\mathcal{S}}}} \boldsymbol{\mathcal{U}}^T \boldsymbol{\mathcal{y}} - \mathbf{S}\_{\hat{\boldsymbol{\mathcal{S}}}} \boldsymbol{\mathcal{U}}^T \mathbf{X} \boldsymbol{\mathcal{\mathcal{B}}})^T (\mathbf{S}\_{\hat{\boldsymbol{\mathcal{S}}}} \boldsymbol{\mathcal{U}}^T \boldsymbol{\mathcal{y}} - \mathbf{S}\_{\hat{\boldsymbol{\mathcal{S}}}} \boldsymbol{\mathcal{U}}^T \mathbf{X} \boldsymbol{\mathcal{\mathcal{B}}}) \\ &= \operatorname\*{arg\,min} \, \|\boldsymbol{\mathbf{S}}\_{\hat{\boldsymbol{\mathcal{S}}}} \boldsymbol{\mathcal{U}}^T \boldsymbol{\mathcal{y}} - \mathbf{S}\_{\hat{\boldsymbol{\mathcal{S}}}} \boldsymbol{\mathcal{U}}^T \mathbf{X} \boldsymbol{\mathcal{\mathcal{B}}}\|\_{2}^2 \end{split} \tag{10}$$

Here k · k<sup>2</sup> is the L <sup>2</sup> norm. **<sup>S</sup>**δˆ**<sup>U</sup>** T **y** and **S**δˆ**U** <sup>T</sup>**X** are obtained from **y** and **X** by a rotation and a scaling, and to simplify notations we denote them as **y**˜ and **X**˜ , respectively.

#### 2.1.3. Sparse Group Lasso

To maintain sparsity in the estimated β, we need to add a regularizer to Equation (10). We used the SGL regularizer: let G be a family of possibly overlapping groups of components in β, for each group G ∈ G, let β<sup>G</sup> be the vector that consists of these components, let λ > 1 and 0 ≤ α ≤ 1, then the regularized optimization problem becomes:

$$\hat{\boldsymbol{\beta}}\_{\text{reg}} = \arg\min \|\mathbf{S}\_{\delta}\mathbf{U}^{T}\mathbf{y} - \mathbf{S}\_{\delta}\mathbf{U}^{T}\mathbf{X}\boldsymbol{\beta}\|\_{2}^{2} + \lambda(1-\alpha)\sum\_{G\in\mathcal{G}}\|\boldsymbol{\beta}\_{G}\|\_{2} + \lambda\alpha\|\boldsymbol{\beta}\_{G}\|\_{1} \tag{11}$$

Here λ is the strength of regularization, and α is the comparative strength of the L 1 and L 2 regularization, with indicating how much sparsity at the SNP level is desired compared to the sparsity at the group level. From a Bayesian perspective, one can think of it as adding a regularizing prior to β of the form:

$$\log p(\boldsymbol{\beta}) \propto (1 - \alpha) \sum\_{G \in \mathcal{G}} \|\boldsymbol{\beta}\_G\|\_2 + \alpha \|\boldsymbol{\beta}\_G\|\_1 \tag{12}$$

#### 2.1.4. Phenotype Prediction

With estimated β and δ, phenotype prediction follows from a straight-forward MLE using Equation (1). Suppose there are other samples with genotype **X** ′ and unknown phenotype **y** ′ , then

$$LL(\mathbf{y'}) \propto \left( \begin{bmatrix} \mathbf{y'}\\ \mathbf{y'} \end{bmatrix} - \begin{bmatrix} X'\\ X \end{bmatrix} \hat{\boldsymbol{\beta}} \right)^T (\mathbf{K} + \hat{\boldsymbol{\delta}} \mathbf{I})^{-1} \left( \begin{bmatrix} \mathbf{y'}\\ \mathbf{y'} \end{bmatrix} - \begin{bmatrix} X'\\ X \end{bmatrix} \hat{\boldsymbol{\beta}} \right) \tag{13}$$

Here **K** = **K<sup>X</sup>** ′**X** ′ **K<sup>X</sup>** ′**X K** T **X** ′**X <sup>K</sup>XX** So, by linear algebra, the MLE estimate for **y** ′ is

$$\text{מייכטריינין און } \lambda \text{ און } \lambda$$

$$\hat{\mathbf{y}}' = \mathbf{X}'\hat{\boldsymbol{\beta}} + \mathbf{K}\_{\boldsymbol{X}\boldsymbol{X}}(\mathbf{K}\_{\boldsymbol{X}\boldsymbol{X}} + \hat{\boldsymbol{\delta}}\boldsymbol{I})^{-1}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) \tag{14}$$

We can summarize the SGL-LMM significant SNPs selection in the following algorithm:

#### 2.1.5. Complexity Analysis

Let n be the number of samples and s be the number of SNPs. When training the null model, the complexity is O(n 3 ) which is from the computation of eigenvalues and eigenvectors. This **Algorithm 1:** Parameter estimate for LMM with SGL regularization



is reasonable when n is about 10k but for higher n one can improve on the time complexity by only taking into account the dominant eigenevalues. The proximal gradient step has a complexity of about O(ns), and since n is usually much less than s, one can see it as more or less O(s). The prediction step has a complexity of O(nn′ s), where n ′ is the size of the testing set. From the complexity analysis, we can see that SGL-LMM is scalable for the genome-wide association analysis. But when analysing with a huge genome such as the human genome, we recommend to analysis each chromosome individually or doing a 2nd step based on suggested loci from GWAS.

#### 2.2. Model Selection

When solving the Equation (11), we employ SGL R package. Instead of doing a two dimensional grid search of λ and α to determine the optimal parameters, the package fix the mixing parameter α and compute solutions for a path with many λ values. The path begins with lambda sufficiently large to set βˆ = 0 and let lambda decrease until the result is close to unregularized. Taking advantage of this mechanism, we carry out feature selection using LMM-SGL through the following steps:

(1) Finding the λ that optimizes phenotype prediction accuracy

In order to find the best λ for phenotype prediction, we first fitted the sparse group lasso model with the whole dataset to find a λ path. We then used 5-fold cross validation to find the appropriate λ, which maximize the average explained variance on the test dataset.

(2) Stability selection

To evaluate the significance of individual SNPs, we carry out stability selection (Meinshausen and Bühlmann, 2010). To obtain a more accurate ranking of SNPs, after the optimal λ was selected in the step above, we chose another 9 λs from the larger λs in the λ path evenly spaced. This group of λs were used in each stability selection process to rank the features by the order of inclusion into the model. We drew randomly no more than 50% of the samples as proposed in the original artical 100 times. We selected all SNPs that were found in ≥ 50% of all results. Significance estimate can be deduced from the selection frequency of individual SNPs.

We summarize the process as the algorithm below and the overall pipeline of SGL-LMM method as **Figure 2**:

#### **Algorithm 2:** Feature selection using SGL-LMM

**Data**: Genotype, Phenotype, groupstructure, α, nlam\_times **Result**: List of features and their importance measured by frequencies


#### 2.3. Simulation Study

To evaluate the accuracy of SGL-LMM and pervious methods for association mapping, we considered a semi-empirical example based on the genotypic and phenotypic data for up to 1307 world-wide accessions of Arabidopsis thaliana from Atwell et al. (2010). The data can be downloaded from https://github.com/Gregor-Mendel-Institute/atpolydb. Based on the quality control provided by GWAS, we excluded a SNP if its Minor Allele Frequency (MAF) was < 0.05, if its missing rate was > 0.05 of the population, or its allele frequencies were not in Hardy-Weinberg equilibrium (P < 0.0001). After filtering, there were 200155 SNPs left.

To simulate the effect of population structure, we used the real phenotypic leaf number at flowering time (LN,16◦C,16 h daylight) which is available for 177 plants of the 1307 plants of A.thaliana. Univariate analyses showed that the phenotype had an excess of associations when population structure was not taken into account (Atwell et al., 2010). After correction for population effect, the p-values are approximately uniformly distributed, Which means this phenotype is totally subjected to population structure. Hence, we use this phenotype to simulate the confounding effect. First, to determine the fraction δ of genetic and residual variance, we fit a random effects model to LN, which we subsequently used to predict the population structure for the remaining 1,130 plants. We run the random effect model multiple times, and choose the final dataset which the difference of genetic variance parameter between real and synthetic data are less than 0.0001. In addition to this empirical background, we added simulated association with different effect sizes and a range of complexities of genetic models.

We then simulated the phenotype as follows:

$$\mathbf{y} = \sigma\_{\text{sig}} \mathbf{y}\_{\text{sig}} + (1 - \sigma\_{\text{sig}}) [\sigma\_{\text{pop}} \mathbf{y}\_{\text{pop}} + (1 - \sigma\_{\text{pop}}) \boldsymbol{\wp}] \tag{15}$$

where **y**sig = X <sup>k</sup>β, **X k** is the genotype data for the k causal SNPs. By introducing the group structure, we consider a case with N<sup>g</sup> = 200 genes(groups) on the chromsome1 which covered 2000 SNPs, we set m groups to be active. We vary the sparsity level of the active groups to get the total active SNPs to be k. β ∼ N (**0**,**I**) and ϕ ∼ N (**0**,**I**).During the simulation, we maintained the original LD structure in each gene.

The initial setting used for simulation were 3 active groups each containing 5 effective SNP (k = 15 and m = 3). To investigate the influence of the confounding effect strength and the overall noise, we considered varied σpop ∈ {0.5, 0.7, 0.9} and σsig ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. For each combination of σpop and σsig , we generate 10 datasets, resulting in 120 datasets in total for the 12 combinations.

#### 2.4. Application With Arabidopsis thaliana Data

To assess the capacity of SGL-LMM to deal with real association mapping of quantitative phenotypes, we investigated the susceptibility of a set of SNPs that belong to genes of several flowering phenotypes in A. thaliana. We used the same dataset as in the simulation study. From the 107 phenotypes, we chose 10 flowering time phenotypes (**Table S1**).

To verify our method, we constructed our dataset in the following ways:


# 3. RESULTS

#### 3.1. Existing Methods

To compare our SGL-LMM method with existing techniques, we considered standard regularization methods that included Lasso and SGL, which model all SNPs simultaneously without correcting for population structure. Also, we combined LMM with different regularization strategies (e.g., Lasso-LMM was listed as a comparison). All the methods that were related to regularization were fit in identical ways (see section 2.2).

#### 3.2. Performance Measurements

In this paper, all the models output a ranking list of SNPs with their frequencies of being chosen; true significant markers were rare and accounted for only 15 out of 1,993 in our simulation datasets. Hence, we treated this as a binary classification problem with an imbalanced dataset where we assigned association markers as label 1 and background markers as label 0. The frequency of each marker was treated as the predicted probability for label 1.

The ROC (Receive operating characteristic) curve and the PR (Precision-Recall) curve are commonly used to evaluate performance of classification models. The ROC curve is created by plotting the Sensitivity against the Specificity while varying the threshold settings:

$$sensitivity(TruePositive, TPR) = \frac{TP}{TP + FN}$$

$$\text{Specificity(FalsePositive, FPR)} = \frac{TN}{TN + FP}$$

The PR curve is created by plotting the Precision against the Recall at various threshold settings:

$$precision = \frac{TP}{TP + FP}$$

$$recall = \frac{TP}{TP + FN}$$

where TP=TruePositve, TN=TrueNegative, FP=FalsePositive, and FN=FalseNegative.

In our imbalanced setting, the ROC curve was not a good visual illustration, because the False Positive Rate did not drop drastically when the True Negative was huge. Whereas, the PR curve was highly sensitive to False Positive and was not impacted by a large True Negative denominator. Hence, we chose the PR curve to evaluate the performance for all the methods, and we used the average AUC (Area Under Curve) of the PR curve to explore the impact of various simulation settings.

#### 3.3. Results of the Simulation Study 3.3.1. SGL-LMM Ranks Causal SNPs Higher Than Alternative Methods

We assessed the performance in recovering causal SNPs with a true simulated association. PR curves were constructed while varying σpop in {0.5, 0.7, 0.9} with σsig set at 0.2 (**Figure 3**). Notice that a larger AUC score indicated better performance. For this experiment, we chose effective SNPs from three of the 200 groups, while taking sparsity into account, and we set the ratio α of L1 and L2 penalty in SGL-LMM to be 0.95. The two methods that incorporated LMM for population correction performed better than those without, and SGL-LMM was the best model (**Figure 3**). For most sets of parameters, SGL-LMM outperformed Lasso-LMM in AUC by about 10%.

Next, we explored the impact of various simulated setting. As mentioned in section 3.2, the area under the Precision-Recall curve is a summary performance measurement to assess different methods. The AUC under the PR curve is shown as a function of an increasing ratio between true genetic marker signals compared with confounding and noise (**Figure 4**). The performance of all methods improved when σsig became larger, and the AUC = 1 at σsig = 0.5 for all methods. Among

them, SGL-LMM was the best. We also notice that when σsig = 0.1, only SGL was more accurate than Lasso-LMM in the majority of datasets. SGL and Lasso-LMM performed similarly (**Figure 3**). One possible explanation is that when the variation explained by causal SNPs was relatively small, noise dominated the results. Under this scenario, eliminating false positives caused by population structure did not improve the performance of the models significantly. However, imposing group structure seems to be useful in generating accurate results.

The AUC under the PR curve is shown as a function of an increasing ratio of population structure and independent random noise with a specific σsig and, as expected, strong confounding was harmful to performance, because the AUC of all methods decreased when the confounding ratio increased. Again, SGL-LMM was superior to its counterparts. However, when σsig = 0.3, the performance of methods with the population correction exhibited an upper trend when σpop varied from 0.5 to 0.7 (**Figure 5C**). The performance of δsig to be 0.1,0.2 and 0.4 can be found in **Figures 5A,B,D**. This effect indicated that with a medium signal to noise ratio, it was advantageous to include a genetic covariance matrix K that accounted for confounding that was caused by population structure. SGL-LMM performed better than alternative methods for the entire range of considered settings. The benefits of population correction and inclusion of group structure in SGL-LMM were most pronounced in the scenario with strong confounding.

#### 3.4. Application With Arabidopsis thaliana Data

Having shown the accuracy of SGL-LMM in recovering the association SNPs in the simulation study, we can demonstrate that the SGL-LMM models association mapping in the A. thaliana dataset better than other models. For this experiment based on real data, we compared the performance of SGL-LMM and Lasso-LMM in predicting phenotype and in selecting predictive SNPs. For the ratio α between L1 and L2 penalty, we considered eight possible values {0.95, 0.85, 075, 0.65, 0.55, 0.45, 0.35, 0.25}; we picked the one that resulted in the largest correlation coefficient between the predicted and the real phenotype for subsequent stability selection. Because it is a verification experiment, we did not cover all genes in the experimental design. It may be the case that few, or even none, of the related genes in the selected phenotypes were covered in our genotype file. As a consequence, when setting the threshold for stability selection to be 50%, few SNPs are chosen by Lasso-LMM, and usually no more than 20 SNPs are chosen by SGL-LMM. Hence, we chose to rank the SNPs by their frequency of being chosen in both approaches and to investigate the first 100 SNPs. We summarized the genes to which these 100 SNPs belonged and the number of these genes in the candidate gene list (**Table 1**).

SGL-LMM had the following two advantages (**Table 1**):

#### 3.4.1. SGL-LMM Had Higher Prediction Accuracy

For most of the 10 phenotypes, correlation coefficients between the predicted and the true phenotypes were higher using SGL-LMM than those obtained with Lasso-LMM by

> 10%; for FT10, the predictions by SGL-LMM had a correlation coefficient 100% higher than that obtained by Lasso-LMM. Therefore, incorporating prior knowledge of genetic structure significantly improved the accuracy of models of quantitative phenotypes.

#### 3.4.2. SGL-LMM Selected Fewer Genes, and It Tended to Find More Genes That Were Known to be Functional

Compared with Lasso-LMM, associations that were located by SGL-LMM were more enriched to known candidate genes (**Table 1**). It linked more candidate genes in five phenotypes, and it linked the same number of candidate genes in the phenotypes SD and SDV. However, SGL-LMM linked many fewer genes compared with Lasso-LMM, which was consistent with our assumption that phenotypes should be related to a few SNPs in a few genes. Hence, adding group information into SGL-LMM made the results more interpretable and more meaningful biologically. The remaining three phenotypes that were related to leaf numbers seemed to be largely unrelated to the 19 candidate genes and to the randomly selected background genes and, therefore, both methods performed badly.

# 4. DISCUSSION

Quantitative traits are important in medicine, agriculture, and evolution, but the association mapping studies of these traits are insufficient. In this paper, we have proposed a sparse group lasso, multi-marker mixed model (SGL-LMM) to identify genetic associations in quantitative traits with the presence of confounding influences, such as population


TABLE 1 | Summary of associations found in SGL-LMM and Lasso-LMM in real data application.

*We report the correlation between the predicted phenotype and the real phenotype in the column titled "correlation.". A bold entry indicates that the method located more true positives than its competitor.*

structure. The approach benefits from the attractive properties of linear mixed models that allow for elegant correction of confounding effects and those of group-based, multi-marker models that not only consider the joint effects of sets of genetic markers rather than one single locus at a time, but that also incorporate biological group information as prior knowledge. As a consequence, SGL-LMM was able to better predict the phenotype and to identify true genetic associations, even in challenging scenarios with complex underlying genetic models, weak effects of individual markers, or presence of strong confounding effects.

SGL-LMM is useful for genome-wide association studies of complex quantitative phenotypes. In this paper, we have illustrated such practical use through a semi-empirical simulation study and retrospective analysis of A. thaliana. First, we found that imposing gene structure as group structure into the model improved both the prediction of phenotype from genotype and the selection of association SNPs, which suggested that incorporating prior biological knowledge into models led to a better fit to real genetic architectures. Second, the combination of a random effect model and a multivariate linear model is a way to reveal the true association of complex phenotypes, especially with a medium signal to noise ratio. It is widely accepted that parts of the unexplained portion of genetic variance can be due to a large number of loci that have a joint effect on the phenotype, but which lead to only a weak signal if considered independently. In addition, SGL-LMM yields much more biologically meaningful and interpretable associations, which suits the biological assumption that complex traits are only related to a few SNPs in a few genes. Our experiments on the flowering phenotype of A. thaliana showed that SGL-LMM linked many more candidate genes, but this was true only in a smaller gene set compared with the Lasso-LMM method.

The SGL-LMM included both GL-LMM (group lasso with linear mixed model) and Lasso-LMM as special cases by varying the ratio between the L1 and L2 norms. The sparsity within groups and group-wise sparsity influenced the performance of SGL-LMM. Small groups did not benefit from the within-group sparsity that led the method act as group lasso with LMM. In practical use, we recommend doing imputation first, which can ensure a moderate size for each group. The SGL-LMM can be made even more powerful by adding a strategy to deal with overlapping groups, which has been shown to be feasible by Jacob et al. (2009). Assessing the statistical significance of association results of SGL-LMM remains a challenge for future research. In summary, SGL-LMM is a useful addition to the current toolbox of computational models for unraveling associations of quantitative traits.

#### AUTHOR CONTRIBUTIONS

YG, AK, MG, and XL conceived and designed the project. YG and CW derived the formula of the method. YG implemented the software, performed the experiment, analyzed data, and wrote the paper with CW and QZ. All authors read, edited, and approved the final version of the manuscript.

#### FUNDING

This work was supported by the National Natural Science Foundation of China (Grant No. 61571163, 61532014, and 61671189), the National Key Research and Development Plan of

#### REFERENCES


China (Grant No. 2016YFC0901902), and the National Institute of Health (grants R01HG006849 and R01GM108805 to AK).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00271/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer YWC declared a past co-authorship with one of the authors QZ to the handling editor.

Copyright © 2019 Guo, Wu, Guo, Zou, Liu and Keinan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Characterizing the Personalized Microbiota Dynamics for Disease Classification by Individual-Specific Edge-Network Analysis

Xiangtian Yu<sup>1</sup> \*, Xiaoyu Chen<sup>1</sup> and Zhenjia Wang<sup>2</sup>

<sup>1</sup> Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, China, <sup>2</sup> Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States

Environmental factors such as the gut microbiome are thought to play an important role in the development and treatment of many diseases. But our understanding of microbiota compositional dynamics is still unclear and incomplete because the intestinal microbial community is an easily-changed ecosystem. It is urgently required to understand the large variations among individuals. These variations, however, will be an asset rather than a limitation to personalized medicine. For a proof-of-concept study on individual-specific disease classification based on microbiota compositional dynamics, we implemented an adjusted individual-specific edge-network analysis (iENA) method to address a limited number of samples from one individual, and compared it to the temporal 16S rRNA (ribosomal RNA) gene sequencing data from individuals in a challenge study. Our identified individual-specific OTU markers or their combined markers are consistent with previously reported markers, and the predictive score based on them can perform a better AUROC than the previous 0.83 and achieve about 90% accuracy on predicting whether an individual developed diarrhea [i.e., were symptomatic (Sx)] or not. In addition, iENA also showed satisfactory efficiency on another dataset about bacterial vaginosis (BV). All these results suggest that the combination of highthroughput microbiome experiments and computational systems biology approaches can efficiently recommend potential candidate species in the defense against various pathogens for precision medicine.

Keywords: network, individual-specific edge-network analysis, complex diseases, personalized microbiota dynamics, omics data

# INTRODUCTION

In addition to genetic risks, environmental factors are accumulating more and more evidence regarding their critical roles in human complex diseases (Qin et al., 2012; Hoyles et al., 2018). As one of these key factors, the gut microbiome is gradually being accepted to be a key player in controlling disease development and progression (Claesson et al., 2012; Forslund et al., 2015). Many studies have concluded that the alterations of commensal microbiota may contribute to a range of significant pathogen states such as antibiotic-associated diarrhea, inflammatory bowel disease, irritable bowel syndrome, pseudomembranous colitis, and cancer (Pop et al., 2016). The

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Zhangran Chen, Xiamen University, China Bingqiang Liu, Shandong University, China

> \*Correspondence: Xiangtian Yu graceyu1985@163.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 21 December 2018 Accepted: 15 March 2019 Published: 11 April 2019

#### Citation:

Yu X, Chen X and Wang Z (2019) Characterizing the Personalized Microbiota Dynamics for Disease Classification by Individual-Specific Edge-Network Analysis. Front. Genet. 10:283. doi: 10.3389/fgene.2019.00283

**290**

high-throughput sequencing of microbial communities provides a bio-technical foundation to characterize the associations of the host microbiome (Blow, 2008; Pushkarev et al., 2018), which is helpful to detect pathogens and identify the crosstalk between an organism's microbiome and the environment (Wagner et al., 2018). This frontier research not only links intestinal microbial communities with health or disease phenotypes but also provides lots of processed data for public requirement and reuse.

As is known, the intestinal microbial community is actually a more complex ecosystem with essential influences on host health status in numerous ways, such as regulating metabolism, developing immunity, and suppressing enteropathogens (Gill et al., 2006; Round and Mazmanian, 2009; Maynard et al., 2012). These beneficial co-evolved interactions between host and microbiota can be disrupted by different environmental stresses such as changes in dietary habits, natural physiology, virus infections, and medical treatments (Dethlefsen et al., 2008; Wu et al., 2011; Pop et al., 2014). Specifically, antibiotic treatments for enteric infections such as ETEC may even lead to immediate and significant changes of gut microbiota (Dethlefsen and Relman, 2011), e.g., loss of beneficial species, increase of drug-resistant strains, and predisposition of pathogen infections. The intestinal ecosystem is easily changed, although it is able to recover and is often incomplete (Lozupone et al., 2012). Thus, it is necessary to carry on long-term observational studies to detect the possible permanent functional alterations among certain microbiota (Jernberg et al., 2010).

Despite the critical role of microbiota in human health attracting more attention, our understanding of microbiota compositional dynamics is still incomplete, and more welldesigned analytical methods are also required to make full use of rich data resources. In the gradually increasing observational studies of the gut microbiota, the microbial communities' sequencing data, e.g., metagenomics data, are widely tested and analyzed (Vedoy et al., 2018). Different from the other high-throughput data in genetic studies (Yu and Zeng, 2018), metagenomics data can be easily changed within different conditions and individuals. Thus, individual heterogeneity is particularly important and individual variation should not be ignored in analytical approaches. In fact, in the era of precision medicine, many methods have focused on the common molecular biomarkers which can diagnose disease states at the cohort/population level. However, to study the occurrence and progression of a disease in a given patient (Zeng et al., 2014; Yu et al., 2015), accurate diagnosis of individuals by samplespecific biomarkers is a key concept and action (Zeng et al., 2016). In contrast to the traditional molecular biomarker analysis, our previously proposed individual-specific edge-network analysis (iENA) (Yu et al., 2017) combined with dynamic network biomarker (DNB) (Li et al., 2014) has detected the earlywarning signals or the pre-disease state before serious disease deterioration based on second-order statistics from the observed data, e.g., "covariance" for expressions among genes or proteins.

Holding an assumption that the microbiota like genetic molecules will have significant network characteristics associated with phenotypes (Rakoff-Nahoum et al., 2016), it is worth extracting discriminative and interpretative features from the microbiota community-like gene network to monitor the disruption of microbial communities during disease occurrence and development (Wang et al., 2018). To take a proof-ofconcept study on the dynamic change of intestinal ecosystem and their disease signals, we have adjusted the iENA method (Yu et al., 2017) with reference group to address the limited number of samples from one individual, and applied it to analyze temporal high-throughput 16S rRNA data from individuals, which is expected to overcome the great individual difference and changeability of the intestinal ecosystem and reveal biological and biomedical insights.

To carry out a proof-of-concept study on the individualspecific disease classification based on microbiota compositional dynamics, we captured the temporal changes from microbiota data of volunteers during the ETEC challenge and subsequent treatment with ciprofloxacin (Pop et al., 2016), and we found the following: (i) the common microbiota biomarkers (OTUs) reported in the previous work can be mostly recovered and are also more effective in distinguishing clinical phenotypes of individuals; (ii) individual-specific biomarkers can be detected depending on the temporal 16S rRNA data from each subject and the given reference data from multiple subjects; (iii) the individual microbiota data can be used to effectively carry out statistics, explore and integrate for personalized diagnosis, prognosis and prediction. In addition, in order to further validate the efficacy and robustness of our concept and method, we have employed iENA on another real-world data from the daily composition and relative abundance of bacteria in vaginal samples from twenty-five women with and without bacterial vaginosis (BV), and again satisfactory performance was achieved on distinguishing BV occurrence from healthy controls. In total, this work supplied novel evidence of individual biomarkers to promoting microbiota-based disease classification, while the high-ranked critical OTUs deserve future clinical validations.

### MATERIALS AND METHODS

# Description of Data Organization Used in Proof-of-Concept Study

Escherichia coli (ETEC) has two expected outcomes: watery diarrhea as symptomatic (Sx), or the host remains asymptomatic (Asx) (Pop et al., 2016). The wild-type virulent ETEC strain (E. coli O78:H11) was most frequently used in volunteer studies, which induces severe diarrhea, with mild fever and vomiting being reported in a relatively higher proportion of subjects. The 16S rRNA data from gut microbiota reported in previous volunteer challenge studies with ETEC H10407 were selected for our study (Pop et al., 2016), which can be obtained from NCBI under project ID: PRJNA298336. The simple summary of the challenge protocol are as follows: the health status of subjects in this volunteer challenge was assessed before the challenge; early antibiotic treatment was given to the patients when some symptoms appear; and starting on day 5, all subjects received a 3-day ciprofloxacin treatment. Importantly, the stool specimens were collected at 12 time points: prior to ETEC infection (day −1, 0) and on days 1–7, 9, 28, and 84 after the infection (Pop

et al., 2016). After sequencing, 124 samples finally passed quality controls and time matches which were used in our analysis, corresponding to 50 samples from 5 Sx volunteers and 74 samples from 7 Asx volunteers (Pop et al., 2016).

## Temporal Microbiota Data Analysis by Individual-Specific Edge-Network Analysis (iENA)

We previously developed an advanced computational framework, i.e., iENA, based on our proposed high-order correlation measurement as shPCC for one-sample omics data (Yu et al., 2017). In brief, iENA provides a powerful network-analysis tool for studying temporal omics data of complex diseases in a manner of individual samples, which is suitable for applications in precision medicine or personalized medicine. As noted in previous iENA analysis, each individual used some samples in the early stages as network references in dynamics analysis. However, when the number of samples for each individual is limited, this strategy cannot work. Thus, to investigate the microbiota dynamics in this work, we implemented an adjusted iENA particularly using samples from the baseline of individuals as the network reference and applied it for analyzing the temporal 16S rRNA data (or even other metagenomics data) as in **Figure 1**.

#### Collecting Data

To apply iENA, we downloaded temporal 16S rRNA datasets from NCBI, which include the ETEC challenge infection samples on individual subjects.

#### Selecting Reference Samples

In order to obtain the mean and variance of microbiota compositions used for evaluating each new single sample (i.e., for each sample of one subject at one time point), a group of reference samples (i.e., control samples, or normal samples) is required to be confirmed ahead of follow-up analysis. Here, we set the samples from the normal stage, i.e., the samples at baseline as a reference group. Whether these samples came from the same subject or different subjects are depending on the data organization. Any sample with similar properties could serve as a reference group in theory.

#### Selecting OTUs Based on Non-zero Value

One difficulty for processing 16S rRNA data is to deal with the large number of zero values for iENA, e.g., during any division computation; thus, similar to previous studies, we deleted OTUs with a large percent of zero values (i.e., 85% or other percent determined by a given threshold) to reduce the bias impact.

#### Constructing Microbiota Network by sPCC Calculation

When we had reference samples, we were able to construct the co-expression network of one sample by our single-sample measurement of the Pearson correlation coefficient (sPCC), consistent with previous studies (Yu et al., 2017). Considering the absence of background network for microbial communities, we selected edges (i.e., one edge represented the association between two microbiota, represented by a pair of OTUs here) from a direct rank cut-off for correlations because the distribution of the new PCC values is not the normal distribution. Then, the top-ranked edges with strong relations were finally selected, which consisted of conventional node-network or microbiota community (Wang et al., 2016; Sung et al., 2017), and were used as the background "nodes" for constructing the following edge-network (e.g., a network of OTU-pairs).

#### Constructing Microbiota-Pair Network by shPCC Calculation

Between two OTU-pairs, we carried out the estimation of the fourth-order single-sample correlation coefficient for each edgepair (i.e., two OTU-pairs) by shPCC (Yu et al., 2017) for each single-sample (e.g., for each sample of one subject at one time point). Note that, in this step, we actually only computed the correlations between the pre-selected OTU-pairs from the above steps, and thus we could reduce the unnecessary computations drastically. Finally, we obtained the microbiota-pair network model corresponding to each sample at a particular time point, and each subject had personalized features on a time series in the OTU-pair networks.

#### Recognizing Individual OTU-Pair Biomarkers

Similar to the OTU-pair selection, we selected top-ranked edgepairs as edge-biomarkers (i.e., OTU-pair biomarkers), which have strong relations with each other in terms of the high-order compositional correlations. Those strong correlated OTU-pairs can be viewed as DNB candidates, represented as a set called "Marker." Then, for each individual, the OTU-pairs in the edgenetwork were used as individual OTU-pair biomarkers, and these OTUs were applied in the clinical phenotype prediction.

#### Quantifying the Predictive Markers by sCI

As is known, the DNB has been developed to identify the pre-disease state before a sudden deterioration during disease development and progression as general disease-warning signals (Chen et al., 2012; Zeng et al., 2013). Recently, the DNB model with its quantification criterion (i.e., CI, composite index) based on multiple samples has been widely adopted:

$$\text{CI} = \frac{\overline{PCC}\_{in}}{\overline{PCC}\_{out}} \times \overline{SD\_{in}} \tag{1}$$

In our previous work on gene networks, the DNB criterion is further re-defined from the above correlation measurements in a manner of single-sample, i.e., sCI is defined as:

$$sCI = \frac{\overline{\sum\_{\mathbf{x}, \mathbf{y} \in Marker} |sPCC(\mathbf{x}, \mathbf{y})|} \overline{|sPCC(\mathbf{x}, \mathbf{y})|}}{\sum\_{\mathbf{x} \in Marker, \mathbf{y} \notin Marker} |sPCC\left(\mathbf{x}, \mathbf{y}\right)|} \times \overline{\sum\_{\mathbf{x} \in Marker} |\mathbf{x} - \mathbf{u}\_{\mathbf{x}}|} \tag{2}$$

where PCCin is the average PCC of the compositions of OTUs in the dominant group or DNB (e.g., a group of marker OTUs or molecules) in absolute value in one sample; PCCout is the average PCC of the compositions of OTUs between the dominant group and other in absolute value in one sample; SDin is the average standard deviation of the compositions of OTUs in the dominant group or DNB. "Marker" is the set of DNB members.

Then, the sCI of individual OTU markers was used to indicate the disease-warning signals when its value was greater than a given threshold.

#### Comparing OTU Markers and Their Disease Classification

For each individual, we obtained the differential OTU-pairs in each single-sample (i.e., the edge associations in each time point) as novel edge-biomarkers to indicate the disease-warning signal. We obtained the sCI value with edge biomarkers for each subject or sample, and we observed different sCI scores at consecutive time points. Thus, the value of sCI changed with time and we defined a threshold to indicate the criticality, i.e., warning disease or not for a subject. In addition, for the challenge data, we also examined the OTU markers induced from each subject, and compared them with previously reported 32-OTU markers from the original research of the experimental data (Pop et al., 2016).

# RESULTS AND DISCUSSION

#### Parameter Setting of the Analysis on ETEC Challenge Data

To make full use of iENA, we used 16S rRNA (ribosomal RNA) gene sequencing data to describe changes in the fecal microbiota from 12 human volunteers during the challenge study with ETEC (H10407), where three males and two females developed diarrhea symptoms while four males and three females did not (Pop et al., 2016).

As shown in **Figure 2**, according to iENA, we divided subjects into two groups according to clinical symptoms: a Sx group with 5 subjects (subjects 4, 11, 16, 17, and 38 in **Figure 2**) and an Asx group with 7 subjects (subjects 3, 13, 22, 29, 30, 33, and 41 in **Figure 2**). Samples before infection from baseline time (green in **Figure 2**, days −1 and 0) were used as the reference group. After selecting OTUs (non-zero percent > 0.85), we could calculate the sPCC (with mean and variance from the reference group) for each sample. We focused on the edges with strong correlations and finally determined the 1500 strongest relations at each time point. Then, these pre-selected edges were used as the background "nodes" for constructing the edge-network, and the significant signal peaks of edge-biomarkers were captured for each subject across multiple time points, which were candidate DNB members. Different from previous iENA applications, there was another parameter to control; the number of final OTU markers, due to the tested microbiota, was much less than tested human genes or proteins.

## OTU Markers Identified by iENA Are Consistent With Reports in the Literature

Based on the above temporal data, we determined different numbers of OTUs as marking features on each time/sampling point of each subject by iENA, and the OTU-index score (i.e., CI index of OTU markers) is an average measurement against

FIGURE 2 | The sample organization of ETEC challenges dataset. The subjects are divided into two groups according to the clinical symptom chart based on standardized symptom scoring: symptomatic (Sx) group with 5 subjects (subjects 4, 11, 16, 17, and 38 in the original data) and asymptomatic (Asx) group with 7 subjects (subjects 3, 13, 22, 29, 30, 33, and 41 in the original data). The samples before the challenge (in green) were used as a reference group; the non-symptom samples (in gray) have no significant clinical symptom; samples in orange indicate administration of ciprofloxacin; red marks represent diarrhea symptoms; pink element indicates the overlapping time/day of diarrhea symptoms and administration of ciprofloxacin.

the effect of OTU number. To further prove OTU markers' satisfactory discrimination of the eventual clinical outcomes of individuals, we identified individual biomarkers comprising differently numbered bacterial OTUs.

Next, we checked the individual-specific biomarkers by combining all OTUs detected on each sample for the same subject. OTU markers found in five Sx individuals were very different from those identified in Asx individuals, which may be the reason why these OTUs can be used to predict displayed symptoms (or disease occurrence). We finally obtained 19 common OTUs in the Sx group, which were also distinguished from the Asx group in a combination manner (**Figure 3**).

These 19 common OTU markers represent robust signatures and most of them have been reported in previous works (Pop et al., 2016), which demonstrates the effectiveness of iENA on OTU marker discovery. Patients who eventually developed diarrhea symptoms were primarily affected by the abundance of OTUs from the genus Bacteroides as well as Dialister. The microbiota predictors included previously observed Bacteroides sp., Blautia sp., Alistipes sp., and our newly found Escherichia and Lachnospiraceae with a potential role during disease occurrence. Looking at **Figure 3** on the one hand, globally, the abundance of OTU signatures seems to be absent in samples of Sx individuals but abundant in samples of Asx individuals; and on the other

hand, locally, Escherichia and Lachnospiraceae appeared most in the samples from Sx individuals. By contrast, some OTUs from Bacteroides and Dialister are more frequently observed in samples from Asx individuals. These results indicate the biological significance of our OTU markers.

# OTU Markers Outperform Previously Reported OTU Signatures

To further explore whether the microbiota could predict the eventual clinical outcome, we used OTU index scores of above 19 common OTUs to divide individuals into normal and disease groups. With an optimal threshold, the model was able to achieve an AUC of 0.9, larger than previously reported 0.83 (Pop et al., 2016), which means these predictors are robust. Based on these OTUs, the accuracy is about 90% in **Figure 4**, much larger than the previously reported 76% (Pop et al., 2016), which supports again that the new OTU markers and their quantifications are efficient in judging whether a patient developed diarrhea symptoms or not by individual microbiota data. Following our assumption, the abundance variance rather than abundance level would have more predictive power according to DNB theory, meaning that the OTU-index score of OUT-markers based on abundance variance achieved higher performance.

# Another Case Study on Bacterial Vaginosis (BV)

In order to further validate the efficacy and robustness of our model, we carried out this method on other data (Ravel et al., 2013). This data was obtained from the daily composition and relative abundance of bacteria in vaginal samples from twentyfive women: 15 SBV women diagnosed with Sx BV, six ABV women with Asx BV, and four healthy women at twenty time points during the 10-week study (Ravel et al., 2013). Due to the great influence of bacteria abundance and the association caused by SBV treatment, the bacteria data of the Sx group (9 SBV) and the Asx group (6 ABV and 4 healthy) at the first nine time points ahead of most treatments were used in following analysis.

Similar to the above case, the samples at the first time points of all individuals were used as the reference group. After selecting OTUs (non-zero percent > 0.5), we could calculate

the sPCC (with mean and variance from the reference group) for each sample. Due to the limited number of bacteria in this data, we focused on the edges with strong correlations and finally determined the 10 strongest relations at each time point. Then, these pre-selected edges are used as the background "nodes" for constructing the edge-network and capturing the significant signal peaks of edge-biomarkers for each subject. As shown in **Figure 5**, to explore whether bacteria could be predictive of the eventual outcome as BV or not, we again simply used the OTU-index scores to divide individuals into Sx (BV) and Asx (not BV) groups. A threshold optimal cutoff led the single OTU-index score

to achieve an AUC larger than 0.8, which means these predictors are efficient.

We also checked the individual-specific biomarkers by combining all OTUs detected on each sample for the same subject. Finally, we found 3 OTU markers in nine Sx (BV) individuals—Aerococcus christensenii, Veillonellaceae, and Bacteria. Meanwhile, in the Asx group (6 ABV and 4 healthy) the common markers were Gardnerella vaginalis, Aerococcus christensenii, and Bacteria. In order to observe more OTU markers distinguishing the two groups, we reduced the selection conditions, and 13 markers appeared in more than a half of the Sx members while 9 markers appeared in more than half of the Asx members. The Sx-special OUT markers were Lactobacillus iners, BVAB2, Bifidobacteriaceae, Parvimonas micra, and most of them have been reported in previous works (Pop et al., 2016) or are BVassociated. These results indicate again the biological significance of our selected OTU markers.

#### CONCLUSION

There is growing interest in bolstering resistance to infections or diseases by altering the microbiota (Jia et al., 2008; Holmes, 2016; Waterman et al., 2016; Delzenne and Bindels, 2018). Here, we have presented a computational framework, i.e., iENA, to identify the key OTU features to distinguish normal and disease states, by extracting higher-order statistics and dynamic information from 16S rRNA (ribosomal RNA) gene sequencing data in a one-sample manner. As a proof-of-concept study, we carried out iENA on the temporal development data of twelve subjects (healthy adults) undergoing a challenge with intestinal microbiota by ETEC. Although the sample size is relatively small and the variations among individuals are large, our iENA achieved robust results that may lead to more confirmed conclusions. The analysis outcome from iENA indicates the following: (i) for challenged subjects, the individual symptom-related OTU markers will have stable relation (higher-order information) rather than sensitive OTU abundance; (ii) the OTU markers are significantly related

#### REFERENCES


to the disease development and progression (e.g., ETEC infection) which will be able to predict whether an individual would develop symptoms or not with reasonable accuracy. In addition, iENA also showed satisfactory efficiency on another dataset about BV. These consequences all demonstrate the effectiveness of iENA with DNB on an individual's microbiota dynamics. Excluding the limitations from individual heterogeneity and sample numbers, network-based approaches like iENA will actually provide more universal tools on different types of real sequencing data (Davis-Richardson et al., 2014), which makes precision medicine more practical in clinical applications.

On account of the intestinal microbiota, iENA can explore differential microbiota pair networks based on differential OTU abundance, variance, and covariance. Although iENA has previously been validated on transcriptome datasets (Yu et al., 2017), it is also able to detect the individual-specific OTU markers on metagenomic datasets like 16S rRNA data, and disclose the higher-order associations between the microbiota and clinical symptoms during the ETEC challenge, or other disease developments like BV. Thus, the combination of new highthroughput microbiome experiments and computational systems biology approaches has the power to recommend potential candidate species in the defense against various pathogens for precision medicine.

#### AUTHOR CONTRIBUTIONS

XY and XC executed the experiment and did the data analysis. XY and ZW wrote the manuscript. XY, XC, and ZW revised the manuscript. All authors read and approved the final manuscript.

# FUNDING

This work was supported by National Natural Science Foundation of China (Nos. 61803360 and 11771152).


hepatic steatosis in non-diabetic obese women. Nat. Med. 24, 1070–1080. doi: 10.1038/s41591-018-0061-3


stools of experimentally infected human volunteers. Gut Pathog. 10:46. doi: 10.1186/s13099-018-0273-6


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Yu, Chen and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00283 April 9, 2019 Time: 18:5 # 9

# A Novel Joint Gene Set Analysis Framework Improves Identification of Enriched Pathways in Cross Disease Transcriptomic Analysis

Wenyi Qin1,2,3, Xujun Wang<sup>4</sup> , Hongyu Zhao4,5 and Hui Lu1,2,4,5 \*

*<sup>1</sup> Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai Jiaotong University, Shanghai, China, <sup>2</sup> Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, United States, <sup>3</sup> Department of Genetics, School of Medicine, Yale University, New Haven, CT, United States, <sup>4</sup> Department of Bioinformatics and Biostatistics, SJTU-Yale Joint Center for Biostatistics, Shanghai Jiaotong University, Shanghai, China, <sup>5</sup> Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, United States*

Motivation: Gene set enrichment analysis is a widely accepted expression analysis tool which aims at detecting coordinated expression change within a pre-defined gene sets rather than individual genes. The benefit of gene set analysis over individual differentially expressed (DE) gene analysis includes more reproducible and interpretable results and detecting small but consistent change among gene set which could not be detected by DE gene analysis. There have been many successful gene set analysis applications in human diseases. However, when the sample size of a disease study is small and no other public data sets of the same disease are available, it will lead to lack of power to detect pathways of importance to the disease.

#### Edited by:

*Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China*

#### Reviewed by:

*Xingming Zhao, Tongji University, China Long Lu, Cincinnati Children's Hospital Medical Center, United States*

\*Correspondence:

*Hui Lu huilu@sjtu.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *21 December 2018* Accepted: *19 March 2019* Published: *12 April 2019*

#### Citation:

*Qin W, Wang X, Zhao H and Lu H (2019) A Novel Joint Gene Set Analysis Framework Improves Identification of Enriched Pathways in Cross Disease Transcriptomic Analysis. Front. Genet. 10:293. doi: 10.3389/fgene.2019.00293* Results: We have developed a novel joint gene set analysis statistical framework which aims at improving the power of identifying enriched gene sets through integrating multiple similar disease data sets. Through comprehensive simulation studies, we demonstrated that our proposed frameworks obtained much better AUC scores than single data set analysis and another meta-analysis method in identification of enriched pathways. When applied to two real data sets, the proposed framework could retain the enriched gene sets identified by single data set analysis and exclusively obtained up to 200% more disease-related gene sets demonstrating the improved identification power through information shared between similar diseases. We expect that the proposed framework would enable researchers to better explore public data sets when the sample size of their study is limited.

Keywords: public data integration, cross disease transcriptome, gene expression, gene set enrichment analysis, mixture model, EM algorithm

# BACKGROUND

High-throughput technology like microarray and next-generation sequencing (NGS) allows researchers measure the expression levels of thousands of genes or microRNAs in one sample simultaneously. These high-throughput genomic data have enabled researchers to better identification of disease related genes and pathways (Gu et al., 2014, 2017; Zheng et al., 2015, 2016; Liu et al., 2016, 2017, 2018; Gong et al., 2018). Gene set enrichment analysis has become a widely accepted expression analysis tool whose purpose is to identify coherent altered expression change within a predefined gene set or a pathway rather than identifying individual differentially expressed (DE) genes (Mootha et al., 2003; Kim and Volsky, 2005; Subramanian et al., 2005; Nam and Kim, 2008). Compared with DE gene analysis, more reproducible and interpretable results could be obtained through gene set enrichment analysis. Gene set enrichment could also detect small but consistent change which is ignored by DE gene analysis (Luo et al., 2009). There are many successful applications of gene set enrichment analysis approach in human disease-related gene/pathway discovery. For example, Drier et al. (2013) showed that enriched gene sets could serve as biomarkers in predicting survival time in glioblastoma and colorectal cancer patients. Zhao et al. combined gene set enrichment analysis information and microRNA target gene sets to identify cancer-related microRNAs (Zhao et al., 2014). Lee et al. utilized gene set enrichment analysis based on mutation and transcriptional data to identify driver mutation behind breast cancer metastasis (Lee et al., 2016). Identifying the enriched gene set will provide crucial information of molecular functions and mechanisms underlying different diseases.

Many gene set enrichment analysis methods have been developed to identify differentially expressed gene sets with different assumptions and data types (Edgar et al., 2002; Kim and Volsky, 2005; Subramanian et al., 2005; Dinu et al., 2007; Freudenberg et al., 2010; Rahmatallah et al., 2015; Zhao and Li, 2017). These methods focused on the analysis of one single data set, thus cannot make full utilization of the rich amount of public expression data. Further, with the cost of microarray and next generation sequencing technique decreasing and stabilization of the experiment protocol, there are now over 1,000,000 samples deposited in public databases such as Gene Expression Ominus (GEO) (Subramanian et al., 2005), metaanalysis is one way to improve the identification power by integrating data sets of same conditions together (Qin et al., 2016). Shen and Tseng (2010) and Chen M. et al. (2013) both proposed meta gene set enrichment analysis frameworks to integrate public data sets of same biological condition and demonstrated improved identification power. However, these meta-analysis frameworks simplify the model by assuming a simple concordance model: a gene is either differentially in all studies or non-differentially expressed in all studies. This is a reasonable assumption when analyzing the dataset of same biological condition but might be problematic in conditions where there are not many public studies available for this disease.

On the other hand, the joint analysis approach has proven more effective in combining multiple different but similar sources of data than meta-analysis approach. The joint analysis methods developed in other fields of omics data analysis have proven useful in increasing the identification power by borrowing information from other similar diseases (Chen X. et al., 2013; Chung et al., 2014; Wang et al., 2016; Lin et al., 2017). In our previous study, we also demonstrated that our joint analysis framework aiming at DE gene detection is more advantageous than single data set analysis and meta-analysis in both simulation studies and real data cases combining different similar disease data sets (Qin and Lu, 2018).

In this study, we extended our previous joint gene analysis framework to joint gene set analysis framework. Base on the assumption that similar disease tends to share similar diseaserelated genes and pathways (Carson et al., 2017; Qin and Lu, 2018), we developed two joint gene set analysis frameworks aiming at improving identification power of enriched gene sets by borrowing different levels of information from other similar diseases. Compared with previous joint gene analysis framework, we unified DE gene/pathway statistic modeling through a two-component beta-uniform mixture model of pvalues and combined the model with normalized Kolmogorov-Smirnov (KS) statistic for joint gene set enrichment analysis. These novel frameworks were then compared with single data set analysis as well as the MAPE framework proposed by Shen and Tseng (2010) in simulation studies while Chen's method is not available from their website (Chen M. et al., 2013). The simulation results demonstrated that our proposed joint analysis framework outperformed all other methods in AUC under different simulation scenarios. When applied to two real data examples, the proposed joint analysis framework could recover most of the enriched gene sets which is identified by single data set analysis and further identified more pathways with better biological interpretability than single data set analysis. These results demonstrated the improved identification power of enriched gene sets of the proposed joint gene set analysis framework by borrowing information through similar diseases.

#### METHODS

#### EM Algorithm Implementation for Joint Gene Set Analysis Framework

To perform joint gene set analysis, we need to first address the issue of modeling DE gene/enriched pathway statistics in a single data set. In this study, P-values derived from differential test statistics (for example, two sample t-statistic or Kolmogorov– Smirnov (KS) statistic designed for detecting enriched pathways) in a single data set are modeled directly by a beta-uniform two component mixture model as described in Pounds and Morris (2003) where the p-values of non-DE genes/non-enriched pathways are assumed to belong to uniform distribution and p-values of DE genes/enriched pathways belong to a beta distribution with scale parameter α and 1, i.e., f p <sup>D</sup> <sup>=</sup> <sup>1</sup> = αp α−1 ;f p <sup>D</sup> <sup>=</sup> <sup>0</sup> = 1, where the categorical variable D represents either DE/enriched or non-DE/non-enriched status of a gene/pathway. The marginal density of p-value is thus written as follows:

$$f\left(\mathbf{p}\right) = \Pr\left(\mathbf{D} = 1\right) \alpha p^{\alpha - 1} + \left(1 - \Pr\left(\mathbf{D} = 1\right)\right) \tag{1}$$

where Pr(D = 1) is the percentage of DE genes/enriched pathways in a single data set and αǫ(0, 1) is the parameter of the beta distribution. In the joint analysis framework setup, let **p<sup>g</sup>** = pg1, . . . pgN represent all computed p-values of gth gene/pathway across N diseases. The formula (1) could be extended to N diseases:

$$f\left(\mathfrak{p}\_{\mathfrak{K}}\right) = \sum\_{\operatorname{Pr}(D\_1\ldots D\_N)} \operatorname{Pr}\left(D\_1\ldots, D\_N\right) \prod\_{i=1\ldots N} f\left(\mathfrak{p}\_{\mathbb{S}^i} \middle| D\_i\right) \tag{2}$$

where Pr(D<sup>1</sup> . . . , DN) represents the global configuration of DE gene/enriched pathway status across all diseases. In this model, Pr(D<sup>1</sup> . . . , DN) and α = {α1,α2, . . . αN} need to be estimated from the data. This is a typical mixture model problem, therefore an EM algorithm is implemented to obtain the maximum likelihood estimate of these parameters following the derivation in previous literature (Pounds and Morris, 2003; Qin and Lu, 2018). The details are described as follows:

Given initial guess of Pr(0) (D<sup>1</sup> . . . , DN) = 1 2 <sup>N</sup> and α (**0**) = α1 (0) , α<sup>2</sup> (0) . . . α<sup>N</sup> (0) where α<sup>i</sup> (0) = 0.5, the EM algorithm update at t-th step for α (**t**) and Pr(D<sup>1</sup> . . . , DN) is written as follows:

#### E-Step

The posterior probability of g-th gene's configuration status given observed **p<sup>g</sup>** and α (**t**) is given by:

$$\Pr\left(D\_{1}\ldots,D\_{N}\middle|\mathfrak{p}\_{\mathbb{X}},\alpha^{(\mathtt{t})}\right) = \frac{f\left(\mathfrak{p}\_{\mathbb{X}}\middle|D\_{1}\ldots,D\_{N},\alpha^{(\mathtt{t})}\right)\left(D\_{1}\ldots,D\_{N}\right)}{f\left(\mathfrak{p}\_{\mathbb{X}},\alpha^{(\mathtt{t})}\right)}\tag{3}$$

#### M-Step

Then the updated Pr(t+1)(D<sup>1</sup> . . . , DN) and α (**t**+**1**) is shown as follows:

$$\Pr^{(t+1)}\left(D\_1\dots, D\_N\right) = \frac{\sum\_{\mathcal{g}=1}^G \Pr\left(D\_1\dots, D\_N \middle| \mathcal{p}\_{\mathcal{g}}, \alpha^{(t)}\right)}{G} \quad \text{(4)}$$

$$\alpha\_j^{(t+1)} = \frac{\sum\_{\mathcal{g}=1}^G \Pr\left(D\_1\dots D\_{\mathcal{j}} = 1, D\_N \middle| \mathcal{p}\_{\mathcal{g}}, \alpha^{(t)}\right)}{\sum\_{\mathcal{g}=1}^G \Pr\left(D\_1\dots D\_{\mathcal{j}} = 1, D\_N \middle| \mathcal{p}\_{\mathcal{g}}, \alpha^{(t)}\right) (-\log p\_{\mathcal{g}\mathcal{j}})} \quad \text{(5)}$$

#### Normalized KS Statistic and Corresponding p-value Calculation for a Pathway

Normalized KS statistic defined in Mootha et al. (2003) is used to detect significantly enriched pathways by measuring if the ranks of genes along one pathway are more enriched on the top rank of an ordered gene list than expected by chance while controlling for pathway size. A normalized KS statistic for a pathway P containing M members is computed as follows:


$$\text{nKS}\_P = \max\_{j=1 \text{ to } G} \sum\_{i=1}^j R\_i \tag{6}$$

To evaluate the significance of the observed normKS for a pathway, a gene-based permutation test is used to calculate the p-value.

The permutation test contains the following steps:


$$\text{p (nKS}\_P) = \frac{\sum I \left( \text{nKS}\_P \ge \text{nKS}\_{perm} \right) + 1}{B \ast P + 1} \tag{7}$$

where I(·) is the indicator function.

#### Gene-Level Joint Gene Set Enrichment Analysis Framework (JointNormKS)

Based on the two-component mixture modeling of p-value for a single data set defined before a gene-level joint gene set enrichment framework is then developed which is based on normalized KS statistic (JointNormKS). The outline of the framework could be summarized as follows:


$$\begin{aligned} \Pr\left(D\_{i} = 1 \, \middle|\, p\_{\mathcal{S}^{1}}, \dots, p\_{\mathcal{S}^{N}}\right) &= \\ \frac{\sum\_{D\_{i} = 1} f(p\_{\mathcal{S}^{1}}, p\_{\mathcal{S}^{2}}, \dots, p\_{\mathcal{S}^{N}} \, | D\_{1}, D\_{2}, \dots, D\_{i} = 1, \dots, D\_{N}) \, \Pr(D\_{1}, D\_{2}, \dots, D\_{i} = 1, \dots, D\_{N})}{f\left(p\_{\mathcal{S}^{1}}, p\_{\mathcal{S}^{2}}, \dots, p\_{\mathcal{S}^{N}}\right)} \end{aligned} \tag{8}$$


#### Pathway-Level Joint Pathway Enrichment Analysis Framework (JointPathway)

In this section, JointPathway is proposed as another joint gene set enrichment analysis framework which summarizes the enrichment evidence on pathway-level first within each disease data set and then performs joint analysis on pathway-level pvalue to identify potential enriched pathways. The assumption of the framework is based on that similar disease tends to share similar shared dysregulated pathways. The outline of the framework is summarized as follows:


#### Meta-Analysis for Pathway Enrichment Analysis (MAPE)

Meta-Analysis for Pathway Enrichment Analysis (MAPE) is a series of meta-analysis frameworks proposed by Shen and Tseng (2010), which is specifically designed for pathway/gene set enrichment meta-analysis. It consists of three different frameworks: MAPE\_Gene, MAPE\_Pathway, and MAPE\_I. Here, we briefly introduce the implementation of each framework. MAPE\_Gene could be summarized by the following steps:


MAPE\_Pathway could be summarized by the following steps:


MAPE\_I is a hybridization of MAPE\_Gene and MAPE\_Pathway frameworks which takes the minimum p-value of a pathway obtained through MAPE\_Gene and MAPE\_Pathway as its test statistic. The p-value and FDR of this statistic are then determined through permutation test.

#### Simulation Study

To evaluate the effectiveness of the proposed joint gene set analysis frameworks, we performed comprehensive simulation studies. Assume that there is a total of 1,000 DE genes out of 10,000 genes. The expression value of each gene in a sample within each data set is generated as described in our previous study (Qin and Lu, 2018) with different means and variance set for each gene. We further assume that the number of data TABLE 1 | Simulation parameter setup under different scenarios.


*EP: Enriched Pathway*

sets to be jointly analyzed is fixed at N = 2 and the number of shared DE genes between two data sets is fixed at 600, 700, 800, or 900, so the DE gene similarity between two data sets are defined as the average shared percentage of DE genes i.e., 1 2 (Pr(D<sup>2</sup> = 1|D<sup>1</sup> = 1) + Pr(D<sup>1</sup> = 1|D<sup>2</sup> = 1)) would be 60, 70, 80, and 90%. After the gene expression data are generated, we further assume that there is a total of 1,000 pathways each of which contains 50 genes and therefore we would expect to see 5 DE genes within each pathway and any pathway containing more than 5 DE genes would be considered as an enriched pathway. In this simulation study, we set the number of DE genes of an enriched pathway at 10 and 15, respectively. Within each data set, there is a total of 100 enriched pathways. Similar to DE gene similarity definition, we define the shared number of enriched pathways at 60, 70, 80, and 90 between two data sets and consider it as enriched pathway similarity between two diseases. Each pathway is formed by randomly sampling DE and non-DE gene and could be represented by **Table 1** where each row represents the enrichment status of a pathway in two data sets and the number in each cell represents how to sample genes from two data sets. Finally, to systematically evaluate the performance of different frameworks, Receiver Operation Curve (ROC) (Fawcett, 2006) is used. Each parameter setup is repeated 30 times and the average Area Under Curve (AUC) is calculated and recorded for each framework.

#### Gene Set Collection Database

The up-to-date C2 canonical pathway collection (Version 6.1) of MsigDB (Subramanian et al., 2007) which contains 1,329 gene sets is used in this study. Before the gene set enrichment analysis, any gene set which contains <15 genes, or more than 500 genes is removed from further analysis.

# Lung Adenocarcinoma and Colorectal Adenocarcinoma

Adenocarcinomas are observed to share similar DE genes as discovered in our previous study (Qin and Lu, 2018), we decide to use lung adenocarcinoma (GEO accession no.: GSE32863) and colorectal adenocarcinoma (GEO accession no.: GSE41258) as one evaluation of our proposed joint gene set analysis frameworks. After we combined multiple probe sets representing same gene by taking the maximum expression value in each sample, a total of 12,054 unique genes and 991 canonical pathways are used in the analysis.

# Alzheimer's Disease (AD) and Huntington's Disease (HD)

AD and HD are known to share highly similar pathology (Narayanan et al., 2014). In this study, GSE33000 which contains both AD and HD postmortem samples are used to evaluate the performance of joint gene set enrichment analysis. Multiple probe sets representing same gene are combined by taking the maximum expression value in each sample. A total of 21,576 genes and 1,071 pathways are used in the analysis.

# RESULTS

# Overview of Proposed Joint Gene Set Enrichment Analysis Frameworks

**Figure 1** outlines the flowchart of three joint gene set enrichment frameworks proposed in this study. The details of the algorithm implementation could be found in the Methods section. Here we briefly discuss the difference between the two frameworks. The joint gene set enrichment framework could be split into gene-level (JointNormKS) and pathway-level (JointPathway). In JointNormKS, the differential expression status of each gene is first jointly analyzed across all similar disease data sets and gene set enrichment analysis is then performed based on the jointly analyzed results which incorporates information from other similar diseases. In this framework, we would expect to observe increased identification power of pathway enrichment when a gene successfully borrows information from other genes. In JointPathway, gene-level information is first summarized based on pathway within each dataset and joint analysis is then performed based on the pathway-level evidence. Under this framework, we would expect to see increased identification power when similar diseases share many enriched pathways among each other.

#### Comparisons Among JointNormKS, JointPathway, Single Data Set Analysis and MAPE Methods in Simulated Data Sets

In this section, we evaluated the performance of the proposed joint gene set enrichment analysis framework through simulation study and compared their performance with single data set analysis and published MAPE methods (Shen and Tseng, 2010). The detailed implementation of the simulation study and parameter setup could be found in Methods section and **Table 1**. Briefly speaking, expression data sets of two similar diseases are generated with different number of DE genes within a pathway, DE gene similarity and enriched pathway similarity. Furthermore, we consider two different DE gene configuration scenario in the pathway. In the first scenario, the enriched pathway in the target disease data set will contain fully overlapped shared DE genes from the similar disease data set from which information is borrowed. In the second scenario, the DE genes in the enriched pathway of the target disease data set will not overlap with any DE genes in the similar disease data set. This is a reasonable assumption as similar situation has been observed in other literature where one pathway is enriched in both datasets but DE genes are different (Shen and Tseng, 2010). The comparison results are summarized in **Figure 2**.

In Scenario 1, we assume that one enriched pathway is composed of shared DE genes. In this scenario, we observe that our proposed JointNormKS outperforms all other methods when the enrichment strength is set to 20% DE genes in an enriched pathway. We observe that JointNormKS is not sensitive to the DE gene similarity, different DE gene similarity yields similar significant AUC improvement over single data set analysis. On the other hand, enriched pathway similarity shows a stronger impact on the performance of JointNormKS: the AUC improves when the enriched pathway similarity increases. JointPathway in this scenario does not show difference with single data set analysis when the enrichment strength is low mainly because the p-value signals of enriched and non-enriched pathways are not separable in this case. The information borrowing in the joint analysis is thus not working for low-signal case. MAPE methods do not work well in this case. MAPE\_Gene shows worse performance in all Enriched pathway parameter setup mainly because when MAPE\_Gene summarizes evidence at gene-level, it takes the maximum p-value of a gene in both diseases which will lead to failing to identify many disease-specific DE genes in a pathway. MAPE\_Pathway shows increased performance when enriched pathway similarity increases. However, even when the enriched pathway similarity is set to 90%, JointNormKS still outperforms MAPE\_Pathway because disease-specific pathway will be regarded as false positive by MAPE\_Pathway and thus has a low rank. MAPE\_I method combines best results calculated from MAPE\_Gene and MAPE\_Pathway methods and thus cannot demonstrate better performance than JointNormKS. When the enrichment strength increases from 20% DE genes to 30% DE genes, JointNomrKS still outperforms all other methods. we also observe that JointPathway demonstrates improved AUC over single data set analysis when the enrichment strength increases because the signal of an enriched pathway in a single data set could be distinguished from non-Enriched pathway which enables the information sharing between two similar diseases. MAPE\_gene performs similar as before while MAPE\_pathway does not show improvement over single data set analysis mainly because when the signal of a single data set is strong enough, meta-analysis-based method would, on the contrary, cause the decrease of the rank of disease-specific Enriched pathway.

In Scenario 2, we assume that enriched pathways are composed of non-overlapping DE genes in two data sets. JointNormKS still outperforms all other methods in this scenario. The AUC improvement is even larger than that in scenario 1. As we further examine the result, we find that the reason that JointNormKS could efficiently borrow shared enriched pathway information is due to the combined use of normalized KS statistic and joint analysis at gene level (see Conclusion and Discussions for details). MAPE\_Gene performs even worse in this scenario because there is not shared DE genes within a pathway. Meta-analysis by taking maximum p-value would thus produce many false positives in DE gene detection. Other methods based on pathway-level evidence summarization remain same performance as in Scenario 1.

To sum up, the simulation test with different parameter setup and two different scenarios demonstrates that JointNormKS performs best among all other methods even when there are no shared DE genes within an enriched pathway. We then decide to use JointNormKS method in real data application in next section.

## Comparison of JointNormKS With Single Data Set Analysis in Real Data Application

Based on the simulation test results, we apply the JointNormKS framework on two real data sets and compare their identified enriched gene sets with those derived from single data set analysis, respectively. We use lung and colorectal adenocarcinoma as one example because adenocarcinoma both develop from gland cells of different tissues and as shown in our previous study, we observed that lung and colorectal adenocarcinoma shared a significant higher percentage of DE genes than other cancers (Qin and Lu, 2018). Alzheimer's disease and Huntington's disease are selected as another example due to their highly similar clinical phenotypes.

#### Real Data Application: Lung Adenocarcinoma and Colorectal Adenocarcinoma

JointNormKS is first applied on adenocarcinoma data sets and results are compared with those obtained through single data set analysis with the use of NormKS statistic by setting the FDR cutoff at 0.1. The comparison results are summarized in **Figure 3**. In lung adenocarcinoma data set, single data set analysis identified 19 pathways while JointNormKS could identify all these pathways plus 12 more enriched pathways. The common pathways identified by both methods contain "KEGG\_CELL\_CYCLE" which is the KEGG pathway documented in KEGG disease pathway database about known pathways involved with non-small cell lung cancer (pathways taken from hsa05223). The p-value and FDR of this pathway is significantly improved in JointNormKS (FDR∼0.005) compared with single data set analysis (FDR∼0.012). We also examined other known pathways involved with non-small-cell lung cancer recorded in KEGG and found that most of these pathways have improved significance in JointNormKS over single data set analysis (**Additional File 1A**). Among other commonly identified pathways, many cancer related pathways are identified including cell cycle related pathways such as "REACTOME\_DNA\_REPLICATION" and cancer signaling pathways such as "PID\_E2F\_PATHWAY" (Nevins, 2001; Bracken et al., 2003; Tazawa et al., 2007), "PID\_AURORA\_B\_PATHWAY" all of which play an important role in tumor progress (Chieffi et al., 2006; Girdler et al., 2006; Qi et al., 2007). For exclusively identified pathways by JointNormKS shown in **Table 2**, many of them are related to lung cancer after an extensive literature search. For instance, "PID\_MYC\_ACTIV\_PATHWAY" is a classic cancer-related pathway regulating cell proliferation process which is found in many cancers (Zajac-Kaye, 2001; Bild et al., 2006; Chou et al., 2010). "BIOCARTA\_MCM\_PATHWAY" which controls initialization of DNA replication process was reported in several lung cancer studies (Ho et al., 2007; Brambilla and Gazdar, 2009). Other pathways which is closely related to cancer progress includes pathways of amino acid metabolism and DNA synthesis. The full list of identified pathways in lung adenocarcinoma could be found in **Additional File 1B**.

In colorectal adenocarcinoma data sets, single data set analysis slightly identified more enriched pathways than JointNormKS. One hundred and twenty six pathways were identified by both methods. We observe that three pathways are exclusively identified by JointNormKS while six exclusively by single data set analysis. The biological process represented by 126 commonly identified enriched pathways are similar to what was observed in lung adenocarcinoma



data set. Among them, "KEGG\_CELL\_CYCLE" and "KEGG\_P53\_SIGNALING\_PATHWAY" are two pathways that are documented in pathways known to be related to colorectal cancer in KEGG database (hsa05210). When examining all eight pathways known to be related to colorectal cancer, we also observed that JointNormKS overall improved the FDR statistical significance of these pathways compared with single data set analysis. The full result is summarized in **Additional File 2A**. We further examined the enriched pathways exclusively identified by JointNormKS and single data set analysis, respectively. We find that all three pathways exclusively identified by JointNormKS are closely related to cancer. "BIOCARTA\_P53\_PATHWAY" and "PID\_MYC\_PATHWAY" are two canonical cancer-related pathways. As for "REACTOME\_TRANSCRIPTION," after we examined the gene family categorization on MsigDB, we find that many genes in this gene set belong to gene family related to cancer such as "oncogene," "tumor suppressor" etc. On the other hand, in the six gene sets exclusively identified by single data set analysis, only one gene set: "WNT\_SIGNALING" is the process known to be related to cancer progress. The other four gene sets might be potential false positives because very few reports could be found for these biological processes. The full list of identified enriched gene sets in colorectal adenocarcinoma could be found in **Additional File 2B**.

#### Real Data Application: Alzheimer's Disease and Huntington's Disease

Furthermore, we apply JointNormKS on two neurodegenerative disorder data sets and evaluate the identified enriched gene sets. The comparison results are summarized in **Figure 4**. JointNormKS demonstrated improved statistical power by identifying more enriched gene sets than single data set analysis while enriched gene sets identified by single data set analysis could also be identified by JointNormKS. On the other hand, in AD data set, JointNormKS exclusively identified 13 enriched gene sets and in HD data set, the number is 57. A clear statistical power gain is observed in JointNormKS over single data set analysis here.

In AD data set, we first examined three pathways known to be related to AD disease documented in KEGG disease pathway (hsa05010). "KEGG\_APOPTOSIS" and "KEGG\_OXIDATIVE\_PHOSPHORYLATION" are identified by both methods with similar level of significance. The results of three known AD related pathways are summarized in **Additional File 3A**. A further examination on the 13 exclusively identified gene sets by JointNormKS shows that these gene sets belong to category of apoptosis/cell survival, neuron development and energy metabolism all of which has a close relationship to AD (**Table 3**). The full list of identified enriched gene sets are summarized in **Additional File 3B**.

In HD data set, seven pathways known to be related to HD documented in KEGG disease pathway are first examined (hsa05016). "KEGG\_CALCIUM\_SIGNALING\_PATHWAY," "KEGG\_OXIDATIVE\_PHOSPHORYLATION,"

"KEGG\_PROTEASOME," "KEGG\_APOPTOSIS" are identified by both methods where JointNormKS demonstrated on average better statistical significance. It worth noting that one HD-related pathway, "KEGG\_RNA\_POLYMERASE" is exclusively identified by JointNormKS. The full result of these HD related pathways is summarized in **Additional File 4A**. Furthermore, among the 57 gene sets exclusively identified by JointNormKS, we are surprised to find many cancer-related pathways. A further literature search shows that biological processes such as cell cycle, DNA repair, apoptosis and kinase signaling are both implicated in both diseases suggesting a potential link between two diseases (Plun-Favreau et al., 2010; Driver, 2012). The full list of enriched gene sets identified in HD are summarized in **Additional File 4B**.

# CONCLUSIONS AND DISCUSSION

In this study, we proposed two novel joint gene set enrichment analysis frameworks: JointNormKS and JointPathway aiming at borrowing shared information across similar disease from gene-level and pathway-level, respectively. Compared our previously developed joint gene analysis framework, the framework proposed here focused on pathway-level detection and demonstrated that assumption of similar disease sharing similar pathways is valid. The framework provides researchers with new opportunities to view their data from a different angle and could complement the limitation of gene-level analysis.

The two frameworks were first tested through simulation test and compared with MAPE, the current meta-analysis methods of gene set enrichment analysis. The results showed that the JointNormKS performed best among all tested methods under all simulation scenarios. The JointNormKS was then applied to two real data sets and identified a comparable or more number of enriched gene sets than analyzing the data set alone. Further examination revealed that JointNormKS could recover most of enriched gene sets that was identified by single data set analysis and the enriched gene sets exclusively identified by JointNormKS were mostly related to the disease. These results demonstrate that when similar diseases are jointly analyzed, the proposed joint gene set framework


could borrow information from each other and improve identification power.

In the simulation test, we observed that in Scenario 1, the JointNormKS was not sensitive to the DE gene similarity (**Figure 2**). The reason is that after the joint analysis at genelevel, the rank of genes which are DE in both data sets would be prioritized to the top of the gene list ordered by posterior probability of DE status and the improvement of the rank of these genes is similar across different DE gene similarity values. Since the Normalized KS statistic is ranksensitive, the ranks of enriched pathways would remain the same and so is the ROC although the posterior probability of these DE genes within an enriched pathway keep increasing. In scenario 2, when an enriched gene set in both data sets is composed of non-overlapped DE genes across two data sets, we observed that JointNormKS was still able to detect these gene sets and even had a better AUC improvement. The reason is that after gene-level joint analysis, the ranks of DE genes in the disease to be borrowed from would improve and Normalized KS statistic which is sensitive to these changes would increase the rank of these shared pathways. This might raise a concern whether this will lead to increased number of false positives. We would like to argue that the whole framework is designed based on the assumption that similar diseases would share similar enriched pathways. If this assumption holds, the JointNormKS framework would work well as demonstrated in simulation tests.

Three improvements need to be implemented in the future work. The first improvement is to design a likelihood test to detect the shared DE gene or enriched pathway similarity before joint analysis is performed so that researchers using this framework would have a better sense of whether these disease data sets should be jointly analyzed or not. The test procedure would be similar to that described in Chung et al. (2014). The second improvement is the ability of the framework to include more disease data sets to borrow as currently the size of prior probability vector increases exponentially based on the total number N of data sets (2N). A heuristic approximation or a hierarchical structure could be implemented as described in Lai et al. (2017). The third improvement is the incorporation of gene set dependence in the joint gene set enrichment analysis framework. In this study, gene set independence is assumed even many gene sets share common genes. This is hardly the case in real world. How to address the gene set/pathway

dependence has been discussed and is a hot topic in the field of statistics (Tamayo et al., 2016; Tomoiaga et al., 2016; Xie et al., 2017). Extra work is needed to include it in the framework proposed in this study and several options would be explored in the future.

#### AUTHOR CONTRIBUTIONS

WQ and HL conceived and designed the study. WQ and HL developed the method, XW and HZ helped in method development. WQ wrote the computer program, analyzed data and interpreted the results. WQ, XW, HZ, and HL wrote the manuscript. All authors read and approved the final manuscript.

#### REFERENCES


#### FUNDING

The work is partially supported by National Key R&D Program of China 2018YFC0910500, the Neil Shen's SJTU Medical Research Fund, SJTU-Yale Collaborative Research Seed Fund; NSFC 31728012, Science and Technology Commission of Shanghai Municipality (STCSM) grant 17DZ 22512000, and the University of Illinois at Chicago Department of Bioengineering.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00293/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Qin, Wang, Zhao and Lu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Evaluation of Plasma Extracellular Vesicle MicroRNA Signatures for Lung Adenocarcinoma and Granuloma With Monte-Carlo Feature Selection Method

Xiangbo Chen1,2, Yunjie Jin<sup>3</sup> and Yu Feng<sup>4</sup> \*

<sup>1</sup> Key Laboratory of Molecular Epigenetics of the Ministry of Education, Northeast Normal University, Changchun, China, <sup>2</sup> Hangzhou Baocheng Biotechnology Co., Ltd., Hangzhou, China, <sup>3</sup> Department of Oncology, Shanghai Putuo People's Hospital, Shanghai, China, <sup>4</sup> Shuguang Hospital, Shanghai University of Traditional Chinese Medicine, Shanghai, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Shijia Zhu, The University of Texas Southwestern Medical Center, United States Cheng Guo, Columbia University, United States

> \*Correspondence: Yu Feng Fengyu@shutcm.edu.cn; dryufeng021@163.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 18 December 2018 Accepted: 05 April 2019 Published: 26 April 2019

#### Citation:

Chen X, Jin Y and Feng Y (2019) Evaluation of Plasma Extracellular Vesicle MicroRNA Signatures for Lung Adenocarcinoma and Granuloma With Monte-Carlo Feature Selection Method. Front. Genet. 10:367. doi: 10.3389/fgene.2019.00367 Extracellular Vesicle (EV) is a compilation of secreted vesicles, including micro vesicles, large oncosomes, and exosomes. It can be used in non-invasive diagnosis. MicroRNAs (miRNAs) processed by exosomes can be detected by liquid biopsy. To objectively evaluate the discriminative ability of miRNAs from whole plasma, EV and EV-free plasma, we analyzed the miRNA expression profiles in whole plasma, EV and EV-free plasma of 10 lung adenocarcinoma and 9 granuloma patients. With Monte-Carlo feature selection method, the top discriminative miRNAs in whole plasma, EV and EV-free plasma were identified, and they were quite different. Using the Repeated Incremental Pruning to Produce Error Reduction (RIPPER) method, we learned the classification rules: in whole plasma, granuloma patients did not express hsa-miR-223-3p while the lung adenocarcinoma patients expressed hsa-miR-223-3p; in EV, the hsa-miR-23b-3p was highly expressed in granuloma patients but not lung adenocarcinoma patients; in EV-free plasma, hsa-miR-376a-3p was expressed in granuloma patients but barely expressed in lung adenocarcinoma patients. For prediction performance, whole plasma had the highest weighted accuracy and EV outperformed EV-free plasma. Our results suggested that EV can be used as lung cancer biomarker. However, since it is less stable and not easy to detect, there are still technological difficulties to overcome.

Keywords: microRNA signatures, biomarker, classification, lung adenocarcinoma, granuloma

# INTRODUCTION

Blood is a mixture of plasma, blood platelet and various blood cells, such as erythrocytes, leukocytes, neutrophilic granulocytes, eosinophilic granulocytes, basophilic granulocytes, monocytes, and lymphocytes (Basu and Kulkarni, 2014). It can reflect the body health and wellness. Extracellular Vesicle (EV) is a compilation of secreted vesicles, including micro vesicles, large oncosomes, and exosomes (Lawson et al., 2018). Exosomes, with a diameter of 30–100 nm, are a kind of membrane-bound EVs and originate from endosome (Raposo and Stoorvogel, 2013). Nearly all kinds of cells can secrete exosomes whether under normal or stressful conditions (Srivastava et al., 2015). When compared with normal cells, tumor cells of the specific organs have been

proven to secrete more exosomes. Besides, the membrane of exosomes richly contains plenty of functional proteins, including tetraspanin, endosome-related membrane transport and fusion proteins and multivesicular bodies-genesis proteins, and thus exosomes could be applied as biomarkers (O'Driscoll, 2015). Exosomes can be extracted from diverse body fluids, which contain numerous biological molecules (DNAs, RNAs, and proteins). Recently, liquid biopsy has been developed as a novel, non-invasive diagnosis method to explore tumor development (Sheridan, 2016).

MicroRNAs (miRNAs) processed by exosomes could be detected by liquid biopsy (Iranifar et al., 2019). miRNAs are a group of non-coding RNAs, which regulate gene expression at the post-transcriptional and translational levels (Inamura, 2017). Dysregulation of miRNA expression is related to the progression of lung adenocarcinoma. Besides, Nadal et al. (2014) have demonstrated that different morphological subtypes of lung adenocarcinoma have specific miRNA expression profiles, for instance, miR-212-3p, miR-132-5p, and miR-27a-3p are found significantly upregulated in adenocarcinomas with solid subtype. A mass of miRNAs play important roles in the process of lung cancer pathogenesis and are recognized as potential diagnostic biomarkers and tumor targeted therapeutic molecules (Inamura, 2017).

As a well-studied, common cancer, lung cancer maintains the leading cause of cancer-specific death around the world. Adenocarcinoma accounts for nearly half of all lung cancer types, remaining the most common histologic subtype (Travis et al., 2011; Rosell and Karachaliou, 2018). Although the development of new therapies has significantly improved the prognosis of patients with lung adenocarcinoma, the 5-year survival rate remains low (less than 16%) (Crino et al., 2010).

To evaluate the discriminative ability of miRNAs from whole plasma, EV, and EV-free plasma, we analyzed the miRNA expression profiles in whole plasma, EV, and EV-free plasma of lung adenocarcinoma and granuloma patients. The same feature selection method, Monte-Carlo feature selection and the same rule learner, Repeated Incremental Pruning to Produce Error Reduction (RIPPER), were applied in the three miRNA expression datasets for lung adenocarcinoma and granuloma patients. The prediction performances and classification rules of whole plasma, EV, and EV-free plasma were compared and analyzed. Our results suggested that the prediction performance of EV miRNAs was better than EV-free plasma miRNAs. What's more, we identified EV specific miRNA expression pattern in lung cancer. These results supported the usage of EV miRNAs as lung cancer biomarkers but the whole plasma achieved a better prediction performance. The utilization of EV biomarkers still has a long way to go.

#### MATERIALS AND METHODS

#### The MicroRNA Expression Profiles in Whole Plasma, EV, and EV-Free Plasma

We downloaded the processed miRNA expression profiles in whole plasma, EV, and EV-free plasma of 10 lung adenocarcinoma patients and the miRNA expression profiles in whole plasma, EV and EV-free plasma of 9 granuloma patients from GEO (Gene Expression Omnibus) under accession number of GSE71661 on August 30, 2018. The expression levels of miRNAs were measured with next generation sequencing using Illumina HiSeq 2500. The reads were mapped onto known human miRNA in miRbase Release 21 using Blast and Bowtie. The mapped reads were normalized with the total number of reads. In each miRNA dataset of whole plasma, EV, and EV-free plasma, there were 10 lung adenocarcinoma and 9 granuloma patients; there were 1,509 miRNAs. The downloaded miRNA profiles were provided in **Supplementary Table S1**.

To systematically compare the miRNA expression difference between lung adenocarcinoma and granuloma patients, whole plasma, EV, and EV-free plasma were analyzed separately. Our goal was to compare their prediction performance and unique expression of miRNAs.

# Key MicroRNAs in Whole Plasma, EV, and EV-Free Plasma Identified With Monte-Carlo Feature Selection

Since there were 19 samples and 1,509 miRNA features in whole plasma, EV, and EV-free plasma dataset, the number of features was much greater than the sample size. If we use all miRNAs to build the classification model, all samples will be perfectly classified. But it will be overfitting and will have no actual meanings. Therefore, we adopted the Monte-Carlo feature selection (Draminski et al., 2008) to identify the key miRNA features and then used these few key features to construct the classification model. The Monte-Carlo feature selection method has been widely used and has achieved great performance in many fields (Chen et al., 2018c,e; Pan et al., 2019).

The Monte-Carlo feature selection method will randomly choose several features multiple times and then construct a series of tree classifiers (Chen et al., 2018a; Pan et al., 2018b; Wang et al., 2018). Based on the frequency and classification accuracies of the feature nodes on these classification trees, each feature will be assigned with a relative importance. Intuitively speaking, if a feature has been selected many times to construct the classification tree, it is important as the classification tree will find the most discriminative features to be the nodes.

Let's denote the total number of miRNA features with d, i.e., 1,509 in this study. m miRNA features (md) will be randomly selected and be used to construct t classification trees for s times. Each of the t trees was trained and tested based on the training and test patient samples randomly divided from the full dataset. Therefore, s · t classification trees will be constructed. Based on how many times a miRNA feature g has been selected by these s · t trees and how much this miRNA feature g has contributed to the classification of the s · t trees, its relative importance (RI) can be calculated:

$$\mathrm{RI}\_{\mathfrak{g}} = \sum\_{\mathfrak{r}=1}^{\mathrm{st}} (\mathsf{w} A \mathsf{c} \mathsf{c})^{\mathfrak{u}} \sum\_{\mathfrak{n}\_{\mathfrak{g}}(\mathfrak{r})} \mathrm{IG} \left( n\_{\mathfrak{g}}(\mathfrak{r}) \right) \left( \frac{\mathrm{no} \cdot \mathrm{in} \, n\_{\mathfrak{g}}(\mathfrak{r})}{\mathrm{no} \cdot \mathrm{in} \, \mathfrak{r}} \right)^{\mathrm{v}} \tag{1}$$

where wAcc is the weighted classification accuracy of decision tree τ , IG(ng(τ )) is the information gain of node ng(τ ), which is a decision rule using the expression levels of miRNA feature g, (no · in ng(τ )) is the number of samples under node ng(τ ), (no · in τ ) is the number of samples in decision tree τ , u, and v are adjust parameters.

By analyzing the s · t classification trees, each miRNA feature will be assigned with a RI and will be ranked decreasingly.

The Monte-Carlo feature selection method was applied using the dmLab software (Draminski et al., 2008) downloaded from http://www.ipipan.eu/staff/m.draminski/mcfs.html.

## Classification Rules for Lung Adenocarcinoma and Granuloma in Whole Plasma, EV, and EV-Free Plasma Learned With RIPPER

Repeated Incremental Pruning to Produce Error Reduction is a widely used method to learn the classification rules (Cai et al., 2018; Chen et al., 2018a,c,e,f; Pan et al., 2018a). Since we want to evaluate the prediction performance objectively, we did the 10-fold cross-validation for three times and combined the threetime results. In each cross validation (Wang et al., 2017; Zhang et al., 2017; Chen et al., 2018b,d; Li et al., 2018), the samples were randomly divided into 10 parts and each part was used as test dataset for once. After 10 rounds, all samples have been tested. As the random splits of data may cause bias, we repeated the 10-fold cross-validation for three times. In this study, the lung adenocarcinoma patients and granuloma patients were treated as positive samples and negative samples, respectively. We used weighted accuracy to evaluate the RIPPER prediction performance, i.e., the average of the accuracies of positive samples and negative samples.

# RESULTS

#### The Discriminative MicroRNAs Between Lung Adenocarcinoma and Granuloma Patients in Whole Plasma, EV, and EV-Free Plasma

The miRNA expression profiles of lung adenocarcinoma and granuloma patients in whole plasma, EV and EV-free plasma were analyzed separately. In whole plasma, the top 10 discriminative miRNAs were hsa-miR-223-3p, hsa-miR-501-5p, hsa-miR-130b-3p, hsa-miR-5010-5p, hsa-miR-330-5p, hsa-miR-378f, hsa-miR-3158-3p, hsa-miR-542-3p, hsa-miR-183-5p and hsa-miR-942-5p. In EV, the top 10 discriminative miRNAs were hsa-miR-23b-3p, hsa-miR-548ac, hsa-miR-3126-3p, hsa-miR-15b-5p, hsa-miR-205-5p, hsa-miR-5010-5p, hsa-miR-331-5p, hsa-miR-1249-3p, hsa-miR-548c-5p, and hsa-miR-1827. In EV-free plasma, the top 10 discriminative miRNAs were hsa-miR-511-3p, hsa-miR-376a-3p, hsa-miR-3150b-3p, hsa-miR-3150b-5p, hsa-miR-3168, hsa-miR-98-5p, hsa-miR-3136-5p, hsa-miR-210-5p, hsa-miR-340-3p, and hsamiR-636. **Figure 1** shows the heatmaps of the top 10 miRNAs in whole plasma, EV and EV-free plasma. The miRNAs and patients were clustered using ward D2 method (Murtagh and Legendre, 2014) based on Euclidean distance. The R package pheatmap<sup>1</sup> was applied to plot the heatmaps. It can be seen

<sup>1</sup>https://CRAN.R-project.org/package=pheatmap

from **Figure 1** that for whole plasma, there were two miss clustered cancer patients; for EV, there was one miss clustered granuloma patient; for EV-free plasma, the cluster pattern was not clear. The miRNAs in EV-free plasma were not suitable as cancer biomarkers.

We plotted the Venn Diagram of the top 10 discriminative miRNAs in whole plasma, EV and EV-free plasma in **Figure 2A**. There was only one overlapped miRNA between whole plasma and EV. The overlap miRNA was hsa-miR-5010-5p. It can be seen that the miRNA expression pattern was different in whole plasma, EV and EV-free plasma. It was necessary to investigate which blood compartments should be used for biomarker discovery.

To investigate whether the overlap pattern would change when more miRNAs were analyzed, we plotted Venn Diagrams of the top 15 and top 20 miRNAs as **Figures 2B,C**, respectively. There was still no overlap among the whole plasma, EV and EV-free plasma. The overlap between whole plasma and EV became larger when more top miRNAs were included but the overlap between EV and EV-free plasma remained to be one no matter whether the top 15 or 20 miRNAs were analyzed. The EV miRNAs were more similar with the whole plasma miRNAs than the EV-free plasma miRNAs.

## The Prediction Accuracies of MicroRNA Signatures for Lung Adenocarcinoma and Granuloma Patients in Whole Plasma, EV, and EV-Free Plasma

We evaluated the prediction accuracies of miRNA signatures for lung adenocarcinoma and granuloma patients in whole plasma, EV and EV-free plasma with 10-fold cross validations. To avoid the bias of random splits of samples, we repeated the 10-fold cross validation for three times. Therefore, the samples size in the confusion matrix will be the original sample size 19 multiplied by 3 which was 57.

The confusion matrices of miRNA signatures in whole plasma, EV and EV-free plasma were given in **Table 1**. The weighted accuracies using whole plasma, EV and EV-free plasma miRNA data were 77.22, 65.19, and 64.82%, respectively. The EV miRNAs performed better than the EV-free plasma miRNAs. The accuracy of granuloma in EV-free plasma, 29.63%, was extremely low.

# The Classification Rules in Whole Plasma, EV, and EV-Free Plasma

With the RIPPER method, we learned the classifications of miRNA expression levels in whole plasma, EV and EV-free plasma. These rules were given in **Table 2**. In whole plasma, granuloma patients did not express hsa-miR-223-3p while the lung adenocarcinoma patients expressed hsa-miR-223-3p. In EV, the hsa-miR-23b-3p was highly expressed in granuloma patients but not lung adenocarcinoma patients. In EV-free plasma, hsa-miR-376a-3p was expressed in granuloma patients but barely expressed in lung adenocarcinoma patients. We compared the mean expression levels of hsa-miR-23b-3p in whole plasma cancer, whole plasma granuloma, EV cancer



and EV granuloma. We found that in EV, hsa-miR-23b-3p was more highly expressed in granuloma than cancer with a fold change of 1.82, while in whole plasma, hsa-miR-23b-3p was more lowly expressed in granuloma than cancer with fold change of 0.84. What's more, we compared the mean expression levels of hsa-miR-376a-3p in EV-free plasma as well. We found that in EV-free plasma, the mean expression levels of hsa-miR-376a-3p in cancer and granuloma were 0 and 10.30, respectively, while in whole plasma, the mean expression levels of hsa-miR-376a-3p in cancer and granuloma were 1.45 and 0, respectively. The expression pattern between EV or EV-free plasma and whole plasma were different. These results suggested it was necessary to measure the EV, EV-free plasma and whole plasma, separately.

hsa-miR-223-3p was reported to have an increased expression in H. pylori-infected gastric cancer patients, which was related to progressive proliferation and migration of cancer cells (Ma et al., 2014; Wang et al., 2015). Thus, in plasma, the expression of hsamiR-223-3p in granuloma patients would not be as high as in cancer patients.

Zhou et al. (2015) found that cancer patients with higher expression of has-miR-23b had better outcomes then those with lower expression. In our study, we found that has-miR-23b-3p had higher expression in granuloma patients compared to in lung adenocarcinoma patients.

Joerger et al. (2014) reported that hsa-miR-376a was insensitive to perturbations in advanced non-small cell lung cancer patients. We found has-miR-376a-3p had a higher expression in granuloma patients, while its expression was very low in lung adenocarcinoma patients.

#### DISCUSSION

We identified the discriminative miRNAs in different blood compartments, such as hsa-miR-501-5p and hsa-miR-130b-3p in plasma; hsa-miR-548ac in EV and hsa-miR-511-3p in EVfree plasma.

hsa-miR-501 has been proven to have an association with clear cell renal cell carcinoma (Liu et al., 2018), pancreatic ductal adenocarcinoma (Liao et al., 2018), cervical cancer (Guo et al., 2018) and so on. Besides, they all found upregulation of has-miR-501 enhances tumor cell proliferation, migration and invasion.

hsa-miR-130b-3p is a novel miRNA in lung cancer, we found hsa-miR-130b-3p are upregulated in the plasma of lung cancer patients, which would be applied as a new biomarker to distinguish cancer and granuloma, and further guide therapeutic decisions clinically.

As for hsa-miR-548, Liu et al. (2015) investigated hsa-miR-548 expression in fresh tumor tissues from 22 patients with primary non-small cell lung cancer via RT-PCR and they found that the hsa-miR-548 expression level was significantly higher (p < 0.01) in adjacent non-tumor tissues than that in the tumor. That is, non-small cell lung cancer would down-regulate the expression of hsa-miR-548. Furthermore, they also observed that hsa-miR-548 was involved in the migration and invasion of non-small cell lung cancer cells by targeting the AKT1 signaling pathway.

For hsa-miR-511-3p, it has been reported to be related to lung adenocarcinoma by triggering BAX (Zhang et al., 2014) and TRIB2 (Zhang et al., 2012).

As for the diagnostic value, plasma is the most valuable, followed by EV and EV-free plasma. Previous studies have demonstrated that exosomes can be used as a type of novel biomarker for tumors and some benign diseases (Principe et al., 2013; Vella et al., 2016). Considering the diagnostic value of testing plasma is better than testing exosomes in plasma, many useful information may be missed when only exosomes in plasma were tested. The reasons are as follows: (1) Methods like OptiPrepTM density-based separation (DG-Exos), ultracentrifugation (UC-Exos), and immunoaffinity capture using anti-EpCAM-coated magnetic beads (IAC-Exos) are not effective enough to isolate exosomes and may destroy exosomes during the isolation process (Greening et al., 2015); (2) exosomes are not stable and are easily degraded, which could cause a bias (Kumar et al., 2018).

Since the sample size of this study was limited, the results should be validated in an independent large cohort. Another factor that may have affected the results was the disease type. For lung adenocarcinoma, the results were like this. But for other diseases, which release a large amount of RNAs and proteins into the circulatory system directly, the importance of exosome may decrease.

#### CONCLUSION

Extracellular Vesicle is a promising technology for non-invasive diagnosis. miRNAs processed by exosomes can be detected by liquid biopsy and used as biomarkers. To evaluate the discriminative ability of miRNAs from whole plasma, EV and EV-free plasma, we analyzed the miRNA expression profiles in whole plasma, EV and EV-free plasma of lung adenocarcinoma and granuloma patients. We found that the top discriminative miRNAs in whole plasma, EV and EV-free plasma were quite different, and the classification rules also varied. The prediction performance of whole plasma was the best but the EV outperformed EV-free plasma. Our results suggested that EV can be used as a lung cancer biomarker but EV may be less stable or difficult to detect than whole plasma, therefore, the whole plasma was still a good choice as lung cancer signatures.

#### AUTHOR CONTRIBUTIONS

fgene-10-00367 April 25, 2019 Time: 16:15 # 7

XC and YJ did the conception and design, performed the sample collection, analyzed and interpreted the data, and wrote, reviewed, and/or revised the manuscript. XC oversaw the developmental methodology. All authors read and approved the final manuscript.

#### REFERENCES


#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00367/full#supplementary-material

TABLE S1 | The processed microRNA expression profiles.

cells induces tumor promoting changes in the stroma through cell-cell communication. Mol. Carcinog. 58, 376–387. doi: 10.1002/mc.22935


Alzheimer's and Parkinson's disease. Int. J. Mol. Sci. 17:173. doi: 10.3390/ ijms17020173


**Conflict of Interest Statement:** XC was employed by company Rongze Biotechnology Co., Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Chen, Jin and Feng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# High-Order Correlation Integration for Single-Cell or Bulk RNA-seq Data Analysis

Hui Tang<sup>1</sup> , Tao Zeng<sup>1</sup> \* and Luonan Chen1,2,3,4 \*

*<sup>1</sup> Key Laboratory of Systems Biology, CAS Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China, <sup>2</sup> CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China, <sup>3</sup> School of Life Science and Technology, ShanghaiTech University, Shanghai, China, <sup>4</sup> Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai, China*

#### Edited by:

*Shihua Zhang, Academy of Mathematics and Systems Science (CAS), China*

#### Reviewed by:

*Guoxian Yu, Southwest University, China Lihua Zhang, University of California, Irvine, United States*

#### \*Correspondence:

*Tao Zeng zengtao@sibs.ac.cn Luonan Chen lnchen@sibs.ac.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *21 December 2018* Accepted: *09 April 2019* Published: *26 April 2019*

#### Citation:

*Tang H, Zeng T and Chen L (2019) High-Order Correlation Integration for Single-Cell or Bulk RNA-seq Data Analysis. Front. Genet. 10:371. doi: 10.3389/fgene.2019.00371* Quantifying or labeling the sample type with high quality is a challenging task, which is a key step for understanding complex diseases. Reducing noise pollution to data and ensuring the extracted intrinsic patterns in concordance with the primary data structure are important in sample clustering and classification. Here we propose an effective data integration framework named as HCI (High-order Correlation Integration), which takes an advantage of high-order correlation matrix incorporated with pattern fusion analysis (PFA), to realize high-dimensional data feature extraction. On the one hand, the high-order Pearson's correlation coefficient can highlight the latent patterns underlying noisy input datasets and thus improve the accuracy and robustness of the algorithms currently available for sample clustering. On the other hand, the PFA can identify intrinsic sample patterns efficiently from different input matrices by optimally adjusting the signal effects. To validate the effectiveness of our new method, we firstly applied HCI on four single-cell RNA-seq datasets to distinguish the cell types, and we found that HCI is capable of identifying the prior-known cell types of single-cell samples from scRNA-seq data with higher accuracy and robustness than other methods under different conditions. Secondly, we also integrated heterogonous omics data from TCGA datasets and GEO datasets including bulk RNA-seq data, which outperformed the other methods at identifying distinct cancer subtypes. Within an additional case study, we also constructed the mRNA-miRNA regulatory network of colorectal cancer based on the feature weight estimated from HCI, where the differentially expressed mRNAs and miRNAs were significantly enriched in well-known functional sets of colorectal cancer, such as KEGG pathways and IPA disease annotations. All these results supported that HCI has extensive flexibility and applicability on sample clustering with different types and organizations of RNA-seq data.

Keywords: high–order, integration, clustering, single-cell, bulk data analysis

**318**

# INTRODUCTION

Cells, the fundamental unit in biology, can be distinguished by their size and shape using a microscope. Later, advanced technological developments have made it possible to isolate a large number of cells, and along with improvements in RNA isolation and amplification methods, next-generation sequencing technologies are used to profile the transcriptome of individual cells. Single-cell RNA sequencing (scRNA-seq) now allows for omics analysis of individual cells, which can expose exciting biological processes, novel medical insights and efficient clinical applications (Dunham et al., 2012; Kolodziejczyk et al., 2015; Wagner et al., 2016). The advances in single-cell technologies have led to more comprehensive studies for multicellular organisms than previous approaches. Recently, 10X Genomics could release a single-cell dataset of more than 1.3 million cells (2017)<sup>1</sup> . With the production of large amount of single-cell data, understanding the development of an organic organ requires to characterize all of its cell types, so that, it is important to quantify single-cell cell types with high quality. Conventionally, one key application of scRNA-seq is to cluster cell types based on cells' transcriptome profiles through unsupervised computational methods (Lloyd, 1982; Jaitin et al., 2014; Mahata et al., 2014; Grün et al., 2015; Kiselev et al., 2017; Jiang et al., 2018; Shi et al., 2018; Dai et al., 2019). These approaches in recently published studies show some good performances in determining different cell types (Xue et al., 2013; Patel et al., 2014; Pollen et al., 2014; Shalek et al., 2014). SAFE-clustering (Yang Y. et al., 2018) can take as input results from multiple clustering methods and scmap (Kiselev et al., 2018) can compare clusters across data sets without merging. RaceID (Grün et al., 2015) augments k-means to identify rare cell types by detecting outliers, but kmeans faces the problem of global solution. Meanwhile, SC3 (Kiselev et al., 2017) adopts repeated application of k-means using a small subset of principal components or different initial conditions and finding the consensus clusters. SC3 is a userfriendly clustering method that works well for smaller datasets. However, it takes too long in terms of computation time because of amount of calculating correlation matrix of cells. Besides, CIDR (Lin et al., 2017) adapts hierarchical clustering (HCA) for single-cell datasets by adding an implicit imputation of zeros into the distance calculation. But, an important shortcoming of hierarchical clustering is that it is prohibitively expensive for large datasets. Therefore, the more efficient and accurate method is still urgently needed to cluster cell types.

At the same time, large amounts of bulk data have already become widely available resources along with rapid development of high throughput technologies. To take full advantage of these rich data sets, integrating multiple datasets will give more opportunities to address biological dynamics and cancer heterogeneity (Hamid et al., 2009; Wang et al., 2014). Some integration methods have been developed in recent years, such as: iClusters, SNF, NMF, and PFA (Zhang et al., 2011; Mo et al., 2013; Mahata et al., 2014; Wang et al., 2014; Shi et al., 2017). However, there are still several limitations of these approaches. For example, iClusterPlus is based on Gaussian assumption, which could not make sense when data is too heterogeneous on signal distributions. And recently developed pattern fusion analysis (PFA) can integrate multidimensional data (Shi et al., 2017) so as to provide a comprehensive way to understand biological processes and complex diseases in a multi-view manner. In theory, PFA can align local sample-patterns derived from each single data type into a global sample-pattern to characterize the sample types in a low-dimensional feature space, so that, it is expected that PFA can model the sample types (i.e., cell types) when using scRNA-seq. However, the original PFA is designed for multi-source data rather than only one source data, in addition to insufficient analysis on the sample features. Thus, it is required to extend the original PFA to sample clustering even for one source data by a unified integration framework.

To overcome above challenges, we proposed a unified computational framework for distinguishing single-cell cell types from single-cell RNA-seq data, which also keeps the ability for clustering sample types from bulk RNA-seq data. The new method named as HCI (High-order Correlation Integration), can integrate joint high-order correlation matrices, where the iterative use of Pearson's correlation coefficient in sample data are incorporated into our previously developed pattern fusion analysis method (PFA) (Shi et al., 2017). Technically, HCI integrates single-cell data sets and different distance matrices corresponding to different sample correlation feature spaces (i.e., the distance between the cells) by joint matrix factorizations.

On the one hand, HCI has been compared with other existing methods [i.e., SC3 (Kiselev et al., 2017) and SEURAT (Macosko et al., 2015)] for identifying cell types on various single-cell RNA-seq data. And the robustness of HCI was also tested in different correlation orders (e.g., one-order, secondorder, different percentage of differentially expressed genes). Furthermore, a case study was conducted by HCI on a scRNA-seq dataset of Diabetes, which successfully clustered the ambiguous cells unassigned in previous study. On the other hand, HCI was also applied to analyze bulk RNA-seq data as previous PFA, e.g., bulk RNA-seq and other omics data (Schuster, 2008). By comparing HCI with the original PFA on three datasets with multiple data types (e.g., gene expression and miRNA expression), it is found that HCI can improve computational efficiency of sample clustering and can recognize gene regulatory networks in an accurate and reliable manner (Joung et al., 2007; Tran et al., 2008; Hamid et al., 2009; Peng et al., 2009).

Totally, HCI can not only cluster cell types with scRNA-seq data in an efficient way, but also capture biologically meaningful sample types as well as extracting network modules with bulk RNA-seq data or other omics data. It provides a new and general way to detect the sample-specific characteristics from the highorder correlation information in an integration manner.

#### MATERIALS AND METHODS

HCI pipeline schematically is shown in **Figure 1**. Input is the expression matrix **M** where columns correspond to cells or samples and rows correspond to genes or molecules, e.g., each element of **X** corresponds to the expression of a gene in a given

<sup>1</sup> 10X Genomics single cell gene expression datasets from https://support. 10xgenomics.com/single-cell-gene-expression/datasets

cell. The analysis procedure of HCI can be summarized as several steps in follows.

#### Pre-processing

The gene filtering removes genes with zero expressions in all cells (or samples), which are not informative for the cell clustering. And, the normalization for each column data is carried to maintain the feature stability of each cell or sample. Then, we can get a filtered expression matrix X.

## High-Order Correlation Matrix Construction

We firstly calculate **F** 1 , the correlation of the gene expression profiles **X**m·n, in which the expressions of m genes are measured for n samples and xkj denotes the expression level of gene k in sample j , the correlation of sample i and j can be calculated by the Pearson correlation coefficient (Rodgers and Nicewander, 1988):

$$f\_{ij}^{(1)} = \frac{\sum\_{k=1}^{n} (\boldsymbol{\chi}\_{ki} - \boldsymbol{\chi}\_{-i})(\boldsymbol{\chi}\_{kj} - \boldsymbol{\chi}\_{-j})}{\sqrt{\sum\_{k=1}^{n} (\boldsymbol{\chi}\_{ki} - \boldsymbol{\chi}\_{-i})^2} \sqrt{\sum\_{k=1}^{n} (\boldsymbol{\chi}\_{kj} - \boldsymbol{\chi}\_{-j})^2}} \tag{1}$$

where xki and x−<sup>i</sup> are the expression level of gene k and the average gene expression level of sample i, respectively. Similarly, xkj and x−<sup>j</sup> are the expression level of gene k and the average gene expression level of sample j, respectively. Thus, we can obtain a correlation matrix **F** 1 n·n of **X** in which f 1 i·j is its element measuring the correlation coefficient between sample i and sample j. Now, based on the matrix **F** 1 n·n , we can further calculate **F** 2 n·n as follows:

$$f\_{ij}^{(2)} = \frac{\sum\_{k=1}^{n} \left( f\_{ki}^{(1)} - f\_{-i}^{(1)} \right) \left( f\_{kj}^{(1)} - f\_{-j}^{(1)} \right)}{\sqrt{\sum\_{k=1}^{n} \left( f\_{ki}^{(1)} - f\_{-i}^{(1)} \right)^2} \sqrt{\sum\_{k=1}^{n} \left( f\_{kj}^{(1)} - f\_{-j}^{(1)} \right)^2}} \tag{2}$$

**F** 1 n·n is called as the first-order correlation matrix of **X**, and **F** 2 n·n is the second-order correlation matrix of **X**. The advantage of this transformation with expression matrix **X** can highlight latent structures between samples with noisy (Hubert, 1985; Ren et al., 2013). In fact, we also investigated the other kind of distance matrix by using other method, such as Spearman correlation, however, **F** 2 n·n is similar to **F** 1 n·n due to its consideration on element rank rather than element value in matrices. Cleary, the higher-order correlation matrix can be constructed in a similar way. Therefore, in this paper, we only use the Pearson metrics to construct our high-order correlation matrices. Noted, such highorder matrix can enhance the sample clustering performance. In our prior analysis, the clustering accuracy increased quickly on the first-order correlation features, and it almost approached the highest on the second-order correlation features and tended to be saturated when the order further increased. Without loss of generality, we only used the first-order matrix and the secondorder matrix to incorporate into HCI in this work.

## Correlation Matrix Induced Pattern Fusion Analysis (PFA)

The input data **X** has m rows and n columns, and matrices **F** 1 n·n and **F** 2 n·n have n rows and n columns. We integrated these three input datasets by pattern fusion analysis. This methodology has been proved and evaluated in previous work (Shi et al., 2017), and the key steps used in our work are as follows:

The first step is to obtain the optimal local information sets of **U**i , **Y** i , which requires to minimize the error **E** i as follows:

$$\min \parallel \ E^i \parallel = \min\_{c^i, U^i, Y^i} \parallel W^i - \left(c^i 1^T + U^i Y^i\right) \parallel\_F^2 \tag{3}$$

where **W**<sup>i</sup> is the input data sets **X**, **F** 1 n·n , **F** 2 n·n , and F is the Frobenius norm. Then, we have

$$\begin{cases} \begin{aligned} \boldsymbol{U}^{i} &= \boldsymbol{Q}\_{d^{i}}^{i} \\ \boldsymbol{Y}^{i} &= \left(\boldsymbol{U}^{i}\right)^{T} \left(\boldsymbol{W}^{i} - \boldsymbol{c}^{i}\boldsymbol{1}^{T}\right) \\ &\boldsymbol{c}^{i} = \frac{\boldsymbol{W}^{i}\boldsymbol{1}}{n} \end{aligned} \tag{4}$$

where **Q**<sup>i</sup> d i is an orthogonal matrix formed by the eigenvectors corresponding to the first d i largest eigenvalues of (**W**<sup>i</sup> − **c** i 1 T )(**W**<sup>i</sup> − **c** i 1 T ) T . It is important noted that the sensible default values d <sup>i</sup> of matrix X is chosen according to d P i r=1 δr/ P p r=1 δ<sup>r</sup> ≥ 0.8 and d i is the r largest eigenvalues of (**W**<sup>i</sup> − **c** i 1 T )(**W**<sup>i</sup> − **c** i 1 T ) T and the number of the non-zeros eigenvalues is p . Meanwhile, the d i -dimension of matrix **F** 1 n·n and **F** 2 n·n is chosen according to d P i r=1 δr/ P p r=1 δ<sup>r</sup> ≥ 0.9 due to their different feature dimensions with **X**.

And then, the adaptive optimal alignment is used to capture the global sample-pattern matrix **Y**. The detailed adaption method can be seen in the original study (Shi et al., 2017), and the related parameters can be easily adjusted by the user.

#### Sample Clustering and Cluster Number Estimation

The global sample-spectrum **Y** obtained in the above step instead of conventional data matrix **X** can be clustered by many clustering methods, such as K-means or HCA. In this paper, Kmeans clustering (Ding and He, 2004) is performed on the global sample-spectrum matrix Y by using the "kmeans()" MATLAB function.

The ratio of distance between clusters (RDC) is calculated to estimate the number K of clusters. One hundred realizations of the sample clustering used K-means clustering. The number K of clusters is inferred by the average RDC number [K = min (K, the average RDC's slope is nearly 0)]. The RDC can be calculated as:

$$RDC = \frac{D\_{in}}{D\_{out}}\tag{5}$$

where Din is the average sample distance in clusters; Dout is the average sample distance between clusters.

Since the reference labels of cells or samples are already known for all published datasets, the Adjusted Rand Index (ARI) (Hubert, 1985) is applied to calculate the similarity between the HCI clustering results and prior-known clusters, which can be further used to evaluate HCI and other methods [e.g., SC3 (Kiselev et al., 2017), PFA, one-order, second-order, and CV situations].

# Molecular Network Construction for Case Study on Bulk RNA-seq Data

The multi-level network is integratively constructed by using HCI schematically shown in **Figure 4A**. In the same way, we calculated the high-order matrices **F** 1 n·n and **F** 2 n·n of the input datasets **X**<sup>I</sup> (e.g., RNAseq, Methylation, MicroRNA), where n is number of samples in data. And then we integrated all input datasets **X**<sup>I</sup> and high-order correlation matrices **F** 1 I , **F** 2 I by using pattern fusion analysis method. Based on the global sample-spectrum matrix **Y**, we can get the differentially expressed mRNAs (or miRNAs) from heterogeneous genomic datasets according to the coefficient matrix **U**<sup>I</sup> ∗ . In this work, we calculated a coefficient of variation for each element on the rows of **U**I<sup>∗</sup> :

$$c\_i = \frac{\delta\_i}{\mu\_i} \tag{6}$$

where µ<sup>i</sup> is the average weight of mRNA i (or miRNA i) in U I ∗ , and δ<sup>i</sup> is the standard deviation. We can define differentially expressed mRNA (or miRNA) i if c<sup>i</sup> is greater than a given threshold T, and they called DEGs (or DE-miRNAs).

Besides, we also performed functional enrichment analysis for genes by Gene Ontology and KEGG. We also analyzed DEGs using Ingenuity Pathway Analysis (IPA), providing the association between a particular gene set and known functions, pathways, networks and associated diseases. An online database miRDB was used for miRNA target prediction and functional annotations.

We defined key genes that significantly enriched in cancer dependent on KEGG, GO and IPA analysis. We found the key genes in the DEGs, which can be linked and correlated by the combined functional couplings of protein-protein interactions of STRING. MicroRNAs which can regulate key DEGs were defined as key miRNAs (degree s > 80) (Hu et al., 2018). Cytoscape was used to reconstruct and visualize gene-gene and miRNAgene network.

# RESULTS

#### Performance Comparison and Robustness Evaluation

To demonstrate the performance of HCI on the single-cell datasets, we firstly downloaded four publicly available scRNA-Seq datasets (**Figure 2A**) (Yan et al., 2013; Deng et al., 2014; Wang et al., 2016; Xin et al., 2016). These datasets were selected on the basis that one can be highly confident on the cell labels as representative cells from different stages, conditions and lines. In order to quantify the similarity between the reference cell types and the clusters obtained by HCI or other comparable methods. We calculated the average ARI of the clustering results (**Figure 2D**, **Figure S1**) and estimated cluster number K according to RDC by running K-means 100 times (**Figure 2C**). Obviously, high-order correlation matrices incorporated into PFA actually improves both the accuracy and the stability of analysis solutions. We found that the accuracy was significantly improved compared with the one-order correlation matrix (only using **F** 1 I ) or the second-order matrix (only using **F** 2 I ) according to the ARI and the RDC (**Figures 2B**,**D**). Besides, in order to determine the robustness as a consistent performance under different conditions, the same analysis on four datasets were both

in previous studies and used as reference of comparison among HCI, One-order situation, and Second-order situation. (C) RDC was applied 100 times in global sample-pattern matrix *Y* to each dataset. The solid lines correspond to the value of each RDC calculation. The dashed black lines correspond to the average of these solid lines. Y-coordinate in each graph represents the RDC value and the x-coordinate represents the number of cluster *K*. The star indicates *K* which we choose (see methods). (D) The mean and standard deviation of ARI in four datasets by running k-means 50 times separately in different situations.

repeated 50 times under different systematic conditions (e.g., 60% CV genes or 80% CV genes used) respectively, where CV genes mean ones with largest expression variances. Similarly, the performance of HCI under different correlation matrices or conditions was better (i.e., robust) than other methods according to the ARI and the RDC (**Figures 2B,D**, **Figure S1**). Overall, HCI always outperformed compared methods on distinguishing single-cell types.

## Comparison of Sample-Cluster Identification With One-Level Data

We applied HCI and SC3 method to the above four datasets for evaluation and comparison on the cell clustering. We calculated the cluster number K and the running time in each individual dataset by using the R package of SC3 (Kiselev et al., 2017). On the one hand, as shown in **Table 1**, HCI performs better than SC3 across almost all datasets in estimating the number K of clusters (except for similar performance on Deng dataset). On the other hand, the running time of 2,000 cells for SC3 is more than 1 h. By contrast, the running time of HCI for 2,000 TABLE 1 | The estimation of K compared with SC3 on real datasets.


cells is <10 min as shown in **Table 2**. It is worth noted that HCI can even apply to large datasets, such as: 10k datasets from 10x genomics, with more than 10,000 cells by using MATLAB efficiently (**Table 2**, **Figure S2**). From these results, we included that, HCI has better performance than SC3 because it considers the high-order correlation information, and integrates this potential heterogeneous information by our PFA framework well.

# Case Study on the scRNA-seq Data of Diabetes

We then applied HCI to a diabetes scRNA-seq data (Wang et al., 2016) with 430 annotated cells belonging to six cell types,

TABLE 2 | The running time compared with SC3 on real datasets.


where 205 ambiguous cells previously unassigned. For the 430 annotated cells, the RDC of HCI suggested that K is 5 or 6 (**Figures 2B**,**D**), provides the reasonable cluster number of cells. When we applied HCI to the whole cells included 430 annotated cells and 205 dropped cells, the results suggested that the K is 7. Obviously, there are potential new cell types included, and we found there were 27 annotated mesenchymal cells in the ambiguous cells. This result also showed that the other ambiguous cells can be clustered well in seven cell types separately (**Figure 3A**). Besides, the other methods (e.g., tSNE, HCA) were used to visualize the clusters of these dropped cells (**Figure 3**). As a control to this analysis, one well-known scRNA-seq analysis method SEURAT (Macosko et al., 2015) was also applied. As the results shown (**Figure S3**), HCI performed better than these traditional methods on distinguishing cell types. Noted, cluster dendrogram of global sample-pattern matrices **Y**, **F** 1 , and **F** 2 are shown in **Figure S4** for illustrating the influence of HCI on information integration.

In addition, marker genes are particularly useful since they can usually uniquely indicate a cell cluster, e.g., α-cells with high expression on IRX2 and ARX. To further interpret the biological meaning of HCI based cell clustering, we applied the 50 key marker genes of the annotated cell types to categorize the previously dropped cells which had been clustered well by HCI now. The violin plot shown the expression level of IRX2 and ARX are significantly high in alpha cells previously identified and also in alpha-dropped cells newly clustered by HCI (**Figure S5**). Furthermore, it was observed a high degree of expression similarity between annotated cells and their corresponding clustered-dropped cells in these key markers (**Figures S6, S7**). Together with these results, we concluded that HCI is able to identify new cell types with high accuracy and biological significance.

## Comparison of Sample-Cluster Identification With Multi-Level Data

To demonstrate the effectiveness of HCI inherited from PFA for integrating multi-level datasets, we applied HCI to three cancer omics datasets, two from the TCGA Data Portal included kidney renal clear cell carcinoma (KIRC) and Adrenocortical carcinoma (ACC), and one from the GEO (Colorectal cancer) (Sayagués et al., 2016). For the two TCGA data, the gene expression, miRNA expression and DNA methylation profiles were prepared in a similar way as those in Shi et al. (2017). As for the Colon cancer, the gene expression and miRNA expression were obtained, and we removed those mRNAs or miRNAs if they have more than 80% zero expression values across all samples. Then these datasets with 122 patients in KIRC, 79 in ACC and 51 in colon cancer were prepared, respectively (**Figure 4B**).

After carried on HCI and PFA on these datasets, respectively, we compared their results according to the RDC, which show that that HCI indeed performs better in terms of accuracy of cluster quality across datasets (**Figures 4C–E**). In this comparison, the heterogeneity factors including different complex conditions, varying data resources and dissimilar samples size would provide strong evidences to support the ability of HCI on identifying clinically relevant disease subtypes and predicting network modules involved in complex diseases (Zhang et al., 2011; Zang et al., 2016).

#### Case Study on the Matched mRNA and miRNA Data of Colorectal Cancer

Finally, we carried on a case study again on colorectal cancer data, especially providing the integrated mRNA-miRNA network according to the global sample-spectrum matrix **Y**. Firstly, the HCI results suggested that the normal (9 samples) and disease (42 samples) can be clustered into two discriminative groups (**Figure 5A**). Then, 6,930 differentially expressed genes and 2,976 differentially expressed miRNAs were obtained. By functional enrichment analysis on these differentially expressed genes with GO BP terms, KEGG pathways and IPA annotations, all significant physiological system development, function terms, disease and networks are listed in **Tables S1, S2**. We found that there are 2,289 genes (nearly 33% DEGs) are significantly correlated with colon cancer among all DEGs. Besides, according to the miRNA target predication from miRDB, 1,661 DEGs can be regulated by 141 DE-miRNAs (**Figure 5B**). Note that all enrichment analysis results involve 25 key genes, 14 of which can be regulated by 22 key miRNAs (**Figure 5B**). In addition, the survival risks of these genes were also evaluated as shown in **Figure 5C**.

As an illustrative instance, we constructed the gene-gene network of 25 key genes (**Figure 5D**) based on the STRING (p = 1.0e-16) (2018)<sup>2</sup> . The enrichment analysis results of this network are listed in **Figure 5E** (**Table S3**), and this network is significantly enriched with cytosol (P = 3.86e-05), beta-catenin destruction complex (P = 1.57e-04), colorectal cancer (P = 2.73e-46), and pathways in cancer (P = 6.82e-41). We also found that the hub genes (e.g., MAPK8, EGF, FALGDS, CCND1, MYC) in this network have been linked to cancer in wide literature reports. For example, the MAPK-signaling pathways have been identified as one of the most strongly associated gene markers to colorectal cancer (CRC) (Cummins et al., 2006; Barault et al., 2008; Lascorz et al., 2010; Slattery et al., 2012). MAPK8 has been shown to interact with MYC which is frequently observed in numerous human cancers. Strikingly, 22 key miRNAs are correlated with 14 key genes in this network. MiRNA-647 and miRNA-449a have been reported their association with colorectal cancer (Noguchi et al., 1999; Feng et al., 2018). These results revealed HCI would classify the sample types clearly and could integrate the multi-level regulatory network based on multiple heterogeneous data. All relevant DEGs and DE-miRNAs

<sup>2</sup>https://string-db.org/

are worthy of future experimental investigation, and listed in **Tables S4, S5**.

# DISCUSSION AND CONCLUSION

types as the same as in (A) legend.

The distinct types of biological data could provide a precise explanation for understanding the complex biological processes (Ghazalpour et al., 2006; Kutalik et al., 2008; Li et al., 2012; Zhang et al., 2012; Chen and Zhang, 2016; Zeng et al., 2016; Feng et al., 2018; Yu and Zeng, 2018). In recent decades, many approaches were proposed for analyzing single-cell data or multi-omics data to identify subtypes and construct biological networks (Gygi et al., 1999; Ding and He, 2004; Chari et al., 2010; Zhang et al., 2011; Kiselev et al., 2017; Guo et al., 2018a,b; Wang et al., 2018). However, for most methods, there are some limitations on reliably identifying the sample types by exploiting multi-datasets, such as the effect of noise on data and the computational cost. And some methods would fail to make full use of the similarity information between samples, thus making the results unreliable. Hence, in order to overcome this problem, a flexible and efficient integration method with automated information fusion and bias correction is demanded. In this work, we introduced the data-driven integrating method HCI. The key idea of this method is to incorporate the high-order similarity matrices (e.g., Pearson correlation matrix) into pattern fusion analysis, where the sample cluster or subtype structure can be actually determined benefiting from the high-order correlations. And the obtained combinatorial sample patterns from HCI could represent comprehensive characterization of inherent sample relations in data. In order to demonstrate the benefits of HCI, various evaluations have been carried on both scRNA-seq and bulk RNA-seq datasets for complex diseases. As expected, HCI effectively captured the sample (e.g., cell or patient) clusters and outperformed the existing methods under different conditions in terms of accuracy and robustness. And two deep case studies supported that HCI has satisfactory flexibility and applicability. Noted, HCI is based on PFA, which has been evaluated and compared with a few multiview clustering methods in previous study (Shi et al., 2017). Meantime, SC3 has also been evaluated and compared with

FIGURE 4 | The enhanced framework flow for integrating bulk datasets and comparison of PFA (A) The flow chart of HCI to integrating multiple heterogeneous omics data (B) A brief introduction to the datasets we used in this comparison. (C–E) Bars correspond to the average of the RDC values by running 100 times in each dataset. Red and gray colors correspond to result of HCI and PFA respectively.

FIGURE 5 | Case study on the colorect cancer (A) Hierarchical clustering diagram of samples in matrix *Y*. Color bars represent the normal samples and disease samples. (B) The process diagram of selecting key genes. (C) Evaluation of the selected 25 important genes related to colorectal cancer in (B). In SurvExpress, we used the average for selected genes, two risk groups and Cox fitting to generate Survival curves. The total number of each group is shown in the top right corner of graph, and the number of censoring samples is marked with +. The CI per curve is also included. *P*-value is shown in the top of figure. (D) The highly connected network consists mainly of 25 DEG genes and 22 miRNAs. The 22 miRNAs targets 16 genes based on miRDB database. The size of node indicates the network degree of gene. And the PPI enrichment *P*-value of genes is shown in the top right corner of this figure. (E) Top-ranked pathways and biological functions enriched in the 25 genes in (D).

many existing approaches (Kiselev et al., 2017). Thus, in this study of scRNA data, we have directly compared HCI and SC3 on multiple datasets. It is worthy to carry on more benchmark studies in this field as a future topic (Zeng et al., 2016). Also as a future topic, we can improve HCI by further exploiting dynamics and network information, such as applying network biomarker (Zhang et al., 2015; Liu et al., 2016; Zhao et al., 2016; Liu, X. et al., 2018) or applying dynamic network biomarker (Chen et al., 2012; Li et al., 2017; Liu et al., 2017; Liu, R. et al., 2018; Yang B. et al., 2018) for accurate and reliable clustering and classification based on omics data from the perspectives of dynamics and network.

As genomic data sources is increasing in diversity and volume, HCI can fit the data structures on both one level data or multiple level data, so that, HCI could provide new avenues for the systematic explanation of various data and complex biological phenotypes at a system-wide level. Indeed, there are still a few future topics to further extend HCI method, e.g., integrating discrete data types including somatic, SNP, and CNV information.

# AUTHOR CONTRIBUTIONS

HT and TZ developed the methodology. HT executed the experiment. HT and TZ carried out the data analysis and wrote this paper. HT, TZ, and LC revised the manuscript. LC and TZ

# REFERENCES


supervised the work, and LC critically reviewed the paper. All authors read and approved the final manuscript.

### FUNDING

This work was supported by the National key research and development program of China (No. 2017YFA0505500), the Strategic Priority Research Program of the Chinese Academy of Sciences (CAS) (No. XDB13040700), the National Natural Science Foundation of China (NSFC) (Nos. 61403363, 11401222, 11871456, 31771476), the Shanghai Municipal Science and Technology Major Project (Grant No. 2017SHZDZX01), and the Natural Science Foundation of Shanghai (17ZR1446100).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00371/full#supplementary-material

samples in cancer by network control strategy. Bioinformatics 34, 1893–1903. doi: 10.1093/bioinformatics/bty006


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Tang, Zeng and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Leveraging Fecal Bacterial Survey Data to Predict Colorectal Tumors

Bangzhou Zhang1,2† , Shuangbin Xu<sup>3</sup>† , Wei Xu<sup>4</sup>† , Qiongyun Chen1,2, Zhangran Chen<sup>2</sup> , Changsheng Yan<sup>2</sup> , Yanyun Fan<sup>1</sup> , Huangkai Zhang<sup>3</sup> , Qi Liu<sup>4</sup> , Jie Yang<sup>4</sup> , Jinfeng Yang<sup>4</sup> , Chuanxing Xiao1,2, Hongzhi Xu1,2 \* and Jianlin Ren1,2 \*

<sup>1</sup> Department of Gastroenterology, Zhongshan Hospital Xiamen University, Xiamen, China, <sup>2</sup> Institute for Microbial Ecology, School of Medicine, Xiamen University, Xiamen, China, <sup>3</sup> Xiamen Treatgut Biotechnology Co., Ltd., Xiamen, China, <sup>4</sup> Department of Gastroenterology, The Affiliated Hospital of Guizhou Medical University, Guiyang, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Marius Vital, Hannover Medical School, Germany Bin Yang, Chinese Academy of Medical Sciences and Peking Union Medical College, China

#### \*Correspondence:

Hongzhi Xu civilben@163.com Jianlin Ren jianlin.ren@126.com †These authors have contributed

equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 21 December 2018 Accepted: 30 April 2019 Published: 28 May 2019

#### Citation:

Zhang B, Xu S, Xu W, Chen Q, Chen Z, Yan C, Fan Y, Zhang H, Liu Q, Yang J, Yang J, Xiao C, Xu H and Ren J (2019) Leveraging Fecal Bacterial Survey Data to Predict Colorectal Tumors. Front. Genet. 10:447. doi: 10.3389/fgene.2019.00447 Colorectal cancer (CRC) ranks second in cancer-associated mortality and third in the incidence worldwide. Most of CRC follow adenoma-carcinoma sequence, and have more than 90% chance of survival if diagnosed at early stage. But the recommended screening by colonoscopy is invasive, expensive, and poorly adhered to. Recently, several studies reported that the fecal bacteria might provide non-invasive biomarkers for CRC and precancerous tumors. Therefore, we collected and uniformly re-analyzed these published fecal 16S rDNA sequencing datasets to verify the association and identify biomarkers to classify and predict colorectal tumors by random forest method. A total of 1674 samples (330 CRC, 357 advanced adenoma, 141 adenoma, and 846 control) from 7 studies were analyzed in this study. By random effects model and fixed effects model, we observed significant differences in alpha-diversity and betadiversity between individuals with CRC and the normal colon, but not between adenoma and the normal. We identified various bacterial genera with significant odds ratios for colorectal tumors at different stages. Through building random forest model with 10 fold cross-validation as well as new test datasets, we classified individuals with CRC, advanced adenoma, adenoma and normal colon. All approaches obtained comparable performance at entire OTU level, entire genus level, and the common genus level as measured using AUC. When combined all samples, the AUC of random forest model based on 12 common genera reached 0.846 for CRC, although the predication performed poorly for advance adenoma and adenoma.

Keywords: fecal bacteria, colorectal cancer, colorectal adenoma, random forest, random effects model

# INTRODUCTION

Colorectal cancer (CRC) ranks second in term of cancer-associated mortality and third in term of incidence, with an estimation of 881000 deaths and over 1.8 million new cases in 2018 in both sexes globally (Bray et al., 2018). CRC incidence rates are about 3-fold higher in developed countries than developing ones. The incidence and mortality rates also showed an increasing trend in China in the past decades. The age-standardized incidence and mortality rates by world standard population are 17.52 and 7.91 per 100000 in 2014, respectively (Chen W. et al., 2018). Survival

**328**

exceeds 90% if the cancer is detected at early stage, but decreases to 13% with advanced metastatic disease (Shah et al., 2018). Moreover, development of most CRC cases follows adenomacarcinoma sequence, spanning more than 10–15 years in average. Therefore, targeting the CRC by early screening and treatment, especially as early to the adenoma stage, would have profound clinical and socioeconomic significances.

Colonoscopy is regarded as the golden standard of CRC screening. However, this test is poorly adhered to due to the invasiveness, frequency, and expensive price. For example, it is reported that more than 25% of adults aged 50–75 years, the high-risk group, never participated for CRC screening in United States (Centers for Disease Control and Prevention, 2018). A recent survey in China showed a more serious screening situation, only 14% of high risk people evaluated by a score system finally undertaking colonoscopy screening (Chen H. et al., 2018). Home-based fecal occult blood tests (FOBT) have low sensitivity in colorectal adenoma (CRA) or pre-cancers (Hundt et al., 2009), and are used less frequently. Thus, development of non-invasive and sensitive early diagnosis tests for CRC or precancerous lesions are in urgent need for improving the patient participation rate.

In the past years, numerous studies using mouse models or case-control designs have shown the effects of both individual gut microbes (Goodwin et al., 2011; Rubinstein et al., 2013; Abed et al., 2016) and the overall community (Baxter et al., 2014; Zackular et al., 2016) in disease progression of CRA and CRC. The roles of gut microbiota hypothesized in tumorigenesis, acting as environmental factors, also accord with the sporadic nature of CRC and CRA. Therefore, extensive efforts have been put into identify microbiota-associated biomarkers for colorectal tumors (Ahn et al., 2013; Zeller et al., 2014; Baxter et al., 2016; Yu et al., 2017; Flemer et al., 2018). Although some taxa, including Fusobacterium, Peptostreptococcus, and Porphyromonas, were consistently reported to be enriched in CRC, unifying signal taxa were not defined. Moreover, most studies focused on CRC, but attention to CRA is factually in great clinical need to facilitate early detection of the tumors. Recently, there were two meta-analyses based on 16S rRNA gene sequences, which were helpful for distilling possible biomarkers and classifying patients with adenoma or carcinoma. However, the aggregate number of samples was smaller (n = 509) (Shah et al., 2018), and sequencing depths of some studies included were quite low (Shah et al., 2018; Sze and Schloss, 2018). Furthermore, several case-control studies with higher depths have been reported since the publication of these two meta-analyses. Therefore, it is meaningful and urgent to update the analysis to facilitate the development of non-invasive diagnosis tests for colorectal tumors based on fecal microbiota.

In this study, we updated meta-analysis using fecal 16S rRNA gene sequence data from 7 studies with a relatively higher sequencing depth (more than 5000 reads/sample). By the most frequently used methods, we sought to determine the bacterial variation among studies, the differences in fecal bacteria diversity and communities in patients with colorectal tumors, and identify a universal set of microbial markers to predict/diagnose the presence of colorectal cancer.

# MATERIALS AND METHODS

#### Datasets

The studies included in this meta-analysis were screened from two sources: systematic Pubmed search with colorectal (colon) cancer (CRC) or adenoma (CRA) and gut microbiota in the past 10 years, and the recently published reviews and metaanalyses. Studies were excluded if (1) samples were not from feces, (2) samples were not sequenced by NGS for 16S rRNA gene, (3) sequences, barcodes, or metadata were not publicly available or not provided by authors until Sep 20, 2018 after requests by emails, (4) the sequencing depth was lower than 5000 raw reads. At last, we obtained sequence datasets and metadata from 7 studies with CRC and/or CRA (Zeller et al., 2014; Baxter et al., 2016; Flemer et al., 2017; Hale et al., 2017; Deng et al., 2018; Flemer et al., 2018; Mori et al., 2018), additional 12 studies associated gut microbiota of colorectal lesions were excluded due to lower sequencing depth, incomplete information of sequences, barcodes, or metadata (Sobhani et al., 2011; Chen et al., 2012; Wang et al., 2012; Ahn et al., 2013; Brim et al., 2013; Chen et al., 2013; Weir et al., 2013; Wu et al., 2013; Goedert et al., 2015; Mira-Pascual et al., 2015; Ai et al., 2017; Zhang et al., 2018). In summary, all 7 studies had CRC samples, 4 studies had advanced adenoma (Adv\_adenoma, >10 mm in size) samples, and 4 studies had samples with adenoma smaller than 10 mm (**Table 1**).

## Sequence Processing

Paired-end reads were assembled using FLASH by default parameters, except with -x 0.2 and -M 200 for V3-V4 /-M 250 for V3-V5 /-M 150 for V4 region. The assembled sequences were quality filtered with a minimum quality score of 25. To assign de novo OTUs, we removed chimeric sequences and clustered sequences with 97% similarity and using Usearch (Edgar, 2013) for individual study. The representative sequences of OTUs were aligned to the SILVA database for taxonomic classification by RDP Classifier (Wang et al., 2007) and aggregate to various taxonomic levels.

# Community Analyses

The alpha-diversity metrics, including observed OTUs (Obs), Shannon, and Pielou's evenness (J), were calculated based on OTU table evenly rarefied to the lowest sequencing depth within each study. The differences between individual with normal colon, adenoma, or CRC were further tested by Wilcoxon test for significance. We also calculated the ORs of these metrics by assigning any value above the median of the metric within the study as positive. The beta diversity based on Bray-Curtis distance was measured within each study, and the differences between groups were determined using permutational analysis of variance (PERMANOVA) with 9999 permutations. In terms of genera, the differences between groups were examined using Wilcoxon test within each study, and the ORs were determined in the same manner as alpha diversity metrics. Finally, both random effects (RE) model and fixed effects (FE) model were used to obtain the change summary estimates.

#### Classification by Random Forest

To estimate the predictive power of gut microbiota for classifying individuals with normal colon and colorectal tumors, the most widely used and robust random forest models were selected and built for each study based on all OTUs, all genera, and the common genera that were detected in every study. RF model based on all studies and n-1 (leave-one-study-out) studies were also built to further assess the classifier performance of the common genera and the weight of particular study to the overall performance, respectively. To test the generalizability, we built RF model based on the common genera from one study and validated it in the other studies, and also performed leave-one-study-out analyses by setting the study left out as the test dataset. All the models were built using a 10-fold cross-validation with ten repeats and the number of features (mtry) was set to the square root of total number of microbial features.

#### Statistical Analyses and Visualization

All statistical analyses were conducted in R-3.4.1 (R Core Team, 2017). The alpha-diversity metrics, Bray-Curtis distances by vegdist function, and PERMANOVA by adonis function were all performed in vegan (Oksanen et al., 2015). The ORs were analyzed using epiR (Stevenson et al., 2018) and meta for


<sup>∗</sup>NA indicates studies were not included in the analysis, either due to the datasets not available, without barcode sequences to splits the datasets, or low sequencing depth/sample.

(Viechtbauer, 2017) with significance testing utilized the chisquare test. In addition, the RF, SVM, KNN, and Adaboost models were built using caret (Kuhn et al., 2017) and random Forest (Breiman et al., 2015) by default parameters, and the test cohorts were predicted using the pROC (Robin et al., 2017). The random effects model and fixed effects model were conducted in metaphor (Viechtbauer, 2017). All figures were plotted using ggplot2-v3.0.0 (Wickham et al., 2017) and gridExtra (Auguie and Antonov, 2016).

### RESULTS

#### Sample Variation

We included 16S rRNA gene sequencing data from 7 fecal studies with diseases of CRC, adv\_adenoma and adenoma (**Table 1**). A total of 1674 samples from 7 countries were retained after quality filtering, including 330 CRC, 357 Adv\_adenomas, 141 adenoma, and 846 controls. At the beginning, we tried to combine all samples together by closed\_reference OTU assignment strategy for compatibility with differential sequencing regions, but found samples clustered primarily by individual studies due to the extra strong variables of DNA extraction methods, PCR amplification conditions, sequencing platforms adopted by individual study (**Figure 1**). Therefore, we processed each study separately using the same parameters in the following analyses.

#### Alpha-Diversity Differences

To compare the alpha-diversity between different disease stages, we considered the microbial richness (Observed OTUs, Obs), Shannon diversity, and evenness J. We found significant higher richness and Shannon diversity in normal colon than CRC in 2 of 7 studies and significant higher microbial evenness in normal colon in 1 of 7 studies (**Supplementary Table S1**). For comparisons in adenoma vs. normal colon and adv\_adenoma vs. normal colon, only one study was significantly different among the richness and evenness. Due to the inconsistent results, we also calculated the odds ratios (ORs). The ORs for Shannon diversity were significantly higher than 1.0 for CRC (OR = 1.48, CI in 1.04 to 2.10) (**Figure 2**) in both RE model and FE model with low heterogeneity (**Supplementary Table S1**), indicating significant lower microbial Shannon diversity in CRC than the normal colon group. While The ORs for J, Obs, and Shannon were not significantly greater than 1.0 for adenoma and adv\_adenoma in the random effects model with higher heterogeneity, even with the trend (**Figure 2**).

## Beta-Diversity Differences

To measure the entire community differences between different individuals with colorectal tumors and with normal colon, we calculated a Bray-Curtis distance matrix for each data set and tested the significance by PERMANOVA. We found significantly different community structure in the CRC relative to normal colons in 6 of 7 studies (**Supplementary Table S2** and

**Supplementary Figure S1**). However, we only found significant community differences in adv\_adenoma vs. normal in 1 of 4 studies and in adenoma vs. normal in 1 of 4 studies. Again, by calculating the ORs based on the Bray-Curtis metric in each study, we found the significant bacterial community differences between CRC and normal colons in both RE models with high heterogeneity (**Supplementary Table S2**), but not significant differences in comparisons of adv\_adenoma or adenoma with individuals with normal colons (**Figure 3**). These results showed that there were dependable and significant community-wide changes in the bacterial community structures of CRC patients.

#### Different Taxa

With the altered overall community differences, we tried to identify the significantly different taxa between subjects with colorectal tumors and the normal. However, the results were not consistent by Wilcoxon tests (**Supplementary Tables S3–S5**). By quantifying the ORs, a total of 13 genera were identified to be associated with CRC (**Supplementary Figure S2**). Five genera had significant ORs lower than 1.0 for presence of CRC in RE and FE models (**Supplementary Table S6**), including Fusobacterium, Lachnospiraceae\_UCG-010, Mogibacterium, Oscillibacter, Prevotella\_7. Eight genera possessed significant ORs higher than 1.0 for the absence of CRC, most of which were thought to be beneficial for butyrate production in intestines, including Anaerostipes, Butyricicoccus, Coprococcus\_2, Roseburia. Besides, a total of 10 genera had significant ORs for the adenoma, and 6 genera had significant ORs for adv\_adenoma.

# Development of Fecal Bacteria-Based Classifier

Since the gut microbial communities were greatly shifted with colorectal tumors, especially in CRC compared to the normal, it is meaningful and profound to identify

FIGURE 3 | Forest plot of the Bray-Cutris distances between the individual with colorectal tumors and the normal colons. (A) Adenomas vs. normal colons; (B) Adv\_adenomas vs. normal colons; (C) CRC vs. normal colons. The error bar depicts the 95% confidence interval. The left-hand side (minus value) of the dashed line depicts that distances between the case and the normal are higher than the distances between the subjects of control. The right-hand side of the dashed line depicts that distances between the case and the normal are lower than the distances between the control. There were significantly difference between the case and the control, if there was no cross between the dashed line and the error bar.

microbial biomarkers for development of invasive diagnosis methods. With this purpose, – we built RF models based on OTU abundance (finer-level) and genus abundance (more general) to classify/predict colorectal tumor and controls within each study.

We found that the RF models using all OTUs did a good job in classifying CRC and individuals with normal colons [median AUC = 0.765, ranging in (0.531, 0.8757)] (**Figure 4C**). As expected, the RF models based on the genera also showed comparable performance in differentiating CRC and the normal

[median AUC = 0.755, ranging in [0.533, 0.977)] (**Figure 4F**). However, the performances of RF models differentiating adv\_adenoma or adenoma and the normal colons were unsatisfactory, just a slightly better than the random predictor in both OTU level [adv\_adenoma: median AUC = 0.568, ranging in (0.514, 0.898), adenoma: median AUC = 0.589, ranging in (0.524, 0.721)] (**Figures 4A,B**) and genus level [adv\_adenoma: median AUC = 0.650, ranging in (0.515, 0.99); adenoma: median AUC = 0.598, ranging in (0.515, 0.650)] (**Figures 4D,E**).

Due to the separate clustering for each study, the above RF models based on all OTUs and all genera were not universal for each other. Therefore, we tried to build the models based on the common genera that detected in every study. Surprisingly, the performance of the models for distinguishing the CRC and individuals with normal colons were good [median AUC = 0.735, ranging in (0.5258, 0.888)] (**Figure 5C**), while the models for adv\_adenoma or adenoma were still weak [adv\_adenoma: median AUC = 0.632, ranging in (0.520, 0.693); adenoma: median AUC = 0.603, ranging in (0.521, 0.700)] (**Figures 5A,B**). When combined all samples and all studies together, RF model returned an AUC of 0.835 for CRC vs. the normal (**Figure 5F**), which is better than the medium AUC of RF models based on single study, although the prediction of Adv\_adenoma or adenoma with the normal was still not good (**Figures 5D,E**). To test whether particular study weight the performance, we re-built RF models based on n-1 studies (leave-one-study-out), and found the performances were not affected too much (**Figures 5D–F**), indicating the stability of RF model for CRC based on all 7 studies and the common genera.

To further test the generalizability of models based on common genera, we evaluated how well the models would perform when given data from a different cohort. First, we used one study as training data and the other single studies as test data. We found that the performances of the models were different among the training cohorts, probably associated with the sample size (**Figure 6**). In addition, the performances of the models for CRC were better than the adv\_adenoma and adenoma. Within Adv\_adenoma, models based on studies of Baxter\_16 and Hale\_17 were better than other two (**Figure 6C**). Second, we tested the leave-onestudy-out analysis again. As expected, the performances of models were still good for CRC [median AUC = 0.754, ranging in (0.569, 0.916)] (**Figure 7C**), even still weak for adv\_adenoma [median AUC = 0.550, ranging in (0.496, 0.578)] and for adenoma [median AUC = 0.539, rang in (0.494, 0.684)] (**Figures 7A,B**).

# Important Microbial Taxa as Potential Biomarkers

By looking deeper into the microbial features selected for the RF model for CRC based on all studies, we obtained the

FIGURE 6 | The performances of models to classify the case and the normal. (A) CRC vs. normal colons; (B) Adenoma vs. normal colons; (C) Adv\_adenoma vs. normal colons. The horizontal ordinates depict the studies used as the training data set. The vertical coordinates depicts the AUC of the specific test study. The black line represent the median of AUC of all test AUCs for a specific model. The dashed gray lines represent the AUC at 0.5 with random predictors.

12 important distinguishing taxa based on the mean decrease Gini value (**Table 2**). Indeed, all these genera were frequently detected in human fecal samples and were previously reported to be harmful to human health, such as the Fusobacterium, Escherichia\_Shigella, and Streptococcus with higher abundance in CRC group. Besides, some genera selected by RF model were


TABLE 2 | Importance, odd ration, heterogeneity, and relative abundance of the 9 common genera selected for the RF model for CRC based on all samples.

CI\_lb, confidence interval\_lower bound; CI\_ub, confidence interval\_upper bound; I2, heterogeneity measure.

found to be beneficial with higher abundance in individuals with normal colons, including Bifidobacterium, Lachnospira. Furthermore, 4 genera were also overlapped with the significant OR taxa by RE model. In short, the microbial features selected for RF model coincided with their abundance and might reflect their physiological effects.

#### DISCUSSION

In this study, we conducted a comprehensive meta-analysis on a diverse collection of 16S rDNA sequencing studies with relatively higher sequencing depth from 6 countries to reveal the great differences in fecal bacterial communities in individuals with colorectal tumors and normal colons. By analyzing all datasets in a uniform manner, we further identified and validated fecal bacterial biomarkers and their important roles in classifying subjects with colorectal tumors, especially the CRC and the normal control. The good performance of common bacterial genera-based RF model demonstrated the great clinical significance and feasibility of development of invasive screening or diagnosis method for CRC by detection of fecal bacterial communities.

Although there were great heterogeneity associated with each original study, the RF model we built for predicting CRC and the normal still returned a good performance with AUC of 0.835. Our model outperformed or were comparable with results in two recently published meta-analyses based on both 16S rRNA sequencing with smaller sample size (Shah et al., 2018) and metagenomic data (Dai et al., 2018), as well as some independent studies based on microbiota (Zeller et al., 2014; Baxter et al., 2016; Flemer et al., 2018) and other non-invasive procedures (FOBT and fecal Immunological test) (Zeller et al., 2014; Liang et al., 2017). Unexpectedly, the models for predicting adv\_adenoma or adenoma from the normal were poor, which is consistent with results in the previous meta-analysis studies (Shah et al., 2018; Sze and Schloss, 2018). However, some studies did report better prediction for adenoma (Goedert et al., 2015; Baxter et al., 2016; Hale et al., 2017). Two potential reasons might explain the inconsistence between results from meta-analysis and the independent studies. Usually samples included in individual studies met consistent criterions, were treated by the same experimental and optimal analyzing protocols, and could be analyzed with more clinical data (e.g., FIT) to improve the model performances (Baxter et al., 2016). In contrast, there were great variations in these aspectsin the meta-analysis. Besides, the study number and sample size in our meta-analysis for adv\_adenoma and adenoma were limited. Therefore, we are looking forward to more studies on adenoma to validate the potential of fecal bacteria in classifying adenoma from the individual with normal colon.

We also found that the RF model constructed using the common genera performed comparably with models based on the entire communities of total genera and even total OTUs, which means the fine level (OTU at 97% similarity) did not further improve the classification model. This phenomenon was also reported in a previous metaanalysis (Sze and Schloss, 2018) and individual study (Hannigan et al., 2018). The "patchy" hypothesis can be used to explain it (Sze and Schloss, 2018). As microbial distribution between individuals was patchy, the classification based on common genera will pool the fine-level diversity, and reduce the variations in the microbial features. Finally, Twelve common genera were identified as the most important features for distinguishing the CRC and the normal colon, 4 of which possessed significant ORs. Fusobacterium, one of the most frequently reported bacteria in CRC studies (Rubinstein et al., 2013; Yu et al., 2017), was enriched in CRC case relative control, as well as other pernicious genera, including Escherichia \_Shigella, Streptococcus. We also identified the depletion of potentially beneficial microbes, such as the butyrate-producting Anaerostipes Faecalibacterium, Lachnospira, Coprococcus (Rivière et al., 2016; Vital et al., 2017). These genera could also be used for further validation by qPCR for more efficient diagnosis.

Even with best efforts, there were limitations in this study. We did not conduct further analyses to improve the RF model and

for more subgroups, since we were unable to collect sufficient information regarding demographic data (age, gender, BMI etc.) and clinical data (FIT, FOBT, cancer stage, tumor location, adenoma growth patterns etc). Given this, we appeal researchers to share their sequencing and meta data associated to profoundly facilitate the research with larger sample size and more complete meta information (Quince et al., 2017). Moreover, it is expected to make better RF models for early screening and diagnosis by considering both microbial features and other metadata (including clinical data) (Baxter et al., 2016; Liang et al., 2017). An advantage in this study was that we obtained the tumor size, and tried to split adenoma samples into small adenoma and advanced adenoma, which was not provided in the previous meta-analyses.

In summary, our study uniformly analyzed a diverse collection of fecal 16S rDNA sequencing datasets and suggests the strong association between fecal bacterial community and colorectal tumors. By revealing the significant differences in diversity, identifying key taxa, and building RF model, we provide evidence for the use of fecal bacterial biomarkers to development of non-invasive diagnostic methods for the colorectal tumors, especially the CRC.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

BZ, HX, and JR designed this study. BZ, SX, WX, QC, ZC, CY, YF, HZ, QL, JieY, JinY, and CX collected and organized the data. BZ, SX, WX, and HZ analyzed the data. BZ, SX, and WX wrote the manuscript.

#### FUNDING

This work was supported by the National Natural Science Foundation of China (81800517 and 81602052), Xiamen Joint Projects for Major Diseases (3502Z20149031) and China Postdoctoral Science Foundation funded project (2018M632588 and 2018M632585).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00447/full#supplementary-material

patients with advanced colorectal adenoma. Am. J. Clin. Nutr. 97, 1044–1052. doi: 10.3945/ajcn.112.046607


Cancer Epidemiol. Biomarkers Prev. 26, 85–94. doi: 10.1158/1055-9965.EPI-16- 0337


**Conflict of Interest Statement:** SX and HZ were employed by company Xiamen Treatgut Biotechnology Co., Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhang, Xu, Xu, Chen, Chen, Yan, Fan, Zhang, Liu, Yang, Yang, Xiao, Xu and Ren. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

**338**

# The Integrative Regulatory Network of circRNA, microRNA, and mRNA in Atrial Fibrillation

Shengyang Jiang<sup>1</sup>† , Changfa Guo<sup>2</sup>† , Wei Zhang<sup>3</sup>† , Wenliang Che3,4† , Jie Zhang<sup>5</sup> , Shaowei Zhuang<sup>1</sup> , Yiting Wang5,6, Yangyang Zhang5,7 \* ‡ and Ban Liu<sup>3</sup> \* ‡

<sup>1</sup> Department of Cardiology, Seventh People's Hospital of Shanghai University of Traditional Chinese Medicine, Shanghai, China, <sup>2</sup> Department of Cardiac Surgery, Zhongshan Hospital, Fudan University, Shanghai, China, <sup>3</sup> Department of Cardiology, Shanghai Tenth People's Hospital, Tongji University School of Medicine, Shanghai, China, <sup>4</sup> Department of Cardiology, Shanghai Tenth People's Hospital Chongming Branch, Tongji University School of Medicine, Shanghai, China, <sup>5</sup> Key Laboratory of Arrhythmias, Ministry of Education, Shanghai East Hospital, Tongji University School of Medicine, Shanghai, China, <sup>6</sup> Basic Medical College, Jinzhou Medical University, Jinzhou, China, <sup>7</sup> Department of Cardiovascular Surgery, Shanghai East Hospital, Tongji University School of Medicine, Shanghai, China

#### Edited by:

Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China

#### Reviewed by:

Xingwang Li, Shanghai Jiao Tong University, China Lei Kong, Peking University, China

#### \*Correspondence:

Yangyang Zhang zhangyangyang\_wy@vip.sina.com Ban Liu 2013liuban@tongji.edu.cn

†These authors have contributed equally to this work as co-first authors

> ‡These authors have contributed equally to this work as co-corresponding authors

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 19 December 2018 Accepted: 14 May 2019 Published: 13 June 2019

#### Citation:

Jiang S, Guo C, Zhang W, Che W, Zhang J, Zhuang S, Wang Y, Zhang Y and Liu B (2019) The Integrative Regulatory Network of circRNA, microRNA, and mRNA in Atrial Fibrillation. Front. Genet. 10:526. doi: 10.3389/fgene.2019.00526 Atrial fibrillation (AF) is the most common irregular heart rhythm which influence approximately 1–2% of the general population. As a potential factor for ischemic stroke, AF could also cause heart failure. The mechanisms behind AF pathogenesis is complex and remains elusive. As a new category of non-coding RNAs (ncRNAs), circular RNAs (circRNAs) have been known as the key of developmental processes, regulation of cell function, pathogenesis of heart diseases and pathological responses which could provide novel sight into the pathogenesis of AF. circRNAs function as modulators of microRNAs in cardiac disease. To investigate the regulatory mechanism of circRNA in AF, especially the complex interactions among circRNA, microRNA and mRNA, we collected the heart tissues from three AF patients and three healthy controls and profiled their circRNA expressions with circRNA Microarray. The differentially expressed circRNAs were identified and the biological functions of their interaction microRNAs and mRNAs were analyzed. Our results provided novel insights of the circRNA roles in AF and proposed highly possible interaction mechanisms among circRNAs, microRNAs, and mRNAs.

Keywords: atrial fibrillation, non-coding RNA, circular RNA, microRNA, mRNA

# INTRODUCTION

Atrial fibrillation (AF) is the most common irregular heart rhythm which influence approximately 1–2% of the general population (Graham et al., 2015; Wang, 2018). Several important factors may increase the risk of developing AF, including age, sex, obesity, excessive alcohol consumption, hypertension, abnormal heart valves and lung diseases (Dagres and Anastasiou-Nana, 2010; Soliman et al., 2014). As a potential factor for ischemic stroke, AF could also cause hospitalization for heart failure, and death which is associated with high mortality, morbidity, and socioeconomic burden (Voukalis et al., 2016). However, current treatment of AF still lacks enough utility and efficacy which may have possibly adverse effects (Vallabhajosyula et al., 2016; Wan et al., 2016). The mechanisms behind AF pathogenesis are complex and remains elusive. Further study of the potential mechanisms of AF could provide novel treatment which could alternate current therapy effectively (Ogawa et al., 2017).

As we all know, Non-coding RNAs (ncRNAs) play important roles in regulating gene expression. The main groups of ncRNAs include long non-coding RNAs (lncRNAs), micro-RNAs (miRNAs), and circular RNAs (circRNAs) (McMullen and Ooi, 2017). As a new category of ncRNAs, circRNAs have been known as the key of developmental processes, regulation of cell function, pathogenesis of heart diseases and pathological responses which could provide novel sight into the pathogenesis of AF (Zhang et al., 2018). Unlike linear RNAs terminated with 5<sup>0</sup> caps and 3 0 tails, circRNAs are characterized by covalently closed loop structures which are presumably more stable and conserved, and may play important roles in many pathophysiological processes (Wang et al., 2016). Recently the role of circRNAs in cardiac disease conditions demonstrated their important functions as modulators of miRNA levels (Stepien et al., 2018). circRNAs may be a new kind of potential biomarkers and therapeutic targets, and their role in heart disease is becoming increasingly obvious.

To investigate the regulatory mechanism of circRNA in AF, especially the complex interactions among circRNA, microRNA and mRNA, we collected the heart tissues from three AF patients and three healthy controls and profiled their circRNA expressions with circRNA Microarray. The differentially expressed circRNAs were identified and the biological functions of their interaction microRNAs and mRNAs were analyzed. Our results provided novel insights of the circRNA roles in AF and proposed highly possible interaction mechanisms among circRNAs, microRNAs and mRNAs.

# MATERIALS AND METHODS

#### The circRNA Expression Profiles of Atrial Fibrillation Patients

We collected the heart tissues from three AF patients and three healthy controls. The clinical information of these six samples were given in **Table 1**. The circRNA expression profiles of these samples were measured with Arraystar Human circRNA Array V2 (8 × 15K, Arraystar). The arrays were scanned by the Agilent Scanner G2505C and analyzed with Agilent Feature Extraction software (version 11.0.1.1). The circRNAs presented in at least 3 out of 6 samples were retained. Finally, the expression levels of 12,515 circRNA probes were log2 transformed and quantile normalized. The circRNA expression profiles was given in **Supplementary Table S1** and uploaded onto GEO (Gene Expression Omnibus) under accession number of GSE129409.

Written informed consent was obtained from patients before collection of the abandoned left atrial appendages. All experimental procedures were conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of Shanghai East Hospital (approval no. 040-2017).

## Identify the Differentially Expressed circRNAs Between Atrial Fibrillation Patients and Healthy Controls

The statistical significance of differential expression between two groups was estimated with t-test using the R software limma package and further filtered with fold change. CircRNAs with t-test p-value smaller than 0.05 and fold change greater than 2 were considered as significant differentially expressed circRNAs.

# Construct the Integrative Regulatory Network of circRNAs, microRNAs, and mRNAs

The interactions between circRNAs and microRNAs play important roles for disease regulation (Ghosal et al., 2013). Some circRNAs contain microRNA sites and act as an endogenous microRNA "sponge" to adsorb and quench the normal biological functions of the microRNA (Lukiw, 2013). To discover such circRNA-microRNA interactions, we applied the TargetScan (Enright et al., 2003) and miRanda (Pasquinelli, 2012) to predict the microRNA targets within circRNAs. What's more, we predicted the microRNA targets in mRNAs. At last, we constructed the genome wide integrative regulatory network of circRNAs, microRNAs and mRNAs.

## Analyze the Biological Functions of circRNAs, microRNAs, and mRNAs in Atrial Fibrillation

Since the functions of circRNAs are still poorly annotated, we investigated the functions of microRNAs interacted with differentially expressed circRNAs. These microRNAs may reflect the functions of differentially expressed circRNAs. We extracted 92 AF related microRNAs from HMDD (the Human microRNA Disease Database) v3.0 (Huang et al., 2018). These 92 AF related microRNAs were listed in **Supplementary Table S2**. If a circRNA interact with these microRNAs, it may be also related to AF. A complete interaction module of circRNAs, microRNAs and mRNAs with strong literature support from each angle will be a promising regulatory model for AF.

# RESULTS

# The Differentially Expressed circRNAs Between Atrial Fibrillation Patients and Healthy Controls

If the t-test p-value was smaller than 0.05 and the fold change was greater than 2, a circRNA was considered as differentially expressed between AF patients and healthy controls. With these criteria, there were 537 up-regulated circRNAs and 199 downregulated circRNAs in AF patients. These differentially expressed circRNAs between AF patients and healthy controls were listed in **Supplementary Table S3**. Since the sample size of the AF patients and healthy controls was too small, we did not use the FDR (False Discovery Rate) cutoff to identify the differentially expressed genes. But we still calculated the FDRs and the FDR was 0.556. We also calculated the mean and standard deviation (SD) of AF patients and healthy controls. For 537 up-regulated circRNAs, MeanAF − SDAF was always greater than MeanControl + SDControl; for 199 down-regulated circRNAs, MeanAF + SDAF was always smaller than MeanControl – SDControl. Such mean


TABLE 1 | Demographic characteristics of patients.

NYHA, New York heart association; INR, international normalized ratio; AF, atrial fibrillation.

and SD results confirmed that there was difference between AF patients and healthy controls and the difference was greater than their variance.

We calculated the frequencies of microRNAs that targeted these up and down-regulated circRNAs, respectively. The top three most frequent microRNAs for up-regulated circRNAs were hsa-miR-597-3p that interacted with 18 up-regulated circRNAs, hsa-miR-136-5p that interacted with 16 up-regulated circRNAs and hsa-miR-103a-2-5p that interacted with 15 up-regulated circRNAs while the top three most frequent microRNAs for down-regulated circRNAs were hsa-miR-103a-2-5p that interacted with 11 down-regulated circRNAs, hsa-miR-4739 that interacted with 8 down-regulated circRNAs and hsamiR-627-3p that interacted with 8 down-regulated circRNAs. These microRNAs may be the associated with the differentially expression pattern of circRNAs in AF.

#### The circRNA–microRNA Interactions in Atrial Fibrillation

We extracted 92 AF related microRNAs from HMDD (the Human microRNA Disease Database) v3.0 (Huang et al., 2018). If the differentially expressed circRNAs we identified interact with these reported AF related microRNAs, they were more likely to be AF associated circRNAs. Therefore, we highlighted the differentially expressed circRNAs that interact with AF related microRNAs.

There were eight up-regulated and two down-regulated circRNAs interact with AF related microRNAs. **Figures 1**, **2** plotted the expression pattern of these eight up-regulated circRNAs and these two down-regulated circRNAs, respectively.

Within the eight up-regulated circRNAs, five of them interacted with hsa-miR-892a, three of them interacted with hsamiR-3149, two of them interacted with hsa-miR-3171. Within the two down-regulated circRNAs, one of them interacted with hsa-miR-892a while another interacted with hsa-miR-133b.

hsa-miR-892a interacted with both up-regulated and downregulated circRNAs. A large number of differentially expressed circRNAs interact with hsa-miR-892a. Xu et al. (2016) reported that the expression level of has-miR-892a increased significantly from the early stage to the end stage of AF and it can be used as early diagnosis biomarker of AF.

has-miR-3149 had similar expression pattern with hsa-miR-892a and its expression level also increased in AF (Xu et al., 2016).

But hsa-miR-3171 had opposite expression pattern, its expression level decreased in AF (Xu et al., 2016).

The associations of hsa-miR-133b and AF has been reported by several studies but different expression patterns were observed. da Silva et al. found that hsa-miR-133b was upregulated in acute new-onset AF patients with a 1.4-fold increased expression compared with well-controlled AF patients and control patients (da Silva et al., 2018). Li et al. (2012)reported that miR-133 was down-regulated in chronic AF canines. It was not clear whether such difference was cased by species or miR-133b functions differently at different stages of AF.

# DISCUSSION

#### The Potential Roles of has\_circRNA\_100612, has-miR-133b, and KCNIP1/JPH2/ADRB1 in Atrial Fibrillation

circRNA is a member of ncRNA family which could capture other RNA molecules and have recently shown as regulators of other proteins or RNAs including miRNAs. Recent studies have focused more attention on the potential of circRNAs to contribute toward disease etiology. And the expression pattern of circRNAs vary widely on different organism and cell types. Several recent studies have suggested that circRNAs may play essential roles in the initiation and development of cardiovascular diseases (Li et al., 2017). And miRNAs could regulate cardiac function through regulating the proliferation, migration, apoptosis, differentiation of cells during the progression of disease. A large number of literatures has reported association between miRNAs and AF related to remodel processes, and miRNAs might have important roles in signaling during the pathogenesis of AF (Flemming, 2014; Danielson et al., 2018). It has been shown that circRNAs may act as endogenous sponge RNAs to interact with miRNAs and influence the expression of miRNA target genes.

In our research, we found that circRNA\_100612 which located on chromosome 10 could lead to AF by interacting with miR-133b. One of the target gene of miR-133b, KCNIP1, is a member of the family of cytosolic voltage-gated potassium (Kv) channelinteracting proteins and related to cardiac conduction pathway. In zebrafish, overexpression of KCNIP1 could lead to inducible AF. Genome-wide approach show a common 4,470 bp CNV in most AF patients indicated that KCNIP1 could be a genetic predictor of AF risk (Tsai et al., 2016).

Another important downstream target gene of miR-133b is JPH2 which have an important role in sarcoplasmic reticulum Ca2<sup>+</sup> handling and modulation of ryanodine receptor Ca2<sup>+</sup>

channels. Knockdown JPH2 in mice was related to loss of junctional membrane complexes numbers, reduced Ca2+ induced Ca2+ release, and acute heart failure (van Oort et al., 2011). Mutation E169K in JPH2 could result in AF because of defective RyR2-mediated SR Ca2+ release events that representing a potential novel therapeutic target for AF (Beavers et al., 2013).

miR-133b could affect ADRB1 which is a member of the superfamily of cell surface receptors and has a great effect on the myocardium (Cresci, 2012; Pasquier et al., 2016). ADRB1 is

FIGURE 2 | The expression pattern of the two down-regulated circRNAs that interact with atrial fibrillation related microRNAs. (A) The expression pattern of down-regulated hsa\_circRNA\_100612 that interact with has-miR-133b; (B) The expression pattern of down-regulated hsa\_circRNA\_405917 that interact with has-miR-892a.

also an effective target for pharmacotherapy in cardiovascular diseases, and β-blocking medications are acknowledged as first line agents for ventricular rate control in patients with AF (Chen et al., 2003; McMurray and van Veldhuisen, 2014).

GJA1. They found that SNPs associated with AF could influence the transcription of GJA1 in both left atrial tissue and whole heart (Thibodeau et al., 2010; Sinner et al., 2014).

# The Potential Roles of Differentially Expressed circRNAs, has-miR-892b, and GJA1 in Atrial Fibrillation

has-miR-892b interact with down-regulated has\_circRNA\_405917 and up-regulated hsa\_circRNA\_008132, hsa\_circRNA\_104052, hsa\_circRNA\_101021, hsa\_circRNA\_101020, hsa\_circRNA\_102341.

One important target gene of miR-892b is GJA1 which encodes the gap junction protein connexin 43 on chromosome 6q22.31 (Van Norstrand et al., 2012). A recent study using large-scale genotyping reported novel AF risk loci at or near

# CONCLUSION

As the most common irregular heart rhythm disease, AF influence approximately 1–2% of the general population. The mechanisms behind AF pathogenesis are complex and remains elusive. circRNAs have been known as the key of developmental processes, regulation of cell function, pathogenesis of heart diseases and pathological responses which could provide novel sight into the pathogenesis of AF. By analyzing the circRNA expression profiles in AF patients and healthy controls, we identified 537 up-regulated circRNAs and 199 down-regulated circRNAs in AF patients. We investigated the interactions between these differentially expressed circRNAs and reported AF microRNAs. There were eight up-regulated and two down-regulated circRNAs interact with AF related microRNAs. By analyzing the functional interactions among circRNAs, microRNAs and target mRNAs, we proposed an integrative regulatory network model of circRNAs, microRNAs and target mRNAs for AF as shown in **Figure 3**. Our results provided novel insights of how circRNAs and microRNAs function in AF and the proposed regulatory network model of circRNAs, microRNAs and target mRNAs worth to be further studied and validated.

#### DATA AVAILABILITY

fgene-10-00526 June 12, 2019 Time: 17:25 # 6

The datasets generated for this study can be found in the GEO https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE129409.

#### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of (International Ethical Guidelines for Biomedical Research Involving Human Subjects), [Shanghai East Hospital Ethical (Tongji university school of medicine) committee]. The protocol was approved by the [Shanghai East Hospital Ethical (Tongji university school of medicine) committee]. All subjects gave written informed consent in accordance with the Declaration of Helsinki.

#### REFERENCES


# AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

# FUNDING

This study was supported by National Key Research and Development Program (Grant No. 2018-YFC-1312505), Foundation of Shanghai Municipal Commission of Health and Family Planning (Grant No. 201640053), the National Natural Science Foundation of China (Grant Nos. 81501203 and 81570436), and the Outstanding Clinical Discipline Project of Shanghai Pudong (Grant No. PWYgy2018-05).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00526/full#supplementary-material

TABLE S1 | The circRNA expression profiles in three atrial fibrillation patients and three healthy controls.

TABLE S2 | The 92 atrial fibrillation related microRNAs extracted from HMDD (the Human microRNA Disease Database) v3.0.

TABLE S3 | These differentially expressed circRNAs between atrial fibrillation patients and healthy controls.



ryanodine receptors after acute junctophilin knockdown in mice. Circulation 123, 979–988. doi: 10.1161/CIRCULATIONAHA.110.006437


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Jiang, Guo, Zhang, Che, Zhang, Zhuang, Wang, Zhang and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# MildInt: Deep Learning-Based Multimodal Longitudinal Data Integration Framework

*Garam Lee1,2, Byungkon Kang1, Kwangsik Nho3,4, Kyung-Ah Sohn1\* and Dokyoon Kim2,5,6\**

*1 Department of Software and Computer Engineering, Ajou University, Suwon, South Korea, 2 Biomedical & Translational Informatics Institute, Geisinger, Danville, PA, United States, 3 Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, United States, 4 Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN, United States, 5 Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States, 6 Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, United States*

#### *Edited by:*

*Tao Zeng, Shanghai Institutes for Biological Sciences (CAS), China*

#### *Reviewed by:*

*Min Chen, Hunan Institute of Technology, China Liansheng Wang, Xiamen University, China*

#### *\*Correspondence:*

*Dokyoon Kim dokyoon.kim@pennmedicine.upenn.edu Kyung-Ah Sohn kasohn@ajou.ac.kr*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 21 December 2018 Accepted: 13 June 2019 Published: 28 June 2019*

#### *Citation:*

*Lee G, Kang B, Nho K, Sohn K-A and Kim D (2019) MildInt: Deep Learning-Based Multimodal Longitudinal Data Integration Framework. Front. Genet. 10:617. doi: 10.3389/fgene.2019.00617*

As large amounts of heterogeneous biomedical data become available, numerous methods for integrating such datasets have been developed to extract complementary knowledge from multiple domains of sources. Recently, a deep learning approach has shown promising results in a variety of research areas. However, applying the deep learning approach requires expertise for constructing a deep architecture that can take multimodal longitudinal data. Thus, in this paper, a deep learning-based python package for data integration is developed. The python package deep learning-based multimodal longitudinal data integration framework (MildInt) provides the preconstructed deep learning architecture for a classification task. MildInt contains two learning phases: learning feature representation from each modality of data and training a classifier for the final decision. Adopting deep architecture in the first phase leads to learning more task-relevant feature representation than a linear model. In the second phase, linear regression classifier is used for detecting and investigating biomarkers from multimodal data. Thus, by combining the linear model and the deep learning model, higher accuracy and better interpretability can be achieved. We validated the performance of our package using simulation data and real data. For the real data, as a pilot study, we used clinical and multimodal neuroimaging datasets in Alzheimer's disease to predict the disease progression. MildInt is capable of integrating multiple forms of numerical data including time series and non-time series data for extracting complementary features from the multimodal dataset.

Keywords: multimodal deep learning, data integration, gated recurrent unit, Alzheimer's disease, python package

# INTRODUCTION

As the amount of biomedical datasets grows exponentially, the development of relevant data integration methods that can extract biological insight by incorporating heterogeneous data is required (Larranaga et al., 2006). Recently, deep learning approaches have shown promising results in numerous applications such as natural language processing, computer vision, and speech recognition. In addition, in the field of translational research, deep learning-based predictive models have shown comparable results (Chaudhary et al., 2017; Choi et al., 2017; Lu et al., 2018; Lee et al., 2019). In previous studies, they integrated multiple domains of data using deep learning models to discover integrative features that cannot be explained by a single domain of data. For example, multimodal neuroimaging dataset is combined in (Lu et al., 2018) using deep learning-based framework for discriminating cognitively normal with Alzheimer's disease (AD), which resulted in considerable performance improvement. For the multi-omics data integration, RNA-seq, miRNA-seq, and methylation data from The Cancer Genome Atlas (TCGA) are incorporated using auto-encoder for predicting hepatocellular carcinoma survival (Chaudhary et al., 2018). Furthermore, in (Lee et al., 2019), multimodal gated recurrent unit (GRU) is used to integrate cognitive performance, cerebrospinal fluid (CSF), demographic data, and neuroimaging data to predict AD progression. Data integration is believed to help improve the classification performance by extracting complementary information from each domain of source.

However, integrating heterogeneous data is a challenging task. First of all, multimodal data might hinder learning complementary feature representation due to the presence of mutually exclusive data, that is, a useful feature representation of the data might not be learned well since the task-irrelevant portion of the data could interfere with the task-relevant portion. In addition, dealing with datasets that consist of multiple time points is another issue for data integration. Time series data include multiple time points of data, whose length is varied over samples, while non-time series consists of a single time point of data. Thus, additional transformation steps for time series dataset should be preceded to convert the variable-length sequence data into fixed-size representations without losing information. Finally, most commonly, the more various datasets are used, the less sample size is available. Traditional data integration methods use only samples overlapped by all modalities. Since only a few samples contain all modalities of data, it is inevitable to use a small portion of the samples, even though abundant samples are available.

In this paper, we provide a deep learning-based python package for heterogeneous data integration. The most significant advantage of our package is the flexibility in which irregular time series data are processed. As the main component of our package, we combine multiple GRUs with simple concatenation-based vector integration, which makes it possible to incorporate any number of modalities. Furthermore, nonoverlapping samples, as well as overlapping samples, can be used for training a classifier. To demonstrate the validity of our package, we conduct experiments on simulation data and real data. For simulation data, we generate multimodal time series data using the autoregressive model and solve a binary classification task. For the real data, as a pilot test, patients with mild cognitive impairment (MCI) is used to predict AD progression.

#### Methods

As shown in **Figure 1**, MildInt comprise two learning phases: 1) feature extraction from each modality of data and 2) learning

FIGURE 1 | Longitudinal total intracranial volume, hippocampal volume, and entorhinal cortex thickness from brain imaging data, genomic data, cognitive assessment, and any forms of numerical data that can be taken using our framework. In phase 1 (blue-dashed rectangle), each modality of data is separately processed for learning feature representation. Both time series and non-time series data can be accepted to produce fixed-size feature vectors using a gated recurrent unit (GRU) component (green-dashed rectangle). Then, the learned representations (rectangles colored by red, green, and yellow) are simply concatenated to form an input for logistic regression (LR) classifier in phase 2 (red-dashed rectangle).

the integrative feature representation to make the final prediction. In phase 1, time series data from a single domain is transformed into a fixed-size vector. Then, vectors from each modality of data are integrated and fed to logistic regression (LR) classifier for the final decision making in phase 2. We use GRU as our main component for learning feature representation from the time series data. Additionally, we apply the concatenation-based data integration method to integrate multiple sources of data into a single vector.

#### PHASE 1: FEATURE EXTRACTION FROM EACH SINGLE MODAL TIME SERIES DATA

#### Recurrent Neural Network

Recurrent neural network (RNN) is a class of deep learning architecture composed of multiple recurring processing layers to learn a representation of sequential data (LeCun et al., 2015). An RNN processes an input sequence one element at a time and updates its memory state that implicitly contains information about the history of all the past elements of the sequence. The memory state is represented as a Euclidean vector (i.e., a sequence of real numbers) and is updated recursively from the input at the given step and the value of the previous memory state. Given a sequence *X =* {*x*1,*x*2,…,*xt* …,*xT*} memory state and output at each time step are computed as follows:

$$s\_t = \tanh\left(U^{(s)}\mathcal{X}\_t + \mathcal{W}^{(s)}\mathcal{S}\_{t-1}\right) \tag{1}$$

$$\alpha\_t = \text{softmax}(V^{(o)}\mathbf{s}\_t) \tag{2}$$

where *U*, *W*, and *V* are parameters to be learned for computing input, memory state, and output, respectively. Output is resulted from softmax function whose role is to convert the vector of hidden state into a probability vector *via* the following operation:

$$\sigma(u\_i) = \frac{e^{u\_i}}{\sum\_k e^{u\_k}}\tag{3}$$

where *ui* is the *i*-th element of the vector *u* and *k* is the number of labels. Finally, loss function is defined with cross-entropy to quantify the distance between true label and estimated one. In our package, only the last output *oT* is picked and used for the estimate because the output is regarded to carry the past features relevant to estimation.

In natural language processing, speech recognition, and anomaly detection in time series, RNN is popularly used for analyzing the sequence of words and time series data (Deng et al., 2013). One of the main advantages of using RNN is that variable length of time series data can be processed. This advantage is a critical part of our framework that is capable of accepting any variable length of time series data. However, extracting features in a long sequence of data is hard for RNN, which is known as a long-term dependency problem (Bengio et al., 1994). To handle this problem, long short-term memory (LSTM) and GRU have been developed and practically used.

#### Learning Feature Representation Using Gated Recurrent Unit

GRU and LSTM are the extension of RNN in which additional parameters regulate the memory state, making it possible to "forget" irrelevant, outdated past information. Although both LSTM and GRU can handle long-term dependency problem, we selected GRU as the main component of MildInt. Since GRU has fewer parameters than LSTM, it is expected that GRU is easier for training in the field of translational informatics where only a few samples are available.

Regulating long-term information is handled by reset and update gates. Parameters for both gates are learned for determining how *xt* is processed [equation (4)–(7)]. Update gate decides how amount of the previous memory value s*t*−1 is passed on. Suppose z*<sup>t</sup>* is computed as 1 by equation (4), then only the previous memory is passed on, while newly computed hidden value *ht* will be forgotten [equation (7)]. On the other hand, reset gate manipulates the computation between previous memory *st*<sup>−</sup>1 and the current input *xt* . In equation (6), reset gate determines the amount of previous memory value *st*−1. Note that GRU is a general case of RNN because setting *rt* to 1 and *zt* to 0 for *t =* 1,2, *… , T* leads GRU to functioning exactly the same as RNN.

$$x\_t = \sigma(U^{(\varepsilon)}x\_t + s\_{t-1}W^{(\varepsilon)})\tag{4}$$

$$r\_t = \sigma(U^{(r)}\mathbf{x}\_t + \mathbf{s}\_{t-1}W^{(r)})\tag{5}$$

$$h\_t = \tanh\left(U^{(k)}\boldsymbol{x}\_t + (\boldsymbol{s}\_{t-1} \odot \boldsymbol{r}\_t)\boldsymbol{W}^{(h)}\right) \tag{6}$$

$$\mathbf{s}\_t = (\mathbf{1} - \mathbf{z}\_t) \odot h\_t + \mathbf{z}\_t \odot \mathbf{s}\_{t-1} \tag{7}$$

In **Figure 1**, *xm <sup>t</sup>* represents *m*-th modality of data at *t* time point. *Tm* is the maximum time length of *m*-th modality. A single GRU takes each modality of time series data separately for learning fixed-length representation in the first phase. Note that every modality of data is assumed to be a time series data in our package. For the single time point modalities, they are also considered as length-1 time series data for ease of integration. Without multiple time points of input data, GRU is only a fully connected network with a prior hidden state. Thus, the GRU component is able to take not only time series data but also nontime series data as well. The feature representations learned in the first phase are optimized only by a single modality of data. Thus, phase 1 can be used for a feature learning phase from a single domain of source.

#### PHASE 2: FINAL CLASSIFICATION

In the second phase, integration of multiple domains of data takes place. The feature representations are learned separately in the first phase. Thus, a vector produced from a GRU component Lee et al. Deep Learning-Based MildInt

contains only the information of a single modality. For learning integrative feature representation in the second phase, vectors are simply concatenated (**Figure 1**). Based on the concatenated vector, any classification algorithm can be used in phase 2. In our package, we provide LR because it yields good interpretability by analyzing beta coefficients of the trained classifier. Also, in the experiments with real data and simulation data, an LR model was used for the final decision.

LR is a classification algorithm in which the outcome is the probability of binary classes. Sigmoid function transforms the linear combination of the input features into probability values that can be mapped to the binary class. We apply *l*1-regularized LR for the classification. A python library Sklearn (Bengio et al., 1994) is used for LR in our package.

#### RESULTS

To validate the performance of our package, experiments on simulation data and real data are conducted. In the experiment with simulation data, multimodal time series data are generated and tested for binary classification. The classification performance of our package is compared with other well-known methods such as logistic regression (LR), random forest (RF), and support vector machine (SVM). In the experiment with real data, four modalities of datasets, such as cognitive performance, cerebrospinal fluid (CSF), demographic data, and MRI data of patients in Alzheimer's disease, are used for MCI conversion prediction that is also set to binary classification.

#### CLASSIFICATION TASK ON THE SIMULATION DATA

In this section, we demonstrate the performance improvement using multimodal data and time series data. In the first experiment, only a single time point of data is used to evaluate the performance improvement of MildInt over other prominent classification algorithms such as SVM, LR, and RF. In the following experiment, the performance of using time series data is observed to evaluate the effectiveness of applying additional time points of data.

To generate time series data for binary classification, we apply the autoregressive model. First two underlying networks *A*0 and *A*1 are generated for the parameters in the autoregressive model. It is assumed that individual record is generated based on the underlying network in which 0-labeled data are generated from network *A*0 while 1-labeled data from *A*1. The underlying network *A*0 is built in which edges are randomly selected as either 0 or 1, and a network *A*1 against *A*0 is built with a distance *d* ranging from 0 to 1 in equation (8) where *A ij* 0 is an element of the *i*-th row and the *j*-th column in the network *A*0 whose size is *n* × *n*.

$$A\_{1\_y} = \left| A\_{0\_y} - d \right|, \qquad \text{for } 1 \le i, j \le n \tag{8}$$

The distance *d* is a value for how likely two matrices *A*0 and *A*1 are distinguishable. For example, if *d* = 1, then *A*0 and *A*1 are opposite matrices where edges in *A*0 are not in *A*1 while edges in *A*1 are not in *A*0. On the other hand, if *d* = 0, *A*0 and *A*1 are exactly the same. Thus, dataset generated with higher *d* is easier to be separated. Second, we pick up sets of nodes from the underlying network to make subnetworks. Each subnetwork is considered as each modality of data because each modality of data is assumed to have a part of information for understanding entire networks. Finally, time series data are generated using the nonlinear autoregressive model in equation (9) where M is a subnetwork and ε is an error term with 0 mean and 0.1 variance.

$$\boldsymbol{\omega}^t = \sigma(M\boldsymbol{\omega}^{t-1} + \boldsymbol{\varepsilon}), \qquad \boldsymbol{\varepsilon} \sim \boldsymbol{\wedge}^\prime(\mathbf{0}, \mathbf{0}. \mathbf{1}) \tag{9}$$

*x* <sup>0</sup> ~ ( −1 1, )

We generated 1,000 samples whose length of time points is 10. Among 1,000 samples, only 500 samples contain all modalities of data, while the rest of them have only a part of all modalities. For evaluation, we ran fivefold cross-validation 10 times in which every fold has the same ratio of positive and negative samples.

In **Figure 2**, we only used a single time point of data to compare the classification performance depending on modality. **Figure 2A** shows inconsistent accuracies of SVM, RF, LR, and MildInt over distances since single modality of data does not contain enough information for understanding whole underlying networks. Thus, the performance becomes more affected by the error term. Contrary to the performance with single modality, performance using multi-modality of data is less affected by error term. As shown in **Figure 2B**, accuracy is improved consistently over distances from 0.5 to 1.0. In particular, the performance of MildInt shows 1.0 accuracy over distances from 0.8 to 1.0 since MildInt can take non-overlapping as well as overlapping samples on input, while SVM, RF, and LR can only use overlapping samples.

From **Figure 3**, we can see the effectiveness of using time series data. As increasing the number of time points, the performance using single modality is consistently increased (**Figure 3A**). Using multi-modality of time series data whose length is more than 6, two sets of data are perfectly classified from the distance 0.5 to 1.0 as seen in **Figure 3B**. Intuitively, data from multiple time points have more information than data at a single time point. Thus, MildInt can exploit temporal changes in time series data for the correct classification.

#### CLASSIFICATION TASK ON THE REAL DATASET

For the experiment with real data, we used 865 subjects in MCI obtained from Alzheimer's disease neuroimaging initiative cohort (ADNI) for predicting AD progression. The overall objective of ADNI is to test whether neuroimaging, biological markers, clinical, neuropsychological assessment could be combined to measure the AD progression. We downloaded four modalities of data including cognitive performance, CSF, magnetic resonance imaging (MRI), and demographic information; each of which has 802, 601, 865, and 865 samples, respectively, from the ADNI data repository (http:// adni.loni.usc.edu). Informed consent was obtained for all subjects,

and the study was approved by the relevant institutional review board at each data acquisition site (for up-to-date information, see http://adni.loni.usc.edu/wp-content/themes/freshnews-dev-v2/ documents/policy/ADNI\_Acknowledgement\_List%205-29-18. pdf). All methods were performed in accordance with the relevant guidelines and regulations. Among the four modalities of samples, 601 overlapping samples are available with 200 MCI converter and 401 MCI non-converter samples. Cognitive performance and CSF are time series data with lengths of 4.05 and 1.69 on average. MRI and demographic information are considered as length-1 time series data in our package. Note that all modalities are given in numerical vector forms. For example, we extracted gender, age, level of education, and cognitive assessment from patients' record. Especially for MRI data, a preprocessing was performed to extract features, such as total intracranial volume, hippocampal volume, and entorhinal cortex thickness, which are relevant to predicting MCI conversion. Recent methods (Lama et al., 2017; Sandeep et al., 2017) that extract features also can be used before running our package. The summary statistics of samples and hyperparameters are shown in **Table 1**.

**Figure 4** shows the accuracies of our package using time series data. We removed the accuracy from the model with demographic data because the prediction performance was too low. The performance improvement using time series data is marginal due to the sparsity of time points. More than half of the samples contain missing values, and even the length of time points is short. Furthermore, we have longitudinal samples for only two modalities of data (cognitive performance and CSF). Thus, it is hardly expected that the performance is enhanced using longitudinal data. However, classification accuracy was improved using multiple domains of data. As seen in **Figure 4**, integrating four sources of data shows the best predictive performance compared with the performance with single modalities. Finally, we compared the performance of MildInt with previously developed methods for MCI conversion prediction. As observed in **Table 2**, MildInt showed comparable prediction results.



*CSF, cerebrospinal fluid; MRI, magnetic resonance imaging.*


*MCI-C, MCI-Converter; MCI-NC, MCI-NonConverter; ACC, Accuracy; SEN, Sensitivity; SPE, Specificity; APOE, Apolipoprotein E; FDG; Fluorodeoxyglucose.*

#### CONCLUSION

MildInt provides multimodal GRU for heterogeneous data integration. The main advantage of our framework is that variablelength time series data and multimodal data can be processed. In addition, every available sample from all modalities including non-overlapping samples can be used for training classifier. The performance of MildInt is evaluated with simulation data and real data. In the experiment with simulation data, it showed the best performance when multimodal data and time series data were integrated. Additionally, in the experiment with real data, integrating cognitive performance, demographic information, CSF, and MRI imaging data show the best performance for MCI conversion prediction. Also, any numerical form of data such as gene expression, methylation, and single nucleotide polymorphism data can be combined in our package. MildInt is suitable to use in cases where time series data such as multiple time points of methylation data and non-time series data such as single nucleotide polymorphism should be incorporated for learning integrative feature representation. Furthermore, compared with previously developed methods, MildInt showed comparable prediction ability that can efficiently incorporate multiple domains of resources.

#### REQUIREMENTS

This package works on python 2.7.x in platforms such as Mac OS X, Windows, and Linux. MildInt requires python packages such as Pandas, Numpy, Tensorflow, and Sklearn to be installed independently. To make MildInt fully functioning, Tensorflow with graphics processing units (GPU) from NVIDIA should be equipped. The GPU-enabled version of Tensorflow has requirements such as 64-bit Linux, NVIDIA CUDA 7.5 (CUDA 8.0 required for Pascal GPUs), and NVIDIA, cuDNN v4.0 (minimum) or v5.1 (recommended).

#### AUTHOR CONTRIBUTIONS

This study was conceived by GL, K-AS, and DK. Experiments were designed and performed by all authors. The manuscript was initially written by GL. All the authors revised the manuscript and approved the final version prior to submission.

#### FUNDING

The support for this research was provided by NLM R01 LM012535, NIA R03 AG054936, and the Pennsylvania Department of Health (#SAP 4100070267). The department specifically disclaims responsibility for any analyses, interpretations, or conclusions. This work was also supported by the National Research Foundation of Korea grant funded by the Korea government (MSIT) (no. NRF-2019R1A2C1006608).

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Lee, Kang, Nho, Sohn and Kim. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Efficient Mining of Variants From Trios for Ventricular Septal Defect Association Study

*Peng Jiang1†, Yaofei Hu1†, Yiqi Wang1, Jin Zhang1, Qinghong Zhu1, Lin Bai2, Qiang Tong1\*, Tao Li1\* and Liang Zhao1,2\**

*1 Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, China, 2 School of Computing and Electronic Information, Guangxi University, Nanning, China*

#### *Edited by:*

*Tao Zeng, Shanghai Institutes for Biological Sciences (CAS), China*

#### *Reviewed by:*

*Wenting Liu, Genome Institute of Singapore, Singapore Zhenhua Li, National University of Singapore, Singapore*

#### *\*Correspondence:*

*Qiang Tong tttrrrxxx@163.com Tao Li 317371983@qq.com Liang Zhao s080011@e.ntu.edu.sg*

*†These authors share first authorship*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 22 December 2018 Accepted: 27 June 2019 Published: 08 August 2019*

#### *Citation:*

*Jiang P, Hu Y, Wang Y, Zhang J, Zhu Q, Bai L, Tong Q, Li T and Zhao L (2019) Efficient Mining of Variants From Trios for Ventricular Septal Defect Association Study. Front. Genet. 10:670. doi: 10.3389/fgene.2019.00670*

Ventricular septal defect (VSD) is a fatal congenital heart disease showing severe consequence in affected infants. Early diagnosis plays an important role, particularly through genetic variants. Existing panel-based approaches of variants mining suffer from shortage of large panels, costly sequencing, and missing rare variants. Although a trio-based method alleviates these limitations to some extent, it is agnostic to novel mutations and computational intensive. Considering these limitations, we are studying a novel variants mining algorithm from trio-based sequencing data and apply it on a VSD trio to identify associated mutations. Our approach starts with irrelevant *k*-mer filtering from sequences of a trio *via* a newly conceived coupled Bloom Filter, then corrects sequencing errors by using a statistical approach and extends kept *k*-mers into long sequences. These extended sequences are used as input for variants needed. Later, the obtained variants are comprehensively analyzed against existing databases to mine VSD-related mutations. Experiments show that our trio-based algorithm narrows down candidate coding genes and lncRNAs by about 10- and 5-folds comparing with single sequence-based approaches, respectively. Meanwhile, our algorithm is 10 times faster and 2 magnitudes memory-frugal compared with existing state-of-the-art approach. By applying our approach to a VSD trio, we fish out an unreported gene—CD80, a combination of two genes—MYBPC3 and TRDN and a lncRNA—NONHSAT096266.2, which are highly likely to be VSD-related.

Keywords: trio-sequencing, k-mer filtering, variant calling, ventricular septal defect, association study, long non-coding RNA

#### INTRODUCTION

Ventricular septal defect (VSD) is a major kind of congenital heart disease (CHD), constituting about 20% of all CHD cases (Spicer et al., 2014). By taking conservative treatment, mortality is around 90% to 95%, whereas *via* surgery, this rate reduces to 19% to 60% (Serpytis et al., 2015). Very often, diagnosis of a VSD patient is at its late stage due to the obvious communication obstacles in infants; this poses a need for early diagnosis, particularly through genetic variants.

Mining genetic variants and associating them with diseases is a hot topic, by which thousands of disease-associated variants have been identified (The International HapMap 3 Consortium, 2010; The 1000 Genomes Project Consortium et al., 2015). Obtaining these findings usually starts with a

**353**

fgene.2019.00000.indd 1 Manila Typesetting Company 06/10/2019 07:57PM

panel containing hundreds to thousands of patients diagnosed as having the same specific disease; later their genetic materials are extracted and sequenced. This is followed by disease-associated variants mining through a series of analytic procedures. Using this protocol, 89,251 single-nucleotide polymorphism (SNP) trait associations have been successfully pinpointed according to genome-wide association study (GWAS) catalog (MacArthur et al., 2017), including more than 400 CHD-related genes (Jin et al., 2017a). Although an association study is fruitful and promising, many issues weaken its applicability. First, panelbased association studies only identify common variants, and rare variants are overlooked due to low statistical significance. Thus, it requires large number of samples to be collected, i.e., hundreds, even thousands of cases. Second, almost all existing studies mine a one-to-one correspondence between genes and diseases rather than a many-to-many scheme, which is pretty challenging. Unfortunately, majority of diseases are caused by many mutations of genes. For instance, more than 400 genes have been discovered to be associated with CHD (Jin et al., 2017a), and more than 700 genes are involved in adult height (Wood et al., 2014), and even much more (Marouli et al., 2017). Third, it is costly to obtain the whole DNA (deoxyribonucleic acid) sequence of a sample. Although the ever increasing throughput and decreasing cost have made whole-genome sequencing possible for general research, it still costs a few hundreds to a thousand dollars for a single genome. To partially overcome the aforementioned limitations from single sequencing (SS) data, trio-based sequencing emerges.

Typically, a trio usually contains two parents and one child. This trio-based approach is effective for identifying diseaseassociated genes according to the basic rule of inheritance. It is also powerful to pinpoint *de novo* mutations without a large panel. Various studies have been conducted to identify diseaseassociated genes by using trio-sequencing (TS). For instance, a trio-based exome sequencing is used to identify *de novo* mutations in early-onset high myopia (Jin et al., 2017b), and ~440 CHD-related genes have been discovered based on 2,645 trios (Jin et al., 2017a). The typical procedure of using trios to identify variants is mapping-calling-filtering, i.e., mapping all sequences of each individual from a trio to a reference genome, calling variants based on mapped sequences, and filtering out variants shared by members of the trio. Intuitively, this protocol is inefficient to identify *de novo* mutations from child sequences. Obviously, a large portion of sequences have no contribution to variant calling, which have been considered during the whole processes for all samples within the trio. To solve this problem, we propose a novel idea of calling *de novo* variants from a trio and have applied it to identified VSD-related genetic variants, including coding genes and long non-coding RNAs (lncRNAs).

Our approach starts from a trio with a child diagnosed as having VSD but with healthy parents. Later, unique *k*-mers (*k*-length consecutive bases from a genomic sequence) belonging to the child only are fished out through a newly proposed counted *k*-mer-encoding algorithm. This is followed by sequence error correction and *k*-mer extension before mapping to a reference genome. Finally, variants are fished out and analyzed against existing databases to mine VSD-related coding genes and lncRNAs.

#### METHODS

Our approach is composed of two major parts: TS-based variant mining and VSD-related variant filtering.

#### Variant Mining

Unlike conventional mapping-calling-filtering approach of variant identification, e.g., SAMtools (Li et al., 2009) and GATK (DePristo et al., 2011), we conceive a novel idea of *de novo* variants identification algorithm from a trio achieving good computation efficiency. Our approach contains four steps: *k*-mer filtering, *k*-mer extension, and variant identification. Details are shown below.

#### *k*-mer Filtering

Let a trio be *Rf* , *Rm*, *Rc*, representing the reads of the father (*Rf* ), the reads of the mother (*Rm*), and the reads of the child (*Rc*, suppose only one child is available); the set of *k*-mers contained in a sample is *Kf* , *Km*, and *Kc* for father, mother, and child, respectively. Herein, we mean each *k*-mer having its count (the times it appears within the sequenced data) available, i.e., a *k*-mer, say κ, is a touple (*s*κ, *f*κ), where *s*κ is the *k*-length string of κ, and *f*κ is its count. To fish out *de novo* mutations from *Kc*, we go through all the *k*-mers of *Kc* and check them with *Kf* and *Km*. In case the count ratio of a *k*-mer between both parents and the child is less than a threshold (τ0), the *k*-mer is kept as a variant-containing candidate.

It seems trivial to filter out large amount of *k*-mers shared between *Kf* /*Km* and *Kc*. However, the number of *k*-mers obtained from a whole human genome sequencing reads is usually too large to fit into a main memory, not to mention putting them together. For instance, the 31-mers having a count larger than one of the HapMap sample NA12878 ((https://www.ncbi.nlm. nih.gov/sra/ERR091571/)) take 90-Gb space on disk. To solve this problem, we have designed a novel coupled Bloom Filterbased algorithm achieving high memory saving ratio and good retrieval efficiency (Jiang et al., 2019). Let *f*max be the maximum frequency in *K*, which can be represented by at most *h* bits (in binary). We take the following steps to represent *K*:


$$\begin{aligned} B^+ \left[ H\_i \left( s\_\kappa \right) \right] &= 1, & i \in \left( 0, 1, \cdots, h - 1 \right), \\ B^- \left[ H\_i \left( s\_\kappa \right) \right] &= \text{Binary} \left( f\_\kappa \right)^h \left[ i \right], & i \in \left( 0, 1, \cdots, h - 1 \right), \end{aligned}$$

fgene.2019.00000.indd 2 Manila Typesetting Company 06/10/2019 07:57PM

where binary (*f*κ)*<sup>h</sup>* is the binary representation of *f*κ *via h* bits, and binary (*f*κ)*<sup>h</sup>*[*i*] returns the value of the *i*th bit.

4. Repeat Step 3 above until all *k*-mers are inserted.

Based on the above steps, *Kf* , *K*m, and *Kc* can be saved into *B*<sup>f</sup> , *B*m, and *B*c economically; more details are shown in Jiang et al. (2019).

Based on above algorithm, we are able to store *Kf* , *K*m, and *K*c within a memory simultaneously and compute the ratio of a *k*-mer between a parent and the child efficiently. Note that, the time efficiency of *k*-mer retrieval from a coupled Bloom Filter mainly comes from the hash operation, which is in O(1) time complexity.

Due to sequencing bias, the *k*-mers are error-prone. To mitigate the impact of errors on variants identification, we perform error correction before further analysis (Zhao et al., 2018). For a *k*-mer κ in *Kx*, we search its neighbors *N*κ from *Bx*, where *x* ∈{*f*, *m*, *c*}. A neighbor of κ is defined as the one having edit distance of 1 from κ. Later, a *z*-score *z*κ is calculated from *N N* κ κ ′ ∪= { } κ , where *z*κ = (*f*κ – μ)/σ, μ is mean frequency of *k*-mers in *N*′ κ , and σ is their standard deviation. We consider κ is error-free when *z*κ > *z*0 and *f*κ > *f*0. In this study, *z*0 = 0.8 and *f*0 = 4. More details are presented in Zhao et al. (2018).

Data: (*Kf* , *Km*, *Kc*), *k*-mers of a trio; *H*, a hash function vector Result: *K*′ *c* , mutation-contained *k*-mers of a child begin for *x* in {*f*, *m*, *c*} do *Bx* = Encoding(*Kx*, *H*) for κ in *Kc* do *ν<sup>f</sup>* ← Decoding(*Bf* , *H*, *s*κ) *ν<sup>m</sup>* ← Decoding(*Bm*, *H*, *s*κ) *ν<sup>c</sup>* ← Decoding(*Bc*, *H*, *s*κ) if *uf* /*νc* < τ<sup>0</sup> and *um*/*νc* < τ<sup>0</sup> then *K*′ *c* ← Correction(*Bc*, *H*, *K*′ *c* ) return *K*′ *c* //Details of *Encoding*, *Decoding* and *Correction* are shown in Appendix A.

#### *k*-mer Extension

A *k*-mer is usually not long enough to uniquely map to a specific location of a reference genome. Hence, extending a *k*-mer into a long sequence is necessary before mapping. To this end, we take a candidate variation-containing *k*-mer as seed, and elongate the *k*-mer to both side. Taking right-hand extension, each time one base is attached to the right of the current string *s*, i.e., *s*ʹ = *s* ⋅ *x*, *x* ∈{*A*, *C*, *G*, *T}*, and the *k* length suffix of *s*ʹ, i.e., suffix( )′ = ′ ( ) − − *s s <sup>k</sup> l k* 1 :*<sup>l</sup>* , is checked against *B*c. In case the suffix is absent, the extension will be altered by another base, or terminated if all alternatives have failed. The left-hand side extension is similar to the right-hand extension but with opposite direction. An extension will be terminated in case the length limitation is reached or multiple extensions are available. We set the length limitation to 1,000 in this study. Extension details are shown in Algorithm 2.

#### ALGORITHM 2: *k*-mer extension.

```
Data: Bc, child k-mers; H, a hash function vector; K′
                                                     c
                                                     , kept k-mers; maxLen: 
maximum length
```
Result: *S*, set of variant-containing sequences

```
begin
```

```
for κ in K′
          c
          do
 hasBranch ← 0
 s′ ← sκ
 repeat
 c ← 0, e ← ''
 for x in {A, C, G, T} do
 s″ ← suffix(s′, k – 1) · x
 val ← Decoding(Bc, H, s″)
 if val > 0 then
 c ← c + 1, e ← x
 if c > 1 then
 hasBranch ← 1
 else
 s′ ← s′ · e
 until hasBranch or |s′| > maxLen
 hasBranch ← 0
 repeat
 c ← 0, e ← ''
 for x in {A, C, G, T} do
 s″ ← x · prefix(s′, k – 1)
 val ← Decoding(Bc, H, s″)
 if val > 0 then
 c ← c + 1, e ← x
 if c > 1 then
 hasBranch ← 1
 else
 s′ ← e · s′
 until hasBranch or |s′| > maxLen
 S S ← ∪{ } s′ return S
```
#### Variant Identification

All extended *k*-mers are mapped to GRCh38/hg38 by BWA (Li and Durbin, 2009), and variants as well as their position are pinpointed by using SAMtools (Li et al., 2009) and the best practice of GATK (DePristo et al., 2011). Unlike most existing approach that uses read coverage to filter out low confidence variants, we use previously identified variant-containing *k*-mers (from the first step and with count included) to refine the obtained variants. More precisely, a variant is kept if it satisfies the following criteria: 1) a *k*-mer (formed by the reference genome and the variant jointly) containing the variant can be found from the set of kept *k*-mers obtained from the first step; 2) the sum of the count of all *k*-mer covering the variant is not less than 3. Note that the first criterion is necessary because extended *k*-mers introduce additional variants that are not unique to the child.

#### Variant Filtering

We focus on VSD-related variants; thus, those obtained from the previous step undergo filtering to fish out VSD-related variants. Two types of variants are considered, viz., contained in coding and non-coding regions. For affected non-coding genes, we pay special attention on long non-coding RNAs.

fgene.2019.00000.indd 3 Manila Typesetting Company 06/10/2019 07:57PM

#### Identifying VSD-Related Variants

Only a tiny portion of variants obtained from GATK could be VSD-related. To obtain these variants, we first filter out irrelevant ones by GATK built-in modules with various parameters, including "QD < 2.0," "QUAL < 30.0," "SOR > 3.0," "FS > 60.0," "MQ < 40.0," "MQRankSum<-12.5," and "ReadPosRankSum<-8.0" for SNPs and "QD < 2.0," "QUAL < 30.0," "FS > 200.0," and "ReadPosRankSum<-20.0" for indels (insertions and deletions). This step is followed by using ANNOVAR (Wang et al., 2010) to filter out variants presented in known individuals with minor allele frequency (MAF) of 0.01. Reference databases used in this stage are the phase 3 of 1000 Genomes Project (The 1000 Genomes Project Consortium et al., 2015), ExAC (Lek et al., 2016), ESP (Exome Variant Server, 2019), and gnomAD (Lek et al., 2016). That is, a variant that appears in these databases having MAF no less than 0.01 is excluded.

After filtering, we use DAVID (Huang et al., 2009) to analyze functions of remaining variants. These variants are also validated by using Gene Ontology (GO) (The Gene Ontology Consortium, 2017), Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2018), the Online Mendelian Inheritance in Man (OMIM) (Hamosh et al., 2005), and the Human Gene Mutation Database (HGMD) (Stenson et al., 2017). Functions and pathways of coding genes can be easily obtained by using DAVID, whereas they are unable to be obtained directly for noncoding transcripts. Hence, we handle them separately; see below.

#### Fishing Out Coding Genes

Taking the results generated by ANNOVAR, we select the variants having consequence of "Nonsense\_Mutation," "Frame\_Shift\_ Ins," "Frame\_Shift\_Del," "Translation\_Start\_Site," "Splice\_Site," "In\_Frame\_Ins," "In\_Frame\_Del," and "Missense\_Mutation." In addition, variants having SIFT score (Sim et al., 2012) larger than 0.05 and PolyPhen-2 index (Adzhubei et al., 2010) smaller than 0.446 are further filtered out. The remaining genes are input into DAVID to analyze gene-disease association, gene-annotation enrichment analysis, pathway mapping, and so on. Those genes related to cardiovascular diseases are fished out. In addition, the neurodegenerative diseases-related genes are also obtained because many studies have shown that these two diseases are closely related (Jin et al., 2017a).

The pinpointed genes are also checked with GO, OMIM, HGMD, and KEGG to verify their functions if available.

#### Pinpointing Out lncRNAs

A lncRNA does not translate proteins; however, it possesses many roles in gene transcription regulation, post-transcriptional regulation, epigenetic regulation, aging, and so on (Marchese et al., 2017). Hence, mutations occurred in lncRNAs may affect the downstream products. To identify the VSD-related variantcontaining lncRNAs within the child, we extend the VSD-related genes (listed in Jin et al., 2017a) to upstream and downstream by 100, 200, 500, and 1,000 bp. A variant is considered as a candidate if it is within the extension region and overlaps with lncRNAs shown in LNCipedia (Volders et al., 2018) or NONCODE (Zhao et al., 2016). Note that this protocol approaches the VSD-related lncRNAs approximately but not directly. The rationale is that

regulatory elements within proximity usually play a role together (Razin et al., 2013; Andrey and Mundlos, 2017).

Functions of identified lncRNAs are fully explored by using LNCipedia and NONCODE.

# RESULTS

#### Data Preparation

A trio containing a 3-year-old boy diagnosed as having typical VSD and a couple of healthy parents is collected. The DNA of each individual is extracted from 5-ml venous blood and is sequenced by an Illumina HiSeq X Ten platform having coverage of 30× and read length of 151 bp. All DNA sequences of the three samples are obtained from one batch. As a result, 356,781,358 paired-end reads are obtained from the child, and 368,280,232 and 330,790,178 paired-end reads are obtained from his father and mother, respectively.

Before the sample collection, a written informed consent is obtained from the parents of the child.

## TS-Based Variants

Based on the TS data, we obtained 2,585,348 variants by using GATK (DePristo et al., 2011) with default settings. These variants are further divided into two types, i.e., protein coding and non-coding. For the non-coding variants, we focus on the regions transcribed into lncRNAs. Variants associated with both cardiovascular and neurodegenerative diseases are explored because they usually occur together (Jin et al., 2017a). Details are shown below.

#### Coding Genes Related to VSD

From the 2,585,348 variants, 193 within exonic regions and 6 from splicing regions pass various filtering criteria that are obtained by using ANNOVAR (Wang et al., 2010). The 193 variants are associated with 61 unique genes, whereas the 6 are involved in 8 genes; see more details in the **Supplementary File**.

Taking the 61 genes as input, we identify 14 genes related to cardiovascular diseases, including RASA1, CNOT2, MICALCL, MDFIC, PRDM7, ATXN1, CSGALNACT1, DYSF, GJB2, KRT35, MUC16, P2RX6, ZNF618, and CD80, and 5 genes related to neurodegenerative diseases, which are ATXN1, EPB41L1, PNPLA6, SYN2, and ERO1B (see **Figure 1** and **Table 1**). Among the 18 genes (ATXN1 appears in both categories), 5 of them have been confirmed by OMIM and GAD (Genetic Association Database) (Becker et al., 2004), of which 2 (RASA1 and ATXN1) are cardiovascular disease-related and 4 are neurodegenerative disease-related (ATXN1, EPB41L1, PNPLA6, SYN2), whereas the rest only appear in one database. Compared with all the 457 VSD-related genes shown in Jin et al. (2017a), we found that MUC16 is common for both data.

We also performed pathway enrichment test for the identified genes; however, no significant cardiovascular-related pathway can be identified. We found that only five genes overlap with the genes saved in KEGG.

After careful investigation of the 14 cardiovascular diseaserelated genes *via* literature review, we found 13 of them have

fgene.2019.00000.indd 4 Manila Typesetting Company 06/10/2019 07:57PM

FIGURE 1 | Variant-containing coding genes obtained from the trio that are associated with cardiovascular and neurodegenerative diseases. Panel (A) shows the 18 genes attached to the two categories [generated by using the STRING database (Szklarczyk et al., 2017)], panel (B) presents the connections between CD80- and CHD-related genes, and panel (C) illustrates the 3D structure of the mutated CD80 (PDB ID: 1I8L) discovered in this study. The 18 genes are fished out by using DAVID from OMIM and GAD databases, genes identified by OMIM are shaded by a polygon. Note that, all the genes identified by OMIM have also been confirmed by GAD.



*aVariant classes include MM (missense mutation), NM (nonsense mutation), FSD (frame shift deletion), IFD (in frame deletion), IFI (in frame insertion), IFD (in frame deletion); baffected number of transcripts, creads coverage, and dmutant allele frequency.*

literature support that are associated with CHD (VSD is the most common type of CHD), while CD80 has no explicit support. Hence, we take further effort to explore the possible roles of CD80 in VSD development.

CD80 is well known in providing co-stimulatory signal necessary for T-cell activation and survival, which has been found on dendritic cells, activated B cells, and monocytes (Peach et al., 1995). To our knowledge, no study shows direct relation between CD80 and CHD. However, Kallikourdis et al. (2017) have reported that CD80 involved in T-cell costimulation complex contributes to a heart failure, suggesting that mutated CD80 has impact on heart defect. Hence, we carefully explored the role of CD80 with VSD computationally.

To examine whether the mutated CD80 has a connection with VSD as shown in this study, we retrieved all cardiovascularrelated genes from GO, OMIM, and HGMD, and built connections between these genes and CD80 by using the STRING database (Szklarczyk et al., 2017). Results show that 31 genes in GO and 18 genes in OMIM have connections with CD80 (in protein association, including known interactions, predicted interactions, co-expression, etc.). In total, 41 genes have connection with CD80. The details are shown in **Table 2**. Among these 41 genes, 7 of them are known interactions (experimentally determined or curated from databases, shown in italic in **Table 2**; see **Figure 1B**). Among these genes, AKT1, PDPK1, CDC42, AKT3, and PIK3CA have concrete evidences shown in relation

fgene.2019.00000.indd 5 Manila Typesetting Company 06/10/2019 07:57PM


TABLE 2 | CD80 interacting genes in GO and OMIM that are associated with cardiovascular diseases.

*Genes in italic are experimentally determined that have interactions with CD80, whereas the rest are computationally predicted.*

with congenital heart disease; even two of them (AKT1, CDC42) have explicit association with VSD (Chang et al., 2010; Liu et al., 2017).

Regarding the mutated CD80 in this study, it has a "T" deletion on the reverse strand at chr3:119537362, leading to a frame shift at position 159 of the translated protein (cf. **Figure 1C**). As a result, the protein is no longer able to insert into the membrane of a cardiac myocyte; see **Figure 1C**. Therefore, the downstream pathway will be affected.

Regarding the eight variants that reside in the splicing region, we found only one (MROH5) that is related to cardiovascular disease.

#### LncRNAs Related to VSD

Other than fishing out candidate VSD-related genes from variants directly, we use known VSD-associated genes as a seed, and then pinpoint lncRNAs having variants near these seeds. A set of 457 known VSD-related genes are obtained from Jin et al. (2017a), whereas the whole set of lncRNAs are retrieved from LNCipedia and NONCODE. A variant-containing lncRNA is considered to be VSD-related if it is within a certain distance of a known VSDrelated gene. Other than using a single distance, we use various distances, which are 100, 200, 500, and 1,000 bp.

We identified 6, 7, 27, and 49 lncRNAs from LNCipedia having distance of 100, 200, 500, and 1,000 bp, respectively, whereas these numbers are 6, 8, 32, and 57 when checked against NONCODE. Details are shown in the **Supplementary File**. To examine whether these lncRNAs have a potential effect to VSD, we carefully studied their expression in different tissues, particularly in the heart. We found that among all the lncRNAs (both from LNCipedia and NONCODE), 29 of them present in the heart, especially NONHSAT096266.2 (NONCODE ID), which is highly and uniquely expressed in the heart, having a FPKM score of 13.97. More interestingly, this lncRNA is very close to NFXL1, which has been identified as a VSD-associated gene (Jin et al., 2017a). See **Table 3**.

#### Results Comparison

We use TrioDeNovo (Wei et al., 2015) with default settings (depth of coverage equals 5) to call *de novo* variants from the trio and compare results with that of our approach. As a result, TrioDeNovo identifies 79,082 variants contained in 357 genes. After filtering out common variants with MAF of 0.01, 51 variants located in 25 genes are obtained. Among these genes, 21 overlap with our findings. Regarding lncRNAs, no variant can be found within 1,000 bp of known VSD-related genes.

#### SS-Based Variants

Intuitively a TS-based approach is able to significantly narrow down candidate genes; however, it is hard to speculate to what extent the improvement is. Hence, we have conducted experiments on the sequences of the VSD sample (the child) only with the same protocols as the TS-based experiments.

Based on the single-sequencing (SS) data, we have obtained 4,826,899 variants by using GATK (DePristo et al., 2011). Similar as trio-sequencing (TS) data analysis, we divide them into protein coding and non-coding variants.


*FPKM: fragments per kilobase of exon per million reads mapped (Mortazavi et al., 2008).*

fgene.2019.00000.indd 6 Manila Typesetting Company 06/10/2019 07:57PM

#### Coding Genes Related to VSD

After annotation and filtering by using ANNOVAR, we obtained 1,552 variants contained in 436 genes. Among these genes, 424 have exonic variations and the other 12 have splicing variations. For the 424 genes, we identified MYBPC3 (Chr11:47342683, C/T, missense mutation, p.G507R) and TRDN (Chr6:123571021, C/A, missense mutation, p.S45I), which are highly related to ventricles, by using DAVID based on OMIM. More details are shown in the **Supplementary File**. Unfortunately, these two genes cannot be identified based on the trio. Further investigation has shown that the variation in MYBPC3 is inherited from the father, whereas the variation in TRDN is inherited from the mother. Considering the truth that both the father and mother are healthy, but the child has VSD, we speculate that the combination of mutated MYBPC3 and TRDN may have noteworthy contribution to VSD. Regarding the 12 genes having variants in splicing regions, we have not found cardiovascular- or neurodegenerative-related genes.

The genes identified by GAD are excluded for the child sample analysis. This is because more than 60 genes can be found, and the significance of relations between these variants and VSD can be hardly determined.

#### LncRNAs Related to VSD

Similar to TS-based lncRNA identification, we carried out the same experiments on the VSD patient only. Unlike the results obtained from coding genes that are about 10 times larger, candidates are selected from the SS-based data than the TS-based data; we get five times larger number of lncRNA variants between the SS-based data and the TS-based data.

The numbers of lncRNAs having variants close to VSD-related genes are 37, 60, 129, and 197 for distances of 100, 200, 500, and 1,000 bp, respectively. Among all these lncRNAs, only 97 are present in heart cells, of which 23 are highly expressed having FPKM score larger than 1. Details are shown in the **Supplementary File**. For instance, the lncRNA NONHSAT181468.1 (Chr2:27217145, CT/C), which has the highest FPKM score of the identified lncRNAs, is highly expressed in the heart having FPKM of 31.7. This lncRNA is within the first intron of SLC5A6, which has been confirmed as a VSD-associated gene.

#### Variants Profile of the Trio

We use the ratio of *k*-mers between the parents and child to reflect their genetic variations. **Figure 2** shows the detailed ratio distribution. It is clear that only a small portion of *k*-mers have ratio of 0 (see **Figure 2A**, **C**). That having been said, among the *k*-mers of the child, only 0.442% contains *de novo* mutations compared with his father, and this value is 0.438% compared with his mother. After combining them together, 0.43% are unique *k*-mers.

The ratio of *k*-mers may be affected by sequencing errors. To alleviate this impact, we include *k*-mers having small ratio (less than 0.3) except the ones having a ratio of 0. Generally, the number of *k*-mers having a ratio of 0 is four to five magnitudes larger than those non-zero ones (see **Figure 2C**). In case these *k*-mers contain mutations, they will be fished out during the downstream variant calling.

Unlike the distribution of *k*-mer count for all k-mers (approximate normal), the *k*-mers having mutations follow a Poisson distribution (see **Figure 2D**). The *k*-mers having counted smaller than 20 forms 97.97% of all *k*-mers having ratio less than 0.3. The distribution breakdowns of these *k*-mers are shown in **Figure 2B**.

#### Run-Time Analysis

Our experiments are conducted on a computer having 128G RAM and two E5-2683V4 CPUs (32 cores in total), installed with CentOS 7.0. Throughout the entire experiments, we use 24 threads as default if applicable.

Other than existing approaches that filter out irrelevant variants from trios after mapping, e.g., TrioDeNovo (Wei et al., 2015), we conduct filtering before mapping. This small change is not trivial since the input data are usually very large. For instance, the input size of the VSD sample used in this study is 242 Gb in fastq format, and the total size is over 700 Gb for the trio. To solve this problem, we have conceived a novel coupled Bloom Filter-based *k*-mer encoding algorithm. This algorithm achieves a compression ratio of 12 under default settings. That having been said, a typical set of *k*-mers obtained from a human genome (usually around 120Gb) can be compressed into 10 Gb. Using this approach, we are able to handle a trio within a main memory.

Experiments show that the total memory used to encode counted *k*-mers obtained from the trio is 31.7 Gb. Based on the encoded *k*-mers having count available, we calculate count ratio of all *k*-mers between the parents and the child. Mathematically, suppose the count of a *k*-mer κ from the child is *fc* κ , and the count is *f <sup>f</sup>* κ and *fm* κ for his father and mother, respectively; then, the count ratio is *r f f f c*/ *f c* κ κ κ = between his father and himself. Analogously, the count ratio between his mother and himself is *r f m c m cf* / κ κ κ = . If both *rf c*/ κ and *rm c*/ κ are smaller than the threshold *r*0, then, κ is kept, where *r*0 is set as 0.3 in this study. Results show that *k*-mer counting takes 129 min, *k*-mer encoding takes 175 min, and *k*-mer filtering takes 20.3 s. As a result, 3.9% *k*-mers are left for further analysis.

Because there exist sequencing errors, we perform error correction on the remained *k*-mers (Zhao et al., 2017). It takes 1.7 s and 0.12-Gb RAM to correct 93.7% errors of the kept *k*-mers. As a result, 293.2M *k*-mers are left for variants identification.

Before mapping variant-containing *k*-mers to a reference genome, we have also conducted *k*-mer extension to avoid multimapping problem caused by short input sequence, e.g., *k*-mer. An extension takes a *k*-mer as seed, and extends the *k*-mer to both sides based on the reads in which the *k*-mer is contained. Finally, we mapped extended sequences to the reference genome GRCh38/h38 *via* BWA (Li and Durbin, 2009), which takes 52 min to finish. This is followed by variants calling through SAMtools Li et al. (2009) and GATK (DePristo et al., 2011) jointly. It takes 50 min to finish the above mentioned steps.

Regarding TrioDeNovo, it takes 572 min to get the sorted sam file from a raw fastq file and uses 8,179 min to merge and generate the final vcf file by using GATK and TrioDeNovo. Compared with our approach, TrioDenovo is 10 times (= (8179 + 572\*3)/

fgene.2019.00000.indd 7 Manila Typesetting Company 06/10/2019 07:57PM

((129 + 175)\*3+102)) slower than ours. Besides, the maximum RAM required by our approach is two magnitudes smaller.

# CONCLUSIONS

log scale.

As the most common CHD, VSD affects a noteworthy portion of newborns, leading to a high mortality. Unveiling the biological mechanism, particularly the underpinning genetic variants, is essential for both early diagnosis and clinical treatment. Existing approaches of mining genetic variants rely on large panels, which is challenging in cost and sample collection. It is also prone to overlooking rare variants and hard to handle multiple variants. We designed a novel algorithm for identifying variants from a trio and associate them with VSD. Experiments show that triosequencing-based approach is able to narrow down VSD-related candidates by about 10 times in coding genes and 5 times in lncRNAs; meanwhile our approach is 10 times faster than existing state-of-the-art approach. Applying our method to a VSD trio, we fish out 14 coding genes closely correlated to cardiovascular diseases and 5 coding genes associated with neurodegenerative diseases. Among them, CD80 has not been reported yet. More promisingly, results show that the combination of MYBPC3 and TRDN has high possibility to be VSD-related. Analysis on lncRNA shows that six are highly expressed in heart that are within 1,000 bp to VSD-related genes, particularly NONHSAT096266.2, which has a FPKM socre of 13.97 and is uniquely expressed in heart.

#### AUTHOR CONTRIBUTIONS

LZ conceived the algorithm, designed the experiments, and wrote the manuscript. PJ participated in program coding. PJ, YH,

fgene.2019.00000.indd 8 Manila Typesetting Company 06/10/2019 07:57PM

YW, JZ, LB, QT, and TL participated in data analysis. All authors read and approved the final manuscript.

#### FUNDING

This study is collectively supported by the Free Exploration Fund of Hubei University of Medicine (FDFR201805), the National Natural Science Foundation of China (31501070), the Natural Science Foundation of Hubei (2017CFB137) and

#### REFERENCES


Guangxi (2016GXNSFCA380006, 2018GXNSFAA281275 and 2018GXNSFAA138085), the Scientific Research Fund of GuangXi University (XGZ150316), and Taihe Hospital (2016JZ11).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00670/ full#supplementary-material


fgene.2019.00000.indd 9 Manila Typesetting Company 06/10/2019 07:57PM


Zhao, Y., Li, H., Fang, S., Kang, Y., Wu, W., Hao, Y., et al. (2016). NONCODE 2016: an informative and valuable data source of long non-coding RNAs. *Nucleic Acids Res.* 44, D203–D208. doi: 10.1093/nar/gkv1252

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Jiang, Hu, Wang, Zhang, Zhu, Bai, Tong, Li and Zhao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

fgene.2019.00000.indd 10 Manila Typesetting Company 06/10/2019 07:57PM

#### APPENDIX A

The procedure of *Encoding*, *Decoding* and error *Correction* are shown below.

```
Function Encoding (K, H):
   B = Ø, fmax ← max (F(K)), h ← |Binary (fmax)|
   while K ≠ Ø do
       initialize new B+, B– and Kʹ
       for κ in K do
 flag ← False, freq ← Binary(fκ )
   * Roll-back point
    for i ← 0 to h – 1 do
 j ← Hi (sκ)
 if B+[ j ] = = 1 and B–[ j ] ≠ freq[i] then
 flag ← True
 Kʹ ← Kʹ ∪{κ}
 else
 B+ [ j ] ← 1, B– [ j ] ← freq[i]
 If flag = = True then
 roll back B to the point *
 break;
       B ← B ∪ {(B+, B–)}, K ← Kʹ
   return B
Function Decoding (B, H, sκ ):
   h ← |H|
   for i ← 0 to h – 1 do
       if B+[Hi
             (sκ )] = = 0 then
 return False
   for i ← 0 to h – 1 do
       bi ← B– [Hi
                (sκ)]
       val ← Denary (b0b1⋯b(h–1))
   return val
Function Correction (B, H, Kʹ):
   For κ in Kʹ do
       uκ ← Decoding (B, H, κ)
       Nκ ← {νκ }
       for i ← 1 to k do
 for x in {A, C, G, T} do
 if sκ [i] ≠ x then
 sκʹ ← sκ [1:(i – 1)]·x·sκ[(i + 1): k]
 νκʹ ← Decoding (B, H, sκʹ)
       zκ ← (νκ – mean (Nκ ))/std(Nκ)
       if not zκ > z0 and νκ < f0 then
 K K c c ′ ← ′ −κ
   return K′
            c
```
fgene.2019.00000.indd 11 Manila Typesetting Company 06/10/2019 07:57PM

# Multi-view Subspace Clustering Analysis for Aggregating Multiple Heterogeneous Omics Data

*Qianqian Shi1\*, Bing Hu2, Tao Zeng3,4 and Chuanchao Zhang5\**

*1 Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China, 2 Department of Applied Mathematics, College of Science, Zhejiang University of Technology, Hangzhou, China, 3 Key Laboratory of Systems Biology, Institute of Biochemistry and Cell Biology, Shanghai Institute of Biological Sciences, Chinese Academy of Sciences, Shanghai, China, 4 Shanghai Research Center for Brain Science and Brain-Inspired Intelligence, Shanghai, China, 5 Wuhan Institute of Huawei Technologies, Wuhan, China*

#### *Edited by:*

*Shihua Zhang, Academy of Mathematics and Systems Science (CAS), China*

#### *Reviewed by:*

*Wenyuan Li, University of California, Los Angeles, United States Jinyu Chen, Academy of Mathematics and Systems Science (CAS), China*

#### *\*Correspondence:*

*Qianqian Shi qqshi@mail.hzau.edu.cn Chuanchao Zhang chaozhangchuan@163.com*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 21 December 2018 Accepted: 16 July 2019 Published: 20 August 2019*

#### *Citation:*

*Shi Q, Hu B, Zeng T and Zhang C (2019) Multi-view Subspace Clustering Analysis for Aggregating Multiple Heterogeneous Omics Data. Front. Genet. 10:744. doi: 10.3389/fgene.2019.00744*

Integration of distinct biological data types could provide a comprehensive view of biological processes or complex diseases. The combinations of molecules responsible for different phenotypes form multiple embedded (expression) subspaces, thus identifying the intrinsic data structure is challenging by regular integration methods. In this paper, we propose a novel framework of "Multi-view Subspace Clustering Analysis (MSCA)," which could measure the local similarities of samples in the same subspace and obtain the global consensus sample patterns (structures) for multiple data types, thereby comprehensively capturing the underlying heterogeneity of samples. Applied to various synthetic datasets, MSCA performs effectively to recognize the predefined sample patterns, and is robust to data noises. Given a real biological dataset, i.e., Cancer Cell Line Encyclopedia (CCLE) data, MSCA successfully identifies cell clusters of common aberrations across cancer types. A remarkable superiority over the state-ofthe-art methods, such as iClusterPlus, SNF, and ANF, has also been demonstrated in our simulation and case studies.

Keywords: multi-view subspace clustering analysis, data integration, heterogeneity, low-rank representation, graph diffusion

# INTRODUCTION

The rapid advance of high throughput technologies makes large amounts of various omics data available to study biological problems (Schuster, 2008). While, different types of data could provide complementary or common information to each other since a biological system consists of a series of highly ordered molecular and cellular events (Wang et al., 2014; Ma and Zhang, 2017; Shi et al., 2017a). Thus, compared to single data types (e.g., gene expression), the integration of multiple omics data is more likely to completely understand the molecular mechanisms underlying particular biological processes or complex diseases, and therefore offers more opportunities to better address biological or medical issues, e.g., to identify cancer subtypes with different biological or clinical outcomes (Xiong et al., 2012; Chen and Zhang, 2016; Shi et al., 2017a).

So far, quite a lot of data-integration methods have been proposed and they can be briefly summarized into two main categories: firstly, to extract signals from each data type; secondly, to acquire comprehensive information by a sample-centric integration (Arneson et al., 2017; Zhang et al., 2017a). In addition, these data integration methods mainly depend on two strategies, one is space projection method (Fan et al., 2016), and the other one is metric (similarity measures) fusion technique (Wang et al., 2014). These ideas match the nonlinear characteristics of biological systems and should really work when capturing the whole phenotype landscape.

However, their solutions to obtain the sample or gene patterns from multiple data domains are really distinct from each other. The earliest proposed methods identify multi-dimensional genomic modules (e.g., mRNA-miRNA functional pairs) (Ghazalpour et al., 2006; Kutalik et al., 2008; Li et al., 2012; Zhang et al., 2012; Chen and Zhang, 2016), which present high correlations over the samples in data sets. Such "co-modules" can only uncover common sample structures across data types and likely lead to biased clustering because much phenotype-associated differential information is missing. Later, Mo et al. developed a method, iClusterPlus (Mo et al., 2013), which considers different properties of omics data (e.g., continuous, count or binary valued variables) through corresponding linear regression models. However, some assumptions held by this method are too strong for heterogeneous tumor samples, and may also lose biologically meaningful information. As a nearly assumption-free and fast approach, SNF (Wang et al., 2014) (similarity network fusion) can overcome such issues and it uses local structure preservation method (i.e., *K*-nearest neighbors) to adjust sample similarity networks for each data type. But, SNF can only characterize pair-wise Euclidean (or other) distances in the sample neighborhoods, and is sensitive to local data noises or outliers. Recently, Ma and Zhang proposed ANF, an "update" of SNF, which incorporates weights of views for each data type (Ma and Zhang, 2017). ANF presents more general and interpretable power than SNF, but it still reserves the unstable nature of pair-wise clustering. Notably, increasing biological evidence suggests that distinct regulatory mechanisms preside over physiological phenotypes (e.g., Waddington's canalization) or even the tumor cell states (Mark and Aviv, 2002). Cell types or patients present extremely strong heterogeneity due to the different master gene sets, implying that these individuals are scattered in multiple biological states (feature subspaces) even at a single data level (Shi et al., 2017b; Haghverdi et al., 2018). That means the pair-wise similarity measurement (e.g., in SNF) can't capture the true heterogeneity spanning in different subspaces, further leading to inaccurate integrative clustering. Thus, the more effective integration approach is still lacking.

Motivated by above requirements from methodology and biology study, we propose a novel framework called "Multi-view Subspace Clustering Analysis (MSCA)" by using representationbased methods (e.g., low-rank representation, namely LRR) (Lin et al., 2011; Liu et al., 2013). LRR or relevant subspace clustering algorithms are originally developed and applied in image recognition (Zhang et al.; Cao et al., 2015; Gao et al., 2016; Brbić and Kopriva, 2017; Zhang et al., 2017b). These methods enable to recover the signal spaces of the images, providing a better description of the visual patterns. Furthermore, they generate a block-diagonal representation graph of samples, which measures sample similarities by linear combinations of the remaining samples, presenting more robust than pair-wise clustering. However, when applied to highly heterogenous data, such as biological omics profiles, these methods are often fragile since they assume linear embedded structures underlie the original data and can't exploit the local geometric relationships of objects (Zhuang et al., 2015). Hence, we should improve the utility of subspace clustering to be more appropriated for biological cases. In our proposed MSCA model, we incorporate the advantage of local structure preservation to force the representations to be locally linear at each data type, and capture the integrative clustering pattern by fusing the multiple informative graphs from local sample representations. In particular, MSCA implements two steps of nonlinear pattern identification for different omics data during pattern fusion, where the multi-view is able to recover more details of systems' complexity and heterogeneity. To validate the effectiveness of our method, we firstly applied MSCA to various synthetic datasets, and found that MSCA not only successfully recognizes the predefined subgroups with a better performance than several state-of-the-art methods, but also shows great robustness on different parameters' variation. In addition, MSCA has demonstrated a good ability to yield biologically relevant subgroups of tumor cells of multiple origins in CCLE (Barretina et al., 2012) data set.

## METHODS

#### Method Overview

MSCA takes two steps as schematically shown in **Figure 1**: i) Construction of sample representation matrix from each type of genomic profiles by a subspace clustering algorithm (**Figures 1A**, **B**); ii) Graph diffusion process of sample similarity matrices, which are derived from the representation matrices corresponding to all data types (**Figure 1C**). MSCA was implemented as a Matlab package and is freely available at https:// github.com/ZCCQQWork/MSCA.

The representation graph *Z* of step (i) presents each single sample as a linear combination of the remaining ones in the same subspace/cluster, and therefore it can be shown as a block-diagonal and sparse matrix. Such low-rank characteristic of *Z* makes it more robust to data outliers and capable to retain more structural information of data, thus paving a good way for the next integrative. After that, MSCA implements the graph diffusion step (ii). It makes information propagate across multiple graphs in an iteration way. And this could fuse biological signals from the involved genomic data. After a few iterations, MSCA converges to the optimal graph (**Figure 1D**), as a multi-view similarity measurement, revealing the underlying relationship of samples. Note that both the steps follow nonlinear criteria, to maximize the chance of characterizing the true complexity and heterogeneity of data, and especially the common information will strengthen the supported sample patterns whereas discordant local structures will weaken their similarities.

#### Extracting the Sample Representation Graph From Each Data Type

Suppose we describe a genomic profile (e.g., mRNA expression) with *h* biological measurements and *n* samples as a data matrix *X =* [*x*1 *,x*2, … ,*xn*], *xi* and *xj* correspond to two samples; then the representation relationships of all samples can be calculated as follows:

$$\min\_{Z,E} \|Z\|\_\ast + \lambda \|E\|\_{2,1} \tag{1}$$

$$\text{s.t.} \begin{cases} \quad X = XZ + E\\ \quad Z^T \mathbf{1} = \mathbf{1} \\\quad Z\_{\circ} = \mathbf{0}, (i, j) \in \overline{\Omega} \end{cases}$$

where *Z* = [*z1, z2, … ,zn*] is a *n × n* matrix containing all the coefficient measurements between pairs of samples *xi* (1 ≤ *i* ≤ *n*), and *zi* is a coefficient vector of sample *i*. ||**Z**||\* represents the nuclear norm of **Z**, i.e., the sum of all singular values of *Z*; *E eij <sup>i</sup> h j n* 2 1 2 <sup>1</sup> , <sup>=</sup> ∑∑ ( ) <sup>=</sup> and is *l2,1*-norm of the error matrix *E*, where *eij* is the (*i*,*j*)-th entry of matrix *E*.

Note that, in the first constraint condition, the linear representation of samples can capture the global structure in data, thus a large similarity coefficient means the two samples are spatially close. Next in the second constraint condition, **1**, as an all-one vector, is used to normalize *Z* that ∑*<sup>i</sup> Zij* = 1. And in the third constraint condition, Ω denotes as the complement of Ω, where Ω is a set of edges between the samples in a predefined adjacency graph. For example, if *xi* and *xj* are not graph neighbors, we have Ω ( ) *i j* , ∈ . In this work, we use *K*-nearest neighbors to predetermine the sample local structure in terms of pair-wise Euclidean distances. Then, the tuning parameter *λ* is used to balance the two optimization terms, which could be selected according to their respective properties, or tuned empirically. For the selection of parameters *K* and λ, the section *Evaluation of MSCA on Synthetic Examples* has more detailed discussions. Given solving problem (1), we obtain the optimal solution *Z*\*, which is block-diagonal indicating that samples in the same subspace are clustered together due to the comprehensive considerations/constraints of global and local data structures. The corresponding sample affinity matrix *W* is obtained by *W Z Z <sup>T</sup>* = + ( ) \* \* / 2, which can be passed on to the next step integration.

In fact, the optimization problem (1) can be solved *via*  ADMM (alternating direction method of multipliers) algorithm (Lin et al., 2010) as below. Firstly, this problem can be converted to an equivalent problem:

$$\min\_{\boldsymbol{L}, \widetilde{\boldsymbol{\alpha}}(\boldsymbol{Z}) = \boldsymbol{0}, \boldsymbol{E}} \left\| \boldsymbol{J} \right\|\_{\*} + \boldsymbol{\mathcal{A}} \left\| \boldsymbol{E} \right\|\_{2,1}$$
 
$$\text{s.t.} \begin{cases} \boldsymbol{X} = \boldsymbol{X}\boldsymbol{Z} + \boldsymbol{E} \\ \qquad \boldsymbol{Z}^{T}\mathbf{1} = \mathbf{1} \\ \qquad \boldsymbol{J} = \boldsymbol{Z} \end{cases} \tag{2}$$

And its augmented Lagrangian function is:

$$L\_{\mu}(Z, E, I) = \left\| I \right\|\_{\*} + \lambda \left\| E \right\|\_{2, 1} + \left\langle Y\_1, X - XZ - E \right\rangle + \left\langle Y\_2, \mathbf{1}^T - \mathbf{1}^T Z \right\rangle$$

$$+ \left\langle Y\_3, Z - I \right\rangle + \frac{\mu}{\mathbf{2}} \left( \left\| X - XZ - E \right\|\_F^2 + \left\| \mathbf{1}^T - \mathbf{1}^T Z \right\|\_F^2 + \left\| Z - I \right\|\_F^2 \right)$$

$$= \mathbf{1} \tag{3}$$

where *μ* is a penalty parameter larger than 0. ||\*||F denotes the Frobenious norm, and *Y1*, *Y2* and *Y3* are Lagrangian multipliers corresponding to three constraints in equation (2) respectively; *L Z* <sup>Ω</sup>( ) = 0 corresponds to the third constraint condition in original optimization equation (1). As known, the above problem can be minimized orderly to update the variables *Z, J, E* by fixing the other variables, respectively, according to ADMM.

Suppose at *k* times of updates, we acquire *Z J E Y Y k k kkk* , , , , 1 2 and *Y<sup>k</sup>* <sup>3</sup> , and the alternate process with update functions can be summarized in below:

Firstly, assuming all the other five matrices are fixed, we can compute *Jk*+1:

$$\begin{aligned} \|J^{k+1} = \arg\min\_{J} \left\| J \right\|\_{\*} + \left\langle Y\_3^k, Z^k - J \right\rangle + \frac{\mu^k}{2} \left\| Z^k - J \right\|\_{F}^2 \\ = \arg\min\_{J} \left\| J \right\|\_{\*} + \frac{\mu^k}{2} \left\| Z^k + \frac{Y\_3^k}{\mu^k} - J \right\|\_{F}^2 \end{aligned} \tag{4}$$

Secondly, assuming *Jk*+1, Zk,*Y<sup>k</sup>* 1 are fixed, we can compute *Ek+1*:

$$\begin{split} \|E^{k+1} = \arg\min\_{E} \mathcal{\lambda} \left\| E \right\|\_{2,1} + \left< Y\_1^k, X - XZ^k - E \right> + \frac{\mu^k}{2} \left\| X - XZ^k - E \right\|\_F^2 \\ = \arg\min\_{E} \mathcal{\lambda} \left\| E \right\|\_{2,1} + \frac{\mu^k}{2} \left\| X - XZ^k + \frac{Y\_1^k}{\mu^k} - E \right\|\_F^2 \end{split} \tag{5}$$

Thirdly, assuming *Jk*+1, *Ek*+1, *Y<sup>k</sup>* <sup>1</sup> , *Y<sup>k</sup>* 2 and *Y<sup>k</sup>* 3 are fixed, we can compute the updated Z from following optimization problem:

$$\begin{aligned} \min\_{L\_{\overline{\Omega}}(\overline{Z})=0} \left\langle Y\_1^k, X - XZ - E^{k+1} \right\rangle + \left\langle Y\_2^k, \mathbf{1}^T - \mathbf{1}^T Z \right\rangle + \left\langle Y\_3^k, Z - f^{k+1} \right\rangle \\ + \frac{\mu^k}{2} \left( \left\| X - XZ - E^{k+1} \right\|\_F^2 + \left\| \mathbf{1}^T - \mathbf{1}^T Z \right\|\_F^2 + \left\| Z - f^{k+1} \right\|\_F^2 \right) \end{aligned} \tag{6}$$

In fact, this problem is equivalent to

$$\min\_{L^1\_\Pi(\mathcal{Q})=0} \left\| X - XZ - E^{k+1} + \frac{Y\_1^k}{\mu^k} \Big|\_F^2 + \left\| \mathbf{1}^T - \mathbf{1}^T Z + \frac{Y\_2^k}{\mu^k} \Big|\_F^2 \right\|\_F^2$$

$$+ \left\| Z - f^{k+1} + \frac{Y\_3^k}{\mu^k} \Big|\_F^2 \right\|\_F^2 \tag{7}$$

Then, it can be further linearized with respect to *Z at Zk* based on LADMAP (linearized alternating direction method with adaptive penalty) algorithm (Lin et al., 2011):

$$\begin{split} \min\_{\boldsymbol{L}^{L}\_{\Omega}(\mathcal{Q})=0} & \left( -\boldsymbol{X}^{T} \left( \boldsymbol{X} - \boldsymbol{X}\boldsymbol{Z}^{k} - \boldsymbol{E}^{k+1} + \boldsymbol{Y}^{k}\_{1} \boldsymbol{\mu}^{k} \right) - \mathbf{1} \left( \boldsymbol{1}^{T} - \boldsymbol{1}^{T}\boldsymbol{Z}^{k} + \boldsymbol{Y}^{k}\_{2} \boldsymbol{\mu}^{k} \right) \right) \\ & + \left( \boldsymbol{Z}^{k} - \boldsymbol{J}^{k+1} + \boldsymbol{Y}^{k}\_{3} \right) \boldsymbol{Z} - \boldsymbol{Z}^{k} \left( \cdots \boldsymbol{Z}^{k} \right) + \frac{\eta}{2} \left\| \boldsymbol{Z} - \boldsymbol{Z}^{k} \right\|\_{F}^{2} \end{split} \tag{8}$$

where η = + *X* + *<sup>T</sup>* 2 2 2 2 **1** 1.

In the end, we obtain *Zk*+1 according to the following updating rule:

$$\begin{aligned} Z^{k+1} &= \arg\min\_{L^k(\mathcal{Z})=0} < H^k, Z - Z^k > + \frac{\eta}{2} \left\| Z - Z^k \right\|\_F^2 \\ &= \arg\min\_{L^k\_{\mathfrak{X}}(\mathcal{Z})=0} \frac{\eta}{2} \left\| Z - Z^k + H^k \right\|\_F^2 \\ &= \begin{cases} \left. \left( Z^k - H^k \right)\_{\mathfrak{Y}} \right\|\_{\mathfrak{Y}}, (i, j) \in \Omega \\ & \qquad 0, \ (i, j) \in \overline{\Omega} \end{cases} \end{aligned}$$

where

$$\begin{aligned} H^k &= -X^T \left( X - XZ^k - E^{k+1} + \frac{Y\_1^k}{\mu^k} \Big/ \mu^k \right) - \mathbf{1} \left( \mathbf{1}^T - \mathbf{1}^T Z^k + \frac{Y\_2^k}{\mu^k} \Big/ \mu^k \right) \\ &+ \left( Z^k - J^{k+1} + \frac{Y\_3^k}{\mu^k} \Big/ \mu^k \right) \end{aligned}$$

Fourthly, assuming that *Ek*+1, *Zk*+1 and *Jk*+1 are fixed, we can calculate simultaneously:

$$Y\_1^{k+1} = Y\_1^k + \mu^k (X - XZ^{k+1} - E^{k+1}) \tag{10}$$

$$Y\_2^{k+1} = Y\_2^k + \mu^k (\mathbf{1}^T - \mathbf{1}^T Z^{k+1}) \tag{11}$$

$$Y\_3^{k+1} = Y\_3^k + \mu^k (Z^{k+1} - J^{k+1}) \tag{12}$$

All the above subproblems can form a closed loop until convergence, and the whole step to derive the graph weight matrix *W* can be briefly summarized in Algorithm 1.

ALGORITHM 1 Algorithm to extract the sample representation matrix for each data type.

Input: the profile of *i* th data type, i.e. *X x x x i i <sup>i</sup> n <sup>i</sup>* = 1 2 , ,..., , tuning parameter *λ*, and nearset neighbors parameter *K*.

Output: the sample representation matrix *Wi* of *i* th data type.

1. Obtain neighbors in data *Xi* using *K*-nearset neighbour method, and assign the parameter Ω

2. Solve the equation (1) by updating (4), (5), (9)-(12) until the iteration converges and obtain the optimal *Z*\*

3: Construct the sample similarity matrix *WI* by *W Z Z i T* = + ( ) \* \* / 2

#### Capturing Multi-View Graph From Various Omics Data

Given *m* different genomics data types, we could obtain respective affinity matrices *Wi , i = 1, 2, …, m* as nonlinear similarity measurements of all samples by above Algorithm 1. This step would fuse individual affinity graphs to a systematic one. The graph diffusion process is implemented like SNF ever does (Wang et al., 2014). In this step, we continue to take advantage of locality-preserving strategy and define a kernel matrix, *S*, to ensure samples in the same neighborhood still stay close across data sources. Simultaneously, we normalized the raw affinity matrix *W* to a new status matrix *P*, which keeps the original information and reduces the scale bias. Note that matrix *P* still carries the full information about the similarity of each sample to all others whereas matrix *S* only encodes the similarity to the local neighborhoods for each sample.

For the *m* different biological data types, matrices *Pi* and *Si* of the *i*-th data type are obtained by equations (13) and (14) based on (*Wi , i* = 1, 2, …, *m*).

$$P^i(i,j) = \left\{ \begin{array}{c} W^i(i,j) \\ \hline 2 \sum\_{k \neq i} W^i(i,k) \\ 1/2, j = i \end{array} \right.$$

$$S^i(i,j) = \left\{ \begin{array}{c} W^i(i,j) \\ \hline \sum\_{k \in N\_i} W^i(i,k) \\ 0, \quad otherwise \end{array} , j \in N\_i \right\} \tag{14}$$

where *Ni* is the *K* nearest neighbors of the sample *xi* based on *Wi .*

The key step of MSCA is to iteratively update status matrix in graph diffusion across data types as follows:

$$P\_{t+1}^{1} = S^1 \times \left(\frac{\sum\_{k \neq 1} P\_t^k}{m - 1}\right) \times (S^1)^T$$

$$P\_{t+1}^i = S^i \times \left(\frac{\sum\_{k \neq i} P\_t^k}{m - 1}\right) \times (S^i)^T$$

$$P\_{t+1}^m = S^m \times \left(\frac{\sum\_{k \neq m} P\_t^k}{m - 1}\right) \times (S^m)^T \tag{15}$$

where *Pt i* <sup>+</sup>1 is the status matrix of *i*-th data type after *t +* 1 iterations and *P P i i* <sup>1</sup> = represent the initial status matrix at *t =* 1.

The equation (15) updates the status matrices each time generating *m* parallel interchanging diffusion processes. After *t* steps, the overall status matrix or multi-view matrix *W*# is computed as:

$$\sum\_{i=1}^{m} P\_i^i \tag{16}$$

#### Iterative Updating Process and Clustering Method

Given a series of sample representation matrices generated by Algorithm 1, the iterative integration process is summarized as Algorithm 2.

ALGORITHM 2 The Iterative Updating Process for MSCA.

Input: The profile of the *m* data types, i.e., *X X X X<sup>m</sup>* =[ , ,..., ] 1 2 , tuning parameter *λ*, and nearset neighbors parameter *K*.

Output: The multi-view similarity matrix *W#* across *m* data types

1. Computing the representation matrix *Wi* (*i* = 1,2,..*m*) of each data type according to Algorithm 1

2. Updating the status matrix *Pi* (*i* = 1,2...*m*) of each data type by the equation (15) until the process reaches convergence

3. Capturing the multi-view similarity matrix W# by the equation (16)

Therefore, the final undirected graph *W*# , involving multi-layer signals, i.e., local and global information, is capable to present the intrinsic complexity of data. The multi-view fused matrix can be applied into spectral clustering algorithm [e.g., Ratio Cuts (Ding et al., 2013)] to identify the meaningful groups of samples, e.g., prognostic different subtypes, or other potential applications.

#### RESULTS

#### Evaluation of MSCA on Synthetic Examples

To demonstrate the ability of MSCA on multi-view subgroups identification, simulation experiments are conducted, with comparison to the above mentioned methods (Mo et al., 2013; Wang et al., 2014; Cao et al., 2015; Brbić and Kopriva, 2017; Ma and Zhang, 2017; Zhang et al., 2017b). In addition, the selection of parameters in MSCA has also been discussed in these synthetic examples.

#### Synthetic Data

Two categories of numeric data sets have been considered for a complete evaluation. Each contains two types of data and 90 samples underlying predefined sample structures by singular value decomposition (Meng et al., 2015). To preserve feature characteristics (e.g., amount, diversity and variance) of biological data types (e.g., gene expression and methylation profiles), the two data types in synthetic examples are directly generated from real data sets (i.e., GSE49278 and GSE49277) (Assié et al., 2014) (**Supplementary Information**). And each data type could provide partial but effective information to describe the whole sample patterns (e.g., type 1 and type 2 in **Figure 2A** and **Supplementary Figure S1A**). We called the "weak heterogeneity" numeric example as simData1 where samples are distributed in a single subspace and the "strong heterogeneity" one as simData2 where different manifold subspaces exist. Briefly, the 90 samples with three established clusters (namely, 1-30, 31-60, 61-90) in simData1 and simData2 are randomly selected from real data, where samples 31-90 present similar distributions from data type 1; and 1-60 appear close from data type 2. But the samples in 31-90 and 1-60 would have different embedded structures or manifold subspaces. Note that the true clusters cannot be recovered by any single data type in both synthetic examples (**Figure 2A** and **Supplementary Figure S1A**).

#### Evaluation and Comparison Based on Cluster Identification

We first applied MSCA and the other methods to the generated data sets (i.e., simData1 and simData2) with predetermined clustering structures. To avoid accidental events, both the data sets were randomly repeated 500 times under different systematic conditions (i.e., low: 0% extra noises; moderate: 10% extra noises; high: 30% extra noises), respectively. And the performance of each algorithm was measured by adjusted Rand index (ARI) (Santos and Embrechts, 2009), and a high value indicates an identical clustering. According to all the results, MSCA always succeeded to piece the information of each data type together, brilliantly distinguishing the pre-designed three clusters (**Figure 2B** and **Supplementary Figure S1B**). Given simData1 of less heterogeneity, all of the compared methods almost perform excellent (**Supplementary Figure S1B**). However, when complexity increases, a great performance difference among different methods comes out. Our MSCA model still performed accurately and robustly to identify sample patterns, even across varying noise strengths (**Figure 2B**). But the pair-wise clustering-based methods,

space, i.e., Cluster 2 and Cluster 3 for data type 1, Cluster 1 and Cluster 2 for type 2. (B) The clustering accuracy comparison among MSCA, SNF, ANF and iClusterPlus under different noise conditions, measures their effectiveness on detecting integrated sample-patterns.

i.e., SNF and ANF, obviously can't recognize the multiple manifolds embedded in high-dimensional space. Even for those subspace clustering algorithms, they didn't perform that well when integrating data sets with biological characteristics (**Supplementary Figure S2**), thus highlighting the feasibility of MSCA for biological cases. While, iClusterPlus performed the second best on accuracy, but the accuracy ranges manifested "long-tail" to expose the unstable nature of iClusterPlus. It's probably because iClusterPlus uses random sampling procedure to solve equations (Mo et al., 2013), and is sensitive to data noises. In all, the novel nonlinear similarity measurement in MSCA is demonstrated to be robust to data noises and heterogeneity, which helps provide a more accurate multi-view for sample patterns in multi-level dataset.

#### Robustness Analysis of MSCA Under Different Parameters

There are two parameters, i.e., λ and *K* (see *Methods*), in MSCA model, thus it is crucially important to examine their effects on the MSCA performance. In particular, the parameter *K* determines the predefined neighborhoods, which constrains the solutions of sample representation matrices. Under different selections of *K* or λ, we use simData2 to test the robustness of MSCA. To avoid results by chance, we repeated 1,000 times and take the average ARI values as evaluation measurement. According to all the results (**Figure 3**), MSCA performs stable and accurate in a wide range of *K* and λ. Once again, the advantage of combining low-rank presentation and local preservation makes MSCA more parameter-independent, and brings a novel light on developing new bioinformatic tools for integrating heterogeneous biological data.

#### Study on CCLE Data

To demonstrate the effectiveness of MSCA to address practical issues, we have applied MSCA to CCLE datasets (Barretina et al., 2012) with the matched mRNA expression profiles by Affymetrix Human Genome U133 Plus 2.0 array and copy number data by Affymetrix SNP Array 6.0. Though it contains thousands of cell lines, we only kept 415 cell lines, whereby more than 25 cells have the same tissues of origin (**Supplementary Table S1**). For each tissue, we obtained its specific expressed genes from two databases: The Human Protein Atlas **(**Uhlen et al., 2015) and PaGenBase (Pan et al., 2013). Several organs belong to upper aerodigestive tract cancer (UADT), including tongue, trachea and esophagus etc., thus, all their gene sets were treated as UADT specific genes. While tumor associated genes were collected from GeneCards (Safran et al., 2010) and top 100 by the provided relevance scores were selected to illustrate corresponding aberration patterns among different subgroups. We adopted one-sided Wilcoxon signed-rank test to identify the tissue-specific genes between one of the clusters and all the remaining ones. More highly expressed genes with *P <* 0.05 (adjusted by FDR) indicate the cluster strongly correlated with a certain tissue of origin. Similarly, differential expression or copy number was calculated using two-sided Wilcoxon signed-rank test for each single gene. A significant *P*-value shows gene expression or copy number in one group dominates the other cell lines and we

regard those differential genes with *P* < 0.05 (after FDR correction) as cluster markable features. Though clusters may share markable features, we count the number of shared clusters to measure the inter-cluster heterogeneity.

Firstly, we used the silhouette score (Rousseeuw, 1999) to evaluate how coherent the identified clusters are, and then we assigned the cell lines into nine clusters (**Supplementary Figure S3**). Among the compared methods, we observed MSCA had a better silhouette score, indicating superior subgroup identification for CCLE samples (**Figure 4A**). Then, we compared the integrative clusters with the original tissue groups (**Figure 4B**), and found some cell lines still manifest high lineage dependency (Pearson correlation 0.42). For example, all the AML or M. myeloma cell lines are assigned to single clusters (i.e., cluster1 and cluster5, respectively), separating from other solid tumor ones. Accordingly, the cluster1 preserves about 77% blood genes and cluster5 holds 85% lymph associated genes (**Supplementary Figure S4**). Besides, the characteristic preservation of tissue specificity for some clusters can explain their homogeneity in turn. But beyond all that, we can see different histological cancer cell lines are grouped into the same integrative clusters because they share gene alterations (**Supplementary Figures S5, S6**). Notably,

the markable features between clusters, especially those copy number variants (**Figure 4C**), tend to be held by only few clusters, revealing strong heterogeneity between MSCA identified clusters (*P*-value < 10−12 and < 10−23 for expression and copy number data respectively, identified by sample shifting test for 5,000 times). Thus, the integrated pan-cancer analysis by MSCA may challenge the tissue original separation and indicate the common molecular aberrations across tumor types.

# DISCUSSION

It's widely acceptable that integration of distinct types of biological data could provide more complete information to understand system complexity and disease heterogeneity (Ghazalpour et al., 2006; Kutalik et al., 2008; Li et al., 2012; Zhang et al., 2012; Zhang et al., 2017c). Over the past decades, the integration methods have progressed to get closer to

biological details, from focusing on common information to specific signals, from critical hypothesis to assumption-free, and from linear models to nonlinear methods, etc. However, it is still a challenging task for bioinformatics to more accurately capture the underlying sample/gene structures from multiple omics data.

Here, we propose the MSCA model with the capacity to identify precise manifolds of samples in data space. In fact, our MSCA method is very similar to a previously published method, SNF (and ANF), which attempts to recognize sample patterns based on cross-view diffusion. However, the biggest difference is that SNF regards all the samples in the same feature space, nevertheless MSCA considers therein embedded multiple subspaces, i.e., different functional molecule sets. We carried out both synthetic examples and a real cancer dataset to demonstrate the capacities of MSCA. In the *in silico* studies, MSCA effectively fused the concordant information associated in certain sample subgroups and outperformed several state-of-the-art integrative methods, in terms of clustering accuracy and robustness. In real case study, the sample patterns derived by MSCA correspond to biological differences using independent knowledge and analytic methods. Beyond that, we believe it can also help other studies which need integration of various data sources, in addition to complex diseases.

Though MSCA implements two nonlinear steps, proven to be effective in theory and practice, the problem of over-learning might still exist because we use the local similarities twice (see *Methods*). Such design may lead to bias when data types contain a lot of shared noises, which is worth careful consideration and improvement. Furthermore, MSCA has currently dealt with continuous data types (e.g., mRNA expression, copy number variant), the effectiveness on other forms of data, e.g., binary data (somatic mutation), category data (clinical covariates), still needs to be continuously improved.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

CZ and QS completed the majority of the project and wrote the article. TZ and BH revised the article.

#### FUNDING

This paper was supported by the National Natural Science Foundation of China (No. 61802141), Natural Science Foundation of Hubei Province (No. 2018CFB098) and Huazhong Agricultural University Scientific & Technological Self-innovation Foundation (No. 2662017QD043).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00744/ full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Shi, Hu, Zeng and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins

*Wenchuan Wang1†, Robert Langlois2†, Marina Langlois2, Georgi Z. Genchev1,2,3, Xiaolei Wang1,4 and Hui Lu1,2,5\**

*1 SJTU-Yale Joint Center for Biostatistics and Data Science, Department of Bioinformatics and Biostatistics, College of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai, China, 2 Department of Bioengineering and Department of Computer Science, University of Illinois at Chicago, Chicago, IL, United States, 3 Bulgarian Institute for Genomics and Precision Medicine, Sofia, Bulgaria, 4 Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China, 5 Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai, China*

#### *Edited by:*

*Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China*

#### *Reviewed by:*

*Weidong Tian, Fudan University, China Dong Xu, University of Missouri, United States*

*\*Correspondence: Hui Lu huilu@sjtu.edu.cn*

*†These authors have contributed equally to this work.*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 21 December 2018 Accepted: 11 July 2019 Published: 30 August 2019*

#### *Citation:*

*Wang W, Langlois R, Langlois M, Genchev GZ, Wang X and Lu H (2019) Functional Site Discovery From Incomplete Training Data: A Case Study With Nucleic Acid–Binding Proteins. Front. Genet. 10:729. doi: 10.3389/fgene.2019.00729*

Function annotation efforts provide a foundation to our understanding of cellular processes and the functioning of the living cell. This motivates high-throughput computational methods to characterize new protein members of a particular function. Research work has focused on discriminative machine-learning methods, which promise to make efficient, *de novo* predictions of protein function. Furthermore, available function annotation exists predominantly for individual proteins rather than residues of which only a subset is necessary for the conveyance of a particular function. This limits discriminative approaches to predicting functions for which there is sufficient residue-level annotation, e.g., identification of DNAbinding proteins or where an excellent global representation can be divined. Complete understanding of the various functions of proteins requires discovery and functional annotation at the residue level. Herein, we cast this problem into the setting of multipleinstance learning, which only requires knowledge of the protein's function yet identifies functionally relevant residues and need not rely on homology. We developed a new multipleinstance leaning algorithm derived from AdaBoost and benchmarked this algorithm against two well-studied protein function prediction tasks: annotating proteins that bind DNA and RNA. This algorithm outperforms certain previous approaches in annotating protein function while identifying functionally relevant residues involved in binding both DNA and RNA, and on one protein-DNA benchmark, it achieves near perfect classification.

Keywords: machine learning, protein sequence and structural analysis, multiple-instance learning, decision trees, semi supervised learning, protein function annotation, DNA binding proteins, RNA binding proteins

# INTRODUCTION

Computational tools have become indispensable in guiding, analyzing, and simulating the mechanistic details underlying experimental studies. Recent innovations in high-throughput experiments for function discovery have provided sufficient data to model and understand the characteristics that govern specific function using machine-learning methods. Such methods have been used to address biological problems ranging from microarray analysis and its application in diagnosis, therapy decisions, and clinical testing (Juneau et al., 2014; Peterson et al., 2015; Shen et al., 2018); interdisease relationships and similarities (Carson et al., 2017; Qin and Lu, 2018) image-based diagnostics (Mehta et al., 2017); predicting protein structural characteristics (Langlois and Lu, 2010a; Abbass and Nebel, 2015; Andreeva, 2016; Kashani-Amin et al., 2018) or clinically relevant discovery enabled by nextgeneration sequencing data of genomes and transcriptomes of diseased and normal cells (Gunaratne et al., 2012; Hayes and Kim, 2015; Gu et al., 2017; Liu et al., 2017; Gong et al., 2018; Liu et al., 2018).

High-throughput sequence and structural genomics projects have continued to outpace corresponding functional discovery projects producing a deluge of protein data, with only a fraction having some functional annotation. This annotation typically provides an indication of the general function but rarely, and when available—less reliably—provides mechanistic detail for a particular function. Systems biology research has focused on analyzing and predicting known interactions between proteins whereas pharmaceutical research requires greater knowledge in the mechanistic details of molecular function. Both efforts would benefit from machine-learning methods that can accurately classify protein function using the limited amount of training data available.

There are two approaches to the classification problem motivated by different statistical views: generative and discriminative learning. On one hand, the generative approach attempts to solve a more general problem i.e., modeling [*p*(*x,y*)] providing greater flexibility at the cost of computational complexity. In order to design an efficient generative algorithm, strong assumptions must be made; e.g., in sequence alignment, one makes the assumption that sequence similarity equals function similarity. On the other hand, discriminative classifiers attempt to find a direct mapping between the class label (*y*) and the input vectors (*x*). Since this approach solves the specific problem at hand, rather than a more general problem, discriminative approaches should be preferred to generative ones (Libbrecht and Noble, 2015). However, the fact remains that generative, sequence alignment techniques remain predominant in the face of recently developed discriminative approaches. So, why have these discriminative techniques not been more successful? The fundamental problem seems to be that research has focused on a single type of discriminate method, classification, which requires labeled training examples. Since protein function annotation data is limited, only a few functional groups such as nucleic acid–binding proteins provide sufficient labeled training data.

A number of discriminative techniques have been developed to deal with incomplete knowledge of the training data such as: semi-supervised learning (Chapelle et al., 2010), active learning (Reker and Schneider, 2015), positive and unlabeled learning (Bhardwaj et al., 2010), and multiple-instance learning (MIL) (Carbonneau et al., 2018). While the first three approaches have demonstrated that unlabeled training data can be used to improve learning, the last approach leverages additional information, i.e., labeled groupings of unlabeled data. In MIL, examples (also referred to as instances) are organized into groups called bags. The class label is associated with the bag rather than the instance; the bag is labeled positive if at least one instance in the bag is labeled positive; otherwise, the bag is labeled negative. Consider the functional site discovery problem: functional data usually pertains to the protein rather than to specific functional sites. Hence, in the MIL formulation, the protein is a labeled bag and the residues (or motifs or pockets) are the instances belonging to that protein/bag.

MIL was originally developed for handwritten digit recognition by Keeler et al. (1990) and was later popularized by Dietterich et al. (1997) to predict drug activity. It has subsequently been applied to a number of problem domains including context-based image retrieval (Maron and Lozano-Perez, 1998; Andrews et al., 2003a), protein super-family annotation (TrX proteins) (Scott et al.), and text categorization (Ray and Craven, 2005). A number of algorithms have been developed to solve MIL including convolutional neural networks (Keeler et al., 1990), axis parallel (Dietterich et al., 1997), support-vector machines (Doran and Ray, 2014), diverse density (Maron and Lozano-Perez, 1998), and standard binary classifiers (Ray and Craven, 2005).

MIL algorithm–based approaches have recently found increased use in the diagnosis of cancer (Li et al., 2015; Mercan et al., 2018; Yousefi et al., 2018), application in neurology for classification of brain abnormalities (Tong et al., 2014), and the prediction of phenotype from metagenomics data (Rahman et al., 2017) to name a few. Recent work has utilized MIL-based methods to predict major histocompatibility complex class II (MHC-II)–binding peptides (Xu et al., 2014) and transcription factor-DNA interaction (Gao et al., 2015; Gao and Ruan, 2017).

The boosting framework has also been conscripted to solve MIL problems. These approaches fall into two groups: modify the weak learner or modify the boosting cost function. That is, Auer and Ortner (2004) took the first approach by boosting a weak MIL-algorithm based on hyper-balls. Other algorithms have been developed using the second approach. For example, Andrews and Hofmann (2003b) used disjunctive logic programming (Lee and Grossmann, 2000) to create a boosting algorithm that achieves a large margin for at least one instance in each bag. Likewise, other groups (Xu and Frank, 2004; Viola et al., 2006) have used a derivation of the AnyBoost framework (Mason et al., 1999) to design an MIL cost function, which can be solved by numerical optimization.

Our work herein formulates the function prediction problem in the setting of MIL. In our approach, the function of a protein is identified through the discovery of key residue microenvironments that strongly signal the existence of a particular functional site. This method requires only two sets of example sequences or structures: one that has the function of interest and another that does not. We do not require knowledge of the functional sites yet this method automatically discovers such sites in order to predict the function of the protein. In the formulation of this approach, we predict function rather than superfamily assignment of a protein; moreover, we represent the protein by each residue's microenvironment rather than by precalculated conserved motifs.

To solve this problem, we developed a novel boosting algorithm (Langlois, 2008) derived from the AdaBoost framework (Schapire and Singer, 1999) that efficiently and accurately identifies residue microenvironments that correspond to functional sites. We then benchmark this approach on two protein function assignment problems: the identification of DNA- and RNA-binding proteins. These proteins play an essential role in nearly every cellular process. A number of experimental (Cajone et al., 1989; Freeman et al., 1995; Chou et al., 2003; Buck and Lieb, 2004; Nutiu et al., 2011; Gordan et al., 2013) and computational (Bhardwaj et al., 2005; Szilagyi and Skolnick, 2006; Bhardwaj and Lu, 2007; Langlois et al., 2007; Tjong and Zhou, 2007; Gao and Skolnick, 2009; Langlois and Lu, 2010b; Weirauch et al., 2013; Xu et al., 2015) approaches have been developed to identify these proteins and their functional sites. Since DNA- and RNA-binding proteins provide a substantial number of labeled examples, e.g., residues known to bind DNA or RNA, these problems have been studied extensively thus presenting an excellent proof of concept for our approach.

#### RESULTS

We demonstrate the ability of an MIL algorithm to accurately predict the function of a protein using its constituent residues with two benchmark nucleic-acid binding datasets: DNA- and RNA-binding proteins. The characteristics of each dataset are summarized in **Table 1**. Both datasets have been used in previous studies to identify residues that bind DNA (Szilagyi and Skolnick, 2006; Langlois et al., 2007) and RNA (Terribilini et al., 2006; Langlois et al., 2007; Kumar et al., 2008). During training, each residue in a DNA-binding protein is considered DNA-binding and in a non-DNA-binding protein non-binding during training and cross-validation. Nevertheless, these residue-level labels are used for later evaluation of the algorithm on the residue level.

#### Protein Function Annotation

We compare two learning algorithms to solve the MIL problem: AdaBoost and AdaBoost.C2MIL on decision trees. The first algorithm, AdaBoost on decision trees is a classification algorithm, which views MIL as a classification problem with positive class noise (Blum, 1998). While other classifiers have been extensively tested on MIL problems (Ray and Craven, 2005), AdaBoost on decision trees has not; this is due to its past poor performance on problems with mislabeled data (Schapire, 1999). The second algorithm AdaBoost.C2MIL is a modification of the original AdaBoost algorithm we developed specifically to handle MIL, which gives special treatment to instances (residues) in a positive bag (DNA-binding protein).

**Table 2** summarizes the performance of each algorithm in terms of area under the receiver operating characteristic (ROC) curve on the protein-level (first column), residue-level over the entire dataset (second column), and over just the DNA-binding proteins (third column). The protein-level results demonstrate the effectiveness of the proposed C2MIL variant over the standard AdaBoost algorithm where C2MIL outperforms AdaBoost by 5% on the DNA-binding task and by 6% on the RNA-binding task. The residue-level performance over the entire dataset is worse in both cases. However, this is due to the inclusion of residues from non-binding proteins, which skew the results. When considering the more pertinent case of only nucleic acid–binding proteins, the C2MIL algorithm outperforms AdaBoost in both cases: by almost 9% for the DNA-binding task and 3% for the RNAbinding task.

The performance over the DNA-binding set on the proteinlevel exceeds several previously published works. First, the performance of the C2MIL algorithm achieves 95.8% area under the ROC whereas the best previous result was 93% (Szilagyi and Skolnick, 2006) and 91.0% (Langlois and Lu, 2010b). At 85.0% specificity, C2MIL achieves 94.4% sensitivity compared to 89.0% (Szilagyi and Skolnick, 2006). At 95.0% specificity, Stawiski et al. (Stawiski et al., 2003) achieved 81.0% sensitivity while C2MIL 86.1% sensitivity. Finally, at 98% specificity, Langlois and Lu (Langlois and Lu, 2010b) achieved 48.1% sensitivity and C2MIL 70.8% sensitivity. Overall, C2MIL shows marked improvement in accurately predicting whether a protein binds DNA.

#### Functional Site Prediction

Since no residue-level labels were given during training, i.e., the algorithm does not know which residues bind DNA or RNA,


TABLE 2 | Performance of algorithms in the multiple-instance learning (MIL) function prediction task—area under the receiver operating characteristic (ROC) curve.


the performance of C2MIL is significantly less than the current best: 72% (**Table 1**) *versus* 83% (Langlois et al., 2007) in terms of area under the ROC. At the same time, the performance over the full dataset (both DNA-binding and non-binding proteins) is significantly better than over just the DNA-binding proteins: 82.7% area under the ROC (**Table 1**). This seems to indicate that non-binding residue environments or substructures on non-NAbinding proteins are easier to predict than corresponding ones on NA-binding proteins.

The ROC plots in **Figure 1** and **Figure 2** compare the performance of C2MIL with the standard AdaBoost algorithm over the DNA-binding dataset. In **Figure 1A**, both algorithms cross several times with no clear winner. However, at low false-positive rates (**Figure 1B**), C2MIL dominates the standard AdaBoost providing an explanation for C2MIL's better performance on the protein level. Since only a single residue predicted positive means the entire bag is positive, this is the important region on the residue-level ROC curve.

The ROC plots in **Figure 2** compare the performance of C2MIL with the standard AdaBoost algorithm over the residues from only DNA-binding proteins. This evaluation follows that of other DNA-binding papers (Langlois et al., 2007). On this task, C2MIL dominates the standard AdaBoost algorithm over the entire range of the ROC plot. As the protein-level results indicate, C2MIL finds at least one residue microenvironment that strongly indicates a given protein is DNA binding. Moreover, these instance-level results demonstrate that not many residues fit the bill given the rather low sensitivity at low false-positive rates.

#### Trends in Residue-Level Prediction

To better understand the residue microenvironments that characterize NA-binding proteins, we plot each type of residue which has been correctly predicted DNA binding in terms of recall and precision (**Figure 3**). Precision measures the fraction of residues predicted NA binding that are actually DNA binding

and (B) zoomed on the 99% specificity.

entire curve and (B) zoomed on the 99% specificity.

(in blue) and RNA binding (in red). Recall measures the fraction of NA-binding residues correctly predicted NA binding.

The first trend evident in **Figure 3** is that far more residues can be used to predict a protein RNA binding (red) as opposed to DNA binding (blue). This suggests that more residues are involved in protein-RNA interactions than protein-DNA. Second, arginine is unsurprising the dominant residue predicted for both NA-binding proteins.

Third, DNA-binding proteins can unexpectedly be well characterized by microenvironments centered on either serine (S) or glycine (G) with a precision of 1.0; e.g., every serine predicted as DNA binding actually was DNA binding. While previous works have suggested glycine (specifically its content) as more correlated with the non-binding set (Bhardwaj et al., 2005; Szilagyi and Skolnick, 2006; Langlois et al., 2007), it has been observed that glycine can make non-specific interactions with DNA (Luscombe and Thornton, 2002) and that glycinerich linkers are critical to regulatory protein function (Singh et al., 2014).

Fourth, a set of RNA-binding proteins can be accurately characterized by microenvironment centered on either valine (V) or methionine (M) with a precision of 1.0. These residues as well as histidine and threonine have been found important experimentally. Threonine has been shown to make specific interactions with both splice sites (Colwill et al., 1996; Zhang and Fuller, 2003) and rRNA (Clemens et al., 1993). Likewise, histidine has been found important for specificity (Hake et al., 1998) and valine makes unique interactions with viral RNA (Pinck et al., 1970).

Note that, in proteins predicted DNA/RNA binding, these four residues (V, M, S, and G) provide a rough location the NA-binding site each protein. This demonstrates that the MIL-algorithm identifies DNA-/RNA-binding proteins based on residue important to their function.

# DISCUSSION

Conventional approaches that apply machine learning to function prediction have relied on a global representation of the sequence or structure, or a local representation of a residue's environment on a target protein. In the first case, only examples of known proteins with a particular function are required whereas the second case requires knowing the location of the active sites. Our proposed approach is similar to sequence alignment techniques in that we require only knowing the function of a particular protein and not the functional residues. Moreover, similar to sequence analysis techniques, it identifies a subset of probable functional residues. Nevertheless, our proposed algorithm does not require sequence similarity or homology to be effective (unlike sequence analysis techniques).

In this work, we demonstrate the ability of our MIL algorithm– based approach to identify potential binding sites and, through the presence of such a site, the function of the protein. This is done without knowledge of the binding sites during the training process. Essentially, one can both identify the function of and locate a binding site on a test protein without knowing, during the training process, the location of such sites. One can view MIL over structure-based features as sub-structure analysis where were consider a sliding window along the amino acid chain throughout the structure. Thus, a user only requires knowledge of the protein function, not the particular site, yet the resulting learning algorithm can predict both.

The proposed approach also has several advantages over traditional homology-based methods:


Our method does not require homologous sequences or structures; instead, it relies on physio-chemical characteristics in combination with (when available) structural features. It can also be applied to problems where knowledge of the functional site is limited. We also provide an analysis of MIL algorithms on the instance level. In some previously published MIL works, the authors evaluate their algorithms on the bag-level since instance-level labels are either unavailable or unreasonably expensive to obtain.

This works establishes the ability of our MIL algorithm– based method to outperform classification in discriminating RNA- or DNA-binding proteins from non-binding proteins. The success of this approach relies on the better representation of function permitted by the MIL problem formulation. Instead of representing the protein sequence or structure by some global representation, the MIL approach allows the entire protein to be decomposed into potential functional units and discovers which unit actually performs the function. Developing a feature encoding for a single functional unit is far easier than for the entire protein sequence or structure.

While multiple-instance (MI) learning has several advantages over classification, it remains a harder learning problem in that the learning algorithm does not have access to instance-level labels. Nevertheless, the experiments clearly show that the proposed MI learner does not perform substantially worse when identifying residues that bind DNA or RNA. Indeed, these results compare favorably with the current state-of-the-art in residue classification.

There are several limitations to the present work. First, we do not limit the algorithm to only sequence information; yet, this will provide the primary source of data for this application. Second, this work does not consider open conformations, e.g., proteins not in complex with DNA. Since the current set of features does not require the exact residue orientation, this may not be a significant limitation. Third, it does not incorporate known binding residues; such residues can provide more information regarding these residues. This problem can be remedied through the application of active MIL (Zhang et al., 2008). Fourth, this algorithm would utilize and would benefit from far larger datasets such as sequences in the UniProt (Leinonen et al., 2004) database. Finally, the analysis of the important residues was just a first-order approximation to the potential wealth of information this technique can glean from both sequence and structural data.

#### MATERIALS AND METHODS

#### Dataset

There are two stringent benchmark datasets used for DNA- and RNA-binding protein prediction tasks. The first set is 60 DNAbinding proteins and 250 non-DNA-binding proteins derived by Liu et al. (2014) and later used by Shen et al. (2017) and Wei et al. (2017) (**Supplementary Table 1**). The second set is 80 RNAbinding proteins and 224 non-RNA-binding proteins used by Miao and Westhof (Miao and Westhof, 2015) and Paz et al. (2016) (**Supplementary Table 2**). The two datasets are both acquired from the Protein Data Bank, and short sequences (less than 50 amino acids) and sequences containing the consecutive character "X" have been removed. To eliminate the redundancy and homology bias that likely leads to overestimated performance, it removes sequences with ≥25% pairwise sequence identity to any other sequences in the dataset using the program CD-HIT.

Each residue in the protein is represented using the following features (feature count within parenthesis) (**Figure 4**):


The residue identifier is a 20-dimensional vector where the residue type is indicated by a non-zero value in the corresponding column. Likewise, there is a corresponding secondary structure identifier feature vector. The structure neighbors count the frequency of each residue type within 3 Å (measured heavy atom to heavy atom). The PSSM feature scores the conservation of this residue position. The BLOSUM window also estimates the residue conservation within a window around the specific residue. Finally, the properties of charge and surface area are estimated for each residue. For more details concerning the feature representation, see Langlois et al. (2007).

#### Algorithm

The Adaptive Boosting (AdaBoost) algorithm transforms a weak classifier *L*(·) into a strong ensemble classifier *H*(·)(44). AdaBoost proves most effective with decision trees as the weak classifier (often referred to as "the best off-the-shelf classifier") and has one tunable parameter: the number of boosting iterations (*T*). Rather than the general boosting framework as in prior work (Mason et al., 1999), we propose to modify the AdaBoost algorithm itself to reduce MIL to importance-weighted classification.

```
EQUATION 1 | Proposed AdaBoost.C2MIL Algorithm
```
Given: {(*X*1,y1) …(*Xn,yn*)} where : X a *i n x x* nd *y x i j* and *<sup>X</sup> <sup>i</sup>* <sup>=</sup> { } ∈ − <sup>∈</sup> <sup>1</sup> { ,11} Reorganize dataset such each negative bag contains one instance Initialize: *<sup>w</sup> <sup>n</sup> t i* <sup>=</sup><sup>1</sup> = =*i n* <sup>1</sup> , , <sup>1</sup> For *t* = 1 …*T* 1. Map dataset to instance level: <sup>ˆ</sup> , , <sup>D</sup> <sup>=</sup> , *x y <sup>w</sup> n i j <sup>i</sup> i i* 2. Train weak classifier on instance-level dataset *L D*( ˆ ) 3. Get confidence-rated instance-level hypothesis ˆ *h X*: *<sup>t</sup>* → ℜ 4. ˆ ( <sup>ˆ</sup> ) *<sup>p</sup> ht* <sup>=</sup> + − 1 1 *exp* 5. Get weak bag-level hypothesis: *h X p x h x p x t i j t i j t i j j t i* ( ) ˆ ( ) ˆ ˆ ( , , <sup>=</sup> ( ) <sup>Σ</sup> Σ *sgn* , )*j* 6. ε*t t <sup>i</sup> h X y w t i i* = ≠ ∑ , *sgn*[ ( )] 7. <sup>α</sup> <sup>ε</sup> ε *<sup>t</sup> t t* <sup>=</sup> <sup>−</sup> 1 2 <sup>1</sup> *ln*

8. *w w t i* <sup>+</sup>1, , = ⋅ *t i exp s (-*α*t t gn*[ ( *h Xi i* )]*sgn*[ ] *y Z ) / <sup>t</sup>* ,*i n* = 1

Output:

$$\begin{aligned} \mathsf{H}(\vec{\boldsymbol{\chi}}\_{i,j}) &= \sum\_{\cdot} \prescript{r}{}{\mathsf{\bf \bf \color{red}{ $\vec{\chi}$ }}} \mathsf{\bf \color{red}{\begin{subarray}{c} \mathsf{\bf \bf \color{red}{ $\vec{\chi}$ } \end{subarray}}} \mathsf{\bf \color{red}{\begin{subarray}{c} \mathsf{\bf \bf \color{red}{\begin{subarray}{c} \mathsf{\bf \bf \bf \color{red}{\begin{subarray}{c} \mathsf{\bf \bf \bf \bf \color{red}{\begin{subarray}{c} \mathsf{\bf \bf \bf \bf \bf \bf \bf \over \bf \bf \end{subarray}}} \end{\bf \bf \color{red}{\begin{subarray}{c} \mathsf{\bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \end{subarray}}} \end{\bf \bf \color{red}{\begin{subarray}{c} \mathsf{\bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf \bf$$

The proposed algorithm, AdaBoost.C2MIL, is outlined in **Equation 1**. The first step in the algorithm is to set up the dataset. It starts by reorganizing the dataset such that each negative instance becomes its own bag while the positive instances remain grouped in their original bags. Note that, since we know each instance in a negative bag must be negative, this step does not disregard useful information. It then sets up a uniform weighted distribution on the bag level. Since each negative instance is a bag, it has its own weight whereas instances in a positive bag share a single weight.

The second step, within the for loop, starts by mapping the MIL dataset to a classification dataset where every instance in a positive bag is labeled positive, and the weight is split uniformly among the instances. Next, the algorithm trains a weak classifier (*L*) over the current distribution of the dataset, which gives confidence-rated hypothesis ˆ *ht* . The confidence-rated prediction follows (Schapire and Singer, 1999) and can be converted to a probability using the sigmoid function. Finally, for positive bags (and negative bags during evaluation), the bag-level prediction is a summation of the instance-level predictions (step 5).

The rest of the algorithm follows AdaBoost on the bag level. First, the algorithm estimates the bag-level error and then calculates the step size α. This step size is then used to increase the weight on incorrectly predicted bags and decrease on correctly predicted.

The output of the ensemble multiple-instance learner acts on both the bag and instance level. Each classifier contributes to the prediction of an instance whereas the bag-level prediction is made by the equation in step 5.

#### Experiments

The overall framework of our experiment is represented in **Figure 4**. The AdaBoost algorithm requires a weak learner and, as a weak learner, the decision tree works well across the board; we use a custom implementation with a top-down (Kearns and Mansour, 1996) impurity function for confidence-rated boosting. The algorithms, metrics, and graphs used in this work were generated using python. The performance is measured using 5-fold stratified cross-validation. And code is available at https:// github.com/WintrumWang/AdaBoost.C2MIL.

#### AUTHOR CONTRIBUTIONS

RL and HL designed the project, WW and RL performed the project, ML and GG helped in method development and manuscript writing, XW participated in the computation. All authors approved the writing.

#### REFERENCES


#### FUNDING

This work is partially supported by National Key R&D Program of China 2018YFC0910500, the Neil Shen's SJTU Medical Research Fund, SJTU-Yale Collaborative Research Seed Fund; NSFC 31728012, Science and Technology Commission of Shanghai Municipality (STCSM) grant 17DZ 22512000, Shanghai Municipal Science and Technology Major Project (No. 2018SHZDZX01), LCNBI and ZJLab. RL acknowledges the support from NIH training grant T32 HL 07692.

#### ACKNOWLEDGMENTS

We thank Matthew B. Carson for assist in the dataset preparation and Irina Irodova for useful discussions regarding the proposed C2MIL algorithm.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00729/ full#supplementary-material

SUPPLEMENTARY TABLE 1 | The DNA-binding protein benchmark dataset.

SUPPLEMENTARY TABLE 2 | The RNA-binding protein benchmark dataset.


and reduced alphabet profile into the general pseudo amino acid composition. *PLoS One* 9 (9), e106691. doi: 10.1371/journal.pone.0106691


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Wang, Langlois, Langlois, Genchev, Wang and Lu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Genomic Profiling of Driver Gene Mutations in Chinese Patients With Non-Small Cell Lung Cancer

*Hongxue Meng1†, Xuejie Guo2†, Dawei Sun3†, Yuebin Liang2, Jidong Lang2, Yingmin Han2, Qingqing Lu2, Yanxiang Zhang2, Yanxin An2, Geng Tian2, Dawei Yuan2\*, Shidong Xu3\* and Jingshu Geng1\**

*1 Department of Pathology, Harbin Medical University Cancer Hospital, Harbin, China, 2 Department of Medicine, Geneis (Beijing) Co., Ltd., Beijing, China, 3 Department of Thoracic Surgery, Harbin Medical University Cancer Hospital, Harbin, China*

#### *Edited by:*

*Tao Huang, Shanghai Institutes for Biological Sciences (CAS), China*

#### *Reviewed by:*

*Harinder Singh, J. Craig Venter Institute, United States Caiguo Zhang, University of Colorado Denver, United States*

#### *\*Correspondence:*

*Dawei Yuan yuandw@geneis.cn Shidong Xu xusd163@163.com Jingshu Geng 13836111022@163.com*

*†These authors have contributed equally to this work*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 08 January 2019 Accepted: 23 September 2019 Published: 18 October 2019*

#### *Citation:*

*Meng H, Guo X, Sun D, Liang Y, Lang J, Han Y, Lu Q, Zhang Y, An Y, Tian G, Yuan D, Xu S and Geng J (2019) Genomic Profiling of Driver Gene Mutations in Chinese Patients With Non-Small Cell Lung Cancer. Front. Genet. 10:1008. doi: 10.3389/fgene.2019.01008*

Worldwide, especially in China, lung cancer accounts to a major cause of mortality related to cancer. Treatment decisions mainly depend on oncogenic driver mutations, which offer novel therapeutic targets for anticancer therapy. However, studies of genomic profiling of driver gene mutations in mainland China are rare. Hence, this is an extensive study of these mutations in Non-small-cell lung cancer (NSCLC) Chinese patients. Comparison of driver gene mutations of lung adenocarcinoma with other races showed that the mutational frequencies were similar within the different East Asian populations, while there were differences between East Asian and non-Asian populations. Further, four promising candidates for druggable mutations of epidermal growth factor receptor (*EGFR*) were revealed that open up avenues to develop and design personal therapeutic approaches for patients harboring mutations. These results will help to develop personalized therapy targeting NSCLC.

Keywords: lung cancer, driver mutations, epidemiology, EGFR, personalized medicine

# INTRODUCTION

Globally, lung cancer is the most frequent cause the mortality compared to other cancer types. Non-small-cell lung cancer (NSCLC) accounts for close to 85% to 90% of all lung cancer cases (Planchard et al., 2018). There are three types of NSCLC based on histopathology, including adenocarcinoma (ADC), squamous cell carcinoma (SCC) and large cell carcinoma (LCC) (Travis et al., 2015). Treatment strategies for NSCLC have been revolutionized since the identification of epidermal growth factor receptor (*EGFR*) activating mutations which predict response to EGFR tyrosine kinase inhibitors (TKIs) in 2004 (Lynch et al., 2004; Paez et al., 2004). Examples of such drugs are erlotinib and gefitinib that have been instrumental in patients in terms of the response and survival without a relapse (Mitsudomi et al., 2010; Rosell et al., 2012). Guidelines from clinical practice offer recommendations of an analysis of mutations in *EGFR* before the start of therapy of advanced NSCLC (D'Addario et al., 2009; Azzoli et al., 2010; Ettinger et al., 2018). To date, at least nine important driver mutations causing NSCLC have been described and several markers are already used for best treatment strategy selection. In this context, the pervasiveness and occurrence of these mutations are different across populations such as that of East Asians and the white with more mutations in *EGFR* and lesser mutations in Kirsten Rat Sarcoma Viral Proto-Oncogene (*KRAS)* (Kohno et al., 2015). With very little data in this regard from mainland China, a study that

1 **384** describes the pattern of driver mutations will facilitate personal medicine for NSCLC and on the design of clinical trials.

The identification of these mutations has been facilitated by the use of three-dimensional (3D) protein structures to analyze interactions between proteins found more in mutations associated with cancer as used by (Porta-Pardo et al., 2015) and (Engin et al., 2016). Hotspot3D (Niu et al., 2016) is another tool that analyzed 3D structures for spatial clusters or hotspots to later study putative variants and their functions. Such studies have shown the potential function and relevance of driver mutations in a clinical scenario.

The current work reports an inclusive set of driver mutations in a large set of probable NSCLC patients of Chinese origin. Several rarely reported mutations, including *EGFR* mutations (V742I, I789M, N842H) related with erlotinib, gefitinib, lapatinib, and *EGFR* mutation (S811C) related with afatinib were discovered.

# MATERIALS AND METHODS

#### Patient and Sample Collection

From July 2016 to October 2018, 5,003 patients with lung adenocarcinoma (3,243 tumor tissues and 1,760 blood samples) and 230 patients with lung squamous cell carcinoma (134 tumor tissue samples and 96 blood samples) from Harbin Medical University Cancer Hospital were subjected to enrollment in this work. Specimens from surgery or biopsies were fixed in formalin and embedded in paraffin (FFPE) to generate samples, while blood samples were collected in 10 ml cell-free DNA BCT tubes (Streck, Inc). While an informed consent in a written format following the Declaration of Helsinki was collected from all patients, all protocols were within the recommendations and framework of the Ethics Committee of the aforementioned hospital.

#### DNA Extraction From Tumor Tissue and Plasma

GeneRead DNA FFPE Kit (Qiagen) was used for DNA extraction from the FFPE samples. In parallel, plasma was extracted by centrifugation in accordance with previous work (Diehl et al., 2008; Madic et al., 2012). Briefly, Streck tubes were centrifuged at 1,600 g for 10 min at 4°C within 3 h of the blood draw. Supernatants were further centrifuged at 16,000 g for 10 min at 4°C to remove debris. Plasma was harvested and stored at -80°C until use. QIAamp Circulating Nucleic Acid kit (Qiagen) was used to isolate circulating DNA. Quantification of DNA from both sets of samples was done using Qubit (Life Technologies) in accordance with instructions from the manufacturer.

# Screening Mutations

Screening of mutations was performed by targeted sequencing using the Lung Cancer Ten Genes Panel (Geneis Co.Ltd) along with the Accel-NGS 2S Plus DNA Library Kit (Swift Biosciences) and NextSeq CN500 Personal Genome Machine (Illumina). Lung Cancer Ten Genes Panel (Geneis Co.Ltd) was used to test mutations in the *EGFR* kinase domain, *KRAS*, *NRAS*, *PIK3CA*, *HER2* kinase, *BRAF*, as well as fusions of *ALK*, *ROS1*, *RET* along with Mesenchymal Epithelial Transition Proto-Oncogene *(MET)* amplifications. The average sequencing depth of 500X for tissue and 1,000X for blood samples was considered reliable. DNA samples were normalized to yield 100–250 ng input. Accel-NGS 2S Plus DNA Library Kit (Swift Biosciences) was used to prepare whole genome libraries and through a series steps including covaris shearing (ctDNA can skip this step), end-repair, A-base addition, barcoded adapter ligation, and PCR amplification. Qubit dsDNA HS Kit (Invitrogen) was used to quantify the libraries while 2100 (Agilent) was used to assess quality in accordance with instructions from the manufacturer. Capture probes with 5' biotin were used to cause a specific pull-down of library samples with target sequences to achieve enrichment. The kits previously mentioned above were used to quantify and check the quality of the captured library while sequencing of templates was done on NextSeq CN500 in accordance to instructions from the manufacturer.

# Mutation and Statistical Analysis

Variant calling was done on the Lung Cancer Ten Genes Panel (Geneis Co. Ltd) from NextSeq CN500 sequencing was the BWA and FreeBayes software. The common clinical databases were used in this study, including PharmGKB, the Human Gene Mutation Database (HGMD), Clinvar, Cosmic, SNPedia, 1000genome, and dbSNP. A blinded approach was followed using the frequency threshold of ≥0.4% and ≥1% to call a mutation for ctDNA samples and tumor tissues analyzed, respectively.

## Mutational Data Collection and Hotspot3D Processing

More than 800 promising candidates were predicted by mutationdrug cluster and network analysis for druggable mutations by Hotspot3D (Niu et al., 2016). Here, the 3,243 tissues data of lung adenocarcinoma patients in China were collected and several rare mutations were then found by filtered in 800 potential druggable mutations of Hotspot3D. Droplet digital PCR was used to validate these potential driver gene mutations in our clinical cases.

# RESULTS

#### Distribution of Oncogenic Driver Mutations in Lung Adenocarcinoma Tissue Samples

Sequencing of 3,243 tissues between July 2016 and October 2018 for oncogenic driver mutations was carried out. The distribution is as followed: mutations in *EGFR* kinase: 55.9%, *KRAS*: 11.7%, *NRAS*: 0.7%, *PIK3CA*: 2.9%, *HER2* (the analysis involved insertions in exon 20): 2.1%, *BRAF*: 1.6%. The next set are fusions of *ALK*: 2.8%, *ROS1*: 0.6%, *RET*: 0.6%, while that of *MET* amplifications was 1.3% (**Table 1**). As shown in **Figure 1**, 55.9% patients showed *EGFR* mutations, while the highest frequency of L858R was observed in 28.1% of the patients,


followed by exon 19 deletion (20.6%). *KRAS* mutations were detected in 11.7% patients, with most of these were located in codon 12 (9.4%).

Of the 3,243 lung adenocarcinoma cases, 901 (901 out of 3,243, 27.8%) were negative, 2,185 (2,185 out of 3,243, 67.4%) harbored single mutations, and 157 (157 out of 3,243, 4.8%) were found to have multiple mutations. Mutations in *ALK*, *KRAS*, *BRAF,* and *EGFR* were studied. Thirteen patients coexisted *EGFR*+*KRAS* mutations; Three patients carried *EGFR*+*BRAF* mutations. However, *EGFR* and *ALK* mutations, *KRAS* and *BRAF* mutations, *KRAS* and *ALK* mutations, and *BRAF* and *ALK* mutations were mutually exclusive in our study (**Figure 2**). In addition, one patient carried a triple mutation: EGFR L858R + EGFR T790M + KRAS G12D, which was still rarely reported at present.

#### Comparison of Driver Gene Mutations of Lung Adenocarcinoma With Other Races

Current research indicates that race plays a role in the genomics of NSCLC. To compare the frequency of driver mutations of lung adenocarcinoma with other races, we obtained the available data from several related studies (Serizawa et al., 2014; George et al., 2015; Yeung et al., 2015; Campbell et al., 2017). These results are summarized in **Table 2**. First, we found that the mutational frequencies were similar for black and white groups, but there were big differences between East Asian populations and non-Asian populations. Specifically, we found that *EGFR* was mutated at a much higher frequency in East Asian populations than in non-Asian populations (35.0-55.9% *vs* 11.6-14.4%). And it was the same for the most common mutations exon 19 deletions and exon 21 L858R. Another notable difference was that 33.5- 34.2% of non-Asian patients had a *KRAS* mutation and this was significantly higher than the rate of 8.5-11.7% found for East Asians. In addition, *ALK* translocations are also important oncogenic drivers of NSCLC. It seems that *ALK* was mutated at a little higher frequency in East Asian populations than in non-Asian populations as presented in **Table 2**. But data from previous reports showed that *ALK* mutation frequencies were similar (3-5% *vs* 3-6%) between patients in East Asia (Japan, Korea, and China) and from those of European descent (Kohno et al., 2015). Further, overall mutational frequencies and copy number changes were not significantly different between mainland China (this study), Hong Kong, and Japan populations in lung adenocarcinoma. And no significant difference was observed in *BRAF*, *HER2*, *MET*, *PIK3CA*, *NRAS*, *RET,* and *ROS1* mutation status.

TABLE 2 | Comparison of driver gene mutations of lung adenocarcinoma between mainland China (this study), Hong Kong (Diehl et al., 2008), Japan (Madic et al., 2012), Black, and White (George et al., 2015).


*aThe mutation frequency was not mentioned in the related study.*

#### Distribution of Oncogenic Driver Mutations in Blood Samples of Patients

Sequencing of 1,760 blood samples (from July 2016 to October 2018) of patients revealed the following distribution: Mutations in *EGFR*: 32.6% patients, *KRAS*: 11.2% patients, *NRAS*: 1.0% patients, *PIK3CA* mutations: 2.9%, *HER2* kinase domain mutations: 0.9%, *BRAF*: 2.9% patients while *MET* amplifications were 0.7% (**Table 3**). It is noteworthy that the frequency of drug sensitive mutations, such as *EGFR* exon 19del and L858R, was reduced when compared with tissues (10.8% in blood and 20.6% in tissues for Exon 19del; 13.1% in blood and 28.1% in tissues for L858R). However, the frequency of drug resistant mutations, such as *EGFR* T790M, was increased when compared with tissues (5.1% in blood and 2.1% in tissues).

## Frequency of Oncogenic Driver Mutations in Squamous Cell Carcinoma of Lung

Targeted DNA sequencing of 230 lung squamous cell carcinoma Chinese patient samples was done. Among those, there were 134 tissue samples and 96 blood samples, and 107(107 out of 134, 79.9%) and 74 (74 out of 96, 77.1%) were negative, respectively. In 134 lung squamous cell carcinoma tissue samples, there were 7 (5.2%) *EGFR* mutations, 6 (4.5%) *KRAS* mutations, 12 (9.0%) *PIK3CA* mutations, 1 (0.7%) *BRAF* mutations and 1 (0.7%) *MET* amplifications. In 96 lung squamous cell carcinoma blood samples, there were 8 (8.3%) *EGFR* mutations, 7 (7.3%) *KRAS* mutations, 6 (6.3%) *PIK3CA* mutations and 1 (1.0%) *BRAF* mutations. No *MET* amplifications were detected (**Table 4**). Our data adds confirmation with earlier work that lung squamous cell carcinoma shows a rare presence of two ubiquitous mutations



seen in lung adenocarcinomas, *KRAS* and *EGFR*, are rare in lung squamous cell carcinoma (Ding et al., 2008). It is noteworthy that the rate of mutation of *PIK3CA* in these samples is relatively higher when compared with lung adenocarcinoma.

#### *EGFR* Candidate Druggable Mutations Were Discovered by Filtered in Hotspot3D Results

We first collected the 3,243 tumors data of lung adenocarcinoma patients in China and several rarely reported mutations, including *EGFR* mutations (V742I, I789M, N842H) related with erlotinib, gefitinib, lapatinib, and *EGFR* mutations (S811C) related with afatinib were discovered (**Table 5**, **Figure 3**) by filtered in 800 potential druggable mutations of Hotspot3D. Droplet digital PCR was used to validate these *EGFR* variants in our clinical cases. We noticed that these *EGFR* rare variants always coexist with some common mutations, which showed poor prognosis in previous studies. The mechanism is still unclear. Functional verification of these *EGFR* druggable mutations will be performed in subsequent work.

#### DISCUSSION

Identification of oncogenic driver mutations in NSCLC has greatly promoted clinical use and development of targeted drugs. Previous genomic studies of Chinese lung adenocarcinoma have not adequately represented patients. The current work involved a sizeable Chinese NSCLC patient sample set subjected to comprehensive investigation for driver mutations described as oncogenic. Our results were comparable with that detected in previous studies in Chinese lung adenocarcinoma (Gou and Wu, 2014), while the difference is mainly manifested in the different detection frequencies of several fusion genes. We suspected that was mainly due to the different platforms and detection methods. Comparison of driver gene mutations of lung adenocarcinoma with other races showed that the mutational frequencies were similar between mainland China (this study), Hong Kong, and Japan populations. But there were big differences between East Asian populations and non-Asian populations. Similar to Western population, the two most ubiquitous mutations were those in *EGFR* and *KRAS* in the case of lung adenocarcinoma samples. However, the *EGFR* mutation frequencies in East Asian lung adenocarcinoma were higher than previously reported in USA/ Europe patients, whereas the overall frequency of *KRAS* mutations was much lower than in the West instead (D'Angelo et al., 2011; Smits et al., 2012; Kohno et al., 2015). A previous study found that Asians had the highest proportion of patients with mutations at 81% and the highest percentage of patients treated with targeted therapies (51%), while African Americans patients were the least likely to harbor mutations and to receive targeted therapy. However, there were no significant differences in overall survival between the four race groups (Steuer et al., 2016). A large dataset is still needed to verify this conclusion.

Although the *EGFR* S768I mutation is considered to be a very rare mutation, we detected a total of 1.1% patients with lung adenocarcinoma harboring this mutation. Due to its rarity and the variability of responses of treated cases, its exact function in TKI therapy is still not fully understood (Asahina et al., 2007; Masago et al., 2010). Subjects carried *BRAF* mutations with percent of 1.6%, and most of them were a V600E mutation. In addition, our data showed that *KRAS* and *BRAF* V600E mutations are mutually exclusive, which is in lieu of previous studies (Rajagopalan et al., 2002; De Roock et al., 2010). The stimulus to cancer development


TABLE 4 | Frequency of mutations in lung squamous cell carcinoma samples.



from both these genes is termed as, equivalent or at least redundant. In addition, *EGFR* and *ALK* mutations, *KRAS* and *ALK* mutations, and *BRAF* and *ALK* mutations were mutually exclusive in our study. Previous studies showed that *KRAS* mutations seem to be incompatible with *EGFR* mutations, but 13 cases of *KRAS* and *EGFR* coexisting mutations were found in our study, which means that the therapeutic effect of EGFR-TKIs in these samples would be ineffective. We also found three cases of simultaneous mutations of *EGFR* along with *BRAF*, which was first found in Li 'study (Li et al., 2014). Interestingly, a triple mutation *EGFR* L858R+*EGFR* T790M+*KRAS* G12D was identified in our study, which was rarely reported at present. Clinical follow up was necessary for future researches.

1,760 lung adenocarcinoma patient blood samples were tested for analyzing mutations in *EGFR*, *KRAS*, *NRAS*, *PIK3CA*, *HER2*, *BRAF* and *MET* in cfDNA. The distribution of drug sensitive mutations, such as *EGFR* exon19del as well as, L858R, was decreased in comparison with these mutation in tissues, while the frequency of drug resistant mutations, such as *EGFR* T790M, was increased. It is speculated that may be related to the patient population. The majority of the patients analyzed with tumor tissues were to find targeted agents for the first time, while some of the patients analyzed with blood samples showed the presence of resistance developed towards EGFR-TKIs. It can be seen from the mutation frequency of *EGFR* T790M (2.1%), which was close to the *de novo* T790M frequency reported in the literature (Su et al., 2012). Almost all NSCLC patients administered therapy using EGFR-TKIs gradually manifest resistance. It is a recommendation nowadays to analyze such samples to check for the reason behind the resistance in these patients. Yet, a challenge here is mutations that underlie the disease in the case of advanced stages may not be entirely reflected in one sample biopsy particularly if the cancer is heterogeneous. Analysis of cfDNA or fragments of DNA minus cells can be an alternative to tissue samples as these fragments are released by apoptotic or necrotic cells with the level of these molecules correlated with the stage of the tumor and its prognosis (Diaz and Bardelli, 2014).

Studies have mainly involved adenocarcinoma in the case of NSCLC with molecular profiling of tumors capable of improving the outcome if therapies are targeted. However, such an approach fails in the case of SCC's accounting for approximately 30% of all NSCLC. Here, we screened 230 Chinese patient samples of lung SCC and reported the rarity of two most ubiquitous mutations in *KRAS* and *EGFR* seen in lung adenocarcinoma, while *PIK3CA* mutations were relatively high when compared with lung adenocarcinoma. Most of the mutations are unknown in lung SCCs and it needs further research.

Besides this, we made a profound analysis of the 3,243 tumors data of lung adenocarcinoma patients in China, then three *EGFR* mutations (V742I, I789M, N842H) related with erlotinib, gefitinib, lapatinib, and one *EGFR* mutation (S811C) related with afatinib were discovered by filtered in Hotspot3D results. Next, we will continue to validate the function of the four *EGFR* rare druggable mutations by the following methods: (i) to predict of drug interaction based on protein structures; (ii) to perform biological validation in cultured cells; (iii) to establish the feasibility evaluation of clinical significance of

these mutations by follow-up patients had these four *EGFR* rare mutations. Interestingly, we found that these *EGFR* rare variants always coexist with some common activating mutations in clinical samples. Whether this phenomenon has specific clinical significance needs to be further analyzed. Our analysis lends weight to novel approaches to address the use of personal medicine in patients with particular genetics.

In conclusion, we present a clear panoramagram of mutation frequencies of driver mutations in a sizeable population of NSCLC patients from China. There was an identification of four rare mutations in *EGFR* in these patients, such results can raise new possibilities for designing personalized treatments for patients carrying these mutations.

#### CONCLUSION

Genomic profiling of driver gene mutations of a sizeable Chinese patient set with NSCLC was performed. Four promising candidates for druggable mutations of *EGFR* were revealed, which opens up new avenues in the development of therapies that target individual patients carrying such genetic alterations. These results will help to develop personalized therapy targeting NSCLC.

#### DATA AVAILABILITY STATEMENT

The data generated in this study can be found at http://www.ncbi. nlm.nih.gov/SNP/snp\_viewTable.cgi?handle=DPSEQ\_SNP

#### REFERENCES


#### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of Ethics Committee of Affiliated Tumor Hospital of Harbin Medical University with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Ethics Committee of Affiliated Tumor Hospital of Harbin Medical University.

#### AUTHOR CONTRIBUTIONS

HM and XG designed the study; HM, XG, DS, YH, and YA collected the data; YL and JL analysed the data; XG and DY interpreted the data; XG wrote the draft; YZ, QL, SX, and GT edited the manuscript; JG acquired the funding and supervised the whole study.

#### ACKNOWLEDGMENTS

This work was supported by the National Nature Science Foundation of China (81600539), Natural Science Foundation of Heilongjiang Province of China (LC2016038), Nn10 program of Harbin Medical University Cancer Hospital (Nn10 2017-03), Youth elite training Foundation of Harbin Medical University Cancer Hospital (JY2016-06), Outstanding Youth Foundation of Harbin Medical University Cancer Hospital (JCQN-2018- 05), and Postdoctoral scientific research developmental fund of Heilongjiang Province (LBH-Q18076).


**Conflict of Interest:** Authors XG, YL, JL, YH, QL, YZ, YA, GT, and DY were employed by company Geneis, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Meng, Guo, Sun, Liang, Lang, Han, Lu, Zhang, An, Tian, Yuan, Xu and Geng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

digital media

of impactful research

article's readership