# INTEGRATION OF MULTISOURCE HETEROGENOUS OMICS INFORMATION IN CANCER

EDITED BY : Victor Jin, Junbai Wang and Binhua Tang PUBLISHED IN : Frontiers in Genetics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-448-4 DOI 10.3389/978-2-88963-448-4

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# INTEGRATION OF MULTISOURCE HETEROGENOUS OMICS INFORMATION IN CANCER

Topic Editors: Victor Jin, The University of Texas Health Science Center at San Antonio, United States Junbai Wang, Oslo University Hospital, Norway Binhua Tang, Hohai University, China

Multisource heterogenous omics data can provide unprecedented perspectives and insights into cancer studies, but also pose great analytical problems for researchers due to the vast amount of data produced. This Research Topic aims to provide a forum for sharing ideas, tools and results among researchers from various computational cancer biology fields such as genetic/epigenetic and genome-wide studies.

Citation: Jin, V., Wang, J., Tang, B., eds. (2020). Integration of Multisource Heterogenous Omics Information in Cancer. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-448-4

# Table of Contents

*04 Long Noncoding RNA FAM201A Mediates the Radiosensitivity of Esophageal Squamous Cell Cancer by Regulating ATM and mTOR Expression via miR-101*

Mingqiu Chen, Pingping Liu, Yuangui Chen, Zhiwei Chen, Minmin Shen, Xiaohong Liu, Xiqing Li, Anchuan Li, Yu Lin, Rongqiang Yang, Wei Ni, Xin Zhou, Lurong Zhang, Ye Tian, Jiancheng Li and Junqiang Chen


Xiaowen Feng, Edwin Wang and Qinghua Cui

*39 SALMON: Survival Analysis Learning With Multi-Omics Neural Networks on Breast Cancer*

Zhi Huang, Xiaohui Zhan, Shunian Xiang, Travis S. Johnson, Bryan Helm, Christina Y. Yu, Jie Zhang, Paul Salama, Maher Rizkalla, Zhi Han and Kun Huang

*52 Recent Advances of Deep Learning in Bioinformatics and Computational Biology*

Binhua Tang, Zixiang Pan, Kang Yin and Asif Khateeb

*62 Autoencoder Based Feature Selection Method for Classification of Anticancer Drug Response*

Xiaolu Xu, Hong Gu, Yang Wang, Jia Wang and Pan Qin

	- Aodan Xu, Jiazhou Chen, Hong Peng, GuoQiang Han and Hongmin Cai

Beste Turanli, Kubra Karagoz, Gholamreza Bidkhori, Raghu Sinha, Michael L. Gatza, Mathias Uhlen, Adil Mardinoglu and Kazim Yalcin Arga

*109 Gene Co-expression Network and Copy Number Variation Analyses Identify Transcription Factors Associated With Multiple Myeloma Progression*

Christina Y. Yu, Shunian Xiang, Zhi Huang, Travis S. Johnson, Xiaohui Zhan, Zhi Han, Mohammad Abu Zaid and Kun Huang

*121 Abundance of HPV L1 Intra-Genotype Variants With Capsid Epitopic Modifications Found Within Low- and High-Grade Pap Smears With Potential Implications for Vaccinology*

Jane Shen-Gunther, Hong Cai, Hao Zhang and Yufeng Wang

*135 Survival Analysis of Multi-Omics Data Identifies Potential Prognostic Markers of Pancreatic Ductal Adenocarcinoma*

Nitish Kumar Mishra, Siddesh Southekal and Chittibabu Guda

# Long Noncoding RNA FAM201A Mediates the Radiosensitivity of Esophageal Squamous Cell Cancer by Regulating ATM and mTOR Expression via miR-101

Mingqiu Chen1,2,3†, Pingping Liu4†, Yuangui Chen5†, Zhiwei Chen<sup>6</sup> , Minmin Shen<sup>4</sup> , Xiaohong Liu<sup>4</sup> , Xiqing Li <sup>4</sup> , Anchuan Li <sup>5</sup> , Yu Lin<sup>7</sup> , Rongqiang Yang<sup>8</sup> , Wei Ni <sup>8</sup> , Xin Zhou<sup>8</sup> , Lurong Zhang<sup>7</sup> , Ye Tian2,3, Jiancheng Li <sup>7</sup> \* and Junqiang Chen<sup>7</sup> \*

#### Edited by:

*Binhua Tang, Hohai University, China*

#### Reviewed by:

*Shaoli Das, National Institutes of Health (NIH), United States Suman Ghosal, National Institutes of Health (NIH), United States*

#### \*Correspondence:

*Jiancheng Li jianchengli6@126.com Junqiang Chen junqiangc@163.com*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *13 August 2018* Accepted: *19 November 2018* Published: *05 December 2018*

#### Citation:

*Chen M, Liu P, Chen Y, Chen Z, Shen M, Liu X, Li X, Li A, Lin Y, Yang R, Ni W, Zhou X, Zhang L, Tian Y, Li J and Chen J (2018) Long Noncoding RNA FAM201A Mediates the Radiosensitivity of Esophageal Squamous Cell Cancer by Regulating ATM and mTOR Expression via miR-101. Front. Genet. 9:611. doi: 10.3389/fgene.2018.00611* *<sup>1</sup> Department of Radiation Oncology, Fujian Medical University Union Hospital and Fujian Provincial Platform for Medical Laboratory Research of First Affiliated Hospital, Fujian, China, <sup>2</sup> Department of Radiation Oncology, The Second Affiliated Hospital of Soochow University, Suzhou, China, <sup>3</sup> Institute of Radiotherapy & Oncology, Soochow University, Suzhou, China, <sup>4</sup> Shengli Clinical Medical College, Fujian Medical University, Fuzhou, China, <sup>5</sup> Department of Radiation Oncology, Fujian Medical University Union Hospital, Fuzhou, China, <sup>6</sup> Fuzhou Center for Disease Control and Prevention, Fuzhou, China, <sup>7</sup> Department of Radiation Oncology, Fujian Cancer Hospital & Fujian Medical University Cancer Hospital, Fuzhou, China" <sup>8</sup> Cancer and Genetics Research Complex, Department Molecular Genetics and Microbiology, College Medicine, University of Florida, Gainesville, FL, United States*

Background: The aim of the present study was to identify the potential long non-coding (lnc.)-RNA and its associated molecular mechanisms involved in the regulation of the radiosensitivity of esophageal squamous cell cancer (ESCC) in order to assess whether it could be a biomarker for the prediction of the response to radiotherapy and prognosis in patients with ESCC.

Methods: Microarrays and bioinformatics analysis were utilized to screen the potential lncRNAs associated with radiosensitivity in radiosensitive (*n* = 3) and radioresistant (*n* = 3) ESCC tumor tissues. Reverse transcription-quantitative polymerase chain reaction (RT-qPCR) was performed in 35 ESCC tumor tissues (20 radiosensitive and 15 radioresistant tissues, respectively) to validate the lncRNA that contributed the most to the radiosensitivity of ESCC (named the candidate lncRNA). MTT, flow cytometry, and western blot assays were conducted to assess the effect of the candidate lncRNA on radiosensitivity *in vitro* in ECA109/ECA109R ESCC cells. A mouse xenograft model was established to confirm the function of the candidate lncRNA in the radiosensitivity of ESCC *in vivo*. The putative downstream target genes regulated by the candidate lncRNA were predicted using Starbase 2.0 software and the TargetScan database. The interactions between the candidate lncRNA and the putative downstream target genes were examined by Luciferase reporter assay, and were confirmed by PCR.

Results: A total of 113 aberrantly expressed lncRNAs were identified by microarray analysis, of which family with sequence similarity 201-member A (FAM201A) was identified as the lncRNA that contributed the most to the radiosensitivity of ESCC. FAM201A was upregulated in radioresistant ESCC tumor tissues and had a poorer short-term response to radiotherapy resulting in inferior overall survival. FAM201A

**4**

knockdown enhanced the radiosensitivity of ECA109/ECA109R cells by upregulating ataxia telangiectasia mutated (ATM) and mammalian target of rapamycin (mTOR) expression via the negative regulation of miR-101 expression. The mouse xenograft model demonstrated that FAM201A knockdown improved the radiosensitivity of ESCC.

Conclusion: The lncRNA FAM201A, which mediated the radiosensitivity of ESCC by regulating ATM and mTOR expression via miR-101 in the present study, may be a potential biomarker for predicting radiosensitivity and patient prognosis, and may be a therapeutic target for enhancing cancer radiosensitivity in ESCC.

Keywords: ATM, esophageal squamous cell carcinoma, FAM201A, long noncoding RNA, miR-101, mTOR, radiosensitivity

#### INTRODUCTION

Globally, esophageal cancer (EC) is one of the most common types of cancer, with the 7th highest incidence rate and 6th greatest rate of cancer-associated death (Bray et al., 2018). Surgery still plays an important role in the treatment of EC (Pennathur et al., 2013; Rustgi and El-Serag, 2014). However, due to the patients' physiological conditions, the tumor location or the tumor stage, only ∼25% of newly diagnosed patients are suitable for surgery (Short et al., 2017). For patients with unresectable EC, radiotherapy (RT) combined with chemotherapy is considered to be the optimal treatment (Sasaki and Kato, 2016).

However, predominantly because of local failure (Lloyd and Chang, 2014; Versteijne et al., 2014) which has been associated with intrinsic and/or acquired radioresistance (Chen X. et al., 2017), the survival rate in EC patients following RT is as low as 10–30% after 5 years (Cooper et al., 1999; Gwynne et al., 2011). Therefore, how to predict the radiosensitivity and resensitize patients is imperative in patients with EC treated with RT. Unfortunately, as the molecular mechanism of radioresistance, which is known to involve DNA repair proteins (Zafar et al., 2010), cell signal pathways (Dumont and Bischoff, 2012), angiogenesis (Francescone et al., 2011), cancer stem cells (Moncharmont et al., 2012), and autophagy (Chaachouay et al., 2011), is intricate and has not been elucidated thoroughly, there are currently no accurate biomarkers to predict radioresistance or therapeutic targets to enhance the radiosensitivity of EC.

Long non-coding RNAs (lncRNAs) are a new class of nonprotein-coding transcripts that are longer than 200 nucleotides (Qi and Du, 2013). A number of previous studies have demonstrated that lncRNAs are important regulators of gene expression, that control both physiological and pathological processes in development and diseases such as cancer (Kung et al., 2013). Recent studies have reported that lncRNAs also function as regulators of tumor radiosensitivity and may serve as biomarkers for tumor response to RT (Spizzo et al., 2012; Yu et al., 2012). However, radiosensitivity-associated lncRNAs in esophageal squamous cell carcinoma (ESCC) are rarely reported (Tong et al., 2014; Zhang et al., 2015; Li et al., 2016; Zhou et al., 2016).

In the present study, we demonstrated that the lncRNA family with sequence similarity 201-member A (FAM201A) contributed the most to the radioresistance of ESCC. Furthermore, functional and mechanistic analyses revealed that FAM201A contributed to radioresistance by upregulating ataxia telangiectasia mutated (ATM) and mammalian target of rapamycin (mTOR) expression via actions as a miR-101 sponge. This study first established a FAM201A-miR-101-ATM/mTOR regulatory network in ESCC, revealing a promising therapeutic strategy for treating ESCC with radioresistance.

#### MATERIALS AND METHODS

#### Patients and Tissue Specimens

The present study was approved by the Fujian Medical University Union Hospital Institutional Review Board (No. 2014KY001). All of the patients signed informed consent prior to treatment, and all of the information was anonymized prior to its analysis. The pretreatment work-up and eligibility criteria, details of radiotherapy and chemotherapy, criteria for toxicity, and shortterm response, follow-up and the statistical analysis of survival were presented in our previous study (Chen M. Q. et al., 2017).

Between July 2015 and March 2017, a total of 41 patients with ESCC who received RT were recruited. Tissue specimens obtained during pretreatment with esophagogastroduodenoscopy were histopathologically examined by two independent pathologists and were snap

**Abbreviations:** ATM, ataxia telangiectasia mutated; AUC, area under the curve; CASC2, cancer susceptibility candidate 2; CR, complete response; CTV, clinical target volume; DLEU2, deleted in lymphocytic leukemia 2; DLX6-AS1, DLX6 antisense RNA 1; DNA, deoxyribonucleic acid; DNA-PKcs, DNA-dependent protein kinase catalytic subunit; ECOG, Eastern Cooperative Oncology Group; ESCC, esophageal squamous cell carcinoma; FAM201A, family with sequence similarity 201-member A; GTV, gross tumor volume; HRR, homologous recombination repair; IC, Induction chemotherapy; lncRNA, long non-coding RNA; MCF2L-AS1, MCF2L antisense RNA 1. mTOR, mammalian target of rapamycin; MTT, 3-(4,5)-dimethylthiahiazo(-z-y1)-3,5-diphenytetrazoliumromide; miR, microRNA; NHEJ, non-homologous end joining; OS, overall survival; PCR, polymerase chain reaction; PD, progression of disease; PR, partial response; RT-qPCR, reverse transcription-quantitative polymerase chain reaction; RI-DSB, radiation-induced double-strand breaks; RNA, ribonucleic acid; ROC, receiver operating characteristic; RT, radiotherapy; SD, stable disease; sh-RNA, short hairpin RNA; siRNA, small interfering RNA; TP, platinum compound plus taxane.

frozen in liquid nitrogen and then stored at −80◦C until RNA extraction.

Tissue specimens were divided into a radiosensitive group (n = 23) and a radioresistant group (n = 18) based on shortterm response to RT. The short-term responses to RT were classified as a clinically complete response (CR), partial response (PR), stable disease (SD), or progressive disease (PD) according to the Japanese Classification of Esophageal Cancer guidelines (Japan Esophageal Society, 2017). Of these, the CR and PR were termed radiosensitive group and the SD and PD were termed radioresistant group in the current study.

### Microarray Screening and Bioinformatics Analysis

Microarray profiling was performed using three radiosensitive ESCC tumor tissues and three radioresistant ESCC tumor tissues. RNA extraction and sequential microarray hybridization were conducted by Biotechnology Company (Shanghai, China), and the detected human genome transcripts were obtained by the Human lncRNA array V6.0 (4x180 K; Agilent Technologies, Inc., Santa Clara, CA, USA). Bioinformatics analysis was performed using GeneSpring Software to obtain differentially expressed lncRNAs correlated with ESCC radiosensitivity.

#### Cell Lines and Culture

The ESCC cell line Eca109 was obtained from Chinese Academy of Sciences (Beijing, China). The corresponding radioresistant cells (Eca109R) were established from the parental cell line Eca109 by stepwise X-ray irradiation at 30 Gy in three fractions (10 Gy per fraction) (Da et al., 2017). Cells were cultured in RPMI-1640 medium (HyClone; GE Healthcare Life Sciences, Logan, UT, USA) with 10% (v/v) fetal bovine serum (Thermo Fisher Scientific, Inc., Waltham, MA, USA) and antibiotics (100 U/mL penicillin and 100µg/mL streptomycin; HyClone) in an atmosphere of 95% air/ 5% CO<sup>2</sup> at 37◦C.

#### RNA Isolation and Reverse Transcription-Quantitative Polymerase Chain Reaction (RT-qPCR)

Total RNAs from either tissue samples or cultured cells were extracted with TRIzol reagent (Thermo Fisher Scientific, Inc.) according to the manufacturer's instructions. The RNA concentration and quality were measured using a NanoDrop ND-2000 spectrophotometer which measured the absorbance at 260 and 280 nm. Samples with an A260:A<sup>280</sup> ratio ≥2.0 were selected for further analysis.

First strand cDNA for the potential lncRNAs and putative micro (mi)-RNA were synthesized using the PrimeScriptTM RT reagent kit with gDNA Eraser (Takara Biotechnology, Co., Ltd., Dalian, China) according to the manufacturer's protocol. Briefly, 1 µg total RNA, 2 µl 5X gDNA Eraser Buffer, 1 µl gDNA Eraser and RNase Free dH2O, were combined in a total reaction volume of 10 µl and incubated at 42◦C for 2 min to eliminate the genomic DNA. A total of 10 µl of the RT reaction mixture (consisting of 4 µl 5X PrimeScript Buffer 2, 1 µl PrimeScript RT Enzyme Mix 1, 1 µl RT Primer Mix, and 4 µl RNase Free dH2O) was then added, and the mixture was incubated at 37◦C for 15 min, followed by 85◦C for 5 s to generate the cDNA.

The expression of the potential lncRNAs in the radiosensitive tumor tissues, compared with the radioresistant tumor tissues, was quantified using SYBR<sup>r</sup> Premix Ex Taq (Takara Biotechnology Co., Ltd.) according to the manufacturer's instructions on the ABI 7500 Real-Time PCR System (Applied Biosystems; Thermo Fisher Scientific, Inc.). Briefly, the 20 µl reaction mixtures were incubated at 95◦C for 30 s for the initial denaturation, followed by 40 cycles at 95◦C for 5 s and 60◦C for 34 s. The expression levels of lncRNAs were calculated using the 1Ct method, where 1Ct= Ct target -Ct reference, a smaller 1Ct value indicates a greater expression. The relative expression of lncRNAs was analyzed using the 2−11Ct method (Livak and Schmittgen, 2001); data was normalized to the endogenous control GAPDH. Each sample was examined in triplicate. The primers and oligonucleotides of the plasmid were synthesized by Invitrogen (Thermo Fisher Scientific, Inc.), the sequences are presented in **Table 1**. The aberrant lncRNA that had the greatest sensitivity and specificity for predicting ESCC radiosensitivity (in radiosensitive and radioresistant tissues), as identified by receiver operating characteristic (ROC) curves, and was associated with survival, was identified as the candidate lncRNA for further study.

### Transient Transfection

Small interfering RNA (siRNA) specifically targeting candidate lncRNA (si-candidate-lncRNA) and putative-miRNA, negative control (NC) si-candidate-lncRNA and si-putative-miRNA, candidate-lncRNA mimic, putative-miRNA mimic, and the inhibitor control were constructed by Nanjing Dongji Biotechnology Company (Nanjing, China). Ectopic expression of the candidate lncRNA was achieved by introducing the candidate lncRNA sequence into a pcDNA3.1 vector (Thermo Fisher Scientific, Inc.). Eca109/Eca109R cells were seeded into 6-well plates at a density of 1 × 10<sup>6</sup> cells/well and cultured overnight prior to transfection. Then, transient transfection with oligonucleotides or plasmids into Eca109/Eca109R cells was performed using Lipofectamine 2000TM (Thermo Fisher Scientific, Inc.). Cells were harvested 48 h post-transfection for subsequent analysis. PCR was used to validate the efficacy of Eca109/Eca109R cell transfection with si-candidate-lncRNA and candidate-lncRNA-mimic.

### Western Blot Analysis

Protein samples from tissues or cells were subjected to 10% SDS-PAGE and transferred to PVDF membranes. Following blocking in 5% skim milk for 2 h, the membranes were incubated overnight at 4◦C with the primary antibodies against P-glycoprotein (Pgp; 1:1,000), glutathione S-transferase π (GST-π; 1:500), ATM (1:750), mTOR (1:1,000), and β-actin (1:5,000) purchased from Zen Bioscience Biotechnology, Inc. (Chengdu, China), followed by incubation with horseradish peroxidase-conjugated goat antirabbit secondary antibodies for 2 h (1:5,000). The antigenantibody complexes were visualized using chemiluminescence.



#### Radiosensitivity Assay

Radiosensitivity was assessed by 4-5-dimethylthiazol-2-yl)-2,5 diphenyl tetrazolium bromide (MTT) assay. ECA109/ECA109R cells (5,000/well) were incubated for 48 h prior to exposure to various doses of radiation (0 Gy, 2 Gy, 4 Gy, 6 Gy, 8 Gy, and 10 Gy). Subsequently, 10 µl of 5 mg/mL MTT was added to each well for a further 3 h, followed by the addition of 150 µl DMSO to dissolve the generated formazan crystals. The absorbance at a wavelength of 570 nm was detected using a microplate reader.

#### Flow Cytometry Analysis of Apoptosis

ECA109/ECA109R cells (5,000/well) were incubated for 48 h prior to exposure to various doses of radiation (0 Gy, 2 Gy, 4 Gy, 6 Gy, 8 Gy, and 10 Gy). The ratio of apoptotic cells was detected using an Annexin V-FITC Apoptosis Detection Kit (BD Bioscience, Franklin Lakes, NJ, USA) and analyzed using a BD Calibur flow cytometer with CellQuest software (BD Biosciences).

#### Candidate lncRNA Downstream Target Genes and Luciferase Reporter Assay

The potential target genes downstream of the candidate lncRNA were predicted using Starbase 2.0 software (http://starbase. sysu.edu.cn/starbase2/index.php) and the TargetScan (www. targetscan.org/vert\_71/) database.

The full fragments of the candidate lncRNA or its mutant containing the putative miRNA-binding sites were synthesized and cloned downstream of the firefly luciferase gene in pGL3 plasmids (Promega Corporation, Madison, WI, USA), and were termed the pGL3-candidate lncRNA-wild type (Wt) and pGL3 candidate lncRNA-mutant (Mut). Eca109 and Eca109R cells were maintained in 96-well plates and co-transfected with 400 ng of the constructed luciferase reporter plasmids, 50 ng of Renilla luciferase reporter vector and 50 nM of the putative miRNA mimic, miR-con, or putative miRNA-vector using Lipofectamine 3000TM (Thermo Fisher Scientific, Inc.). Cells were harvested at 48 h after transfection, and luciferase activity was determined using a Dual Luciferase Reporter Assay Kit (Promega Corporation). Renilla luciferase activities were used as the internal control for the normalization of firefly luciferase activity.

#### In vivo Experiments

The animal experiments were approved by the Animal Care and Use Committee of Fujian Medical University Union Hospital and were performed in accordance with the Institutional Guide for the Care And Use Of Laboratory Animals. Lentiviral vector [Lenti-short hairpin (sh)-candidate lncRNA] for stable silenced expression of the candidate lncRNA was obtained from Shanghai GenePharma Co., Ltd. (Shanghai, China) and transfected into Eca109/Eca109R cells. The success of transfection was detected by PCR and the survival of the cells was determined by an MTT assay. Then, equal numbers of siRNA-candidate lncRNAtransfected Eca109, NC and control cells were implanted into 8-week old nude mice (n = 5 per group; Model Animal Research Center of Nanjing University) by subcutaneous injection.

At two weeks after the injection (to allow for tumor growth), the tumors were irradiated by X-ray at 10 Gy. Tumor size was measured every 3 days with a caliper, and tumor volume was calculated according to the following formula: Volume = (length x width<sup>2</sup> )/2. All mice were sacrificed on day 42 after inoculation. The resected tumor masses were harvested for subsequent weight measurements. A growth curve was constructed to determine tumor radiosensitivity and the effect of the siRNA of the candidate lncRNA on tumorigenicity in nude mice was analyzed.

#### Statistical Analysis

The overall survival data was analyzed using SPSS software 23.0 (IBM Corp., Armonk, NY, USA). Survival curves were established through the Kaplan-Meier method and compared by a log rank test.

A multivariable analysis of patient demographic and clinical parameters (gender, age, ECOG score, tumor location, clinical T and N stages, the radiotherapy doses for GTV and CTV, and the tumor response to treatment) was performed using the Cox proportional hazards model.

Experimental data are presented as x ± s from independent experiments performed in triplicate. For comparisons, paired or independent Student's t-tests, Chi-square tests or ANOVA with post hoc tests (Tukey's) were performed. ROC curves were used for selecting an optimal cut-off point for each test and for comparing the accuracy of diagnostic tests. Two-tailed P < 0.05 ( <sup>∗</sup>P < 0.05, ∗∗P < 0.01, ∗∗∗P < 0.001) was considered to indicate a statistically significant difference.

### RESULTS

### Patient Characteristics, Treatment Response, and Survival

Between July 2015 and March 2017, a total of 41 ESCC patients treated with RT combined with chemotherapy were enrolled in the present study. After RT, a total of 4 patients achieved CR, 19 patients reached PR, 9 patients maintained SD and 9 cases had PD. There were no significant differences between radiosensitive (4 CR and 19 PR) and radioresistant (9 SD and 9 PD) patients regarding the distributions of gender, age, ECOG score, tumor location, and clinical stage (**Table 2**).

### Differential Expression of lncRNAs Potentially Correlated With Radiosensitivity

A total of 113 aberrantly expressed lncRNAs were identified in the microarray analysis using three radiosensitive ESCC tumor tissues and three radioresistant ESCC tumor tissues, of which 71 lncRNA transcripts were upregulated (fold change >2, P < 0.05) and 42 lncRNA transcripts were downregulated (fold change < 0.5, P < 0.05) in the radiosensitive ESCC tumor tissues when compared with the radioresistant ESCC tumor tissues. The lncRNAs CASC2, FAM201A, DLEU2, DLX6-AS1, and MCF2L-AS1 were considered to be the potential lncRNAs related to radiosensitivity when analyzed using GeneSpring Software 12.6 (Agilent Technologies, Inc.) (**Figure 1A**, **Supplementary File 1**).

Tumor tissues from the remaining 35 enrolled patients (20 radiosensitive patients and 15 radioresistant patients, respectively) were collected to detect the expression of the lncRNAs CASC2, FAM201A, DLEU2, DLX6-AS1, and MCF2L-AS1 by RT-qPCR. The results revealed that the differential expression of CASC2, FAM201A, and DLX6-AS1 between the radioresistant and radiosensitive groups were significantly different, while the difference in the DLEU2 and MCF2L-AS1 expressions were not significantly different when comparing the groups (**Figure 1B**; **Supplementary File 2**).

#### FAM201A Is a Novel lncRNA With a Potential Function in the Radiosensitivity and Survival of ESCC

Based on above data, the ROC curve of the lncRNAs CASC2, FAM201A, and DLX6-AS1 was applied to identify the lncRNA that was the most correlated to radiosensitivity and survival using the area under curve (AUC) were 0.783 (95%CI: 0.609– 957, P = 0.005), 0.817 (95%CI: 0.673–960, P = 0.002), and 0.340 (95%CI: 0.150–530, P = 0.110); respectively. Compared with the lncRNA DLX6-AS1, FAM201A, and CASC2 yielded a superior AUC with specificity and sensitivity for distinguishing radiosensitive ESCC tumor tissues from radioresistant ESCC tumor tissues (**Figure 2A**).

CASC2 was associated with short-term response to RT but not with survival, while FAM201A was correlated with both the short-term response and survival (**Figures 2B,C**). This indicated that FAM201A, as opposed to CASC2, may be a suitable biomarker of ESCC treated with RT.

To analyze whether FAM201A functions as a biomarker for radiosensitivity and survival in ESCC or not, the maximum Youden index method (Fluss et al., 2005) was performed to establish the cutoff value of FAM201A in the ROC curve. A total of 22 patients were termed as FAM201A-low with an average 1Ct expression value of 6.155, whereas, the remaining 13 patients, named the FAM201A-high expression group, had an average 1Ct expression value of 8.437 (**Supplementary File 3**).

Compared with the FAM201A-low group, the FAM201A-high group exhibited a poorer short-term response to RT and lower survival time. However, neither high or low FAM201A expression was correlated with tumor stage, regardless of whether it was T or N stage (**Table 3**). Furthermore, univariate and multivariate analysis indicated that FAM201A was the only independent risk factor for survival (OR, 0.642; 95% CI, 0.4668–0.885; P = 0.007). These data suggested that FAM201A could be a robust molecular marker for predicting RT sensitivity and survival in patients with ESCC.

### FAM201A Regulated Radiosensitivity in vitro

Based on the above results, the effects of FAM201A regulated radiosensitivity in ESCC cancer cells were further explored by performing an X-ray irradiation experiment using Eca109/Eca109R cells transfected with si-FAM201A and FAM201A-mimic (**Figures 3A,B**; **Supplementary File 4**).

The results revealed that the survival rates of both Eca109 and Eca109R cells decreased with the increasing X-ray irradiation dose, and the percentage of apoptotic cells in each line increased with the increasing X-ray irradiation dose (**Figures 3C,D**; **Supplementary File 4**). The decrease in survival was more pronounced with the increase in X-ray irradiation dose in ECA109 cells when compared with ECA109R cells, demonstrating that the Eca109R cells were more resistant to X-ray irradiation.

In Eca109 cells, when compared with the control cells, FAM201A-mimic exhibited a significant promotion in cell proliferation, while si-FAM201A exhibited a significant


*There were no significant differences between radiosensitive and radioresistant patients regarding the distributions of gender, age, ECOG score, tumor location and clinical stage. ECOG, eastern cooperative oncology group; GTV, gross tumor volume; CTV, clinical target volume; PF, platinum plus fluorouracil; TP, platinum plus taxane; a, M1 means Supraclavicular lymphatic node metastasis; IC, Induction chemotherapy; b, according to the 7th AJCC TNM staging system.*

increase in proliferation inhibition, indicating that for Eca109 cells, upregulated FAM201A expression likely resulted in cell radioresistance to X-rays (**Figure 3E**; **Supplementary File 4**).

In Eca109R cells, when compared with the control cells, si-FAM201A exhibited a significant inhibition of cell proliferation, while FAM201A-mimic did not exhibit the increased cell proliferation that was observed in ECA109 cells, indicating that the expression level of FAM201A in Eca109R cells was already at a high level, and thus, further elevation of FAM201A expression was not possible to enhance its radioresistance. These results indicated that, whether in cases of intrinsic or acquired radioresistance, si-FAM201A may enhance ESCC cell radiosensitivity, which may therefore be a novel effective target strategy for sensitizing ESCC to radiotherapy (**Figure 3F**; **Supplementary File 4**).

### FAM201A Knockdown Enhanced the Radiosensitivity of ESCC in vivo

To confirm the efficacy of si-FAM201A on radiosensitivity in vivo, a xenograft tumor mouse model was established. A total of 15 mice with similar weights and dates of birth were selected in the present study (male: female = 8:7). When compared with the control groups, FAM201A knockdown (sh-FAM201A) significantly blocked tumor growth (decreased tumor volume and weight), suggesting that the silenced FAM201A expression enhanced radiosensitivity, thereby confirming that

the gene expression levels in RNA samples isolated from three radiosensitive and three radioresistant ESCC tumor tissues by microarray assays. (B) Differential expression of the potential lncRNAs related to radiosensitivity (CASC2, FAM201A, DLEU2, DLX6-AS1, and MCF2L-AS1) in radiosensitive (*n* = 20) and radioresistant (*n* = 15) ESCC tumor tissues by reverse transcription-quantitative polymerase chain reaction. \**P* < 0.05, \*\**P* < 0.01.

FAM201A could induce radiosensitivity in vivo (**Figures 3G,H**; **Supplementary File 4**).

### FAM201A Negatively Regulated the Expression of miR-101

Using Starbase 2.0, miR-101 and miR-590 were predicted to have complementary base pairings with FAM201A. Accordingly, luciferase reporter vectors containing the Wt or a Mut FAM201A binding site were established and co-transfected with miR-101 into Eca109 cells. The same process was performed for miR-590.

The results demonstrated that the ectopic expression of miR-101 was markedly suppressed by co-transfection with the FAM201A mutant sequence in the Eca109 cell luciferase activity reporter assay. However, neither pGL3-FAM201A-Wt reporter nor pGL3-FAM201A-Mut transfection in Eca109 cells affected miR-590 expression (**Figures 4A,B**; **Supplementary File 5**).



*The level expression of FAM201A was not correlated with the tumor stage, whatever in term of T stage or N stage. Compared with low expression FAM201A, patients with high expression of FAM201A resulted in poorer short-term response to RT. CR, complete response; PR, partial response; SD, stable disease; PD, progressive disease.*

#### FAM201A Upregulated ATM and mTOR Expression by Acting as a miR-101 Sponge

To further evaluate the regulatory relationship between FAM201A and miR-101, Eca109 cells were transfected with si-FAM201A and FAM201A-mimic sequences and matched controls. The results revealed that miR-101 expression was significantly downregulated in FAM201A-mimic Eca109/Eca109R cells, and was notably upregulated in si-FAM201A-transfected Eca109/Eca109R cells (**Figures 4C,D**). Taken together, these results indicated that FAM201A suppressed the expression of miR-101 (**Supplementary File 6**).

Using TargetScan, ATM and mTOR were predicted to be the downstream targets of miR-101. In Eca109/Eca109R cells, the expression of ATM and mTOR was increased while that of miR-101 was decreased in FAM201A-mimic cells when compared with control cells. When FAM201A expression was decreased, the expression of ATM and mTOR was downregulated while that of miR-101 was increased. Compared with non-irradiated cells, the expression of ATM and mTOR increased after Xray irradiation. Western blotting confirmed the results of PCR (**Figure 5**; **Supplementary File 6**).

### DISCUSSION

The earliest study on lncRNAs associated with radiosensitivity in ESCC was reported by Tong et al. in 2014 (Tong et al., 2014). In this study, they revealed that, when compared with normal paracarcinoma tissue, tumor tissues with a low expression of lncRNA LOC285194 exhibited a larger tumor size, poorer histological grade, had an advanced TNM stage, more lymph node and distant metastases, and was significantly negatively correlated with the pathological response to RT than the LOC285194-high group. Subsequently, researchers have revealed another three lncRNAs related to ESCC radiosensitivity, including BOKAS (Zhang et al., 2015), MALAT1 (Li et al., 2016), and AFAP1- AS1 (Zhou et al., 2016). However, clinical trials for evaluating such lncRNAs related to ESCC radiosensitivity are lacking as the mechanism for how lncRNAs regulate radiosensitivity has yet to be fully elucidated, and so no promising lncRNAs have been applied in the clinic.

In the present study, we identified that the lncRNA FAM201A contributed the most to the radioresistance of ESCC regardless of the tumor stage. The FAM201A gene is a 2.9 Kbp long gene located in genomic 9p13.1 (Humphray et al., 2004) that results in RNA transcripts without ORFs, which means that it has no protein-coding potential. FAM201A in human diseases has been reported crudely in Obsessive-compulsive disorder and Tourette's syndrome by Yu et al. (2015), while it was first mentioned in cancer (colorectal) by Matsumura et al. (2017). Recently, Huang et al. revealed that the biofunction of FAM201A was involved in the development of Osteonecrosis of the femoral head (Huang et al., 2018). However, the molecular mechanism of lncRNA FAM201A function has not been studied. To the best of our knowledge, the present study was the first to report on the correlation of FAM201A with ESCC radiosensitivity and to investigate its potential molecular mechanism, in order to elucidate whether it may be a biomarker for the prognosis and prediction of the patient's response to RT.

The results revealed that patients with FAM201A overexpression had poorer radiosensitivity and inferior survival. Conversely, lower FAM201A expression in ESCC was associated with improved radiosensitivity and a good prognosis, indicating that lnc-FAM201A may serve as a predictor of radiosensitivity in ESCC.

Subsequently, we performed experiments in vitro and in vivo to confirm the functions of FAM201A. In vitro, the overexpression of FAM201A was demonstrated to promote Eca109 cell proliferation; while decreasing FAM201A expression inhibited cell proliferation. The difference in radioresistance following the overexpression of FAM201A in Eca109 and Eca109R cells indicated that FAM201A upregulation likely resulted in cell radioresistance to X-rays irradiation. In addition, the similar levels of radiosensitivity following the reduction in FAM201A expression in Eca109 and Eca109R cells suggested that si-FAM201A may enhance the radiosensitivity of both intrinsically and acquired-radioresistant tumor cells, indicating that siFAM201A may serve as an effective sensitizing molecular strategy for ESCC. In vivo, when compared with control groups, FAM201A knockdown significantly blocked xenograft tumor

FIGURE 3 | Reverse transcription-quantitative polymerase chain analysis confirmed the efficiency of transfected (A) Eca109 or (B) Eca109R cells with si-FAM201A and FAM201A-mimic. In both (C) Eca109 or (D) Eca109R cancer cells transfected with si-FAM201A or FAM201A-mimic, the percentage of apoptotic cells in each line increased with the increasing X-ray irradiation dose. In contrast to the levels of apoptosis, cell proliferation decreased with increasing radiation doses in (E) Eca109 or (F) Eca109R. The effect of shFAM201A on Xenograft tumor survival was also evaluated: (G) tumor survival curve, (H) tumor volume and weight. \*\**P* < 0.01, \*\*\**P* < 0.001.

109R cells. \**P* < 0.05, \*\**P* < 0.01, \*\*\**P* < 0.001.

growth (decreased tumor volume and weight), which confirmed that siFAM201A was able enhance radiosensitivity.

X-ray irradiation. Western blotting validation of ATM and mTOR in Eca109 and Eca109R cells. \*\**P* < 0.01.

Recently, a competing endogenous RNAs hypothesis proposed that lncRNAs may exert their biological function by acting as a molecular sponge for miRNAs, in turn leading to derepression of miRNA targets (Tay et al., 2014). To explore the molecular mechanism of FAM201A-modulated radiosensitivity in ESCC, we used the online software Starbase 2.0 to predict the downstream target genes, and found that miR-101 and miR-590 had complementary base pairings with FAM201A. Only miR-101, and not miR-590, was observed to directly interact with FAM201A, as determined by the luciferase reporter assay. The qPCR analysis further demonstrated that FAM201A overexpression downregulated miR-101 expression while si-FAM201A transfection upregulated miR-101. These results suggested that FAM201A may modulate target gene expression by serving as a "sponge" for miR-101 (Kung et al., 2013).

Further, the role of miRNAs usually depends on what genes they target. The TargetScan analysis showed that ATM and mTOR were the targets of miR-101. Furthermore, qPCR revealed that overexpression of FAM201A leads to the downregulation of miR-101, the upregulation of ATM and mTOR, and resulted in radioresistance; however, depletion of FAM201A led to the upregulation of miR-101, downregulation of ATM, and mTOR, and resulted in radiosensitivity. Additionally, western blotting confirmed these PCR results.

ATM is the major repair protein involved in the homologous recombination repair (HRR) of ionizing radiation-induced double-strand breaks (RI-DSB). ATM deficiency leads to HRR disorders, increased apoptosis and radiosensitivity (Cliby et al., 1998; Cuddihy and Bristow, 2004; Hammond and Muschel, 2014). Therefore, we hypothesize that FAM201A may regulate ESCC radiosensitivity via a "FAM201A-miRNA101-ATM-HRR" axis.

HRR occurs only in the S and G2 phases of DNA replication, due to the requirement of homologous sister chromatids as a template (Pâques and Haber, 1999). DSBs during the absence of homologous sequence chromosomes requires non-homologous end joining (NHEJ) to achieve DNA repair, which is a repair function performed throughout the cell cycle and was initially considered to be the primary mechanism of RI-DSB repair (Branzei and Foiani, 2008; Beucher et al., 2009). Yan et al. (2010) reported that miR-101 regulates the radiosensitivity of cells by regulating the DNA-dependent protein kinase catalytic subunit, an important member of the NHEJ machinery, via mTOR. Therefore, we hypothesize that lncRNA-FAM201A may also modulate cell ionizing radiosensitivity via a "FAM201A-miR-101-mTOR-NHEJ" axis. In future research, we will focus on the upstream mechanism underlying FAM201A upregulation in regulating ESCC radiosensitivity.

#### CONCLUSIONS

In conclusion, the present study revealed that lncRNA FAM201A may be a potential biomarker for predicting radiosensitivity and prognosis, as well as a therapeutic target for enhancing cancer radiosensitivity in ESCC. FAM201A contributed to radioresistance through a FAM201A-miR-101-ATM/mTOR regulatory network in ESCC. However, the upstream mechanism for FAM201A upregulation in regulating ESCC radiosensitivity requires further study.

#### ETHICS STATEMENT

This study was subject to approval by the Fujian Medical University Union Hospital Institutional Review Board (No. 2014KY001). All patients signed an informed consent

#### REFERENCES


prior to treatment, and all information was anonymized and deidentified prior to its analysis.

### AUTHORS CONTRIBUTIONS

MC, PL, YT, JC, and JL conceived the study, manuscript, and statics analysis. YL, XiaL, MS, XiqL, and AL assistance with collecting clinical data. RY, WN, XZ, YC, and LZ provided assistance with study design and revisions of the manuscript. All authors read and approved the final manuscript.

#### FUNDING

This study was supported in part by grants from the Fujian Provincial Health & Family Planning Commission (Project Number: 2016-ZQN-32), the Fujian Provincial Department of Science & Technology (Project Number: 2018J01306), the Fujian Provincial Department of Science & Technology (Project Number: 2017Y9079), the Fujian Provincial Platform for Medical Laboratory Research, and Key Laboratory for Tumor Individualized Active Immunity (Project Number: FYKFKT-2017015).

### ACKNOWLEDGMENTS

The authors thank all patients who participated in the present study.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00611/full#supplementary-material


File S6 | Overexpression of FAM201A and its effect on miR-101/ATM/mTOR in Eca109 and Eca109R cells.

ionizing radiation. Radiother Oncol. 99, 287–292. doi: 10.1016/j.radonc.2011. 06.002


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Chen, Liu, Chen, Chen, Shen, Liu, Li, Li, Lin, Yang, Ni, Zhou, Zhang, Tian, Li and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# CaDrA: A Computational Framework for Performing Candidate Driver Analyses Using Genomic Features

Vinay K. Kartha1,2, Paola Sebastiani1,3, Joseph G. Kern<sup>4</sup> , Liye Zhang<sup>5</sup> , Xaralabos Varelas<sup>4</sup> and Stefano Monti1,2,3 \*

<sup>1</sup> Bioinformatics Program, Boston University, Boston, MA, United States, <sup>2</sup> Section of Computational Biomedicine, Boston University School of Medicine, Boston, MA, United States, <sup>3</sup> Department of Biostatistics, Boston University School of Public Health, Boston, MA, United States, <sup>4</sup> Department of Biochemistry, Boston University School of Medicine, Boston, MA, United States, <sup>5</sup> School of Life Sciences and Technology, ShanghaiTech University, Shanghai, China

#### Edited by:

Binhua Tang, Hohai University, China

#### Reviewed by:

Ao Li, University of Science and Technology of China, China Samir B. Amin, The Jackson Laboratory for Genomic Medicine, United States

> \*Correspondence: Stefano Monti smonti@bu.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 07 October 2018 Accepted: 04 February 2019 Published: 19 February 2019

#### Citation:

Kartha VK, Sebastiani P, Kern JG, Zhang L, Varelas X and Monti S (2019) CaDrA: A Computational Framework for Performing Candidate Driver Analyses Using Genomic Features. Front. Genet. 10:121. doi: 10.3389/fgene.2019.00121 The identification of genetic alteration combinations as drivers of a given phenotypic outcome, such as drug sensitivity, gene or protein expression, and pathway activity, is a challenging task that is essential to gaining new biological insights and to discovering therapeutic targets. Existing methods designed to predict complementary drivers of such outcomes lack analytical flexibility, including the support for joint analyses of multiple genomic alteration types, such as somatic mutations and copy number alterations, multiple scoring functions, and rigorous significance and reproducibility testing procedures. To address these limitations, we developed Candidate Driver Analysis or CaDrA, an integrative framework that implements a step-wise heuristic search approach to identify functionally relevant subsets of genomic features that, together, are maximally associated with a specific outcome of interest. We show CaDrA's overall high sensitivity and specificity for typically sized multi-omic datasets using simulated data, and demonstrate CaDrA's ability to identify known mutations linked with sensitivity of cancer cells to drug treatment using data from the Cancer Cell Line Encyclopedia (CCLE). We further apply CaDrA to identify novel regulators of oncogenic activity mediated by Hippo signaling pathway effectors YAP and TAZ in primary breast cancer tumors using data from The Cancer Genome Atlas (TCGA), which we functionally validate in vitro. Finally, we use pan-cancer TCGA protein expression data to show the high reproducibility of CaDrA's search procedure. Collectively, this work demonstrates the utility of our framework for supporting the fast querying of large, publicly available multi-omics datasets, including but not limited to TCGA and CCLE, for potential drivers of a given target profile of interest.

#### Keywords: oncogenic driver analysis, stepwise search, TCGA, CCLE, R package

**Abbreviations:** BRCA, breast carcinomas; CaDrA, candidate driver analysis; CCLE, Cancer Cell Line Encyclopedia; COSMIC, Catalogue of Somatic Mutations in Cancer; FDR, false discovery rate; FPR, false positive rate; KS, Kolmogorov– Smirnov; qRT-PCR, quantitative real-time polymerase chain reaction; RPPA, reverse phase protein array; SCNA, somatic copy number alteration; TCGA, The Cancer Genome Atlas; TN, triple-negative; TPR, true positive rate.

## INTRODUCTION

fgene-10-00121 February 15, 2019 Time: 17:48 # 2

Advances in high-throughput sequencing technology has led to a rapid rise in the availability of large multi-omic datasets through compendia such as the CCLE, TCGA, the Genotype-Tissue Expression (GTEx), and others (Barretina et al., 2012; Chang et al., 2013; Ardlie et al., 2015). These data include genetic alterations, comprising SCNAs and somatic mutations, epigenetic information, such as microRNA expression and DNA methylation, as well as gene expression profiling through microarray or RNA-sequencing (RNASeq) technology, across tens of thousands of samples representing varying biological contexts. Concomitantly, several computational methods have been developed and applied to effectively query and integrate different types of genome-wide datasets in order to make meaningful predictions about the biological processes driving the phenotypes of interest (Drier et al., 2013; Kristensen et al., 2014). An important application of such methods is the identification of recurrent genomic alterations, and their potential effects on downstream pathway activity or phenotypes associated with development and disease states. For example, in many cancers, samples exhibiting elevated activity of a given oncogenic signature may be enriched for, or driven by functionally relevant somatic mutations or SCNAs. Identifying such associations may help elucidate underlying mechanisms contributing to abnormal pathway activity, further enabling disease subtyping and sample classification (Bea et al., 2005; Savage et al., 2003; Monti et al., 2012). Alternatively, linking these genomic features with their close interactors through protein-protein interaction networks, gene function annotations or phenotypic readouts such as drug sensitivity may support the discovery of novel druggable targets and further guide precision medicine regimens (Bild et al., 2006; Heiser et al., 2011; Daemen et al., 2013; Hou and Ma, 2014; Jia and Zhao, 2014).

Recently, computational methods and models have been developed for performing driver gene analyses applied to highdimensional 'omics' data from cancer cell lines and patients. These are typically motivated either by frequency or exclusivity of alterations across samples (Youn and Simon, 2011; Ciriello et al., 2012; Dees et al., 2012; Vandin et al., 2012; Lawrence et al., 2013; Leiserson et al., 2013; Kim et al., 2016), or their functional interplay based on biological interaction networks and pathway ontology (Ng et al., 2012; Creixell et al., 2015; Leiserson et al., 2015; Cho et al., 2016). Indeed, certain approaches integrate interactome and functional information to further guide driver gene prioritization in cancer (Chen et al., 2014; Xi et al., 2017; Sanchez-Vega et al., 2018). Some of these tools have been proposed to specifically identify subsets or combinations of genomic features that are collectively associated with a given phenotypic response, explaining a larger fraction of the biological context than any individual feature alone (Kim et al., 2016). These methods, while useful, do not offer simultaneous support for: (i) the joint analyses of multi-type features, including SCNAs and somatic mutations, with possible extension to other genomic data, (ii) multiple feature scoring functions and, most importantly, (iii) rigorous assessment of the statistical significance of the discovered associations. Of equal

Here, we present CaDrA, a methodology that searches for the set of genomic alterations, here denoted as features (mutations, SCNAs, translocations, etc.), associated with a user-provided ranking of samples within a dataset. Our method specifically employs a stepwise heuristic search to identify a subset of features whose union is maximally associated with the observed sample ranking, and carries out rigorous statistical significance testing based on sample permutation, thereby allowing for the identification of candidate genetic drivers associated with aberrant pathway activity or drug sensitivity, while still exploiting aspects of feature complementarity and sample heterogeneity. To highlight the method's overall performance, along with its relevance and ability to select sets of genomic features that indeed drive certain oncogenic phenotypes in cancer, we perform extensive evaluation of CaDrA based on simulated data, as well as real genomic data from cancer cell lines and primary human tumors. The results from simulations show that CaDrA has high sensitivity for mid- to large-sized datasets, and high specificity for all sample sizes considered. Using genomic data drawn from CCLE and TCGA, we demonstrate CaDrA's capacity to correctly identify well-characterized driver mutations in cancer cell lines and primary tumors spanning multiple cancer types, along with its ability to discover novel features associated with invasive phenotypes in human breast cancer samples, which we functionally validate in vitro. Our framework, which is publicly available as an R package, will allow for rapidly mining numerous multi-omics datasets for candidate drivers of user-specified molecular readouts, such as pathway activity, drug sensitivity, protein expression, or other quantitative measurements of interest, further enabling targeted queries and novel hypothesis generation.

### RESULTS

#### CaDrA Overview

An overview of CaDrA's workflow is summarized in **Figure 1**. CaDrA implements a step-wise heuristic approach that searches through a set of binary features [each represented as a 1/0-valued vector, indicating the presence/absence of a SCNA, somatic mutation, or other (epi)genetic alterations across samples, respectively], and returns a final subset of features whose union (logical OR) defines an alteration 'meta-feature' that is maximally associated with the defined sample ranking provided as input (see section "Methods"). The strength of the association of a metafeature with a sample ranking is a function of the agreement between the skewness of the alterations' occurrences and the sample ranking. The input sample ranking is usually a function of a sample-specific measurement, e.g., the activity level of a pathway, the response to a targeted treatment, the expression level of a given transcript or protein, etc. Therefore, the metafeature returned by the search is the set of features maximally

predictive of that same sample-specific measurement variable. The logical OR operator used in the iterative search framework specifically takes advantage of heterogeneity seen across samples (i.e., samples harboring similar phenotypes but different drivers of the given outcome), thus enabling the potential identification of complementary drivers of target phenotypes (Kim et al., 2016). CaDrA allows for multiple modes to query ranked binary datasets with user-specified parameters defining search criteria, enables rigorous permutation-based significance testing of results, and reduced computation time by exploiting precomputed score distributions and parallel computing, when available (see section "Methods").

### Analysis of Simulated Data to Evaluate CaDrA Performance

To assess the overall performance of CaDrA to recover (statistically) significantly associated meta-features, we simulated two types of datasets for a range of sample sizes: (i) the truepositive datasets consist of both left-skewed (i.e., true positive with skewness concordant with sample ranking) as well as uniformly distributed (i.e., null) features; and (ii) the null datasets consist of null features only (see section "Methods" and **Supplementary Figure S1**). This enabled us to estimate the overall sensitivity and specificity of CaDrA using the true positive and null datasets, respectively. By running CaDrA on multiple simulated datasets of different sample sizes (n = 500 true positive and null datasets for each sample size), we first evaluated the resulting meta-features based on the number of true positive features and the total number of features contained within each returned meta-feature (i.e., the meta-feature size; **Figures 2A,B**). The true positive datasets had a maximum of five positive features to be detected, while the maximum number of features CaDrA was allowed to add was set to 7, to evaluate the ability of the search to recover all but no more than the positive features. With progressively higher sample sizes, we observed an increase in the fraction of CaDrA-identified metafeatures that include all 5 true positive features (**Figure 2A**). The TPR and FPR of CaDrA on the simulated positive and null data, respectively, for different sample sizes are shown in **Figures 2C,D**, and was calculated as the fraction of searches

returning meta-features with permutation p-value significant at α = 0.05 (**Supplementary Figure S2**). The TPR was estimated for different numbers of recovered true positive features (in the true positive datasets), while the FPR was estimated for different numbers of returned features (by definition, false positives) in the null datasets, and is summarized in **Table 1**. CaDrA returned all of the simulated true positive features with 100% TPR for sample sizes larger than N = 100. CaDrA also yielded a very high mean TPR of >95% at N = 100, with the sensitivity dropping to 7.7% only at the smallest sample size of N = 50 (**Table 1**). Further, when applied to the null datasets (**Figure 2B**), the majority of meta-features returned by CaDrA were correctly deemed as non-significant at α = 0.05, with a maximum mean FPR of 7.2% for the lowest sample size analyzed (**Figure 2D** and **Table 1**).

These results suggest that CaDrA requires mid- to largesized datasets for sufficient sensitivity, while maintaining high specificity at all sample sizes assessed.

### CaDrA Identifies Known Regulators of Ras/Raf/Mek/ERK Signaling Sensitivity in Cancer Cell Lines

The mitogen-activated protein kinase (MAPK) kinase (MEKK)/extra-cellular signal-regulated kinase (ERK) pathway is a well-conserved kinase cascade known to play a regulatory



Weight-averaged TPR and FPRs were computed per sample size for true positive and null simulated datasets, respectively (n = 500 simulated datasets per sample size; see section "Methods").

role in cell proliferation, differentiation, and survival in response to extracellular signaling (Kim and Choi, 2010; Cargnello and Roux, 2011; Burotto et al., 2014). Increased MAP/ERK kinase (MEK) activity is a feature of many cancers, and is often triggered by missense mutations in BRAF and NRAS, two upstream oncogenes and potent regulators of Ras/Raf/Mek/ERK signaling (Cantwell-Dorris et al., 2011; Burotto et al., 2014). Small molecules targeting these mutated proteins have been shown to be effective in treating these cancers via inactivation of Ras/Raf/Mek/ERK signaling (Roberts and Der, 2007; Chapman et al., 2011; Barretina et al., 2012; Johnson and Puzanov, 2015). To highlight CaDrA's ability to recover independent genomic features that may confer hypersensitivity of cancer cells to targeted small molecule treatment, we utilized drug sensitivity profiles for MEK inhibitor AZD6244 (Yeh et al., 2007), along with matched genomic data from CCLE. Specifically, we used per-sample estimates of 'ActArea' or area under the fitted dose response curve, a metric that has been shown to accurately capture drug response behavior (Jang et al., 2014), to rank cell lines from high to low sensitivity, as well as data comprising somatic mutations and SCNAs as the binary feature matrix (see section "Methods"). CaDrA was then run to look for a subset of features associated with increased sensitivity to treatment with AZD6244 (i.e., increased ActArea scores).

The resulting feature set (i.e., meta-feature) is shown in **Figure 3**. Remarkably, CaDrA selected the BRAFV600<sup>E</sup> and NRAS somatic mutations in the first two iterations, respectively. Subsequent iterations identified mutations in APAF1, TGFBR2, and AMHR2, before terminating the search process (P ≤ 0.001). APAF1 is a pro-apoptotic factor and known regulator of cell survival and tumor development (Ferraro et al., 2003), the depleted expression of which has been observed in malignant melanoma cell lines and specimens (Soengas et al., 2006). TGFBR2 and AMHR2 are both type II receptors functioning as part of the transforming growth factor (TGF)/bone morphogenetic protein (BMP) superfamily, together serving as mediators of cellular differentiation, proliferation and survival, and play important roles in directing epithelialmesenchymal transition (EMT) (Rojas et al., 2009; Stone et al., 2016). Notably, MAPK signaling activity can also be regulated by TGF/BMP stimulation (Derynck and Zhang, 2003; Moustakas

FIGURE 3 | CaDrA identifies mutations in MAPK/ErK signaling genes that contribute to hyper-sensitivity to MEK inhibition in vitro. ActArea measurements reflecting sensitivity to MEK inhibitor AZD6244 were used to rank CCLE cell lines (n = 477). CaDrA was then run to identify sets of genomic features that were most-associated with decreasing ActArea (i.e., increasing sensitivity) scores. Through step-wise search iterations, CaDrA identified somatic mutations in known regulators upstream of MEK, including an activating mutation in BRAF (BRAFV600<sup>E</sup> ) and NRAS, as well as those in APAF1, TGFBR2, and AMHR2, before terminating the search process. The resulting meta-feature (red track) and its corresponding enrichment score (ES) is shown.

and Heldin, 2005; Chapnick et al., 2011), suggesting that these mutations are potential independent drivers of increased MEK signaling, and hence, of increased sensitivity to treatment with AZD6244. We next extended our analysis of cancer cell line sensitivity profiles to alternative small molecules targeting MEK (PD-0325901), as well as RAF (PLX4720 and RAF265). The metafeatures associated with increased sensitivity to each of the four drug treatments assessed are shown in **Supplementary Figure S3** and summarized in **Table 2**. Importantly, both BRAFV600<sup>E</sup> and NRAS mutations were identified as candidate drivers of sensitivity to MEK inhibition by AZD6244 and PD-0325901. Furthermore, the BRAFV600<sup>E</sup> mutation was returned by CaDrA for all four independent queries, highlighting its association with increased sensitivity to inhibitors targeting the same protein (BRAF) as well as its downstream effector (MEK).

Collectively, these results confirm CaDrA's capability to accurately identify upstream drivers of cellular response to treatment that are both components of independently linked pathways, as well as part of the same signaling branch, which in turn suggests their role in driving the disease state of interest.

#### CaDrA Identifies Hallmark Drivers Associated With Protein Biomarkers in Human Cancers

Protein abundance levels have widely been utilized to histologically classify several human tumor subtypes, with



Mutation meta-features identified as associated with increased sensitivity to inhibitors targeting Mek (AZD6244, PD-0325901) and Raf (PLX4720) are shown, along with the corresponding permutation p-value of each search result.

relevant diagnostic and therapeutic implications. Epidermal Growth Factor Receptor (EGFR) expression, for instance, together with EGFR mutation status can be used to predict response to existing anti-EGFR treatments in patients with lung cancers (Pao et al., 2004; Mascaux et al., 2011). To demonstrate CaDrA's targeted search mode when identifying genomic alterations that track with a pre-defined starting feature, we ran CaDrA using phosphorylated EGFR (EGFRTyr1068) protein expression levels to stratify TCGA lung adenocarcinomas (LUAD), and seeded the search process with EGFR mutations. Subsequent search iterations selected well-known regulators of EGFR activity in lung cancers, including mutations in epithelialto-mesenchymal transition mediators SMAD4 and LAMC2, as well as ERBB2 (Liu et al., 2015; Moon et al., 2015), with the meta-feature being statistically significant based on the permuted null background obtained for the same search criterion (P ≤ 0.02; **Supplementary Figure S4**).

We then wished to more systematically determine whether CaDrA can identify known drivers of target profiles previously associated with oncogenic and tumor-suppressive markers in human cancers. To do so, we queried TCGA expression profiles of proteins encoded by a set of hallmark genes that are defined in the COSMIC database (Forbes et al., 2017), along with genomic data from nine different cancer types in TCGA (Forbes et al., 2017). Briefly, for each cancer type, a CaDrA query was performed with respect to each of the proteins corresponding to the COSMIC-defined oncogenes or tumor suppressor genes (n = 57). In particular, CaDrA was applied to search for sets of genomic features associated with elevated protein expression for each protein under consideration. The features selected by CaDrA were then pooled across all protein queries, and the resulting feature set was tested for enrichment against the reference COSMIC list of frequently mutated oncogenes and tumor suppressor genes (n = 554; see section "Methods"). We observed a significant enrichment of the reference cancer driver mutations among the CaDrA-identified features in all cancer types tested (Hyper-enrichment FDR < 0.05; **Figure 4** and **Supplementary Table S1**). These results validate CaDrA's ability to identify independently cataloged, functionally relevant genomic drivers in primary human malignancies.

### CaDrA Reveals Novel Drivers of Oncogenic YAP/TAZ Activity in Human Breast Cancer

Next, we tested whether our framework can be applied to the discovery of novel drivers of oncogenic pathways in

p-values are shown, with cancer types sorted in decreasing order of FDR q-value. BLCA, bladder urothelial carcinoma; BRCA, breast invasive carcinomas; GBM, glioblastoma multiforme; HNSC, head and neck squamous cell carcinoma; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; OV, ovarian serous cystadenocarcinoma; PAAD, pancreatic adenocarcinoma; PRAD, prostate adenocarcinoma. Points are plotted in -log<sup>10</sup> space.

cancer. The Hippo signaling pathway is a highly conserved developmental pathway known to play an essential role in cell proliferation and survival (Varelas, 2014). YAP (Sudol, 1994), and TAZ (Kanai et al., 2000) serve as central downstream transcriptional effectors of the pathway. Aberrant nuclear YAP/TAZ localization and transcriptional activity is associated with a range of cancers, including BRCAs (Hiemer et al., 2015; Moroishi et al., 2015; Zanconato et al., 2015, 2016). To identify alternative genetic events that can potentially explain

FIGURE 5 | CaDrA identifies novel drivers of oncogenic YAP/TAZ activity in human breast carcinomas. (A) TCGA BRCA RNASeq data (n = 951) was projected onto the space of YAP/TAZ-activating genes (blue area plot; see section "Methods"). A CaDrA search for features associated with elevated YAP/TAZ activity identified two chromosomal deletions (Del5q21.3, Del20p13), and a somatic mutation in RELN (black tracks). The union of the three features (red track) and the corresponding running enrichment score (ES) is also shown. (B) Box plot of YAP/TAZ activity estimates for triple negative (TN) and non-TN TCGA BRCA samples. Sample groups are further stratified by the presence or absence of the union alteration status of the meta-feature identified by CaDrA (panel a, red track). Only samples with known TN status were considered (C) siRNA-mediated knockdown of 20p13-harboring gene RBCK1, and RELN in HS578T cells resulted in significant increase in the expression levels of canonical YAP/TAZ targets CTGF and CYR61, as indicated by their relative qRT-PCR expression, confirming the identified CaDrA hits as potential regulators of BRCA-associated YAP/TAZ activity. (D) Sub-sampling-based reproducibility assessment for candidate drivers of YAP/TAZ activity compared to a CaDrA query for a random profile ranking in TCGA BRCAs. Jaccard (J) indices of the returned meta-features obtained with and without sub-sampling (repeated for n = 100 independent sub-sampling iterations) were computed and compared for the two queries, yielding a significantly higher J index distribution for the original query relative to the permuted ranking query (Wilcox P < 0.0001). Ctrl: Scrambled control; YT: YAP/TAZ; <sup>∗</sup> FDR < 0.05; two-tailed Student's t-test.

the elevated YAP/TAZ activity exhibited in some human breast cancers, we applied CaDrA using genomic data from the TCGA BRCA sample cohort, along with corresponding per-sample estimates of YAP/TAZ activity derived using a gene expression signature of YAP/TAZ knockdown in MDA-MB-231 cells (see section "Methods"). Samples with available RNASeq, somatic mutation and SCNA profiles (n = 957) were first ranked in decreasing order of their overall YAP/TAZ activity estimates. The ranked binary matrix of mutation and SCNA features were then used as input to CaDrA. In the first iteration, CaDrA identified the top scoring genomic feature to be a deletion on chromosomal locus chr5q21.3 (**Figure 5A**), harboring tyrosine kinase receptor-encoding gene EFNA5. EFNA5, a member of the Eph receptor family, has been hypothesized to function as a tumor suppressor, whose expression has been shown to be reduced in human BRCAs relative to normal epithelial tissue (Fu et al., 2010). Advancing to a second iteration, CaDrA then identified an additional deletion of chr20p13 as the nextbest feature (**Figure 5A**). The chr20p13 genomic deletion spans multiple genes (**Supplementary Table S2**), including RBCK1, whose reduced expression has been shown to be associated with increased tumor cell proliferation and survival, as well as with poor prognosis in breast cancer (Donley et al., 2014). CaDrA then proceeded to identify somatic mutations in the

RELN gene, before terminating the search process (P ≤ 0.001; **Figure 5A**). Loss of RELN expression has indeed been shown to induce cell migration in esophageal carcinoma, and to be associated with poor prognosis in breast cancer (Stein et al., 2010; Yuan et al., 2012). To ensure that the derived metafeature association is not a spurious consequence of correlation with tumor subtype, we tested for the association of YAP/TAZ activity with the meta-feature while controlling for BRCA TN status using a linear regression model. The results confirmed that the positive association between YAP/TAZ activity and the occurrence of these genomic alterations is independent of BRCA patho-histology (linear regression meta-feature coefficient P < 0.0001; **Figure 5B**). Analysis of YAP/TAZ activity based on the same knockdown signature in CCLE BRCA cell lines (n = 59; **Supplementary Figure S5A**) shows that RBCK1 and RELN display the highest anti-correlation between their gene expression and YAP/TAZ activity (**Supplementary Figure S5B**). In order to assess whether these identified candidates indeed drive the elevated YAP/TAZ activity phenotype, we performed siRNAmediated knockdown of RELN or RBCK1 in HS578T breast cancer cells, followed by expression quantification of YAP/TAZ canonical targets, which serves as a read-out of nuclear YAP/TAZ activity (Piccolo et al., 2014). HS578T cells which, similar to MDA-MB-231 cells from which the gene signature was derived, are TN BRCA cells but display lower overall YAP/TAZ activity (rank 7/59) compared to the latter (rank 54/59). Importantly, knockdown of either of these candidate drivers in these cells yielded a significant increase in expression levels of YAP/TAZ targets CTGF and CYR61 (FDR < 0.05; two-tailed Student's t-test), validating the association of their loss of function with increased YAP/TAZ transcriptional activity (**Figure 5C**).

Thus, application of CaDrA to the analysis of YAP/TAZ activity in primary BRCA samples identified multiple new candidate drivers, with in vitro validation confirming the causal role of the top two candidates, RBCK1 and RELN, in driving this activity. These results highlight our tool's ability to discover novel oncogenic genomic drivers.

#### Evaluation of CaDrA Reproducibility

Next, we sought to determine CaDrA's reproducibility, and how this may be influenced by the statistical significance of the returned meta-feature (as determined by permutation p-value). To do so, we implemented a sub-sampling procedure and applied it to the search for YAP/TAZ activity drivers in TCGA BRCAs. Specifically, the original meta-feature returned by the search on the full dataset, and the meta-feature returned when performing the same search on a random subset (80%) of samples were compared by the Jaccard (J) index (see section "Methods"). We performed this sub-sampling search procedure both with respect to the original sample ranking (**Figure 5A**), and with respect to a permuted sample ranking (n = 100 iterations each). Comparison of the resulting J index distributions yielded a significantly higher reproducibility of results when sub-sampling from the original sample ranking, than from the randomly permuted one (Wilcox P < 0.0001; **Figure 5D**). These results support the conclusion that the CaDrA-based significance testing is a strong predictor of a search result reproducibility, and a rigorous criterion to discriminate between true and false positives.

To systematically validate this conclusion, we extended the sub-sampling analysis to CaDrA queries of protein expression profiles across the nine different cancer types previously described. Briefly, for each cancer type we assessed whether the meta-features corresponding to the top five mostsignificant CaDrA protein queries (CaDrA P ≤ 0.05) were more reproducible than those corresponding to a randomly selected subset of five non-significant protein queries (CaDrA P > 0.05). To this end, the J index distribution obtained upon sub-sampling from the significant queries (n = 100 iterations each) was compared to the equivalent distribution from the non-significant queries, and a significantly higher reproducibility of the former was observed in all nine cancer types tested (Wilcox FDR < 0.001; **Figure 6**).

Taken together, these results show that CaDrA-based significance testing is a strong predictor of a search result reproducibility. Most importantly, it provides for a statistically rigorous decision rule, which would not be available based on the sub-sampling results alone.

### DISCUSSION

Identifying (epi)genetic drivers of molecular readouts is of fundamental importance to determining alternative mechanisms influencing the phenotype in question. Existing methods attempting to extract functionally relevant sets of genomic alterations associated with a given context either do not support the analysis of data beyond somatic mutations, do not incorporate multiple feature scoring functions and search modes, or do not implement rigorous statistical significance testing of the obtained results. Importantly, a computational framework package bundling all of these features does not exist, and can significantly help identify novel drivers of signature activity.

Here, we presented CaDrA as a tool that determines the subset of queried binary features most associated with a phenotypic signature of interest by specifically exploiting a stepwise heuristic search method. CaDrA was applied to identify both known and novel genomic drivers of sample signature activity, comprising drug sensitivity, protein expression and gene set activity estimates, using publicly available multi-omics datasets from cancer cell lines and primary tumors. Querying CCLE data for features associated with increased sensitivity to Mek/Raf inhibitors, CaDra recovered known driver mutations in oncogenes known to be gate-keepers of MEK pathway activity, including NRAS and BRAF. Importantly, BRAFV600<sup>E</sup> mutations account for >90% of BRAF mutations and is generally found to be mutually exclusive to NRAS mutations (Sensi et al., 2006; Cantwell-Dorris et al., 2011), as also observed in the CCLE, highlighting CaDrA's ability to identify features exhibiting mutual exclusivity. Further, the large-scale investigation of expression profiles of annotated hallmark proteins in tumors from nine different cancer types in TCGA confirmed CaDrA's ability to systematically identify known mutations of oncogenes and

tumor suppressor genes in human cancers, as defined in the COSMIC database.

Through our extensive evaluation on simulated data, we were able to highlight CaDrA's high sensitivity for mid-to-large sized datasets (N > 90), and high specificity for all sample sizes considered. Importantly, multi-omics datasets produced by networks such as CCLE and TCGA, also presented in this study, are well above this sample size limit. CaDrA's specificity was further evident when querying genetic drivers of increased sensitivity to treatment with PLX4720, a potent and selective inhibitor designed to preferentially inhibit active B-Raf protein bearing the V600E allele (Tsai et al., 2008). In this scenario, the search process correctly identified the BRAFV600<sup>E</sup> mutation as the sole feature associated with elevated sensitivity to treatment, in agreement with the known specificity of the small molecule inhibitor, with the feature association being highly statistically significant. It is important to emphasize that the evaluation of CaDrA's sensitivity and specificity crucially relied on the statistical testing procedure we defined, a feature missing in most of the other existing methods.

We were also able to demonstrate the utility of our framework in the discovery of novel drivers in human breast cancers. Specifically, we asked whether there were genomic alterations associated with elevated activity of Hippo pathway co-activators YAP/TAZ, known to control pro-tumorigenic signals in multiple cancer types (Hiemer et al., 2015; Moroishi et al., 2015; Zanconato et al., 2016). The mechanisms contributing to dysregulated YAP/TAZ activity in cancer remain poorly understood. To date, very few genomic alterations have been associated with driving tumorigenic YAP/TAZ activity (Harvey et al., 2013). Our CaDrA search with respect to a sample ranking of decreasing YAP/TAZ activity, as measured by the coordinated

expression of YAP/TAZ-activated genes, yielded a meta-feature consisting of chromosomal deletions of 5q21.3 and 20p13, and mutations in the RELN. Subsequent functional validation by knockdown of select targets, namely RELN and RBCK1, in HS578T BRCA cells exhibiting low YAP/TAZ-activity resulted in a significant increase in the expression of canonical YAP/TAZ targets CTGF and CYR61. These results confirmed the selected targets' involvement in the regulation of YAP/TAZ-mediated activity, and the capability of CaDrA to identify new drivers of pathway activity. Importantly, this case study highlights the capability of the method to integrate information, and discover targets pertaining to multiple DNA alteration types.

A sub-sampling-based assessment of CaDrA's results show that the ability to recover reproducible meta-features was higher for the true (significant) YAP/TAZ activity ranking, compared to a randomly permuted sample ranking. This sub-sampling procedure was independently assessed using a systematic pan-cancer comparison of reproducibility results from significant and non-significant protein queries, which revealed a significantly higher concordance of the former compared to the latter in all cases tested. Together, these results confirm the agreement between the estimated permutation p-values and the reproducibility of the meta-features identified by CaDrA, and emphasize the importance of our statistical testing procedure in supporting normative decision making.

Previously developed methods have indeed been shown to aid in the selection of functionally relevant genomic features in cancer (Ciriello et al., 2012; Vandin et al., 2012; Leiserson et al., 2013, 2015; Kim et al., 2016). However, CaDrA is to our knowledge the only method performing rank-based prediction in this context, which we believe is well-suited to: (i) model the noisy relationship between (epi)genetic alterations and a functional readout, and (ii) privilege the accurate prediction of highly ranked samples over lowly ranked samples, a desirable feature when modeling oncogenic activity. Furthermore, the framework as defined is flexible enough such that non-rank-based scoring functions can be easily incorporated. We emphasize that using rankbased scoring functions, while advantageous for the reasons mentioned, rely on accurate stratification of samples based on the dependent variable to yield concordant associations for a given biological question. Thus, the soundness of predictions is dependent on the quality of signatures used to query the target profile of interest.

The method that most-resembles CaDrA in its approach is REVEALER (Kim et al., 2016), an iterative search algorithm that functions in a similar fashion to CaDrA, while specifically seeking only those features that are mutually exclusive given the sample context. We note that a direct and rigorous comparison between CaDrA and REVEALER was not possible given the lack of a formal procedure to estimate statistical significance of results in the latter. We further emphasize that our tool defines a flexible framework capable of incorporating additional feature scoring functions, including the mutual information criterion implemented in REVEALER. Indeed, the incorporation of such scoring functions would benefit from the statistical significance estimation module built into CaDrA.

Current implementations of CaDrA and other similar methods are limited to the use of summarized input genomic features that are treated as binary events, denoting the presence or absence of a given mutation or SCNA in a sample. As we have demonstrated, this summarization approach is indeed sufficient to identifying genomic feature sets that may drive the target profile of interest. However, since different types of point mutations (missense, truncating, etc.) may impose differing functional impacts in oncogenes versus tumor suppressor genes, we surmise that these methods could be further improved by qualitatively differentiating between the different types of alterations being considered. One possibility would be to separate mutations by predicted gain or loss-of-function, as well as to distinguish between low (1) and high (≥2) DNA copy number gains or losses, although this may lead to excessive sparsity in the input matrix for low-frequency point mutations and SCNAs.

While our evaluations focused on somatic mutations and SCNAs, CaDrA's search functionality can be applied to additional sequencing readouts capturing regulatory features, including and not limited to, DNA methylation and microRNA expression, albeit with proper discretization of these continuous features. A joint analysis of these additional data types might provide insight into epigenetic mechanisms that complement the assessed genetic features in driving phenotypic variation. Furthermore, we envision the adoption of CaDrA for the study of germ-line variation as well, thus contributing to move beyond the "one feature at a time" paradigm typical of GWAS studies, although issues of computational efficiency in that problem space will likely become more challenging.

## CONCLUSION

CaDrA enables the efficient identification of subsets of genomic features, including somatic mutations and SCNAs, as candidate drivers of a pre-defined phenotypic variable. Given the rapid rise in the availability of multi-omics datasets, as well as an increased need to interrogate targeted molecular readouts within these contexts, we believe that our methodology will accelerate feature prioritization for further follow-up and consideration, in turn aiding in the discovery of potential drivers of the phenotype of interest. Thus, we propose CaDrA as a tool for both targeted hypotheses testing, and novel hypothesis generation.

## METHODS

## The CaDrA Algorithm

An overview of CaDrA's workflow is summarized in **Figure 1**. CaDrA takes as input the sample ranking induced by a samplespecific measurement, a matrix of binary features (1/0 indicating the presence/absence of a given feature in a sample), and a scoring method specification to measure the significance of the concordance between the occurrence of alteration events and the defined sample ranking. The pre-defined sample ranking can be based on quantitative estimates of a gene expression, a signature or pathway activity, or other experimentally derived

measurements. Each row in the matrix of binary features denotes the presence or absence of a somatic alteration (mutation, CNA, or other) in each of the samples in the ranked cohort. The score function is a measure of the left-skewness of a binary vector with respect to the sample ranking. The more the occurrences of an alteration are skewed toward higher rankings (i.e., the more the 1's in the feature vector are skewed toward the left), the higher the score. The scores currently implemented are the KS test (default), and the Wilcoxon rank-sum test, but additional scoring functions can easily be added.

Given the sample ranking, the matrix of binary features, and the score of choice (KS or Wilcoxon), CaDrA implements a step-wise greedy search: it begins by first selecting the single feature that maximizes the score (Step 1; **Figure 1**). It then generates the union (logical OR) of this starting feature with every other remaining feature in the dataset and computes scores for the obtained 'meta-features' (Step 2; **Figure 1**); it selects a 2nd feature that, added to the first (as a union), maximally increases the score – which will then serve as the new top reference hit (Step 3; **Figure 1**). Repeating this process until no further improvement to the cumulative score can be attained, the search output is a set of features (i.e., a metafeature) whose union has the (local) maximum skewness score with respect to the input sample ranking. The significance of a CaDrA search and its cumulative score are determined by generating an empirical null distribution of scores based on the exact same data and search parameters, but with randomly permuted sample rankings, providing a permutation p-value per search result. Since the CaDrA algorithm specifically returns feature-sets maximally left-skewed given the provided sample ranking variable, it can be applied to identify features that are either positively correlated or anti-correlated with the continuous variable of interest by ranking samples in decreasing or increasing order of that variable, respectively.

#### CaDrA Features Search Modes

CaDrA supports multiple search modalities: it allows for the selection of a user-specified feature from which to start the search (rather than selecting the feature with highest score as depicted in Step 1 of **Figure 1**); alternatively, since the greedy search is not guaranteed to find the global maximum, it also allows for a "top-N" search modality, whereby the search is started from each of the first N features (as measured by their individual skewness scores), and the result of the best search can be determined by selecting the set of features with the best cumulative score over the top-N runs.

#### Visualization of Search Results

For a given search, CaDrA outputs a set of features (metafeature), which can be visualized as a 'meta-plot'. This includes (panels from top to bottom): an area plot of the samplespecific measurements used to obtain the sample ranks; a colorcoded matrix of all features in the meta-feature (in the stepwise order that they were added), one feature per row, with the corresponding union of the meta-feature (red) last; and a corresponding enrichment score (ES) plot below. Additionally, top-N search results can be visualized for overlapping features to evaluate robustness across different search starting points.

#### Parallelization Support

The generation of the empirical null distribution for significance testing is typically done for ≥500 iterations (i.e., permuted sample ranks). In order to speed up this potentially time-consuming task, CaDrA supports exploiting parallel computing with the help of the parallel R package functionality, should multiple compute cores be available to users.

#### Permutation Caching

Since the generation of the null distribution used for significance testing is a time-consuming step, and since the null distribution of scores depends solely on the feature dataset and the search parameters specified (scoring method, starting feature versus top-N search mode etc.), and not on the input sample ranking, we can implement cacheing of the null distribution corresponding to each dataset and search parameters. When submitting multiple subsequent queries (each with its own sample ranking) that utilize the same dataset and search criteria, CaDrA can then fetch the corresponding cached null distribution to generate permutation p-values almost instantaneously, avoiding the need for repetitive computation, thus significantly reducing overall query run time.

#### Data Availability and Processing

CaDrA is freely available for download and use as a documented R package under the git repository https://github.com/montilab/ CaDrA, and will further be deposited and maintained for future use under Bioconductor, including complete code and example use-cases.

DNA copy number (GISTIC2), mutation and RPPA data for TCGA analyses were obtained using Firehose v0.4.3 corresponding to the Jan 28th, 2016 (SCNA and somatic mutations) and Jul 15th, 2016 (RPPA) Firehose release. Somatic mutation data was processed at the gene level by assigning either 1 or 0 based on the presence or absence of any given mutation in that gene, respectively (excluding synonymous mutations). Annotated Level 3 RPPA data was used for all protein-related TCGA data queries. For pan-cancer analyses, these three data sets were obtained for nine cancer types, including bladder urothelial carcinoma (BLCA), breast invasive carcinomas (BRCA), glioblastoma multiforme (GBM), head and neck squamous cell carcinoma (HNSC), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), ovarian serous cystadenocarcinoma (OV), pancreatic adenocarcinoma (PAAD), and prostate adenocarcinoma (PRAD). RNASeq version 2 data processed as Level 3 RSEM-normalized gene expression values corresponding to the Feb 4th, 2015 Firehose release was used for the TCGA BRCA analysis. CCLE genomic data were downloaded from https://portals.broadinstitute.org/ccle and processed as previously described (Kim et al., 2016). Somatic mutation binary calls per gene were used as is, and SCNA data was processed using GISTIC2 (Mermel et al., 2011) with all default parameters barring the confidence level, which was set to 99%. ActArea estimates pertaining to drug treatment sensitivity across CCLE samples was used as previously described (Barretina et al., 2012).

In all cases presented, SCNA and somatic mutation data were jointly analyzed as a single input dataset to CaDrA, thereby including samples for which both data were available. All input data to CaDrA were further pre-filtered so as to exclude alteration frequencies below 3% and above 60% to reduce feature sparsity and redundancy, respectively, across samples (CaDrA's default feature pre-filtering settings).

#### Simulated Data Generation

fgene-10-00121 February 15, 2019 Time: 17:48 # 12

To evaluate both the sensitivity and specificity of CaDrA, we generated simulated data to represent cases where there was a mix of left-skewed ("true positive") and randomly distributed ("null") features, as well as cases where there were only null features. The left-skewness of a feature is a measure of its association with the sample ranking, since samples are sorted from left (high rank) to right (low rank). The design and parameter specification of the simulated data matrix is shown in **Supplementary Figure S1**. Each feature/row is a binary (0/1) vector, with 1 (0) in the ith position denoting the occurrence (non-occurrence) of the genetic event (e.g., SCNA or mutation) in the ith sample. This simulation of binary features relies on the following parameters:

N: Dataset sample size (number of columns in the matrix). n: Total number of features in the dataset (number of rows in the matrix).

p: Number of true positive features generated per dataset [a positive feature is a feature whose distribution of events (i.e., the number of 1's) is significantly associated with the sample ranking, i.e., left-skewed].

f : Left-skew proportion. The proportion of samples that are cumulatively left-skewed in the sample ranking.

λ: The mean (and variance) of the Poisson distribution from which the number of events in the null features is sampled. This is equal to the number of 1's per skewed positive feature. A Poisson distribution is used so that we can partially control (through the mean) the number of 1's in a null feature, which are then uniformly distributed across samples (see description of Null feature generation below).

The resulting simulated binary data matrix will consist of two main types of features:

True Positive (TP) Features: A total of p TP features are generated. Events (i.e., 1's) are assigned to the TP features in a mutually exclusive fashion, with each of these features having (f × N)/p entries set to 1, with their cumulative OR yielding an N-sized vector with the left-most f × N entries set to 1's. For example, if we generate data for 100 samples and 5 positive features, with the left-skew proportion set to 0.5, each non-overlapping feature will have 10 among the 50 left-most entries (columns) set to 1, such that the union (logical OR) of the 5 features will have 1's in the first 50 entries.

Null Features: Null features are generated for a total of (n–p) features. To generate these features, we sample the number of 1's per null feature based on a Poisson distribution with mean parameter λ = (f × N)/p. In this fashion, the number of 1's in the null features will have a distribution centered on the corresponding number for the TP features. For instance, if we generate data for 100 samples and 5 TP features with left-skew proportion f = 0.5, then each of the TP features will have ten 1's, and each of the remaining 995 null features will have a number of 1's sampled from Poisson (λ = 10), uniformly distributed over the N samples.

A schematic representation of this data, along with the parameters that define its composition is shown in **Supplementary Figure S1**.

#### Evaluation of CaDrA Performance on Simulated Data

Evaluation of CaDrA performance was performed considering two main scenarios: (a) True positive datasets: Data containing both true positive and null features (where the sensitivity of CaDrA is tested); and (b) Null datasets: Data containing only null features (where the specificity of CaDrA is tested), with the following parameter specifications for data generation:

 $N = \{50, 60, 70, 80, 90, 100, 250, \text{and } 500\}$   $m = 1000$   $p = 5$   $f = 0.5$ 

CaDrA was run using default input parameters, returning a meta-feature which had the best score, along with a permutation p-value based on the empirical null search distribution (**Supplementary Figure S2**). These results were then used to determine performance estimates for different sample sizes, composition (i.e., distribution of TP versus null features per returned meta-feature), size (i.e., the number of features within the returned meta-feature) and statistical significance of the returned meta-features. Mean TPR percentages shown in **Table 1** are a result of weight-averaging TPRs corresponding to different number of true positive features per meta-feature, weighted by the total searches returning such meta-features (gray circles **Figure 2C**). Mean FPR percentages shown in **Table 1** are a result of weight-averaging FPRs corresponding to different metafeature sizes, weighted by the total searches returning such meta-features (gray circles **Figure 2D**).

#### COSMIC Enrichment Analyses

For enrichment analyses, RPPA protein data for the nine cancer types (see section "Data Availability and Processing") was first restricted to those proteins representing hallmark oncogene or tumor suppressor genes included in the COSMIC v84 database (n = 57)<sup>1</sup> (Forbes et al., 2017). For each cancer type, a CaDrA query was then performed with respect to the protein expressioninduced sample ranking, using somatic mutation and copy number alteration data as input features, in order to search

<sup>1</sup>https://cancer.sanger.ac.uk/census

for features associated with elevated protein expression of each of the hallmark proteins queried. The features selected thereof were then pooled across all queries, and the resulting gene list tested for significant enrichment (based on the hyper-geometric distribution) with respect to a set of annotated oncogenes and tumor suppressor genes in COSMIC (n = 554), compared to the pooled list of non-selected features.

#### Sub-Sampling Analyses

fgene-10-00121 February 15, 2019 Time: 17:48 # 13

For all sub-sampling analyses presented, CaDrA was run after sub-sampling 80% of the original data, with consistency of CaDrA results computed as the Jaccard (J) index of the returned meta-feature obtained with and without sub-sampling (repeated for n = 100 independent sub-sampling iterations). To assess reproducibility of drivers associated with YAP/TAZ activity, the search was repeated by either preserving the observed ranking (decreasing YAP/TAZ activity), or by taking a permuted ranking. J indices were then compared between the original and permuted ranking cases using a Wilcox rank sum test. For the pancancer protein query analysis, all available proteins profiled as part of the RPPA data were used, with J indices similarly computed for the top 5 protein queries that yielded significant meta-features (P ≤ 0.05), and 5 queries randomly selected from the non-significant list (P > 0.05) in each cancer type. J indices were then pooled for the five significant, and nonsignificant results, respectively, and compared using a Wilcox rank sum test. FDR correction was used for all pan-cancer analyses tests of significance.

### YAP/TAZ Signature Projection and Assessment in TCGA BRCAs

A signature comprising YAP/TAZ-activating genes (n = 717) in MDA-MB-231 cells was obtained based on a previous study (Enzo et al., 2015). The TCGA BRCA RNASeq data (n = 1,186 samples) was projected onto the signature genes and per-sample estimates of YAP/TAZ activity were derived using ASSIGN (Shen et al., 2015), which was then used as a continuous ranking variable with CaDrA. The association of YAP/TAZ activity with the CaDrA-derived meta-feature, and with BRCA subtype (i.e., TN status) was determined using a linear regression model.

### Cell Culture, siRNA Knockdown and qRT-PCR

HS578T BRCA cells were purchased from ATCC and cultured using media and conditions suggested by ATCC. For RNA interference, cells were transfected using RNAiMAX (Thermo Fisher) with control siRNA (Qiagen, 1027310) or an equal molar mixture of siRNA targeting RELN (Sigma), RBCK1 (Sigma), or TAZ and YAP (Hiemer et al., 2014). 48 h post transfection, RNA was extracted from cells using RNeasy kit (Qiagen) and the synthesis of cDNA was performed as previously described (Hiemer et al., 2014). Quantitative realtime PCR (qRT-PCR) was performed using Taqman Universal master mix II (Thermo Fisher) and measured on ViiA 7 real-time PCR system. Taqman probes used included those recognizing CTGF (Thermo Fisher Hs00170014\_m1), CYR61 (Thermo Fisher Hs00155479\_m1), RELN (Thermo Fisher Hs01022646\_m1), RBCK1 (Thermo Fisher Hs00934608\_m1), WWTR1 (Thermo Fisher Hs01086149\_m1), and YAP (Thermo Fisher Hs00902712\_g1) and GAPDH (Thermo Fisher 4326317E). Expression levels of each gene were calculated using the 11Ct method and normalized to GAPDH. Knockdown efficiency of YAP, TAZ, RELN, and RBCK1 was verified for each experiment. Mean transcriptional knockdown of YAP, TAZ, and RBCK in HS578T cells was >80%. Basal RELN levels in HS578T cells were low, and relative knockdown in these cells was 28.3% (±14.1). Data from qRT-PCR experiments are shown as mean ± S.D., with each knockdown compared with respect to the scrambled siRNA control (siCtl) using an unpaired, two-tailed Student's t-test.

### CaDrA Search Parameters

For evaluation using genomic data, CaDrA was run in the top-N mode using the default of N = 7, choosing the best resulting metafeature (see section "Methods"; CaDrA features: Search modes). For evaluation of simulated data, only the top-scoring feature was considered as a starting feature per search run (i.e., N = 1). The "ks" method was chosen for evaluating skewness of features at each step in all cases presented. All other default input search parameters were used for all cases presented.

## AVAILABILITY OF DATA AND MATERIAL

The datasets generated and/or analyzed during the current study are available in the TCGA repository (https://tcga-data. nci.nih.gov/docs/publications/tcga), and CCLE repository (https: //portals.broadinstitute.org/ccle), and are available from the corresponding author on reasonable request.

## AUTHOR CONTRIBUTIONS

VK developed the R package and conducted the analyses. VK and SM wrote the manuscript, with input from PS and XV. JK performed the siRNA and qRT-PCR experiments. LZ assisted in obtaining the gene expression signature for TCGA data projection. PS assisted in the evaluation of CaDrA on simulated data. SM and VK designed the CaDrA framework and features, and interpreted the results. XV designed the experimental validation of novel candidate drivers, and interpreted the results thereof. All authors read and approved the final manuscript.

## FUNDING

This work was supported by National Institutes of Health NIDCR fellowship F31 DE025536 (VK), CDMRP grant W81XWH-14- 1-0336 (XV), the Dahod breast cancer research program at Boston University School of Medicine (XV and SM), as well as the Clinical and Translational Science Institute (supported by Clinical and Translational Research Award CTSA grant UL1-TR001430) at Boston University School of Medicine (SM). The funding sources played no role in the design of the study and collection, analysis, and interpretation of data and in the writing of this manuscript.

#### ACKNOWLEDGMENTS

fgene-10-00121 February 15, 2019 Time: 17:48 # 14

We would like to thank Joshua Klein for making suggestions toward the implementation of specific package features. We

#### REFERENCES


further acknowledge dbGap for granting access to the TCGA data (phs000178.v9.p8).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00121/full#supplementary-material



classical Hodgkin lymphoma. Blood 102, 3871–3879. doi: 10.1182/blood-2003- 06-1841


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Kartha, Sebastiani, Kern, Zhang, Varelas and Monti. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Gene Expression-Based Predictive Markers for Paclitaxel Treatment in ER+ and ER− Breast Cancer

*Xiaowen Feng1,2 , Edwin Wang2,3 \* and Qinghua Cui1 \**

*1Department of Biomedical Informatics, School of Basic Medical Sciences, MOE Key Lab of Cardiovascular Sciences, Peking University, Beijing, China, 2Cumming School of Medicine, University of Calgary, Calgary, AB, Canada, 3 Faculty of Medicine, McGill University, Montreal, QC, Canada*

#### *Edited by:*

*Victor Jin, The University of Texas Health Science Center at San Antonio, United States*

#### *Reviewed by:*

*Ao Li, University of Science and Technology of China, China Tianbao Li, The University of Texas Health Science Center at San Antonio, United States*

#### *\*Correspondence:*

*Edwin Wang edwin.wang@ucalgary.ca Qinghua Cui cuiqinghua@hsc.pku.edu.cn* 

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 15 November 2018 Accepted: 13 February 2019 Published: 01 March 2019*

#### *Citation:*

*Feng X, Wang E and Cui Q (2019) Gene Expression-Based Predictive Markers for Paclitaxel Treatment in ER+ and ER*− *Breast Cancer. Front. Genet. 10:156. doi: 10.3389/fgene.2019.00156*

One of the objectives of precision oncology is to identify patient's responsiveness to a given treatment and prevent potential overtreatments through molecular profiling. Predictive gene expression biomarkers are a promising and practical means to this purpose. The overall response rate of paclitaxel drugs in breast cancer has been reported to be in the range of 20–60% and is in the even lower range for ER-positive patients. Predicting responsiveness of breast cancer patients, either ER-positive or ER-negative, to paclitaxel treatment could prevent individuals with poor response to the therapy from undergoing excess exposure to the agent. In this study, we identified six sets of gene signatures whose gene expression profiles could robustly predict nonresponding patients with precisions more than 94% and recalls more than 93% on various discovery datasets (*n* = 469 for the largest set) and independent validation datasets (*n* = 278), using the previously developed Multiple Survival Screening algorithm, a random-sampling-based methodology. The gene signatures reported were stable regardless of half of the discovery datasets being swapped, demonstrating their robustness. We also reported a set of optimizations that enabled the algorithm to train on small-scale computational resources. The gene signatures and optimized methodology described in this study could be used for identifying unresponsiveness in patients of ER-positive or ER-negative breast cancers.

Keywords: microarray gene expression profile, breast cancer, signature genes, drug resistance, predictor

## INTRODUCTION

Predicting if a given patient would not respond to a specific treatment could save enormous health care resources and potentially make it possible to reallocate the individual to better suited medication programs earlier (Garraway et al., 2013; Collins and Varmus, 2015). Paclitaxel treatment, which targets at cell cycle processes through stabilizing microtubules, is a prevalent medication used in various cancer types including breast, ovarian, and prostate cancer. Up to 20% of the ER-positive (ER+) breast cancer patients, who represent 80% of breast cancer population, could gain survival benefit from paclitaxel treatment. With high-confident prediction, it would be made possible to prevent nearly 20,000 women from ineffective paclitaxel treatment, which might cause additional neurotoxicity and adverse effects, in the United States alone. Network representation learning as well as integration of somatic mutation profile and gene functional annotation

**31**

information were utilized to discovery driver genes related to drug treatment responsiveness (Xi et al., 2017, 2018; Yang et al., 2018; Zhang et al., 2018). Existing studies either focused on triple-negative cases, or provided insights on a small number of tipping point genes more biologically other than computationally. For example, ABCB1/PgP and ABCC3/MRP3 were reported to be closely associated with resistance to paclitaxel (Němcová-Fürstová et al., 2016; Delou et al., 2017), while the resistance might be driven by hundreds of genes (Duan et al., 2004). Xu et al. collected 22 key genes involved in paclitaxel treatment resistance for miscellaneous cancer types by analyzing literatures (Xu et al., 2016) with the assistance of GeneMANIA (Warde-Farley et al., 2010), a gene/protein function predicting tool.

In this study, we improved the Multiple Survival Screening (MSS), a methodology developed by Li et al. (2010). for identifying cancer prognostic markers with high robustness and prediction power (Li et al., 2010), and employed it to five microarray gene expression datasets [GSE20194 (MAQC Consortium, 2010; Popovici et al., 2010), GSE20271 (Tabchy et al., 2010), GSE22093 (Iwamoto et al., 2010), GSE23988 (Iwamoto et al., 2010), and GSE25066 (Hatzis, 2011; Itoh et al., 2013)], which were partitioned into discovery set and independent validation set, in search of signature genes of nonresponsiveness in ER+ breast cancer. We discovered sets of such genes that gave precision up to 94.6% and recall rate up to 93.3%, and performed consistently in cross validation inside discovery datasets, and different discovery datasets against their corresponding independent validation datasets. Similar results were obtained for ER-negative patients, demonstrating the prediction power and potential of real-life applications of the optimized methodology and reported gene sets.

#### RESULTS

#### Gene Signatures for Unresponsiveness of Paclitaxel Treatment in ER-Positive Breast Cancer

To explore efficient and generalizable gene signatures for predicting of whether a given breast cancer patient should be admitted to paclitaxel treatment, we constructed a discovery dataset comprised of microarray data generated by four cohorts (GSE20271, GSE22093, GSE23988, and GSE25066; referred to as *T1pos*; see Methods for details), where in total 469 patients were acquired (*nRD* = 418, *nCR* = 51; RD, residual disease; CR, complete response). Similarly, an independent validation dataset was formed using microarray data from the cohort of GSE20194 (*nRD* = 213, *nCR* = 65; referred to as *V1pos*). MAS5 normalization was employed for both *T1pos* and *V1pos*, respectively. Both expression profile matrices then underwent additional normalizations to address batch effects between the cohorts as well as merging of multiple probes that represented same gene on the gene expression microarray (see Methods).

Implementing a methodology based on Multiple Survival Screening (MSS) (Li et al., 2010), which as a random search computational scheme that could identify reliable signature genes, we obtained six gene signatures ("Signatures," A1–F1) from *T1pos* corresponding to six groups of Gene Ontology (GO) terms closely associated with carcinogenesis (**Figure 1**): cell adhesion, apoptosis, cell cycle, immune response, phosphorylation, and DNA damage & repair. Each signature gene set contained 30 unique genes and was used to translate a given expression profile into a feature vector. Testing the six signatures against *V1pos*, we observed that the prediction achieved precision of 94.4% and recall rate of 90.1% for RD (residual disease; mutually exclusive to CR, complete response) subgroup, where a true positive prediction was defined as predicting a nonresponding patient to be so, and a false positive prediction to be predicting a patient that responded to the treatment as a nonresponding one. Precision and recall rate aligned with convention definition. Comparing to the genes with most significantly differential expression profiles (see Method), less than 50% of the most significant genes were selected (i.e., if selecting 130 genes, less than 65 genes were among the 130 top listed genes). Simply using the most significant genes gave inferior prediction power in the independent validation dataset (recall rate of 88%), implying that most prominent differential expression patterns contained cohort-specific features and might not be feasible to be utilized directly.

Further, we examined the predicting performances of all possible combinations of six signatures (*k* = 2, 3, 4, 5) (**Figures 2**–**4**) through 10-fold cross validation tests in *T1pos*. While all choices gave precisions more than 94%, recall rates varied between 80 and 95%, exhibiting differences in prediction power. The combination of Signature B1 (apoptosis), C1 (cell cycle), and F1 (DNA damage and repair) provided the best-balanced precision and recall rate (using the average values of 10-fold cross validations), of 94.0 and 93.4%, respectively. Predictor comprised of the selected combination of signatures had a better performance on the independent validation (precision of 93.1% and recall rate of 92.7%). We considered the recall rate to be the most important metric, as the methodology was intended to reliably predict whether an individual can skip a treatment without adverse consequences. In comparison, we tested seven signature genes (BRCA1, APC, p16/CDKN2A, FRMD6/hEx, YAP, BAX, and LZTS1/FEZ1) related to drug resistance in breast cancer, collected by Xu et al. (2016), for their prediction power. In the four-cohort discovery dataset, two-cohort discovery dataset and validation dataset, the signature gave precision rates of 92.3, 89.5, and 94.0% and recall rates of 82.7, 78.9, and 85.2%, respectively. Overall, our proposed signature genes provided better prediction power, and the methodology allowed the aggregation of accumulating datasets to discover potential better gene combinations.

To demonstrate the contribution of the signature genes against drug resistance, we calculated their relative contribution scores (RCS) based on randomization tests. Similar to the signature selection process but with reduced randomization count per iteration (50,000) and higher total iteration counts (200 for each of the six GO terms), fuzzy K-means clustering combined with Fisher's test was performed to measure randomized gene

sets' partition power over responsiveness, where gene set that exhibited statistical significance stronger than *p* < 0.001 was collected as "candidate geneset." Relative prevalence of a given signature gene was then obtained by measuring its presence amongst the candidate gene sets and normalizing the value through dividing the largest absolute prevalence value.

#### Robustness and Generalizability of Signature Gene Sets

To examine whether the identified gene signatures were not impacted by random factors, we performed another round of signature discovery process on *T1pos* with same set of hyperparameters and a new initial random state. We found that 99.2% (129 out of 130) gene selections remained the same in the new iteration, with the only altered gene selection resided in the Signature A1 (adhesion). Expanding the number of random gene sets or iterations of the algorithm (see Methods) would not significantly impact on the gene signatures.

Further, the same gene signature discovery methodology was employed to *T2pos*, a discovery dataset comprised of two cohorts (GSE22093 and GSE25066) and validated against the remaining three cohorts (GSE22093, GSE23988, and GSE20194) to prove the generalizability of the signatures. Regardless of shrank dataset size, the identified Signature B2 (apoptosis), C2 (cell cycle), and F2 (DNA damage & repair) were exactly the same as the above Signature B1, C1, and F1. This signature combination achieved best precisions and recall rates in GSE20194 (a.k.a. *V1pos*; 94.6 and 93.4%, respectively), GSE20271 (95.4 and 91.2%, respectively), and GSE23988 (95.7 and 96.0%, respectively). Swapping the components of the discovery dataset did not significantly impact on signature discovery (none or less than two gene selections altered in each GO term signature) and the above reported prediction power. These results demonstrated that Signature C and E were generic and stable for nonresponsive ER-positive breast cancer cases and might be applied to new incoming datasets.

#### Gene Signatures for Unresponsiveness of Paclitaxel Treatment in ER-Negative Breast Cancer

We further demonstrated that the methodology may work equally well for ER-negative population. To obtain signature genes for ER-negative (ER−) group, we constructed a discovery dataset comprised of the four cohorts described above (see Methods (GSE20271, GSE22093, GSE23988, and GSE25066; referred to as *Tneg*; *nRD-and-ERneg* = 152, *nCR-and-ERneg* = 217). Similarly, GSE20194 (*nRD-and-ERneg* = 62, *nCR-and-ERneg* = 45; referred to as *Vneg*) was utilized as an independent validation dataset. MAS5 normalization and further regularizations addressing batch effects were performed as mentioned previously. We obtained five sets of signature genes ("Signatures," a–e) corresponding to five groups of GO terms which were closely associated with carcinogenesis: phosphorylation, immune response, apoptosis, DNA damage and repair, and cell cycle. Regardless of distinct ratio of sample size of RD and CR subgroup (ratios in range 0.7–1.4), compared to ER+ datasets (ratios in range 3–10), the prediction power of the signature gene sets was similarly steady. Validating in *Vneg*, the combination of Signature b (immune response), c (apoptosis), and d (DNA damage and repair) (**Figure 5**) achieved precision of 94.8% and recall rate of 92.0%.

#### Optimizing Methodology to Use 50-Fold Less Computation Resources

The original MSS methodology essentially relied on random searching, which was implemented through randomly generating sets of genes, ranking their ability to represent nonresponding patients, and selecting consensus genes from top-ranked gene sets to serve as gene signatures in the predictor. This process was computationally expensive, where training a model distributed on 672 cores (2.60 GHz) would cost 30–60 min to finish the 6 million iterations for six GO subsets (see Methods), and had also undefined hyperparameters that accounted for the number of total iterations as well as ranking criteria.

We found that the signature genes were prominent enough in most discovery datasets, as long as the overall sample size was reasonable, to allow optimization of signature discovery processes. First, hyperparameters that determine the base "gene pool" of random sampling could be replaced by simply picking the 500 most significantly differentially expressed genes, trivializing parameter tuning. Then, through introducing one single threshold and an ensemble method (see Methods), we were able to reduce the 1 million iterations required by the original methodology to 20,000 iterations while retaining same prediction power. While signatures reported above could

be used for potential application in breast cancer nonresponsive screening without redoing the discovery processes, the optimization was suitable for implementations of the methodology on small computation resource, e.g., personal computer.

### DISCUSSION

Precision oncology addresses the following aspects of targeted therapies: for example, developing medications that would benefit patients with a certain phenotype or symptom helps improve overall survival, finding means to confidently suggest patients to opt-out treatments that provide little benefit to them is as important. Paclitaxel, a drug which targets microtube components (β subunit of tubulin) of cell cycle regulatory to oppress expansion of cancer cells, has been considered as an important agent for treating breast cancer, providing valid efficacy and tolerability while low in cross-resistance with other drugs. However, paclitaxel's response rate among breast cancer patients resides in a loose range of 10–60%. Only 20% ER-positive patients would respond or partially respond to the drug. Accurately predicting whether a given patient will respond to paclitaxel treatment with confident would help preventing enormous breast cancer patients from undergoing excess effectless treatment and adverse effects. Gene expression profile was reported to be the strongest indicator of paclitaxel sensitivity in breast cancer patients (Dorman et al., 2015). Although resistance to paclitaxel has been reported to be associated with the expression levels of hundreds of transcripts and studied for the underlying molecular mechanisms as well as key pathways, existing signature genes did not perform well in predicting the lack of response in breast cancer patients.

While microarray and RNA-seq are becoming more applicable and affordable for clinical diagnostics, preventing patients from excessive treatments is desirable. In this study, we reported six sets of robust and generalizable gene signatures for the prediction of nonresponding individuals in ER+ and ER− groups of breast cancer, where combination of Signature B (30 genes related to apoptosis), C (30 genes related to cell cycle), and F (30 genes related to DNA damage and repair) achieved the best precision (>94%) and recall (>93%) predicting nonresponding patients in independent validation datasets, which were significant improvements compared to previous studies [e.g., 82% accuracy in cell lines, using expression profile of 15 genes and SVM model (Dorman et al., 2015)]. Signature genes were given relative contribution scores (RCS) based on randomization tests to demonstrate their contribution to the predictor, or relatively to what extent they contributed to the resistance. Moreover, we described a potential optimization of the methodology that rendered the algorithm less computational demanding, and therefore enabling faster gene signature discovery in new datasets.

### MATERIALS AND METHODS

#### Data Processing and Normalization

The following five microarray-based gene expression profiles (samples examined before treatments) were collected from the repository of Gene Expression Omnibus (GEO): (1) GSE20194, comprised of 278 samples using Affymetrix Human Genome U133A Array (GPL96), where 161 samples were labeled as ER+. Of the 161 samples, 151 samples were marked as residual disease (RD) and 10 samples as partial complete response (pCR) or complete response (CR); (2) GSE 20271, comprised of 178 samples using Affymetrix Human Genome U133A Array (GPL96). In total, 98 samples were labeled as ER+, where 91 samples were marked as RD and 7 samples as pCR or CR; (3) GSE22093, comprised of 103 samples using Affymetrix Human Genome U133A Array (GPL96). In total, 42 samples were labeled as ER+, where 32 samples were marked as RD and 10 samples as pCR or CR; (4) GSE23988, comprised of 61 samples using Affymetrix Human Genome U133A Array (GPL96). In total, 32 samples were labeled as ER+, where 25 samples were marked as RD and 7 samples as pCR or CR; (5) GSE25066, comprised of 508 samples using Affymetrix Human Genome U133A Array (GPL96). In total, 297 samples were labeled as ER+, where 270 samples were marked as RD and 27 samples as pCR or CR.

We retrieved all five cohorts in their raw data format (CEL files) along with clinical data records. Expression profiles of each cohort were then normalized through MAS5.0 normalization (using RMA normalization instead in this step did not demonstrate visible impact on the results reported). After log2 Feng et al. Predictive Markers for Paclitaxel Treatment

transformation, we mapped the probes to Entrez Gene IDs (mapping provided by GEO) and removed duplicated reads of a given gene by retaining their average read. In total 4,075 unique genes were preserved. Probes pointed to unidentified genes (i.e., genes without Entrez ID) were not removed deliberately. They were practically invisible during the downstream analysis (see below), however. Data were further median-centered and z-scored across cohorts to address batch effects.

The four-cohort discovery datasets comprised of GSE20271, GSE22093, GSE23988, and GSE25066, utilizing GSE20194 as independent validation dataset. The two-cohort discovery dataset comprised of GSE22093 and GSE25066, utilizing GSE20194, GSE20271, and GSE23988 as validation set.

#### MSS Methodology and Optimization

Based on the study of Li et al., we utilized the following random-sampling-focused methodology in a given pair of discovery dataset and independent validation dataset.


(104), cell adhesion (56), cell cycle (84), and phosphorylation (77). For the two-cohort discovery dataset, the numbers of genes were as the following for subgroups: apoptosis (290), DNA damage & repair (81), immune response (142), cell adhesion (93), cell cycle (115), and phosphorylation (111).

	- a. The number of RGSs can be reduced to up to 20-fold less by monitoring the list of most frequently appeared genes of the RGSs, without affecting the reported results. In original MSS, arbitrary 1 or 2 millions of iterations were performed to obtain the "gilded RGSs" and then the signature genes (see below). Instead we observed that, combinations of signature genes were prominent enough that it was possible to set a stopping criterion T, such that if after T iterations, the top 30 most frequently appeared genes of the "gilded RGSs" had no change, terminate this step and accept the "gilded RGSs" along with the list of top 30 most frequent genes as the final results. It was safe to assume such a parameter T in the range of 100–500, where a lesser T implied more tradeoff of robustness of the gene list in favor of computational complexity.
	- b. Computational complexity could be further reduced by using an ensemble model. Instead of allowing each signature gene set to claim one vote in the predicting (see below), we lowered the parameter T to as less as 30 and obtained five gene lists for each GO-defined subpool. Each gene list was then treated as one independent voter during voting.

Combining a and b, the number of total executed iterations could be reduced to 50-fold less. In this study, we implemented the original MSS methodology distributed on a cluster with 672 CPUs, paralleling all 1 million iterations for each GO-defined subpool, and the runtime was around half an hour. Using the optimization, it was possible to calculate the predictor of desire at regular PCs or workstations in reasonable time frame.

Altering the proportion of CR and RD cases in RPSs would not significantly affect reported results, as long as the ratio was kept around 1:2 to 1:5.

4. Each RGS was tested against all 40 RPSs (if not using optimized version): patients in a RPS were partitioned into two clusters through K-means (Euclidean distance; using fuzzy K-means that implemented by sklearn-extension with fuzzy factor as 2 would not significantly alter the reported results, but with much less efficiency). Fisher's test was used to determine if the clusters enriched CR or RD individuals, respectively. The p's yielded by Fisher's tests were recorded, and the reciprocal of their average was considered as the enrichment score of the RGS. For each GO term, top 3,000 most significant RGSs were selected to be "gilded RGSs" based on the enrichment score. This threshold could be chosen freely between 1,000 and 3,000 and did not significantly affect the report results.

5. The unique 30 most frequently picked genes across gilded RGSs of a GO term were drew as the set of signature genes for the corresponding GO term.

#### Gene Sets Selection

Combinations of gene sets were tested using 10-fold cross validation and independent validation dataset. Prediction of labels (either the given individual being nonresponsive or responsive to paclitaxel treatment) was made through voting: (1) for each GO term, we used their 30 signature genes to translate expression profiles of patients in the training dataset into 1D vectors of shape (30, 1). (The expression profile of the individual being predicted underwent the same transformation.) Centroids of the feature vectors were calculated for RD subgroup and CR subgroup, respectively. If cosine distance between feature vectors of an individual and RD subgroups' centroid was smaller than such cosine distance between feature vectors and CR's centroid, the individual would gain one point on belonging to RD; one point be given to CR otherwise. (2) After all signature genesets had their votes assigned, the individual was labeled

#### REFERENCES


as the prediction with most votes. Having even number of signature genesets rarely was a problem in this study; we observed that predictions of nonresponsive labels were mostly being consented by majority or all genesets. If being of concern, cosine-distances-based fuzzy votes could be used in place of the binary votes.

#### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https://www.ncbi.nlm.nih.gov/geo/.

### AUTHOR CONTRIBUTIONS

QC, EW, and XF designed the study. XF performed data preparation, coding, signature extraction, optimization, and downstream analysis.

#### FUNDING

This work was supported by Natural Science Foundation of China (81670462).


Zhang, F., Wang, M., Xi, J., Yang, J., and Li, A. (2018). A novel heterogeneous network- based method for drug response prediction in cancer cell lines. *Sci. Rep.* 1–9. doi: 10.1038/s41598-018-21622-4

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Feng, Wang and Cui. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# SALMON: Survival Analysis Learning With Multi-Omics Neural Networks on Breast Cancer

Zhi Huang1,2,3, Xiaohui Zhan2,4, Shunian Xiang4,5, Travis S. Johnson2,6, Bryan Helm<sup>2</sup> , Christina Y. Yu2,6, Jie Zhang<sup>5</sup> , Paul Salama<sup>3</sup> , Maher Rizkalla<sup>3</sup> , Zhi Han2,7 \* and Kun Huang2,3,7 \*

*<sup>1</sup> School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, United States, <sup>2</sup> Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, United States, <sup>3</sup> Department of Electrical and Computer Engineering, Indiana University-Purdue University Indianapolis, Indianapolis, IN, United States, <sup>4</sup> National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen, China, <sup>5</sup> Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States, <sup>6</sup> Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States, <sup>7</sup> Regenstrief Institute, Indianapolis, IN, United States*

Improved cancer prognosis is a central goal for precision health medicine. Though many models can predict differential survival from data, there is a strong need for sophisticated algorithms that can aggregate and filter relevant predictors from increasingly complex data inputs. In turn, these models should provide deeper insight into which types of data are most relevant to improve prognosis. Deep Learning-based neural networks offer a potential solution for both problems because they are highly flexible and account for data complexity in a non-linear fashion. In this study, we implement Deep Learning-based networks to determine how gene expression data predicts Cox regression survival in breast cancer. We accomplish this through an algorithm called SALMON (Survival Analysis Learning with Multi-Omics Neural Networks), which aggregates and simplifies gene expression data and cancer biomarkers to enable prognosis prediction. The results revealed improved performance when more omics data were used in model construction. Rather than use raw gene expression values as model inputs, we innovatively use eigengene modules from the result of gene co-expression network analysis. The corresponding high impact co-expression modules and other omics data are identified by feature selection technique, then examined by conducting enrichment analysis and exploiting biological functions, escalated the interpretation of input feature from gene level to co-expression modules level. Our study shows the feasibility of discovering breast cancer related co-expression modules, sketch a blueprint of future endeavors on Deep Learning-based survival analysis. SALMON source code is available at https://github. com/huangzhii/SALMON/.

Keywords: deep Learning, co-expression analysis, survival prognosis, breast cancer, multi-omics, neural networks, cox regression

#### Edited by:

*Victor Jin, The University of Texas Health Science Center at San Antonio, United States*

#### Reviewed by:

*Long Gao, University of Pennsylvania, United States Dong Xu, University of Missouri, United States*

\*Correspondence:

*Kun Huang kunhuang@iu.edu Zhi Han zhihan@iu.edu*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *01 December 2018* Accepted: *14 February 2019* Published: *08 March 2019*

#### Citation:

*Huang Z, Zhan X, Xiang S, Johnson TS, Helm B, Yu CY, Zhang J, Salama P, Rizkalla M, Han Z and Huang K (2019) SALMON: Survival Analysis Learning With Multi-Omics Neural Networks on Breast Cancer. Front. Genet. 10:166. doi: 10.3389/fgene.2019.00166*

### BACKGROUND AND INTRODUCTION

There is a strong need to identify effective prognostic biomarkers to help optimize and personalize treatment (Liu et al., 2016). Among cancers, breast invasive carcinoma is one of the most heterogeneous cancers with distinct prognoses based on morphological, phenological, and molecular stratifications (Nagini, 2017; Wu et al., 2017). Breast invasive carcinoma patients have a 77% survival rate after 5 years and 44% survival rate after 15 years (Pereira et al., 2016), so developing accurate prognostic models could significantly improve risk stratification after diagnosis.

Recent Deep Learning-based approaches have been widely applied to Computational Biology and Bioinformatics (Huang et al., 2017; Zhang et al., 2018b). The advantages of learning nonlinear functions and retrieving lower dimensional representation (Ching et al., 2018) reveal advances of Deep Learning models. The application of survival prognosis that incorporates Cox proportional hazards regression with a single transcriptomic dataset (Ching et al., 2018; Katzman et al., 2018; Shao et al., 2018) and with multi-omics data (Chaudhary et al., 2018; Poirion et al., 2018; Ramazzotti et al., 2018; Sun et al., 2018; Zhang et al., 2018a) is of major interest in precision health.

For these reasons, we integrate multi-omics data with Deep Learning-based survival prognosis models. While most contemporary approaches incorporate one or few types of omics data, such as mRNA-seq data and miRNA-seq data (Gupta et al., 2015; Nassar et al., 2017), we propose that integrating more diverse data may lead to improved modeling—especially when driven by machine learning. Moreover, classic cancer biomarkers can often stratify patients into risk groups, and these too should be integrated when available. Specifically, copy number burden (CNB) and tumor mutation burden (TMB) are important for predicting tumor progression (Marshall et al., 2017; Thomas et al., 2018) and immunotherapy (Birkbak et al., 2013; Chalmers et al., 2017; Goodman et al., 2017). Other demographical and clinical information such as diagnosis age, estrogen receptors (ER) status, progesterone receptors (PR) status should also be considered during model construction. One of the challenges for such diverse data is high-dimensionality.

Most Deep Learning approaches employ neural networks (multilayer perceptron) with huge numbers of parameters to be optimized. Optimizing such large sets of parameters with limited patient samples tends to introduce overfitting that renders the models ineffective. In this paper, we advocate the use of eigengene matrices instead of original mRNA-seq and miRNAseq data derived from co-expression analysis with R package "lmQCM." Using neural network architecture, multi-omics data, and the Cox proportional hazards model, we develop our model called SALMON (Survival Analysis Learning with Multi-Omics Neural Networks). SALMON adopts co-expression modules as input, namely, the eigengene matrix derived from co-expression network analysis. It greatly reduces the dimension of the original feature space addressing the "curse of dimensionality" and increases the robustness and learnability of the model. This novel technique was not adopted by any other Deep Learning-based survival prognosis model such as Cox-nnet (Ching et al., 2018).

SALMON is trained on co-expression module eigengenes instead of gene expressions and thus we were able to investigate co-expression modules contribution to the hazard ratio (**Figure 1**). These gene co-expression modules contained individual genes from the initial lmQCM gene co-expression network analysis. Genes from modules that highly contributed to the hazard ratio were further evaluated with gene enrichment analysis to confirm certain gene regulations and biological processes. These biological findings confirm the validity of our models and provide insight into the complex regulatory relationships at work in breast invasive carcinoma.

### MATERIALS AND METHODS

### Datasets and Study Design

In this experiment, we analyzed 583 female breast invasive carcinoma (BRCA) patients which had five omics data types including gene expression data (illuminahiseq\_rnaseqv2- RSEM\_genes\_normalized) and miRNA data (illuminahiseq\_mirnaseq-miR\_gene\_expression) from Broad GDAC Firehose (https://gdac.broadinstitute.org/), copy number burden (CNB) was measured by total Kb length and the data (broad.mit.edu\_PANCAN\_Genome\_Wide\_SNP\_6\_whitelisted. seg) was provided from Pan-Cancer Atlas (PanCanAtlas) Initiative (https://gdc.cancer.gov/about-data/publications/ pancanatlas). Tumor mutation burden (TMB) was calculated by the total number of mutated genes based on MAF files (Mutation\_Packager\_Oncotated\_Calls) from Broad GDAC Firehose. Demographical and clinical information (diagnosis age, Estrogen Receptor (ER) status, Progesterone Receptor (PR) status) and overall survival (OS) events and months were collected from cBioPortal (http://www.cbioportal.org/). HER2 status was not considered in this article because of insufficient data. **Table 1** shows the statistical information of this patient cohort.

We performed 5-fold cross-validation on the dataset. In each fold, 80% of the data were used for model training and 20% of the data were used for model testing. mRNA and miRNA data were pre-processed by TSUNAMI online analysis suite (https:// apps.medgen.iupui.edu/rsc/tsunami/). The pre-processing steps are 2-fold: It firstly removed genes with lowest 20% of mean expression values shared by all patients. Then it removed genes with lowest 20% of expression values' variance. These preprocessing steps were necessary to ensure the robustness for the downstream correlational computation in gene co-expression module analysis step.

#### Gene Co-expression Module Analysis

Instead of feeding mRNA-seq and miRNA-seq data to the neural networks and analyzing results at the gene level, we used eigengene matrices of gene co-expression modules obtained from lmQCM algorithm (Zhang and Huang, 2014) as the input to the SALMON algorithm. This reduced 99.46% of input features and greatly reduced the number of parameters in the neural networks. Using eigengenes as features can be considered as bias/variance (error/complexity) trade-off in machine learning (Weigend et al., 1990; Geman et al., 1992), which simplifies

TABLE 1 | Demographical and clinical characteristics of 583 female breast invasive carcinoma (BRCA) patients.


*mRNA and miRNA stand for mRNA-seq data and miRNA-seq data. OS stands for overall survival. The status of ER and PR were derived from IHC (immunohistochemistry). All clinical information was collected from cBioPortal.*

the networks significantly. The total number of neural network weights to be learned was then narrowed down from 107193 to 521, ensuring the robustness of the learning process and alleviate the overfitting issue (Caruana et al., 2001; Schmidhuber, 2015).

There are many gene co-expression network analysis packages, such as the R package for weighted correlation network analysis (WGCNA) (Langfelder and Horvath, 2008) and local maximal Quasi-Clique Merger (lmQCM) (Zhang and Huang, 2014), which can discover densely connected gene modules across samples/patients. Co-expression network analyses are used increasingly to reveal latent gene-gene interactions, biomarkers and novel gene functions (Horvath et al., 2012; Chandran et al., 2016; Han et al., 2016, 2017; Zhang and Huang, 2017; Xiang et al., 2018). Comparing to WGCNA, weight normalization process in lmQCM was inspired by the spectral clustering (Ng et al., 2002) in machine learning. With efficient implementation of the revision from eQCM (edge-covering quasi-clique merger) algorithm (Xiang et al., 2012), lmQCM allowed module overlap, mining smaller densely co-expressed modules, and thus was adopted in this article. The generally smaller size of mined modules can also generate more meaningful gene ontology (GO) enrichment results (Zhang et al., 2012, 2013, 2016; Shroff et al., 2016; Cheng et al., 2017). The implementation was performed on TSUNAMI. For mRNA-seq data, we set lmQCM parameters γ = 0.7, λ = 1, t = 1, β = 0.4, minimum size of cluster = 10, and adopted Spearman's rank correlation coefficient (Mukaka, 2012) to calculate gene-wised correlations. The parameters setting of miRNA-seq data were the same except γ = 0.4, β = 0.6, and minimum size of cluster = 4.

After calculating gene co-expression modules with lmQCM, eigengene matrices were then determined. The eigengene matrix is the expression values of each gene co-expression module summarized into the first principal component using singular value decomposition (SVD) (Golub and Reinsch, 1970). With the first right-singular vector of each module as the summarized expression values, it projects co-expressed genes to 1-D space and thus can be treated as the "super gene." In our experiment with breast invasive carcinoma, an eigengene matrix with 57 dimensions was derived from mRNA-seq data and an eigengene matrix with 12 dimensions was also derived from miRNA-seq data. Details of co-expression modules and eigengene matrices we derived for this paper are available in **Supplementary files**. These eigengene matrices were treated as the substitution of the original expression inputs.

#### Neural Networks Design, Architecture, and Evaluation Metric

SALMON was designed and implemented in PyTorch 1.0. mRNA-seq and miRNA-seq eigengene matrices were firstly connected to hidden layers with dimensions 8 and 4, respectively, then connected to the final output (hazard ratio) with Cox proportional hazards regression networks. Alternatively, CNB, TMB, and demographical and clinical information (diagnosis age, ER status, PR status) had no hidden layer and were connected to final output directly as covariates. This architecture was explained graphically in **Figure 1**. The rationale behind this network architecture instead of using simple fully connected networks such as Cox-nnet (Ching et al., 2018) was by assuming (1) each omics type affects the hazard ratio independently; (2) downscale eigengene matrices by hidden layers can force multiomics data contributed to hazard ratios in a relatively equal scale at Cox proportional hazards regression networks part.

SALMON adopts Adaptive moment estimation (Adam) optimizer (Kingma and Ba, 2015). We set the number of epochs = 100 with fine-tuned learning rates for each 5-folds cross-validation experiments. LASSO (least absolute shrinkage and selection operator) regularization (Santosa and Symes, 1986) is applied to the networks. Sigmoid activation function is also applied right after each forward propagation and Cox proportional hazards regression networks. The Sigmoid function

$$sigmoid\left(\mathbf{x}\right) = \frac{1}{1 + e^{-\mathbf{x}}} \tag{1}$$

forces the output range be within 0 to 1, introduces non-linearity to the system. In this model, we set the batch size = 64, and the batch normalization was not adopted. The number of the hidden layers and dimensions of hidden layers can be fine-tuned, in this paper, single hidden layers were attached to transcriptomic data with size = 8 for mRNA-seq modules, and size = 4 for miRNA-seq modules.

#### Cox Proportional Hazards Regression Networks

Our algorithm SALMON, integrated Cox proportional hazards model, differs from previous work (Ma and Zhang, 2018; Sun et al., 2018) which use survival status (living or deceased) in a binary classification problem. In contrast, we also took survival times (overall survival months) into account denoted as Y<sup>i</sup> and made our neural networks into a Cox regression learning task. Maximum likelihood estimation (MLE) is then applied to the log partial likelihood

$$\ell\left(\beta\right) = \sum\_{i:\,C\_{i}=1} \left(\sum\_{k=1}^{K} \beta\_{k} X\_{ik} - \log\left(\sum\_{j:\,Y\_{j}\ge Y\_{i}} \exp(\sum\_{k=1}^{K} \beta\_{k} X\_{ik})\right)\right) \text{(2)}$$

where β are the parameters to be estimated. C<sup>i</sup> = 1 indicates the occurrence of the death events for patient i with K-dimensional input vector X<sup>i</sup> .

#### Objective Function

Based on Cox proportional hazards regression networks we formulized the objective function of neural networks as:

$$\hat{\Theta} = \operatorname{argmin}\_{\Theta} \left\{ \sum\_{i:\, C\_i = 1} \left( \sum\_{k=1}^{K} \beta\_k X\_{ik} \right) \right.$$

$$-\log \left( \sum\_{j:\, Y\_j \ge Y\_i} \exp(\sum\_{k=1}^{K} \beta\_k X\_{ik}) \right) + \lambda \left\| \Theta \right\|\_1 \right\} \tag{3}$$

where 2 are the entire network weights (including β) to be optimized via back-propagation, λ is the weight multiplier of LASSO regularization. We set λ = 1 × 10−<sup>5</sup> in the experiments.

#### Evaluation Metric

Concordance index (Steck et al., 2007), valued from 0 to 1, is used in this article as the evaluation metric of survival prognosis. It is widely adopted to evaluate the performances of survival prognosis models (Ching et al., 2018; Katzman et al., 2018) and is equivalent to the area under the ROC curve (AUC) (Bradley, 1997), which measures the model's distinguishability between living and deceased groups. A concordance index = 0.5 indicates the model makes ineffective prediction. A higher concordance index > 0.5 indicates a better survival prognosis model. For breast invasive carcinoma cancer, we consider a concordance index > 0.7 indicates a good model performance.

#### Survival Analysis

Survival analysis with log-rank test (Mantel, 1966) is used to inspect the performances of SALMON on 5-folds crossvalidation testing sets. The Kaplan-Meier survival curves are generated by dichotomizing all testing patients to low risk and high risk groups via the median hazard ratio. The corresponding log-rank p-value implies the ability of the model to differentiate two risk groups. Lower p-values convey better model performances.

#### Gene Ontology and Functional Enrichment Analysis

Co-expression modules generated by lmQCM are then exported to ToppGene Suite (Chen et al., 2009) (https://toppgene.cchmc. org/) and Enrichr (Kuleshov et al., 2016) (http://amp.pharm. mssm.edu/Enrichr/). Using ToppGene, we performed functional analysis including Gene ontology (GO) and cytoband analysis. The false discovery rate (FDR) <0.05 and FDR <1.0 were considered to be significantly enriched for GO analysis and cytoband analysis, respectively. Human Gene Atlas [up regulated genes in human tissues from BioGPS (http://biogps.org)] and ARCHS4 tissues were also investigated for some certain coexpression modules by Enrichr.

#### RESULTS

The experiments were performed with six different combinations of multi-omics data as input sources, they are: (i) mRNA-seq data (mRNA) (57 features); (ii) miRNA-seq data (miRNA) (12 features); (iii) integration of mRNA and miRNA (69 features); (iv) integration of mRNA, miRNA, copy number burden (CNB), and tumor mutation burden (TMB) (71 features); (v) integration of mRNA, miRNA, and demographical and clinical (diagnosis age, ER status, PR status) data (72 features); (vi) integration of mRNA, miRNA, CNB, TMB, and demographical and clinical (diagnosis age, ER status, PR status) data (74 features). Where both RNA-seq co-expression modules are required for all integrative combinations. The SALMON model architecture from **Figure 1** removed certain network substructures which not been used and performed 5-folds cross-validation with 583 patients. Concordance index was used to evaluate the performances. SALMON was then compared to several other survival prognosis algorithm Cox-nnet (Ching et al., 2018), DeepSurv (Katzman et al., 2018), generalized linear model with Cox regression (GLMNET) (Friedman et al., 2010), and RSF (Ishwaran et al., 2008) with all omics data fed in. Since their Cox regression model didn't take multi-omics data sources into consideration, we modified their original framework to integrate multi-omics data (with co-expression modules) altogether as single input vector. The feature importance of all 74 covariates were also investigated by repeated feature deletion, then ranked by the median of decreased concordance index, proved and revealed certain biological interpretations.

#### Integrating Multi-Omics Features Increased the Performances

From **Figure 2A**, we observed an upward trend on median/mean concordance indices with more omics data are integrated. Integrating all omics data (74 features) gave the optimal performances (concordance index: median = 0.7285; mean = 0.6918). Next, all hazard ratios from 5-folds testing sets were concatenated and performed the log-rank test (Mantel, 1966) as shown in **Figures 2C–E** and **Figure S1**. Another feature set without transcriptomics data was also considered as reference (5 features containing CNB, TMB, and demographical and clinical features) with median concordance index = 0.6949 and the Kaplan-Meier plot was shown in **Figure S1F** (log-rank test p-value = 3.67E-03). We found that integrating all omics data (**Figure 2E**) gave the most significant p-value (1.201E-04) with respect to the log-rank test, proving that integrating more multi-omics data to SALMON can enhance the prediction.

We further performed pairwise paired t-test to the resulting concordance indices. As shown in **Table 2**, a negative t-statistic implied that the set 1 is lower than set 2. This concludes that integrating more omics data can generally increase the performance of survival prognosis in breast cancer.

Next, we compared SALMON to the state-of-the-art Deep Learning-based cancer survival prognosis model Cox-nnet (Ching et al., 2018), as well as another recently proposed DeepSurv (Katzman et al., 2018), and two traditional models generalized linear model with Cox regression (GLMNET) (Friedman et al., 2010) and RSF (Ishwaran et al., 2008).We further modified their original implementation with all omics data as inputs. As shown in **Figure 2B**, the median concordance index of SALMON (0.7285) was reported higher than the modified Cox-nnet (0.7234), DeepSurv (0.6563), GLMNET (0.6490), and RSF (0.6229). Compare to the modified Coxnnet with similar performance in terms of concordance index, SALMON has a more significant result in log-rank test (p-value = 1.201E-04) than the modified Cox-nnet (p-value = 2.282E-04) with all testing sets and all 74 features as inputs (**Figure S2**). Between SALMON and the modified Cox-nnet the performance is insignificant (paired t-test statistic = −2.105, p-value = 0.103) suggesting these two methods are comparable. But from the neural network structure perspective, SALMON is more flexible since it separates forward propagation for each omics data, which enable a scalable integration of multi-omics data.

#### Interpreting and Ranking the Importance of Co-expression Modules

Interpreting feature importance for neural networks has been studied over years. One way is to assign each feature be zero repeatedly, then the feature with lowest change of the resulting accuracy implies the least importance that affects to the prediction model. This approach is widely adopted for feature selection and ranking the importance of features in neural network (Setiono and Liu, 1997; Zhang, 2000; Sung and Mukkamala, 2003). Based on this approach, we analyzed the contribution of each eigengene's module to the final hazard ratio by forcing each input feature of the testing sets be zero. By feeding the modified testing sets to the pre-trained SALMON networks, we rank the importance of features by inspecting how much the concordance indices decreased. Features that decrease the testing concordance indices more are considered to be more important. At this moment, we integrated all omics data for training and testing. **Table 3** presented top features that mostly reduced the concordance index. The leading two features are the diagnosis age and PR status, then five mRNA-seq co-expression modules are followed.

Next, we selected those features (33 in total) of which their median values < 0 in **Figure 3** and re-performed the training testing in SALMON. Results showed that before and after feature selection, the performances are insignificant in terms of concordance index (before feature selection: mean = 0.6918, median = 0.7285; after feature selection: mean = 0.7108, median = 0.7200; paired t-test statistic = −0.861, p-value = 0.438) (**Figure S3**). This implying that training with selected "important" multi-omics features instead of all can still preserve the prognosis performances.

#### Identification of Breast Cancer Related Genes and Cytobands Associated With Important Modules

To inference the biological implication from the feature ranking, we performed Gene Ontology (GO) and cytoband enrichment from ToppGene Suite (https://toppgene.cchmc.org/) (Chen et al., 2009). Specifically, we focused on analyzing top five mRNA coexpression modules (**Table 3**). Totally we identified 10 genes such as MST1, CPT1B, MAP3K7, CCNC, etc. We also identified various enriched cytoband and other biological functions. **Table 3** is further discussed and explained in Discussion section. Genes list within each mRNA-seq, miRNA-seq module is provided in **Supplementary Material**.

#### Investigating Feature Importance With Different Age Groups

As shown in **Figure 3**, we observed the strong predictive power of diagnosis age, which is consistent with previous studies demonstrating age as one of the most prominent cancer risk factors (Adami et al., 1986). Thus, it is crucial to further investigate if patients in different groups can be stratified using

FIGURE 2 | (A) Performances of SALMON with multi-omics data integrated in terms of concordance index. (B) Performance comparison between SALMON and the modified Cox-nnet, DeepSurv, GLMNET, and RSF in terms of concordance index with all omics data used for learning. (C–E) Kaplan-Meier plot of survival prognosis. Hazard ratios were derived from all five testing sets. Log-rank test was used to find the corresponding p-value with low risk and high risk groups dichotomized by the median hazard ratio. Omics data used for training and testing: (C) mRNA-seq data (mRNA); (D) miRNA-seq data (miRNA); (E) integration of mRNA, miRNA, CNB, TMB, and demographical & clinical (diagnosis age, ER status, PR status) data. All other combinations of multi-omics results are in Figure S1.

TABLE 2 | Performances comparison with different combinations of multi-omics data by pairwise paired *t*-test, according to concordance index among 5-folds cross-validation results.

#### Pairwise paired T-test


*Note that a negative t-statistic indicated set 1 worse than set 2 in terms of performances. Multi-omics dataset applied as inputs: (i) mRNA-seq data (mRNA) (57 features); (ii) miRNA-seq data (miRNA) (12 features); (iii) integration of mRNA and miRNA (69 features); (iv) integration of mRNA, miRNA, copy number burden (CNB), and tumor mutation burden (TMB) (71 features); (v) integration of mRNA, miRNA, and demographical and clinical (diagnosis age, ER status, PR status) data (72 features); (vi) integration of mRNA, miRNA, CNB, TMB, and demographical and clinical (diagnosis age, ER status, PR status) data (74 features).*

*t-denotes the pairwise paired Student's t-test statistic, P denotes the p-value obtained. P-value* < *0.05 are considered to be significant and indicated with* \* *symbol.*

the same set of features. In this paper, we define three age groups: (1) age in range 26–50 (191 patients), (2) age in range 51–70 (280 patients), (3) age in range 71–90 (112 patients) to represent younger, middle aged, and elderly patients. By training and testing these three distinct groups with SALMON algorithm, we aim to answer two questions: (1) whether the diagnosis age still be a strong factor that affect prognosis performance; (2) what are the differences of feature rankings between these three distinct groups.

The performances in terms of concordance index by integrating all omics and clinical data (including mRNA, miRNA, CNB, TMB, diagnosis age, ER status, PR status) are shown in **Figure 4**. As expected they are all slightly inferior than the performance when not stratifying patients by age (median = 0.7285; mean = 0.6918), there is not a statistical significant difference. When inspecting the feature rankings, as shown in **Table 4**, we observed that in the age group 26– 50, PR status (Progesterone Receptors status) plays a pivotal role in prognosis, while other features do not have substantial contributions to the prognosis including the diagnosis age (we still listed some modules). This situation changed in the age group 51–70 as ER status (Estrogen Receptors status) becomes the most important feature, while diagnosis age ranked at #5 with only marginal contribution. In age group 71–90, neither ER, PR status nor diagnosis age ranked in the front, instead mRNA-seq co-expression modules appeared to have the major influence on prognosis. The top ranked modules are #11, #1, #29, #35, and #4. By performing enrichment analysis, we found that the module #11 is significantly enriched with epithelium development genes (GO:0060429, p = 2.253E-9); module #1 is significantly enriched with chromosome organization genes (GO:0051276, p = 5.344E-17) and two well-known breast cancer genes NCOA3 (Burwinkel et al., 2005) and FOXA1 (Meyer and Carroll, 2012; Rangel et al., 2018) were identified in module 1; module #29 was enriched on cytoband 19q13.41 (p = 1.517E-25) and are exclusively zinc-finger proteins; module #35 was enriched on cytoband 1q34 (p = 1.252E-15) and contains multiple genes which have been previously detected in multiple breast cancer studies including UQCRH, PSMB2, PPIH, and YBX1 (Miller et al., 2005; Pujana et al., 2007; Barry et al., 2010); and module #4 is highly enriched with mitotic cell cycle genes (GO:1903047, p = 2.183E-70) including wellknown breast cancer genes such as MKI67 (Gyorffy et al., 2010) and AURKA (Cox et al., 2006). Detailed feature rankings are in **Figures S5**–**S7**.

#### DISCUSSION

In this work, we demonstrated the feasibility of breast cancer survival prognosis by integrating multi-omics data using Deep Learning-based approaches and opened up a new avenue for deriving new prognostic biomarkers in breast cancer. We introduced our SALMON (Survival Analysis Learning with Multi-Omics Neural Networks) algorithm with the implementation of Cox proportional hazards regression networks in breast invasive carcinoma. Instead of using gene TABLE 3 | Top features that reduced the concordance index, including two demographical and clinical features, and five mRNA-seq co-expression modules (eigengene matrices as inputs to the SALMON).


level mRNA-seq or miRNA-seq data directly, SALMON adopts eigengene matrices as the network input derived from weighted gene co-expression network analysis. Unlike other algorithms, SALMON performs forward propagation separately with respect to each type of omics or clinical data in contrast with some other models such as Cox-nnet [which originally did not integrate multi-omics data nor use the co-expression modules as inputs (Ching et al., 2018)]. The separation of forward propagation prevents the interactions across omics data types thus enable easier examination of the module/feature importance for interpretability. It showed good prognosis results in terms of concordance index and log-rank test. Though experiments showed that SALMON has the competitive yet insignificantly superior performance compared to the state-of-the-art Cox-nnet (Ching et al., 2018), we have different paradigm in investigating how prognosis performance increases when integrating more omics and clinical data types, since other models such as Cox-nnet (Ching et al., 2018), DeepSurv (Katzman et al., 2018), etc. do not handle multi-omics data as input. The improved performances (concordance index) by integrating more omics data validates the hypothesis that integrative analysis enhances the survival prognosis accuracy for breast cancer. Moreover, using gene co-expression modules than gene expressions to reduce features upfront is the feature engineering technique we introduced based on bioinformatics techniques. By bridging the gap between gene co-expression analysis and Deep Learning, the advantages can be observed when we backtrack to identify the module/feature can affect the performances. The detected modules reveal clear cancer related biological processes, functions or structural variations allowing further biomedical investigations.

As feature importance has been conveyed and ranked from SALMON, we discovered that keeping only top important

features can still preserve the testing performances. Based on features ranking, we also investigated the biological interpretation behind each demographical feature, clinical feature, and co-expression module. For the leading two features, since the importance of diagnosis age and PR status have been widely examined and recognized in breast cancer (Adami et al., 1986; Boyd et al., 1995; Huang et al., 2000; Bauer et al., 2007) and further confirmed by our results (**Figure 2C**), we focused on the top five mRNA-seq data co-expression modules ranked from 3 to 7. Those top five mRNA-seq data co-expression modules are: module #13, #47, #5, #36, #51.

In module #13, appears to be significantly associated with CD8+ T Cells (p-value = 6.54E-06) and CD4+ T Cells (pvalue = 1.50E-02) based on Human Gene Atlas analysis. CD8+ and CD4+ T cells are important components of the immune system, which has been proved to have strong correlation with cancers (Hung et al., 1998; Hadrup et al., 2013). It contains multiple breast cancer related genes: (1) MST1 kinase, a core


*Experiments performed separately with three age groups: 26–50 group; 51–70 group; 71–90 group, with integrating all omics data (integration of mRNA, miRNA, CNB, TMB, diagnosis age, ER status, PR status). Detailed feature rankings are in* Figures S5*–*S7*. The bold values are of our interests and are being discussed.*

component of Hippo pathway, its phosphorylation can inhibit oncoproteins TAZ/YAP and regulate T-cell function. (Arash et al., 2017; Ercolani et al., 2017); (2) CPT1B, which encodes the critical enzyme for fatty acid beta-oxidation (FAO), the inhibition of FAO can inhibit breast cancer stem cells, chemoresistance, and breast tumor growth (Wang et al., 2018). In addition, tissues enrichment analysis using ARCHS4 (https://amp.pharm.mssm. edu/archs4/) also revealed that nearly one third of genes (11 out of 36) in this module were associated with breast cancer bulk tissue (p-value = 1.867E-03) (**Figure S4**).

In module #47, two genes are related to breast cancer have been identified: (1) MAP3K7, also known as TAK1, is a key mediator between survival and cell death in TNF-α-mediated signaling (Totzke et al., 2017); and (2) CCNC, an important transcriptional regulator whose higher expression is associated with shorter relapse-free survival (RFS) and impact the response to adjuvant therapy in breast cancer. Gene amplification of CCNC is also the most frequent type of genetic alterations in breast cancers (Broude et al., 2015). Module #47 was also enriched in cytoband chr6q.

In module #5, genes are highly enriched on tumor microenvironment (TME) related processes such as extracellular matrix (ECM), cell adhesion, and cell migration. Among them, DDR2 plays an indispensable role in a series of hypoxiainduced behaviors of breast cancer cells, such as migration, invasion, and epithelial-mesenchymal transition (EMT), the activated DDR2 can promote the metastasis of breast cancer (Ren et al., 2014). In addition, FLNA, whose overexpression is associated with the advanced stage, lymph node metastasis, and vascular or neural invasion of breast cancer (Feng et al., 2006). It also contributes the development of breast cancer (Tian et al., 2013). Finally, TCF4 is an important transcription factor, its loss is related with breast cancer chemoresistance (Ruiz de Garibay et al., 2018).

In module #36, SNW1 is a component of spliceosome in RNA splicing, its deletion can induce apoptosis, where the inhibition of SNW1 or its associating proteins may be a novel therapeutic strategy for cancer treatment (Sato et al., 2015). Module #36 was also enriched in cytoband chr14q23-q24 and chr14q31-q32.

In module #51, TCP1 functioned as a cytosolic chaperone in the biogenesis of tubulin (Yaffe et al., 1992), which has been proved to have an association with breast cancer (Bassiouni et al., 2016). HDAC2, its overexpression has a correlation with DNAdamage response and promote tumor progression (Shan et al., 2017). Module #51 was also enriched on cytoband chr6q.

Instead of identified breast cancer related genes, the Enrichment analysis in selected modules also revealed important biological functions. Module 47 and 51 were enriched in chr6q. Not surprisingly, previous studies have identified the frequent alterations at chr6q in archival breast cancer specimens (Shadeo and Lam, 2006), while chr6q21 is hotspots copy number alteration region (Chin et al., 2007). The copy number alterations at chr6q26 can affect MAP3K4, plays an important role of epidermal growth factor receptor pathway (Shadeo and Lam, 2006). Module 36 was enriched in chr14q, the cytoband where the high-level alterations at 14q31.3-32.12 were found in breast cancer from Shadeo and Lam (2006). Besides, the deletion of chr14q is a common feature of tumors with BRCA2 mutations (Rouault et al., 2012). Modules 5 was specifically associated with TME related biological process such as extracellular matrix (ECM), cell adhesion and cell migration. All these GO Biological Processes (BPs) have been shown to play pivotal roles in TME development in cancers while TME has now been widely recognized as a critical participant in tumor progression (Quail and Joyce, 2013). Abnormal ECM in tumors can promote the aggressiveness of breast cancer (Robertson, 2016). Cell adhesion as a common event in cancer can promote cell growth as well as tumor dissemination (Moh and Shen, 2009; Saadatmand et al., 2013). All these discoveries not only confirmed the existed literatures for breast cancer, but also justified the feature importance that SALMON generated.

Another interesting finding is that no miRNA-seq module was ranked in top features although miRNA-seq modules show a better prognosis performance than mRNA-seq modules. This could due to the modules within miRNA-seq are more dependent with each other than the modules within mRNA-seq, thus simply knock out one module/feature may not reduce the performance too much. Indeed, by performing pair-wised Pearson correlation analysis, we found 3.03% miRNA-seq modules has strong correlations (Pearson ρ > 0.8), while in mRNA-seq modules this ratio is down to 0.94%. It leads us a new perspective to inspect modules dependency in the future.

Since we confirmed that diagnosis age is the most powerful predictor, we examined the feature rankings with three different age groups, namely, younger group (age 26–50), middle aged group (age 51–70), and elderly group (age 71–90). We confirmed that by separating the 583 patients to three distinct age groups, the diagnosis age becomes unimportant to the prognosis outcome. While in younger group, PR status is the most important feature. In middle aged group, ER status is the most important feature. When we inspected the elderly group with age in range 71–90, we found that only mRNA-seq coexpression modules were ranked at the top and the five most conspicuous ones are modules #11, #1, #29, #35, and #4. These observations suggest that specific biological processes may play different roles in breast cancer patients of different ages while different biomarkers and predictive models may be needed for different age groups. Further inspection of the modules found that three of these modules are related to known breast cancer related processes such as epithelium development (Vincent-Salomon and Thiery, 2003), chromosome organization (Muleris et al., 1995), and mitotic cell cycle (Kastan and Bartek, 2004) including well-known breast cancers genes such as NCOA3, AURKA, MKI67, and FOXA1. The other two modules are highly enriched on specific cytobands on different chromosomes, implying potential copy number variations on these regions. Indeed, both cytobands (19q13.41 and 1q34) are known to be associated with breast cancer outcomes (Han et al., 2006; Ton et al., 2009). For module #35, while most of the genes locate on 1q34, many of the genes such as UQCRH, PSMB2, PPIH, and YBX1 are involved in RNA processing and have been identified with breast cancer in multiple studies (Miller et al., 2005; Pujana et al., 2007; Barry et al., 2010). Interestingly, all genes identified from module #29 are zinc finger transcription factors. While it is not clear if any of them are specifically related to breast cancer, it is of great interest to further investigate the roles of the ZNF family genes in breast cancer development.

#### CONCLUSION

We performed survival prognosis on breast cancer, proposed a Deep Learning-based algorithm SALMON (Survival Analysis Learning with Multi-Omics Network) by integrating Cox proportional hazards model and adopting gene co-expression network analysis results as input, and predict patient hazard ratios precisely. Performances (concordance index and log-rank test p-value) improved when more omics data integrated to

#### REFERENCES


the input of SALMON. SALMON also showed a competitive performance compared to other Deep Learning survival prognosis model. By inspecting how each feature contributes to the hazard ratios, SALMON confirmed certain mRNA-seq coexpression modules and clinical information, which play pivotal roles in breast cancer prognosis, revealed several biological functions. By further stratifying patients with diagnosis age, SALMON confirmed that different age groups have different main features that controls survival prognosis performance. To sum up, SALMON fuses the gene co-expression network analysis, Deep Learning technique, feature selection, Cox proportional hazard model, integrative analysis, and module-level enrichment analysis altogether, offers a new avenue for the future integrative analysis and Deep Learning-based cancer survival prognosis.

#### DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the supplementary files.

#### AUTHOR CONTRIBUTIONS

ZHu conceived and designed the algorithm and analysis, conducted the experiments, and wrote the paper. XZ, KH, ZHu performed the biological analysis and wrote the paper. SX, JZ performed the biological analysis. TJ, CY, ZHa collected the data. TJ, BH, KH edited the paper. JZ, PS, MR, ZHa, KH provided the research guide. PS, ZHa, KH supervised this project.

### FUNDING

This work was partially supported by Indiana University School of Medicine (IUSM) start-up fund (JZ), the National Cancer Institute Informatics Technology for Cancer Research (NCI ITCR) U01 [CA188547] (KH, JZ), Indiana University Precision Health Initiative (KH, JZ, ZHu, ZHa, TJ, BH, CY), and Shenzhen Peacock Plan [KQTD2016053112051497] (XZ, SX).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00166/full#supplementary-material


breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc. Natl. Acad. Sci. U.S.A. 102, 13550–13555. doi: 10.1073/pnas. 0506230102


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Huang, Zhan, Xiang, Johnson, Helm, Yu, Zhang, Salama, Rizkalla, Han and Huang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Recent Advances of Deep Learning in Bioinformatics and Computational Biology

#### Binhua Tang1,2 \* † , Zixiang Pan1†, Kang Yin<sup>1</sup> and Asif Khateeb<sup>1</sup>

<sup>1</sup> Epigenetics & Function Group, Hohai University, Nanjing, China, <sup>2</sup> School of Public Health, Shanghai Jiao Tong University, Shanghai, China

#### Edited by:

Juan Caballero, Universidad Autónoma de Querétaro, Mexico

#### Reviewed by:

Wenhai Zhang, Hengyang Normal University, China Zhuliang Yu, South China University of Technology, China

#### \*Correspondence:

Binhua Tang bh.tang@hhu.edu.cn

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 20 August 2018 Accepted: 27 February 2019 Published: 26 March 2019

#### Citation:

Tang B, Pan Z, Yin K and Khateeb A (2019) Recent Advances of Deep Learning in Bioinformatics and Computational Biology. Front. Genet. 10:214. doi: 10.3389/fgene.2019.00214 Extracting inherent valuable knowledge from omics big data remains as a daunting problem in bioinformatics and computational biology. Deep learning, as an emerging branch from machine learning, has exhibited unprecedented performance in quite a few applications from academia and industry. We highlight the difference and similarity in widely utilized models in deep learning studies, through discussing their basic structures, and reviewing diverse applications and disadvantages. We anticipate the work can serve as a meaningful perspective for further development of its theory, algorithm and application in bioinformatic and computational biology.

Keywords: computational biology, bioinformatics, application, algorithm, deep learning

### INTRODUCTION

Deep learning is the emerging generation of the artificial intelligence techniques, specifically in machine learning. The earliest artificial intelligence was firstly implemented on hardware system in the 1950s. The newer concept with the more systematic theorems, named machine learning, appeared in the 1960s. And its newly-evolved branch, deep learning, was first brought up around the 2000s, and soon led to rapid applications in different fields, due to its unprecedented prediction performance on big data (Hinton and Salakhutdinov, 2006; LeCun et al., 2015; Nussinov, 2015).

The basic concepts and models in deep learning have derived from the artificial neural network, which mimic human brain's activity pattern to intelligentize the algorithms and save tedious human labor (Mnih et al., 2015; Schmidhuber, 2015; Mamoshina et al., 2016). Although deep learning is an emerging subfield recently from machine learning, it has immense utilizations spreading from machine vision, voice, and signal processing, sequence and text prediction, and computational biology topics, altogether shaping the productive AI fields (Bengio and LeCun, 2007; Alipanahi et al., 2015; Libbrecht and Noble, 2015; Zhang et al., 2016; Esteva et al., 2017; Ching et al., 2018). Deep learning has several implementation models as artificial neural network, deep structured learning, and hierarchical learning, which commonly apply a class of structured networks to infer the quantitative properties between responses and causes within a group of data (Ditzler et al., 2015; Liang et al., 2015; Xu J. et al., 2016; Giorgi and Bader, 2018).

The subsequent paragraphs mainly summarize the essential concepts and recent applications of deep learning, together highlight the key achievements and future directions of deep learning, especially from the perspectives of bioinformatics and computational biology.

**52**

#### ESSENTIAL CONCEPTS IN DEEP NEURAL NETWORK

#### Basic Structure of Neural Network

Neural network is a class of information processing modules, frequently utilized in machine learning. Within a multi-layer context, the basic building units, namely neurons, are connected to each other among the adjacent layers via internal links, but the neurons belonging to the same layer have no connection, as depicted in **Figure 1**.

In **Figure 1**, each hidden layer processes its inputs via a connection function denoted as below,

$$h\_{W,b}(X) = f(W^TX + b) \tag{1}$$

where W refers to the weight and b for bias. When all input layer neurons are active, each input neuron will multiply their respective weight matrix and the output will be summed up with a bias, which then will be fed into an adjacent hidden layer. Although the input-output formalization may repeat similarly among hidden layers, there is usually no direct connection between neurons within the same layer. And activation function is to quantify the connection between two neighboring neurons across two (hidden) layers.

Specifically, the input of the activation function is the combination WTX+b denoted in Equation (1), and the function output is then fed into the next neuron as a new input. Following the connection formula, the former input feature can be extracted to the next layer; by this means the features can be

FIGURE 1 | The network structure of a deep learning model. Here we select a network structure with two hidden layers as an illustration, where X nodes constitute the input layer, Hs for the hidden layers, Y for the output layer, and f(·) denotes an activation function.

well-extracted and refined further. And the performance of the feature extraction depends significantly on the selection of the activation function.

Before training the network structure, the input raw datasets are usually separated into two or three groups, namely a training set and a test set, sometimes a validation set to examine the performance of previously trained network models, as depicted in **Figure 2**. In practice, the original datasets are separated stochastically to avoid the potential local tendency, but the proportion of each set can be determined manually.

### Learning by Training, Validation, and Testing

Normally, training a neural network refers to a process the network self-tunes its parameters or weights to meet the prespecified performance criteria, thus the trained model can be further used in regression or classification purposes. As depicted in **Figure 2**, generally a complete dataset collected from a specific experiment beforehand can be split into the training and testing, and even validation sets, then followed by conventional tasks as model training, validation and performance comparison.

During training with initial batches of data samples, model parameters and their characteristics normally can be tuned by various learning paradigms, including appropriate activation and rectification functions. Then the trained network should be further tested or even validated with the other batch of samples, to acquire high robustness and satisfactory predictability, the processes of which are often referred as model testing and validation.

Usually, the three procedures above are faithfully implemented in conventional machine learning studies; and even in its quickly-evolving subfield, deep learning, the similar paradigm is always observed (LeCun et al., 2015; Schmidhuber, 2015).

## Activation and Loss Function

After training completed, the neural network can perform regression or classification task on testing data, while there usually exists the difference between the predicted outputs and actual values. And the difference should be minimized to acquire optimal model performance.

Within a certain layer, error reduction requires scaling it back within a preset range before passing it onto the next layer of neurons. Activation herein is defined to control neurons' outputs in "active" or "inactive" status, using those non-linear functions as rectified linear unit (ReLU), tanh, and logistic (Sigmoid or soft step) (LeCun et al., 2015).

Besides, a loss function herein is to measure the total difference between the predicted and accurate values, through fine-tuning in backpropagation process. And it acts as an ending threshold for parameter optimization by means of iteratively evaluating the trained models.

With activation function in each neuron throughout diverse layers, a training procedure will continue searching a whole hyperparameter space till the ending threshold, compare and detect an optimal parameter combination by minimizing the preset loss function.

### TYPICAL ALGORITHMS AND APPLICATIONS

With the substantial progresses in advanced computation and Graphic Processing Unit (GPU) technologies, systematic interrogation into massive data to understand its inherent mechanisms becomes possible, especially through deep learning approaches. Hereinafter, we illustrated several frequently utilized models in deep learning literatures, in both recent computation theories and diverse applications.

#### Recurrent Neural Network

Recurrent Neural Network (RNN) is a deep learning model different from traditional neural networks, since the former can integrate the previously learned status through a recurrent approach, namely backpropagation; while traditional neural network usually outputs prediction based on the status of the current layer.

Compared with traditional network models, RNN only has one hidden layer but it can unfold horizontally, and multivertical-groups are enabled to utilize most of the previous results, namely "using memory".

As depicted in **Figure 3**, the hidden layer neuron H<sup>n</sup> is defined by Equation (2),

$$H\_n = \sigma\_1(W\_{1,n}^T H\_{n-1} + W\_{2,n}^T X\_n + b\_{1,n}) \tag{2}$$

where W1,<sup>n</sup> and W2,<sup>n</sup> represent weight matrix, b1,<sup>n</sup> is a bias matrix, and σ(·) (usually tanh(·)) for an activation function. Thus, each layer will generate a partial of output from the current hidden layer neuron with a weight matrix W3,<sup>n</sup> and bias b2,n, defined by Equation (3),

$$
\hat{Y}\_n = \sigma\_2(W\_{3,n}H\_n + b\_{2,n})\tag{3}
$$

And the total loss Ltotal will be the sum of the loss functions from each hidden layer, defined as below,

$$L\_{\text{total}} = \sum\_{n=1}^{N} L\_n = \sum\_{n=1}^{N} L(\hat{Y}, Y) \tag{4}$$

Thus, fine tuning of RNN backpropagation is based on three weights, W1,n, W2,n, and W3,n. Since the multi-parameter setting in weights adds to the optimization burden, RNN usually performs worse than Convolutional Neural Network (CNN) in terms of fine-tuning. But frequently it is ensembled with CNN in diverse applications, such as dimension reduction, image, and video processing (Hinton and Salakhutdinov, 2006; Hu and Lu, 2018). Angermueller et al. proposed an ensembled RNN-CNN architecture, DeepCpG, on single-cell DNA methylation data, to better predict missing CpG status for genome-wide analysis; together the model's interpretable parameters shed light on the connection between sequence composition and methylation variability (Angermueller et al., 2017). Section Autoencoder will specifically discuss CNN and its typical applications.

Moreover, RNN outperforms those conventional models as logistic regression and SVM, and it can be implemented in various environments, accelerated by GPUs (Li et al., 2017). Due to its structural characteristics, RNN is suitable to deal with long and sequential data, such as DNA array and genomics sequence (Pan et al., 2008; Ray et al., 2009; Jolma et al., 2013; Lee and Young, 2013; Alipanahi et al., 2015; Xu T. et al., 2016).

But RNN cannot interact with hidden neurons far from the current one. To construct an efficient framework

of recalling deep memory, many improved algorithms have been proposed, like BRNN in protein secondary structure prediction (Baldi et al., 1999), and MD-RNN in analyzing electron microscopy and MRIs of breast cancer samples (Kim et al., 2018).

LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) are two recently-improved derivatives of RNN to solve the long-time dependence issues. GRU shares a similar structure with LSTM, which has several gates used for modeling its memory center. The current memory output is jointly influenced by its current input feature, the context (namely the past influence), and the inner action toward the input, as shown in **Figure 4**.

In **Figure 4**, the yellow track refers to an input gate transfering its total past features, and is accessible for any new feature to be added. The green track is a mixture of an input gate and its former hidden layer neurons; and it decides what to omit, namely resetting activation function close to 0, and what to be updated into the yellow track. The blue track is the output gate integrating the inner influence from the yellow track, and it decides the output of the current hidden neurons and what to be passed to the next hidden neuron.

Recently an attention-based architecture, DeepDiff, utilizes a hierarchy of LSTM modules to characterize how various histone modifications cooperate simultaneously, and it can effectively predict cell-type-specific gene expression (Sekhon et al., 2018).

#### Convolutional Neural Network

Convolutional neural networks (CNN or ConvNet) are suitable to process information in the form of multiple arrays (LeCun et al., 2015; Esteva et al., 2017; Hu and Lu, 2018). To reduce the parameters without compromising its learning capacity is the general design principle of CNN (LeCun et al., 2015; Krizhevsky et al., 2017). And each convolution kernel's parameters in CNN are trained by the backpropagation algorithm.

Especially in image-related applications, CNN can cope with pixel scanning and processing, thus it greatly accelerates the implementation of optimized algorithms into practice (Esteva et al., 2017; Quang et al., 2018). Structurally, CNN consists of linear convolution operation, followed by nonlinear activators, pooling layers, and deep neural network classifier, depicted in **Figure 5**.

In **Figure 5**, several filters are applied to convolve an input image, and its output is subsampled as a new input into the next layer; and convolution and subsampling processes are repeated till high level features, namely shapes, can be extracted. The more layers a CNN model has, the higher-level features it will extract.

In feature learning, convolution operation is to scan a 2D image with a given pattern, and calculate the matching degree at each step, then pooling identifies the pattern presence in the scanned region (Angermueller et al., 2016). Activation function defines a neuron's output based on a set of given inputs. The weighted sum of inputs is passed through an activation function for non-linear transformation. A typical activation function returns a binary output, 0 or 1; when a neuron's accumulation exceeds a preset threshold, the neuron is activated and passes its information to the next layers; otherwise, the neuron is deactivated. Sigmoid, tanh, ReLU, leaky ReLU, and softmax are the commonly used activation functions (LeCun et al., 2015; Schmidhuber, 2015).

Through pooling layers, pixels are stretched to a single column vector. The vectorized and concatenated pixel information is fed into dense layers, known as fully connected layers for further classification. The fully-connected layer renders the final decision, where CNN returns a probability that an object in the image belongs to a specific type.

Following the fully-connected layer is a loss layer, which adjusts their weights across the network. A loss function is used to measure the model performance and inconsistency between the actual and predicted values. Model performance increases with decreasing of the loss function. For an output vector y<sup>i</sup> and an input x=(x1, x2, . . . , xn), the mapping loss function L(·) between x and y is defined as,

$$L(\boldsymbol{\wp}\_i, \boldsymbol{\hat{\wp}}\_i) = \frac{1}{n} \sum\_{i=1, j=1}^{n,k} \varphi[\boldsymbol{\wp}\_i, \boldsymbol{f}(\boldsymbol{\wp}\_i, \boldsymbol{\sigma}\_i, \boldsymbol{\alpha}\_{ij}, \boldsymbol{b}\_i)] \tag{5}$$

where ϕ denotes an empirical risk for each output, yˆ<sup>i</sup> for the i-th prediction, n the total number of training samples, k the count of the weights ωij and b<sup>i</sup> the bias for the activation function σ<sup>i</sup> .

Recently, CNN has been adopted rapidly in biomedical imaging studies for its outstanding performance in computer vision and concurrent computation with GPUs (Ravi et al., 2017). Usually convolution-pooling structure can better learn imaging features from CT scans and MRI images from head trauma, stroke diagnosis and brain EPV (enlarged perivascular space) detection (Chilamkurthy et al., 2018; Dubost et al., 2019).

In recent computational biology, a discriminative CNN framework, DeepChrome, is proposed to predict gene expression by feature extraction from histone modification. And the deep learning model outperforms traditional Random Forests and SVM on 56 cell types from REMC database (Singh et al., 2016).

Furthermore, CNN can be combined with other deep learning models, such as RNN to predict imaging content, where CNN encodes an image and RNN generates the corresponding image description (Angermueller et al., 2016). Till now, quite a few variants of CNN have been also proposed in diverse classification applications, like AlexNet with GPU support and DQN in reinforcement learning (Mnih et al., 2015).

#### Autoencoder

Through an unsupervised manner, autoencoder is another typical artificial neural network, designed to precisely extract coding or

representation features using data-driven learning (Min et al., 2017; Zeng et al., 2017; Yang et al., 2018). For high-dimensional data, it is time-consuming and infeasible to load all raw data into a network, thus dimension reduction or compression is a necessity in preprocessing of raw data.

Autoencoder can compress and encode information from the input layer into a short code, then after specific processing, it will decode into the output closely matching the original input. **Figure 6** illustrates its basic model structure and processing steps.

Convolution and pooling are two major steps in encoder, depcited in **Figure 6B**; while decoder has two complete opposite steps, namely unpooling and deconvolution in **Figure 6C**. Both convolution and pooling can compress data while preserving the most representative features in two different ways. Convolution involves continuously scanning data with a rectangle window, for example a 3 × 3 size; after each scanning, the window moves to a next position, namely pixel, by replacing the oldest elements with new ones, together with convolution operation. After the whole scanning and convolution, pooling is utilized to deeper compress on redundancy.

Similar to traditional PCA in dimension reduction to some extent, but autoencoder is more robust and effective in extracting data features for its non-linear transformation in hidden layers. Given an input x, the model extracts its main feature and generates xˆ = Wb, where W and b denote weighting and bias vectors, respectively. Commonly, the output cannot fit the input precisely, which can be measured with a loss function in mean squared error (MSE) defined in Equation (6),

$$L\left(W, b\right) = \frac{1}{m} \sum\_{i=1}^{m} \left(\hat{\mathbf{x}} - \mathbf{x}\right)^{2} \tag{6}$$

Thus, the learning process is to minimize the loss L after iterative optimization.

Recently, sparse autoencoder (SAE) is frequented discussed for its admirable performance in dimension reduction and denoising corrupted data. And the loss function in SAE is defined in Equation (7),

$$L\_{SAE} = L\left(W, b\right) + \beta \sum\_{k} KL\left(\rho || \widehat{\rho\_k}\right) \tag{7}$$

where KL refers to KL-divergence in Equation (10), ρ for the activation level of neurons, usually set as 0.05 in condition of sigmoid, indicating most neurons are inactive, ρ<sup>k</sup> for the average activation level of neuron k, and β for the regularization coefficient.

$$\text{KL}(\rho \| \widehat{\rho}\_k) = \rho \log \frac{\rho}{\widehat{\rho}\_k} + (1 - \rho) \log \frac{1 - \rho}{1 - \widehat{\rho}\_k} \tag{8}$$

where <sup>ρ</sup>b<sup>k</sup> represents the average activation level of test samples, and x (i) is the i-th test sample in Equation (9).

$$
\widehat{\rho\_k} = \frac{1}{m} \sum\_{i} \left[ a\_j(\boldsymbol{\omega}^{(i)}) \right] \tag{9}
$$

For high dimensional data, multiple autoencoders can be stacked to act as a deep autoencoder (Hinton and Salakhutdinov, 2006). And this architecture may lead to vanishing gradient, due to its gradient-based and backpropagation learning, and the current solutions include adopting ReLu activation and dropout (Szegedy et al., 2015; Krizhevsky et al., 2017). During configuration and pretraining, the model weights can be acquired by greedy layerwise training, then the network can be fine-tuned with the backpropagation algorithm.

Many variations of autoencoder have been proposed recently, such as sparse autoencoder (SAE), denoising autoencoder (DAE). Typically, stacked sparse autoencoder (SSAE) was proposed to analyze high-resolution histopathological images in breast cancer (Xu J. et al., 2016). By using SAE with three iterations, Heffernan et al. reported the successful prediction of protein secondary structure, local backbone angles, and solvent accessible surface area (Heffernan et al., 2015). Miotto et al. introduced a stack of DAEs to predict features from a large scale of electronic health records (EHR), via an unsupervised representation approach (Miotto et al., 2016). Ithapu et al. proposed a randomized denoising autoencoder marker (rDAm) to predict future cognitive and neural decline for Alzheimer diseases, with its performance surpassing the existing methods (Ithapu et al., 2015).

### Deep Belief Network

As a generative graphical model, Deep Belief Network (DBN) is composed of multiple Restricted Boltzmann Machines (RBM) or autoencoders stacked on top of each other, where each hidden layer in subnetworks serves as a visible layer for the next layer (Hinton et al., 2006). The main network structures of RBM and DBN are depicted in **Figure 7**, where it manifests the construction relations between the two network models.

DBN trains layer by layer in an unsupervised greedy approach to initialize network weights, separately; then it can utilize the wake-sleep or backpropagation algorithm during fine-tuning. While for traditional backpropagation used in fine-tuning, DBN may encounter several problems: (1) requiring labeled data for training; (2) low learning rate; (3) inappropriate parameters tending to acquire local optimum.

Within recent applications, Plis et al. classified schizophrenia patients based on brain MRIs with DBN (Plis et al., 2014); in drug design based on high-throughput screening, DBN was exploited to perform quantitative structure activity relationship (QSAR) study. And the result showed that the optimization in parameter initialization highly improves the capability of DNN to provide high-quality model predictions (Ghasemi et al., 2018). DBN was also used to study the combination of resting-state fMRI (rsfMRI), gray matter, and white matter data by exploiting the latent and abstract high-level features (Akhavan Aghdam et al., 2018). Meanwhile, DBN and CNN were compared to prove that deep learning has better discriminative results and holds promise in the medical image diagnosis (Hua et al., 2015).

### Transfer Learning in Deep Learning

Besides the above deep learning models, transfer learning is frequently utilized in specific cases without sufficient labeling information or dimensionality (Pan and Yang, 2010). Although conceptually it does not belong to deep learning, due to its transferability of high-level semantic classification for deep neural network, transfer learning has gained emerging notices from deep learning fields (O'Shea et al., 2013; Anthimopoulos et al., 2016).

In quite a few deep learning studies, transfer learning enables a previously-trained model to transfer its optimized parameters to a new model, thus to implement the knowledge transmission and reduce repetitive training from scratch, as depicted in **Figure 8**.

Normally, source and target domains have certain statistical relationship or similarity that directly affects the transferability. The domain contains the original dataset, for example image matrix, and the task refers to certain processes, like classification or pattern recognition. The mission of transfer learning includes transferring not only the parameters like weight, but the concentrated small-size matrix from the origin data domain called knowledge distillation.

The knowledge distillation usually uses both "hard target" and "soft target" to train the model and obtain lower information entropy. The below softmax function is usually utilized to soften

$$f(\alpha\_k) = \frac{e^{\frac{\alpha\_k}{T}}}{\sum\_k e^{\frac{\alpha\_k}{T}}} \tag{10}$$

FIGURE 8 | The schematic illustration of transfer learning. Given source domain and its learning task, together with target domain and respective task, transfer learning aims to improve the learning of the target prediction function, with the knowledge in source domain and its task.

where the logical judger α<sup>k</sup> is the input, f(·) is to soft target data and can offer smaller gradient variance, k denotes the k-th segmented data slice. The parameter T is called temperature and the larger T is, the softer the target is.

Furthermore, transfer learning is categorized into instancebased, feature-based, parameter-based and relation-based derivatives, depicted in **Figure 9**. Currently transfer learning is frequently discussed in the deep learning fields for its great applicability and performance. Ensembled with CNN, transfer learning can attain greater prediction performance of interstitial lung disease CT scans (Anthimopoulos et al., 2016). It was also used as a ligament between the multi-layer LSTM and conditional random field (CRF), and the result showed that the LSTM-CRF approach outperformed the baseline methods on the target datasets (Giorgi and Bader, 2018).

### CONCLUSIONS

Within the work, we comprehensively summarized the basic but essential concepts and methods in deep learning, together with its recent applications in diverse biomedical studies. Through reviewing those typical deep learning models as RNN, CNN, autoencoder, and DBN, we highlight that the specific application scenario or context, such as data feature and model applicability,

are the prominent factors in designing a suitable deep learning approach to extract knowledge from data; thus, how to decipher and characterize data feature is not a trivial work in deeplearning workflow yet. In recent deep learning studies, many derivatives from classic network models, including the network models depicted above, manifest that model selection affects the effectiveness of deep learning application.

Secondly, for its limitation and further improvement direction, we should revisit the nature of the method: deep learning is essentially a continuous manifold transformation among diverse vector spaces, but there exist quite a few tasks cannot be converted into a deep learning model, or in a learnable approach, due to the complex geometric transform. Moreover, deep learning is generally a big-data-driven technique, which has made it unique from conventional statistical learning or Bayesian approaches. Thus, it is a new direction for deep learning to integrate or embed with other conventional algorithms in tackling those complicated tasks.

Thirdly, when it comes to innovation in computational algorithm and hardware. As an inference technique driven by big data, deep learning demands parallel computation facilities of high performance, together with more algorithmic breakthroughs and fast accumulation of diverse perceptual data, it is achieving pervasive successes in many fields and applications. Particularly in bioinformatics and computational biology, which is a typical data-oriented field, it has witnessed the remarkable changes taken place in its research methods.

Finally, as unprecedented innovation and successes acquired with deep learning in diverse subfields, some even argued that deep learning could bring about another wave like the internet. In the long term, deep learning technique is shaping the future of our lives and societies to its full extent. But deep learning should not be misinterpreted or overestimated either in academia or AI industry, and actually it has lots of technical problems to solve due to its nature. In all, we anticipate this review work will provide a meaningful perspective to help our researchers gain comprehensive knowledge and make more progresses in this ever-faster developing field.

#### AUTHOR CONTRIBUTIONS

BT conceived the study. ZP, KY, AK, and BT drafted the application sections and revised and approved the final manuscript.

#### FUNDING

This work was supported by the Natural Science Foundation of Jiangsu, China (BE2016655 and BK20161196), and the Fundamental Research Funds for China Central Universities (2019B22414). This work made use of the resources supported by the NSFC-Guangdong Mutual Funds for Super Computing Program (2nd Phase), and the Open Cloud Consortium sponsored project resource, supported in part by grants from Gordon and Betty Moore Foundation and the National Science Foundation (USA) and major contributions from OCC members.

#### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Tang, Pan, Yin and Khateeb. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Autoencoder Based Feature Selection Method for Classification of Anticancer Drug Response

Xiaolu Xu<sup>1</sup> , Hong Gu<sup>1</sup> , Yang Wang<sup>2</sup> , Jia Wang<sup>3</sup> \* and Pan Qin<sup>1</sup> \*

*<sup>1</sup> Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, China, <sup>2</sup> Institute of Cancer Stem Cell, Dalian Medical University, Dalian, China, <sup>3</sup> Department of Breast Surgery, Institute of Breast Disease, Second Hospital of Dalian Medical University, Dalian, China*

Anticancer drug responses can be varied for individual patients. This difference is mainly caused by genetic reasons, like mutations and RNA expression. Thus, these genetic features are often used to construct classification models to predict the drug response. This research focuses on the feature selection issue for the classification models. Because of the vast dimensions of the feature space for predicting drug response, the autoencoder network was first built, and a subset of inputs with the important contribution was selected. Then by using the Boruta algorithm, a further small set of features was determined for the random forest, which was used to predict drug response. Two datasets, GDSC and CCLE, were used to illustrate the efficiency of the proposed method.

#### Edited by:

*Binhua Tang, Hohai University, China*

#### Reviewed by:

*Sandeep Kumar Dhanda, La Jolla Institute for Immunology (LJI), United States Firoz Ahmed, Jeddah University, Saudi Arabia*

#### \*Correspondence:

*Jia Wang wangjia77@hotmail.com Pan Qin qp112cn@dlut.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *28 October 2018* Accepted: *04 March 2019* Published: *27 March 2019*

#### Citation:

*Xu X, Gu H, Wang Y, Wang J and Qin P (2019) Autoencoder Based Feature Selection Method for Classification of Anticancer Drug Response. Front. Genet. 10:233. doi: 10.3389/fgene.2019.00233* Keywords: anticancer drug response, autoencoder, classification model, feature selection, random forest

## 1. INTRODUCTION

The prediction of drug responses for individual patients is an essential issue in the research of precision medicine. It is known that the drug response for various patients can be different (Wilkinson, 2005). Thus, there are different therapeutic effects when using the same anticancer drug for a cohort of patients (Dong et al., 2015). It has been suggested that the patients with similar response to an anticancer drug can have similar genetic features, like gene mutations and expressions (Wang et al., 2017). These features can be used as the biomarkers to predict the drug response (La Thangue and Kerr, 2011).

Because the clinical trials are of high time and economic costs, the researchers prefer to use the cell lines obtained from the cancer patients for investigating drug responses. These investigations lead to several drug response databases, like Genomics of Drug Sensitivity in Cancer (GDSC) (Yang et al., 2012) and Cancer Cell Line Encyclopedia (CCLE) (Barretina et al., 2012). By using these databases, constructing models for the prediction of drug response becomes feasible. Primarily, researchers always use IC50 (Barretina et al., 2012; Garnett et al., 2012), which indicates the concentration required for 50% inhibition in vitro, to measure the sensitivity of drug response. Taking IC50 as the dependent variable, linear regression models, including ridge regression, lasso, and elastic net, were developed to predict drug response (Barretina et al., 2012; Garnett et al., 2012; Basu et al., 2013; Iorio et al., 2016). Further complex models, like support vector regression, artificial neural network, and random forest (RF), were also constructed for this purpose (Riddick et al., 2010; Menden et al., 2013; Ammad-Ud-Din et al., 2014; Ammad-ud din et al., 2016; Costello et al., 2014; Ospina et al., 2014; Cichonska et al., 2015; Dong et al., 2015; Zhang et al., 2015). Neto et al. (2014) proposed the STREAM algorithm that combined a Bayesian inference strategy with ridge regression for the prediction of drug response. Besides the regressions, several network-based

**62**

models were also proposed (Wang et al., 2014; Fey et al., 2015; Zhang et al., 2015). Model ensembles have also been considered by some works (Wan and Pal, 2014; Cortés-Ciriano et al., 2015). Meanwhile, deciding whether an individual patient is sensitive or not to the anticancer drugs is meaningful for treatment. By setting a proper threshold value for IC50, drug response can be divided into two categories: sensitivity and non-sensitivity. In this case, classification models can be fitted for predicting drug response. To this end, the recommender system, naive Bayes classifier and support vector machine have been used (Barretina et al., 2012; Dong et al., 2015; Suphavilai et al., 2018).

Nilsson et al. (2007) indicated that the appropriate selection of small feature set gives the best possible classification results. Thus, selecting an appropriate feature set from a large number of genetic feature candidates is a crucial issue for classification models for predicting drug response. In this paper, we developed a drug response prediction model, called AutoBorutaRF, by using autoencoder (Liou et al., 2008) and Boruta algorithm (Kursa et al., 2010) for feature selection and RF for classification. We first constructed the autoencoder network (Liou et al., 2008), which is a type of artificial neural network, for the reduction of genetic features. By using the Gedeon method (Gedeon, 1997), we initially reduced the total number of features. We further selected a smaller feature set feasible for RF by using the Boruta algorithm. By applying AutoBorutaRF to GDSC and CCLE, we proved that our proposed method is of excellent prediction accuracy. We further analyzed the biomarkers obtained from the lung cell lines in GDSC by the proposed feature selection method.

## 2. MATERIALS AND METHODS

#### 2.1. Datasets and Preprocessing

In this research, we used two datasets, including GDSC (Garnett et al., 2012) and CCLE (Barretina et al., 2012). The datasets were downloaded by using R package PharmacoGx (Smirnov et al., 2015). We used the sensitivity measure IC50 (Barretina et al., 2012; Garnett et al., 2012) as the response variable (denoted by yrs,c) for cell line c. We used three types of genetic features as the explanatory variables, including the gene expression (denoted by **x**rna,<sup>g</sup> ), the single-nucleotide mutation (denoted by **x**snv,<sup>g</sup> ), and the copy number alternation (denoted by **x**cna,<sup>g</sup> ) for gene g. Note that the elements in **x**rna,<sup>g</sup> and **x**cna,<sup>g</sup> are real-valued; the elements in **x**snv,<sup>g</sup> are binary-valued, i.e., "1" for mutation and "0" for wild type. In the two datasets, some cell lines missed the values of the response variable, the single-nucleotide mutation features, and the copy number alteration features. There was no missing value in the gene expression features. We first removed the features with the cell lines missing values more than 50%. Then, we removed the cell lines with more than 50% features missing values from the datasets. For the remaining cell lines with missing values, we used a weight mean method to compensate the missing values as follows:

1. Let z ∗ c,g denote the missing value for the cell line c in the response variable or the genetic feature g. Let **x**rna,<sup>c</sup> denote the vector of gene expression features for the cell line c.


$$\widehat{z}\_{c,\emptyset}^{\*} = \sum\_{k=1}^{K} \frac{d(c,k)}{\sum\_{k=1}^{K} d(c,k)} z\_{k,\emptyset}$$

4. If g is the single-nucleotide mutation feature, zc,<sup>g</sup> is compensated by

$$\mathbf{f}\_{\mathbf{c},\boldsymbol{\xi}}^{\*} = \begin{cases} 1 & \sum\_{k=1}^{K} \mathbf{1}(z\_{k,\boldsymbol{\xi}} = 1) > \sum\_{k=1}^{K} \mathbf{1}(z\_{k,\boldsymbol{\xi}} = 0) \\\\ 0 & \text{otherwise} \end{cases}$$

where **1**() = 1 for the true statement in the parenthesis and **1**() = 0 for the negative statement in the parenthesis.

We set K = 10 for the preprocessing of GDSC and CCLE datasets.

### 2.2. Label Assignment for Cell Lines According to IC50

bz

This research is to construct classification models for predicting how the cell lines respond to the drugs under study. The drug responses can be divided into two categories: "sensitivity" and "non-sensitivity" (Liu et al., 2016). So far, several works have used various threshold values of IC50 to classify the drug responses (Brubaker et al., 2014; Li et al., 2015). Brubaker et al. (2014) used a hard threshold 0.1 to label sensitivity for IC50< 0.1 and to label non-sensitivity (i.e., resistance in this work) for IC50≥ 0.1. However, by investigating the histograms of IC50, we found that the statistics of drugs are various. It can be supposed that the decision of labels should be driven by the data of individual drugs. To this end, we adopted the strategy introduced in Li et al. (2015), which used the median of the observed IC50 values as a data-driven threshold. We labeled a cell line as "sensitivity" if its IC50 is smaller than the median overall the cell lines for an individual drug. We labeled a cell line "non-sensitivity" if its IC50 is equal to or larger than the median overall the cell lines for an individual drug.

#### 2.3. Classification Model and Feature Selection for Predicting Drug Response 2.3.1. Classification Model

The drug response data are often of imbalanced classifications. Because RF is outstanding for the imbalanced classification problem, we used it as the classification model. In RF, we used classification and regression trees (CART) algorithm as

the basic classifier. RF randomly generalizes 1,000 CARTs. Each CART is trained by using ⌈0.632 × Nsample⌉ bootstrapping samples, where Nsample is a total of cell lines. The ultimate results were determined through voting with the prediction results of all CARTs.

#### 2.3.2. Feature Selection With the Autoencoder and Boruta Algorithm

Feature selection is crucial for improving the prediction performance of the classification models. We used the Boruta algorithm, which aims to the feature selection problem for RF (Kursa et al., 2010) (**Figure 1**). The considerable cardinality of the feature candidate set leads to the curse of dimensionality for the Boruta algorithm. Thus, we first used the autoencoder network, to roughly screen out the features to a proper dimension. The detailed two-stepwise feature selection procedure is described as follows:

Step 1: We trained two single-hidden-layer autoencoder networks, with hyperbolic tangent being the activation functions, for screening out the features of the gene expression and the features of the copy number alteration, respectively. Different from the straight application of the hidden layers of the autoencoder, we used Gedeon method (Gedeon, 1997) to calculate the proportional contributions to select the significant genes. The contribution of the ith input (gene) to the jth output (gene) is calculated as

$$Q\_{ij} = \sum\_{k=1}^{K} (P\_{ik} \times P\_{kj})$$

Here K denotes the total number of the neurons of the hidden layer. Pik is the contribution of the ith input to the kth neuron of the hidden layer calculated by

$$P\_{ik} = \frac{|W\_{ik}|}{\sum\_{i=1}^{G} |W\_{i^\*k}|}$$

with G being the total number of the inputs and W<sup>i</sup> ∗k s being the weights linking the corresponding neuron couples. Pkj is the contribution of the kth neuron of the hidden layer to the jth output, whose calculation is similar to that of Pik. The total contribution of the ith input is calculated by

$$q\_i = \sum\_{j=1}^{G} \frac{Q\_{ij}}{\sum\_{i^\*=1}^{G} Q\_{i^\*j}}$$

We ranked the inputs of the autoencoder in the descending order with respect to q<sup>i</sup> and removed the last

TP + TN

50% features. We also removed the features, whose means of correlation coefficients with other features were more than 0.95.

	- 2-1. Extend the dataset by adding copies of all the features obtained by Step 1.
	- 2-2. Shuffle the values of the copied features, called shadow features, to remove their correlations with the response variable, i.e., IC50.
	- 2-3. The shadow features are combined with the original ones.
	- 2-4. Run a random forest classifier on the combined dataset and perform a variable importance measure, in which the mean decrease accuracy (MDA) is used.
	- 2-5. Z score is calculated by dividing MDA with the standard deviation of accuracy loss.
	- 2-6. Find the maximum Z score among shadow attributes (MZSA).
	- 2-7. The features with importance significantly lower than MZSA are permanently removed from the dataset. The features with importance significantly higher than MZSA are retained as important features.
	- 2-8. The shadow features are removed from the dataset.
	- 2-9. Repeat the above steps until for the prefixed iterations (200 was prefixed in our study), or all the retained features are important features.

## 2.4. EasyEnsemble for Imbalanced Datasets

The total number of cell lines sensitive to drugs is much smaller than that of cell lines non-sensitive to drugs. Thus, the datasets in this research are the class imbalance. Let N and R denote the sample set of majority class (non-sensitivity) and that of minority class (sensitivity), respectively. The imbalance ratio IR = |N |/|R| is used to measure the class imbalance, with | · | being the cardinality of a set. For the various drugs under study, the values of IR are different. In this research, for the drugs with IR≤ 2, the feature selection and classification method were directly used; for the drugs with IR> 2, we used EasyEnsemble (Liu et al., 2009) resampling strategy to deal with the imbalance class problem. The core procedure of EasyEnsemble used here is described as follows:


#### 2.5. Evaluation Criteria

We used the following metrics to evaluate the performance of the classification models:

$$\text{Accuracy:} \qquad \text{ACC} = \frac{TP + TN}{TP + FP + TN + FN}$$

$$\text{Recall:} \qquad \text{REC} = \frac{TP}{TP + FN}$$

$$\text{Specificity:} \qquad \text{Specific} = \frac{TN}{TN + FP}$$

$$F\_1 \text{ score:} \qquad \qquad \qquad F\_1 = $$

Matthews correlation coefficient:

$$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(FP + TN)(FN + TN)}}$$

TP

TN

2TP 2TP + FP + FN

where


Besides the metrics above, AUC was also obtained.

Because the total number of samples was much smaller than that of the features, the above evaluation criteria were obtained by using 10-fold cross validation (CV). The dataset was randomly partitioned into 10 equal sized subsets. Of the ten subsets, a single subset was used as the test set to calculate the evaluation criteria of the models trained by the remaining nine subsets. The above process was then repeated 10 times, and the mean of the evaluation criteria obtained in the 10 times was used as the final criteria. In this way, the test datasets can be ensured to be independent of the training datasets.

#### 3. RESULTS

#### 3.1. Data Description

There are missing data in both datasets. These missing data were compensated by using the weighted mean method described in the section Materials and Methods. The total numbers of samples for each variable are listed in **Table 1**.

According to their histograms, the most of distributions of drug responses of cell lines in two datasets can be approximated by the Gauss distribution (**Figure 2**). t-hypothesis test showed that the significance of two groups divided by median of IC50 in GDSC is of p-values from 4.27 × 10−<sup>160</sup> to 6.89 × 10−46; such significance in CCLE is of p-value from 7.14 × 10−<sup>95</sup> to 4.05 × 10−<sup>4</sup> .

#### 3.2. Prediction Performance of AutoBorutaRF

To illustrate the effectiveness of our AutoBorutaRF method, we demonstrated its prediction performance on GDSC and CCLE datasets. Meanwhile, we compared it with other four algorithms,


TABLE 1 | Total numbers of samples for three features.

*The number in the parenthesis means a total of cell lines corresponding to the features.*

including naive Bayes classifier (Barretina et al., 2012), SVM-RFE (Dong et al., 2015), FSelector for k-nearest-neighbors (KNN) algorithm (Soufan et al., 2015), and AutoHidden. The naive Bayes method first selected the top 30 features using either nonparametric Wilcoxon Sum Rank Test (for the gene expression features) or Fisher Exact Test (for the gene mutations). Then, the remaining significant features (p < 0.25) were clustered using a message-passing algorithm for each type of features. Then, they combined these two-part features and used a naive Bayes classifier for the drug response classification prediction. SVM-RFE is a wrapper method using a recursive feature selection and SVM classifier. The parameters of feature number, gamma and cost were set to be 10, 0.5, and 10, which were the optimal parameters selected by SVM-RFE. FSelector selected features using FSelector based on the information entropy and applied to the KNN algorithm. In AutoHidden, we directly use the hidden layer of the autoencoder constructed in our AutoBorutaRF, as the features.

TABLE 2 | Mean values of six evaluation metrics obtained from GDSC.


*The bold number indicates the best result.*

The overall prediction performance of the five methods for the two datasets is illustrated in **Tables 2**, **3** and **Figure 3**. All the metrics in the figure were obtained by using 10-fold CV. **Figure 3** showed that our method was of the best performance with respect to AUC, accuracy, recall, specificity, F<sup>1</sup> score, and Matthews correlation coefficient.

Among the 98 drugs in GDSC, ABT-888 presented the worst prediction with AUC being 0.5935, and the best prediction is for RDEA119 with AUC being 0.8282. Meanwhile, RDEA119, PD-0325901, 17-AAG, and Vorinostat were the only four drugs with AUC >0.8. However, there were 59 drugs, whose AUCs were higher than 0.7. Among the 24 drugs in CCLE, the worst prediction is for AEW541 with AUC being 0.6509. The best three predictions are for Nutlin-3, LBW242, and AZD6244, with AUC being 0.9633, 0.9300, and 0.9079, respectively. The AUCs of Irinotecan, Panobinostat, PD-0332991, PD-0325901, PHA-665752, PLX4720, and Topotecan are higher than 0.85. The receiver operating characteristic (ROC ) curves are listed in **Supplementary File 1**.

#### 3.3. Identified Biomarkers Are Associated With Cancer and Drug Target Pathway

We used 95 lung cell lines in the GDSC database to illustrate the biological significance of the identified biomarkers. **Figure 4A**


*The bold number indicates the best result.*

shows the prediction performance of AutoBorutaRF for the lung cell lines. AutoBorutaRF showed satisfying prediction performance for predicting the drug responses for the lung cell lines. We used the non-parametric Wilcoxon sum rank test for the genetic features of gene expression and copy number alternation and a Fisher exact test for the genetic feature of single-nucleotide mutation, to test the significant difference of the genetic features between the sensitive and nonsensitive populations. Among all the identified 1,087 features (**Supplementary File 2**), a total of features with p < 0.05 was 1029, shown by **Figure 4B**. These results showed that most of the identified features were of significantly different genetic profiles between two classes (**Supplementary File 3**).

We further use PLX4720 and BIBW2992 as two examples to illustrate the biological significance of the features selected for the lung cell lines. Prediction metrics of these two drugs are shown in **Figure 5**. PLX4720 is the inhibitor for B-raf and targets at MAPK signaling pathway (Michaelis et al., 2014). The selected significant features for PLX4720 were CCL19, CCRL2, CST7, GPR143, HDAC5, and IDO1. CCRL2 inhibits p38 MAPK phosphorylation and up-regulates the expression of E-cadherin (Wang et al., 2015). Besides, CCR7, CST7, GPR143, HDAC5, and IDO1 are also related to lung cancer or the MAPK pathway (Liu et al., 2014, 2018; Li and Seto, 2016; Matthews et al., 2016; Rose et al., 2016).

BIBW2992 inhibits ERBB2 and EGFR and targets at EGFR signaling pathway (Iorio et al., 2016) and has been widely investigated for cancers, like lung cancer and melanoma (Rinehart et al., 2004; Nehs et al., 2010; Varmeh et al., 2016). The selected significant features were FYN, KCNH2, REST, CDH12,

FIGURE 3 | Box plots of the six evaluation metrics overall the cell lines in the (A) GDSC and (B) CCLE datasets. Our method was of the best performance with respect to AUC, accuracy, recall, specificity, *F*1 score, and Matthews correlation coefficient. The naive Bayes classifier and SVM-RFE outperformed at specificity.

FIGURE 4 | Prediction performance for the lung cell lines in GDSC. (A) Box plots of six metrics overall the lung cells showed the satisfying prediction performance. (B) Histogram of *p*-values obtained by the statistical significance test for the identified features proved that most of the identified features were of significantly different genetic profiles between the sensitive and non-sensitive populations.

LRRC8E, SCG2, PHF8, PCSK1, ANXA2, and MIR6730. FYN was an authentic Effector of oncogenic EGFR signaling, by limiting EGFR tumor cell motility (Lu et al., 2009). CDH12 plays an important role in non-small-cell lung cancer(NSCLC) geneses, resulting from that the mutations of CDH12 and other PRAME family members were equally distributed among tumors of different grades and stages (Bankovic et al., 2010). SCG2 is in connection with the alteration of miRNA profiles in A549 human non-small-cell lung cancer cells (Shin et al., 2009). KCNH2, REST, LRRC8E, PHF8, PCSK1, ANXA2, and MIR6730 have been also proved to be related to signaling pathway EGFR and lung cancer (Bonilla and Geha, 2006; de Castro et al., 2006; Kreisler et al., 2010; Wang et al., 2012; Demidyuk et al., 2013; Shen et al., 2014; Díaz-Rodríguez et al., 2018). The function descriptions and interaction networks of the identified features for PLX4720 and BIBW2992 are included in **Supplementary File 4**.

### DISCUSSION

The prediction of anticancer drug response is crucial for many applications, like the preclinical setting and clinical trial design. The prediction models for drug response include regression models and classification models. This research developed AutoBorutaRF for predicting the drug response for a twofold aim: achieving proper features for RF and investigating biologically significant biomarkers for the explaining drug response. Because the genetic feature candidates are a vast set, we cannot directly apply the well developed Boruta algorithm for feature selection. We first drastically reduced the dimension by constructing the autoencoder network. Different from the typical application of a hidden layer of the autoencoder, we extracted the inputs with large contributions evaluated by the Gedeon method.

Considering AUC= 0.7 as a pass mark, 22 of 24 drugs in CCLE were of qualified prediction performance; 59 of 98 drugs in GDSC were of qualified prediction performance. Further analysis should be conducted to investigate the reasons leading to the prediction difference between two datasets.

We further investigated the biological significance. We proved that most of the identified genetic features between the sensitive and non-sensitive cell lines were significantly different. By using PLX4720 and BIBW2992 as two examples, we illustrated that many genes identified by AutoBorutaRF were reported to have close relationship with tumorigenesis or cancer progression. The detailed function explanations and interaction networks of the selected features can be referred to **Supplementary File 4**. Thus, AutoBorutaRF can be considered to be a capable machine learning method for determining the biomarkers for predicting the drug response for the preclinical and clinical purposes.

Note that our proposed method used no prior information to obtain the optimal feature set in the sense of prediction performance. In future research, the pre-determined information, like pathway knowledge, and the prior distribution describing the uncertainties of anticancer drugs can be considered to be embedded in our method.

#### REFERENCES


### DATA AVAILABILITY

The source code and datasets for this study can be downloaded from https://github.com/bioinformatics-xu/AutoBorutaRF.

#### AUTHOR CONTRIBUTIONS

XX and PQ processed the data, designed the algorithm, and the programming codes, and wrote the manuscript. YW supported result interpretation and manuscript writing. JW and HG supervised the project and contributed to writing the manuscript.

#### FUNDING

This work was supported by the National Natural Science Foundation of China (61633006, 61502074, 81602309, 81422038, 81872247, 91540110, and 31471235).

#### ACKNOWLEDGMENTS

We thank Pi Xu Liu and Hailing Cheng for useful discussion.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00233/full#supplementary-material

Supplementary File 1 | ROC curve of ten-fold cross validation.

Supplementary File 2 | Selected features.

Supplementary File 3 | Results of feature significance test.

Supplementary File 4 | Function descriptions and interaction networks for PLX4720 and BIBW2992.


to drugs based on genomic and chemical properties. PLoS ONE 8:e61318. doi: 10.1371/journal.pone.0061318


blocking CCL2-induced phosphorylation of p38 MAPK in human breast cancer cells. Med. Oncol. 32:254. doi: 10.1007/s12032-015-0696-6


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xu, Gu, Wang, Wang and Qin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Simultaneous Interrogation of Cancer Omics to Identify Subtypes With Significant Clinical Differences

Aodan Xu, Jiazhou Chen, Hong Peng, GuoQiang Han and Hongmin Cai\*

*School of Computer Science and Engineering, South China University of Technology, Guangzhou, China*

Recent advances in high-throughput sequencing have accelerated the accumulation of omics data on the same tumor tissue from multiple sources. Intensive study of multi-omics integration on tumor samples can stimulate progress in precision medicine and is promising in detecting potential biomarkers. However, current methods are restricted owing to highly unbalanced dimensions of omics data or difficulty in assigning weights between different data sources. Therefore, the appropriate approximation and constraints of integrated targets remain a major challenge. In this paper, we proposed an omics data integration method, named high-order path elucidated similarity (HOPES). HOPES fuses the similarities derived from various omics data sources to solve the dimensional discrepancy, and progressively elucidate the similarities from each type of omics data into an integrated similarity with various high-order connected paths. Through a series of incremental constraints for commonality, HOPES can take both specificity of single data and consistency between different data types into consideration. The fused similarity matrix gives global insight into patients' correlation and efficiently distinguishes subgroups. We tested the performance of HOPES on both a simulated dataset and several empirical tumor datasets. The test datasets contain three omics types including gene expression, DNA methylation, and microRNA data for five different TCGA cancer projects. Our method was shown to achieve superior accuracy and high robustness compared with several benchmark methods on simulated data. Further experiments on five cancer datasets demonstrated that HOPES achieved superior performances in cancer classification. The stratified subgroups were shown to have statistically significant differences in survival. We further located and identified the key genes, methylation sites, and microRNAs within each subgroup. They were shown to achieve high potential prognostic value and were enriched in many cancer-related biological processes or pathways.

Keywords: similarity integration, omics data, survival analysis, DNA methylation, gene expression, miRNA

## 1. INTRODUCTION

In current clinical practice, cancer is typically categorized based on its tissue source and pathological histology. However, cancer is also known as a well-characterized pathological system among the molecular level. Most cancers emerge along with complex molecular alterations at the germ and/or somatic level (Kristensen et al., 2014). Molecule-level cancer re-classification and

Edited by:

*Binhua Tang, Hohai University, China*

#### Reviewed by:

*Xiaofeng Dai, Jiangnan University, China Pu-Feng Du, Tianjin University, China*

> \*Correspondence: *Hongmin Cai hmcai@scut.edu.cn*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

Received: *13 November 2018* Accepted: *04 March 2019* Published: *28 March 2019*

#### Citation:

*Xu A, Chen J, Peng H, Han G and Cai H (2019) Simultaneous Interrogation of Cancer Omics to Identify Subtypes With Significant Clinical Differences. Front. Genet. 10:236. doi: 10.3389/fgene.2019.00236*

**72**

subtyping based on genome-scale data sets can act as a sally port for precision oncology (Wu et al., 2017), such as for evaluating the metastatic potential of patients and selecting the most promising treatment (Forbes et al., 2010). Although enormous quantities of molecular data have been accumulated from various cancer profiling projects, for example, the Catalog of Somatic Mutations in Cancer (COSMIC) database (Forbes et al., 2008), the International Cancer Genome Consortium (ICGC) (International Cancer Genome Consortium et al., 2010), and The Cancer Genome Atlas (TCGA) (Weinstein et al., 2013), interpreting such data is difficult. In recent years, many sophisticated statistical and mathematical models have been proposed to analyze biological data, most of which are based on a single data type (e.g., gene expression, methylation). However, all biological mechanisms consist of multiple molecular phenomena and genomes exhibit variation owing to gene mutations, epigenetic changes, individual differences and environmental influences. It is difficult for conventional analysis based on data from a single genome to capture the heterogeneity of all biological processes and clearly differentiate phenotypes. Thus, the focus has now been shifted to how to integrate multi-omics to achieve more promising and stable cancer classification results.

To perform such simultaneous interrogation, there are two major challenges. First, distinct omics data are heterogeneous in scale, dimension, and quality, and such heterogeneity requires subtle processing. Second, there are internal relationships between single data layers (e.g., the promoter DNA methylation may suppress expression). As such, information on these regulatory patterns can improve our integrated analysis. Existing methods can be roughly divided into three categories based on their methodology: latent variable representation methods, probabilistic modeling methods, and network-based methods (Huang et al., 2017; Rappoport and Shamir, 2018). Latent variable representation are mainly committed to mapping diverse features from different data types into a shared lowdimension common space under the assumption that a set of latent variables is shared across multi-omics data. For example, iCluster+ employs an expectation-maximization (EM) algorithm to build regularized regression in modeling latent variables and observed data (Mo et al., 2013). A joint non-negative matrix factorization (jNMF) method is used to detect the shared characteristic space (Zhang et al., 2012). A moCluster algorithm can define a joint latent variable using the modified consensus PCA (CPCA) (Meng et al., 2015). The major drawback of these methods is that, when dimensions and variances of different omics datasets differ greatly, the basic assumption may be unexplainable. The unobserved latent variables possess little biological meaning and have far fewer dimensions than original spaces. Probabilistic models always presume different prior distributions of multi-omics data, constructing a mixture model, and then estimate the parameters and mixture ratios. For instance, a Beta-Gaussian mixture model can integrate gene expression data and protein-DNA binding probabilities into a single probabilistic modeling framework (Dai et al., 2009). Except for modeling original data, we can also model the probability of clusters distribution on the local and global level using the hierarchical Dirichlet mixture model (Gabasova et al., 2017). However, the accuracy relies heavily on the inherent distribution of data and overfitting may occur when sample size far less than features. Instead of searching common latent variables in measurement space, network-based methods begin with each single data layer and propagate information through interactions between samples to construct a global graph structure. A previous work named similarity network fusion (SNF) (Wang et al., 2014) follows this route using the messagepassing theory to fuse similarities of each available data type into one network by iteratively updates every network as the similarity matrix product of a single layer and the average of the rest layers. Network structure can effectively handle differences in dimension and scale. However, the main difficulty lies in how to determine the contributions of each local pattern and how to interpret the clustering result in terms of the original features. Hence, there are still-strong demands for efficient and precise multi-omics data integration methods that can overcome the dimension variance and heterogeneous scale.

In this paper, we proposed a method to interrogate omics data simultaneously to achieve multi-scale cancer subtyping. The proposed high-order path elucidated similarity (HOPES) integrates the similarities for each type of omics data into a unified and stable one, thus achieving a simplified link of the underlying mechanism of various types of expression. We modeled integrated similarity as the approximation to various high-order paths across each local dataset, the progressively increased high-order path can represent different consistency requirements. We especially emphasized interaction within each pair of local layers rather than updates using a single layer and average of the rest layers. HOPES models such similarity integration as a minimization problem consist of three subobjective functions, for which an efficient numerical algorithm was designed to obtain the solution. Through the optimization procedure, we strengthened the strong correlation between patients and removed the weak ties mainly caused by noise. Thereby, we successfully subtype cancers with significant clinical differences. Real experiments on five cancer projects of TCGA and a normal control set for cancer diagnosis and prognosis tasks demonstrated the excellent performance of HOPES in subtyping and identifying key oncogenesis pathway. The subsequent biological analysis of the resulted key pathway was shown to possess potential prognostic value and biological significance.

#### 2. MATERIALS AND METHODS

### 2.1. Tumor Datasets With Comprehensive Omics Measurements

We tested the proposed HOPES on five distinct tumor datasets, downloaded from TCGA. The tested samples consisted of five tumor types: glioblastoma multiforme (GBM), lung squamous cell carcinoma (LUSC), kidney renal clear cell carcinoma (KIRC), colon adenocarcinoma (COAD), and a cervical cancer dataset (CESC). Each tumor was measured by DNA methylation, gene expression, and miRNA expression. The overall survival information corresponding to each sample was also considered. The first four projects were the same as the experimental data obtained in a previous study (Wang et al., 2014). The gene expression data for GBM and LUSC were collected using the Broad Institute HT-HG-U133A platform, while COAD was collected by the UNC-Agilent-G4502A-07 platform, and KIRC by the UNC-Illumina-Hiseq-RNASeq platform. The miRNA expression data for GBM were collected by the UNC-miRNA-8X15K platform, while those for LUSC, KIRC, and COAD were collected by the BCGSC-Illumina-GA-miRNAseq. The methylation for GBM was analyzed by the JHU-USC-Illumina-DNA-Methylation platform, while for the others the JHU-USC-Human-Methylation-27 platform was used. The fifth CESC dataset contains data on clinical and pathological features, genomic alterations, DNA methylation profiles, and RNA and proteomic signatures, and is available from TCGA (Cancer Genome Atlas Research Network et al., 2017). We collected gene expression profiles, DNA methylation expression, miRNA expression, and clinical data from the Broad Institute TCGA Genome Data Analysis Center (Broad Institute TCGA Genome Data Analysis Center, 2016). A total of 284 samples with these four types of data were included in the study. For each data type, we removed signatures with a missing rate among all of the samples higher than 20%. For the remaining missing-value data, a K-nearest neighbor (KNN) imputation (Troyanskaya et al., 2001) scheme was used to complement it by filling the empty area with the mean value of non-empty neighbors. Finally, we normalized each dataset across samples and obtained a gene expression dataset of 20,118 genes, a methylation dataset of 396,065 CpG sites, and a miRNA dataset of 885 miRNAs. To reduce computational cost, for analysis involving methylation data, the 1,000 most variable CPG sites based on the standard deviation of beta values were selected.

#### 2.2. Comparative Healthy Dataset as a Control

Besides the tumor samples, we also prepared normal samples as a control set to evaluate the capacity for using HOPES in diagnosis. A few healthy cases with data on gene expression, methylation, and miRNA expression are also included in TCGA. Finally, we merged 35 samples derived from several normal tissues adjacent to cancerous tissue among the six TCGA disease projects(BRCA, GBM, KIRC, COAD, LUSC, and CESC). Preprocessing as mentioned above was also performed on the 35 normal controls. Although we simply integrate healthy samples from different tissues as a control set, the normalization step can remove differences between different tissues, and ensure the separability between cancer samples and healthy controls.

### 2.3. Methods

#### 2.3.1. SNF

Similarity network fusion(SNF) is a novel algorithm which integrates different omics data through computing and fusing patient similarity networks. SNF conduct the similarity fusing by iteratively updating every similarity network, making it more similar to the others with every iteration as follows:

$$P^{(\boldsymbol{\nu})} = \mathcal{S}^{(\boldsymbol{\nu})} \times \left(\frac{\sum\_{k \neq \boldsymbol{\nu}} p^{(k)}}{m - 1}\right) \times (\mathcal{S}^{(\boldsymbol{\nu})})^T, \boldsymbol{\nu} = 1, 2, \dots, m$$

where P represent the similarity matrix derived from each datasets, S represent the local affinity which only contains the nearest neighbors' information, and m is the number of different data types. Actually the iteration process means updating the similarity between node i and node j in P (v) as the weighted sum of similarities between the K nearest neighbors of node i and those of node j. While neighbors' similarities are derived from the other m − 1 datasets.

The main contribution of SNF is it can solve the discrepancy of dimensions and variances in different omics datasets which may be the biggest challenge for omics data integration. And it has been widely used in many practical biological tasks. However, it still exists some limitations in this algorithm. (1) This procedure treats each network as the same without weights constraints. (2) There is only one connection path between different datasets that across two intermediate nodes which is insufficient for depicting complex network interaction. (3) The information exchange only exists in one dataset and the average of the others. There are no direct mutual adjustments between different datasets which may cover some interconnection between specific data types. The incomplete network connection model makes it difficult to recover the most precise global similarity pattern or resist high-level noise in biological data.

#### 2.3.2. Similarity Fusion Through High-Order Path

To have a consistent and highly representative global similarity, HOPES simulate three different network connection models with different path length and try to find the fused pattern which retains the maximal commonality. As it was depicted in **Figure 1**, (1) Path-0 similarity preserves the characteristics of each local affinity obtained using K nearest neighbor, (2) Path-1 similarity import one intermediate node to enhance the effect of each local affinity, (3) Path-2 similarity import two intermediate nodes to integrate interaction between different local affinity to enhance the commonality. The detailed numerical expression and constraint of the different order paths are as follows.

Suppose we have C different omics datasets, and their local affinities Si(i ∈ 1, ..., C) were evaluated by a scaled exponential similarity kernel (Wang et al., 2014) see details in **Supplementary Methods**. First, for the path-0 similarity, the fused similarity is required to be close to each underlying affinity which can be simply characterized by minimizing average losses as follows:

$$\min\_{W} \sum\_{i=1}^{C} \left\|{W \cdot \Omega\_{i} - \mathbb{S}\_{i}}\right\|\_{F}^{2} \tag{1}$$

where W is a n × n fused similarity matrix, S<sup>i</sup> is local affinity extracted from i-th omics data, and <sup>i</sup> is a n × n matrix whose entries denote whether corresponding entries in S<sup>i</sup> are equal to 0. There are C types of omics data.

are preserved. Such edges are highlighted in yellow.

Different from the path-0 similarity, we further propose path-1 similarity to retain the maximal commonality when filtering through each underlying affinity. Thus we assume the fused global similarity to be close to every one step transformed similarity by multiple each local affinity.

$$\min\_{W} \sum\_{i=1}^{C} \|W - S\_i W\|\_F^2 \tag{2}$$

It can be noted that (SiW)(m,n) = PSi(m, k)W(k, n) can be interpreted as the weighted sum of distance between the K nearest neighbors of node m and node n while neighbors' information are from dataset i, which represents W filtered by S<sup>i</sup> . Therefore, the aim of Equation (2) is to ensure proximity between the global affinity and the transformed affinity after it has been weighted by each local affinity. One can impose a stricter requirement that the fused global similarity is closed to the transformed similarity which has been filtered by each underlying local affinity through higher-order paths. For example, with path-2 proximity,

$$\min\_{W} \sum\_{i=1}^{C} \sum\_{j=1}^{C} \|\boldsymbol{W} - \boldsymbol{S}\_{i}\boldsymbol{W}\boldsymbol{S}\_{j}^{T}\|\_{F}^{2} \tag{3}$$

Where (SiWSj)(m,n) = PSi(m, k)W(k, l)Sj(l, n), It also represents the weighted sum of the distance between the K nearest neighbors of node m and those of node n, while neighbors' information of two vertexes is from two different datasets. This interactivity between different local affinity sharply strengthens the commonality requirement. The filtration process is supposed to weaken the original edges in W unless the correlation between node i and j is simultaneously supported by each pair of data types.

Finally, combining the aforementioned constraints for modeling proximities of various path orders, we propose the determination of the global affinity by minimizing the following energy function:

$$\min\_{W} \sum\_{i=1}^{C} \left( \|W \cdot \Omega\_{i} - \mathbb{S}\_{i}\|\_{F}^{2} + \alpha \|W - \mathbb{S}\_{i}W\|\_{F}^{2} + \beta \sum\_{j=1}^{C} \|W - \mathbb{S}\_{i}W\|\_{F}^{T} \|\_{F}^{2} \right) \tag{4}$$

where α and β are hyperparameters that adjust the weight of different order constraints and can be empirically set. Details on parameter tuning was attached in the **Supplementary Methods**. The optimization problem can be solved through a consensus alternating direction minimization method (ADMM)(see **Supplementary Methods** for detailed solution procedure).

In conclusion, the three different order paths represent an incremental relationship from specificity to commonality and from weak constraint to strong constraint. They can simulate much more complex network connection models and set increasing consistency requirements on the global similarity. Therefore, we can take all the specialty of every single dataset, the interconnection between datasets, and global consistency into consideration and construct a more comprehensive and robust global similarity network. Moreover, the weights can be adjusted manually based on the real world condition which makes HOPES more flexible.

#### 2.3.3. Downstream Applications

Once we have the fused global similarity matrix, it can be the fundamental structure for much downstream analysis. The most directly is applying the spectral clustering to cluster the samples into different subgroups which can be used for cancer diagnosis or molecular subtyping. In this paper, to eliminate the variations due to clustering initialization, the consensus clustering (Monti et al., 2003) was used to enhance the reliability performance. It records the consensus across multiple clustering repeated trials based on one certain global similarity matrix to assess the stability of the clustering results.

Except for clustering, we also tried to project the global structure into specific characteristics in every single dataset. Since these features are the most relevant to the fused results, they can not only be prognostic valuable but also may indicate some interconnection between different omics layers. We located these features using MCFS, an unsupervised feature selection algorithm for multi-cluster data (Cai et al., 2010). After providing our fused similarity matrix W and the original omics data as input, the feature selection task can be modeled as a L1 − regularized regression problem that exports the sparse coefficient vectors of features. In this case, we can easily select



a series of most relevant features(corresponding to the non-zero coefficients).

### 3. RESULTS

We designed a series of experiments to demonstrate the progress of HOPES by comparing it with four representative methods belong to three kinds of popular integration framework: network fusion-based SNF (Wang et al., 2014), joint latent variablesbased iCluster+ (Mo et al., 2013), moCluster (Meng et al., 2015), and probabilistic model-based Clusternomics (Gabasova et al., 2017). Simulations and real data experiments were performed to evaluate the performance on global cluster structure detection and usability in clinical practice, respectively.

#### 3.1. Experiments to Demonstrate the Accuracy and Robustness of HOPES With Simulated Data

To demonstrate the performance of HOPES in fusing multiomics data, we first tested it on simulated datasets and

FIGURE 3 | Cluster accuracy comparison between different methods on different simulation datasets. The upper panel represents the NMI of HOPES, SNF, and moCluster in SimData1 (A) and SimData2 (B) under incremental standard deviation of Gaussian noise. The lower panel shows the NMI boxplots in SimData1 (C–E) and SimData2 (F–H) among three methods under different noise levels (from left to right in the order of low, intermediate, and high), measuring their accuracy and stability on recovering the integrated pattern through partial layers.

TABLE 2 | The accuracy for cancer diagnosis of different methods.


compared it with SNF and moCluster. The simulated dataset was generated similarly to the one reported elsewhere (Shi et al., 2017). The simulated dataset was created to recapitulate the features of actual genomic data by combining biological variation levels from real data and a pre-defined cluster structure. The actual genomic profiles were downloaded from GEO (Barrett et al., 2013) with the following GEO codes: GSE51557, GSE73002 and GSE106453. These three were focused on DNA methylation (Conway et al., 2015), RNA expression (Nakagawa et al., 2008) and miRNA expression (Shimomura et al., 2016), respectively. Based on these actual genomic data we used the singular value decomposition (SVD) to fuse them with pre-defined cluster structure, and constructed two synthetic data sets (SimData1 and SimData2). SimData1 has a clear boundary between each cluster while SimData2 possesses fuzzy boundaries(see **Supplementary Methods** for more details).

We tested HOPES and the other methods on both simulation datasets under different levels of noise intensity to assess the information integration capability and robustness. We used the normalized mutual information (NMI) as a criterion for performance, and for each noise condition we ran repeated trials 20 times to eliminate accidental error. Collectively, all simulation results suggested that HOPES can always successfully recover the four pre-defined clusters from incomplete layers (**Figure 2**). As we demonstrate in data construction, the three single layers each contained an indivisible part. To dig out the real cluster information, an effective integration method was required. The proposed HOPES used the high order path distance among different data types to approximate the global similarity. The correlation information of nodes i and j will be weakened if it exists in only a single data layer, which ensures the separation of mixed groups in a single data source. Moreover, the progressive proximity model not only sets constraint on the high-order path distance, but also reconcile the extremely specific characteristics in each single data layer. Thus, it is promising for detection of the hidden cluster structure shaped by multi-source data.

The numerical results are shown in **Table 1** and **Figure 3**, which suggest that HOPES outperformed the compared methods irrespective of the set signal and noise conditions, highlighted in bold in **Table 1**. It should be noted that Clusternomics show little tolerance on noise, because the lack of modeling for noise. For the rest three methods we can add the variance of Guassian noise to 3, while Clusternomics can only resist noise with variance lower than 1 (see **Supplementary Figures** for more details). In this section, we mainly discuss the performance on the rest four methods. It can be demonstrated that SNF achieved high precision when the noise level remained low; however, its robustness upon exposure to noise was insufficient. The low stability may be ascribed to SNF updating a fused network through a single local affinity and the other average similarity at every iteration. The update rule raises concern about the enhancement of erroneous information derived from one data layer, especially when edge points exist. However, HOPES provided path-2 elucidated similarity determined by each pair of data types which effectively solve it. In contrast, the latent variables-based methods such as iCluster+ and moCluster showed fairly good stability but poor accuracy for both of the synthetic datasets as noise increased. The iCluster+ modeled continuous variable as the linear combination of specific intercept term, common latent variables, and residual variance which all follow normal distribution. This assumption can fits

our noise and original data setting, however, it can not accurately model the distribution of latent variables as a discrete sequence. So iCluster+ show good performance on dealing noises but unable to capture the global structures. The moCluster is based on a joint latent variable derived by consensus PCA, so it strongly relies on the selection of principal components. Moreover, the

FIGURE 6 | Clustering results of CESC and KIRC. The Kaplan-Meier survival curves *P*-values are recorded in Table 3 for (A) CESC and (B) KIRC, and 3D scatter plots for (C) CESC and (D) KIRC. Vertexes of scatter plots represent samples colored by their cluster label; the *x*-, *y*-, and *z*-axis represent the first three principal components of the fused data matrix.

large gap between feature magnitude of distinct data types also affects the accuracy. More specifically, the boxplots indicate the degree of dispersion and skewness in the data, and show outliers during 20 repeated trials under low, medium, and high noise levels. As depicted in **Figures 3C–H**, HOPES achieved higher accuracy and more stable results within all three methods in SimData1. However, the results of moCluster were highly dispersed during repeated trials which makes the results less credible. After we imported edge points in SimData2, the discreteness of every method slightly increased, but HOPES still performed best, in accordance with the previous results. Interestingly, moCluster appears to be very stable when the noise level is low, but with moderate noise, almost half of the trials were quantified as outliers, which suggests this method exhibits large fluctuations.

### 3.2. Experiments for Cancer Diagnosis on Actual Cancer Datasets

We then tested whether the proposed method HOPES can distinguish tumor samples from normal controls based on their omics measurements. We applied the HOPES and other comparative methods to combinations of COAD (92 samples),

TABLE 3 | Survival analysis by Log-rank test on five tumor datasets.


KIRC (122 samples), LUSC (106 samples) and 35 normal controls. The gene expression, methylation, and miRNA expression data for these case/control sets and the overlap among them are shown in **Figure 4**. It can be noted that the amounts and proportion of common variables vary between different data types. The normal samples tested in this work were selected to have the matching characteristics. It can be noted that the amounts of variables vary from the expression of 280 miRNAs to 23,360 methylation sites, and miRNA measurements are shown to have the largest proportion of overlap among all of cancer types.

We calculated the classification accuracy on the collected tumor vs. normal samples. **Table 2** shows the classification performance either by one single set of data or by the fused methods, in which the most highest accuracy were highlighted in bold. The results reflect that, at the single data level, miRNA with the smallest number of measurements showed the best performance regarding sample classification while methylation showed the worst performance. On average, the performance on fused data derived by HOPES and SNF is uniformly better than that for a single source. The good performance of data fusion is attributed to its capability of resisting erroneous correlations or even negative effects, which not only enhances accuracy but also generates more stable results.

Nevertheless, integration methods such as iCluster+ which splices all of the features, strongly rely on a priori gene selection; therefore, if the number of variables is imbalanced, it will be difficult to retain positive information. Thus, the classification accuracy falls in between the worst and best of single level analyse, so as for moCluster. The Clusternomics extract the global assignment based on the mixture of local partitions, so if clustering results were obscure in single data layer the global performance can not be satisfied. The sample size also influence the performance of Clusternomics a lot. We take an example of KIRC dataset for further analysis. One can see that the fused data clustered tightly and uniformly, as shown by the heatmap of the similarity matrix (**Figure 5**). One can see that the clustering result by the proposed HOPES achieved superior performance (**Figure 5D**) to that by each single source (**Figures 5A–C**). In **Figure 5D** shows distinct boundaries between different clusters and uniform structure within each cluster. The fused similarity between healthy samples is far greater than cancer samples, which demonstrates the heterogeneity of cancer. We also created a Venn diagram to examine the sample assignment by each single source or by the fused one. We found that the fused data by HOPES

are robust to mistakes in each single source. More precisely, for 65% (102 of 157) of samples, there were incorrect assignments in at least one single data type analysis, while for 33% (53 of 157) of cases, the classification results were wrong in at least two single data types. However, only 7.6% (12 of 157) of cases were mis-assigned by our method (**Figure 5E**).

### 3.3. Prognostic Performance on Actual Cancer Datasets

To illustrate the prognostic ability of the elucidated similarity, we applied HOPES to five tumor omics datasets, namely CESC, GBM, COAD, KIRC, and LUSC. The similarities obtained by SNF and HOPES were used to cluster each tumor sample into three subtypes. Their corresponding survival curves were drawn and quantified by the log-rank test. The statistical significance of differences between them was denoted by the P-value. To facilitate visual comparisons, the results on both the survival curves and the first three principal components are shown in **Figure 6** and **Supplementary Figure 3**. The survival curves resulting from HOPES can be observed to achieve the smallest P-value, highlighted in bold in **Table 3**. Consistent with the results in synetic experiments, HOPES show the most clinical significant and reliable performance in all datasets. Since COAD only contains 92 samples with more than 20,000 gene features, the Clusternomics can not fit a mixture model for COAD.

To clarify the beneficial characteristics of the similarity elucidated by HOPES, we took another example of CESC for further analysis. We compared the clustering results on each single type of omics data alone with those for the elucidated one. The results are plotted in a heatmap as shown in **Figure 7**. Notably, it is difficult to cluster each single type of omics data into sub-clusters. There are no legible block structures in **Figure 7A**, or only tiny sub-clusters in **Figures 7B,C**. Between different clusters, the cross section shows small differences in color, implying that the differences were negligible. In comparison, the clustering results after HOPES were shown to feature three distinct sub-clusters. The last sub-cluster in the bottom-right corner exhibits a fairly homogeneous color within the clusters. Moreover, we can deduce that there are two clusters, upon clustering by gene expression, as shown in **Figure 7A**. There are no obvious sub-clusters either by methylation level (**Figure 7B**) or by miRNA expression (**Figure 7C**). In comparison, the clustering results after HOPES were shown to feature three distinct sub-clusters. The last sub-cluster in the bottom-right corner exhibits a fairly homogeneous color within the clusters. The elucidated similarity makes it markedly easy to find

sub-clusters that were concealed in the analyses for each type of omics data alone.

We also found that the elucidated similarity highlights the molecular heterogeneity in cervical carcinomas. The subtyping by HOPES differed depending on the histological classification, showing a discrepancy between phenotype and gene-level types. For instance, the sub-clusters by HOPES largely corresponded to those by methylation level. The CESC project classified samples into six subgroups by histology. To determine the correspondence between the histological classification and HOPES, we merged four different types of adenocarcinoma into one type, as used in studying cervical cancer (Cancer Genome Atlas Research Network et al., 2017). The clusters produced by HOPES strongly correlated with the histological types, but were not the same; our cluster 3 contained all of the adenosquamous cases, while cluster 2 mainly consisted of cervical squamous cell carcinoma samples. We used the χ 2 test to determine whether the two clustering results are significantly associated, and our cluster results showed a strong correlation with each single genomic data cluster, with small P-values (gene expression P = 1.28 × 10−<sup>6</sup> ; methylation P = 7.94 × 10−<sup>9</sup> ; miRNA expression P = 2.2 × 10−16).

#### 3.4. Functional Annotation of Relevant Features Among Cervical Cancer Subtypes

To demonstrate the biological significance of subtype derived by HOPES, we extracted the subset of the most relevant features among the original features and conducted a series of functional analyse on it. We chose the 15 most relevant features in gene expression, methylation, and miRNA data for further analysis.

First, we constructed a corresponding heatmap with different clustering labels, In **Figure 8**, selected signatures of all three data types are merged, showing a clear block form corresponding to the HOPES subgroup. As long as these selected features are differentially expressed following our clustering result, their biological annotation can help us to confirm that the separation created by HOPES is not only clinical meaningful but also biologically significant. In terms of the gene expression pattern, subtype 1 (red), corresponding to lower expression in EPCAM, PPP1R9A, DDAH1, C17orf28, RICH2, and DNALI1, showed a longer survival time, while subtype 2 exhibited completely the opposite performance in the same gene set. The subgroup with the poorest prognosis (blue) significantly corresponded to LOC84931, PRAME, DBN1, SCAND3, and TUBB3 over-expression. The methylation data specifically highlight subgroup 1 in the first five CpG sites(cg11796219, cg04778236, cg00757822, cg06888746, cg08749305); subgroup 2 shows downregulation in the last three CpG sites (cg22958104, cg14193097, cg04206484); while subgroup 3 is relatively down-regulated in cg07258916, cg05869617, cg15966877, and cg22831949. The heatmap of miRNA shows increased expression of hsamiR-767, hsa-miR3200, and hsa-miR-483, which correlates with decreased survival probability and clearly up-regulated expression of hsa-miR-10a, hsa-miR-194-1, and hsa-miR-375 in subgroup 2.

Second, we performed survival analysis on each single feature using the kmeans as a general clustering method, and found that more than 1/3 relevant features showed good partition ability with a Log-rank test p-value< 0.05 including five genes (LOC84931, DBN1, SCAND3, TUBB3, ICOS), six CpG sites (cg11796219, cg08749305, cg07258916, cg05869617, cg01762070, cg15966877), and six miRNAs (hsa-miR-767, hsa-miR-3200, hsa-miR-483, hsa-miR-9-2, hsa-miR-584, hsamiR-342). **Figure 9** shows the Kaplan-Meier survival curves of the top 3 most significant features in genes, CpG sites, and miRNAs. Among these genes, DBN1 was detected as a useful oncofetal biomarker (Iyama et al., 2016). It is involved in migration and invasion of glioma, colon, bladder and lung cancer (Mitra et al., 2011; Terakawa et al., 2013; Lin et al., 2014; Zwiener et al., 2014; Xu et al., 2015); TUBB3 was assessed as one of the predictive and prognostic factors in cervical cancer patients under different neoadjuvant regimens (Zwenger et al., 2015). It was also defined to be a useful prognostic biomarker in patients with advanced NSCLS (Li Z. et al., 2014). Moreover, ICOS was also included in one of the genotype combinations (CD28/IFNG/ICOS) that is associated with cervical cancer (Guzman et al., 2008). In analyzing each single CpG site, an R package, "IlluminaHumanMethylation450kanno.ilmn12.hg19" was applied to match each CpG site with reference gene region. The most significant features, included cg22831949, falls in PTPRN2 , which was found to inhibit apoptosis and promote cancer formation in breast cancer (Sorokin et al., 2015); cg07258916 corresponding to PLXNA4 which belongs to the plexin family, and was previously indicated to inhibit tumor cell migration (Balakrishnan et al., 2009); cg11796219 matched with C3orf21, while C3orf21 ablation was proved to promote cell proliferation, inhibite apoptosis and accelerate cell migration in lung cancer. Selected miR-767 contributes to the decrease of TET activity, which is a hallmark of cancer (Loriot et al.,

2014). It also known as risky miRNA that significantly correlates with clinical outcomes in GBM (Li R. et al., 2014). Moreover, miR-483 can play the role of an antiapoptotic oncogene in many human cancers, such as Wilms' tumors, colon, liver, and breast cancers (Veronese et al., 2010). It was also identified as predictors of poor prognosis in adrenocortical Cancer (Soon et al., 2009). miR-9 was proved to be correlated with MYCN amplification, tumor grade, and metastatic status (Ma et al., 2010), more specifically, it was found to be associated with clear cell renal cell carcinoma, breast cancer, gastric carcinoma, and brain tumors (Lehmann et al., 2008; Luo et al., 2009; Nass et al., 2009; Hildebrandt et al., 2010).

To determine the functional relevance of the selected features, the identified genes, target genes of CpG sites and miRNAs were merged as a core set. We then performed the GO enrichment analysis (Ashburner et al., 2000) and KEGG pathway analysis (Kanehisa et al., 2011) on it using DAVID tools (Huang et al., 2008, 2009). The genes targeted by miRNAs were predicted by miRTarBase, an experimentally validated miRNAtarget interaction database (Chou et al., 2017). We only used the interactions supported by strong experimental evidence (reporter assay or western blot). Finally, the core gene set included 173 genes consisting of 15 original genes, 15 methylation related genes, and 143 miRNA targets. We found that the whole core gene set was enriched in 56 GO biological process terms, with Benjamini-corrected p-value < 0.05. **Figure 10** depicts GO terms with p-value < 10−<sup>6</sup> , notably, these significant terms strongly correlate with cancer. An example of this is the most significant term, namely respond to hypoxia. Numerous research has confirmed that pathological hypoxia plays a pivotal role in cancer progression and migration (Muz et al., 2015). In addition, the Hypoxia-inducible factor 1α, which regulates genes involved in response to hypoxia was proved as a strong prognostic marker in early stage cervical cancer (Birner et al., 2000). The regulation of cell proliferation, regulation of transcription from RNA polymerase II promoter, and regulation of apoptotic process participate in the full life-cycle of tumors (Takeshima et al., 2009; Vander Heiden et al., 2009; Wong, 2011). For KEGG analysis, a total of 46 pathways (Benjamini-corrected pvalue < 0.05) were identified, **Figure 11** shows pathways with p-value <10−<sup>4</sup> . Among these pathways, cancer was the most common subclass such as pathways in cancer, microRNAs in cancer, Bladder cancer, colorectal cancer and pancreatic cancer. Besides direct cancer pathways, the PI3K-AKT-FoxO signaling cascade was identified, which has been previously identified to be involved in cancer and aging (Zhang et al., 2011). The PI3K/Akt signaling pathway leads to the inhibition of the downstream targets FoXO transcription factors, while FoXO is associated with cell cycle progression (Medema et al., 2000), apoptosis (Urbich et al., 2005), and angiogenesis (Tang and Lasky, 2003). There is another research revealed that the activation of AMPK impedes cervical cancer cell growth through this PI3K-AKT-FoxO axis (Yung et al., 2013).

In conclusion, we performed survival analysis, GO enrichment analysis, and KEGG pathway analysis on a subset of the most relevant features of gene expression, methylation and miRNAs corresponding to our HOPES subgroups. We found that these selected features were of great significance in cancer clinical outcomes and biological function such as cancer cell proliferation, apoptosis, and angiogenesis. These findings not only demonstrate the biological meaning of our integrated clustering results, but also indicate that HOPES can act as the anterior work for prognostic biomarker detection.

### 4. DISCUSSION

The integrated analysis of multi-omics data can facilitate the study of molecular events at different periods of cancer progression and development, and complementary information can remove the effect of noise, leading to precise and useful classification results. Our proposed HOPES method integrates the similarity of different data layers to overcome the dimension and scale heterogeneity that hinders latent variable-based methods. The progressive fusion model based on high-order path similarity can evaluate the strength of single data level specificity and global level consistency together for a consistent and highly representative global similarity. The derived global similarity can filter erroneous or single level specific ties. This procedure can solve the issue of inducing too much noise or distortions by partial structures in a single data set, when we integrate all of the similarity information from each data type. Downstream consensus spectral clustering contributes to the obtainment of reliable clustering results.

In practice, our method shows superior capabilities in distinguishing global patterns through multiple source data. In addition, HOPES show great robustness compared to the other methods which are constrained by sample size or priori feature selection. Since HOPES only used the sample similarity information, its performance is independent of the data source, so it is promising for general usage. The fused similarity matrix shows the higher accuracy of tumor classification than any single data type or other integration methods. Moreover, the clustering results of cancer patients feature significant separation regarding a prognostic indicator (survival time), which can contribute to cancer subtyping at the molecular level and further clinical treatment. The obtained subgroups are also shown to be promising for the identification of potential biomarkers by revealing the key components that drive the differences between subgroups. The enrichment analysis on the key components confirmed the power of HOPES in discriminating the biomarkers.

#### DATA AVAILABILITY

The CESC dataset generated during and analyzed during the current study are available in the Broad Institute TCGA Genome Data Analysis Center with identifier "https://doi.org/10.7908/C11G0KM9" (Broad Institute TCGA Genome Data Analysis Center, 2016). The BRCA, LUSC, COAD, and GBM datasets that support the findings of this study are provided by Wang et al. (2014). The Code used in this publication is freely available at github.com/scutbioinformatics/HOPES.

#### REFERENCES


### AUTHOR CONTRIBUTIONS

AX and HC conceived, designed, and supervised all phases of the project. AX performed experiments and wrote the manuscript. AX and JC performed the bioinformatics analysis. JC, HP, and GH contributed to discussions, and editing of the paper. All authors read and approved the final manuscript.

### FUNDING

This work is partially supported by the National Natural Science Foundation of China (61472145, 61372141, 61771007), Science and Technology Planning Project of Guangdong Province (2016A010101013, 2017B020226004), Applied Science and Technology Research and Development Project of Guangdong Province (2016B010127003), Guangdong Natural Science Foundation (2017A030312008), the Fundamental Research Fund for the Central Universities (2017ZD051) and Health Medical Collaborative Innovation Project of Guangzhou City (201803010021).

#### ACKNOWLEDGMENTS

We thank Liwen Bianji, Edanz Group China, for editing the English text of a draft of this manuscript.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00236/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xu, Chen, Peng, Han and Cai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# BayesPI-BAR2: A New Python Package for Predicting Functional Non-coding Mutations in Cancer Patient Cohorts

#### Kirill Batmanov<sup>1</sup> , Jan Delabie<sup>2</sup> and Junbai Wang<sup>1</sup> \*

<sup>1</sup> Department of Pathology, Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway, <sup>2</sup> Department of Pathology, University Health Network, Toronto, ON, Canada

#### Edited by:

Marko Djordjevic, University of Belgrade, Serbia

#### Reviewed by:

Dusanka Savic Pavicevic, University of Belgrade, Serbia Martin Taylor, The University of Edinburgh, United Kingdom Philipp Bucher, École Polytechnique Fédérale de Lausanne, Switzerland

\*Correspondence: Junbai Wang junbai.wang@rr-research.no

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 15 October 2018 Accepted: 15 March 2019 Published: 02 April 2019

#### Citation:

Batmanov K, Delabie J and Wang J (2019) BayesPI-BAR2: A New Python Package for Predicting Functional Non-coding Mutations in Cancer Patient Cohorts. Front. Genet. 10:282. doi: 10.3389/fgene.2019.00282 Most of somatic mutations in cancer occur outside of gene coding regions. These mutations may disrupt the gene regulation by affecting protein-DNA interaction. A study of these disruptions is important in understanding tumorigenesis. However, current computational tools process DNA sequence variants individually, when predicting the effect on protein-DNA binding. Thus, it is a daunting task to identify functional regulatory disturbances among thousands of mutations in a patient. Previously, we have reported and validated a pipeline for identifying functional non-coding somatic mutations in cancer patient cohorts, by integrating diverse information such as gene expression, spatial distribution of the mutations, and a biophysical model for estimating protein binding affinity. Here, we present a new user-friendly Python package BayesPI-BAR2 based on the proposed pipeline for integrative whole-genome sequence analysis. This may be the first prediction package that considers information from both multiple mutations and multiple patients. It is evaluated in follicular lymphoma and skin cancer patients, by focusing on sequence variants in gene promoter regions. BayesPI-BAR2 is a useful tool for predicting functional non-coding mutations in whole genome sequencing data: it allows identification of novel transcription factors (TFs) whose binding is altered by non-coding mutations in cancer. BayesPI-BAR2 program can analyze multiple datasets of genome-wide mutations at once and generate concise, easily interpretable reports for potentially affected gene regulatory sites. The package is freely available at http://folk.uio.no/junbaiw/BayesPI-BAR2/.

Keywords: gene regulation, transcription factors, cancer, bioinformatics, non-coding mutations

## INTRODUCTION

Somatic mutations are the primary cause of cancer. Although most studies of cancer genomes to date have focused on mutations occurring within exons, recent efforts have made whole genome sequences of paired tumor and normal samples widely available, facilitating the analysis of noncoding variants in cancer. In many cases, such variants have been shown to affect gene expression

**Abbreviations:** BayesPI-BAR, Bayesian modeling of Protein-DNA Interaction and Binding Affinity Ranking; FL, follicular lymphoma; PWM, position weight matrix; SNV, single nucleotide variant; TF, transcription factor.

and to promote tumorigenesis (Khurana et al., 2016). One mechanism by which non-coding variants can affect gene expression is the alteration of TF binding to mutated DNA sequences. For example, a mutation may disrupt a TF binding site, preventing the TF from recognizing its target sequence, or a new binding site may be created by a mutation. Several computational tools are available to predict such effects, e.g., GERV (Zeng et al., 2016), atSNP (Zuo et al., 2015), BayesPI-BAR (Wang and Batmanov, 2015), among others. All these tools have the same mode of operation: given a mutation, typically a SNV, and a set of TF-DNA binding models, they produce a list of TFs whose binding is possibly affected by the SNV, ordered by the effect size and/or certainty. However, the predicted list may contain dozens of TFs for every SNV. Adding to the complexity of issue, each cancer sample may have thousands of SNVs, which makes it difficult to interpret the results. Importantly, there is no software package available today to perform such analysis for a patient cohort based on genomewide sequencing data, considering recurring effects of mutations among several patients.

The BayesPI-BAR2 package presented here aims to solve these problems. It ranks TFs affected by SNV through a new BayesPI-BAR algorithm (Batmanov et al., 2017), augmented with a set of tools to find mutation hotspots among patients and mutations linked to differentially expressed genes. The pipeline collects information about SNVs of all patients in the mutation hotspot regions, and then evaluates the significance of predicted effects against randomly generated background mutation models. The methodology behind BayesPI-BAR2 package and the robustness of predictions were validated in a previous study (Batmanov et al., 2017). Now, a user-friendly Python package is developed based on the proposed pipeline. The package is evaluated in both FL and skin cancer patients, by using mutations called from the whole genome sequencing experiments. BayesPI-BAR2 may reveal novel regulatory sites that are disrupted by mutations in cancer or other diseases, by using genome-wide sequencing data, which is similar to the findings in Weinhold et al. (2014). Additionally, it can identify novel TFs whose binding is altered by non-coding mutations in the genome (Batmanov et al., 2017). It is useful not only for regulatory mutation study in cancer, but also for similar research in other diseases.

### MATERIALS AND METHODS

#### Overview of BayesPI-BAR2 Python Package

The operation of the BayesPI-BAR2 pipeline is illustrated in **Figure 1**. It is motivated by works in Batmanov et al. (2017) where novel mutations affecting gene regulation were discovered in FL patients, by considering diverse genome information. The original analysis pipeline comprised of various scripts that were implemented in different programming languages. Here, a completely new Python package was built with enhanced functionality and user-friendly command line options. Particularly, the old BayesPI-BAR (Wang and Batmanov, 2015) program (a combination of R and Perl programs) was reimplemented in Python with a more efficient algorithm and flexible parallelization. This computationally demanding task can be automatically parallelized now either on a single multi-core machine, or on a cluster supporting the SLURM job queue manager.

BayesPI-BAR2 Python package first finds DNA regions with high mutation density and close to differentially expressed genes, then predicts TF affinity changes in these regions using the new BayesPI-BAR, and finally tests the significance of these predicted changes against a background model. All analysis is carried out by a set of command line tools written in Python 2. The package also includes binary files of the new BayesPI program (Wang and Morigen, 2009) which can infer new TF binding affinity models PWMs such as dinucleotide interdependence (Wang, 2014), DNA shape-restricted dinucleotide models (Batmanov and Wang, 2017), and compute TF-DNA differential binding affinity (dbA) scores (Wang et al., 2015). There is also a demo script in the package that shows a full pipeline execution. BayesPI-BAR2 Python package is a useful tool for identifying functional regulatory mutations in cancers or diseases, based on whole genome sequencing experiments. For a more detailed description of the package, please refer to following sections and (Batmanov et al., 2017).

#### Identification of Mutation Hot Regions and Patient-Specific Mutation Blocks

In the first step of the BayesPI-BAR2 pipeline, highly mutated DNA sequence (mutation hotspot) regions are identified by a method described in Batmanov et al. (2017), which considers mutations from several patients to define a set of regions. In default setting, BayesPI-BAR2 searches for putative mutation hotspot regions near the transcription start sites (TSS) of differentially expressed genes, because important regulatory sequences (e.g., functional regulatory mutations) are often located in the promoters. To have a robust mutation calling (Alioto et al., 2015) in the promoter region, a minimum sequencing depth of 30 is recommended at this point. The significance of the differential expressions is tested by two-sample Kolmogorov–Smirnov test, where reads per kilobase of exon model per million mapped reads (RPKM) values of RNA-seq data of patients are compared to that of the normal samples (e.g., P < 0.05). Since RPKM-based differential expression tests may be affected by experimental biases (Bullard et al., 2010) and result in imprecise prediction, a multiple testing correction of P-values is not recommended. Nevertheless, by changing the threshold value of the pipeline, it is easy to apply the Bonferroni correction on the P-values. Alternatively, user can apply external software to perform the differential gene expression analysis, and directly input the gene list into BayesPI-BAR2 package.

Subsequently, MuSSD (Mutation filtering based on the Space and Sample Distribution) algorithm (Batmanov et al., 2017) is applied on the promoter regions of differentially expressed genes. Based on the identified mutation hotspot regions from MuSSD, patient specific mutation blocks are built: the reference sequence is taken from the reference genome assembly according to the region covered by the mutation hotspot (possibly including

patient germline variants), and the alternate sequence contains all mutations from the same patient in the region. In BayesPI-BAR2 package, the computational predictions of both the mutation hotspot regions and the patient-specific mutational blocks are implemented in Python, with a more efficient algorithm than the original MATLAB script (Batmanov et al., 2017).

#### BayesPI TF-DNA Binding Affinity Model

The basic biophysical model for computing TF-DNA binding affinity, named BayesPI, was first reported in Wang and Morigen (2009). The TF-DNA binding probability is derived from the statistical mechanical theory of TF-DNA interactions (Djordjevic et al., 2003; Foat et al., 2006), which can be shown as

$$P(\mathcal{S}, \mathcal{w}, \mu) = \sum\_{i=0}^{N-M} \frac{1}{1 + e^{\mathcal{E}\_{\text{indexp}}(\mathcal{S}\_{i:i+M}, \mathcal{w}) - \mu}}$$

where Si,<sup>a</sup> = 1 if the DNA sequence has nucleotide a (one of A, C, G, T) at position i and Si,<sup>a</sup> = 1 otherwise, N is the sequence length, M is the length of the binding motif, µ is the chemical potential of the TF or its concentration in the nucleus. The selection of µ (e.g., µ = 0, −10, −13, −15, −18, −20) is based on a previous study (Wang and Batmanov, 2015) of the effect of DNA sequence variants on TF binding affinity changes, where verified regulatory mutations in human genome were used to infer the dynamical range of chemical potentials.

$$E\_{\text{indep}}\left(\mathcal{S},\mathcal{w}\right) = \sum\_{j=0}^{\mathcal{M}-1} \sum\_{a=1}^{4} \omega\_{\mathcal{j},a} \mathcal{S}\_{\mathcal{j},a}$$

Eindep (S,w) is the TF binding energy to a short DNA fragment with length M bp. This model assumes that nucleotides at each binding position contribute to the binding energy independently. The matrix w ∈ R (M× 4) , called position-specific affinity matrix (PSAM), where wj,<sup>a</sup> is the binding energy of nucleotide a at position j of the DNA fragment. In BayesPI-BAR2 Python package, a collection of PSAMs derived from a previous published work (Kheradpour and Kellis, 2014) is included, and several new BayesPI features are also added [e.g., PSAM with dinucleotide interdependence (Wang, 2014), and DNA shaperestricted dinucleotide models (Batmanov and Wang, 2017)].

#### BayesPI-BAR Approach

Bayesian modeling of Protein-DNA Interaction and Binding Affinity Ranking (Wang and Batmanov, 2015) method is used to evaluate the significance of TF binding affinity changes caused by DNA sequence variants. It is based on an idea for distinguishing direct versus indirect TF binding in Wang et al. (2015). A new

quantity, dbA, is introduced to measure the binding strength above background level. BayesPI-BAR Python code computes the shifted differential binding affinity (δdbA), for each sequence variant and TF:

$$\text{\(\text{\\$}dbA \text{ (\\$}\_{\text{ref}}, \text{\\$}\_{\text{alt}})\text{ } = dbA \text{ (\\$}\_{\text{alt}}) - dbA \text{ (\\$}\_{\text{ref}})\text{ }$$

Sref, Salt represent the reference and alternate sequences, respectively. δdbA is the measure of the affinity change used by BayesPI-BAR. More details about the BayesPI-BAR approach are available in the supplementary and (Batmanov et al., 2017).

#### Significance Testing for TF Binding Affinity Changes

To test the significance of disruption of TF-DNA binding by patient SNVs, patient-specific δdbA values of a given regulatory mutation block are compared to that of the randomly generated background mutation blocks, using the two-sided Rank-sum test. BayesPI-BAR2 has three alternative mutation models to generate the background: a tumor-derived mutation model, a k-mer mutation signature such as those available from COSMIC (Tate et al., 2018), and a uniform mutation model. A list of TF binding effects which are significantly stronger than estimated by the background model is exported by BayesPI-BAR2.

Since patient mutation blocks are pre-filtered by MuSSD algorithm based on the space and sample distribution of mutations, there are several constraints on the background mutation blocks: (a) both the size and the mutation counts of the background mutation blocks are kept same as that of patient ones. (b) DNA sequence is selected randomly from the same regions as the patient mutation block. (c) distributions of the mutation positions and the nucleotide changes are based on specific mutation signature such as tumor-derived mutations. To evaluate the relationship between the number of background blocks and the precision of background δdbA model, a few simulations are displayed in **Figure 2**. It shows the fraction of significant TFs reaches a plateau when there are more than 1000 blocks used. The significance test for TF-DNA binding affinity changes proceeds in following three steps:


FIGURE 2 | Estimation of sufficient background samples for BayesPI-BAR2 package. The plot displays the dependency of significant TF discovery on the number of background samples used. Significant TFs in the mutation blocks from two different datasets are considered: (1) two BCL2 blocks from FL dataset with 14 patients affected, blue line; (2) and the TERT block from skin cancer dataset with 58 patients affected, green line. On the X-axis, we plot the number of background mutation blocks taken. On the Y-axis, we plot the number of significant TFs found when using X background mutation blocks, which are also significant when using the full set of 10000 background blocks. Y is normalized by the number of significant TFs discovered using the full background set. Therefore, Y = 1 corresponds to the same result as using the full background set.

The significance testing considers both the strength of TF binding affinity change and the recurrence of δdbA values across samples, using the Bonferroni correction for the number of TFs tested. A stronger P-value correction procedure may not be suitable here. For example, Benjamini–Hochberg (BH) false discovery rate requires the P-values to be independent (or have limited dependencies) (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001), but there are strong dependencies among P-values of the significance testing for TF binding affinity changes. Often, P-values of very similar PWMs are close to each other, which may result in unreliable correction by the BH procedure. Bonferroni correction has no assumptions about the process used to generate the P-values which is suited in the current study. At least 10 samples are needed to perform proper statistical test in BayesPI-BAR2. If the sample size is too small, there will be a problem in achieving the statistical significance by Rank-sum test, even if the effects are large (Wild and Seber, 2011).

### Algorithm Efficiency and Parallel Computation

Computation of scores is the most time-consuming task that is needed for both the patient and the background mutation blocks.

The old R program (Wang and Batmanov, 2015) was designed to evaluate TF binding affinity changes in a single mutation and was unable to process multiple mutations simultaneously. In the new Python package, a parallel computation paradigm is developed by using more efficient data processing library. Additionally, the efficiency of BayesPI code was improved by applying a new subexpression for TF binding probability (please refer to BayesPI TF-DNA binding affinity model section):

$$e^{\sum\_{j=0}^{\mathbf{M}-1} \sum\_{a=1}^{4} \boldsymbol{\omega}\_{\mathbf{j},a} \mathbf{S}\_{\mathbf{j},a} - \mu} = e^{-\mu} \prod\_{j=0}^{\mathbf{M}-1} \prod\_{a=1}^{4} \left( e^{\mathbf{w}\_{\mathbf{j},a}} \right)^{\mathbf{S}\_{\mathbf{j},a}}$$

Where the terms e <sup>w</sup>j,<sup>a</sup> and e <sup>−</sup><sup>µ</sup> in the right side of the formula are precomputed and stored in order to avoid computing the exponent term in every sliding window. The new implementation reduces the computational time by about 90%. In addition, in BayesPI-BAR2 Python package, all calculations are parallelized across either multiple local CPUs or multiple nodes on a cluster using the SLURM workload manager. For instance, it takes about 5 h to process all mutation blocks in the skin cancer dataset (263 patients; ∼100000 mutations), by using 8 nodes of 8 CPUs in each. The overall waiting time can be further reduced if more parallel processes are used or few mutation blocks are selected for testing. User guide and package architecture of BayesPI-BAR2 are available in the **Supplementary Section**.

#### RESULTS

#### Validating New Python Code in Verified Regulatory Mutations

The precision of the new BayesPI-BAR Python program, which is the basis of BayesPI-BAR2 package, was first assessed by a benchmark dataset of 67 SNVs with experimentally verified effects of TF binding. The results match the previous study (Wang and Batmanov, 2015).

#### Evaluating the New BayesPI-BAR2 Package in Follicular Lymphoma

A previous analysis of regulatory mutations in FL cancer patients was performed by running various scripts manually. The new BayesPI-BAR2 Python package is applied on the same FL patients, by considering only the gene promoter regions (e.g., TSS ± 1000 bp with 795 called SNVs) as were investigated before (Batmanov et al., 2017). Putative mutation hot blocks near BCL6, BCL2, and HIST1H2BM genes are detected automatically, where containing 34, 40, and 2 SNVs, respectively. The results match with the earlier report (Batmanov et al., 2017). Also, the mutation effects on TF binding at the promoter of two important FL genes (BCL6 and BCL2) (Pasqualucci et al., 2014) were recovered: for example, regulatory activities of two TFs (FOXD2 and FOXD3) on BCL6 and BCL2 were confirmed previously by knockdown experiments in SUDHL4 lymphoma cell (Batmanov et al., 2017). The new BayesPI-BAR2 Python package can reproduce the previous results (Batmanov et al., 2017) and is robust in predicting functional regulatory mutations.

### Applying BayesPI-BAR2 on Genome-Wide Sequencing Data of Skin Cancer

The somatic mutations and RNA-Seq counts for the skin cancer evaluation were downloaded from the public DCC data release 23 at the International Cancer Genome Consortium (ICGC) data portal, from the MELA-AU, SKCA-BR, and SKCM-US projects. The dataset contains 23 million mutations called from whole genome sequence analysis of 263 patients. Melanoma or skin cancer has the highest prevalence of somatic mutations across human cancer types, which is more than ten times higher than that in Lymphoma cancer (Alexandrov et al., 2013). There are frequent driver coding mutations in melanoma cancer (Hodis et al., 2012; Roberts and Gordenin, 2014). Therefore, DNA regions from 2 Kbp upstream to 100 bp downstream of TSS of protein-coding genes [e.g., GENCODE (Harrow et al., 2012)] were selected, and genes differentially expressed between the patient RNA-Seq data and the normal melanocyte RNA-Seq (Haltaufderhyde and Oancea, 2014) were used in this study (10015 genes with ∼99173 mutations).

After applying BayesPI-BAR2 Python package, 166 putative regulatory mutation blocks were detected (containing 2746 mutations). A list of the 15 most highly mutated blocks is shown in a **Supplementary Table 1**, where blocks matched to previous findings are marked and the corresponding publications are cited. A mutation block near TERT gene has the most patients affected, 58 in number, closely followed by blocks near several housekeeping genes (RPL<sup>∗</sup> , RPS<sup>∗</sup> , and others). This is in agreement with the previous studies (Weinhold et al., 2014; Poulos et al., 2015). It has been suggested that these mutations are due to vulnerability of some DNA positions to ultraviolet light damage (Fredriksson et al., 2017). In the TERT mutation block, significantly affected TFs were also predicted by BayesPI-BAR2 automatically (e.g., Wilcoxon ranksum test P < 0.001 with Bonferroni correction; **Figure 3**), which split into two groups: positive change (creation of binding sites) at the top, in orange; and negative change (destruction of existing binding sites) on the bottom, in blue. The heatmap of **Figure 3** shows the variation of affinity changes among 58 patients, who harbor at least one mutation in the TERT block. Nine out of seventeen positively affected TFs belong to the ETS protein family, which are the most significantly affected ones. This is also in agreement with the well-known pathomechanisms of melanoma (Huang et al., 2013). When testing significance of affinity changes against the skin cancer specific mutation signature model and a uniform model, the same significantly affected TFs were found in the TERT block, with small differences in the ranking (**Supplementary Figures 1**, **2**).

Additionally, BayesPI-BAR2 discovers novel regulatory mutations which affect gene expression in skin cancer. For instance, binding of TFs from Sp/KLF family and ETS family

were found to be disrupted (e.g., about 47 patients with mutations; **Supplementary Table 1**) in a mutation block near RALY. RALY is differentially expressed between the skin cancer patients and the normal control samples. It is an RNA-binding protein that may play a role in pre-mRNA splicing. Based on human phenotype association evidence for RALY from the GWAS Catalog (MacArthur et al., 2017), we found mutations of this gene associated with melanoma, skin pigmentation, and skin sensitivity to sun. The next most frequent mutation block was predicted near RPS27 (e.g., 46 patients with mutations), where binding of TBP, ETS, and IRF TF families are interrupted. RPS27 mutation and its elevated expression have been detected in many melanoma patients and in various human cancers (Dutton-Regester et al., 2014). The two newly discovered regulatory mutation blocks may contribute to the dysregulation of RALY and RPS27 and are worthy for further investigation because both genes are known to be significantly associated with melanoma. Thus, BayesPI-BAR2 not only can automatically recover known gene regulatory disturbance, but also can discover the novel ones which can be tested in wet-lab. BayesPI-BAR2 Python package comes with the code to perform the complete analysis of this melanoma dataset.

#### DISCUSSION AND CONCLUSION

The new BayesPI-BAR2 Python package has been evaluated in both small (e.g., 14 FL patients) and large (e.g., 263 skin cancer patients) cancer patient cohorts, based on whole genome sequencing experiments. It achieves good prediction accuracy and automatically reproduces the published results. The new package can be used to investigate previously unknown regulatory effects, even if the sample size is small and the recurrent mutation frequency is low. Nevertheless, the robustness of significance test in BayesPI-BAR2 is dependent on the sample size (Biau et al., 2008), a small sample size may pose difficulty in achieving the significance difference. For example, there are 3 mutation blocks from 14 FL patients that pass the test of significant TF binding affinity changes (P-values <0.05), but there are 15 mutation blocks from 263 skin cancer samples that pass a more stringent criteria (P-values <0.001). Therefore, a large sample size is preferred when using BayesPI-BAR2 to predict putative functional non-coding mutations.

BayesPI-BAR2 approach is more general than a previous mutation recurrence analysis (Weinhold et al., 2014), because it takes into account the recurrence of both the mutation

among multiple patients and the effect on TF binding. In other words, different mutations may contribute to the creation or disruption of the same regulatory link in different patients. For example, there are two canonical highly recurrent mutations in the TERT promoter mutations: C > T at chr5:1,295,228 and chr5:1,295,250. Both of these mutations create ETS binging sites. Though six of fifty-eight patients did not have these two mutations, some ETS factors are positively affected in five of them (**Figure 3**). It indicates that other non-canonical mutations at TERT promoter may also create ETS binding sites.

Although BayesPI-BAR2 needs heavy computation to achieve the goal, the waiting time can be significantly reduced by distributing more jobs in a high performance computing system. In the study of 263 skin cancer patients, the total waiting time was reduced to 1 h and 30 min while using 10 nodes of 10 CPUs of ABEL computer cluster at University of Oslo. On average, approximately 6 min are used for completing the calculation of one mutation block. Efficiency of BayesPI-BAR2 can be further improved by applying advanced sampling method and parallel algorithm, or by implementing it in Graphical Processing unit (GPU) (Zou et al., 2018). Alternatively, if more prior information regarding mutation blocks (e.g., differential methylation, nucleosome occupancy, active enhancer/promoter histone markers, and predicted long distance gene regulations) (Wang et al., 2013; Cao et al., 2017; Dhingra et al., 2017) is available, then fewer mutation blocks will be selected for testing against the background models. Thus additional information can also reduce the total computation time significantly. The new features will be implemented in the future.

The new BayesPI-BAR2 Python package allows analysis of non-coding mutations in cancer patient cohorts, discovering mutation hotspots, and predicting effects of these mutations on TF-DNA binding. Unlike previously available tools, it considers the frequency of mutations, their recurrence across patients, and integrates this information with the predicted affinity changes employing a simple and statistically sound approach. Although in principle, it is applicable to any mutation dataset, BayesPI-BAR2 is designed for the typical cancer use case, with the goal to find few non-random effects among many somatic mutations. The package can be a useful tool for indepth analysis of non-coding mutations detected in whole genome sequencing experiments, as well as for predicting their effects on genome regulation in cancer. All in all,

#### REFERENCES


it provides a reasonable number of predictions for further experimental validation.

#### DATA AVAILABILITY

The package source code, binaries for Linux and OS X, and demo datasets are available at http://folk.uio.no/junbaiw/BayesPI-BAR2/; Project name: BayesPI-BAR2 Package; Operating system(s): Linux and OS X; Programming language: Python; License: General Public License (GNU GPLv3); Any restrictions to use by non-academics: None; The datasets analyzed during the current study are available in the public DCC data release 23 at the ICGC data portal: https://dcc.icgc.org/releases/ release\_23/Projects.

### AUTHOR CONTRIBUTIONS

KB implemented the BayesPI-BAR2 pipeline in Python. JD validated study. JW conceived project, designed BayesPI-BAR2 pipeline, and contributed in developing package. KB and JW drafted manuscript. All authors read and approved the final manuscript.

### FUNDING

This work was supported by the Norwegian Cancer Society (DNK 2192630-2012-33376, DNK 2192630-2013-33463, and DNK 2192630-2014-33518), South-Eastern Norway Regional Health Authority (HSØ 2017061 and HSØ 2018107), and the Norwegian Research Council NOTUR project (nn4605k).

#### ACKNOWLEDGMENTS

The authors thank Prof. Magnar Bjørås for proofreading the article and Ms. Amna Farooq for manuscript editing.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00282/full#supplementary-material



genome. Mol. Cancer Res. 13, 1218–1226. doi: 10.1158/1541-7786.MCR-15-0146


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Batmanov, Delabie and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

**96**

# Multi-Omic Data Interpretation to Repurpose Subtype Specific Drug Candidates for Breast Cancer

Beste Turanli1,2,3† , Kubra Karagoz<sup>4</sup>† , Gholamreza Bidkhori<sup>2</sup> , Raghu Sinha<sup>5</sup> , Michael L. Gatza<sup>4</sup> , Mathias Uhlen<sup>2</sup> , Adil Mardinoglu2,5,6 \* and Kazim Yalcin Arga<sup>1</sup> \*

#### Edited by:

Junbai Wang, Oslo University Hospital, Norway

#### Reviewed by:

Woonyoung Choi, The Johns Hopkins Hospital, United States Diego Bonatto, Federal University of Rio Grande do Sul, Brazil

#### \*Correspondence:

Adil Mardinoglu adilm@scilifelab.se Kazim Yalcin Arga kazim.arga@marmara.edu.tr

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 23 November 2018 Accepted: 17 April 2019 Published: 07 May 2019

#### Citation:

Turanli B, Karagoz K, Bidkhori G, Sinha R, Gatza ML, Uhlen M, Mardinoglu A and Arga KY (2019) Multi-Omic Data Interpretation to Repurpose Subtype Specific Drug Candidates for Breast Cancer. Front. Genet. 10:420. doi: 10.3389/fgene.2019.00420 <sup>1</sup> Department of Bioengineering, Marmara University, Istanbul, Turkey, <sup>2</sup> Science for Life Laboratory, KTH – Royal Institute of Technology, Stockholm, Sweden, <sup>3</sup> Department of Bioengineering, Istanbul Medeniyet University, Istanbul, Turkey, <sup>4</sup> Department of Radiation Oncology, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, United States, <sup>5</sup> Department of Biochemistry and Molecular Biology, Penn State College of Medicine, Hershey, PA, United States, <sup>6</sup> Faculty of Dentistry, Oral and Craniofacial Sciences, Centre for Host-Microbiome Interactions, King's College London, London, United Kingdom, <sup>7</sup> Department of Chemical and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden

Triple-negative breast cancer (TNBC), which is largely synonymous with the basallike molecular subtype, is the 5th leading cause of cancer deaths for women in the United States. The overall prognosis for TNBC patients remains poor given that few treatment options exist; including targeted therapies (not FDA approved), and multi-agent chemotherapy as standard-of-care treatment. TNBC like other complex diseases is governed by the perturbations of the complex interaction networks thereby elucidating the underlying molecular mechanisms of this disease in the context of network principles, which have the potential to identify targets for drug development. Here, we present an integrated "omics" approach based on the use of transcriptome and interactome data to identify dynamic/active protein-protein interaction networks (PPINs) in TNBC patients. We have identified three highly connected modules, EED, DHX9, and AURKA, which are extremely activated in TNBC tumors compared to both normal tissues and other breast cancer subtypes. Based on the functional analyses, we propose that these modules are potential drivers of proliferation and, as such, should be considered candidate molecular targets for drug development or drug repositioning in TNBC. Consistent with this argument, we repurposed steroids, anti-inflammatory agents, anti-infective agents, cardiovascular agents for patients with basal-like breast cancer. Finally, we have performed essential metabolite analysis on personalized genome-scale metabolic models and found that metabolites such as sphingosine-1-phosphate and cholesterol-sulfate have utmost importance in TNBC tumor growth.

Keywords: breast cancer, drug repositioning, non-cancer therapeutics, repurposing, basal subtype, personalized metabolic models

### INTRODUCTION

fgene-10-00420 May 3, 2019 Time: 16:51 # 2

Breast cancer is the most commonly diagnoses and second leading cause of cancer-related deaths in women in the United States with an estimated 268,600 new cases and 41,760 deaths in 2019 (Siegel et al., 2019). Although overall survival has significantly improved over the past several decades owing in part to advances in early diagnostic techniques and an increasing understanding of the underlying biological basis of the disease, which has led to improved treatment strategies. On a molecular level, breast cancer can be defined as five predominant molecular subtypes including the luminal A (LumA), luminal B (LumB), and Normal-like (NL) subtypes which are predominantly estrogen receptor (ER) and progesterone receptor (PR) positive; the HER2 Enriched subtype (HER2E) subtype; and basal-like tumors which are largely synonymous with Triple Negative Breast cancer (TNBC) and are ER/PR/HER2 negative. The considerable differences among these molecular subtypes are a consequence of dramatically altered genomic and proteomic profiles which manifest as changes in activated signaling networks (Gatza et al., 2014) and manifest as differences in risk factors, incidence, age, prognosis and response to treatment. Therefore, there is a clear need to develop reliable biomarkers and to identify potential drug targets in each molecular and clinical subtype (Perou et al., 2000; Curtis et al., 2012; Weigman et al., 2012; Gatza et al., 2014; Ciriello et al., 2015; Mertins et al., 2016).

Basal-like breast cancers disproportionally affect younger women and women of African American decent. This subtype, which is highly concordant with TNBC, accounts for ∼15–20% of diagnosed breast tumors but more than 1-in-4 breast cancer related deaths each year. This is, due in part, to the lack of effective therapeutic options for TNBC patients aside from multi-agent chemotherapy, which remains the standard-of-care treatment despite a limited and varied response among patients and the related toxic side-effects (Solzak et al., 2017). In this context, we and others, have proposed that systems level analyses can assist in revealing the underlying molecular mechanism of the diseases, discovery of biomarkers for specific subtypes, identification of subtype specific drug targets and reposition of drugs that can be used in effective treatment of patients (Mardinoglu and Nielsen, 2015; Mardinoglu et al., 2018; Turanli et al., 2018).

Publicly available "omics" datasets including The Cancer Genome Atlas (TCGA) (Ciriello et al., 2015), Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) (Curtis et al., 2012), and the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (NCI-CPTAC) (Mertins et al., 2016) enhance our understanding of the subtype specific molecular mechanisms of breast cancer. Moreover, integrative and comparative analysis of "omics" data together with network modeling provided a comprehensive platform for the drug repositioning and multi-target drug design (Kibble et al., 2015; Vitali et al., 2016; Turanli et al., 2017). A number of studies also combined genomic, transcriptomic, proteomic data with protein-protein interaction networks (PPINs) and identified putative druggable candidates in breast cancer by analyzing topological features of the reconstructed networks (Karagoz et al., 2015; Liu et al., 2017; Li et al., 2018; Nuncia-Cantarero et al., 2018). These bioinformatics pipelines have their own power through decreasing the number of candidate therapeutic targets/drugs and proposing potential treatment strategies for subsets of breast cancer patients.

The overall prognosis for patients with basal-like breast cancer remains poor and there is an urgent need to identify molecular targets to develop effective therapeutic strategies. To take advantage of the extensive publicly available "omics" data, we integrated transcriptome with interactome data and calculated network entropy for each protein-protein interactions (PPIs) to identify the dynamic states in basal-like breast cancer. Our analyses identified modules as systems biomarkers at gene expression level and these networks were confirmed at the proteomic level. Importantly, functional annotation and analysis of module activity scores demonstrated that these modules were subtype specific. Using these models essential metabolites and drug candidates were identified within the context of basal-like specific modules. Collectively, these analyses suggest that the proposed strategy incorporating multi-omics analyses of human breast tumors has the capacity to define novel signaling networks and link these features to existing therapeutic opportunities.

### MATERIALS AND METHODS

#### Data Collection

Throughout the study, we integrated multi-omics data including genomics, transcriptomics, and proteomics using network analysis (**Table 1**). TCGA data were obtained from https://gdac. broadinstitute.org/, METABRIC and CPTAC data were collected from **Supplementary Files** of these studies. At transcriptomic level, gene expressions were obtained from two major initiatives presenting RNA-Seq data from the TCGA study and microarray data from the METABRIC study. Normalized gene expression values for 179 basal and 852 non-basal like breast cancer samples (n = 1031) from TCGA, and 331 basal and 1665 non-basal samples from the METABRIC project (n = 1992) were used in integrative analysis. At the protein level, two different sources were used, (i) expression data of 160 basal and 777 non-basal like samples (n = 937) in TCGA, using Reverse Phase Protein Array (RPPA)- based analysis of 226 proteins, and (ii) expression data of 19 basal and 58 non-basal like samples (n = 77) from CPTAC which performed comprehensive mass-spectrometry methods including around 10,000 proteins (Mertins et al., 2016).

RNA sequencing data from TCGA (n = 1031) were used as a discovery set whereas, microarray data from METABRIC and proteomic data from TCGA and/or CPTAC were used as independent validation data sets in the study (**Table 1**).

#### Differential Interactome

To obtain a differential view of human interactome between two different phenotypes, and to identify PPIs that are up- or down-regulated in each phenotype relative to the other one, we used the gene expression profiles of interacting protein pairs and recruited the differential interactome analysis as previously described (Ayyildiz et al., 2017). For this purpose, normalized gene expression profiles from TCGA (179 basal-like

and 852 non-basal like samples) were categorized into three levels: high (1), moderate (0), and low (-1) expression levels according to comparison of each gene expression with the average expression within each sample. The probability distributions for any possible co-expression profile of gene pairs (encoding proteins interacting with each other) were estimated, and the uncertainty of determining whether or not a PPI in encountered in a phenotype was estimated through an entropy formulation. In order to define possible PPIs, we used the high confidence human PPIs (Karagoz et al., 2016), comprising 147,923 interactions among 13,213 proteins. Karagoz and coworkers assembled and integrated physical PPIs of Homo sapiens from six publicly available databases including BioGRID (Chatr-Aryamontri et al., 2015), DIP (Salwinski, 2004), IntAct (Orchard et al., 2014), HIPPIE (Schaefer et al., 2012), HomoMINT (Persico et al., 2005), and HPRD (Prasad et al., 2009). Then, PPIs analyzed the differential view of human interactome between the basal and non-basal subtypes of breast cancer; P < 0.05 was considered statistically significant for these analyses.

#### Differentially Expressed Genes and Proteins

Both differentially expressed genes (DEGs) between 179 basal and 852 non-basal samples in TCGA cohort, and differentially expressed proteins (DEPs) between 19 basal and 61 nonbasal samples in CPTAC cohort were identified by using the Significance Analysis of Microarrays (SAM) method implemented in R software (Tusher et al., 2001; Hu et al., 2016; Gámez-Pozo et al., 2017). False Discovery Rate (FDR), adjusted p-value was set at p < 0.05, and fold changes > 1 between basallike and non-basal samples were considered as up-regulated DEGs and proteins in basal tumors.

### Module Extraction From Basal Specific Networks

Basal subtype specific PPI networks were constructed by using the differential interactome from basal-like tumors. The interactions associated with proteins corresponding to DEGs that are up-regulated in basal-like tumors were identified and used to construct up-regulated PPI networks specific to basal-like breast cancer. The networks were visualized by using Cytoscape software (version 3.4.0) (Lopes et al., 2011). The topological analysis of the networks was performed via CytoNCA plugin of Cytoscape (version 2.1) (Tang et al., 2015). Two different topological metrics, degree, which is defined by the number of adjacent nodes of a node in the network, and betweenness centrality, which characterizes nodes by how often they occur on the shortest path between two other nodes in the network, were simultaneously employed to define hub nodes. Hub nodes with higher degree or betweenness values were reported to have significant roles in cellular signal trafficking and could be potential candidate biomarkers or drug targets Modules were identified as highly connected subnetworks within upregulated networks. Gene expression data from METABRIC were used for validation of the gene expression modules in basallike breast cancer.

#### Functional Annotation

Functional enrichment analysis associated with the three proteinprotein interaction modules were analyzed using QIAGEN's Ingenuity <sup>R</sup> Pathway Analysis (IPA <sup>R</sup> , QIAGEN Redwood City)<sup>1</sup> .

#### Module Activity

In order to convert the identified EED, AURKA, and DHX9 modules to gene expression signatures that can be used to quantify pathway activity in a given sample from independent datasets, the module was converted to a gene list and the mean expression of unweighted gene list was used to calculate a pathway score. For these studies, a score was calculated for each sample in the TCGA (discovery) and METABRIC cohort (validation). Analysis of variance (ANOVA) tests were used to quantify differences between the EED-module, DHX9 module and AURKA-module activity scores between breast cancer subtypes in each dataset. A Student's t-test was used to evaluate levels of EED, DHX9e and AURKA signature scores between adjacent normal breast tissue and basal-like tumors. To infer the functional roles of these modules, a panel of 270 experimentally derived gene expression signatures that predict activation of various oncogenic signaling pathways, was performed by integrating gene expression data as described previously (Gatza et al., 2014). To identify the association of the modules with oncogenic pathways, a Spearman's rank correlation was used between oncogenic pathway activity scores and EED, DHX9 and AURKA activity scores.

#### Module Specific Drug Repositioning

To identify small molecules that can potentially reverse gene expression of basal-like tumors, we utilized the Library of Integrated Network-based Cellular Signatures (LINCS) – L1000 data which includes gene expression data from ∼50 human cell line in response to ∼20,000 compounds (Campillos et al., 2008). We queried basal-like specific module genes which are all up-regulated and down-regulated DEGs (Fold Change < 0.2)

<sup>1</sup>www.qiagen.com/ingenuity

TABLE 1 | Validation and discovery sets used in this study.


signatures as input. We used the L1000CDS2 (Duan et al., 2016) search engine, which contains 30,000 significant signatures that were processed from the LINCS L1000 data, to identify small molecule signatures associated with each module. The identified drugs were ranked based on their scores and the top 50 were acquired for each query. Drugs were checked through literature review and publicly available datasets such as CTD (Davis et al., 2017) and KEGG DRUG (Kanehisa et al., 2012) to identify those that were previously investigated within the context of breast cancer.

#### Subtype Specific Essential Metabolites

We next acquired 917 personalized genome scale metabolic models (GEMs) of breast cancer patients (Uhlen et al., 2017). We analyzed each patient GEM to identify essential metabolites for tumor growth by removing the reactions in which the metabolite functions as substrate regardless of compartmentalization (Bidkhori et al., 2018). Next, we categorized personalized models based on clinical information to create subtype-specific patient metabolic models and found the percentage of subtype representation of each metabolite. A Fisher exact test was applied to identify statistically significant difference between basal-like and non-basal-like (i.e., all other tumors) for each metabolite. Significant difference between subtypes was determined based on a P < 0.05.

#### RESULTS

#### Basal-Like Subtype Specific PPI Elucidation via Differential Interactome

Cancer cells are characterized by increase in network entropy comprising high uncertainty, pathway redundancy and promiscuous signaling resulting from intra-sample heterogeneity. Recently, a differential interactome network analysis were presented to show the uncertainties of PPIs in ovarian cancer (Ayyildiz et al., 2017). In this study, we employed differential interactome algorithm utilizing the entropy concept using a comprehensive gene expression data and human PPI network to reveal the heterogeneity among the breast cancer subtypes (i.e., basal-like vs. non-basal-like). To do so, we categorized the expression of each gene and for each patient using 179 basal and 852 non-basal-like samples from TCGA into three classes as -1, 0, 1, These classes were then integrated with a high confident PPI network (Karagoz et al., 2016) and the frequency of PPIs estimated for both basal-like and non-basal-like tumors. Using a 95% confidence interval (p < 0.05), significant values <0.2 and >0.8 as well as corresponding H < 0.7 were calculated for each class. As a result, 3,002 interactions among 1,652 proteins were considered significant across the entire dataset. These analyses identified 2,291 interactions among 1,391 proteins as being significantly activated in basal-like tumors whereas 712 interactions among 612 proteins were identified as significant in non-basal-like samples; 351 proteins were common across both subgroups of tumors (**Supplementary Table S1**).

Since low entropy presents low uncertainty, low redundancy and deterministic signaling resulting with homogeneity in the population, we next focused on the basal-like subtype to identify low entropy interactions (H < 0.1). These analyses identified the EED protein network which is defined by 82 interactions within the group of 98 proteins. Importantly, the lowest entropy profile of the EED centroid network only identified an interaction with one protein (CTCF) in non-basal-like tumors. We further identified a sub-set of proteins, excluding 351 common signatures evident in both basal-like and non-basal-like tumors to identify a basal-like subtype specific network (**Supplementary Table S2**). All differential interactome networks and basal-like subtype specific networks were delimited regarding up-regulated genes in the basal-like tumors through 2-class SAM analysis (Tusher et al., 2001; **Supplementary Table S3**). Through the integration of SAM analysis and the above detailed differential interactome framework, we identified three significant modules: EED centroid module, covering relatively low entropy PPIs (**Figure 1A**); the DHX9 centroid module, covering mixed of low and high entropy PPIs (**Figure 1B**); and the AURKA centroid module, covering relatively high entropy PPIs (**Figure 1C**).

Further analyses of the EED, DHX9, and AURKA modules determined that genes included in EED-module play roles in cyclins and cell cycle regulation (p = 6.1e-19), cell cycle: G1/S checkpoint regulation (p = 3.5e-18), regulation of cellular mechanics by calpain protease (p = 1.6e-11), aryl hydrocarbon receptor signaling (p = 4.3e-11), apoptosis signaling (p = 7.0e-10), TWEAK signaling (p = 1.8e-09), and GADD45 signaling (p = 4.3e-9), In contrast, the genes in DHX9-module contribute to mTOR signaling (p = 4.1e-06), regulation of eIF4 and p70S6K signaling (p = 7.9e-06), EIF2 signaling (p = 7.2e-05), Inflammasome pathway (p = 1.4e-04), assembly of RNA Polymerase I Complex (p = 1.1e-03), DNA double strand break repair (p = 1.8e-03) and cell cycle (p = 3.5e-03) while the genes associated with the AURKA-module are involved in DNA damaged-induced 14-3-3A signaling (p = 1.8e-10), mitotic roles of Polo like kinase (p = 2.1e-09), role of CHK proteins in cell cycle checkpoint control (p = 6.0e-08), ATM signaling (p = 9.3e-07) and mismatch repair (p = 3.1e-06), role of BRCA1 in DNA damage response (p = 1.3e-05), and cell cycle (p = 9.8e-05). These data suggest that each module represent a unique aspect of basal-like breast cancer signaling. Some of these pathways such as TWEAK signaling, apoptosis signaling, mTOR signaling, ATM signaling showed that the chemotherapy targeted pathways are also activated in basal-like tumors in which chemotherapy is the front-line treatment option, nowadays (**Supplementary Figure S1**).

### Proteomic Analysis of Basal Specific Modules

We next reconstructed PPI networks using transcriptome data and validated our findings at proteomic level by leveraging orthogonal genomic and proteomic data from the TCGA and CPTAC projects. Transcriptome data from 937 sample was compared to RPPA analysis of the same samples to assess the relationship between each network at the 226 proteins and phosphoproteins from TCGA. Likewise the gene expression data from a subset of 77 of these samples was used to

examine the relationship between each module and 10,062 proteins and phosphoproteins using mass spectrometry-derived proteomic data from the CPTAC project. First, we used CPTAC proteome data to compare each gene to its corresponding protein across all basal-like tumors and assessed correlation for those pairs. Overall, 52.6–64.5% of the mRNA-protein pairs showed statistically significant positive Spearman correlations (P < 0.05) when changes in mRNA abundance were compared to changes in relative protein abundance. These proteins in basal-like samples are shown in darker colors in **Figures 1A–C**. Then, we identified DEPs between basal-like and non-basallike samples by using both RPPA and CPTAC data. Although RPPA data has limited number of proteins, we identified several up-regulated proteins including CCNE1, RAF1, SRC, CDK1, EGFR, MYC, MYH9, PCNA associated with the EED-module. Similarly, NDGR1 and CCNB1 were associated with the DHX9 and AURKA modules, respectively. We also analyzed DEPs between basal-like and non-basal-like tumors by using CPTAC data which is more comprehensive than RPPA data and it covered 69.4–56.4% of the module genes and 29.4–36.4% of these proteins were identified as being up-regulated in basal-like tumors (**Supplementary Table S3**).

### Modules as Basal Specific Signatures

In order to quantitatively assess the activity of each modular in each patient sample, we next generated a gene expression signature on the basis of median expression of each gene in the module. This strategy was used to calculate a module score for each sample in the TCGA (discovery set) and METABRIC (validation set) datasets. We then quantitatively evaluated the differences in the module activities across breast cancer subtypes by an ANOVA test. These analyses demonstrated that EED (P = 1.13e-244), DHX9 (P = 2.4e-236), and AURKA (P = 2.05e-175) activity was highest in basal-like tumors in the TCGA cohort (**Figures 2A–C**); these findings were validated by analysis of module activity in the METABRIC cohort (**Figures 2D–F**). Finally, we determined that the EED (P = 1.06-e96), DHX9 (P = 2.44e-85), and AURKA modules (P = 6.61e-127) were expressed at significantly higher levels in in basal-like tumors compared to adjacent normal tissue (**Figures 3A–C**).

## Functionality of Basal Specific Modules

We examined the functional roles of these modules by exploring the correlations with a series of previously published gene expression signatures which are capable of measuring oncogene or tumor suppressor pathway activity, aspects of the tumor microenvironment and other tumor characteristics. We identified pathway activities, which were positively (or negatively) correlated with module activities using a Spearman rank correlation to assess the relationship between pathway activity and the EED, DHX9, or AURKA module activity scores. As expected, our data recapitulated known characteristics of basal-like tumors including low hormone receptor signaling and high expression of proliferation pathway activity and demonstrated the relationship between these characteristics and the expression of each module (i.e., EED, DHX9, and AURKA). Moreover, these modules were associated with multiple indicators of proliferation including, RB\_LOSS, RB\_LOH, and bMYB highly correlated with these module activities as well as RAS, PIK3CA, β-catenin, MYC and HER1\_Cluster 1, HER1\_Cluster 2, and HER1\_Cluster 3 signatures (**Figure 4A**). Consistent results were obtained using the METABRIC data

(**Figure 4B**). Importantly, we also confirmed the ability of the transcriptomic module signatures to assess the functional roles of EED, DHX9, and AURKA modules by exploring relationships between the module signature scores and protein expression. Analysis of RPPA data from basal-like samples confirmed that these tumors with high module scores have significantly higher levels of CHK1, CHK2, CDK1, Cyclin B1, Cyclin E1, FOXM1, and PCNA protein expression consistent with their role in cell cycle regulation and proliferation (**Figure 4C**).

### Drug Repositioning Based on Basal Subtype Specific Modules

As discussed above, the EED, DHX9, and AURKA modules were converted to gene expression signatures on the basis of up-regulated genes specific to each module; as would be expected down-regulated genes (Fold Change < 0.2) were common for all modules. We asked the question of whether each module/signature identified potential therapeutic opportunities. To do so, we queried each gene signatures separately against the LINCS database L1000CDS2 (Duan et al., 2016) in order to identify concordant and discordant patterns of gene expression between each module and gene expression profiles associates with drug-induced and/or disease expression. Drugs that resulted in a gene expression profile that was negatively correlated with each module were identified and selected as potential candidate compounds that had the potential to reverse the activity of each module network that was associated with basal-like tumors (**Supplementary Figure S2**). Since we have demonstrated specificity of the modules to basal-like tumors, we may also propose that our candidate drugs are specifically targeting basallike tumors.

After removing the duplicated drugs from query results, we found that EED and AURKA modules were associated with 41 candidate compounds while DHX9 was associated with 31 candidate small molecules. Networks comprising drug candidates and modules were found to have 114 interaction between three modules and 80 drugs (**Figure 5A**). The 80 identified drugs were categorized as molecular inhibitors (23%), antineoplastic agents (15%), heterocyclic compounds (10%), antiinfective agents (6%), or steroids (6%). Moreover, a number of the drugs specific to each module (as well as some common candidates) were also identified in each drug category (**Figure 5B**). There are at least 19 approved, 24 investigational, and 6 experimental drugs listed in DrugBank (version 5.1.1), however there are perturbagens used in L1000 platform without detailed information (**Supplementary Table S4**).

characterized by high expression of cell cycle proteins.

Nine of the drugs including selumetinib, trametinib, and several other investigational drugs were common to each of the three modules. Consistent with our results, selumetinib as MEK inhibitor was reported to suppresses cell proliferation, migration, and trigger apoptosis, following G1 arrest in TNBC cells (Zhou et al., 2016). Furthermore, the MEK inhibitor, trametinib is also a therapy of significant interest for the treatment of TNBC since TNBC cell lines have been shown to be especially sensitive to this drug (Jing et al., 2012; Davis et al., 2014). Finally, we noted some overlap between drugs associated with each module. For instance, the three common drugs (i.e., wortmannin, mestanolone, NVP-TAE684) are associated with both the EED and AURKA modules while 12 drugs (i.e., radicicol, lapatinib, alvocidib, zileuton, geldanamycin, exemestane) are consistent between the EED and DHX9 module (**Supplementary Table S4**). Intriguingly, 10 of our candidate drugs were previously associated with the breast cancer based on at least one of the sources including CTD, KEGG Drug, Clinical Trials, and scientific literature (**Table 2**).

Since EED module has the lowest entropy level between PPIs, we focused on 17 drug candidates which are only related to EED module in addition to common drugs. Three of these drugs are anti-neoplastic agents and five of them are unknown, however, others belonged to steroids (BRD-A94793051,

Oxymetholone, Testosterone propionate), PLK inhibitor (BI-2536), heterocyclic compounds (BRD-K17953061, GDC-0980, TG101348), cardiovascular agents (BRD-K52080565, S-2500), anti-inflammatory (oxaprozin), and anti-infective agents (5-fluorocytosine).

### Essential Metabolites and Anti-metabolites as Drug Candidates

GEMs reconstructed for different cancer tissues have been used for characterization of metabolic modifications; disease

TABLE 2 | Various drug candidates that already associated with breast cancer via different sources.


stratification and determination of drug targets using essential genes or metabolites (Folger et al., 2011; Agren et al., 2012; Bidkhori et al., 2018). To address this question, we first identified a panel of 917 personalized GEMs derived from breast cancer patients (Uhlen et al., 2017). We then categorized each GEMs based on clinical information to create subtype-specific patient metabolic models. These models were then used to identify subtype-specific metabolites essential for tumor growth. After categorization of BCS, percentage of abundance for each essential metabolite was calculated. Significant alteration between the abundance of basal-like and non-basal BCS were determined based on FDR adjusted P-value threshold (P-adj < 0.05) (**Supplementary Table S5**). These analyses identified 27 essential metabolites (**Supplementary Table S6**); 11 were significantly enriched in basal-like tumors while the remaining 16 were enriched in non-basal-like samples. Further analyses determined that the essential metabolites that are expressed at higher levels in basal-like tumors were associated with steroid metabolism, biotin metabolism, nucleotide metabolism, sphingolipid metabolism and transport. Conversely, the identified metabolites downregulated in basal-like samples were involved in beta-alanine metabolism, arginine and proline metabolism, cysteine and methionine metabolism, and carnitine shuttle (**Figure 6**).

#### DISCUSSION

The dynamics of cells are regulated by PPIs and properties of networks such as entropy provide information about the current state of the network. Given that cancer cells are reported to have an increase in network entropy, several previous studies have integrated gene expression data with PPI network information

to compute the energetic state of cancer cells by calculating entropy (West et al., 2012; Teschendorff et al., 2015; Rietman et al., 2016). Likewise, a number of studies have used a networkbased entropy approach to identify disease specific PPIs as biomarker candidates, proliferative and prognostic markers in lung and breast cancer, as well as to demonstrate the association between network entropy and tumor initiation, progression, and anticancer drug responses (Varadan and Anastassiou, 2006; Xiong et al., 2010; Banerji et al., 2013; Lecca and Re, 2015; Cheng et al., 2016; Ayyildiz et al., 2017).

The current study employed a novel multi-omics-based approach to integrate genomic, proteomic and metabolomic tumor data. Our analyses of mRNA expression data identified three highly connected modules which are centered on the activation of the EED, DHX9, and AURKA signaling networks. These data demonstrated that each module is highly activated in basal-like tumors compared to non-basal-like tumors as well as adjacent normal tissues. Importantly, by analyzing proteome data, our results confirmed the correlation between the expression of genes and proteins that comprise each identified module. By analyzing the association between module expression and oncogenic signaling using a panel of more than 250 gene expression signatures, we were able to assess the functional relationship of these modules with known oncogenic and signaling features. Our results demonstrated the correlation between EED, DHX9, and AURKA module activity and proliferative oncogenic pathways including RAS, PI3K, and Rb/E2F signaling in basal-like tumors. Consistent with these results, CHK1, CHK2, CDK1, Cyclin B1, Cyclin E1, and PCNA protein expression levels were identified higher in tumors with high module scores. Through integrated analyses, we identified candidate drugs to target three modules by drug repositioning. Utilizing multiple omics data including genome, transcriptome, and interactome, we repurposed 519 agents for breast cancer by incorporating data from the LINCS project (Duan et al., 2016) into our analyses. In another drug repositioning study, five of the identified repurposed candidate agents showed superior therapeutic indices compared to doxorubicin in in vitro assays in basal sub-type cell line (SUM149) in addition to luminal cell line (MCF7) (Chen et al., 2016). Moreover, Lee et al. (2016) developed an integrative approach for drug repositioning using the expression signature, chemical structure, target signatures and LINCS data. They applied this strategy to identify candidate anti-cancer drugs for breast cancer (Lee et al., 2016). Although there are previous computational drug-repositioning efforts that utilized LINCS as mentioned, the methodologies are focused on breast cancer regardless of disease heterogeneity and subtype information.

In addition, our analyses identified subtype-specific metabolites, including several specific to basal-like tumors, which may provide opportunity to design anti-metabolite drugs for breast cancer. Results in essential metabolite analysis emphasized sphingolipids and steroid metabolism for basal-like breast cancer. Sphingolipid levels in breast cancer tissue are generally higher than normal breast tissue and bioactive sphingolipids, such as sphingosine-1-phosphate (S1P) has many cellular functions like

cell proliferation, migration, survival, immune cell trafficking, and angiogenesis which are related to cancer progression and metastasis (Nagahashi et al., 2016). However, sphingosine and S1P were recently highlighted as important for signaling mechanisms in metastatic TNBC and its targeted therapy (Maiti et al., 2017). A recent lipidomics profiling of TNBC tumors also supported sphingolipids as potential prognostic markers and associated enzymes as candidate therapeutic targets (Purwaha et al., 2018) in parallel to our results.

TNBC was associated with expression pattern of 2-pore domain potassium (K2p) channels which enable background leak of potassium (K+). Differential expression on K2p-channels may be suggested as a novel molecular marker related to potassium levels in basal like BCS (Dookeran et al., 2017). In another study, expression of calcium-activated potassium (SK4) channels were also associated with TNBC and cellular functions such as proliferation, migration, apoptosis, and EMT processes (Zhang et al., 2016).

Breast cancer is known as one of the malignancies in which steroid hormones drive cellular proliferation (Capper et al., 2017). As steroid metabolism associated metabolite, cholesterol sulfate, is quantitatively the most important known sterol sulfate in human plasma and may play a role in cell adhesion, differentiation and signal transduction (Strott and Higashi, 2003). Given that current standard-of-care therapy for TNBC is largely limited to multi-agent cytotoxic chemotherapy, the potential of incorporating identified repurposed drugs and/or targeting identified modules and/or metabolites represents a potential therapeutic opportunity for a subset of patents with limited treatment options.

Given these data, we would propose that the strategy outlined here can be used to repurposed drugs in order to identify novel candidate compounds or drugs to be utilized in not only monotherapy but also in combination therapy for the treatment of TNBC. Consistent with this argument, a number of the candidate drugs identified by our analyses have been incorporated in ongoing clinical trials. For instance, TNBC patients who received pre-operative sequential epirubicin and cyclophosphamide followed by docetaxel were found to have a significant increase in pathological complete response (PCR) (Warm et al., 2010). Although a great number of pre-clinical trials will be necessary to support the in silico modeling detailed in the current study prior to initiation of clinical trials, a large number of identified candidates have significant in vitro and in vivo support to indicate that these represent potential therapeutic opportunities. For instance, drugs inhibiting cyclindependent kinases (CDKs), including the CDK9 inhibitor alvocidib have been reported to be effective against TNBC (Ocana and Pandiella, 2015).

Erlotinib also showed anti-tumor effect on TNBC in a xenograft model (Ueno and Zhang, 2011). Likewise, targeting the MET and EGFR receptors, which regulate RAS/ERK and PI3K/AKT signaling, resulted in improved treatment compared to monotherapy (Linklater et al., 2016).

The current study has defined a novel approach to identify breast cancer subtype-specific network modules via a network entropy-based approach. This strategy can be used for both the identification of potentially novel signaling networks but also to identify subtype-specific therapeutic opportunities through drug repositioning. Importantly, we demonstrate that this approach can be used to link signaling networks with and subtype-specific essential metabolites which represents additional therapeutic opportunities. As such, the current studies have the potential enhancing the impact of existing therapeutics or multi-agent therapeutic strategies by identifying novel drug/target networks in the context of breast cancer and in breast cancer subtypes. On a broader scale, this strategy is largely applicable to all cancer and disease type/subtypes where multi-platform genomic, proteomic, and metabolomic data exists and thus represents a potential strategy to define novel signaling networks unique to each disease and identify disease/subtype-specific therapeutic strategies.

### AUTHOR CONTRIBUTIONS

BT and KK designed the study, performed the all other analyses, and wrote the manuscript. GB performed essential metabolite analysis. MG, RS, AM, MU, and KA supervised the work and contributed to the manuscript during the progress of the work. All authors reviewed and approved the final manuscript.

### FUNDING

This work was supported by Knut and Alice Wallenberg Foundation and Marmara University Scientific Projects Committee (BAPKO) in the context of the project FEN-C-DRP-250816-0417. R00CA166228 from the National Cancer Institute of the National Institutes of Health and V2016-013 from the V Foundation for Cancer Research to MG and DHFS-18PPC-024 from the New Jersey Commission for Cancer Research to KK.

### ACKNOWLEDGMENTS

We thank the TUBITAK BIDEB 2211 National Doctoral Fellowship Program provided to BT.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 00420/full#supplementary-material

FIGURE S1 | Functional enrichment results of the genes involved in each basal-like module using Ingenuity Pathway Analysis (IPA).

FIGURE S2 | The gene signatures of three modules separately on L1000CDS2 for elucidating the differences and similarities between drug-induced expression profiles and disease expression. Drugs were ranked for each module and we elected drugs that showed negatively correlated action mechanisms with the module gene signatures to reverse disease gene expression.

TABLE S1 | Non-basal and basal-like subtype specific PPI elucidation via differential interactome.

TABLE S2 | Three modules for only in basal-like subtype specific networks.

TABLE S3 | Statistical values of differential expressed genes and proteins.

TABLE S4 | Information of repurposed module specific and common drug signatures.

#### REFERENCES

fgene-10-00420 May 3, 2019 Time: 16:51 # 11


TABLE S5 | Essential metabolite and personalized model matrix and breast cancer categorization of personalized GEMs.

TABLE S6 | Significant essential metabolites between non-basal and basal-like breast cancer.


on genome-wide RNA interference transcriptomes. Genes 8:E86. doi: 10.3390/ genes8030086


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Turanli, Karagoz, Bidkhori, Sinha, Gatza, Uhlen, Mardinoglu and Arga. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Gene Co-expression Network and Copy Number Variation Analyses Identify Transcription Factors Associated With Multiple Myeloma Progression

Christina Y. Yu1,2, Shunian Xiang3,4, Zhi Huang2,5, Travis S. Johnson1,2, Xiaohui Zhan2,4 , Zhi Han2,6, Mohammad Abu Zaid<sup>2</sup> and Kun Huang2,6 \*

<sup>1</sup> Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States, <sup>2</sup> Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, United States, <sup>3</sup> Department of Medical and Molecular Genetics, Indiana University, Indianapolis, IN, United States, <sup>4</sup> National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen, China, <sup>5</sup> School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, United States, <sup>6</sup> Regenstrief Institute, Indianapolis, IN, United States

#### Edited by:

Victor Jin, The University of Texas Health Science Center at San Antonio, United States

#### Reviewed by:

Zhengqing Ouyang, The Jackson Laboratory for Genomic Medicine, United States Vishal Acharya, Institute of Himalayan Bioresource Technology (CSIR), India

> \*Correspondence: Kun Huang kunhuang@iu.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 01 December 2018 Accepted: 01 May 2019 Published: 17 May 2019

#### Citation:

Yu CY, Xiang S, Huang Z, Johnson TS, Zhan X, Han Z, Abu Zaid M and Huang K (2019) Gene Co-expression Network and Copy Number Variation Analyses Identify Transcription Factors Associated With Multiple Myeloma Progression. Front. Genet. 10:468. doi: 10.3389/fgene.2019.00468 Multiple myeloma (MM) has two clinical precursor stages of disease: monoclonal gammopathy of undetermined significance (MGUS) and smoldering multiple myeloma (SMM). However, the mechanism of progression is not well understood. Because gene co-expression network analysis is a well-known method for discovering new gene functions and regulatory relationships, we utilized this framework to conduct differential co-expression analysis to identify interesting transcription factors (TFs) in two publicly available datasets. We then used copy number variation (CNV) data from a third public dataset to validate these TFs. First, we identified co-expressed gene modules in two publicly available datasets each containing three conditions: normal, MGUS, and SMM. These modules were assessed for condition-specific gene expression, and then enrichment analysis was conducted on condition-specific modules to identify their biological function and upstream TFs. TFs were assessed for differential gene expression between normal and MM precursors, then validated with CNV analysis to identify candidate genes. Functional enrichment analysis reaffirmed known functional categories in MM pathology, the main one relating to immune function. Enrichment analysis revealed a handful of differentially expressed TFs between normal and either MGUS or SMM in gene expression and/or CNV. Overall, we identified four genes of interest (MAX, TCF4, ZNF148, and ZNF281) that aid in our understanding of MM initiation and progression.

Keywords: multiple myeloma, MGUS, SMM, gene co-expression, copy number variation

### INTRODUCTION

Multiple myeloma (MM) is a B-cell malignancy caused by the proliferation of aberrant clonal plasma cells that secrete monoclonal immunoglobulin protein, also known as M protein. MM is consistently preceded by a premalignant phase called monoclonal gammopathy of undetermined significance (MGUS) and clinically defined by thresholds in serum M protein and clonal bone

marrow plasma cell content with the absence of hypercalcemia, renal insufficiency, anemia, and bone lesions (known as CRAB features) or amyloidosis relating to the plasma cell proliferative disorder (Landgren et al., 2009; Rajkumar et al., 2014). The risk of developing MGUS is low, thought to be around 3.2% of individuals aged 50 or older and increases to 5.3% for those aged 70 or older (Kyle et al., 2006). An individual with MGUS lives with an increased risk of developing MM at a rate of 1% per year (Kyle et al., 2002). Additionally, there is an intermediate precursor between MGUS and MM known as smoldering multiple myeloma (SMM). This phase is clinically defined by a higher threshold in M-protein or clonal bone marrow plasma cell content with the continued absence of CRAB features (Rajkumar et al., 2014). The risk of progression for SMM increases at a variable rate, as 10% per year for the first 5 years, 3% per year for the next 5 years, and 1% per year in the following 10 years (Kyle et al., 2007). Understanding the biological basis of MM progression from these precursors is still unclear.

Gene expression profiling studies have been applied to MM to identify subgroups and biomarkers in order to better understand the molecular basis of disease, improve prognostic models, and characterize features associated with a high risk of disease progression (Davies et al., 2003; Zhan et al., 2006; Chng et al., 2007a; Shaughnessy et al., 2007; Broyl et al., 2010; Dhodapkar et al., 2014; López-Corral et al., 2014; Shao et al., 2018). A few studies have analyzed the disease precursors using hierarchical clustering and differential expression analysis to identify gene signatures (Davies et al., 2003; Zhan et al., 2007; López-Corral et al., 2014). We approached gene expression profiling analysis from the transcription factor (TF) perspective, using gene coexpression networks (GCNs).

Gene co-expression networks have been widely used in discovery of new gene functions and regulatory relationships (Langfelder and Horvath, 2008; Zhang et al., 2010, 2012; Kais et al., 2011; Yin et al., 2015; Zhang and Huang, 2016; Miao et al., 2018). GCNs have been implemented in a few MM studies albeit these studies focused on differential gene expression and not co-expression (Dong et al., 2015; Wang et al., 2016; Liu et al., 2017). We applied GCN analysis on two publicly available MM datasets to identify regulatory genes specifically associated with or disrupted in MM precursors.

The GCN algorithm we employed is local maximal Quasi-Clique Merger (lmQCM) (Zhang and Huang, 2016), previously developed to mine densely correlated gene modules in weighted GCNs (Zhang et al., 2010; Zhang and Huang, 2016, 2017; Xiang et al., 2018). The advantages that lmQCM has over a similar method such as WGCNA (Langfelder and Horvath, 2008) is the ability to allow genes to belong to more than one module and the ability to produce smaller sized modules many of which are related to copy number variations (CNVs) in cancers (Han et al., 2016; Zhang and Huang, 2016; Xiang et al., 2018).

We further supported and validated our gene expression findings with CNVs from microarray technology based on singlenucleotide polymorphism (SNP) arrays. SNP arrays can be used in numerous ways to identify genomic imbalances (She et al., 2008; López-Corral et al., 2012; Johnson et al., 2016; Mitchell et al., 2016; Mikulasova et al., 2017). We surmised that some gene expression changes from normal to MM precursors can be explained by CNVs in order to better understand the genomic changes of myeloma progression.

## MATERIALS AND METHODS

### Gene Expression Profiling Datasets: Processing and GCN

We applied an integrative network-based approach to identify modules of co-expressed genes associated with MM precursors. MM microarray datasets GSE5900 and GSE6477 from the Gene Expression Omnibus (GEO) were obtained, annotated, and filtered using the TSUNAMI web-tool<sup>1</sup> . The web-tool retrieved the gene expression matrices via the R package GEOquery. We converted probe IDs to corresponding HGNC symbols according to GEO Platform accession number. In the case of duplicate gene symbols, we retained the one with the largest mean expression value. Probes without gene symbols were removed. We further filtered the data by removing the lowest 20% of genes quantified by absolute average value. The lowest 50% of genes quantified by variance in GSE5900 were removed, while filtering GSE6477 was accomplished by removing the lowest 10% of genes quantified by absolute average value and lowest 10% of genes quantified by variance. We applied a stricter cutoff on GSE5900 because the microarray platform had a much larger probeset than the platform in GSE6477 (54,675 vs. 22,283 probes). This was conducted in order to obtain expression sets with similar numbers of genes. The resulting datasets had 15,388 and 12,530 genes for GSE5900 and GSE6477, respectively. Normalization of the datasets was confirmed by inspecting the boxplots of the samples for consistent median values.

## SNP Array Dataset: Processing and CNV Analysis

We obtained raw CEL files from GEO study GSE31339, sequenced on Affymetrix Genome-Wide Human SNP Array 6.0. The CEL files were analyzed by the R package Rawcopy (Mayrhofer et al., 2016) and then aggregated by the following conditions: normal (n = 10), MGUS (n = 20), and SMM (n = 19). SMM sample GSM777173 was removed from our analysis after the sample identity distogram suggested some cell or DNA contamination with other samples (**Supplementary Figure S1**). CNVs were detected in genomic segments using PSCBS, an enhanced method of circular binary segmentation (Bengtsson et al., 2010; Olshen et al., 2011). We used the reference data included in Rawcopy for calculating logarithm (base 2) ratios (log<sup>2</sup> ratios) of genome segmentation. Rawcopy defined the thresholds for copy number gain as segment median log<sup>2</sup> ratio > 0.2 and copy number losses as segment median log<sup>2</sup> ratio < −0.3 (Mayrhofer et al., 2016). The package also annotated probes with their corresponding genes.

<sup>1</sup>https://apps.medgen.iupui.edu/rsc/tsunami/

#### Gene Co-expression Network Mining

We separated GSE5900 into three datasets: normal (n = 22), MGUS (n = 44), and SMM (n = 12). The GSE6477 dataset was separated in the same fashion into three datasets: normal (n = 15), MGUS (n = 22), and SMM (n = 22). GCN mining was conducted using the R package lmQCM. The lmQCM algorithm has an option for normalizing the edge weights of the weighted co-expression network by setting the sums of both rows and columns of the weight matrix to be all ones similar to the weight normalization in spectral clustering (Ng et al., 2001). Another important parameter for lmQCM is gamma that controls the initiation of new gene modules in the iterative mining process. Here, we applied the edge weight normalization and also tested varying gamma values; the rest of the parameters were kept as the default. The normalization process suppresses high weights between nodes and boosts edges with relatively lower weights, which overcomes the issue of unbalanced edge weights in dense module mining algorithms (Zhang and Huang, 2016). The gamma variable ranges from 0 to 1 and controls for the number of generated modules and the maximum module size. For normalized weights, the suggested range of gamma is 0.3– 0.75. A higher gamma results in more total modules with fewer genes in the largest module. A lower gamma results in less total modules with more genes in the largest module. We selected gamma values that struck a balance between these two outcomes and elected to keep the largest module under 500 genes. Different values for gamma were selected to obtain a similar number of modules between the same conditions (i.e., normal, MGUS, or SMM) in GSE5900 and GSE6477. This allows the identified modules to be more comparable between datasets of the same condition. We chose the following gamma values for GSE5900: 0.60 for normal, 0.40 for MGUS, and 0.75 for SMM. The following gamma values were chosen for GSE6477– normal: 0.65, MGUS: 0.60, and SMM: 0.55.

For comparison, we also applied the widely used weighted GCN mining algorithm WGCNA (Langfelder and Horvath, 2008) on the same datasets specifying a minimum module size of 10 and using power 5 or 6 as appropriate, leaving the rest of the settings as default. We then selected the most similar modules from lmQCM and WGCNA and calculated gene-wise Spearman correlations to quantify the co-expression density of each module. The most similar modules were determined using the Jaccard index between lmQCM and WGCNA modules in the same condition, where the Jaccard index is simply defined as the size of the intersection between two gene modules divided by the size of the union of the same two modules.

### Identification of Condition-Specific Modules

Condition-specific modules are those in which the expression profile of the genes in one module is more correlated in one condition compared to others (e.g., normal, MGUS, or SMM). We utilized a previously developed metric called Centralized Concordance Index (CCI) that evaluates the co-expression of genes within modules identified from GCN analysis (Han et al., 2016). The CCI describes how strongly genes coexpress and is calculated from a subset of gene expression data containing the genes from a module and samples from a single condition. CCI values range from 0 to 1, with a higher number indicating more densely correlated genes. For each gene module identified from lmQCM, we calculated the corresponding CCI in normal, MGUS, and SMM. The CCIs for each module were then compared across the three conditions, and a difference of ∼0.2 in CCI values between MM precursors (MGUS or SMM) and normal were identified as potentially interesting.

### Module Similarity Between Datasets

We further reduced our modules of interest by identifying modules with similar genes between GSE5900 and GSE6477. The Jaccard index, described above in Section "Gene Co-expression Network Mining," was used to calculate the similarity of modules in the same conditions between GSE5900 and GSE6477. This calculation was conducted between every pair of modules in each condition: normal, MGUS, and SMM. Each resulting matrix was then transformed into a z-score where the top one percentile of similar module pairs from each condition were kept to filter the list of potentially interesting modules for enrichment analysis.

#### Functional Enrichment Analysis and Identification of Upstream Regulators

We used the R package enrichR (Kuleshov et al., 2016) to conduct enrichment analysis of the genes in each module of interest. We specified the "GO Biological Process 2017b" and "KEGG 2016" databases for functional and pathway enrichment analyses. For determining the significance of GO and KEGG pathway terms, we used Bonferroni significance cutoffs of 0.05/nMods where nMods is the number of modules corresponding to the specific dataset. For instance, the p-value cutoff for GSE5900 normalspecific data is 0.05/31 = 0.00161. We took GO terms with significant p-values and summarized them using the web-tool REVIGO (Supek et al., 2011).

Using enrichR, we specified the "TRANSFAC and JASPAR PWMs" database to identify TFs that regulate the genes in our modules of interest, using a less stringent Bonferroni cutoff of 0.1/nMods. We then narrowed down the list of TFs by identifying those that were differentially expressed among the three conditions by either gene expression data or CNV segment median data by conducting Mann–Whitney tests between normal and MGUS and between normal and SMM samples.

#### Network Analysis of TF Targets

We used Ingenuity Pathways Analysis (IPA, Qiagen) for network analysis of TFs and their targets determined from enrichR to explore possible signaling pathways. We conducted core analyses (which is a function of IPA) for each TF and its targets, using experimentally observed knowledge in the Ingenuity Knowledge Base and specifying direct and indirect gene relationships in human tissue and cell lines.

### RESULTS

fgene-10-00468 May 16, 2019 Time: 14:41 # 4

### lmQCM Produces Smaller-Sized Modules Than WGCNA With Stronger Gene Correlations

Our workflow is shown in **Figure 1**. After applying the lmQCM algorithm using the specified gamma values to the GSE5900 datasets, we obtained 78, 60, and 95 modules for normal, MGUS, and SMM, respectively; module sizes ranged from 10 to 400 genes. In GSE6477, using the specified gamma values, we obtained 79, 85, and 70 modules for the normal, MGUS, and SMM samples, respectively; module sizes ranged from 10 to 352 genes. Applying WGCNA to GSE5900, we obtained 40, 41, and 98 modules for normal, MGUS, and SMM, respectively; module sizes ranged from 11 to 4694 genes. In applying WGCNA to GSE6477, we obtained 34, 99, and 74 modules for normal, MGUS, and SMM, respectively; module sizes ranged from 11 to 4324 genes. Detailed breakdowns by sample type are shown in **Table 1**.

The most similar gene modules were identified from two SMM modules in lmQCM and WGCNA. The lmQCM module contained 224 genes and the WGCNA module contained 393 genes. The Jaccard index was 0.396, with an overlap of 175 genes. Within each respective module, we calculated the Spearman correlation in a gene-wise manner and conducted a two-sided Mann–Whitney test between the absolute value of the correlation coefficients in each population. The correlation coefficients were significantly higher in the lmQCM module (median: 0.399) compared to the WGCNA module (median: 0.322) with a p-value of 2.2E-16 (**Supplementary Figure S2**).

### Module Reduction Using CCI and Jaccard Similarity

Normal-, MGUS-, and SMM-specific modules were identified by calculating the CCI difference between normal and MGUS samples and normal and SMM samples and setting a cutoff of around 0.2 CCI difference. This resulted in 68 and 79 normal-specific modules, 45 and 72 MGUS-specific modules, 95 and 63 SMM-specific modules across GSE5900 and GSE6477 datasets, respectively. An example of a normal-specific gene module is visualized using Spearman correlation heatmaps in **Supplementary Figure S3**.

To further reduce modules of interest, we used Jaccard similarity. After module similarity comparison using the Jaccard index, we reduced the interesting modules to more manageable numbers than solely using CCI and were left with 31 and 39 normal-specific modules, 22 and 31 MGUS-specific modules, and 47 and 30 SMM-specific modules across GSE5900 and GSE6477 datasets, respectively. The module sizes ranged from 10 to 400 genes.

#### Frequency of CNVs Increase From MGUS to SMM

Chromosomes 2, 4, 10, 11, 12, and 21 were mostly unchanged and showed 10% or less allelic imbalance in all conditions. Chromosomes 1q, 3, 5, 6, 7, 9, 15, 18, and 19 were slightly amplified in MGUS and more amplified in SMM, with chromosomes 1q, 5, 9, and 19 showing the highest frequencies of change in SMM of around 40%. For instance, 1q had about 10% of MGUS samples amplified and around 40% of SMM samples amplified. We observed an increased frequency of deletions in chromosomes 1p, 6, 7, 8p, 10, 12p, 13, 14q, 16q, 18, 20, and 22q; the highest deletion frequency was around 25% and was observed in 8p, 13, 16q, and 22q of SMM patients. The CNV landscape across conditions is shown in **Figure 2**.

### EnrichR GO Results Are Highly Enriched in Immune-Related Terms

The top GO BP terms from all condition-specific modules are shown in **Supplementary Table S1**. In the normal-specific data, there were 95 significant GO BP terms that appeared in both GSE5900 and GSE6477, the top few being neutrophil degranulation, antigen processing and presentation of exogenous peptide antigen via MHC class II, and antifungal humoral response. These GO terms are mostly related to immune system response.

The MGUS-specific data had 40 significant GO BP terms in common from GSE5900 and GSE6477, with many immune function terms such as positive regulation of B cell activation, response to interferon-alpha, and B cell receptor signaling pathway.

The SMM-specific data shared 125 GO BP terms between GSE5900 and GSE6477 data, the most significant ones relating to the process of transcription and translation. There were also terms related to immune function such as B cell receptor signaling pathway.

#### Condition-Specific Modules From Four Identified TFs Describe Different Aspects of Myeloma

We identified these TFs as interesting: MAX, TCF4, ZNF148, and ZNF281. MAX was identified from a normal-specific module, TCF4 and ZNF148 were identified from MGUS-specific modules, and ZNF281 was identified from a SMM-specific module. Three TFs (MAX, TCF4, and ZNF148) were differentially expressed between normal and a MM precursor (MGUS or SMM) in the gene expression datasets and/or the CNV dataset (**Table 2**). While ZNF281 was not differentially expressed, it showed an interesting increase in copy number gain from normal to MGUS and to SMM.

#### Module Descriptions

The gene co-expression module containing MAX was functionally enriched in bleb assembly and activation of MAPKKK activity involved in innate immune response.

The gene co-expression module containing ZNF148 was functionally enriched in antigen processing and presentation of exogenous peptide antigen via MHC class II and negative regulation of peptide hormone processing.

In the gene co-expression module containing TCF4, multiple assembly complexes containing the genes GEMIN5, PPARGC1A, and TEAD1 were significantly enriched. They

TABLE 1 | GCN results from algorithms lmQCM and WGCNA.


The total number of resulting modules and size range are detailed by dataset and sample type.

include apoptosome assembly, mitotic checkpoint complex assembly, and Wnt signalosome assembly.

The gene co-expression module containing ZNF281 is functionally enriched in genes involved in transcription. These include transcription, DNA-templated, transcription from RNA polymerase II promoter, telomeric repeat-containing RNA transcription, and mRNA transcription.

The details of GO BP enrichment results (top enriched terms and p-values) for these modules with their corresponding p-values are listed in **Supplementary Table S1**.

#### TFs Exhibit Consistent CNV and Gene Expression Trends During the Course of Myeloma Progression

MAX did not show differential gene expression; however, its copy number significantly decreased in MGUS and SMM compared to normal (p-val = 1.17E-05 and 6.10E-04, respectively, **Figures 3A,B**). The CNV pattern showed deletions in MGUS and amplification and deletions in SMM (**Figure 3B**).

ZNF148 was the only TF that showed significantly different CNV aberrations and gene expression, with gene expression and

copy number amplification both increasing in MGUS and SMM (p-val range: 1.75E-02–3.11E-04, **Figures 4A,B**).

TCF4 was differentially expressed between normal/MGUS (p-val = 3.65E-03) and normal/SMM (p-val = 1.49E-02), with gene expression progressively increasing from MGUS to SMM (**Figure 5A**). In regard to CNVs, TCF4 exhibited amplifications in MGUS and amplifications and deletions in SMM (**Figure 5B**).

ZNF281 did not show differential gene expression (**Figure 6A**). ZNF281 showed increasing CNV amplifications from MGUS to SMM, but it was not considered significant by Mann–Whitney tests (**Figure 6B**).

#### TF Signaling Networks Are Related to Cancer Progression

IPA network analysis showed MAX and its targets interact with other TFs CCNT1, KLF10, and MYC. MAX is further predicted to target CCNG2 and TXNIP. BRD4 is shown to regulate expression of BHLHE40 and SLC7A2 (**Figure 3C**).

ZNF148 and its targets were shown to interact with TFs TP53, FOXO1, SP1, TCF3, HSF1, SMARCA4, and E2F1. Additionally, CDKN1A was shown to be a common target of the TFs listed above (**Figure 4C**).

fgene-10-00468 May 16, 2019 Time: 14:41 # 6


TABLE 2 | Transcription factors of interest, identified from condition-specific modules in normal, MGUS, and SMM samples.

The chromosomal regions were determined by Rawcopy. The TF targets were identified by enrichR.

TCF4 and its targets were shown to interact with TFs RUNX2, CCND1, and HNF4A in addition to nuclear receptor PPARG and junction protein JUP (**Figure 5C**).

ZNF281 and its targets were shown to interact with TFs CREB1, CTNNB1, RELA, NPM1, and POU5F1. ZNF281 was shown to directly target GADD45A. TP53 was shown to be an intermediate interactor that connected each subnetwork (**Figure 6C**).

#### DISCUSSION

We conducted GCN analyses on two publicly available MM datasets and identified four TFs by a condition-specific method. This pipeline has previously not been applied to studying MM precursors. Our approach identified TFs expressed in conditionspecific gene modules in publicly available MM data. We then validated our TFs with CNV data taken from a third publicly available dataset, looking for genes located on chromosomal segments that showed a consistent trend in aberration from normal to SMM and identified four TFs: MAX, ZNF148, TCF4, and ZNF281.

The gene module that MAX belongs to was determined to be condition-specific in normal samples. This means that the genes in the module were observed to be co-expressed in normal samples and less so in MGUS and SMM samples. This suggests that MAX is dysregulated in MGUS and SMM, which we observed to be true in the CNV data. MAX is known to complex with MYC to regulate transcription (Kato et al., 1992) and MYC is commonly known to be constitutively active in MM. The MAX–MYC relationship has been targeted in previous studies to inhibit c-MYC activity in MM cell lines (Holien et al., 2012). This association appears to conflict with our data, which shows the chromosomal region of MAX deleted in some MGUS and SMM samples and decreased gene expression in some SMM samples. An alternate explanation can be found in studies that show MYC can function independently of MAX in pheochromocytoma and small cell lung cancer (Ribon et al., 1994; Romero et al., 2014). MAX-independent expression of MYC in MM and its precursors requires further investigation; a recent abstract identified MAX as a tumor suppressor driver gene in MM (Garcia et al., 2017), which is a promising start.

ZNF148 has been implicated in other MM studies (Magrangeas et al., 2003; Dong et al., 2015), but to our knowledge, none have directly associated this gene with MGUS or SMM. The associated chromosomal segment of ZNF148 was progressively amplified from normal to MGUS and to SMM, corresponding with increased ZNF148 gene expression. This suggests that this TF is involved as a driver in disease progression earlier than previously thought.

TCF4 was differentially overexpressed in MGUS and SMM compared to normal. TCF4 was not significantly amplified in MGUS, although this may be due to small sample size. We suggest that copy number amplification may play a part in TCF4 dysregulation and may be involved in the initiation of MGUS but not SMM. This reasoning is due to the observation that the TCF4 region is solely amplified in MGUS whereas there is a mix of amplified and deleted regions in SMM. This is consistent with our identification of TCF4's gene module as MGUS-specific. Module enrichment and network analysis suggest Wnt signaling through TCF4 contributes to RUNX2 and CCND1 overexpression. RUNX2 overexpression has been shown to be a driver of MM progression (Li et al., 2014; Trotter et al., 2015). CCND1 overexpression has typically been observed to occur in MM precursors with chromosomal 11 and 14 translocations (Miura et al., 2003; Zhan et al., 2006). In gastric cancer, CCND1 has been shown to directly interact with TCF4 through the Wnt signaling pathway (Zheng et al., 2018), suggesting that other mechanisms of CCND1 overexpression may also occur in MM.

ZNF281 was increasingly amplified from MGUS to SMM patients. However, this is not considered statistically significant, possibly due to small sample size. Module enrichment results suggest transcriptional genes are more active in SMM, consistent with the fact that cancer cells require continued transcription in order to grow and proliferate. Increased transcription increases the chances of mutations in the DNA, which would activate tumor suppressor p53 and lead to cell cycle arrest or apoptosis in normal functioning cells. Cancer cells commonly have mutated TP53 to avoid transcriptional control and apoptosis. However, TP53 mutations are relatively rare in newly diagnosed MM patients (Chng et al., 2007b; Abdi et al., 2017). Our IPA network analysis suggests that TP53 may be regulated by CTNNB1. A previous study showed CTNNB1 suppressed TP53 in smooth muscle cells during artery

FIGURE 3 | (A) MAX expression across sample groups. Mann–Whitney tests between groups showed no significant difference. (B) Observations of MAX copy number. Mann–Whitney tests showed significant copy number variation between Normal and MGUS (p = 1.17E-05) and between Normal and SMM (p = 6.10E-04). (C) A predicted interaction network of MAX and its downstream targets. The gray nodes indicate genes from our module and the white nodes are gene interactions defined in IPA. Solid lines between nodes indicate a direct interaction supported by the Ingenuity Knowledge Base while the dashed line indicates an indirect interaction. Significance levels: <sup>∗</sup>p ≤ 0.05; ∗∗p ≤ 0.01; ∗∗∗p ≤ 0.001.

formation (Riascos-Bernal et al., 2016). Something similar may be occurring in MM.

As previously observed by the original authors (López-Corral et al., 2012), the incidence of CNVs progressively increased from normal to MGUS and to SMM. Our analysis with Rawcopy identified similar regions of amplification and deletion from normal to MGUS and from MGUS to SMM. While not all the chromosomal regions were

FIGURE 5 | (A) TCF4 expression across sample groups. Mann–Whitney tests showed significant differential expression between Normal and MGUS (p = 3.65E-03) and between Normal and SMM (p = 1.49E-02). (B) Observations of TCF4 copy number. Mann–Whitney tests showed no significant differences between any groups. (C) A predicted interaction network of TCF4 and its downstream targets. The gray nodes indicate genes from our module and the white nodes are gene interactions defined in IPA. Solid lines between nodes indicate a direct interaction supported by the Ingenuity Knowledge Base. Significance levels: <sup>∗</sup>p ≤ 0.05; ∗∗p ≤ 0.01; ∗∗∗p ≤ 0.001.

considered statistically different in the original study, it is visually striking how the frequency of chromosomal aberrations increase in patients from MGUS to SMM. The chromosomal regions of our identified TFs exhibited copy number changes. We suggest that these copy number alterations affect gene expression to an extent. The limitation is that we

cannot offer direct evidence for this, therefore we suggest further exploration of this relationship in the laboratory.

There are other limitations to our study we should acknowledge. We filtered our gene lists down to 12,000–15,000 genes out of ∼22,000 and ∼54,000 microarray probes and identified TFs that showed consistent trends across groups. We may have removed or overlooked genes that could also play a part in myelomagenesis or progression. Although we inferred potential biological mechanisms of the four TFs from literature, the clinical significance of these genes remains to be investigated. Further research can be conducted to assess the pertinence of our TFs in addition to integrating other data modalities into more analyses. Despite these drawbacks, the biological details for these genes appear to have a relevant role in MM initiation and progression.

#### CONCLUSION

In conclusion, we interrogated the role that TFs have in MM progression using a pipeline of GCN analysis, conditionspecific gene module selection, TF enrichment analysis, and CNV analysis. We identified the TFs MAX, ZNF148, TCF4, and ZNF281 from gene expression data and validated that their CNVs change from normal to MGUS and SMM. We examined the biological relevance of these TFs in MM and suggest further study of these genes in the laboratory.

#### AUTHOR CONTRIBUTIONS

KH and CY conceptualized the study. CY analyzed and interpreted the multiple myeloma data and was the major contributor in writing the manuscript. TJ, SX, and ZHu contributed to the design of the experiments and data interpretation. MA and XZ critically reviewed the manuscript. ZHa and KH gave research direction.

#### REFERENCES


#### FUNDING

The funding for this study was partially supported by the NLM grant (4 T15 LM 11270-5), Indiana University Precision Health Initiative, and NCI ITCR U01 CA188547.

#### ACKNOWLEDGMENTS

The authors thank Dr. Jie Zhang for her suggestions on module analyses and Ms. Megan Metzger for proofreading the manuscript.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00468/full#supplementary-material

FIGURE S1 | Sample identity distograms of SMM samples produced by Rawcopy. (A) Distogram including GSM777173 that suggests this sample has some relatedness to other samples. (B) Distogram after removing GSM777173.

FIGURE S2 | Gene-wise correlation heatmap of the two most highly similar modules in (A) lmQCM (n = 224) and (B) WGCNA (n = 393). The correlation coefficients are the absolute value of the Spearman correlation. The median correlation coefficient is higher in lmQCM (0.403) compared to WGCNA (0.344). SCC, Spearman correlation coefficient.

FIGURE S3 | Gene-wise correlation heatmap of a normal-specific gene module. The genes in the module were identified by lmQCM in the normal samples. Gene-wise correlation coefficients are calculated from gene expression in each respective condition: (A) Normal, (B) MGUS, and (C) SMM. The correlation coefficients are the absolute value of the Spearman correlation. The genes are more correlated in normal samples and decrease in correlation in MGUS and SMM samples. The CCI values are 0.697, 0.226, and 0.252, respectively. SCC, Spearman correlation coefficient.

TABLE S1 | GO BP enrichment results identified by enrichR. The most relevant enrichment terms are included along with the enrichment size and p-value associated with the corresponding dataset.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Yu, Xiang, Huang, Johnson, Zhan, Han, Abu Zaid and Huang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Abundance of HPV L1 Intra-Genotype Variants With Capsid Epitopic Modifications Found Within Low- and High-Grade Pap Smears With Potential Implications for Vaccinology

#### Edited by:

Junbai Wang, Oslo University Hospital, Norway

#### Reviewed by:

Xiangyun Wang, Novartis, United States Manoj Kumar, Institute of Microbial Technology (CSIR), India Luis Felipe Jave-Suarez, Centro de Investigación Biomédica de Occidente (CIBO), Mexico

#### \*Correspondence:

Jane Shen-Gunther jane.shengunther.mil@mail.mil; shengunther@livemail.uthscsa.edu Yufeng Wang yufeng.wang@utsa.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 04 December 2018 Accepted: 06 May 2019 Published: 24 May 2019

#### Citation:

Shen-Gunther J, Cai H, Zhang H and Wang Y (2019) Abundance of HPV L1 Intra-Genotype Variants With Capsid Epitopic Modifications Found Within Low- and High-Grade Pap Smears With Potential Implications for Vaccinology. Front. Genet. 10:489. doi: 10.3389/fgene.2019.00489

#### Jane Shen-Gunther<sup>1</sup> \*, Hong Cai2,3, Hao Zhang<sup>2</sup> and Yufeng Wang2,3 \*

<sup>1</sup> Gynecologic Oncology and Clinical Investigation, Department of Clinical Investigation, Brooke Army Medical Center, Fort Sam Houston, TX, United States, <sup>2</sup> Department of Biology, University of Texas at San Antonio, San Antonio, TX, United States, <sup>3</sup> South Texas Center for Emerging Infectious Diseases, University of Texas at San Antonio, San Antonio, TX, United States

Background: The aim of this study was to explore the Human Papillomavirus (HPV) genotype composition and intra-genotype variants within individual samples of low- and high-grade cervical cytology by deep sequencing. Clinical, cytological, sequencing, and functional/structural data were forged into an integrated variant profiling pipeline for the detection of potentially vaccine-resistant genotypes or variants.

Methods: Low- and high-grade intraepithelial lesion (LSIL and HSIL) cytology samples with +HPV were subjected to amplicon (L1 gene fragment) sequencing by dideoxy (Sanger) and deep methods. Taxonomic, abundance, diversity, and phylogenetic analyses were conducted to determine HPV genotypes/sub-lineages, relative abundance, species diversity and phylogenetic distances within and between samples. Variant detection and functional analysis of translated L1 amino acid sequences determined structural variations of interest.

Results: Pure and mixed HPV infections were common among LSIL (n = 6) and HSIL (n = 6) samples. Taxonomic profiling revealed loss of species richness and gain of dominance by carcinogenic genotypes in HSIL samples. Phylogenetic analysis showed excellent correlation between HPV-type specific genetic distances and carcinogenic potential. For combined LSIL/HSIL samples (n = 12), 11 HPV genotypes and 417 mutations were detected: 375 single-nucleotide variants (SNV), 29 insertion/deletion (indel), 12 multi-nucleotide variants (MNV), and 1 replacement variant. The proportion of nonsynonymous mutations was lower for HSIL (0.38) than for LSIL samples (0.51)

(p < 0.05). HPV variant analysis pinpointed nucleotide-level mutations and amino acid-level structural modifications.

Conclusion: HPV L1 intra-host and intra-genotype variants are abundant in LSIL and HSIL samples with potential functional/structural consequences. An integrated multiomics approach to variant analysis may provide a sensitive and practical means of detecting changes in HPV evolution and dynamics within individuals or populations.

Keywords: human papillomavirus, HPV genotyping, HSIL, late major capsid protein L1, metagenome, next generation sequencing, protein structure prediction, vaccine

#### INTRODUCTION

In 1932, Richard Shope isolated the first papillomavirus (PV) from crude extracts of "warty" tumors found on the skin of a wild cottontail rabbit (Shope, 1932). Since then, 183 animal and 225 HPV have been discovered and classified in The Papillomavirus Episteme (PaVE) (Van Doorslaer et al., 2017) <sup>1</sup> With the advent of metagenomic sequencing, the rate of HPV discovery has accelerated rapidly (Bzhalava et al., 2014) and the resolution of HPV viromes and variants have sharpened immensely (Shen-Gunther et al., 2017) to allow in-depth analysis of genetic variations and functional consequences (van der Weele et al., 2017; Dube Mandishora et al., 2018).

The PV is believed to have co-evolved with their hosts over 350 million years (Doorbar et al., 2015). Through phylogenetic analysis, Chen et al. (2018a) demonstrated that viral nicheadaptation to host ecosystems (tissue tropism) anteceded viralhost codivergence. The PV-host tissue tropism apparently played a vital role in shaping the molecular evolution of oncogenic HPV from archaic hominins to modern humans. HPV-16, an extraordinary result of evolutionary processes over the last 40 million years (Chen et al., 2018a) has emerged as a highly potent carcinogen with a predilection for human mucosa. HPV-16 is now the leading cause of invasive cervical cancer and other cancers of the oropharyngeal and anogenital tracts (Bosch et al., 2013).

The HPV genome is a ∼8,000 base pair (bp), double stranded, circular DNA packaged within a protein capsid. The prototypical genome encodes 6 early genes (E1, E2, E4, E5, E6, and E7) and 2 late genes (L1 and L2) (Van Doorslaer et al., 2017). Specifically, the L1 gene encodes the major capsid protein which forms a pentameric capsomer that self-arranges into a 72-subunit icosahedral capsid. The capsid is essential for viral binding and entry into host-specific tissues (Buck et al., 2013). Furthermore, the L1 coding sequences of the immunogenic surface loops

<sup>1</sup>https://pave.niaid.nih.gov

are distinctively poorly conserved due to selective pressures for mutagenesis and immune evasion (Buck et al., 2013).

Recently, whole-genome Sanger and deep sequencing studies have shown a surprisingly high level of intra-host diversity of HPV-16, −18, −52, and −58 (van der Weele et al., 2017, 2018; Hirose et al., 2018). Extensive intra-host HPV L1 sequence variability in 35 HPV genotypes was also discovered in samples from Zimbabwean women by deep sequencing (Dube Mandishora et al., 2018). Such intra-host viral sequence variability is believed to be caused by error-prone host replication machinery used for viral replication and HPV-induced APOBEC deaminase activity with ensuing selective shaping by host tissues and immune responses (Dube Mandishora et al., 2018; Hirose et al., 2018). These remarkable findings of L1 genetic variability are clinically important due to potential structural changes on the epitopes of virions arising from nonsynonymous mutations. The result may be ineffectual binding by host neutralizing antibodies induced by either natural infections or prophylactic vaccines (Bissett et al., 2016; El-Aliani et al., 2017).

Using a multi-omics approach, we aimed to explore the HPV genotype composition and intra-genotype variants within individual samples of low and high-grade cervical cytology. We also focused on the genetic and translated amino acid sequence variations of L1 informed by next-generation sequencing (NGS) for mapping onto the structure of HPV antigenic loops as a means of variant profiling and visualization.

#### MATERIALS AND METHODS

#### Subjects and Samples

Residual liquid-based cervical cytology samples were consecutively procured from the Department of Pathology after completion of cytological diagnosis. Demographic and cytohistological data were abstracted from the electronic health record (AHLTA) of the Department of Defense (DoD) and linked to each sample. In our previous study, three categories of samples, i.e., negative for intraepithelial lesion or malignancy (NILM), low-grade squamous intraepithelial lesion (LSIL) and high-grade squamous intraepithelial lesion (HSIL) were collected for HPV genotyping and DNA methylation analysis (Shen-Gunther et al., 2016). For this pilot study, we randomly selected a subset of HPV-positive LSIL (n = 6) and HSIL (n = 6) for characterization and comparison of viral diversity and variant analysis.

**Abbreviations:** BLAST, Basic Local Alignment Search Tool; CIN, cervical intraepithelial neoplasia; HPV, Human Papillomavirus; HSIL, high-grade squamous intraepithelial lesion; IARC, International Agency for Research on Cancer; Indel, insertion/deletion; LSIL, low-grade squamous intraepithelial lesion; MCL, Maximum Composite Likelihood; ML, Maximum Likelihood; MNV, multi-nucleotide variant; NGS, next-generation sequencing; NJ, Neighbor-Joining; ORF, open reading frame; Pap, Papanicolaou smear; PaVE, papillomavirus genome database; PCoA, principal coordinate analysis; PDB, Protein Data Bank; QC, Quality Control (QC); SNV, single nucleotide variant; WHO, World Health Organization.

### HPV L1 DNA Amplification and Deep Sequencing

DNA extraction from residual liquid-based cervical cytology for HPV DNA amplification and deep sequencing was performed as described previously (Shen-Gunther et al., 2017). Briefly, HPV DNA was amplified using the consensus primer set: MY09/11 to target a 450 bp region (corresponding to flanking nucleotide positions 6584/7035 on HPV-16) of the L1 gene for genotype identification (Shen-Gunther and Yu, 2011). The PCR products were then purified for construction of DNA libraries using the Nextera XT kit (Illumina). Each DNA sample (1 ng) with a standardized concentration of 0.1–0.2 ng/µL was "tagmented" (fragmented and tagged with sequencing adapters) and barcoded with dual index adaptors. The DNA libraries were normalized quantitatively for equal representation from each sample prior to pooling and sequencing. Paired-end bidirectional sequencing (2 × 300 bp) was performed on the MiSeq (Illumina) instrument using the MiSeq Reagent Kit v3 (600 cycle) for bridge amplification. Quality sequences were subjected to nucleotide BLAST (Altschul et al., 1997) against the HPV sequences in the papillomavirus genome database (PaVE) (Van Doorslaer et al., 2017) 2 , to determine the HPV genotype(s) (Shen-Gunther and Yu, 2011).

The PCR products were concurrently subjected to dideoxy (Sanger) sequencing for validation of deep-sequenced results. Briefly, amplicons (∼200 ng DNA/sample) were sequenced using primer MY11 at Eurofins Operon (United States). The resulting quality sequences were BLAST aligned for HPV genotyping as described above.

#### Next-Generation Sequencing (NGS) Data Analysis, Genotyping, and Taxonomic Profiling

The pre-configured, automated Quality Control (QC) workflow implemented in Illumina MiSeq output a series of QC metrics including the summary statistics of the reads, and the Phred quality scores Q which correspond to the base-calling error probabilities (Ewing and Green, 1998; Ewing et al., 1998). The reads were processed using the CLC Genomics Workbench 11.0.1 (QIAGEN). The Core NGS workflow was implemented, including: (1) Preprocessing reads with quality trimming based on quality scores with a limit cutoff 0.05, and the ambiguity number ≤2, and adapter trimming. (2) Merging overlapping pairs to improve the read quality. The parameter setting was mismatch cost 2, gap cost 3, and minimum score 8. (3) Mapping to the nonredundant HPV reference genome database, which was constructed based on the collection and annotation of the PaVE database (Van Doorslaer et al., 2017). Mapping parameters included read alignment match score 1, mismatch penalty 2, linear gap cost for insertion or deletion of 3. (4) Taxonomic profiling. The Microbial Genomics Module was implemented to perform qualification by assigning the read to a HPV genotype if a match is found and quantification of the abundance of each

### Diversity Analysis of HPV Communities in LSIL and HSIL Samples

The diversity of the HPV genotypes was analyzed for each sample using the Microbial Genomics Module of the CLC Genomics Workbench 11.01.1 (QIAGEN). α-diversity of the HPV communities was computed to measure within-sample variation by (1) the Simpson's index (Simpson, 1949): SI = 1 − P<sup>n</sup> i=1 p 2 i , and (2) Shannon entropy (Shannon, 1948): H = P<sup>n</sup> 1 pi log2p<sup>i</sup> , where n was the number of HPV genotypes found in the sample, and p<sup>i</sup> was the proportion of reads that were identified as the i th HPV genotype. β-diversity analysis was performed with the principle coordinate analysis (PCoA) of Bray-Curtis distances (Bray and Curtis, 1957): B = P<sup>n</sup> i=1 |x A <sup>i</sup> −x B i | P<sup>n</sup> i=1 (x A <sup>i</sup> −x B i ) , where n is the number of operational taxonomic unit (OTU) i and x A i and x B i are the respective abundances of OTU i in samples A and B, to measure the dissimilarity or "distance" of HPV genotype composition between samples. Principal component analysis (PCA) was used to determine the correlative relationship between variables (HPV genotypes) in the LSIL or HSIL group. PCA was performed on the covariance matrix of natural log-transformed abundance data [ln (n +1)] of HPV genotypes within each sample (Rencher and Christensen, 2012). Log transformation was applied to reduce the influence (skewness) of highly abundant genotypes. PCA was performed using STATA/IC 15.0 (StataCorp).

### Phylogenetic Analysis and Tree Construction of HPV Genotypes

Multiple alignment of consensus sequences of each HPV genotype detected in the HSIL and LSIL samples was obtained using the T-coffee program (Notredame et al., 2000). The evolutionary history of the HPV L1 sequences was inferred by using the Maximum Likelihood (ML) method (Felsenstein, 1981) and Tamura-Nei model (Tamura and Nei, 1993). Initial trees for the heuristic search were obtained automatically by applying Neighbor-Joining (NJ) (Saitou and Nei, 1987) and BioNJ (Gascuel, 1997) algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and then selecting the topology with superior log likelihood value. The bootstrap resampling with 1,000 pseudoreplicates was carried out to assess support for each individual branch (Felsenstein, 1985). Bootstrap values of <50% were collapsed and treated as unresolved polytomies. Evolutionary analyses were conducted in MEGA X (Kumar et al., 2018).

#### Detection of HPV L1 Sequence Variants, Amino Acid Alterations, and Structural Modifications

Variants were detected by comparing to reference sequences of each HPV type, using the Low Frequency Variant Detection Module in the CLC Bio Genomics Workbench 11.0.1 (QIAGEN), where an error model was included to exclude variants that were likely due to sequencing errors. Variants were classified

qualified HPV genotype to generate an abundance table for each sample. Reads matching to the host genome were filtered.

<sup>2</sup>https://pave.niaid.nih.gov

into four categories: SNV, MNV, indel, or replacement of one or more bases.

The functional consequences of detected variants in each sample were inferred based on the predicted changes at the codon level. These changes were classified as nonsynonymous (with amino acid changes), synonymous (silent mutation without alteration in amino acid designation), or indels which can lead to reading frame shift or early stop codon. To map the amino acid changes to protein structure, BLAST searches were conducted to identify the homologous HPV L1 structure(s) collected in the Protein Data Bank (PDB)<sup>3</sup> (Berman et al., 2000). 3D models showing the structure of HPV L1 protein with variant and reference sites was created using the CLC Bio Genomics Workbench 11.0.1 (QIAGEN). Another protein structural feature, i.e., surface probability, useful for identification of antigenic determinants was calculated using the protein module of CLC Bio Genomics Workbench 11.0.1 (QIAGEN). The surface probability (accessibility) of an amino acid is predicted using Emini's formula: S<sup>n</sup> = [Q<sup>6</sup> i=1 δn+4−i]∗(0.37)−<sup>6</sup> where S<sup>n</sup> is the surface probability of amino acid n equating to the normalized product of fractional surface probabilities (δx) of six amino acids flanked by positions n −2 and n+ 3 (Emini et al., 1985). The S<sup>n</sup> of a random hexapeptide is 1.0 (threshold); a value >1.0 indicates increased surface probability.

#### HPV Taxonomy and Carcinogenicity Classifications

The genotype classification of PV is based on the DNA sequence of the L1 gene (de Villiers et al., 2004; Bernard et al., 2010). The definitions for taxonomic ranks (PaVE) are as follows: (1) Genera: members of the same genus share >60% nucleotide sequence identity in the L1 open reading frame (ORF), (2) Species: PV types within a species share between 71 and 89% nucleotide identity within the complete L1 ORF, (3) Genotypes: PV of the same type share ≥90% nucleotide sequence identity, (4) Variants: <2% sequence difference from a known type, (5) Variant lineage: PV genomes with approximately 1.0% nucleotide sequence difference (proposed nomenclature), and (6) Sub-lineage: PV genomes with 0.5–1.0% nucleotide sequence difference (proposed nomenclature).

The World Health Organization (WHO) International Agency for Research on Cancer (IARC) Working Group assessed carcinogenic potential of HPV types and classified them into three categories (International Agency for Research on Cancer, 2012) (1) carcinogenic: HPV types 16, 31, 33, 35, 52, and 58 in α-9, HPV types 18, 39, 45, 59, and 68 in α−7, HPV type 51 in α−5, HPV type 56 in α−6, (2) possibly carcinogenic: HPV types 26, 69, and 82 in α-5, HPV types 30, 53, 66 in α-6, HPV types 70, 85, and 97 in α−7, HPV types 67 in α−9, and HPV types 34 and 73 in α−11, and (3) not classifiable/not carcinogenic: The viruses in this group are from α−1, −2, −3, −4, −8, −10, −13, −14/15. HPV types 6 and 11 were not classifiable, and all others were probably not carcinogenic (Schiffman et al., 2009; Bernard et al., 2010).

#### Deep Sequencing Resolved Viromes and Genotypes of Mixed HPV Infections for Differentiation Between LSIL and HSIL Samples

This study included 12 cytology samples, classified as LSIL (n = 6) and HSIL (n = 6) (**Table 1**). The median age of the cohort was 28 years (range, 21–40). For the LSIL group, the median age [34 years (range, 22–40)] was slightly greater than that of the HSIL group [27 years (range, 21–29)]. Histological results from cervical biopsies or excisions were available for 9 of 12 (75%) samples. Histological validation of the cytology samples showed overall good agreement (78%) (**Table 1**).

Both traditional Sanger and NGS platforms were used to detect HPV genotypes and sub-lineages within each sample. Sanger sequencing resolved the single dominant HPV genotype within each sample. Compared to Sanger sequencing, NGS achieved a better resolution in detection of mixed genotypes (up to four in this cohort) and low-abundance genotypes (**Table 2**). Comparing the dominant genotypes and sub-lineages derived from both sequencing methods, the inter-assay agreement was 100%. Tabulated summary of NGS reads is shown in **Table 2**. The median of reads that passed quality check for 12 samples was 328,197. The proportion of merged reads that were mapped to reference HPV genotype (s) ranged from 94.9 to 99.8%.

#### TABLE 1 | Cytohistological correlation.


CIN, cervical intraepithelial neoplasia; HSIL, high-grade squamous intraepithelial lesion; HPV, human papillomavirus; LSIL, low-grade squamous intraepithelial lesion. <sup>a</sup>Cervical histopathology is based on the highest grade documented on cervical biopsy or therapeutic excisional biopsy, i.e., cold knife conization (CKC) and loop excisional procedure (LEEP). Absence or presence of pathology reports in the DoD electronic health records was categorized as "Documented" or "Not documented," respectively. <sup>b</sup>Cytohistological agreement was calculated using samples with documented histopathology. LSIL cytology corresponds to CIN I histology; HSIL cytology corresponds to CIN II/III histology.

RESULTS

<sup>3</sup>https://www.rcsb.org/


**125**

fgene-10-00489 May 22, 2019 Time: 17:34 # 5

### HPV Communities Were Dissimilar Between LSIL and HSIL With Loss of Species Richness and Gain of HPV-16 Dominance in HSIL Samples

The composition of HPV genotypes in each sample is illustrated in **Figure 1**. For six LSIL L1 samples, the number of genotype(s) per sample was distributed as: 1 (16.7%), 2 (66.6%), and 3 (16.7%). For six HSIL samples, the number of genotype(s) per sample was distributed as: 1 (33.3%), 2 (33.3%), 3 (16.7%), and 4 (16.7%). Notably, all HSIL samples contained at least one carcinogenic HPV genotype, whereas only half of the LSIL samples were found to have a carcinogenic genotype (**Table 2**).

We analyzed the HPV diversity, dominance and community structure between LSIL and HSIL samples. A total of 10 different genotypes were found in single and mixed-infected LSIL samples, whereas seven different genotypes were identified in HSIL samples. The respective Shannon Entropy Indices for LSIL and HSIL samples were 0.32 and 0.16, suggesting reduced diversity in HSIL samples (**Figure 2**). The dominant (most abundant) genotype in LSIL samples was HPV-61 versus HPV-16 for HSIL. HPV-16, one of the most important carcinogens responsible for almost half of the cervical cancer incidences (Taylor et al., 2016; Mirabello et al., 2017), was found in 5 of 6 (83.3%) HSIL samples. Two additional carcinogenic genotypes HPV-18 and HPV-39 were also discovered in HSIL samples. By contrast, HPV-61, which was considered noncarcinogenic, had 50% occurrence in LSIL samples, indicative of low risk for cervical cancer (Schiffman et al., 2009). It is worthy to note that two LSIL samples contained carcinogenic genotypes (HPV-58 in Sample 81, and HPV-16 in Sample 160), suggesting a finer resolution by HPV molecular profiling than cytological grading for carcinogenic potential.

We further examined the diversity of each sample estimated through read counts and Simpson-Index (**Figure 2**). The reduced diversity in high grade cytology samples is supported by the mean Simpson's indices (0.12 versus 0.05 for LSIL and HSIL, respectively). Sample 81 showed a relatively high diversity among LSIL samples, likely due to the presence of two abundant genotypes HPV58 and HPV 61. Sample 305 had the highest diversity in HSIL samples with mixed infection of four genotypes (carcinogenic HPV18, and possibly carcinogenic HPV53, HPV 66, and HPV 70). Samples with pure HPV genotypes, 137 and 399, exhibited low diversity.

Dissimilarity of HPV communities across HSIL and LSIL samples was visualized by principle coordinate analysis (PCoA) of Bray-Curtis distances (Bray and Curtis, 1957; **Figure 3**). PCoA showed HPV-16 (PCo 1, 60%) as being the most influential genotype in HSIL. In contrast, LSIL was influenced about equally (PCo 1–3, 21–26%) by carcinogenic, possibly carcinogenic, and probably not carcinogenic/not classifiable genotypes. As for the PCA results, the component loadings plot for LSIL and HSIL showed the correlative relationship between HPV

genotypes along the first two principal components axes (PC1 and PC2) (**Supplementary Figure 3**). The sum of PC1 and PC2 explained 51.6 and 96.2% of the total variance for LSIL and HSIL, respectively. Comparing LSIL and HSIL, HPV-16 emerged from all other genotypes as the dominant component in HSIL. The score variables plots displayed each sample's contribution to the principal components. HSIL compared to LSIL had a preponderance of samples containing a high composition of HPV-16.

#### Molecular Taxonomy of HPV Genotypes Based on NGS Is Highly Discriminatory and Correlated With IARC-Defined Carcinogenicity

Prototypical HPV genome based on the genetic information of HPV-16 (GenBank ID: K02718) is created using the CLC Bio Genomics Workbench 11.0.1 (QIAGEN) and shown in **Figure 4**. The L1 (450 bp) gene fragment of each sample was the target used for sequencing, genotyping, and phylogenetic analysis. A maximum likelihood tree was inferred from the L1 sequences derived from single and multi-infected samples (**Figure 5**). The tree topology is consistent with the HPV species trees (Schiffman et al., 2009; Bernard et al., 2010; International Agency for Research on Cancer, 2012). These L1 sequences were clustered into four clades with strong bootstrap support: (1) α-9 clade included HPV-16 from four HSIL and two LSIL samples, and HPV-58 from an LSIL sample 81. Both HPV-16 and HPV-58 are carcinogenic. (2) α-7 clade included carcinogenic HPV-18 and HPV-39, and a possibly carcinogenic HPV-70, which were shown in three mixed-infected HSIL samples. (3) α-6 clade included possibly carcinogenic HPV-53 and HPV-66. (4) α-3 clade included all the probably not carcinogenic genotypes found in HSIL and LSIL samples (Chen et al., 2018b). Clearly, the broad categorical grade designation based on precancerous cervical lesions (HSIL versus LSIL) was imprecise at predicting

FIGURE 3 | HPV dominance and community structure between LSIL and HSIL. HPV L1 3D-Principal Coordinate Analysis (PCoA) plots of HSIL and LSIL samples showing dissimilarity between the two HPV communities with HPV-16 (PCoA 1) being the most influential genotype in HSIL versus HPV-61 for LSIL (PCoA 1). β-diversity was measured by Bray-Curtis index. Carc, carcinogenic; HSIL, high-grade squamous intraepithelial lesion; ID, identification; LSIL, low-grade squamous intraepithelial lesion.

were obtained automatically by applying Neighbor-Joining (Saitou and Nei, 1987) and BioNJ (Gascuel, 1997) algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach, and then selecting the topology with superior log likelihood value. Bootstrap resampling with 1,000 pseudo-replicates was carried out to assess support for each individual branch (Felsenstein, 1985). The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. Evolutionary analyses were conducted in MEGA X (Kumar et al., 2018). Carc, carcinogenic; NC, not classifiable; Poss Carc, possibly carcinogenic; Prob Not Carc, probably not carcinogenic; REF, reference genome; URR, upstream regulatory region.

carcinogenicity. Conversely, the molecular taxonomy based on NGS is highly discriminatory and correlated well with IARCdefined carcinogenicity.

#### Sequence and Structural Variations Identified at HPV Antigenic Sites May Alter Viral Recognition by Innate or Vaccine-Induced Host Defense

We hypothesized that variation in HPV L1 within and among the clinical samples can reveal critical details about the genetic basis for evolution of HPV immune evasion and host-pathogen interactions, because L1 encodes the major capsid protein that plays an important role in virion attachment and entry to the host (Knappe et al., 2007; Dasgupta et al., 2011; Surviladze et al., 2015; Chabeda et al., 2018). Being a natural antigen, the capsid surface is the target of HPV prophylactic vaccines (Harper, 2009; Harper and Williams, 2010; Yang et al., 2016). **Supplementary Table 1** lists the position, predicted mutation type and change at the coding region for HPV variants, compared to the respective reference HPV types. For the combined LSIL/HSIL samples (n = 12), a total of 417 mutations were detected, including 375 SNVs, 29 indels, 12 MNVs, and one replacement variant. The distribution of these variants for the 12 samples by Pap grade and HPV genotype is shown in **Figures 6A,B**, respectively. The proportion of nonsynonymous mutations was lower for HSIL (0.38) than for LSIL samples (0.51) (p = 0.017, Fisher's exact test) (**Figure 7**). On the other hand, probably or probably not carcinogenic HPV types in LSIL samples appeared to be under relaxed functional constraint to accumulate mutations.

The distribution of variants in HPV L1 by amino acid positions according to HPV genotype is shown in **Supplementary Figure 1**.

It is important to identify mutations that is potentially driven by vaccine- or natural infection-induced host immune response. To visualize mutations in 3D, first the structural model of HPV-16 L1 (PDB ID: 2R5H) (Bishop et al., 2007) was reconstructed with demarcated hypervariable surface loops: BC, DE, EF, FG, and HI (**Figure 8**). Additionally, the HPV-16 L1 protein sequence with surface probability plot for prediction of antigenic determinants on surface proteins (Emini et al., 1985) is provided in **Supplementary Figure 2**. In the case of HSIL Sample 179, we identified seven nonsynonymous mutations in HPV-16. **Figure 9** shows 3D conformational changes visualized by overlying the mutated amino acid residues (cyan) to those (purple) in the reference HPV-16 structure (PDB ID: 1DZL) (Chen et al., 2000). It is particularly noticeable that the mutation at position 353 corresponded to a threonine to proline change (T353P) located at the HI-Loop. The T353P change also increased the surface probability from 3.40 to 3.63 (range, 0–6.47; threshold = 1.0) (**Supplementary Figure 2**). HI Loop is one of the loops in L1 protein that extends to the outer surface of the capsid complex (Chen et al., 2000; Bishop et al., 2007). This hypervariable HI loop (AA 339–365) contains an HPV-16 immunodominant epitope (Christensen et al., 2001). As seen in human Influenza virus,

antigen drift, where mutations are accumulated in antigenic sites, is a potent force driving the evolution of immune evasion and reduced vaccine efficacy (Fitch et al., 1997; Bush et al., 1999; Smith et al., 2004). Similarly, codon changes like T353P at the antigenic regions may confer selective advantage by increasing the likelihood of immune evasion. In addition to T353P, other mutations in this sample may lead to changes in the secondary structure, including W325C at G2 β-sheet, G367P at β-I sheet, T389S at α-2 helix, L441I at a β-turn, Q461P and F462Y near α-5 helix (Bishop et al., 2007).

### DISCUSSION

This study revealed the complex genetic diversity of HPV viromes within low- and high-grade Pap samples. Both pure and mixed infections were common as shown by deep amplicon sequencing. Taxonomic profiling revealed the difference between LSIL and HSIL viral communities with loss of species richness and gain of dominance by carcinogenic genotypes, particularly HPV-16, in HSIL samples. Deep sequencing allowed the detection of carcinogenic HPVs constituting a minor component of a virome which was undisclosed by Sanger sequencing or cytological grading. Phylogenetic inference of the patient-derived L1 sequences showed excellent correlation between HPV typespecific distances and IARC-defined carcinogenic potential. Together with taxonomic profiling, this "Taxo-Phylo" approach holds promise as a molecular taxonomy-based classifier of cervical cytology.

HPV variant detection and analysis pinpointed the nucleotidelevel mutations and potential functional, as well as, structural consequences. Localizing mutations to primary sequences and structures can help understand the functional consequence of mutations and identify causal or adaptive mutations. Furthermore, in silico modeling of mutations may direct laboratory testing and confirmation of its significance through antigen-antibody binding assays. For example, hepatitis B virus (HBV) genotypes are known to vary by ethno-geography. Mutations in the major hydrophilic regions (MHR) of the hepatitis B surface antigen (HBSAg) have resulted in stable, vaccine-escape mutant virions that are infectious and pathogenic (Carman et al., 1990; Gencay et al., 2018). Recently, investigators have used ultra-deep sequencing and clinical immunoassays (monoclonal antibodies) to detect single-nucleotide, vaccineescape mutations and associated changes in the HBSAg amino acid residues in clinical samples (Gencay et al., 2018). Similarly, liquid-based cervical cytology samples may be interrogated by deep sequencing and multiplexed immunoassays, e.g., Luminex xMAP <sup>R</sup> (Peters et al., 2013) to survey HPV L1 mutant virions that may escape from innate or vaccine-induced immunity.

Longitudinal HPV metagenomic surveillance may also provide a sensitive means of detecting changes in HPV evolution and dynamics within individuals or populations. This is clinically important because virulent genotype(s) of low abundance may

highlight the hypervariable surface loops: BC, DE, EF, FG, and HI and amino acid (AA) positions. These loops are antigenic regions of interest in vaccinology.

FIGURE 9 | Structural location of L1 variants. Visualization of L1 variants from a HSIL sample (Sample 179) linked to a 3D protein structure. The reference structure is a HPV 16 L1 monomer with accession number 1DZL (Chen et al., 2000) shown in backbone representation. Variant consequences in 3D are identified by the variant in cyan collocated on top of the reference amino acid in purple with attention toward the surface loops. AA, amino acid position.

later dominate the virome if it is inherently more carcinogenic or confers a selective advantage with ensuing clonal expansion. Current published literature on HPV L1 variant analysis is scarce. As noted previously, a high intra-type L1 sequence variability was discovered in 35 HPV genotypes by deep sequencing. These investigators also found unique genotypes and its variants associating with distinct anatomical sites supporting the notion of viral niche-adaptation as shapers of viral evolution (Chen et al., 2018a; Dube Mandishora et al., 2018). However, functional consequences of these mutations were not studied. Another investigation found multiple mutations within the L1 fragment of HPV-16 (MY09/11-primed amplicons) of 35 invasive cervical cancer samples from Morocco (El-Aliani et al., 2017). A distinct mutation in the HI loop (T389P) found in 51.4% of cases could potentially interact with vaccine-induced neutralizing antibodies (El-Aliani et al., 2017). In view of this information, our results are highly consistent with the findings of high intra-host and intra-type L1 sequence variability that could potentially impact vaccine efficacy.

The strength of this study lies in the multi-omics approach developed herein. Integration of clinical metadata, genomic data, and functional/structural information to reveal patientspecific metagenomic profiles and variant structures in 3D is novel and practical. Such individualized virome profiling may provide guidance to clinicians on the risk of cervical cancer and potentially deleterious viral variants/mutations. We acknowledge that our study has limitations in that the sample size was small and a fragment of L1 was studied so overreaching generalizable conclusions cannot be drawn. However, an integrated, holistic approach was established from this dataset to further HPV metagenomics research. Our future direction will be to conduct a large scale, whole-genome or full-sequence L1 variant analysis to survey type-specific variant patterns by cytological grades.

## CONCLUSION

fgene-10-00489 May 22, 2019 Time: 17:34 # 13

In this pilot study, NGS provided a cost-effective platform for an unbiased discovery of HPV communities in clinical samples. The HPV genotype composition was shown to be correlated with clinical severity and the carcinogenic risk for cervical cancer. Multi-omics analyses afforded an unprecedented opportunity to better characterize the L1 complexity in clinical samples. Ultimately, this approach will lead to greater understanding of the dynamic interplay between virus and host in HPV pathogenesis.

### AUTHOR'S NOTE

This paper has undergone PAO review at Brooke Army Medical Center and was cleared for publication. The opinions or assertions contained herein are the private views of the authors and are not to be construed as official or reflecting the views of the U.S. Department of the Army, U.S. Department of Defense, or the U.S. government.

### ETHICS STATEMENT

This study was approved by the institutional review board of Brooke Army Medical Center, Fort Sam Houston, Texas.

### AUTHOR CONTRIBUTIONS

JS-G and YW conceived and designed the study and participated in the acquisition of data. JS-G, YW, HC, and HZ analyzed and

#### REFERENCES


interpreted the data. JS-G, YW, and HC wrote the manuscript. All authors read and approved the final manuscript.

#### FUNDING

Laboratory materials for this work were supported in part by the Department of Clinical Investigation Intramural Funding Program at Brooke Army Medical Center, Fort Sam Houston, Texas.

#### ACKNOWLEDGMENTS

We thank the staff at the Greehey Children's Cancer Research Institute and Bioanalytics and Single-Cell Core of the University of Texas Health Science at San Antonio for their invaluable service in supporting the next-generation sequencing experiments; and the staff, Ms. Roxanne Toscano and Ms. Rosalyn Miller, at the Cytopathology Laboratory of Brooke Army Medical Center for their invaluable service for collecting the clinical samples in support of the HPV Research Program in the Department of Clinical Investigation at Brooke Army Medical Center.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00489/full#supplementary-material


induce neutralizing antibodies to distinct HPV types. Virology 291, 324–334. doi: 10.1006/viro.2001.1220


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Shen-Gunther, Cai, Zhang and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Survival Analysis of Multi-Omics Data Identifies Potential Prognostic Markers of Pancreatic Ductal Adenocarcinoma

#### *Edited by:*

*Junbai Wang, Oslo University Hospital, Norway*

#### *Reviewed by:*

*Ashok Sharma, Augusta University, United States Hui-Chen Wu, National University of Tainan, Taiwan*

> *\*Correspondence: Chittibabu Guda babu.guda@unmc.edu*

*†These authors have contributed equally to this work.*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 01 December 2018 Accepted: 14 June 2019 Published: 18 July 2019*

#### *Citation:*

*Mishra NK, Southekal S and Guda C (2019) Survival Analysis of Multi-Omics Data Identifies Potential Prognostic Markers of Pancreatic Ductal Adenocarcinoma. Front. Genet. 10:624. doi: 10.3389/fgene.2019.00624*

*Nitish Kumar Mishra†, Siddesh Southekal† and Chittibabu Guda\**

*Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE, United States*

Pancreatic ductal adenocarcinoma (PDAC) is the most common and among the deadliest of pancreatic cancers. Its 5-year survival is only ~8%. Pancreatic cancers are a heterogeneous group of diseases, of which PDAC is particularly aggressive. Like many other cancers, PDAC also starts as a pre-invasive precursor lesion (known as pancreatic intraepithelial neoplasia, PanIN), which offers an opportunity for both early detection and early treatment. Even advanced PDAC can benefit from prognostic biomarkers. However, reliable biomarkers for early diagnosis or those for prognosis of therapy remain an unfulfilled goal for PDAC. In this study, we selected 153 PDAC patients from the TCGA database and used their clinical, DNA methylation, gene expression, and micro-RNA (miRNA) and long non-coding RNA (lncRNA) expression data for multi-omics analysis. Differential methylations at about 12,000 CpG sites were observed in PDAC tumor genomes, with about 61% of them hypermethylated, predominantly in the promoter regions and in CpG-islands. We correlated promoter methylation and gene expression for mRNAs and identified 17 genes that were previously recognized as PDAC biomarkers. Similarly, several genes (B3GNT3, DMBT1, DEPDC1B) and lncRNAs (PVT1, and GATA6-AS) are strongly correlated with survival, which have not been reported in PDAC before. Other genes such as EFR3B, whose biological roles are not well known in mammals are also found to strongly associated with survival. We further identified 406 promoter methylation target loci associated with patients survival, including known esophageal squamous cell carcinoma biomarkers, cg03234186 (ZNF154), and cg02587316, cg18630667, and cg05020604 (ZNF382). Overall, this is one of the first studies that identified survival associated genes using multi-omics data from PDAC patients.

Keywords: Dm-CpG: Differentially methylated CpG, DMR: differentially methylated region, DEG: differentially expressed gene, HR: hazard ratio, TCGA: The Cancer Genome Atlas, GDC: The Genomic Data Commons, FDR: false discovery rate

**135**

### INTRODUCTION

Pancreatic ductal adenocarcinoma (PDAC) originates from the ductal epithelial cells of the pancreas and it is the most common malignancy of the pancreas. Due to lack of early symptoms, PDAC is commonly presented in the metastatic stage, and as a result, fewer than 20% patients can be considered for surgical removal of the tumors (Adamska et al., 2017). Unfortunately, removing frank tumors from the pancreas cannot be expected to cure a metastatic disease, which is reflected in the current statistics of 5-year survival, which remains pegged at a dismal 8% (Chiaravalli et al., 2017). By 2030, PDAC is projected to become the second leading cause of mortality from cancer, only behind lung cancer (Rahib et al., 2014). This is the most alarming situation, and we have an urgent need for developing early detection and effective treatment regimens.

Recent studies regarding molecular profiling and epigenetic regulation in PDAC pathophysiology have provided a valuable roadmap for this effort. We are beginning to gather information about the early-onset and PDAC-specific epigenetic alterations that alter gene expression (Neureiter et al., 2014), especially those that induce metastatic changes such as genome structure reorganization and affect tumor grade, stage, and patient survival (Thompson et al., 2015). Such studies are helping in identifying targets for designing epigenetic inhibitors to treat PDAC. Not surprisingly, these targets belong to growth signaling and tumor suppressor-silencing pathways, and also those that affect cell cycle checkpoints (Paradise et al., 2018).

There is also no doubt that early detection and early beginning of therapy will be key for defeating PDAC. Identification of earlyonset DNA methylations in PDAC target genes should provide biomarker candidates for early diagnosis. We also know from earlier studies that certain critical genes are hypomethylated in pancreatic cancer. The mucin 4 (MUC4) gene is one example of promoter hypomethylation in pancreatic cancer (Zhu et al., 2011). However, pancreatic cancer appears to be affected by both hyper- and hypomethylated genes (Mishra and Guda, 2017). In particular, inside the promoters of ~72% of human genes, there are stretches of CpG dinucleotides (known as CpG islands), which are hypermethylated in cancer (Saxonov et al., 2006). Frequently, transcription of tumor suppressor genes is silenced by CpG island hypermethylation, while hypomethylation of promoters appears to cause overexpression of oncogenes and genomic instability (Tan et al., 2009). Abnormal DNA methylation affects many genes of cancer patients. In PDAC, genes involved in axon guidance, cell adhesion, epithelial-mesenchymal transition (EMT), and other pathways of tumor development, as well as genes involved in pancreatic development including the HOXfamily genes, show abnormal DNA methylation (Nones et al., 2014; Mishra and Guda, 2017). Some of these genes may be useful for diagnosing PDAC stage and for the prognosis of successful therapy.

The availability of bisulfite-sequencing and array-based DNA methylation data in The Cancer Genome Atlas (TCGA) (Weinstein et al., 2013; Tomczak et al., 2015), and International Cancer Genome Consortium (ICGC) (Zhang et al., 2011) has given our pursuit for identifying candidate biomarkers a great fillip. The study of differentially methylated loci between tumor and normal samples has great scientific merit for cataloging the genomic changes in PDAC. But integrated genomic analysis of differences in DNA methylations, their impact on expression of the genes, and correlating those data with patient survival will bring us closer to the goal of identifying the candidate biomarkers. Until recently, integrative analyses have mostly been done for examining methylation status of promoters and CpG islands (Vincent et al., 2011). For example, Raphael et al. used integrative analysis of TCGA pancreatic ductal cancer data (Raphael et al., 2017), but their focus was somatic alterations and molecular subtyping. Using the TCGA data, a number of DNA methylation pattern analyses have been reported for multiple cancers (Noushmehr et al., 2010; Aine et al., 2015; Yang et al., 2015); but for PDAC, this is still lacking. Unlike in our previous study (Mishra and Guda, 2017), in which we performed integrative analysis of all types of pancreatic cancers (PC) in the TCGA database, the present work is focused exclusively on PDAC, that is, this report does not contain any other subtypes of PC. In this PDAC study, we analyzed differential DNA methylation, gene expression, miRNA and lncRNA expression, and association of promoter DNA methylation with gene expression and lncRNA expression (**Figure S1**). Next, we examined whether those genomic and transcriptional changes corresponded with patient survival in a significant way. Overall, in the current study, we identified several prognostic markers for pancreatic ductal cancer.

#### MATERIALS AND METHODS

#### Clinical Data and Samples

We downloaded the current study view clinical data as of August 2018 from cBioPortal (Gao et al., 2013). The TCGA database has a total of 186 pancreatic cancer patients. Based on the described neoplastic and histological information of these patients in the clinical files, we selected 154 patients who had PDAC unambiguously. We excluded the other patients who had endocrine, invasive adenocarcinoma, undifferentiated or mixed pancreatic cancers (**Table S2**). CpGs/genes/miRNAs/lncRNAs with missing values in ≥20% samples, and similarly, samples with missing values of ≥20% of CpGs/genes/miRNAs/lncRNAs were excluded from further analysis.

#### DNA Methylation, RNAseq, and miRNAseq Data

The Bioconductor tool *TCGAbiolinks* (Colaprico et al., 2016) was used to download the TCGA level-3 data on DNA methylation (Illumina HumanMethylation450 BeadArray), gene expression (IlluminaHiSeq RNASeqV2), and lncRNA and microRNA expression (IlluminaHiSeq miRNAseq). The DNA methylation data also contains β values for 485,577 CpG sites with annotations for transcripts from GENCODE v22, the associated CpG island (CGI), CpG sites' distance from the nearest transcription start site (TSS), and CpG coordinates as per GRCh38 reference genome. The β values are calculated as (M/M+U) which ranges between 0 and 1, where M is the methylated allele frequency and U is the unmethylated allele frequency. Therefore, a higher β values indicate a higher level of methylation. The gene expression data were obtained for each of the 60,483 GENCODE v22 genes in each sample. The miRNASeq data for each sample have single raw read counts and reads-per-million (RPM) counts for 1,881 miRNAs that are annotated in miRBase v21. As TCGA PDAC samples were processed in batches at different sites of the consortium, the data can be vulnerable to batch effects. Before starting the PDAC data analysis we first checked for possible batch effect in different types of data using Mbatch (Akbani et al., 2010).

### Methylation Data Processing

Beta values of CpG probes mapped against X, Y, and mitochondrial chromosomes were excluded from analyses to eliminate gender bias. CpGs with missing β values (approximately 20% of the samples) were also excluded. To estimate the remaining missing values in the data, we used the k-nearest neighbor-based imputation method using the *imputeKNN* module of the R tool (R Core Team, 2019), *impute* (Troyanskaya et al., 2001). We also removed the data from CpG probes which overlapped with repeat masker and SNPs from dbSNP v151 with minor allele frequency (MAF) > 1% (Zhou et al., 2017). Statistical analyses of DNA methylation of 162 samples (153 primary tumors and nine normal samples) were performed at two different levels, i.e., the CpG site level, and the region level.

CpG probes were independently mapped in six different subregions of the genes: TSS200 (the region from TSS to 200 bp upstream of TSS), TSS1500 (200–1,500 bp upstream from TSS), 5'UTR, 1st exon, gene body, and 3'UTR. DNA methylation characteristics in the known UCSC CpG island, shores (regions 0–2 kb from CpG islands), and shelfs (regions 2–4 kb from CpG islands) were also analyzed.

### Logistic Regression Analysis

We used logistic regression in R to classify the tumor and normal samples on the basis of their DNA methylation, gene expression, lncRNA expression, and miRNA expression data. Logistic regression was performed by using *lm* function in R. R package, *ROCR* was used to evaluate logistic regression performance, calculate the area under curve (AUC), and generate receiver operating characteristic (ROC) curve plots (Sing et al., 2005).

#### Differential Methylation Analysis

The β values for CpGs after preprocessing and imputation analyses were further normalized by using the beta mixed integer-quantile normalization (BMIQ) tool to adjust for type I and type II probes in data by using R tool, *BMIQ* (Teschendorff et al., 2013). The R package, *limma* was used for conducting supervised differential methylation analyses. For a CpG site to be considered differentially methylated, the primary tumor and normal samples were to have a mean β value difference of at least 0.2 (∆β ≥ 0.2), and the BH adjusted *p*-value less than 0.005. Using the R tool, *gtrellis*, we generated circular plots of 10 Mb sliding windows for each chromosome to examine differentially methylated CpGs that had differential methylation frequencies (Gu et al., 2016). Next, we determined the methylation frequency per megabase pair (Mb) for each chromosome by calculating the total number of dm-CpGs in the chromosome and dividing by the length of the chromosome (Mb) using the GRCh38. Hypermethylation and hypomethylation frequencies were also calculated for each autosomal chromosome in a similar manner. For each chromosome, when the ratio between hypermethylation to hypomethylation frequencies was ≥1.5, we considered that chromosome to be predominantly hypermethylated. On the other hand, if the hypomethylation to hypermethylation frequency ratio is ≥1.5 we considered that chromosome to be predominately hypomethylated.

#### Differentially Methylated Regions (DMRs) Analysis

Differentially methylated region (DMR) analyses were performed using the Bioconductor tool *DMRcate* (Peters et al., 2015). *DMRcate* first calculates differential methylation at individual CpG sites derived by using moderated *t*-statistic from *limma* (Ritchie et al., 2015). After correcting for false discovery rate (FDR), regions of significant dm-CpGs were agglomerated into groups where the distance between two consecutive probes is within 1 kb. Only those DMRs that have at least two dm-CpGs with adjusted *p*-value < 0.01 within 1-kb distance were considered for DMR analysis. Next, we annotated the overlapping promoter regions (+/−2,000 bp from TSS) and generated a plot of DMRs by using the Bioconductor package *Gviz*.

### RNASeq and miRNASeq Data Processing

The TCGA level-3 RNASeq data contain a single raw read count and a normalized expression value for each gene. In contrast, the GDC data portal has different types of level-3 data. From the GDC, we used HT-Seq raw read counts data for differential gene expression and the FPKM-UQ for correlation analysis. These expression values were generated by aligning the reads with the GRCh38 reference genome and then quantifying the mapped reads for the genes. TCGA level-3 miRNASeq data contain raw read count for each miRNA in the miRBase database, which was derived by exact mapping of miRNASeq data (Chu et al., 2016).

## Differential Gene Expression Analysis

For differential gene expression analysis, the expected counts data from 146 primary PDAC and three normal samples were used. Before differential expression analysis, we removed all genes with missing expression values (~20% of the samples) and also genes which had CPM (count per million) numbers less than one (about 25% of the samples). After preprocessing, we used the Bioconductor tool, *DESeq2* (Love et al., 2014) for differential gene expression analysis, for which, a cutoff value of 0.01 for both raw *p*-value and Benjamini–Hochberg (BH) (Benjamini and Hochberg, 1995) adjusted *p*-value were applied. For differential miRNA analysis, we used raw read counts in *DESeq2* with a BH adjusted *P*-value of ≤0.01.

#### Correlation Between DNA Methylation and Gene Expression

For the correlation analysis, primary tumor samples of 146 patients that contained both DNA methylation and gene expression data were used. Correlation between promoter DNA methylation and corresponding gene expression was done by using linear regression function in the R package, *cor. test*. Methylation and expression levels (log2 (FPKM-UQ + 1) of genes were tested for non-zero correlation using Spearman's correlation, after excluding all samples with a correlation value of zero. Any association between DNA methylation and gene expression was considered as significant if the *p*-value ≤ 0.005 and rho ≥|0.25|.

#### Pathway Enrichment Analysis

Bioconductor package, *clusterProfiler* (Yu et al., 2012) was used for enrichment analysis of differentially expressed genes (DEG). KEGG canonical pathways were used for pathway enrichment analysis. We used BH adjustment *p*-values of 0.05 and a minimum of five and maximum of 500 genes as selection criteria for every significant pathway. For the pathway enrichment analysis of dm-CpGs, we used '*gometh*' module of Bioconductor tool *missMethyl* (Phipson et al., 2016). Genes associated with dmCpGs (Δβ ≥ 0.2) in the Illumina Human 450K BeadChip are obtained from the annotation package, *IlluminaHumanMethylation450kanno.ilmn12.hg19*. All GO and KEGG terms were tested using '*gometh*' function, and false discovery rates were calculated using the BH method.

#### Survival Analysis

To reveal the roles of differentially expressed genes and miRNAs on patient survival, PDAC patients were classified into high and low expression groups, using the median expression of genes as the cut-off value. For the analysis of promoter region DNA methylations, we used β value cutoff of ≥0.5 (high) and ≤0.3 (low) groups. We analyzed only those CpG sites that were differentially methylated (±1,500 bp from TSS) and also negatively correlated with gene expression. We used the R tool, *survival*, for survival analysis, and Kaplan–Meier (KM) survival plots were generated. In addition, we performed Cox-regression analyses. For both analyses, we selected CpGs that had *p*-value ≤ 0.05. For gene expression, miRNA, and lncRNA expression and patient survival analyses, we used all available genes in the analysis and divided PDAC patients into two classes based on the median expression. PDAC patients that were above the median, were classed as the high expression group, and those below the median were classed as the low expression group.

### RESULTS

We downloaded level-3 DNA methylation, gene expression, and miRNA expression data from TCGA using Bioconductor tool, *TCGAbiolinks*, and systematically carried out data cleaning, global unsupervised analyses, and detailed individual and integrative analyses on DNA methylation, mRNA, and miRNA expression datasets. To understand the functional significance and relevance of the differentially-expressed and differentiallymethylated genes in PDAC, we also performed downstream analyses using pathway enrichment tools and Cox-regression and Kaplan–Meier survival plots. Complete flow-chart of the data analysis is available in **Figure S1**.

### Global DNA Methylation Analysis

We performed the Wilcoxon rank test to analyze the overall difference in DNA methylation levels in six different gene subregions (TSS200, TSS1500, 1st exon, 5´UTR, 3´UTR, and genebody) and five methylated genomic regions (CpG-island, s-shore, n-sore, s-shelf, and n-shelf). For this analysis, we combined the β values of all CpGs in corresponding regions for tumor and normal samples. Our analyses revealed that CpG segments close to TSS and also the islands themselves have, in general, a higher level of DNA methylation in tumor samples (**Figure S2**). Specifically, DNA methylation levels of TSS200, TSS1500, 1st exon, 5´UTR, island, s-shore, and n-shore regions were higher in the tumor. In contrast, DNA methylation levels were low in genomic regions that are away from the TSS and the CpG islands (**Figure S2**).

We observed a total 12,083 differentially methylated CpGs (dm-CpGs) with ∆β ≥ |0.2| between tumor and normal samples; out of these 7,378 were hypermethylated and 4,705 were hypomethylated (**Table S3**, **Figure S3**). At even higher thresholds (∆β ≥ |0.3|), the number of dm-CpG sites dwindled to 1,741. **Figure 1A** shows all dm-CpG results from each autosomal chromosome at ∆β ≥|0.2| depicted in the outer circle of the circos plot. The two innermost circles show the density of hyper- and hypomethylation in a 10 Mb sliding window across the genome. The distribution of dm-CpGs in twelve different genomic subregions is shown in **Table 1** and **Figure 1C**. A total 4,610 dm-CpGs were observed within the promoter regions of genes i.e., ±1.5 Kb from the TSS of genes. We also observed that the regions close to the CpG islands (island, shore) and the promoters (TSS200, TSS1500, promoter, 1st Exon, 5´UTR), were predominantly hypermethylated (**Figure S2**—1.5kb distribution plot), while regions away from promoter (shelf) and promoter (3´UTR, gene body) are hypomethylated (**Table 1**, **Figure 1C**).

In PDAC tumors, we observed that chromosome 1 and 2 contained the highest numbers of dm-CpGs, while chromosome 14, 15 had the lowest. Such differences are expected given the large sizes of chromosomes 1 and 2. To size-normalize for all chromosomes, we calculated the methylation frequency/Mb for each chromosome to compare the net differential methylation. The size-normalized DNA methylation frequencies indicated that chromosome 20 has the highest differential methylation frequency (14.76 dm-CpGs/Mb) while chromosome 18 has the lowest (0.82 dm-CpG/Mb), as shown in **Table 2** and **Figure 1B**. Except in chromosome 9, hypermethylated CpG sites were more prominent than hypomethylated sites in all the other chromosomes (**Table 2**). We also observed that chromosomes 10 and 18 were extensively hypermethylated to the extent that the hypermethylation frequencies for these two chromosomes were three times higher than the hypomethylation frequencies (**Figure 1B**, **Figure S4**).

Chromosomes are sorted based on total differential methylation in per megabase pair length of the chromosomes. (C) Bubble plot of differentially methylated CpGs in genomic regions. Size of bubble represents a total number of dm-CpGs.

To locate genomic regions with high epigenomic perturbations, we calculated dm-CpG frequencies of chromosomal segments in 10 MB sliding windows. Our analysis revealed that chr7:27,000,001–28,000,000 has the highest dm-CpG frequency with the entire region mostly hypermethylated (**Figure 1A**, inner red circle). The region contains several HOX-family genes as HOXA1, HOXA3, HOXA7, HOXA10, HOXA11, and HOXA13.

### Genome-Wide Analysis of Differentially Methylated Regions (DMRs)

The normal differential methylation analysis process does statistical testing for individual CpG sites, but regulatory methylation targets are most commonly clustered into short regions. Clusters of hypermethylated CpG sites in the promoter region of a gene are usually associated with epigenetic silencing of the gene (Jones and Baylin, 2002). Differentially methylated regions (DMRs) comprise multiple consecutive methylated CpG sites with at least two dm-CpGs, therefore detecting DMRs is more biologically relevant (Weaver et al., 2004; Bert et al., 2013).

In all, we identified 779 DMRs across the genome in PDAC. Chromosome 7 showed the highest (74) and chromosome 21 showed the lowest (6) DMRs (**Table S3**). The DMRs were of different lengths, ranging from 3bp to ~11kb. There were 116 short (<100 bp) DMRs, 84 long (>2 kb) DMRs. The number of dm-CpGs within DMRs ranges from 2 to 45. These DMRs

TABLE 1 | Distribution of differentially methylated CpG sites in different genomic and gene regions in pancreatic ductal adenocarcinoma (∆β ≥ 0.2).


also overlap with the promoters of several HOX-family genes (**Table S3**). Examples of DMRs showing contrasting methylation patterns between normal and tumor samples on chromosome 9 and chromosome 2 are presented in **Figure S5**.

#### Differential Gene Expression Analysis

HTSeq read-counts for 146 PDAC patient tumors and three normal samples were downloaded from TCGA and differential gene expression analysis was performed on them using DESeq2 package. Trimmed mean of M-values (TMM) normalization was employed to account for library size variations among samples (Robinson and Oshlack, 2010). We identified 90 differentially expressed genes (80 protein-coding, seven lncRNA, two antisenses, and one Ig-V gene) after adjusting to *p*-value < 0.05 (significance corrected using the Benjamini-Hochberg method) (**Figure 2**,



**Table S4**). From the 147 tumors and three normal samples, 10 differentially expressed miRNAs were found (**Table S4**).

#### Promoter DNA Methylation and Gene Expression Correlation Analysis

We used Spearman's test to examine correlations between promoter DNA methylation (within 1.5kb from TSS) and gene expression using the R function, *cor.test*. Correlations that had rho values of ≥ |0.25| and BH adjusted *p-*values of < 0.005 were taken as significant. We observed correlations of 30,619 promoter CpGs with the expression of 8,932 genes, the majority of which were negatively correlated (25,077 CpGs with 7,518 genes), with only a minority (5,605 CpGs with 2,937 genes) showing positive correlations. At higher rho threshold values (|0.5|) and low FDR (<0.005), we observed correlations of 4,971 CpGs with the expression of 1,744 genes, out of which most (4,568 CpGs with 1,602 genes) were negatively correlated and fewer (407 CpGs with 212 genes) were positively correlated (**Table S5**, **Figure S6**).

Similar Spearman's analyses were performed for finding correlations between CpGs and lncRNAs. We identified 1,216 CpGs that were significantly correlated with 442 lncRNAs, out of these the great majority (1,039 CpGs with 368 lncRNAs) were negatively correlated and fewer (177CpGs with 95 lncRNAs) were positively correlated. At higher thresholds (rho ≥ |0.5| and BH adjusted *p*-value ≤ 0.005), we observed that 199 CpGs were correlated with 84 lncRNAs, out of which 174 CpGs showed negative correlations with 72 lncRNAs, and 25 CpGs were positively correlated with 12 lncRNAs (**Table S5**).

#### Pathway Enrichment Analysis

Analyses of differentially methylated CpGs using the Bioconductor missMethyl pathway tool indicated the enrichment of several KEGG pathways (**Table 3**). Several critical cancer-related pathways such as MAPK signaling, Rap1 signaling, calcium signaling were shown in the list. We also observed the enrichment of the nicotine addiction pathway as corroborated by the fact that these patients were cigarette smokers (**Table S3**). In case of differential expression, we observed only 80 differentially expressed genes and no significant pathways were enriched from that list of genes.

#### Survival Analysis

We used an in-house R code to perform survival analysis base on the DNA methylation, gene expression, miRNA, and lncRNA results. This R code uses the R tools, *survival,* and *survMiner* in the background and performs the Cox regression and log-odd tests, and generates KM-plots for CpGs, genes, miRNAs, and

TABLE 3 | KEGG pathway analysis for differentially methylated genes. We used *missMethyl* tool for pathway analysis. For each enriched pathways, N is the total gene in given pathways, DN is the number of mapped genes in hg38 against differentially methylated CpGs, P.DM is the *p*-value, and FDR is the BH adjusted *P*-value.


lncRNAs—all in the context of significant difference in patient survival in the high and low expression groups. In Cox regression analysis, we used low expression and methylation group of samples as reference. The hazard ratio (HR) > 1 indicates high expression group patients have low survival and <1 suggests high survival.

We conducted survival analysis of PDAC patients with respect to differentially methylated CpGs (*p*-values for both log-odd and Cox regression ≤ 0.05). The results identified 439 CpGs that may have survival roles. Out of these, 80 showed survival relationship at a stringent selection criterion (*p*-value ≤ 0.01). In contrast, survival analysis of the gene expression data indicated 1,954 genes that may influence PDAC patient survival with *p*-value ≤ 0.05 (**Table S5**). When we reduced survival *p*-value cutoff to 0.01, this gene number goes down to 518. Similarly, we observed 236 lncRNAs which correlated with survival at *p*-value ≤ 0.05, whereas this number came down to 74 at *p*-value cutoff of 0.01. For miRNA, these numbers were 25 at *p*-value ≤ 0.05 that were reduced to 7 at *p*-value ≤ 0.01.

#### Correlative Analysis of Gene Expression and Survival

Genes and genomic regulatory loci that are differentially expressed and correlated with patients' survival could be important for understanding the initiation and progression of PDAC. Integrative analysis of patient survival and differential expression identified 17 genes that passed our tests at BH adjusted *P*-value ≤ 0.05 for both differential expression and patient survival or five genes when the thresholds were decreased to 0.01 for both DEG and survival analysis (**Table 4**). In these tests, we did not observe any differentially expressed lncRNAs that correlated with PDAC patient survival.

Further analysis of genes that have dm-CpGs in the promoter regions (∆β ≥|0.2|, FDR < 0.005) and showing a negative correlation in corresponding gene expression (rho ≤ -0.5, FDR < 0.005) showed that a total of 93 CpGs have a significant difference (*p*-value ≤ 0.05) in survival between high and low patient groups. This number further goes down to 4 if we use *p*-value ≤ 0.01 in the survival analysis (**Figure 3**).

In the case of lncRNA, we observed that three promoter dm-CpGs showing a negative association with lncRNA expression have a role in overall patient survival (*p*-value ≤ 0.05). This number goes down to two if we further reduce survival *p*-value to 0.01. List of these CpGs with survival details are shown in **Table S7**.

#### Analysis of Genes of Mucin Family

Our DEG analysis showed that MUC2, MUC5B, and MUC13 were significantly upregulated in PDAC (**Table S8**). MUC1, MUC6, and MUC16 showed overexpression but it was not statistically significant (BH adjusted *P*-value > 0.05). We noted that MUC5B, which was overexpressed in PDAC (BH adjusted *P*-value = 0.018) has also two hypomethylated CpGs (cg20911165 and cg03609102) in its promoter region, which also showed a negative correlation with MUC5B expression (**Figure 4**). We also observed that expression of MUC1, MUC3, MUC4, MUC6, MUC15, MUC17, MUC20, and MUC21 genes was negatively correlated with the promoter methylation (**Table S5**).


TABLE 4 | List of probable prognostic gene/miRNA biomarkers for pancreatic ductal adenocarcinoma. List of genes and miRNA which have very low *p*-value in survival and *DESeq2* differential gene expression analysis, and high area under curve (AUC).

*Log2FC, log2 fold change; HR, hazard ratio; 95% CI, upper and lower 95% confidence interval values of hazard ratio (HR), and beta is β coefficient of a given variable for the Cox regression analysis.*

### DISCUSSION

Alterations in the promoter DNA methylation, as well as miRNA and lncRNA expression, play critical roles in cancer biology by up- or downregulating gene expression (Merlo et al., 1995; Ramachandran et al., 2016). DNA methylation pattern alterations can serve as useful biomarkers for distinguishing tumors from normal samples (Oh et al., 2013). Two previous studies by (Sato et al., 2008) and (Tan et al., 2009) had explored DNA methylation patterns in pancreatic cancer. Sato et al. used methylationsite specific PCR, and Tan et al. used GoldenGate methylation cancer panel array. Both of these technique have limited

genome coverage and sensitivity. In addition, those studies used formalin-fixed paraffin embedded samples, xenografts, and pancreatic cancer cell lines, which might affect the quality of the results. On the other hand, the current study is based on TCGA Illumina HumanMethylation450 chip from fresh tissue samples, which has higher genome coverage with greater consistency and accuracy. Our study is more comprehensive, since we scoped for differential methylation, differential gene expression, differential miRNA, differential lncRNA in a genome-wide manner, and we also correlated these results with patient survival. To avoid gender bias, we excluded all CpG probe and gene expression data from X and Y chromosomes. Our results demonstrated that all chromosomes had dm-CpGs in PDAC (**Figure 1A**, **Table 2**). CpG islands, promoter, and their proximal regions had more hypermethylated CpG sites compared to regions away from islands and promoters (**Figure 1C**, **Table 1**, **Figure S6**). We observed that several chromosomal regions which have a high frequency of dm-CpGs are also a region which is differentially methylated.

In this study, CpG sites in the zinc finger protein 154 (ZNF154) promoter region were hypermethylated and showed a negative correlation with ZNF154 gene expression. We found that promoter of ZNF158 overlap with a region which has the highest differential methylation frequency in chromosome 19. The survival analyses indicated that the cg03234186 high methylation group patients had a low overall survival (HR = 1.7) in PDAC (**Table 5**). ZNF154 hypermethylation is a urine-based prognostic biomarker for bladder cancer, where hypermethylation correlates with recurrence-free survival of the patients (Reinert et al., 2012). ZNF154 hypermethylation may also be a blood-based prognostic biomarker for solid tumors (Sanchez-Vega et al., 2013; Margolin et al., 2016). Recently, Zhang *et al*. located CpG hypermethylations at ZNF154 promoter (cg03234186, cg12506930, cg26465391) by studying the TCGA prostate cancer archive. Hypermethylation downregulates ZNF154 expression and survival analysis suggest that hypermethylation of this site is associated with poor survival of patients (Zhang, Shu et al., 2018).

KRAB zinc-finger tumor suppressor ZNF382 expression is suppressed by promoter methylation in esophageal squamous cell carcinoma (Zhang, Xiang et al., 2018). In PDAC, we identified hypermethylations in five CpG sites in the ZNF382 promoter region, which are negatively correlated with gene expression. Logistic regression-based classification showed an AUC of 1.0 for all these CpGs. Hypermethylation of (cg02587316, cg18630667, and cg05020604) was associated with low survival of PDAC patients (**Table 5**). Above findings suggest that methylation of cg03234186 (ZNF154), and cg02587316, cg18630667, cg05020604


*AUC, area under curve; HR, hazard ration; P-value Cox, the P-value for cox regression analysis. P-value and P-adj are the raw P-value and BH adjusted P-value respectively for the Spearman rank correlation.*

(ZNF382) have the potential to serve as prognostic biomarkers for PDAC (**Figure 5**).

The differentially expressed miRNAs include hsa-mir-196-a1/2 and hsa-mir-196b, both of which are HOX-cluster embedded miRNA members of the evolutionarily conserved miR-196 gene family (Mansfield and McGlinn, 2012; Fantini et al., 2018). The hsa-mir-196-a1 gene is located in the intergenic region between HOXB9 and HOXB13 on human chromosome 17; the hsa-mir-196a-2 between HOXC9 and HOXC10 on chromosome 12, and the hsa-mir-196b is on chromosome 7. HOX genes such as HOX-B7 (Braig et al., 2010), HOXB8 (Yekta et al., 2004), and HOXA9 (Li et al., 2012) are targets of the miR-196 family. MiR-196b directly targets HOXA9, whose overexpression is associated with bad prognosis in leukemia (Li et al., 2012). The hsa-mir-196aregulated HOX-B7 expression has a role in melanoma (Braig et al., 2010), it would be worth investigating the role of HOXcluster gene regulation by miRNA and/or promoter methylations in pancreatic cancers.

Hsa-mir196-b has been reported as a biomarker for digestive tract cancers (Lu et al., 2016) and familial pancreatic cancer (Slater et al., 2014). Multiple studies indicate that hsa-mir196-b overexpression is bad for the cancer patient. For example, hsamir196-b overexpression is associated with poor prognosis in gastric cancer (Lim et al., 2013; Ge et al., 2014), and is also associated with accelerated invasiveness in epithelial ovarian cancer (Chong et al., 2017). Kanno et al., (2017) reported that hsa-mir-196b overexpression might be a prognostic biomarker for a bad outcome. In our current study, we also found that PDAC patients with hsa-mir-196b overexpression showed worse survival (**Table 4**, **Figure 6**), which further corroborates the role of hsa-mir-196b as a biomarker for PDAC.

MiR-125a is a tumor suppressor that induces apoptosis, mitochondrial energy disorder, and cellular migration through suppressing mitochondrial fission, and play an important role in pancreatic cancer (Pan et al., 2018). Metastatic colorectal cancer patients treated with bevacizumab in combination with FOLFOX have better progression-free survival (Kiss et al., 2017). In the current study, we observed that hsa-mir-125a is overexpressed but *P*-value was not significant, however, univariate Cox regression analysis suggested that patients with higher expression of mir-125a had a better overall survival (HR = 0.57) (**Table S6**). This finding suggests that hsa-mir-125a might be useful as a prognostic biomarker for PDAC.

Hsa-mir-135a-2 is a precursor of hsa-mir-135a; univariate log-rank test (*P*-value = 0.01) and Cox-regression analysis (HR = 0.55) suggest that higher expression is associated with better overall survival of PDAC patients. Cheng *et al*. reported that mir-135a is a metastasis inhibitor, and they observed similar survival trends in gastric cancer cell line data (Cheng et al., 2017). In our study, we also observed that hsa-mir-3200 expression is associated with good prognosis of PDAC (HR = 0.5) (**Table S6**).

From the survival analyses of protein-coding genes in PDAC, we observed 518 genes that had significant correlations with patient survival both in high and low expression cohorts. The aryl hydrocarbon receptor nuclear translocator like 2 (ARNTL2) gene, which codes for a helix-loop-helix transcription factor, was the most significant among all. Overexpression of this gene was reported to predict poor outcome for lung adenocarcinoma patients (Brady et al., 2016). To our knowledge, the role of ARNTL2 in PDAC was not explored before, and the current study showed that ARNTL2 overexpression had a strong association with poor survival (HR = 2.2) in PDAC patients.

In the contrary, overexpression of certain genes was also found to help extend patient survival. Overexpression of CELF2 and EFR3B were correlated with better PDAC patient survival (**Table S6**). CELF2 is a tumor suppressor (Subramaniam et al., 2011; Ramalingam et al., 2012), and EFR3B contributes to the control of the phosphorylation state and could affect the responsiveness of G-protein-coupled receptors in higher eukaryotes (Bojjireddy et al., 2015). The role of EFR3B is mammalian is still unexplored, nevertheless, our results indicated that its expression is a key indicator of patient survival.

The abnormal expression of many long non-coding RNAs (lncRNAs) has been reported as effectors in the progression of various cancers. Some of these lncRNAs may be useful as diagnostic indicators and anti-cancer targets (Petrovics et al., 2004; Gutschner et al., 2013). We explored whether lncRNAs were involved in PDAC and whether we can find any indication for their utility for the diagnosis and treatment of PDAC. However, none of their expression patterns were correlated with patient survival. It is possible that we needed more than the three tumoradjacent normal samples for examining lncRNAs. Unfortunately, the present TCGA database has expression values for only three lncRNAs. However, we did find a few lncRNA expression and survival correlations at low *P*-value thresholds (*P*-value ≤ 0.05) that could be further tested for their role in patient survival (**Table S6**).

LINC00941 is an epigenetically-silenced lncRNA found in pan-cancer TCGA data analysis (Wang et al., 2018). In our study, we found that LINC00941 is overexpressed (*P*-value = 0.02) and that high expression correlated with poor prognosis (HR = 1.8). PVT1 is another lncRNA, which is upregulated in lung cancer and plays a crucial role in lung cancer progression (Li et al., 2018). In our study, PVT1 also turned up overexpressed (*P*-value = 0.009) and correlated with poor PDAC patient survival (HR = 1.60), logistic regression classification AUC is 0.88 (**Figure 7**). Therefore, PVT1 may prove useful as a potential biomarker for PDAC therapy. RP11-54H7.4 is another overexpressed lncRNA in the TCGA database that was reported as a candidate biomarker for lung squamous cell carcinoma prognosis (Tang et al., 2017). We also observed elevated expression of RP11-54H7.4 (not significant), and high expression group PDAC patients had worse survival (HR = 1.6) (**Figure 7**).

A few other lncRNAs had contributory roles in PDAC patient survival, but they did not differentially express. The cancer susceptibility candidate 11 (CASC11) lncRNA is among them. Based on a knockdown study, CASC11 is thought to have a promoting role in colorectal cancer growth and metastasis (Zhang et al., 2016). Our current study showed that CASA11 overexpression associated with low survival. The antisense lncRNA of GATA6 (GATA6-AS) interacts with an epigenetic regulator LOXL2 to regulate endothelial gene expression *via*  changes in histone methylation (Neumann et al., 2018). Our

study showed that GATA6-AS overexpression correlated with poor prognosis of PDAC patients. A second similar lncRNA (GATA6-AS1) also was overexpressed and correlated with poor survival of PDAC patients (HR = 0.5) (**Table S6**).

Regarding protein-coding genes (**Table 4**), our study found 17 differentially expressed genes but five of them were identified at a stringent *P*-value of ≤ 0.01 that also correlated with PDAC patient survival. Expression of ASPM, Nek2, B3GNT3, DMBT1, and DEPDC1 is associated with better survival of PDAC patients in this study. ASPM (abnormal spindle-like microcephaly associated) is an oncogene that promotes tumor aggression in PDAC, and overexpression is associated with poor prognosis (Wang et al., 2013). We also observed that the ASPM overexpressing patient group showed low survival. NIMA-related kinase 2 (Nek2) is a serine/threonine kinase that plays a critical role in mitosis. Nek2 was reported as a prognostic biomarker for lung cancer (Shi et al., 2017), and knockdown of Nek2 gene with siRNA in xenograft mice decreased tumor size and increased survival for liver metastasized pancreatic cancer (Kokuryo et al., 2016). This gene was also reported as a prognostic biomarker for PDAC, as patients with high Nek2 expression showed shorter survival (Ning et al., 2014). In the current study, we observed a similar trend, our logistic regression model analysis also suggests that Nek2 expression may be a distinctive trait in PDAC vs. normal samples (AUC = 0.95). Our finding further reconfirms that Nek2 is a potential prognostic biomarker of PDAC.

We observed that overexpression of B3GNT3 (beta-1,3-Nacetylglucosaminyltransferase-3) is associated with shorter survival in PDAC (**Figure 8**). High AUC for the logistic regression model (AUC = 0.93) and low *P*-value with the high hazard ratio in Cox regression analysis suggests that this can be a potential prognostic biomarker for PDAC. Previous reports also confirmed that B3GNT3 overexpression was associated with shorter survival of patients in the cervical (Zhang et al., 2015) and non-small lung cell (Gao et al., 2018) cancers. Similarly, overexpression of the DEP domain containing 1 (DEPDC1) is associated with shorter overall survival of PDAC patients. Overexpression of DEPD1B is already reported in several types of human cancers (Su et al., 2014; Huang et al., 2017), we also observed overexpression in PDAC. High classification AUC (0.95) and Cox regression HR (1.9) suggest that it's a good candidate for prognostic biomarker in PDAC (**Table 4**). These findings suggest that our proposed methodology is working well for detecting known biomarkers, so it can as well detect novel prognostic biomarkers.

On the other hand, overexpression of DMBT1 and Bcl2 modifying factor (Bmf) is shown to improve survival in our study. DMBT1 (deleted in malignant brain tumors 1) expression cohorts have better survival (HR = 0.6) and high logistic regression classification AUC (0.95) suggests its role as a potential biomarker (**Figure 8**). DMBT1 is a tumor suppressor and involved in immune defense and epithelial differentiation in cancer (Mollenhauer et al., 2000). Expression of DMBT1 goes down in breast cancer (Braidotti et al., 2004; Blackburn et al., 2007), we observed a similar trend in our analysis. Proapoptotic protein Bmf which regulate the death of CD8 T cells (Hubner et al., 2010), is a probable prognostic biomarker for

PDAC (HR = 0.62), samples with high expression Bmf have a good prognosis.

Mucins are high molecular weight glycoproteins with oligosaccharides attached to serine or threonine residues of the mucin core protein backbone that play important roles as diagnostic and prognostic markers for carcinogenesis and tumor invasion (Hollingsworth and Swanson, 2004). We separately analyzed the promoter DNA methylation and mucin gene expression in pancreatic ductal cancer. We observed significant upregulation of MUC2, MUC5B, and MUC13 in PDAC. MUC5B and MUC13 overexpressed in pancreatic ductal cancer (Kaur et al., 2013), the MUC5B expression is highly sensitive to change in promoter methylation (Yamada et al., 2011). We observed the hypomethylation of MUC5B promoter CpG cg20911165 and cg03609102 which is negatively correlated with the gene expression (**Figure 4**). We also observed overexpression of MUC2 gene, in general, its expression goes down in PDAC but some report also suggests overexpression of MUC2 (Niv, 2017). Survival analysis of PDAC data reveals that patients which have higher expression of MUC21 have low survival rate (Cox-*P*value = 0.04, HR = 1.6).

Pathways analysis didn't observe any significantly enriched pathways for the differentially expressed genes in pathway enrichment analysis, as number of genes is not enough for analysis. But, pathway analysis of loci with dm-CpGs suggested that MAPK signaling, Rap1 signaling, cAMP signaling, cancer signaling, and mucin type O-glycan biosynthesis pathways were enriched. We conjecture that the nicotine and morphine addiction pathway showed up in our analysis because these PDAC patients are current or past smokers (**Table 3**). Many other cancer-related genes showed up differentially expressed in PDAC, including MUC2, MUC5B, MUC13, ALDH3A1, CDCA7, and CCL2. Several histone core proteins were overexpressed in PDAC. Our current study also indicated that HIST1H2BC, HIST1H2BJ, and HIST1H3H were associated with poor survival of PDAC patients (**Table 4**).

#### CONCLUSIONS

To our knowledge, this study represents the first TCGAbased PDAC methylome data analysis. The DNA methylome of pancreatic ductal cancer showed significant changes from normal samples. Most of hypermethylation taking place within the promoter regions and methylation in the promoter region have a strong association with corresponding gene expression. A 10 Mb region of chromosome 7 has the highest hypermethylation density, and this region harbors a number of HOX cluster genes. MUC family genes and histone core proteins are overexpressed, expression of MUC21 and several histone core HIST1H2AC, HIST1H2BC, and HIST3H2A are also associated with patients' survival. Role of hsa-mir-196b and Nek2 in PDAC patients' survival is further reconfirmed. Our analysis reveals that proteincoding genes, ARTNTL2, CELF2, EFR3B, B3GNT3, and long non-coding genes, CASC11, GATA6-AS are potential prognostic biomarkers of PDAC. Promoter methylation of ZNF154 and ZNF382, which were previously reported as early stage urine/

blood-based biomarkers have the potential to be prognostic biomarkers for PDAC.

#### DATA ANALYSIS

All analyses were performed using the R version 3.5.1 (R Development Core Team 2015). We performed differential methylation/expression and survival analysis by using R/Bioconductor tools. List of tools used for this analysis are available in **Table S1**.

#### AUTHOR CONTRIBUTIONS

NM, SS, and CG are responsible for the study design. NM and SS performed the statistical analysis and generated figures. NM

#### REFERENCES


and SS drafted the manuscript and CG edited and improved the manuscript and approved it.

#### FUNDING

The authors thank the Bioinformatics and Systems Biology Core, which receive partial support from National Institutes of Health grants [P20GM103427, P30CA036727].

#### SUPPLEMENTARY MATERIALS

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00624/ full#supplementary-material


targeting Mfn2-related mitochondrial fission. *Int. J. Oncol.* 53 (1), 124–136. doi: 10.3892/ijo.2018.4380


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Mishra, Southekal and Guda. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

digital media

of impactful research

article's readership