# MULTI-OMIC DATA INTEGRATION IN ONCOLOGY

EDITED BY : Chiara Romualdi, Enrica Calura, Davide Risso, Sampsa Hautaniemi and Francesca Finotello PUBLISHED IN : Frontiers in Oncology and Frontiers in Genetics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88966-151-0 DOI 10.3389/978-2-88966-151-0

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# MULTI-OMIC DATA INTEGRATION IN ONCOLOGY

Topic Editors: Chiara Romualdi, University of Padua, Italy Enrica Calura, University of Padua, Italy Davide Risso, University of Padua, Italy Sampsa Hautaniemi, University of Helsinki, Finland Francesca Finotello, Innsbruck Medical University, Austria

Citation: Romualdi, C., Calura, E., Risso, D., Hautaniemi, S., Finotello, F., eds. (2020). Multi-omic Data Integration in Oncology. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88966-151-0

# Table of Contents

#### *05 Editorial: Multi-omic Data Integration in Oncology*

Francesca Finotello, Enrica Calura, Davide Risso, Sampsa Hautaniemi and Chiara Romualdi

*09 Identification of Specific Long Non-Coding Ribonucleic Acid Signatures and Regulatory Networks in Prostate Cancer in Fine-Needle Aspiration Biopsies*

Zehuan Li, Jianghua Zheng, Qianlin Xia, Xiaomeng He, Juan Bao, Zhanghan Chen, Hiroshi Katayama, Die Yu, Xiaoyan Zhang, Jianqing Xu, Tongyu Zhu and Jin Wang

*21 Computational Methods for the Integrative Analysis of Genomics and Pharmacological Data*

Jimmy Caroli, Martina Dori and Silvio Bicciato

*27 Genomic and Transcriptomic Landscape of Tumor Clonal Evolution in Cholangiocarcinoma*

Geng Chen, Zhixiong Cai, Xiuqing Dong, Jing Zhao, Song Lin, Xi Hu, Fang-E Liu, Xiaolong Liu and Huqing Zhang


Soledad Ochoa, Guillermo de Anda-Jáuregui and Enrique Hernández-Lemus

*73 Big Data-Based Identification of Multi-Gene Prognostic Signatures in Liver Cancer*

Meiliang Liu, Xia Liu, Shun Liu, Feifei Xiao, Erna Guo, Xiaoling Qin, Liuyu Wu, Qiuli Liang, Zerui Liang, Kehua Li, Di Zhang, Yu Yang, Xingxi Luo, Lei Lei, Jennifer Hui Juan Tan, Fuqiang Yin and Xiaoyun Zeng

*96 Impact of Data Preprocessing on Integrative Matrix Factorization of Single Cell Data*

Lauren L. Hsu and Aedin C. Culhane


Giovanna Nicora, Francesca Vitali, Arianna Dagliati, Nophar Geifman and Riccardo Bellazzi

*139 Integrated Transcriptome Analysis of Human Visceral Adipocytes Unravels Dysregulated microRNA-Long Non-coding RNA-mRNA Networks in Obesity and Colorectal Cancer*

Sabrina Tait, Antonella Baldassarre, Andrea Masotti, Enrica Calura, Paolo Martini, Rosaria Varì, Beatrice Scazzocchio, Sandra Gessani and Manuela Del Cornò

# *157 Unraveling the Complexity of the Cancer Microenvironment With Multidimensional Genomic and Cytometric Technologies*

Natasja L. de Vries, Ahmed Mahfouz, Frits Koning and Noel F. C. C. de Miranda

#### *172 Multi-Omics Characterization of the 4T1 Murine Mammary Gland Tumor Model*

Barbara Schrörs, Sebastian Boegel, Christian Albrecht, Thomas Bukur, Valesca Bukur, Christoph Holtsträter, Christoph Ritzel, Katja Manninen, Arbel D. Tadmor, Mathias Vormehr, Ugur Sahin and Martin Löwer

# Editorial: Multi-omic Data Integration in Oncology

Francesca Finotello<sup>1</sup> , Enrica Calura<sup>2</sup> , Davide Risso<sup>3</sup> , Sampsa Hautaniemi <sup>4</sup> and Chiara Romualdi <sup>2</sup> \*

*<sup>1</sup> Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, Innsbruck, Austria, <sup>2</sup> Department of Biology, University of Padua, Padua, Italy, <sup>3</sup> Department of Statistical Sciences, University of Padua, Padua, Italy, <sup>4</sup> Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland*

Keywords: multi-omic, single-cell, transcriptomics, pathways, cancer, data integration

#### **Editorial on the Research Topic**

#### **Multi-omic Data Integration in Oncology**

In the next few years, we are going to witness changes in the treatment of cancer patients due to molecular and personalized medicine. Indeed, many hospitals are already starting routine genome-wide screening to complement and inform diagnosis and treatment choices. However, the majority of molecular aberrations identified in cancers have synergic interactions in many aspects of cell signaling beyond the genome. The complexity of cancers cross cell boundaries especially studying the tumor microenvironment as a heterogeneous and dynamic network of interacting cells (1), one of the new hot topics for anticancer treatment development. In this scenario, multi-omic technologies and single-cell data can shed light on these interactions by generating high-throughput datasets portraying the genomes, transcriptomes, proteomes, metabolomes, and epigenomes of tumors.

Large-scale cancer genomic projects, such as The Cancer Genome Atlas (TCGA) (2), have generated petabytes of multi-omic data portraying this heterogeneity. Importantly, these data have been made available to the scientific community, shifting the main challenge from data collection to data analysis and integration, and allowing for development of novel data analysis methods. However, while computational and statistical analyses of single-omics datasets are well-established—excluding the still challenging single-cell data analyses—the integration of multi-omic data is still far from being standardized. As the number of datasets grows and the biological knowledge increases, existing methods should be extended or generalized, and new computational tools need to be proposed to cope with the complexity and multi-level structure of the available information. In this special issue, de Anda-Jáuregui and Hernández-Lemus presented a comprehensive review of the state of the art of multi-omic data analysis in oncology, encompassing a wide range of tasks, such as data acquisition and processing, data management, identification of therapeutic targets, as well as patient classification, diagnosis, and prognosis.

One of the major challenges in the analysis of multi-omic data is how to integrate the different data modalities. Nicora et al. reviewed a selection of recent tools for the computational integration of multi-omic data sets based on: deep learning, network integration, data clustering or factorization, and feature extraction or transformation. This emerging field has already contributed a rich catalog of freely available tools: the most widely used approaches are network-based methods,

#### Edited and reviewed by:

*Claudio Sette, Catholic University of the Sacred Heart, Rome, Italy*

> \*Correspondence: *Chiara Romualdi chiara.romualdi@unipd.it*

#### Specialty section:

*This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology*

Received: *05 August 2020* Accepted: *07 August 2020* Published: *15 September 2020*

#### Citation:

*Finotello F, Calura E, Risso D, Hautaniemi S and Romualdi C (2020) Editorial: Multi-omic Data Integration in Oncology. Front. Oncol. 10:1768. doi: 10.3389/fonc.2020.01768*

**5**

but deep learning strategies are becoming increasingly popular. In this context, Chierici et al. proposed a computational framework for high-throughput data integration (called Integrative Network Fusion, INF), which leverages network structures and machine learning models to extract multi-omic predictive biomarkers for cancer subtype identification. By integrating gene expression, protein expression, and copynumber data across three TCGA cancer types, INF showed a higher predictive performance with respect to simple juxtaposition of single-omics analyses and enabled the extraction of more biologically meaningful biomarkers. INF was designed to integrate an arbitrary number of omic layers, allowing to extend the framework to other types of data, such as histopathological and radiological images.

The main goal of most integrative methods is the identification of multi-omic signatures that can be diagnostic (healthy vs. disease), prognostic (good vs. poor patient outcome), or predictive (good vs. poor response to therapeutic interventions). The selection of the optimal signature size, that is the number of molecular features needed to stratify patients, is not trivial. In general, the smaller the signature size, the easier its clinical applicability, but the lower its accuracy, due to patients heterogeneity. In this perspective meta-analysis studies that exploit data from previously published studies can increase the signature robustness and reliability. Liu et al. combined extensive text mining and transcriptomic data to identify and validate a small prognostic signature in liver cancer. By selecting more than thousand genes known to be involved in liver cancer initiation and progression, they identified a triplet of genes associated with survival. Using three independent cohorts and specific experimental assays to confirm transcript and protein expression levels, they found that low expression of F2, GOT2, and TRPV1 is associated with poor prognosis in liver cancer. In a parallel study, Li et al. identified a small diagnostic signature composed of long non-coding RNAs (RP11-33A14.1, RP11- 423H2.3, and LAMTOR5-AS1) that, combined with clinical and previously-published molecular biomarkers, is able to predict prostate cancer from fine needle aspiration biopsies with high sensitivity and specificity. Looking for potential molecular functions of the signature elements, the authors suggested and validated a sponge mechanism, that sees miR-7, miR-24-3p, and miR-30 as the three main miRNAs sequestered by the long non-coding RNAs, which in turn interact with the RNA binding protein FUS.

While the identification of precise molecular signatures is fundamental for clinical practice, the understanding of the actual mechanisms driving these alterations in specific cancers or cancer subtypes is crucial to design new pharmacological treatments. Ochoa et al. investigated the regulatory elements that drive the various expression behaviors of the PAM50 signature (3) in different breast cancer subtypes. The authors integrated coding and non-coding gene expression, methylation levels, and information on transcription factors (TF)-target interaction data via a generalized elastic-net model. Using breast tumors and normal adjacent tissues from the TCGA, they identified both subtype-specific regulators and regulators acting across subtypes, such as miR-21 and miR-10b. With a similar aim, Tait et al. combine transcriptomic data to study the expression patterns of non-coding elements (miRNAs and long non-coding RNAs, ncRNA) underlying dysfunctional adipocyte phenotype in obesity and colorectal cancer. The authors inferred lncRNA-miRNA-mRNA modules, highlighting several ncRNA modulations and dysregulated pathways that are common to both obesity and colorectal cancer. Chen et al., using whole exome and transcriptome sequencing, studied the genomic and transcriptomic landscape of cholangiocarcinoma. The authors investigated subnetworks that were greatly influenced by tumor clonal or subclonal mutations impacting gene expression.

Immunotherapy with checkpoint blockers has drastically advanced treatment of different types of cancer over the past years, improving overall patient survival compared to standard therapy. However, response to treatment remains hard to predict due to the large intra- and inter-patient heterogeneity. Lapuente-Santana and Eduarti reviewed the benefit of multiomic approaches for biomarker discovery in the immunooncology field. They present multi-omic approaches that could help understand how different immune cell types can influence the efficacy of immunotherapy with checkpoint blockers and how the cells interact in the tumor microenvironment, shaping the immune response, and resistance to immunotherapy. The authors suggest that a combination of dynamic mathematical models and longitudinal data could further improve our understanding of the tumor microenvironment role in the response to immunotherapy and provide the rationale for alternative personalized treatments.

Another field that recently had a boost from multiomic integration strategies is pharmacogenomics. The term pharmacogenomics is generally used to define the variability of drug response due to the patients' genomic landscape. In this context, cancer cell lines have been the most widely used models to explore the molecular basis of drug sensitivity. Starting from the first NCI-60 project (4), several other studies investigating the link between the genomic makeup and drug response in cancer cell lines have been carried out (5–7). Caroli et al. reviewed the databases and computational tools that have been developed to integrate cancer cell lines genomic profiles and sensitivity to small molecule perturbations obtained from different screenings.

Multimodal omics can be integrated in silico to respond to complex biological questions that require a systems biology approach. One of such examples is the prediction of tumor neoantigens, namely mutated peptides that are bound to the major histocompatibility complex molecules of cancer cells and can elicit anticancer immune responses. Schrörs et al. derived an integrated map of the genome, transcriptome, and neoantigen landscape of one of the most widely used breast cancer models: the 4T1 murine mammary cancer cell line. They found that 4T1 cells share molecular features with triple-negative breast cancer and, thus, represent a promising model for preclinical studies. Moreover, the authors confirmed experimentally the antigenic potential of 23 mutated peptides selected from the pool of neoantigens predicted in silico using IFNγ-ELISpot assays.

Despite their recognized value to advancing and informing immuno-oncology and precision medicine, standard "bulk" technologies are intrinsically limited by the sequencing of heterogeneous cell mixtures, which renders a blended average portrayal of the tumor microenvironment. Rapidly-emerging single-cell technologies allow to disentangle the phenotypes of individual cells, providing unprecedented insights into the cellular and spatial diversity of the tumor microenvironment. However, the sparsity, noise, and high-dimensionality of singlecell data pose unique challenges to data analysis. Hsu and Culhane provide a guide to dimensionality reduction techniques that are vital to extract the major sources of variations from single-cell RNA-sequencing data prior to performing downstream data integration, clustering and analysis. The authors focused on principal component analysis (PCA), a matrix factorization method that can easily scale to large datasets when used with sparse-matrix representations; they described its relationship with singular value decomposition, the differences between using correlation or covariance matrices, the impact of data scaling, log-transformation, and standardization, and how to recognize artifacts in PCA plots. Moreover, they described how canonical correlation analysis (CCA), another popular matrix factorization approach, can be used to integrate single-cell data from different platforms or studies.

Despite their promise, single-cell technologies, such as flow cytometry, mass cytometry, or single-cell RNA sequencing, are still limited by the lack of information on spatial context and multicellular interactions. de Vries et al. show how multimodal and spatially-resolved single-cell data can advance our understanding of the inter-cellular organization and communication in the tumor microenvironment. They present recent developments in spatial, tissue-based techniques, such as multiparameter fluorescence, imaging mass cytometry, and in situ transcriptomics, as well as, multidimensional singlecell technologies and studies that integrate multiple singlecell modalities to disentangle complex cell interactions in the tumor microenvironment. These approaches hold the promise to uncover the sources of intra-tumor heterogeneity that hamper cancer treatment but require the development of dedicated bioinformatic tools for the data analysis and interpretation and tight collaboration between oncologists, immunologists, pathologists, and bioinformaticians for the extraction of mechanist rationales and actionable targets.

#### REFERENCES


Overall, our collection of original research articles and reviews covers a wide range of multi-omic applications in oncology. The scenario that emerges is that transcriptomics, methylomics, and genomics are the three most frequently analyzed and integrated data, both in bulk and single-cell studies. To fully understand the complex interactions of the molecular processes underlying cellular mechanisms a fine temporal and spatial resolution is required. Spatial transcriptomics (8), a set of techniques that allow the (sub-) cellular characterization of gene expression, has the potential to unveil the complex interplay between cell types but gives rise to new computational and statistical challenges, also in terms of data integration. In addition, important information can be exploited by integrating omics data and biomedical images (9), a field that is experiencing new advances in terms of sensitivity and resolution. Multimodal integrative analysis will soon become the standard to study complex systems, and we look forward to exciting new computational developments to tackle data heterogeneity, computational efficiency and results interpretation, and can ultimately push the oncology field forward.

#### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

#### FUNDING

FF was supported by the Austrian Science Fund (FWF) (project no. T 974-B30). DR was supported by the Programma per Giovani Ricercatori Rita Levi Montalcini granted by the Italian Ministry of Education, University, and Research, by the National Cancer Institute of the National Institutes of Health (2U24CA180996), and by the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation (CZF2019-002443). SH was supported by the European Union's Horizon 2020 Research and Innovation Programme under grant agreement No. 667403 for HERCULES. CR and EC were supported by Italian Association for Cancer Research (IG 21837 to CR and MFAG 2019 23522 to EC).


biomarker discovery in cancer cells. Nucleic Acids Res. (2013) 41:D955–61. doi: 10.1093/nar/gks1111


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Finotello, Calura, Risso, Hautaniemi and Romualdi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification of Specific Long Non-Coding Ribonucleic Acid Signatures and Regulatory Networks in Prostate Cancer in Fine-Needle Aspiration Biopsies

Zehuan Li 1,2† , Jianghua Zheng3† , Qianlin Xia4† , Xiaomeng He1† , Juan Bao<sup>1</sup> , Zhanghan Chen<sup>2</sup> , Hiroshi Katayama<sup>5</sup> , Die Yu<sup>1</sup> , Xiaoyan Zhang<sup>1</sup> , Jianqing Xu<sup>1</sup> , Tongyu Zhu1,6 and Jin Wang1\*

#### Edited by:

Chiara Romualdi, University of Padova, Italy

#### Reviewed by:

Smrithi Rajendiran, University of California, Santa Cruz, United States Chin-Yo Lin, University of Houston, United States

\*Correspondence:

Jin Wang wjincityu@yahoo.com

† These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Cancer Genetics, a section of the journal Frontiers in Genetics

Received: 04 June 2019 Accepted: 17 January 2020 Published: 14 February 2020

#### Citation:

Li Z, Zheng J, Xia Q, He X, Bao J, Chen Z, Katayama H, Yu D, Zhang X, Xu J, Zhu T and Wang J (2020) Identification of Specific Long Non-Coding Ribonucleic Acid Signatures and Regulatory Networks in Prostate Cancer in Fine-Needle Aspiration Biopsies. Front. Genet. 11:62. doi: 10.3389/fgene.2020.00062 <sup>1</sup> Scientific Research Center, Shanghai Public Health Clinical Center, Fudan University, Shanghai, China, <sup>2</sup> Department of General Surgery, Zhongshan Hospital, Fudan University, Shanghai, China, <sup>3</sup> Department of Laboratory Medicine, Zhoupu Hospital Affiliated to Shanghai University of Medicine & Health Sciences, Shanghai, China, <sup>4</sup> Department of Laboratory Medicine, Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, China, <sup>5</sup> Department of Molecular Oncology, Okayama University Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama, Japan, <sup>6</sup> Department of Urology, Shanghai Key Laboratory of Organ Transplantation, Zhongshan Hospital, Fudan University, Shanghai, China

Prostate cancer (PCa) is one of the most common tumors in men and can be lethal, especially if left untreated. A substantial majority of PCa patients not only are diagnosed based on fine needle aspiration (FNA) biopsies, but their treatment choices are also largely driven by the pathological findings obtained with these FNA specimens. It is widely believed that lncRNAs have strong biological significance, but their specific functions and regulatory networks have not been elucidated. LncRNAs may serve as key players and regulators of PCa carcinogenesis and could be novel biomarkers of this cancer. To identify potential markers for early detection of PCa, in this study, we employed a competing endogenous RNA (ceRNA) microarray to identify differentially expressed lncRNAs (DelncRNAs) in PCa tissue and quantitative real-time PCR (qRT-PCR) analysis to validate these DelncRNAs in FNA biopsies. We demonstrated that a total of 451 lncRNAs were differentially expressed in four pairs of PCa/adjacent tissues, and upregulation of the lncRNAs RP11-33A14.1, RP11-423H2.3, and LAMTOR5-AS1 was confirmed in FNA biopsies of PCa by qRT-PCR and was consistent with the ceRNA array data. The association between the expression of the lncRNA LAMTOR5-AS1 and aggressive cancer was also investigated. Regulatory network analysis of DelncRNAs showed that the lncRNAs RP11-33A14.1 and RP11-423H2.3 targeted miR-7, miR-24- 3p, and miR-30 and interacted with the RNA binding protein FUS. Knockdown of these DelncRNAs in PCa cells also demonstrated the effects of RP11-423H2.3 on miR-7/miR-24/miR-30 or LAMTOR5-AS1 on miR-942-5p/miR-542-3p via direct interaction. The results of these studies indicate that these three specific lncRNA signatures and regulatory

**9**

networks might serve as risk prediction and diagnostic biomarkers for prostate cancer, even in biopsies obtained by FNA.

Keywords: prostate cancer, long non-coding ribonucleic acid, regulatory networks, fine needle aspiration biopsies, microribonucleic acid, ribonucleic acid binding proteins, biomarker

# INTRODUCTION

Prostate cancer is the second most common tumor among men worldwide, leading to the highest morbidity and mortality along with lung and bronchial cancer. In 2018, the incidence of prostate cancer (PCa) among all new cancer cases was 19%, and in the USA, ~29,000 men died from prostate cancer (Siegel et al., 2017; Siegel et al., 2018), which is usually diagnosed at a localized stage by the combination of prostate-specific antigens (PSAs), magnetic resonance imaging (MRI), digital rectal examination (DRE), and transrectal ultrasound (TRUS)-guided biopsy (Carroll et al., 2018); most panel members favor informed testing beginning at the age of 45 years. Despite these detection methods and systemic therapies, including radiation therapy, prostatectomy, androgen deprivation therapy, immunotherapy, and chemotherapy (Mohler et al., 2018), several patients are still diagnosed at a late stage of development (Siegel et al., 2018). Moreover, while PCa remains indolent in most individuals, in a minority of patients, PCa behaves aggressively. PSA, which is the most common prostatic marker, has a high specificity for prostate cancer, but its expression cannot be detected in ~5% of patients with high-grade PCa (Epstein, 1993; Van Der Toom et al., 2019) or, conversely, leads to the overdiagnosis of clinically insignificant cancer (Tan et al., 2019). Thus, biomarkers that accurately diagnose prostate cancer and, more importantly, differentiate indolent from life-threatening prostate cancer are urgently required.

Noncoding RNAs (ncRNAs) play key roles in cancer progression and could be used to develop novel biomarkers of prostate cancer (Shan et al., 2017; Xia et al., 2018). Answering the many unknown questions regarding ncRNAs' participation in prostate cancer progression, such as how ncRNAs participate in many pathological processes leading to the development of prostate cancer, how they significantly interact with proteins, and the degree of their specificity and ease of detection in tissues, serum, plasma, and urine could lead to the development of novel biomarkers of this aggressive cancer. In our previous studies, we demonstrated that four differentially expressed genes (TGBL1, HOXA7, KRT15, and TGM4) in FNA biopsies could facilitate the diagnosis of prostate cancer, which was significantly improved over PSA (Shan et al., 2017), and we found that differentially expressed circular RNAs (circRNAs) (circ\_0062019 and circ\_0057558) and the host gene SLC19A1 of circ\_0062019 could be used as potential novel biomarkers of PCa (Xia et al., 2018). Long noncoding RNAs (lncRNAs) are currently defined as RNA transcripts longer than 200 nucleotides that do not appear to code proteins but control cell fate during development through complex mechanisms, and their dysregulation underlies some human disorders caused by chromosomal deletions and translocations (Batista and Chang, 2013). LncRNAs include several types of RNA transcripts, such as antisense, intronic, and intergenic transcripts, pseudogenes, and retrotransposons (Lee, 2012), which are more cell typespecific than protein-coding genes, and their aberrant expression has been documented in various cancers, including PCa (Hon et al., 2017; Misawa et al., 2017). LncRNAs were found to be involved in prostate carcinogenesis by mediating enhancerpromoter looping, alternative splicing, and antisense gene silencing, antagonizing transcription regulators and repressing DNA repair (Walsh et al., 2014). For example, the lncRNA SChLAP1 promotes aggressive PCa mechanistically by impairing the SWI/SNF axis-mediated regulation of their gene expression and genomic binding (Prensner et al., 2013). The lncRNA NEAT1, which is regulated by estrogen receptor alpha (ERa), drives an oncogenic cascade in PCa and is associated with therapeutic resistance (Chakravarty et al., 2014). The lncRNA HOTAIR increases the androgen receptor-mediated transcriptional program and promotes the growth of castration-resistant prostate cancer (Zhang et al., 2015). Other lncRNAs, such as lncRNA ZEB1-AS1 (Su et al., 2017) and lncRNA HOXD-AS1 (Gu et al., 2017), can also regulate cell proliferation and chemoresistance as oncogenes. However, some lncRNAs, such as lncRNA TUG1 and lncRNA CTB-89H12.4, can mediate sponge regulatory networks as tumor suppressors (Du et al., 2016). Preclinically, the interfering lncRNA MALAT1 can suppress enzalutamide-resistant PCa progression (Wang et al., 2017b). Therefore, lncRNAs play multifaceted roles in PCa and may serve as risk prediction, diagnostic, prognostic, and predictive biomarkers of PCa.

In this study, we applied a competing endogenous RNA (ceRNA) microarray to identify differentially expressed lncRNAs in PCa tissue. Through further validation of the most differentially expressed lncRNAs in prostate biopsy tissues, we found that three lncRNAs, i.e., RP11-33A14.1, RP11-423H2.3, and LAMTOR5-AS1, and their regulatory networks may serve as novel diagnostic biomarkers of PCa.

Abbreviations: AUC, area under the curve; BPH, benign prostatic hyperplasia; ceRNA, competing endogenous RNA; DEGs, differentially expressed genes; DelncRNAs, differentially expressed lncRNAs; DRE, digital rectal examination; ERa, estrogen receptor alpha; FNA, Fine-Needle Aspiration; FPG, fasting plasma glucose; FUS/TLS, fused in sarcoma/translocated in liposarcoma; GEO, gene expression omnibus; lncRNAs, long noncoding RNAs; MRI, magnetic resonance imaging; ncRNAs, noncoding RNAs; NEAT1, nuclear paraspeckle assembly transcript 1; NMD, nonsense mediated RNA decay; PCa, prostate cancer; PSA, prostate specific antigen; PTBP1, polypyrimidine tract-binding protein 1; qRT-PCR, quantitative real-time polymerase chain reaction; ROC, receiver operating characteristic; RBPs, RNA binding proteins; TC, total cholesterol; TG, total triglyceride; TRUS, transrectal ultrasound.

# MATERIALS AND METHODS

# Cell Lines and Cell Culture

The prostate cancer cell lines 22Rv1 (ATCC No. CRL-2505), DU145 (ATCC No. HTB-81), LNCaP (ATCC No. CRL-1740), and PC3 (ATCC No. CRL-1435) were purchased from the Culture Collection of the Chinese Academy of Sciences, Shanghai, China (http://www.cellbank.org.cn/). DU145 and PC3 were cultured in MEM (Cat # : 41500034, Life Technologies) and F-12 (GIBCO, 21700075, Life Technologies), respectively; LNCaP and 22Rv1 were maintained in RPMI-1640 (Cat # : 31800022, Life Technologies) supplemented with 10% fetal bovine serum (FBS) (Thermo Fisher Scientific, Waltham, MA, US) at 37°C in 5% CO2. The human prostatic epithelial cell lines (HPEpic) were purchased from Shanghai Xinyu Biological Technology Co., Ltd. All cells were cultured according to the ATCC standard procedure.

#### Prostate Tumor and Benign Prostatic Hyperplasia Tissue Samples

Four pairs of fresh prostate tumor and paracancerous tissues and 105 cases of prostate tissues on fine needle biopsies (FNA), including 48 cases of PCa tissues and 57 cases of benign prostatic hyperplasia (BPH) tissues, were acquired from Zhongshan Hospital Affiliated with Fudan University. This research was approved by the Ethics Committee of Zhongshan Hospital Affiliated with Fudan University and Shanghai Public Health Clinical Center. Written informed consent was obtained from all patients for the use of their tissue samples and clinical records. Each tissue was confirmed by a pathologist specializing in prostate cancer, and a Gleason score was provided for the risk stratification. All samples were stored at −80°C after surgical resection.

#### Ribonucleic Acid Purification, Competing Endogenous Ribonucleic Acid Microarray, and Data Analysis

Total RNA was extracted and purified using TRIzol reagent (Invitrogen, Carlsbad, CA, US) and an RNeasy Mini Kit (QIAGEN, GmBH, Germany) following the manufacturer's instructions. The total RNA was quantified by a NanoDrop 2000 spectrophotometer (NanoDrop, US) and selected by limiting the 260/280 nm absorbance ratio of the samples to 1.8–2.0. The selected RNA samples were assessed by an Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, US) to inspect the RNA integrity. Four pairs of prostate tumor and paracancerous tissues were used for the microarray assay to investigate the differentially expressed lncRNAs between the cancer tissues and paracancerous tissues (Xia et al., 2018). The total RNA was amplified and labeled by a Low Input Quick Amp WT Labeling Kit (Santa Clara, CA, US) and labeled by Cy3 labeled CTP with T7 RNA polymerase. The labeled cRNAs were purified by an RNeasy Mini Kit (QIAGEN, GmBH, Germany) and loaded onto SBC human (4\*180 K) ceRNA microarrays, including 68,423 ncRNAs, 88,371 circRNAs, and 18,853 messenger RNAs (mRNAs) (Shanghai Biotech Co., Ltd., Shanghai, China). The microarray hybridization was performed following the manufacturer's standard protocols using a Gene Expression Hybridization Kit (Santa Clara, CA, US) in a hybridization oven (Santa Clara, CA, US). The hybridized slides were washed, fixed, and finally scanned to obtain images using an Agilent Microarray Scanner (Agilent Technologies, Santa Clara, CA, US). The data were extracted with Feature Extraction software 10.7 (Agilent Technologies, Santa Clara, CA, US), and the raw data were normalized by the quantile algorithm in the limma package in R. The significantly differentially expressed lncRNAs between the prostate cancer and paracancerous tissues were identified and retained by screening for fold change > 2.0 at p < 0.05. The prostate cancer microarray datasets were deposited in the Gene Expression Omnibus (GEO) database under accession number GSE140927.

#### Regulatory Network Analysis of Differential Long Non-Coding Ribonucleic Acids and Microribonucleic Acids in Prostate Cancer

For an integrative analysis of prostate cancer-specific differentially expressed lncRNAs and miRNAs, we searched the GEO database for miRNA expression profiling studies related to prostate cancer. The two miRNA expression datasets were downloaded from the National Center for Biotechnology Information GEO database (GSE76260 and GSE36802). All patients' records/information were anonymized and deidentified prior to the analysis. In total, 106 prostate clinical specimens (53 cancer and 53 non-neoplastic tissues/matched benign prostate tissues) were collected from GEO to create the data downloaded from 47 patients with prostate cancer in two different platforms, including an Affymetrix Multispecies miRNA-1 Array and Illumina Human v2 MicroRNA Expression BeadChip. We applied unpaired Student's t-tests to determine the expression differences between the groups. The differential expression values are displayed as a log of the foldchange. All analyses were performed with R statistical software. We predicted the candidate genes targeted by these miRNAs based on TargetScan (Whitehead Institute for Biomedical Research, Cambridge, MA, US) (Lewis et al., 2003) or miRecords (LC Sciences, Houston, TX, US) (Xiao et al., 2009). We also applied GEO2R to determine the involvement of dysregulated miRNAs in PCa and used the microRNA.org databases and the hypergeometric method to calculate the pvalues in the miRNA target analysis. Furthermore, we analyzed the potential target microRNAs (miRNAs) of the differential lncRNAs online (http://www.mircode.org). To understand the protein-lncRNA interactions of the differentially expressed lncRNAs, we constructed a lncRNA-mRNA network based on the transcripts. By analyzing the possible combination of lncRNAs and mRNAs, we predicted the target mRNAs of the differentially expressed lncRNAs (http://starbase.sysu.edu.cn/ starbase2/) (Li et al., 2014) and generated a lncRNA-mRNA regulatory network map by Cytoscape3.5.1 software (Shannon et al., 2003).

#### Knockdown of Differentially Expressed Long Non-Coding Ribonucleic Acids in Prostate Cancer Cells

We applied si-RP11-423H2.3 and si-LAMTOR5-AS1 to knock down the expression of RP11-423H2.3 and LAMTOR5-AS1 in the prostate PC3 and DU145 cancer cells (the target sequence of RP11-423H2.3 was AAGGACAGCTTGCCTGACT; the target sequence of LAMTOR5-AS1 was CTGGTCTACTGTCACA ACA; and siRNA-GFP was the control). All siRNAs were designed and synthesized by Ribobio (Guangzhou, China). qRT-PCR was applied to validate the transfection efficiency and expression level of relevant lncRNAs and target miRNAs. The siRNA with the best transfection efficiency was selected for subsequent experiments. Prostate PC3 and DU145 cancer cells were transfected with siRNAs at a concentration of 50 nM using 5 µl of Lipofectamine 3000 (Invitrogen, CA, US) according to the manufacturer's protocol.

### Quantitative Real-Time Polymerase Chain Reaction Analysis

Total RNA was isolated from 105 clinical specimens and prostate cells using TRIzol reagent (Invitrogen, Carlsbad, CA, USA). In total, 600 ng of total RNA per sample was used for complementary DNA (cDNA) synthesis using a PrimeScript™ RT Reagent Kit with gDNA Eraser (Takara, Cat# : RR047A, Japan). Real-time quantitative reverse transcription PCR (qRT-PCR) was performed with SYBR Premix Ex Taq™ II (Takara, Cat# : RR820A, Japan) using the LightCycler 480 II Instrument (Roche Molecular Systems, Inc). We performed qRT-PCR in a total reaction volume of 10 ml, including 5 ml of 2 x SYBR Green PCR buffer, 0.4 ml of forward primer (10 mM), 0.4 ml of reverse primer (10 mM), 0.2 ml of ROX Reference Dye II, 3.5 ml of ddH2O, and 15 ng of cDNA. The reaction was initiated at 95°C for 1 min followed by 95°C (5 s) and 60°C (30 s) for 40 cycles. The expression of the lncRNAs was normalized to the level of 18S. The specific primers of the lncRNAs, miRNAs, and 18S are presented in Table S1. The data were collected and analyzed using the 2−DDCt method.

# Immunoblots

Prostate PC3 and DU145 cancer cells were transfected with si-RP11-423H2.3, si-LAMTOR5-AS1, or siRNA-GFP (si-Control) using Lipofectamine 3000 (Invitrogen, CA, US) according to the manufacturer's protocol. After 72 h, protein samples were lysed in radioimmunoprecipitation assay (RIPA) buffer supplemented with protease inhibitors. Thirty micrograms of total protein were loaded per lane separated on a 10% sodium dodecyl sulfate (SDS)-polyacrylamide gel by electrophoresis, and proteins transferred onto nitrocellulose membranes. The membranes were blocked with 5% milk in phosphate buffered saline with tween 20 (PBST) and then incubated with a rabbit anti-UPF1 (Cat#: D161327, BBI Solutions) or rabbit anti-FUS (Cat#: D223360, BBI Solutions), or b-actin (N-21) rabbit polyclonal antibody (Cat#: sc-130656, Santa Cruz Biotechnology, Inc) at 4°C overnight. After washing with PBST, the blots were treated with a horseradish peroxidase (HRP) conjugated anti-rabbit IgG. Detection of blots was performed using Meilunbio® fg super sensitive ECL luminescence reagent (Dalian Meilun Biotechnology Co., Ltd.) (Zhang et al., 2019).

## Statistical Analyses

We collected clinical data from 105 prostate tissues, and a Student's t-test was used to analyze the differences in lncRNA expression between the prostate cancer group and BPH group. A Pearson correlation analysis was used to investigate the relationship between the differential lncRNAs and clinical parameters. The results were regarded as statistically significant at p < 0.05. All graphs were generated using GraphPad Prism 7.0 software (GraphPad Software Inc., La Jolla, CA, USA). The statistical analysis was performed using SPSS 22.0 (IBM-SPSS Inc., Chicago, IL, USA). Receiver operating characteristic (ROC) curves were applied to evaluate the clinical diagnostic value of the differential lncRNAs and the combination of PSA and lncRNAs.

# RESULTS

#### Differential Profiling of Long Non-Coding Ribonucleic Acids in Prostate Cancer

To identify potential biomarkers of PCa, we first performed ceRNA microarray profiling of PCa patients and detected many transcripts in the PCa and adjacent normal tissues. We collected four pairs of tumor/adjacent normal tissue paraffin specimens and applied a ceRNA microarray to detect the transcripts in the PCa and adjacent normal tissues (Xia et al., 2018). A heatmap (Figure 1A) and scatter plots (Figure 1B) of the differential lncRNAs between the PCa tissues and normal tissues are shown in Figure 1. The heatmap indicates that 451 lncRNAs (Figure 1A) were differentially expressed with a fold change > 2.0 at p < 0.05. Among these lncRNAs, 217 lncRNAs were upregulated, and 234 lncRNAs were downregulated, in four pairs of PCa/ adjacent tissues (Table 1). Among the differentially expressed lncRNAs, the most upregulated lncRNA is LINC00675, and the most downregulated lncRNA is RP11-864N7.4.

#### Validation of Key Differentially Expressed Long Non-Coding Ribonucleic Acids (RP11-33A14.1, RP11-423H2.3, and LAMTOR5-AS1) Using Fine Needle Aspiration Samples

We further carried out a qRT-PCR analysis of the related differential lncRNAs, including RP11-33A14.1, RP11-423H2.3, LAMTOR5-AS1, LINC00675, RP11-118K6.2, and RP11- 423H2.3, in the normal prostate cell line HPEpic, PCa cells (22Rv1, DU145, LNCaP, and PC3 cells), and 105 FNA prostate tissues (48 PCa tissues and 57 BPH tissues) (Figure S1). The results revealed that the lncRNAs RP11-33A14.1 (Figure 2A), RP11-423H2.3 (Figure 2B), and LAMTOR5-AS1 (Figure 2C) were upregulated in the four PCa cells. We further validated these lncRNAs in 48 PCa tissues and 57 BPH tissues. The results showed that in the PCa tissues, the lncRNAs RP11-33A14.1, RP11-423H2.3, and LAMTOR5-AS1 were upregulated by

FIGURE 1 | Heatmap and scatter plots of differential long non-coding RNAs (lncRNAs) in prostate tumor tissues and normal tissues. (A) Heatmap of differential lncRNAs; (B) scatter plots of differential lncRNAs.



11.12 ± 3.66-fold (Figure 2D), 4.44 ± 1.87-fold (Figure 2E), and 1.89 ± 0.76-fold (Figure 2F), respectively (p < 0.05), further confirming the results of our ceRNA microarray.

#### Differentially Expressed Long Non-Coding Ribonucleic Acids as Novel Biomarkers of Prostate Cancer Associated With Prostate-Specific Antigens Levels and the Progression of Prostate Cancer

We assessed the diagnostic effectiveness of the differential lncRNAs in differentiating between PCa and BPH tissues by an ROC curve (Figure 3). The areas under the curve (AUCs) of lncRNAs RP11-33A14.1, RP11-423H2.3, and LAMTOR5-AS1 were 0.697, 0.620, and 0.641, respectively (Figure 3 and Table 2). When the three differential lncRNAs were combined, the AUC was 0.754 (Figure 3D). We further analyzed the PSA level using the results of the 3 differential lncRNAs, and the AUC was 0.984. The sensitivity was 97.9%, and the specificity was 84.2% (Figure 3E). To clarify the characteristics of these differential lncRNAs in PCa, we applied a Pearson correlation analysis to analyze the correlation between these lncRNAs and the corresponding clinical parameters. As shown in Table 3,

TABLE 2 | ROC analysis of the diagnostic efficiency of differential long noncoding ribonucleic acids (lncRNAs) (RP11-33A14.1, RP11-423H2.3, and LAMTOR5-AS1) and serum prostate-specific antigen (PSA) in prostate cancer (PCa) patients and benign prostatic hyperplasia (BPH) controls.


TABLE 3 | Association between the differential long non-coding ribonucleic acids (lncRNAs) and clinical parameter in prostate cancer (PCa) patients.


\*Bold values denote statistical significance at the p < 0.05 level.

the lncRNA LAMPOR5-AS1 is positively correlated with the PSA level of the patients (p < 0.001). A combined Gleason score of 6 or 7 indicates that PCa is likely to grow but may not spread quickly. A score of 8–10 is suggestive of aggressive prostate cancer that is potentially lethal [24]. In this study, we investigated the association between the expression of lncRNA LAMTOR5-AS1 and aggressive cancer (Gleason score 8–10, p < 0.05) (Table 4) and found that lncRNA LAMTOR5-AS1 expression was higher in the less aggressive PCa (Gleason score 6–7; GS6-7) than in the aggressive PCa (Gleason score 8–10; GS8-10), yet its expression in GS8-10 was higher than in



\*Bold values denote statistical significance at the p < 0.05 level.

non-cancer tissues (<sup>p</sup> = 0.023) (Figure S2), which indicated that LAMTOR5-AS1 might be useful in the early diagnosis of PCa.

#### Regulatory Network Analysis of Differentially Expressed Long Non-Coding Ribonucleic Acids, Their Target Microribonucleic Acids, and Their Interaction With Ribonucleic Acids Binding Protein in Prostate Cancer

Subsequently, we predicted the miRNAs likely to be targeted by these three lncRNAs. In total, 100 miRNAs with binding sites for lncRNA RP11-33A14.1 and 47 miRNAs with binding sites for lncRNA RP11-423H2.3 were selected for subsequent analysis (Figures 4A, B). We also analyzed the miRNA expression profiles of GSE76260 and GSE36802 from the GEO databases. The microarray dataset GSE76260 included 32 prostate cancer and 32 non-neoplastic tissue samples; GSE36802 included 21 pairs of prostate cancer samples and matched benign prostate tissues. We identified 53 miRNAs that were differentially expressed between the prostate cancer tumor tissue and the normal controls. We found that compared with the normal controls, 28 miRNAs were upregulated (Figure 4A), and 25 miRNAs were repressed in the prostate cancer tissue samples (Figure 4B) in the two GEO datasets. Taken together, these results indicate that miR-7 predicted from lncRNAs RP11-33A14.1 and RP11-423H2.3 was upregulated in the prostate cancer tumor tissue in the two GEO datasets (Figure 4A). In contrast, two miRNAs (miR-24 and miR-30c) predicted from the two lncRNAs were repressed in the prostate cancer tumor tissue in the two GEO datasets (Figure 4B). Furthermore, we found that lncRNAs RP11- 33A14.1 and RP11-423H2.3 both target miR-7, miR-24-3p, and miR-30 (Figure 4C). However, we only obtained two predicted miRNAs (miR-542-3p and miR-30c) for LAMTOR5-AS1 if we combined these two GEO datasets and utilized the miRDB database to identify target miRNAs. Next, we applied three reference datasets, DIANA-TarBase (http:// www.microrna.gr/tarbase) (Karagkouni et al., 2018), lncRNASNP2 (http://bioinfo.life.hust.edu.cn/lncRNASNP/ #!/mirna/), and miRDB (http://www.mirdb.org/), to predict the targeted miRNAs of LAMTOR5-AS1 and overlapped the three predicted results. Furthermore, we selected the top miRNAs (miR-550b-3p, miR-942-5p, miR-542-3p, miR-7162-3p, miR-4653, miR-3921, and miR-181b-3p) (Table S3) with the highest context scores (score > 70 in two predicted datasets) to establish a lncRNA-miRNA network for LAMTOR5-AS1 (Figure 4C). Finally, we analyzed the regulatory networks of lncRNAs RP11-33A14.1, RP11- 423H2.3, and LAMTOR5-AS1 and predicted their potential RNA binding proteins (RBPs) using the starBase database. We found that lncRNAs RP11-423H2.3 and LAMTOR5-AS1 shared common RBPs, including eIF4AIII, U2AF65, and UPF1. More intriguingly, lncRNAs RP11-33A14.1, RP11- 423H2.3, and LAMTOR5-AS1 interact with the same RBP FUS (Figure 4D).

# DISCUSSION

Prostate cancer is one of the most common cancers in men and ranges from low risk states amenable to active surveillance to high-risk states that can be lethal, especially if left untreated (Eskra et al., 2019). Although the diagnosis cornerstone of PCa has been prostate-specific antigen levels and numerous biomarkers have been introduced over the past decade, there is still a critical need for the development of relatively noninvasive and clinically useful methods for the screening, detection, prognosis, disease monitoring, and prediction of treatment efficacy of PCa.

lncRNAs, their targeted miRNAs (C); lncRNA RP11-423H2.3 and LAMTOR5-AS1 shared common RNA-binding proteins (D).

Noncoding RNAs (ncRNAs) are typically classified into small and lncRNAs based on their size ranges of <200 or >200 nucleotides, and these RNAs are actively transcribed to a versatile group of RNA transcripts without protein-coding potential (over 80% of the genome) (Kapranov et al., 2007; Djebali et al., 2012). The dysregulation of lncRNAs has been implicated in the development and progression of a variety of cancers (Das et al., 2019). However, notably few lncRNAs have been functionally characterized and experimentally validated in PCa. In this study, the lncRNAs RP11-33A14.1, RP11-423H2.3, and LAMTOR5-AS1 were found to be upregulated in FNA biopsies of PCa. Several members of the lncRNA RP11 family are related to malignancies, including glioblastoma, renal cell carcinoma, and colorectal cancer. The lncRNA RP11-838N2.4 enhances the cytotoxic effects of temozolomide by inhibiting the functions of miR-10a in glioblastoma cell lines (Liu et al., 2016). The lncRNA RP11-436H11.5 functions as a ceRNA to upregulate BCL-W expression by sponging miR-335-5p, thereby promoting proliferation and invasion in renal cell carcinoma (Wang et al., 2017a). The downregulation of long noncoding RNA RP11-708H21.4 is associated with a poor prognosis in colorectal cancer and promotes tumorigenesis by regulating the AKT/mTOR pathway (Sun et al., 2017). RP11- 380D23.2 drives the distal-proximal patterning of the lung by regulating PITX2 expression (Banerjee et al., 2018). The lncRNA LAMTOR5-AS1, which is known as late endosomal/ lysosomal adaptor-2C MAPK and MTOR activator 5 (LAMTOR5) antisense RNA 1, was first shown to be associated with PCa in this report. Subsequently, we assessed the diagnostic effectiveness of differential lncRNAs in differentiating between PCa and BPH tissues. When the PSA level was combined with the three differential lncRNAs, the AUC was 0.984, the sensitivity was 97.9% and the specificity was 84.2%, which are better than the values obtained using PSA only. We previously demonstrated that different levels of two circRNAs (circ\_0057558 and circ\_0062019) and four genes DEGs (ITGBL1, TGM4, KRT15, and HOXA7) could help to distinguish PCa patients from non-PCa patients (Shan et al., 2017; Xia et al., 2018); thus, we proposed that combining these biomarkers might improve the diagnostic efficiency of PCa. We demonstrated that when the expression of two circRNAs (circ\_0057558 and circ\_0062019) or 4 differentially expressed genes (DEGs) (ITGBL1, TGM4, KRT15, and HOXA7) were considered along with the three differentially expressed lncRNAs (DelncRNAs), the AUC was 0.935 (Figure S3A) and 0.968 (Figure S3B), the sensitivity was 85.0% and 93.8%, and the specificity was 89.2 and 92.7%, respectively. We also attempted to include only one gene (ITGBL1) and one circRNA (circ\_0062019), which were the best biomarkers for the diagnosis of PCa in our previous publications, and found that when the expression of ITGBL1 and circ\_0062019 was considered along with the three DelncRNAs, the AUC was 0.957 (Figure S3C), the sensitivity was 93.3%, and the specificity was 92.3% (Table S2), which were significantly improved compared to three lncRNAs. We also demonstrated that the lncRNA LAMPOR5-AS1 is positively correlated with the PSA level in patients and is more closely related to less aggressive PCa than to aggressive PCa, indicating that LAMTOR5-AS1 may be useful in the early diagnosis of PCa and that these differentially expressed lncRNAs might be novel biomarkers of PCa.

We further performed a regulatory network analysis of the differentially expressed lncRNAs and predicted that miR-7, miR-24, and miR-30 were target miRNAs of lncRNAs RP11- 33A14.1 and RP11-423H2.3. Among these miRNAs, two miRNAs (miR-7 and 30d) were upregulated (Figure 4A), but four miRNAs (miR-24, miR-30a, miR-30c, and miR-30e) were repressed in the prostate cancer tumor tissue (Figure 4B) in the two GEO datasets. To determine if possible mechanisms of action that target miRNA expression were affected by these DelncRNAs, we knocked down RP11-423H2.3 and LAMTOR5-AS1 in PCa cells. Our results revealed that knockdown of RP11-423H2.3 reduced the expression levels of miR-24-3p, miR-30a, miR-30d, and miR-30e and upregulated miR-7-1-3p in both PC3 and DU145 cells (Figures 5A–C). We also found that when LAMTOR5-AS1 was knocked down (Figure 5D), miR-942-5p, and miR-542-3p were repressed in PC3 cells (Figure 5E) but upregulated in DU145 cells (Figure 5F). In keeping with the ceRNA regulatory mechanism, lncRNAs can function as molecular decoys or sponges of microRNAs (Salmena et al., 2011), which might cause increased expression of miR-7-1-3p following knockdown of RP11-423H2.3. On other hand, some lncRNAs could also be processed to generate miRNAs or activate miRNA expression (Yoon et al., 2014), so that several miRNAs were deregulated after knockdown of RP11- 423H2.3 or LAMTOR5-AS1, which supported the effects of RP11-423H2.3 on miR-7/miR-24/miR-30 or LAMTOR5-AS1 on miR-942-5p/miR-542-3p via direct interaction. miR-7 can inhibit the stemness of prostate cancer stem-like cells and tumorigenesis by repressing the KLF4/PI3K/Akt/p21 pathway (Chang et al., 2015). miR-24 serves as a tumor suppressor role in PCa and was repressed in prostate cancer cell lines and tumor tissue, which was correlated with high PSA serum levels and related to prostate cancer progression (Lynch et al., 2016). miR-30 was also downregulated in prostate cancer cells compared to that in the prostate immortalized normal epithelial-derived cell line RWPE-1, which may be associated with tumor suppressor functions in prostate cancer (Kao et al., 2014), and miR-30 has been identified as a direct regulator of androgen receptor signaling in prostate cancer by complementary functional microRNA library screening

(Kumar et al., 2016). miR-30a-5p and miR-30b were not only found to be lower in PCa tumors than in benign tissues but significantly increased when VCaP and PC3 cells were treated with saracatinib and PP2. However, miR-30c was different (Kao et al., 2014). miR-30b-3p and miR-30d-5p can be direct regulators of androgen receptor signaling in prostate cancer, and inhibition of miR-30b-3p and miR-30d-5p can increase androgen receptor (AR) expression and promote androgenindependent cell growth (Kumar et al., 2016). Finally, we determined that the lncRNAs RP11-423H2.3 and LAMTOR5-AS1 shared common RBPs, including eIF4AIII, U2AF65, and UPF1. Some lncRNAs can recruit regulatory compounds and affect gene expression by interacting with RBPs (Jia et al., 2017). The lncRNA MEG3 interacts with the RBP polypyrimidine tract-binding protein 1 (PTBP1) and induces cholestatic liver injury (Zhang et al., 2017). LncRNAs might affect the expression level of neighboring genes by a cis-regulated function. We found that all three lncRNAs, i.e., RP11-33A14.1, RP11-423H2.3, and LAMTOR5- AS1, interacted with FUS, while the loss of FUS expression may contribute to cancer progression (Brooke et al., 2011). The DNA and RNA helicase UPF1 played key roles in nonsense mediated RNA decay (NMD) that could selectively degrade aberrant RNA transcripts (Azzalin and Lingner, 2006). FUS was a multifunctional protein and participated in many RNA metabolism pathways, and mutant FUS suppressed protein biosynthesis and disrupted NMD regulation (Kamelgarn et al., 2018). FUS expression was also inversely correlated with Gleason grade of prostate cancer (Ghanbarpanah et al., 2018). We demonstrated that deregulation of FUS and UPF1 was in both PC3 and DU145 cells following knockdown of RP11-423H2.3 or LAMTOR5-AS1 (Figure S4). which implied that RBP FUS and UPF1 with lncRNA RP11-423H2.3 or LAMTOR5-AS1 interactions might affect prostate cancer progression. Deregulation of the RNA-binding protein fused in sarcoma/translocated in liposarcoma (FUS/TLS) in breast cancer cells by interacting with the lncRNA nuclear paraspeckle assembly transcript 1 (NEAT1) and miR-548ar could induce cell apoptosis (Wang et al., 2016). As FUS is a member of the TET protein family, this protein was found to be inversely regulated by miR-141 in human neuroblastoma (Wang et al., 2016) and can be activated by lncRNA XIST, which also served as a ceRNA in cervical cancer progression while competitively binding with miR-200a (Zhu et al., 2018). FUS promoted conditions that favored cell-cycle arrest by reducing proliferator factors and was a key link between androgen receptor signaling and the progression of the cell cycle in prostate cancer (Brooke et al., 2011; Ghanbarpanah et al., 2018).

# CONCLUSIONS

While we continue to search for smarter and more reliable, precise, and cost-effective screening methods, we continue to advocate shared decision-making in prostate cancer screening to serve our patients' best interests. The differentially expressed lncRNAs and their specific regulatory networks may serve as potential biomarkers for the clinical diagnosis and treatment of PCa, which could guide decisions regarding whom to biopsy and whom to re-biopsy after an initial negative biopsy with continued suspicion of PCa and might support an individual oncological approach in the future.

# DATA AVAILABILITY STATEMENT

The prostate cancer microarray datasets were deposited in the Gene Expression Omnibus (GEO) database under accession number GSE140927.

# ETHICS STATEMENT

The studies involving human participants were reviewed and approved by the Ethics Committee of Zhongshan Hospital Affiliated with Fudan University and Shanghai Public Health Clinical Center. Written informed consent was obtained from all patients for the use of their tissue samples and clinical records.

# AUTHOR CONTRIBUTIONS

JW and JZ planned overall concepts and designed the experiments. ZL, QX, XH, ZC, and DY performed the experiments. QX, ZL, HK and JW interpreted the data. XZ, TZ, JB and JX supported the study. ZL participated in drafting the manuscript. JW wrote and revised the manuscript.

# ACKNOWLEDGMENTS

This research was supported by a grant from the National Natural Science Foundation of China (81672383, 81372318), a grant (2018ZX10302103-003) from the National Special Research Program of China for Important Infectious Diseases, China, and a grant (PWRL2017-07) supported by Pudong New District Commission of Health and Family Planning Leading Talent Program, Shanghai, China. The authors also want to thank Ms. Xiaoxiao Sun (Sinotech Genomics Co., Ltd., Shanghai, China) for microarray data analysis of our manuscript.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020. 00062/full#supplementary-material

#### REFERENCES


collection of experimentally supported miRNA-gene interactions. Nucleic Acids Res. 46, D239–d245. doi: 10.1093/nar/gkx1141


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Li, Zheng, Xia, He, Bao, Chen, Katayama, Yu, Zhang, Xu, Zhu and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Computational Methods for the Integrative Analysis of Genomics and Pharmacological Data

Jimmy Caroli, Martina Dori and Silvio Bicciato\*

*Department of Life Sciences, University of Modena and Reggio Emilia, Modena, Italy*

Since the pioneering NCI-60 panel of the late'80's, several major screenings of genetic profiling and drug testing in cancer cell lines have been conducted to investigate how genetic backgrounds and transcriptional patterns shape cancer's response to therapy and to identify disease-specific genes associated with drug response. Historically, pharmacogenomics screenings have been largely heterogeneous in terms of investigated cell lines, assay technologies, number of compounds, type and quality of genomic data, and methods for their computational analysis. The analysis of this enormous and heterogeneous amount of data required the development of computational methods for the integration of genomic profiles with drug responses across multiple screenings. Here, we will review the computational tools that have been developed to integrate cancer cell lines' genomic profiles and sensitivity to small molecule perturbations obtained from different screenings.

#### Edited by:

*Davide Risso, University of Padova, Italy*

#### Reviewed by:

*Nehme El-Hachem, McGill University, Canada Jun Zhong, National Cancer Institute (NCI), United States*

> \*Correspondence: *Silvio Bicciato silvio.bicciato@unimore.it*

#### Specialty section:

*This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology*

Received: *05 December 2019* Accepted: *03 February 2020* Published: *27 February 2020*

#### Citation:

*Caroli J, Dori M and Bicciato S (2020) Computational Methods for the Integrative Analysis of Genomics and Pharmacological Data. Front. Oncol. 10:185. doi: 10.3389/fonc.2020.00185* Keywords: genomics, pharmacogenomics, integration, bioinformatics, online databases

# INTRODUCTION

Clinical responses to cancer treatment are strongly influenced by the patient's genomic landscape, pushing modern therapeutics toward a more personalized approach (1). To this end, despite their inability to reflect many aspects of a drug's behavior in the human body, cancer cell lines have been the most widely used models to explore the molecular basis of drug activity. Indeed, since the NCI-60 project, several major screenings of unite genetic profiling and drug testing have been created to investigate how genomic portraits can shape cancer response to therapy. These efforts required the definition of integrated frameworks that, leveraging on high-throughput technologies and computational methods, addressed the identification of genomic factors of cancer vulnerability associated with drug sensitivity. The NCI-60 project (https://dtp.cancer.gov/ discovery\_development/nci-60/) has been the first extensive screening of a massive number of chemical compounds (>50,000) on a well-defined set of cancer cell lines (60 across nine different tumoral tissues) (2, 3). Building on the NCI-60 approach, several other projects investigated the interplay between genomic backgrounds and responses to drug treatment in cancer cell lines (**Figure 1A**). All cancer cell line screenings basically adopt two approaches. In the first strategy, the molecular profiles of untreated cells and their response to various compounds are investigated in parallel to assess or predict how the molecular portraits determine intrinsic cell sensitivity and resistance to drugs or potential drugs. In the second, cell lines are profiled both before and after treatment to assess how their expression profiles respond to perturbation by the various agents tested. In particular, the Cancer Cell Line Encyclopedia (CCLE, https://portals.broadinstitute.org/ ccle) project fully characterized the molecular profiles of more than 1,000 untreated cancer cell lines

**21**

along with their response to a panel of 24 Food and Drug Administration (FDA)-approved drugs (4–6). Similarly, the Genomics of Drug Sensitivity in Cancer (GDSC, https://www. cancerrxgene.org) and the Cancer Therapeutics Response Portal (CTRP, http://portals.broadinstitute.org/ctrp/) linked genomic features of more than 800 cancer cell lines to their sensitivity to hundreds of chemical compounds comprising FDA-approved drugs, clinical candidates, and small molecules (7–11). Conversely, the Connectivity Map (CMap) and its recent development, L1000 (CLUE, https://clue.io), profiled cancer cell lines before and after the treatment with several chemical compounds and genomic perturbagens, retrieving gene signatures directly associated to their administration (12–14). Although these screenings share a similar experimental pipeline, most of the produced data are heterogeneous and lack concordance in terms of investigated cell lines, tested compounds, and genomic information. In this review, we will describe some computational tools for the integrative analysis of data from different pharmacogenomics resources.

## INTEGRATIVE ANALYSIS OF GENOMICS AND PHARMACOLOGICAL DATA

Inspired by the NCI-60 project, several collaborative efforts scaled up the number of cancer cell lines investigated in pharmacogenomics studies from the original 60 to more than 1,400, planning to reach over 10,000 publicly available cancer models in the near future (15). The massive amount of genomic and drug response data generated by these screenings are commonly collected in databases that, through dedicated web portals, provide direct insights into potential interactions between the analyzed cancer cell lines and the tested drugs. These databases are commonly equipped with computational resources specifically designed for the navigation and the analysis of the pharmacogenomics data, as for instance GDSCTools (16), CellMiner (17), Enrichr (18), L1000 Viewer (19), PharmacoGx, and PharmacoDB (20, 21), and the recently deployed RING (22). However, most of these tools are database specific and have limited capabilities in integrating data obtained from different screenings. This limitation is mostly due to the heterogeneity of data provided by the various studies, with drug tests not standardized across projects and genomic profiling not always available for the entire panel of cell lines. In addition, data are often unbalanced, with experiments comprising a high number of cell lines screened on few drugs (e.g., CCLE and GDSC) and, vice versa, screenings of large pools of chemical compounds performed on small cohorts of cancer cell lines (as in the NCI-60). Finally, while genomic data are rather homogeneous and can be easily integrated across studies after removing batch effects, pharmacological data derived from distinct experimental designs must be kept separate as they are profoundly different in terms of analytical assays, tested drug concentration, and retrieved inhibitory potential (23, 24). Despite these intrinsic limitations, several approaches have been proposed for the integrative analysis of genomics and pharmacological data collected from different screenings (**Figure 1B**). In particular, CellMinerCDB combines genomic profiles from NCI-60, CCLE, GDSC, and CTRP with the pharmacological data provided by the NCI-60 screening (25); the Genomics and Drugs integrated Analysis portal (GDA) integrates pharmacological data derived from the NCI-60 with the genomic information of NCI-60 and CCLE (26); and the CMap enables the investigation of the L1000 data through the correlation of gene lists and transcriptional signatures modulated by the drug treatment (12, 14, 27).

## CellMinerCDB: Integrative Cross-Database Genomics and Pharmacogenomics Analyses

CellMinerCDB (https://discover.nci.nih.gov/cellminercdb/) expands the analysis power of CellMiner, the original NCI-60 analysis tool, with the integration of the cancer cell line data from the Sanger/Massachusetts General Hospital GDSC, the Broad/Novartis CCLE, and the Broad CTRP (25, 28). The integrated database comprises all molecular profiles of almost 1,400 different cancer cell lines, together with drug activity for more than 20,000 compounds. The guiding element, used to link pharmacological information to genomic data from different sources, is the set of common cancer cell lines between the NCI-60 and the other resources, with 55 NCI-60 lines shared with GDSC, 44 with CCLE, and 671 in common between CCLE and GDSC. CellMinerCDB performs correlation analyses to investigate and visualize relationships between the drug activity of a compound and the specific profile of a selected molecular feature across all the available cell lines (univariate analysis). In addition, linear regression methods are implemented for the integrative analysis of multiple identifiers (multivariate analysis). The confidence of the associations is assessed by statistical analyses conducted through a basic linear regression model or using least absolute shrinkage and selection operator (LASSO). An interesting feature of CellMinerCDB is the possibility to compare patterns associated to either drug activity or molecular data via the Compare Pattern function of the univariate analysis search. This analysis allows the identification of genomic determinants of drug response, as exemplified by the connection found between the expression of Schlafen 11 (SLFN11) and the response to several DNA-targeted anticancer drugs as platinum derivatives, topoisomerase inhibitors, and poly (ADP-ribose) polymerase (PARP) inhibitors (25).

# Genomics and Drugs Integrated Analysis

GDA (gda.unimore.it/) is a web-based tool designed for the integrative analysis of drug response, mutations, and gene expression profiles derived from the NCI-60 consortium and the CCLE (26, 29). GDA comprises 73 cancer cell lines shared by NCI-60 and CCLE and treated with 50,816 compounds and integrates the drug response data from the NCI-60 screening with the mutations and genomic information derived from both CCLE and NCI-60. GDA allows four different types of analyses, namely, from drug to gene, from gene to drug, from signature to drug, and from drug to signature. Pharmacological and genomic data can be queried to identify drugs correlated to gene mutations (from gene to drug), gene mutations associated

to drug responses (from drug to gene), and drugs associated to active gene signatures (from signature to drug). Starting from a drug correlated to gene mutations, gene expression profiles can be used to identify genes differentially expressed in cell lines sensitive to the selected compound. The statistics behind GDA is based on drug response data. Basically, all pairs of cell lines and drugs are defined as responsive if the relative sensitivity is smaller than two standard deviations of the left tail of the distribution of all relative sensitivities, and nonresponsive otherwise. Based on genomic data, cell lines are classified as mutant if treated with the compound and carrying the selected set of mutations and as wild type if treated with the compound but without the specific set of mutations. Given these classifications, compounds are ranked using a score defined by the fraction of responsive in mutant multiplied by the fraction of non-responders in wild type. This score ranks each drug based on the enrichment of responsive in the mutant group. The statistical significance of this ranking is computed using a one-tailed Fisher's exact test for the enrichment of responsive in mutant as compared to non-responsive in wild type, given the number of non-responsive in mutant and responsive in wild type. Results are accessible through interactive graphical representations and tables and can be directly fed to external tools as Enrichr for functional annotation (18). When used to identify compounds able to inhibit the proliferation potential of cancer cell lines with aberrant nuclear YAP/TAZ activation, GDA retrieved imatinib analogs and statins as potentially active drugs. Following GDA indications, in vitro studies demonstrated that the combination of statins with dasatinib, an imatinib analog enhances YAP/TAZ nuclear exclusion, is able to block YAP/TAZ transcriptional activity, and is much more active in inducing apoptosis in different tissues (29).

# Connectivity Map and the CMap Linked User Environment

CMap (https://www.broadinstitute.org/connectivity-map-cmap) was one of the first computational resources developed for the investigation of connections between transcriptomics and druginduced perturbations (12). As extensively reviewed in Musa et al. (30), the goal of CMap is to identify drug or diseaseassociated gene signatures correlating with transcriptomics changes induced by the administration of drugs or chemical compounds (31, 32). The original project comprised the gene expression profiling of three cancer cell lines before and after the treatment with 164 different small molecules, obtaining drugassociated gene signatures for each cell line. This initial version has been recently scaled up through the L1000 Assay Platform, a method to analyze the expression levels of 978 selected landmark transcripts (assayed with 1,058 probes, including 80 controls) that have been shown to be sufficient to recover more than 80% of the information relative to the full transcriptome (14). This new approach translated into the screening of 86 different cancer cell lines using 27,927 unique perturbagens, including 19,811 small molecules and 7,494 genetic perturbations (consisting of overexpression or knockdown of different genes associated with human diseases or biological pathways). This large-scale screening finally resulted in a collection of 476,251 gene expression signatures that can be analyzed through the CMap Linked User Environment (CLUE, https://clue.io). In CLUE, the Query tool allows to input a gene signature (i.e., a list of genes upregulated and downregulated) and search for perturbagens (chemical and/or genetic) that induce a similar (or opposite) expression profile in the treated cells. The statistical significance of the association is assessed through a connectivity score that takes into account the strength of the similarity between the query and the induced signature as compared to the enrichment of all other signatures in the database (14). This approach proved its efficacy in the identification of a novel inhibitor for the serine-threonine kinase CSNK1A, an enzyme essential in specific subtypes of myelodysplastic syndrome and acute myeloid leukemia. Starting from the loss of function signature of CSNK1A1, authors searched CMap for compounds mimicking the loss of this kinase and identified one compound (BRD-1868) with a high connectivity score relative to this signature. Further enzymatic assays confirmed both the binding between BRD-1868 and CSNK1A1 and its inhibitory effect on enzymatic activity (14). From its first publication, CLUE has been expanded to include also proteomics analysis ranging from expression arrays to histone modification signatures.

# CONCLUDING REMARKS

Efforts to decipher the molecular mechanisms of cancer stimulated scientists to explore the interconnection between the genomic landscape of cancer models and their response to drug treatments. This resulted in large pharmacogenomics screenings that, with the advent of high-throughput technologies, generated large amounts of genomics and pharmacological data. However, the integration of these precious information is still challenging due to the variable type and number of drugs and cancer cell lines that have been screened by the various projects and the heterogeneous assays used for drug testing in the different studies (23, 24, 33–35). Despite these intrinsic difficulties, several computational approaches have been developed for the integrative analysis of genomics and pharmacological data. Their application allowed to discover several new connections between drug sensitivity and genomic backgrounds, enabling the potential repurposing of commercially available drugs to cancer treatment (36–38). However, these computational resources, although proven effective, still suffer the limitations of the original studies as the sparsity of the drug and cell interaction matrices, the effective impossibility to merge drug response data across different screenings, and the criticalities of cancer cell lines as a reliable cancer model (39–41). To this end, the project for a Patient-Derived Model Database (PDMB) launched in 2012 by the NCI might represent a potential breakthrough as genomic and drug response data directly collected from patients and patientderived xenografts (PDXs) will reproduce more accurately the cancer disease and its environment than any cell line model (42). Furthermore, while novel experimental models are generating more accurate data, advanced computational methods are under development to enhance the analytical potential of existing algorithms. As recently discussed (43– 45), artificial intelligence approaches as network-based models, deep-learning frameworks, and machine-learning techniques are increasingly applied to investigate pharmacogenomics connections and drug repositioning. These methods can be effective not only for data integration but also to predict new interactions and applications of already approved drugs (46– 48). In summary, computational approaches for the integration of genomic and pharmacological data have the potential to become crucial for the systematic identification of new biomarkers of drug sensitivity and the discovery of novel anticancer drugs on the basis of specific genetic abnormalities, as long as reliable cellular models and highly curated data become available.

# AUTHOR CONTRIBUTIONS

JC and SB conceived the project. JC, MD, and SB wrote and revised the manuscript.

# FUNDING

This work was supported by funds from the Italian Association for Cancer Research (AIRC) Special Program Molecular Clinical Oncology 5 per mille (grant no. 10016) and from the Italian Epigenomics Flagship Project (Epigen) of the Italian Ministry of Education, University and Research.

# REFERENCES


drug screening studies. Nucleic Acids Res. (2018) 46:D994–1002. doi: 10.1093/nar/gkx911


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Caroli, Dori and Bicciato. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genomic and Transcriptomic Landscape of Tumor Clonal Evolution in Cholangiocarcinoma

Geng Chen<sup>1</sup> , Zhixiong Cai<sup>1</sup> , Xiuqing Dong<sup>2</sup> , Jing Zhao<sup>1</sup> , Song Lin<sup>1</sup> , Xi Hu<sup>1</sup> , Fang-E Liu<sup>3</sup> , Xiaolong Liu<sup>2</sup> and Huqing Zhang<sup>1</sup> \*

<sup>1</sup> School of Life Sciences and Technology, Xi'an Jiaotong University, Xi'an, China, <sup>2</sup> The United Innovation of Mengchao Hepatobiliary Technology Key Laboratory of Fujian Province, Mengchao Hepatobiliary Hospital of Fujian Medical University, Fuzhou, China, <sup>3</sup> Department of Nursing, School of Medicine, Xi'an Peihua University, Xi'an, China

Cholangiocarcinoma remained a severe threat to human health. Deciphering the genomic and/or transcriptomic profiles of tumor has been proved to be a promising strategy for exploring the mechanism of tumorigenesis and development, which could also provide valuable insights into Cholangiocarcinoma. However, little knowledge has been obtained regarding to how the alteration among different omics levels is connected. Here, using whole exome sequencing and transcriptome sequencing, we performed a thorough evaluation for the landscape of genome and transcriptome in cholangiocarcinoma and illustrate the alteration of tumor on different biological levels. Meanwhile, we also identified the clonal structure of each included tumor sample and discovered different clonal evolution patterns related to patients' survival. Furthermore, we extracted subnetworks that were greatly influenced by tumor clonal/subclonal mutations or transcriptome change. The topology relationship between genes affected by genomic/transcriptomic changes in biological interaction networks revealed that alteration of genome and transcriptome was highly correlated, and somatic mutations located on important genes might affect the expression of numerous genes in close range.

Keywords: cholangiocarcinoma, clonal evolution, sequencing, transcriptome, genome

#### INTRODUCTION

Cholangiocarcinoma (CCA), a heterogeneous malignant tumor currently acknowledged as the second most common primary liver cancer, showed increasing incidents worldwide during past decades. Although CCA is considered as a rare cancer in most countries due to its relative low incidents (lower than 6 cases per 100,000 people), the situations are different in several countries including China and Thailand, where CCA incident reaches an exceptionally high level. Among all CCA cases, intrahepatic cholangiocarcinoma takes up only 10%, while a minority (15%) of these patients were diagnosed with resectable disease status (Cardinale et al., 2018; Rizvi et al., 2018). While the most promising therapeutic strategy for CCA is surgical operation combined with chemo-/radio- therapy, this approach was considered only suitable for early stage CCA and later stage CCA patients often face the difficulty of lacking effective treatment options. Thus, most CCA patients usually suffered from poor prognosis (5-year survival rate less than 10%). Meanwhile,

#### Edited by:

Enrica Calura, University of Padova, Italy

#### Reviewed by:

Xueqiu Lin, Stanford University, United States Jun Zhong, National Cancer Institute (NCI), United States

\*Correspondence:

Huqing Zhang huqzhang@mail.xjtu.edu.cn

#### Specialty section:

This article was submitted to Cancer Genetics, a section of the journal Frontiers in Genetics

Received: 11 June 2019 Accepted: 19 February 2020 Published: 13 March 2020

#### Citation:

Chen G, Cai Z, Dong X, Zhao J, Lin S, Hu X, Liu F-E, Liu X and Zhang H (2020) Genomic and Transcriptomic Landscape of Tumor Clonal Evolution in Cholangiocarcinoma. Front. Genet. 11:195. doi: 10.3389/fgene.2020.00195

**27**

the heterogeneity of tumor on multiple levels (e.g., genomic, transcriptional) often resulted in resistance to therapy, which further intensifies the challenge of CCA treatments. Thus, a thorough evaluation of the landscape on CCA genome and transcriptome could provide clinically related insights into the genesis and progression of CCA.

Just like other tumors, CCA is developed on the basis of acquiring tumor somatic mutations and clonal evolution. When tumor arises and progresses, the acquisition of somatic mutations randomly happened, resulting in different groups of tumor cells with distinct genetic features. The tumor clone, built up with the complicated constitution of groups of tumor cells (which could be referred as subclones), evolves during its development, dynamically changing its structure to better fit the micro-environment (Greaves and Maley, 2012; McGranahan and Swanton, 2017). During this entirely evolutionary process, certain somatic mutations could give tumor cells survival advantage and subpopulation carrying these genomic alterations expanded, while subclones with mutations reducing survival capacity diminished. Thus, deciphering the clonal evolution in CCA could provide valuable information regarding crucial genetic events in tumorigenesis and progression and how different biological pathways might be affected by these genetic events, which in turn could help further understand the intrinsic mechanisms of tumor progression. Indeed, such efforts have been made in other types of cancer including leukemia (Ferrando and López-Otín, 2017) and solid tumors such as hepatocellular carcinoma (Chen et al., 2018) and breast cancer (Hoadley et al., 2016), and different clonal evolution patterns have been discovered with high correlation with patients' clinical course.

However, the evolutionary process in CCA still requires further investigation. What more, although the importance of clonal evolution is widely acknowledged, how tumor clonal structure affects tumor transcriptome remained poorly explored. Understanding how somatic mutation interacted with such transcriptome change could further provide valuable insights into the evolutionary mechanism of CCA development. To explore the genetic and transcriptional landscape of intrahepatic CCA, we performed whole exome sequencing and transcriptome sequencing on tumor and corresponding peritumor tissue of 9 CCA patients. The differences on genetic and transcriptional levels were investigated and tumor clonal evolution was deciphered to discover the molecular pathways taking part in the deregulation of tumor cells. These findings will be of great value in understanding the mechanism of CCA development and how transcriptome interact with genetic alterations.

#### MATERIALS AND METHODS

#### Sample Collection

Tumor and corresponding peritumor tissue samples were collected from 9 patients diagnosed with intrahepatic cholangiocarcinoma during their surgical operation for tumor removal. The detailed clinical information is provided in **Table 1**. All human tissue sample collection procedures and usage of these samples were approved by the Institution Review Board of TABLE 1 | Clinical characteristics of 9 enrolled CCA patients.


PVTT, Portal vein tumor thrombosis; TNM, The TNM Classification of Malignant Tumors; BCLC, the Barcelona Clinic Liver Cancer staging system.

Mengchao Hepatobiliary Hospital of Fujian Medical University and written consents were obtained from all participated patients included in this study.

#### Whole Exome/Transcriptome Sequencing

Whole-exome and transcriptome sequencing were performed to capture the genetic and transcriptional features for the acquired tumor and corresponding peritumor tissue on Illumina HiSeq 3000 system.

# Whole Exome Sequencing Data Processing

fgene-11-00195 March 12, 2020 Time: 19:2 # 3

Somatic single nucleotide variants (SNV) and copy number alterations (CNA) were detected for the whole exome sequencing data of tumor tissue samples using the corresponding peritumor as control. To identify SNVs, SomaticSniper (version 1.0.5.0) (Larson et al., 2012) were applied using default parameters provided in the algorithm manual and only SNVs with somatic score ≥ 40 were accepted for downstream analysis. The identified SNVs were further filtered with such criteria to rule out possible false discovery: (1) read depth ≥ 50 in both tumor and peritumor tissues; (2) variant allele frequency ≥ 10% in tumor tissue; (3) variant allele frequency < 10% in normal peritumor tissues. The detected SNVs were then annotated using wANNOVAR to obtain related gene and functional information. For CNVs, TitanCNA (version 1.17.1) (Ha et al., 2014) was applied on the tumor tissue's whole exome sequencing data using the corresponding peritumor as control using the workflow script provided by the algorithm.

# Transcriptome Sequencing Data Processing

All acquired Transcriptome sequencing reads were first aligned to ribosomal rRNA sequences to remove ribosomal RNA sequence. The unmapped reads were then aligned to human genome reference (GRCH37) using star with GENCODE gene annotation. The gene expression was quantified with fragments per kilobase of exon per million mapped fragments (FPKM) and genes with no read counts in > 50% samples were not included in downstream analysis. Differentially expressed genes were identified using limma package. Genes with adjusted p value < 0.05 (Benjamini-Hochberg correction) and fold-change >2 or <0.5 were then considered as significantly differentially expressed between CCA tumor and peritumor.

# Clonal Evolution in CCA

For each CCA tumor sample, inference of subclonal population was conducted using Sclust (Cun et al., 2018). Sclust provided a copy-number analysis method incorporated with mutational clustering to accurately determines copy-number states and subclonal populations. In brief, whole exome sequencing data of the paired tumor and peritumor samples were first processed using command bam process to extract the read ratio and SNP information. Then, the copy number analysis is conducted with command cn for each patient, using the obtained read ratio and SNP information together with SomaticSniper mutation calling results. Finally, the mutational clustering was performed using command cluster based on above results to identify tumor clonal structure.

# Discovery of Altered Subnetworks Influenced by Somatic Mutations and Transcriptome Change

HotNet2 was applied to discover altered subnetworks in the large gene interaction networks. HotNet2 required two input files for subnetwork identification: Heat scores and Interaction network. For somatic mutations, Heat scores for HotNet2 were generated based on mutation distribution across all patients; For transcriptome, Heat scores were generated based on the adjusted p-value produced by DESeq2 package. Network hint + hi2012 and irefindex9 provided by HotNet2 was used as the Interaction network for this analysis. The algorithm was run using all recommended parameters provided by algorithm authors and the identified subnetworks were visualized using Cytoscape (version 3.4.0) (Shannon et al., 2003).

# RESULTS

### Case Summary

In total, 9 patients that were diagnosed with CCA and received surgical operation in Mengchao Hepatobiliary Hospital were included in this study. According to previous reports regarding inflammatory context of liver tumors (Bishayee, 2014; Banales et al., 2016), we chose peritumor tissue as sequencing control to better capture the CCA characteristics. During their surgery, cholangiocarcinoma tumor tissues along with corresponding peritumor tissues were collected and the tumor existence for all patients was histologically confirmed. Then, whole-exome and transcriptome sequencing were performed for acquired tissue samples. Among all included patients, 77.8% (7/9) were diagnosed with TNM staging I-II and the other 22.2% were diagnosed with TNM staging III. The average diameter of tumor in each patient was 5.1 cm (range, 2.0–9.5 cm), while Vascular tumor thrombus was seen in 44.4% (4/9) of all patients. Detailed clinical information for all included patients before they received surgical operation is presented in **Table 1** and the corresponding clinical courses were demonstrated in **Figure 1A**.

# Landscape of CCA Genome and Transcriptome

Whole-exome sequencing achieved a mean average depth of 194.67 × cross all collected tissue samples. To identify tumor somatic mutations, SomaticSniper was applied on all tumor tissue samples using corresponding peritumor as control. Meanwhile, copy number variation was identify using TitanCNA. In total, an average of 378 somatic SNVs (range, 260–529) were detected in tumor tissues, and the distribution of SNVs across human Genome was visualized in **Figure 1C**. Annotation of acquired SNVs revealed a number of common mutated genes across tumor samples, containing several known cancer-related genes (**Figure 1B**). Several members of mucin (MUC16, MUC3A, MUC6, and MUC4) were among the most frequently mutated genes, which is consistence with previous reports (Chang et al., 2006; Pereira et al., 2016; Liu et al., 2018; Pareja et al., 2019). Other noteworthy genes included DSPP, PER3, MTCH2, and KRT18, all have been reported with important roles in tumor formation and development. On the other hand, a number of copy number of variations was also identified in tumor samples, showing a wide-spread instability of cancer genome (**Figure 1D**).

Meanwhile, transcriptome sequencing revealed a significant change on transcriptional level, with a total of 2366 differentially expressed genes identified between CCA tumor and peritumor

FIGURE 1 | The Clinical courses and the genome and transcriptome landscape of CCA. (A) The clinical course of 9 included CCA patients. RFA, Radiofrequency ablation; TACE, Transarterial chemoembolization. (B) The common mutated genes with somatic SNVs identified in include CCA patients. Different color indicated the functional type of somatic SNVs in these genes (orange: non-synonymous mutation; light blue: synonymous mutation; gray: not mutated). (C) The genomic distribution of somatic SNVs for included CCA patients. Each circle represented a single patient. Dots in the dot plot represented identified somatic SNVs and their heights indicated corresponding variant allele frequencies. (D) The genomic distribution of somatic CNVs for included CCA patients. Each circle represented a single patient. The scatter plot showed the logR value for each segment, and regions with different color indicated their copy number status (red: copy number gain; gray normal; green: copy number loss). (E) Principal component analysis of CCA transcriptome. The image showed the three-dimension distribution of each sample on the first three principal components. Red dots represented peritumor samples and black dots represented tumor samples. (F) Clustering of included tissue samples using top genes correlated with the first three principal components. Genes names and sample names were provided.

samples. To provide a clear classification based on samples' transcriptional features, principal component analysis was conducted to better characterize these samples. Not surprisingly, tumor samples and peritumor samples were well divided by the first three principal components, which explained 21.96%, 10.60%, and 8.68% of variation in samples' transcriptome, respectively (**Figure 1E**).

The results showed that the top genes positively associated with PC1 included RBP4, SLC27A5, and PCK2, all of which were known tumor-related genes and correlated with cancer patients' survival (Anderson and Stahl, 2013; Leithner et al., 2014, 2015; Balsa-Martinez and Puigserver, 2015). Meanwhile, PC1 negatively associated genes included FLNA, ARF5, and SLC25A6, suggesting its connection to cancer development (Savoy and Ghosh, 2013; Casalou et al., 2016; Shao et al., 2016; Cho et al., 2019). For PC2, top positively correlated genes included IFITM1 and GPX1, both have been reported to be associated with risk of numerous cancers (Ravn-Haren et al., 2006; Arsova-Sarafinovska et al., 2009; Lee et al., 2012; Ogony et al., 2016), while most negatively PC2 correlated genes included common-known tumor over-expressed genes such as EFNA1 (Nakamura et al., 2005; Xiang-Dan et al., 2010).

As in PC3, most noteworthy genes positively correlated with this principal component are ZFP36 and DUSP1, both are known for their function of regulation in cancer progression (Montorsi et al., 2016; Nagahashi et al., 2018). Other important correlated genes included t CXCL9 and CXCL10, and they served as important regulators of immune activation in tumor microenvironment (Bronger et al., 2016; Ding et al., 2016; Tokunaga et al., 2018).

Using top genes correlated with the first three principal components, transcriptome clustering revealed that tumor sample and peritumor samples could be indeed well separated (**Figure 1F**), suggesting that CCA tumors indeed have distinct gene expression patterns compared to peritumor tissues.

# Clonal Evolution in CCA

To explore the evolutionary process driving tumorigenesis and development, Sclust algorithm was applied to infer subclonal populations in cancer genomes. Combining copynumber analysis and mutation clustering approach, Sclust could accurately determine copy-number states as well as cellular prevalence of mutations. As shown in **Figure 2A** and **Supplementary Figure S1**, different types of clonal structure were revealed. For 7 of the included patients (CCA-1218, CCA-1431, CCA-1461, CCA-950, CCA-1429, CCA-1590, and CCA-1600), no subclonal mutations were identified since all mutations within each sample could be clustered into one single cluster according to their allele frequencies. These results showed that during the tumor clonal evolution of these patients, the

using known immune signatures. (C) Go term enrichment results in biological pathways for each identified subnetwork. Subnetwork 2–4 indicated subnetwork altered by clonal mutations and subnetwork of subclonal mutation indicated the subnetwork altered by subclonal mutations. Subnetwork 1 only contained two genes and did not show significant enrichment in Gene Ontology of biological pathways.

randomly accumulated mutations might not create subclones with significant survival advantage. The other 2 patients (CCA-1141 and CCA-1174), on the other hand, presented considerable portion of subclonal mutations. In patient CCA-1141, two large subclonal mutation clusters were observed, with cellular frequency of 46.70% and 86.88%, respectively. The other patient, CCA-1174, also showed one considerable subclonal mutation clusters, accounting for 63.48% of all tumor cells. The existence of a large number of subclonal mutations might suggest that the emerge of these tumor subclones took place in the later stage of tumor development, while a high cellular frequency further indicated that they possessed notable survival advantage. Surprisingly, these two patients with subclonal mutations identified showed better prognostic outcome compared to other patients, with relapse-free survival and over-all survival both longer than 20 months. One possible explanation is that in this kind of patients, some critical mutations that might greatly benefit tumors' growth took place in the later period of tumor development (which explained the expanding tumor subclones), while other tumor acquired these genetic alterations in the early stage, and thus resulted in the differences in patients' prognosis. Evaluation of known immune signature based on gene expression further revealed that CCA-1141 and CCA-1174 could be categorized into cold tumor with relatively low level of cells correlated with immune response (**Figure 2B**). This result suggested that the clonal evolution of CCA might be closely related to its immune microenvironment, and high level of infiltration might suppress the evolutionary process of tumor cells.

To better understand how tumor clonal evolution affected different biological pathways/processes in tumor cells, we first divided patients' somatic mutations into clonal mutations and subclonal mutations, and then HotNet2 algorithm was used to scan gene interaction networks for altered subnetworks affected by different categories of mutations. For clonal mutations, four subnetworks were identified (**Figures 3A–E**). The first subnetwork contained only 2 core genes: RUNX1T1 and TAL2 (**Figure 3A**). These two genes were both related to

gene transcription and their dysregulation has been reported to promote tumorigenesis in various cancer. The second subnetwork (**Figure 3B**) contained three core genes (FBLN1, FBLN2, and ZNF8, label with red) and six expansion genes (CDC42EP4, EIF2AK4, EXPH5, GIGYF1, VPS8, and ZNF233, labeled with blue). Gene Ontology (GO) term enrichment analysis revealed that this subnetwork is closely related with extracellular matrix structure, cell-substrate adhesion and cell morphogenesis (**Figure 2C**), suggesting that tumor clonal mutation would show a tendency to affect biological pathways related to cells' interaction with microenvironment, which is critical for tumor development. The third subnetwork was made up of eight highly interacted genes, namely ATXN1, BCR, GLI1, HTT, LZTR1, SPTBN4, SYNE1, and TP53 (**Figure 3C**). All these genes were known as oncogenes, including a wellknown driver gene in various cancer, TP53. The last and biggest subnetwork (**Figure 3D**) including 10 core genes (ALK, DEF6, GRIK2, GRIN2B, HIVEP2, KRT18, LRP2, LRRC7, TIAM1, UBXN11) and 6 expansion genes (KLC2, MYO5B, PTPRE, SETD5, TRMT2A, and ZC3H12A), most of which served as important components of multiple signaling pathways and involved in regulation of cancer cell.

Interestingly, several subnetworks altered by tumor clonal mutations were closely related to major metabolism pathways. It's within expectation since one well-known intrinsic character for tumor cells is its abnormal metabolism.

On the other hand, we also analyzed the subnetwork affected by tumor subclonal mutations. Considered that two out of nine patients were identified with subclonal mutations, HotNet2 identified only one subnetwork that was altered by subclonal mutations (**Figure 3E**). GO analysis revealed that the mutated genes were most relevant to cell adhesion. This result suggested that subclonal mutations benefiting tumor metastasis might bring survival advantage for corresponding tumor subclones.

#### Transcriptome Analysis Revealed Alteration in Pathways Enriched in CCA Clonal Evolutionary Process

we next explored the transcriptome landscape to evaluate the change in gene expression during CCA development. Using limma algorithm, a total of 2366 differentially expressed genes [| log(fold-change)| ≥ 1 and Padjusted < 0.05] were identified in CCA tumor comparing to peritumor samples (**Figure 4A**). Among these genes, 1833 were significantly upregulated in CCA and 533 were downregulated. Transcriptome clustering using the top 20 differentially expressed genes also showed an excellent separation between tumor and peritumor samples (**Figure 4B**). GO-term enrichment analysis revealed that the up-regulated genes (**Figure 4C**) were mostly enriched in the regulation of biological process (GO:0048519, GO:0048522 and GO:0048523), while down-regulated genes (**Figure 4D**) were mostly enriched in metabolism related biological processes including carboxylic acid metabolic process (GO:0019752) and oxoacid metabolic process (GO:0043436).

Next, HotNet2 was once again applied to identify the altered subnetworks affected by transcriptome aberration. Surprisingly, genes identified in subnetworks affected by somatic mutations (clonal or subclonal) rarely appeared in subnetworks affected by transcriptome change. However, mapping genes affected by transcriptome change back to biological interaction networks revealed that many of these genes were in close range of the altered subnetworks affected by tumor somatic mutations (**Figures 5A–E**). It appeared that tumor genomic alterations created a spreading aberration across the biological interaction network and thus a number of genes were under their influence, resulting in a widerange change of tumor transcriptome. Meanwhile, Gene Ontology enrichment analysis revealed that subnetworks altered by transcriptome change were dominantly enriched in biological processes related to cell division and cell cycle (**Figure 4E**), including cell division (GO:0051301), cell cycle (GO:0004857), protein localization (GO:0008104) and cellular component organization (GO:0016043), indicating notable change of proliferation capacity happened during tumor clonal evolution. It's not surprising that cell morphogenesis (GO:0000902), cellular localization (GO:0051641), intracellular transport (GO:0046907) and maintenance of protein location in cell (GO:0032507), four biological pathways that had been reported to be significantly enriched for mutation-affected subnetworks, were also enriched for these transcriptome-change-affected genes.

Furthermore, we also found that these multi-omics-altered subnetworks were significantly overlapped with pathways presented in kegg database (**Supplementary Figures S2–S21**). Noteworthily, all hot subnetworks were significantly overlapped with pathways in cancer (hsa05200), while other enriched pathways included cell-cycle (Vermeulen et al., 2003), ECMreceptor interaction (Lu et al., 2012) and VEGF signaling pathway (Roskoski, 2007), all have been reported to be related with tumor progression.

To further investigate if the altered pathways could be clinically related, we obtain the gene expression profile from TCGA-CHOL dataset and use Cox regression analysis to identify potential biomarkers for CCA patients' overall survival. Univariate cox regression analysis revealed that 14 genes within the hot subnetworks showed expression pattern significantly correlated with patients' overall survival (**Supplementary Figure S22**), including PTN and EGFR, two major players in tumor progression. Then these genes were utilized to generate the multivariate Cox regression model using stepwise forward selection. The acquired model consisted of 4 genes (PTPRZ1, CFH, RCN2 and VPS4B) and corresponding model parameters were summarized in **Supplementary Table S1**. The prognostic value was then calculated from the model score as follows:

$$
gamma = \frac{e^{\kappa \alpha \epsilon}}{1 + e^{\kappa \epsilon}}
$$

Applying the 50-percentage cutoff of prognostic value, the TCGA-CHOL dataset could be divided into two risk groups with distinct prognostic patterns (Kaplan-Meier survival analysis, p = 0.00015, **Supplementary Figure S23**).

All these results suggested that the alteration of tumor genome and transcriptome were closely related, and the influence of driver gene mutations might spread to faraway downstream.

# DISCUSSION

Clonal evolution has been proved to be one of the most important concepts in tumor genesis and development. Currently, a lot of researches have been conducted in variable kinds of tumors and revealed different clonal evolution patterns along with cancer development, providing insights into better understanding of their evolutionary mechanism. These valuable knowledges were of great value in prognosis evaluation and treatment selection. In our analysis including 9 cholangiocarcinoma patients, we discovered that a major portion (7/9) of CCA cases did not show visible subclones within the primary tumors, indicating the existence of mature clonal structure after tumorigenesis. Interestingly, the other two CCA patients with considerable subclones demonstrated significantly longer RFS and OS compared to these patients without visible subclones. Above phenomena might suggest that the forming of a stable and lasting clonal structure at early stage might lead to worse clinical outcome for CCA cases. Another intriguing finding is that the expanding subclones in tumor were connected to relatively low immune signatures (as we showed before), showing a close interaction between tumor and its immune microenvironment. Meanwhile, identification of subnetworks

affected by CCA clonal/subclonal mutations revealed that clonal mutations' influence spread across a number of different biological pathways, while subclonal mutations influence mainly focused on pathways that benefiting tumor metastasis. This result indicated that most mutations with survival advantage were acquired during early stage of CCA development and acquisition of mutations on key regulator genes could affect how tumor evolved.

Cancer development involved biological alteration/dysregulation on multiple biological levels, including genomic, epigenomic and transcriptomic. Although a lot of studies have been conducted on every single omics level, discovering a variety of patterns and mechanism for how these alterations contribute to tumorigenesis, one major question still remained largely unanswered: how the alteration on multiple biological levels interact? In our analysis, we identified key subnetworks that were greatly affected by genomic and transcriptomic changes. Interestingly, although genes in subnetworks greatly affected by genomic change rarely overlapped with those under the influence of transcriptome alteration, it appeared that these two groups of genes were in close range within biological interaction networks, suggesting that dysregulation of genome and transcriptome were closely related. One possible explanation might be that genes that were mutated served as sources of disturbance and affected the expression of their neighbor genes. This disturbance could further spread, creating a large-scale change of tumor transcriptome.

# CONCLUSION

In conclusion, integrating whole exome and transcriptome sequencing technology, our analysis demonstrated the landscape of CCA genome as well as transcriptome and discovered the different clonal evolution patterns in these patients. We also identified biological pathways significantly altered by tumor somatic mutations and transcriptome change and reveal the connection among the alteration on different omics levels, which could bring insight for better understanding the mechanism of CCA development and help future prognosis evaluation.

#### DATA AVAILABILITY STATEMENT

The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (Wang et al., 2017) in BIG Data Center (Big Data Center Members, 2018), Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, under accession numbers HRA000085, which can be accessed at https://bigd.big. ac.cn/gsa-human.

### ETHICS STATEMENT

fgene-11-00195 March 12, 2020 Time: 19:2 # 10

The studies involving human participants were reviewed and approved by the Institution Review Board of Mengchao Hepatobiliary Hospital of Fujian Medical University. The patients/participants provided their written informed consent to participate in this study.

# AUTHOR CONTRIBUTIONS

GC, ZC, and HZ contributed the conception and design of the study. GC, JZ, and XH performed the bioinformatic analysis. XD,

## REFERENCES


SL, and ZC performed the sample collection and clinical data collection. HZ, GC, F-EL, and XL interpreted the analysis results. GC and HZ wrote the manuscript. ZC, F-EL, and XL wrote the sections of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

# FUNDING

This work was supported by the National Science and Technology Major Project of China (Grant No. 2018ZX10302205) and National Natural Science Foundation of China (Grant No. 61372151).

### ACKNOWLEDGMENTS

The results shown here are part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2020.00195/full#supplementary-material


to glucose depletion in lung cancer. Oncogene 34, 1044–1050. doi: 10.1038/onc. 2014.47


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Chen, Cai, Dong, Zhao, Lin, Hu, Liu, Liu and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Computational Oncology in the Multi-Omics Era: State of the Art

Guillermo de Anda-Jáuregui 1,2 \* and Enrique Hernández-Lemus 1,3 \*

<sup>1</sup> Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico, <sup>2</sup> Cátedras Conacyt Para Jóvenes Investigadores, National Council on Science and Technology, Mexico City, Mexico, <sup>3</sup> Center for Complexity Sciences, Universidad Nacional Autónoma de México, Mexico City, Mexico

Cancer is the quintessential complex disease. As technologies evolve faster each day, we are able to quantify the different layers of biological elements that contribute to the emergence and development of malignancies. In this multi-omics context, the use of integrative approaches is mandatory in order to gain further insights on oncological phenomena, and to move forward toward the precision medicine paradigm. In this review, we will focus on computational oncology as an integrative discipline that incorporates knowledge from the mathematical, physical, and computational fields to further the biomedical understanding of cancer. We will discuss the current roles of computation in oncology in the context of multi-omic technologies, which include: data acquisition and processing; data management in the clinical and research settings; classification, diagnosis, and prognosis; and the development of models in the research setting, including their use for therapeutic target identification. We will discuss the machine learning and network approaches as two of the most promising emerging paradigms, in computational oncology. These approaches provide a foundation on how to integrate different layers of biological description into coherent frameworks that allow advances both in the basic and clinical settings.

#### Edited by:

Francesca Finotello, Innsbruck Medical University, Austria

#### Reviewed by:

Raoul Jean Pierre Bonnal, Istituto Nazionale Genetica Molecolare (INGM), Italy Barbara Di Camillo, University of Padova, Italy Dietmar Rieder, Innsbruck Medical University, Austria

#### \*Correspondence:

Guillermo de Anda-Jáuregui gdeanda@inmegen.edu.mx Enrique Hernández-Lemus ehernandez@inmegen.gob.mx

#### Specialty section:

This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology

Received: 01 December 2019 Accepted: 10 March 2020 Published: 07 April 2020

#### Citation:

de Anda-Jáuregui G and Hernández-Lemus E (2020) Computational Oncology in the Multi-Omics Era: State of the Art. Front. Oncol. 10:423. doi: 10.3389/fonc.2020.00423 Keywords: multi-omics analysis, computational oncology, data integration, cancer complexity, machine learning, network science

# 1. CANCER: THE COMPLEX DISEASE

Cancer is by now widely accepted to be the quintessential complex disease: a proper description of the pathological phenotype can only be achieved by properly integrating the myriad of interconnected biological elements and their relationships with their environment (1). As a complex system, cancer exhibits features, such as: emergent patterns, adaptive and collective behaviors, self-organization, non-linear dynamics, and interactions forming complex networks (2). Examples of these can be found in the Hallmarks of Cancer (3, 4), as seen in **Figure 1**.

On a system-wide fashion, every tumor is involved in interactions with non-cancer elements: such as gene-environment interactions (GxE) (5), micro-environmental interactions (6), and those with the immune system (7); intercellular interactions within the tumor environment (8); and intracellular interactions, such as transcriptional regulation and gene co-expression (9, 10), signaling (11, 12) and metabolic pathways (13, 14), as well as protein interactions (15). These are exemplified in **Figure 2**. It soon becomes evident that a major source of cancer complexity lies on the many layers of interacting elements involved in the phenomenon.

**38**

# 2. THE MULTI-OMICS PARADIGM

#### 2.1. Multi-Omics in a Nutshell

Multiomics is the name given to the modelization approach in biology hat makes use of more than one of the current high-throughput biomlecular experimental techniques (a.k.a.

FIGURE 1 | Hallmarks of cancer complexity. The defining features of cancer (3, 4) are intrinsically connected to the defining features of complex systems (2).

omics) in order to characterize biological systems at the phenomenological level. It is understood that every omic contributes on a specific fashion to shape the actual biological phenotype under study. For this reason, it has become evident that there is a need for integrating frameworks to gather and organize the knowledge gained with each experimental approach into mechanistic or semi-mechanistic descriptions of the biological phenomenon. This issue has been deemed particularly relevant for the study of complex phenotypes, such as cancer tumors (16).

The rapid development of sequencing strategies as well as genotyping and expression microarrays led to the development of gene models to account for the molecular aspects of biology at the whole cellular level (and even at the organ and organism scales). The coming of age and popularization (driven by an almost exponential lowering of the costs) of next gen sequencing techniques leads to an explosion of new approaches to understand complex phenotypes that in turn have sped up the rise of high throughput proteomics, metabolomics catching up. Single cell technologies and a number of arising sequence based approaches (ChIP-seq, ATAC-seq) are becoming usual tools of biomedical and in particular cancer research (see **Figure 3**, for an account of the fastly increasing number of PubMed publications based on these omic tools).

In spite of this, the integrative approach to multi-omic modeling is far from trivial due to the broad diversity of data types, dynamic ranges and sources of experimental and analytical errors characteristic of each omic. In spite of these facts, a number of approaches to multi-omic integration have been proposed [see, for instance, discussions in Hernández-Lemus (17, 18)]. Said approaches make use of tools from statistics, probability, machine

FIGURE 2 | The many levels of interactions found in a cancer system. (A) Depicts intracellular interactions that can be measured via the different omic technologies, such as genomics, transcriptomics, metabolomics, lipidomics, and so on. (B) Shows intercellular interactions, such as the ones orchestrated through immune responses, microbial interactions (metagenomics) and other instances of cell-cell interactions.

learning and network science to classify, explore and provide guidelines for feature selection and their application is very much rooted in the tenets of systems biology.

The systematic study of cancer given by multi-omics is founded on the acknowledgment of a contribution of many different factors in the development and maintenance of the malignant state, including genetic aberrations, epigenetic alterations, changes in the response to cellular signaling, metabolic alterations, and beyond (19). Hence, by analyzing cancer as a complex pathology, the systems biology paradigm tries to gain insight into the molecular origins of the disease by looking at the diverse contributions, from DNA mutations (both germline and somatic), to deregulation of the gene expression programmes, the phenomenon of hormone disruption, that may or not be supplemented by metabolic abnormalities, and aberrant pathway signaling.

Cancer is also a multiscale pathology, aside from the biomolecular events just mentioned there is the influence of the environment and lifestyle that is known to be able to modify the onset, development, and outcome of tumors and their metastases. Multiomic analysis under a systems biology framework makes possible to use the unprecedented power of current high-throughput molecular and computational tools to draw a more complete figure of the different players in tumorigenesis and tumor establishment. At the same time, it may provide us with new instruments and strategies useful in basic and clinical research laboratories, but also in translational medicine and therapeutic endeavors.

These different levels of description have been independently studied for years. However, even if the advent of high-throughput technologies has permitted the development of systems biology, system-level models (conforming the theoretical foundations of these multiomic studies) are still under development.

#### 2.2. The Systems Biology Framework

In essence, the foundational basis of systems biology is that of considering biological phenomena as systems, i.e., constructs formed by a large number of complex molecular and environmental components interacting at different levels to shape the functional features of said system. Tumor behavior, for instance, is determined by a combination of changes in genomic information that may (or may not) be associated with abnormal gene expression profiles; affecting protein abundance, but also modifying protein structure and folding, as well as supramolecular assembly. Changes in the regulatory patterns may also affect cell signaling mechanisms; and their responses. Hence, the complex interaction of nucleic acids and proteins in replication, transcription, metabolic, and signaling networks are considered the ultimate causes for the functioning (or misfunctioning, if preferred) of the tumor cell. We can notice that these are interdependent phenomena that cannot be treated separately, hence the need for integrative methodologies.

Another pivotal challenge in contemporary studies undertaken following a systems biology view is hence data integration. Data integration allows for the understanding of the enormous datasets generated by experimental multi-omics. This is indeed a highly non-trivial task, since just the data management of such large amounts of information represents a challenge that has been called the big data paradigm.

#### 3. THE ROLES OF COMPUTATION IN THE AGE OF CANCER MULTI-OMICS

We have identified four main roles that computation plays in the analysis of high-throughput data. These are the raw data acquisition from high-throughput instruments; the processing of raw data to quantitative data; the storage and management of massive omics data, for instance in remote repositories; and finally the deployment of data analysis models. These roles are illustrated in **Figure 4**. In this section, we will discuss select aspects of each of these roles.

#### 3.1. Data Acquisition and Processing

The acquisition, processing, and manipulation of omic data generated in high throughput experiments requires, due to the very nature of these experiments (see **Figure 5**), the use of specialized bioinformatics pipelines. As the complexity of these datasets increases due to the natural evolution of these technologies, so do the associated challenges evolve (20). Bioinformatics workflow management systems can be used to develop, maintain, and foster reproducibility of a give pipeline or workflow. Examples of these systems include Galaxy (21), Snakemake (22), Nextflow (23), and the general purpose Common Workflow Language (24).

It should be noted that a large number of tools for omic data analysis are available as packages for the R language contained in the Bioconductor project (25), a repository of bioinformatics open source software. It is important, however, to acknowledge the existence of other software ecosystems, such as the Biopython project (26). Although the number of packages in Bioconductor is greater than that found in Biopython [see for instance (27)], the main takeaway should be that there is a large number of tools available to researchers that can be used in any combination suitable for their research question.

#### 3.1.1. Genomics

The oldest of the omic technologies, genomic analyses focus on the genomic sequence and its variations: insertions, deletions (INDELs), single nucleotide variations (SNVs), copy number

variations (CNVs), and so forth. The relationship between genomic alterations and cancer is well-known (28).

Microarrays have long been used for genotyping. Although specifics of microarray technology may vary across manufacturers, most modern DNA microarrays can be analyzed using well-established tools available in the Arrays (29). Such tools can handle arrays for different genotyping tasks, including SNP and copy number assays [for instance, copy number detection from exome sequencing using CODEX (30)].

Although DNA microarrays remain in use, next generation sequencing (NGS) technologies are quickly becoming commonplace. The analysis of NGS data entails a workflow that involves sequence acquisition and alignment to a reference genome, A number of downstream analysis pipelines can follow; for instance, a variant discovery workflow would involve variant calling, filtering, annotation, and prioritization (31). The first step to analyze NGS data is to use a sequence aligner tool on the sequence data (stored in FASTQ format). Some popular aligners are the stand-alone BWA (32), Bowtie (33), Bowtie2 (34), and SNAP (35), with aligned sequences being stored in SAM (Sequence Alignment Map, text-based) or BAM (Binary Alignment Map) files. These aligned sequences are the input for downstream genotyping analyses (36, 37).

Such standards are indeed a matter of state-of-the-trade in the academic research community indeed. Regarding pipelines approved by regulatory instances, there is in fact an official FDA guideline document to this end: "Considerations for Design, Development, and Analytical Validation of Next Generation Sequencing (NGS)—Based in vitro Diagnostics (IVDs) Intended to Aid in the Diagnosis of Suspected Germline Diseases" available for download at https://www.fda.gov/media/99208/ download. The Guideline document (99208) actually refers to a Software Documentation Guideline: "General Principles of Software Validation; Final Guidance for Industry and FDA Staff " which is however quite outdated (last revised January, 11, 2002) (https://www.fda.gov/media/73141/download). Some NGS tools however are actually available as a web service at https:// precision.fda.gov/. For a review on these guidelines and tools see (38).

#### 3.1.2. Epigenomics

With the recent advent of high-throughput omic technologies to probe chemical modifications in the tumor genomes it has become more and more evident that such epigenomic modifications are present and likely play relevant roles in many cancers. These variations include DNA methylation and histone modifications, both in oncogenes and in other cancer-associated genes. Mutations in genes involved in epigenetic regulation have also been found in several tumor types. The computational analysis of epigenomic data may provide us new insights about cancer initiation and progression. More relevant perhaps, such studies will pave the way for a more efficient identification of genetic and epigenetic biomarkers for diagnosis, prognosis or response to therapy. These in turn, may accelerate the development of novel therapeutic approaches.

Epigenomics often presents another view of functional processes complementary to that of genomics. Sometimes epigenomic techniques even allow for a better understanding of genome-associated phenomena. Such is the case of highthroughput immunoprecipitation assays, such as ChIP-Seq. ChIP-Seq and other experiments based on the analysis of short reads show the effects of multi-reads, i.e., reads that map to more than one genomic region. Determination of the origin of such multi-reads indeed results critical for the accurate mapping of reads to repetitive regions, such as copy number variants (39, 40). Current computational approaches have been refined to cover up for this phenomenon even at the single-cell level (41).

The epigenome contains the set of potentially inheritable chemical modifications of DNA and histone proteins that can control gene expression activity (42). There are several mechanisms which are contained within the epigenomics concept, each requiring a different high throughput molecular technique for its measurement. Each of these techniques, in turn, requires the use of a dedicated set of computational tools. These include:


ChIP-seq (51) data is used to identify genomic locations with an overabundance of proteins of interest; such identification uses the so-called peak callers (52, 53). These include SICER2 (54), PeakRanger (55), GEM (56) MUSIC (57), PePr (58), DFilter (59), and MACS (60); benchmarks for these algorithms can be found at https://github.com/skchronicles/ PeakCalling.

MACS is a popular peak caller that uses dynamic Poisson distribution; its successor, MACS2 (61), improves the algorithm to, amongst other things, make it more suitable for calling differential regions. Differential binding analysis (that is, identifying sites in which exhibit a different binding behavior between biological conditions) can be useful to identify relevant regions that may be driving cancer phenotypes, using ChIP-seq data. Tools for this task include DiffBind (62), a package that provides functions to handle the results of peak set callers, such as MACS. Another tool for this task is csaw (63), useful for de novo detection of differentially bound regions using a sliding window approach. In-depth comparison of differential ChIP-seq analysis tools can be found in (64).

• Chromosome conformation: The three-dimensional organization of the genome allows for interactions between regions that are distant in terms of sequence, even belonging to other chromosomes. These higher-order chromosome structures are a current area of research in oncology (65). Chromosome configuration capture techniques are able to quantify interactions between genomic loci. These C-techs are based on the original 3C, Chromosome configuration capture (66); able to quantify interactions between a single pair of loci. It was followed by: 4C (Chromosome configuration captureon-chip) (67), which captures interactions between one locus and all others; 5C (chromosome conformation capture carbon copy) (68), which captures all interactions between two sets of loci; and Hi-C (high-resolution chromosome conformation capture) (69, 70) to detect interactions between all possible loci pairs. Development of computational analysis tools for chromosome conformation capture data is ongoing, although there are available packages for the detection of significant interactions for all these technologies (71–73).

It has been known for some time that higher order chromatin arrangements are associated with chromosomal alterations in cancer. For instance, it has been argued that spatial chromosome conformation and negative selection may be powerful driving forces behind somatic copy number alterations (74). More recently, chromatin conformation capture has allowed the identification of putative pharmacological targets in breast cancer (75). Genomic loci interactions may even affect the expression of biomarkers related to hallmarks of cancer, such as hypoxia (76).

Packages, such as methylPipe and compEpiTools provide an integral platform for the comprehensive and integrative analysis of the first two classes of epigenomic data (77), whereas ATACseqQC (78) is a package offering quality control tools for ATAC-seq data, while esATAC (79) offers a whole analysis pipeline and the GenomicInteractions package (80) offers a complete framework for the analysis of chromosome conformation data.

#### 3.1.3. Transcriptomics

Transcriptomic analyses are used to measure the presence and abundance of RNA in a given physiological context (81). Perhaps the most common application of transcriptomic technologies is to measure gene expression. The gene expression profile of a phenotype can be used as a barcode of its biological state. Such barcodes can be compared, through differential expression analyses, to pinpoint cellular changes in cancers (82). The expression profile is the product of the gene regulatory program encoded in the genome and the epigenome. By measuring gene expression, we are indirectly capturing the regulatory changes that are at the core of the disease.

The development of gene expression microarray technology (83) has made gene expression measurement more technically and economically viable than the measurement of protein abundance. Therefore, methods for the measurement of biological activity (i.e., pathways) have been developed with transcriptomic data in mind (84). Studying the molecular phenotype of cells via transcriptomics has become an invaluable tool providing a proxy to the functional state of cells and its regulatory interactions, both in cancer (85, 86), and in healthy phenotypes (87). Nevertheless, it should be noted that the correspondence between gene and protein abundance is far from perfect (88), which highlights the need for multi-omics.

Beyond gene expression, whole transcriptomic analyses involve the measurement of non-coding (nc) RNA, such as micro-RNA (miR), long non-coding RNAs (lnc-RNA), small nucleolar, Piwi-interacting, enhancer RNAs, among others (89, 90). The role of these transcripts, particularly in terms of their contribution to the regulatory program, remains an active area of study.

As previously mentioned, transcriptomic technologies are one of the most developed omics, second only to genomics itself. Measurement of transcript abundance can be done using either expression microarrays or RNA-sequencing (91, 92). Each methodology has technical considerations, but the general steps for their analyses are similar: acquire and preprocess data, removing technical artifacts; quality control; and data normalization. The resulting data can be represented as an expression matrix: an NxM matrix where rows represent transcripts, and columns represent samples (or observations). It should be noted that most expression pipelines are oriented toward differential expression analyses [see for instance (93)]; this should be taken into account in case that is not the intended use-case.

Starting points for RNA-seq data analysis include either alignment based methods, such as Bowtie (33), and STAR (94), or alignment-free methods, such as kallisto (95) and Salmon (96).

Cancer-related omic experiments often rely on specific, tailormade analytics. One instance of this is provided by alignmentfree RNA-Seq analysis methods, such as the ones performed by kallisto, Salmon, etc. Alignment-free methods (AFMs) are particularly well-suited to study cancer transcriptomics to look up at the role and abundance of fusion transcripts that may give rise to chimeric proteins (97, 98). Another reason behind the use of AFMs is that it is known that different RNASeq pipelines present differences that may be important when analyzing cancer genomes and transcriptomes (99, 100).

Further require different tools for quantification, quality control, and normalization of expression data. For instance, a popular pipeline is composed of the aforementioned Bowtie as a short read aligner, TopHat (101) for the identification of slice junctions, Cufflinks (102) for transcriptome assembly and differential expression analysis, and CummeRbund (103) for result exploration; it should be noted that, while this pipeline is still widely used and maintained (e.g., Bowtie2 latest release was 02/28/20), other approaches are been gradually embraced by the community (104); for instance, the HiSat2 (105), StringTie (106), and Ballgown (107).

In the case of tools like STAR, we need to be aware that fusion detection using STAR-fusion is mainly limited by the length of single-end reads. The STAR-fusion wiki (https:// github.com/STAR-Fusion/STAR-Fusion/wiki) indicates the need for at least 100 base length. In the case of other approaches, such as FusionHunter (108) the authors recommend to align to a pseudo-reference and discard junction spanning reads with <6 bp matches on either gene. Arriba is a relevant tool to call for gene fusions, based also in the STAR-alignment (https://github.com/suhrig/arriba/). Arriba was the winner of the DREAM SMC-RNA Challenge (https://www.synapse.org/#! Synapse:syn2813589/wiki/401435) (109).

An advantage of the modular design of these pipelines is that it is possible to combine tools from different workframes, depending on experimental and analytical needs: For instance, Salmon provides tools to connect with differential expression tools, such as DESeq2 (110), edgeR (111), limma (112), or sleuth (113). A detailed discussion of these methods is beyond the scope of this article; please see Conesa et al. (114) for an in-depth review.

#### 3.1.4. Proteomics

Proteomic analyses are used to identify and quantify the set of proteins present within a biological system of interest (115). The study of cancer proteomes is promising as a way of identifying biomarkers and therapeutic targets (116). This is not surprising: proteins are the molecular unit from which cellular structure and function arises.

Historically, high throughput proteomics technologies have developed at a slower pace than genomics and transcriptomics technologies. Microarray approaches to proteomics have been developed, with varied levels of success and applications (117, 118). However, the bigger breakthroughs have come through the use of mass spectrometry (119).

Various steps of proteomics analysis involve data analysis (120). During data acquisition, the detected molecular fragments must be identified. This is often done by comparing fragments to databases in real-time (121, 122). Later, the assembly of proteins from identified peptide fragments requires another set of computational methods (123). The development of such methods remains an active area of research (124, 125). The Bioconductor offers a streamlined set of tools for the management of proteomics data, from data processing to functional analysis (126). Another alternative for protein quantification is the maxquant toolset (127).

#### 3.1.5. Metabolomics and Lipidomics

Metabolic alterations are important contributors to cancer development (128). Cancer metabolomics has become an important research topic in oncology (129), with the promise of providing novel insights on cancer development and potential therapeutic options. Lipidomics is actually a subset of metabolomics (130). The study of cancer lipidomics may lead to the identification of biomedical important findings, such as novel biomarkers (131).

Like proteomics before, metabolomics and lipidomics studies have been possible thanks to the use of mass spectrometry. The analytical considerations for the extraction and quantification of these types of compounds have some differences to those used for proteomics. This is expected, as the chemical nature of metabolites and lipids are fundamentally different (132, 133). In turn, bioinformatic and chemoinformatic approaches to high-throughput metabolite profiling exhibit some modifications (134).

Analysis frameworks for metabolomic and lipidomic data are currently available. The metab package (135) provides an analysis pipeline for metabolomics derived from gas chromatography mass spectrometry data. The metaRbolomics package (136) is a general toolbox that goes from data processing to functional analysis. Finally, the lipidr package (137) is a similar framework focused on lipidomics data.

#### 3.1.6. Unraveling the Complexity Within Samples: Single Cell, Imaging, Microbiome

The aforementioned technologies were all developed for the detection and quantification of analytes extracted from a complex biological matrix, obtained from tissue, plasma, or a similar fluid. As such, the data from these omics is an aggregate of the different cellular contexts present in the sample. The environment within and surrounding cancer tumors is notably heterogeneous (138, 139). There is knowledge to be gained by recovering the omics diversity within samples.

Cancer is an extremely heterogeneous disease at the cellular and molecular level. Tumor heterogeneity caused by the concurrence of multiple cell lineages and differentiation stages, determined to an extent by the processes of clonal evolution. This has led to an early adoption of single cell analysis techniques. The case of single cell sequencing to study the genomic and epigenomic features of the different cell populations within a tumor by considering the characteristics of individual cells has revealed as an appealing approach to deal with said cell-to-cell variability (140–142).

Cancer cell heterogeneity also exists beyond the genome. Tumor evolution under complex environmental scenarios often leads to variability in epigenetic modifications. Single cell sequencing and imaging techniques have proven to be quite effective to characterize cellular plasticity induced by epigenomic phenomena (143). Aside from scMethSeq, and scDNAse Seq, other techniques, such as single-cell chromatin accessibility assays are starting to shed light to how epigenomic subpopulations in cancer may have the potential to impact tumor features, such as drug sensitivity and clonal dynamics (144).

Single-cell omics analyses rely on experimental techniques for the isolation of single cells from a sample, using microfluidics or fluorescence-activated cell sorting methods (145). Single-cell RNA-seq (scRNA-seq) is currently the most developed highthroughput omics technology for individual cell analysis (146).

Data from scRNA-seq experiments can be thought to be very similar to so-called "bulk" data. Data from scRNA-seq is, in fact, sparser, more variable, and with more complex expression values distributions. As such, data analyses techniques may need to account for different assumptions than their "bulk" counterparts (147). Again, the development of these novel bioinformatics tools is an active area of research (148). The Bioconductor ecosystem has a complete framework for the analysis of scRNAseq from low-level (149) to functional analyses (150). Scanpy (151) provides a toolkit for single-cell gene expression analysis in a Python environment. Another single-cell genomics toolkit is Seurat (152) for R.

Integration of single-cell RNA-seq with other profiling tools is an important research area (153); as along with single-cell, there are other technologies that can provide a more complete picture of the cancer heterogeneity. High throughput imaging techniques (154) can be generated and computationally analyzed (155, 156). Imaging techniques can be used along with omics to recover the spatial distribution of molecules within cells and throughout tissues. Tools, such as CellProfiler (157) allow for a high-throughput analysis of data. Imaging techniques can be combined with single-cell methods: for instance, MERFISH can simultaneously measure copy number and distribution of RNA in single cells (158); Slide-seq (159) can measure transcriptomes at a high spatial resolution.

Space-resolved transcriptomics or spatial transcriptomics (ST) is a set of in situ transcript capturing methodologies aiming at quantification and visualization of gene expression patterns in individual tissue sections or regions. ST methods have indeed revealed relevant tissular phenomena linked to tumor evolution and in some cases have been able to allow the prediction of clinical outcomes in, for instance, breast cancer subtypes (160).

ST mapping of prostate tumors, on the other hand, have resulted key in the identification of gene expression gradients in stroma adjacent to tumor regions. This in turn has resulted in patient re-stratification based of tumor microenvironment features (161). A similar approach has been taken to trace tumor advance in malignant melanoma (162). A combination of ST with scRNASeq has led some researchers to propose the concept of a "tumor atlas," a roadmap to navigate tumor spatial and cellular heterogeneity (163).

Multi-omic analysis is not devoid of technical and logistic conundrums. Perhaps the most obvious is the availability of the different sample types from a single source in the same experiments. Cell cultures may provide a way out to this problem, however in vitro conditions are often not resembling some aspects of interest in complex phenotypes, such as cancer. In recent times, three dimensional cell culture techniques have allowed the design and development of more realistic models, such as the case of organoids and tumoroids. These models may represent a good compromise between cell line studies and biopsy-captured tissue experiments (164). Multiomic approaches are starting to be applied on lab-grown organoids with relative success (165, 166). In order to analyze such data some novel computational tools are being developed and adapted (167).

The role of the immune system in cancer response is another area of active research. CITE-Seq is an RNASeq method that incorporates epitope analysis thus leading to semiquantitative information regarding surface protein abundance via antibody assays, even at the single cell level (168). This novel technique is starting to be applied to provide the answer to fundamental questions in oncology, such is the case of tumorigenesis (169)

Finally, the role of the microbiome in cancer is being recognized (170); the integration of metagenomic, and perhaps meta-omics data (171), could provide key insights into cancer pathogenesis and therapeutics.

#### 3.2. Data Management

The push for open data in the field of biomedical genomics since the gestation of the Human Genome Project has led to the emergence of a rich Genomic Commons (172). Making data available in public repositories makes for faster scientific discovery, although there are challenges to be overcome, both ethical/legal (173), and technological.

Challenges of data management include defining the type of data to be stored and how to store it; the policies for data access, sharing, and re-use; and long term archiving policies (174). Arguably, the most successful repository of cancer multiomics is NIHs Genome Data Commons (GDC) (175). The Genome Data Commons contains all data generated by the Cancer Genome Atlas (TCGA) project (176); although it should be noted that not all data is publicly accessible. The data is organized as a directed graph comprised of interconnected entities (**Figure 6**), with each entity having an associated set of properties and links. Data is publicly accessible either through the gdc-client command line tool, the REST API for programmatic access to the database, or through dedicated packages, such as rtcga (177). A recent account by The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium (PCAWG) of these resources and analyses is presented in (178). Furthermore, a larger collection of datasets can be accessed through the Broad Institute's Firehose (http://gdac.broadinstitute.org/); cloud computing enabled data access is provided through the Cancer Genome Collaboratory (https://cancercollaboratory.org/).

The impact of TCGA at the forefront of multiomics research is inarguable. As a publicly available resource, it provides data for method development and validation. This is used by a lot of current projects. However, there are other datasets with either single layer or multiomic datasets that can also be integrated. And wetlab researchers still carry out their projects, contributing to the cancer multiomics community. Integrating data from both, local experimental projects and large collaborative endeavors, such as TCGA is indeed a common practice in many places, such as our institution, the National Institute of Genomic Medicine in Mexico. Doing so allows to contrast specific hypothesis for the different research groups with the statistical power obtained via the much larger datasets generated by international multicentric collaborative projects.

As mentioned, it is possible to extract a lot of knowledge from the systematic re-analysis of data available in large public datasets. Perhaps, the more comprehensive of these databases is the one by the TCGA/Genome Data Commons/International Cancer Genome Consortium, TCGA. Retrieving the data via their Application Programming Interface (API) (https://gdc. cancer.gov/developers/gdc-application-programming-interfaceapi) demands some familiarity with command line tools and coding that may be beyond of most non-bioinformaticians. The project's data portal (https://portal.gdc.cancer.gov/) provides easy to use interfaces, but may be limited on its application to broader analyses. To date there is a number of commercially available platforms that provide a gentler access to the TCGA data. Such is the case of Qiagen's OncoLand database (https:// digitalinsights.qiagen.com/products-overview/discoveryinsights-portfolio/content-exploration-and-databases/qiagenoncoland/) and the cloud-based analytics solution Seven Bridges

(https://docs.sevenbridges.com/docs/tcga-data). A limitation, aside from being subscription based alternatives that require a payment is that they are not customizable, which means that not all possible (nor desired) analysis may be performed.

There are, however a number of resources not only to access the data but to actually perform different levels of downstream analysis. Such is the case of imputation approaches to missing data in the TCGA database (179) (https://github.com/ mrendleman/MachineLearningTCGAHNSC-BINF/).

Perhaps, the best combination of usability and versatility is present in the TCGA Workflow suite available as an R/Bioconductor package (180) (https://www.bioconductor.org/ packages/release/workflows/vignettes/TCGAWorkflow/inst/ doc/TCGAWorkflow.html).

#### 4. COMPUTATIONAL TOOLS FOR MULTI-OMICS DATA INTEGRATION

An often-asked question is why try to integrate multiple omics technologies using complex models. Perhaps the simplest argument is that the biological phenomena is not comprised of independent layers of biological features: integrative models will be, due to this simple fact, closer to the system of study. As omics technologies become available, researchers have used them together to try and capture a better description of the phenomena (see **Figure 7**).

Improving our current cancer diagnostic capabilities is a major goal of biomedical research: the role of molecular technologies in the development of these tools has long been recognized (181). It is expected that multi-omic integration is able to provide better predictive tools than single molecular technologies, due to the fact that each technology is capturing just a slice of the whole complex pathological system; multiomics data are expected to be of value for both basic and clinical research, as long as they are able to recover biological insights beyond those obtainable from the simple addition of each analysis layer (182, 183).

It may soon become evident that the formalisms that can lead to such level of description are, by necessity, complex (184). A remaining question is what multiomic combinations are able to achieve better diagnostic results. Selecting this optimal omics combination is not trivial, since there are practical constraints (such as economic and technical limitations) in the clinical setting in which such diagnostic tools are to be deployed (185). Computational tools and bioinformatic approaches play an important role in the design of such studies. A list of such tools is presented in Supplementary Materials as **Table 1**.

#### 4.1. Multi-Omics Data Representation and Preparation

The success of a computational method could arguably be influenced by the design principles implemented in its data representation. The MultiAssayExperiment package (186) provides an eponymous data class to contain multiomics experiments. Like other Bioconductor classes, MultiAssayExperiment is object-oriented. It can contain the

information of different (multi-omics) experiments, linking features, patients, and experiments. Furthermore, by sharing design principles with the rest of the S4-Bioconductor classes, it is highly interoperable.

An important issue with large scale multi-omics studies is the problem of missing and mislabeled samples. Whether by technical limitations or human error, the samples associated with a given patient may not have all measurements; or samples from two different patients may get mixed-up. There are packages available to handle these problems. The missRow package (187) can be used to handle missing data, combining multiple imputation with multiple factor analysis. The omicsPrint package (188), in turn, can be used to evaluate data linkage through the use of linear discriminant analysis.

The STATegRa (189) project provides a framework for multiomics data analysis and integration: these are MixOmics (190), descended from the integrOmics project (191); and just like the Bioconductor project, the major advantage of such projects is the increased interoperability due to the sharing of design principles. For instance, within the STATegRa project, there is an Experiment Manager System (192); MOSim (193) a tool that provides methods for the generation of synthetic multi-omics datasets. These datasets can be used for the benchmarking and validating of other integration tools; and an experimental multiomics dataset (194).

# 4.2. Multi-Omics Data Integration as a Data Science Problem

For this review, we approached these methods from a data science perspective, considering that each method is in essence solving a machine learning task (or set of tasks). In **Figure 8** we show

some of these mappings, although it should be noted that these categories may be fluid: an unsupervised clustering analysis can become the basis for a supervised classifier, with diagnostic and prognostic applications. This is the story of the PAM50 algorithm for breast cancer (195).

# 4.3. Exploratory Data Analysis

Exploratory data analysis (EDA) is a vital first step in omics analyses (196). Through EDA the nature of the data can be understood, allowing for better decisions at a further modeling step.

Unsupervised learning approaches can provide a hypothesisfree understanding of the data behavior. This will reflect the nature of the underlying biological phenomenon. Unsupervised clustering analyses attempt to group samples based on the similarity of their measured features. The assumption is that this unsupervised classification will recover relevant biological differences. Multi-omics can increase the efficiency of such approaches (197).

Multi Omic data analysis is often performed with the aim of unveiling non-trivial molecular and systemic interactions that are difficult or impossible to see if one relies on a single omic approach. However, since we are tacitly assuming that the different omic levels of description may have synergistic effects that are key to develop more accurate models of tumor biology. Since multi omic approaches may generate a plethora of interdependent data it is useful to design analytical strategies for dimensionality reduction, feature selection and integration of all this information.

Aside from intelligibility, there are additional reasons to make dimensionality reduction schemes, one of these is that a multi omic study combines different information sources, hence dramatically increasing the number of features, often keeping the number of samples constant, in order to preserve statistical power we need to rely only on the most informative variables (198–200).

Computational tools to this end have been developed, such as the following: https://www.bioconductor.org/packages/release/ bioc/html/mixOmics.html https://bioconductor.org/packages/ release/bioc/html/STATegRa.html For an extensive list of computational tools in the context of cancer biology, see (186).

One can make use of dimensionality reduction techniques in order to embed multi-omic data observations into a lowerdimensional space that can be used for either manual (i.e., visual) inspections or as the input for unsupervised clustering (or other analysis tools). Popular dimensionality reduction methods:


Data visualization is an important part of EDA: the graphical representation of data can be sufficient for the identification of complex patterns (204). Visualizing high-dimensional biological data can be helpful from a purely data-driven point of view: for instance, to understand the variability within a phenomenon. Combinations of dimensionality reduction, data clustering, and visual inspection can be effective to identify subpopulations within a dataset. The most common visualization for these tasks is perhaps the scatterplot, but it is far from the only: for instance, hexbins (205) can be used to explore sc-RNAseq data, which can be useful to overcome overplotting problems related to the order in which points are drawn in the canvas.

Visualization can also be coupled with other biological information, for instance locating the genomic regions in which epigenomic features are found. Visualizations, such as the Circos plot (206) can be used for the detailed representation of multi-omics data and their location in specific genomic regions; The omicCircos (207) implementation is compatible with the standard data classes used inBioconductor. The multiOmicsViz multiOmicsViz package is useful to visualize the effects of one omics layer to another, visualized in within the spatial chromosome context. The Gviz package (208) provides a full R graphics system solution for genome browserstyle visualizations. Such representation is useful to represent the behavior of different experimental layers (as tracks) in a sequence context. For ChIP-seq data visualization, tools like PAVIS (209) may be used. Single Cell RNA-seq data visualization suites, such as SingleCell Signature Explorer (210) can be useful for exploratory analysis of such datasets. In the case of chromatin capture data, visualization toolboxes, such as HiBrowse (211), the Epigenome Browser (212), and Juicebox (213). For a thorough review of Hi-C visualization consult (214).

Common exploratory data analysis tools are implemented either in base R or as packages from CRAN (since their use is not necessarily limited to biological data). However, there are packages providing integrated EDA tools for multiomics and oncology. The OMICsPCA package (215) provides omics-oriented tools for PCA analysis. The CancerSubtypes package (216) contains several data preprocessing, quality control, and clustering methods, focused on the identification of cancer subpopulations from multi-omics data. Biocancer (217) provides an interactive multi-omics data exploratory toolkit. The omicade4 package (218) provides an implementation of multiple co-inertia analysis (MCIA), another dimensionality reduction technique; these tools were used for the integration of transcriptome and proteome data from the NCI-60 cancer cell line panel. The Multi-omics Autoencoder Integration (maui) is a tool for multi-omics data analysis for Python. It allows for latent factor model coupled with artificial neural networks for multiomics data integration. iClusterPlus is a Bioconductor package based on the original iCluster (219) algorithm for integrative cluster analysis combining different types of genomic data.

#### 4.4. Statistical Models: Classificators, Predictors, and Feature Selection

Exploratory methods provide a useful description of biological phenomena. Nevertheless, in the oncology context, the identification of actionable elements is most desired, to generate translational value. The generation of models and feature selection strategies can lead to such results.

In this context, statistical models are computational (and thus mathematical) representations of the relationships between observed variables. These models can be useful to solve a given task based on some input data (220). Examples of these tasks include the classification of samples and the prediction of the state of a feature of interest.

Classification models have important biomedical applications (185). If a classification is able to discriminate between physiological states it can have translational use: A model that discriminates between health and disease has diagnostic utility; A model that discriminates between different disease outcomes has prognostic utility, which can be used for stratification purposes. Molecular classifiers have been quite successful in oncology: perhaps the best example being breast cancer (221). Classification models can be developed using supervised methods (that is, the model is trained with class information); but unsupervised methods, such as the previously discussed clustering, may be able to recover groupings that capture biological and clinical differences.

Predictive models can provide insights into the molecular mechanisms driving physiological states. These can reveal the interactions between different omics, as well as between individual biomolecules. Furthermore, predictive models can have translational applications, including their use in prognostic tools (222).

Feature selection consists in the selection of a subset of measured variables that are most informative: that is, they contribute the most for the model to accomplish its task. Proper feature selection is important for biomedical models (223), as (1) removing uninformative ("irrelevant" or "redundant") features simplifies the model and increases its performance; and (2) a smaller set of features is less expensive to measure, increasing the translational potential of a given model.

Common applications of statistical models in the clinical context of cancer are the prediction of susceptibility, recurrence, and survival (223). Additionally, classification and association models are regularly used for the interpretation of molecular studies of cancer. For instance, biomarker discovery (224) is an often sought target for modeling based on biochemical and multi-omics analyses. This is an important area of study, since actionable biomarkers are not particularly common (225).

#### 4.4.1. Implementations and Use-Cases

Novel tools for the implementation of oncology models using model data are being released constantly. Many of these packages combine exploratory, supervised, and unsupervised tools, providing a wide range of analysis tools. mixOmics (190) is a self-described omics data integration project; it includes an eponymous package that provides different exploratory and integrative multivariate methods, including (independent) PCA, Canonical Correlation Analysis, Partial Least Squares regression (PLS), and PLS-Discriminant Analysis (DA). Part of the larger project is the Data Integration Analysis for Biomarker discovery using Latent Variable approaches for 'Omics studies (DIABLO) framework, which has been used for the identification of a multiomics signature of breast cancer molecular subtypes (226).

Other tools also follow this combined design principle. The ropls package (227), for instance, incorporates the tools for PCA, as well as (Orthogonal) PLS. Multi-Omics Factor Analysis (MOFA) is implemented in the eponymous package (228). This factor analysis model has been used for the unsupervised detection of groups in a leukemia dataset, and the selection of informative multi-omic features associated with oxidative stress. OmicsMarkeR (229) also provides a variety of classification and feature selection tools; originally developed for metabolomics, this tool has been used for the study skin cancer progression (230). Some packages include different classifier methods to generate an ensemble model; such is the case of Biosigner (231) which combines PLS-DA, Random Forests, and Support Vector Machines to select discriminant features across omics.

We agree with the assumption that multi-omics specific tools can improve workflows by adhering to a single design philosophy. However, we also agree that this is convenient, but not necessary. For instance, a diagnostic panel for pancreatic cancer was recently identified with a Random Forest implementation (232) using genomics, transcriptomics, and immunohistochemistry data. In another study, biomarker candidates for pancreatic cancer are identified using a Support Vector Machine on miRNA and gene transcriptomics (233).

Predictive models can be used to identify the contribution of one omics layer to the activity of another. For instance, epigenomix (234) uses Bayesian mixture models to integrate ChIP-seq and gene transcription data. The Integrative analysis of Multi-omics data for Alternative Splicing (235) package integrates expression, sQTLs, and methylation to provide mechanistic insights behind the manifestation of alternative splicing.

Predictive methods have been used to integrate multiomics with other sources of big data, with publicly available implementations. The packages rexposome and omicRexposome (236) have been used to study the exposome, defined as the set of environmental exposures. Using multi-canonical correlation analyses and multiple co-inertia analysis, exposomewide associations have been made to multi-omic data. The OmicsLonDA package (237) offers a method that uses linear mixed-effect models and smoothing spline regression models to identify time periods with differential omics levels. A highlight of this package is the consideration for the use of physiological measurements from wearable sensors, which may provide applications for nowcasting, the prediction of near-future states.

#### 4.4.2. Functional Aggregation

One could argue that analysis methods can be more informative if there is a way of associating the findings to the wider body of biomedical knowledge. Mapping omics data to functional features, such as pathways and functional genesets, is a strategy that can provide such readily interpretable results. Functional enrichment approaches, such as over-representation analysis (ORA) and gene-set enrichment analysis (GSEA), are effectively feature extraction methods that can be used as biologically relevant dimensionality reduction methods. The results of such methods can serve as starting points for more complex models, such as interactions among functions (238). For a detailed discussion of functional analysis, see (84).

The development of methods for effective functional enrichment based on multi-omics data is ongoing. Multi-omics gene-set analysis (MOGSA) (239) approaches the problem by using multivariate analysis, and using projections of data and genesets to lower dimensional spaces, to generate an enrichment score. Massive integrative gene set analysis (MIGSA) (240) takes a different approach, making independent functional associations for each omics layer (using ORA and Functional Class Scoring). Instead of providing an aggregated measurement, the functional associations of each layer are stored in a special data structure, allowing flexible analyses. This method has been used to functionally characterize breast cancer molecular subtypes from a multi-omics perspective.

Functional aggregation can be used as the basis for other data analysis tasks. In pathwayPCA (241), exploratory data analysis is done by analyzing the functional enrichment of each omics set separately, and aggregating them via consensus. This method was used to study heterogeneity in an ovarian cancer dataset. In the original work for the Divergence analysis (242) method for highdimensional omics data analysis, the authors evaluate the effect of using functional aggregation for their data classification task. Functional aggregation methods are an important part of highthroughput drug initiatives, as can be seen by their prominence in the iLINCS platform (243).

#### 4.5. The Network Paradigm

As we have stated throughout this work, biological phenomena are complex, interconnected systems. The data that we recover from high-throughput multi-omics is not isolated. Any biological system is not just the sum of its parts, but the sum of its biological elements and their relationships. With this in mind, the integration of high-throughput data within a network paradigm becomes appealing. Some advantages of a network approach to multi-omics integration are:


A network perspective can enhance every aspect of the multi-omics analysis. For instance, mapping omics data to pathway networks can provide an opportunity to biologically contextualize the data. A classic tool for this is the pathview (246) package. The Graphite (247) package is a more flexible alternative, as it allows the visualization of pathways from different data sources, and provides proper graph objects that can be manipulated using network visualization tools. Recently, the metaGraphite package provided a major update to the original tool, effectively incorporating multi-omics through the addition of a metabolomics layer.

Network approaches can be used for classification and prognosis. For instance, the micrographite (248) package provides a method to integrate micro-RNA and mRNA data through their association to canonical pathways. This approach has been useful in identifying key micro-RNAs in myeloma (249), primary myelofibrosis (250), and ovarian cancer (251). Mergeomics (252) integrates data from genomic, epigenetic, and transcriptional association studies through a functional enrichment method, the results of which are used as the basis for a network construction; however, this tool has not been used in a cancer context. pwOmics (253) is another tool that leverages biological network knowledge to integrate multi-omics data. In particular, this tool is well-suited for the study of time series analyses.

While mapping data to predefined networks can be useful to gain a much-needed biological context, high-throughput technologies offer the opportunity to actually infer networks from the data itself. With such approach, data analysis problems can be transformed into network analysis problems. For instance, feature clustering becomes network module detection, which can be then used as the basis for a functional enrichment analysis (254).

While network reconstruction from omics data can be a powerful tool, it should be stated that every network reconstructed from data has an underlying hypothesis, which defines what the links between elements represent. This hypothesis should be at the center of any interpretation of the topological or functional associations recovered from a network. Furthermore, one must remember that comparison between reconstructed networks of different biological conditions will yield information about biological differences only if the method for network reconstruction does not deviate for each condition. For a discussion on this subject, see (255). This point is particularly relevant when discussing multi-omics data integration, as many of the network reconstruction methods available were developed for gene expression data. Proper validation of a method should be conducted before using it with other types of data.

There are some recent implementations of network reconstruction methods that have been developed with multi-omics data in mind. MAGIA<sup>2</sup> (256) is a tool for the reconstruction of micro-RNA and transcription factor regulatory circuits; it has been used for the analysis of expression regulation in the NCI60 cell panel. The Discordant method (257) uses a mixture model to identify differential correlation: that is, statistical dependencies between feature pairs that are lost or gained from one biological state or another. This method has been evaluated for its use with different types of omics data. The Netboost (258) is a network reconstruction method infers statistical dependency based on multi-omics data, and uses a modularity approach to reduce dimensionality; the method has been used for the classification and survival analysis of acute myeloid leukemia data. AMARETTO (259) identifies pairwise relationships between different omic layers to select cancer driver genes. A module detection approach is used to construct a dimensionally reduced module network, which is further analyzed to identify molecular signatures.

Probabilistic network reconstruction is a powerful data analysis technique. In such a model, features are connected based on an information-theoretical similarity measure, such as mutual information, between their expression profiles. Unlike correlation metrics (260), mutual information can capture non-linear relationships between features, which makes it suitable for the analysis of transcriptomics (261). We have applied these methods for the reconstruction of micro-RNA and gene co-expression bipartite networks with minor adjustments; the analysis of such networks has yielded interesting insights on the nature of functional control by micro-RNAs (262). A current research interest the authors of this work is the extension of probabilistic network reconstruction for multi-omics reconstruction, in order to construct probabilistic multilayer networks (263) that can be studied using the recent tensorial formalism of multilayer networks (264).

# 4.6. Data Science in Biology—A Word of Warning

An important aspect of any data science project is the crucial role of both technical and domain specific expertise. The analysis of biological networks in particular can pose some complication for biological scientists not familiar with the field of network science; a network visualization may be presented as result, without an adequate evaluation of network topology or other structural and dynamic parameters. Similar behaviors can be found with other applications of data science tools.

A data-driven analysis without the participation of a domain expert risks the pursuit of non-relevant questions. On the other hand, even though a bioinformatics tool may be developed with an increased usability in mind, the level of complexity of both the computational method may require a deeper understanding of the algorithm's assumptions and limitations in order to reach valid results. With this in mind, it is evident that proper computational approaches to biological questions require a fundamental understanding of both in order to reach scientifically solid conclusions. In many cases, the key to achieve this is to strive for multidisciplinary approaches.

# 5. CONCLUSION

Cancer is the paradigmatic complex phenotype. We have been able to capture some of this complexity via experimental measurements with the different high throughput biomolecular technologies generically termed omics. Each single-technology derived data type has its own set of caveats and complexities. An additional challenge lies in the fact that each data type is able to account for a fraction of the large set of cancer aspects or features. Recent times have witnessed the development of new ways to gather and analyze these partial information layers together, under the name of multi-omics.

There are, however, multiple approaches to multi-omic computational modeling and integration, some of the most relevant have been described and discussed here. Our aim has been that of presenting the current state of the art of computational oncology tools for multiomic studies of complex cancer phenotypes. Novel developments in the multiomic computational analysis come from different fields, ranging from purely mathematical developments (263, 264), to machine learning and computational intelligence applications (179, 223), to single-cell sequencing and imaging studies (139, 145) and more. However, in our view, the development of methods to integrate all these different analytical approaches into intelligible and statistically robust frameworks will provide the field with unprecedented advances both in our understanding of cancer biology and in our impact in the clinical settings. The field is fast-growing and currently under development, with novel algorithmic approaches being constantly released, but we believe that the present account is a good starting point.

#### AUTHOR CONTRIBUTIONS

GA-J and EH-L contributed to reviewing and classifying the literature, structured the review, prepared the figures, wrote, and revised the manuscript. EH-L contributed to funding and general oversight of the project.

#### FUNDING

This work was supported by the Consejo Nacional de Ciencia y Tecnología [SEP-CONACYT-2016-285544 and FRONTERAS-2017-2115], and the National Institute of Genomic Medicine,

#### REFERENCES


México. Additional support has been granted by the Laboratorio Nacional de Ciencias de la Complejidad, from the Universidad Nacional Autónoma de México. EH-L was recipient of the 2016 Marcos Moshinsky Fellowship in the Physical Sciences.

#### ACKNOWLEDGMENTS

The authors would like to thank Dr. Laura Lucila Gómez Romero (INMEGEN) for a recent discussion on current sequence-based methods. **Figures 2**, **4** were generated using Biorender (https:// biorender.com/). **Figure 4** includes images from Wikipedia, released under a Creative Commons Attribution-Share Alike 3.0.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc. 2020.00423/full#supplementary-material


reveals folding principles of the human genome. Science. (2009) 326:289–93. doi: 10.1126/science.1181369


usability by addressing inconsistency, sparsity, and high-dimensionality. BMC Bioinformatics. (2019) 20:339. doi: 10.1186/s12859-019- 2929-8


microRNA-transcription factor mixed regulatory circuits (2012 update). Nucleic Acids Res. (2012) 40:W13–21. doi: 10.1093/nar/gks460


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 de Anda-Jáuregui and Hernández-Lemus. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multi-Omic Regulation of the PAM50 Gene Signature in Breast Cancer Molecular Subtypes

Soledad Ochoa1,2, Guillermo de Anda-Jáuregui 1,3 \* and Enrique Hernández-Lemus 1,4 \*

<sup>1</sup> Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico, <sup>2</sup> Graduate Program in Biomedical Sciences, Universidad Nacional Autónoma de México, Mexico City, Mexico, <sup>3</sup> Cátedras Conacyt para Jóvenes Investigadores', National Council on Science and Technology, Mexico City, Mexico, <sup>4</sup> Center for Complexity Sciences, Universidad Nacional Autónoma de México, Mexico City, Mexico

Breast cancer is a disease that exhibits heterogeneity that goes from the genomic to the clinical levels. This heterogeneity is thought to be captured (at least partially) by the so-called breast cancer molecular subtypes. These molecular subtypes were initially defined based on the unsupervised clustering of gene expression and its correlate with histological, morphological, phenotypic and clinical features already known. Later, a 50-gene signature, PAM50, was defined in order to identify the biological subtype of a given sample within the clinical setting. The PAM50 signature was obtained by the use of unsupervised statistical methods, and therefore no limitation was set on the biological relevance (or lack of) of the selected genes beyond its predictive capacity. An open question that remains is what are the regulatory elements that drive the various expression behaviors of this set of genes in the different molecular subtypes. This question becomes more relevant as the measurement of more biological layers of regulation becomes accessible. In this work, we analyzed the gene expression regulation of the 50 genes in the PAM50 signature, in terms of (a) gene co-expression, (b) transcription factors, (c) micro-RNAs, and (d) methylation. Using data from the Cancer Genome Atlas (TCGA) for the Luminal A and B, Basal, and HER2-enriched molecular subtypes as well as normal tumor adjacent tissue, we identified predictors for gene expression through the use of an elastic net model. We compare and contrast the sets of identified regulators for the gene signature in each molecular subtype, and systematically compare them to current literature. We also identified a unique set of predictors for the expression of genes in the PAM50 signature associated with each of the molecular subtypes. Most selected predictors are exclusive for a PAM50 gene and predictors are not shared across subtypes. There are only 13 coding transcripts and 2 miRNAs selected for the four subtypes. MiR-21 and miR-10b connect almost all the PAM50 genes in all the subtypes and normal tissue, but do it in an exclusive manner, suggesting a cancer switch from miR-10b coordination in normal tissue to miR-21. The PAM50 gene sets of selected predictors that enrich for a function across subtypes, support that different regulatory molecular mechanisms are taking place. With this study we aim to a wider understanding of the regulatory mechanisms that differentiate the expression of the PAM50 signature, which in turn could perhaps help understand the molecular basis of the differences between the molecular subtypes.

Keywords: multi-omic approaches, breast cancer subtypes, PAM50, elastic net, data integration

#### Edited by:

Chiara Romualdi, University of Padova, Italy

#### Reviewed by:

Tanja Kunej, University of Ljubljana, Slovenia Valentina Silvestri, Sapienza University of Rome, Italy

#### \*Correspondence:

Guillermo de Anda-Jáuregui gdeanda@inmegen.edu.mx Enrique Hernández-Lemus ehernandez@inmegen.gob.mx

#### Specialty section:

This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology

Received: 01 December 2019 Accepted: 29 April 2020 Published: 22 May 2020

#### Citation:

Ochoa S, de Anda-Jáuregui G and Hernández-Lemus E (2020) Multi-Omic Regulation of the PAM50 Gene Signature in Breast Cancer Molecular Subtypes. Front. Oncol. 10:845. doi: 10.3389/fonc.2020.00845

### 1. INTRODUCTION

Breast cancer is the most common cause of cancer death among females (1). Breast tumors have been classified in molecular subtypes with distinctive clinical characteristics and a recognizable gene expression signature (2). Such signature has been reduced to 50 genes that achieve the best separation of subtypes, attaining the PAM50 classifier (3). However, the physiological implications of the difference in gene expression, if any, are not well-understood.

Given that gene expression is regulated by several interconnected mechanisms (4–7), differences across subtypes are expected for these mechanisms. Evidence of this was found in the form of distinguishable patterns of DNA methylation, mutation and miRNA expression that shape groups partially equivalent to the molecular subtypes (8). These patterns imply a link between the different omics and PAM50 gene expression, but do not clarify which genomic, epigenetic or post transcriptional changes drive the expression signature of such molecular subtypes. To advance in the identification of such drivers of molecular subtypes expression, we propose the use of a sparse model of PAM50 gene expression.

Sparse models achieve the selection of the best predictors of an independent variable by fitting penalized linear models. The penalization of the regression coefficients aim is to shrink them toward zero in such a way that predictors contributing lowly to prediction i.e., poorly associated with the independent variable, end up with null coefficient values and get filtered out of the model (9). Ridge Regression, Least Absolute Shrinkage and Selection Operator, and Elastic Network methods apply different penalizations. The elastic network approach selects groups of pairwise correlated variables instead of choosing a single predictor from the group (10, 11), augmenting the space of predictors of interest but also incrementing false positive rates (12).

Sparse models have been proposed for multi-omic sample classification (13, 14) and biomarker identification (15–17); but their capacity to simplify multi-omics co-interpretation has only been tested in the evaluation of the extent of different omics effects over a phenotype (18, 19). Here, the predictor selection capability of the elastic network approach is exploited to identify the CpGs, coding transcripts, and miRNAs most associated with the expression of the PAM50 genes in order to outline molecular differences behind the gene expression patterns characterizing breast cancer subtypes within a true multi-omic framework. The hypothesis is that PAM50 gene expression patterns are accompanied by distinctive regulatory elements, reflecting the way gene expression is controlled in the different breast cancer subtypes.

#### 2. METHODS

#### 2.1. Data Acquisition

Concurrent experimental samples of DNA methylation, transcript and miRNA expression were downloaded from the GDC (https://portal.gdc.cancer.gov/repository) at May 2019. Only samples with Illumina Human Methylation 450, RNA-seq and miRNA-seq measures were kept; filtering out samples quantified with the Illumina Human Methylation 27 BeadChip, which covers a smaller portion of the genome than the one we wanted to target. Subtype classification was also downloaded from the GDC trough TCGABiolinks R package (20).

After preprocessing them according to Aryee er al. (21), Tarazona et al. (22), and Tam et al. (23), and biomaRt v95, values of methylation for 384,575 probes and expression for 16,475 coding transcripts and 433 miRNA precursors were obtained for 45 unique samples of Her2, 395 LumA, 128 LumB, and 125 Basal subtypes, plus 75 samples of non-tumor (normal adjacent) tissue.

#### 2.2. Elastic Network Implementation

The three different data types were concatenated and normalized to have mean = 0 and standard deviation = 1. Eighty percent of the samples for each subtype were used for training, leaving the rest for testing as in Liu et al. (13). Using the R package glmnet (24), elastic network models were fitted per subtype for each gene in the PAM50 classifier with the linked script https://github.com/CSB-IG/PAM50multiomics/ blob/master/enetGLMNET.R. The mixing parameter was held fixed at 0.5 because such value has shown a good performance (10), but shrinkage parameter (λ) was optimized between values from 0.001 and 1,000 through repeated cross-validation.

Cross-validation was repeated 100 times with k = 3-folds for the subtypes with <100 training samples (Her2+ subtype and normal tissue) and k = 5 for the more represented subtypes (Luminal A, Luminal B, and Basal). Chosen λ parameters were used to predict testing data and root mean squared error (RMSE) was calculated per model. Fitting was repeated with the same specifications, for only 40 samples per subtype to verify the effect of data set size.

#### 2.3. Omics Comparison

For each PAM50 gene model, RMSE was calculated for the testing data either with (1) the complete set of selected predictors, (2) only with selected CpGs, (3) just with selected coding transcripts, or (4) solely with selected miRNAs. Omic's specific RMSE were evaluated by zeroing all coefficients not associated to the omic of interest in the already fitted models with the linked script https://github.com/CSB-IG/PAM50multiomics/ blob/master/RMSEperOmics.R, in an approach similar to the one used by Setty et al. (25) to search for key regulators. Obtained values shape RMSE distributions per omic which were compared via Kolmorogov–Smirnov test. This was done both per subtype per omic and mixing all the subtypes in a distribution per omic. P-values obtained were corrected for multiple testing with the FDR method.

#### 2.4. Test vs. Reported Links Between Predictors and PAM50 Genes

Enrichment for previously reported regulatory links between PAM50 genes and CpGs, TFs, and miRNAs were tested by simple Fisher's Exact Test. Tests repeated by subtypes had p-values adjusted by FDR. Regulatory targets were taken from Illumina's annotation in the case of CpGs and from databases accessible through R packages in the case of TFs

genes are regulated (as predicted by the model). Results may shine some new light on the way PAM50 genes are able to capture intrinsic features of these phenotypes.

and miRNAs, with the linked script https://github.com/CSB-IG/ PAM50multiomics/blob/master/validateInteractions.R. tftargets https://github.com/slowkow/tftargets is the package used to retrieve TF targets. It queries both predicted and validated data from TRED(2007), ITFP(2008), ENCODE(2012), and TRRUST(2015) databases at the date specified in parentheses next to each resource, plus the lists curated by Neph et al. (26) and Marbach et al. (27).

The package used to retrieve miRNA targets is multiMiR v2.2 (28), it queries DIANA-microT-CDS, ElMMo, MicroCosm, miRanda, miRDB, PicTar, PITA, TargetScan, miRecords, miRTarBase, and TarBase, also reporting both experimentally validated and predicted results. Universe size for enrichment tests were taken from these databases, constrained to regulators measured in the input datasets. The hypothesis is that models selected reported associations between a PAM50 gene and a regulator measured in the input dataset more than expected.

#### 2.5. Analysis of the Selected Predictors

Selected predictors and associated coefficient values were loaded to Cytoscape to construct a network of PAM50 gene predictors per subtype. PAM50 genes are taken as targets while predictors are sources, this makes a directed network were out and indegree are estimated. Predictors with the largest outdegree were submitted to an analysis of differential expression and their coefficient value distributions were compared to the global miRNA distribution via Kolmorogov–Smirnov tests. The differential analysis of miRNA expression was done per subtype by limma's package treat function in order to control for both fold change and significance (29). A minimum fold change of 1.1 was used.

#### 2.6. Gene Enrichment Analysis

Every set of predictors selected for a PAM50 gene was submitted to functional enrichment analysis with the R package HTSanalyzeR v2.13.1 (30) versus the GO-BP with the linked script https://github.com/CSB-IG/PAM50multiomics/ blob/master/enrichment.R. Sets enriched across subtypes were further tested via Fisher's Exact Test with the alternative hypothesis that selection in one subtype is exclusive with regards to selection another subtype.

The code to perform all previous analyses (see **Figure 1**) can be found at the following GitHub repository: https://github.com/ CSB-IG/PAM50multiomics

# 3. RESULTS

Elastic network models were fitted per gene, regressing PAM50 gene expression to DNA methylation, miRNA and coding transcript expression. Elastic networks model shrink the regression coefficients toward 0, filtering predictors by its strength of association with the variable of interest. This ability for feature selection was exploited versus unfiltered omic data to identify the CpGs, coding transcripts and miRNAs most related to the PAM50 genes in cancer subtypes and normal tissue.

We fitted five models for each PAM50 gene, one per subtype and one for the normal tissue, since differences are expected for each of the 5 phenotypes. Descriptors of models per subtype and omic are reported in **Table 1**.

The output of the model are lists of associations between PAM50 genes and the selected predictors. Each selected predictor has a coefficient of regression whose value reflects the extent of association with the PAM50 gene. Coefficients are never zero, TABLE 1 | Size of input and output of the models per subtype: Basal, Her2+, Luminal A, Luminal B as well as normal (i.e. tumor-adjacent healthy tissue).


since this value means predictors can be filtered out of the prediction; but can be both negative and positive indicating an opposite effect over the predicted value. Lists of associations shape networks like the one represented in **Figure 2**. Networks for the other subtypes and the normal tissue can be found at **Figures S1–S4**.

From observation of networks of selected predictors to PAM50 genes, it is evident that CpGs are the most selected predictors, followed by transcripts and with only a few miRNAs selected. It can also be seen that most predictors are exclusive of a PAM50 gene but all the PAM50 genes share predictors whose pattern of expression or methylation links one gene to another. This suggests the complete set of PAM50 expression is coordinated, independently of the gene being of luminal expression, basal, or any other signature.

### 3.1. Omics Contribute Differently to PAM50 Gene Expression Prediction in Normal Tissue and Cancer

In order to test the reliability of the fitted models, we checked the prediction error and the selection of previously reported associations. Regulation through DNA methylation, miRNA, or TF targeting is hence regarded as true positive and compared to model's results.

The proportion of selected predictors can not be explained solely by the size of the omics taken as input (χ 2 , p-value < 2.2e-16, **Figure 3**), specifically, coding transcripts and miRNAs are overrepresented in the models (Fisher's Exact Test, p-value < 2.2e-16). Concordantly, there are more true TF (Fisher's Exact Test, p-value ≤ 1.942846e-05) and miRNA (Fisher's Exact Test, p-value ≤ 7.573200e-11) relations than expected but less CpGs (Fisher's Exact Test, p-value ≤ 4.311267e-03). The exception is LumB subtype which has as many true positive CpGs as expected.

Given the difference between input and selected proportion of omics, we hypothesized a discrepant prediction power of

FIGURE 2 | Predictors selected per PAM50 gene for Basal subtype. (A) Topology of selected predictors and associated PAM50 genes. Brown circles are PAM50 genes. Transcripts are colored in dark green, miRNAs in blue, and CpGs in light green. Edges link each PAM50 gene with its selected predictors. The color of the line indicates the sign of the coefficient value associated with the predictor; negative values are in brown and positive ones in green. Zoom of the gray area shows the predictors selected for MYBL2. (B) Summary of the network. Barplot with the total representation of each omic plus heatmap of the count of predictors shared by PAM50 genes.

CpGs, coding transcripts, and miRNAs. To test this, we evaluated models carrying the complete set of selected predictors or just the predictors from each omic.

As RMSE is a standard measure to compare regression models that measures how far is the model prediction from the observed data in response variable units, then, the lower its value the better. Normally, the error decreases the more independent predictors are included in the model, so we choose not to fit again with the selected predictor per omic, but to test the exact same model with the jointly fitted coefficient values, just zeroing predictor's coefficients from other than the omic of interest. This way, the RMSE distribution of a model containing only predictors of a given omic, represents how much of the total prediction is contributed by the predictors from that omic.

As suggested by the difference with the input proportions, DNA methylation is the less predictive omic for all the subtypes, thought this difference is not always significant (CpGs vs. coding transcripts ks. test p-value ≤ 0.03192 for LumB, Her2+, and Basal and CpGs vs. miRNAs ks. test p-value ≤ 0.02222 for Her2+ and Basal). This disagrees with the great prediction improvement reported by Huang et al. (16) for methylation data, a fact that could be driven by the much larger and heterogenous input data used here, that we believe captures better the heterogeneity of breast cancer subtypes. Meanwhile, coding transcript and miRNAs contribute the same, with no significant difference between their distributions for all the subtypes.

Remarkably, the error distribution obtained with the complete set of predictors significantly outperforms CpGs and some subtype miRNAs (ks.test p-value ≤ 0.02222 for LumA and Basal) but never outweighs coding transcripts. Single omics can not beat multi-omics error due to the design of the test, thus the outperforming of CpGs and miRNAs is unsurprising, what is startling is the complete statistical agreement between multiomics prediction power and coding transcripts prediction power, which supports gene expression as the current best biomarker of molecular subtypes. We must note however that this may be related to (1) more info on RNA and (2) PAM50 was derived from expression signatures.

Finally, there is no significant difference across subtypes RMSE distributions for both single-omics and multi-omics, but CpGs (ks.test p-value ≤ 0.01601952), miRNAs (ks.test p-value ≤ 0.002834981), and multi-omics (ks.test p-value ≤ 0.03919459) distributions of normal tissue differ from the distribution of each subtype, suggesting these omics represent a distinct amount of PAM50 gene expression in normal tissue than in cancer, that is, the association of DNA methylation and miRNA expression with PAM50 gene expression is altered in cancer.

#### 3.2. The Association Strength Distributions of Predictors Are Different for Each Subtype

The difference between omics extends to coefficient values, shown in **Figure 4**. Since coefficients represent the strength of association between predictors and PAM50 expression (16), coefficient values suggest that each omic has a specific association with PAM50 gene expression. Coefficient value distributions are significantly different between subtypes (ks.test p-value ≤ 2.82E-02) and omics (ks.test p-value ≤ 0.01535) with few exceptions for coding transcripts and miRNAs. Basal, Her2+, and LumB coding transcripts coefficients are not significantly different. Neither are miRNA coefficients of pairs LumA and normal tissue, LumB and Basal subtype, and Basal and Her2.

According to these distributions, DNA methylation has a strong but noisy association with PAM50 gene expression while miRNA (Fisher test p-values ≤ 0.001403597) and coding transcript (Fisher test p-values ≤ 1.086031e-29) association tends to be positive (**Figure S3**) and more stable. The elevated association between DNA methylation and PAM50 genes expression explains why so many CpGs get selected in spite of its low prediction power. A stronger association between DNA methylation and gene expression than between gene and miRNA expression had previously been found for ovarian cancer by Sohn et al. (18) using a different penalization modeling.

#### 3.3. miR-21 and miR-10b Are the Only Relevant Predictors Selected Across Subtypes

Next, we wanted to see how variable is actually the association between one predictor and the predicted PAM50 gene, that is, the specific coefficient values, not their distributions. For this, we wanted to focus on the predictors selected for a PAM50 gene across subtypes, shown in **Figure 5**. However, as noted before, models selected a great quantity of predictors exclusive for each gene, 93.45% of the selected CpGs, 74.24% of the coding transcript, and 81.37% the miRNAs are not shared between any two genes. In consequence, there are no CpGs associated with any gene for all the subtypes but there are 14 relations with coding transcripts and 51 with miRNAs satisfying this.

The 13 coding transcripts selected across subtypes as predictors of a specific PAM50 gene are trivial, since they just portray physical linkage. ELP2 and SLC39A6 are coded in opposite strands of the same locus while the rest of pairs are contiguous. Most of the associations, 84.77%, connect a PAM50 gene with a coding transcript in another chromosome, but these are not repeatedly selected across subtypes. It is worth mentioning that although all coefficients values are positive, even close predictors, like YEATS4 and SLC35E3 carry distinct coefficients.

Regarding miRNAs, there are only two miRNAs repeatedly selected among subtypes, miR-10b and miR-21. These are known breast cancer markers targeting some PAM50 genes (31). Mir-21 has been experimentally linked with BCL2, MYC, EGFR, and ERBB2 expression (32–35) and predicted to target ESR1 and FOXA1 (36, 37). On the other hand, miR-10b has been linked to CDC6, EGFR, and SFRP1 (38, 39). There is no particular pattern among validated associations or coefficients, other than miR-21 carrying mostly positive coefficient values and miR-10b selection extending up to normal tissue (for the full set of validated interactions please see **Supplementary Table S1**).

#### 3.4. Micro-RNA miR-21 and miR-10b Are Universal PAM50 Predictors in Cancer and Health

Next we wanted to check the role of miR-21 and miR-10b per subtype. With this in mind, we revisited the models derived networks, that link PAM50 genes and predictors per subtype.

The networks show that genes overexpressed in each subtype get larger models. About 30% of the luminal genes have models

larger than average for LumA subtype, while almost 90% of basal genes have the equivalent for Basal subtype. Her2+ subtype and normal tissue have no clear pattern, but for LumB subtype, half the luminal genes and 28% of the proliferative ones have increased size models.

whereas those at the (Right) column stand for positive coefficient values.

Predictors that bridge between PAM50 genes can proceed from any omic, but CpGs are significantly underrepresented (Fisher test p-values ≤ 1.81E-88). CpGs are at most, selected for two subtypes as predictors of a specific PAM50 gene. There are just 24 CpGs in this situation, of which 15 are shared between Her2+ and another subtype or the normal tissue, including nine CpGs associated with ERBB2 but placed in other loci than chromosome 17.

Meanwhile, coding transcripts and miRNAs fulfill this role more often (Fisher test p-values ≤ 5.84E-03) than solely input proportions would explain. This is no surprise since both pertain to the same level of molecular features, that of transcripts, as the PAM50 gene expression signature; as such, coding transcript and miRNA may be subject to the same biomolecular pressures. The stunning observation is that one miRNA can link almost all of

the PAM50 genes for all the cases (**Figure 6**). The outstanding miRNAs are again miR-21 and miR-10b.

For normal tissue miR-10b was selected as predictor of all PAM50 genes while miR-21 is linked to only four genes. On the contrary, miR-21 is connected to most genes in the all the breast cancer subtypes, while miR-10b is poorly linked. For LumA subtype, shown in **Figure 6B**, both miR-10b and miR-10a are highly connected, but still can not reach genes like FOXC1, which is connected instead with miR-21.

Both miR-10a and miR-10b are members of the miR-10 family encoded within the Hox genes genomic clusters; miR-10a resides upstream from HOXB4 and miR-10b upstream from HOXD4 (40). Due to their relatedness they will be referred as miR-10a/b.

The hub-like behavior of these miRNAs agrees with previous observations of our group of highly connected miRNAs per subtype (41), which are important for network cohesion (42). Although the coefficients networks maintain a large connected component when removing miR-10a/b and miR-21, tens to hundreds of predictors are needed to link all the PAM50 genes; when only one of these miRNAs is required to achieve the same.

Given that each miRNA has the potential to target hundreds of genes (43), miR-10a/b and miR-21 are not so exceptional in this regard. However, as explained earlier, only a fraction of PAM50 genes have a regulatory relation with these miRNAs, suggesting most of the detected associations are indirect. Indirectness is consistent with the low values of the coefficients, which range from −0.2938690 to 0.4333184, when miRNAs coefficient values range within two orders of magnitude higher. Coefficient value distributions of miR-10a/b and miR-21 are also significantly different than the rest of miRNA coefficients (ks.test p-value ≤ 9.068e-05).

#### 3.5. PAM50 Genes Enrich for Different Functions per Subtype

The selection of predictors we have presented is based on a statistical association with the pattern of expression of a

PAM50 gene. The covariation sustaining such an association may respond to how a specific group of predictors is able to attain some biological function. To test this, functional enrichment was done with the set of selected predictors per gene per subtype, versus Gene Ontology Biological Processes categories (GO-BP) (**Figure 7**).

Only two PAM50 genes are enriched for some process in the Basal subtype, FOXC1 (basal cluster) and ANLN(proliferative cluster). Neither the ANLN enrichment for telomere protection nor the FOXC1 linkage to transforming growth factor response are within these genes immediate annotated processes. Though FOXC1 is actually related with TGFβ since both are able to regulate EMT (44).

In the case of Her2+, just ORC6 (proliferative cluster) is enriched for the totally unexpected process of synapse assembly, but, despite the significant p-value, we must notice that this is based on only two genes.

LumA is the most enriched subtype. This is not surprising since it has the largest number of selected coding transcripts, which is the starting material for enrichment. The 20 enriched genes are mostly linked to distinct cellular division aspects. The exception are the three keratins, genes with basal expression, which are connected through their normal processes, suggesting selected predictors respond to the normal gene's function. MYC and UBE2T are linked to rather wide categories (45) while MLPH associates with other than its normal processes. The remaining 14 genes are connected through categories consistent with their proliferative expression, which again alludes to a selection that followed the normal function of the genes. This is again consistent with the available evidence.

For LumB subtype, MELK and CCNB1 enrich for cell division as would be normally expected; while MYBL2 is unintuitively linked to negative regulation of epithelial cell proliferation, which however, has been reported (46). Finally, the normal tissue shows different cell division aspects coherent with the proliferative expression of its enriched genes.

Altogether, few genes have predictors with significant enrichment extended across subtypes. Eight genes enriched in two subtypes, including CCNB1, MKI67, and UBE2C, that connect with the same processes, the expected ones, for the two subtypes. MELK also connects with its normal process for two subtypes but in LumA and LumB subtypes plus normal tissue. ANLN, CEP55, KRT17, MYBL2, and ORC6, enrich for different processes across subtypes, that is, a fifth of the genes with any kind of enrichment, but five of the nine genes enriched for more than one subtype.

To further test the functional enrichment per subtype, we compared the sets of predictors selected per subtype for each one of the 9 genes that enrich for several subtypes. Genes enriched for cell division across subtypes, CCNB1, MKI67, and MELK connect to the process via distinct sets of selected predictors. From the beginning, these genes bear different predictors (Fisher's Exact Test H1: less, p-value ≤ 1.281e-09), with a small intersection whose removal does not change the significant enrichment for cell division. This reflects the robustness of the process, which is so important that distinct subsets of the 603 genes annotated in the category are enough to call it.

The other two genes enriched for the same process across subtypes, UBE2C for mitotic cytokinesis and, MELK for regulation of transcription involved in G1/S transition of mitotic cell cycle, lost the functional enrichment when the predictors selected in both LumA and normal tissue (the intersection) were removed. This implies LumA mitotic cytokinesis and regulation of transcription may be involved in G1/S transition of mitotic cell cycle relying on the normal tissue mechanism.

The quantity of shared predictors between the sets selected for CEP55, indicates that predictor selection in the LumA subtype is exclusive for normal tissue selection (Fisher's Exact Test H1: less, p-value = 1.141e-10). This means that the differential enrichment between LumA and normal tissue is sustained by different predictors, suggesting CEP55 fulfills divergent roles in these phenotypes. This matches differences observed between cancer and normal tissue (47) but, to our knowledge, not reported for LumA subtype.

The same reasoning supports KRT17 and ORC6 divergent roles across subtypes. It is odd that KRT17 is linked to kinase signaling for normal tissue and not for a breast cancer subtype, when this has been described for another cancer (48) but this may be associated to tumor incidence over adjacent tissue (49). For ANLN and MYBL2, selection exclusion between subtypes is not significant, meaning that differential enrichment of these genes could settle on the same predictors, suggesting functional diversity.

# 4. DISCUSSION

Sparse penalized models have already proven useful to discover molecular mechanisms, cluster samples, and predict outcomes such as survival (50). Penalization permits the fitting of models otherwise unattainable given the relatively small sample sizes and huge number of variables measured by the omics. Here, the elastic network approach was used for integrated interpretation of different omics measuring DNA methylation and expression of both coding transcripts and miRNAs.

However, a large training set is always preferable, and not all breast cancer subtypes have been extensively sampled, which is reflected in the models. For Luminal A, the most frequent and sampled subtype, the highest number of predictors were selected by the models; while Her2+, with only 45 samples, got the lowest number of selected predictors. To assure comparability across subtypes we trained the models again, but now using the same number of samples, 40 samples, for all the subtypes. Patterns found with this subset persist in the analysis of the whole set of data, supporting comparability (**Figures S5–S8**). Nevertheless, the absence of predictors found for LumA in the smaller subtype's models due to a lack of representation can not be ruled out. This could specifically affect the functional enrichment of PAM50 neighborhoods of predictors and so, the functional divergence between subtypes is not definitive but should be experimentally tested.

Multi-omic modeling of PAM50 gene expression is no better than the sole use of coding transcripts, supporting gene expression as the best biomarker of molecular subtypes. However, our point in using the sparse model was not to predict PAM50 but to identify the molecular differences associated with PAM50 signatures that may lead to functional differences.

At the global level, a reduced prediction power of DNA methylation and miRNAs containing models was observed for all subtypes vs. normal tissue, indicating that the influence of this omics on PAM50 gene expression is reduced for cancer. Although this may be born out of incomplete knowledge or incipient technology, an alteration of these omics has been effectively reported; specifically, a generalized hypomethylation has been observed for breast and other cancers (51).

Different predictors were expected per cancer subtype, but the exclusivity of predictors from all the omics was surprisingly high. Only 13 coding transcripts and 2 miRNAs were selected for the four subtypes. The lack of CpGs selected across subtypes is consistent with the high strength of association it has with PAM50 gene expression. If the pattern of expression is different between subtypes, the highly associated CpGs should be different.

The ubiquitous selection of miR-10b and miR-21 across subtypes suggests a central role for these miRNAs in breast cancer, which is actually supported by the literature. Proliferation, cell migration, and in vivo tumor growth of MCF7 and MDA-MB-231 cell lines implanted in nude mice is inhibited through antagomiR-21 (52) demonstrating the relevance of this miRNA, at least for luminal A and triple negative subtypes. In turn, both sub and overexpression of miR-10 are oncogenic. MiR-10b overexpression enhances cell migration and invasion by targeting HOXD10; while subexpression of miR-10b-3p, coded in the same miR-10b locus, participates in breast cancer onset by upregulating the cell cycle regulators BUB1, PLK1, and CCNA2 (53).

Coherent with the ubiquitous selection of miR-21 breast cancer subtypes and its replacement by miR-10a/b in normal tissue. MiR-21 is significantly overexpressed for all cancer subtypes while miR-10b is underexpressed, as previous reports say (31). Mir-10a is significantly underexpressed in Basal and Her2+ subtypes and slightly overexpressed in luminal subtypes, but this is not significant in LumB case. The proposal is that when miR-10b coordinates PAM50 genes, normal tissue expression is predicted; when miR-10b is sub expressed and miR-21 is overexpressed, this second miRNA gains miR-10b place, coordinating cancer expression of the PAM50 genes. Since miR-10b has a known role in metastasis (31), it would be interesting to observe the dynamics of the networks throughout the evolution of the disease.

Additionally, the small coefficients associated with these miRNAs are consistent with indirect associations. Considering all these pieces, the transition from hub miR-10a/b in normal tissue to miR-21 in breast cancer through the luminal subtypes, evokes a switch between two master regulators. Master regulators are genes needed for the specification of a lineage by its capacity to regulate downstream genes either directly or not, whose missexpression can re-specify the fate of cells (54).

Nonetheless, sparse models can not select regulators naively, they need to feed on known regulators (16, 25, 55). Then, the regulatory capacity of selected predictor can not be stated, leaving miR-10a/b and miR-21 just as universal predictors of PAM50 genes.

Another limitation of the study is the absence of an estimator of significance or accuracy intrinsic to the methodology (56). Regression models quality is described in terms of RMSE, without an indication of how well the selected predictors describe PAM50 expression. A ROC curve is not feasible, since models would have to be turned into the classification setting, and even this is unreachable, because true negative regulators can not be ascertained, as non regulators could simply be regulators yet to discover.

Finally, it is important to mention that applying the same shrinkage to inherently different molecular levels, like CpG methylation and transcript expression, could shrink to zero all the coefficients of subtler effect predictors (13). Thus, the next implementation of sparse multiomic models on PAM50 expression should adopt multiple penalizations, which could even ameliorate the bias on subtype representation (57). Distinct values for the mixing parameter should also be probed, as well as data decomposition into latent variables (58).

#### Future Directions

Apart from exploration of alternative frameworks, the immediate follow up should be the experimental assessment of the observations described here. Specifically, silencing and expression of miR-10a/b and miR-21 need to be tested for each breast cancer subtype. Disection of interaction between the miRNAs and the PAM50 genes is required too.

Then, more omics could be included in the models. Copy number variation is the first candidate to be incorporated since it is already available in the databases and has a proven effect on Her2+ subtype, in particular regarding the effect of the Her2 amplicon since it has been associated to regulation of growth and survival processes. But single nucleotide variation and chromatin accessibility are also available for some samples.

Other phenotypes with discriminant patterns of expression could benefit from sparse modeling. There could be significant predictors linked to the glioblastoma subtypes as was observed for breast cancer. Predictors represent potential regulators of the mechanisms behind subtype heterogeneity and, as such, are interesting markers of cancer. In this sense, predictor selection across stages, not subtypes, could illuminate the driving forces behind disease development. Alternative methods like A–JIVE (59) and sPLS (60) would have also exciting outcomes in this settings.

A relevant mid to long term future direction will be the implementation of experimental assays to test for multi-omic synergistic or cooperative phenomena, aiming at providing some mechanistic clues of the biological functions behind. There is however a strong challenge on this given the combinatorial mixture of effects that may be complex to disentangle. Some promissory (yet preliminary) advances are starting to arise.

# 5. CONCLUSION

Holistic studies of cancer are needed to dissect its complexity. Initiatives like The Cancer Genome Atlas have delivered the distinct molecular perspectives that need to be interpreted as a whole. The elastic net models subject of this work, approach such an integration in a rather simplistic linear form. Yet, the methodology is powerful enough to prove the intuition that PAM50 gene expression patterns are accompanied by distinctive potentially regulatory elements. Predictors are selected in an almost exclusive manner, heavily dictated by the omic of origin, with CpGs strongly associated to PAM50 expression not selected across subtypes. The way miR-10a/b and miR-21, the only relevant predictors selected for all subtypes,

#### REFERENCES

1. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality are connected and differentially expressed, suggest an specific regulatory difference between breast cancer and normal tissue that merits further research.

### DATA AVAILABILITY STATEMENT

The datasets analyzed for this study can be found in the Genome Data Commons site https://bit.ly/2Itoi2e. The code to perform all previous analyses can be found at the following GitHub repository: https://github.com/CSB-IG/PAM50multiomics.

### AUTHOR CONTRIBUTIONS

SO organized the database, performed the statistical analysis, and wrote the first draft of the manuscript. GA-J contributed to design of the study, generated programming code, and contributed to the writing of the manuscript. EH-L conceived the study, contributed to design of the study, provided funding, discussed findings, and reviewed the writing of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

### FUNDING

This work was supported by the Consejo Nacional de Ciencia y Tecnología [SEP-CONACYT-2016-285544 and FRONTERAS-2017-2115], and the National Institute of Genomic Medicine, México. Additional support has been granted by the Laboratorio Nacional de Ciencias de la Complejidad, from the Universidad Nacional Autónoma de México. EH-L is recipient of the 2016 Marcos Moshinsky Fellowship in the Physical Sciences.

### ACKNOWLEDGMENTS

This paper constitutes a partial fulfilment of the Graduate Program in Biomedical Sciences of the National Autonomous University of México (UNAM) requirements of SO (María de la Soledad Ochoa-Méndez). She acknowledges the scholarship and support provided by the National Council of Science and Technology (CONACyT) and UNAM. **Figure 1** was generated using Biorender (https://biorender.com/).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc. 2020.00845/full#supplementary-material

**Figures S1–S4** depict the topology of the networks for the non-basal subtypes that were not shown. **Table S1** contains a list of all validated interactions.

worldwide for 36 cancers in 185 countries. Cancer J Clin. (2018) 68:394–424. doi: 10.3322/caac.21492

2. Prat A, Pineda E, Adamo B, Galván P, Fernández A, Gaba L, et al. Clinical implications of the intrinsic molecular subtypes of breast cancer. Breast. (2015) 24:S26–35. doi: 10.1016/j.breast.2015. 07.008


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Ochoa, de Anda-Jáuregui and Hernández-Lemus. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Big Data-Based Identification of Multi-Gene Prognostic Signatures in Liver Cancer

Meiliang Liu1†, Xia Liu2†, Shun Liu<sup>1</sup> , Feifei Xiao<sup>3</sup> , Erna Guo1,4, Xiaoling Qin<sup>1</sup> , Liuyu Wu<sup>1</sup> , Qiuli Liang<sup>1</sup> , Zerui Liang<sup>1</sup> , Kehua Li <sup>1</sup> , Di Zhang<sup>1</sup> , Yu Yang<sup>1</sup> , Xingxi Luo<sup>1</sup> , Lei Lei <sup>1</sup> , Jennifer Hui Juan Tan<sup>5</sup> , Fuqiang Yin6,7 \* and Xiaoyun Zeng1,7 \*

*<sup>1</sup> School of Public Health, Guangxi Medical University, Nanning, China, <sup>2</sup> Key Laboratory of Longevity and Ageing-Related Disease of Chinese Ministry of Education, Centre for Translational Medicine and School of Preclinical Medicine, Guangxi Medical University, Nanning, China, <sup>3</sup> Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, SC, United States, <sup>4</sup> School of International Education, Guangxi Medical University, Nanning, China, <sup>5</sup> School of Life Sciences and Chemical Technology, Ngee Ann Polytechnic, Singapore, Singapore, <sup>6</sup> Life Sciences Institute, Guangxi Medical University, Nanning, China, <sup>7</sup> Key Laboratory of High-Incidence-Tumor Prevention and Treatment, Guangxi Medical University, Ministry of Education, Nanning, China*

#### Edited by:

*Chiara Romualdi, University of Padova, Italy*

#### Reviewed by:

*Yanqiang Li, Houston Methodist Research Institute, United States Fang Wang, University of Texas MD Anderson Cancer Center, United States*

#### \*Correspondence:

*Fuqiang Yin yinfq@mail2.sysu.edu.cn Xiaoyun Zeng zengxiaoyun@gxmu.edu.cn*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology*

Received: *02 January 2020* Accepted: *29 April 2020* Published: *28 May 2020*

#### Citation:

*Liu M, Liu X, Liu S, Xiao F, Guo E, Qin X, Wu L, Liang Q, Liang Z, Li K, Zhang D, Yang Y, Luo X, Lei L, Tan JHJ, Yin F and Zeng X (2020) Big Data-Based Identification of Multi-Gene Prognostic Signatures in Liver Cancer. Front. Oncol. 10:847. doi: 10.3389/fonc.2020.00847* Simultaneous identification of multiple single genes and multi-gene prognostic signatures with higher efficacy in liver cancer has rarely been reported. Here, 1,173 genes potentially related to the liver cancer prognosis were mined with Coremine, and the gene expression and survival data in 370 samples for overall survival (OS) and 319 samples for disease-free survival (DFS) were retrieved from The Cancer Genome Atlas. Numerous survival analyses results revealed that 39 genes and 28 genes significantly associated with DFS and OS in liver cancer, including 18 and 12 novel genes that have not been systematically reported in relation to the liver cancer prognosis, respectively. Next, totally 9,139 three-gene combinations (including 816 constructed by 18 novel genes) for predicting DFS and 3,276 three-gene combinations (including 220 constructed by 12 novel genes) for predicting OS were constructed based on the above genes, and the top 15 of these four parts three-gene combinations were selected and shown. Moreover, a huge difference between high and low expression group of these three-gene combination was detected, with median survival difference of DFS up to 65.01 months, and of OS up to 83.57 months. The high or low expression group of these three-gene combinations can predict the longest prognosis of DFS and OS is 71.91 months and 102.66 months, and the shortest is 6.24 months and 13.96 months. Quantitative real-time polymerase chain reaction and immunohistochemistry reconfirmed that three genes *F2*, *GOT2*, and *TRPV1* contained in one of the above combinations, are significantly dysregulated in liver cancer tissues, low expression of *F2*, *GOT2*, and *TRPV1* is associated with poor prognosis in liver cancer. Overall, we discovered a few novel single genes and multi-gene combinations biomarkers that are closely related to the long-term prognosis of liver cancer, and they can be potential therapeutic targets for liver cancer.

Keywords: liver cancer, gene combinations, data mining, disease-free survival (DFS), overall survival (OS)

**73**

# INTRODUCTION

Liver cancer is the sixth most common cancer and the fourth leading cause of cancer-related deaths (1). Specifically, hepatocellular carcinoma (HCC) accounts for more than 90% of liver cancer cases from a histopathological perspective. According to the GLOBOCAN 2018 database, there are about 841,000 new HCC cases and 782,000 related deaths worldwide each year, with China accounting for nearly half of the total number of global HCC cases and deaths (2, 3). In China, the Guangxi province has higher morbidity and mortality rates than the national average (4). The high mortality and poor prognosis of HCC poses a global challenge. Despite the slight increase in the 5-year survival rate of liver cancer in China from 10.1 to 12.1% over the periods of 2003–2015, it still remains at a low level (5). A survival analysis of 2, 887 liver cancer patients in 14 years showed that the 1-year, 3-year, and 5-year survival rates were 49.3, 26.6, and 19.5%, respectively (6).

Although there are many existing therapies for HCC including surgical resection, transplantation, ablation, and transcatheter chemoembolization, etc., the long-term survival of HCC patients remains poor due to their limited indications and different effects on prognosis (7–10). A 20-year prospective cohort analysis reported that the 5-year survival rates of TNM stage I, II, IIIA, and IVA patients after hepatectomy were 81.7, 77.2, 44, and 28.2%, respectively (11). Therefore, it is of crucial importance to explore new prognostic biomarkers and investigate treatment strategies to improve the overall prognosis of HCC patients.

Currently, the research on prognostic molecular markers of HCC is still ongoing, and many single-gene or multigene combination molecular markers related to HCC invasion, metastasis and prognosis are being gradually discovered. For example, the expression of HMGA1 in HCC is associated with poor prognosis and is found to promote tumor growth and migration in vitro (12). The overexpression of SYPL1 is associated with epithelial-mesenchymal transition (EMT) of HCC cells and can predict the prognosis of HCC (13). RBM8A and SIRT5 promote the migration and invasion of HCC cells by activating the EMT signaling pathway and targeting E2F1 (14, 15), respectively (16, 17). The EpCAM (18), a liver X receptor (LXR) (19), SPAG5 (20), and KOR (21) have been shown to be strongly correlated with HCC metastasis, invasion, or prognosis. Arginase-1, FTCD, and MOC-31 have a good performance in the diagnosis of HCC (22). TMEM88, CCL14, and CLEC3B can serve as potential prognostic markers of HCC (23). At the same time, some multi-gene combined prognostic studies on HCC have also been reported. For example, three genes (UPB1, SOCS2, RTN3) combination markers (24) and four genes (CENPA, SPP1, MAGEB6, HOXD9) combination models can predict the overall survival in patients with HCC prognosis (25).

However, due to the sample size limitation and the heterogeneity of the samples in different studies, the efficiency of the identified prognostic markers for liver cancer still has ample space to improve. In addition, because of the myriad of gene interaction capabilities and the possibility of synergistic promotion of disease progression, it is of great significance to find some multi-gene combinations that may have better prognostic efficacy than single genes for prognostic targets of liver cancer. Therefore, the leverage of the large sample sizes of the public data platforms, integrating new and effective mining and screening methods, as well as reliable experimental verification is a very promising direction for the discovery of multiple effective single genes and multi-gene combination prognostic markers of liver cancer.

High-throughput profiling technologies and bioinformatics methods are now being applied to all fields of biomedical research. A mass of cancer data, such as the mRNA expression, copy number variation, single nucleotide polymorphism (SNP), and microRNA expression generated by those tools are collected in public archives such as The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/), Coremine (http://www. coremine.com/medical/), Oncomine (https://www.oncomine. org/resource/login.html), Gene Expression Omnibus database (GEO, https://www.ncbi.nlm.nih.gov/geo/), etc. Making full use of the public data from these databases is meaningful for exploring and discovering effective HCC prognostic biomarkers. For instance, Li et al. (24) developed a three-gene prognostic signature composing of three genes UPB1, SOCS2, and RTN3, which was revealed to have prognostic value for HCC patients based on TCGA data. Our previous study used data retrieved from the Coremine, TCGA, and GEO database and discovered that high-expressed E2F transcription factor 3 is associated with poor prognosis of HCC (26).

In this study, we used text mining approach to find the medial related candidate gene list for liver cancer prognosis, and a total of 1,173 genes that might be related to the prognosis of liver cancer were finally obtained. The association of the 1,173 genes with overall survival (OS) and diseasefree survival (DFS) was accessed in a large sample of TCGA cohort, in which the subgroups of 319 patients with DFS and 370 with OS were available. The survival analyses are carried out for each of these genes to identify single prognostic markers. Moreover, we performed survival analyses of the gene combinations and performed multiple screening for these HCC prognostic molecular markers, revealing the association between the expression of numerous genes or gene combinations and the survival in HCC patients. We then compared the ability of single genes and multiple gene combinations to predict the prognosis of HCC. Moreover, a huge difference between high and low expression group of these three-gene combinations was detected, with median survival difference of DFS up to 65.01 months, and of OS up to 83.57 months. The high or low expression group of these three-gene combinations can predict the longest prognosis of DFS and OS is 71.91 months and 102.66 months, and the shortest is 6.24 months and 13.96 months. Among the above genes that may be strongly correlated with the prognosis of HCC identified in large sample data, it was found that the combination of the three genes F2, GOT2, and TRPV1 that have not been systematically reported has a strong ability to predict the prognosis of HCC. We further verified F2, GOT2, and TRPV1 by three independent expression profile microarray data for liver cancer acquired from the Oncomine database, and conducted the quantitative real-time polymerase chain reaction (qRT-PCR) in 20 pairs of HCC and adjacent tissues, and immunohistochemistry (IHC) staining in 90 pairs of HCC and its precancerous tissues. These results validated that the low expression of F2, GOT2, and TRPV1 in liver cancer was associated with the poor prognosis of liver cancer.

### MATERIALS AND METHODS

#### Data Sources

We combined 3 corresponding concepts of the key word "liver cancer" with 2 concepts of the key word "prognosis" and 10 concepts of the key word "outcome," respectively, (**Supplementary Table S1**), and searched for their corresponding genes or proteins in the Coremine database (http://www. coremine.com/medical/). After deleting duplicates, we selected 1,173 gene entries with p-values < 0.05 that might be associated with the prognosis of liver cancer for further analyses (**Supplementary Table S2**).

The above genes mined in the Coremine database include some genes obtained from other gene-mining reports; however, the number of samples and data standards in each report is different. Therefore, we selected the cohort of The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/), a database with consistent sample size and data standards, to conduct unified batch verification of these genes and conduct three-gene combinations survival analyses.

We studied the relationship between each of the selected 1,173 genes and the prognosis of liver cancer patients in TCGA cohort which downloaded from cBioPortal for Cancer Genomics (https://www.cbioportal.org/) in September 2018 (27, 28), and a subgroup of 319 liver cancer samples with HCC DFS corresponding follow-up data and a subgroup of 370 liver cancer samples with HCC OS corresponding follow-up data were chosen.

#### Survival Analysis and Gene Selection

Kaplan-Meier estimation of survival functions and Log-rank tests were used to evaluate effect of genes on DFS and OS. The Cox proportional hazard model was performed for multivariate analyses of HCC prognosis. Survival analyses were performed using the R survival package in R (version 3.3.1). The Kaplan-Meier survival curves and Cox proportional hazards regression model for DFS and OS were generated by IBM SPSS (version 23.0). The median expression level of a gene was used as a cutoff value for the classification of patients into high and low expression groups (29).

# Human Tissue Samples

For the validation studies, we used 20 patients who underwent primary and curative hepatectomy from Apr 2016 to Apr 2018 at the First Affiliated Hospital of Guangxi Medical University. Those patients who have distinctive pathologic diagnosis of HCC without preoperative anticancer treatment were eligible for inclusion in this study. The paraffin-embedded pathologic specimens were collected during surgery and stored in a liquid nitrogen tank until the step of mRNA isolation. All patients received an explanation for the purpose of the study and signed informed consent. The Ethics Committee of Guangxi Medical University granted approval for this study. For IHC, a commercial biological tissue microarray containing 90 pairs of HCC and adjacent normal liver tissues was constructed by the Biological sample library of Shanghai Outdo Biotech Company, and the survival information of each case was usable. (Microarray: HLivH180Su14).

# Quantitative Real-Time Polymerase Chain Reaction (qRT-PCR)

QRT-PCR was performed to evaluate the mRNA expression of selected genes in 20 HCC and their matched precancerous tissues. Total RNA was isolated using Trizol reagent (Life Technologies, Inc., NY, USA) according to the manufacturer's instructions. The concentration and purity of the total RNA were detected using Microplate reader (Bioteck Instruments, Inc., VT, USA). RNA reverse transcription was then performed with the PrimeScriptTM RT reagent Kit (Takara Biomedical Technology (Beijing) Co., Ltd.) with gDNA Eraser (Perfect Real Time), and qRT-PCR was performed using the TB GreenTM Premix Ex TaqTM II (Tli RNaseH Plus) kit (Takara Biomedical Technology (Beijing) Co., Ltd.) protocol in a StepOnePlus system (Applied Biosystems. Life Technologies Holdings Pte Ltd, Singapore).

The sequences of the primers are as follows: F2: forward primer, 5′ -CTGAGGGTCTGGGTACGAACT-3′ , reverse primer, 5′ -TGGGTAGCGACTCCTCCATAG-3′ ; GOT2: forward primer, 5′ -AAGAGTGGCCGGTTTGTCAC-3′ , reverse primer, 5 ′ -AGAAAGACATCTCGGCTGAACT-3′ ; TRPV1: forward primer, 5′ -TGCACGACGGACAGAACAC-3′ , reverse primer, 5 ′ -GCGTTGACAAGCTCCTTCAG-3′ . The cycle conditions are as follows: after an initial incubation at 95◦C for 30 s, the samples were cycled 40 times at 95◦C for 5 s and 60◦C for 30 s. The relative expression level of each gene in the individual samples was calculated using the 2−11Ct method and normalized using GAPDH as an endogenous control.

# Immunohistochemistry (IHC)

EnVisionTM FLEX+, Mouse, High pH, (Link) (K8002, Dako) was used for the immunohistochemistry. After the tissue chips were baked and placed in LEICAST5010 (LEICA), PT Link (Dako North America, Inc.) was used for antigen retrieval. Primary antibodies were diluted (F2, 1:3000; GOT2, 1:80000; TRPV1, 1:1500) and incubated overnight at 4◦C. The secondary antibody reactions were carried out using the Autostainer Link 48 (Dako North America, Inc.), the sections were subjected to color development with the DAB chromogenic kit, and finally counterstained with Hematoxylin (SLBT4555, Sigma Aldrich). The following antibodies were used: F2, 1: Anti-Thrombin (ab83981; Abcam), GOT2, 1: Anti-FABP-1 (ab171739; Abcam), TRPV1, 1: Anti-VR1 (ab3487; Abcam). All slides were evaluated by two independent pathologists who were blind about the clinicopathologic data.

The expression levels were scored as the staining intensity (0, negative; 1+, weak; 2+, moderate; 3+, strong) multiplied by the proportion of immunopositive staining area (0, < 25%; 1+, 25– 50%; 2+, 50–75%; 3+, >75%) intensity of staining. Expression scores <5 were considered as "low expression," and scores ≥5 were considered as "high expression."

#### Statistics

Statistical analyses were conducted using R 3.3.1 (Auckland, NZ) and IBM SPSS 23.0 (Chicago, USA). McNemar test was used to test the paired 4-fold table experimental data of IHC. The paired t-test was used to analyze the qRT-PCR experimental data. Except for single-gene survival analyses and three-gene prognosis survival analyses with p-value < 0.01 as statistically significant, other statistical analyses were considered statistically significant with two-sided p-value < 0.05.

# RESULTS

#### Selection of Genes Related to Liver Cancer Prognosis and Liver Cancer Samples

We combined 3 corresponding concepts of the key word "liver cancer" [Liver neoplasms (alias Liver Cancer) (disease) (60,666 connections); Liver carcinoma (alias liver cell cancer) (disease) (55,739 connections); Carcinoma, Hepatocellular (alias Adult Liver Cancer) (mesh) (57,034 connections)] with 2 corresponding concepts of the key word "prognosis" [Prognosis (mesh) (77,312 connections); Prognostic Marker (alias Prognosis Marker) (chemical) (22,056 connections)] and 10 corresponding concepts of the key word "outcome" [Fatal Outcome (mesh) (34,016 connections); Outcome Assessment (Health Care) (alias Outcome Study) (mesh) (48,296 connections); Outcome studies (procedure) (9,545 connections); Treatment Outcome (mesh) (77,246 connections); Outcomes research (procedure) (5,540 connections); Outcome monitoring (procedure) (2,030 connections); Patient-focused outcomes (procedure) (3,830 connections); Treatment outcome in HSR (procedure) (998 connections); Patient Reported Outcome Measures (alias Patient Reported Outcome) (mesh) (2,301 connections); Patient Outcome Assessment (mesh) (9,066 connections)], respectively, (**Supplementary Table S1**), and searched for their corresponding genes or proteins in the Coremine database (http://www.coremine.com/medical/). With p-values < 0.05 as the criteria, a total of 1,173 genes that might be related to the prognosis of liver cancer were finally obtained after screening and elimination of duplicates. As the samples of liver cancer in the Coremine database were not uniform enough, we selected 319 samples for DFS and 370 samples for OS of liver cancer from the TCGA database and obtained the corresponding survival data as well as the expression information of the above 1,173 genes in these samples. This was necessary to carry out the subsequent survival analyses of these genes for liver cancer.

# The Single Genes Prognostic Analyses

To clearly describe our process of screening genes, a flowchart of the analysis procedure was developed (**Figure 1**). First, we performed the Kaplan-Meier analysis of each of the 1,173 genes. It was found that the mRNA expression of 276 genes and 283 genes was significantly associated with DFS in 319 patients (p < 0.05) and OS in 370 patients (p < 0.05), respectively. Additionally, the mRNA expression of 166 of these genes was significantly associated with both DFS and OS (p < 0.05).

To further investigate the value of the genes in the prognosis of liver cancer, we chose 135 genes and 149 genes with p-values < 0.01 for DFS and OS, respectively. Next, we used the Cox proportional hazards regression model to employ multivariate analyses on the above genes, respectively to determine the DFS and OS prediction potential of these genes.

The DFS-related multivariate analysis results showed that the expression of 39 genes (ALDOB, APOB, AURKB, C5, CCNF, CD4, CENPJ, CETP, COL18A1, CPT2, DAND5, DNASE1, EBPL, F7, FLT3, G6PD, GNMT, ITGB2, KLRK1, KNG1, LMOD1, NEK2, PCLAF, PER1, PKM, POU2F1, PPAT, PPIA, PRF1, PTPN6, RUNX3, SELP, SLCO1B1, SPPL2A, STAT5A, TCF21, TRPV1, TUSC1, and TYMS) was significantly associated with DFS in HCC patients (p < 0.05, **Table 1**). The highly significant results of both the DFS-related single-gene survival analyses for each of these 39 genes and multivariate analysis confirmed that the above 39 genes have a strong association with the DFS of liver cancer, especially the 5-year disease free survival rate of liver cancer.

The OS-related multivariate analysis results showed that the expression of 28 genes (ABCC1, ANXA7, APOB, ATG7, BAK1, CA9, CCNA2, CHD1L, CYP3A4, E2F1, EZH2, F2, G6PC, GMPS, GOT2, HDAC2, HPX, KPNA2, LAPTM4B, MAGEB3, MAPT, MPV17, NTF3, PPAT, SLC2A1, SLC38A1, SPP1, and TRPV1) was significantly associated with OS in HCC patients. (p < 0.05, **Table 1**). The strongly significant results of both the OS-related single-gene survival analyses and multivariate analysis confirmed that these 28 genes are significantly associated with the OS of liver cancer, especially the 5-year survival rate of liver cancer.

Additionally, among the above-mentioned genes selected after single-gene survival analyses and multivariate analyses, 3 genes (APOB, PPAT, and TRPV1) were significantly associated with both DFS and OS in HCC patients.

Heat maps of the expression of the above 39 DFS-related genes and 28 OS-related genes in 1173 TCGA liver cancer samples, respectively, which grouped by prognosis status, were shown in **Supplementary Figure S1**.

### Three-Gene-Combination Prognostic Model

To reflect the association of the expression of the combined genes with the prognosis of HCC, three-gene-combinations of the above 39 and 28 single genes that are significantly associated with DFS and OS, respectively, were formed, resulting in 9,139 and 3,276 three-gene-combinations for DFS and OS, respectively. In each combination, simultaneous high expression of the three genes in the same case was defined as the co-high expression group. Similarly, simultaneous low expression of the three genes in the same case was considered to be the co-low expression group. In order to ensure the comparability between the high and the low expression group, we deleted combinations which had < 25 cases in the co-high or co-low expression group.

#### Three-Gene-Combination of Prediction for DFS in Liver Cancer

K-M survival analysis of each of the above 9,139 combinations constituted by 39 DFS-related single genes was first performed. Then, we selected a total of 2,758 combinations with p-values < 0.01, excluding the combinations with no more than 25 cases in the co-high expression or co-low expression groups. Apparently,

these selected 2,758 combinations have significant prognostic implications for DFS in liver cancer.

In addition, 18 of the above 39 single genes have not yet been systematically reported to be associated with HCC prognosis, and these 18 genes can combine into 816 three-gene-combinations. The results of the K-M survival analyses showed that 317 combinations had significant association with DFS of liver cancer (p < 0.01).

The top 15 combinations of the above 2,758 and 317 combinations with the smallest p-values were chosen. The DFS-related survival analyses diagrams and tables of these combinations and the single genes they contain are as follows (**Figures 2**, **3**; **Tables 2**, **3**).

#### Three-Gene-Combination of Prediction for OS in Liver Cancer

Similarly, three-gene-combinations of the 28 single genes significantly associated with OS confirmed by the single gene survival analyses and the multivariate analysis were formed, resulting in 3,276 three-gene-combinations. 930 of these 3,276 combinations were screened out on the conditions that the number of cases in both the co-high and co-low expression groups was > 25, and the p-values were < 0.01 according to the OS-related K-M analyses results.

Furthermore, 12 of the above 28 single genes that were noted to have an unknown association with liver cancer prognosis formed 220 three-gene-combinations. Out of the 220 combinations, there were 31 combinations in which the number of cases in both the co-high and co-low expression groups was > 25 and the OS-related survival analyses results showed p < 0.01.

We found 930 of above 3,276 combinations and 31 of above 220 unreported-gene combinations were significant association with OS related survival of liver cancer patients. Among the 930 combinations and 31 combinations mentioned above, the diagrams and tables of the OS-related survival analyses of the top 15 combinations with the smallest p-values and the single genes they contain are as follows (**Figures 4**, **5**; **Tables 3**, **4**) Among the 12 genes that have an unknown association with HCC prognosis, F2, GOT2, TRPV1, and their combination F2-GOT2-TRPV1 were all significantly associated TABLE 1 | Multivariate analyses of prognosis of DFS of 319 HCC patients and OS of 370 HCC patients in a TCGA cohort.


*(Continued)*

#### TABLE 1 | Continued


\**The gene has not been systematically reported to be associated with HCC prognosis.*

*Cox proportional hazard model was used to analyze the impact of 135 genes on DFS and the impact of 149 genes on OS, respectively, P* < *0.05 were considered to be significant. 39 genes and 28 genes were significantly associated with liver cancer DFS and OS, respectively.*

with OS in 370 liver cancer samples from the TCGA data (F2: p = 0.005; GOT2: p < 0.001; TRPV1: p = 0.002; F2- GOT2-TRPV1: p < 0.001). The overall survival rate in HCC patients with low expression of F2, GOT2, TRPV1, and the three-gene-combination F2-GOT2-TRPV1 were all significantly lower than that in liver cancer patients with high expression. In addition, the median survival time difference between the high expression group and the low expression group of F2, GOT2, TRPV1, and the three-gene combination F2-GOT2- TRPV1 was 23.62, 32.26, 35.61, and 55.68 months, respectively. The median survival time difference of this combination was greater than that of a single gene, which was one of the main reasons why we selected these three genes for qRT-PCR and immunohistochemically validation.

# Low Expression of F2, GOT2, and TRPV1 Predicts Poor Prognosis

Based on the above results of the OS-related survival analyses and multivariate analyses on 28 genes, as well as the results of survival analyses on their three-gene-combinations, we selected three genes F2, GOT2, and TRPV1 with strong liver cancer prognostic potential for subsequent validation.

#### F2, GOT2, and TRPV1 Were Downregulated in HCC Tissues

The gene expression in HCC was determined based on three independent microarrays which are all collected in Oncomine database (https://www.oncomine.org/resource/login. html). As shown in Roessler Liver 2 Statistics (225 HCC tissues vs. 220 liver tissues), the expression of F2, GOT2, and TRPV1 in HCC tissues were all significantly downregulated compared with that in normal liver tissues. (p <0.001; **Figure 6**) In addition, based on the Mas Liver Statistics (38 HCC tissue vs. 19 liver tissue), both F2 and TRPV1 were significantly down-regulated in HCC tissues. Based on the Chen Liver Statistics (104 HCC tissues vs. 76 liver tissues), both F2 and GOT2 were significantly down-regulated in HCC tissues.

The qRT-PCR results of F2, GOT2 and TRPV1 showed that 20/20, 19/20, and 16/19 of the HCC tissues exhibited significantly lower expression of F2 (p < 0.001; **Figure 7A**), GOT2 (p < 0.001; **Figure 7B**), and TRPV1 (p = 0.006; **Figure 7C**), respectively, when compared with their corresponding non-tumorous tissues.

The protein expression of F2, GOT2, and TRPV1 in HCC tissues was evaluated using IHC. Positive staining of F2, GOT2, and TRPV1 was mainly localized in the cytoplasm of HCC cells. The representative staining of F2, GOT2, and TRPV1 negative and positive protein expression in HCC are shown in **Figure 8A**.

Among 90 HCC tissues and adjacent non-malignant liver tissues, IHC was employed to measure the protein expression of F2, GOT2, and TRPV1, respectively. Low F2 expression was observed in 62/89 (69.66%) of the HCC tissues, compared to 33/89 (37.08%) in adjacent normal liver tissues (p < 0.001); low GOT2 expression was noted in 72/89 (80.90%) of the HCC tissues, compared to 32/89 (35.96%) in adjacent normal liver tissues (p < 0.001); low TRPV1 expression was also observed in 59/89 (66.29%) of the HCC tissues, compared to 38/89 (42.70%) in adjacent normal liver tissues (p = 0.002).

was >25. (A) Association of DFS and the top 15 combinations of the overall genes combinations. (B) Association of DFS and the top 15 combinations of the unreported genes combinations.

#### Expression of F2, GOT2, and TRPV1 and Their Combination F2-GOT2-TRPV1 With OS

Based on the above results of single-genes and threegene combinations survival analyses of TCGA HCC samples, the low expression of F2, GOT2, TRPV1 and their combination F2-GOT2-TRPV1 was significantly associated with poor OS in HCC. (F2: p = 0.005; GOT2: p < 0.001; TRPV1: p = 0.002; F2-GOT2-TRPV1: p < 0.001). In addition, the median survival time difference between the high expression group and the low expression group of

DFS and the 16 single genes contained in the first 15 unreported-gene combinations.

TABLE 2 |The associations of three-gene combinations with disease-free survival (DFS) of HCC patients in a TCGA cohort, analyzed by Kaplan-Meier method.


*(Continued)* Multi-Gene Prognostic Signatures Identification

Liu et al.

#### TABLE2 | Continued


TABLE 3 |The associations of single genes contained in the multi-gene combinations with disease-free survival (DFS) and overall survival (OS) of HCC patients in a TCGA cohort, analyzedbyKaplan-Meier method. Liu et al.


*(Continued)*

Multi-Gene Prognostic Signatures Identification

TABLE 3 | Continued DFS (Median) of single genes of the

Estimate

 Std. Error 95% confidence


OS (Median) of single genes of the

Estimate

 Std. Error 95% Confidence

combinations

Lower boundary  with HCC prognosis

> Median survival

> > (H-L)

time difference

 Interval P

Upper boundary

combinations

Lower boundary  with HCC prognosis

> Median survival

> > (H-L)

time difference

 interval P

Upper boundary

FIGURE 4 | Association of the top 15 three-gene-combinations with smallest *p*-values with OS, using the data of HCC samples in a TCGA cohort and assessed by Kaplan-Meier analyses. The high expression group (blue line) of the combination consisted of samples with high expression of all three genes, and the low expression group (green line) of the combination consisted of samples with low expression of all three genes. The number of high and low expression groups in each combination was >25. (A) Association of OS and the top 15 combinations with the smallest *p*-values of the overall genes combinations. (B) Association of OS and the top 15 combinations with the smallest *p*-values of the unreported genes combinations.

F2-GOT2-TRPV1 was greater than that of any of the three single genes.

The results of IHC for 90 liver cancer cases showed that the low protein expression of F2, GOT2, and TRPV1 was significantly associated with lower 5-year survival in HCC patients (F2: p = 0.033, GOT2: p = 0.035, TRPV1: p = 0.046; K-M survival analyses). However, due to the insufficient number of events in the co-high expression group of the combination F2-GOT2-TRPV1, there was marginally significant difference found in the overall survival rate of HCC patients between the co-high expression group and the co-low expression group of the protein combination F2-GOT2-TRPV1 (p = 0.051) (**Figure 8B**).

### DISCUSSION

Liver cancer is characterized by inconspicuous early symptoms, a high degree of malignancy, recurrence and spread, and unsatisfactory prognosis. With limited treatment options, it is one of the common malignancies that plague the world. Therefore, identification of effective prognostic biomarkers for liver cancer is the key to improving the efficacy of targeted therapy for HCC and reducing the adverse prognostic effects of liver cancer.

In our study, by combining and searching 15 corresponding concepts of the key words "liver cancer," "prognosis," and "outcome," and according to p-values < 0.05, 1,173 genes that

OS and the 11 single genes contained in the first 15 unreported-gene combinations.

Liu et al.


TABLE 4 |The associations of three-gene combinations with overall survival (OS) of HCC patients in a TCGA cohort, analyzed by Kaplan-Meier method.

*(Continued)* Multi-Gene Prognostic Signatures Identification


TABLE 4 | Continued

> OS (Median) of

> > Estimate

combinations

 Std. Error 95% confidence

 of 28 genes with HCC prognosis

> interval P

May

2020 | Volume 10 | Article 847


OS (Median) of

Median survival

difference

time

combinations

Estimate

 of 12 genes have unknown association

 Std. Error 95% confidence  with HCC prognosis

> Median survival

> > difference

time

 interval P

FIGURE 6 | Expression of *F2*, *GOT2*, and *TRPV1* in HCC and adjacent normal liver tissues confirmed by independent microarrays from the Oncomine database. The expression of (A) *F2*, (B) *GOT2*, and (C) *TRPV1* were all significantly reduced in HCC tissues by the Roessler Liver 2 Statistics [225 HCC tissues (dark blue) vs. 220 normal liver tissues (light blue)]. \*\*\**p* < 0.001.

may be related to the prognosis of liver cancer were mined from the Coremine platform after merging and removing duplicates. However, due to the insufficient sample size and data related to the prognosis of liver cancer in the Coremine platform as well as the large heterogeneity among the samples, we also selected gene expression data and prognosis data of 319 samples for DFS and 370 samples for OS from the TCGA platform. We then separately conducted DFS-related and OS-related K-M survival analysis for each gene, followed by multivariate analyses, respectively. The large-scale genes mining and a large number of homogenous samples gave us a reliable analytical foundation. By far, this is the first large-scale survival analyses for hundreds of genes for subsequent screening.

In addition, the genes selected by K-M survival analyses with a low p-value (p < 0.01) were further screened by multivariate analyses using the Cox proportional hazards regression model. We found that 39 genes and 28 genes were reliably and significantly associated with DFS and OS, respectively, in liver cancer. Many of the above genes have been confirmed to be associated with the prognosis of HCC by previous reports. For example, of the 39 DFS-related genes, ALDOB inhibits metastasis in HCC and can be a valuable novel prognosis predicting marker (30); APOB was found to be a prognostic biomarker for patients with radical resection of HCC (31, 32); CCNF is downregulated in HCC and is a promising prognostic marker (33). In addition, CPT2 (34), G6PD (35), GNMT (36), NEK2 (37), etc. have also been reported to be prognostic markers of HCC by affecting the occurrence or invasion of HCC. The above findings are consistent with what we identified. Other genes, such as C5, CD4, CETP, COL18A1, DAND5, DNASE1, EBPL, F7, FLT3, ITGB2, KNG1, LMOD1, PPAT, PPIA, PRF1, SELP, SPPL2A, and TRPV1 that have not been systematically reported in relation to the prognosis of liver cancer, are our newly discovered prognostic markers for DFS in liver cancer. Similarly, of the 28 OS-related genes, CA9 regulates the epithelial-mesenchymal transition and is a novel prognostic marker in HCC (38), E2F1 expression has an impact on tumor aggressiveness and affects the prognosis of HCC (14, 15), CYP3A4 (39), HDAC2 (40), and KPNA2 (41) have also been identified as prognostic markers of HCC and are reflected in our findings. The other genes, such as ANXA7, F2, GOT2, HPX,

with HCC patients prognosis. (A) Negative, weakly positive, intermediately positive, and strongly positive IHC staining of *F2*, *GOT2*, and *TRPV1*. *F2*, *GOT2*, and *TRPV1* were all low expressed in liver cancer. (B) The lower protein expression levels of *F2*, *GOT2*, and *TRPV1* were all associated with 5-year OS of 90 HCC patients, examing by Kaplan-Meier analyses and log-rank test. However, there was marginally significant association between the *F2*-*GOT2*-*TRPV1* combination protein expression levels with the OS of HCC patients. (*F2*: *p* = 0.033, *GOT2*: *p* = 0.035, *TRPV1*: *p* = 0.046, *F2*-*GOT2*-*TRPV1: p* = 0.051).

MAGEB3, MAPT, MPV17, NTF3, PPAT, SLC2A1, SLC38A1, and TRPV1 are all novel prognostic markers associated with liver cancer OS found by our reliable and large-scale screening studies. Three genes (APOB, PPAT, and TRPV1) were associated with both DFS and OS of HCC, suggesting that APOB, PPAT, and TRPV1 may be significant and effective in predicting both the progress and the adverse outcomes of HCC.

Moreover, there may be connections among the above selected genes and they can work together to influence the development and prognosis of liver cancer to some extent. Although there are some genes that had been reported as prognostic molecular markers of liver cancer, most reports focused on the impact of a single gene on the prognosis of liver cancer, few studies performed such a large-scale survival analysis. Studies of multiple gene combinations are more effective than the analysis of single genes in predicting the prognosis of liver cancer.

In our study, we performed three-gene combinations of the 39 DFS-related genes and 28 OS-related genes screened from the above survival analyses. In order to further study the predictive effect of the combinations constituted by the selected genes on the prognosis of liver cancer, and to compare the predictive power of single genes and corresponding gene combinations, we carried out thousands of K-M survival analyses on these combinations. To ensure the comparability and credibility, we removed the combinations of which the co-high or co-low expression group cases were fewer than 26, and screened 2,758 DFS-related combinations and 930 OS-related combinations with p-values < 0.01. Moreover, we also performed three-genecombination models and K-M survival analyses on the 18 DFSrelated genes and 12 OS-related genes we found but have not been systematically reported to be related to the prognosis of HCC. 317 unreported-gene combinations and 31 unreportedgene combinations significantly associated with DFS and OS, respectively, were screened out.

For the above four types of three-gene-combinations (the overall genes combinations associated with DFS, the unreported genes combinations associated with DFS, the overall genes combinations associated with OS, and the unreported genes combinations associated with OS), the top 15 combinations with the lowest p-values of the survival analyses and the genes they contained were, respectively, selected for comparison (**Tables 2**, **3**, **4**).

For example, for the overall gene combinations associated with OS, KPNA2-SLC38A1-SPP1, the median survival time difference between the co-high and the co-low expression group was 83.57 months. In contrast, that of the single genes KPNA2, SLC38A1, and SPP1, was 47.66, 35.61, and 29.64 months, respectively. After combining KPNA2, SLC38A1, and SPP1, the median survival time difference between the high and low expression groups was larger than that of any of the three single genes by at least 36 months. This shows that these three genes KPNA2, SLC38A1, and SPP1, after combination, may be better predictive values for liver cancer prognosis and may be more clinically useful for future treatment target selection.

We also selected genes that have not been previously reported for liver cancer prognosis and compared their prognostic efficacy with the corresponding three-gene combinations (the chart only shows the top 15 groups with the lowest p-values of the threegene combinations prognostic models). The expression of one of the combinations F2-GOT2-TRPV1 had a greater effect on the median survival time of OS than any of the three individual genes (The median survival time difference: F2-GOT2-TRPV1: 55.68 months; F2: 23.62 months; GOT2: 32.26 months; TRPV1: 35.61 months).

Coagulation factor II (F2) plays a major role in proteolysis to form thrombin in the first step of the coagulation cascade and eventually generates hemostasis. An enrichment analysis of genetic changes during the development of HCC identified several hub genes, including F2, which interacts in several groups of conditional specific PPI networks (42). Additionally, it was reported that F2 is associated with invasion in neuroendocrine prostate cancer (43). Glutamic-oxaloacetic transaminase 2 (GOT2) plays an important role in amino acid metabolism and the tricarboxylic acid cycle, and it affects the malate-aspartic acid shuttle activity and glycolysis in the liver under the stimulation of liver inflammation. (44, 45) TRPV1 is a regulator of cell homeostasis, previous studies have revealed that the expression of TRPV1 is significantly decreased in renal cell carcinoma, colorectal cancer, and melanoma. In addition, TRPV1 can affect P53 and TRPV1-dependent pathways to inhibit the growth of colorectal cancer and melanoma (46–48), and can cause apoptosis in human osteosarcoma MG63 cells (49).

At present, there are few studies on the above three genes F2, GOT2, TRPV1 and particular their combinations in the prognosis of HCC. In our study, the results of the 20 pairs of HCC and paracancerous tissues for qRT-PCR, as well as 90 pairs HCC biochips for IHC confirmed that all of the F2, GOT2, and TRPV1 genes are significantly and consistently down-expressed in HCC tissues, and this is reconfirmed by three independent microarrays. Moreover, the low expression of F2, GOT2, and TRPV1 were all significantly associated with poor prognosis of HCC. However, due to the number of death events in the F2- GOT2-TRPV1 high expression group of in the HCC biochips being 0, the survival analysis of the F2-GOT2-TRPV1 high and the expression group was marginally significant (p = 0.051), but this is still consistent with our above-mentioned big data-based multi-gene combination survival analysis results.

As there may be certain relationships between the genes we screened that are significantly associated with the prognosis of liver cancer, they can work together in the form of multigene combinations in the development of liver cancer. However, the predictive potency of different gene combinations varies. Some combinations are better predictors than individual genes, and therefore these combinations may be more valuable than individual genes in determining the target site for liver cancer prognosis. Due to limitations in human and material resources, it still remains unclear how these genes and gene combinations specifically affect the HCC survival. Further investigation and experimentations are needed to elucidate the biological mechanisms of the selected genes, particularly for the significant multi-gene combinations, in the development and progression of HCC.

Our findings cover a large gene level, and we have also explored the predictive efficacy of a number of gene combinations for the prognosis of liver cancer. We believe that these highly significant prognostic-related genes and gene combinations derived from the above multiple screenings are promising, reliable molecular markers for the prognosis of liver cancer, and our screening methods can be extended to other tumor types.

In conclusion, based on a large sample size of public data platform, novel and effective data mining and multiple screening methods, large-scale survival analyses, as well as supplemental reliable experimental verification, we identified a series of novel genes and multi-gene combinations that are significantly associated with DFS or OS in liver cancer. Moreover, a huge difference between high and low expression group of these three-gene combination was detected. Some of the three-gene combinations can predict much longer or shorter survival time for liver cancer patients than the single genes. QRT-PCR, immunohistochemistry, and three independent microarray results confirmed our findings that three of the selected novel genes F2, GOT2, and TRPV1, as well as the corresponding combination F2-GOT2-TRPV1, showed significantly lower expression in HCC and are associated with OS in HCC. Some gene combinations may be more predictors of prognosis than single genes and can be used as potential effective therapeutic targets for liver cancer.

#### DATA AVAILABILITY STATEMENT

The datasets generated for this study are available on request to the corresponding author.

#### ETHICS STATEMENT

The studies involving human participants were reviewed and approved by The Ethics Committee of Guangxi Medical

# REFERENCES


University. The patients/participants provided their written informed consent to participate in this study.

# AUTHOR CONTRIBUTIONS

ML and XLi performed most analysis. ML led the writing of the manuscript. SL provided the clinical samples and participated in revising the manuscript. FX and JT participated in drafting and reviewing the manuscript. EG conducted a search for genes and preliminary screening work by keyword. XQ obtained and matched the TCGA samples data. ML, LW, and QL performed the single-gene and multi-gene-combination survival analyses. ZL and LL conducted an inquiry about the relevant information of the selected genes. XLu performed validation of the selected genes in three microarrays. KL and DZ performed the mRNA isolation and qRT-qPCR, and collected and analyzed experimental data. YY and XLi were subjected to immunohistochemistry and experimental data processing. FY and XZ participated in designing and reviewing the study. All the authors reviewed the manuscript and all the authors read and approved the final manuscript.

# FUNDING

This study was supported by the National Natural Science Foundation of China (Grant No. 81760611) and the Key Laboratory of High-Incidence-Tumor Prevention and Treatment (Guangxi Medical University), Ministry of Education (No. GKE 2019–01).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc. 2020.00847/full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Liu, Liu, Liu, Xiao, Guo, Qin, Wu, Liang, Liang, Li, Zhang, Yang, Luo, Lei, Tan, Yin and Zeng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Impact of Data Preprocessing on Integrative Matrix Factorization of Single Cell Data

#### Lauren L. Hsu1,2 and Aedin C. Culhane1,2 \*

*<sup>1</sup> Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States, <sup>2</sup> Division of Biostatistics and Computational Biology, Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, United States*

Integrative, single-cell analyses may provide unprecedented insights into cellular and spatial diversity of the tumor microenvironment. The sparsity, noise, and high dimensionality of these data present unique challenges. Whilst approaches for integrating single-cell data are emerging and are far from being standardized, most data integration, cell clustering, cell trajectory, and analysis pipelines employ a dimension reduction step, frequently principal component analysis (PCA), a matrix factorization method that is relatively fast, and can easily scale to large datasets when used with sparse-matrix representations. In this review, we provide a guide to PCA and related methods. We describe the relationship between PCA and singular value decomposition, the difference between PCA of a correlation and covariance matrix, the impact of scaling, log-transforming, and standardization, and how to recognize a horseshoe or arch effect in a PCA. We describe canonical correlation analysis (CCA), a popular matrix factorization approach for the integration of single-cell data from different platforms or studies. We discuss alternatives to CCA and why additional preprocessing or weighting datasets within the joint decomposition should be considered.

Keywords: data integration, matrix factorization, single cell, scRNA-seq, normalization, standardization, data preprocessing

## INTRODUCTION

Single-cell (sc) molecular profiling provides unprecedented resolution and incredible potential to discover the heterogeneity of cell types and states and intercellular communication that drives complex cellular dynamics, homeostasis, response to environment, and disease. We will focus this review on the challenges and considerations when applying matrix factorization approaches to integration of sc RNA sequencing data (scRNA-seq). Matrix factorization methods, including principal component analysis (PCA), are central to scRNA-seq data analysis pipelines, but are often treated as "black boxes" within computational pipelines, with little consideration of what steps are included. We will "open the box" to illustrate the exact scaling and transformations that are performed on data in a PCA, and how different preprocessing steps impact data and cross-platform batch integration. These tips and considerations will also apply other single cell omics data, as well as to multi-modal integration of different omics data.

#### Challenging Properties of Single Cell Data

Single-cell data present a set of unique challenges for data analysis and integration (1–3). In contrast to traditional bulk RNA-seq which provides the average expression of RNA molecules across tens of thousands or millions of cells, scRNA-seq measures RNA in each cell.

#### Edited by:

*Francesca Finotello, Innsbruck Medical University, Austria*

#### Reviewed by:

*Valentine Svensson, FL60 Inc, United States Jean Fan, Harvard University, United States Federico Marini, Johannes Gutenberg University Mainz, Germany*

#### \*Correspondence:

*Aedin C. Culhane aedin@ds.dfci.harvard.edu*

#### Specialty section:

*This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology*

Received: *20 February 2020* Accepted: *18 May 2020* Published: *23 June 2020*

#### Citation:

*Hsu LL and Culhane AC (2020) Impact of Data Preprocessing on Integrative Matrix Factorization of Single Cell Data. Front. Oncol. 10:973. doi: 10.3389/fonc.2020.00973*

The goal of scRNA-seq is frequently to define differential gene expression within specific cell types that characterize a phenotype, so cell type identification is a critical early step. In a tissue or biological sample, the population of cells is heterogeneous, containing many cell types including unidentified, new cell types, and cell states. Annotation of cell types in biological samples is challenging, as methods are still emerging and are limited by a lack of gold standard benchmarking data. To classify cell types and states, unsupervised clustering analysis is often used to partition cells into clusters, however, the biologically expected cell-tocell variation within cell states is poorly understood, and cell clusters may be associated with systematic, batch, technical, or methodological artifacts (1). Toward the goal of creating a comprehensive cell type and state reference, the Human Cell Atlas will catalog the diversity of cell types in the human body (4) and anticipates discovering distinct tissue-specific, disease-specific, age-specific, gender-specific cell phenotypes, and will identify many new cell types and states that are yet to be defined.

Most, or at least half, of the transcriptome, is detected in a typical bulk RNAseq study. In contrast, scRNA-seq studies frequently measure <5,000 genes in a single cell (1). Most genes are not measured and these zero counts may represent zero gene expression or false negative dropout, that is, when a gene was expressed but was not detected due to technological limitations (3, 5) such as limited sequencing depth or stochastic variation. Gene expression may also be missed due to biological variance; single point-in-time measurements cannot capture dynamic processes, such as RNA transcriptional bursts. Emerging evidence suggests transcription occurs in bursts or pulses that depend on core promoter and enhancers (6) and a three-state model may be required to capture its biological complexity (7). These issues of scRNA-seq analysis underscore the importance of appropriate quality control, preprocessing, and normalization (1, 8).

#### Preprocessing of sc Sequencing Data

Several library preparation and read mapping approaches including genome or transcriptome mapping and pseudoalignment can be used to generate a "raw" or unique molecular identifier (UMI) count matrix from sequencing reads (9), but in a comparison of over 3,000 preprocessing and analysis pipelines, Vieth et al. found normalization of the count matrix had greatest impact on downstream analysis (9). Standard "normalization" pipelines include scaling using sample-specific size factors, log transformation to reduce skewness, and feature filtering before PCA. The selection of a particular normalization routine will itself embed assumptions about the underlying distribution of the data. Inappropriate preprocessing may introduce artifacts that impact the ability to perform further preprocessing (e.g., alignment and integration of batches of sc data both within and between studies) and downstream biological analyses [e.g., cell type identification, classification, and differential gene expression (1, 8, 9)].

Depending upon the analysis method selected, objective defined, and the dataset itself, different approaches to preprocessing may be appropriate; various data scaling, centering, standardization, and transformation (**Figure 1**) approaches can be applied. Frequently these terms are used interchangeably even though they represent different data manipulations (11, 12). Often the goal of preprocessing steps is to generate data that meet the linearity, homoscedasticity (that the points have the same scatter, i.e., there is no relationship between mean and variance), and normality assumptions that are required for most parametric statistical methods, including linear regression. A recent review of metabolomics data includes an extensive review of scaling and transformation approaches on sparse data (13).



FIGURE 1 | Common data preprocessing steps include scaling, centering, standardization, and transformation. Graphical examples of these preprocessing routines are applied to two datasets (1) "toy data" with a mean and standard deviation (SD) of 1.5 generated for purposes of illustration, and (2) the 10X raw counts matrix in the scMix benchmarking dataset used in Figure 2 (10).

• Normalization transforms the data points so that their distribution resembles a normal, also called Gaussian, distribution. In a normal distribution (i.e., the classic bell curve) points are distributed symmetrically around the mean, most observations are close to the mean, and the median and mean are the same. Depending upon the distribution of the original dataset, this may be achieved by a log transformation, or may require more extensive preprocessing. Two recent articles have proposed analysis of Pearson residuals rather than log normalized counts (8, 14). In bioinformatics and computational fields, this term may also refer to size and/or range scaling transformation which may not produce a normal distribution (21).

Feature selection, for instance restricting analysis to overdispersed genes which are expected to capture a disproportionate amount of the variance in the data, is included in many analysis pipelines to reduce the computation time (16, 22). Furthermore, selecting genes with high biological variability, to exclude many genes with low biological signal and high numbers of zeros, may increase the signal to noise ratio in dimension reduction.

### Dimension Reduction

Data dimension reduction is indispensable in single cell data analyses because it facilitates exploratory data analysis and visualization, and is an essential step in many downstream analysis including cell clustering (23, 24), cell-type identification, cell trajectory, lineage reconstruction, and trajectory inference (25–27). It is also a critical first step in many algorithms that align and integrate sc datasets (11, 22, 28).

Dimension reduction transforms the data to a new coordinate system (i.e., a low-dimensional shared latent space) such that the greatest variance can be identified and distinguished from background noise, or less informative variance. The output is a set of embeddings for each data point which encode their location in the low-dimensional shared latent space. It is frequently achieved using matrix factorization, a class of unsupervised techniques that provide a set of principled approaches to parsimoniously reveal the low-dimensional structure while preserving as much information as possible from the original data.

Principal component analysis (PCA) is arguably the oldest, fastest, and the most commonly used matrix factorization approach (29). PCA is a deterministic algorithm that seeks linear combinations of the variables that explain the variance in the data and ranks these such that the first component explains most of the variance or "strongest" pattern in the data. PCA uses a Gaussian likelihood and is best applied to data that are approximately normally distributed. Whilst it is not recommended to be applied to highly skewed data (**Figure 1**), nonetheless, in a recent systematic analysis of 18 linear and non-linear dimension reduction approaches, PCA and other classical linear methods performed surprisingly well in both clustering and lineage inference analysis when assessed on 30 scRNA-seq datasets (30). Linear (straight-line) analysis methods including PCA, independent component analysis (ICA), factor analysis (FA) ranked best in clustering. PCA, FA, non-negative matrix factorization [NMF, (31, 32)], and uniform manifold approximation and projection [UMAP, (33)] ranked top in lineage inference analysis (30). We compare ICA and NMF matrix factorization in a recent review (31).

Dimension reduction methods optimized for count data that apply a better-fitting likelihood model (e.g., Poisson or negative binomial) are promising for addressing the skewed distribution of sc count data (8, 14). However, glmPCA (8), Poisson factorization (34–36), and probabilistic count matrix factorization [pCMF, (37)], as well as methods designed to model zero-inflated sparse data, including ZIFA and ZINB-WaVE (38, 39) did not outperform PCA across the full range of analyses and evaluations performed in the study Sun et al. (30). While there are particular settings where these methods may be most appropriate, they are not necessarily appropriate as "generalpurpose" approaches. The high computational cost and long run time make many of these models difficult to integrate into multi-step bioinformatics pipelines.

Non-linear dimension reduction methods can identify variance in subsets of features by fitting local linear maps on subsets of points. Non-linear methods applied to sc data include diffusion maps (40), locally linear embedding, isoMap, kernel adaptations of linear methods, uniform manifold approximation and projection (UMAP) (41), and t-distributed stochastic neighbor embedding [tSNE, (42)]. However, similar to the methods that apply non-Gaussian likelihoods, nonlinear dimension reduction methods are often computationally expensive and since they are not deterministic may produce different embeddings when re-applied to the same dataset. To improve computational tractability, PCA is frequently used as a preprocessing step prior to non-linear dimensionality reduction approaches including t-distributed stochastic neighbor embedding [tSNE, (43)] and UMAP (33). Although not required to run UMAP, in practice, it can be applied to accelerate computation time by significantly reducing dimensionality and noise while preserving underlying latent structure.

In this review, we focus on PCA because of its popularity, performance, and widespread use. PCA is a central step in many sc analysis algorithms and pipelines. When used with sparse-matrix representations, it can easily scale to large datasets. Excellent general tips for dimension reduction have been described (44), so we will focus on considerations and limitations when applying dimension reduction to sc data, including a stepby-step explanation of how PCA works, especially when applied to integrative sc analysis (**Figure 2A**).

#### The Impact of Data Preprocessing on Dimension Reduction

There are two types of PCA, which differ in data centering and scaling prior to matrix decomposition. PCA of a covariance matrix or a correlation matrix is achieved by applying matrix factorization to a centered but unscaled matrix, or a centered and scaled matrix, respectively (**Figure 2A**, Step 2). The latter is the most popular form of PCA. Linear regression using nonlinear iterative partial least-squares (NIPALS), eigen analysis, or singular value decomposition (SVD) are a few of the many ways to factorize or decompose a matrix. SVD is a basic matrix operation, and fast approximations of SVD, including IRLBA, are commonly applied to sc data [extensively reviewed by (45)].

FIGURE 2 | Matrix Factorization of sc data: (A) schematic diagram of a PCA or CCA workflow, includes: (1) filtering of datasets to intersecting genes; (2) scaling, transformation, and normalization of individual and joint count matrices; (3) concatenating matrices and applying a matrix factorization, usually singular value decomposition (SVD); and (4) visualizing results. SVD is a matrix operation that finds for a given input matrix the left singular vectors (U), the right singular vectors (V), and the singular values (D), such that the product of U and V with their respective transpose matrices is the identity matrix. Each singular vector is orthogonal to the others, and they are ordered such that the first component explains the greatest variance, and each subsequent component explains less than the preceding. (B) The first two principal components of SVD performed on counts and log-transformed counts of the scMix benchmarking data (10), comprising 3 cell lines (HCC827, H1975, and H2228), that were unprocessed, centered, and centered and scaled, to reflect SVD, covariance-based and correlation-based PCA, respectively. Results from covariance-based and correlation-based PCA applied to log-transformed data are similarly effective, showing moderate data integration and separation by cell type but an arch effect is visible on PC1 and PC2 in SVD of the raw counts. (C) Covariance-based and correlation-based PCA of log-transformed data, colored by sequencing depth, show that unadjusted differences in sequencing depth limit integration, forming a gradient across each cluster. (D) The first three principal components from Canonical Correlation Analysis (CCA) of scMix data. In both raw counts and log-transformed data, PC1 provides poor separation by cell type and batch integration. The plot of PC2 by PC3 from CCA on log-transformed data show reasonable clustering by cell line, though exhibit poor batch integration; in contrast, PC2 by PC3 plot from CCA on raw data shows better batch integration and poorer separation by cell type.

SVD factors an input matrix into three matrices U, D, and V, as illustrated schematically in **Figure 2A** (46) (R code to perform PCA via both eigen analysis and SVD are provided in Supplementary Methods). The maximum number of principal components or rank of the analysis is the number of rows or columns of the matrix (whichever is lower, n-1, or p-1), though typically 30 or fewer components are examined in most scRNA-seq pipelines (22). Selection of the correct number of components is non-trivial and most commonly achieved by heuristic approaches. To understand the distribution of variance explained by each component, scree-plots can also be helpful visual tool (47, 48) and permutations based approaches are recommended (49, 50).

**Figure 2B** displays SVD of raw count or log<sup>2</sup> transformed count matrices that were (1) unprocessed data (top row); (2) centered by subtracting column means (middle row); and (3) scaled and centered to reproduce SVD. (2) and (3) show PCA of a covariance matrix (princomp in R), and PCA of a correlation matrix (prcomp in R), respectively (**Figure 2B**). These are applied to a small, well-described benchmarking dataset (10), comprising scRNA-seq measurements of a three cell line mixture on three technological platforms (10X, Dropseq, and CELseq2). Both forms of PCA had greater success in finding structure in the data as compared to SVD alone. However, clusters of cell lines could only be distinguished in data that were log transformed. Moderate cross platform integration was observed in data that were centered, or centered and scaled (equivalent of PCA of a covariance or correlation matrix, respectively). Nonetheless, as illustrated in **Figure 2C**, we observe that systematic differences in sequencing depth between the three platforms still creates a gradient across each cluster, preventing full integration. Whilst this analysis was performed on all variables (genes), we and others have found that excluding genes with low variability and high numbers of zeros prior to dimensionality reduction may increase the signal to noise ratio (12, 48, 51).

#### The Horseshoe or Arch Effect

PCA is optimized for continuous, normally distributed data and is suboptimal when applied to sparse data with many zero counts. The arch or horseshoe is a common pitfall and has been described in detail in the literature (44, 52, 53). This distortion results from the presence of a gradient or sequential latent ordering in the data [Tutorial by (54)]. In the top row of **Figure 2B** all of the cell lines on the first component (PC1) are on the same side of the origin, forming a classical horseshoe pattern, characterized by a distinctive "arched" shape, with points mostly on one side of the origin and folding back on itself in one of the dimensions. This indicates that additional data preprocessing is required; cell lines cannot be distinguished, and the data are not integrated across batches. In the top right plot of **Figure 2B** which shows SVD on unprocessed log counts, the first 2 PCs appear correlated, but are by definition orthogonal—their dot product is 0. Orthogonal vectors are uncorrelated only when at least one of them has mean 0. In contrast, when data are centered (e.g., middle and bottom row of **Figure 2B**), these artifacts are gone. It is vital that such arch effects are identified, especially when PCA forms part of a computational workflow that extracts the first n principal components without inspection. As seen in **Figure 2**, preprocessing and data normalization can remove arch artifacts and we refer the reader to excellent recent reviews on the subject (44, 52–54).

Examining PC plots can illuminate issues beyond the arch effect, in this case for instance, showing that the 10X data are located further from the origin on PC1 and PC2 as a result of difference in sequencing depth between platforms (**Figures 2B,C**). This can be corrected for by scaling the size factors by dataset to account for these systematic differences prior to log-normalization (55).

## Integrating Two or More Datasets With K-table Matrix Factorization

Matrix factorization approaches have been highly effective and widely applied to removing batch effects in bulk omics data (56, 57). Whilst dimensionality reduction methods like PCA can discover batch effects (1, 11, 28), and could also be applied to remove within or even between batch effects in sc data, it is more common to explicitly define the blocks, groups, or datasets to be integrated and apply matrix factorization that is designed to extract correlated structure between groups. Emerging sc data integration and cross-study batch correction methods frequently use PCA or joint matrix decompositions as a first step.

Matrix factorization approaches that integrate multiple groups or matrices with matched rows or columns, often called Ktable, multi-block component analysis or tensor decompositions (46), have been applied to both bulk and scRNA-seq data integration (46). The simplest K-table approach is possibly Procrustean analysis (58, 59). Procrustes was a figure from Greek mythology who was famous for cutting limbs or stretching unknowing passers-by such that they fit into his bed, and similarly, Procrustean analysis involves rotation or reduction of a component from one PCA to best fit a second PCA. Several other matrix factorization approaches for K-table exist (46).

Arguably the most popular K-table approach applied to omics data is canonical correlation analysis [CCA, (60, 61)], which maximizes the correlation between components, or canonical variables of each dataset, and has been widely applied to integration of bulk omics data [reviewed by (46, 62)]. Classical CCA requires more observations than features, and therefore sparse implementations that include feature selection are used in the analysis of bulk omics data (63, 64). CCA and adaptations of CCA have been applied to integrate scRNA-seq including the cross-study integration of stimulated and resting human peripheral blood mononuclear cells (PBMCs); cross-platform integration of mouse hematopoietic progenitors scRNA-seq data; and heterogeneous case-control cell populations after drug exposure (16, 22). Seurat 3 uses CCA with anchors to align datasets that are extracted using mutual nearest neighbors on the CCA subspace (65). Harmony uses PCA as a first step (66). PCA or CCA is the first step in scAlign, a neural-network based method for pairwise or data to references, alignment of single cell data (67) which was reported to outperform other single cell alignment methods (CCA in Seurat, scVI, MNN scanorama, scmap, MINT, and scMerge). Non-linear matrix factorization approaches for integration of datasets include joint NMF [LIGER, (68)] but in a recent comparative study this was reported to be computationally slow and may overlay samples of little biological resemblance compared to the other methods (69). A benchmark comparison of 14 methods for integration of scRNA-seq datasets, on datasets from different technologies with identical cell types, non-identical cell types, multiple batches, big data, and simulated data revealed that harmony, LIGER, and Seurat 3 CCA are most performant (65).

Other matrix decomposition approaches, including multiple co-inertia analysis (48, 70), multiple factor analysis (71, 72), and consensus PCA (73–75), maximize a covariance or squared covariance criterion and are not limited by a requirement for more observations than features. These have been applied to bulk omics data and clustering, for example Meng et al., applied Westerhuis's modified implementation of consensus PCA to integrate methylation, proteomic and genomics data, reporting it was performant and faster that iCluster/iCluster+ (75). Dimension reduction methods for both single and K-table analysis, including a summary of the mathematical formulae and overview of available software packages for each mode of analysis, have been recently reviewed (46). Of note, there is also a recently described generalized framework to easily modulate between covariance and correlation-optimization in integrative matrix factorization (62, 76).

#### Horseshoes in CCA

Similar to PCA, a problematic arch effect is seen on PC1 and PC2 (**Figure 2D**) when CCA is applied to align and integrate raw counts or log counts of scRNA-seq measurements of three cell lines that were obtained on three technological platforms: 10X, Dropseq, and CELseq2 (10). The raw data had more platform overlap, and the log-transformed had less overlap in cell types in PC2 and PC3 (**Figure 2D**). These data demonstrate that, if CCA is used as a first step in a pipeline, it should include a check for the presence of such artifacts. For example, upon examining **Figure 2D**, one could exclude PC1, since CCA integrates the data across platforms in PC2 and PC3.

#### Scaling of Datasets in CCA

Simultaneous integration of multiple matrices is more complex than integrative analysis of a single dataset because each dataset may have different numbers of observations (cells), internal structure, and variance. In this CCA (**Figure 2D**) vignette the 10X dataset exhibited less correlated structure with the Dropseq and CELseq2 datasets, which had lower sequencing depth (**Figure 2C**). Therefore, in K-table matrix decomposition two levels of preprocessing are recommended. First, each individual dataset is normalized, centered, and scaled. Secondly, datasets are scaled by cross-dataset size factors (55), weighted to inflate or deflate the contribution of individual datasets, such as scaling by the square root of their total inertia, the percent variance on the first principal component, sample size, or another measure of data quality or expected contribution [reviewed by (46)].

#### Key Takeaways

When applying matrix factorization methods including PCA, it is recommended to consider the impact of scaling, logtransforming, standardization, and normalization. Common data challenges, and tips to address them, include:


# SUMMARY

Single cell omics data are expanding our understanding of tumor heterogeneity, the tumor microenvironment, and tumor immunology. Algorithms for cell clustering, cell type identification, and cell trajectory analysis rely on dimension reduction to achieve computationally tractable solutions. The sparsity, noise, and high dimensionality of these data present unique challenges and underscore the importance of dimension reduction in sc analysis. PCA is widely used and popular for its speed, scalability, and performance, though it may not be the most optimal method for sc data. Matrix factorization approaches optimized for count matrices or distances matrices have been described [reviewed by (38)], and it is likely that more performant data preprocessing, scaling, and transformation approaches will continue to be developed. These methods will improve the performance of dimension reduction approaches in sc data integration and analysis.

#### RESOURCES

We include below a short list of single cell analysis resources, vignettes, and reference materials

https://osca.bioconductor.org/ https://github.com/seandavi/awesome-single-cell https://satijalab.org/seurat/ https://hemberg-lab.github.io/scRNA.seq.course/ https://github.com/SingleCellTranscriptomics

#### SUPPLEMENTAL MATERIAL

R Code to reproduce these figures which describes different implementation of SVD and PCA is publicly available at https:// github.com/aedin/Frontiers\_Supplement/. It includes a code to generate PCA, computed by SVD, eigenanalysis and PCA using R packages princomp, prcomp, ade4, FactoMineR. In each case, the relationship between these methods is described.

#### AUTHOR CONTRIBUTIONS

LH and AC wrote the paper. LH wrote the code and performed analysis. AC wrote the online supplemental PCA vignette code.

#### REFERENCES


#### FUNDING

We are grateful for funding from Stand Up to Cancer, National Institutes for Health (5U01CA214846 and 5P50CA101942), and the Assistant Secretary of Defense Health Program, through the Breast Cancer Research Program (W81XWH-15-1-0013 to AC). Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the Department of Defense. This project has been made possible in part by grant number CZF2019-002443 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation.

data. Nucleic Acids Res. (2008) 36:e11. doi: 10.1093/nar/ gkm1075


Vision. (2014) 48:279–94. doi: 10.1007/s10851-013- 0435-6


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Hsu and Culhane. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Toward Systems Biomarkers of Response to Immune Checkpoint Blockers

#### Óscar Lapuente-Santana<sup>1</sup> and Federica Eduati 1,2 \*

<sup>1</sup> Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, Netherlands, <sup>2</sup> Institute for Complex Molecular Systems, Eindhoven University of Technology, Eindhoven, Netherlands

Immunotherapy with checkpoint blockers (ICBs), aimed at unleashing the immune response toward tumor cells, has shown a great improvement in overall patient survival compared to standard therapy, but only in a subset of patients. While a number of recent studies have significantly improved our understanding of mechanisms playing an important role in the tumor microenvironment (TME), we still have an incomplete view of how the TME works as a whole. This hampers our ability to effectively predict the large heterogeneity of patients' response to ICBs. Systems approaches could overcome this limitation by adopting a holistic perspective to analyze the complexity of tumors. In this Mini Review, we focus on how an integrative view of the increasingly available multi-omics experimental data and computational approaches enables the definition of new systems-based predictive biomarkers. In particular, we will focus on three facets of the TME toward the definition of new systems biomarkers. First, we will review how different types of immune cells influence the efficacy of ICBs, not only in terms of their quantification, but also considering their localization and functional state. Second, we will focus on how different cells in the TME interact, analyzing how inter- and intra-cellular networks play an important role in shaping the immune response and are responsible for resistance to immunotherapy. Finally, we will describe the potential of looking at these networks as dynamic systems and how mathematical models can be used to study the rewiring of the complex interactions taking place in the TME.

Keywords: tumor microenvironment, precision immuno-oncology, multi-omics profiling, systems biology, predictive biomarkers, cancer signaling networks, immune checkpoint blockers

# A CHANGE IN THE LANDSCAPE OF BIOMARKERS DISCOVERY

Tumor cells are able to activate several mechanisms to evade the immune response by disguising themselves as "self " cells. Binding to inhibitory checkpoint molecules (i.e., immune checkpoints) they can block antitumor activities of the immune system. Immunotherapy with immune checkpoint blockers (ICBs) uses antibodies to target immune checkpoints, such as PD1, PD-L1, and CTLA-4, unleashing the immune response. In clinical trials, ICB therapy has been shown to achieve durable therapeutic response and to increase patient survival in different cancer types, although still a small number of ICBs are FDA-approved (1, 2). Even if clinically approved, ICB therapy is

#### Edited by:

Enrica Calura, University of Padova, Italy

#### Reviewed by:

Howard Donninger, University of Louisville, United States Mirjana Efremova, Wellcome Sanger Institute (WT), United Kingdom

> \*Correspondence: Federica Eduati f.eduati@tue.nl

#### Specialty section:

This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology

Received: 30 January 2020 Accepted: 22 May 2020 Published: 24 June 2020

#### Citation:

Lapuente-Santana Ó and Eduati F (2020) Toward Systems Biomarkers of Response to Immune Checkpoint Blockers. Front. Oncol. 10:1027. doi: 10.3389/fonc.2020.01027

**105**

effective for a small subset of patients. Given the potential immunological toxicity (3, 4) and the elevated costs (>US\$100,000 per patient per year) (5) associated with ICBs, it is of paramount importance to be able to predict which patients will likely respond to the therapy, in order to administer the optimal treatment based on biomarkers.

The investigation of mechanisms supporting immune resistance has provided a great opportunity for biomarker discovery of patient response to ICBs (**Figure 1**). Two biomarkers have been clinically approved for PD-1/PD-L1 blockade therapy: the first is immunohistochemistry (IHC) staining of PD-L1 in non-small-cell lung cancer (NSCLC), melanoma, renal cell carcinoma (RCC), urothelial cancer, and triple-negative breast cancer (TNBC) (6); and the second is high microsatellite instability/defective mismatch repair (MSI-H/dMMR) regardless of tumor type (7, 8). Other emerging predictive biomarkers such as tumor mutational burden (TMB) (9, 10), signatures of a T cell inflamed tumor microenvironment (TME) either alone (10) or in combination (11), and neoantigen load (12–14) are still undergoing clinical trials. In addition, T cell receptor (TCR) diversity has been used as a biomarker to monitor the clonal expansion of T cells in breast cancer, glioma, cervical cancer, and leukemia/lymphoma (15–18). Further efforts both to exploit the utility of these biomarkers and to search for additional ones are still ongoing. For a complete review of these biomarkers and in which tumors they work, we refer to Havel et al. (19).

Despite being promising, these biomarkers also present some limitations. For instance, IHC enables measuring PD-L1 expressed on tumor cells, however the expression of this biomarker fluctuates over time and varies between different tumor sites. This variability undermines the ability to evaluate PD-1/PD-L1 therapies effectiveness based on IHC, as reviewed in Topalian et al. (20) and Camidge et al. (21). Another example is TMB, which is known to correlate imperfectly with clinical response (12, 13, 22). Neoantigen burden should partially overcome this issue, however most computational tools fail to estimate true neoantigens (19, 20, 23), and additional features should be considered to better determine neoantigen immunogenicity as reviewed in Finotello et al. (24).

Above-mentioned examples shed light upon the conceptual problem of looking only at individual components of the TME. While the characterization of different parts playing a role in the interaction between tumor and immune system has been essential to elucidate the most important actionable mechanisms, further research is required to define biomarkers harnessing a more coordinated joint action of these mechanisms. Predictive biomarkers for immunotherapy with ICBs have been extensively reviewed previously (19, 20, 23, 25). In this Mini Review we focus on how a holistic profiling of the TME can provide new opportunities for identifying systems-based biomarkers built on existing synergies between the different individual components of the TME. Such a shift toward multifaceted strategies has been favored by increasingly available multi-omics data from bulk populations, individual cells, and imaging technologies (26), that can be integrated using computational approaches. In the following sections we will describe how biomarkers can be derived by considering three increasing levels of complexity. The first is the cellular component, focusing on the immune contexture of tumors, such as immune cells quantification, functionality, and localization. The second is the network of communication between and within cells of the TME. Finally, we will elaborate on how mathematical models can be used to take the dynamic nature of these networks into account.

# THE ROLE OF THE IMMUNE CONTEXTURE ON ICB EFFICACY

It is well-known that different types of immune cells can play a different role in the response to ICBs (27). For example, while the presence of CD8+ T cells within the TME is a good biomarker of ICBs efficacy, a high abundance of regulatory T (Treg) cells is generally associated with poor prognosis. Different tools have been developed to quantify tumor-infiltrating immune cells from bulk (RNA-seq) and single-cell (scRNA-seq) RNA sequencing measurements, as extensively reviewed in Finotello and Eduati (26) and Finotello and Trajanoski (28).

Apart from quantification of immune cells, their spatial localization also plays a pivotal role in the response to immunotherapy (29). For instance, CD8+ T cells not only need to be present, but also to be infiltrated (hot tumor) for the ICB therapy to work. In fact, pure quantification of CD8+ T cells is not always associated with favorable prognosis (30). Imaging techniques can be used to explore the spatial patterns of immune infiltration. A notable example of a biomarker assessing through IHC, both the abundance and the location (tumor center and invasive margin) of two lymphocyte populations (CD3+ and CD8+ T cells) is the immunoscore (31), that was shown to accurately predict patient survival in colorectal cancer patients (32). More recently, spatial information of CD8+ T cells from IHC was integrated with transcriptomics data to study the effect of lymphocyte infiltration in patients with TNBC, providing predictive biomarkers of ICBs response (33). Automatic approaches for image analysis could reveal useful in the future for high-throughput identification of spatial biomarkers. A first attempt in this direction was the development of tumor infiltrating lymphocytes maps by using deep learning on images from the cancer genome atlas (TCGA) (34).

Another important factor that affects patients' response to ICBs is the functional state of the different immune cells (35). Dysfunctional states of T cells can be characterized from bulk and single-cell RNA-seq (36–38) and epigenetic profiling (39– 41). ICBs aim at rescuing dysfunctional T cells, therefore the investigation of their functional state can inform on ICBs therapy success and limitations (36–39, 41). Depending on the type of

**Abbreviations:** CTLA-4, cytotoxic T lymphocyte antigen 4; DC, dendritic cell; ICB, immune checkpoint blocker; IFNγ , interferon gamma; IHC, immunohistochemistry; MMR, DNA mismatch repair; MSI-H, high microsatellite instability; NOS2, nitric oxide synthase 2; NSCLC, non-small-cell lung cancer; PD-L1, programmed cell death-ligand 1; PD-1, programmed cell death protein 1; RCC, renal cell carcinoma; RNA-seq, RNA sequencing; scRNA-seq, single-cell RNA sequencing; TCGA, the cancer genome atlas; TCR, T cell receptor; TMB, tumor mutational burden; TME, tumor microenvironment; TNBC, triple-negative breast cancer; TNF, tumor necrosis factor; Treg, regulatory T cell.

stimulatory signal, macrophages (42, 43), and B cells (44, 45) can develop into functional subsets that have either positive or negative effects on tumors. Another example are dendritic cells (DCs), that normally control cancer antigen presentation, priming and activation of T cell responses, however the TME can compromise their ability to stimulate the immune response (46, 47). Certain computational tools for cell-type quantification can also unmask the phenotypic state of cell subpopulations in the TME by inferring the transcriptomics profiles of individual cells (48, 49). A promising research direction for biomarkers discovery is also given by new technologies that allow generation of omics data from tissue slides preserving cell spatial identity (50, 51). These approaches would result in combined localization and characterization of the cells in the TME.

Analysis on the immune infiltrate quantification, functionality, and localization can help both to explain the diversity of the tumor immune milieu and develop informative biomarkers for ICBs (27, 52, 53). Pointing in this direction, different efforts have recently explored the use of bulk transcriptomics data to derive more complex immune-related scores to assess the likelihood of a patient to respond to ICBs (38, 54–63).

### INTRA- AND INTER-CELLULAR NETWORKS ORCHESTRATE THE IMMUNE RESPONSE

The functional state of cells in the TME is defined by a complex system of communication between molecules within the cells (intra-cellular networks) and among different cells (inter-cellular networks). Looking at intra- and inter-cellular networks can provide a more holistic perspective of the TME and inform a new class of biomarkers for immunotherapy and its potential combination with other targeted therapies (64).

Intra-cellular signaling pathways play a part in shaping the interaction with the immune system [(65, 66); **Figure 2**]. Abnormalities in tumor-intrinsic signaling, involving oncogenes and tumor suppressor genes, have been associated with mechanisms of inherent immune resistance (67). Examples are PTEN loss (68) or EGFR gain of function (69), both causing PI3K-Akt pathway activation and leading to over-expression of PD-L1 and consequent immunoresistance. Due to the complexity of signaling pathways, with numerous cross-talks and feedback loops, the adoption of individual oncogenic drivers as biomarkers is not expected to be effective in most cases (20). In fact, PD-L1 signal is directly regulated by numerous oncogenic pathways such as Ras, mTOR, EGFR, MEK, ERK, and MAPK (70). Besides pathways regulating immune checkpoints, other signaling cross-talks control the immune response from different perspectives, like inactivation of TP53 or activation of β-catenin pathway, both reducing chemokine production by tumor cells and thereby reducing recruitment of immune cells into the TME (71, 72).

In addition, cancer cells receive signals from other cells in the TME through ligand-receptor interactions. These intercellular communications lead to changes in the phenotype of the regulated cells thus playing an important role in both progression and prognosis of cancer (73, 74). An example is the response elicited on cancer cells by two cytokines (TNF and IFNγ ) produced by activated T cells. These cytokines induce PD-L1 expression through JAK-STAT and NF-kB signaling, inducting acquired resistance to the immune response (75, 76). Another study identified a relationship between high expression of NOS2 and prolonged IFN signaling in tumors resistant to PD-1 blockade (77).

While collections of intra- (78) and inter-cellular (79) interactions can be derived from literature and databases, additional data are required to characterize the networks for each patient or group of patients. Transcriptomics and proteomics data can provide the basis to study intra- and inter-cellular signaling networks. Imaging data can also be integrated to improve our understanding on spatial localization of interacting cells. Computational methods have been developed to infer integrated inter- and intra-cellular networks from bulk (80, 81) and single-cell (81, 82) RNA-seq data. These tools could be exploited to derive biomarkers for immunotherapy by studying the functional effect of cell-cell communication. In a recent study, a curated database of ligand-receptor interactions (79) was integrated with gene expression data to deconvolute the transcriptional profile of cancer and stromal cells and infer cross-talks in the TME (83). Interestingly, the authors show that for different cancer types, PD-L1 expression is higher on cancer or stromal cells which nicely correlates with the general responsiveness to immunotherapy. Further research is required to assess if this holds also for individual patients, making it potentially a more effective biomarker than bulk PD-L1 expression. In another recent publication (84), researchers performed an extensive literature curation to derive a comprehensive signaling network of innate immune response in cancer, including cell type-specific signaling in macrophages, DCs, myeloid-derived suppressor cells, and natural killer cells. Such network was then integrated with scRNAseq data from macrophages and natural killer cells in melanoma to study the heterogeneity of innate immune cell types and could potentially be used to predict patient survival and response to immunotherapies. Finally, Worzfeld et al. combined parallel bulk transcriptomics and proteomics data on tumor cell spheroids, tumor-associated T cells and macrophages to derive inter-cellular signaling networks in the ovarian cancer microenvironment (85). Such networks included several immune checkpoint regulators and appeared to have potential clinical relevance. Overall, these studies have demonstrated the enormous benefit that holistic approaches combining complex multicellular networks can bring into the immuno-oncology field, and we expect that in the forthcoming future more research efforts will be spent in this direction. The recent developments of 3D cell culture models resembling the TME, are expected to be a powerful tool for further in vitro and ex vivo investigation of intra-cellular communication, and to study their effect on the response to ICBs (86).

### THE POTENTIAL OF LOOKING AT THE DYNAMICITY AND PLASTICITY OF THE TME

It is well-known that the cellular functional state changes dynamically in response to environmental changes and perturbations such as drug treatment (87, 88), calling for identification of the dynamic properties of the networks. The ideal data for dynamic functional characterization of the system's response are obtained upon perturbation (89). Functional screening of the effect of cancer drugs has been so far focused on cancer cell lines. While cell lines are a debatable model system, they proved to be a valuable tool to explore novel biomarkers of drug response (90, 91). High-throughput drug screening studies are now also being increasingly performed on organoids (92) or other 3D experimental models (86), which are more physiological human cancer models of the TME. These efforts open new ways for pre-clinical investigation of the effect of immunotherapy. Finally, more recent technologies allow screening also of patient biopsies without need for culturing steps (93–95) paving the way for functional characterization

of ex vivo tumor samples potentially improving personalized cancer treatment.

To capture the functional context of the immune response, statistical, and mathematical approaches are developing into more compendious methods that integrate multi-omics data and prior knowledge on network structure (**Figure 2**). While mathematical models do not fall into the standard definition of biomarkers, they can provide predictions of response to immunotherapy. Additionally they can be used to define dynamic biomarkers based on properties of the modeled system, as opposed to static biomarkers that only consider the initial conditions of the system (88).

Dynamic mathematical models can be used to study intracellular networks of the different cell types populating the TME (96). To characterize these networks at the patient-specific level, models of signaling pathways in cancer cells have been trained from perturbation experiments (97, 98), gene expression data (99), or integrating multi-omics data (100). The resulting parameters corresponding to these personalized models can be relevant biomarkers of clinical outcome (99–101). Mathematical models have also been used to study intra-cellular signaling in T cells. This includes the investigation of how PD-1 leads to deactivation of the T cell receptor signaling (102) or mechanistic understanding of T cell exhaustion (103). PD-1 is one of the main targets of ICB, and exhausted T cells have a higher number of targetable checkpoint proteins like PD-1 and CTLA-4, therefore the investigation of these aspects could be relevant to identify possible biomarkers.

More studies are now focusing on mathematical models incorporating inter-cellular interactions to better capture the complexity of the TME. Agent-based models can be used to simulate the interactions between cells in the tumor microenvironment seen as a 2D or a 3D grid (104). Each cell is seen as an agent that can perform different tasks with a certain probability (e.g., cells can non-proliferate, divide, or die). Since the immune response can be seen as a probabilistic outcome of a complex system (88), agent-based models are an adequate mathematical approximation to capture this stochasticity. These models can be refined using a multitude of data types and used to simulate the effect of immunotherapy (105, 106), providing a variety of possible outcomes given the same initial conditions that can be interpreted as probability of success. It has been shown that tumor-bearing inbred mice, which have only minimal differences, can respond differently to immunotherapy (88), therefore having models that can incorporate stochasticity provides an interesting approximation of the in vivo situation. Another approach to model cell-cell communication is by using response-time modeling (107), where cells are modeled as a black-box that can receive inputs (e.g., cytokines) from other cells, process them, and change state (e.g., immune cells can switch between inactive and active) accordingly with a certain probability. Recently, Grandclaudon et al. combined perturbation data with a multivariate quantitative model to study context dependent interactions between DCs and helper T cells (108). A different approach based on quantitative systems pharmacology was recently used to simulate the effect of ICB therapy in metastatic breast cancer patients using a four compartments (central, peripheral, tumor-draining lymph node, and tumor) model (109).

Additionally, combining mathematical models with longitudinal data, i.e., data collected at different time points, can be used to investigate the evolutionary dynamics of treatment response. This aspect is particularly relevant, especially to be able to distinguish at an early stage real tumor progression (patient should be assigned to a different treatment) from what is called pseudoprogression, i.e., temporary progression followed by a response to the treatment (patient should be kept on ICB). The latter behavior has been described using a model of immune activation incorporating the dynamics of antigen presentation (110). Based on a system of three ordinary differential equations to describe the interaction between tumor cells, Treg cells, and cytotoxic T cells, this model could explain why, in response to ICBs, the tumor can worsen before starting regressing. Other multi-cellular models have been used to derive in silico patients to test different possible dynamics of treatment response (111, 112), that could be compared with longitudinal measurements of tumor load from PET/CT imaging (112). Longitudinal data are often limited to non-invasive imaging and, in a few cases, to transcriptomics, IHC, TCR, and genome sequencing data (113, 114) for a limited number of time points due to invasiveness of biopsies. Computational modeling of longitudinal data is still at its infancy, but we envision that in the future more mechanistic dynamic models will be able to exploit this type of data for definition of dynamic biomarkers.

#### CONCLUSIONS AND FUTURE PERSPECTIVES

Current limitations in identifying predictive biomarkers for ICB therapy are partially due to overlooking the complexity

#### REFERENCES


of the TME. Following the advancements in technologies to measure multi-omics data, measurements of bulk populations, individual cells, and spatial information have paved the way to a more comprehensive view of the TME. Recent efforts are focused on searching for signatures of response to ICBs that consider quantification, localization, and functionality of different immune cells in the TME, showing improved predictive power with respect to simpler biomarkers (115). However, they still miss an integrative strategy that takes a view of the whole TME, rather than examining each factor in isolation. In this respect, mechanistic models incorporating existing biological basis, e.g., on intra- and inter-cellular pathways, can accompany both therapy and biomarker development in immuno-oncology (116).

There is compelling evidence that the interplay of the immune system, tumors, organs, and external environment, harmonizes antitumor immune responses (117). Therefore, we envision that novel systems medicine approaches entailing mathematical models can gradually build up a profile of the TME, both in the lab and, more importantly, in the clinic. To this end, building patient specific models have become of increasing importance, especially when based on data that can be measured in clinical settings. Moreover, systems approaches can especially be useful to provide rationale for alternative personalized treatments such as combinatorial therapy.

# AUTHOR CONTRIBUTIONS

ÓL-S and FE wrote and edited the article. All authors contributed to the article and approved the submitted version.

#### FUNDING

Costs related to this publication are covered by the Computational Biology group of the Department of Biomedical Engineering of Eindhoven University of Technology.

# ACKNOWLEDGMENTS

The authors would like to thank the members of the CBio group for their helpful proofreading suggestions.


microenvironment of primary and metastatic renal cell cancer. Clin Cancer Res. (2015) 21:3031–40. doi: 10.1158/1078-0432.CCR-14-2926


escape in EGFR-driven lung tumors. Cancer Discov. (2013) 3:1355– 63. doi: 10.1158/2159-8290.CD-13-0310


86. Modugno FD, Di Modugno F, Colosi C, Trono P, Antonacci G, Ruocco G, et al. 3D models in the new era of immune oncology: focus on T cells, CAF and ECM. J Exp Clin Cancer Res. (2019) 38:117. doi: 10.1186/s13046-019-1086-2


predictive modelling of anticancer drug sensitivity. Nature. (2012) 483:603–7. doi: 10.1038/nature11003


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Lapuente-Santana and Eduati. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

,

\* ‡ and

, Giuseppe Jurman<sup>1</sup>

# Integrative Network Fusion: A Multi-Omics Approach in Molecular Profiling

#### Edited by:

Marco Chierici <sup>1</sup>

Alessandro Zandonà<sup>4</sup>

Cesare Furlanello1,6‡

\* †

Chiara Romualdi, University of Padova, Italy

#### Reviewed by:

Prashanth N. Suravajhala, Birla Institute of Scientific Research, India Jun Zhong, National Cancer Institute (NCI), United States

#### \*Correspondence:

Marco Chierici chierici@fbk.eu Giuseppe Jurman jurman@fbk.eu

†These authors share joint first authorship ‡These authors share joint last authorship

#### Specialty section:

This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology

Received: 31 March 2020 Accepted: 28 May 2020 Published: 30 June 2020

#### Citation:

Chierici M, Bussola N, Marcolini A, Francescatto M, Zandonà A, Trastulla L, Agostinelli C, Jurman G and Furlanello C (2020) Integrative Network Fusion: A Multi-Omics Approach in Molecular Profiling. Front. Oncol. 10:1065. doi: 10.3389/fonc.2020.01065 <sup>1</sup> Fondazione Bruno Kessler, Trento, Italy, <sup>2</sup> University of Trento, Trento, Italy, <sup>3</sup> Department of Medical, Surgical and Health Sciences, University of Trieste, Trieste, Italy, <sup>4</sup> NIDEK Technologies Srl, Albignasego (PD), Italy, <sup>5</sup> Max Planck Institute of Psychiatry, Munich, Germany, <sup>6</sup> HK3 Lab, Milan, Italy

, Lucia Trastulla<sup>5</sup>

, Nicole Bussola1,2†, Alessia Marcolini 1†, Margherita Francescatto1,3

, Claudio Agostinelli <sup>2</sup>

Recent technological advances and international efforts, such as The Cancer Genome Atlas (TCGA), have made available several pan-cancer datasets encompassing multiple omics layers with detailed clinical information in large collection of samples. The need has thus arisen for the development of computational methods aimed at improving cancer subtyping and biomarker identification from multi-modal data. Here we apply the Integrative Network Fusion (INF) pipeline, which combines multiple omics layers exploiting Similarity Network Fusion (SNF) within a machine learning predictive framework. INF includes a feature ranking scheme (rSNF) on SNF-integrated features, used by a classifier over juxtaposed multi-omics features (juXT). In particular, we show instances of INF implementing Random Forest (RF) and linear Support Vector Machine (LSVM) as the classifier, and two baseline RF and LSVM models are also trained on juXT. A compact RF model, called rSNFi, trained on the intersection of top-ranked biomarkers from the two approaches juXT and rSNF is finally derived. All the classifiers are run in a 10x5-fold cross-validation schema to warrant reproducibility, following the guidelines for an unbiased Data Analysis Plan by the US FDA-led initiatives MAQC/SEQC. INF is demonstrated on four classification tasks on three multi-modal TCGA oncogenomics datasets. Gene expression, protein expression and copy number variants are used to predict estrogen receptor status (BRCA-ER, N = 381) and breast invasive carcinoma subtypes (BRCA-subtypes, N = 305), while gene expression, miRNA expression and methylation data is used as predictor layers for acute myeloid leukemia and renal clear cell carcinoma survival (AML-OS, N = 157; KIRC-OS, N = 181). In test, INF achieved similar Matthews Correlation Coefficient (MCC) values and 97% to 83% smaller feature sizes (FS), compared with juXT for BRCA-ER (MCC: 0.83 vs. 0.80; FS: 56 vs. 1801) and BRCA-subtypes (0.84 vs. 0.80; 302 vs. 1801), improving KIRC-OS performance (0.38 vs. 0.31; 111 vs. 2319). INF predictions are generally more accurate in test than one-dimensional omics models, with smaller signatures too, where transcriptomics consistently play the leading role. Overall, the INF framework effectively integrates multiple data levels in oncogenomics classification tasks, improving over the performance of single layers alone and naive juxtaposition, and provides compact signature sizes<sup>1</sup> .

Keywords: multi-omics, classification, network, oncogenomics, predictive modeling

#### 1. INTRODUCTION

The challenge of integrating multi-omics data is as old as bioinformatics itself (1, 2), but, despite the wide literature, it remains an open issue nowadays, even worth being funded by major institutions<sup>2</sup> .

This study introduces Integrative Network Fusion (INF), a reproducible network-based framework for high-throughput omics data integration that leverages machine learning models to extract multi-omics predictive biomarkers. Originally conceptualized and tested on multi-omics metagenomics data in an early preliminary version (3, 4), INF combines the signatures retrieved from both the early-integration approach of variable juxtaposition (juXT) and an intermediate-integration approach [SNF, (5)], to find the optimal set of predictive features. In particular, first a set of top-ranked features is extracted by juXT by a classifier, here Random Forest (RF) and linear Support Vector Machine (LSVM). Then, a feature ranking scheme (rSNF) is computed on SNF-integrated features and finally a RF model (rSNFi) is trained on the intersection of two sets of top-ranked features from juXT and rSNF, obtaining an approach that effectively integrates multiple omics layers and provides compact predictive signatures. Selection bias and data-leakage effects are controlled by performing the experiments within a rigorous Data Analysis Plan (DAP) to warrant reproducibility, following the guidelines of the US FDA-led initiatives MAQC/SEQC (6–8). In particular, to alleviate the computational burden of the full DAP pipeline, an approximated DAP is designed to lighten computing without significantly affecting the results. Further, experiments are run on samples with randomly shuffled labels as a sanity check vs. overfitting effects and, finally, INF robustness is verified by testing on different train/test splits.

We test INF on three datasets retrieved from the TCGA repository, to predict either the estrogen receptor status (ER) or the cancer subtype on the breast invasive carcinoma (BRCA) dataset, and to predict the overall survival (OS) on the kidney renal clear cell carcinoma (KIRC) and acute myeloid leukemia (AML) datasets. Overall, INF improves over the performance of single layers and naive juxtaposition on all four oncogenomics tasks, extracting a biologically meaningful compact set of predictive biomarkers. Notably, the transcriptomics layer is



BRCA, breast invasive carcinoma; AML, acute myeloid leukemia; KIRC, kidney renal clear cell carcinoma; gene, gene expression; cnv, copy number variants; prot, protein expression; meth, methylation; mirna, microRNA expression; ER, estrogen receptor; subtypes, breast cancer subtypes; OS, overall survival; ST, synthetic target.

prevalent inside the inferred INF signatures, consistently with published findings (9).

The INF framework is currently designed to integrate an arbitrary number of one-dimensional omics layers. We plan to further extend the framework by enabling the integration of histopathological features extracted from whole slide images (10) or deep features from radiological images (11) extracted by deep neural network architectures, carefully addressing all potential caveats (12).

#### 2. MATERIALS AND METHODS

#### 2.1. Data

Three multi-modal cancer datasets generated by The Cancer Genome Atlas (TCGA) Research Network (https://www.cancer. gov/tcga) and four classification tasks are considered in this study. Protein expression (prot), gene expression (gene), and copy number variants (cnv) are used to predict breast invasive carcinoma (BRCA) estrogen receptor status (0: negative; 1: positive) and subtypes (luminal A, luminal B, basal-like, HER2 enriched). Methylation (meth), gene expression (gene), and microRNA expression (mirna) are used to predict acute myeloid leukemia (AML) and kidney renal clear cell carcinoma (KIRC) overall survival (0: alive; 1: deceased). The number of samples and features for each omic layer and classification task are detailed in **Table 1**; class balance, split by dataset, is reported in **Table 2**.

For AML (13) and KIRC (14), gene expression is profiled using the Illumina HiSeq2000 and quantified as log2-transformed RSEM normalized counts; miRNA mature strand expression is profiled using the Illumina Genome Analyzer and quantified as reads per million miRNA mapped; and methylation is assessed by Illumina Human Methylation 450K and expressed as beta values. For BRCA (15), gene expression is profiled with Agilent 244K custom gene expression microarrays; protein expression is

<sup>1</sup> INF source code is publicly available on the GitLab repository https://gitlab. fbk.eu/MPBA/INF, while data is archived at http://dx.doi.org/10.6084/m9.figshare. 12052995.v1

<sup>2</sup>European Call Multi-omics for genotype-phenotype associations (RIA) https:// ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/ topic-details/biotec-07-2020

#### TABLE 2 | Class balance.


BRCA, breast invasive carcinoma; AML, acute myeloid leukemia; KIRC, kidney renal clear cell carcinoma; ER, estrogen receptor; subtypes, breast cancer subtypes; OS, overall survival.



Multiplicative factor, class separation, and random state refer to the parameters scale, class\_sep, and random\_state of the make\_classification function in scikitlearn.

assessed by reverse phase protein arrays; copy number profiles are measured using Affymetrix Genome-Wide Human SNP Array 6.0 platform, copy number variants are segmented by the TCGA Firehose pipeline using GISTIC2 method, and then mapped to genes.

The original data is publicly accessible on the National Cancer Institute GDC Data Portal (https://portal.gdc.cancer. gov/) and the Broad GDAC Firehose (https://gdac.broadinstitute. org/), where further details on data generation can be found. The data was retrieved in December, 2019 and January, 2020 using the RTCGA R library (16).

Furthermore, the INF pipeline has been tested on a synthetic dataset with 380 observations in two classes (70% class 1 and 30% class 2, defining the synthetic target ST), 3 pseudoomics layers, and 400 features (layer 1: 100; layer 2: 50; layer 3: 250). The dataset is generated in-house using scikitlearn's make\_classification function with the arguments shuffle=False and flip\_y=0. The number of informative features and the difficulty of the task were set on a per-layer basis, as summarized in **Table 3**.

#### 2.2. In silico Workflow

The INF pipeline integrates two or more omics layers, e.g., gene expression, protein expression, or methylation, in a machine learning framework for improved patient classification and biomarker identification in cancer. The core consists of three main components, structured as in **Figure 1**, managing the integration of the omics layers and their predictive modeling. A baseline integration method (juXT) is first considered by training a Random Forest (RF) (17) or a linear Support Vector Machine (LSVM) (18) classifier on juxtaposed multi-omics data, ranking features by ANOVA F-value. Secondly, the multi-omics features are integrated by Similarity Network Fusion (SNF) (5), a method that computes a sample similarity network for each data type and fuses them into one network. INF introduces a novel feature ranking scheme (rSNF) that sorts multi-omics features according to their contribution to the SNF-fused network structure. A RF or LSVM classifier is trained on the juxtaposed multi-omics data, ranking features by rSNF. A compact RF model (rSNFi) is finally trained on the juxtaposed dataset restricted on the intersection of top-ranked biomarkers from juXT and rSNF.

#### 2.3. Omics Integration

In a comparative review of scientific literature, SNF (5) emerged as one of the most reliable alternatives to simple juxtapositionbased integration. SNF is a non-Bayesian network-based method that can be divided into two main steps: the first step builds a sample-similarity network for each omics dataset, where nodes represent samples and edges encode a scaled exponential Euclidean distance kernel computed on each pair of samples; the second step implements a non-linear combination of these networks into a single similarity network through an iterative procedure. The multi-omics datasets are first converted into graphs, and for each graph two matrices are computed: a patient pairwise similarity matrix ("status matrix"), and a matrix with similarity of each patient to the K most similar patients, through K-nearest neighbors ("local affinity matrix"). At each iteration, the status matrix is updated through the local affinity matrix, generating two parallel interchanging processes. The status matrices are finally fused together into a single network. Spectral clustering is performed on the fused network, in order to identify sub-communities of samples, potentially reflecting phenotypes. The clustering performance is evaluated with respect to a ground truth, i.e., the real phenotype each sample belongs to, by the Normalized Mutual Information (NMI) score. SNF integrates multiple omics datasets into a single comprehensive network in the space of samples rather than measurements (e.g., gene expression values).

This work proposes multi-omics integration as an approach to identify robust biomarkers of samples phenotypes or cancer subtypes (e.g., survival status vs. breast cancer subtyping); consequently, it is necessary to extract measurements information from the SNF-fused network of samples. To this aim, we extended SNF by implementing rSNF (ranked SNF), a feature-ranking scheme based on SNF-fused network clustering. In detail, a patient network W<sup>i</sup> is built for each feature fi , based on f<sup>i</sup> alone, and spectral clustering is performed on it. Then, NMI score is computed comparing the samples clusters found inside W<sup>i</sup> with those in the fused network; the higher the score, the more similar the clustering between the fused network and W<sup>i</sup> . Thus, each feature f<sup>i</sup> is associated to a consistency score, ranking all multi-omics features with respect to their relative contribution to the whole network structure.

The entire procedure of similarity networks inference and fusion relies on two hyperparameters: α, the scaling variance in the scaled exponential similarity kernel used for similarity networks construction, and K, the number of nearest neighbors in sparse kernel and scaled exponential similarity kernel construction. While the original method (5) assigned fixed values

to α and K, in this study the optimal hyperparameters are chosen among the grids αgrid = {0.3, 0.35, 0.4, 0.45, . . . , 0.8} and Kgrid = {i ∈ N, 10 ≤ i ≤ 30} in a 10×5-fold cross-validation schema.

#### 2.4. Predictive Profiling

To ensure the reproducibility of results and limit overfitting, the development of classification models is performed inside a Data Analysis Plan (DAP) (**Figure 2**), following the guidelines derived by the U.S. Food and Drug Administration MAQC/SEQC studies (6, 19). Data is split in a training set (TR) and two non-overlapping test sets (TS, TS2), preserving the original proportion of patient phenotypes (classes). The TR/TS/TS2 partitions are 50/30/20 of the entire data set, respectively. The data splitting procedure is repeated 10 times so to obtain 10 different TR/TS/TS2 splits. Predictive models are trained and developed on TR and TS for juXT and rSNF; in the case of rSNFi, the models are trained and developed on TS and TS2 to avoid information leakage due to using the same data both for feature selection and model training (see **Figure 3**). For each split, Random Forest (RF) or linear kernel Support Vector Machine (LSVM) classifiers are trained on the training partition within a stratified 10×5-fold cross-validation (10×5-CV). The model performance is assessed in terms of average precision, recall and Matthews Correlation Coefficient (MCC) (20, 21). The MCC is generally regarded as a balanced measure of accuracy and precision that can be used both in binary and multiclass problems (22, 23) and even when classes are imbalanced (24). MCC lies in [−1, 1], with 1 meaning perfect prediction, -1 inverse prediction and 0 random guess. For binary classification tasks, MCC is calculated on true and predicted labels considering true positive (TP), true negative (TN), false positive (FP) and false negative (FN) values, as in the following:

$$\text{MCC} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}}$$

At each CV round, features are ranked either by ANOVA F-value (for juXT, rSNFi) or by the rSNF ranking (see section 2.3) and different classification models are trained for increasing numbers of ranked features, namely 5, 10, 25, 50, 75, and 100% of the total features. A unified list of top-ranked features is then obtained by Borda aggregation of all the ranked CV lists (25, 26). The best model is later retrained on the whole training set restricted to the features yielding the maximum MCC in CV, and validated on the test partition. A global list of top-ranked features is derived for juXT, rSNF, and rSNFi by Borda aggregation of the Borda lists of each TR/TS split (Borda of Bordas, "BoB"). The signatures for juXT, rSNF, and rSNFi are defined by the top N features of the corresponding BoB lists, with N being the median size of top features across all experiments.

In the "full" version of the DAP (fDAP), described above, the rSNF ranking is performed at each CV round on

shuffled beforehand, the DAP runs in "random labels" mode as a sanity check to ensure that the procedure is not affected by systematic bias.

the training portion of the data. Since this procedure is quite demanding in terms of computational time, even if parallelized (≈ 9 feature/min), we devised an "accelerated" version of the DAP (aDAP), where the rSNF ranking is precomputed on the whole TR data and used as is at each CV round. We assessed the fDAP vs. aDAP performance on the synthetic dataset as well as BRCA-ER and BRCAsubtypes by comparing the overall metrics and measuring the dissimilarity of the rSNF BoB of the two DAPs by the Canberra distance (25).

TABLE 4 | Summarized best predictive performances for each classification task using RF model and three omics layers.


CI: 95% bootstrap confidence interval; {MCC,PREC,REC}\_cv: best average MCC, precision, recall in cross-validation on training set splits; {MCC,PREC,REC}\_ts: average MCC, precision, recall on test set splits; Nf: median number of features leading to MCC\_cv. Bold indicates best performance (highest MCC and smallest signature size). Precision and recall were computed for binary classification tasks only.

RF models are trained using 500 trees, measuring the quality of a split as mean decrease in the Gini impurity index (17); the regularization parameter C of LSVM models is tuned over the grid Cgrid = {10<sup>i</sup> , i ∈ N, −2 ≤ i ≤ 3} within a 10× stratified Monte Carlo cross-validation (50% training/validation proportion). Results for RF models are summarized in **Table 4**, while LSVM models performance is detailed in the **Supplementary Tables** BRCA-ER\_LSVM, KIRC-OS\_LSVM.

To ensure that the predictive profiling procedure is not affected by selection bias, the whole INF workflow, including the rSNF procedure, is also repeated after randomly scrambling the training set labels ("random labels" mode): in this setup, the performance of a classifier unaffected by systematic bias should be close to that of a random predictor, with MCC close to zero.

#### 2.5. Implementation

The complete INF pipeline is implemented through the workflow management tool Snakemake (27, 28), which allows automatic handling of all dependencies required to generate the INF output. The pipeline operates on N omics input files, one for each layer that should be integrated, and a single file describing the patient labels. The omics files are tab-separated text matrices with patients on the rows and features on the columns, with row and column identifiers. The label file is a single column file with patient phenotypes, with no header. This input structure, with one file per omic layer and a label file, simplifies the downstream analysis and reduces to a minimum the preprocessing burden for the end user.

The predictive profiling module, including the DAP, is written in Python 3.6 on top of NumPy (29) and scikit-learn methods (30). The ranked SNF (rSNF) procedure is implemented in R (31) leveraging the original R scripts provided by SNF authors (5), extended by a dedicated script for SNF tuning and a main script for SNF analysis and the post-SNF feature selection procedure, which is parallelized over the features for efficiency using the foreach R library.

#### 2.6. Computational Details

The INF computations were run on the FBK Linux highperformance computing facility KORE, on a 8-core i7 3.4 GHz Linux workstation, and on a 72-vCPU 2.7 GHz Platinum Intel Xeon 8168 Microsoft Azure cloud machine (F72s v2 series).

#### 2.7. Data and Code Availability

To further foster reproducibility and support users and future developers, the full code of this benchmark is publicly shared on the GitLab repository https://gitlab.fbk. eu/MPBA/INF. Additional information is included in the **Supplementary Material** available on the publisher's website, while the full set of experimental data can be accessed at http:// dx.doi.org/10.6084/m9.figshare.12052995.v1.

#### 3. RESULTS

The INF workflow was run on all tasks considering 3-layer integration and all 2-layer combinations; the DAP was also run separately on all single-layer datasets in order to obtain a baseline. All results presented here refer to experiments performed with RF classifier. Experiments using LSVM were performed on BRCA-ER and KIRC-OS obtaining similar classification performances, top features and layer contributions (**Supplementary Tables** BRCA-ER\_LSVM, KIRC-OS\_LSVM). The classifier performance for 3-layer integration is summarized in **Table 4**, in terms of average cross-validation MCC on the 10 training set splits (MCC\_cv) with 95% Studentized bootstrap confidence intervals (CI) as (MCC\_min, MCC\_max), average MCC on the 10 test set splits (MCC\_ts) with CI, and median number of features (Nf) yielding MCC\_cv. Similarly, precision (PREC) and recall (REC) are reported in **Table 4** as average cross-validation and test set values with CI. As expected, whenever there is a nonnegligible unbalance toward the positive class, the number of false positives tends to increase, with more false positives yielding a comparatively low precision with higher recall, and vice versa. In both cases, the MCC efficiently works in balancing the two effects. The classifier performance on single-layer and 2-layer data is summarized in **Figure 4**.

A comparison between the "accelerated" flavor of the DAP (aDAP) and the full DAP (fDAP) was run on synthetic data, BRCA-ER and BRCA-subtypes data, with aDAP yielding similar performance metrics and top-ranked biomarker lists as fDAP (**Supplementary Tables** Synthetic\_RF, BRCA\_RF\_fDAP, canberra\_distances), while being ≈ 30× faster (for BRCA-ER, approx. 2 vs. 64 h, or 300 features/min vs. 9 features/min). All the results presented here were thus obtained using aDAP. Moreover, the INF workflow running in "random labels" mode achieved an average cross-validation MCC ≈ 0, as expected by a procedure unaffected by systematic bias.

Overall, integrating multiple omics layers with INF yields better or comparable classification performance than using only features from a single layer or naïve omics juxtaposition, at the same time with much more compact signature sizes. On 3-layer BRCA-subtypes and 2- or 3-layer KIRC-OS, INF outperforms the single layers, as well as juXT and rSNF (**Figure 4**, **Table 4**). On 2-layer BRCA-subtypes, INF performance on gene-cnv and geneprot is comparable to the best-performing single-layer data (gene) and superior to cnv and prot single layers, while INF on cnvprot only improves over the cnv single layer. On the BRCA-ER task, the performance with INF integration of 2 or 3 layers is still better than using single layers, nevertheless to a smaller extent, except for cnv-prot integration which performs better than cnv alone but slightly worse than gene and prot single layers. The good performances achieved at the gene and prot single layers do not come unexpected, since the biological nature of the target ERstatus is defined at transcriptomics level. On the more difficult AML-OS task, INF has better performance over both rSNF and juXT on gene-mirna and meth-mirna integration, still improving over single-layer performance both in terms of MCC and reduced signature sizes.

#### 3.1. One or Multi-Omics Layers vs. juXT/rSNF/rSNFi

For BRCA-ER, three-layer INF (rSNFi) integration performs better than either rSNF or juXT (MCC test 0.830 vs. 0.804, 0.797 for rSNF and juXT, respectively). All two-layer INF integrations perform similarly to, or better than, the corresponding rSNF and juXT integrations, in particular for cnv-prot integration (MCC test 0.746 vs. 0.682, 0.692 resp. for rSNF and juXT).

On BRCA-subtypes, the 3-layer INF integration performs better than either rSNF or juXT (MCC test 0.838 vs. 0.811, 0.795 resp. for rSNF and juXT), nevertheless without improving over the gene single-layer performance (MCC test 0.821). However, the INF median signature size is only 301.5, compared to 1801 for rSNF and juXT, and 891 for the gene layer alone. All

two-layer INF integrations yield better performance than their corresponding juXT or rSNF integrations.

Omics integration is particularly effective for KIRC-OS, as all 2- and 3-layer INF integrations outperform juXT, rSNF, and each of the single-layer classifiers. In fact, 3-layer rSNFi achieves MCC test 0.378 vs. 0.274, 0.305 (resp. for juXT, rSNF), 0.296, 0.327, 0.333 (resp. rSNFi meth-mirna, gene-mirna, gene-meth), and 0.253, 0.261, 0.249 (resp. gene, meth, mirna).

For AML-OS, INF feature sets are always more compact than either juXT or rSNF, with three-layer integration giving better MCC than any of the INF two-layer integrations (MCC test 0.176 vs. 0.125, 0.169, 0.047, respectively three-layer vs. meth-mirna, gene-mirna, gene-meth). Moreover, cross-validation MCCs corresponding to INF integration are better than any single layer MCC as well as rSNF and juXT.

### 3.2. Characterization of the Signatures Identified by INF

For all tasks, INF signatures are markedly more compact with respect to both juXT and rSNF. With 91.5 vs. 6559 (1.4%) median features (rSNFi vs. juXT), the largest reduction in size occurs for AML-OS 3-layer integration, while the least reduction is observed for BRCA-subtypes task, with 301.5 vs. 1801 (16.7%) median features (rSNFi vs. juXT).

In terms of contributions from the omics datasets being integrated, the gene layer generally provides the largest number of features to the signatures identified by the INF workflow. In particular for the BRCA dataset, in both ER and subtypes tasks, the gene layer contributes over 95% of the top features for juXT and rSNFi, with rSNF signatures being slightly more balanced (prot contribution remains marginal, while cnv provides 28.3 and 17.7% of the top features in ER and subtypes tasks respectively). This is expected as the class label is defined mainly at transcriptomics level. In AML-OS experiments, the layer contributing the most is still gene, accounting for ca. 78, 73, and 81% of the top feature sets for RF juXT, rSNF and rSNFi experiments, respectively. In KIRC-OS experiments, gene is the layer contributing the most to the top juXT and rSNF feature sets, while meth is the major contributor for rSNFi. The percentage of features from each omic layer contributing to the top signatures for juXT, rSNF and rSNFi 3-layer integrations are reported in **Supplementary Tables** layer\_contribution. The RF rSNFi signatures for all tasks are available in **Supplementary Tables** BRCA-ER\_RF\_rSNFi, BRCA-subtypes\_RF\_rSNFi, AML-OS\_RF\_rSNFi and KIRC-OS\_RF\_rSNFi.

Even though a systematic biological interpretation of the identified signatures is beyond the scope of this work, to ascertain the reliability of our results we compared them with published data. The top features in the BRCA-ER rSNFi signature include multiple genes known to be associated with breast carcinoma progression and outcome such as AGR3, B3GNT, and MLPH (32–34). In addition we find the estrogen receptor gene (ESR1 from the gene and ER-alpha from the prot layer) and the transcription factor GATA3 (from both gene and prot layers) (35). Both the BRCA-ER and BRCA-subtypes signatures include genes previously identified as novel biomarkers for intrinsic breast carcinoma subtype prediction (36). Interestingly there is only partial overlap between the top features identified in BRCA ER vs. subtypes tasks. Considering AML-OS task, it is noteworthy to mention that the top feature identified has been recently reported as a potential biomarker predicting overall survival in a subset of AML patients (37).

Within the mirna features of the AML-OS signature, MIR-203 expression was recently found to be associated with AML patient survival (38); MIR-100 is highly expressed in AML and was found to regulate cell differentiation and survival (39); high expression of miR-504-3p was reported to be associated with favorable AML prognosis (40). Given that the rSNFi signature identified in the KIRC-OS task contains a large percentage of methylation data (86.5%), its direct interpretation is more difficult. It is however interesting to observe that all the 15 gene features in the signature are identified as prognostic markers for renal carcinoma according to the Human Protein Atlas (41).

#### 3.3. Unsupervised Analysis

The features selected by juXT, rSNF and rSNFi are projected on a bi-dimensional space using the UMAP unsupervised multidimensional projection method (42, 43). Here we show an example on the BRCA-subtypes 3-layer dataset, with a UMAP projection of the features selected by juXT (**Figure 5**) compared to the UMAP projection of the INF signature (**Figure 6**) for one of the 10 data splits (the UMAP plots for the remaining 9 splits are in **Figures S1, S2**). Colors represent cancer subtypes and shapes represent training/test partitions. Using the 1801 juXT features, cancer subtypes are roughly clustered, with HER2-enriched and Luminal B being more dispersed (**Figure 5**). The clusters appear to be more sharply defined in the projection of the 302-feature INF signature: in particular, Basal-like patients form a distinct cluster, while Luminal A, Luminal B and HER2-enriched patient clusters are close to each other, slightly overlapping yet hinting to a trajectory pattern (**Figure 6**). The HER2/luminal cluster contains two patients classified as basal-like subtype, consistently with the findings of (44).

# 4. DISCUSSION

#### 4.1. Background and Related Work

Ritchie et al. (45) defined omics data integration as the combination of multiple omics datasets that can be used for the development of models to predict complex traits or phenotypes. The problem of data integration in computational biology is far from having a consolidated and shared solution. Many long-standing obstacles are still far from being overcome, and the increasing availability of data [e.g., TCGA, (46)] and computational tools [see for instance (47–51) and https:// github.com/mikelove/awesome-multi-omics], also interactive [e.g., (52)], is raising new issues that need to be addressed. In fact, not only are existing datasets still lacking standardization protocols to deal with their complexity and heterogeneity, but also the reliability, reproducibility and interpretability of new computational methods are emerging as urgent and relevant questions (53). Moreover, modern technologies allow the rapid extraction of high-dimensional, high-throughput features from different sources (e.g., gene expression, DNA sequencing, metabolomics, or high-resolution images), which in turn require collaboration between biologists, computer scientists, physicians and other experts. The lack of common methodologies and terminologies can transform this synergy into a further level of complexity in the process of data integration (54). As

observed in (55, 56), specific technological limits, noise levels and variability ranges affect the different omics, and thus confounding the underlying biological signals, yielding that really integrative analysis is still very rare, while different methods often discover different kinds of patterns, as evidenced by the lack of consistency in the published results, although efforts in this direction have started appearing (57, 58).

Indeed, the underlying hypothesis of multi-omics integration is that different omics data can provide complementary information (56) [although sometimes redundant (9)], and thus a broader insight with respect to single-layer analysis, for a better understanding of disease mechanisms (59). This assumption has been confirmed by multiple studies on diverse diseases, such as cardiovascular disease (60), diabetes (61), liver disease (62), or mitochondrial diseases (63), and also longitudinally (64), suggesting that the more complex the disease the more advantageous the integration. As the co-occurrence of multiple causes and correlated events is a well-known characteristic of tumorigenesis and cancer development, the integration of data generated from multiple sources can thus be particularly useful for the identification of cancer hallmarks (65–68).

Many computational strategies have been introduced that combine multiple types of data to identify novel biomarkers and thus to predict a phenotype of interest or drive the development of intervention protocols. Given the heterogeneity of data and tasks, these techniques deal with the data integration at different levels of the learning process: (i) by concatenating the features before fitting a model (early-integration), (ii) by incorporating the integration step into the model training (intermediateintegration), or (iii) by combining the outputs of distinct models for the final prediction (late-integration) (69, 70).

In the early-integration approach, also known as juxtaposition-based, the multi-omics datasets are first concatenated into one matrix. To deal with the highdimensionality of the joint dataset, these methods generally adopt matrix factorization (55, 56, 58, 71), statistical (47, 49, 58, 60, 62, 72–76), and machine learning tools (58, 76, 77). Alternatively, data models relying on polyglot approaches can be used especially in (bio)informatics applications (78, 79). Although the dimensionality reduction procedure is necessary and may improve the predictive performance, it can also cause the loss of key information (69).

Moreover, biomarkers identified purely on a computational statistics rationale from meta-omics features often lack biological plausibility (80).

In order to maximize the contribution of the single-omics layer, the late-integration methods first model each dataset individually, and then merge or average the results; they are also known as model-driven (70, 81). Although these techniques avoid the pre-selection of the features, they do not leverage the hidden correlations between the data, posing again the risk of signal loss (80, 82).

The intermediate-integration strategies aim at developing a joint model that accounts for the correlation between the omics layers, to boost their combined predictive power (83). Among these methods, the network-based models refer to the reconstruction of a graph representing the complex biological interactions (76, 84), known or predicted, between the variables to discover novel informative relationships (85). They have successfully been applied in cancer research for the identification of pan-cancer drug targets (86), the detection of subtypespecific pathways (83, 87) and of genetic aberrations (88), or the stratification of cancer patients (89–91). In particular, Koh et al. (44) predicted breast cancer subtypes by applying a modified shrunken centroid method in the development of their networkbased tool, iOmicsPASS. Further, breast cancer datasets in TGCA represent a benchmark for integrative models (92–94), as well as AML (95).

More recently, the success of deep learning algorithms in various bioinformatics fields (96) prompted the adoption of deep neural networks for omics-integration in precision oncology. Autoencoders and convolutional neural networks have been effectively trained for the prediction of prognostic outcomes (9, 97), response to chemotherapeutic drugs (50), and gene targeting (98), by adopting either an early-integration (9, 98) or a late-integration (50, 97). Although deep learning models hold the potential to include image-derived features in the integration workflow, they suffer from interpretability and generalization issues (99).

Although it is clear that no single method is consistently preferable, and that most of the proposed approaches are task and/or data dependent (80), the complexity of tumor analysis suggests that network-based approaches are needed (87, 100).

In this context, it is clear that omics-integration is one of the most promising and demanding challenges of the modern bioinformatics, and that there is an urgent need to prove the reproducibility, interpretability, and generalization capability of the proposed methods (85, 101).

#### 4.2. Integrative Network Fusion

We present the INF framework for the characterization of cancer patient phenotypes by integrated multi-omics signatures, combining an improved version of a state-of-the-art integration technique (5) with predictive models developed inside a Data Analysis Plan (6) for machine learning. The framework is applied to TCGA data to predict clinically relevant patient phenotypes such as the overall survival or cancer subtypes.

The simplest approach for multi-omics data integration consists in juxtaposition of normalized measurements into one joint matrix, followed by the development of a predictive model. Juxtaposition-based integration is considered as a baseline technique, since it is the most naïve approach to combine two datasets; moreover, it enables to identify multi-omics signatures by borrowing discriminatory strength from information derived by all datasets. Juxtaposition further dilutes the already possible low signal-to-noise ratio in each data type, affecting the understanding of the biological interactions at the different omics levels.

Conversely, the INF method for omics data integration is an improvement of the popular Similarity Network Fusion (SNF) approach (5), which has inspired several studies in the scientific literature, specifically in cancer genomics (77, 87, 102–106). SNF maximizes the shared or correlated information between multiple datasets by combining data through inference of a joint network-based model, accounting for how informative each data type is to the observed similarity between samples.

Two innovative solutions have been implemented in this study: (i) we devised a SNF-based procedure to rank variables according to their importance in clustering samples with similar phenotypes; and (ii) predictive models were developed exploiting the SNF-ranked variables, inside a rigorous Data Analysis Plan which ensures reproducibility (6, 19).

The performance of INF was assessed both in terms of statistical properties as well as biological interest. Concerning the statistical aspect, INF was compared with predictive models developed on the juxtaposed datasets (juXT technique), as well as on the single-layer datasets. With INF, smaller signature sizes were systematically derived to achieve comparable or even better performance both in cross-validation and in test. This is an added value for INF, as biological validation of biomarkers can definitely benefit from signatures of small size in terms of both costs and required time. This main achievement is mainly due to the novel rSNF ranking, which increases the signal-to-noise ratio from the combined layers by prioritizing the most discriminant biomarkers in terms of network mutual information. rSNF exploits two main SNF advantages: integration of heterogeneous data and clustering of sample networks. The main peculiarity of the SNF integrative procedure is its robustness to noise (5), because weak similarities among samples (low-weight edges) disappear, except for low-weight edges supported by all networks, which are conserved depending on how tightly connected their neighborhoods are across networks. Moreover, the rSNFi step further increases the signal-to-noise ratio by training a predictive classifier on multi-omics juxtaposed data restricted to the topranked biomarkers shared by juXT and rSNF models. The resulting signatures are compact in size (up to 99% reduction w.r.t. juXT) while allowing predictive models to achieve equal or better performance compared to naïve juxtaposition or the single layers alone. While a comprehensive evaluation of the biological meaning of the signatures identified through the INF framework is beyond the scope of this work, we assessed their general validity with a thorough literature search. Our investigation shows that the signatures identified through the INF framework include biological markers that are relevant in the tasks under analysis and are consistent with previously published data. Further, as in (9), the largest contribution in the biomarkers' lists is provided by gene expression, while epigenomics, proteomics and miRNA transcriptomics play a minor role.

It should be noted that, especially in computational biology, multicollinearity between pairs of predictors and/or layers is intrinsic in the problem. Nevertheless, most machine learning models are indeed designed to identify the relevant predictors even in the presence of strong linear or non-linear correlations, provided that an appropriate DAP, feature ranking method, and diagnostic tools (e.g., random labels) are adopted against selection bias. To this aim, the application of a DAP derived from the MAQC-II initiative for model selection is a core attribute of the INF framework.

A fair comparison of INF results with other integration methods is currently unfeasible due to the number and variety of computational pipelines with dissimilar datasets, preprocessing methods, data analysis plans, and performance metrics.

This work is based on the original R implementation of the SNF algorithm (5). However, we are aware that Open Source implementations exist in other programming languages, in particular snfpy for Python (107). In a future release of the INF workflow, we plan to migrate the SNF-related parts to snfpy or a similar Python-based implementation, in order to drop the dependency on R and to potentially improve the overall performance.

In its current version, the INF framework supports the integration of two or more one-dimensional omics layers. As part of our future effort we will add support for the integration of medical imaging layers, for example leveraging the extraction of histopathological features from whole slide images by deep learning (10) or using radiomics or deep features from radiological images (11). In both cases, further issues will emerge from the interactions between the omics and the non-omics data, needing particular care in the integration (12).

#### DATA AVAILABILITY STATEMENT

The original contributions presented in the study are included in the article/supplementary files, further inquiries can be directed to the corresponding author/s.

#### AUTHOR CONTRIBUTIONS

CA, LT, and GJ: conceptualization. MC, NB, AM, AZ, LT, CA, and GJ: methodology. MF: interpretation. GJ: coordination. MC, NB, AM, MF, GJ, and CF: writing. All authors contributed to the article and approved the submitted version.

#### ACKNOWLEDGMENTS

The authors wish to thank Dr. Valerio Maggio for helpful discussions on aspects of the machine learning workflow and for paper proofreading. The results published here are in whole or

#### REFERENCES


part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc. 2020.01065/full#supplementary-material


a systematic review and meta-analysis. PLoS ONE. (2017) 12:e0174843. doi: 10.1371/journal.pone.0174843


approaches for cancer study. bioRxiv. [Preprint]. (2020) 905760. doi: 10.1101/2020.01.14.905760


**Conflict of Interest:** AZ was employed by the company NIDEK Technologies Srl. CF was employed by the company HK3 Lab.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Chierici, Bussola, Marcolini, Francescatto, Zandonà, Trastulla, Agostinelli, Jurman and Furlanello. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Integrated Multi-Omics Analyses in Oncology: A Review of Machine Learning Methods and Tools

Giovanna Nicora1†, Francesca Vitali 2,3,4†, Arianna Dagliati 1,5,6†, Nophar Geifman5,6 and Riccardo Bellazzi <sup>1</sup> \*

*<sup>1</sup> Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy, <sup>2</sup> Center for Innovation in Brain Science, University of Arizona, Tucson, AZ, United States, <sup>3</sup> Department of Neurology, College of Medicine, University of Arizona, Tucson, AZ, United States, <sup>4</sup> Center for Biomedical Informatics and Biostatistics, University of Arizona, Tucson, AZ, United States, <sup>5</sup> Centre for Health Informatics, The University of Manchester, Manchester, United Kingdom, <sup>6</sup> The Manchester Molecular Pathology Innovation Centre, The University of Manchester, Manchester, United Kingdom*

#### Edited by:

*Francesca Finotello, Innsbruck Medical University, Austria*

#### Reviewed by:

*Federica Eduati, Eindhoven University of Technology, Netherlands Giuseppe Jurman, Fondazione Bruno Kessler (FBK), Italy*

> \*Correspondence: *Riccardo Bellazzi riccardo.bellazzi@unipv.it*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology*

Received: *30 January 2020* Accepted: *26 May 2020* Published: *30 June 2020*

#### Citation:

*Nicora G, Vitali F, Dagliati A, Geifman N and Bellazzi R (2020) Integrated Multi-Omics Analyses in Oncology: A Review of Machine Learning Methods and Tools. Front. Oncol. 10:1030. doi: 10.3389/fonc.2020.01030* In recent years, high-throughput sequencing technologies provide unprecedented opportunity to depict cancer samples at multiple molecular levels. The integration and analysis of these multi-omics datasets is a crucial and critical step to gain actionable knowledge in a precision medicine framework. This paper explores recent data-driven methodologies that have been developed and applied to respond major challenges of stratified medicine in oncology, including patients' phenotyping, biomarker discovery, and drug repurposing. We systematically retrieved peer-reviewed journals published from 2014 to 2019, select and thoroughly describe the tools presenting the most promising innovations regarding the integration of heterogeneous data, the machine learning methodologies that successfully tackled the complexity of multi-omics data, and the frameworks to deliver actionable results for clinical practice. The review is organized according to the applied methods: Deep learning, Network-based methods, Clustering, Features Extraction, and Transformation, Factorization. We provide an overview of the tools available in each methodological group and underline the relationship among the different categories. Our analysis revealed how multi-omics datasets could be exploited to drive precision oncology, but also current limitations in the development of multi-omics data integration.

Keywords: multi-omics, machine learning, tools, systematic review, oncology, cancer

#### INTRODUCTION

The integration and analysis of high-throughput molecular assays is a major focus for precision medicine in enabling the understanding of patient and disease specific variations. Integrated approaches allow for comprehensive views of genetic, biochemical, metabolic, proteomic, and epigenetic processes underlying a disease that, otherwise, could not be fully investigated by single-omics approaches. Computational multi-omics approaches are based on machine learning techniques and typically aim at classifying patients into cancer subtypes (1–5), designed for biomarker discovery and drug repurposing (6, 7).

While complexities underling cancer still hampers our understanding of how this disease arises and progresses (8), multi-omics approaches have been suggested as promising tools to dissect patient's dysfunctions in multiple biological systems that may be altered by cancer mechanisms (9).

Several efforts have been made to generate comprehensive multi-omics profiles of cancer patients. The Cancer Genome Atlas (TCGA, https://portal.gdc.cancer.gov/) provides detailed clinical, genomics, transcriptomics, and proteomics data on about 20,000 subjects and plans to generate additional data in the next years for a variety of cancer types. Analysis of datasets generated by multi-omics sequencing requires the development of computational approaches spanning from data integration (10), statistical methods, and artificial intelligence systems to gain actionable knowledge from data.

Here we present a descriptive overview on recent multiomics approaches in oncology, which summarizes current stateof-art in multi-omics data analysis, relevant topics in terms of machine learning approaches, and aims of each survey, such as disease subtyping, or patient similarity. We provide an overview on each methodology group, while then focusing on publicly available tools.

# METHODS

#### Search Strategy

We retrieved publications by querying the Scopus database as: (cancer OR tumor OR tumor OR oncolog<sup>∗</sup> )AND(multi-omic<sup>∗</sup> OR multiomic<sup>∗</sup> OR mixomic<sup>∗</sup> )AND("machine learning" OR "data fusion" OR "network analysis").

#### Eligibility Criteria

Since other review covered previous years (10, 11) we included peer-reviewed journal articles published from 2014 to 2020 (last query 04-22-2020). If a study appears in multiple publications, only the latest version was included. We selected relevant studies by screening titles and abstracts, then analyzing full-texts. We excluded papers accordingly to the following criteria:


# Categories and Analyses

For each article, we extracted the publication year and the number of citations. We categorize the selected publications according to:

	- 1. Stratified Medicine for subgroup discovery: studies aimed at finding groups of patients that exhibit different therapeutic/prognostic outcomes;

We highlight successful approaches for each criterion and identify promising ones that are either nascent or unexplored as potential opportunities.

# RESULTS

We retrieved 270 papers. The Scopus query did not retrieve 24 relevant works that were added manually based on our previous knowledge. After a screening of papers' abstracts, 58 papers meeting our criteria were selected. Retrieved papers were organized into a matrix table (**Table 1**) and analyzed with respect to the aforementioned categories. As highlighted in **Figure 1A**, categories are not mutually exclusive, thus we show links between groups, which relate papers applying multiple methods. **Figure 1B** depicts all considered publications by year of publication and the Field-Weighted Citation Impact, a metric that allows comparison of papers accounting for year of publication and number citations. Studies are shown with different colors and shapes according to method used and the aim/output type.

In the following sections, we describe the methodological categories that emerged from our literature review. For each methodological category, particular emphasis is placed on studies providing tools that can be exploited by other users, either with their own data or to reproduce their results.

#### Network-Based Methods

Network-based approaches were exploited to detect, reconstruct and study interactions among sub network modules (13, 19, 22, 25, 40); to assess functional correlation among multi-omics entities (12, 14, 20, 55, 61, 62); to integrate and fuse networks to create comprehensive view of a disease (16, 24, 32, 37, 41, 63, 65). A few work leverage Bayesian methods (4, 34) or Markov models (17, 67).

Some approaches integrate network analysis within frameworks that apply multiple algorithms (35, 51, 58). In (51) a multi-platform analysis exploited for profiling pancreatic adenocarcinoma, includes clustering and Similarity Network Fusion to integrate genomic, transcriptomic, and proteomic data from the different platforms. In (58) authors develop a framework for drug repurposing and multi-target therapies by constructing a protein network for the disease under study and fusing several data sources. In (27), a functional interaction network predicts variations in expressions caused by genomic alterations, and it is exploited to prioritize cancer genes. Few

#### TABLE 1 | Selected papers and categories.


Nicora et al.

*(Continued)*

#### TABLE 1 | Continued


*(Continued)*

#### TABLE 1 | Continued


FIGURE 1 | (A) Linkage between different methodological categories. References to papers (see Table 1). That could be categorized in different groups are reported near the link. (B) Publications by year of publication and Field-Weighted Citation Impact. Different colors indicate exploited methods, shapes aims, and outputs. Papers with red borders have source code or provide a tool. Papers in the "Subgroup identification" group and/or with free tool result to be the most cited across years. The reference numbers are reported in Table 1.

others interesting approaches (16, 19) have been discussed in (10).

#### iOmicsPass

iOmicsPASS (40) implements a network-based method for integrating multi-omics profiles over genome-scale biological networks. The tool provides analysis components to transform qualitative multi-omics data into scores for biological interaction, then it uses the resulting scores as input to select predictive subnetworks; finally, it selects predictive edges for phenotypic groups using a modified nearest shrunken centroid algorithm. Authors validate iOmicsPASS on Breast Invasive Ductal Carcinoma data, integrating mRNA expression, and protein abundance, with and without the normalization of the mRNA data by the DNA Copy Number Variation (CNV). When compared with the original nearest shrunken centroid classification algorithm, iOmicsPASS outperform the baseline method, indicating the importance of selecting predictive signature forms densely connected sub networks, thus limiting the search space of predictive features to known interactions.

#### AMARETTO

Amaretto (22) is an algorithm developed multiple omics profiles integration across different type of cancers. Authors illustrate how the algorithm identifies cancer driver genes based on multiomics data fusion and detects subnetworks of modules across all cancers. The algorithm identifies potential cancer driver genes by investigating significant correlations between methylation, CNV and gene expression (GE) data. When the driver genes are identified it constructs a module network connecting them with the co-expressed target genes they control. This constricts a pan-cancer network that is able to identify novel pancancer driver genes.

#### DrugComboExplorer

DrugComboExplorer (35) identifies candidate drug combinations targeting cancer driver signaling networks by processing DNA sequencing, CNV, DNA methylation, and RNA-seq data from individual cancer patients using an integrated pipeline of algorithms. The pipeline is based on two components: the first one extracts dysregulated networks from transcriptome and methylation profiles of specific patients using bootstrapping-based simulated annealing and weighted coexpression network analysis. The second component generates a driver network signatures for each drug, evaluates synergistic effects of drug combinations on different driver signaling networks and ranks drug combinations according the synergistic effects. In (35) authors apply DrugComboExplorer on diffuse large B-cell-lymphoma and prostate cancer, demonstrating the ability of the tool to discover synergistic drug combinations and its higher prediction accuracy compared with existing computational approaches.

#### Deep Network

Deep Networks (DNs) are widely used to analyse omics-data (68). In a multi-omics scenario, clustering on DNs features showed different survival groups in neuroblastoma and liver cancer (23, 29, 64). In (42) authors integrated GE, methylation and miRNA in a restricted Boltzmann machine, where hidden layers represent different survival groups in breast cancer patients. Subnetworks are used in (54) to project different omics views in latent spaces that are further concatenated and fed into a final network to predict drug response.

#### SALMON

SALMON (Survival Analysis Learning with Multi-Omics Neural Networks) is a Deep Learning framework that integrates omics-data (mRNA and miRNA), clinical features and cancer biomarkers (36). Instead of feeding a neural network with mRNA and miRNA data, SALMON takes as input the eigengene matrices derived from co-expression analysis. Thus, it overcomes the high-dimensionality problem, reducing input features of about 99%. Authors assume that mRNA and miRNA data affect survival outcome independently, therefore the two corresponding eigengene matrices are connected to two different hidden layers whose output is linked to the final network with a Cox proportional hazards regression network. Results on breast cancer carcinoma patients showed improvements in survival prediction ability compared to single-omics.

#### Clustering

Multi-omics clustering approaches are exploited to detect regularities and patterns that reveal different cancer molecular subtypes (21, 33, 43, 48, 57, 60) and prognostic groups in hepatocellular carcinoma (59). In (18) consensus clustering is performed on transcriptomics, metabolomics, and proteomics data to stratify patients with hepatocellular carcinoma based on their redox response. Clustering applications are often preceded by feature selection and/or feature transformation of multiomics data, such as factorization, low rank approximation, and neural network. An exhaustive review on multi-omics integrative clustering approaches can be found in (69).

#### Nemo

NEMO (NEighborhood based Multi-Omics clustering) is a similarity-based tool that computes inter-patient similarity matrices for each omics through a radial basis function kernel. Spectral clustering is performed on the resulting average similarity matrix (52). NEMO addresses the problem of partial datasets, where not all the omics are measured for all the patients, and the final average matrix is computed on the observed omics values, without performing imputation. NEMO clustering shows higher performance compared to the same approach with imputed data, while on TCGA cancer datasets it detects significant differences in survival for six out of 10 cancer types.

#### Clusternomics

The main assumption of multi-omics clustering approaches relies on the existence of a consistent clustering structure across heterogeneous datasets. Alternatively, in (30) authors introduced the context-dependent clustering Clusternomics. Each omics is seen as a context describing a particular aspect of the underlying biological process. The global clustering structure is inferred from the combination of Bayesian clustering assignments. Then, by separating cluster assignment on two levels, Clusternomics allows the number of clusters to vary on local or global structure. Its performances are evaluated on a simulated dataset, where it showed higher Adjuster Rank Index compared to other clustering techniques, but also on breast, lung and kidney cancer from TCGA repository, where it identified clinically meaningful clusters.

#### Affinity Network Fusion

Affinity Network Fusion (AFN) (44) is both a clustering and classification technique that applies graph clustering to a patient affinity matrix incorporating information from multiple views. For each omic, after feature selection and/or transformation, AFN computes patient pair-wise distances. kNN Graph Kernel applied to the distance metric creates a patient affinity matrix for each view. The final affinity matrix is the weighted sum of the computed affinity matrices. AFN approach showed improved clustering performance in detecting cancer subtypes on several TCGA datasets when compared to its application in single omics.

#### Feature Extraction

In multi-omics integration, variable selection to reduce the dimensionality of the omics dataset has a dominant role [(70), **Figure 1A**]. Recursive feature elimination was exploited to select subsets of expressed genes and methylation data to classify breast cancer disease subtypes with a Random Forest (3). Genes prioritization allowed prognosis prediction in different cancer types from epigenomics, transcriptomics, and genomics data (38), and biomarker discovery in prostate cancer (28). In (39) authors weight gene-gene interaction from transcriptomics and genomics data with a random walked-based method to select the most important interaction for survival prediction in breast cancer and neuroblastoma patients.

#### netDX

netDx is an algorithm that performs feature selection on Patient Similarity Networks (PSN) to classify patients in different prognostic groups (50). A PSN is built for each omics such that nodes represent patients and edges stand for the similarity of two nodes in the given view. Then netDx identifies which networks (i.e., which omics) strongly relate high- and lowrisk patients through the GeneMANIA algorithm (71), which solves a regression problem to maximize the edges that connect query patients. Finally, each network is weighted according to its ability to relate patients of the same group and networks whose score exceeds a defined threshold are selected and combined in a single network by averaging their similarity scores. Authors benchmarked netDx against several machine-learning methods to predict survival outcomes on PanCancer TCGA multi-omics datasets, showing comparable results. On a breast cancer dataset, netDx selected features correspond to pathways known to be dysregulated in this type of cancer.

#### Feature Transformation

Feature transformation (FT) refers to algorithms that replace existing features with new features still function of the original ones. As shown in **Figure 1B**, the majority of FT techniques aims at identifying cancer subtypes, biomarkers, omics-signatures, and key features from multi-omics data. Zhu et al. (66) proposed a kernel machine-learning method for a pan-cancer prognostic assessment by integrating multi-omics data. This work is particularly interesting since it's the only FT method we reviewed that allows multi-omics profile integration individually and in combination with clinical factors. A Kernel-based approach, combined with non-linear regression and Bayesian inference, resulted to be the best performing algorithm in a drug sensitivity prediction challenge (26).

In the following, we will report selected FT approaches, although few other tools for subgroup discovery, such as iClusterBayes (47), Multi-Omics Factor Analysis (15), JIVE (49), and MCIA (46), are available.

#### MixOmics

One of the most recent and biggest efforts in this field resulted in an R package called mixOmics (53). MixOmics allows for multivariate analysis of omics data including data exploration, dimension reduction, and visualization. mixOmics can be applied in numerous of studies with different aims such as integration and biomarker identification from multi-omics studies. The package includes two different types of multi-omics integration. One aimed at integrating different type of omics data of the same biological samples, while the second focus on integrating independent data measured on the same predictors to increase sample size and statistical power (53). Both frameworks aim at extracting biologically relevant features, [i.e., molecular signatures, by applying FT techniques (53)]. In (53) authors presented the results on 150 samples of mRNA, miRNA and proteomics breast cancer data and showed its ability to correctly discriminate three types of breast cancers.

#### mixKernel

mixKernel (45) is a R package compatible with mixOmics, which allows integration of multiple datasets by representing each dataset through a kernel that provides pairwise information between samples. The single kernels are then combined into one meta-kernel in an unsupervised framework. These new meta-kernels can be used for exploratory analyses, such as clustering or more sophisticated analysis to get insights into the data integrated. The authors showed better performances of mixKernel applied to mRNA, miRNAs and methylation breast cancer data if compared with one kernel approach.

#### iProFun

iProFun (56) is a method aimed at elucidating proteogenomic functional consequences of CNV and methylation alterations. The authors integrated mRNA expression levels, global protein abundances, and phosphoprotein abundances of a certain cancer. The output consists in a list of genes whose CNVs and/or DNA methylations significantly influencing some or all of the data integrated. iProFun obtains summary statistics of data integrated based on a gene-level multiple linear regression. These statistics are then used to extract genes having a cascading effect of all cis-molecular traits of interests and genes whose functional regulations are unique at global protein levels. iProFun applied to ovarian cancer TCGA dataset showed its ability in extracting interesting genes that could be considered targets for future therapies.

#### Factorization

Traditional data mining methods are often inadequate to treat heterogeneous, sparse and noisy data such as multi-omics. Heavy pre-processing operations could modify, therefore loose, the inner structure of data coming from different sources. To discover latent characteristics hidden in huge amount of information, factorization techniques have been applied to highlight complex interactions among omics-data, hard to detect using standard approaches.

Gao et al. (31) developed an integrated Graph Regularized Non-negative Matrix Factorization model focused network construction by integrating gene expression data, CNV data, and methylation data. The authors used the factorization technique to decompose and fuse the multi-omics data. Then, by combining the results with network and mining analyses they showed how their method was able to find potential new cancer-related genes on two different TCGA datasets. Another method, based on factor analysis, aims at identifying latent factors in the multi-omics-data integrated in the model that can be used for subsequent analysis such as subgroup identification (15). Give its aim in extracting hidden features, we described this method in detail in the feature transformation section.

### DISCUSSION

Along with technological advances in high-throughput sequencing, which characterize multiple "omes" from biological samples, holistic systems for data integration and knowledge discovery with machine-learning algorithms are still under development. Precision oncology would greatly benefit from actionable knowledge gained from multi-omics assays. In this paper we provided an overview of recent works on this topic and highlight current achievements and limitations.

We reviewed relevant tools to perform analysis based on different combination of omics, and observed their growing numbers in recent years, indicating strong commitments to develop such tools. Several issues emerged, too. The majority of the proposed techniques were applied to TCGA dataset, and data integration was mainly focused on transcriptomics and genomics. Efforts should be devoted to make new data sources available to the research community (72), such as the UKBioBank (73) and DriverDBv3 (74), and to integrate other "omes," such as metabolome, or patient-generated, and environmental data. Research in this field would greatly benefit from the development of databases specifically developed for containing

#### REFERENCES

1. Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. (2009) 25:2906–12. doi: 10.1093/bioinformatics/btp543

and facilitating the analysis of multi-omics and clinical data, such as LinkedOmics (75). Another important improvement to increase usability and reproducibility would be to aim at developing methods that can be applied and generalized for all omics data type.

The complexity of multi-omics data analysis requires collaborative efforts among the clinical and machine-learning communities and the joint application of methodologies derived from heterogenous backgrounds. We noted that some promising methods, such as matrix-factorization have not been extensively exploited, while clustering and network-based approaches are the most extensively used, probably due to their flexibility and the possibility to be integrated in comprehensive frameworks that include feature extraction and transformation to deal with the curse of dimensionality. Deep learning methods, that are flexible and achieved outstanding results in other fields, are increasingly used, even though many works share the same "pipeline" (i.e., the exploitation of autoencoder hidden layers for clustering). Interestingly, the number open source tools have increased in the very last years (**Figure 1B**).

We are aware of some limitations of our review. An important aspect that has not been covered by this review is the quantitative comparison among tools (76), which could highlight possible overfitting (77) and issues that may prevent the actual translation of multi-omics approaches from bench to bedside. Although, by indicating works that provide a usable tool (**Table 1**), our review could be a starting point for a comprehensive quantitative comparison.

#### AUTHOR CONTRIBUTIONS

RB conceived the study. GN, FV, and AD run the analyses and wrote the article. NG and RB revised the article. All authors contributed to the article and approved the submitted version.

#### FUNDING

This study was funded by Fondazione Regionale Ricerca Biomedica, Milan, Italy [FRRB project n. 2015-0042, Genomic profiling of rare hematologic malignancies, development of personalized medicine strategies, and their implementation into Rete Ematologica Lombarda (REL) clinical network] and by the NIHR Manchester BRC, MRC Molecular Pathology Node MMPathic (grant ref MR/N00583X/1).

#### ACKNOWLEDGMENTS

We would like to acknowledge Simone Marini for his valuable help in the initial phases of the study.


J Integr Bioinform. (2014) 11:236. doi: 10.2390/biecoll-jib-20 14-236


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Nicora, Vitali, Dagliati, Geifman and Bellazzi. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Integrated Transcriptome Analysis of Human Visceral Adipocytes Unravels Dysregulated microRNA-Long Non-coding RNA-mRNA Networks in Obesity and Colorectal Cancer

Sabrina Tait <sup>1</sup> , Antonella Baldassarre<sup>2</sup> , Andrea Masotti <sup>2</sup> , Enrica Calura<sup>3</sup> , Paolo Martini <sup>3</sup> , Rosaria Varì <sup>1</sup> , Beatrice Scazzocchio<sup>1</sup> , Sandra Gessani <sup>1</sup> and Manuela Del Cornò<sup>1</sup> \*

<sup>1</sup> Center for Gender-Specific Medicine, Istituto Superiore di Sanità, Rome, Italy, <sup>2</sup> Bambino Gesù Children's Hospital-IRCCS, Research Laboratories, Rome, Italy, <sup>3</sup> Department of Biology, University of Padua, Padua, Italy

#### Edited by:

Margaret Jane Currie, University of Otago, Christchurch, New Zealand

#### Reviewed by:

Weifeng Ding, Nantong University, China Olga Brovkina, Federal Medical-Biological Agency, Russia

\*Correspondence: Manuela Del Cornò manuela.delcorno@iss.it

#### Specialty section:

This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology

Received: 06 March 2020 Accepted: 01 June 2020 Published: 02 July 2020

#### Citation:

Tait S, Baldassarre A, Masotti A, Calura E, Martini P, Varì R, Scazzocchio B, Gessani S and Del Cornò M (2020) Integrated Transcriptome Analysis of Human Visceral Adipocytes Unravels Dysregulated microRNA-Long Non-coding RNA-mRNA Networks in Obesity and Colorectal Cancer. Front. Oncol. 10:1089. doi: 10.3389/fonc.2020.01089 Obesity, and the obesity-associated inflammation, represents a major risk factor for the development of chronic diseases, including colorectal cancer (CRC). Dysfunctional visceral adipose tissue (AT) is now recognized as key player in obesity-associated morbidities, although the biological processes underpinning the increased CRC risk in obese subjects are still a matter of debate. Recent findings have pointed to specific alterations in the expression pattern of non-coding RNAs (ncRNAs), such as microRNAs (miRNAs), and long non-coding RNAs (lncRNAs), as mechanisms underlying dysfunctional adipocyte phenotype in obesity. Nevertheless, the regulatory networks and interrelated processes relevant for adipocyte functions, that may contribute to a tumor-promoting microenvironment, are poorly known yet. To this end, based on RNA sequencing data, we identified lncRNAs and miRNAs, which are aberrantly expressed in visceral adipocytes from obese and CRC subjects, as compared to healthy lean control, and validated a panel of modulated ncRNAs by real-time qPCR. Furthermore, by combining the differentially expressed lncRNA and miRNA profiles with the transcriptome analysis dataset of adipocytes from lean and obese subjects affected or not by CRC, lncRNA–miRNA–mRNA adipocyte networks were defined for obese and CRC subjects. This analysis highlighted several ncRNAs modulation that are common to both obesity and CRC or unique of each disorder. Functional enrichment analysis of network-related mRNA targets, revealed dysregulated pathways associated with metabolic processes, lipid and energy metabolism, inflammation, and cancer. Moreover, adipocytes from obese subjects affected by CRC exhibited a higher complexity, in terms of number of genes, lncRNAs, miRNAs, and biological processes found to be dysregulated, providing evidence that the transcriptional and post-transcriptional program of adipocytes from CRC patients is deeply affected by obesity. Overall, this study adds further evidence for a central role of visceral adipocyte dysfunctions in the obesity–cancer relationship.

Keywords: obesity, colorectal cancer, adipocyte, RNASeq, microRNAs, long non-coding RNAs, networks

# INTRODUCTION

The increase of obesity is a major health problem afflicting nowadays adults and children worldwide (1). Obesity is a complex condition, characterized by excessive expansion and functional alteration of white adipose tissue (AT), that increases the risk of life threatening diseases such as cardiovascular disease, diabetes and cancer, including colorectal cancer (CRC). Indeed, white AT, particularly visceral fat, is a complex endocrine and immunocompetent organ, homing adipocytes and resident immune cells, exhibiting secretory as well as immunological, metabolic, and endocrine regulatory activities and playing a central role in obesity-associated morbidities (2). Its functional units, the adipocytes, produce and secrete a large array of mediators including cytokines/chemokines, extracellular matrix proteins, hormones, growth and angiogenic factors that influence, either locally or systemically, a variety of physiological and pathological processes, such as immune functions, cell proliferation, migration, angiogenesis (3, 4). In addition of being an established risk factor (5), excess adiposity is also associated with CRC worse outcomes (6, 7), although the mechanisms underlying the detrimental link between obesity and CRC are complex and not yet precisely defined. In this respect, it has been postulated that this association may be due to the large spectrum of cytokines and metabolites that are produced by AT showing pro-inflammatory and cancer prone features. Moreover, obesity-related metabolic alterations (i.e., triggering of insulin resistance, impairment in lipid metabolism, endocrinologic changes and oxidative stress) may contribute to CRC initiation and progression (8). More recently, emerging evidence point to the role of non-coding RNAs (ncRNAs) in many obesity-related disorders including cardiovascular and metabolic diseases, inflammation, and cancer (9), and more specifically in CRC (10).

NcRNAs are transcripts that are not translated into proteins. They are present in all organisms, where they regulate gene expression and, therefore, biological processes, at the transcriptional and post-transcriptional level (11). Multiple types of regulatory ncRNAs are emerging as key elements of cellular homeostasis and diseases. Among these long ncRNAs (lncRNAs) (>200 nts) and small ncRNAs (<200 nts), such as microRNAs (miRNAs), small interfering-, Piwi interacting-, small nucleolar-, small nuclear-, extracellular-RNAs, are arbitrarily classified according to their nucleotide length (12). Among them, microRNAs (miRNAs) are evolutionarily conserved small ncRNAs (18–25 nt in length) playing a crucial role in cell transcriptional regulation (13, 14). Their expression correlates with different obesity relevant parameters, such as body mass index (BMI), adipocyte size and metabolic parameters, highlighting important regulatory role in obesity (15–18). The importance of miRNAs in mediating the initiation, growth, and development of CRC was also reported (19). In contrast with small ncRNAs, lncRNAs undergo post-transcriptional modifications, such as polyadenylation and splicing, although they lack protein-coding capacity (20). They are emerging as miRNA sponges and inhibitors, thus releasing downstream genes from the miRNA control (21). Furthermore, lncRNAs can also interact with DNA, RNA and proteins, overall regulating gene expression and epigenetic status (12). Accumulating evidence has revealed that the expression of lncRNAs is involved in the occurrence and development of many major diseases, including human cancers (22, 23), and that lncRNA-miRNAmRNA networks are specifically associated with CRC (24). High-throughput methods and bioinformatics approaches have significantly contributed to the identification of new transcripts, including ncRNAs. However, only few studies have described miRNAs and lncRNAs in human AT under obesity (9, 25–27). Moreover, no studies have reported the expression of miRNAs and lncRNAs in AT from CRC patients. In this regard, we recently reported that obesity and CRC, conditions characterized by the common denominator of inflammation, are associated with changes in the transcriptional program of adipocytes mostly involving pathways and biological processes linked to fibrosis, inflammation and metabolism of pyruvate, lipids, and glucose (28). In this study, we analyzed the ncRNA expression profiles, specifically miRNAs and lncRNAs, of lean and obese subjects affected or not by CRC, by RNASeq/Small RNASeq analysis. This approach allowed to highlight changes in adipocyte miRNA and lncRNA profiles that are specifically associated with obesity or CRC, or shared by both conditions. Finally, by integrating bioinformatics prediction, functional enrichment analysis, and data on differential mRNA expression previously described (28), we identified lncRNA-miRNA-mRNA regulatory networks and defined multiple pathways characterizing visceral adipocytes, that are altered in obesity and/or CRC. Overall, this might contribute to set the basis for a more tumorprone microenvironment, thus adding further evidence for the central role of AT functional alterations in linking obesity to cancer.

# METHODS

### Ethics Statement

Investigation has been conducted in accordance with the ethical standards and with the Declaration of Helsinki, and according to national and international guidelines. It was approved by the institutional review board of Istituto Superiore di Sanità. All enrolled subjects were provided with complete information about the study and asked to sign an informed consent.

# Patient and Sample Collection

As previously described (28), "human visceral adipose tissue (VAT) was collected from age-matched lean and obese subjects undergoing abdominal surgery or laparoscopy for benign (i.e., gallbladder disease without icterus, umbilical hernia, and uterine fibromatosis) or CRC conditions (histologically proved primary colon adenocarcinoma, stage TNM 0–III). The exclusion's criteria were: clinical evidence of active infection, recent (within 14 days) use of antibiotics/anti-inflammatory drugs, pregnancy, hormonal therapies, severe mental illness, autoimmune diseases, family history of cancer, others neoplastic diseases. In the normal weight group, the BMI range was 20–25 Kg/m<sup>2</sup> . In the obese group the BMI was ≥ 30 Kg/m<sup>2</sup> , and waist circumference > 95 cm for men and > 80 cm for women. The total number of subjects was six/category."

# Adipocyte Isolation, RNA Preparation and Sequencing

Adipocytes were isolated from human VAT as previously described (29). Total RNA was isolated with Total RNA Purification Plus Kit (Norgen Biotek, Canada). RNA quality and quantity was assessed by Agilent 2,100 Bioanalyzer and samples stored at −80◦C until use. Total RNA (2 µg) was used to prepare the library for Illumina sequencing (Illumina TruSeq Small RNA Sample Preparation). Single-end reads (>10 M reads per sample) were produced by Illumina HiSeq 2000.

## RNASeq Data Preprocessing and Differential Expression Analysis

Libraries were then processed with Illumina cBot for cluster generation on the flowcell, following the manufacturer's instructions and sequenced on single-end mode at the multiplexing level requested on HiSeq2000 (Illumina, San Diego, CA). The CASAVA 1.8.2 version of the Illumina pipeline was used to process raw data for both format conversion and de-multiplexing. Adapters were removed and low-quality bases were trimmed by the script TrimGalore. Per sample, per read and per base quality of raw sequence data have been assessed with FastQC version 0.11.3 (http://www.bioinformatics.babraham. ac.uk/projects/fastqc/) and all the included samples passed the initial quality checks. All the sequencing data had all the range of the per base quality values into very good quality calls, lower than the 0.02% of the total sequences showed a per sequence low quality score and no adapter content. Thus, no quality trimming where performed during preprocessing. The percentage of mapped reads resulted high with the mean value of 97.5% (min 94.08% and max 98.41%).

The transcriptome reconstruction was performed as previously described (28). Re-annotation of previously unknown transcripts was performed using the bioMart package (30) into R 3.6 (31), querying available Ensemble transcript IDs and retrieving Gene Names, Entrez gene IDs, gene and transcript biotypes thus allowing the identification of a higher number of lncRNAs. Multiple testing controlling procedure was applied following Benjamini & Hochberg method hereafter referred as False Discovery Rate (FDR). We then extracted the list of differentially expressed lncRNAs (DEL) with a False Discovery Rate (FDR) ≤ 0.05. For small RNASeq analysis, raw reads where pre-processed using cutadatapt 1.9.1 (http://code. google.com/p/cutadapt/) and reads shorter than 17 bases were excluded. MiRNA expression quantification was carried out using MirDeep2 (version 2.0.0.8, Bowtie version 1.1.2) (32) using hg38.p2 genome version and 79 Ensembl version. MiRNA mature/hairpin sequences were downloaded from Mirbase 21 version (33), then, raw counts were filtered to keep only miRNA with more or equal to 10 reads in at least one sample. MiRNA expression was normalized with upper quantile normalization (EDASeq version 2.10.0) (34) while differential expression (all comparisons) was computed using edgeR (3.18.1) (35) from raw counts. All analyses were carried out in R and Bioconductor 3.5 version (https://bioconductor.org). Due to the limited differential expression of miRNAs (DEM), the threshold of FDR was set ≤ 0.06. For small RNA sequencing, six biological replicates per category were prepared and the raw sequence data are available from the NCBI Sequence Read Archive (SRA) (http://www.ncbi. nlm.nih.gov/sra) under accession number SRA: PRJNA632999. For long RNA sequencing, we employed the RNASeq datasets previously published and available under accession number SRA: PRJNA508473 (28).

### mRNA-miRNA-lncRNA Regulation Network Construction

Target genes of the identified differentially expressed miRNAs (DEM) were searched in the TarBase v.8 (36) and miRTarBase 7.0 (37) databases which feature up-to-date experimentally validated miRNA-targets interactions. Interactions between DEL and DEM were retrieved in both the DIANA-LncBase v2.0 database (38), using the prediction module and a score ≥ 0.6 as cut-off, and the ENCORI database (39) featuring experimentally verified RNA-RNA interactions. The ENCORI database was also used to search for DEL-mRNA verified interactions. The overall targets of DEM and DEL were filtered against the lists of differentially expressed transcripts (DET) and integrated to define specific mRNA-miRNA-lncRNA interactions networks for each condition. The Cytoscape software (40) was used to visualize the obtained networks.

# Functional Analysis

The cumulative list of DEM and DEL targets within the DET of each condition was explored for significantly enriched pathways with the Cytoscape plug-in ClueGO and CluePEDIA (41) querying the KEGG, WikiPathways and Reactome databases. Default settings were used for the pathways selection, connectivity and grouping. A two-sided enrichement analysis was performed, adjusting the p-values with the Benjamini-Hochberg correction and considering significant only pathways with p < 0.05.

### Real-Time qPCR Validation of Differentially Expressed lncRNAs and miRNAs

Twelve candidate ncRNAs, found differentially expressed by RNASeq, were selected for validation by real time qPCR (RTqPCR). The validation of lncRNA expression was performed by qPCR using SYBRGreen assays (**Supplemental Table 1**). The synthesis of cDNA was performed by using 300–500 ng of total RNA in 20 µL reaction volume using the Superscript III kit (Thermofisher Scientific) following the manufacturer's instructions. The reverse transcription conditions were as follows: 5 min at 25◦C, 60 min at 50◦C, and 15 min at 70◦C. cDNA was mixed with 2 × SensiFast SYBR low rox (Bioline), lncRNA expression values were normalized to the expression of GUSB as the endogenous control. For the validation of miRNA expression levels, we started the reverse transcription of 6 miRNAs by using 2 µl (5 ng/µl) of total RNA with the miRCURY LNA RT Kit (Qiagen). The reverse transcription conditions were as follows: 60 min at 42◦C and 5 min at 95◦C. cDNA was mixed with 2 × miRCURY SYBR Green Master Mix (Qiagen) following the manufacturer's instructions. The expression values of miRNAs were normalized to the expression of let-7a-5p as the endogenous control. For each sample, the relative expression level was determined according to the 2 –11CT method after running the samples on a QuantStudio 12 K Flex Real-Time PCR System (Thermofisher Scientific) following the manufacturer's instructions. For each sample, the relative gene expression level was determined according to the 2 –11CT method. Statistical comparisons of means from six biological replicates, matched with RNASeq analysis, was performed between the various subject groups (five for the NwCRC group) by one-way analysis of variance (ANOVA) with LSD post hoc tests by using SPSS software (Ver.20). Differences were considered statistically significant when p-values were ≤ 0.05. Analysis of correlation between qPCR and RNASeq data was performed by Spearman's rank test setting significance at p < 0.05.

# RESULTS

#### Long and Small RNA Sequencing Analysis Identify Differentially Expressed lncRNAs and miRNAs That Are Associated With Obesity and/or CRC

We have previously analyzed the transcriptome profiles of human adipocytes isolated from visceral AT (VAT) biopsies obtained from healthy control lean (normal weight, Nw) and obese (Ob) subjects, or CRC patients (normal weight or obese, NwCRC, and ObCRC, respectively), by RNA sequencing (28). Along with the protein coding transcripts, the long RNASeq analysis detected also a total of 90 differentially expressed lncRNAs (DEL, FDR ≤ 0.05), 35 of which were novel transcripts (**Table 1**). In NwCRC subjects, 45 DEL were found dysregulated (11 downregulated, 33 upregulated and one DEL with two transcripts inversely modulated, NUTM2A-AS1) compared to Nw healthy controls. In Ob group, we found 27 DEL (3 downregulated, 23 upregulated and one DEL with two inversely modulated transcripts, RASSF8-AS1). Finally, when comparing ObCRC group with the control lean group, a total of 52 DEL, including 13 downregulated, 38 upregulated and one with three transcripts (MIR4435-2HG, one up- and two downregulated), were found. Among the overall 90 DEL, 10 were shared by all the three subject categories (AC109460.3, AL031429.1, AL139260.1, APTR, FAM198B-AS1, LINC00968, LINC01106, LINC01348, MIR4435-2HG, SNHG16), 6 were shared by NwCRC and Ob patients (AC008105.3, AC021092.1, HIF1A-AS1, LINC00926, RASSF8-AS1, ZNF883), 12 were shared by NwCRC and ObCRC patients (AC009022.1, AC010457.1, AC016582.2, AC068888.1, AL356056.1, AP000317.2, FAM27E3, MINCR, MIR100HG, SLC14A2-AS1, STAG3L5P-PVRIG2P-PILRB, TPTEP1), and only one was shared by Ob and ObCRC patients (AC022007.1). On the other hand, a number of lncRNAs were selectively modulated in each subject category, with the ObCRC group exhibiting the highest number of specific DEL (**Table 1**). In parallel, small RNASeq analysis revealed a total of 58 differentially expressed miRNAs (DEM, FDR ≤ 0.06) in adipocytes of TABLE 1 | Differentially expressed lncRNAs in normal weight affected by CRC (NwCRC), obese (Ob), and obese affected by CRC (ObCRC) individuals vs. healthy lean control.


(Continued)

#### TABLE 1 | Continued


TABLE 2 | Differentially expressed miRNAs in normal weight affected by CRC (NwCRC), obese (Ob), and obese affected by CRC (ObCRC) individuals vs. healthy lean control.


(Continued)

#### TABLE 2 | Continued


NwCRC, Ob, and ObCRC subjects compared to Nw individuals (**Table 2**). Specifically, 22 DEM were found in NwCRC (12 upregulated and 10 downregulated), 20 DEM were detected in Ob subjects (13 upregulated and 7 downregulated), while the comparison of ObCRC with Nw control revealed a higher number of dysregulated miRNAs (39 DEM, 20 upregulated and 19 downregulated), suggesting that the conditions of obesity and CRC interact concurrently, thus influencing the miRNA expression profile in adipocyte from ObCRC subjects. Among the overall modulated 58 DEM, only 3 were common to all group of subjects (miR-1247-5p, miR-125a-5p, miR-193b-3p), 7 were shared by NwCRC and ObCRC subjects (miR-125b-1-3p, miR-22-5p, miR-29b-2-5p, miR-4455, miR-452-5p, miR-7706, miR-98-5p), 9 were shared by ObCRC and Ob subjects (let-7e-3p, miR-1287-5p, miR-152-3p, miR-181c-5p, miR-181d-5p, miR-185-5p, miR-24-3p, miR-34a-5p, miR-421), while only one was common to NwCRC and Ob subjects (miR-345-5p). As regards the subject group-specific DEM, again the ObCRC category exhibited the highest number of selectively dysregulated miRNAs (**Table 2**). A Venn diagram was then generated to discover the common or unique lncRNAs and miRNAs among the three experimental groups (Ob, NwCRC, and ObCRC subjects) (**Figure 1**). By intersecting DEL and DEM data from the three comparisons (NwCRC, Ob and ObCRC individuals compared to Nw subjects), 13 ncRNAs were found to be shared between cancer and obese conditions. The identification of these differentially expressed ncRNAs, likely involved directly in creating a tumor-promoting microenvironment, may provide clues on the epigenetic mechanisms by which obesity favor CRC onset, as well as on how CRC development in obesity differs from that in lean individuals.

### Identification of Target Genes Regulated by Differentially Expressed miRNAs

To investigate the potential involvement of the aforementioned DEM in the pathogenic events related to obesity and/or CRC, we next analyzed dysregulated miRNAs and validated consistency of differential expression of their targets. For each identified DEM, we extracted the list of experimentally validated mRNA targets from TarBase and miRTarBase repositories. Based on our previously obtained gene expression dataset (28), we considered only those targets included in the list of differentially expressed transcripts (DET). We then assembled an interaction network between DEM and their target genes for each group (**Figure 2**). The complete list of DEM-DET interactions for each condition is reported in **Supplemental Table 2**.

In detail, interaction analysis showed 713 nodes (21 DEM and 692 target DET) and 1,669 edges in the NwCRC network (**Supplemental Table 2**), with two DEM having a number of directed edges ≥ 200 (hsa-let-7f-5p and hsa-miR-98-5p) and five DEM having < 200 ≥ 100 directed edges (hsa-miR-193b-3p, hsa-miR-29b-3p, hsa-miR-125a-5p, hsa-miR-22-5p, and hsamiR-374b-5p). Among the modulated genes, BTG2, and SON genes were the target of 10 DEM and other 33 DET interacted with more than five DEM. In the interaction network of Ob subjects, 808 nodes (20 DEM and 788 DET) and 1,759 edges were found (**Supplemental Table 2**), with hsa-miR-34a-5p having 420 directed edges and six DEM having over 100 directed edges (hsa-let-7c-5p, hsa-miR-24-3p, hsa-miR-193b-3p, hsa-miR-185- 5p, hsa-miR-181c-5p, hsa-miR-125a-5p). SON was the target genes of 10 DEM and 33 DET interacted with more than five DEM. In ObCRC subjects 1,056 nodes (37 DEM and 1,019 DET) and 3,449 edges were found (**Supplemental Table 2**). hsa-miR-34a-5p and hsa-miR-107 had, respectively, 464 and 357 targets, four DEM had over 200 direct edges (hsa-miR-92a-3p, hsa-miR-24-3p, hsa-miR-98-5p, hsa-miR-30c-5p), seven DEM had < 200 ≥ 100 directed edges (hsa-miR-10b-5p, hsa-miR-193b-3p, hsamiR-22-5p, hsa-miR-125a-5p, hsa-miR-185-5p, hsa-miR-181c-5p, hsa-miR-181d-5p). The top interacting DET was again SON and other 19 DET had more than 10 directed edges.

#### Identification of Target Genes and microRNAs Regulated by Differentially Expressed lncRNAs

In addition to the miRNA regulatory networks, the dysregulation of lncRNA expression was recently associated with obesity and CRC (27, 42). Therefore, we constructed lncRNA-mRNA regulatory networks through an integrated analysis of the new identified DEL and the previously described DET (28), for each category of subjects.

As shown in **Figure 3**, only for a subgroup of DEL at least one experimentally validated interaction was found in the ENCORI database. In particular, in NwCRC subjects, the up-regulated DEL SNHG16, AC109460.3, NUTM2A-AS1, and STAG3L5P-PVRIG2P-PILRB, as well as the down-regulated AP000317.2, were relevant hubs each interacting with more than three DET (**Figure 3A**). In Ob subjects, main nodes were represented by the up-regulated SNHG16, AC109460.3, and MIR3142HG (**Figure 3B**). In ObCRC subjects, the down-regulated XIST interacted with 264 DET while the up-regulated SNHG16 and AC109460.3 interacted with more than 10 DET. Other 3 DEL (LINC01184, STAG3L5P-PVRIG2P-PILRB, AP000317.2) had more or equal than 5 directed edges (**Figure 3C**).

Since lncRNAs can bind to miRNAs to "communicate" with other RNA targets as well as to be reciprocally regulated by

numbers in the region of the overlapping circles indicate the ncRNAs that are expressed in two or more conditions. The complete list of the 13 ncRNAs shared by obesity and CRC is shown on the right.

miRNAs (21), we then explored the ENCORI database for experimentally validated DEL-DEM interactions. As shown in **Figure 4**, DEL-DEM interaction networks in CRC patients, both lean and obese, displayed more interconnections than in obese individuals not affected by CRC. In particular, 95 relationship pairs between 28 DEL and 19 DEM were found in NwCRC patients, with the DEL USP9Y and AC006504.5 interacting with 12 and 11 DEM respectively; besides, hsa-miR-664a-3p and hsa-miR-22-5p were the top interaction DEM with 8 direct connections to DEL (**Figure 4A**). Likewise, in ObCRC subjects, 146 relationship pairs between 34 DEL and 28 DEM were found, with XIST and STAG3L5P-PVRIG2P-PILRB interacting with 23 and 11 DEM, respectively, and the top DEM hsa-miR-515-5p and hsa-miR-516b-5p interacted with 15 and 8 DEL, respectively (**Figure 4C**). Conversely, in Ob subjects only 37 relationship pairs between 16 DEL and 15 DEM were found, with AC021092.1 interacting with 5 DEM and hsa-miR-181d-5p and hsa-miR-181c-5p interacting with 5 DEL (**Figure 4B**).

### mRNA-miRNA-lncRNA Regulatory Networks

In order to identify novel key regulators in the transcriptional and post-transcriptional adipocyte reprogramming under obesity and CRC conditions, integrated lncRNA-miRNA-mRNA networks were constructed for each conditions taking into account and combining the interactions described between miRNA/mRNA, lncRNA/miRNA, and lncRNA/mRNA.

In this regard, it is reported that a stronger connectivity of RNA nodes in the network can reflect the importance of the biological functions of these RNAs in the network. Therefore, hub nodes with degree exceeding 5 represent key players in biological networks (43). Based on this criterion, different number and distribution of hubs, according to the RNA type, were identified in the three integrated networks. Specifically, we described 9 lncRNAs, 20 miRNAs, and 79 mRNAs hubs in the NwCRC network, 3 lncRNAs, 18 miRNAs, and 70 mRNAs hubs in the Ob network, and 10 lncRNAs, 36 miRNAs, and 308 mRNAs hubs in the ObCRC network, according to the higher complexity already described for the ObCRC condition in term of DEM-DET, DEL-DET, DEL-DEM interactions. Due to the complexity of the networks, only nodes with degree equal or higher than 6 are shown in **Figure 5**, whereas results description refers to the whole network. Focusing on ncRNAs, the most highly connected hubs in the NwCRC network were let-7f-5p, miR-98-5p, miR-193b-3p, miR-29b-3p, while SNHG16, and NUTM2A-AS1 had higher degrees compared with the other lncRNAs (**Figure 5A**). In the Ob network, miR-34a-5p, let-7c-5p, miR-24-3p, miR-193b-3p, and SNHG16, along with the novel lncRNA AC109460.3, were the most highly connected hubs ncRNAs (**Figure 5B**). Predominant nodes in the ObCRC network were miR-34a-5p, miR-107, miR-92a-3p, miR-24-3p, while the lncRNA XIST represents the main key interactor in the network (**Figure 5C**).

Searching for common key regulators, 23 genes were found to be the pivotal nodes in all networks, which include two miRNAs, miR-193b-3p, and miR-125a-5p, a known (SNHG16) and a novel (AC109460.3) lncRNA, and 19 mRNAs, indicating that these common elements and their interactors could be involved in relevant processes in obesity and CRC. Among the shared mRNA nodes, we found key players involved in the adipocyte transcriptional program (e.g., STAT3, RORA, CNOT1), in adipogenesis and lipogenesis processes (e.g., SEC31A, BMPR2) and in food intake and hypothalamic signaling (e.g., SON, PRRC2A, CUX1).

edge represents the interaction between genes. Only nodes with a number of directed edges ≥ 5 are shown (see Supplemental Table 2 for the extended network). Shades of green and red indicate, respectively, down- or up-regulated DEM/DET.

# Functional Enrichment Analysis of Networks-Related mRNA Targets

The biological function of a miRNA-lncRNA-mRNA network may be explained by the functions of the included target mRNAs. Thus, target genes of DEM and/or DEL found in the interaction networks of each subject group, were subjected to functional enrichment analysis combining different databases (KEGG, WikiPathways, and Reactome). The detailed list of terms, along with the genes involved in each term, are

#### reported in **Supplemental Table 3** and results are summarized in **Figure 6**.

In NwCRC patients numerous pathway terms associated with metabolic processes (e.g., One-carbon metabolism, Purine metabolism, Cysteine/Methionine metabolism), lipid metabolism (e.g., Fatty acyl-coA biosynthesis, AMPK, and SREBP signaling) and pathways involved in cancer (e.g., Signaling by FGFR1 in disease, Pathway in clear cell renal cell carcinoma) were obtained (**Figure 6A**). While the cancer pathways mainly

featured up-regulated genes, the lipid metabolism pathways mainly included down-regulated genes. As expected, also the obesity- associated network was enriched in terms related to lipid metabolism (e.g., Cholesterol biosynthesis, Glycerophospholipid Biosynthetic Pathway). Further, in Ob individuals, we found enriched cancer pathways shared with CRC lean subjects (e.g., Signaling by FGFR1, Pathway in clear cell renal cell carcinoma, Integrated Breast Cancer Pathway), or unique of obese condition, such as a TP53-related pathway, all induced (**Figure 6B**). Finally, the ObCRC network (**Figure 6C**) was primarily enriched by fundamental biological functions that are implicated in inflammatory signaling pathway (e.g., Platelet degranulation, TGF-beta signaling, IL-4, and IL13 signaling), tumor suppression and insulin sensitivity (e.g., Regulation

edge represents the interaction between genes. Shades of green and red indicate, respectively, down- or up-regulated DEL/DET.

of PTEN gene transcription, Interleukin-37 signaling, Insulin resistance), along with categories related to metabolism (e.g., Pyruvate metabolism and Citric Acid cycle, AMPK signaling) and cancer (e.g., FGFR1 mutant receptor activation; signaling by VEGF). Interestingly, in contrast to what observed for Ob and NwCRC networks, the majority of enriched categories featured under-expressed genes in ObCRC patients, with the exception, among others, of pathways related to energy metabolism (e.g., mTOR signaling and AMPK signaling), to the growth factor EGF (e.g., EGF/EGFR signaling pathway) and to neuronal development (e.g., Netrin-1 signaling).

Finally, pathways related to type I interferon signaling (e.g., Interferon type I signaling pathway, ISG15 antiviral mechanism, antiviral mechanisms by IFN- stimulated genes) are shared by

obese and CRC networks. Furthermore, all networks described showed dysregulated genes belonging to processes involved in RNA regulation (e.g., metabolism of RNA), endocytosis and vesicle-mediated transport (e.g., Membrane trafficking, Vesicle budding, Endocytosis, Extracellular matrix organization) and sumoylation (e.g., SUMO E3 ligases SUMOylate target proteins). Interestingly, in ObCRC patients we observed a predominant pathway repression state, again indicating that the interplay between obesity- and CRC results in a specific modulation of adipocyte transcriptional and post-transcriptional program.

# Validation by Using Real Time qPCR

The expression levels of pivotal transcripts were validated by RTqPCR. Candidate transcripts were selected among those DEL and DEM found to be shared between cancer and obese conditions (e.g., LINC01106, LINC00968, SNHG16, miR-125a-5p, miR-193b-3p, miR-1247-5p), along with those of ncRNAs specific for CRC or obese subjects (e.g., XIST, H19, MINCR, miR-29b, miR-125b-1-3p, miR-181d-5p), on the basis of their relevance in the described regulatory networks. As shown in **Figure 7**, the lncRNAs belonging to all categories of subjects (obese and CRC affected) were found to be significantly modulated compared to healthy lean subjects. Specifically, LINC01106 was significantly up-modulated in Ob and ObCRC, while H19 was specifically down-modulated in NwCRC patients. We failed to observe a significant up-regulation of LINC00698, MINCR and SNHG16 in NwCRC patients, although we confirmed their upregulation in the other subject groups (Ob and ObCRC for LINC00698 and ObCRC for MINCR and SNHG16), according to RNASeq analysis (**Figure 7A**). Overall, RNASeq and qPCR data displayed a significant positive correlation (Rho = 0.829; p < 0.0001). Similarly, in the case of miRNAs (**Figure 7B**), qPCR analysis confirmed the down-modulation of miR-125b-1-3p in all conditions and miR-193b only in Ob and ObCRC, whereas the under-expression of miR-1247 and miR miR-125a-5p was validated in Ob and ObCRC groups or ObCRC group, respectively. We also reported an up-regulation of miR-181d-5p in both Ob and CRC affected subjects, although RNASeq data showed its over-expression in Ob subjects only. In contrast to what observed from RNASeq analysis, no differential expression

DEM/DET/DEL. (A) NwCRC, (B) Ob, (C) ObCRC individuals in comparison to healthy lean subjects.

of miR-29b-3p was found in all groups of subjects. Overall, although we did not achieve a complete correspondence between miRNA expression data from the two different techniques, qPCR and RNASeq results were significantly correlated (Rho = 0.6079; p = 0.0074).

# DISCUSSION

The prevalence of obesity and obesity-associated diseases, including CRC, is in constant increase, accounting for a large portion of public health challenges. These multifactorial and

ObCRC patients compared to lean healthy subjects. Each node represents a significantly enriched KEGG/Wiki/Reactome term, the diameter being proportional to the significance. Shades of green and red indicate that the node features > 50% down-regulated or up-regulated genes, respectively. Gray nodes indicate terms with equal contribution of up- down-regulated genes.

complex disorders are strongly interconnected, although the mechanisms underlying the higher susceptibility to cancer development and the poorer cancer prognosis in obese individuals are still a matter of debate. Different components of the AT microenvironment, such as chronic inflammation, vascularity and fibrosis, altered levels of sex hormones, insulin resistance, are nowadays recognized as important determinants of CRC risk. Moreover, adipocytes release lipids acting as an energy reservoir for cancer cells, while the rapid expansion of AT in obesity produces hypoxia and promotes angiogenesis, favoring

Frontiers in Oncology | www.frontiersin.org

the tumor spread (7, 44, 45). Recent findings in epigenetics emphasized an important functional role of miRNAs, as well as of lncRNAs, in pathophysiological processes. The dysregulation of these transcripts, in fact, has been found in pathological conditions such as cancer and dysmetabolic disorders including obesity. In AT, miRNAs regulate all aspects of the adipocyte biology, including inflammation and adipokines production, metabolic responses, lipolysis and lipogenesis, adipogenesis and browning (9, 46). Likewise, the total number of lncRNAs identified in AT and found to modulate adipose function, is rapidly increasing (26, 47–49). Several studies reported the involvement of lncRNAs in adipogenesis and lipid metabolism (27, 50) as well as in AT function and development in mouse models (51, 52). Nevertheless, their implication in human adipocytes remains largely unknown. Likewise, no definitive conclusions regarding the molecular factors and the mechanistic processes underlying the relationships among obesity, AT dysfunction and CRC have been reached so far. To the best of our knowledge, this is the first comparative study that performed an integrated multi-omic analysis on human visceral adipocytes to assess how obesity, alone or combined to CRC, affects miRNA, and lncRNA expression and networks, as a potential mechanism linking obesity and CRC.

The expression of miRNAs of obese subjects with respect to lean individuals has been previously investigated in both VAT and subcutaneous AT (SAT), the two main fat depots that exhibit significant differences in anatomical, cellular and molecular features (6, 53). Heterogeneity of subjects (fat depots, BMI), type of samples (isolated adipocytes compared to adipose tissue), together with the use of different high-throughput techniques (arrays, RNA sequencing) has rendered difficult to identify a specific "miRNA signature" altered in obesity (9). In this regard, differences in miRNAs expression were observed when comparing visceral and subcutaneous fat (17, 54, 55), or isolated adipocytes and whole AT (56). In our study, we performed a whole analysis of miRNAs in human adipocytes isolated from the visceral fat. Among the miRNAs dysregulated in obese subjects compared to the normal weight controls, we found those involved in adipogenesis (e.g., let-7 family, miR-193b,−483-5p), in lipid metabolism (e.g., miR-181d), or in glucose and insulin metabolism (e.g., miR-34a-5p,−24-3p,−144- 5p,−361-3p), previously described in different AT depots of obese subjects (9, 57–59), further supporting a role of these miRNAs in the functional alterations of adipocytes occurring in obesity. Additionally, in obese adipocytes we also reported the dysregulation of those miRNAs previously found to be involved in the regulation of immune response, adipokine secretion and inflammation (e.g., miR-125a-5p; −181 family, −193b) (15, 60, 61) or implicated in many aspects of carcinogenesis in several cancer types, including CRC (e.g., miR-34a, let7e-3p, −144- 5p, −193b, −361-3p, −451a) (54, 62). Specifically, we found that miR-125a-5p and miR-193b-3p were downregulated in both obesity and CRC, in keeping with their previously reported down-regulation in VAT of obese subjects (63, 64), although contrasting results on miR-193b expression have been showed in human SAT (56, 64). Notably, we have previously described an up-regulation of the target genes of miR-193b (i.e., CCL2) and miR-125a-5p (i.e., STAT3), as an important mechanism underlying obesity-associated inflammation (29, 65), according to the literature (54, 56). Furthermore, we also report the characterization of 35 novel and 55 known lncRNAs in visceral adipocytes. An important property of lncRNAs is their cell- and tissue- specific expression (66). Therefore, the current annotation of lncRNAs is far from being complete. Alterations in the expression of some lncRNAs have been reported in both SAT and VAT, as important regulators of AT functions (26, 27, 67). In our study, we report the first analysis of lncRNAs in purified visceral adipocytes and this could explain the discrepancies observed with previous studies mainly conducted in whole AT (26, 27, 67). In general, we identified known and novel lncRNAs not previously described in other reports. Specifically, in obese subjects we found several lncRNAs (e.g., ZFAS1, LUCAT1, HIF1A-AS1, HOXB-AS3) already identified in the setting of different type of cancers, but not previously reported in human AT. Moreover, the lncRNA MIR3142HG, recently described as important mediator of the inflammatory response in Idiopathic Pulmonary Lung Fibroblasts positively regulating CXCL8 and CCL2 release (68), is specifically up-modulated in obesity. Notably, we previously reported an upregulation of both CCL2 and CXCL8 in adipocytes from Ob individuals (29), suggesting a role of this lncRNA in the AT inflammation. Other two lncRNAs, SNHG16, and LINC01106, were found to be upregulated in obesity, and this modulation was shared between obese and cancer conditions. In this regard, an abnormal expression of SNHG16 has been observed in multiple cancers and usually correlates with worse pathological features (69), while the novel lncRNA LINC01106 has been recently reported to be related to the overall survival of CRC patients by acting as inflammatory mediator in inflammatory bowel disease (IBD)-related CRC. This lncRNA showed also an intimate interaction with miR-193a in epithelial tissue from IBD and CRC patients (70).

Despite the well-known link between AT related inflammation and CRC development, no previous studies considered the expression of ncRNAs in the AT of CRC patients. When overlapping the data from NwCRC, Ob, and ObCRC individuals, the down-regulation of miRNAs, such as miR-193b-3p, miR-125a-5p, and miR-1247-5p, was found to be shared between cancer and obese conditions. Interestingly, both miR-193b-3p, and miR-1247-5p act as tumor suppressors in CRC or other types of cancer (71, 72), suggesting that their repression in AT from Ob and CRC individuals could have a potential protumorigenic role. Beside common features, some ncRNAs are unique of tumor conditions. For instance, lncRNA H19, among others, was repressed only in NwCRC patients, with respect to healthy control. Interestingly, H19 has been described to play a role in obesity-induced cancer and to promote epithelialmesenchymal transition of CRC, with a reported poor prognosis for cancer patients exhibiting H19 induction (73, 74). However, we observed an opposite expression in AT compared to cancer cells, suggesting a different role of this lncRNA in visceral adipocytes, that could potentially involve H19 target genes STAT3 and SPARC (75, 76). Indeed, we and others previously reported a key role of STAT3 and SPARC in AT dysfunctions both in obese (28, 65, 77) and CRC conditions (28, 65). Similarly, the lncRNA XIST is highly down-modulated in the AT from CRC group, although its up-regulation in CRC tissues and cell lines was reported (75, 78, 79). Remarkably, XIST can act as oncogene or tumor suppressor depending on the human malignancies (80) and was recently identified as a candidate in mediating glucose metabolism in glioma and contributing to cancer progression (81).

In this study, we not only identified some specific lncRNAs and miRNAs across the adipocyte genome, but we also described miRNA-lncRNA-mRNA interaction networks and the functional analysis of the pathways in which the target genes are involved. The target genes we identified in the networks were mainly enriched in several pathways, associated with metabolic processes, lipid and energy metabolism, inflammation, and cancer. Specifically, the SREBP pathway was remarkably inhibited in the NwCRC network, with implications not only on lipid metabolism but also on inflammation-mediated metabolic diseases, as well as on immune responses (82). Of note, the lncRNA SNHG16, that we have identified as a main hub of this network, has been reported to modulate the lipogenesis via regulation of SREBP2 expression (83), and to affect others genes involved in lipid metabolism (84). Another intriguing connection identified in Ob network is the upregulated TP53 transcriptional regulation pathway. The activation of this pathway has been previously observed in obesity and correlated to the release of inflammatory cytokines fueling cancer initiation and progression (85), thus potentially setting the basis for a more tumor-prone AT microenvironment in obese subjects. Furthermore, p53 in human AT was shown to be involved in insulin resistance, adipogenesis, lipid metabolism and nutrient sensing (86).

We also previously reported the influence of obesity on the adipocyte transcriptional program in CRC, with ObCRC subjects showing a higher number of dysregulated genes and processes (28). Likewise, in this study we observe a higher complexity of ObCRC network in terms of lncRNA and miRNA profiles. Interestingly, we describe in ObCRC patients the deregulation of fundamental biological functions that are mainly implicated in inflammatory signaling pathways, such as IL-37 and IL-13 signaling. In this regard, an increase expression of the cytokine IL-13, contributing to AT inflammation, has been reported to play an important role in obesity-related colon carcinogenesis (87), while IL-37 signaling has been described to play an inhibitory role in innate immune responses. In fact, it acts by reducing systemic and local inflammation, whereas its expression in SAT was negatively correlated with BMI (88). Other enriched categories in ObCRC network are: (i) TGF-beta signaling that has been reported to regulate multiple aspects of AT biology (i.e., vascularization, inflammation and fibrosis) (89), (ii) Netrin-1 signaling, recently described to play a role in tissue regulation outside the nervous system, specifically in tumor development (i.e., angiogenesis and inflammation) and (iii) PTEN regulation, for which a dual role as tumor suppressor and metabolic regulator has been reported (90). Finally, the networks described in all subject groups were enriched in: (i) type I IFN signaling, recently identified as essential in the regulation of metabolism and in maintaining AT function (91), (ii) SUMOylation, a posttranslational modification mechanism that plays an emerging role in cellular metabolism and metabolic disease (92) and (iii) pathways involved in RNA metabolism, as expected. The identification of these pathways in both obese and cancer groups strongly points to the local metabolic alterations in AT as key element in colorectal carcinogenesis.

Additionally, pathways related to membrane trafficking, vesicle budding and endocytosis processes were also found to be dysregulated in both obesity and CRC networks. In this regard, it is worth to note that in addition to act locally, adipocytes influence and communicate with distant organs and tissues, by releasing bioactive molecules, such as triglycerides, adipokines, cytokines, and free fatty acids (93). This ability allows even tumors with no direct contact with AT to be affected by obesity, as indicated by epidemiological studies linking obesity with several types of cancers (94). Among adipocytes products that could sustain cancer cell growth, circulating miRNAs, both naked or associated to exosomes, may regulate the function of the immune system and distant organs and could potentially be used as biomarkers of diagnosis and prognosis of obesity and cancer (15). Likewise, exosomal lncRNAs have been shown to promote angiogenesis, cell proliferation and drug resistance and can be found in several body fluids, being highly stable, thus considered potential tumor biomarkers (95).

In conclusion, the importance of understanding the role of lncRNAs and miRNAs in AT of obese and CRC affected subjects extends beyond the description of gene regulation mechanisms. The results obtained in this study, through a multi-omics approach and computational analysis, contribute to the identification of candidate genes, ncRNAs and their regulatory networks relevant to many AT biological processes, although the direct causality remains to be established, requiring further experimental and functional studies. Nonetheless, the identification of AT miRNAs and lncRNAs as key components of interrelated processes and pathways may not only better define their role in human AT, but also identify promising mechanismbased targets, to disrupt the relationship between obesity, metabolic dysregulation, and cancer, potentially improving intervention and treatment plans.

# DATA AVAILABILITY STATEMENT

The datasets generated for this study can be found in the NCBI Sequence Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/ sra) (PRJNA632999, PRJNA508473).

# ETHICS STATEMENT

The studies involving human participants were reviewed and approved by the institutional review board of Istituto Superiore di Sanità. The patients/participants provided their written informed consent to participate in this study.

# AUTHOR CONTRIBUTIONS

RV and BS isolated adipocytes from human visceral adipose tissue biopsies. AB prepared samples for RNA Sequencing and performed real-time qPCR for gene validation. ST, AM, EC, and PM performed bioinformatics and statistical analyses of RNASeq data. ST and AM provided intellectual input throughout the study. MD and SG provided substantial contributions to the conception of the work as well as interpretation of data and manuscript writing. All authors contributed to the article and approved the submitted version.

#### FUNDING

This work was supported by a grant of the Italian Association for Cancer Research (AIRC) (IG 2013 N14185) to SG.

#### REFERENCES


#### ACKNOWLEDGMENTS

We are indebted to Drs. R. Persiani, G. Silecchia, and A. Iacovelli for kindly providing clinical samples.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc. 2020.01089/full#supplementary-material


experimentally supported miRNA-gene interactions. Nucleic Acids Res. (2018) 46:D239–45. doi: 10.1093/nar/gkx1141


obese white adipose tissue. J Clin Endocrinol Metab. (2014) 99:2821–33. doi: 10.1210/jc.2013-4259


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Tait, Baldassarre, Masotti, Calura, Martini, Varì, Scazzocchio, Gessani and Del Cornò. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Unraveling the Complexity of the Cancer Microenvironment With Multidimensional Genomic and Cytometric Technologies

#### Natasja L. de Vries 1,2, Ahmed Mahfouz 3,4,5, Frits Koning<sup>2</sup> and Noel F. C. C. de Miranda<sup>1</sup> \*

*<sup>1</sup> Pathology, Leiden University Medical Center, Leiden, Netherlands, <sup>2</sup> Immunohematology and Blood Transfusion, Leiden University Medical Center, Leiden, Netherlands, <sup>3</sup> Human Genetics, Leiden University Medical Center, Leiden, Netherlands, <sup>4</sup> Delft Bioinformatics Laboratory, Delft University of Technology, Delft, Netherlands, <sup>5</sup> Leiden Computational Biology Center, Leiden University Medical Center, Leiden, Netherlands*

#### Edited by:

*Francesca Finotello, Innsbruck Medical University, Austria*

#### Reviewed by:

*Itai Yanai, New York University, United States Christina Stuelten, National Cancer Institute (NCI), United States Pablo G. Camara, University of Pennsylvania, United States*

\*Correspondence:

*Noel F. C. C. de Miranda N.F.de\_Miranda@lumc.nl*

#### Specialty section:

*This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology*

Received: *01 April 2020* Accepted: *17 June 2020* Published: *23 July 2020*

#### Citation:

*de Vries NL, Mahfouz A, Koning F and de Miranda NFCC (2020) Unraveling the Complexity of the Cancer Microenvironment With Multidimensional Genomic and Cytometric Technologies. Front. Oncol. 10:1254. doi: 10.3389/fonc.2020.01254* Cancers are characterized by extensive heterogeneity that occurs intratumorally, between lesions, and across patients. To study cancer as a complex biological system, multidimensional analyses of the tumor microenvironment are paramount. Single-cell technologies such as flow cytometry, mass cytometry, or single-cell RNA-sequencing have revolutionized our ability to characterize individual cells in great detail and, with that, shed light on the complexity of cancer microenvironments. However, a key limitation of these single-cell technologies is the lack of information on spatial context and multicellular interactions. Investigating spatial contexts of cells requires the incorporation of tissue-based techniques such as multiparameter immunofluorescence, imaging mass cytometry, or *in situ* detection of transcripts. In this Review, we describe the rise of multidimensional single-cell technologies and provide an overview of their strengths and weaknesses. In addition, we discuss the integration of transcriptomic, genomic, epigenomic, proteomic, and spatially-resolved data in the context of human cancers. Lastly, we will deliberate on how the integration of multi-omics data will help to shed light on the complex role of cell types present within the human tumor microenvironment, and how such system-wide approaches may pave the way toward more effective therapies for the treatment of cancer.

Keywords: cancer microenvironment, single-cell, data integration, multi-omics, mass cytometry, spatial analysis, immunophenotyping

### INTRODUCTION – HETEROGENEITY OF CANCER AND NEED FOR MULTIDIMENSIONAL APPROACHES

A genetic basis for cancer development was first proposed by the German zoologist Theodor Boveri who speculated that malignant tumors might be the result of abnormal chromosome alterations in cells (1). By then, a cancer cell-centric vision dominated, where tumorigenesis was thought to be exclusively driven by multistep alterations in cellular genomes. During the last decades, however, it has become increasingly apparent that the study of cancers must also encompass other constituents of the cancer microenvironment including immune cells, fibroblasts, and other stromal components, to capture all aspects of cancer biology (2). The immune system, for example,

**157**

plays a dichotomous role in cancer development and progression, as different cells can antagonize or promote tumorigenesis (3). The mapping and understanding of the interplay between cancer cells and other constituents of the cancer microenvironment is thus fundamental for the clinical management of this disease.

The study of cancers as complex systems is further complicated by cancer heterogeneity that can occur at different levels; intratumorally, between lesions, and across patients. Intratumoral heterogeneity involves the near-stochastic generation of both genetic (e.g., mutations, chromosomal aberrations) and epigenetic (e.g., DNA methylation, chromatin remodeling, post-translational modification of histones) modifications. Within tumors, distinct niches can favor the outgrowth of different cancer cell clones that acquired characteristics compatible with regional microenvironments (e.g., nutrient and oxygen availability, exposure to immune cells). Other intrinsic sources of heterogeneity such as self-renewal of cancer cells and cell differentiation processes contribute further to intratumoral heterogeneity (4, 5). In addition, the immune system is a major part of the tumor microenvironment and contains many different types of adaptive (e.g., CD4<sup>+</sup> and CD8<sup>+</sup> T lymphocytes) and innate (e.g., macrophages and innate lymphoid cells) immune cells that also contribute to cancer heterogeneity (6). Their location within a tumor has been shown to significantly impact their anti- or pro-tumorigenic effects (7). In addition, the density of immune cell infiltration in tumors is a well-known determinant for the prognosis of cancer patients (8). Inter-lesional heterogeneity can be observed between multiple primary tumors at time of diagnosis, between a primary tumor and metastasis, and between different metastatic niches in individual patients. They can be a result of the outgrowth of subclones that can be (epi)genetically distinguished by mutations or structural variations (9). Moreover, the structure and composition of the cancer microenvironment can vary between the primary tumor and metastases. Upon extravasation, cancer cells from primary tumors are exposed to different types of immune cells, stromal cells, platelets, and metabolic stress, and have to adapt to the new tissue microenvironment. As such, the metastatic tissue ("soil") plays a critical role in regulating the growth of metastases ("seed") (10). Finally, interpatient heterogeneity is, on top of the aforementioned variables, also fueled by distinct germline genetic backgrounds and environmental and stochastic factors that can affect cancer progression but also immunity.

Major challenges in the field of cancer research are the identification of predictive biomarkers to select patients that are likely to respond to specific treatments, the detection of mechanisms of resistance to therapy, and the development of novel treatments to improve cancer survival. Here, we review the rise of cutting-edge multidimensional technologies such as spectral flow cytometry, multiparameter immunofluorescence, (imaging) mass cytometry, single-cell RNA-sequencing (scRNAseq), and RNA spatial profiling that may play a crucial role to address the former problems. We will discuss how multiomics of dissociated cells as well as of spatial data can be obtained (**Figure 1A**) and the importance of integrating them to reveal the full cellular landscape of the cancer microenvironment (**Figures 1B,C**). For example, single-cell data of dissociated cells can be used as guide for cell type identification in spatial data (11) and, vice versa, spatial data can be used to predict the location of dissociated cells based on the similarity of their expression profiles to spatially-mapped data (12–14) (**Figure 1B**). In addition, mapping can be used to predict the spatial profile of genes or proteins which have not been experimentally measured to expand the coverage of spatial data (**Figure 1C**) (15–17).

#### MULTIDIMENSIONAL SINGLE-CELL TECHNOLOGIES AND THEIR STRENGTHS AND WEAKNESSES

#### Single-Cell DNA- and RNA-Sequencing

Next-generation sequencing (NGS) approaches have revolutionized our ability to generate high-throughput genomic data where individual RNA and DNA molecules are represented by sequencing reads thereby retaining information on genotypes, phenotypes, cellular states, and sub-clonal alterations. Traditional molecular profiling has, until recently, largely relied on the analysis of bulk cell populations. Deep sequencing of DNA and RNA from tissues enabled reconstruction of "average" genomes and "average" transcriptomes that could then be deconstructed by employing bioinformatic algorithms to perform clonal evolution analysis or determine the composition of cancer microenvironments (18–21). For an unbiased and systematic characterization of cells, high-throughput single-cell DNA- and RNA-sequencing have emerged as powerful tools. With single-cell DNA-sequencing, the genomic heterogeneity of tissues can be explored in detail. It can be used to detect nucleotide variations and chromosomal copy number alterations as well as more complex genomic rearrangements and cellular fractions. Single-cell genome sequencing involves whole-genome amplification of single cells, of which the three main methods are MDA (22), MALBAC (23), and DOP-PCR (24). In 2011, the first study of DNA-sequencing of human breast cancer single cells was published (25), which was followed by many single-cell studies charting genetic heterogeneity within individual tumors as well as between primary tumors and their metastases, thereby allowing for a detailed understanding of the evolution processes occurring in a tumor. Single-cell DNA-sequencing has myriad applications in cancer research including examining intratumoral heterogeneity (26–28), investigating clonal evolution during tumorigenic processes (25, 29–32), tracing metastatic dissemination (33), genomic profiling of circulating tumor cells (34–36), measuring mutation rates (37), and gain insight into resistance to therapy (38). By defining, in detail, the genetic composition of tumors, the rationalization of targeted cancer therapies is made possible. However, drawbacks of single-cell DNA-sequencing methods are non-uniform coverage and allelic dropout events as well as artifacts introduced during genomic amplification, all of which contribute to a high rate of false negative and false positive findings (39).

The first single-cell RNA-sequencing (scRNA-seq) experiment was published in 2009 by Tang and colleagues who profiled the

FIGURE 1 | Overview of the pipeline for the integration of single-cell data of dissociated cells and spatially-resolved data. (A) Single-cell data can be obtained by flow and mass cytometry that make use of antibodies coupled to fluorochromes or heavy metal isotopes, respectively, for the immunodetection of dissociated cells. For single-cell RNA-sequencing, antibodies coupled to oligonucleotides can be used to simultaneously retrieve information on protein and RNA expression of single cells. Spatially-resolved data can be obtained by multiplexed imaging or spatial transcriptomics by immunodetection of tissue sections with antibodies coupled to fluorochromes, heavy metal isotopes or oligonucleotides. Integration of single-cell data of dissociated cells with spatially-resolved data will reveal the full cellular landscape of the cancer microenvironment. (B,C) Integration approaches for single-cell data of dissociated cells and spatially-resolved data. Single-cell data of dissociated cells can be used as guide for cell type identification in spatial data and, vice versa, spatially-resolved data can be used to predict the location of dissociated cells based on the similarity of their expression profiles to spatially-mapped data (B). In addition, single-cell data can be used to predict the spatial profile of genes or proteins in the samples that have not been measured to expand the coverage of spatial data (C). Based on samples that have been measured (i.e., sample 1, 2, and 3), the expression of genes or proteins in sample 4, 5, and 6 can be predicted.

transcriptome of a single cell from early embryonic development (40). Rapid technological advances resulted in an exponential increase in the number of cells that can be studied by scRNAseq analyses (41). Just 8 years later, 10x Genomics published a scRNA-seq dataset of more than one million individual cells from embryonic mice brains (42). There are many different scRNAseq library preparation platforms, which can be categorized into plate-based, droplet-based, and microwell-based (41). The selection of the method depends on the research question, the number of input cells, the sequencing depth, the need for full-length coverage of transcriptomes, and costs, among others [reviewed by (43, 44)]. ScRNA-seq has demonstrated to be a powerful technique to decipher cancer biology. In 2012, Ramskold et al. applied scRNA-seq to study circulating tumor cells in melanoma, and could identify potential biomarkers for melanoma as well as SNPs and mutations in this relatively rare circulating tumor cell population (45). Thereafter, scRNA-seq has been used to study the microenvironment of several cancer types including prostate cancer (46), breast cancer (47), glioma (48–50), renal cancer (51), lung cancer (52), melanoma (53– 56), colorectal cancer (57–59), pancreatic ductal adenocarcinoma (60), liver cancer (61), head and neck cancer (62), leukemia (63), and glioma (64). A pioneering study that applied scRNAseq to primary glioblastomas uncovered inherent variability in oncogenic signaling, proliferation, immune responses, and regulators of stemness across cells sorted from five tumors (48). However, this study was restricted to cancer cells and did not further investigate other cell types of the cancer microenvironment. Subsequently, another scRNA-seq study examined distinct genotypic and phenotypic states of malignant, immune, stromal, and endothelial cells of melanomas from 19 patients (53). They identified cell states linked to resistance to targeted therapy, interactions between stromal factors and immune cell abundance, and potential biomarkers for distinguishing dysfunctional and cytotoxic T cells. A recent study in colorectal cancer broadened such scRNA-seq analysis by including a comparison of primary tumors to matched normal mucosa samples (58). By projecting their scRNA-seq data to a large reference panel, the authors identified distinct subtypes of cancer-associated fibroblasts and new expression signatures that were predictive of prognosis in colorectal cancer. Further, scRNA-seq has been applied to investigate changes in the tumor microenvironment of cancer patients treated with immune checkpoint blockade to find signatures associated with positive responses to this therapy (65, 66).

Currently scRNA-seq can be combined with sequencing of T cell receptor and immunoglobulin repertoires thereby allowing to connect information of B- and T cell specificity and phenotype. High-throughput single-cell B cell receptor sequencing of more than 250,000 B cells from different species has recently been pioneered to obtain paired antibody heavy- and light chain information at the single-cell level, and revealed a rapid discovery of antigen-reactive antibody candidates (67). By a novel approach called RAGE-seq (Repertoire and Gene Expression by Sequencing), gene expression profiles can be paired with targeted full-length mRNA transcripts providing BCR and TCR sequences (68). This method has been applied to study cells from the primary tumor and tumor-associated lymph node of a breast cancer patient and demonstrated the ability to track clonally related lymphocytes across tissues and link TCR and BCR clonotypes with gene expression features (68). A limitation of scRNA-seq is that RNA levels are not fully representative of protein amounts. The advent of CITE-seq, REAP-seq, and Abseq overcame this limitation by enabling simultaneous detection of gene expression and protein levels in single cells by combining oligonucleotide-labeled antibodies against cell surface proteins with transcriptome profiling of thousands of single cells in parallel (69–71). scRNA-seq, when employed in a discovery setting, can inform on the best markers to be used for the study of specific populations by complementary technologies such as flow or mass cytometry. However, aspects of sample preparation and handling have been shown to induce significant alterations in the transcriptome (72). Furthermore, throughput is limited by cost, protocol complexity, available sequencing depth, and dropout events. Together, this can affect the downstream analysis pipeline such as clustering of cell populations and the inference of cellular relationships.

Computational analysis of scRNA-seq data is challenging and involves multiple steps, e.g., quality control, normalization, clustering, and identification of differentially expressed genes and/or trajectory inferences. Multiple unsupervised clustering analyses are available to identify putative cell types, of which graph-based clustering is most widely used (73). For each of these steps, numerous computational tools are available, but in addition software packages have implemented the entire clustering workflow such as Seurat (16), scanpy (74), and SINCERA (75).

#### Single-Cell Epigenetic Characterization

Although most high-throughput profiling studies to date have focused on DNA, RNA, and protein expression, recent progress in studying the epigenetic regulation of gene expression, at single-cell level, has been made. Over the last decades, methods have been developed including ATAC-seq to measure chromatin accessibility (76), bisulfite sequencing to measure DNA methylation (77), ChIC-sequencing to measure histone modifications (78), and chromosome conformation capture (3C) to analyze the spatial organization of chromatin in a cell (79). Several studies revealed epigenetic programs that regulate T cell differentiation and dysfunction in tumors. Analysis of chromatin accessibility by ATAC-seq revealed that CD8<sup>+</sup> T cell dysfunction is accompanied with a broad remodeling of the enhancer landscape and transcription factor binding as compared to functional CD8<sup>+</sup> T cells in tumors (80–83). Also, an increased chromatin accessibility at the enhancer site of the PDCD1 gene (encoding for PD-1) has been found in the context of dysfunctional CD8<sup>+</sup> T cells (82). In addition, studies have applied epigenetics to determine mechanisms of resistance to cancer immunotherapies by characterizing chromatin regulators of intratumoral T cell dysfunction before and after PD-1, PD-L1, or CTLA-4 blockade therapy (84, 85). Lastly, DNA hypermethylation may result in the inactivation of genes, such as mismatch repair gene MLH1 associated with microsatellite instability in colorectal cancer (86). Until recently, studies on epigenetic modifications depended on correlations between bulk cell populations. Since 2013, with the development of single-cell technologies, epigenomic techniques have been modified for application to single cells to study cell-to-cell variability in for instance chromatin organization in hundreds or thousands of single cells simultaneously. Several singlecell epigenomic techniques have been reported on recently, including measurements of DNA methylation patterns (scRRBS, scBS-seq, scWHBS) (87–89), chromatin accessibility (scATACseq) (90), chromosomal conformations (scHi-C) (91), and histone modifications (scChIC-seq) (92). A recent study applied scATAC-seq to characterize chromatin profiles of more than 200,000 single cells in peripheral blood and basal cell carcinoma. By analyzing tumor biopsies before and after PD-1 blockade therapy, Satpathy et al. could identify chromatin regulators of therapy-responsive T cell subsets at the level of individual genes and regulatory DNA elements in single cells (93). Interestingly, variability in histone modification patterns in single cells have also been studied by mass cytometry, which was denominated EpiTOF (94). In this way, Cheung et al. identified a variety of different cell-type and lineage-specific profiles of chromatin marks that could predict the identity of immune cells in humans. Lastly, scATAC-seq has been combined with scRNA-seq and CITE-seq analyses to find distinct and shared molecular mechanisms of leukemia (95). These single-cell strategies will allow to further understand how the epigenome drives differentiation at the single-cell level and unravel drivers of epigenetic states that could be used as target for the treatment of cancer. Additionally, these methods may be used to measure genome structure in single cells to define the 3D structure of the genome. However, for many of these single-cell epigenetic techniques, disadvantages are the low coverage of regulatory regions such as enhancers (scRRBS), low coverage of sequencing reads (scChiP-seq, scATAC-seq), and low sequencing resolution (scHi-C) (96, 97).

#### Single-Cell Protein Measurements

Flow cytometry has been, in the past decades, the method of choice for high-throughput analysis of protein expression in single cells. The number of markers that can be simultaneously assayed was limited to ∼14 markers due to the broad emission spectra of the fluorescent dyes. Recent developments with spectral flow cytometer machines enable the detection of up to 34 markers in a single experiment by measuring the full spectra from each cell, which are unmixed by reference spectra of the fluorescent dyes and the autofluorescence spectrum (98). Fluorescence emission is registered by detectors consisting of avalanche photodiodes instead of photomultiplier tubes used in conventional flow cytometry. A variety of cellular features can be detected by flow cytometry including DNA and RNA content, cell cycle stage, detailed immunophenotypes, apoptotic states, activation of signaling pathways, and others [reviewed by (99)]. This technique has thus been paramount in characterizing cell types, revealing the existence of previously unrecognized cell subsets, and for the isolation of functionally distinct cell subsets for the characterization of tumors. However, the design of multiparameter flow cytometry antibody panels is a challenging and laborious task, and most flow cytometry studies have so far focused on the in-depth analysis of specific cellular lineages, instead of a broad and system-wide approach.

In 2009, the advent of a new cytometry methodology, mass cytometry (CyTOF, cytometry by time-of-flight), overcame the limitation of spectral overlap by using metal-isotope-conjugated antibodies to detect antigens (100). The metal isotopes attached to each cell are distinguished by mass and quantified in a quadrupole time-of-flight mass spectrometer. A mass cytometer is theoretically capable of detecting over 100 parameters per cell, but current chemical methods limit its use to ∼40–50 parameters, simultaneously. Mass cytometry has expanded the breadth of single-cell data in each experiment, making it highly suitable for systems-level analyses such as immunophenotyping of cancer microenvironments. By allowing the examination of large datasets at single-cell resolution, mass cytometry can be applied for the discovery of novel cell subsets as well as for the detection and identification of rare cells. Further advantages of mass cytometry are the irrelevance of autofluorescence, the low biological background as heavy metals are not naturally present in biological systems, and limited signal spillover between heavy metals, thereby reducing the complexity of panel design. Conversely, as compared to flow cytometry, mass cytometry suffers from a higher cell loss during acquisition, is more expensive, and is low-throughput, with a flow rate of up to 500 cells per sec as compared to thousands of cells per sec in flow cytometry. In addition, cells cannot be sorted for further analysis and forward- and side-scattered light is not detected.

Several studies have applied mass cytometry to further characterize immune cell profiles in peripheral blood or tissues from patients with breast cancer (101), renal cancer (102), melanoma (55, 56, 103–105), lung cancer (52, 106, 107), glioma (49, 50), colorectal cancer (57, 106, 108, 109), liver cancer (61, 110), ovarian cancer (111), and myeloma (112–115), among others. In addition to characterizing immune cell profiles of different tissue types, mass cytometry has also been used to characterize immunophenotypes in tumors and monitor changes during immunotherapy (56, 103–105, 114). In this way, factors that influence response to immunotherapy can be investigated and mechanisms at play during treatment can be characterized. This information can be used to understand and facilitate the identification and classification of responder vs. non-responders to cancer immunotherapy. Most of the studies so far have focused on the CTLA-4 and PD-1/PD-L1 axis of cancer immunotherapy, but novel immunotherapeutic targets such as co-inhibitory molecules LAG-3 or TIM-3 or co-stimulatory molecules such as OX40 and GITR are currently being explored in mice models and clinical trials (116). Moreover, mass cytometry has been employed to study antigen-specific T cells with a multiplex MHC class I tetramer staining approach, which has led to the identification of phenotypes associated with tumor antigenspecific T cells (106). Most studies applied mass cytometry for measuring cell surface or intracellular markers, but it can also be used to evaluate cell signaling processes relying on the analysis of protein phosphorylation (117). Altogether, these studies showed that immune responses in cancer are extremely diverse, within tumors from individual patients as well as between patients with equivalent tumor types. Hence, finding clinically-relevant characteristics based on overall differences can be challenging because of inter-patient variability; differences between cancer patients can be so large that they compromise the discovery of biomarkers.

Because the number of potential phenotypes (resulting from the combination of different markers) increases exponentially with the rise in number of antibodies being measured simultaneously, computational tools for the analysis and visualization of multidimensional data have become key in this field. Traditional workflows for analyzing flow cytometry datasets by manual gating are not efficient to capture the phenotypic differences in mass cytometry and complex flow cytometry data and suffer from individual user bias. In addition, flow and mass cytometry datasets can easily contain millions of cells, illustrating the need for scalable clustering algorithms for efficient analysis. Current single-cell computational tools employed for complex flow cytometry and mass cytometry datasets include unsupervised clustering-based algorithms such as SPADE (118), Phenograph (119), and FlowSOM (120). However, these clustering-based tools do not provide singlecell resolution of the data. On the other hand, non-linear dimensionality reduction-based algorithms such as t-SNE (121) are widely used tools but limited by the number of cells that they can analyze simultaneously, resulting in down-sampled datasets and non-classified cells. Recently, a hierarchical approach of the t-SNE dimensionality-reduction-based technique, HSNE, was described to be scalable to tens of millions of cells (122, 123). In addition, a novel algorithm has recently been implemented in the single-cell analysis field as a dimensionality reduction tool, called uniform manifold approximation and projection (UMAP) (124).

#### Spatially-Resolved Data

Most of the multidimensional single-cell techniques such as flow cytometry, mass cytometry, and scRNA-seq require cellular dissociation to obtain cell suspensions prior to measuring the individual cells. Different dissociation methods are used, both mechanical and enzymatic, and may result in the loss of certain cell types and affect the expression of specific cell surface markers. Moreover, tissue specimens are often contaminated with blood or other tissues that are processed along with the tissue of interest. As such, not all subsets identified in single-cell data may be representative of the sample of interest. Another key limitation is the lack of information on spatial localization and cellular interactions within a tissue. Analysis of tissue sections by traditional IHC- and immunofluorescence-based methods are useful in providing spatial information (125), but are severely limited in the number of markers that can be measured simultaneously. Recent technological advances have greatly expanded the number of markers that can be captured on tissue slides. For instance, by applying the principles of secondary ion mass spectrometry to image antibodies conjugated to heavy metal isotopes in tissue sections with imaging mass cytometry (IMC) (126) and multiplexed ion beam imaging by time-of-flight (MIBI-TOF) (127). In both imaging approaches, conventional IHC workflows are used but with metal-isotopeconjugated antibodies that are detected through a time-of-flight mass spectrometer. In IMC, a pulsed laser is used to ablate a tissue section by rasterizing over a selected region of interest. The liberated antibody-bound ions are subsequently introduced into the inductively coupled plasma time-of-flight mass spectrometer. IMC can currently image up to 40 proteins with a subcellular resolution of 1µm. The principle of MIBI-TOF is similar, but it makes use of a time-of-flight mass spectrometer equipped with a duoplasmatron primary oxygen ion beam rather than a laser. It currently enables simultaneous imaging of 36 proteins at resolutions down to 260 nm (128). Both techniques are, however, low-throughput due to the relatively long imaging time of 2 h per field of 1 mm<sup>2</sup> in IMC and 1 h 12 min per field of 1 mm<sup>2</sup> in MIBI-TOF (129). IMC has been applied to study tumor heterogeneity in several types of cancers, such as pancreatic cancer (130), biliary tract cancer (131), breast cancer (126, 132, 133), and colorectal cancer (108, 134). MIBI-TOF has been used to study the tumorimmune microenvironment of breast cancer (127, 128, 135, 136) and the metabolic state of T cells in colorectal cancer (109). These spatially-resolved, single-cell analyses have great potential to characterize the spatial inter- and intratumoral phenotypic heterogeneity, which can guide cancer diagnosis, prognosis and the selection of treatment. A recent study was able to extend IMC data by integration with genomic characterization of breast tumors and could, in this way, investigate the effect of genomic alterations on multidimensional tumor phenotypes of breast cancer (137).

Other multiplexed imaging techniques such as the Digital Spatial Profiling (DSP) system from NanoString and co-detection by indexing (CODEX) make use of DNA oligonucleotides. In DSP, antibodies or probes are tagged with unique ultravioletphotocleavable DNA oligos that are released after ultraviolet exposure in specific ROIs and quantified (138). It enables simultaneous detection of up to 40 proteins or over 90 RNA targets from a tissue section and theoretically allows unlimited multiplexing using the NGS readout, but is time-consuming, does not allow for a reconstruction of the image, and has a lower resolution (10µm) (129). In CODEX, antibodies conjugated to unique oligonucleotide sequences are detected in a cyclic manner by sequential primer extension with fluorescent dye-labeled nucleotides. CODEX currently allows the detection of over 50 markers with an automated fluidic setup platform including a three-color fluorescence microscope (139). Of note, throughput is limited by sequential detection of antibody binding. A disadvantage of CODEX, but also of IMC, is the lack of signal amplification which hampers the detection of lowly abundant antigens. A novel imaging technique, called Immuno-SABER, overcame this limitation by implementing a signal amplification step using primer exchange reactions. Immuno-SABER makes use of multiple DNA-barcoded primary antibodies that are hybridized to orthogonal single-stranded DNA concatemers, generated via primer exchange reactions (140). These primer exchange reactions allow multiplexed signal amplification with rapid exchange cycles of fluorophore-bearing imager strands. The Nanostring DSP platform has been used to study the tumor microenvironment and the outcome of various clinical trials of combination therapy for melanoma (141–144), interactions between macrophages and lymphocytes in metastatic uveal

melanoma (145), immune cell subsets in lung cancer (129, 143), and tumor microenvironments of different metastases in prostate cancer (146). CODEX has been applied to study the immune tumor microenvironment of colorectal cancer patients with 56 protein markers simultaneously (147).

These multiplexed imaging techniques can be applied to snap-frozen as well as FFPE samples that are usually stored in clinical repositories. However, they raise new analysis challenges such as the visualization of 40 markers simultaneously, the image segmentation for cell determination, and algorithms for image-based expression profiles. To understand the tissue architecture, it is necessary to have prior knowledge on which cell types can be present and what their physical relationship to one another could be. Several computational approaches have been developed to enable data analysis of spatially-resolved multiplexed tissue measurements including HistoCAT (148) and ImaCytE (149). These approaches apply cell segmentation masks [using a combination of Ilastik (150) and CellProfiler (151)] to extract single-cell data from each image, which allow for deep characterization using multidimensional reduction tools such as t-SNE combined with the assessment of spatial localization and cellular interactions. In addition to cell-based analysis such imaging technologies also allow the employment of pixel-based analysis that do not depend on cell segmentation.

Integration of single cell transcriptome profiles with their spatial position in tissue context can be achieved by labeling of DNA, RNA, or probes using in situ hybridization (ISH). Traditional ISH techniques have been improved to allow the detection of single molecules, named single-molecule fluorescence ISH (smFISH) that can be used to quantitate RNA transcripts at single-cell resolution within a tissue context (152, 153). However, only a small number of genes can be measured simultaneously and a main limitation is the lack of cellular resolution to hundreds of micrometers. To improve the throughput, several highly multiplex methods of in situ RNA visualization have been developed such as osmFISH (154), sequential FISH [seqFISH (155) and seqFISH+ (156)] and errorrobust FISH [MERFISH (157)]. These allow the subcellular detection of 100–10,000 transcripts simultaneously in single cells in situ by using sequential rounds of hybridization with temporal barcodes for each transcript, but require a high number of probes and are time-consuming. Furthermore, ISH can suffer from probe-specific noise due to sequence specificity and background binding. Another approach which may be more applicable for tumors is in situ RNA sequencing on tissue sections. STARmap (158) and FISSEQ (159) can profile a few hundreds to thousands of transcripts by using enzymatic amplification methods, but at lower resolution and sensitivity compared to seqFISH and MERFISH. Spatial Transcriptomics (160) and Slide-seq (161) can profile whole transcriptomes by using spatially barcoded oligo-dT microarrays. The spatial transcriptomics method has been used to study and visualize the distribution of mRNAs within tissue sections of breast cancer (160, 162), metastatic melanoma (56, 163), prostate cancer (164), and pancreatic cancer (165). These studies highlight the potential of gene expression profiling of cancer tissue sections to reveal the complex transcriptional landscape in its spatial context to gain insight into tumor progression and therapy outcome.

# INTEGRATION OF TRANSCRIPTOMIC, (EPI)GENOMIC, PROTEOMIC, AND SPATIALLY-RESOLVED SINGLE-CELL DATA

Traditionally, each type of single-cell data has been considered independently to investigate a biological system. However, cancer is a spatially-organized system composed of many distinct cell types (**Figure 2A**). These different cell types including immune cells, stromal cells, and malignant cells can be visualized and investigated in an interactive manner (**Figure 2B**). By applying multi-omics to individual cells in the cancer microenvironment, the molecular landscape of every cell (44) can be defined with its proteome (proteins), transcriptome (RNA sequence), genome (DNA sequence), epigenome (DNA methylation, chromatin accessibility), and spatial localization (x, y, z-coordinates) (**Figure 2C**). Integrating these different molecular layers for each cell will allow a detailed profiling of cancer as a complex biological system (**Figure 2D**). Data integration approaches have classically been categorized in three groups: early (concatenationbased), intermediate (transformation-based), and late (modelbased) stage integration (166). Early or intermediate stage integration approaches are more powerful than late stage integration since they can capture interactions between different molecular data-types. However, such approaches are also more challenging methodologically given the different data distributions across data types.

A number of studies have used complementary forms of multidimensional analysis on the same sample type in the context of cancer. We have performed a search strategy in PubMed, Web of Science, and Embase databases to find studies that have used mass cytometry in concert with scRNA-seq in the context of human cancer (**Supplementary Table 1**). An overview of the eight relevant studies that applied mass cytometry together with scRNA-seq to study human cancer and their integration stage is shown in **Table 1**. In addition, we performed a search strategy in PubMed, Web of Science, and Embase databases on studies that applied single-cell mass cytometry in concert with spatially-resolved data obtained by IMC or MIBI-TOF in human cancer (**Supplementary Table 1**). An overview of the two relevant studies and their integration stage is shown in **Table 2**. Notably, all different multidimensional datasets in these studies were analyzed separately and follow a late (model-based) stage integration. Only Goveia and colleagues applied an integration of clustered mass cytometry and scRNA-seq data (107). They merged scaled average gene expression data for each scRNAseq cluster with scaled average protein expression data for each CyTOF cluster, an approach based on a recently described method from Giordani et al. (167). As they integrated the data only after clustering each modality separately, it is still considered late stage integration.

Integrating multiple single-cell datasets is a challenging task because of the inherently high levels of noise and the large

FIGURE 2 | An integrated multicellular model of cancer. (A) From cells in a spatially-organized cancer microenvironment to (B) a three-dimensional view of individual cells. (C) From each individual cell in the cancer microenvironment, protein expression can be measured by single-cell protein analysis, RNA expression by single-cell RNA analysis, DNA and chromatin expression by single-cell (epi)genome analysis, and the x, y, z-coordinates with spatially-resolved analysis. (D) Integrating all four molecular layers for each cell will allow a detailed profiling from individual cell-to-cell interactions to whole tissue context.

amount of missing data. Furthermore, the ever-expanding scale of single-cell experiments to millions of cells poses additional challenges. Several methods have been proposed to integrate multimodal single-cell data. State-of-the-art methods focus on embedding both spatial and standard datasets into a latent space using dimensionality reduction, such as Seurat (16), LIGER (17), and Harmony (168), or by employing factor analysis, such as MOFA (169), MOFA+ (170), scMerge (171), and scCoGAPS (172). In addition, a recent study introduced gimVI as a model for integrating spatial transcriptomics data with scRNA-seq data to impute missing gene expression measurements (15). Of note, most of the methods so far follow an intermediate or late integration approach (166). As such, these methods overcome challenges due to the different data distributions across data types, but they are less powerful in capturing interactions between different molecular data types.

Several methodologies have been developed to simultaneously acquire multiple measurements from the same cell (**Box 1**). Although obtaining simultaneous measurements from the same cell is becoming more feasible, it is still more common to perform subsequent measurements from the same sample (different sets of cells). Integrating spatial-based assays with mRNA or protein expression measurements can be beneficial for several reasons. For instance, spatial measurements are often limited in TABLE 1 | Overview of studies applying mass cytometry together with single-cell RNA-sequencing to study human cancer heterogeneity.


*scRNA-seq, single-cell RNA-sequencing.*

TABLE 2 | Overview of studies applying mass cytometry together with imaging mass cytometry or MIBI-TOF to study human cancer heterogeneity.


*IMC, imaging mass cytometry; MIBI-TOF, multiplexed ion beam imaging by time-of-flight.*

terms of the number of features they can assess simultaneously, although the latest generations of MERFISH and seqFISH(+) can measure around 10,000 transcripts per cell. By integrating these imaging techniques with scRNA-seq, the amount of biologicallyrelevant information can be enhanced. Moncada et al. presented an integration of scRNA-seq with the spatial transcriptomics method generated from the same sample to study pancreatic cancer (165). A clear challenge when integrating spatial protein (e.g., IMC, MIBI-TOF, CODEX) with scRNA-seq data is the need to model relationships between mRNA and protein expression levels, thus adding an extra layer of complexity. The advent of CITE-seq, combining antibody-based detection of protein markers with transcriptome profiling, could be used to bridge this gap since it allows simultaneous measurement of both mRNA and surface protein marker expression. We foresee an important role for CITE-seq data in the integration of IMC, MIBI-TOF, and CODEX spatial data with scRNA-seq data. Recently, the integration of CITE-seq with CODEX as well as with IMC has been pioneered by Govek et al. (173).

#### POTENTIAL AVENUES OF HOW THE INTEGRATED DATA WILL HELP TO SHED LIGHT ON THE COMPLEX ROLE OF THE MICROENVIRONMENT IN CANCER

Cancer heterogeneity has long been recognized as a factor complicating the study and treatment of cancer but, until recently, it was difficult to account for in cancer research. The advent of multidimensional single-cell technologies has shed light on the tremendous cellular diversity that exists in cancer tissues and heterogeneity across patients. Moving forward, it will be important to work on the integration of available (spectral) flow cytometry, mass cytometry, scRNA-seq, and spatially-resolved datasets to investigate commonalities and differences in cellular landscapes between cancer tissues. Multiple flow and mass cytometry datasets can be matched if they include a shared marker set between panels, thereby extending the number of markers per cell and allowing meta-analysis of different mass cytometry datasets with a common core of markers (174). In addition, cell-type references from different single-cell datasets can improve the functional characterization of cells (175). Such a system-wide approach will improve insights into how different components of the cancer microenvironment interact in a tissue context. This requires an extensive collaboration between multi-disciplinary research fields such as oncology, immunology, pathology, and bioinformatics.

Nevertheless, the development and widespread use of innovative methodologies also implies the development of analytical tools for the interpretation of complex datasets and their standardization across laboratories. Furthermore, systemslevel analyses challenge a researcher's capacity to reconnect findings to their biological relevance. Studies should focus on the removal of unwanted variation and experimental noise in highthroughput single-cell technologies as well as the development of cell-type references, such as the Human Cell Atlas (176) and the Allen Brain Atlas (177) principles. We need to further develop algorithms to integrate data from different imaging and nonimaging single-cell technologies. Alternatively, technological developments should allow the acquisition of molecular profiles from single cells without the need of dissociating them from their tissue context. Lastly, it would be of great value to correlate multi-omics techniques with cell-to-cell signaling networks such as CellPhoneDB (178) and NicheNet (179). We expect that this integrated and comprehensive data can be used to create a multicellular model of cancer, from single cells to its tissue context, to understand and exploit cancer heterogeneity for improved precision medicine for cancer patients.

BOX 1 | Methods for the integration of transcriptomic, (epi)genomic, and proteomic single-cell data.

The analysis of protein expression has been extended to include transcript measurements at the single-cell level. CITE-seq (69), REAP-seq (70), and PLAYR (180) can be used to detect mRNA and protein levels simultaneously in single-cells. In CITE-seq and REAP-seq, oligonucleotidelabeled antibodies are used to integrate cellular protein and transcriptome measurements. In PLAYR, mass spectrometry is used to simultaneously analyze the transcriptome and protein expression levels. The analysis of mRNA expression and methylation status in single cells can be achieved by scM&Tseq (181). In addition, mRNA expression and chromatin accessibility of single cells can be analyzed by sci-CAR (182), SNARE-seq (183), and Pairedseq (184). Chromatin organization and DNA methylation from a single nucleus can jointly be profiled by snm3C-seq (185). DR-seq (186) and G&T-seq (187) can assay genomic DNA and mRNA expression simultaneously in single cells, allowing correlations between genomic aberrations and transcriptional levels. Moreover, recent studies have reported on the development of single-cell triple-omics sequencing techniques, such as scTrio-seq (188) and scNMTseq (189). In scTrio-seq, the transcriptome, genome, and DNA methylome of individual cells are jointly captured. Lastly, scNMT-seq jointly profiles transcription, DNA methylation, and chromatin accessibility, allowing for a thorough investigation of different epigenomic layers with transcriptional status.

How will such system-wide approaches contribute toward more effective therapies for the treatment of cancer? With the advent of targeted therapy and immunotherapy, remarkable advances have been made that changed the management of oncologic treatment for a significant number of patients. However, still only a minority of cancer patients benefit from these therapies, and resistance to treatment remains a major complication in the clinical management of advanced cancer patients. Integrated multi-omics data can help to improve our understanding of the variability in treatment response and resistance mechanisms. By linking detailed molecular and immunological profiles of cells in the cancer microenvironment with sensitivity to specific therapies, potential targets for cancer treatments and associated biomarkers can be identified. This

### REFERENCES


would also support a rational selection of patients that are most likely to benefit from specific treatments. Furthermore, integrated multi-omics data has the potential to guide the development of alternative therapies, for instance through the identification of resistance mechanisms. We expect that such system-wide approaches, with technologies that include spatial information, will become standard methodologies in cancer research in the coming years.

#### AUTHOR CONTRIBUTIONS

NV and AM performed the bibliographic research for the manuscript and designed the figures. NV, AM, FK, and NM jointly wrote the manuscript. All authors contributed to the article and approved the submitted version.

### FUNDING

The authors acknowledge funding from the European Commission under a MSCA-ITN award (675743: ISPIC), the KWF Bas Mulder Award UL (2015-7664), the ZonMw Veni grant (916171144), and the European Research Council (ERC) under the European Union's Horizon 2020 Research and Innovation Programme (grant agreement no. 852832).

#### ACKNOWLEDGMENTS

We thank J.W. Schoones from Walaeus library of Leiden University Medical Center for his help with developing the literature search strategies and M.E. Ijsselsteijn from the Department of Pathology of Leiden University Medical Center for providing imaging mass cytometry images of colorectal cancer.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc. 2020.01254/full#supplementary-material


to profile histone modification. Nat Methods. (2019) 16:323–5. doi: 10.1038/s41592-019-0361-7


muscle-resident cell populations. Mol Cell. (2019) 74:609–21.e6. doi: 10.1016/j.molcel.2019.02.026


heterogeneity. Nat Methods. (2016) 13:229–32. doi: 10.1038/nmeth. 3728


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor declared a past co-authorship with one of the authors NM.

Copyright © 2020 de Vries, Mahfouz, Koning and de Miranda. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multi-Omics Characterization of the 4T1 Murine Mammary Gland Tumor Model

Barbara Schrörs <sup>1</sup> , Sebastian Boegel <sup>2</sup> , Christian Albrecht <sup>1</sup> , Thomas Bukur <sup>1</sup> , Valesca Bukur <sup>1</sup> , Christoph Holtsträter <sup>1</sup> , Christoph Ritzel <sup>2</sup> , Katja Manninen<sup>1</sup> , Arbel D. Tadmor <sup>1</sup> , Mathias Vormehr 2,3, Ugur Sahin1,4 and Martin Löwer <sup>1</sup> \*

<sup>1</sup> TRON gGmbH - Translationale Onkologie an der Universitätsmedizin der Johannes Gutenberg-Universität Mainz Gemeinnützige GmbH, Mainz, Germany, <sup>2</sup> University Medical Center of the Johannes Gutenberg, University Mainz, Mainz, Germany, <sup>3</sup> BioNTech SE, Mainz, Germany, <sup>4</sup> HI-TRON - Helmholtz-Institut für Translationale Onkologie Mainz, Mainz, Germany

#### Edited by:

Francesca Finotello, Innsbruck Medical University, Austria

#### Reviewed by:

Christina Stuelten, National Cancer Institute (NCI), United States Timothy O'Donnell, Icahn School of Medicine at Mount Sinai, United States

\*Correspondence: Martin Löwer martin.loewer@tron-mainz.de

#### Specialty section:

This article was submitted to Cancer Genetics, a section of the journal Frontiers in Oncology

Received: 29 January 2020 Accepted: 12 June 2020 Published: 23 July 2020

#### Citation:

Schrörs B, Boegel S, Albrecht C, Bukur T, Bukur V, Holtsträter C, Ritzel C, Manninen K, Tadmor AD, Vormehr M, Sahin U and Löwer M (2020) Multi-Omics Characterization of the 4T1 Murine Mammary Gland Tumor Model. Front. Oncol. 10:1195. doi: 10.3389/fonc.2020.01195 Background: Tumor models are critical for our understanding of cancer and the development of cancer therapeutics. The 4T1 murine mammary cancer cell line is one of the most widely used breast cancer models. Here, we present an integrated map of the genome, transcriptome, and immunome of 4T1.

Results: We found Trp53 (Tp53) and Pik3g to be mutated. Other frequently mutated genes in breast cancer, including Brca1 and Brca2, are not mutated. For cancer related genes, Nav3, Cenpf, Muc5Ac, Mpp7, Gas1, MageD2, Dusp1, Ros, Polr2a, Rragd, Ros1, and Hoxa9 are mutated. Markers for cell proliferation like Top2a, Birc5, and Mki67 are highly expressed, so are markers for metastasis like Msln, Ect2, and Plk1, which are known to be overexpressed in triple-negative breast cancer (TNBC). TNBC markers are, compared to a mammary gland control sample, lower (Esr1), comparably low (Erbb2), or not expressed at all (Pgr). We also found testis cancer antigen Pbk as well as colon/gastrointestinal cancer antigens Gpa33 and Epcam to be highly expressed. Major histocompatibility complex (MHC) class I is expressed, while MHC class II is not. We identified 505 single nucleotide variations (SNVs) and 20 insertions and deletions (indels). Neoantigens derived from 22 SNVs and one deletion elicited CD8<sup>+</sup> or CD4<sup>+</sup> T cell responses in IFNγ-ELISpot assays. Twelve high-confidence fusion genes were observed. We did not observe significant downregulation of mismatch repair (MMR) genes or SNVs/indels impairing their function, providing evidence for 6-thioguanine resistance. Effects of the integration of the murine mammary tumor virus were observed at the genome and transcriptome level.

Conclusions: 4T1 cells share substantial molecular features with human TNBC. As 4T1 is a common model for metastatic tumors, our data supports the rational design of mode-of-action studies for pre-clinical evaluation of targeted immunotherapies.

Keywords: immunotherapy, cancer models, computational immunology, triple negative breast cancer, 4T1 murine mammary gland tumor cell line

# INTRODUCTION

The translational value of pre-clinical cancer studies is dependent on the availability of model systems that mimic the situation in the patient. The murine mammary carcinoma cell line 4T1 is widely used as syngeneic tumor model for human breast cancer [e.g., (1–3)], a tumor entity with the world-wide highest incidence<sup>1</sup> . This cell line was originally derived from a subpopulation of a spontaneously arising mammary tumor of a mouse mammary tumor virus (MMTV) positive BALB/c mouse foster nursed on a C3H mother (BALB/BfC3H) (4, 5). 4T1 can easily be transplanted into the mammary gland and was already described as poorly immunogenic, highly tumorigenic, invasive, and spontaneously metastasizing to distant organs (6). Thus, the location of the primary tumor and its metastatic spreading closely resemble the clinical course in patients. Moreover, 4T1 cells are used to specifically investigate triple-negative breast cancer (TNBC) [e.g., (7–9)] lacking protein expression of estrogen receptor (ER), progesterone receptor (PgR), and epidermal growth factor receptor 2 (ErbB2) (10). This triple-negative phenotype is estimated for more than 17% of breast cancers that are annually diagnosed (11).

In spite of being such a widely used system, until now mainly phenotypic characteristics of 4T1 have been compared to human (triple-negative) breast cancer in the literature, while no comprehensive genomic, transcriptomic, and immunomic overview has been provided that would complement the evaluation of 4T1 as adequate breast cancer or even TNBC model. In our study, we examined the 4T1 cell line from a multi-omic point of view to complete the picture.

# MATERIALS AND METHODS

#### Samples

BALB/cJ mice (Charles River) were kept in accordance with legal and ethical policies on animal research. The animal study was reviewed and approved by the federal authorities of Rhineland-Palatinate, Germany and all mice were kept in accordance with federal and state policies on animal research at the University of Mainz and BioNTech SE. Germline BALB/cJ DNA was extracted from mouse tail. 4T1 WT cells were purchased from ATCC. Third and 4th passages of cells were used for tumor experiments.

#### Data

ENCODE RNA Sequencing data of adult BALB/c mammary gland tissue for differential expression analysis against 4T1 expression profiles was downloaded from the UCSC Genome Browser (12) repository:


wgEncodeCshlLongRnaSeqMamgAdult8wksFastqRd1Rep1.fa stq.tgz

wgEncodeCshlLongRnaSeqMamgAdult8wksFastqRd1Rep2.fa stq.tgz


Female BALB/c RNA-Seq data sets for the comparison of the MHC expression were described before (13) and are available in the European Nucleotide Archive (see Data Availability Statement).

# High-Throughput Sequencing and Read Alignment

Exome capture from 4T1 and BALB/cJ mice were sequenced in duplicate using the Agilent Sure-Select solution-based mouse protein coding exome capture assay. 4T1 oligo(dT)-isolated RNA for gene expression profiling was prepared in duplicate. Libraries were sequenced on an Illumina HiSeq2500 (2 × 50 nt). DNA-derived sequence reads were aligned to the mm9 genome using bwa [(14); default options, version 0.5.9\_r16]. Ambiguous reads mapping to multiple locations of the genome were removed. RNA-derived sequence reads were aligned to the mm9 genome using STAR [(15); default options, version 2.1.4a]. The sequencing reads are available in the European Nucleotide Archive (see Data Availability Statement).

#### Mutation Detection

Somatic SNV and short insertion/deletion (indel) calling was performed using Strelka [(16); default options for whole exome sequencing, version 2.0.14] on each cell line or normal library replicate pair individually. The individual analysis runs resulted in 1,115 and 1,108 SNV candidates, with an overlap of 886 SNVs (66%) and in 60 and 58 indel candidates, with an overlap of 50 (74%).

#### Transcriptome Profiling

Transcript abundance estimation was done with kallisto [(17); default options, version 0.42.4] on each cell line or normal sample library replicate individually using the mean transcripts per million (TPM) per transcript final value. Differential expression analysis was performed with edgeR [(18); default options, version 3.26.8] using the reported transcript counts of kallisto, summarized by adding up the counts of the respective transcripts associated with each gene. The TPM values of the technical replicates have a Pearson's correlation coefficient of more than 0.99. Enriched pathways (KEGG 2019 Mouse<sup>2</sup> ) and gene ontologies (GO Biological Process 2018<sup>3</sup> ) in differentially up- or downregulated genes were determined using Enrichr (19). The associated Enrichr libraries were used as background lists for comparison with enrichment analysis in TNBC subtypes (20).

<sup>1</sup>http://gco.iarc.fr/today/data/factsheets/populations/900-world-fact-sheets.pdf

<sup>2</sup>https://amp.pharm.mssm.edu/Enrichr/geneSetLibrary?mode=text& libraryName=KEGG\_2019\_Mouse

<sup>3</sup>https://amp.pharm.mssm.edu/Enrichr/geneSetLibrary?mode=text& libraryName=GO\_Biological\_Process\_2018

Data from human TNBC studies (20–22) was obtained from the respective journal websites4,5,6. Data for mapping human and mouse gene symbols was obtained from the Jackson Laboratory<sup>7</sup> . TNBC and breast tissue short read data in fastq format was obtained from the short read archive (TNBC: accession number PRJNA607061, sample accession numbers are documented in **Table S7**).

TCGA BRCA expression values for ERBB2, ESR1, and PGR was obtained from the UCSC Xena browser (http:// xena.ucsc.edu), using the "HTSeq FPKM-UQ" dataset. The clinical annotation including immunohistochemistry results was downloaded from the GDC Legacy site<sup>8</sup> . These tables were merged using the patient barcodes keeping only patients with non-missing and non-inconclusive results for the immunohistochemistry status of "Her2", "Pr", and "Esr". This resulted in 808 data points. Principal component analysis was done in R with the "prcomp" function.

#### Fusion Gene Detection

Fusion genes were detected with an in-house pipeline: We employed the "wisdom of crowds" approach (23), and applied four fusion detection tools, SOAPFuse, MapSplice2, InFusion and STARFusion (23–26) to two technical replicates of the 4T1 cell line. We used Ensembl GRCm38.95 as reference. SOAPFuse and STARFusion were run with default parameters, MapSplice2 was run with "–qual-scale phred33 –bam –seglen 20 –min-maplen 40" as additional parameters, and InFusion was run with "– skip-finished –min-unique-alignment-rate 0 –min-unique-splitreads 0 –allow-non-coding" as additional parameters. For run time improvement, we did a first manual pass of a STAR alignment to the mm10 reference genome and retained only nonmatching and chimeric reads for further processing by the four fusion detection tools. In order to combine the eight resulting datasets (four tools applied to two replicates) we first created the union of results of all four tools for each replicate, followed by the intersection of both independent runs (one per replicate cell line RNA library). This was considered as high confidence result set.

#### DNA Copy Number Calling

Absolute copy numbers were detected from exome capture data using Control-FREEC [(27), version 11.5]. Control-FREEC was run multiple times with different ploidy input parameters (ploidy = x for values of x = 2, 3, 4, or 5) on the merged alignment files (merged with the "merge" command from samtools). In addition, the following non-default parameters were set: forceGCcontentNormalization = 1, intercept = 0, minCNAlength = 3, sex = XX, step = 0, uniqueMatch = TRUE, contaminationAdjustment = FALSE.

8/MediaObjects/13058\_2016\_690\_MOESM1\_ESM.docx

The CNV calls were processed with custom Python and R scripts: The output segment copy numbers were assigned to gene symbols by intersection with gene coordinates. Using the gene symbols, the previously detected SNVs were mapped to the copy numbers. Computed variant allele frequencies (VAF) from read alignments were then compared to the expected allele frequency distribution based on discrete copy numbers. For e.g., for a copy number of 3 (as predicted by Control-FREEC), one would expect SNV VAFs in associated genes clustered around values of 0.33 (one allele mutated), 0.66 (two alleles mutated), and 1 (three alleles mutated). The best match was manually determined for a Control-FREEC ploidy value of 5.

### Transcript Assembly

RNA-Seq transcript assembly was done using trinity [(28); default options, version r20140413p1]. Assembled transcript contigs were mapped to human transcript sequences and the MMTV genome (GenBank accession number NC\_001503.1) with blat (29).

#### MHC Typing

MHC type of the 4T1 cells was determined from RNA-Seq reads as described in Castle et al. (13).

#### MHC Expression

MHC expression was quantified using Sailfish [(30); default options, version 0.6.2] on an mm9 transcriptome index which represents C57BL/6 mice, combined with the expected BALB/cJ MHC sequences.

#### Mutation Signatures

Mutation signatures (31) were computed with the R package YAPSA (default settings, version 1.4.0).

#### Expression Profiling of Viral Genes

Virus genomes were downloaded in FASTA format from the NCBI Virus Genomes resource (32). Sequence reads were aligned using STAR [(15); version 2.5] to a combined reference genome containing murine genome sequences (mm9) and 7,807 virus genomes. We used a maximum mismatch ratio of 0.2, reporting ambiguous alignments only when the alignment scores matched the best alignment of the read.

For each of the virus accession numbers, the GenBank features "mRNA" and "CDS" were extracted from NCBI sources to create a virus gene database for expression analysis. Taxonomic information was extracted for filtering closely related viruses with lower read counts.

Viral gene expression was calculated using the built virus gene database and an in-house software as previously described (33). Any read overlapping a union model of all of a gene's isoforms was counted. All read counts were normalized to reads per kilobase of gene model per million mapped reads (RPKM) for all murine and viral genes.

#### Neoantigen Selection for Immunogenicity Testing

The selection for the initial immunogenicity assessment was described earlier (34). For the subsequent testing of 11 additional

<sup>4</sup>https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4911051/bin/pone.0157368. s007.xlsx

<sup>5</sup>http://downloads.hindawi.com/journals/bmri/2018/2760918.f1.docx <sup>6</sup>https://static-content.springer.com/esm/art%3A10.1186%2Fs13058-016-0690-

<sup>7</sup>http://www.informatics.jax.org/downloads/reports/

HOM\_MouseHumanSequence.rpt

<sup>8</sup>file URL: https://portal.gdc.cancer.gov/legacy-archive/files/735bc5ff-86d1-421a-8693-6e6f92055563

4T1-WT SNVs, the following more strict criteria were applied: (i) present in both replicates, (ii) hitting a transcript outside the untranslated region (UTR), (iii) resulting in a non-synonymous amino acid exchange (no stop gain or loss), (iv) mean expression in replicates > 0, (v) VAF in 4T1 DNA > 0, (vi) VAF in 4T1 RNA > 0.1, and (vii) VAF in RNA of an independent control mammary gland sample was 0. Indels were selected accordingly, but with a less stringent filter on the variant allele frequency in the tumor RNA (VAF\_in\_RNA > 0). Indels were subjected to confirmation via Sanger sequencing [performed as in (34)] which left two of the three pre-filtered indels for further experiments.

#### Immunogenicity Testing

The immunogenicity assessment of SNV-derived neoantigens was performed as described earlier (34). For the testing of indelderived mutated peptides, mice (n = 3) were vaccinated with repetitive intravenous injections of 40 µg RNA lipoplexes (35) on days 0, 7, and 14. Five days after the last immunization, splenocytes of mice were tested for recognition of 15-mer peptides spanning the complete mutated sequence (11 amino acid overlap). T-cell responses were measured via IFN-γ enzymelinked immunospot assay (ELISpot) as previously described (34). In brief, 5 x 10<sup>5</sup> splenocytes were stimulated overnight by addition of 2µg/mL peptide at 37◦C in anti-IFN-γ (10µg/mL, clone AN18, Mabtech) coated Multiscreen 96-well plates (Millipore) and cytokine secretion was detected with an anti-IFN-γ antibody (1µg/mL, clone R4-6A2, Mabtech). For subtyping of T-cell responses, CD8<sup>+</sup> T cells were isolated from splenocytes via magnetic-bead based cell separation [Miltenyi Biotech, CD8a (Ly-2) MicroBeads] according to the manufacturer's recommendations. CD8<sup>+</sup> T cell-depleted splenocytes served as a source for CD4<sup>+</sup> T cells. 1.5 × 10<sup>5</sup> isolated CD8<sup>+</sup> T cells and 5 × 10<sup>5</sup> cells derived from the CD4<sup>+</sup> T cell containing flow-through were restimulated in an IFN-γ ELISpot as described above. 1 × 10<sup>5</sup> syngeneic bone marrow derived-dendritic cells (34) served as antigen-presenting cells for CD8<sup>+</sup> T cells.

# RESULTS

### The 4T1 Tumor Genome

Using whole exome and RNA-Seq data, we assessed genomic variation patterns by comparing 4T1 to BALB/c DNA, examining copy number aberrations, indels, SNVs, and gene fusions. Moreover, we determined absolute DNA copy numbers.

No reads mapped to Y chromosome (DNA or RNA), which is expected as 4T1 originated from a female mouse. The analysis of the copy number profile revealed a median gene copy number of four, suggesting a tetraploid genome, although a sizable fraction of the genome seemed to be present in five copies (**Figure 1A**, second circle from the outside; **Table S1**). The findings were confirmed by a good agreement between the observed SNV allele frequencies and the allele frequency profile expected by the predicted gene copy number (e.g., for a copy number of four we expected SNV VAFs to be clustered around the values of 0.25, 0.5, 0.75, and 1). We observed known breast cancer oncogenes Akt1 and Sf3b1 (36) with focal amplifications (copy number six and seven, respectively), while pan-cancer oncogene Myc had a copy number of 11 (**Table S1**). Several known human tumor suppressor genes had a predicted copy number of less than four, with a possible functional impact (**Table S1**).

We identified 505 SNVs (**Table S2**, **Figure 1A**, outer circle, gray) and 20 short indels (**Table S3**, **Figure 1A**, outer circle, red) in transcripts, as well as 12 fusion events (**Table S4**, **Figure 1A**, middle). The majority of SNVs caused non-synonymous protein changes outside UTRs (264; 52%) including 248 missense and 16 non-sense variations (15 premature stops and one stop loss). Relative to the mouse genome (32 million protein-coding nucleotides), the 4T1 variation rate was 1.1 mutations per MB, which is within the range observed for human breast cancer (31). This number is an order of magnitude lower compared to the murine colon cancer model CT26, which suggests that CT26 is more likely to encode immunogenic epitopes than 4T1. The observed difference in the mutational load was in agreement with previous studies (37, 38), even though we detected a higher number of somatic mutations in both tumor models. We confirmed 45 of 47 (96%) and 193 of 246 (78%) previously reported SNVs in our data. Of the 264 non-synonymous SNVs, we found 91 (34%) mutations to be expressed (VAF > 0), which is comparable with a study in human TBNC that found ∼36% of mutations to be expressed (39). We have recently shown a high correlation between the DNA and RNA mutation allele frequencies in three murine tumor models (including 4T1) (13). Here, using updated methods for transcript quantification and mutation calling, we were able to reproduce these results (R <sup>2</sup> = 0.98, **Figure S1**), thus further corroborating that genes are equally transcribed from all alleles, mutated and wild-type (WT), in proportion to their DNA allele frequency.

Examining the mutational landscape in the 4T1 exome (**Figure 1B**), we found a higher prevalence of C>T, C>G, and C>A SNVs (**Figure S2**), which is in concordance with the somatic mutational signatures in human breast cancers (40). Interestingly, we found an overrepresentation of C>T transitions at XCG triplets (**Figure S2**; C is the mutated base, preceded by any nucleotide and followed by G), which is a known mutational mechanism due to deamination of methylated cytosines to thymine and has been observed in human breast cancers (41). C>T transitions showed the largest contribution to the mutational signatures in 4T1 and has been attributed to the activity of the APOBEC family of cytidine deaminases (42). Of note, Apobec3 has been found to provide partial protection in mice against infection with the oncogenic retrovirus MMTV (43), suggesting activation of this gene during MMTV infection and genome integration with subsequent cytosine deamination, resulting in the observed mutation pattern. The mutational signatures revealed a strong signal for signature AC3 (**Figure 1B**), which is associated with breast cancer and colloquially called "BRCAness," followed by signature AC1, which is associated with spontaneous deamination. In contrast, signature AC2 was not found at all (and therefore not shown in **Figure 1B**), which would further strengthen the potential connection to APOBEC cytidine deaminases, as described above.

Of the most frequently mutated genes recently identified in breast cancer in general (41) and TNBC in particular (39) (Tp53,

(red), with point size scaled by variant allele frequency; CNVs (second circle from the outside), log scaled, with gray lines marking CN = 5, 10, and 50; fusion genes (middle). (B) Mutation signature of 4T1 somatic SNVs. Signatures with a computed exposure value of 0% are not shown.

Pik3ca, Myc, Ccnd1, Pten, Erbb2, Znf703/Fgfr1 locus, Gata3, RB1, and Map3k1, Egfr), we only identified mutations in Trp53 (frameshift insertion of "A") and Pik3cg (synonymous SNV) which is the catalytic subunit of class I PI3 kinases (similar to Pik3ca). In addition, we did not find mutations in breast cancer susceptibility genes Brca1 and Brca2. Further mutations in cancer-related genes included Nav3 (V1129L), Cenpf (D1327E), Muc5ac (A429P), Mpp7 (Q158R), Gas1 (G326R), Maged2 (A473S), Dusp1 (C24R), Ros1 (W1875C), Polr2a (M1102I), Rragd (L385P), and Hoxa9 (insertion of "G" in UTR). Variations in immune-relevant genes included Tlr8 (R613H), Tlr9 (N332K), and Lilrb3 (S91R).

Using RNA-Seq data of 4T1 replicates, we identified 12 fusion events (**Table S4**), including a fusion of Siva1 and Gas8, one regulating cell cycle progression/proliferation and apoptosis, the other being a putative tumor suppressor gene. None of them have been reported before in breast cancer (44, 45).

#### MMTV Integration

MMTV is a milk-transmitted retrovirus that is oncogenic through integration into the host genome, thereby activating the expression of nearby genes (46). Multiple common insertion sites (CIS) have been identified and associated with candidate oncogenes and pathways involved in mammary tumorigenesis, including the Wnt and Fgf clusters (47, 48). A subset of CIS was significantly correlated with overexpression and deregulation of candidate oncogenes (49). We collected a set of 54 candidate genes for MMTV integration and compared their expression in 4T1 cells to that in normal mammary gland (**Figure S3**). About 68.5% of these genes showed significant down- or upregulation, while only about 42% of all genes of 4T1 cells were differentially expressed, suggesting MMTV integration as a possible cause. However, many pf the 54 candidate genes are involved in oncogenic pathways, so it is not clear if the observed differential expression are caused by the integration, potentially dysregulating a pathway or effect of the dysregulated pathway in the first place.

Moreover, we had direct evidence from RNA-Seq based transcriptome assembly of an integration site 5′ to the Fgfr2 gene (**Figure S4**). A CIS near Fgfr2 was associated with an increased copy number and overexpression of Fibroblast growth factor receptor 2 (Fgfr2) (47). While we just observed a copy number of four, three of eleven isoforms were significantly overexpressed in 4T1. Fgfr2 is a transmembrane tyrosine kinase receptor and its activation triggers a complex signal transduction network (via e.g., Ras-Raf-Mapk or Pik3-Akt pathway), which leads to transcription of genes involved in angiogenesis, cell migration, proliferation, differentiation and survival. There is evidence of deregulated activation of FGFR signaling in the pathogenesis of human cancers (46). FGFR2 amplifications have been found in 10% of gastric cancers (50) and were also found in a subset of human TNBC patients (39, 51); FGFR2 amplifications are estimated to occur in ∼4% of TNBC samples, resulting in constitutive activation of FGFR2 (52). Increased expression of this gene is associated with poor overall survival and disease-free survival (53). This amplification is targetable with high sensitivity to FGFR inhibitors in vitro (52), an FGFR2-targeting antibody showed potent antitumor activity against human cancers in pre-clinical studies (54) and several FGFR tyrosine-kinase inhibitors are in clinical trials (54–56). However, the contribution of MMTV infection and initiation to human mammary carcinogenesis in general and FGFR2 amplification in particular is still highly debated (57). Of note, Notch4 and Krüppel-like factor 15 (Klf15) have been shown to be associated with MMTV CIS and although both genes are expressed in normal murine mammary gland, we do not find any isoform expressed in 4T1 possibly due to MMTV integration. Interestingly, while KLF15 has been recently proposed to be a tumor suppressor in breast cancer (58) and silencing this transcription factor results in a fitness advantage for the tumor, Notch-4 is a potent breast oncogene, overexpressed in TNBC (59) and Notch signaling is involved in mammary gland tumorigenesis (60).

#### The 4T1 Transcriptome

Differential expression analysis of 4T1 cell RNA expression vs. healthy mammary gland tissue RNA revealed 12810 differentially expressed genes (FDR ≤ 5%, absolute log<sup>2</sup> fold-change >1) out of 29,955 total genes in mm9 (**Tables S5**, **S6**). This set of differentially expressed genes is very similar to differentially expressed genes in human breast cancer: we compared the gene sets of two studies comparing TNBC epithelium to adjacent microdissected stroma (21) and TNBC to non-TNBC cancers (22). These studies allowed a gene set enrichment test, yielding p-values of 2.2 × 10−<sup>16</sup> and 0.001002 (Fisher's exact test), respectively. Next, we compared pathways and gene ontologies (GO) that were significantly enriched (FDR ≤ 0.05, **Table S7**) in 4T1 differentially expressed genes to a study including different TNBC subtypes (20). Here, we only found significant overlap with top pathways and GO terms reported for subtype "Basallike and immune suppressed (BLIS)" (ppathway = 0.04506 and pGO = 0.0142, Fisher's exact test). Furthermore, we analyzed RNA-Seq data of 57 TNBC breast cancer samples from the short read archive (PRJNA607061) and 66 breast tissue samples from the GTEx project (**Table S8**). All analysis steps were performed in analogy to the analysis of the 4T1 data. Here, we computed a pvalue of 2.2 × 10−<sup>16</sup> with Fisher's exact test when comparing the sets of differentially expressed genes. Moreover, the mean gene expression in TNBC is well-correlated to the gene expression in 4T1, as demonstrated by a Pearson's correlation coefficient of 0.727 (**Figure S5**).

**Figure 2** shows the expression of a selection of relevant genes discussed below. The murine homologs of the typical genes associated with TNBC are Esr1, Pgr, and Erbb2. While Esr1 was about 2-fold downregulated and Pgr showed zero expression, Erbb2 had a comparable expression in 4T1 vs. the non-cancer mammary gland samples (about 20 TPM). However, compared to the ERBB2 expression in the TCGA human breast cancer (BRCA) cohort, this value was on the lower end of the expression level spectrum [not shown<sup>9</sup> and (61)]. In order to investigate this detected mRNA expression, we compared the ERBB2, ESR1, and PGR mRNA expression in available TCGA breast cancer samples

<sup>9</sup>http://gepia.cancer-pku.cn/detail.php?gene=ERBB2

and grouped the expression values by the annotated result of the immunohistochemistry (IHC) assay. A principal component analysis (**Figure S6**) showed, that mRNA expression can separate IHC positives from negatives (albeit not perfectly). The data also showed that a negative IHC result is not necessarily associated with zero mRNA expression (**Figure S7**). With copy numbers of five, the three genes also did not divert form the general genomic copy number level. Moreover, genes Brca1 and Brca2 were highly overexpressed.

4T1 is a widely used model for metastatic breast cancer (62) and consistently, we found known metastasis-associated genes such as the differentiation antigen Msln (mesothelin), Cdh1, Sema3e, Gjb3, and Ect2 to be overexpressed. The latter one is known to be a key factor in progression of breast cancer (63) as well as in metastasis, and high expression is associated with poor prognosis for TNBC patients (64, 65). Overexpression of mesothelin was shown to promote invasion and metastasis in breast cancer cells (66). Interestingly, we found also Highmobility group protein HMG-I/HMG-Y (Hmga1) and Hmgarelated sequence 1 (Hmga1-rs1) to be upregulated in 4T1 cells. Hmga1 is involved in promoting metastatic processes in breast cancer (67) and it has also been found to stimulate retroviral integration (68). Hmga2 is a driver of tumor metastasis (69) and Igf2bp2 is a downstream target gene (70). Both genes were highly expressed in 4T1 cells. In addition, we found a 6 fold overexpression of Nephronectin (Npnt) in 4T1 compared to the normal murine breast samples examined, in which we detected only weak signals of this gene (22.4 vs. 3.6 TPM). Npnt plays a role in kidney development, is associated with embryonal precursors of the urogenital system (71) as well as with integrin expression (72). High expression levels of Npnt have been observed in human thyroid (median: 277 TPM), human blood vessels (e.g., aorta, 200 TPM), human lung (161 TPM) and to a much lesser extent in human mammary tissue (14 TPM)10. Furthermore, Npnt has been suggested to have a role in promoting metastasation, as decreased expression in 4T1 tumors significantly inhibited spontaneous metastasis to the lung (73), further indicating the highly metastatic phenotype of 4T1. In contrast, we found an extremely low expression of Gas1, which plays a role in growth suppression. Also, growth factor Vegfa and growth factor receptor Egfr were downregulated.

Other deregulated genes are also described as being cancerrelated, including Srsf3, which has a proto-oncogenic function and is frequently upregulated in various types of cancer (74). FOXM1 is a proto-oncogene involved in regulating the expression of genes that are specific for the G2/M DNA damage checkpoint during cell cycle prior to mitosis. Foxm1 has been found overexpressed in a variety of solid tumors, including breast cancer (75) and indeed, we also observed a 9-fold increase in 4T1 cells. PLK1 is also involved in the G2/M transition, found to be significantly overexpressed in TNBC and targeting this gene has been described as a potential therapeutic option for TNBC patients (76). Tumor protein D52 (Tpd52) was 6-fold upregulated, which is in consistence with reports showing high overexpression in many solid tumors and in particular breast cancer (77). Of note, we found the colon cancer antigen Gpa33 (78) to be highly expressed in 4T1 (143 TPM), not in normal

<sup>10</sup>https://gtexportal.org/home/gene/NPNT (accessed January 9, 2020)

murine breast (<1 TPM) and not in any other human noncancer tissue except colon (median: 111 TPM) and small intestine (median: 75 TPM) (data from11).

Among factors associated with a poor prognosis, proliferation markers Top2a, Mki67, and Birc5 (79–81) were highly expressed in 4T1, while almost absent in normal murine breast tissues. Pbk is also considered a marker for cellular proliferation (82) and is associated with poorer prognosis in lung cancer (83). Anln is highly expressed in breast cancer tissues (84) and a marker of poor prognosis in breast cancer (85) and indeed, we also found high expression of this gene in 4T1 (131.8 TPM). In addition, Pigf, which has been shown to enhance breast cancer motility (86) was overexpressed in 4T1 (42 TPM vs. 32.7 TPM). Genes related to metabolic regulation, such as Acly and Akt2, were downregulated. Polypeptide N-acetylgalactosaminyltransferase 3 (Galnt3) was upregulated in 4T1 and overexpression of this gene is associated with shorter progression-free survival in advanced ovarian cancer (87).

Moreover, Wnt7a and Wnt7b were upregulated in 4T1 cells, while other components of the Wnt/β-catenin pathway were downregulated (Wnt1, Wnt11, and Wnt5a). The role of Wnt10b in TNBC has been described before (88), indicating a direct effect on Hmga2 expression (see above). Furthermore, the gene Ezh2, known for its deregulatory activity of the Wnt pathway, was upregulated as well. Consequently, we found the Wnt target genes including the proto-oncogene Myc and the genes Ctnnb1, Ccnd1, and Fzd6 (Frizzled) to be upregulated (89).

As reported before (90), we found expression of the Murine Leukemia Virus (MuLV) gene coding for gp70, as well as of genes of the Murine osteosarcoma virus (NC\_001506.1) and (confirming the genomic findings on MMTV integration) of all MMTV genes (**Table S9**).

#### 6-Thioguanine Resistance

Due to the resistance to 6-thioguanine (6-TG) treatment, metastatic 4T1 cells can be precisely quantified even in distant organs (6). The cytotoxicity of 6-TG is based on the conversion of 6-TG into 2′ -deoxy-6-thioguanosine triphosphate which can be incorporated into DNA (91). Deficiency in MMR, which is found in various cancer types (92), is associated with resistance to 6-TG (91). In 4T1, we did observe significant downregulation of Pold4 only, but none of the other MMR genes (Exo1, Lig1, Mlh1, Mlh3, Msh2, Msh3, Msh6, Pcna, Pcna-ps2, Pms2, Pold1, Pold2, Pold3, Rfc1, Rfc2, Rfc3, Rfc4, Rfc5, Rpa1, Rpa2, Rpa3, and Ssbp1; MSigDB: C2 curated gene sets, KEGG\_MISMATCH\_REPAIR, mouse orthologs obtained from12) at mRNA level (**Table S6**). Moreover, no non-synonymous SNVs or indels were detected in these genes, which might have impaired their function. In addition, mutational signatures AC6 and AC20 (associated with defective MMR) are present, but with relatively weak signals of about 5% and less (**Figure 1B**). Signatures AC15 and AC26 (also associated with defective MMR) are not detected. Diouf et al. (93) observed in human leukemia cells that MMR deficiency and thus an increased resistance to thiopurines can also result from a deregulated MSH2 degradation. While we again did not detect any mutations in the genes involved in regulating the stability of MSH2 (Mtor, Herc1, Prkcz, and Pik3c2b), we found Pik3c2b to be downregulated (**Table S6**). As the knockdown of PIK3C2B in human leukemia CCRF-CEM cells decreased sensitivity to 6- TG in comparison to control (93), lacking or reduced expression of Pik3c2b mRNA in 4T1 might explain the resistance to 6- TG treatment.

#### MHC Expression

The key players of the mammalian adaptive immune system are the major histocompatibility complex (MHC) molecules with the primary task to bind and present self, abnormal self, and foreign peptides derived from intracellular (MHC class I) or from extracellular proteins (MHC class II) on the surface of nucleated cells for recognition by T lymphocytes. Novel cancer immunotherapy concepts target tumor-specific antigens (either tumor-associated antigens or neo-epitopes) presented by MHC molecules of tumor cells. In general, non-cancer murine tissues show variable expression of MHC class I and class II, with lymphatic organs (i.e., lymph node, spleen) showing highest abundance of MHC transcripts and brain having the lowest MHC expression (**Figure 3**), which is in agreement with expression patterns of the human MHC system (94).

We confirmed that 4T1 cells have the same class I MHC haplotype as the parental BALB/c mice: H-2D<sup>d</sup> , H-2K<sup>d</sup> , and H-2L<sup>d</sup> . MHC class II could not be typed from RNA-Seq reads due to lack of expression. In 4T1, we found MHC class I and Ib loci to be expressed at comparable levels to normal (non-lymphatic) tissues (**Figure 3**, **Table S10**). In addition, β2 microglobulin (B2m), essential component of the MHC class I complex, and members of the MHC class I antigen presenting pathway were expressed (**Figure S8**). This suggests that MHC class I antigen presentation is functional and thus 4T1 cells are capable of presenting peptides and neo-epitopes to T effector cells. In contrast, 4T1 cells expressed neither MHC class II nor the MHC class II master regulator and transcriptional coactivator Ciita [**Figure 3**, **Figure S8**; (95)]. Both findings suggest that 4T1 cells do not have functional MHC class II antigen presentation.

#### 4T1 Neoantigens

To investigate the mutations with regard to their potential to elicit immune responses in vivo, experiments in mice were conducted. In a previous study (34), we already examined 38 SNVs detected in the 4T1-luc2-tdtomato mammary carcinoma (4T1-Luc) cell line. Thirty-six of these were also present in the WT 4T1 cell line, 16 of which were immunogenic. Based on the subsequent re-analysis of WT 4T1, we selected additional eleven SNVs and two indels for immunogenicity assessment (**Figure 4A**). This selection was done by filtering the available set of potential neoantigens in order to enrich for likely immunogenic peptide sequences (see Methods). To this end, a vaccine for each of the newly selected 13 mutations was engineered using antigen-encoding pharmacologically optimized lipoplexed RNA as vaccine format. As before, SNVs were flanked by 13 amino acids of WT sequence, in-frame indel

<sup>11</sup>https://gtexportal.org/home/gene/GPA33 (accessed January 9, 2020) <sup>12</sup>http://bioinf.wehi.edu.au/software/MSigDB/

mutations were flanked by 15 amino acids of WT sequence and frameshift mutations were investigated covering 15 WT amino acids upstream of the mutations as well as the whole sequence of new amino acids until reaching a stop codon. Mice (n = 3–5) were immunized intravenously three times within a 2 week timeframe. IFN-γ ELISpot of splenocytes stimulated with overlapping 15-mer peptides covering the respective vaccinated sequence was performed 5 days after the last immunization. With this, we found immune responses against additional six SNVs and one deletion (see **Figure 4B** for the results on the indels, **Table S11** summarizes all immune responses). In total, we can thus report 22 SNVs and one deletion identified in 4T1 triggering immune responses in immunized mice. Of note, only four and 14 of these were derived from SNVs already reported before (37, 38). For a subset of 15 SNVs the WT counterpart was tested, which revealed that 10 responses were clearly specific for the mutated sequence. As already observed (34), most of the reactivities were elicited by CD4<sup>+</sup> T cells (15 out of 21 analyzed mutations). Two SNVs were targeted by CD4<sup>+</sup> and CD8<sup>+</sup> T cells.

# CONCLUSION

The murine mammary cancer cell line 4T1 is one of the most often used model systems for breast cancer and in particular TNBC. Here, we could confirm that 4T1 indeed resembles metastatic TNBC at the transcriptional level with respect to key markers Esr1, Erbb2, and Pgr. In addition, compared to

human TNBC data, we found good concordance on the level of differentially expressed genes and pathways and a reasonable correlation of raw expression values. The expression profile was in agreement with the metastatic phenotype of 4T1, as we found Msln, Ect2, and Plk1, and other genes associated to metastasis to be highly overexpressed in comparison to normal mammary gland. As described above, also a number of genes involved in proliferation and survival were deregulated. Moreover, it is known that the Wnt/β-catenin (Ctnnb1) pathway plays an important role in human breast cancers (96) with high activation rates and association with a poor prognosis (97). Some components of this pathway including Wnt target genes were upregulated in 4T1 cells. Overall, the observed profile reflected the complex interplay of various factors of tumorigenesis- and metastasis-driving signaling and allows for further mode-ofaction investigation in the 4T1 tumor model.

On the mutation level, the raw numbers of mutations compare well against the CT26 colon cancer model. CT26 has 3,023 SNVs and 362 short indels, and in 4T1 we found an order of magnitude less variants (505 SNVs and 20 short indels). This is a similar relationship as observed for human colorectal and breast cancer (31) and supports previous findings (37, 38) as mentioned above. Differences in the absolute numbers in comparison to these reports might be due to genetic diversification of in vitro cell lines investigated at different laboratories at differing passage numbers (98) or different sequencing and mutation calling strategies.

Here and in a previous study (34), we determined in vivo immune responses against 22 SNVs (out of 49 tested, 45%) as well as one deletion (out of two indels tested) upon vaccination of BALB/c mice and 10 mutations (out of 15 immunogenic SNVs) showed mutation specificity. Although we did not examine all possible candidate neoantigens, the low mutational burden and the similarity to the basal-like and immune suppressed TNBC subtype suggest that 4T1 is a tumor model exhibiting relatively low immunogenicity. This is in agreement with others (37), while different studies argue the opposite, showing upregulation of many immune activation genes (38, 99) and thus immune cell infiltration in transplanted 4T1 tumors. Our 4T1 RNA-Seq data, however, was generated from the pure cell line. Accordingly, we could not see upregulation of immunerelated genes. Nonetheless, 4T1 cells can secrete a plethora of inflammatory mediators and thereby modulate not only lymphocyte-mediated immune responses against the tumor, but also the innate microbial host defense (100–102). In future studies, the identified fusion transcripts might also be viable and interesting candidates for immunogenicity testing.

Besides the expression of MMTV at the RNA level and the deregulation of known genes with nearby insertion sites, we found direct evidence of MMTV integration near the gene Fgfr2. Combined with the relatively low mutational burden, we hypothesize that the MMTV infection and integration is the major genomic change causing eventually the TNBC-like phenotype. Interestingly, despite no observed somatic mutations in Brca1 or Brca2, a "BRCAness" mutation signature could be found (**Figure 1B**, signature AC3).

A very recent publication (38) underlined the importance of profiling tumor models to appropriately translate preclinical findings. The here presented genome, transcriptome, and immunome data serves as a baseline for further studies, examining e.g., tumor-host interactions in terms of immunogenicity and TNBC in general. Although the data sources are highly heterogeneous (resulting from different studies and sequencing experiments), a distinct overlap between our qualitative and quantitative findings and studies on human TNBC can be found and confirms our approach. Together, our study supports the rational design of pre-clinical studies with an important and established tumor model.

#### DATA AVAILABILITY STATEMENT

The datasets generated for this study can be found in the European Nucleotide Archive (https://www.ebi.ac.uk/ena/ browser/home) with the study accession number PRJEB36287.

#### ETHICS STATEMENT

This animal study was reviewed and approved by the federal authorities of Rhineland-Palatinate, Germany and all mice were kept in accordance with federal and state policies on animal research at the University of Mainz and BioNTech SE.

# AUTHOR CONTRIBUTIONS

ML, BS, SB, and US contributed to the conception and design of the study. CA, VB, and KM were responsible for NGS sequencing. ML, BS, SB, TB, CH, and CR performed bioinformatics analyses. ML, BS, MV, and AT selected neoantigen candidates. MV planned and performed immunogenicity testing. ML, BS, and SB interpreted the data and wrote the first draft of the manuscript. CH, TB, and MV wrote sections of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

### FUNDING

This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Grant Agreement No. 789256).

# ACKNOWLEDGMENTS

We thank John C. Castle, Sebastian Kreiter, and Mustafa Diken for discussions and advice, Luisa Bresadola and Jonas Ibn-Salem for proofreading the manuscript, and Karen Chu for project management support.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc. 2020.01195/full#supplementary-material

Figure S1 | Comparison of DNA and RNA variant allele frequency (VAF) in 4T1 cells. The Pearson correlation coefficient is 0.977.

Figure S2 | Abundance of nucleotide substitutions in 4T1 cells with respect to nucleotide triplets.

Figure S3 | Differential expression of MMTV integration effector genes. Colored dots indicate differential expression in 4T1 vs. BALB/c mammary gland. Red gene labels indicate genes that are described as upregulated in the literature.

Figure S4 | Schematic view of proposed MMTV integration in Fgfr2 gene. Upper panel shows a UCSC Genome Browser view of an alignment of assembled sequence c75264\_g4\_i1 to the mm9 genome. The middle part shows the assembled sequence (blue) and the part mapping to Fgfr2 (red). Numbers indicate parts of the sequence mapping to Fgfr2 (red) and MMTV (green). The lower panel shows a schematic of Betaretrovirus genome, for which MMTV is a reference strain (taken from https://viralzone.expasy.org/66).

Figure S5 | Mean gene expression of TNBC plotted against mean gene expression in orthologous genes of 4T1. Counts per million (cmp) were computed by edgeR.

Figure S6 | Scatterplot of a principal component analysis of TCGA BRCA gene expression of genes ERBB2, ESR1, and PGR. Shown are the first two principal components (PC1 and PC2). Ellipses indicate normal-probability contours.

Figure S7 | Boxplot of TCGA BRCA gene expression of genes ERBB2, ESR1, and PGR, separated by TNBC status. Expression on y-axis is given as log<sup>2</sup> (FPKM+1) units.

Figure S8 | Gene expression of members of the MHC class I and II antigen presenting pathway in 4T1 and BALB/c mammary gland.

Table S1 | Raw Control-FREEC output (sheet 1) and predicted absolute gene copy numbers of 4T1 genes (sheet 2).

Table S2 | Somatic SNVs in 4T1, including annotation on amino acid substitutions, affected genes/transcripts, expression of these, and coverage/VAF in the DNA/RNA NGS libraries.

Table S3 | Somatic INDELs in 4T1, including annotation on frameshift, affected genes/transcripts and coverage/VAF in the DNA/RNA NGS libraries. A VAF of −1 means "not covered," while a VAF of 0 indicates coverage but absence of the variant allele.

Table S4 | Fusion genes in 4T1, including predicted positions of breakpoints, number of junction reads, and spanning read pairs and the program that detected a fusion.

Table S5 | Gene expression in 4T1 and BALB/c mammary gland in TPM.

Table S6 | Differential gene expression in 4T1 vs. BALB/c mammary gland, showing log fold change, FDR, and baseline expression values.

Table S7 | Gene set and pathway enrichment in differentially expressed genes of 4T1 cells for upregulated and downregulated genes in GO gene sets and KEGG pathways, respectively (sheets are labeled "up GO", "up KEGG", "down GO", and "down KEGG", respectively).

#### REFERENCES


Table S8 | Differential gene expression in human TNBC vs. breast tissue, showing log fold change, FDR and baseline expression values.

Table S9 | Expression in RPKM of MMTV genes for two replicates of 4T1 RNA-Seq libraries.

Table S10 | Expression values in TPM of MHC genes in 4T1 and BALB/c tissues. The used reference sequences from the UCSC known genes or Genbank are also listed.

Table S11 | Results of immunogenicity testing, including details on mutation, amino acid substitution, the result of the ELISpot assay, the subtype of the T-cell response, and the specificity when compared to a WT control.


**Conflict of Interest:** MV and US are employed by the company BioNTech SE.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Schrörs, Boegel, Albrecht, Bukur, Bukur, Holtsträter, Ritzel, Manninen, Tadmor, Vormehr, Sahin and Löwer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.