# INTELLIGENT SYSTEMS FOR GENOME FUNCTIONAL ANNOTATIONS

EDITED BY : Shandar Ahmad, Michael Fernandez and Pedro Ballester PUBLISHED IN : Frontiers in Genetics and Frontiers in Bioengineering and Biotechnology

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88966-090-2 DOI 10.3389/978-2-88966-090-2

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# INTELLIGENT SYSTEMS FOR GENOME FUNCTIONAL ANNOTATIONS

Topic Editors: Shandar Ahmad, Jawaharlal Nehru University, India Michael Fernandez, The Vancouver Prostate Centre, Canada Pedro Ballester, INSERM U1068 Centre de Recherche en Cancérologie de Marseille (CRCM), France

Citation: Ahmad, S., Fernandez, M., Ballester, P., eds. (2020). Intelligent Systems for Genome Functional Annotations. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88966-090-2

# Table of Contents


Y-h. Taguchi and Turki Turki

*38 Identification and Analysis of Long Repeats of Proteins at the Domain Level*

David Mary Rajathei, Subbiah Parthasarathy and Samuel Selvaraj


Sina Ghadermarzi, Xingyi Li, Min Li and Lukasz Kurgan

*91* In silico *Metabolic Pathway Analysis Identifying Target Against Leishmaniasis – A Kinetic Modeling Approach* Nikita Bora and Anupam Nath Jha

# Editorial: Intelligent Systems for Genome Functional Annotations

#### Shandar Ahmad<sup>1</sup> \*, Pedro J. Ballester 2,3,4,5 and Michael Fernandez <sup>6</sup>

<sup>1</sup> School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India, <sup>2</sup> Cancer Research Center of Marseille, INSERM U1068, Marseille, France, <sup>3</sup> Institut Paoli-Calmettes, Marseille, France, <sup>4</sup> Aix-Marseille Université, Marseille, France, <sup>5</sup> CNRS UMR7258, Marseille, France, <sup>6</sup> Department of Urologic Sciences, Faculty of Medicine, Vancouver Prostate Centre, University of British Columbia, Vancouver, BC, Canada

Keywords: functional annotation, protein-protein interaction (PPI), machine learning, gene annotation, intelligent system applications

**Editorial on the Research Topic**

#### **Intelligent Systems for Genome Functional Annotations**

Functional annotation of an entire genome is critical to the understanding of any biological process or pathway. Yet, large parts of the human genome, and much more in the non-model organisms, remain without annotations. Simple, sequence-similarity based annotations have been found to be grossly inadequate for this purpose. More complex models, often based on intelligent systems, such as Machine Learning (ML) have proved to be very helpful. Indeed, ML models have key properties that makes them particularly useful for genome annotation (Yip et al., 2013). In their basic formulation, ML techniques have found their way into the field of biological functional annotation quite early. For example, secondary structure prediction using ML was carried out as early as in mid-1980's and many other areas of biological sequence, structure and/or function prediction have seen great advances in terms of the complexity of techniques, feature engineering, and other principles of data-driven analytics.

Several computational techniques have been developed exclusively for solving functional annotation problems. However, most of the growth has been in terms of the application of emerging and established computational techniques in the biological domain. ML software has often been used as a blackbox tool, while researchers focus on the biological concept of the problem and its solution. More recently, deep learning based neural networks methods have made rapid progress and have shown particular success with problems associated with large amounts of biological data and complex system representations. Typically popular amongst them have been convolutional neural networks (CNN), multi-layer perceptrons (MLP) and long short-term memory networks (LSTM), with widely different forms and learning strategies. On the other hand the biological understanding of molecular function and organization of knowledge on this subject has also undergone rapid advances. Instead of scattered and ambiguous labeling of function, systematic annotations in terms of biological (disease, gene, and protein etc.) ontologies with hierarchical and nested labels from semantic models have made the task of annotation learning and prediction of biological function much more robust. Clearly, much has been achieved on biological and technical aspects of functional annotations, but many hurdles remain. It was under this context, this special issue was proposed to promote the reporting of various aspects of biological functional annotations where different types of intelligent systems/ML have been used to solve functional annotation problems.

In the eight research papers forming this special issue, a special aspect of functional annotation i.e., for predicting drugability was reported. There is a need to annotate gene products to indicate whether these are likely to be druggable or not. Ghadermarzi et al. argued that the majority of

Edited and reviewed by: Richard D. Emes, University of Nottingham, United Kingdom

> \*Correspondence: Shandar Ahmad shandar@jnu.ac.in

#### Specialty section:

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Received: 17 July 2020 Accepted: 23 July 2020 Published: 25 August 2020

#### Citation:

Ahmad S, Ballester PJ and Fernandez M (2020) Editorial: Intelligent Systems for Genome Functional Annotations. Front. Genet. 11:915. doi: 10.3389/fgene.2020.00915

**4**

the druggable human proteome is yet to be annotated and explored. To advance on that front, these authors collected the data from three types of protein targets: druggable, nondruggable, and possibly druggable. Both new and established markers for each protein were extracted from its protein sequence or names/identifiers. They discovered that the possibly druggable proteins have significantly higher abundance of alternative splicing isoforms, relatively large number of domains, higher degree of centrality in the protein-protein interaction networks, and lower numbers of conserved and surface residues, when compared with the non-druggable proteins. These markers can be helpful to find novel druggable human proteins and provide interesting insights into the cellular functions and subcellular locations of both current drug targets and potentially druggable proteins.

The genome can also be annotated with those regions whose alterations in tumor cells are found to control patient response to a drug treatment. Thus, the contribution from Bomane et al. looked at this problem from a precision oncology perspective. In particular, authors investigated the extent to which it is possible to predict breast cancer patient response to the mitotic inhibitor paclitaxel using the US National Cancer Institute's Genomic Data Commons. These datasets comprised the responses of breast cancer patients to paclitaxel along with six molecular profiles of their tumors. Ten ML algorithms were applied to each of these profiles and the resulting 60 classifiers evaluated on the held-out patients. Only three of these 60 models were at all predictive, highlighting the crucial importance of a broad search to avoid suboptimal results. DNA methylation and miRNA profiles were the most informative overall. In combination with these two profiles, the ML algorithms selecting the smallest subset of molecular features were found to generate the most predictive classifiers.

In addition to supervised ML models for function annotation, unsupervised ML methods have shown to be extremely successful to interpret experiments when labeled experiments are not available. For example, annotations from newly invented singlecell RNA sequencing (scRNA-seq) technology (Sasagawa et al., 2019) is incredibly challenging because of the lack of labels for individual cells. Purely unsupervised clustering methods, such as t-distributed stochastic neighbor embedding and uniform manifold approximation and projection, have been employed to obtain low-dimensional embedding of cell–cell relationships with the foreseeable drawback of highly dependent upon genes selected for clustering. Taguchi and Turki contribution explored tensor decomposition (TD)-based unsupervised feature extraction (FE) (Taguchi, 2019) to integrate two scRNAseq expression profiles that measure human and mouse midbrain development. TD-based unsupervised FE showed to be a promising method to effectively integrate two scRNAseq profiles while outperforming other popular unsupervised selection methods.

The integration of biological experiments also requires informatics tools and platforms that combine and analyse different sources of biological data (Triplet and Butler, 2014). TargetMine is an integrative data analysis platform for target prioritization and broad-based biological knowledge discovery. The recent improvement of the platform described by Chen et al. forms a contribution in that direction and highlights newly modeled biological data types and the improvement of new analytical and visual tools. Enhanced coverage of gene–gene relations, and small molecule metabolite to pathway mappings are now implemented in TargetMine together with an improved literature survey feature. The platform also incorporated in silico predictions of gene functional associations such as protein– protein interactions and global gene co-expression. Authors demonstrated how the newer enhancements in TargetMine provides a more expansive coverage of the biological data space and can help interpret genotype–phenotype relations.

Finding new biological targets is key for designing potential new drugs. Computational biology can help identifying targets by sorting the parasite's metabolic pathways that pins out proteins essential for its survival. Bora and Jha contributed a kinetic modeling for determining targets against Leishmania donovani, a deadly human pathogen responsible for causing Visceral Leishmaniasis. Metabolic pathway and Protein-Protein Interactions (PPI) were integrated to analyse the "purine salvage" pathway, which is mandatory for parasite survival. Available experimental data was used to develop a kinetic model of Purine salvage pathway that helped marking of crucial enzymes involved in the synthesis of the metabolites. Additionally, PPI analysis of the pathway assisted in building a static interaction network for selected proteins. Dynamic Modeling and Topological analysis of the PPI network through centrality measures were combined to detected targets. ADSL (Adenylosuccinate lyase) and IMPDH (Inosine-50-monophosphate dehydrogenase) enzymes appeared to be crucial and further modeling of three dimensional structure of ADSL enzyme aided toward the search for antiparasitic drugs for the treatment of Visceral Leishmaniasis.

In terms of specific issues discussed in this special issue, Wang et al. have reported the use of one of the top unsupervised machine learning technique known as overlapping cluster generator (OCG) for the functional characterization of hitherto poorly annotated P311. They propose that the proteins on the interface of OCGs represent multifunctional property and based on this PPI characterization propose that P311 may be involved in inflammatory responses, cell proliferation, and coagulation. While protein-wise functional annotation is a key biological problem of interest, more detailed characterization of proteins to gain general insights into their structure and function are critically important. In this regards, amino acid repeats in proteins play an important role in their structures and by consequence their functions. Thus, Rajathei et al. have reported an analysis of protein repeat regions and their role in structure and function of proteins. They also report that repeat regions of longer than 15 residues are present in about 67% of proteins in the Uniprot, when viewed in a non-redundant manner. Biological function annotation cannot be complete without looking at the sensitivities at the epigenetic and single nucleotide polymorphism-driven functional diversity. Database and resources form the centerstage in any such analysis. Ma et al. have presented a database FeatSNP that focuses specifically on the SNPs in epigenetic factors in human brain. In the absence of a thorough understanding of human brain and proteins involved in performing cognitive and behavioral functions, such a database will be of immense value for people working on understanding genomic perspectives of protein function in brain and its related disorders.

Overall, the special issue covered various aspects of ML both supervised and unsupervised with classical clustering, overlapping clustering and general statistical principles, neural networks, tensor decomposition and related techniques. The biological systems investigated included generalized druggability, P331 systems, brain associated proteins and Leishmania donovani. Thus, special issue brings together typical systems in which ML and intelligent systems have helped gaining predictive value and biological insights into the vast area of biological function annotation. We hope the

### REFERENCES


readers from computational biology and the domain specific researchers will be benefitted by reading articles included in this special issue.

#### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

This work was supported by a grant from Department of Science and Technology, Government of India under the project DST/ICPS/Cluster/DataScience/2018/General/11 to SA.

genome annotation: a match meant to be?. Genome Biol. 14:205. doi: 10.1186/gb-2013-14-5-205

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Ahmad, Ballester and Fernandez. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Reconstruction and Functional Annotation of P311 Protein–Protein Interaction Network Reveals Its New Functions

Song Wang<sup>1</sup> , Xiaorong Zhang<sup>1</sup> , Fen Hao<sup>1</sup> , Yan Li<sup>2</sup> , Chao Sun<sup>3</sup> , Rixing Zhan<sup>1</sup> , Ying Wang<sup>1</sup> , Weifeng He<sup>1</sup> , Haisheng Li1,4 \* and Gaoxing Luo<sup>1</sup> \*

1 Institute of Burn Research, State Key Laboratory of Trauma, Burn and Combined Injury, Southwest Hospital, Third Military Medical University, Chongqing, China, <sup>2</sup> Laboratory Center of Southwest Hospital, Third Military Medical University, Chongqing, China, <sup>3</sup> The Sixth Resignation Cadre Sanatorium of Shandong Province Military Region, Qingdao, China, <sup>4</sup> The 324th Hospital of Chinese People's Liberation Army, Chongqing, China

#### Edited by:

Rosalba Giugno, University of Verona, Italy

#### Reviewed by:

Hauke Busch, Universität zu Lübeck, Germany Lingyun Zou, Third Military Medical University, China Vincenzo Bonnici, University of Verona, Italy

\*Correspondence:

Haisheng Li lee58427@163.com Gaoxing Luo logxw@yahoo.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 19 September 2017 Accepted: 30 January 2019 Published: 19 February 2019

#### Citation:

Wang S, Zhang X, Hao F, Li Y, Sun C, Zhan R, Wang Y, He W, Li H and Luo G (2019) Reconstruction and Functional Annotation of P311 Protein–Protein Interaction Network Reveals Its New Functions. Front. Genet. 10:109. doi: 10.3389/fgene.2019.00109 P311 is a highly conserved multifunctional protein. However, it does not belong to any established family of proteins, and its biological function has not been entirely determined. This study aims to reveal the unknown molecular and cellular function of P311. OCG (Overlapping Cluster Generator) is a clustering method used to partition a protein-protein network into overlapping clusters. Multifunctional proteins are at the intersection of relevant clusters. DAVID is an analytic tool used to extract biological meaning from a large protein list. Here we presented OD2 (OCG + DAVID + 2 human PPI datasets), a novel strategy to increase the likelihood to identify biological functions most pertinent to the multifunctional proteins. The principle of OD2 is that OCG prepares the protein lists from multifunctional protein relevant overlapping clusters, for a functional enrichment analysis by DAVID, and the similar functional enrichments, which occurs simultaneously when analyzing two human PPI datasets, are supposed to be the predicted functions. By applying OD2 to two reconstructed human PPI datasets, we supposed the function of the P311 in inflammatory responses, cell proliferation and coagulation, which were confirmed by the following biological experiments. Collectively, our study preliminarily found that P311 could play a role in inflammatory responses, cell proliferation and coagulation. Further studies are required to validate and elucidate the underlying mechanism.

Keywords: P311, protein–protein interaction networks, inflammatory response, cell proliferation, coagulation

### INTRODUCTION

P311, with the official gene symbol NERP (neuronal regeneration related protein), is a highly conserved 8-kDa intracellular protein. The 68-amino acid sequence of P311 contains a PEST domain (rich in Pro, Glu, Ser, and Thr) in the N-terminus (Stradiot et al., 2018). The domain is also in short-lived proteins such as transcription factors, cytokines and signal molecules, which implies that P311 might belong to one of the protein families (Sommer and Wolf, 2014; Varshavsky, 2014). However, no more evidence was found to ascribe P311 to one of the protein families. So far,

**7**

P311 does not belong to any established family of proteins; therefore, it fails to provide any clues on its function. Studler et al. (1993) first reported that P311 was highly expressed in embryonic mouse brains, and then other groups demonstrated that P311 was expressed in motoneurons (Fujitani et al., 2004), glioblastomas (McDonough et al., 2005), smooth muscle cells (Badri et al., 2013), and fibroblasts (Tan et al., 2010; Cheng et al., 2017). Furthermore, these studies showed that P311 was involved in nerve and lung regeneration (Fujitani et al., 2004; Zhao et al., 2006), glioma invasion ((Mariani et al., 2001; McDonough et al., 2005), blood pressure homeostasis (Badri et al., 2013), myofibroblast differentiation (Pan et al., 2002), amoeboid-like migration (Shi et al., 2006), behavioral responses in learning and memory (Taylor et al., 2008) and the affective, but not the sensory component of pain (Sun et al., 2008).

Our group has been focused on P311 since 2004, when we found that the expression of P311 increased dramatically in hypertrophic scars through gene expression profiling and a comparative proteomics analysis (Wu et al., 2004). We then found that P311 induced fibroblast differentiation via enhancing the TGFβ1 signaling pathway in a human hypertrophic scar (Tan et al., 2010), and it also was a new inducer of EpMyT (Epidermal stem cell transdifferentiate into myofibroblasts) in wound healing (Li et al., 2016). Furthermore, we demonstrated that P311 played a crucial role in renal fibrosis via TGFβ1/Smad signaling (Yao et al., 2015). Recently we showed that P311 accelerated skin wound re-epithelialization by promoting epidermal stem cell migration through RhoA and Rac1 activation (Yao et al., 2017) and that P311 deficiency leads to attenuated angiogenesis in cutaneous wound healing (Wang et al., 2017). These studies have aroused our tremendous and sustained interest in P311 and we therefore continue to study its biological function.

Protein-protein interaction (PPI) networks can highlight the modularity of cellular processes and allow the deciphering of protein functions at the cellular level, as proteins tend to interact with each other when they are involved in the same molecular complex, pathway, or biological process (Hartwell et al., 1999). Meanwhile, a PPI network can be represented as a simple graph in which vertices correspond to proteins and edges, to direct physical interactions, which allow the graph partition method to highlight clusters of densely connected vertices (Aittokallio and Schwikowski, 2006). The identified clusters stand for groups of proteins involved in the same molecular complex, pathway, or biological process. Further analyzing the proteins from each group can predict the function of uncharacterized proteins (Sharan et al., 2007). OCG (Overlapping Cluster Generator) is a graph partition that decomposes a protein-protein network into overlapping clusters. Multifunctional proteins are at the intersection of relevant clusters (Becker et al., 2012). DAVID consists of a comprehensive biological knowledgebase as well as analytical tools designed to systematically extract biological meaning from a large gene/protein list (Huang et al., 2008).

In this study, we presented OD2 (OCG + DAVID + 2 human PPI datasets), a promising strategy to predict the function of P311. By applying OD2 to two reconstructed human PPI datasets, we supposed the function of P311 in inflammatory responses, cell proliferation and coagulation. Finally, we conducted relevant biological experiments to confirm the functions of P311.

### MATERIALS AND METHODS

### Datasets

A PPI network dataset (named Dataset 1) involving 80,930 binary interactions between 10,229 proteins (**Figure 1A** and **Supplementary Material 1**) was constructed by (1) eight interactions from our previous literature (Peng et al., 2012) and (2) 80,922 binary interactions from the PPI network dataset assembled by Bossi and Lehner (2009).

Additionally, another high confidence dataset (named Dataset 2) of 110,707 binary interactions involving 9,606 proteins (**Figure 1A** and **Supplementary Material 3**) was built by fusing the eight interactions from our previous literature (Peng et al., 2012) with 110699 binary interactions, whose combine-score is greater or equal to 0.5, from the STRING database (Szklarczyk et al., 2014). In the STRING dataset, each protein-protein interaction is annotated with a combine-score. The score does not necessarily indicate the strength or specificity of the interaction, instead, it indicates confidence, i.e., how likely STRING judges an interaction to be true, given the available evidence. All scores rank from 0 to 1, with 1 being the highest possible confidence.

## OCG (Overlapping Cluster Generator) Algorithm

The OCG algorithm was carried out by the software available in Becker et al. (2012) (**Supplementary Material 4**). The principle of OCG is to build a tree in which the leaves are introductory classes that are progressively and hierarchically joined (**Figure 1B**).

#### Initial Overlapping Class System

This program begins with building initial overlapping clusters from a simple unweighted graph G = (V, E). Let | V| = n and | E| = m. To obtain the initial overlapping class system, there are three strategies to cover the graph.

The covering class system can be:


The centered clique system has been chosen for further studies, because its graph density is better and is less time and memory consuming.

#### Hierarchy of Overlapping Class

To obtain the hierarchy overlapping class systems, the need to be fused with the clusters by optimization of the modularity Q. In each step, the connected clusters are the ones that maximize the average gain. This average is defined as the global modular gain divided by the number of newly connected vertex pairs. The chain effect can thus be avoided, which is caused by adding elements one by one and producing clusters that are not suitable for the following functional prediction.

#### Final Overlapping Clusters

Final overlapping clusters come out when the system of clusters reaches the maximized modularity. Alternatively, if the researcher sets either a minimum number of clusters or the maximum allowed cluster cardinality, the final overlapping cluster that maximizes the overall modularity within those constraints come out.

### P311 PPI Networks Reconstruction and Analysis

As shown in **Figure 1A**, through the OCG (Overlapping Cluster Generator) algorithm, the dataset was partitioned into overlapping PPI networks (**Supplementary Material 5**). Among those, we picked out the networks containing protein P311 (NERP) as P311-containing networks.

Following the analysis flow with DAVID (**Figure 1C**), all constituents in each P311-containing network were analyzed, separately, to find the significant terms. The Q-values < 0.05 were invoked as the threshold from which to choose the significant terms (Katsogiannou et al., 2014). Q-values are hypergeometric p-values corrected for multiple testing according to the Benjamini and Hochberg procedure. Cytoscape was utilized to visualize all the networks (Smoot et al., 2010). Significant GO terms occurring in both terms from dataset 1 derived P311-containing networks and dataset 2 derived P311-containing networks, were picked out as the common terms, which were supposed to be the functions of P311.

Following the bioinformatic analysis above, we identified the common predicted function as the function which would be confirmed in the follow-up experiments.

#### Animals

P311 WT and P311 KO mice were kindly gifted by Professor Gregory A Taylor (Taylor et al., 2008). All mice grew up in the animal Institute of Third Military (Army) Medical University. The mice were maintained in a specific, pathogenfree environment under controlled light, temperature, and humidity.

### Culture of Mouse Primary Fibroblasts

Cells were cultured as previously described (Varani et al., 2004). Briefly, after incubation in 0.25% Dispase II (04942078001. Roche) overnight at 4◦C, the dermis was separated and minced into tissue fragments. Then primary fibroblasts were grown from the fragments and cultured in Dulbecco's modified Eagle's medium (DMEM) (11965118, Gibco) supplemented with 10% fetal bovine serum (FBS) (10099141, Gibco), 100 U/mL of penicillin and streptomycin (15140122, Gibco).

#### Proliferation Assay

fgene-10-00109 October 30, 2019 Time: 15:41 # 4

Mouse primary fibroblasts (MPFs) were seeded in three replicates in 96-well plates in DMEM supplemented with 10% FBS. After 1, 2, 3, 4, 5, and 6 days, according to the manufacturer, the absorbance was measured at 450nm using an enzyme-linked immunosorbent assay reader with the addition of 10 µl/well of CCK8 reagent (CK04, Dojindo).

### Full-Thickness Excisional Skin Wound Model

The model was prepared as previously described (Wang et al., 2018). Briefly, hairs on the dorsal surface of mice were shaved with an electric shaver and cleaned with 75% alcohol before surgery. A full-thickness excisions skin wound was made on the dorsal surface, with a 4 mm round skin biopsy punch, while anesthetized, with 1% pentobarbital via an intraperitoneal injection.

### Superficial Second-Degree Burn Mouse Model

As previously described (Li et al., 2016), mice were anesthetized with intraperitoneal pentobarbital (35 mg/kg) and dorsal skin hairs were shaved 2 days before the burn. The scald apparatus (YSL-5Q) was then used to make the second-degree thermal burn injury with the condition (80◦C, 3 s under a pressure of 500 g weight). Two wounds were produced on each mouse along the posterior median line, and the distance between the two wounds was 1.0 cm. The burn depth was confirmed by pathology.

#### Hematoxylin-Eosin (H&E) Staining

The mice were sacrificed on the 3rd-day post-surgery, and the wound tissues were then carefully harvested, fixed with 4% paraformaldehyde, embedded in paraffin, sliced and stained with H&E. The wound area and inflammatory cell count on the hematoxylin and eosin (H&E) stained sections were determined by the ImageJ 1.41 software provided by the National Institute of Health.

### RNA Isolation and Quantitative Real-Time PCR

Total RNA was extracted from mouse skin with the RNeasy Mini Kit (QIAGEN, 74104), and cDNA was synthesized with the cDNA Synthesis Kit (Toyobo, FSK-100). Real-time PCR was performed with the SYBR Green Master Mix (Toyobo,

CD14, 5<sup>0</sup> -ACATCTTGAACCTCCGCAACGTGT-3<sup>0</sup> and 50TT GAGCGAGTGTGCTTGGGCAATA-3<sup>0</sup> ; CD16, 5<sup>0</sup> -TTGCAGT GGACACGGGCCTTTATT-3<sup>0</sup> and 50TTGTCTTGAGGAGCC TGGTGCTTT-3<sup>0</sup> ; β-actin, 5<sup>0</sup> -CGTGCGTGACATCAAAGAG AA-3<sup>0</sup> and 5<sup>0</sup> -TGGATGCCACAGGATTCCCAT-3<sup>0</sup> ; GAPDH, 5 0 -CGTGCCGCCTGGAGAAAC-3<sup>0</sup> and 5<sup>0</sup> -AGTGGGAGTTGC TGTTGAAGTC-3<sup>0</sup> ; P311, 5<sup>0</sup> -GAGGCTTCCTAAGGGAAGAC TT-3<sup>0</sup> and 5<sup>0</sup> -AAGTGGAGGTAAC TGATTCTTGG-3<sup>0</sup> .

### Flow Cytometry

After isolating cells from the dermal sheet, the appropriate primary antibody was added for 1 h at 4◦C to PerCP CY5.5-conjugated mAb specific F4/80 (Cell Signaling). After transduction with Ad-P311 or Ad-Vector for 24 h, MPFs (mouse primary fibroblasts) were scraped off the plates in PBS containing 5% BSA. Cells were then fixed in 70% ethanol in 4◦C overnight, washed in PBS two additional times, and then stained for 30 min at 37◦C in a 50 mg/ml propidium iodide (Sigma, United States) solution containing 200 mg/ml RNase A and 0.1% Triton-X-100. All the prepared cells were analyzed with the Attune Acoustic Focusing Cytometer (Applied Biosystems, Life Technologies, CA, United States), and the data were analyzed using the Flow Jo software (Tree Star Incorporation, United States). Experiments were replicated at least three times using the same conditions and settings.

### Thromboelastography (TEG)

Thromboelastography (TEG) is a method of testing the efficiency of blood coagulation (Branco et al., 2014). Blood was gathered from the retro-orbital plexus of the P311 WTburn, P311 KO-burn, P311 WT-sham-burn, and the P311 KO-sham-burn mice on the 7th-day post-burn, in tubes containing buffered sodium citrate. A minimum of 1 mL was achieved in each case. The previously described (Bolliger et al., 2012), the TEG analysis at 37◦C using the 'Citrated Native' protocol with TEG 5000 Thromboelastography Hemostasis Analyzer System (Haemonetics Corporation, Braintree, MA, United States) was used. Descriptions of the TEG parameters are provided in **Table 1**.

TABLE 1 | Description of thromboelastography (TEG) parameters.


For DAVID analysis, the entire genome-wide genes of humans were the default background, and the significance of the geneterm enrichment was analyzed by a modified Fisher's exact test (EASE score). For follow-up experiments, the data were described as mean ± SD (standard deviation) and analyzed by an unpaired, two-tailed Student's T-test with SPSS 18.0 software. P < 0.05 was considered as statistically significant.

### RESULTS

### Functional Enrichment Analysis of P311 PPI Networks Predicts Its New Functions

Previously, our group identified eight proteins that might interact with P311, utilizing the yeast two-hybrid (Y2H) technique. These proteins are HRG, SERPINC, MT2A, SRPR, HYI, ACY3, EIF6, and PDF (**Supplementary Material 2**) (Peng et al., 2012). The nine proteins then constructed the initial P311-containing network. Following the analytic flow of DAVID (**Figure 1C**), according to functional annotation and enrichment, the proteins, with a q-value more than 0.05, were enriched in the negative regulation of endopeptidase activity (Q = 0.9404).

As shown in **Figure 1A**, the initial P311-containing network was merged with the human PPI network dataset assembled by Bossi and Lehner (2009) and the human PPI network dataset downloaded from STRING database (Szklarczyk et al., 2014), was used separately, to build two datasets. We obtained two reconstructed human PPI datasets, one (named Dataset 1) with 80,930 binary interactions between 10,229 proteins (**Figure 1A** and **Supplementary Material 1**), another (named Dataset 2) with 110,707 binary interactions involving 9,606 proteins (**Figure 1A** and **Supplementary Material 3**).

The two large human PPI networks were partitioned by OCG using the centered clique system to initially cover the graph. The final overlapping clusters emerged when the maximal modularity was reached (**Figure 1B**). Finally, 732 overlapping clusters, four of which contained P311, were obtained from Dataset 1. The four reconstructed P311-containing networks were named M1, M2, M3, M4 (**Supplementary Materials 5, 6**). Meanwhile, we obtained four reconstructed P311-containing networks, among 1588 overlapping clusters, obtained from Dataset 2. The four reconstructed P311-containing networks were named N1, N2, N3, N4 (**Supplementary Materials 5, 7**). Overall, we obtained eight reconstructed P311-containing networks.

All constituents in each reconstructed P311-containing network were then analyzed by DAVID, separately (**Figure 1C**). In the analysis of dataset 1 derived P311-containing networks, according to functional annotation and enrichment, these functions range from biological processes already reported (**Supplementary Material 8**) to novel ones (**Supplementary Material 9**), such as the GPI anchor biosynthetic process, glucose metabolic process, peptidyl-serine phosphorylation, chemokinemediated signaling pathway, monocyte chemotaxis, cellular response to interferon-gamma,G1/S transition of mitotic cell cycle, DNA replication, platelet activation and the positive regulation of the establishment of protein localization to the plasma membrane. The top ten functions of each reconstructed P311-containing network are shown in **Figure 2A**.

In the analysis of dataset 2 derived P311-containing networks, according to functional annotation and enrichment, the predicted functions also included the already reported ones (**Supplementary Material 8**) and the new ones (**Supplementary Material 10**). The novel functions were enriched in the carboxylic acid catabolic process, monocarboxylic acid catabolic process, rRNA catabolic process, rRNA processing, the G1/S transition of mitotic cell cycle, DNA replication initiation and so on. The top 10 functions of each reconstructed P311-containing network are shown in **Figure 2B**.

To improve the confidence of the bioinformatic analysis, we compared the predicted functions from dataset 1 and dataset 2 to find the ones occurring on both sides. As shown in **Table 2**, one identified function (positive regulation of cell migration) (McDonough et al., 2005; Yao et al., 2017) was predicted by the analysis system. Another 13 unidentified ones were reported.

Integrated with the predicted functions above, and the phenomenon we observed before during our experiments, we supposed the function of P311 in inflammatory responses, cell proliferation and coagulation as the most confident ones.

### The Contribution of P311 to the Inflammatory Responses During Wound Healing

Inflammatory responses are the hallmark pathophysiological procedure for wound healing. After injuries such as trauma, burns or surgery, the prompt recruitment of inflammatory cells, such as monocytes and macrophages, which express CD14 and CD16 markers, occur in the wound site, followed by neutrophils and lymphocytes. These cells control the inflammatory response to wounds (Eming et al., 2007). Histological analysis of the wounds was carried out to check the microscopic appearances of wounds and the number of inflammatory cells that have infiltrated into the subcutaneous areas. The results showed that on the 3rd-day after the injury, the number of inflammatory cells in P311 WT was significantly higher than that in P311 KO mice (231.5 ± 63.9/field vs. 141.4 ± 39.2/field, p < 0.05) (**Figures 3A,B**). Further, we detected the mRNA levels of CD14 and CD16 expressed in wounds from the two groups on the 1st- and 3rd- day after the injury. We found more than a 150 fold change in CD14 expression and more than a 50-fold change of CD16 expression in wounds in P311 WT mice compared with non-wounded skin (p < 0.05) (**Figures 3C,D**). Additionally, changes in P311 KO mice were considerably smaller than those in P311 WT mice. On the 3rd day after the injury, the levels of CD14 and CD16 decreased compared with those on the 1st day after injury, but the expression levels in P311 WT mice were still significantly higher than that in P311 KO mice (p < 0.05) (**Figures 3C,D**). Detail CT values are provided in **Supplementary Material 11**. Finally, we utilized flow cytometry to detect the percentage of F4/80+ inflammatory cells in the wounds. Consistent with the findings above, we found that wounds from P311 KO mice showed a significant reduction

in the percentage of F4/80+ inflammatory cells in the wounds (13.3 ± 1.3% vs. 9.0 ± 1.4%, p < 0.05) on the 3rd day after injury (**Figures 3E,F**). All these findings confirmed that P311 influences


M, N stand for the P311-containing network from dataset one and dataset two, respectively.

the inflammatory responses during wound healing, by affecting the initial inflammatory cell recruitment.

### P311 Overexpression Promotes Proliferation of Mouse Primary Fibroblasts (MPF)

To confirm the function of P311 in proliferation, P311 was cloned into a pAdEasy vector which expresses GFP (green fluorescent protein), and then MPFs were transfected with a recombinant adenovirus vector (P311) or a negative vector (GFP). Before flow cytometry analysis and proliferation assay, the transfection efficiency was quantified and verified by observing GFP expression using a fluorescent microscope, and P311 mRNA expression levels were checked in real-time PCR. After transfection for 48 h, we observed that more than 90% of MPFs were transfected in both groups (**Figures 4A,B**). Realtime PCR result displayed that the mRNA level of P311 in the recombinant adenovirus vectors (P311) group was more than 10000-fold higher than that in the negative vector (GFP) group (p < 0.05), which confirmed the over-expression of P311 in the recombinant adenovirus vectors (P311) group. Detailed CT values are appended in the **Supplementary Material 11**. Further, proliferation capacity of these transfected cells was assessed using a proliferation assay. The result showed that from day 3, cells in the recombinant adenovirus vectors (P311) group had a higher proliferation capacity than cells in the negative vector (GFP) group (p < 0.05), and this tendency continued to the end (p < 0.05), which indicated that P311 over-expression significantly promoted the proliferation of MPFs (**Figure 4D**).

To understand how P311 promoted the proliferation of MPFs, we determined the cell cycle status of transfected cells by PI (Propidium iodide) staining. We found that the negative vector (GFP) group had a higher proportion of cells in the G0/G1 phrase (51.71 ± 4.21% vs. 41.05 ± 3.49%, p < 0.05), while the proportion of cells in the S phase was significantly lower in the negative vector (GFP) group than in the recombinant adenovirus vectors (P311) group (25.16 ± 4.92% vs. 36.68 ± 3.56%, p < 0.05) (**Figure 4F**). Moreover, there was no difference in the proportion of cells in the G2/M between the two groups (p > 0.05). These finding were consistent with the prediction by bioinformatics analysis. All of which indicated that P311 promoted MPFs proliferation via enhancing cells into the S phase.

### The Coagulation Profile of Blood Obtained From P311 WT Versus P311 KO Mice Immediately and 7-Days Post Burned

To assess the function of P311 in coagulation, we established a superficial second degree burn mouse model, as burn injury is traditionally thought to be a common triggering cause of coagulopathy, ranging from activation of coagulation to disseminated intravascular coagulation (DIC) (Curreri et al., 1975). Thromboelastography (TEG) was then used to monitor the coagulation profile of blood after trauma, as growing evidence shows that TEG is better than the conventional laboratory tests in evaluating the coagulation profile, platelet dysfunction, and fibrinolysis after trauma (Branco et al., 2014; Pekelharing et al., 2014).

**Figure 5A** represented a schematic TEG tracing. Results of TEG analysis are shown in **Table 3**, while **Figure 5B** represents TEG tracings from the two groups. On the 7th-day post-injury, no difference was detected between the P311 WT-sham and the P311 KO-sham injury. The clot formation time (R 8.850 ± 1.115 vs. 8.167 ± 1.291 min, p = 0.955) was similar, and the clot rate (α- angle 6.075 ± 1.656◦ vs. 6.275 ± 1.819◦ , P = 0.876) was also almost the same. No fibrinolysis was observed for both the P311 WT or the P311 KO mice. Comparing the P311 WT-burn to the P311 KO-burn, except for the clot formation time (R 8.850 ± 1.541 vs. 7.733 ± 1.210 min, p = 0.57), other parameters (Angle, MA, and G) were all significantly different (p < 0.05), which implied that P311 might regulate the process

FIGURE 4 | P311 overexpression promotes proliferation of primary mouse fibroblasts (MPF). (A) Representative morphology of primary mouse fibroblasts (MPF) transfected with Ad-P311 or a negative vector (GFP). Transfected for 48 h and then observed under a fluorescence microscope to confirm the infection efficiency by visualizing GFP expression. Scale bar = 100 mm. (B) Quantitation data of GFP<sup>+</sup> cells fraction in P311 transfected MPFs and vector group to determine the transfection efficiency. (C) P311 mRNA level in P311 transfected MPFs and vector group (n = 4 per group). (D) Representative flow cytometry cell cycle. (E) Quantitation data of flow cytometry cell cycle (n = 6). (F) The CCK8 assay was performed to assess the effect of P311 on the proliferation (n = 6). <sup>∗</sup>P < 0.05, ∗∗P < 0.01, Ad-P311 vs. Ad-Vector.

of coagulation. Altogether, the experiment demonstrated that P311 knockout significantly impacts burn-induced coagulopathy, suggesting a potential target for therapy.

### DISCUSSION

In this study, bioinformatic and experimental approaches were combined to predict and validate the function of P311. Firstly, by applying OD2 to two reconstructed human PPI datasets, we predicted the function of P311 in inflammatory responses, cell proliferation and coagulation. The principle of OD2 is that OCG prepares the protein lists from multifunctional protein relevant overlapping clusters, for functional enrichment analysis by DAVID. Similar functional enrichments, which occur simultaneously when analyzing two human PPI datasets, were chosen to be the predicted functions. Finally, we conducted relevant biological experiments to confirm these functions of P311.

### Integrating Initial P311 PPI Network With the Human Interactome Strengthens the Functional Landscape of P311

As proteins tend to interact with each other when they are involved in the same molecular complex, pathway, or biological process, the understanding of protein function is intrinsically tied to the understanding of this network (Hao et al., 2016). These networks are functional units of protein–protein interaction (PPI) networks and allow function prediction when involving unidentified proteins (Brun et al., 2003; Sharan et al., 2007). DAVID consists of a comprehensive biological knowledgebase


TABLE 3 | Comparison of P311 WT and P311-KO mice after burned and sham injury.

and analytical tools designed to systematically extract biological meaning from a large gene/protein list (Huang et al., 2008). To decipher the unknown function of P311, we performed functional annotation and enrichment analysis of constituent proteins in the initial P311-containing network, using DAVID. The binary analysis result showed that P311 might have the function to regulate endopeptidase activity, which was not observed in our follow-up experiments. This implied that the strategy to analyze the PPI network, which consists of binary interactions, has its own limitations, as it did not include the extended functional information hiding in the whole human PPI network, which is modular and consists of groups of highly related proteins in the same cellular function (Hartwell et al., 1999; Brun et al., 2003). Katsogiannou et al. (2014) reported that merging the initial PPI network with human interactome can enhance the functional information. Following their strategy, we merged the initial P311-containing network identified by our group, with two existing human PPI datasets, respectively (**Figure 1**).

Regarding a PPI network as a simple graph, in which vertices correspond to proteins and edges to direct physical interactions, allows the graph partition method to identify the networks (Newman, 2006). OCG is a clustering method used to identify the networks containing proteins involved in the same molecular complex, pathway, or biological process, and multifunctional proteins were identified at the intersection of the overlapping networks. Moreover, OCG had a better tradeoff between sensitivity and specificity than the CFfinder and Link communities did (Becker et al., 2012). By applying OD (OCG + DAVID) to PPI networks, we found that the result of analyzing different datasets by OD may be a little different, as the datasets were built at a different time by different researchers, which may cause them to not all have the same data in the dataset. Reversely, this may demonstrate the reliability of the result. To improve the confidence of the bioinformatic analysis, we chose the common predicted functions occurring on both sides to be the predicted function. So the bioinformatic system was developed into OD2 (OCG + DAVID + 2 human PPI datasets) with the principle that OCG prepared the protein lists from multifunctional protein relevant overlapping clusters for functional enrichment analysis by DAVID. The similar functional enrichments, which occurred simultaneously when analyzing two human PPI datasets, were chosen to be the predicted functions (**Figure 1**).

Through OD2, we predicted that P311 was involved in one well-known function-positive regulation of cell migration, which supported a reliable prediction function of the system. It has been reported that P311 accelerated cell migration mainly by enhancing the activity of GTPase. Mechanisms, involving P311 in enhancing the activity of GTPase, were cell specific. In epidermal stem cells, it enhanced the activity of Rho A and Rac1 (Yao et al., 2017). While in myofibroblasts, it enhanced the activity of Ral A (Shi et al., 2006). Other known functions of P311 like the regulation of blood pressure homeostasis (Badri et al., 2013), the regulation of development of fibrosis (Yao et al., 2015; Cheng et al., 2017), the regulation of wounds (Yao et al., 2017), angiogenesis (Wang et al., 2017) and so on are shown in **Supplementary Material 8**. P311–TGF-β axis signaling was demonstrated to be important signaling that is related to the processes above. Moreover, mechanisms involving P311 in the expression of TGF-β were also found to be cell-specific. In NIH3T3 fibroblasts, P311 downregulated the expression of TGF-β1 and TGF-β2 binds to the LAP (latency associated protein) (Paliwal et al., 2004). In vascular smooth muscle cells, P311 binds eukaryotic translation initiation factor 3 subunit b (eIF3b) to promote the translation of the transforming growth factor β1–3 (TGF-β 1–3) (Badri et al., 2013; Yue et al., 2014). In EpSCs, P311 stimulated TGFβ1 expression by promoting TGFβ1 promoter methylation and by activating the TGFβ1 5 0 /30UTR (Li et al., 2016). However, eIF6, an interacting protein of P311, downregulated the expression of TGFβ1 via H2A.Z occupancy and Sp1 recruitment in fibroblasts (Yang et al., 2015). Additionally, 13 predicted functions, in which P311 had never been implicated before, provided new insight into the function of P311 (**Table 2**).

## Predicted Functions Are Confirmed by Biological Experiments

With the proper biological experiments, we verified the predicted functions preliminarily. We confirmed the role of P311 in an inflammatory response, which was related to chemotaxis and immune responses (**Table 2** and **Supplementary Materials 9, 10**), by checking the recruitment of inflammatory cells in the wound healing model. Compared to the P311 WT mice, the number of inflammatory cells was much lower in P311 KO mice, while the mRNA level of CD14 and CD16 was also much lower in P311 KO mice (**Figure 3**). It has been proven that P311 could regulate wound healing by accelerating reepithelization (Yao et al., 2017) or by affecting angiogenesis (Wang et al., 2017). It therefore implies that P311 might also regulate wound healing by affecting the inflammation response during wound healing.

The role of P311 in proliferation was identified by a proliferation assay and flow cytometry analysis of the cell cycle. The results showed that P311 promoted MPFs proliferation by enhancing cells into the S phase (**Figure 4**). The expression of P311 was also found in small-cell lung carcinoma (SCLC), large-cell neuroendocrine carcinoma (LCNEC) (Jones et al., 2004) and glioblastoma (Mariani et al., 2001), whose cells had a really high proliferation capability, which indirectly implied that P311 might regulate the proliferation of cells. This all indicated that P311 might regulate the proliferation of cells. However, further studies are required to validate and elucidate the underlying mechanisms.

Moreover, we also verified a role of P311 in coagulation. As we had observed the difference in the formation of scabs between P311 WT and P311 KO mice when we made the mice wound healing model (Wang et al., 2017), the bioinformatic analysis predicted the function of coagulation. These strongly implied a function of P311 in coagulation. We utilized Thromboelastography (TEG) to monitor the coagulation profile of blood after the burn injury. The result displayed that Angle, MA and G in the P311 KO-burn group were all higher than those in the P311 WT-burn group, which preliminarily verified that P311 might regulate coagulation. Further studies are needed to validate and elucidate the underlying mechanisms, which may suggest P311 as a potential therapeutic target for coagulation related diseases.

However, there were still several predicted functions, which were confirmed to not be associated with P311, through biological experiments. This indicated again that the bioinformatic analysis provides clues but does not validate it as truth. The strategy reported by our group can provide us with specific clues for follow-up biological experiments. Our study preliminarily found that P311 could be involved in inflammatory response, cell proliferation and coagulation. Further studies are required to validate and elucidate the underlying mechanism.

#### REFERENCES


### ETHICS STATEMENT

All protocols involving animals were reviewed and approved by the Southwestern Hospital Institutional Review Board.

### AUTHOR CONTRIBUTIONS

HL and GL made substantial contributions to the conception and design of the work. SW and XZ performed the majority of the experiments. RZ, YW, CS, and FH contributed to the collection, analysis and interpretation of the data for this study. SW wrote the first draft of the manuscript. YL, WH, and GL revised and edited the manuscript critically with important intellectual contributions. GL was responsible for obtaining funds. All authors read and approved the manuscript.

#### FUNDING

This work was supported by grants from China's NSFC grants program (81630055).

#### ACKNOWLEDGMENTS

We wish to thank Prof. Gregory A. Taylor from the Duke University Medical Center for kindly providing the P311 KO mice.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00109/full#supplementary-material



integrated over the tree of life. Nucleic Acids Res. 43, D447–D452. doi: 10.1093/ nar/gku1003


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer LZ declared a shared affiliation, with no collaboration, with several of the authors, SW, XZ, FH, YL, RZ, YW, WH, HL, and GL, to the handling Editor at the time of review.

Copyright © 2019 Wang, Zhang, Hao, Li, Sun, Zhan, Wang, He, Li and Luo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Corrigendum: Reconstruction and Functional Annotation of P311 Protein–Protein Interaction Network Reveals Its New Functions

*Song Wang1, Xiaorong Zhang1, Fen Hao1, Yan Li2, Chao Sun3, Rixing Zhan1, Ying Wang1, Weifeng He1, Haisheng Li1,4\* and Gaoxing Luo1\**

*1 Institute of Burn Research, State Key Laboratory of Trauma, Burn and Combined Injury, Southwest Hospital, Third Military Medical University, Chongqing, China, 2 Laboratory Center of Southwest Hospital, Third Military Medical University, Chongqing, China, 3 The Sixth Resignation Cadre Sanatorium of Shandong Province Military Region, Qingdao, China, 4 The 324th Hospital of Chinese People's Liberation Army, Chongqing, China*

Keywords: P311, protein–protein interaction networks, inflammatory response, cell proliferation, coagulation

#### *Approved by:*

*Frontiers Editorial Office, Frontiers Media SA, Switzerland*

#### *\*Correspondence:*

*Haisheng Li lee58427@163.com Gaoxing Luo logxw@yahoo.com*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 31 July 2019 Accepted: 07 August 2019 Published: 06 November 2019*

#### *Citation:*

*Wang S, Zhang X, Hao F, Li Y, Sun C, Zhan R, Wang Y, He W, Li H and Luo G (2019) Corrigendum: Reconstruction and Functional Annotation of P311 Protein–Protein Interaction Network Reveals Its New Functions. Front. Genet. 10:818. doi: 10.3389/fgene.2019.00818*

**A Corrigendum on:**

#### **Reconstruction and Functional Annotation of P311 Protein–Protein Interaction Network Reveals Its New Functions**

*By Wang S, Zhang X, Hao F, Li Y, Sun C, Zhan R, Wang Y, He W, Li H and Luo G (2019) Front. Genet. 10:109. doi: 10.3389/fgene.2019.00109*

There is an error in the *Funding* statement. The correct number for "China's NSFC grants program" is "81630055." A correction has therefore been made to the *Funding* statement and should read:

"This work was supported by grants from China's NSFC grants program (81630055)."

 Additionally, in the original article, the reference for "Becker et al., 2011" was incorrectly written as "Becker, E., Robisson, B., Chapple, C. E., Guénoche, A., and Brun, C. (2011). Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinformatics 28, 84–90. doi: 10.1093/bioinformatics/btr621". It should be "Becker, E., Robisson, B., Chapple, C. E., Guénoche, A., and Brun, C. (2012). Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinformatics 28, 84–90. doi: 10.1093/bioinformatics/btr621".

Lastly, the citation of Becker's work was missing in the **Supplementary Material 4**. Corrections have been made by placing a copy of the code from Becker et al., Boinformatics, 2012 Jan 1;28(1):84– 90, in the supplementary material along with an edited README file. Thus, corrections have also been made to the *Materials* and *Methods*, subsection *OCG (Overlapping Cluster Generator) Algorithm*, by removing the last sentence, which was redundant and to the first paragraph:

"The OCG algorithm was carried out by the software available in (Becker et al., 2012) (**Supplementary Material 4**)."

**18**

The authors apologize for these errors and state that they do not change the scientific conclusions of the article in any way. The original article has been updated.

### ACKNOWLEDGMENTS

We wish to thank Prof. Gregory A. Taylor from the Duke University Medical Center for kindly providing the P311 KOmice.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00818/ full#supplementary-material

*Copyright © 2019 Wang, Zhang, Hao, Li, Sun, Zhan, Wang, He, Li and Luo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CCBY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# FeatSNP: An Interactive Database for Brain-Specific Epigenetic Annotation of Human SNPs

#### Edited by:

Shandar Ahmad, Jawaharlal Nehru University, India

#### Reviewed by:

Wei-Hua Chen, Huazhong University of Science and Technology, China Sandeep Kumar Dhanda, La Jolla Institute for Immunology (LJI), United States

#### \*Correspondence:

Ting Wang twang@wustl.edu Bo Zhang bzhang29@wustl.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 19 November 2018 Accepted: 08 March 2019 Published: 02 April 2019

#### Citation:

Ma C-y, Madden P, Gontarz P, Wang T and Zhang B (2019) FeatSNP: An Interactive Database for Brain-Specific Epigenetic Annotation of Human SNPs. Front. Genet. 10:262. doi: 10.3389/fgene.2019.00262 Chun-yu Ma<sup>1</sup> , Pamela Madden<sup>2</sup> , Paul Gontarz<sup>1</sup> , Ting Wang<sup>3</sup> \* and Bo Zhang<sup>1</sup> \*

<sup>1</sup> Center of Regenerative Medicine, Department of Developmental Biology, Washington University School of Medicine, St. Louis, MO, United States, <sup>2</sup> Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, United States, <sup>3</sup> Department of Genetics, The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, United States

FeatSNP is an online tool and a curated database for exploring 81 million common SNPs' potential functional impact on the human brain. FeatSNP uses the brain transcriptomes of the human population to improve functional annotation of human SNPs by integrating transcription factor binding prediction, public eQTL information, and brain specific epigenetic landscape, as well as information of Topologically Associating Domains (TADs). FeatSNP supports both single and batched SNP searching, and its interactive user interface enables users to explore the functional annotations and generate publication-quality visualization results. FeatSNP is freely available on the internet at FeatSNP.org with all major web browsers supported.

#### Keywords: SNP, database, epigenetics, brain, transcription factor

### INTRODUCTION

Genome-wide association studies (GWAS) and expression quantitative trait loci (eQTL) analyses have identified thousands of genetic variants that are associated with a wide range of human phenotypes, shedding lights on the understanding of the genetic effect to human diseases. However, a key challenge for scientists in the human genetics community is to understand the molecular mechanism connecting significant genetic variant and specific phenotype. More than 90% of SNPs associated with human phenotypes are located in nonprotein-coding regions, and cannot be explained by alteration of amino acid sequence of proteins (Welter et al., 2014). Recently, mounting evidence suggests that disease-associated noncoding SNPs are highly enriched in tissue-specific regulatory elements including enhancers, which can be detected and defined by specific chromatin modifications (Carey et al., 2015;

**20**

Zhou et al., 2015; Agrawal et al., 2018). Moreover, some noncoding SNPs are found to be located within transcription factor (TF) binding motifs, which affect the TF binding affinity and result in allele switching and/or allele-specific regulation of target genes (Andersson et al., 2014; Roadmap Epigenomics et al., 2015; Nelson et al., 2016). These evidences underscore the potential causal role of non-coding genetic variants in affecting human diseases and phenotypes through regulation of gene expression (Claussnitzer et al., 2015).

Here we introduce FeatSNP, an online tool and database which provides an interactive user interface (UI) for inquiring brainspecific functional and epigenetic annotation of human SNPs. Unlike traditional SNP functional annotation databases, such as RegulomeDB (Boyle et al., 2012) and HaploReg (Ward and Kellis, 2012), FeatSNP focuses on the collection and curation of brain-specific functional genomics data, including epigenomes, transcriptomes, and eQTL data, to better annotate the regulatory potential of single SNP. Specifically, FeatSNP supplies a series of new features to facilitate research understanding the functional annotation of SNP on human brain (**Supplementary Table S1**). FeatSNP uses human brain transcriptomes to improve and refine the prediction of allele-specific TF binding motifs. The expression correlation between SNP-associated gene and predicted SNPassociated TFs was used to determine the best allele-associated TF candidate. The interactive UI allows the users easily to browse functional annotation and generate analysis results and high quality figures.

#### METHODS

FeatSNP consists of a front end UI implemented with HTML/PHP/JavaScript, and a backend NoSQL database implemented with MongoDB (v3.2.7) as shown in **Figure 1**. The current SNP dataset contains 81,144,876 bi-allelic SNPs from dbSNP (V144), with SNP accession number as unique identifier in the database. Human dbSNP build 144 was downloaded from ftp.ncbi.nih.gov/snp, which includes 84,435,229 SNPs records, 1,591,294 insertions records, 2,595,517 deletions records, 33,234 indel records, and 110 Multiple Nucleotide Polymorphisms (MNPs) records. After filtering redundant records, 81,144,876 of 84,435,229 biallelic SNPs were used to generate functional annotations and were curated by the FeatSNP database. The genome coordinates (hg19) of 81,144,876 SNPs were used to associate the SNPs with their nearest genes based on 56,642 records of GENOCDE gene annotation Release 19 (GRCh37.p13).

To predict impact of allele-specific TF binding affinity by SNPs, the Position Weight Matrix (PWM) of 519 vertebrate TFs were collected from JASPAR (Core Vertebrate 2016) (Mathelier et al., 2016). After evaluating the motif weight PWM of 519 TFs at base-pair resolution (**Supplementary Figure S2**), the reference and alternate alleles for every SNP with flanking 10 bp of genomic sequences both upstream and downstream were obtained from the UCSC Genome Browser. FIMO (Grant et al., 2011) was used to scan the 21 bp sequence to identify binding motifs matching any of the 519 TF PWMs, and calculate the TFBS motif scores. Only instances where a motif in the sequence (i) passed the threshold of P < 1e-2 in either the reference or the alternate allele, and (ii) contained the SNP location and (iii) the difference of motif scores between the reference and the alternate allele was greater than 2, were recorded in the database.

1,259 transcriptome datasets of 13 brain tissues generated by the GTEx consortium (Gibson, 2015) were used to calculate the Pearson correlation between each SNP associated gene and predicted binding TFs. The lowly expressed gene and TFs (expression of all samples in one tissue less than 0.2RPKM) were removed. The correlation and gene expression in 13 brain tissues were visualized by using JavaScript package Highcharts (v5.0.2).

**21**

eQTL data of 10 brain tissues generated by GTEx consortium were negative-log10 transformed and further visualized by using Highcharts (v5.0.2).

Histone modification ChIP-seq data of 10 brain tissues were downloaded from NIH Roadmap Epigenomics data portal. Bedtools was used to identify SNPs residing in peaks of 7 histone modification marks (H3K4me3, H3K36me3, H3K27me3, H3K4me1, H3K27ac, H3K9me3, and H3K9ac) that were identified by macs2 (Zhang et al., 2008) with default parameters. To enhance the user experience, the WashU epigenome browser (Zhou et al., 2015) was embedded in the UI to display epigenetic landscape in a 200 bp region surrounding each SNP. The browser also displays DNA methylation data (Whole Genome Bisulfite Sequencing) of 4 neuronal progenitor and brain tissues generated by Roadmap Epigenomics Project, enhanced epilogos visualization<sup>1</sup> of all 127 epigenomes, and topologically associating domains (TAD) data of GM12878, IMR90, and Hap1 cell lines (Rao et al., 2014; Sanborn et al., 2015). eQTL data of 10 brain tissues generated by GTEx consortium were also visualized on the embedded WashU epigenome browser.

The association records of SNP and human disease/traits (V1.0.2) were downloaded from GWAS Catalog. 33,894 associations with p-value smaller then 5E-8 were kept and classified based on 1,374 human disease/traits categories. The functional annotations of these 33,894 SNPs were

<sup>1</sup> epilogos.altiusinstitute.org

reported on FeatSNP.org/html\_file/disease\_classification.html (**Supplementary Figure S3**).

#### RESULTS

To illustrate the use of FeatSNP, we performed the analysis using rs8070723 as an example. rs8070723 is an intronic A/G SNP (major allele A frequency 0.881, minor allele G frequency 0.119) in MAPT, the gene that encodes the microtubuleassociated protein tau, and is associated with Progressive Supranuclear Palsy (Hoglinger et al., 2011) and with Parkinson's Disease (UK Parkinson's Disease Consortium et al., 2011). To better understand the regulatory potential of this human disease-associated SNP, we inquired the epigenetic annotation of rs8070723 in FeatSNP through Single SNP ID Searching function on SNP Query Page (**Supplementary Figure S4**). The database first reported the basic information of SNP rs8070723, including genomic location, allelic frequency, surrounding DNA sequence, and associated gene (**Figures 2A–C**). Users can further access the genetic information and associated human disease or traits of inquired SNPs on dbSNP and GWAS Catalog through external links.

FeatSNP found four potential TF binding motifs harboring rs8070723 with A allele, including PBX1, Hoxa9, Dux, and EN2. All four TF binding motifs had high TFBS scores in A allele,

CpG sites in four neuronal cells. Middle track: Epilogos visualization of chromHMM predicted chromatin status. Followed eQTL tracks: log10 transformed p-value of eQTL in 10 brain regions. TAD track: Topological Associated Domain tracks of GM12878, IMR90, and Hap1. Bottom: RefSeq gene annotation track. (C) Complete eQTL information of SNP rs8070723 in 10 brain regions.

and the TFBS motifs were destroyed with G allele with low TFBS scores (**Figure 2C**). PBX1 encodes a nuclear protein that belongs to the PBX homeobox family of transcriptional factors, and studies suggested PBX1 regulates the patterning of the cerebral cortex (Golonzhka et al., 2015) and its transcriptional network controls dopaminergic neuron development in Parkinson's disease (Villaescusa et al., 2016). EN2 encodes homeodomaincontaining proteins and has been implicated in the control of pattern formation during development of the central nervous system (Genestine et al., 2015). Hoxa9 is an important homeobox transcription factor and plays important roles in myeloid leukemogenesis (Siriboonpiputtana et al., 2017). Dux-family transcription factors were recently identified to regulate zygotic genome activation in placental mammals (De Iaco et al., 2017). Thus, PBX1 and EN2 could be the potential master TFs affected by the SNP rs8070723.

Since FeatSNP curated 1,259 transcriptome data of 13 brain tissues generated by the GTEx consortium (Gibson, 2015), we were able to further check the expression level of PBX1 and EN2 in multiple brain regions in FeatSNP database. EN2 was only expressed in the cerebellum of the brain (**Supplementary Figure S1A**) and did not correlate with expression level of MAPT (**Figure 3A**). We found that PBX1 highly expressed in different brain regions (**Supplementary Figure S1B**), and we also found the expression of MAPT had strong and specific correlation with PBX1 in multiple brain regions (**Figure 3A**), especially in anterior cingulate cortex (r = 0.808), nucleus accumbens (r = 0.768), and frontal cortex (r = 0.768) (**Figure 3B**), which were considered as major affected regions of Progressive Supranuclear Palsy (Salmon et al., 1997).

We further explored the epigenetic annotation of the genomic regions tagged by rs8070723 in 10 brain regions by using epigenome data generated from Roadmap Consortium, which were also curated in FeatSNP database. We found the regions tagged by SNP rs8070723 enriched for strong active histone modification signals including H3K4me1, H3K9ac, and H3K27ac in 8 brain tissues (**Figure 4A**). Such active histone modifications were generally associated with active enhancer and promoter functions. Chromatin epigenetic status prediction based on chromHMM (Ernst and Kellis, 2012) suggested that the regions tagged by SNP rs8070723 could be considered

as strong enhancers (**Figure 4B**). Finally, we explored the eQTL data in 13 brain tissues, and found rs8070723 was associated with several genes' expression, including MAPT (**Figures 4B,C**). MAPT gene mutations have been associated with several neurodegenerative disorders such as Alzheimer's disease and Parkinson's disease. Our result suggests that rs8070723 G allele might influence MAPT expression level by reducing the binding affinity of upstream regulatory protein PBX1, therefore providing a mechanistic association with neurodegenerative diseases including Progressive Supranuclear Palsy and Parkinson's Disease.

### CONCLUSION

In summary, FeatSNP is an interactive database providing brain-specific functional genomics resources to investigate the regulatory potential of human SNPs. This database provides a multitude types of functional annotations, including TF binding motif prediction, epigenetic landscape, expression correlation and eQTL information. We anticipate that this database will facilitate scientists to investigate the functional impact of their candidate genetic variants in a more streamlined, rapid, and efficient fashion.

### REFERENCES


### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: http://www.roadmapepigenomics.org/.

### AUTHOR CONTRIBUTIONS

C-yM and BZ performed the data analysis, C-yM and PG developed the database and website. PM, TW, and BZ designed and supervised the study.

### FUNDING

This work was supported by National Institutes of Health grant DA027995, HG007175, HG007354, and Goldman Sachs Philanthropy Fund.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00262/full#supplementary-material



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ma, Madden, Gontarz, Wang and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Tensor Decomposition-Based Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis

#### *Y-h. Taguchi1\* and Turki Turki2*

*1 Department of Physics, Chuo University, Tokyo, Japan, 2 Department of Computer Science, King Abdulaziz University, Jeddah, Saudi Arabia*

#### *Edited by:*

*Michael Fernandez, The Vancouver Prostate Centre, Canada*

#### *Reviewed by:*

*Yunierkis Perez, University of the Americas, Ecuador Jose Ignacio Abreu Salas, Catholic University of the Most Holy Conception, Chile*

> *\*Correspondence: Y-h. Taguchi tag@granular.com*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 26 June 2019 Accepted: 19 August 2019 Published: 19 September 2019*

#### *Citation:*

*Taguchi Y-h and Turki T (2019) Tensor Decomposition-Based Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis. Front. Genet. 10:864. doi: 10.3389/fgene.2019.00864*

Although single-cell RNA sequencing (scRNA-seq) technology is newly invented and a promising one, but because of lack of enough information that labels individual cells, it is hard to interpret the obtained gene expression of each cell. Because of insufficient information available, unsupervised clustering, for example, *t*-distributed stochastic neighbor embedding and uniform manifold approximation and projection, is usually employed to obtain low-dimensional embedding that can help to understand cell–cell relationship. One possible drawback of this strategy is that the outcome is highly dependent upon genes selected for the usage of clustering. In order to fulfill this requirement, there are many methods that performed unsupervised gene selection. In this study, a tensor decomposition (TD)-based unsupervised feature extraction (FE) was applied to the integration of two scRNA-seq expression profiles that measure human and mouse midbrain development. TD-based unsupervised FE could select not only coincident genes between human and mouse but also biologically reliable genes. Coincidence between two species as well as biological reliability of selected genes is increased compared with that using principal component analysis (PCA)-based FE applied to the same data set in the previous study. Since PCA-based unsupervised FE outperformed the other three popular unsupervised gene selection methods, highly variable genes, bimodal genes, and dpFeature, TD-based unsupervised FE can do so as well. In addition to this, 10 transcription factors (TFs) that might regulate selected genes and might contribute to midbrain development were identified. These 10 TFs, BHLHE40, EGR1, GABPA, IRF3, PPARG, REST, RFX5, STAT3, TCF7L2, and ZBTB33, were previously reported to be related to brain functions and diseases. TD-based unsupervised FE is a promising method to integrate two scRNA-seq profiles effectively.

Keywords: tensor decomposition, enrichment analysis, single-cell RNA-sequencing, midbrain development, interspecies analysis

## INTRODUCTION

Single-cell RNA sequencing (scRNA-seq) (Sasagawa et al., 2019) is a newly invented technology that enables us to measure the amount of RNA in a single-cell basis. In spite of its promising potential, it is not easy to interpret the measurements. The primary reason of this difficulty is the lack of sufficient information that characterizes individual cells. In contrast to the huge number of cells measured, which is often as many as several thousands, the number of labeling is limited, for example, measurement of conditions as well as the amount of expression of key genes measured by fluorescence-activated cell sorting, whose number is typically as little as tens. This prevents us from selecting genes that characterize the individual cell properties.

In order to deal with samples without suitable numbers of labeling, unsupervised method is frequently used, since it does not make use of labeling information directly. *K*-means clustering and hierarchical clustering are popular methodologies that are often applied to gene expression analysis. The popular clustering methods specifically applied to scRNA-seq are *t*-distributed stochastic neighbor embedding (tSNE) (van der Maaten and Hinton, 2008) and uniform manifold approximation and projection (UMAP) (McInnes et al., 2018), which are known to be useful to get low-dimensional embedding of a set of cells. In spite of that, the obtained clusters are highly dependent upon genes used for clustering. Thus, the next issue is, without labeling (i.e., pre-knowledge), to select genes that might be biologically meaningful.

The various unsupervised gene selection methods applicable to scRNA-seq were invented, for example, highly variable genes, bimodal genes, dpFeature, and principal component analysis (PCA)-based unsupervised feature extraction (FE) (Murakami et al., 2012; Taguchi and Okamoto, 2012; Taguchi and Murakami, 2013; Ishida et al., 2014; Kinoshita et al., 2014; Murakami et al., 2014; Taguchi, 2014; Taguchi and Murakami, 2014; Umeyama et al., 2014; Murakami et al., 2015; Taguchi, 2015; Taguchi et al., 2015a; Taguchi et al., 2015b; Taguchi et al., 2015c; Taguchi et al., 2016; Taguchi, 2016a; Taguchi, 2016b; Taguchi, 2016c; Taguchi, 2016d; Taguchi and Wang, 2017; Taguchi et al., 2017; Taguchi, 2017d; Taguchi and Wang, 2018a; Taguchi, 2018a; Taguchi and Wang, 2018b). Chen et al. (2018) recently compared genes selected by these methods and concluded that the genes selected are very diverse and have their own (unique) biological features. In this sense, it is required to invent more advanced unsupervised gene selection methods that can select more biologically relevant genes.

In this paper, we propose the application of tensor decomposition (TD)-based unsupervised FE (Taguchi, 2017a; Taguchi, 2017b; Taguchi, 2017c; Taguchi, 2017e; Taguchi, 2017f; Taguchi and Ng, 2018; Taguchi, 2018b; Taguchi, 2018c; Taguchi, 2019a). It is an advanced method of PCA-based unsupervised FE for scRNA-seq analysis. For more details about PCA-based unsupervised FE and TD-based unsupervised FE, see the recently published book (Taguchi, 2019b). Especially focusing on the integration of two scRNA-seq profiles, the advantages of TD-based unsupervised FE when compared with PCA-based unsupervised FE are as follows: The former can integrate more than two gene expressions prior to the analysis, while the latter can only integrate the results obtained by applying the method to individual data sets.

In the following, based on the previous study (Taguchi, 2018a) where PCA-based unsupervised FE was employed, we try to integrate human and mouse midbrain development gene expression profiles to obtain key genes that contribute to this process, by applying TD-based unsupervised FE. It turned out that TD-based unsupervised FE can identify biologically more relevant and more common genes between human and mouse than can PCA-based unsupervised FE that outperformed other compared methods.

### METHODS AND MATERIALS

#### scRNA-seq Data

#### Midbrain Development of Humans and Mice

The first scRNA-seq data used in this study were downloaded from Gene Expression Omnibus (GEO) under the GEO ID GSE76381; the files named "GSE76381\_EmbryoMoleculeCounts.cef.txt.gz" (for human) and "SE76381\_MouseEmbryoMoleculeCounts.cef. txt.gz" (for mouse) were downloaded. These two gene expression profiles were generated from scRNA-seq data set: One represents human embryo ventral midbrain cells between 6 and 11 weeks of gestation (287 cells for 6 weeks, 131 cells for 7 weeks, 331 cells for 8 weeks, 322 cells for 9 weeks, 509 cells for 10 weeks, and 397 cells for 11 weeks, for a total of 1,977 cells). Another is a set of mouse ventral midbrain cells at six developmental stages between E11.5 and E18.5 (349 cells for E11.5, 350 cells for E12.5, 345 cells for E13.5, 308 cells for E14.5, 356 cells for E15.5, 142 cells for E18.5, and 57 cells for unknown, for a total of 1,907 cells).

#### Mouse Hypothalamus With and Without Acute Formalin Stress

The second scRNA-seq data used in this study were downloaded from GEO under GEOID GSE74672; the file named "GSE74672\_ expressed\_mols\_with\_classes.xlsx.gz" was downloaded. It is generated from scRNA-seq data set that measures mouse hypothalamus with and without acute formalin stress. Various meta-data, which are included in the first 11 rows of the data set, are available. The meta-data available include sex, age, cell types [astrocytes, endothelial, ependymal, microglia, neurons, oligos, and vascular smooth muscle (VSM)], control vs stressed samples, and so on.

### TD-Based Unsupervised FE

#### Midbrain Development of Humans and Mice

TD-based unsupervised FE is a recently proposed method successfully applied to various biological problems. TD-based unsupervised FE can be used for integration of multiple measurements applied to the common set of genes. Suppose *xij* ∈ ℝ*<sup>N</sup>*×*<sup>M</sup>* and *xik* ∈ ℝ*<sup>N</sup>*×*<sup>K</sup>* are the *i*th expression of the *j*th and *k*th cells under the two distinct conditions (in the present study, they are human and mouse), respectively. Then the three-mode tensor, *xijk* ∈ ℝ*<sup>N</sup>*×*M*×*<sup>K</sup>*, where *N* (= 13,889) is total number of common genes between human and mouse, which share gene symbols,

*M* (= 1,977) is the number of human cells, and *K* (= 1,907) is total number of mouse cells, is defined as

$$
\boldsymbol{\mathfrak{x}}\_{ijk} = \boldsymbol{\mathfrak{x}}\_{ij} \cdot \boldsymbol{\mathfrak{x}}\_{ik}.\tag{1}
$$

It is Case II Type I tensor (Taguchi, 2017e). Since it is too large to be decomposed, it is further transformed into Type II tensor, as follows:

$$\boldsymbol{\omega}\_{jk} = \sum\_{i=1}^{N} \boldsymbol{\omega}\_{jk},\tag{2}$$

where *x jk M K* ∈ <sup>×</sup> is now not a tensor but a matrix. In this case, TD is equivalent to singular value decomposition (SVD). After applying SVD to *xjk* , we get SVD,

$$\varkappa\_{jk} = \sum\_{\ell=1}^{\min(M,K)} \lambda\_{\ell} \mu\_{\ell j} \nu\_{\ell k},\tag{3}$$

where *u <sup>j</sup> M M* ∈ <sup>×</sup> and *v <sup>k</sup> K K* ∈ <sup>×</sup> are singular value vectors attributed to cells of human scRNA-seq and those of mouse scRNA-seq, respectively. Here, Case II means that tensor is generated such that two matrices share the genes, while Type II means that summation is taken over as in Eq. (2). On the other hand, the tensor before taking summation as in Eq. (1) is Type I.

Singular value vectors attributed to genes of human and mouse scRNA-seq, *u <sup>i</sup> N M* ∈ <sup>×</sup> and *v <sup>i</sup> N K* ∈ <sup>×</sup> are defined as respectively.

$$
\mu\_{\ell i} = \sum\_{j=1}^{M} \mu\_{\ell j} \mathbf{x}\_{ij},
\tag{4}
$$

$$\nu\_{\ell i} = \sum\_{k=1}^{K} \nu\_{\ell k} \mathbf{x}\_{ik} \,. \tag{5}$$

In order to find genes associated with biological functions, we need to select *uℓj* and *vℓk* which are coincident with biological meaning. In this study, we employ time points of measurements as biological meanings. In other words, we seek for genes associated with time development. Since we would like to find any kind of time dependence, we simply deal with time points as un-ordered labeling. Thus, we apply categorical regression

$$u\_{\ell j} = a\_{\ell} + \sum\_{t=1}^{r} a\_{\ell t} \mathfrak{S}\_{jt},\tag{6}$$

(*T* = 6; *t* = 1 to *T*, which correspond to 6, 7, 8, 9, 10, and 11 weeks; see *Methods and Materials*) or

$$\nu\_{\ell k} = b\_{\ell} + \sum\_{t=1}^{T} b\_{\ell t} \mathfrak{S}\_{kt},\tag{7}$$

(*T* = 7; *t* = 1 to *T*, which correspond to E11.5, E12.5, E13.5, E14.5, F15.5, E18.5, and unknown; see *Methods and Materials*), where δ δ *jt* ( ) *kt* =1 when the *j*th (*k*th) cell is taken from the *t*th time point otherwise δ δ *jt* ( ) *kt* = 0 . *aℓ*, *aℓt*, *bℓ* and *bℓt* are the regression coefficients.

*P*-values are attributed to *ℓ*th singular value vectors using the above categorical regression [lm function in R (R Core Team, 2018) is used to compute *P*-values]. *P*-values attributed to singular value vectors are corrected by Benjamini-Hochberg (BH) criterion (Benjamini and Hochberg, 1995). Singular value vectors associated with corrected *P*-values of less than 0.01 are selected for the download analysis. Hereafter, the set of selected singular value vectors of human and mouse is denoted as Ω human and Ω mouse , respectively.

*P*-values are attributed to genes with assuming χ2 distribution for the gene singular value vectors, *uℓi* and *vℓi*, corresponding to the cell singular value vectors selected by categorical regression as

$$P\_i^{\text{human}} = P\_{\chi^2} \left[ > \sum\_{\ell \in \text{all}\_\ell^{\text{human}}} \left( \frac{u\_{\ell i} - \left< u\_{\ell i} \right>}{\mathfrak{o}\_\ell^{\text{human}}} \right)^2 \right] \tag{8}$$

for human genes and

$$P\_i^{\text{mouse}} = P\_{\chi^2} \left\lfloor > \left. \sum\_{\ell \in \Omega\_\ell^{\text{mouse}}} \left( \frac{\nu\_{\ell i} - \left(\nu\_{\ell i}\right)}{\sigma\_\ell^{\text{mouse}}} \right)^2 \right\rfloor \right\rfloor \tag{9}$$

for mouse genes, respectively. Here,

$$
\left< \mu\_{\ell i} \right> = \frac{1}{N} \sum\_{i=1}^{N} \mu\_{\ell i} \tag{10}
$$

and

$$
\left< \boldsymbol{\nu}\_{\ell i} \right> = \frac{1}{N} \sum\_{i=1}^{N} \boldsymbol{\nu}\_{\ell i} \,. \tag{11}
$$

σ human and σ mouse are the standard deviations of *ℓ*th gene singular value vectors for human and mouse, respectively, Ω human and Ω mouse are sets of *ℓ*s, selected by categorical regression for human [Eq. (6)] and mouse [Eq.(7)], respectively. *P x* <sup>χ</sup><sup>2</sup> [ ] > is the cumulative probability of χ2 distribution when the argument takes values larger than *x*. *Pi* human and *Pi* mouse are corrected by BH criterion, and genes associated with corrected *P*-values of less than 0.01 are selected.

#### Mouse Hypothalamus With and Without Acute Formalin Stress

The application of TD-based unsupervised FE to mouse hypothalamus is quite similar to that of mouse and human midbrain. There are also two matrices, *xij N M* ∈ <sup>×</sup> and *xik N K* ∈ <sup>×</sup> which correspond to the *i*th expression of the *j*th and *k*th cells under the two distinct conditions (in the present case, they are without and with acute formalin stress, respectively); *N*=24,341,*M*=1,785 and *K*=1,096. Case II Type II tensor, *xjk,* was also generated using Eqs. (1) and (2), and SVD was applied to *xjk* as Eq. (3). Then singular value vectors attributed to genes of samples without and with acute formalin stress, *uℓi* and *vℓi,* were computed by Eqs. (4) and (5). We also applied categorical regressions to *uℓi*. and *vℓi*, although categories considered here are not time points but cell types. Then categorical regressions applied to *uℓi* and *vℓi* in mouse hypothalamus without and with acute formalin stress are

$$
\mu\_{\ell j} = a\_{\ell} + \sum\_{s=1}^{7} a\_{\ell s} \mathfrak{S}\_{j s}, \tag{12}
$$

$$\nu\_{\ell k} = b\_{\ell} + \sum\_{s=1}^{7} b\_{\ell s} \delta\_{ks},\tag{13}$$

where *s* stands for one of seven cell types mentioned in Methods and Materials and δ δ *js* ( ) *ks* =1 when the *j*th (*k*th) cell is taken from the *s*th cell types otherwise δ δ *js* ( ) *ks* = 0 . **Table 1** lists the number of cells in these categories. The remaining procedures to select genes associated with identified cell type dependency are exactly the same as those in midbrain case.

#### Enrichment Analyses

Various enrichment analysis methods are performed with separate uploading selected human and mouse gene symbols, or genes selected commonly between samples without and samples with acute formalin stress, to Enrichr (Kuleshov et al., 2016).

#### RESULTS

#### Midbrain Development of Humans and Mice

As a result, following the procedure described in the *Methods and Materials*, we identified 55 and 44 singular value vectors attributed to cells, *uℓj*s and *vℓk*s for human and mouse, respectively.

TABLE 1 | The number of cells that belong to either without or with acute formalin stress or cell types.


*VSM, vascular smooth muscle.*

One possible validation of selected *uℓj*s and *vℓk*s is coincidence. Although cells measured are not related between human and mouse at all, if SVD works well, corresponding singular value vectors (i.e., *uℓj* and *vℓk* sharing the same *ℓ*s) attributed to cells should share something biological, for example, time dependence. This suggests that it is more likely that corresponding singular value vectors attributed to cells, *uℓj* and *vℓk,* are simultaneously associated with significant *P*-values computed by categorical regression. As expected, they are highly significantly correlated. **Table 2** shows confusion matrix of the coincidence of selected singular value vectors between human and mouse. For human cells, only the top 1,907 singular value vectors among all 1,977 singular value vectors are considered, since the total number of singular value vectors attributed to mouse cells is 1,907.

**Figure 1** shows the coincidence of selected singular value vectors between human and mouse. Singular value vectors with smaller *ℓ*s, that is, with more contributions, are more likely selected and coincident between human and mouse. This can be the side evidence that guarantees that TD-based unsupervised FE successfully integrated human and mouse scRNA-seq data.

Next, we selected genes with following the procedures described in *Methods and Materials.* (The list of genes is available as **Supplementary Data Sheet 1** and **2**). The first validation of selected genes is the coincidence between human and mouse. In Taguchi's previous study (Taguchi, 2018a), more number of common genes were selected by PCA-based unsupervised FE than other methods compared, that is, highly variables genes, bimodal genes, and dpFeature. **Table 3** shows the confusion matrix that describes the coincidence of selected genes between human and mouse. Odds ratio is as large as 133, and *P*-value is 0 (i.e., less than numerical accuracy), which is significantly better than coincidence of selected genes between human and mouse (53 common genes between 116 genes selected for human and 118 genes selected mouse), previously achieved by PCA-based unsupervised FE (Taguchi, 2018a), which outperformed other methods, that is, highly variable genes, bimodal genes, and dpFeature.

On the other hand, most of the genes selected by PCA-based unsupervised FE in the previous study (Taguchi, 2018a) are included in the genes selected by TD-based unsupervised FE in the present study. One hundred two genes are selected by TD-based unsupervised FE among 116 human genes selected by PCAbased unsupervised FE in the previous study (Taguchi, 2018a),

TABLE 2 | Confusion matrix of coincidence between selected 55 singular value vectors selected among all 1,977 singular value vectors, *uℓ<sup>j</sup> ,* attributed to human cells and 44 singular value vectors selected among all 1907 singular value vectors, *vℓk,* attributed to mouse cells.


*Selected: corrected P-values, computed with regression analysis [Eqs. (6) and (7)], are less than 0.01. Not selected: otherwise. Odds ratio is as many as 227, and P-values computed by Fisher's exact test are 1.44×10-44.*

TABLE 3 | Confusion matrix of coincidence between selected 456 genes for human and selected 505 genes for mouse among all 13,384 common genes.


*Selected: corrected P-values, computed with* χ*2 distribution [Eqs. (8) and (9)], are less than 0.01. Not selected: otherwise. Odds ratio is as many as 133, and P-values computed by Fisher's exact test are 0 (i.e., less than numerical accuracy).*

while 91 genes are selected by TD-based unsupervised FE among 118 mouse genes by PCA-based unsupervised FE. Thus, TD-based unsupervised FE is quite consistent with PCA-based unsupervised FE.

Biological significance tested by enrichment analysis is further enhanced (Full list of enrichment analysis is available as **Supplementary Tables 1** and **2**). Most remarkable advance achieved by TD-based unsupervised FE is "Allen Brain Atlas," to which only downregulated genes were enriched in the previous study (Taguchi, 2018a). As can be seen in **Table 4**, now much enrichment is associated with upregulated genes. In addition to this, most of the five top-ranked terms are related to paraventricular nucleus, which is adjusted to midbrain. This suggests that TD-based unsupervised FE successfully identified genes related to midbrain.

In addition to this, "Jensen TISSUES" (**Table 5**) for Embryonic\_brain is highly enhanced [i.e., more significant (smaller), with *P*-values ~10-100 which were as large as 10-10 to 10-20 in the previous study (Taguchi, 2018a)]. On the other hand, "ARCHS4 tissues" also strongly supports the biological reliability of selected genes (**Table 6**). The term "MIDBRAIN" is enriched highly, and it is top ranked for both human and mouse.

There is some brain-related enrichment found in other categories, although it is not strong enough compared with that of the top three. Brain-related terms in "GTEx Tissue Sample Gene Expression Profiles up" (**Table 7**) are also enhanced for mouse brain (top three terms are brain), although no brain terms are enriched within five top-ranked terms for human (this discrepancy cannot be understood at the moment). On the contrary, brain-related terms in "MGI Mammalian Phenotype

TABLE 4 | Five top-ranked terms from "Allen Brain Atlas up" by Enrichr for selected 456 human genes and 505 mouse genes.


TABLE 5 | Enrichment of embryonic brain by "JENSEN TISSUES" in Enrichr.


TABLE 6 | Enrichment of embryonic brain by "ARCHS4 Tissues" in Enrichr.


2017" (**Table 8**) are enhanced for human brain (fourth and fifth ranks), although no brain terms are enriched within the five top-ranked terms for mouse (this discrepancy also cannot be understood at the moment). The above observations suggest that TD-based unsupervised FE could identify genes related to mouse and human embryonic midbrain.

We also uploaded selected 456 human genes and 505 mouse genes to STRING server (Szklarczyk et al., 2014), which evaluates protein–protein interaction (PPI) enrichment. Among 456 human genes, 7,488 PPI are reported, while the expected number of PPI is as small as 3,524 (*P*-value is less than 1×10-6). Among 505 mouse genes, 6,788 PPI are reported, while the expected number of PPI is as small as 3,290 (*P*-value is again less than 1×10-6). Thus, TD-based unsupervised FE can successfully identify significantly interacting protein-coding genes.

Finally, we checked if transcription factors (TFs) that target selected genes are common between human and mouse (**Table 9**). These TFs are associated with adjusted *P*-values of less than 0.01 in "ENCODE and ChEA Consensus TFs from ChIP-X" of Enrichr. They are highly overlapped between human and mouse

TABLE 7 | Five top-ranked terms from "GTEx Tissue Sample Gene Expression Profiles up" by Enrichr for selected 456 human genes and 505 mouse genes. Brainrelated terms are asterisked.


TABLE 8 | Five top-ranked terms from "MGI Mammalian Phenotype 2017" by Enrichr for selected 456 human genes and 505 mouse genes. Brain-related terms are asterisked.




*TFs, transcription factors.*

(there are 10 common TFs between 16 TFs found in human and 24 TFs found in mouse). Although selected TFs are very distinct from those in the previous study (Taguchi, 2018a), they are highly interrelated with each other (see below). These TFs are uploaded to the regnetworkweb server (Liu et al., 2015), and TF networks shown in **Figure 2** are identified. Clearly, even partially, these TFs interact highly with each other.

We also checked if the 10 commonly selected TFs (in bold in **Table 9**) are related to brains. Lack of BHLHE40 was found to result in brain malfunction (Hamilton et al., 2018). The function of EGR1 was found in embryonic rat brain (Wells et al., 2011). GABPA is essential for human cognition (Reiff et al., 2014). IRF3 is related to brain disease (Schultz et al., 2019). PPAR, which PPARG belongs to, is believed to be the therapeutic target of neurodegenerative diseases (Warden et al., 2016). REST is a master regulator of neurogenesis (Mozzi et al., 2017). RFX5 is known to be expressive in fetal brain (Sugiaman-Trapman et al., 2018). STAT3 promotes brain metastasis (Priego et al., 2018). TCF7L2 regulates brain gene expression (Shao et al., 2013). ZBTB33 affects the mouse behavior through regulating brain gene expression (Kulikov et al., 2016). Thus, all 10 commonly selected TFs are related to brains.

#### Mouse Hypothalamus With and Without Acute Formalin Stress

Although the effectiveness of the proposed strategy toward scRNAseq is obvious in the results shown in the previous subsection, one might wonder if it is accidental. In order to dispel such doubts, we apply TD-based unsupervised FE to yet another scRNA-seq data set: mouse hypothalamus with and without acute formalin stress. Contrary to the data set analyzed in the previous subsection where very distant two data sets were analyzed, the data sets analyzed here are very close to each other. Both data sets are taken from the same tissue of mouse, hypothalamus. The only difference is if they are stressed by formalin dope or not. The motivation why we here specifically apply TD-based unsupervised FE to two close data sets is as follows: When two data sets are too close, it might be difficult to identify which genes are commonly altered by additional condition, in this case, the dependence upon cell types, because all genes might behave equally between the two. Thus, it is not a bad idea to check if TD-based unsupervised FE can work well when not only very distant data sets are analyzed but also very close data sets are analyzed.

With following the procedure described in the Materials and Methods, we identified 30 and 24 singular value vectors attributed to cells, *uℓj*s and *vℓk*s, without and with acute formalin stress, respectively. We again applied Fisher's exact test (**Table 10**). Although odds ratio is 10 times larger than that in **Table 2**, *P* -value is even smaller than that in **Table 2**; this suggests that TD-based unsupervised FE could

identify not all of genes but only limited genes as being common between two experimental conditions: without and with stress.

**Figure 3** shows the coincidence of selected singular value vectors between samples without and with stress. Singular value vectors with smaller *ℓ*s, that is, with more contributions, are more likely selected and coincident between samples without and with stress. This can be the side evidence that guarantees that TD-based unsupervised FE successfully integrated scRNA-seq data taken from samples without and with stress while avoiding to regard that all are coincident between two samples.

Next, we selected genes with following the procedures described in the *Methods and Materials*. The first validation of selected genes is the coincidence between human and mouse. **Table 11** shows the confusion matrix that describes the coincidence of selected genes between samples without and with stress. Odds ratio is as large as 270, and *P*-value is 0 (i.e., less than numerical accuracy). Thus, as expected, TD-based unsupervised FE could not identify all genes but only a limited number of genes associated with cell-type dependence.

Finally, we tried to evaluate if genes selected are tissue type specific, that is, hypothalamus. We have uploaded 3,324 commonly

TABLE 10 | Confusion matrix of coincidence between selected 30 singular value vectors selected among all 1,096 singular value vectors, *uℓ<sup>j</sup>* ,attributed to samples without stress and 24 singular value vectors selected among all 1,096 singular value vectors, *vℓ<sup>k</sup>* attributed to samples with stress.


*For samples without stress, only the top 1,096 singular value vectors among all 1,785 singular value vectors are considered, since total number of singular value vectors attributed to samples without stress is 1,096. Selected: corrected P -values, computed with regression analysis (Eqs. (12) and (*13*)), are less than 0.01. Not selected: otherwise. Odds ratio is as many as 2,483, and P-values computed by Fisher's exact test are 1.92×10-40.*

TABLE 11 | Confusion matrix of coincidence between selected 4,150 genes for samples without stress and selected 3,621 genes for samples with stress among all 24,341 genes.


*Selected: corrected P -values, computed with* χ*2s atribution that corresponds to Eqs. (8) and (9) in human and mouse midbrain study, are less than 0.01. Not selected: otherwise. Odds ratio is as many as 270, and P-values computed by Fisher's exact test are 0 (i.e., less than numerical accuracy).*

selected genes to Enrichr. "GTEx Tissue Sample Gene Expression Profiles up" suggest that all five top-ranked terms are brain with high significance (**Table 12**, adjusted *P*-values are less than 1×10-130). This suggests that TD-based unsupervised FE successfully identified limited number of genes related to brains even using closely related samples. In order to be more specific, we checked "Allen Brain Atlas up" in Enrichr. Then we found that all five top-ranked terms are hypothalamic (**Table 13**). It is interesting that TD-based unsupervised FE could successfully identify hypothalamus-specific genes by only using scRNA-seq retrieved from hypothalamus. It is usually required to use data taken from other tissues in order to identify tissue-specific genes because we need to compare targeted tissues and not targeted tissues in order to identify genes expressed specifically in target tissues. The successful identification of genes specific to something without using the comparison with other samples was also observed previously during an attempt to identify tumor-specific genes by TD-based unsupervised FE (Taguchi, 2017c). In this sense, TD-based unsupervised FE methods are effective not only when genes common between two distinct conditions are sought but also when genes common between two closely related conditions are sought. Thus, it is unlikely that the success of a TD-based unsupervised method applied to scRNA-seq is accidental.

TABLE 12 | Five top-ranked terms from "GTEx Tissue Sample Gene Expression Profiles up" by Enrichr for 3,324 genes selected commonly between samples without and with stress.


TABLE 13 | Five top-ranked terms from "Allen Brain Atlas up" by Enrichr for 3,324 genes selected commonly between samples without and with stress.


#### DISCUSSIONS AND FUTURE WORK

In this study, we applied TD-based unsupervised FE to the integration of scRNA-seq data sets taken from two species: human and mouse. In the sense of identification of biologically more relevant set of genes, TD-based unsupervised FE can outperform PCA-based unsupervised FE that previously (Taguchi, 2018a) could outperform three more popular methods: highly variable genes, bimodal genes, and dpFeature. Thus, it is expected that TD-based unsupervised FE can do so, too.

For the purpose of integration of two scRNA-seq data sets, TD-based unsupervised FE has many advantages than the other four methods, that is, PCA-based unsupervised FE, highly variable genes, bimodal genes, and dpFeature. At first, TD-based unsupervised FE can integrate two scRNA-seq data sets, not after but before the selection of genes. This enabled us to identify more coincident gene sets between two scRNA-seq in this study of human and mouse. As a result, we were able to identify more coincident results between human and mouse.

The criteria of gene selection are quite robust; they should be dependent upon time points when they are measured. We did not have to specify how they are actually correlated with time. It is another advantage of TD-based unsupervised FE.

By applying enrichment analysis to the genes selected, we found many valuable insights about the biological process. As a result, we identified 10 key TFs that might regulate embryonic midbrain developments. All of the 10 selected TFs turned out to be related to brains.

TD-based unsupervised FE turned out to be quite effective to integrate two scRNA-seq data sets. This method should be applied to various scRNA-seq data sets considering broader scope of investigations.

In future work, we plan to (1) utilize the proposed TD-based unsupervised FE under the transfer learning setting; (2) extend the proposed approach to handle the data integration from multiple related tasks; and (3) investigate the performance of the proposed approach when coupled with machine and deep learning algorithms.

### DATA AVAILABILITY

The data sets analyzed for this study can be found in the GEO.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE76381.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE74672

### AUTHOR CONTRIBUTIONS

Y-HT planned the research, performed analyses, and wrote a paper. TT discussed the results and wrote a paper.

### FUNDING

This study was supported by KAKENHI (17K00417 and 19H05270) and Okawa Foundation (grant number 17-10). This project was also funded by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, under grant no. KEP-8-611-38. The authors, therefore, acknowledge with thanks DSR for technical and financial support.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00864/ full#supplementary-material

#### REFERENCES


expression and glucose homeostasis. *Diabetes* 62, 789–800. doi: 10.2337/ db12-0365


discovery for posttraumatic stress disorder-mediated heart disease. *BMC Bioinform.* 16, 139. doi: 10.1186/s12859-015-0574-4


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Taguchi and Turki. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Identification and Analysis of Long Repeats of Proteins at the Domain Level

#### David Mary Rajathei, Subbiah Parthasarathy and Samuel Selvaraj\*

*Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, India*

#### Edited by:

*Shandar Ahmad, Jawaharlal Nehru University, India*

#### Reviewed by:

*Michael Gromiha, Indian Institute of Technology Madras, India Vladimir N. Uversky, University of South Florida, United States Tadashi Satoh, Graduate School of Pharmaceutical Sciences, Nagoya City University, Japan Bratati Kahali, Indian Institute of Science, India*

\*Correspondence:

*Samuel Selvaraj selvarajsamuel@gmail.com*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology*

Received: *18 April 2019* Accepted: *16 September 2019* Published: *08 October 2019*

#### Citation:

*Rajathei DM, Parthasarathy S and Selvaraj S (2019) Identification and Analysis of Long Repeats of Proteins at the Domain Level. Front. Bioeng. Biotechnol. 7:250. doi: 10.3389/fbioe.2019.00250* Amino acid repeats play an important role in the structure and function of proteins. Analysis of long repeats in protein sequences enables one to understand their abundance, structure and function in the protein universe. In the present study, amino acid repeats of length >50 (long repeats) were identified in a non-redundant set of UniProt sequences using the RADAR program. The underlying structures and functions of these long repeats were carried out using the Gene3D for structural domains, Pfam for functional domains and enzyme and non-enzyme functional classification for catalytic and binding of the proteins. From a structural perspective, these long repeats seem to predominantly occur in certain architectures such as sandwich, bundle, barrel, and roll and within these architectures abundant in the superfolds. The lengths of the repeats within each fold are not uniform exhibiting different structures for different functions. We also observed that long repeats are in the domain regions of the family and are involved in the function of the proteins. After grouping based on enzyme and non-enzyme classes, we observed the abundant occurrence of long repeats in specific catalytic and binding of the proteins. In this study, we have analyzed the occurrence of long repeats in the protein sequence universe apart from well-characterized short tandem repeats in sequences and their structures and functions of the proteins at the domain level. The present study suggests that long repeats may play an important role in the structure and function of domains of the proteins.

Keywords: long repeats, protein, domain, protein family, enzyme and non-enzyme classes, structural fold

## INTRODUCTION

Amino acid repeats are ubiquitous in protein sequences that often correspond to structural and functional units of proteins. The length of these repeats varies considerably from shorter units of homo repeats of single amino acid (Jorda and Kajava, 2010), oligopeptide repeats of 2–20 residues (Fraser and MacRae, 1973) and solenoid repeats of 20–40 residues to larger repetitions of length >50 called domain repeats (Andrade et al., 2001). These repeats occur as a single pair or as multiple copies in a tandem/non-tandem manner that are useful for structural packing or for one or more interactions with ligand (Katti et al., 2000; Luo and Nijveen, 2014). It has been observed that many proteins of length >500 contain internal repeats, suggesting the importance of repeats in producing larger proteins (Marcotte et al., 1998). However, these repeats possess weak identities due to extensive divergence, but retain similar folds and functions of the proteins (Holm and Sander, 1993). It has also been found out that long stretches of perfect repetitions are infrequent in protein sequences even though they are folded into recurrent structural motifs (Turjanski et al., 2016). Many methods and algorithms, such as Fourier transformation, short string extension, sequencesequence alignment, and sequence profiles comparison have been introduced for the identification of such diverged sequence repeats with insertion and deletion without prior knowledge. Web based servers such as the Internal Repeat Finder, RADAR, REPRO, TRUST, XSTREAM, HHRepID, T-REKS, and PTRStalker (Pellegrini et al., 1999, 2012; George and Heringa, 2000; Heger and Holm, 2000; Szklarczyk and Heringa, 2004; Newman and Cooper, 2007; Biegert and Söding, 2008; Jorda and Kajava, 2009) have been developed by implementing the above techniques to detect amino acid repeats in proteins.

Earlier, proteins containing homo repeats (Jorda and Kajava, 2010), fibrous repeats (Fraser and MacRae, 1973) and different well-characterized repeats types, namely tetratricopeptide, leucine-rich, ankyrin and armadillo/heat etc. (Fraser and MacRae, 1973; Yoder et al., 1993; Groves and Barford, 1999; Kobe and Kajava, 2001), possessing different structures and functions have been analyzed (Andrade et al., 2001). Further, short units of repeats in tandem that form repeats in the structural folds of solenoids (α, β, α/β), β-trefoil (Murzin et al., 1992; Ponting and Russell, 2000), β-prisms (Chothia and Murzin, 1993; Bourne et al., 1999), and β-propellers (Bork and Doolittle, 1994; Neer et al., 1994) have been reviewed (Kajava, 2012). Recently, a detailed analysis and classification of β-hairpin repeat structures has been carried out (Roche et al., 2017). Also, it has been pointed out that short tandem repeats accumulate in the intrinsically disordered regions (IDR) (van der Lee et al., 2014) and play an important role in protein interactions and stability (Tompa, 2012; Habchi et al., 2014).

Analysis of larger proteins has demonstrated that significant portions of proteins are composed of domains. They are the conserved parts of proteins which can fold and function independently. The folded domains can either serve as modules for building up large assemblies or provide specific catalytic enzyme functions or bindings of the proteins. It has been found that repeats of a length >50 residues often correspond to conserved regions that are present in proteins as single or multiple copies for the function of the proteins (Hemalatha et al., 2007). Our analysis of sequence repeats of the proteins with known 3D structures in the PDB (Berman et al., 2014) has shown that they retain similar folds in spite of divergences, in order to conserve the structure and function of the proteins and, repeats that are in the single/two domains from the same family contain conserved motifs for the function of the proteins (Mary Rajathei and Selvaraj, 2013). Further, the conservation of interresidues interactions in domain repeats have been analyzed in terms of long-range contact, surrounding hydrophobicity and pair-wise interaction energy (Mary et al., 2015). A database IR-PDB for repeats in the sequence of the proteins in the PDB has been developed for the analysis of impact of repeats in proteins (Selvaraj and Rajathei, 2017).

The widely used sequence database UniProtKB (UniProt Consortium T, 2017) contains more than 500,000 sequences that are annotated with well-characterized repeats of tetratricopeptide, leucine-rich repeats, ankyrin, and armadillo/heat etc. However, there has been no survey of repeats of length>50 in the UniProt sequences, which may provide insights into their role in the structure, function and evolution of the proteins. In the present study, we have analyzed the occurrence of long repeats and their underlying structures and functions in a non-redundant set of UniProt sequences. Since repeats of size exceeding 50 residues are large enough to fold independently into stable domains (Kajava, 2012), we used Gene3D for structural domains, Pfam for functional domains and enzyme and non-enzyme functions for specific catalytic and binding for their structure and function of long repeats proteins. It was found that long repeats occur in about 23% of the considered proteins. Analysis of the structure of long repeats reveals that these repeats are predominantly observed in the structural folds of sandwich, bundle, barrel and roll. We observed that repeats in the domains for the function of the proteins. Further, we observed that long repeats tend to occur both in enzyme and non-enzyme functions of proteins. While long repeats are found in all the major enzyme classes, these are more abundant among both ligases and isomerases. Among the non-enzyme proteins, such as DNA binding, metal binding, calcium binding, and Nucleotide binding (NP), these repeats are observed more in Nucleotide binding and DNA binding proteins. The present analysis shows that the occurrence of long repeats and their structures and functions of the proteins at the domain level.

### MATERIALS AND METHODS

### Data Collection

A collection of 555,100 proteins along with their assigned UniProt ID, amino acid sequence, protein name, protein family, enzyme function, and non-enzyme functions such as DNA binding, calcium binding, metal binding, and NP binding, as well as other annotation of the sequences from the databases of Pfam, Gene3D, PDB, and DisProt, was downloaded from UniProtKB/Swiss-Prot (UniProt Consortium T, 2017) and stored in a file. The Pfam is a database of protein domain families that assigns the domains, as well as their functional regions (Finn et al., 2014). Gene3D (Lewis et al., 2018), is a database that assigns the structure of the protein according to CATH hierarchy of class, architecture and fold in numerical values (Dawson et al., 2017). At the class level (C), the numerical value 1 is for all alpha class, 2 for all beta and 3 for a mixture of alpha and beta. Likewise, the numerical values are assigned for Architecture level (A) based on secondary structure arrangement in 3-D space and for Topology/Fold level (T) based on the connection of secondary structural elements. The PDB ID's of the 3D structure known proteins were obtained from the PDB database (http://www. rcsb.org/pdb/home/home.do). The intrinsic disordered regions of the proteins that were extracted from the literature are available in the DisProt database (Piovesan et al., 2017). A nonredundant representative set of 126,945 sequences that share <50% sequence identity was obtained by clustering the 555,100 sequences using the web server CD-HIT (Fu et al., 2012). The overall work-flow is summarized as a flowchart (**Figure 1**).

## Finding Sequence Repeats of the Proteins Using RADAR

The presence of internal repeats in each protein sequences was identified using the repeat detection program RADAR (Heger and Holm, 2000), which was downloaded from the URL (https:// sourceforge.net/projects/repeatradar). The RADAR program is efficient for ab initio detection of repeats of length >15 in a single sequence by aligning the sequence against itself, as well as by generating the sequence profile using multiple sequence alignment. RADAR evaluates the statistical significance of the observed repeats by measuring a Z-score for each repeat unit (McLachlan, 1983; Heringa and Argos, 1993). The Z-score of a repeat unit is the number of standard deviations of the repeat unit score above the mean. The score of each unit is determined from a profile derived from the multiple alignment of repeat unit without considering end-gaps. Repeats with Z-scores threshold of > 6 are reported by the RADAR program. An in-house Perl program that incorporated the RADAR executable was written to detect internal repeats of all sequences in the dataset in a single run. Proteins containing repeats of length >50 were considered for further analysis.

### Finding the Structure of Long Repeats Proteins

The UniProt ID's of proteins having long repeats were extracted and their Gene3D structural domain-based assignments of the proteins were extracted using a Perl program. Then, the name of class, architecture and fold of the protein was found out by using CATH search and grouped according to their name for the further analysis of architecture and fold of the protein with repeats.

## Finding the Functional Domains of Long Repeats Proteins

The UniProt ID's of long repeat proteins were extracted and their assigned Pfam domains of the sequences were identified. The domain regions and their functional residues information of the proteins were found out using Pfam database search (Finn et al., 2014), and repeats in the domain regions were identified by manual search. The level of similarity of the repeats within a protein and within a protein family was found out in terms of % sequence identity through using the Needleman–Wunsch algorithm (Needleman and Wunsch, 1970) implemented in the ggsearch36 program of the FASTA-36.3.5b package (Henikoff and Henikoff, 1992). Needleman–Wunsch alignment scores were calculated using the BLOSUM50 scoring matrix (Pearson, 2000) with a penalty of −12 for gap opening and −2 for gap extension. Further, the repeats in domains of the proteins were also analyzed for their functional involvement at the structure of the proteins using the server PDBsum by giving PDB ID as input (Laskowski et al., 2018).

Finding the Enzyme and Non-enzyme Functions of Long Repeats Proteins

3 in most of long repeat proteins (B).

The assigned enzyme numbers (EC) of long repeats proteins were extracted. The EC number of the protein at the first level corresponds to seven enzyme classes of Oxidoreductases (EC 1), Transferases (EC 2), Hydrolases (EC 3), Lyases (EC 4), Isomerases (EC 5), Ligases (EC 6), and finally, Translocases (EC 7). The enzyme numbers were extracted and grouped according their numbers for further analysis. The non-enzyme proteins that are assigned with DNA binding, calcium binding, metal binding and NP binding were also extracted and grouped according to their name.

## RESULTS

### Abundance of Proteins Having Long Repeats

The presence of amino acid repeats of length >15 was found out in 85,726 (67%) out of non-redundant set of 126,945 UniProt protein sequences (**Supplementary Data File 1**). The long repeats were found out in 29,768 (35%) proteins. These repeats are present as a single pair or multiple copies of repeats in tandem/non-tandem manner. For example, Nacetylmuramoyl-L-alanine amidase Rv3717 protein (UniProt ID: I6Y4D2) of length 241 contains a single pair of tandem repeats of length 96 in the continuous region of 12–116/118–226. Complement control protein C3 (P68639) of length 263 has three copies of repeats of length 55 in tandem (81–142/143– 200/201–254), whereas Transcriptional regulatory protein TyrR (UniProt ID: P44694) of length 318 contains a single pair of non-tandem repeats of length 76 in the discontinuous region of (19–99/200–279). The length of the repeats varies in the range of 51–1759 and lengths of >1,000 are mostly found out in enzyme proteins. For example, the non-ribosomal peptide synthetase 1 (Q4WT66) of sequence length 6,269 contains repeats of length 1,546 (277–905/906–2,334/2,335– 3,465/3,466–4,590/4,593–5,634). Through the analysis of length distribution of long repeats, as well as their repeat number distribution of long repeats against the number of proteins (**Figure 2**), we observed that the lengths <200 are observed in more than 90% of the proteins with an average of 100 residues, and repeated in 2–5 number of times with repeat numbers of 2 (61%) and 3 (26%) in most of the long repeat proteins. The Z-score values of the repeats were extracted and found that 74,089 out of 74,154 repeat units have Zscores >6. Among these, 66,400 repeat units have Z-scores of >20. This suggests that most of the observed repeats are statistically significant.

### Analysis the Structure of Long Repeats Proteins

The structural class, architecture and fold of the 14,176 proteins (48%) have been found out using structural domain based Gene3D assignments. Among these, some proteins are having two or more Gene3D assignments. In this study, 10,504 proteins that contained a single Gene3D assignment were considered for further analysis (**Supplementary Data File 2)**. For example, Annexin A1 (P04083) protein contains a single Gene3D assignment of 1.10.220, which means that this protein belongs to class alpha (1) of orthogonal bundle architecture (10) with Annexin V domain fold (220).

### Analysis of Long Repeats at the Architecture Level

According to CATH domain-based hierarchy (http://www. cathdb.info/browse/tree), the presence of long repeats in different architectures of alpha (α), beta (β), and alpha/beta (α/β) class proteins was observed (**Figure 3**). Out of five architectures of α class, these were observed in the four architectures, namely orthogonal bundle, up-down bundle, α horseshoe and

α/α barrel. Among these, substantial numbers were present in the architectures of bundle and horseshoe. Under β class, the repeats were present in 13 out of 20 architectures and the sandwich, propeller, roll and barrel were observed most. Likewise, repeats were found in 10 out of 14 architectures of α/β class and the architectures of 3-layer (αβα) sandwich, 2-layer sandwich, α/β barrel and αβ-complex were observed most. By combining the architectures from different classes of proteins, repeats in specific architectures of sandwich, bundle, barrel and roll compared to other architectures were found out.

### Analysis of Long Repeats at the Fold Level

The existence of repeats in different folds of sandwich, bundle, barrel and roll architectures was found out. Repeats were observed in 84 out of 287 folds in orthogonal bundle and 32 out of 101 folds in up-down bundle of α class. At the β class, 12 out of 43 folds in β sandwich, 18 out of 48 folds in β barrel, and 13 out of 40 folds in β roll architecture of the proteins were having repeats. Under α/β class, repeats in 47 out of 126 folds under 3-layer (αβα) sandwich, 57 out of 224 under 2-layer sandwich, 6 out of the 18 folds under α/β barrel and 16 out of 58 folds under α/β roll were observed. Among that, some folds were observed in a greater number of proteins compared to other folds (**Figure 4**). In α class, the Arc Repressor Mutant Subunit A fold and four Helix Bundle fold of bundle architectures were observed in most of the proteins compared to other folds (**Table 1**). Under β class, the Immunoglobulinlike fold and Jelly Roll fold of β-sandwich, PH-domain fold of β-roll, and OB fold of β-barrel were observed most. The Rossmann fold in 3-layer (αβα) sandwich, TIM Barrel in αβ barrel and Herpes Virus-1 followed by Alpha-Beta plaits fold of 2-layer Sandwich, and Ubiquitin-like (UB roll) of αβ roll were observed most. The results reveal the predominant occurrence of long repeats in the diverse structure exhibiting folds of the proteins.

## Analysis of Long Repeats for Structural Repeats

The long repeats in proteins with known 3-D structure (as available from UniProt annotation) were analyzed for structural repeats. The proteins with tandem repeats were found out and analyzed at the structural level. We observed long tandem repeats form structural repeats in the folds of up-down and orthogonal bundle of α-class, Immunoglobulin, Jelly Roll and OB fold of β-class, Rossmann fold, TIM barrel, α/β plait, and UB roll of α/β class. **Figure 5** shows the structural repeats of the proteins in the folds of up-down and orthogonal bundle of α-class, Immunoglobulin, Jelly Roll and OB fold of β-class, Rossmann fold, TIM barrel, α/β plait, and UB roll of α/β class. Further, we found out that the lengths of the repeats are not uniform and vary considerably within each fold. **Figure 6** shows the considerable variation in lengths, as well as in the secondary structures of different proteins possessing the Rossmann fold that usually contains βαβαβ secondary structure arrangements. The Desulfovibrio vulgaris CbiK(P) Cobaltochelatase (PDB ID: 2XVY) contains two repeats of βαβαβαβ secondary structure arrangement of length 103 (**Figure 6A**) (Malay et al., 2009), whereas, another protein Thermoplasma volcanium Phosphoribosyl pyrophosphate synthetase (PDBID: 3MBI) contains two repeats



Virus-1 domain of 2-layer sandwich, TIM barrel of Alpha-Beta Barrel and UB Roll of Alpha-Beta Roll in a substantial numbers of Long repeats proteins.

of βαβαβαβα of length 121 in **Figure 6B** (Cherney et al., 2011). The analysis results suggest that the length variations of repeats within the Rossmann fold lead to the presence of additional α-helices, β-strands, and coil regions. Thus, longer repeats of different lengths provide the structural differences within a fold of the proteins.

## Analysis of Long Repeats for Intrinsic Disordered Region

The intrinsically disordered regions (IDR) for 51 (<1%) of long repeats proteins were found out using DisProt database. While analyzing the predisposition of long repeats for IDR, most of the repeats were identified in the structured regions. However, we also identified long repeats in an IDR. For example, Nucleoporin NUP1 (P20676) protein of length 1,076 contains tandem repeats of length 62 in the region of (352–399/403–462/522–564/666–728/731–778/779–840/849– 906/907–972/978–1,031), which has been identified as an IDR (300–1,078). This analysis suggests that long repeats are generally structured in most of the proteins while few of them may have IDRs.

## Analysis of Functions of Long Repeats at the Domain Level

The Pfam domain assignments in 26,750 (90%) of proteins were found and suggested the occurrence of repeats in the functional domain families containing proteins. While grouping

by protein family, the existence of repeats in 5,258 distinct protein families was found out. Some of the protein families are having long repeats in a greater number of their member proteins (**Supplementary Data File 3**). **Figure 7** shows the list of 36 protein families such as Class II aminoacyl-tRNA synthetase, Ser/Thr Protein kinase, Class I aminoacyl-tRNA synthetase, Cytochrome P450, Mitochondrial carrier (TC 2.A.29), G-protein coupled receptor 1, and ABC transporter that are having repeats in more than 40 member proteins of the family. We observed long repeats in the domains of the family with varying lengths. For example, the Peptidase S8 family proteins contained long repeats in 41 member proteins of the family (**Table 2**). Among these, 38 protein repeats were in the Peptidase S8 domains with varying repeat lengths. **Figure 8** shows some of the proteins' repeat regions as well as their alignment that covers the Peptidase S8 domain regions. The level of similarity between the repeats in the Peptidase S8 domain within a protein and within the member proteins of the Peptidase S8 domain family was computed in terms of % sequence identity. For example, the sequence identity of 29% was observed for the repeats (157–215/228–313), within the Peptidase domain (157– 401) of the Aqualysin-1 protein (P08594) (**Table 2**). Further, the sequence similarities of repeat unit (157–215) of this protein, with the repeat units in the Peptidase S8 domain of the 37 member proteins, were also computed. We observed that 65 % of repeats were in the range of 20–40% sequence identity and the remaining protein repeats were in the range of 10–20% identity. This observation suggests that the repeats within a protein, as well as within a protein family, are considerably diverged.

Further, repeats in the domains are involved in the function through functional residues (highlighted in red color). For example, the regions (162–173) and (197–207) of repeats (157–215/228–313) of Aqualysin-1 (UniProt ID P08594) have contained functional residues VYVIDTGIRTTH and HGTHVAGTIGG for Serine proteases (**Figure 8**). The functional involvement of the repeats was also found out in the structure of the proteins using PDBsum. For example, the functionally involved residues (highlighted red in color) of repeats (157–215/228–313) in the structure of Aqualysin-1 (PDB ID 4DZT) were found out using PDBsum search (**Figure 9**). This suggests that these repeats occur in the domains of the family for the function of the proteins.

#### Analysis of Enzyme and Non-enzyme Functions of Long Repeats

Further, the enzyme functions in 13,333 proteins and nonenzyme functions in 2,437 proteins, of a total of 15,770 (53%) of long repeats proteins, were also found out. Of a total of 13,333 enzymes having long repeats, Ligases (35.91%) have the maximum number of repeats followed by Isomerases (28.98%), Translocases (11.81%), Transferases (11.12%), Lyases (5.12%), Hydrolases (4.74%), and Oxidoreductases (2.32%). Among the non-enzymes in 2,437 proteins, NP binding proteins (48.09%) have the maximum number of repeats followed by DNA binding (30.44%), metal binding (16.94%), and calcium binding (4.51%). These observations suggest the importance of long repeats in both the catalytic and binding function of proteins apart from serving as modules of large assemblies.

TABLE 2 | List of 41 member proteins of the Peptidase S8 family long repeat region's and their function domain regions assigned using Pfam.


*(Continued)*

#### TABLE 2 | Continued


### DISCUSSION

Our survey of long repeats in a non-redundant set of UniProt sequences has highlighted the occurrence of these repeats that play an important role in the structure and function of domains of the proteins. Previous studies have focused on structural and functional implications of proteins with homo repeats (Uthayakumar et al., 2012), fibrous repeats (Parry, 2005) and different well-characterized repeats of length 5–50 (Andrade et al., 2001). Therefore, an in-depth study of long repeats in

UniProt sequences was carried out for a better understanding of the correspondence of repeat sequences with their structures and functions. In this study, we used the RADAR program for internal repeat detection, since it often detects both tandem and interspersed repeats in larger size. Our earlier studies for repeats analysis (Mary Rajathei and Selvaraj, 2013; Mary et al., 2015) have shown the ability of RADAR to detect repeats of length > 50 that are structurally similar and conserved in a 3D structure environment. Further, the sensitivity and accuracy of RADAR repeats, by comparison with Pfam, indicate good coverage, accurate alignments, and reasonable repeat borders (Heger and Holm, 2000). The identified repeats vary in the range of 50–1,759 of lengths and diverged with more insertions and deletions, but the calculated z-scores by RADAR have shown their statistical significance.

From a structural perspective, long repeats tend to occur abundantly in certain architectures of sandwich, barrel, bundle, and roll. Within these architectures, they are predominately observed in the super folds of up-down and orthogonal bundle of α-class, Immunoglobulin, Jelly Roll and OB fold of β-class, Rossmann fold, TIM barrel, α/β plait, and UB roll of α/β class of the proteins. The adoption of classic super secondary elements (αα, βαβ, ββ) and incorporation of repetitive duplication of a small stable unit may be the possible reasons for abundance of larger duplication in these folds (Thornton et al., 1999). For example, the evolution of the (βα)<sup>8</sup> repeat in the TIM barrel is through repetitive duplication of a small stable unit (βα) (Lang et al., 2000). It has been observed that repeats in the folds may fulfill the physical demand (stable and fast folding conformation) of the protein chain during the process of evolution, in order to meet the cellular function (Lupas et al., 2001). Further, it has been shown that the existence of structural symmetries in the super-folds (6 out of 10) may also require larger duplication during evolution of the proteins (Brych et al., 2003). Kim et al. (2010), through their SymD (detecting symmetry in protein structures) method, have identified 33 folds that contain 10 or more symmetric domains. There is considerable overlap between the symmetry in the folds they identified and those observed in the present work (**Figure 5**). We observed that long repeats of different lengths within a fold provide the structural differences of the proteins for different functions. Further, the analysis of predisposition of long repeats for disordered regions has shown that long repeat proteins are mostly structured to form stable folds. However, it has been observed that short tandem repeats are highly disordered, which do not adopt a single defined configuration for specific function (Tompa, 2012; Habchi et al., 2014; van der Lee et al., 2014).

Further, repeats have been analyzed for a specific domain of the family, in which protein function could be found out through the domain (Rentzsch and Orengo, 2013). We found that repeats in the domain regions of the family are involved in the function through functional residues. Earlier, we analyzed the repeats in the individual proteins of PDB and found that the existence of repeats in single/two domains from the same family, for the function of the proteins and that are not in the domains, are also involved in the function of the proteins (Mary Rajathei and Selvaraj, 2013). We observed that the lengths of repeats in the domains of the family are not uniform. Further, the computation of sequence identity of the repeats within a protein and within a family of Peptidase S8 domain shows lower similarity, which may be the consequence of their divergences over a period. Earlier, it was observed that repeat proteins are indeed repetitive in their families, exhibiting abundant stretches of short perfect repetitions (Turjanski et al., 2016). The repeats of varying lengths in the structures of the fold, as well as in the functional domains of the family, have suggested that long repeats are considerably diverged and may not be overlapped. However, further studies would be needed to understand the conservation of long repeats of the proteins in the structure and function of the proteins.

Further, we observed the existence of long repeats in all seven enzyme classes of the proteins and are especially more abundant in ligases and isomerases. Among the nonenzyme proteins, long repeats are observed in DNA binding, calcium binding, metal binding and NP binding proteins with NP binding and DNA binding in a greater number of proteins. However, further studies are needed to understand why certain enzyme classes and non-enzyme classes are having long repeats in more numbers. This shows that the occurrence of long repeats, not only serves as modules of large assemblies, but also in the catalytic function or binding of the proteins.

While commenting on the evolution of the well-characterized short tandem repeats in many evolutionary lineages, it has been postulated that repeat-containing proteins are cheap to evolve, rather than the de nova sequence evolution, as the repeat units are thermodynamically stable (Andrade et al., 2001; Andersson et al., 2015). Through our analysis, we observed the occurrence of long repeats in the stable folds for different functions of the proteins and suggested that long repeats may play a role in the evolution of proteins with stable folds and novel functions.

### REFERENCES


### CONCLUSIONS

The present large scale study has focused on the presence of long repeats in a non-redundant set of the entire annotated UniProtKB/Swiss-Prot database and reveals that long repeats are found in 23% of the proteins. Regarding their three-dimensional structures, they are found in certain structural folds that are incorporated with repetitive duplication of small stable folds. Further, the long repeats of different lengths within each fold are observed in different structures of the proteins. From a functional perspective, these repeats are found in both enzyme and nonenzyme functions containing proteins. Hence, long repeats may have a role in the evolution of proteins with stable folds and novel functions.

### DATA AVAILABILITY STATEMENT

The UniProt annotated sequence files and the RADAR output files were analyzed for this study. Major results are available as **Supplementary Material**.

### AUTHOR CONTRIBUTIONS

DR developed the computer programs in perl platform for this study and drafted the manuscript. SP supported to analyze and computation the data. SS conceived the idea and helped in the preparation of the manuscript.

### ACKNOWLEDGMENTS

DR acknowledges University Grants Commission, India for a research fellowship through UGC Post-doctoral Fellow for Women Research Grant No. F.15-1/2017-18/PDFWM-2017-18- TAM-44286/(SA-II).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fbioe. 2019.00250/full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Rajathei, Parthasarathy and Selvaraj. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The TargetMine Data Warehouse: Enhancement and Updates

*Yi-An Chen\*†, Lokesh P. Tripathi\*†, Takeshi Fujiwara, Tatsuya Kameyama, Mari N. Itoh and Kenji Mizuguchi\**

*Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, Osaka, Japan* 

#### *Edited by:*

*Shandar Ahmad, Jawaharlal Nehru University, India*

#### *Reviewed by:*

*Hamed Bostan, North Carolina State University, United States Marco Brandizi, Rothamsted Research (BBSRC), United Kingdom*

#### *\*Correspondence:*

*Yi-An Chen chenyian@nibiohn.go.jp Lokesh P. Tripathi lokesh@nibiohn.go.jp Kenji Mizuguchi kenji@nibiohn.go.jp*

*†These authors have contributed equally to this work*

#### *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 18 June 2019 Accepted: 05 September 2019 Published: 09 October 2019*

#### *Citation:*

*Chen Y-A, Tripathi LP, Fujiwara T, Kameyama T, Itoh MN and Mizuguchi K (2019) The TargetMine Data Warehouse: Enhancement and Updates. Front. Genet. 10:934. doi: 10.3389/fgene.2019.00934*

Biological data analysis is the key to new discoveries in disease biology and drug discovery. The rapid proliferation of high-throughput 'omics' data has necessitated a need for tools and platforms that allow the researchers to combine and analyse different types of biological data and obtain biologically relevant knowledge. We had previously developed TargetMine, an integrative data analysis platform for target prioritisation and broad-based biological knowledge discovery. Here, we describe the newly modelled biological data types and the enhanced visual and analytical features of TargetMine. These enhancements have included: an enhanced coverage of gene–gene relations, small molecule metabolite to pathway mappings, an improved literature survey feature, and *in silico* prediction of gene functional associations such as protein–protein interactions and global gene co-expression. We have also described two usage examples on trans-omics data analysis and extraction of gene-disease associations using MeSH term descriptors. These examples have demonstrated how the newer enhancements in TargetMine have contributed to a more expansive coverage of the biological data space and can help interpret genotype–phenotype relations. TargetMine with its auxiliary toolkit is available at https://targetmine.mizuguchilab.org. The TargetMine source code is available at https:// github.com/chenyian-nibio/targetmine-gradle.

Keywords: data warehouse, integrative data analysis, multi-omics data analysis, gene prioritisation, drug discovery, data mining, knowledge discovery

### INTRODUCTION

The rapid proliferation of high-throughput omics technologies has revolutionised biological research by significantly adding new omics data. However, as the experimental datasets increase in size and complexity, extraction of meaningful biological knowledge becomes qualitatively more difficult, expensive and labourious. Therefore, there is an ever widening gulf between data generation and the rate at which it can be properly analysed (Greene et al., 2014). Proper mining and curation of large biological datasets are necessary to develop an improved understanding of living systems and of disease pathogenesis.

An integrative multi-omics approach combines different types of biological data into a single analytical framework to understand the relationships between different cellular components (Zhu et al., 2012; Yan et al., 2018). Such analyses are useful to develop analytical models that can interpret genotype–phenotype relationships, garner the knowledge of pathways involved in cellular events and diseases, help pinpoint targets (such as gene and proteins) of biological and therapeutic interest and potentially develop intervention methods than can counteract undesirable phenotypic progression (i.e. diseases) (Sun and Hu, 2016; Hasin et al., 2017).

A major challenge in multi-omics data analysis is the availability of clean and usable biological data. We have previously developed TargetMine, an integrated data warehouse based on the objectoriented InterMine data warehouse framework (Smith et al., 2012; Kalderimis et al., 2014; Triplet and Butler, 2014), which models biological entities (such as genes and proteins) as 'objects' that are described by a set of attributes and their relationships with other objects are modelled as 'references'. The InterMine system allows for integration of different types of biological databases, and it comes pre-equipped with data integration features that are able to directly parse the data from commonly used data formats and sources (such as UniProt, OBO, FASTA and BioPAX). InterMine also allows the users to design their own data parsers (Smith et al., 2012; Lyne et al., 2007; Kalderimis et al., 2014). The TargetMine data model was developed by extending a customised version of the core InterMine data model. When integrating similar types of data from heterogeneous data sources, we first identified common attributes (gene identifiers for instance) which are then used to merge the overlapping datasets into a suitable data model. The data sources are prioritised based on their reliability, and the stored identifiers are constantly revised to update or discard outdated identifiers with every database update (Lyne et al., 2007; Smith et al., 2012).

TargetMine was initially developed and optimised for target discovery and prioritisation of candidate genes, especially in early stage drug discovery (Chen et al., 2011). We have continued to make significant additions and refinements to the TargetMine system to transform TargetMine into an integrative data analysis platform that can more effectively interpret information-rich omics datasets for biological knowledge discovery (Chen et al., 2019). Aside from periodically updating the existing datasets, these new developments have involved assimilation of newer biological data types and a new auxiliary toolkit to assist with data analysis and visualisation that addresses the limitations in the core InterMine framework (Chen et al., 2016).

Here, we describe our progressive efforts to enhance TargetMine as a data analysis platform that can better assist multi-omics data analysis and biological knowledge discovery especially in disease biology. These efforts broadly fall into three categories: (1) upgrading the existing data types with the up-to-date information available from the source repositories; (2) assimilating new data types, especially those data types that help to examine different types of gene-gene relations; and (3) augmenting the auxiliary toolkit to better analyse and visualise biological data.

We will now describe our efforts below individually.

### ADDITIONAL DATA SOURCES AND DATA MODELS IN TARGETMINE PROVIDE A DEEPER COVERAGE OF THE BIOLOGICAL DATA SPACE

A comprehensive coverage of the biological data space is necessary for drug discovery and related research. To achieve this, we have continuously expanded the repertoire of data types in TargetMine. Since the last major release, we have included within TargetMine new biological data associated with three major areas—drug-target interactions, gene-disease associations and biological mechanisms. The inclusions of these data types have offered deeper insights into gene-gene relations and have also enabled the users to perform more probing biological queries with TargetMine (**Table 1**). To enable the user to quickly and easily perform complex queries, TargetMine contains a library of 'templates' that consist of predefined queries with a simple form and description and are categorised by data types (Chen et al., 2011; Chen et al., 2016; Chen et al., 2019).

### KEGG Relations

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collective repository of genes, genomes, pathways, diseases and chemical compounds that provides a comprehensive



mapping of the biological systems (Kanehisa et al., 2016). The relation element in KEGG typically specifies relationship between two entities (proteins and compounds) in KEGG pathways. KEGG relations largely correspond to signalling pathway maps and encode regulatory information such as 'A activates B', 'A inhibits B' and 'A phosphorylates B'. The inclusion of KEGG relations is useful, since they often provide an additional context to interactions between gene products that are not always evident from standard PPI analysis. This integration has enabled the users to reconstruct probable signal transduction paths by performing queries such as 'Given a pair of genes A and B, find an intermediate gene and relations from gene A to gene B'.1

#### Post-Translational Modifications

Post-translational modifications (PTMs) are events that involve covalent addition of functional groups to proteins or their proteolytic processing during and after their biosynthesis. PTMs amplify the functional diversity of proteins and expand their influence over various cellular processes. Therefore, identifying and understanding PTMs help in a deeper understanding of cellular functions and in disease biology (Mnatsakanyan et al., 2018; Thygesen et al., 2018). We retrieved PTM associations from PhosphoSitePlus (Hornbeck et al., 2015), a knowledgebase of mammalian PTMs, and we carefully parsed the UniProt sequence annotation (features) that describe regions or sites of interest in proteins, to create an integrated repository of PTMs in TargetMine. The inclusion of PTMs in TargetMine enables the users to perform complex queries such as 'Given a list of proteins, identify upstream kinases that may phosphorylate them'2 or 'Given a protein and a specific residue position, identify any PTMs mapped to that residue'.3

### KEGG Reactions Compound-Pathway Mappings

Metabolites are the low molecular weight compounds such as amino acids, sugars and lipids, which are typically substrates and by-products of biological processes and enzymatic reactions; they are widely involved in feedback regulatory processes in the cell and, being the downstream products, often directly influence the phenotype. Thus, the metabolome is often regarded as the link between genotype and phenotype (Krumsiek et al., 2016). To facilitate a more effective metabolomics analysis with TargetMine, we first extended the existing compound class to create a new KEGG COMPOUND class. Subsequently, we referenced the KEGG COMPOUND class both with the existing Enzyme and Pathway classes using the relationships extracted from the KEGG COMPOUND database. We also defined a new Reaction class to describe the biochemical reactions in the KEGG reaction database, and this class was referenced with all of the KEGG COMPOUND, pathway and enzyme classes. Given a list of compounds, the users can now retrieve the corresponding enzymes involved in their metabolism, the enzymatic reactions involving these metabolites and, even map them to the corresponding pathways and diseases associated with the pathways. Users can also perform enrichment analyses to prioritise the enzymes/genes and pathways specifically associated with their metabolites of interest (see example below).

### Disease-Gene Mappings

A deeper understanding of disease pathogenesis requires a mapping of links between genes, pathways and specific diseases, but they are difficult to obtain in general. Recently, we have enhanced the integration of genetic linkages to diseases by improving the existing GWAS data model and adding the variation annotations from ClinVar (Landrum et al., 2015), and the disease associations that were extracted from the associated publications in dbSNP (Chen et al., 2019). We have also included gene-disease associations compiled in DisGeNET, an integrative platform of curated gene-disease associations (Pinero et al., 2017). This integration has enabled the users to perform queries such as 'Given a gene, find the related SNPs and any diseases associated with these SNPs'.4

### Scientific Literature Survey

Literature survey is indispensable to annotating gene information, interpreting gene sets and facilitating further research. However, scientific literature is increasing exponentially, making it difficult for the researchers to find, study and understand new publications of interest. To facilitate an easier sharing of scientific knowledge, we have incorporated document representations such as MeSH (Medical Subject Headings) (Rogers, 1963) descriptors (such as general article, review, clinical study, case report, etc.) and abstracts into TargetMine. This implementation allows the users to quickly screen for scientific texts (based on their MeSH descriptors) associated with their gene(s) of interest. For example, users may restrict their query to retrieving only those publications classified as 'case report' by constraining the 'Mesh Terms'-> 'Identifier' attribute for the MeSH Terms identifier 'D002363' (case reports). Researchers typically rely on abstracts to assess an article for further reading and often; abstracts are the only source of information that are freely available (Germini et al., 2017). To allow the users to easily access and scan article abstracts of interest, we leveraged the attribute 'Abstract Text' within the 'Publication' class. This implementation allows the users to retrieve publications associated with their gene(s) of interest along with their corresponding abstracts; this implementation also allows the users to quickly and easily scan multiple abstracts in a single webpage, instead of visiting the individual 'Publication report' pages and clicking on the available PMID links to access the corresponding abstracts on NCBI PubMed (as was the case previously).

### INCLUSION OF COMPUTATIONALLY PREDICTED ASSOCIATIONS AND SCORES IN TARGETMINE

### TF-Target Associations

Transcription factor (TF)–target gene interactions determine gene expression patterns, and therefore, regulate cellular functions. Previously, we had included expert-curated experimentally validated human TF-target gene interaction data from Amadeus (Linhart et al., 2008), ORegAnno (Griffith et al., 2008) and HTRIdb (Bovolenta et al., 2012) to create a combined repository in TargetMine (Chen et al., 2016). To expand the coverage of TF-target gene interactions, we examined and processed the vast amounts of TF-binding site data compiled by the Encyclopedia of DNA Elements (ENCODE) consortium (see *Methods*). Over 200,000 new TF-target gene interactions corresponding to 23 human TFs were incorporated into TargetMine in this manner. We also incorporated nearly 200,000 TF-target gene interactions corresponding to 39 TFs in the mouse genome, thereby providing a detailed coverage of gene-regulatory associations in mouse that were not available in the previous iterations of TargetMine.

#### PPI Confidence Scores Using Predicted Likelihood of PPIs

PPIs are vital to virtually every cellular process, and their dysregulation typically leads to cellular dysfunction including diseases. However, it is necessary to assess the PPI data properly to ensure the robustness of PPIN-based analyses in investigating disease-causing biological pathways and to discover druggable target proteins. We had previously performed a confidence assessment of our combined PPI repository and defined a reliable high-quality subset termed 'high-confidence direct physical PPIs' (HCDPs) (Chen et al., 2016). HCDPs have been helpful in analysing network topological properties and identifying key components of the presently characterised interactome maps, namely, network 'hubs' and 'bottlenecks' (Tripathi et al., 2013; Chen et al., 2016). However, HCDPs constitute only a small proportion of all available PPIs, and using HCDPs alone for PPI-based network analysis may often exclude potentially useful PPIs. This is especially true for the mouse and rat interactomes in TargetMine, where HCDPs are rather sparse. Therefore, we have included an additional measure for the assessment of PPI reliability. In our group, we had previously developed prediction server of protein–protein interactions (PSOPIA), an integrative averaged one-dependence estimators (AODE)–based method to predict the likelihood of interaction between a pair of proteins based on experimentally characterised homologous PPIs (Murakami and Mizuguchi, 2014). We employed PSOPIA to evaluate all PPIs within TargetMine, and the output PSOPIA scores were tagged to the individual PPIs as a new attribute PsopiaScore. This implementation has enabled the users to query the interacting partners of a gene/protein or a list of genes/ proteins of interest and to infer overall PPI networks involving these genes/proteins consisting of all interactions judged to be of sufficiently high quality by the user, either based on their HCDP status and/or their PSOPIA scores.

#### Gene Co-Expression Analysis for Prediction of Novel DNA-Binding Proteins and for Improved PPI-Based Network Analysis

Gene expression refers to the process where the genetic information encoded in a gene is transcribed into a functional gene product—RNA or the eventual protein. Gene expression analysis involves mapping and analysing collective gene expression patterns that dictate cellular function under different environments. The spread of technologies that can map global gene expression profiles has led to an abundance of genomewide gene expression data (transcriptome) for many cell and tissue types and physiological conditions (Barrett et al., 2012; Kolesnikov et al., 2015). Global gene expression profiles have been variously analysed to search for genes that are differentially expressed in different cellular and physiological conditions (such as development, infection and diseases) and to predict functions for genes of unknown function. A guiding principle of function prediction using gene expression is *guilt-by-association*, which assumes that genes with related functions are more likely to have correlated properties such as interactions and expression patterns (Singer et al., 2004).

In this study, we have sought to leverage global gene co-expression (GCE) patterns to minimise biological noise and to further refine and improve PPI-based network analysis (**Figure 1**). To assess the effectiveness of this approach, we performed GSFE analysis on a multiple gene sets gathered from literature and then repeated the tests on modified gene sets that included both co-expressed genes and randomly selected unrelated genes (biological noise) (see methods). Among the 298 curated gene sets that were tested, 81% (240) were associated with overall higher F1 scores when HCDPs or globally co-expressed HCDPs (GCE-HCDP) where added to the initial gene list, compared with initial gene sets (**Figure 2**; **Table 2**, **S1**). Furthermore, of the 240 gene sets where GCE showed an improved prioritisation performance, in 170 of them (~70%), inclusion of GCE-HCDP contributed to a better performance in terms of F1 scores as compared to inclusion of HCDPs alone (**Figure 2**; **Table 2**).

Our observations suggested that the inclusion of HCDPs and GCE-HCDPs contributed to improved target prioritisation and gene set analysis with TargetMine.

### ENHANCED DATA ANALYSIS AND VISUALISATION WORKFLOW WITH TARGETMINE

We had previously developed an auxiliary toolkit to assist with data analysis and visualisation in TargetMine without any scripting and/or programming efforts on the part of the user (Chen et al., 2016) (https://targetmine.mizuguchilab.org/ tutorials/auxiliary-toolkit/). We have subsequently added new analytical and visualisation features to further enhance the ability of TargetMine as a data analysis platform. For instance, we have now added a dendrogram to the association heatmap function, which permits users to quickly and more easily identify clusters of genes that share significant functional attributes. We have also introduced the 'Expressed Tissue' feature that allows the users to hierarchically assemble a heatmap of user-supplied genes and the cell/tissues where they are highly expressed and thereby obtain a contextual view of their expression patterns. We have also enhanced the efficacy of the network visualisation function. In the present form, the function would permit the users to supply a list of genes and construct and visualise a composite interaction network that

includes all the biomolecular interactions within TargetMine, i.e. PPIs, MTIs, PCIs/drug-target interactions and TF-target gene interactions that are associated with the query genes. However, adding too many interactions can also render the network very dense and complex, therefore becoming difficult TABLE 2 | The inclusion of HCDPs filtered by gene co-expression (GCE-HCDP) to generate extended gene sets led to an overall improved target prioritisation performance when compared with the inclusion of unfiltered HCDPs and un-extended gene sets.


to load and visualise properly in the browser. To address these concerns, we have added a series of features to select and filter biomolecular interactions by qualitative assessment and/or contextual information. For instance, the 'Interaction Network' feature allows the user to restrict the PPI selection to 'HCDPs' or expand them to include 'All' PPI types by selecting the corresponding circles. We have also introduced features that permit the users to filter the interaction types by expressed tissues and GELs. We have also introduced a feature to allow the users to specifically include and visualise directed gene–gene relations parsed from KEGG. Moreover, we have improved the network feature to restrict the MTIs, PCIs/drug-target interactions and TF-target gene interactions to the user supplied gene list.

### APPLICATIONS WITH USE CASES

#### Trans-Omics Data Analysis

To demonstrate the effectiveness of TargetMine in assisting multi-omics data analysis, we re-examined a previously published multi-omics data on mitochondrial links to liver metabolism, and the effects of a high-fat diet on it (Williams et al., 2016). We first retrieved the biomolecules (110 transcripts, 27 proteins and 25 metabolites) that were differentially expressed in high-fat diet (HFD) fed mice relative to control (see methods). Next, the differentially expressed transcript, protein and metabolite sets were transformed into the corresponding transcriptome, proteome and metabolome differentially expressed gene (DEG) sets, respectively (see **Supplementary File S1** for the detailed methodology with the help of an example). The inferred DEG sets (containing 84, 29 and 62 genes, respectively; **Supplementary Table S1**, **Supplementary Figure S1**) were first compared (using the list operations in TargetMine) to identify overlapping genes. Three DEGs (Ces2a, Cyp3a11 and Csad) were shared across transcriptome and proteome DEG sets, and a solitary gene, cysteine sulfinic acid decarboxylase (Csad), was downregulated across all the three DEG sets (**Supplementary File S1B**). Csad is an enzyme that plays a key role in generating taurine from cysteinesulfinate in liver, and its hepatic expression and abundance are typically downregulated by bile acids responsible for modulating lipid metabolism (Kerr et al., 2014). Our observations, therefore, clearly suggested that the fluctuations in Csad levels were likely to be modulated by dietary fat *via* bile acids. Additionally, we were also able to prioritise signatures that were consistent with an HFD model such as cytochrome p450 subunit Cyp3a11 that functions in retinoic acid metabolism.

Next, the individual DEG sets were then subjected to pathway enrichment analysis (see methods). Five enriched pathways were associated with the transcriptome DEG set, 11 enriched pathways with the proteome DEG set, and 17 enriched pathways were associated with the metabolome DEG set, respectively. KEGG pathway sub-categories 'Lipid metabolism', 'Global and overview maps', 'Cancers: Overview' and 'Metabolism of cofactors and vitamins' were commonly represented in the enriched pathways across all the three DEG sets (**Supplementary File S1B**). Specifically, KEGG pathway 'Metabolic pathways' was commonly enriched in all the three DEG sets; 'Steroid hormone biosynthesis', 'Linoleic acid metabolism', 'Retinol metabolism', 'Arachidonic acid metabolism' and 'Chemical carcinogenesis' were commonly enriched in Transcriptome and Proteome DEG sets (associated with Cyp3a11), and 'Drug metabolism—other enzymes' was commonly enriched in Proteome and Metabolome DEG sets, respectively (although no gene overlap was observed). Taken together, our observations have suggested that higher levels of dietary fat are responsible for dysregulation of cellular factors and pathways associated with lipid metabolism and as such have provided promising candidates for further research.

### Extracting Gene-Disease Associations From Literature Using Mesh Descriptors

A vast amount of untapped associations between genes and diseases are scattered across biomedical literature. A quick and efficient mining of such information can help interpret genotype-phenotype relationships and also speed up database curation. The inclusion of MeSH descriptors now allows the TargetMine users to easily survey for annotated gene-disease associations for their gene(s) of interest. As a case study, we extracted literature-embedded gene-disease associations for PPARγ (peroxisome proliferator–activated receptor gamma), a nuclear receptor that is implicated in the pathology of numerous diseases including obesity, diabetes and cancer. Next, we sought to retrieve all the publications that were indicative of the involvement of PPARγ in disease pathogenesis by constraining the query for MeSH term attribute 'Diseases'. We retrieved 397 unique disease associations for PPARG in this manner (**Supplementary File S2**).

### CONCLUSIONS

TargetMine is a versatile data analysis platform that provides a unified, homogenous representation of diverse types of omics and other biological data; it allows the users to query and navigate across the stored data types and analyse them in a singular interface. In this study, we have described the augmentation of the TargetMine by progressively improving and expanding the coverage of data types and by adding new and improved analytical features. We have also demonstrated how the extension of TargetMine system has significantly boosted its capabilities to survey the biological target space, to assist multiomics data analysis, to interpret novel genotype-phenotype relationships and to facilitate biologically relevant knowledge discovery, especially in disease biology.

For the future developments, we will continue to accommodate new and emerging data types of interest and expand the analytical features. We also aim to introduce a workflow function for multiomics data analysis that will allow the users to more easily and effectively analyse and interpret their omics datasets and advance their research.

## METHODS

### Tf-Target Gene Associations From Encode

The binding events (peaks) for human and mouse TFs with binding profiles in different cell types were downloaded from the ENCODE resource (Davis et al., 2017). To accommodate the additional TF-target information, we redefined the erstwhile protein-DNA interaction class into a new transcriptional regulation class. The promoter region was defined as 10,000 bp upstream of the transcriptional start site (TSS). We extracted binding site positions within this hypothesised promoter region, and we identified the corresponding genes by mapping the genomic coordinates downstream of the TSS to the genomic coordinates stored within TargetMine. Next, we mapped the TFs whose binding sites were identified in this manner with the downstream genes to generate new TF-target gene associations.

#### Gene Co-Expression Analysis

#### Data Sets

The gene sets were retrieved from GeneSigDB, a database of curated gene signatures (Culhane et al., 2012). To obtain a reasonable size of candidates, we selected only the human, mouse and rat gene sets that consisted of 30~80 genes; 642 gene sets were selected in this manner. The genes in the so-called 'standardised' gene list in GeneSigDB were represented by Ensembl identifier and symbol; for further analysis, we mapped them to Entrez Gene identifier (Gene ID) using TargetMine (build 20160629). The Ensembl identifiers that were not mapped to a corresponding Gene ID were excluded from the list, thereby marginally reducing the sizes of the gene sets. Next, we performed pathway enrichment analysis on each of these gene sets, and only those gene sets (298 out of 642) that were associated with at least one enriched pathway (pathways were judged to be significant if the adjusted *p*-value was 0.05 or less) were taken up for subsequent analyses (**Figure 1**).

#### Global Gene Co-Expression Analysis

Global gene expression profiles for human genes were retrieved from gene expression omnibus (GEO) (Barrett et al., 2012; Kolesnikov et al., 2015), and gene co-expression levels were computed as described earlier (Ahmad et al., 2018).

#### Gene Prioritisation With PPI and GCE

The aim of the gene prioritisation is to identify a relatively important subset of genes form a list of candidates for further analysis. The first step of target prioritisation within TargetMine involves uploading a list of initial candidate genes or proteins (e.g. a set of differentially expressed genes or a set of proteins that interact with a given protein) to create a TargetMine gene list. Next, the enrichment of specific biological themes such as KEGG/Reactome pathways, integrated pathway cluster (IPC) (Chen et al., 2014), gene ontology terms etc. is estimated by performing Fisher's exact test here followed by multiple testing correction to control the false discovery rate. The genes associated with the most significantly enriched biological associations (that satisfied, in this instance, a condition of *p* ≤ 0.05 after a multiple test correction with the Benjamini and Hochberg procedure (Benjamini and Hochberg, 1995; Benjamini et al., 2001)) are judged to be highly important to the biological phenomenon under study and therefore selected for further analyses.

For each of 298 gene sets, we replaced at random 75% genes with an equal number of unrelated randomly selected genes from the corresponding genome to generate test gene sets to incorporate biological noise. To avoid any bias incurred due to the selection of random genes, the process was repeated 10 times to infer 10 test gene sets for each curated gene list.

Next, the HCDPs for the genes within the test gene sets were retrieved from TargetMine and were appended to the initial test gene sets to create extended test gene sets. Independently, co-expressed HCDPs for the test gene sets were retrieved from TargetMine and were appended to the initial test gene sets to create extended test gene sets. Only the interacting partners that had a GCE value greater than 0.03 or less than −0.03 with the genes in the test gene sets were considered. Finally, the prioritisation tests (**Figure 1**) were then performed for each test gene set.

#### Evaluating the Performance of GCE-Filtered HCDPs in Target Prioritisation

To evaluate the protocols, we compared the enriched pathways among the reference gene sets, the 'noisy' gene sets and the extended noisy gene sets that were independently generated with the inclusion of HCDPs and GCE-HCDPs.

The enriched pathways (*p*-value < 0.05) were then mapped to their corresponding IPC (Chen et al., 2014). These enriched IPCs from the reference gene set were defined as the true positives (TPs) in this instance, and the rest were defined as false positives (FPs). TPs which were not found in the test results were defined as false negatives (FNs). For each gene set, the F1-score was estimated as follows:

$$F1 = \frac{2TP}{2TP + FP + FN}$$

The test was performed 10 times for each gene set (from the step that we randomly generated the 'noisy' gene set). A student t-test was also performed to compare the significance of the differences between the two approaches. If the new prioritisation protocol achieved an overall higher F1-score than standard enrichment analysis, it was assumed to have provided an improved prioritisation performance, even the difference was trivial.

#### Gene Set Inference and Pathway Enrichment for Multi-Omics Data Analysis

The biomolecules (transcripts/genes, proteins and metabolites) were judged to be differentially expressed if they were statistically significantly (*p* ≤ 0.05; t-test) increased or decreased more than 1.5-fold i.e. if the fold change (FC) ≥1.5 (upregulated) or FC ≤ 0.667 (downregulated) in mice fed with high-lipid diet relative to the control. The biological pathway data from KEGG were used to assign functional annotations to the DEGs, using TargetMine. Statistical significance of the pathway enrichment was determined by Fisher's exact test, and the *p*-values were corrected for multiple testing using the Benjamini–Hochberg procedure. The enriched pathways were considered statistically significant if the adjusted *p* ≤ 0.05.

### DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the manuscript/**Supplementary Files**.

#### AUTHOR CONTRIBUTIONS

Y-AC and LT contributed equally to this work. Y-AC, LT, TF, TK, MI, and KM were responsible for data gathering and validation. Y-AC, LT, TF and TK performed all the data analysis. Y-AC, LT, and KM were responsible for overall data analysis and interpretation and wrote the manuscript. All authors read and approved the final version of the manuscript.

#### FUNDING

This work was in part supported by Grants-in-Aid for Scientific Research from the Japan Agency for Medical Research and Development (Grant Number 19ak0101068h0003; "The adjuvant database project" Grant Number 16ak0101010h0005) and from

#### REFERENCES


the Japan Society for the Promotion of Science (Grant Number 17K07268) to KM.

#### ACKNOWLEDGMENTS

The authors thank the members of the Mizuguchi laboratory for their critical review of the study and the manuscript.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00934/ full#supplementary-material

medicine journals: a protocol for a systematic survey of the literature. *BMJ Open* 7, e014981. doi: 10.1136/bmjopen-2016-014981


Pinero, J., Bravo, A., Queralt-Rosinach, N., Gutierrez-Sacristan, A., Deu-Pons, J., Centeno, E., et al. (2017). DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. *Nucleic Acids Res.* 45, D833–D839. doi: 10.1093/nar/gkw943

Rogers, F. B. (1963) Medical subject headings. *Bull. Med. Libr. Assoc.* 51, 114–116.


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor declared a past co-authorship with the authors.

*Copyright © 2019 Chen, Tripathi, Fujiwara, Kameyama, Itoh and Mizuguchi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Paclitaxel Response Can Be Predicted With Interpretable Multi-Variate Classifiers Exploiting DNA-Methylation and miRNA Data

#### *Alexandra Bomane, Anthony Gonçalves and Pedro J. Ballester\**

*Cancer Research Center of Marseille, CRCM, INSERM, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Paris, France*

To address the problem of resistance to paclitaxel treatment, we have investigated to which extent is possible to predict Breast Cancer (BC) patient response to this drug. We carried out a large-scale tumor-based prediction analysis using data from the US National Cancer Institute's Genomic Data Commons. These data sets comprise the responses of BC patients to paclitaxel along with six molecular profiles of their tumors. We assessed 10 Machine Learning (ML) algorithms on each of these profiles and evaluated the resulting 60 classifiers on the same BC patients. DNA methylation and miRNA profiles were the most informative overall. In combination with these two profiles, ML algorithms selecting the smallest subset of molecular features generated the most predictive classifiers: a complexity-optimized XGBoost classifier based on CpG island methylation extracted a subset of molecular factors relevant to predict paclitaxel response (AUC = 0.74). A CpG site methylation-based Decision Tree (DT) combining only 2 of the 22,941 considered CpG sites (AUC = 0.89) and a miRNA expression-based DT employing just 4 of the 337 analyzed mature miRNAs (AUC = 0.72) reveal the molecular types associated to paclitaxelsensitive and resistant BC tumors. A literature review shows that features selected by these three classifiers have been individually linked to the cytotoxic-drug sensitivities and prognosis of BC patients. Our work leads to several molecular signatures, unearthed from methylome and miRNome, able to anticipate to some extent which BC tumors respond or not to paclitaxel. These results may provide insights to optimize paclitaxel-therapies in clinical practice.

Keywords: biomarker discovery, machine learning, artificial intelligence, precision oncology, tumor profiling

### INTRODUCTION

Breast cancer (BC) is the most common type of cancer in women worldwide resulting in half a million deaths annually (Golubnitschaja et al., 2016). BC is a disease presenting substantial intertumor heterogeneity (Russnes et al., 2011). Cytotoxic drugs are used to eradicate tumor cells, to complement surgery or radiotherapy as well as to alleviate cancer symptoms. Paclitaxel is a BC-approved cytotoxic drug from the taxane family, which acts by interfering with the normal function of microtubules during cell division (Perez, 1998). As with other cancer drugs (Brown and Böger-Brown, 1999; Cardoso et al., 2002; Ribeiro et al., 2012; Housman et al., 2014), resistance to paclitaxel have been regularly observed in BC patients (Flint et al., 2009; Ajabnoor et al., 2012).

#### *Edited by:*

*Rakesh Kaundal, Utah State University, United States*

#### *Reviewed by:*

*Deepak Singla, Punjab Agricultural University, India Haiquan Li, University of Arizona, United States*

> *\*Correspondence: Pedro J. Ballester pedro.ballester@inserm.fr*

> > *Specialty section:*

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

*Received: 18 June 2019 Accepted: 30 September 2019 Published: 25 October 2019*

#### *Citation:*

*Bomane A, Gonçalves A and Ballester PJ (2019) Paclitaxel Response Can Be Predicted With Interpretable Multi-Variate Classifiers Exploiting DNA-Methylation and miRNA Data. Front. Genet. 10:1041. doi: 10.3389/fgene.2019.01041*

1 **61**

Precision oncology requires predictors to guide the optimization of drug therapies for patients (Peck, 2016; Schwartzberg et al., 2017). Indeed, it is now well-established that gene polymorphisms and other genomic alterations play important roles in the observed heterogeneous response to drugs (Wang et al., 2011; Harper and Topol, 2012; Kadra et al., 2012). This has led to the identification of clinical biomarkers of drug response from molecular profiles of the patients' tumors (Huang et al., 2014). These predictive biomarkers now guide patient-specific treatment selection during clinical trials and are also used in clinical practice (Mandrekar and Sargent, 2009; Biankin et al., 2015). Most commonly, single-gene markers are used to discriminate between therapy responders and non-responders (Prahallad et al., 2012; Rodríguez-Antona and Taron, 2015), typically consisting of an actionable mutation (e.g. single-nucleotide variant) of a specific gene in the tumor sample.

Single-gene markers that are able to predict the efficacy of cytotoxic drugs are rare (Felip and Martinez, 2012), especially for taxanes (Murray et al., 2012; Bartlett et al., 2015; Norimura et al., 2018). For instance, Marsh et al. (2007) have proposed that the point mutation *CYP1B1\*3* could be an important factor that helps to differentiate between sensitive and resistant BC patients to paclitaxel. However, Gehrmann et al. (2008) have raised doubts about the association between this alteration and paclitaxel-treated patient prognosis, concluding that CYP1B1 alone is not sufficient to predict tumor response to paclitaxel, and that it could interact with still unknown factors involved in paclitaxel sensitivity. This is an example of inter-patient variability in drug response not being fully captured by the mutational status of single gene, as it has also been seen *in vitro* in a range of drugs (Naulaerts et al., 2017).

Machine Learning (ML) can be used to build *in silico* models able to predict tumor response to a given drug by combining multiple tumor features in an optimal manner (Libbrecht and Noble, 2015; Ali and Aittokallio, 2018). The scarcity of suitable clinical data to build such predictors has been a major roadblock, which has made predictors based on cancer cell line data thrive (Costello et al., 2014; Ding et al., 2016). Fortunately, response data from paclitaxel-treated BC patients along with comprehensive molecular profiles of their tumors are increasingly available. Such datasets represent an opportunity to improve our ability to anticipate which BC patients will respond to paclitaxel. We obtained them from the recently created Genomic Data Commons (GDC) of the US National Cancer Institute (NCI) (Jensen et al., 2017). The GDC provides a unified data repository enabling data sharing across cancer genomic studies in support of precision medicine. The GDC feeds from several cancer genome programs at the NCI Center for Cancer Genomics, notably The Cancer Genome Atlas (TCGA) (Weinstein et al., 2013), and offers a range of information-rich genomic, transcriptomic and epigenomic profiles, as well as clinical drug response data.

These datasets, however, pose the challenge of being highdimensional. Each profile typically contains between hundreds and many thousands of features, but only tens of profiled tumors of the same cancer type and treated with the same drug. For example, a community challenge intended to predict drug response employed 53 BC cell lines (Costello et al., 2014), while thousands of features from DNA copy-number variation, transcript expression, mutations, DNA methylation, and protein abundance profiles were considered. In another study (Tripathi et al., 2016), predictive models of response to cytotoxic drugs were achieved using 60 pancancer cell lines and gene variants as features. A further example of predictive drug-sensitivity models is a study employing 60 diverse cell lines and protein abundances as features (Ma et al., 2006). Small sample sizes are not only typical of preclinical studies, but also of clinical studies addressing the same problem. For instance, gene expression signatures were identified and evaluated using 81 melanoma patients to predict their response to PD-1 checkpoint inhibitors (Ayers et al., 2017).

In this study, we will investigate whether it is possible to anticipate the response of BC patients to paclitaxel using GDC data. We also aim at discovering the molecular factors that, collectively, best discriminate between paclitaxel-resistant and paclitaxel-sensitive BC patients. High-dimensional data promotes model overfitting, which in turn results in poorer predictions. As predictive performance differences between ML algorithms are strongly problem-dependent (Tan and Gilbert, 2003; Fernández-Delgado et al., 2014), considering a range of algorithms to identify those that are most suitable for paclitaxel-treated BC patients is appealing. To this end, we apply 10 ML methods to build predictive models in combination with each available molecular profile. Some of the resulting multi-variate predictors are highly interpretable in that they can answer questions such as why this particular patient is non-responsive. This information should permit formulating hypothesis about the molecular mechanisms of BC patient resistance to paclitaxel.

### MATERIAL AND METHODS

#### GDC Data

GDC molecular profiling and clinical data from the TCGA Breast Invasive Carcinoma or BRCA (https://portal.gdc.cancer. gov/projects/TCGA-BRCA) provide the basis for this study. Molecular profiles and clinical data come from release version 4.0, except for miRNA and miRNA isoform (isomiR) expressions coming from release version 8.0 (Release Notes - GDC Docs).

TCGA-BRCA project gathers data from 1,098 patients, resulting in almost 13,000 files (around 130 GB). These datasets were retrieved and downloaded using the GDC Application Programming Interface (API). **Table S1** reports information about files collected from the GDC that have been used to generate datasets.

### Processing Clinical Data for Modelling

Patient population included primary or secondary advanced breast cancer receiving single-agent paclitaxel. For some patients it was observed that different drugs have the same or very close treatment start and end time. These entries may form part of a drug combination. However, available drug response annotations do not allow to check this information. Therefore, possible effects due to drug combinations are ignored in this study when identifying paclitaxel-treated patients. Patients with missing paclitaxel response were not retained. To only consider baseline tumors' molecular profiles, patient records were only retained if no treatment was administered before resection and the time of sample procurement is indicated. We assumed that a baseline molecular profile can explain drug responses observed in a given patient even if paclitaxel was administered at any time after sample resection (Geeleher et al., 2014). After these curation steps, 61 paclitaxel-treated BC patients with valid records remained (**Table S2** reports information about treatments and biospecimens). Annotated patient responses are divided into four categories based on the RECIST standard (Therasse et al., 2000): "Complete Response" (CR), "Partial Response" (PR), "Stable Disease" (SD), and "Clinical Progressive Disease" (CPD). We further classified clinical responses into two categories, namely "responder" (CR or PR) and "non-responder" (SD or CPD).

### Processing Molecular Profiles for Modelling

The GDC works on harmonization of raw genomic data developing specific workflows to provide consistent and up-todate molecular profiles (GDC Reference Files | NCI Genomic Data Commons, Genomic Data Harmonization | NCI Genomic Data Commons). Available profiles comprise copy-number variation (CNV), DNA methylation, mRNA, miRNA and isomiR (Ameres and Zamore, 2013) expressions. In order to produce suitable inputs for ML algorithms and/or extract some specific information, we processed some of them. More details are available in the homonym section of Supplementary Methods. All the datasets produced from these molecular profiles are made of real-valued features.

#### Predicting Drug Response Using ML Algorithms With Embedded Feature Selection

Most classifiers were built with ML algorithms capable of embedded feature selection to mitigate the impact of highdimensional data on their generalization on unseen data. Implementations of Classification And Regression Tree (CART) (Breiman et al., 1984) and Random Forest (RF) (Breiman, 2001) were taken from the python library *Scikit-learn* version 0.19.1, while the ones of XGBoost (XGB) (Chen and Guestrin, 2016) version 0.6 and LightGBM (LGBM) (Ke et al., 2017) version 2.0.10 were respectively downloaded from https://github.com/ dmlc/xgboost and https://github.com/Microsoft/LightGBM. We also applied a Deep Neural Network (DNN) algorithm (Bengio, 2009) implemented with the python library *Keras* version 2.2.4 using the TensorFlow backend. In addition to these nonlinear models, linear models were generated with Logistic Regression (LR) (Ranstam et al., 2016), which is also implemented in *Scikit-learn*. The visualization of Decision Trees (DTs) was done with the python package *dtreeviz* version 0.2.2. The homonym section in Supplementary Methods provides more details.

### Predicting Drug Response Using the Optimal Model Complexity (OMC)

OMC is a strategy to build ML models employing only the most relevant features (Nguyen et al., 2018). More details are available in the homonym section of **Supplementary Methods**.

#### Measuring the Predictive Performance of a Classifier

This is a binary classification problem, as each patient belongs to one of two classes, responder and non-responder, with the responder considered as the positive class. As it is customary with problems with a small number of data instances (**Table S3**), we are using LOO (Leave-One-Out) CV (Cross-Validation) to evaluate the developed classifiers. Several types of LOOCV will be used: standard for "all-features models", nested for "OMC models", and permutated for "permutation models". As with any other CV (Kohavi, 1995), each data instance (patient here) is exactly once in the test set. Thus, CV performance of a model is exclusively based on the prediction of test instances that were not used in any way to train or select the model (any feature selection is hence carried out on training folds only). Employing nested CV on algorithms requiring model selection (those employing OMC) provides an unbiased estimate of model performance, as it has been shown elsewhere (Cawley and Talbot, 2010; Varma and Simon, 2006).

Once known and predicted classes are compared for all held-out samples, true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) are counted among BC patients. These counts are used for calculating classification metrics that summarize the predictive performance reached by a classifier: precision, recall, F1-scores (Van Rijsbergen, 1979), and Matthews Correlation Coefficient (MCC) (Matthews, 1975; Boughorbel et al., 2017). More details about these metrics can be found in the homonym section of Supplementary Methods. Classification scores and contingency matrices obtained from all produced classifiers are stored in **Tables S4** and **S5**, respectively.

### RESULTS

#### Benchmarking of All-Features Models (RF, XGB, LGBM, DNN, LR) Reveals Some Informative Molecular Profiles, But the Resulting Classifiers Perform Marginally Better Than Random

**Figure 1** shows that most of the all-features ML classifiers perform worse than random classification reaching slightly negative median MCCs (from -0.19 to -0.05). Poor performance was also obtained when using linear models: LR models perform randomly at best (median MCCs range

FIGURE 1 | DNA methylation and miRNA expression lead to the most predictive ML models. Each MCC of a given model is calculated by LOOCV. The experiment is repeated several times, each time with a different random seed, giving rise to a boxplot of MCCs for each case. Permutation models were generated after shuffling class labels on the considered training set. As RF models give undefined MCCs, blanks are found in bins where boxplots are expected. Substantial variability is observed, showing that this problem is both profile- and classification method-dependent. The dashed line shows the expected MCC from random classification. (A) Predictive performance of all-features models. All-features models are those in principle employing all the features in the profile to generate the prediction. Models are built with ML algorithms using the default operating threshold (0.5) to calculate the predicted class label from the predicted class probability. Five random seeds were set for each ML algorithm; thus, MCC values come from five runs of standard LOOCV. *x*-axis shows the employed molecular profile, while *y*-axis displays the MCCs obtained by classifiers. From the lightest to the darkest blue, boxplots summarize the distributions of MCCs obtained by XGB,LGBM, LR, and DNN models, respectively. Ellipses indicate which profiles employed by models obtain better-than-random predictive performance: DNA methylation profiles are the most predictive. This also suggests that the other profiles are less informative for the prediction of BC tumor response to paclitaxel. (B) Predictive performance applying OMC to methylation-based models. 10 random seeds were used to investigate further the most predictive profiles. OMC models had their hyperparameters complexity and operating threshold tuned and thus required nested LOOCV. Horizontal axes show the employed ML algorithms to process CpG site (left) and CGI (right) methylation datasets, while vertical axes display MCCs achieved by classifiers. Light-pink, medium-pink, and indigo boxplots summarize the distributions of MCCs obtained by all-features, OMC and permutations models, respectively. Circles indicate ML algorithms releasing models with predictive performance improved using OMC. This shows that predictive accuracy depends on both the molecular profile and the ML algorithm. Here, the best models found are CpG site methylation-based RF-OMC, CGI methylation-based RF-OMC, and CGI methylation-based XGB-OMC. (C) Predictive performance of CART models. These models (light-pink boxes) were built considering all features in the profile with no hyperparameter tuning. Permutation models (indigo boxes) were trained after that shuffling class labels in the training set. Each MCC is calculated by standard LOOCV, a process repeated with 10 different random seeds. *x*-axis shows the molecular profiles ('CNV' is short for copy-number variation, 'methy\_CpG' for CpG site methylation, and 'methy\_CGI' for CGI methylation), while *y*-axis displays the LOOCV MCCs achieved by each classifier. The dashed line shows the expected MCC from random classification. Ellipses indicate molecular profiles processed by CART models obtaining the highest predictive performance. These results reveal that CpG sites methylation-based and miRNA expression-based CART models are the most predictive. Predictive accuracy is substantially higher than that provided by all-features or OMC models (in A and B*)*, which demonstrates that the CART learning algorithm is more suitable for these problem instances.

from -0.17 to 0.1 depending on the profile). Poor predictive performance is primarily caused by FPs (i.e. misclassification of non-responders). This problem was particularly noticeable in RF models, which misclassified every non-responsive patient regardless of the employed profile, leading to undefined MCCs. The latter shows that all-features RF handles class imbalance poorly on these particular problem instances.

DNA methylation-based XGB, LGBM, and DNN models achieve median MCCs slightly higher than 0.0, and they perform hardly better than permutation models. On the one hand, CpG site methylation-based XGB, LGBM, and DNN models obtain a median MCC of 0.08 (*p*-value from two-sided paired Student's *t*-test obtained by class-permutation test = 1.05∙10-1), 0.04 (*p*-value = 1.41∙10-1), and 0.14 (*p*-value = 4.08∙10-1), respectively. On the other hand, CpG island (CGI) methylation-based XGB and LGBM models achieve a median MCC of 0.08 (*p*-value = 2.33∙10-1) and 0.09 (*p*-value = 8.04∙10-2), respectively. Moreover, miRNA and mRNA expression-based DNN models had a median MCC of 0.088 (*p*-value = 4.84∙10-2) and 0.015 (*p*-value = 4.76∙10-1), respectively.

#### Complexity-Optimized ML Models (RF-OMC, XGB-OMC, and LGBM-OMC) Provide Better Prediction and Extract Relevant Factors for Paclitaxel Response From CGI Methylation Data

Using OMC allows both to reduce considerably the number of features considered during model training and to adjust the operating threshold for assigning class labels to data instances. This leads to some OMC models that perform better than those considering all features from dataset (**Table S12** in **Supplementary Results**). This is especially the case for some methylation-based models that have been improved using OMC (**Figure 1**), unlike for models based on other profiles (**Figure S5**). The improvement of OMC over the all-features approach is ML algorithm-dependent.

CGI methylation-based OMC models have obtained improved predictive performance, using either RF or XGB. For instance, XGB-OMC models obtain a median MCC of 0.25, which is significantly better than both permutation and all-features models (*p*-values equal 9.30 ×∙10-4 and 2.16 ×∙10-2, respectively). In order to extract a robust subset of molecular factors potentially involved in paclitaxel response, the most informative features selected by these models were investigated (**Table S13** and **S14**). It results in 7 out of the 11,644 CGI coordinates encoded as CGI\_ID.24217, CGI\_ID.15915, CGI\_ ID.6919, CGI\_ID.5276, CGI\_ID.5459, CGI\_ID.16043, and CGI\_ID.11903. Moreover, we notice that 5 of them are common to the features used by the RF-OMC models. Consulting indices provided in **Tables S7** and **S8** (more details in **Supplementary Methods**), we found that these coordinates are related to the following 16 genes: CYP2D6, NDUFA6-AS1, RP4-669P10.19 (or C6orf108 pseudogene), MBTPS2, YY2, C2orf40 (or ECRG4), UXS1, IKZF1, APOBEC4, RGL1, ARPC5, NCF2, SMG7, C1orf177 (or LEXM), RP11-631M21.6 (or FAM166A pseudogene 7), and TUBB8 (**Table S14**).

#### Transparent ML Models (CART) Capture CpG Methylation Sites and Mature MIRNAS Relevant for the Sensitivity to Paclitaxel and Show How They Are Combined to Explain Drug Response

Most of the available profiles led to poor classification of test set patients when modelled with CART (**Figure 1**). By contrast, CART classifiers based on miRNA expression and CpG site methylation data provided high to very high predictive performance in the context of this problem (in 10 LOOCV runs, median MCCs of 0.43 and 0.54 were obtained, respectively) and performed significantly better than random models (*p*-values from twosided paired Student's *t*-test obtained by class-permutation test equal 4.57∙× 10-6 and 2.86∙× 10-4, respectively; see **Figure 1** and **Table 1**). For each case, the best model is defined as that obtaining the highest MCC in 10 standard LOOCVs of the full dataset (i.e. all data instances and all available features). **Figure S6** shows that the performance of these models is robust to different sizes of both training set and test set.

As observed in **Figure 2** CART models strongly reduce the number of features involved in the predictions. The miRNA expression-based model found that 4 out of 337 mature miRNAs were the most informative features (MIMAT0004985 or miR-942-5p, MIMAT0000084 or miR-27a-3p, MIMAT0000274 or miR-217, and MIMAT0004657 or miR-200c-5p), while the CpGsite methylation model identified 2 out of 22,941 CpG sites as the most informative features (cg09691574, which is related to the protein coding genes MRGPRX4 and SAA2-SAA4, and to the lincRNA RP11-113D6.6 also called antisense to MRGPRX4; and cg12542281, which is related to the protein coding gene N4BP2L2). The DTs represented in **Figure 2** show directly the interactions between independent features leading to the predictions. They also reveal the molecular types associated to paclitaxel-sensitive and paclitaxel-resistant BC tumors (the CpG site index is provided in **Table S7**).

Lastly, integrating different molecular profiles has sometimes been found to provide small predictive accuracy gains, e.g. see **Figure 4** in this study (Costello et al., 2014). Thus, since both miRNA and methy\_CpG profiles led to the most predictive models, it makes sense to integrate these data sets and train CART models on the features of the resulting hybrid profile. Using the same 10 random seeds as the methy\_CpG-based CART models (median LOOCV MCC of 0.54), the hybrid CART models obtained slightly worse accuracy (median LOOCV MCC of 0.52). The resulting CART tree is identical to that in **Figure 2**, suggesting that miRNA features were overshadowed by methy\_CpG features during CART induction.

### DISCUSSION

Owing to the wealth of curated data offered by the GDC, we could evaluate six profiles. The exhaustive evaluation of the 60

predictive models obtained, employing 10 ML algorithms with each profile, reveal strong variability in predictive performance (**Figure 3**). These results show the importance of considering multiple profiles and ML algorithms, the latter being always possible. For example, we could have carried out this study using the standard all-features versions of tree-ensemble, LR and DNN algorithms. However, this would have only resulted in models with near-random predictability despite using six profiles and thus, we could have concluded that precision oncology is not possible for paclitaxel-treated BC patients. Instead, we also tested algorithms generating models requiring only a handful of features (OMC-based and CART), which in addition, provided the best performance on these problem instances. Note that the most predictive of these models achieved an over 10,000-fold reduction in the number of features (**Table 1**).

Identifying a concise list of predictive molecular features is indeed beneficial for interpretability. The CGI methylationbased XGB-OMC model employs a dramatically reduced number of features (11 of the considered 11,644). The increased predictive performance comparing to all-features model (**Figure 1**) shows that the selected subset of features contains the information relevant for predictions (**Figure S2**). Therefore, applying OMC not only offers better predictivity, but also better interpretability of response to paclitaxel, as it revealed a molecular signature able to discriminate sensitive and resistant BC tumors from high-dimensional data. The best CART models reached the highest predictive performance among the generated predictors (**Figure 1**). Moreover, these models allow going further in the interpretation of response to paclitaxel (**Figure 2**). For example, the CpG-methylation DT unveils two rules employing only two features to predict which are the paclitaxel-sensitive BC tumors (**Figure 2**). The other example is the miRNA DT, which carries out these predictions using four induced rules based on only four features (**Figure 2**). Thus, the application of these rules to forthcoming tumors should improve paclitaxel treatment for BC patients. To facilitate such application, we are providing two python scripts in the supplementary materials, each implementing the rules for one of these predictive profiles.

Our best classifier obtained a median MCC of 0.54 in 10 LOOCV runs (an average MCC of 0.62, with MCC ranging from


*The predictive performance of CART models was presented in Figure 1. Here we summarize the characteristics of the two best models (i.e. those exploiting miRNA expression and CpG methylation profiles). A median MCC was calculated with the 10 MCCs coming from LOOCV experiments (each with a different random seed). This five additional LOOCV runs with respect to those presented in Figure 3 were carried out to better characterize the performance of the best models found in our study. The small difference found in median MCC (0.52 in Figure 3 versus 0.54 here) suggests that this performance metric is quite robust to the number of LOOCV runs for CART. The training sets were also class-permutated during cross-validation as explained in the Methods section and CART trained on the resulting data to provide a second set of 10 MCCs per profile. The p-value (two-sided paired Student's t-test) of this class-permutated test shows how likely are the MCCs of the CART models to arise by chance. The first model was trained on miRNA expression: 4 out of 337 mature miRNAs were retained to build this model reaching a median MCC of 0.43 and performing significantly better than models based on permutated data (p-value = 4.57∙10-6). The second model is obtained processing CpG site methylation (shorten as 'methy\_CpG'): 2 out of 22,941 CpG sites were retained to build this model achieving a median MCC of 0.54 and performing significantly better than permutation models (p-value = 2.86∙10-4).*

in Figure 1 exploiting (A) miRNA expression and (B) CpG site methylation data. Each DT node has a histogram showing the distribution of patients at that node against the selected feature (the proportion of responders vs non-responders in each feature bin is also shown). The triangle under the histogram marks the value of the best split for the selected feature, whose name can be found under the histogram as well. Each node has two leaves: to the left (patients with a feature value lower than that of the best split) and to the right (the rest of the patients). Terminal nodes (or leaves) are displayed as pie charts. The proportions of non-responders and responders are respectively colored yellow and green. The log2-transformed miRNA expression-based DT shown in (A) reveals four different molecular types of sensitive BC tumors and one molecular type associated to resistant BC tumors involving four mature miRNAs: MIMAT0004985, MIMAT0000084, MIMAT0000274, and MIMAT0004657 (also known as miR-942-5p, miR-27a-3p, miR-217, and miR-200c-5p, respectively). Thus, a tumor is classified as responsive to paclitaxel if: 1) the expression value of MIMAT0004985 is higher than 2.18, or 2) the expression value of MIMAT0004985 is lower than 2.18 and that of MIMAT0000084 is lower than 9.69, or 3) the expression value of MIMAT0004985 is lower than 2.18, and that of MIMAT0000084 is higher than 9.69 and that of MIMAT0000274 is higher than 4.55, or 4) the expression value of MIMAT0004985 is lower than 2.18, and that of MIMAT0000084 is higher than 9.69, and that of MIMAT0000274 is lower than 4.55, and that of MIMAT0004657 is higher than 5.39. Otherwise, the tumor is classified as non-responsive. The DT based on CpG site methylation (shortened as 'methy\_CpG') shown in (B) unveils two different molecular types of sensitive BC tumors and one type of resistant BC tumors involving two CpG sites represented by probes cg09691574, which is related to the protein coding genes, MRGPRX4 and SAA2-SAA4; and to the lincRNA RP11-113D6.6, also called antisense, to MRGPRX4; and to cg12542281, which is related to the protein coding gene N4BP2L2. Thus, a tumor is predicted to be sensitive to paclitaxel if: 1) the beta value associated to the methylation of cg09691574 is higher than 0.77, or 2) the beta value associated to the methylation of cg09691574 is lower than 0.77, and that of cg12542281 is higher than 0.05. Otherwise, the tumor is predicted to be resistant. Both DTs were found to give pure leaves (i.e. all data instances that are in terminal nodes belong to the same class).

0.48 to 0.87 in these runs as it can be seen in **Figure 1**). To put these predictive accuracies in the context of what is typically achieved when predicting tumor response to a drug from omics profiles, we have looked at other test set performances reported at the literature for this problem. One study (Kim et al., 2016) applied a range of ML algorithms to predict pancancer cell line response from transcriptomic profiles and obtained MCCs below 0.6 in all cases (see **Figure 1** in that paper). Maximum MCCs slightly above 0.5 and 0.3 were also obtained using RF with transcriptomic profiles (Nguyen et al., 2017) and genomic profiles (Naulaerts et al., 2017), respectively. Another study (Xu et al., 2019) also predicted drug response using many hundreds of pancancer cell lines using several ML algorithms from various omics profiles (gene expression, copynumber alterations, single-nucleotide mutations). Depending on the considered data resource, average MCCs across drug and profiles range from 0.15 to 0.31 or from 0.22 to 0.45 (see Tables 2 and 3 in that paper). Yet another example is by (Tripathi et al., 2016) using gene variants as features, where MCCs range across

FIGURE 3 | Employing multiple ML algorithms and tumor profiles increase the likelihood of discovering models able to predict BC patient response to paclitaxel. ML algorithms include the unaltered version of tree-ensemble and linear algorithms using all available features (RF, XGB, LGBM, and LR) and their OMC versions (RF-OMC, XGB-OMC, LGBM-OMC, and LR-OMC). The 9th algorithm was CART, employed to generate simpler and more interpretable classification models. The 10th algorithm was DNN, employed to generate more sophisticated but less interpretable models. Each of these algorithms was evaluated on each of the six molecular profiles, which resulted in 60 classifiers on the same BC patients. LOOCV evaluation was performed 5 times setting a different random seed for the employed ML algorithm, leading to 5 MCC determinations quantifying predictive performance. The heatmap shows the median MCC per classifier. Rows show the processed molecular profiles ('CNV' is short for copy-number variation, 'methy\_CpG' for CpG site methylation, and 'methy\_CGI' for CGI methylation), while columns display the employed ML algorithms. Thus, each cell corresponds to the median MCC of a given predictive model. Cells are colored in light-blue and dark-blue when this model reaches a negative or very negative median MCC (i.e. classification worse than random); in grey when it reaches a median MCC very close to 0.0 (i.e. random classification); in light-brown and dark-brown when it reaches a positive or very positive median MCC (i.e. classification better than random or close to perfect); in white and labelled NA (i.e. not available) when it reaches an undefined median MCC (i.e. misclassification of non-responders within several or all iterations). These results show that DNA methylation is the most informative profile (it leads to 2 of the 3 classifiers with a median MCC of a least 0.25). The choice of ML algorithm also affects the predictive performance. For example, none of the RF or LGBM classifiers obtain an MCC of at least 0.10. Thus, predictive performance depends strongly on of algorithm- profile combination: only one XGB-OMC models is predictive (that based on CGI methylation) and it is among the best predictors (median MCC of 0.25). Two other examples are the CART classifiers based on CpG methylation and miRNA expression, with median MCC of 0.52 and 0.43, respectively. Figures S4 and S5 further characterizes the performance of the best classifiers.

drugs from 0.32 to 0.56 or from 0.30 to 0.44 depending on data resource (see **Tables 1** and 2 in that paper). Lastly, single-gene drug response markers identified by MANOVA and Chi-Square tests on pancancer cell lines obtained maximum MCCs of 0.30 and 0.31, respectively (Dang et al., 2018).

The alteration of gene expression due to epigenetic modifications triggers the development of cancers, including BC. DNA methylation changes, occurring both within and around CGIs, can impact transcriptional activity of genes or transcription factors involved in malignant phenotypes (Esteller, 2002; Irizarry et al., 2009; Levenson, 2010; Deaton and Bird, 2011; Manjegowda et al., 2017; Stirzaker et al., 2017). It has been shown that biomarkers for prognosis and treatment can be unearthed from DNA methylation profiles (Xiang et al., 2013; Mikeska and Craig, 2014; Stirzaker et al., 2014; Li et al., 2015; Pouliot et al., 2015). Furthermore, it has been found that DNA methylation can interfere in chemo-resistance to paclitaxel (Wang et al., 2012; Ignatov et al., 2014; Yun et al., 2015; He et al., 2016; Zhang et al., 2018). Our DNA-methylation-based predictors selected CpG sites and CGIs related to genes previously found individually involved in cancer development and with transcriptional activity regulated by methylation (e.g. MBTPS2, YY2, ECRG4, IKZF1). Selected features by these models are also related to genes associated to response to cytotoxic drugs such as N4BP2L2 (paclitaxel), CYP2D6 (tamoxifen), APOBEC4 (tamoxifen, doxorubicin, and etoposide), and TUBB8 (paclitaxel) (**Table S15**).

miRNAs also play a key role in cancer development by acting as tumor suppressors or oncogenes. These molecules can be used as biomarkers, and modulation of their specific activities provides insight for therapeutic investigations (Hayes et al., 2014; Peng and Croce, 2016). Furthermore, the expression of some miRNAs has been associated to the sensitivity to paclitaxel (Zhou

et al., 2010; Chen et al., 2014; He et al., 2016; Lu et al., 2017). The miRNA expression-based CART model combines miR-27a-3p, miR-217, miR-200c-5p, and miR-942-5p to predict which BC tumors are paclitaxel-responsive with high accuracy (**Figures 1** and **2A**). Individually, each of these miRNAs have been linked to paclitaxel response and BC prognosis: the first three are related to paclitaxel resistance, whereas the last one is associated to shorter survival of BC patients (**Table S15**).

Our study has some limitations to be pointed out. First, for a given patient, molecular profiles were obtained from the primary tumor, while clinical response was registered later following tumor evolution. Both tumors may present some differences at the molecular level, due to temporal or spatial tumor heterogeneity, as well the possible impact of the treatment administered after tumor resection. Second, while we reported predictive accuracy on BC tumors not used in any way to build or select the model, an additional independent evaluation on a second cohort would shed further light into how general these models are. The latter is currently not possible due to the scarcity of paclitaxel-treated BC patients with DNA methylation or miRNA profiles of their tumors.

Yet, our work provides very predictive (in the context of the considered problem), robust (**Figure S4** and **Figure 4**), and even interpretable models to identify primary BC tumors sensitive to paclitaxel. These results also suggest that tumor methylomes and miRNomes can be a source of multi-variate models to predict prognosis and treatment response. Indeed, our predictive models reveal molecular features that can collectively anticipate which BC tumors are sensitive or resistant to paclitaxel. Previous studies have experimentally validated the involvement in BC development, and even in the resistance to paclitaxel, of these molecular factors individually, which further supports the applicability of these classifiers. Furthermore, our results also suggest novel predictive factors such as the antisense to MRGPRX4; the pseudogenes (Poliseno et al., 2015; Xiao-Jie et al., 2015) C6orf108 and FAM166A; and the coding genes NDUFA6-AS1, UXS1, RGL1, and LEXM.

### DATA AVAILABILITY STATEMENT

The datasets generated for this study can be found in the https:// portal.gdc.cancer.gov/.

### AUTHOR CONTRIBUTIONS

PB conceived the study and designed the experiments. AB and PB wrote the manuscript with the assistance of AG. AB carried out the numerical experiments. All authors analyzed the results and contributed to their discussion.

### FUNDING

This work was supported by the Institut Paoli-Calmettes (grant number 305/2016 to PB).

### REFERENCES


Cawley, G. C., and Talbot, N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. *J. Mach. Learn. Res.* 11, 2079–2107.

Chen, N., Chon, H. S., Xiong, Y., Marchion, D. C., Judson, P. L., Hakam, A., et al. (2014). Human cancer cell line microRNAs associated with in vitro sensitivity to paclitaxel. *Oncol. Rep.* 31, 376–383. doi: 10.3892/or.2013.2847

Chen, T., and Guestrin, C. (2016). "XGBoost," in *Reliable Large-scale Tree Boosting System*. ACM New York, NY, USA. doi: 10.1145/2939672.2939785

Costello, J. C., Heiser, L. M., Georgii, E., Gönen, M., Menden, M. P., Wang, N. J., et al. (2014). A community effort to assess and improve drug sensitivity prediction algorithms. *Nat. Biotechnol.* 32, 1202–1212. doi: 10.1038/nbt.2877

Dang, C. C., Peón, A., and Ballester, P. J. (2018). Unearthing new genomic markers of drug response by improved measurement of discriminative power. *BMC Med. Genomics* 11, 10. doi: 10.1186/s12920-018-0336-z

Deaton, A. M., and Bird, A. (2011). CpG islands and the regulation of transcription. *Genes Dev.* 25, 1010–1022. doi: 10.1101/gad.2037511

### ACKNOWLEDGMENTS

Computing resources for this study were partly provided by the computing facilities DISC (Datacenter IT and Scientific Computing) of the Centre de Recherche en Cancérologie de Marseille.

### SUPPLEMENTARY MATERIALS

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.01041/ full#supplementary-material

Ding, Z., Zu, S., and Gu, J. (2016). Evaluating the molecule-based prediction of clinical drug responses in cancer. *Bioinformatics* 32, 2891–2895. doi: 10.1093/ bioinformatics/btw344

Esteller, M. (2002). CpG island hypermethylation and tumor suppressor genes: a booming present, a brighter future. *Oncogene*. 21, 5427–5440. doi: 10.1038/ sj.onc.1205600


GDC Reference Files | NCI Genomic Data Commons.


Genomic Data Harmonization | NCI Genomic Data Commons.


He, D. X., Gu, F., Gao, F., Hao, J. J., Gong, D., Gu, X. T., et al. (2016). Genome-wide profiles of methylation, microRNAs, and gene expression in chemoresistant breast cancer. *Sci. Rep.* 6, 24706. doi: 10.1038/srep24706

Housman, G., Byler, S., Heerboth, S., Lapinska, K., Longacre, M., Snyder, N., et al. (2014). Drug resistance in cancer: an overview. *Cancers (Basel)*. 6, 1769–1792. doi: 10.3390/cancers6031769


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Bomane, Gonçalves and Ballester. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Sequence-Derived Markers of Drug Targets and Potentially Druggable Human Proteins

#### *Sina Ghadermarzi1, Xingyi Li2, Min Li2\* and Lukasz Kurgan1\**

1 Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States, 2 School of Computer Science and Engineering, Central South University, Changsha, China

Recent research shows that majority of the druggable human proteome is yet to be annotated and explored. Accurate identification of these unexplored druggable proteins would facilitate development, screening, repurposing, and repositioning of drugs, as well as prediction of new drug–protein interactions. We contrast the current drug targets against the datasets of non-druggable and possibly druggable proteins to formulate markers that could be used to identify druggable proteins. We focus on the markers that can be extracted from protein sequences or names/identifiers to ensure that they can be applied across the entire human proteome. These markers quantify key features covered in the past works (topological features of PPIs, cellular functions, and subcellular locations) and several novel factors (intrinsic disorder, residue-level conservation, alternative splicing isoforms, domains, and sequence-derived solvent accessibility). We find that the possibly druggable proteins have significantly higher abundance of alternative splicing isoforms, relatively large number of domains, higher degree of centrality in the protein-protein interaction networks, and lower numbers of conserved and surface residues, when compared with the non-druggable proteins. We show that the current drug targets and possibly druggable proteins share involvement in the catalytic and signaling functions. However, unlike the drug targets, the possibly druggable proteins participate in the metabolic and biosynthesis processes, are enriched in the intrinsic disorder, interact with proteins and nucleic acids, and are localized across the cell. To sum up, we formulate several markers that can help with finding novel druggable human proteins and provide interesting insights into the cellular functions and subcellular locations of the current drug targets and potentially druggable proteins.

Keywords: drug targets, druggability, druggable human proteome, drug-protein interactions, protein-protein interactions, intrinsic disorder

## INTRODUCTION

Knowledge of the drug-target interactions is essential for numerous applications including screening of drug candidates (Schneider, 2010; Núñez et al., 2012; Dalkas et al., 2013; Tseng and Tuszynski, 2015), drug repositioning and repurposing (Chong and Sullivan, 2007; Haupt and Schroeder, 2011; Oprea and Mestres, 2012; Hu and Bajorath, 2013; Li et al., 2016), characterization and mitigation of side-effects of drugs (Lounkine et al., 2012; Wang et al., 2012b; Kuhn et al., 2013; Tarcsay and Keserű, 2013; Hu et al., 2014), and prediction of novel protein-drug interactions (Wang et al., 2016a; Lotfi

#### Edited by:

Shandar Ahmad, Jawaharlal Nehru University, India

#### Reviewed by:

Marcelo Adrian Marti, University of Buenos Aires, Argentina Yuanyuan Wang, St. Jude Children's Research Hospital, United States

#### \*Correspondence:

Min Li limin@mail.csu.edu.cn Lukasz Kurgan lkurgan@vcu.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 30 May 2019 Accepted: 09 October 2019 Published: 15 November 2019

#### Citation:

Ghadermarzi S, Li X, Li M and Kurgan L (2019) Sequence-Derived Markers of Drug Targets and Potentially Druggable Human Proteins. Front. Genet. 10:1075. doi: 10.3389/fgene.2019.01075

1 **73** Ghadermarzi et al. Markers of Druggable Human Proteins

Shahreza et al., 2017; Ezzat et al., 2018; Hao et al., 2019; Wang and Kurgan, 2019; Wang and Kurgan, 2018; Wang et al., 2019). Recent analysis reveals that over 95% of the currently known drug targets are proteins and that these proteins facilitate about 93% of known drug-target interactions (Santos et al., 2017). Thus, we focus on the drug-protein interactions and we use the term "drug target" as a synonym for the protein drug target. While earlier works report about 400 drug targets (Hopkins and Groom, 2002; Russ and Lampel, 2005), subsequent studies annotate as many as over 600 drug targets in human (Santos et al., 2017). Furthermore, the druggable human proteome, defined as the full complement of the human drug targets (Hopkins and Groom, 2002; Russ and Lampel, 2005; Rask-Andersen et al., 2014; Cimermancic et al., 2016; Hu et al., 2016), is expected to be much larger. Early estimates place the number of human drug targets at around 3,000 (Hopkins and Groom, 2002; Russ and Lampel, 2005). A more recent analysis approximates this number at 4.5 thousand (Finan et al., 2017), which corresponds to about 22% of the human genome. While the historically typical drug targets include G-protein coupled receptors, nuclear receptors, ion channels, and some of the enzymes (Overington et al., 2006; Imming et al., 2007), recent works suggest that many of the nonenzymes (e.g., scaffolding, regulatory, and structural proteins) and proteins involved in specific protein-protein interactions (PPIs) should be targeted by drugs (Makley and Gestwicki, 2013; Ozdemir et al., 2019), effectively expanding the list of potential drug targets. These observations point to the fact that many of the drug targets remain to be discovered and characterized. The search for these proteins relies on the concept of druggability, which was originally defined based on the presence of structure that favors interactions with drug-like compounds where the corresponding interactions provide desired therapeutic effects (Hopkins and Groom, 2002; Russ and Lampel, 2005; Keller et al., 2006). In a purely structural context, druggability is related to binding of a compound to a given protein target with high affinity (< 1 µM) (Sheridan et al., 2010; Radusky et al., 2014). We focus on the former definition where both the interactions and the therapeutic effects are considered.

One of the key elements in the quest to find druggable proteins is to identify functional and structural characteristics that differentiate drug targets from the non-drug targets (Zheng et al., 2006; Lauss et al., 2007; Bakheet and Doig, 2009; Zhu et al., 2009b; Zhu et al., 2009c; Bull and Doig, Mitsopoulos et al., 2015; 2015; Feng et al., 2017; Kim et al., 2017). In one of the earliest works, Chen *et al.* concentrated on the analysis of structural fold types, target family representation and similarity, pathway associations, tissue distribution, and chromosome location for the drug targets (Zheng et al., 2006). A similar analysis that considered cellular functions, pathway associations, tissue distribution, and subcellular and chromosome location of the drug targets was published soon after by Lauss and colleagues (Lauss et al., 2007). More recent studies have shifted the focus towards characteristic features of the target protein sequence and structure. Bakheet and Doig used a relatively small set of 148 targets to analyze several sequence properties (chain length, hydrophobicity, charge, and isoelectric point), putative secondary structure and transmembrane regions, inclusion of signal peptides, selected set of post-translational modifications (PTMs), as well as the previously studied subcellular location and functions (Bakheet and Doig, 2009). Subsequently, Bull and Doig investigated a similar set of characteristics using a much larger set of 1324 drug targets (Bull and Doig, 2015). They considered a similar set of sequence properties, native secondary structure and signal peptides, selected PTMs, and a few new properties: the number of germline variants, expression levels, and the number of PPIs (Bull and Doig, 2015). The most recent study by Park, Lee, and colleagues expanded the above list of characteristics by inclusion of gene essentiality and tissue specificity (Kim et al., 2017). Moreover, several articles narrowly focused on characteristics that quantify topological features of the underlying PPI networks (Zhu et al., 2009b; Zhu et al., 2009c; Mitsopoulos et al., 2015; Feng et al., 2017). While these studies have considered a broad range of functional and structural features of drug targets, they identified the drug target-specific characteristics by comparing the drug targets against the other human proteins (non-drug targets). However, many of these non-drug targets could be in fact druggable, i.e., as many as 22% according to (Finan et al., 2017). Using the non-drug targets to represent the nondruggable proteins in order to define characteristic features of the druggable targets ultimately creates a bias toward describing the currently known drug targets. Consequently, this reduces our ability to use these characteristics to identify a complete set of druggable proteins.

We address the abovementioned shortcoming of the prior works by comparing sequence-derived characteristics of the drug targets, possibly druggable proteins, and non-druggable proteins using a large and well-curated dataset of human proteins. Our study is novel in four ways. First, we contrast the drug targets (D dataset) not only against all non-drug targets (N dataset), which was also done in prior studies, but also against nondruggable non-drug targets (Nn dataset; the non-drug targets that exclude disease associated proteins) and against possibly druggable non-drug targets (Nd dataset; the non-drug targets that are associated with multiple diseases). The association of the non-drug targets with diseases is necessary for the druggable proteins to exert therapeutic effects. Second, we further compare the D, N, Nd, and Nd proteins against highly promiscuous drug targets that interact with many drugs (Dh dataset) and drug targets that interact with low number of drugs (Dl dataset). This full-spectrum analysis allows us to pinpoint characteristics that differentiate between drug targets, possibly druggable proteins and non-druggable proteins, as well as features that are specific to promiscuous *vs*. non-promiscuous drug targets. Third, we focus on the characteristics that can be quantified directly from the protein sequence or protein name/identifier. This facilitates their use as potential markers for druggability across the entire human proteome. This is in contrast to several related studies that are limited to a relatively small subset of human proteins with solved structures (Hambly et al., 2006; Bull and Doig, 2015; Hu et al., 2016; Wang et al., 2016a; Wang et al., 2019). Fourth, we include several important sequence/protein-derived characteristic that were missed in the past studies including putative intrinsic disorder, residue-level conservation, presence and number of alternative splicing isoforms, inclusion of domains, and solvent accessibility (surface area). Moreover, we cover some of the key characteristics from the prior works, such as the topological features of PPIs, cellular functions, and subcellular locations.

## MATERIALS AND METHODS

## Datasets

#### Datasets of Drug Targets (D Dataset), Highly Promiscuous Drug Targets (Dh Dataset), and Low-Interaction Drug Targets (Dl Dataset)

We collect a comprehensive set of drug targets by combining interaction information extracted from several large bioactive compounds-protein interaction databases. We filter these bioactive compounds to include only approved and experimental drugs. Furthermore, we focus on human proteins by excluding protein fragments and proteins from other organisms. We maximize the coverage by first collecting an inclusive set of interactions (including all bioactive compounds and protein chains) and then applying the two filters to obtain a high quality and large set of drugs and proteins.

The data collection protocol follows the work in (Wang and Kurgan, 2019; Wang and Kurgan, 2018). We extract the source data from three large repositories: Drug2gene (Roider et al., 2014), TTD (Zhu et al., 2009a), and GtP (Harding et al., 2017). Drug2gene is one of the most inclusive repositories that aggregates 19 source databases including TTD and GtP and several other major databases like ChEMBL (Gaulton et al., 2016) and DrugBank (Wishart et al., 2017). However, Drug2gene includes older and substantially smaller version of the TTD and GtP resources. Therefore, we integrated the latest versions of these two databases into our dataset. These databases provide a list of drug-protein pairs that use different identifiers and which include other information that could be useful to identify these molecules (like drug structure). The arguably most popular way to identify drugs and proteins are the PubChem CIDs and UniProt accession numbers, respectively. We use these identifiers to map data between the resources. We also merged the drugs with different PubChem CID but identical *simplified molecularinput line-entry system* (SMILES) structures. First, we remove the data collected from TTD and GTP that lacks PubChem CID or UniProt identifiers. Next, we map the proteins in Drug2gene that are represented by Entrez Gene ID into the corresponding UniProt accession numbers. After mapping and combining these datasets and removing duplicates, we obtain 2,490,057 interactions for 591,684 bioactive compounds and 4,128 proteins. Next, we filter this list of compounds using the list of drugs obtained from the DrugBank and ChEMBL. We remove the compounds that do not have the same CID or SMILES structure when compared to the list of DrugBank and ChEMBL drugs. Finally, we remove non-human proteins using a reference human proteome from UniProt. At the end, the set of drug targets (D dataset) includes 33,104 interactions between 4,405 drugs (PubChem CID) and 1,638 protein (UniProt identifiers). We provide the complete D dataset in the **Supplementary Material**. Moreover, we generate an expanded set of human and human-like drug targets that includes proteins in the D dataset plus proteins from other organisms that share high sequence similarity to the human proteins (D+ dataset). More specifically, following recent works (Hu et al., 2014; Wang et al., 2016a; Wang et al., 2019), human proteins that share at least 90% sequence identity quantified using BLAST with default parameters (Altschul et al., 1997) to any of the drug targets were added into the D+ dataset. Consequently, the D+ dataset has 1,762 proteins including 124 proteins that were included based on the high similarity; we list these proteins in the **Supplementary Material**. The number of drug targets in our dataset is slightly higher than the sizes of the datasets used in related studies (in the inverse chronological order): 1604 in (Feng et al., 2017), 1578 in (Kim et al., 2017), 1324 in (Bull and Doig, 2015), and 1,030 in (Rask-Andersen et al., 2014). Compared to popular databases, such as KEGG DRUG and DrugBank, our dataset features a more complete set of interactions (33,104 *vs*. 14,222 and 23,380, respectively (Wang and Kurgan, 2019) while focusing on a smaller and relevant set drugs that specifically target human proteins [4,405 *vs*. 5,045 and 10,562, respectively (Wang and Kurgan, 2019).

Drug targets in our dataset interact with as few as 1 drug and as many as 443 drugs. We investigate whether sequence-derived and functional characteristics of highly promiscuous drug targets are different from the drug targets that interact with a few proteins. To do that we extracted two subsets of the drug targets, the highly promiscuous targets (Dh dataset) that correspond to the top quartile of the targets with the highest interaction counts, and the low-interaction drug targets (Dl dataset) that include the bottom quartile of the drug targets with the lowest numbers of interactions.

#### Dataset of Non-Drug Targets (N Dataset)

We contrast the sequence-derived and functional characteristics of the proteins in the D, D+, Dh, and Dl datasets against the proteins that are not current drug targets. We collect these non-drug targets (N dataset) by selecting proteins from the UniProt's human proteome that are not in the D dataset. The selection process follows two rules. First, we match the size of the N dataset to the size of the D dataset to ensure robust statistical comparisons between different datasets. Second, when down-sampling the human proteins we ensure that the selected proteins have similar size as the proteins in the D dataset. More specifically, for each protein in the D dataset we pick a human non-drug target at random (without replacement) that has a matching sequence length (with 10% tolerance). We introduce the latter rule since the amount of intrinsic disorder in proteins is dependent on proteins length (HOWELL et al., 2012). The same selection process was used in several related studies (Meng et al., 2015b; Na et al., 2016; Meng et al., 2018) to eliminate protein size bias when studying intrinsic disorder. We provide the list of the 1,638 size-matched proteins that constitute the N dataset in the **Supplementary Material**. Moreover, Section "Non-druggable and possibly druggable proteins" describes how the N dataset is used to derive the dataset of Non-druggable non-drug targets (Nn dataset; the non-drug targets that exclude disease associated proteins) and the dataset of possibly druggable non-drug targets (Nd dataset; the non-drug targets that are associated with multiple diseases).

## Characterization of Protein Properties

We characterize a broad collection of characteristics of human proteins that include their disease associations, structural properties derived from the sequence (putative intrinsic disorder and surface), sequence properties (domain annotations, alternative splicing, and residue-level conservation), topological properties of the corresponding PPI network (centrality measures and hubs), and functional properties (GO annotations and predicted protein-binding regions). We extract these characteristics directly from the protein sequence or protein names/identifiers. This means that they could be used as potential markers for druggability that cover the entire human proteome.

#### Disease Associations

The protein-disease association data were collected from DisGeNET (Gutiérrez-Sacristán et al., 2016). DisGeNET integrates several curated databases and offers arguably one of the most complete levels of coverage for human diseases. This database provides association between disease MeSH IDs and Entrez Gene IDs and also provides a mapping between Entrez Gene IDs and UniProt identifiers. We mapped these annotations to our dataset using the UniProt identifiers.

#### Sequence-Derived Structural Properties

We annotate two relevant structural properties that we can accurately derive from the protein sequences: intrinsic disorder and solvent accessibility. We are unable to directly collect structural data since significant majority of the proteins in the D, D+, and N datasets do not have solved structures.

Intrinsically disordered proteins and protein regions lack a stable tertiary structure in isolation (Dunker et al., 2013; Habchi et al., 2014; Uversky, 2014a). Proteins with disordered regions are crucial for many key cellular functions including molecular recognition and assembly, cell cycle and cell death regulation, signal transduction, transcription, translation, and viral cycle (Dyson and Wright, 2005; Uversky et al., 2005; Liu et al., 2006; Xie et al., 2007; Peng et al., 2012; Xue et al., 2012; Peng et al., 2013; Uversky et al., 2013; Fan et al., 2014; Fuxreiter et al., 2014; Peng et al., 2014b; Xue and Uversky, 2014; Dolan et al., 2015; Meng et al., 2015a; Meng et al., 2015b; Varadi et al., 2015; Babu, 2016; Na et al., 2016; Yan et al., 2016; Wang et al., 2016b; Kjaergaard and Kragelund, 2017). They are also the main contributors to the dark proteome (Hu et al., 2018; Kulkarni and Uversky, 2018). Intrinsic disorder is abundant in the human proteins. Computational studies estimate that about 19% amino acids in eukaryotic proteins are intrinsically disordered (Peng et al., 2015) and over 40% human proteins have at least one long disordered region with 30 or more consecutive residues (Oates et al., 2013). These proteins are particularly relevant to this study since they are associated with several human diseases (Uversky et al., 2008; Babu, 2016; Uversky et al., 2014; Uversky, 2014b) and since they attract recent interest as potent drug targets (Cheng et al., 2006; Uversky, 2012; Dunker and Uversky, 2010; Ambadipudi and Zweckstetter, 2016; Tantos et al., 2015). Intrinsic disorder can be predicted accurately from protein sequence using computational methods (Peng and Kurgan, 2012; Walsh et al., 2015; Lieutaud et al., 2016; Meng et al., 2017a; Meng et al., 2017b). We use one of the leading disorder predictors, IUPred (Dosztányi et al., 2005; Dosztanyi, 2018). This selection is motivated by the fact that IUPred is computationally efficient (i.e., it can be used to process large datasets of proteins, such as the D and N datasets) and since it provides accurate predictions (Peng and Kurgan, 2012; Walsh et al., 2015). We use the IUPred's results to compute the disorder content (fraction of disordered residues in a given protein) and the length of the putative disordered regions.

Solvent accessibility provides a crucial context for the analysis of the residue-level conservation since it allows us to separate conserved residues that are localized on the surface (which include residues that are instrumental for the drug-protein interaction) from those located in the protein core (which are likely responsible for structural stability of the protein). We predict the relative accessible surface area using the ASAquick method (Faraggi et al., 2014). This method predicts relative solvent accessibility from a single sequence (without alignment), and thus it much faster than the other predictors that require calculation of multiple sequence alignment. It also provides accurate prediction, which is why it was recently used in related studies (Zhang et al., 2017; Amirkhani et al., 2018; Meng and Kurgan, 2018). We convert the numeric relative solvent accessibility of residues into a binary annotation (solvent exposed *vs*. buried) using a threshold of 0.15. This value adequately splits the bimodal distribution of solvent accessibility values for the residues in the combined D and N datasets (**Figure S2** in the **Supplementary Material**). We use these results to quantify the fraction of the putative surface residues in a given protein.

We assess quality of these predictions by comparing values of the fraction of the native surface residues that are computed using a limited set of proteins that have structures against the fraction of the predicted surface residues for the same set of proteins. We utilize mapping generated with the SIFTS resource (Velankar et al., 2013) that is available in UniProt to identify structures of the human proteins from the D and N datasets in the PDB database (Berman et al., 2000). We consider structures that cover at least 90% of the corresponding full protein sequences collected from UniProt to ensure that they correspond to a similar set of residues that are covered by the predictions which rely on the full protein chains. We compute the native solvent accessibility from these structures in three steps. First, we remove other molecules (including other protein chains) from the PDB structures. Second, we use DSSP (Kabsch and Sander, 1983; Joosten et al., 2010) to compute solvent accessibility values. Third, we convert the solvent accessibility into the relative solvent accessibility values using the normalization procedure that is described in the ASAquick article (Faraggi et al., 2014). We were able to collect the native solvent accessibility values for 373 drug targets (including 343 proteins from the D dataset, 55 from the Dh dataset, and 103 from the Dl dataset) and 73 proteins non-drug targets (including 39 from the Nd dataset and 12 from the Nn dataset). This corresponds to (373 + 73)/(1762 + 1,638) = 13% structural coverage of the human proteins in our datasets. **Figure S3** compares the distributions of the fractions of the surface residues computed from the protein structures against the fractions that are based on the predicted solvent accessibility for the seven considered datasets. The distributions that rely on the native *vs*. putative solvent accessibility for each of the seven dataset are very similar. The differences are not statistically significant (*p*-values range between 0.17 for the N dataset and 0.88 for the Nd dataset). This results suggests that the solvent accessibility predicted with ASAquick provides an accurate approximation of the native fraction of the surface residues.

#### Protein Sequence Properties

We use the proteins sequences to annotate the domains, alternative splicing isoforms, and sequence conservation. We collect the domain annotations from Pfam (Calderone et al., 2013) using UniProt identifiers, and we use these annotations to compute the domain boundaries (fraction of the domainassigned residues) and the number of domains per protein. We obtain the number of alternative splicing isoforms from the UniProt database (UniProt: the universal protein knowledgebase, 2016). We calculate residue-level conservation scores using the relative entropy measure (Wang and Samudrala, 2006) from the PSSMs generated with PSI-BLAST (Altschul et al., 1997). We use a threshold to convert the numeric conservation scores to binary, i.e., a given residue is either conserved (if its conservation score > threshold) or non-conserved (otherwise). We selected the threshold that corresponds to the 80th percentile of the distribution of the conservation scores for the residues in the combined D and N datasets (**Figure S1** in the **Supplementary Material**). The corresponding threshold value of 0.63 corresponds to an inflection point in the distribution tail where the conserved residues should be located. Using these annotations, we quantify the rate of the conserved residues in the protein sequence and among the residues located on the putative protein surface, given that this is where the drug-protein interaction occurs.

#### Topological Properties of the Protein-Protein Interaction Network

Motivated by work in (Zhu et al., 2009b; Zhu et al., 2009c; Mitsopoulos et al., 2015; Feng et al., 2017), we quantify the topological characteristics of drug targets and non-drug targets in the human PPI network. We collected the interaction network from the MENTHA resource (Calderone et al., 2013) and directly mapped it to our datasets using UniProt identifiers. MENTHA integrates data coming from several popular databases of PPIs, such as IntAct (Orchard et al., 2014), MINT (Licata et al., 2012), DIP (Salwinski et al., 2004), BioGRID (Oughtred et al., 2019), and MatrixDB (Launay et al., 2015), providing arguably one of the most comprehensive coverage levels. Several different centrality measures can be used to define topological characteristics of proteins in PPI networks (Wang et al., 2013a). We considered a comprehensive set of measures including betweenness centrality (Freeman, 1977), eigenvector centrality (Bonacich, 1987), closeness centrality (Bavelas, 1950), information centrality (Stephenson and Zelen, 1989), degree centrality (Jeong et al., 2001), subgraph centrality (Estrada and Rodriguez-Velazquez, 2005), network centrality (Wang et al., 2012a), and local average connectivity (Li et al., 2011). We reduced this set by removing measures that are redundant (highly correlated). The corresponding subset of four measures (eigenvector, closeness, betweenness and information centrality) has relatively low mutual correlations (< 0.6) while being highly correlated (> 0.8) with at least one of the removed measures. We give the corresponding correlations between these measures on our datasets in **Table S1** in the **Supplementary Material**. The eigenvector centrality is an extension of the node degree in which connections to more important nodes have more impact on the score. The nodes that are connected to many highly connected nodes end up having higher score than nodes which are connected to the same number of less-connected nodes (Bonacich, 1987). The closeness centrality measures the average length of the shortest path from the node to other nodes. The nodes with higher closeness centrality on average have smaller distance to the other nodes (Bavelas, 1950). The betweenness centrality quantifies the frequency with which a given node appears in the shortest paths between nodes in the network. Thus, removal of nodes with high betweenness centrality has big impact on the shortest paths between nodes (Freeman, 1977). Finally, information centrality is based on information along the paths from a given node to the other nodes (Stephenson and Zelen, 1989).

Besides quantifying several different topological features, we also annotate hub proteins, defined as proteins that interact with many proteins (Jeong et al., 2001). While early works on hub proteins defined them using a fixed minimal number of (Jeong et al., 2001), more recent studies use a floating threshold defined as a certain percentage of the most connected nodes in a given interactome (Han et al., 2004; Batada et al., 2006; Dosztányi et al., 2006). This results in different cut-offs that define hubs for different interactomes (different organisms) and emphasizes the fact that hubs are a property of the whole interactome system rather than a property of individual proteins. We used the latter definition using the cut-off that corresponds to the 90th percentile of the interaction counts in the complete human PPI network, which is consistent with several recent studies (Han et al., 2004; Batada et al., 2006; Dosztányi et al., 2006). Therefore, we annotate hub proteins as those that have the number of PPIs in the complete interactome collected from MENTHA that is higher than this threshold (i.e., ≥ 77 interactions).

Hub proteins have increased levels of intrinsic disorder (Meng et al., 2015b; Patil et al., 2010) and the disordered regions are often employed to carry out PPIs (Mohan et al., 2006; Vacic et al., 2007; Yan et al., 2016). The disordered protein-binding regions are also linked to certain human diseases (Uversky, 2018). Thus, we also annotate putative disordered protein binding regions. We use ANCHOR (Dosztányi et al., 2009) to predict the disordered protein-binding residues and we aggregate this information to compute the content of disordered protein binding residues for the proteins in our datasets. The selection of this method is motivated by the fact that it is accurate and popular, and provides fast predictions (i.e., is capable of processing our large datasets) (Meng et al., 2017; Katuwawala et al., 2019).

#### Functional Properties

We annotate cellular functions and subcellular locations of the drug targets and the non-drug targets using the Gene Ontology (GO) terms (Consortium, 2004), which we collect using the PANTHER system (Muruganujan et al., 2018). We annotate and separately analyze the molecular functions, biological processes, and cellular components, where the latter define the subcellular locations.

### Statistical and Similarity Analyses

We compare the sequence-derived and functional characteristics between the drug targets, non-drug targets, and possibly druggable proteins using statistical tests of significance of differences. We quantify the significance of the differences using the *t*-test if the underlying measure of the sequence-derived/functional property has normal distribution, and Wilcoxon rank-sum test otherwise. We used the Anderson-Darling test with the *p*-value cutoff of 0.05 to test normality. We use the Fisher's exact test when comparing binary characteristics, including disease associations and presence of hubs.

We annotate the cellular functions and subcellular locations associated with a particular set of proteins using enrichment analysis offered by the PANTHER system (Muruganujan et al., 2018). This system generates a list of annotations that are statistically over-represented when compared with the annotations present in the whole human proteome. PANTHER quantifies the ratios of enrichment and the corresponding *p*-values for each GO term when compared with the reference human proteome. We focus on the GO terms that occur at least 10 times in our datasets (to ensure robustness of statistical analysis), and we annotate a given term as associated with a particular set of proteins if its ratio > 2 (at least two fold increase) and the associated *p*-value (quantified using the False Discovery Rate correction) is < 0.05.

We measure similarity between two sets of proteins by comparing the cellular function and subcellular location GO terms associated with these two protein sets. We calculate this similarity using the GOSemSim package (Li et al., 2010) with default parameters [Wang et al. measure (Wang et al., 2007)] and the reference set to human.

### RESULTS AND DISCUSSION

#### Non-Druggable and Possibly Druggable Proteins

The set of the non-drug targets likely includes a relatively large number of druggable proteins. The ability to characterize properties that differentiate the drug targets and druggable proteins from the non-drug targets hinges on the annotation of the non-druggable and possibly druggable proteins in the set of these non-drug targets. Druggability of proteins requires that they interact with a drug-like compound and that this interaction provides a desired therapeutic effects (Hopkins and Groom, 2002; Russ and Lampel, 2005; Keller et al., 2006). Thus, one way to annotate possibly druggable and non-druggable proteins is to analyze protein-disease associations. **Figure 1** shows the fractions of the proteins associated with different classes of diseases among the drug targets and the non-drug targets. As expected, the

FIGURE 1 | Fraction of drug targets and non-drug targets associated with different classes of diseases. The green and red bars show the fraction of disease associated proteins among the drug targets and non-drug targets for each disease class. The p-values quantify the significance of the differences between the two fractions using the Fisher's exact test. The disease classes are sorted by the value of the fraction of the drug targets.

number of the disease associated proteins is significantly higher among the drug targets compared to the non-drug targets. This difference is statistically significant for each of the 23 diseases classes (*p*-values < 0.0001). About 94% of the drug targets are associated with at least one disease, attesting to the relatively high coverage of these annotations and supporting the fact that the drug targets exert therapeutic effects. The largest fraction of the drug targets (82%) is associated with cancers. To compare, only about 64% of the non-drug targets are disease-associated. The latter suggests that the non-drug targets include both nondruggable proteins (those that lack association with any of the diseases) and possibly druggable proteins (those that are associated with diseases). We note that the use of the diseases associations provides a partial support for their druggability since it does not address the ability of the possibly druggable proteins to interact with drug-like molecules.

**Figure 2** analyzes relation between the drug targets, nondrug targets, and disease associations. **Figure 2A** reveals that the disease-associated proteins are likely to be drug targets. About 60% of proteins that are associated with at least one disease are drug targets. The fraction of drug targets increases for the proteins that are associated with more disease. This increase is sharper for a lower number of diseases and plateaus for proteins with about 10 or more disease associations. Therefore, we hypothesize that the non-drug targets with a relatively large number of disease associations can be used as a proxy for possibly druggable proteins. We use the inflection point in **Figure 2A**, which corresponds to proteins with ≥13 disease associations among which 75% are drug targets, to define the set of possibly druggable proteins. **Figure 2B** is a Venn diagram that visualizes overlap between the disease associated proteins (black borders), the drug targets (dataset D; green border), and the non-drug targets (dataset N; red border). We define the set of the non-drug targets that are associated with 13 or more diseases as possibly druggable proteins (Nd dataset; orange area in **Figure 2B**). **Figure 2B** also shows that virtually all drug targets are associated with at least one disease (black border with number of diseases K ≥ 1), while a large portion of the non-drug targets lacks any disease associations (brown area in **Figure 2B**). The latter set of proteins constitutes the set of the non-druggable proteins (Nn dataset).

We test reliability of annotations of the possibly druggable and non-druggable proteins using the 124 human-like drug targets from the D+ dataset that were annotated based on their high sequence similarity to drug targets in other organisms. We found only 4% (5 of the 124) of the human-like drug targets among the 4,869 non-drug targets that are not associated with diseases compared to 67% (83 human-like drug targets) that are among the 4,287 non-drug targets that are associated with 13 or more diseases. The high degree of the latter overlap suggests that the Nd dataset should include a substantial number of druggable proteins. We note that the 4% overlap with the non-drug targets

FIGURE 3 | Similarity in cellular processes and subcellular locations between the drug targets (D dataset), possibly druggable proteins (Nd dataset), non-druggable proteins (Nn dataset), and non-drug targets (N dataset). We measure similarity for four pairs of these datasets (D vs. Nd, D vs. Nn, D vs. N, and Nn vs. Nd) based on the comparison of the corresponding sets of GO terms associated with these datasets, i.e., GO terms over-represented in a given dataset when compared to the entire human proteome. The GO terms are divided into three categories: MF (molecular functions), BP (biological processes), and CC (cellular components). Similarity was measured with the GOSemSim package (Li et al., 2010). We describe details of these calculations in section "Statistical and similarity analyses". The gray markers show the similarity for each GO-term category while the blue markers are the average across the three categories.

FIGURE 2 | Relation between drug targets, non-drug targets and diseases associations. Panel A shows the fraction of the drug targets among proteins associated with a given minimal number of diseases K. Panel B is a Venn diagram that visualizes overlap between the disease associated proteins (with K = 1 and K = 13), the drug targets (dataset D; green border), and the non-drug targets (dataset N; red border). Among the non-drug targets we define the Nn dataset of non-druggable proteins (brown area), i.e., the non-drug targets that are not associated with any disease, and the Nd dataset of possibly druggable proteins (orange area), i.e., the non-drug targets that are associated with 13 or more diseases.

that lack diseases associations likely stems from incompleteness of the diseases association data.

**Figure 3** further tests the validity of the hypothesis that the Nd and Nn datasets include the possibly druggable and the nondruggable proteins, respectively. It quantifies similarity in the context of cellular functions and subcellular location between the drug targets, possibly druggable proteins, non-druggable proteins, and the non-drug targets. First, we generate a set of GO terms that are associated with each of these datasets, i.e., GO terms over-represented in a given dataset when compared to the human proteome. We perform this analysis separately for each of the three GO terms categories: molecular functions, biological processes, and cellular components; the latter is a proxy for the subcellular location. Next, we calculate similarity between the corresponding sets of dataset-specific GO terms; we describe the details in section "Statistical and similarity analyses". The gray lines in **Figure 3** shows the similarity values for each GO term category while the blue lines show the average across the three categories. The left-most set of results reveals that the cellular functions and subcellular location of the drug targets (D dataset) are similar to the possibly druggable proteins (Nd dataset), which aligns with our hypothesis that the Nd dataset in fact includes druggable proteins. The second set of results, which compares the drug targets against the non-druggable proteins (Nn dataset), shows lack of similarity in the biological processes and subcellular locations and modestly reduced levels of similarity in the molecular functions. The corresponding average similarity = 0.145 is lower by a factor of two when compared with the similarity = 0.303 between the drug targets and possibly druggable proteins. The other two sets of results, which compare the possibly druggable against the non-druggable proteins and the drug targets against the non-drug targets, similarly reveal the lack of similarity in the biological processes and subcellular

locations, while showing similarity in the molecular functions. The average similarities for these two dataset pairs are low and equal 0.177 and 0.115, respectively, suggesting that the corresponding two pairs of datasets include proteins involved in distinct cellular processes and subcellular locations. To sum up, the above analysis demonstrates that drug targets and the possibly druggable proteins share much higher levels of functional and subcellular location similarity compared to the similarity between possibly druggable proteins, non-druggable proteins, and non-drug targets. This finding, which uses an independent source of information compared to the approach we used to annotate the possibly druggable proteins, supports validity of our annotations of the possibly druggable and the non-druggable proteins.

#### Comparative Analysis of the Sequence-Derived Structural and Functional Characteristics of the Drug Targets, Possibly Druggable, and Non-Druggable Proteins

Our ability to identify novel druggable proteins relies on the understanding of functional and sequence-derived characteristics that differentiate drug targets from the non-drug targets. We focus specifically on the characteristics that can be quantified from the protein sequence and/or identifier, which allows for a proteome-wide deployment. We compare a broad range of these characteristics between the drug targets, non-drug targets, possibly druggable proteins, and non-druggable proteins. We also investigate differences between the above protein sets and the expanded set of drug targets that includes human and human-like targets (D+ dataset), highly promiscuous drug targets that interact with many drugs (Dh datasets), and drug targets that interact with a low number of drugs (Dl dataset).

FIGURE 4 | Distributions of the values of the sequence-derived characteristics for the highly promiscuous drug targets (Dh), drug targets that interact with a low number of drugs (Dl), all drug targets (D), all human and human-like targets (D+), non-drug targets (N), possibly druggable proteins (Nd), and non-druggable proteins (Nn). (Panels A) shows the amount of conserved residues. Panels B and C focus on the protein domains while Panel D quantifies the number of splicing isoforms. The whiskers show the 5 and 95 percentiles, the top and bottom of the box correspond to the first and third quartiles, the middle bar is the median, and the cross marker is the average. The annotation above the whiskers show the significance of differences with the other protein sets; only significant differences are listed where N\* means p-value 0.05 and N\*\* means p-value 0.0001 when compared with the N dataset. We explain calculation of statistical tests in section "Statistical and similarity analyses".

#### Characteristics Derived From the Protein Sequence

**Figure 4** focuses on the characteristics derived directly from the protein sequence, including the residue-level conservation (content of conserved residues in protein chains), number of domains and the content of domain-annotated residues, and the number of the alternative splicing isoforms. **Figure 4A** shows that the drug targets (both D and D+ datasets) have significantly fewer conserved residues than the non-drug targets, possibly druggable proteins and the non-druggable proteins (*p*-value < 0.05). The possibly druggable proteins (orange bars) have significantly lower numbers of conserved residues compared to the non-druggable proteins (brown bars) (*p*-value < 0.05).

Moreover, the highly promiscuous drug targets have significantly lower numbers of the conserved amino acids than the non-drug targets and the non-druggable proteins (*p*-value < 0.05), while maintaining similar levels compared to the possibly druggable proteins. Altogether, relatively low numbers of the conserved residues are characteristics for the drug targets and these numbers are also relatively low among the possibly druggable proteins. Interestingly, the residue-level conservation of the residues on the protein surface, where the protein-drug interaction occurs, follows the same pattern (**Figure 5E**). This finding complements prior results that show that drug targets have lower evolutionary rates and higher similarity to orthologous genes (Lv et al., 2016).

FIGURE 5 | Distributions of the values of the sequence-derived structural characteristics predicted from the protein sequence for the highly promiscuous drug targets (Dh), drug targets that interact with a low number of drugs (Dl), all drug targets (D), all human and human-like targets (D+), non-drug targets (N), possibly druggable proteins (Nd), and non-druggable proteins (Nn). Panels A, B, and C quantify the abundance of intrinsic disorder while Panels D and E quantify the amount of surface and the amount of conserved residues on the surface, respectively. The whiskers show the 5 and 95 percentiles, the top and bottom of the box correspond to the first and third quartiles, the middle bar is the median, and the cross marker is the average. The annotation above the whiskers show the significance of differences with the other protein sets; only significant differences are listed where N\* means p-value 0.05 and N\*\* means p-value 0.0001 when compared with the N dataset. We explain calculation of statistical tests in section "Statistical and similarity analyses".

**Figures 4B, C** reveal that the drug targets (both D and D+ datasets) have substantially more domains and have larger amounts of domain-annotated residues when compared to the non-druggable proteins (*p*-value < 0.0001). At the same time, they a similar number domains when contrasted with the possibly druggable proteins. Furthermore, the possibly druggable proteins have significantly higher levels of domain annotations when contrasted against the non-druggable proteins (*p*-value < 0.0001). The underlying reasons for this enrichment could be two-fold. First, there could be proportionally more multi-domain proteins among the drug targets and the possibly druggable proteins. Consequently, inclusion of a larger number of domains could increase the likelihood that these proteins host at least one druggable domain. However, our result could also mean that these proteins are more studied and understood, and thus their domain annotations are more complete. Moreover, the fact that at least close to half of proteins in all considered datasets have domain annotations, which suggests that they are functionally annotated, suggests that our functional similarity analysis in **Figure 3** should be robust.

The drug targets (both D and D+ datasets) and the possibly druggable proteins have significantly more splicing isoforms compared to the non-druggable proteins (*p*-value < 0.05) and this increase is even higher for the promiscuous drug targets (*p*-value < 0.001). This suggests that enrichment in the number of alternative splicing variants could serve as a marker for druggability. The alternative splicing was found to contribute to drug resistance (Siegfried and Karni, 2018; Zhao, 2019), which supports veracity of our result. Interestingly, recent studies suggest that targeting alternative splicing events could lead to therapeutic opportunities (Le et al., 2015; Siegfried and Karni, 2018). Our analysis also reveals that majority of the drug targets and the possibly druggable proteins have multiple isoforms. Thus, gene level analysis of drug targets may not be adequate, considering that these genes would encode multiple proteins.

Overall, we identified three potential sequence-derived markers of druggability. The drug targets and possibly druggable proteins share lower numbers of conserved residues and are more likely to have multiple domains and isoforms when compared to the non-druggable proteins. We also note that the results for the original set of human drug targets (D dataset) are consistent with the results for the expanded set of drug targets (D+ dataset).

#### Sequence-Derived Structural Properties

This study is the first to analyze two relevant sequence-derived structural characteristics that can be accurately predicted from the protein sequence: intrinsic disorder and solvent accessibility. Proteins with disordered regions are associated with a wide range of human diseases (Uversky et al., 2008; Uversky et al., 2014; Uversky, 2014b; Babu, 2016) while solvent accessibility determines protein surface where the drug-protein interaction happens. We note that while authors in (Kim et al., 2017) computed putative solvent accessibility, they only used it to analyze results concerning enrichment in the PTMs.

**Figures 5A–C** quantify two key aspects of the disorder: the overall content of disordered residues and the length of disordered regions. Proteins with higher disorder content are functionally distinct from structured proteins while long disordered regions are thought to correspond to disordered protein domains (Tompa et al., 2009; Pentony and Jones, 2010; Peng et al., 2014a). We observe that drug targets (both D and D+ datasets) are significantly less disordered (by a factor of two) and include much shorter disordered regions when compared with the non-drug targets, including both possibly druggable and non-druggable proteins (*p*-value < 0.001). This is in agreement with a recent study that demonstrates that the current drug targets are biased to exclude disordered proteins (Hu et al., 2016). There are several reasons for this bias. The protein structures are used during the rational drug design process (Gane and Dean, 2000; Lundstrom, 2006; Mavromoustakos et al., 2011; Lounnas et al., 2013) and to gain mechanistic insights into the protein-drug interactions (Pielak et al., 2009; Tan et al., 2013; Christopoulos, 2014) (Altschul et al., 1997; Wang and Samudrala, 2006; Calderone et al., 2013; Orchard et al., 2014; UniProt: the universal protein knowledgebase, 2016). The structures are also indispensable for modeling associated with drug repurposing and repositioning (Moriaud et al., 2011; Ma et al., 2013). This is while proteins with disordered regions are much less likely to have structures (Hu et al., 2018), partly because since they are explicitly avoided in the structural genomics pipeline (Linding et al., 2003; Oldfield et al., 2005; Mizianty et al., 2014). Interestingly, the highly promiscuous drug targets are enriched in disorder when contrasted with the overall set of drug targets and the low promiscuity drug targets (*p*-value < 0.0001), while their disorder levels are comparable to the possibly druggable proteins. This coincides with the observation that disordered regions are capable of interactions with multiple partners (Oldfield et al., 2008; Hu et al., 2017). Our results suggests that although low disorder amounts are a strong marker for the current drug targets, the set of possibly druggable proteins includes large amounts of disorder. In fact, the disordered proteins may become the key to unlocking a substantial portion of yet to be discovered druggable targets (Uversky, 2012; Hu et al., 2016), especially given their association with numerous human diseases (Uversky et al., 2008; Uversky et al., 2014; Uversky, 2014b; Babu, 2016).

The amount of the putative surface residues for the drug targets (both D and D+ datasets) is significantly smaller that for the non-drug targets, including the possibly druggable and nondruggable proteins (*p*-value < 0.0001), see **Figure 5D**. This could be driven by the fact that drug targets are often membrane proteins (Yildirim et al., 2007; Rajendran et al., 2010), which means that they have relatively low surface area compared to other proteins. They are also mostly structured proteins (Hu et al., 2016) that are more likely to have globular shape with more buried residues compared to more irregularly shaped/elongated disordered proteins (Peng et al., 2014b; Uversky, 2017). Moreover, presence of disordered regions on the protein surface also leads to an increase of the surface area compared to structured conformations (Wu et al., 2015). Interestingly, the possibly druggable proteins have comparable content of the putative surface residues with the low promiscuity drug targets, which is also significantly smaller when contrasted with the non-druggable proteins (*p*-value < 0.0001). This again, like in the case of the results in **Figure 4**, shows that the possibly druggable proteins are more similar to drug targets than to the non-druggable proteins. Finally, we observe that the number of conserved residues on the putative surface (**Figure 5E**) maintains the same relation between the different protein sets as the overall number of conserved residues shown in **Figure 4A**, i.e., significantly lower for drug targets (both D and D+ datasets), and lower for the possibly druggable proteins compared to the non-druggable proteins (*p*-value < 0.05).

#### Topological Features of the Protein-Protein Interaction Networks

Topological features of the PPI networks are among the most studied characteristics of the drug targets (Zhu et al., 2009b; Zhu et al., 2009c; Bull and Doig, 2015; Mitsopoulos et al., 2015; Feng et al., 2017; Kim et al., 2017). A unique aspect of our analysis is that we focus on a set of orthogonal measures, i.e., measures that have low mutual correlations. This offers a more focused and balanced analysis given the high degree of similarity between many of these measures. **Figure 6** reveals that the entire set of four measures of centrality has significantly higher values for the drug targets (both D and D+ datasets) compared to the non-druggable proteins (*p*-value < 0.0001). Our results are in line with several prior studies that correspondingly show that drug targets have more connected and denser local network neighborhoods (Zhu et al., 2009b; Zhu et al., 2009c; Mitsopoulos et al., 2015; Lv et al., 2016). This finding suggests that drug targets are possibly more relevant biologically or are at a higher point of control and thus can better modify physiology, making them better therapeutic targets. The novel element in our study is that we find that all considered network centrality measures for the possibly druggable are even higher than for the drug targets (orange *vs*. green bars in **Figure 6**; *p*-value < 0.05). Consequently, they are also significantly higher than for the non-druggable proteins (orange *vs*. brown bars in **Figure 6**; *p*-value < 0.0001). Thus, our study suggests that these measures can be used as markers of druggability.

**Figure 7** analyzes the abundance of the PPI network hubs among the drug targets, possibly druggable and non-druggable proteins. Approximately 17% of the drug targets (for both D and D+ datasets) are hubs and this rate is significantly higher than the 12% rate for the non-drug targets (green *vs*. red bars; *p*-value < 0.0001). Similarly large difference was observed in (Mitsopoulos et al., 2015). Our study reveals additional important details. We observe

FIGURE 6 | Distributions of the values of the selected orthogonal PPI network properties for the highly promiscuous drug targets (Dh), drug targets that interact with a low number of drugs (Dl), all drug targets (D), all human and human-like targets (D+), non-drug targets (N), possibly druggable proteins (Nd), and non-druggable proteins (Nn). Panels A, B, C, and D concern the betweenness centrality, eigenvector centrality, closeness centrality, and information centrality measures, respectively. The whiskers show the 5 and 95 percentiles, the top and bottom of the box correspond to the first and third quartiles, the middle bar is the median, and the cross marker is the average. The annotation above the whiskers show the significance of differences with the other protein sets; only significant differences are listed where N\* means p-value 0.05 and N\*\* means p-value 0.0001 when compared with the N dataset. We explain calculation of statistical tests in section "Statistical and similarity analyses".

FIGURE 7 | Fraction of hub proteins among the highly promiscuous drug targets (Dh), drug targets that interact with a low number of drugs (Dl), all drug targets (D), all human and human-like targets (D+), non-drug targets (N), possibly druggable proteins (Nd), and non-druggable proteins (Nn). The annotation next to the bars show the significance of differences with the other protein sets; only significant differences are listed where N\* means p-value 0.05 and N\*\* means p-value 0.0001 when compared with the N dataset. We explain calculation of statistical tests in section "Statistical and similarity analyses".

that the rate of hubs is very high among the highly promiscuous drug targets (25%) and the possibly druggable proteins (24%), and these rates are significantly higher than the 12% rate for the non-drug targets (*p*-value < 0.0001) and the 5% rate for the nondruggable proteins (*p*-value < 0.0001). This suggests that high connectivity in the PPI network is a strong marker for druggability.

#### Functions and Subcellular Locations of Drug Targets and Possibly Druggable Proteins

Several studies analyzed cellular functions and subcellular locations of the drug targets (Lauss et al., 2007; Bakheet and Doig, 2009; Wang et al., 2013b). The green bars in **Figure 8**

provide a list of significantly enriched functions and locations for our set of drug targets. Our results indicate that most of the drug targets are enzymes, including kinases and oxidoreductases, followed by substatial numbers of channels, and in particular ion channels. They are often involved in binding, signalling, regulation, and transport. These finding are in close agreement with the results in (Bakheet and Doig, 2009). **Figure 8** also shows that drug targets are primarily found in membranes, with a large numbers also found in the cytoplasm and the intracellular space. Consistent results are found in (Bakheet and Doig, 2009; Wang et al., 2013b), and these subcellular locations also agree with the observation that membrane proteins are the prime targets for the development of therapeutics (Yildirim et al., 2007; Rajendran et al., 2010).

FIGURE 8 | Molecular functions, processes, and subcellular locations that are enriched among the drug targets (D dataset) and the possibly druggable proteins (Nd dataset). We show the top 10 (with the highest counts) over-represented/significantly enriched GO terms for the drug targets (green bars) and the possibly druggable proteins (orange bars). The bars quantify the ratios of enrichment relative to the human proteome and the corresponding p-values are shown on the right. GO terms are identified on the left, including their names and the number of the correspnding proteins in the given dataset. We explain calculation of statistical tests in section "Statistical and similarity analyses".

This study is the first to perform this type of analysis for the possibly druggable proteins (orange bars in **Figure 8**). Our analysis suggests that the possibly druggable proteins share functional similarities with the drug targets. They are similarly involved in the catalysis, signaling, and binding. However, the possibly druggable proteins tend to bind proteins and nucleic acids, instead of anions and ions which are the main partners for the drug targets. Moreover, the possibly druggable proteins are often involved in the metabolic and biosynthesis processes, and in the cell death cycle. The preference for the protein-protein and protein-nucleic acids binding and the cell death cycle involvement are supported by their significant enrichment in the intrinsic disorder (compared to the drug targets,

FIGURE 9 | Content of putative protein binding regions in the highly promiscuous drug targets (Dh), drug targets that interact with a low number of drugs (Dl), all drug targets (D), all human and human-like targets (D+), non-drug targets (N), possibly druggable proteins (Nd), and non-druggable proteins (Nn). The annotation next to the bars show the significance of differences with the other protein sets; only significant differences are listed where N\* means p-value 0.05 and N\*\* means p-value 0.0001 when compared with the N dataset. We explain calculation of statistical tests in section "Statistical and similarity analyses".

see **Figures 5A, B**), and the fact that disordered regions are known to facilitate these types of functions (Vuzman and Levy, 2012; Uversky et al., 2013; Fuxreiter et al., 2014; Peng et al., 2015; Basu and Bahadur, 2016; Wang et al., 2016b; Hu et al., 2017; Srivastava et al., 2018). We further investigate this in **Figure 9** that analyzes the differences in the content of the putative disordered protein-protein binding regions. These results confirm the enrichment in the corresponding functional annotations for the possibly druggable proteins. The possibly druggable proteins include a substantial amount of the disordered protein-binding regions, on average about 14% of residues. Moreover, the drug targets (both D and D+ datasets) are significantly depleted in these protein-binding regions (on average only 7% of residues) when compared with the possibly druggable proteins (*p*-value < 0.0001). Interestingly, **Figure 8** also reveals that the possibly druggable proteins are localized across the cell and they do not have a specifically associated subcellular location, unlike the drug targets that are found mostly in the membranes and cytoplasm. Overall, our empirical analysis provides new insights into the cellular functions and subcellular locations of the druggable proteins.

### SUMMARY AND CONCLUSIONS

Recent research approximates that the druggable human proteome has about 4,500 proteins (Finan et al., 2017), while there are about 1,600 current drug targets (1,750 drug targets if we include proteins that share high sequence similarity to drug targets that were annotated in other organisms). Annotation of the remaining druggable human proteins would facilitate development and screening of drugs, drug repurposing and repositioning, understanding and mitigation of drug side-effects, and prediction of drug–protein interactions. We contrast the drug targets against the possibly druggable and non-druggable proteins to identify markers that could be used to identify novel druggable proteins. This is in contrast to the prior studies that compare drug targets against non-drug targets (Zheng et al., 2006; Lauss et al., 2007; Bakheet and Doig, 2009; Zhu et al., 2009b; Zhu et al., 2009c; Bull and Doig, 2015; Mitsopoulos et al., 2015; Feng et al., 2017; Kim et al., 2017), thus producing markers that describe current drug target and which implicitly exclude the druggable proteins that are included in the non-drug target set. We annotate the possibly druggable and non-druggable proteins based on the presence and promiscuity of disease associations, and we validate these annotations *via* functional similarity analysis.

We cover a wide range of sequence-derived characteristics to define these markers. These characteristics can be computed across the entire human proteome, allowing for a complete sweep of all potential candidate proteins. We investigate several important characteristic that were missed in the past studies including putative intrinsic disorder, residue-level conservation, presence and number of alternative splicing isoforms, inclusion of domains, and putative solvent accessibility (surface area), as well as the key features from the prior works, such as the topological features of PPIs, cellular functions and subcellular locations. **Figure 10** summarizes the results. It shows the difference in the values of the key markers when comparing the possibly druggable proteins (in orange), the nondruggable proteins (in brown), all non-drug targets (in red), and the expanded set of human and human-like drug targets (in light for the non-druggable proteins (in brown).

green) against the human drug targets (in dark green). We observe that the possibly druggable proteins are significantly more similar to the drug targets than the non-druggable proteins for majority of the markers. These markers include high abundance of alternative splicing isoforms, relatively large number of domains, higher degree of centrality in the corresponding PPI network (and correspondingly much higher rate of hubs), lower number of conserved residues, and lower number of residues on the putative (sequence-derived) surface. Thus, these factors could serve as high-quality markers for druggability. "Results and discussion" section discusses these findings in the context of the current literature. Moreover, **Figure 10** shows that drug targets (both D and D+ datasets) have significantly depleted levels of intrinsic disorder and intrinsically disordered protein-binding regions when compared with the much higher and comparable levels among the possibly druggable and non-druggable proteins. This suggests that the high levels of disorder combined with the presence of the abovementioned markers should be used together to effectively enlarge the current collection of drug targets. This is in accord with several recent studies that postulate inclusion of the disorder-enriched proteins into the set of druggable proteins (Cuchillo and Michel, 2012; Uversky, 2012; Chen and Tou, 2013; Joshi and Vendruscolo, 2015; Ambadipudi and Zweckstetter, 2016; Hu et al., 2016; Yu et al., 2016).

Our analysis also shows that the possibly druggable proteins are functionally similar to the drug targets, being involved in the catalysis, signaling, and binding. The main difference is that the possibly druggable proteins target interactions with proteins and nucleic acids, unlike the current drug targets that favor interactions with anions and ions. **Figure 10** points to the high amount of the disordered protein-binding regions for the possibly druggable proteins compared to the drug targets, which is in concert with the disordered nature of the druggable proteins. This is in agreement with the literature that shows that disordered regions often facilitate PPIs (Mohan et al., 2006; Vacic et al., 2007; Fuxreiter et al., 2014; Yan et al., 2016; Hu et al., 2017). Finally, we show that the possibly druggable proteins are involved in the metabolic and biosynthesis processes and that they are localized across the cell, without a preference for specific subcellular locations. This is unlike the current drug targets that are located primarily in the membranes.

To sum up, our empirical analysis has led us to formulate several markers that may help with identifying novel druggable human proteins and has produced interesting insights into the cellular functions and subcellular locations of potentially druggable proteins.

### DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the article/ **Supplementary Material**.

## AUTHOR CONTRIBUTIONS

LK conceptualized the study. LK and ML designed the study. SG organized the source databases. SG and XL performed acquisition of data. SG and LK organized and performed statistical analysis. All authors organized, analyzed and interpreted the results. LK and SG wrote the first draft of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version, and provided approval for publication of the content.

## FUNDING

This research was supported in part by the Robert J. Mattauch Endowment funds to LK, the National Natural Science Foundation of China (No. 61832019), and the 111 Project (No. B18059).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.01075/ full#supplementary-material

### REFERENCES


estimated energy content. *Bioinformatics* 21 (16), 3433–3434. doi: 10.1093/ bioinformatics/bti541


Hopkins, A. L., and Groom, C. R. (2002). The druggable genome. *Nat. Rev. Drug Discovery* 1 (9), 727–730. doi: 10.1038/nrd892


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Ghadermarzi, Li, Li and Kurgan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# In silico Metabolic Pathway Analysis Identifying Target Against Leishmaniasis – A Kinetic Modeling Approach

#### Nikita Bora and Anupam Nath Jha\*

Computational Biophysics Laboratory, Department of Molecular Biology and Biotechnology, Tezpur University, Tezpur, India

Edited by:

Shandar Ahmad, Jawaharlal Nehru University, India

#### Reviewed by:

Junjie Yue, Biotechnology Research Institute (CAAS), China Vahab Ali, Rajendra Memorial Research Institute of Medical Sciences, India

> \*Correspondence: Anupam Nath Jha anjha@tezu.ernet.in

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 27 June 2019 Accepted: 14 February 2020 Published: 06 March 2020

#### Citation:

Bora N and Jha AN (2020) In silico Metabolic Pathway Analysis Identifying Target Against Leishmaniasis – A Kinetic Modeling Approach. Front. Genet. 11:179. doi: 10.3389/fgene.2020.00179 The protozoan Leishmania donovani, from trypanosomatids family is a deadly human pathogen responsible for causing Visceral Leishmaniasis. Unavailability of proper treatment in the developing countries has served as a major threat to the people. The absence of vaccines has made treatment possibilities to rely solely over chemotherapy. Also, reduced drug efficacy due to emerging resistant strains magnifies the threat. Despite years of formulations for an effective drug therapy, complexity of the disease is also unfortunately increasing. Absence of potential drug targets has worsened the scenario. Therefore exploring new therapeutic approach is a priority for the scientific community to combat the disease. One of the most reliable ways to alter the adversities of the infection is finding new biological targets for designing potential drugs. An era of computational biology allows identifying targets, assisting experimental studies. It includes sorting the parasite's metabolic pathways that pins out proteins essential for its survival. We have directed our study towards a computational methodology for determining targets against L. donovani from the "purine salvage" pathway. This is a mainstay pathway towards the maintenance of purine amounts in the parasitic pool of nutrients proving to be mandatory for its survival. This study represents an integration of metabolic pathway and Protein-Protein Interactions analysis. It consists of incorporating the available experimental data to the theoretical methods with a prospective to develop a kinetic model of Purine salvage pathway. Simulation data revealed the time course mechanism of the enzymes involved in the synthesis of the metabolites. Modeling of the metabolic pathway helped in marking of crucial enzymes. Additionally, the PPI analysis of the pathway assisted in building a static interaction network for the proteins. Topological analysis of the PPI network through centrality measures (MCC and Closeness) detected targets found common with Dynamic Modeling. Therefore our analysis reveals the enzymes ADSL (Adenylosuccinate lyase) and IMPDH (Inosine-5<sup>0</sup> monophosphate dehydrogenase) to be important having a central role in the modeled network based on PPI and kinetic modeling techniques. Further the available three

dimensional structure of the enzyme "ADSL" aided towards the search for potential inhibitors against the protein. Hence, the study presented the significance of integrating methods to identify key proteins which might be putative targets against the treatment of Visceral Leishmaniasis and their potential inhibitors.

Keywords: Leishmania donovani, Visceral Leishmaniasis, kinetic modeling, purine salvage, protein protein interaction

### INTRODUCTION

One of the primary problems arising worldwide is the increasing risk of parasitic disease mostly infecting the people and animals in the underdeveloped countries (El Kouni, 2003). It includes infections from a wide source of microorganisms like fungi, bacteria, and protozoans etc. (Wang, 1984; Kokina et al., 2019). One such parasitic disease, Visceral leishmaniasis (VL or Kala Azar), caused by the protozoan species Leishmania donovani has served as a major threat to these countries, increasing the rate of fatality (Desjeux, 2004; Alvar et al., 2012). VL serves to be one of the most severe forms of leishmaniasis (Sundar, 2001) with the highest death rate (Cavalli and Bolognesi, 2009). This species can infect the internal organs threatening the human health (Sharma et al., 2017). Mostly affected are the poor people from the East African and the Indian subcontinent hence leading to a higher demand for the identification, treatment and control of the infection in the low and middle income countries (Chappuis et al., 2007).

Treatment of Leishmania infection relies on chemotherapy (Sundar and Chatterjee, 2006), however, failure towards the available chemotherapeutic agents and treatments still prevails. The first drugs for the treatment were made available around five decades ago. However, the formulation of a single drug is not sufficient to combat the species due to the differences in drug sensitivity among the Leishmania sp. (Croft and Coombs, 2003). The substantial side effects (Vijayakumar and Das, 2018) and difficulty in administration has also led to the evolution of drug resistant parasitic strains contributing to the increased rate of mortality (Sundar, 2001). Further, expensive treatment strategies acts as a hurdle towards an effective drug development (Bora, 1999; Croft et al., 2005). In a conclusive manner, a major challenge still exists in identifying effective cure and treatment for the parasite disease (Freitas-Junior et al., 2012) which requires an exploitation of current technologies for identifying novel chemotherapeutics (Davis et al., 2004). The mandate is to find novel drug-targets from the parasite's proteome (Guerin et al., 2002). The identification of such targets from a pathogen's biological pathway is reported to be an important feature in the drug discovery process (Chawla and Madhubala, 2010). This has helped in exploring ways for studying the protozoan's metabolic pathways to sort out the ones unique to them (Martin et al., 2016). The information embedded in the microbe's life cycle may pave the way for understanding the pathogenesis (Smith and Romesberg, 2007). It proves to be essential in controlling the microbial infections that are becoming resistant to the drugs available for their treatment causing fatal conditions.

One of the key factors in understanding the pathogen's biological pathway requires knowledge of the underlying kinetics governing the enzymes and molecules involved in the pathway. This complex biological system can be represented into a network of interconnecting links signifying the reactions involved in the pathway (Meshram et al., 2019). In silico analysis of metabolic pathways through systems biology approach has been on the forefront for providing the means to understand the whole network through the availability of the experimental data. Hence, availability of experimental details paves a way for describing the pathway mathematically (Van Riel, 2006). System level analysis has been used as a tool for identifying targets against different species of Leishmania (Chavali et al., 2008; Mandlik et al., 2012; Sharma et al., 2017). However, providing a detailed mathematical model is still a challenge in systems biology (Steuer et al., 2006). Biological databases (Stein, 2003), provide a means to attain the knowledge of biological reactions involved in the pathways. Also with the growth of high throughput technologies (Baker and Brass, 1998), the size of these databases are increasing making the interpretation of data a major challenge in the scientific field (Guimera and Amaral, 2005).

Life cycle of L. donovani exists as flagellated extracellular promastigotes in the phlebotomine sandfly vector and as immotile intracellular amastigotes within cells of the infected mammalian host. The amastigote has been reported to be the cause behind the pathogen infections including Visceral Leishmaniasis (McConville et al., 2007). L. donovani lacks the machinery for the synthesis of purines competing with the hosts for salvaging the required purines (Boitz et al., 2012). Incapability of these organisms to synthesize the purines de novo has led to their dependence on the salvaging of the purines from their hosts rendering the "purine salvage" pathway essential for its survival (Boitz and Ullman, 2013; Ansari et al., 2016). Hence, the pathway has served has a backbone towards the maintenance of purine amounts in the parasitic pool of nutrients (Marr et al., 1978; Looker et al., 1983; De Koning et al., 2005). A lack of effective cure for the disease has made its importance in the scientific community opening up ways for delineation of this pathway (Boitz et al., 2012). The enzymes involved in this survival pathway opens up the exploration space for newer drug targets (Carter et al., 2008) for chemotherapeutic agents against parasitic diseases (Berg et al., 2010; Doleželová et al., 2018).

In the current work, we have carried out a twin approach – both dynamical and static, for identifying potential drug targets against the pathogen L. donovani. Unlike the traditional ways of identifying a target, we have carried out the in silico simulation of the "purine salvage" pathway for the protozoan which represents the dynamic method. The selected pathway has been represented

in the form of a mathematical model defining the reactions catalyzed by the enzymes and the pathway components. Kinetic modeling of the pathway displayed the importance of the proteins Adenylosuccinate lyase (ADSL) and Inosine-5<sup>0</sup> -monophosphate dehydrogenase (IMPDH). The static method comprises of a Protein Protein Interaction (PPI) network for the proteins involved in the purine salvage pathway. PPI network analysis has been used as a tool for studying the proteomes of Leishmania species like L. braziliensis, L. major and L. infantum (Flórez et al., 2010; Rezende et al., 2012; Chávez-Fumagalli et al., 2018; Dos Santos Vasconcelos et al., 2018). The PPI networks are types of biological network representing a set of proteins and their interactions. It was carried out to analyze the importance of the sorted proteins from the dynamical model. A topological analysis of the PPI network showed that these proteins were also found to be ranked in the top central nodes (proteins) with higher connectivity. These nodes (hub proteins) have a higher number of connections that designates its physical associations with their other proteins responsible for performing the specialized functions. Hence a destruction of these nodes will lead to a loss of the connectivity and function which might be crucial for the parasite's survival.

### MATERIALS AND METHODS

A dual approach has been applied in our current work where a dynamic modeling and a static interaction network was generated. The dynamic modeling approach includes the selection of a pathway essential for the parasite survival, retrieving the enzymatic reactions and construction of a kinetic model reflecting the importance of certain enzymes which might be important targets in controlling Leishmaniasis. The static interaction network includes construction of a PPI and computing the hub nodes.

#### Selection of Metabolic Pathway

A series of reactions are indulged within the parasite which effect the host (Lambris et al., 2008). These reactions are regulated by enzymes for the generation of the desired products. Some metabolic pathways of the pathogen are proven to be essential for the survival of the organism inside the host. Analysis of those reactions is crucial in listing out the enzymes which could be potential drug targets. One such pathway considered to be mandatory for the survival of L. donovani is the "purine salvage" pathway. The enzymes and reactions involved in the pathway are compiled with the help of the available resources.

#### Biochemical Network Construction Reactions of the Biological Pathway

The enzymes and the reactions involved in the salvaging of purines in L. donovani are obtained from the biological pathway resource "KEGG" (Kanehisa and Goto, 2000). The conversions of the metabolites were further cross checked with the current literature. For building of our model of interest, we have considered the reactions only occurring at the amastigote stage, i.e., the infection causing stage of the parasite (McConville et al., 2007). The metabolic pathway is then graphically represented through the pathway editor "Cell Designer" (Funahashi et al., 2003). It represents every reaction along with the substrates and the metabolites formed with a simpler view. The species transformation is represented through reactions (whether reversible or irreversible).

#### Kinetic Model

A kinetic rate law defines every reaction included in the model. Enzymatic mechanisms were studied to specify the rate equations for the reactions governed by the enzyme. For enzymes whose detailed mechanisms were not known, the rate equations were constrained to the basic Michaelis-Menten equation, with the basic knowledge that all enzymatic reactions follow the Michaelis-Menten equation. The kinetic model of our pathway is set up using the pathway simulator "COPASI" (Hoops et al., 2006). It constitutes the species defined in their biochemical terms. All the rate equations for the enzymes were generated through COPASI. The mathematical model of the metabolic pathway is defined in the form of Ordinary Differential equations (ODE). Defining a kinetic model implies retrieving available kinetic parameters through different sources. These parameters for the enzymes in our model is collected from curated enzyme database "Brenda" (Schomburg et al., 2004). Parameters not available in the database are collected through literature. Parameters unavailable for a few enzymes for L. donovani are taken from other reference organisms (Franco and Canela, 1984) assuming that the enzyme mechanisms remains conserved across species. Further integrating a system requires setting up the initial conditions. Considering this, we have set the initial concentrations for the different species in our model.

### Model Simulation and Analysis

The kinetic behavior of the model has been assessed through evaluating the species with respect to time.

#### Steady State Analysis

Initially the steady state analysis was carried out for our model using COPASI. It enables to analyze the state at which the concentrations of the species do not vary with respect to time. The behavior of the system at that state tends to continue to be present in the future. Methods that COPASI uses for the steady state calculations are the damped Newton method and the integration method both forward and backward.

#### Time Course Analysis

Followed by the steady state calculation we conducted the time course analysis using the "time course utility" of COPASI. The deterministic method (LSODA) is used for simulating the dynamic behavior and carried out for a period of 500 sec.

#### Sensitivity Analysis

Assessing the effects of the parameters over the system variables is another crucial point in kinetic modeling. This was carried out through a sensitivity analysis of the model using the sensitivity utility of COPASI.

#### Protein-Protein Interaction Network

Enzymes sorted through our dynamic analysis were further checked for their importance in the pathway through a topological analysis of the Protein-Protein Interaction (PPI) Network. Compared to the kinetic models, the PPI represents the static interactions of a protein with their co-partners. These networks provide as an effective means towards understanding functional interactions, identification of important modules and prioritization of targets. Protein-Protein Interaction network are mathematically created networks where every protein is represented as a node and the interaction between two proteins as an edge. The functional protein interactions for L. donovani were retrieved from the available source of interaction database "STRING version 10.5" (Szklarczyk et al., 2014). The corresponding interactions of proteins involved in the purine salvage pathway were retrieved. Using this molecular interaction information, a Protein-Protein interaction (PPI) network has been constructed through the open source platform "Cytoscape" (Shannon et al., 2003) which represents the overall interaction of the selected proteins in the L. donovani sp. MCODE (Chin et al., 2014) clustering algorithm was applied to identify the clusters consisting of densely interconnected nodes. Proteins in a cluster tend to functionally link to each other. Topological analysis for the constructed PPI network was carried out to deduce the properties of the network. Central nodes in the network were analyzed through the Cytohubba (Chin et al., 2014) package of "Cytoscape." These nodes show a higher number of connections with the other proteins in the network and hence maintain their associated functions. Deletion of the proteins behaving as the central nodes will lead to a loss of connectivity in the network. Hence, nodes having a higher tendency to alter the topology of the network are identified. They might prove to be important targets for inhibiting the growth and survivability of the pathogen.

#### Inhibitor Search for the Selected Enzyme

The presence of the three dimensional structure of a protein assists towards the identification of possible inhibitors against the selected enzyme (Choudhary et al., 2019). Inhibitor search for our selected protein was carried out through pharmacophore mapping of the available ligands against the protein. In our study we highlighted two enzymes having a potential of being targets against L. donovani. However, the crystal structure of only the ADSL enzyme was available and hence we selected this enzyme for further study. A literature search was conducted to list out the possible inhibitors for the ADSL enzyme. These inhibitors were subjected to a pharmacophore generation through Pharmagist. The pharmacophore was then mapped against the ZINC database through the Zinc Pharmer to select out possible inhibitors for ADSL enzyme. A set of molecules were then selected from the database of molecules which were further considered for analysis. Molecular docking approach was applied to the selected molecules to observe the binding behavior of the ligands within the receptor molecule. Autodock version 4.2.6 was applied to carry out the molecular docking.

## RESULTS AND DISCUSSION

#### Selection of Metabolic Pathway

Purine nucleotides are considered important for maintaining the vital processes of the cell. However, the Leishmania parasite being unable to synthesize the purine rings has developed a pathway for scavenging the required purines (Boitz et al., 2012). Literature sources showed various reactions in the incorporation of purines from the host metabolism. Summarizing all the possible transformations of the metabolites, a generalized form of the pathway is constructed including all the reaction products and the enzymes involved in salvaging of the nutrients. Also, the complexity of simulating kinetic models assists the fact that single pathways are more feasible to study. The main strategy while building our model was to reduce the system while still maintaining the overall characteristics of the biological system. Therefore in our current study we have considered a total of 15 reactions with 14 enzymes found to be involved in these reactions generating the desired metabolites. These enzymes are tabulated in **Table 1** along with their reactions retrieved from BRENDA and literature. Several classes of enzymes like lyases, kinases, transferases etc. are found to be involved in the scavenging process which is shown by their EC numbers. The schematic representation of the pathway is created through Cell Designer. Both reversible and irreversible reactions were included in our model. The reactions and conversions playing an important role in the regulation of the metabolites are schematically arranged. These enzymes (**Table 1**) along with the reactions they catalyze are shown in **Figure 1**.

### Model Building

The kinetic model of the pathway was generated through the pathway simulation software "COPASI." Pathway species (enzymes, metabolites) were incorporated to generate the biochemical model followed by fixing the defined rate laws against each enzyme. The rate of the enzymes was described by the Michaelis-Menten equations which were able to capture the necessary details of the reactions and hence displaying the dynamic behavior. These rate equations depend on the concentrations of the metabolites and parameters like binding constants, maximum velocity etc. The model was subsequently fed with the collected kinetic parameters setting up their initial concentrations. This yielded a set of ODE's representing the enzyme kinetics of the system. Kinetic modeling solves these set of ODE's for their dynamic behavior. Once the model has been assembled, it is simulated for observing the variations in the system. The publicly available biochemical network simulator "COPASI" was used to simulate the constructed model and also to perform the analysis.

#### Model Simulation and Analysis Steady State Analysis

A characteristic feature of metabolic network is to evolve into a steady state frequently. Computational calculation of steady state requires solving the model through solving the system of ODE's numerically. Steady states of a system represent a

#### TABLE 1 | Enzymes involved in the purine salvage pathway.


certain physiological condition of the network. Changes in the physiological conditions results in transient states as the system transfers itself to a new steady state. The steady state of our kinetic model was calculated and it showed that the model acquired a steady state with our initial conditions and set of parameters. The steady state fluxes have been shown in **Table 2**.

TABLE 2 | Steady state fluxes of Purine salvage model.

fgene-11-00179 March 5, 2020 Time: 15:5 # 6


#### Time Course Analysis

Time evolution of the model was carried out through the "time course analysis" utility of COPASI. The main objective of performing this analysis is to observe the dynamics of the model. A graph shows the behavior of the model with respect to time (**Figure 2**). The concentration vs. time plot is generally used to infer the attainment of a steady state. A plateau has been observed in the trajectories of the concentrations against time plot where no further changes in the variables are observed. This suggests that the model has reached the steady state characterized by the constant species concentrations. The production and the consumption of the metabolites occur at the same rate.

#### Sensitivity Analysis

Sensitivity analysis is essentially performed to observe the consequences of the parameters over the model variables like concentration of the species. Two plots were generated showing the effects of the concentration of the species over the variables and the parameters. **Figure 3** displays that the reactions catalyzed by the enzymes IMPDH, ADSL, ADSS, XPRT, GDA, and GMPS were sensitive to the initial concentrations of the marked variables. Whereas **Figure 4** shows that the marked parameters involved in the reactions 1, 2, 4, 8, 9, and 12 were mostly sensitive to the model. These reactions were catalyzed by the enzymes AK, APRT, PNP, ADSL, HGPRT, and the IMPDH, respectively. It is observed from both the cases that the enzymes IMPDH and ADSL are the most sensitive to the effect of concentration. Therefore, the study indicates that these specific reactions probable to be sensitive in the metabolic chain could be further analyzed through future experiments to give valuable insights into the dynamics of the system. Moreover the experimental studies could be directed towards the therapeutic importance of these reactions. Thus, the behavior observed here provides generous amount of dynamics of the "Purine salvage" pathway reflecting the importance of these reactions.

#### Protein-Protein Interaction Network

To further analyze the importance of the two proteins, IMPDH and ADSL, we carried out a static analysis along with the dynamics study. The available protein-protein interactions for L. donovani from STRING consists a total of 3707 proteins and a total of 437053 interactions. The constructed PPI network of the purine salvage pathway showed the participation of 1581 proteins (nodes) and 5735 unique interactions (list of proteins attached at the end of **Supplementary Material**). The interactions showed proteins to be involved in other functions other than the purine salvage. Clustering of the network through MCODE resulted into two clusters as tabulated in **Table 3**. Comparison of both the clusters reveals that lesser number of proteins were involved with a higher number of interactions resulting into a dense network. Proteins in a dense network often forms functional modules that contributes to cellular processes (Rasti and Vogiatzis, 2019).

Therefore knocking out of these proteins should result into a much lesser number of interactions which will lead to an overall loss of a number of functions of the proteins. These interactions may also prove to be fatal. On the other hand, in the second cluster higher number of proteins were involved having lesser number of interactions compared to cluster 1. The proteins which were found to be sensitive in the kinetic modeling approach were analyzed for their presence in the cluster. The enzymes ADSL and IMPDH were found to be having interactions with other proteins in the first cluster. Along with these two proteins, two other proteins were also found to be present in the interaction pattern. These proteins along with their STRING ids have been tabulated in **Supplementary Table S1**. Cluster 1 of the MCODE clustering method has been shown in **Supplementary Figure S1** with the tabulated enzymes being highlighted in the network. The other protein APRT (Adenine phosphoribosyltransferase) with a STRING id of XP\_003861601.1 was found to be present in cluster 2 (**Supplementary Figure S2**). Rest of the proteins were not involved in formation of any interactions in these two clusters.

Network analysis enhances our understanding of the interactions of a node with other nodes in the network. As the biological network is heterogeneous, different topological parameters were used to identify the essential proteins. The MCC (Maximum Clique centrality) method of Cytohubba is reported to be a better method in identifying central nodes (Chin et al., 2014). It shows that the proteins ADSL and IMPDH were captured to be the central nodes. Another method to rank the proteins applied was global centrality based method "closeness." It showed that the two proteins IMPDH and ADSL were ranked to be essential.

Literature studies support the fact the enzyme IMPDH is a promising target for drug discovery in antibacterial, anticancer and antiviral treatments (Shu and Nair, 2008) and organism like Pneumocystis carinii (O'Gara et al., 1997). Further importance of this enzyme has also been demonstrated in the organism Leishmania amazonensis (Pitaluga et al., 2015). ADSL is known

FIGURE 2 | Results of the time course analysis for the metabolites.

as an important target in organisms like Plasmodium falciparum (malaria) (Bulusu et al., 2009), Cryptococcus neoformans (Chitty et al., 2017), Staphylococcus aureus (Fyfe et al., 2010) and Schistosoma mansoni (schistosomiasis) (Romanello et al., 2017). Experimental groups have also reported the fact that a decrease growth rate occurs for the phenotypes when the ADSL gene is knocked out (Boitz et al., 2013). Therefore, studies could be further conducted upon targeting these proteins of interest which might lead to a loss of Leishmania infections (Galina et al., 2017). In silico mutational analysis of the ADSL protein suggested a few mutations that bought conformational changes to the catalytic site of the protein (Bora and Nath Jha, 2019).

### Inhibitor Search for the Selected Enzyme

Availability of three dimensional structures for the proteins gives an insight into the molecular information of the proteins (Das et al., 2018). Exploring the protein structural repository "PDB", it was found that the crystal structure for only ADSL (PDB id: 4MX2) was available. This led to the exploration of the protein through other computational techniques highlighting it as a probable target. We carried out an analysis to

identify potent inhibitors against this target molecule having an available 3D structure.

A total of 14 molecules reported to be possible inhibitors of Adenylosuccinate Lyase has been listed out (shown in **Supplementary Table S2**). Pharmacophore generation of the 14 molecules was carried out to find out the potential regions of the ligands contributing towards the catalytic activity. It was found out that 11 molecules were best aligned in generating the desired pharmacophore. Three chemical features were found out to be involved in maintaining the activity of the ligand


molecules which includes two hydrogen bond acceptor regions and one aromatic region. The generated pharmacophore has been shown in **Figure 5**. It was then used as a query to search potential ligands. The reason behind the use of searching the database through a pharmacophore is to identify hits having similar chemical features to that of the query. We mapped the pharmacophore against the Zinc database. Twenty inhibitors with a root mean square deviation (rmsd) of zero were selected out as our dataset for further analysis. The list of molecules is tabulated in **Supplementary Table S3**. The molecules were found to be satisfying Lipinski's rule which designates the druglike property of the molecules. To find out the optimal binding pose of these ligands to the enzyme, the ligand molecules were docked to the active site of the ADSL enzyme. The results of the Docking have been shown in **Supplementary Table S4**.

Molecular docking analysis revealed the binding patterns of the ligand molecules within the ADSL protein. Docking results showed that the binding energy of all the molecules were quite stable but out of 20 molecules, 18 molecules were found to be forming hydrogen bonds with the receptor. The strength of binding usually depends of the interaction between the ligand atoms and the receptor atoms. Therefore the presence

FIGURE 6 | Docking pictures of Molecule 7 and Molecule 13 within the active site of ADSL.

of hydrogen bonds aids toward the stability of the ligand-receptor complexes. Also out of the 18 molecules, 12 molecules were observed to be forming hydrogen bonds with the active site residues. These residues are reported to be involved in the catalysis process and hence the binding of these ligands to the residues showcases their importance in inhibiting the enzyme. Also molecule 7 and molecule 13 has been observed to be interacting with the maximum number of active residues. The orientation of the molecules inside the cavity of the protein has been shown in **Figure 6**. Results showed that the molecules were able to bind deep within the activity that assists them in their catalytic properties. Therefore the effective binding of the molecules to the protein display their therapeutic importance as drug molecules.

#### CONCLUSION

The in silico study reflected that a systems level analysis for a protozoan parasite "L. donovani" is possible resulting into the identification of drug targets. In our work, a biochemical network of a metabolic pathway "Purine salvage" for the organism L. donovani has been constructed. The dynamic behavior of the model is analyzed through a mathematical representation of the reactions showcasing the biological events. The dynamic simulation allowed us to define the biological pathway of interest with respect to time. Further the model stands as a benchmark for incorporation of complex enzymatic mechanisms through the availability of experimental data. Reactions showing higher sensitivity to the model have been sorted out with the listing of the enzymes regulating those reactions. These enzymes and their reactions could be further experimentally tested out for their importance in therapeutics. Also to analyze the essentiality of these proteins, a static interaction network (the PPI network) for the proteins involved in purine salvage has been constructed. Clustering analysis resulted into a much higher dense network consisting of the proteins ADSL and IMPDH. It was observed through topological analysis that the mentioned proteins were ranked among the top 30% of the hub proteins. These proteins might serve as targets to further explore the pathway mechanism

#### REFERENCES


shedding light on the control of infections caused by the pathogens. Availability of the three dimensional structures for ADSL marks its possibility for other computational analysis like Virtual screening of inhibitors. Search for potential inhibitors for the ADSL protein was carried out through the identification of specific chemical features i.e., the pharmacophore. Mapping of the pharmacophore to available ligands lead to the selection of molecules having similar features and within minimum variation from the query molecules. To observe the binding mode of the ligand molecules to that of the receptor, molecular docking analysis was applied. Results of the docking analysis showed that two molecules were able to effectively bind to the active site residues of the protein. Hence it displayed their therapeutic importance as inhibitors having features similar to the known inhibitors against ADSL.

### DATA AVAILABILITY STATEMENT

The datasets generated for this study are available on request to the corresponding author.

### AUTHOR CONTRIBUTIONS

NB performed the in silico studies, analyzed the data, and wrote the manuscript. AJ supervised this study.

#### ACKNOWLEDGMENTS

We acknowledge Department of Biotechnology (BT/PR15847/ NER/95/21/2015), Government of India for providing computational facilities and fellowship to NB.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2020.00179/full#supplementary-material




**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Bora and Jha. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership