# BIOINFORMATICS OF GENOME REGULATION AND SYSTEMS BIOLOGY

EDITED BY : Yuriy L. Orlov and Ancha Baranova PUBLISHED IN : Frontiers in Genetics and Frontiers in Plant Science

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88966-014-8 DOI 10.3389/978-2-88966-014-8

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# BIOINFORMATICS OF GENOME REGULATION AND SYSTEMS BIOLOGY

Topic Editors: Yuriy L. Orlov, First Moscow State Medical University, Russia Ancha Baranova, George Mason University, United States

Citation: Orlov, Y. L., Baranova, A., eds. (2020). Bioinformatics of Genome Regulation and Systems Biology. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88966-014-8

# Table of Contents


Vitaly V. Gursky, Konstantin N. Kozlov, Sergey V. Nuzhdin and Maria G. Samsonova


Maxim S. Kovalev, Anna A. Igolkina, Maria G. Samsonova and Sergey V. Nuzhdin

*55 Characterization of DNA Methylation Associated Gene Regulatory Networks During Stomach Cancer Progression*

Jun Wu, Yunzhao Gu, Yawen Xiao, Chao Xia, Hua Li, Yani Kang, Jielin Sun, Zhifeng Shao, Zongli Lin and Xiaodong Zhao

*64 Intracellular Vesicle Trafficking Genes,* RabC*-GTP, are Highly Expressed Under Salinity and Rapid Dehydration but Down-Regulated by Drought in Leaves of Chickpea (*Cicer arietinum *L.)*

Gulmira Khassanova, Akhylbek Kurishbayev, Satyvaldy Jatayev, Askar Zhubatkanov, Aybek Zhumalin, Arysgul Turbekova, Bekzak Amantaev, Sergiy Lopato, Carly Schramm, Colin Jenkins, Kathleen Soole, Peter Langridge and Yuri Shavrukov

*78 Using Ancestry Informative Markers (AIMs) to Detect Fine Structures Within Gorilla Populations*

Ranajit Das, Ria Roy and Neha Venkatesh

*86 The General Transcription Repressor* TaDr1 *is Co-expressed With* TaVrn1 *and* TaFT1 *in Bread Wheat Under Drought*

Lyudmila Zotova, Akhylbek Kurishbayev, Satyvaldy Jatayev, Nikolay P. Goncharov, Nazgul Shamambayeva, Azamat Kashapov, Arystan Nuralov, Ainur Otemissova, Sergey Sereda, Vladimir Shvidchenko, Sergiy Lopato, Carly Schramm, Colin Jenkins, Kathleen Soole, Peter Langridge and Yuri Shavrukov

*97 Natural Selection Equally Supports the Human Tendencies in Subordination and Domination: A Genome-Wide Study With* in silico *Confirmation and* in vivo *Validation in Mice*

Irina Chadaeva, Petr Ponomarenko, Dmitry Rasskazov, Ekaterina Sharypova, Elena Kashina, Maxim Kleshchev, Mikhail Ponomarenko, Vladimir Naumenko, Ludmila Savinkova, Nikolay Kolchanov, Ludmila Osadchuk and Alexandr Osadchuk

George S. Krasnov, Anna V. Kudryavtseva, Anastasiya V. Snezhkina, Valentina A. Lakunina, Artemy D. Beniaminov, Nataliya V. Melnikova and Alexey A. Dmitriev *124 PreAIP: Computational Prediction of Anti-inflammatory Peptides by Integrating Multiple Complementary Features* Mst. Shamima Khatun, Md. Mehedi Hasan and Hiroyuki Kurata *135 Sexual Transcription Differences in* Brachymeria lasus *(Hymenoptera: Chalcididae), a Pupal Parasitoid Species of* Lymantria dispar *(Lepidoptera: Lymantriidae)* Peng-Cheng Liu, Shuo Tian and De-Jun Hao *144 Utility of cfDNA Fragmentation Patterns in Designing the Liquid Biopsy Profiling Panels to Improve Their Sensitivity* Maxim Ivanov, Polina Chernenko, Valery Breder, Konstantin Laktionov, Ekaterina Rozhavskaya, Sergey Musienko, Ancha Baranova and Vladislav Mileyko *156 Transcriptomic Analysis of Seed Germination Under Salt Stress in Two Desert Sister Species (*Populus euphratica *and* P. pruinosa*)* Caihua Zhang, Wenchun Luo, Yanda Li, Xu Zhang, Xiaotao Bai, Zhimin Niu, Xiao Zhang, Zhijun Li and Dongshi Wan *172 Conserved MicroRNA Act Boldly During Sprout Development and Quality Formation in Pingyang Tezaocha (*Camellia sinensis*)* Lei Zhao, Changsong Chen, Yu Wang, Jiazhi Shen and Zhaotang Ding *194 The Genomic Landscape of Crossover Interference in the Desert Tree*  Populus euphratica Ping Wang, Libo Jiang, Meixia Ye, Xuli Zhu and Rongling Wu *205 Molecular Organization and Chromosomal Localization Analysis of* 5S *rDNA Clusters in Autotetraploids Derived From* Carassius auratus *Red Var. (***♀***) ×* Megalobrama amblycephala *(***♂***)* QinBo Qin, QiWen Liu, ChongQing Wang, Liu Cao, YuWei Zhou, Huan Qin, Chun Zhao and ShaoJun Liu *214 Dicyemida and Orthonectida: Two Stories of Body Plan Simplification* Oleg A. Zverkov, Kirill V. Mikhailov, Sergey V. Isaev, Leonid Y. Rusin, Olga V. Popova, Maria D. Logacheva, Alexey A. Penin, Leonid L. Moroz, Yuri V. Panchin, Vassily A. Lyubetsky and Vladimir V. Aleoshin *235 Searching for Signatures of Cold Climate Adaptation in* TRPM8 *Gene in Populations of East Asian Ancestry* Alexander V. Igoshin, Konstantin V. Gunbin, Nikolay S. Yudin and Mikhail I. Voevoda *242 Initial Characterization of the Chloroplast Genome of* Vicia sepium*, an Important Wild Resource Plant, and Related Inferences About Its Evolution* Chaoyang Li, Yunlin Zhao, Zhenggang Xu, Guiyan Yang, Jiao Peng and Xiaoyun Peng

*113 Pan-Cancer Analysis of TCGA Data Revealed Promising Reference Genes* 

*for qPCR Normalization*

# Editorial: Bioinformatics of Genome Regulation and Systems Biology

Yuriy L. Orlov 1,2,3 \* and Ancha V. Baranova<sup>4</sup> \*

*1 Institute of Digital Medicine, I.M.Sechenov First Moscow State Medical University (Sechenov University), Moscow, Russia, <sup>2</sup> Life Sciences Department, Novosibirsk State University, Novosibirsk, Russia, <sup>3</sup> Agrobiotechnology Department, Agrarian and Technological Institute, Peoples' Friendship University of Russia, Moscow, Russia, <sup>4</sup> School of Systems Biology, George Mason University, Fairfax, VA, United States*

Keywords: genomics, bioinformactics, systems biology, plant science, gene expression, special issue

#### **Editorial on the Research Topic**

#### **Bioinformatics of Genome Regulation and Systems Biology**

This Research Topic presents the studies in the field of computational genomics. These papers were discussed at BGRS\SB-2018 (Bioinformatics of Genome Regulation and Structure Systems Biology) multi-conference, along with the hybrid wet-lab/computational genetics studies focused on genome-wide gene expression regulation. The BGRS is the major event in the computational biology field, which has been held in Novosibirsk, Russia biannually since 1998. The main conference is typically followed by a series of special post-conference journal issues covering contemporary computational genetics and genomics applications (Orlov et al., 2016, 2019a; Tatarinova et al., 2019). First Special Issues covering BGRS\SB conference were presented in the Journal of Bioinformatics and Computational Biology in 2012 (Kolchanov and Orlov, 2013; Orlov et al., 2015, 2019b) and other platforms (Chen et al., 2017; Baranova et al., 2019; Orlov, 2019; Medical Genetics and Bioinformatics special issue). Starting in 2018, extended discussion of the conference materials in genetics and genomics is being presented in Frontiers in Genetics.

In this Research Topic, we arranged the papers by areas of applications—clinical bioinformatics and human genome studies are followed by the plant genetics and then by systems biology applications.

Bah et al. comprehensively reviewed genomics tools and databases allowing us to dissect the pathophysiology of bacterial and parasitic infection, spanning the species from Mycobacterium tuberculosis to Plasmodium falciparum. These databases provide the data and tools for in-depth investigations of disease outbreaks and pathophysiological mechanisms, genomic variation and coevolution of hosts and pathogens, diagnostic markers and vaccine targets, with special attention to the contributions of genomics and bioinformatics to the management of both common and neglected tropical diseases, including tuberculosis, dengue fever, malaria, and filariasis.

The TCGA (The Cancer Genome Atlas) database was mined from an entirely new technical viewpoint of developing reference genes with stable mRNA levels for quantitative PCR studies of cancer cells (Krasnov et al.). A scoring system for the assessment of gene expression stability allowed authors to highlight previously untried reference gene candidates, specific to each cancer type, along with several more "universal," pan-cancer reference gene candidates, namely SF3A1, CIAO1, and SFRS4. The application on colon adenocarcinoma was presented in Fedorova et al. (2019), another work in the frames of BGRS SB conference series.

The study by Ivanov et al. highlighted methodological problems for an up-and-coming biomarker mining technique, a sequencing of cell-free DNA (cfDNA) in human plasma. As fragmentation patterns of cfDNA are far from being random due to nucleosome patterns reflecting tissue-specific epigenetic signatures, these patterns may be used for guiding the design of ampliconbased NGS panels. Therefore, the sensitivity of mutation detection in liquid biopsy samples may be much improved, allowing for a lessening of the amount of body fluids collected from patients.

#### Edited and reviewed by:

*Richard D. Emes, University of Nottingham, United Kingdom*

\*Correspondence: *Yuriy L. Orlov orlov@d-health.institute Ancha V. Baranova abaranov@gmu.edu*

#### Specialty section:

*This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics*

Received: *17 May 2020* Accepted: *26 May 2020* Published: *28 July 2020*

#### Citation:

*Orlov YL and Baranova AV (2020) Editorial: Bioinformatics of Genome Regulation and Systems Biology. Front. Genet. 11:625. doi: 10.3389/fgene.2020.00625*

**5**

Khatun et al. work in the medical bioinformatics field; they have developed a computational tool PreAIP (Predictor of Anti-Inflammatory Peptides), aimed at augmenting the search for novel biologics. Integrative analysis of stomach carcinoma samples by pairing DNA methylation patterns with gene regulatory network topology was presented in Wu et al.. The authors showed conservation of epigenetic patterns across various stages of this important type of human malignancies.

Gene expression regulation at genome level is important in evolution and adaptation studies (Ponomarenko et al., 2017; Igoshin et al.). Igoshin et al. looked into the adaptation of humans to cold climate. They have concentrated on the TRPM8 gene, which encodes for a cold-sensing ion channel. In a population data set, they found a very promising single nucleotide polymorphism rs7577262 with a signature of selective sweep. Chadaeva et al. employed bioinformatics to discern behavioral pattern in mice and identify variants contributing to the dominance and the subordination traits continuing bioinformatics behavior studies in laboratory animals (Bragin et al., 2017). Using the prediction on-line tool SNP\_TATA\_Comparator (Ponomarenko et al., 2017) a set of candidate SNP markers contributing to the dominance and the subordination were uncovered. The studies using same SNP analysis tool were continued in Oshchepkov et al. (2019) and Ponomarenko et al. (2020).

Zverkov et al. considered a problem of genome reduction in primitive parasites. Among the two groups of microscopic parasitic invertebrates, the Dicyemida, and Orthonectida, overall morphological organization is much simplified, with tissues and organs almost absent. In these species, homeodomain transcription factors, G-protein-coupled receptors, and many other protein families have undergone a massive reduction. Interestingly, it seems that the dramatic simplification of body plans in dicyemids and orthonectids has evolved independently.

Das et al. discuss the application of ancestry informative markers (AIMs), previously developed for the inference of genomic ancestry in humans (Das and Upadhyai, 2018), for the delineation of gorilla lineages. Three of the four AIMsdetermining approaches were successful for gorilla species (Das et al.).

The next group of papers in the Research Topic highlight the findings in genome regulation related to plants genetics. Kovalev et al. developed a computer pipeline and a machine learning classifier of deleterious coding mutations in agricultural plants, with the performance exceeding that of the popular PolyPhen-2 tool. The novel tool will improve the annotation of genes located in QTL and GWAS hit regions. This work was initially discussed at BGRS\SB-2018 plant biology session as well (Orlov et al., 2019c).

Zhang et al. studied abiotic stress in a model of Populus euphratica and its sister species P. pruinosa, differing by their adaptability to the content of salt in the soil. The authors performed transcriptome analyses of three seed germination phases from both of the species of desert poplar, and presented their findings in a form of a database suitable for use by poplar breeders. Wang et al. also studied Populus euphratica, in this case to infer genetics mechanisms of crossover Interference. Four-point linkage analysis allowed them to show the distribution of the crossover interference through the entire genome of this tree, uniquely suited for survival in saline deserts.

The following work by Khassanova et al. continues the line of studies of salinity resistance by exploring expression profiles in the chickpea (Cicer arietinum L.). They have tested six accessions of Chickpea ecotypes, all selected from field trials, for tolerance to abiotic stresses, found the involvement of CaRabC gene and developed markers for genotyping chickpea germplasm. Gene expression patterns in bread wheat exposed to drought were studied in Zotova et al.. The authors' team had identified general transcription repressor TaDr1, a part of TaDr1, TaDr1A and TaDr1B gene set, with drought-dependent variable expression. It seems that the general transcription repressor TaDr1 controls expression of TaVrn1 and TaFT1 and, consequently, flowering time. These finding have direct implications for plant productivity in the dry environment.

Flowering time in plants is important agricultural feature determined by genetics and environment. Gursky et al. dissected the core genetic regulatory network canalizing the flowering signals to the decision to flower. While discovered and extensively studied in the model plant Arabidopsis thaliana, the flowering model may hold in other species (Kozlov et al., 2019). When the authors built a model gene network in chickpea (Cicer arietinum), activation from the FLOWERING LOCUS T gene or its homologs to the flowering decision led to a high expression of the meristem identity genes, including AP1. Different levels of activation from AP1 may explain the differences observed in the expression of the two homologs of the repressor gene TFL1 in species compared. Zhao et al. worked on tea plant (Camellia sinensis). In this plant, the development of new sprouts directly affects the yield and quality of the tea leaves, by affecting the content of catechins, theanine, and caffeine. Using High-Performance Liquid Chromatography-Mass Spectrometry, authors showed that conserved miRNA are playing a role in primary metabolism of a tea plant during sprouting. Li et al. presented their study of the chloroplast genomes of Vicia sepium, an important wild resource plant suitable for cultivation in extreme cold and dry conditions. The authors have compared a new complete chloroplast genome of V. sepium with the chloroplast genomes from related genera belonging to tribe Fabeae, then reconstructed the evolutionary history of the chloroplast genomes in these species.

Orlov M. et al. have studied promoters of Mycoplasma gallisepticum, an intracellular parasite affecting the respiratory tract of poultry, and found that the vlhA promoters differ by carrying a variable GAA repeats region upstream of transcription start site. These data have implications for the studies of the phase variation in M. gallisepticum. The computer technique of such promoter studies were continued in Orlov and Sorokin (2020).

Liu et al. presented their study of gender differences in solitary parasitoid species Brachymeria lasus, which has been evaluated as a potential candidate for release to control the gypsy moth, Lymantria dispar, a pest of worldwide importance. Work by Qin et al. considers the polyploidy problem in vertebrates. They have analyzed genome organization in the autotetraploid of the red crucian carp (Carassius auratus red var.). The loss of chromosomal loci, base variations in nontranscribed spacer, and array recombination of repeat units have been detected.

Overall, we are proud of the Research Topic at Frontiers in Genetics we collated. We hope that you will find this paper collection a stimulating reading, and will consider coming to the next BGRS\SB conferences in Novosibirsk, Russia as well as read next "Bioinformatics of Genome Regulation" Research Topic in Frontiers (https://www.frontiersin.org/research-topics/ 14266/bioinformatics-of-genome-regulation).

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

YO and AB organized the Research Topic as guest editors, supervised the reviewing of the manuscript, and wrote this Editorial paper. All authors contributed to the article and approved the submitted version.

#### ACKNOWLEDGMENTS

The guest editors are grateful to the authors contributing to this special issue papers collection and thank the reviewers who helped improve the manuscripts. The publication has been prepared with the support of the RUDN University Program 5-100.


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Orlov and Baranova. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Dynamical Modeling of the Core Gene Network Controlling Flowering Suggests Cumulative Activation From the FLOWERING LOCUS T Gene Homologs in Chickpea

#### Edited by:

Yuriy L. Orlov, Institute of Cytology and Genetics (RAS), Russia

#### Reviewed by:

Inna N. Lavrik, Medizinische Fakultät, Universitätsklinikum Magdeburg, Germany Filippo Geraci, Consiglio Nazionale Delle Ricerche (CNR), Italy

\*Correspondence:

Maria G. Samsonova m.g.samsonova@gmail.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 25 August 2018 Accepted: 26 October 2018 Published: 20 November 2018

#### Citation:

Gursky VV, Kozlov KN, Nuzhdin SV and Samsonova MG (2018) Dynamical Modeling of the Core Gene Network Controlling Flowering Suggests Cumulative Activation From the FLOWERING LOCUS T Gene Homologs in Chickpea. Front. Genet. 9:547. doi: 10.3389/fgene.2018.00547 Vitaly V. Gursky 1,2, Konstantin N. Kozlov <sup>2</sup> , Sergey V. Nuzhdin2,3 and Maria G. Samsonova<sup>2</sup> \*

<sup>1</sup> Theoretical Department, Ioffe Institute, Saint Petersburg, Russia, <sup>2</sup> Systems Biology and Bioinformatics Laboratory, Peter the Great Saint Petersburg Polytechnic University, Saint Petersburg, Russia, <sup>3</sup> Molecular and Computational Biology, University of Southern California, Los Angeles, CA, United States

Initiation of flowering moves plants from vegetative to reproductive development. The time when this transition happens (flowering time), an important indicator of productivity, depends on both endogenous and environmental factors. The core genetic regulatory network canalizing the flowering signals to the decision to flower has been studied extensively in the model plant Arabidopsis thaliana and has been shown to preserve its main regulatory blocks in other species. It integrates activation from the FLOWERING LOCUS T (FT) gene or its homologs to the flowering decision expressed as high expression of the meristem identity genes, including AP1. We elaborated a dynamical model of this flowering gene regulatory network and applied it to the previously published expression data from two cultivars of domesticated chickpea (Cicer arietinum), obtained for two photoperiod durations. Due to a large number of free parameters in the model, we used an ensemble approach analyzing the model solutions at many parameter sets that provide equally good fit to data. Testing several alternative hypotheses about regulatory roles of the five FT homologs present in chickpea revealed no preference in segregating individual FT copies as singled-out activators with their own regulatory parameters, thus favoring the hypothesis that the five genes possess similar regulatory properties and provide cumulative activation in the network. The analysis reveals that different levels of activation from AP1 can explain a small difference observed in the expression of the two homologs of the repressor gene TFL1. Finally, the model predicts highly reduced activation between LFY and AP1, thus suggesting that this regulatory block is not conserved in chickpea and needs other mechanisms. Overall, this study provides the first attempt to quantitatively test the flowering time gene network in chickpea based on data-driven modeling.

Keywords: chickpea, flowering time, FT genes, ICCV 96029, CDC Frontier, dynamical model

### INTRODUCTION

The depleted genetic diversity of many domesticated agriculturally important plants is a common problem for breeders, providing an obstacle in developing new forms with desired features. One such feature important for domesticated chickpea (Cicer arietinum) is early flowering time, which enforces more rapid transition from vegetative to reproductive growth. Due to high sensitivity of chickpea to ascochyta blight, it is essential to reduce the full plant cycle, from sowing to maturation, in order to fit it to relatively short growing seasons having dry weather and, hence, low disease pressure (Kumar and Abbo, 2001). These growing seasons are quite short in major chickpea growing regions, pushing breeders to developing chickpea lines with early flowering time. Thus, it is important to identify key genes regulating floral transition and quantitatively understand the behavior of the flowering time gene network.

The floral transition has been intensively studied in model organisms, such as Arabidopsis (Arabidopsis thaliana) (Srikanth and Schmid, 2011; Andrés and Coupland, 2012), and in other plants, including important crops and legumes (Kumar and Abbo, 2001; Dong et al., 2012; Shrestha et al., 2014; Blümel et al., 2015; Peng et al., 2015; Weller and Ortega, 2015; Zhang et al., 2016; Ridge et al., 2017). Flowering starts in response to various environmental signals, including photoperiod and vernalization, and endogenous signals, such as autonomous and circadian clock, and molecular pathways have been identified conducting these signals to the core gene network that integrates them into a binary decision to flower. Despite the high complexity of these pathways and many unknown regulators, it has been shown that key genes regulating the process are conserved between species. In particular, the flowering signals lead to the elevated expression of the floral pathway integrator gene FLOWERING LOCUS T (FT), or its homologs, in the leaves (Kardailsky et al., 1999; Kobayashi et al., 1999; Pin and Nilsson, 2012; Jaeger et al., 2013).

In Arabidopsis, the understanding of the core gene network integrating the flowering signals transmitted via the expression of FT has evolved to the general scheme illustrated in **Figure 1A** (Jaeger et al., 2013). FT is a mobile factor transported from the leaves to the apical meristem, where it forms the complex with the transcription factor FD. This complex activates the meristem identity genes LEAFY (LFY) and APETALA1 (AP1), which also activate each other. The expression of AP1 activates genes controlling flower development and thus can be considered as the output of the network specifying the floral transition (Kaufmann et al., 2010). In order to keep the center of the shoot apical meristem in a vegetative state, the key floral repressor TERMINAL FLOWER1 (TFL1) inhibits expression of LFY and AP1 in this region. The resulting gene interaction graph takes the form shown in **Figure 1B**, incorporating evidence for some additional interactions: TFL1 acts as a repressor in the complex with FD, LFY activates FD, and AP1 represses TFL1. As many genes are omitted, each node in the graph in fact represents a group of genes (Jaeger et al., 2013).

The knowledge about the regulatory interactions between the genes from **Figure 1** has been obtained via extensive genetic studies, and it provides a unique opportunity for computational

modeling of this gene regulatory network, when experimental data on the system behavior is available. The modeling allows to gain mechanistic insights into specific properties of the floral transition system and produce testable predictions. Jaeger et al. (2013) elaborated a dynamical model of the core network from **Figure 1** based on the data on the flowering time for a set of the wild type and mutant Arabidopsis genotypes. They showed that the floral transition dynamics can be explained by splitting the network into several feedback and forward loops, each bearing a clear functional role (Pullen et al., 2013). Leal Valentim et al. (2015) studied a similar gene network, particularly considering that the complex TF-FD activates LFY via the intermediate transcription factors SOC1 and AGL24. They measured expression dynamics of all genes involved and used this data to calibrate a dynamical model. Using this datadriven approach, they tested various hypotheses about regulation of LFY by SOC1 and AGL24 and showed that perturbations can spread through the network in a nonlinear way.

A possibility to extend these results to chickpea depends on what we know about the inflorescence genes in this species. We concentrate on two chickpea cultivars in this study, CDC Frontier and ICCV 96029. CDC Frontier is a photoperiod-sensitive kabuli chickpea cultivar developed at the University of Saskatchewan (Warkentin et al., 2005), exhibiting relatively late flowering (Daba et al., 2016; Ridge et al., 2017). The reference genome sequence was obtained for this cultivar (Varshney et al., 2013). ICCV 96029 is a photoperiod-insensitive desi chickpea cultivar developed by the International Crops Research Institute for the Semi-Arid Tropics, India, representing the earliest flowering chickpea cultivar currently known. Quantitative trait loci associated with early flowering were investigated, and it was shown that a single recessive allele with some additional modifiers confer early flowering of ICCV 96029 (Kumar and van Rheenen, 2000; Gaur

et al., 2015; Upadhyaya et al., 2015; Mallikarjuna et al., 2017). Ridge et al. (2017) provided evidence that a mutation in an ortholog of the key circadian gene ELF3 can be associated with earliness in ICCV 96029 under short day growth conditions, but their analysis of the expression of clock genes in ICCV 96029 did not reveal any clear differences for this cultivar.

In contrast to the single FT gene in Arabidopsis, Ridge et al. (2017) identified five FT homologs in chickpea: FTa1, FTa2, FTa3, FTb, and FTc, named according to affiliation with one of the three clades (FTa, FTb, and FTc). They also found two chickpea orthologs of TFL1 (TFL1a and TFL1c). Furthermore, Ridge et al. (2017) measured the expression dynamics of the homologs of all genes from the core gene network for CDC Frontier and ICCV 96029 under two growth conditions (short day, SD, and long day, LD) and identified specific differences in expression between these genotypes. In particular, they noted that the upregulation of FT and AP1 expression was synchronous with floral bud initiation, thus confirming that regulation of floral transition in chickpea occurs via the FT gene family.

We aimed to investigate a possibility to extend the core gene network from **Figure 1** to chickpea. Assuming this network is conserved, we developed a dynamical model of gene expression and applied it to the previously published expression time series (Ridge et al., 2017). We used the resultant model to dissect interactions in which targets were found insensitive to regulator action. This points to chickpea specific deviations in regulation of floral transition. We also studied if the TFL1 homologs are mutually distinguishable in the context of the model. Finally, we tested several hypotheses about how the FT-like genes combine in their activation of the meristem identity genes.

#### RESULTS

#### Model

We modeled the flowering time gene network shown in **Figure 1**. We formulated the model in terms of the ordinary differential equations in which the change rates of gene product concentrations are regulated by the activators and inhibitors via the Hill-type regulation functions (the model equations (1–5) are described in details in section Materials and Methods). The formulation of the model equations depends on how we combine the activation from the FT-like genes. The baseline model (model, or hypothesis, H0) assumes that the five FT homologs are mutually indistinguishable in their activation of the meristem identity genes (LFY and AP1). In this model, FD forms the complex with the total FT concentration equal to the sum of the protein concentrations from each FT homolog. The activation of LFY by the FT-FD complex is characterized in the model equations by the regulation function containing the following regulatory parameters: one Michaelis–Menten constant (K8), one Hill parameter (n8), and one maximal synthesis rate (v8) (see equation (6) in section Materials and Methods), and a similar set of regulatory parameters quantify the activation of AP1 by the total FT concentration. An alternative model (H1) assumes that only one of the five FT's is enough to activate transition to flowering, so the concentration of only that FT participates in the complex FT-FD and activates LFY and AP1 (see equation (7) in section Materials and Methods for the case of LFY activation). In another alternative model (H2), we tried to distinguish a single FT gene from the other four assuming that this singledout gene has the regulatory parameters distinct from the rest of the FT genes, while these FT's still activate cumulatively (like in model H0). The activation from the singled-out FT gene and the activation from the total concentration of the rest of the FT genes are represented in the model by two distinct regulation functions (see equation (8) in section Materials and Methods for the case of LFY activation). Models H1 and H2 have five possible versions, where each version is associated with one FT homolog separated from the other FT-like genes. We tested only four of them, excluding FTa3 from the analysis due to its very low expression in both growth conditions.

We applied the models to describe the previously published dynamic expression data for all genes from the core network measured in two chickpea cultivars, ICCV 96029 and CDC Frontier (Ridge et al., 2017). We failed to find a good model solution for the expression data from CDC Frontier (the best solution is shown in **Supplementary Figure 1**; we also discuss possible reasons in Discussion). Therefore, the rest of the paper describes modeling results for ICCV 96029.

#### Parameter Estimation and Model Solutions for ICCV 96029

Models H0 and H1 have the same number of free parameters (k = 31), and model H2 has six parameters more (k = 37). We estimated values of these parameters by minimizing the weighted sum of squared residuals quantifying the difference between the model solution and the ICCV 96029 data for the two growth conditions (SD and LD) simultaneously (section Materials and Methods). The data comprised expression levels of five genes (TFL1a, TFL1c, FD, LFY, and AP1) in ICCV 96029 on 7 days under SD and LD, with the total number of data points equal to m = 70. After estimating the parameter values, we applied the Akaike information criterion corrected for small data samples for model comparison, as described further in the text.

As k was relatively large, we refused to estimate the parameter values by fitting the model to the data from one condition (either LD or SD) and testing on the data from the other condition. In that case, the number of parameters k in model H2 would exceed the number of data points (m = 35 in LD or SD) and k in other model versions would be close to m, and that would complicate the application of the Akaike information criterion for model comparison. As a control, we performed the fitting to the LD data and tested on the SD data in model H0 and made sure that the corresponding solutions were qualitatively similar to the two-conditions fitting results (**Supplementary Figure 2**).

We further circumvented an overfitting potential of the two-conditions fitting applying the ensemble approach in the analysis of model behavior (Samee et al., 2015). In this approach, all sets of parameter values and solutions resulted from the fitting procedure were considered as equally suited for biological conclusions, and the conclusions were derived based on the analysis of the whole ensemble of the solutions and optimized parameter values.

The parameter optimization under hypothesis H0 resulted in the model solutions of very similar quality (**Figure 2**; distributions of the estimated parameter values are shown in **Supplementary Figure 3**). The model correctly reproduces the main characteristics of the data. The dynamic increase of LFY and AP1 concentrations can be explained by activation from the rising expression of the FT genes. LFY activates FD, resulting in the dynamic increase of its expression. Finally, the floral repressors TFL1a and TFL1c decrease in time due to repression by AP1.

#### Reduced LFY and AP1 Activation

The solution in **Figure 2** shows somewhat insufficient expression levels of both LFY under SD and AP1 under LD. The analysis of the expression data reveals that LFY behaves rather counterintuitively under SD as compared with LD and differs in this behavior from AP1. Namely, LFY is down-regulated in LD compared to SD, despite the increased activation from the raising expression of the FT genes in LD compared to SD, and this holds both for ICCV 96029 and CDC Frontier (**Figure 3**). In contrast, the integral expression of AP1 increases from SD to LD in accordance with the rising activation from FT. This anticorrelation between LFY and its sole activators (FT and AP1) observed in the data hampers the model in finding a better solution.

We analyzed how LFY and other transcription factors are involved in their regulations in the model for ICCV 96029 by plotting average values of the Hill functions which implement in the model equations each regulatory interaction from the gene network (**Figure 4**). An active regulation tends to keep the Hill function value between 0 and 1, while the limit values (0 or 1) evidence that the interaction between genes is saturated, with no sensitivity to specific expression levels of the regulators. This type of saturation occurs for activation of LFY by AP1, with the corresponding Hill function values pushed to zero. Activation of AP1 by LFY is also characterized by the Hill function values close to zero, but the analysis of the Jacobian values of the righthand side of the model equations for this regulation still shows relatively high LFY influence on AP1 (**Supplementary Figure 4**). Another saturated regulation involving LFY is activation of FD. At the same time, LFY is sensitive to its repressors (the complexes TFL1a-FD and TFL1c-FD), in contrast to the

data was available, 100 expression values were sampled from the normal distribution with the mean and s.d. presented at this temporal point in the data. These values then were interpolated across time, producing a set of 100 expression dynamics, and these dynamics were integrated over time. The chart and error bars show means and standard deviations, respectively, over this set of the integral values.

saturated repression of AP1 by these complexes (**Figure 4**). Overall, this analysis of the model and expression data suggests that there are regulators of LFY missing in the core gene network under study.

**Figure 4** shows four regulations characterized by the average Hill function values that are considerably far from the saturation limits: activation of LFY and AP1 by FT and repression of TFL1a and TFL1c by AP1. This fact allows us to use the model for testing various alternative hypotheses about these regulations.

#### Difference in TFL1a and TFL1c Expression can be Explained by Different Regulation by AP1

We tested a hypothesis that a small difference in TFL1a and TFL1c expression observed in the data (**Figure 5**) can be explained by different regulation by AP1. Because of this difference in the expression, we included TFL1a and TFL1c in the model as two distinct dynamical variables whose dynamics are under control of the following four parameters per factor (equations (1–2) in section Materials and Methods): maximal expression rate v<sup>i</sup> , dissociation constant K<sup>i</sup> , cooperativity parameter n<sup>i</sup> , and degradation rate λ<sup>i</sup> (i = 1,2). If the model fitting produced no significant difference in these parameters between TFL1a and TFL1c, there would be no means to distinguish between these factors in the model and we would have to consider a single dynamical variable TFL1 = TFL1a + TFL1c instead. If the difference in parameter values exists, there is an interesting question about whether this difference can be explained by different regulation from AP1. If AP1 is indeed involved, a statistically significant difference should exist between values of the regulatory parameters K<sup>1</sup> and K<sup>2</sup> and/or between values of n<sup>1</sup> and n2, because these parameters are associated with repression of TFL1a and TFL1c by AP1. A possible difference in v<sup>i</sup> and/or λ<sup>i</sup> should be attributed to other, AP1 independent, factors.

The optimized parameter values for TFL1a and TFL1c form two clearly separated clusters, which correspond to the main box ("main cluster") and the outliers ("outlying cluster") in the AP1→TFL1a and AP1→TFL1c parts of **Figure 4**, and it is already seen in this figure that the regulation by AP1 differs between the analyzed target genes within the main cluster. The Hill exponents n<sup>i</sup> are the same in the main cluster for both TFL1a and TFL1c (n<sup>i</sup> = 1, i = 1,2), but we see the significant difference in K<sup>i</sup> values in this cluster: K<sup>1</sup> = 561.14 ± 0.13 (TFL1a) and K<sup>2</sup> = 401.14 ± 0.08 (TFL1c) (p-value = 2 × 10−<sup>9</sup> ). Therefore, the model suggests different regulatory properties of AP1 in its action on the genes TFL1a and TFL1c, linked to possible different association kinetics to their promoters. The outlying cluster is characterized by a small influence of AP1 and contain only from 5 to 6 parameter sets with very similar K<sup>i</sup> and n<sup>i</sup> values, so we consider this cluster as not relevant.

#### Model Suggests Cumulative Activation by the FT Homologs

We tested whether an individual FT gene stands out against the other FT homologs by fitting the three versions of the model (models H0, H1, and H2) described above and in Materials and Methods, with subsequent comparison of their fitting quality. We considered only four of the five FT genes in the tests excluding FTa3, since its expression was small relative to the other ones (**Figure 6A**).

We first checked if a single FT gene can provide the full activation from the FT gene family in the network, thus serving as a unique transmitter of the flowering signal (model H1). Under this assumption, we replaced the sum of FT concentrations in the model equations by the concentration of one of the four FT's and fitted each resulted version of the model to the expression data for ICCV 96029. For each tested FT gene, model H1 demonstrated worse fitting quality as compared to the baseline model with the cumulative activation from all FT genes (model H0) (**Figure 6B**; p-value = 3 × 10−<sup>7</sup> for FTa1 as the sole activator; 7 × 10−<sup>9</sup> , FTa2; 2 × 10−<sup>5</sup> , FTb; 10−<sup>4</sup> , FTc). Breaking the cost function into the separate SD- and LD-related components reveals that all versions of model H1 have worse quality in description of the LD data and all except the FTa2- and FTc1-related models have worse description of the SD data (**Supplementary Figure 5**). Since models H0 and H1 have the same number of parameters, neither of them is prone to overfitting to a larger extent than the other one, and, hence, we can conclude about better relevance of model H0 based on the fitting quality comparison and without applying additional quality measures.

As several FT genes are required for better description of the expression data, a question yet remains about whether different FT's activate the meristem identity genes differently in terms of their regulatory parameters. We implemented this possibility in model H2 by singling an FT out from the other four and adding a new regulation function to the model equations representing the

shown. The use of a more conventional form of AICc yields a similar figure (Supplementary Figure 8 and Supplementary Text).

activating action of this FT with its own regulatory parameters (v, K, and n), while preserving in the equations the activation from the sum of the other FT concentrations. Model H2 exhibited a better fitting quality than H0 for the singled-out genes FTa1 (p-value = 0.005) and FTc (p-value = 0.0004), with no improvement for the other two FT genes (p-value = 0.09 for the singled-out FTa2 and 0.12 for FTb) (**Figure 6C**). Both FTa1 and FTc-related models H2 demonstrate better fit to the LDdata, with no significant improvements in fits to the SD-data (**Supplementary Figure 6**).

We can try to find features in the expression of FTa2 and FTb that can be attributed to their worse individual performance in the model. **Figure 6A** shows that the expression dynamics of FTa2 is almost identical under SD and LD for a long time and becomes down-regulated under LD at later days, in contrary to the behavior of all other FT's and to the up-regulation of AP1 in LD (**Figure 3**). At the other extreme, the up-regulation of FTb in LD is the strongest among the FT genes, and this raise in expression might be too large to represent the difference between SD and LD adequately. However, model H1 with FTb as the only FT activator performs best among all FT genes on average (**Figure 6B**), and both FTb-related models (H1 and H2) provide the lowest cost function values among all models, including H0 (see the minimal cost values in **Figures 6B,C**), which hints at possible importance of this gene.

The observed better performance of models H2 with the singled-out genes FTa1 and FTc can be related to overfitting, since model H2 has six parameters more than the baseline model H0. We controlled this by evaluating the Akaike information criterion corrected for small data samples (AICc; equation (10) in section Materials and Methods), which assesses the quality of a model applied to a data by combining the fitting quality of the model and its complexity in terms of the number of free parameters. Smaller values of this measure correspond to better models. AICc evaluation reveals that its value for each version of model H2 is more than four times larger than for model H0 (**Figure 6D**), which suggests that the complexity added to model H2 is not justified by the resulted improvement in fitting. Therefore, we conclude that the model with the cumulative activation from all FT genes (model H0) is the most relevant for the given expression data.

#### DISCUSSION

We presented a computational model of the core gene network controlling the floral transition and investigated its ability to describe the expression data in two chickpea cultivars. We were able to find good model solutions for ICCV 96029, which suggests a general conservation of the core gene network from **Figure 1** in this chickpea cultivar. On the other hand, the modeling results were negative for CDC Frontier. A possible reason for this could be related to the specific choice of the modeling formalism. This explanation does not seem likely, since the modeling formalism is quite general and has been successfully applied to the same gene network in Arabidopsis (Leal Valentim et al., 2015). Another explanation which we find more probable is that this gene network is more perturbed in CDC Frontier than in ICCV 96029.

Several key differences between CDC Frontier and ICCV 96029 were reported based on the analysis of the expression data (Ridge et al., 2017): ICCV 96029 exhibits much earlier and much stronger up-regulation of the expression of AP1, according to the earlier appearance of visible floral buds as compared to CDC Frontier. The floral repressors TFL1a and TFL1c have lower expression levels in ICCV 96029 than in CDC Frontier, also in accordance with the early flowering of the former. On the other hand, the differences in expression of FD and LFY are not as visible between the cultivars.

The expression levels of the FT genes in the data are significantly different for the two cultivars, and the total FT concentration in CDC Frontier can be estimated as close to the background levels (**Figure 7A**). This can partially explain why the model is not feasible for the expression data from CDC Frontier. Such small FT levels could possibly be related to the observed fact that the first floral buds, appeared in CDC Frontier at 31 days after sowing in SD and at 32 days in LD, were abortive, although the low expression of some of these genes persisted for much longer time (Ridge et al., 2017). Furthermore, investigation of the autocorrelation functions of the FT expression time series reveals very different patterns in the FT signals between the cultivars (**Figure 7B**), and these patterns are translated to the rest of the core network genes almost without changes (**Figure 7C**). It is interesting to note a periodic signal in the FT dynamics in CDC Frontier with a period of two days, although this signal can yet be an experimental artifact related to low expression levels.

Another important difference between the cultivars that we see in the data and that might contribute to the difference in the modeling results concerns the dependence between concentrations of TFL1a/TFL1c and LFY/AP1. TFL1a and TFL1c repress LFY and AP1, and AP1 represses the TFL1-like genes (Ratcliffe et al., 1999; Kaufmann et al., 2010). Therefore, we should expect that these two groups of transcripts should avoid coexistence in the data and, hence, exhibit a negative correlation over time. We do see this correlation in the data from ICCV 96029, but not from CDC Frontier (**Table 1**). Moreover, **Table 1** shows that these mutual repressors tend to show a positive correlation in the CDC Frontier data. Regardless of whether this inconsistency in the CDC Frontier data should be attributed to an artifact or it hints at alternative regulations between the TFL1 like genes and the inflorescence identity genes in this cultivar, this property evidently impedes the modeling success under given assumptions.

It has been shown that LFY is involved in positive regulation of AP1 and is positively regulated by AP1 in Arabidopsis (Wagner et al., 1999; Jaeger et al., 2013; Leal Valentim et al., 2015). Our modeling results suggest that some additional factors should exist providing insufficient activation of these genes in the model for chickpea. The counterintuitive increase in the integral expression of LFY under SD as compared with LD, contrary to the decreasing activation from the FT-like genes, may indicate that additional activators of LFY participate under SD and compensate the missing activation. We believe that the absence of such factors in the core gene network considered in our model and, as a consequence, the inability to properly handle the LD

FIGURE 7 | Difference in FT behavior between ICCV 96029 and CDC Frontier, based on the expression data from (Ridge et al., 2017). (A) The dynamics of the sum of concentrations of all five FT transcripts, for the two cultivars and two growth conditions. Developing floral buds were first detected at 15 days (under SD) and 13 days (LD) in ICCV 96029 and at 31 days (SD) and 32 days (LD) in CDC Frontier (Ridge et al., 2017). (B) Autocorrelation function (ACF) for the expression data time series of the FT genes. ACF estimates similarity (correlation) between data points as a function of the time lag between them. For each time lag value, an ACF value was calculated for the expression time series for each FT gene and growth condition (SD and LD), and then an average ACF was calculated over the FT genes and conditions. (C) The same as in (B) but for the expression dynamics of the genes TFL1a, TFL1b, FD, LFY, and AP1.


TABLE 1 | Correlations between the expression dynamics of TFL1a/TFL1c and LFY/AP1 in the data from (Ridge et al., 2017).

The Spearman rank correlation coefficient ρ was calculated for each cultivar (CDC Frontier and ICCV 96029) and growth condition (SD and LD). The p-values (P) were calculated by one-tailed permutation test, and the p-values below 0.05 are marked with asterisk.

vs. SD changes in expression is the reason why AP1 is almost excluded as an activator of LFY in the model solutions. In other words, this allows for the hypothesis that the LFY-AP1 regulation module is not conserved in chickpea. However, we should also consider the possibility that the LD vs. SD increase in expression of LFY is due to insufficient quality of the data. Future work, both modeling and experimental, should clarify this point.

Since ICCV 96029 is day length neutral and floral transition is conferred via the FT genes, we might expect no difference in FT expression between SD and LD treatments in this cultivar. However, the expression data by Ridge et al. (2017) shows an essential difference in expression of these genes (**Figures 6**, **7A**), and it is important that this difference is transferred to the SD/LD difference in expression of AP1 (**Figure 3**), so that the key gene specifying flower meristem identity exhibits sensitivity to photoperiod according to the data. This expression data was collected from the plants with first visible floral buds appeared at 15 days after sowing in SD and 13 days in LD (Ridge et al., 2017), thus providing the two days difference in floral bud initiation time between SD and LD. This two days difference diverges from previous measurements showing no difference in this time in ICCV 96029 (19 days from seeding ± 0.0) (Daba et al., 2016), but it qualitatively matches with the observed difference in expression.

Irrespective of whether this match is confident or not, the observed raise in expression of the FT genes and AP1 in LD suggests that some compensatory mechanisms, or missing repressors, should exist diminishing the influence of that extra expression on the time to flower. It is reasonable to presume that these mechanisms should operate in the post-inductive phase of flower development, as they take the increased expression of floral meristem identity genes as the input. However, this conjecture is not in correspondence with the previously observed fact that ICCV 96029 does not exhibit photoperiod sensitivity on any of the pre-, inductive, or post-inductive phases of flower development (Daba et al., 2016). We believe this expression-based photoperiod sensitivity effect in ICCV 96029 is a fascinating subject for further studies.

An important difference of legumes and other species from Arabidopsis is in multiple orthologs of the inflorescence genes, such as FT, that present in a single copy in Arabidopsis (Pin and Nilsson, 2012). The regulatory roles of individual copies can sometimes be separated from the others; for example, FTb has been shown to have the leading role in pea (Hecht et al., 2011). The main purpose of our modeling approach was to infer possible differences in regulatory roles or other properties associated with the five FT homologs and two TFL1 homologs in chickpea (Ridge et al., 2017). It is important that the model and expression data in principle allow to perform such inference, as the fitting results reveal that both FT- and TFL1-like genes are involved in active regulations.

AP1 was shown to repress TFL1-like genes (Liljegren et al., 1999; Kaufmann et al., 2010; Jaeger et al., 2013), and we found that this repression can be different for TFL1a and TFL1c in chickpea. As this difference concerns only the values of the equilibrium dissociation constant K, we can suggest that AP1 has different binding properties to the promoters of TFL1a and TFL1c.

Visual comparison between the expression of the five FTlike genes in ICCV 96029 does not help in differentiating their regulatory properties. Our modeling results support the cumulative activation model, in which all FT proteins have very similar regulatory properties and activation of the meristem identity genes occurs via the total FT concentration. Analyzing their expression data, Ridge et al. pointed at FTb as particularly important for induction of flowering (Ridge et al., 2017). However, this gene becomes indistinguishable from the others if we put it in the modeling context. The ensemble of model fits in which this gene is singled out does not improve the model, and we get the same conclusions using the Akaike information criterion to assess the relative performance of the model. On the other hand, we found that singling FTb out produced the lowest values of the minimal cost in all types of the computational experiments, suggesting that its potential of being the leading FT activator is not exhausted and is not seen only due to possible imperfections of the model and/or data.

As any modeling approach, our model has limitations. Perhaps the most important one concerns the large number of free parameters. We tackled this inevitable problem by utilizing the ensemble approach in the analysis of the model behavior (Samee et al., 2015). Despite the existing interdependence between the model parameters, the optimized parameter values led to the set of very similar solutions for ICCV 96029. We drew any conclusions only based on the average over the ensemble of the optimized parameter values, thus utilizing the "wisdom of the crowd" principle. We note that, for example, both the model with the single FTb and the model with the singled-out FTb provide the minimal costs among all alternative models, while they do not perform better on average. Even with the given number of free parameters, the model was not able to reproduce the expression data from CDC Frontier, which, in particular, indicates that we cannot fit any data. Therefore, we believe that the ensemble approach increases the confidence of our results.

#### MATERIALS AND METHODS

#### Model Equations

We model the expression of TFL1a, TFL1c, FD, LFY, and AP1 with the following set of differential equations:

$$\frac{d\mu\_{\rm TFL1a}}{dt} = \nu\_I \frac{K\_1^{n1}}{K\_1^{n1} + \mu\_{AP1}^{n1}} - \lambda\_1 \mu\_{\rm TFL1a} \,, \tag{1}$$

$$\frac{d\mu\_{\rm TFL1c}}{dt} = \nu\_2 \frac{K\_2 l^{n2}}{K\_2 l^{n2} + \mu\_{\rm AP1} l^{n2}} - \lambda\_2 \mu\_{\rm TFL1c} \,, \tag{2}$$

$$\frac{d\mu\_{FD}}{dt} = \nu\_3 \frac{\mu\_{LFY}^{n3}}{K\_3^{n3} + \mu\_{LFY}^{n3}} - \lambda\_3 \mu\_{FD} \,, \tag{3}$$

$$\frac{d\mu\_{LY}}{dt} = \left(\nu\_4 \frac{\mu\_{AP1}\mathfrak{n}}{K\_4^{\mathfrak{n}\mathfrak{l}} + \mu\_{AP1}\mathfrak{n}^{\mathfrak{n}\mathfrak{l}}} + f\_{FT \to LFY}\left(\mathfrak{t}\right)\right) \times$$

$$\left(\frac{K\_5^{\mathfrak{n}5}}{K\_5^{\mathfrak{n}5} + \left[\mu\_{FD}\left(\mathfrak{u}\_{TFL1a} + \mathfrak{u}\_{TFL1c}\right)\right]^{\mathfrak{n}5}}\right) - \lambda\_4 \mathfrak{u}\_{LFY}, (4)$$

$$\begin{split} \frac{du\_{AP1}}{dt} &= \left(\nu\_5 \frac{u\_{LFY}^{n6}}{K\_6^{n6} + u\_{LFY}^{n6}} + f\_{FT \rightarrow AP1}(t)\right) \times \\ &\quad \left(\frac{K\_7^{n7}}{K\_7^{n7} + [u\_{FD} \left(u\_{TFL1a} + u\_{TFL1c}\right)]^{n7}}\right) - \lambda\_5 u\_{AP1}, (5) \end{split}$$

where u's describe the protein concentrations, v<sup>i</sup> are the maximal protein synthesis rates, K<sup>i</sup> are the Michaelis–Menten constants (which can be seen as the equilibrium dissociation constants for the regulators binding the target gene promoters in the case of a direct transcriptional regulation), ni are the Hill constants (accounting for the cooperative effects), and λ<sup>i</sup> are the protein degradation constants. We do not model translation explicitly, but instead assume that protein concentrations are proportional to mRNA concentrations for simplicity.

The specific form of the equations is chosen according to the regulatory graph in **Figure 1** and can be read as follows. The last terms on the right-hand side of all the equations represent degradation of each protein. The first term on the righthand side of equation (1) is the regulation function describing repression of TFL1a by AP1. The same regulation function but with different parameters describes repression of TFL1c by AP1 in equation (2). The first term on the right-hand side of equation (3) represents activation of FD by LFY. The first brackets in equation (4) contains the sum of the activating inputs to LFY expression from AP1 (the first term in the sum) and the FT homologs (the function fFT→LFY(t), described below). This input is multiplied by the regulation function in the second brackets of this equation, representing repression of LFY by the FD-TFL1 complex. This repression is represented under the assumption that TFL1a and TFL1c have equivalent regulatory properties, and the concentration of the complex is proportional to the product of the FD concentration (uFD) and the total concentration of TFL1a and TFL1c (uTFL1a+uTFL1c). The first brackets in equation (5) contains the sum of the activating inputs to AP1 expression from LFY (the first term in the sum) and the FT homologs (the function fFT→AP1(t), described below). This input is multiplied by the regulation function in the second brackets of this equation, representing repression of AP1 by the FD-TFL1 complex.

We test three alternative hypotheses (H0, H1, and H2) about functions fFT→LFY and fFT→AP1. Under the null hypothesis H0, we assume regulatory equivalence of the five FT homologs, so the total concentration of all FT proteins forms the complex with FD and activate LFY and AP1 with a single Michaelis–Menten constant and a single Hill constant, according to the following expression:

$$H\mathbf{0} \colon f\_{FT\to LFY}\left(t\right) = \nu\_{\mathbf{6}} \frac{\left[\boldsymbol{\upmu\_{FD}} \sum\_{i=1}^{5} \boldsymbol{\upmu\_{i}}\left(t - \tau\right)\right]^{n\mathbf{8}}}{K\_{\mathbf{8}}\boldsymbol{\upmu\_{}^{\mathbf{8}}} + \left[\boldsymbol{\upmu\_{FD}} \sum\_{i=1}^{5} \boldsymbol{\upmu\_{i}}\left(t - \tau\right)\right]^{n\mathbf{8}}},\quad \text{(6)}$$

**u<sup>1</sup>** = **uFTa1**, **u<sup>2</sup>** = **uFTa2**, **u<sup>3</sup>** = **uFTa3**, **u<sup>4</sup>** = **uFTb**, **u<sup>5</sup>** = **uFTc** ,

and a similar expression for the function fFT→AP<sup>1</sup> with the AP1-related constants v7, K9, and n9. The FT concentrations in equation (6) are calculated with a time delay τ , which is taken to transport FT from the leaves to the apical meristem.

In the hypothesis H1, we assume that a single FT gene (with index k) is capable to fully represent the FT-mediated activation of LFY and AP1:

$$\text{HII: } \begin{aligned} \text{pH: } \quad &f\_{\text{FF}\to\text{LFY}}\left(t\right) = \nu\_{\text{6}} \frac{[\boldsymbol{u}\_{\text{FD}}\boldsymbol{u}\_{\text{k}}\left(t-\tau\right)]^{\text{nR}}}{K\_{\text{8}}\text{"}+[\boldsymbol{u}\_{\text{FD}}\boldsymbol{u}\_{\text{k}}\left(t-\tau\right)]^{\text{nR}}}, \end{aligned} \tag{7}$$

and a similar expression for the function fFT→AP<sup>1</sup> with the same u<sup>k</sup> and with the AP1-related constants v7, K9, and n9.

Under the hypothesis H2, we assume that a member u<sup>k</sup> of the FT family is distinguishable from the rest four members of the family in terms of regulation of LFY and AP1, so that we can separate it into a distinct regulation function with its own regulatory constants as follows:

$$H2\colon \ f\_{FT \to LFY}(t) = \nu\_6 \frac{\left[\
u\_{FD} \sum\_{i \neq k}^4 \mu\_i \left(t - \tau\right)\right]^{n8}}{K\_8^{n8} + \left[\
u\_{FD} \sum\_{i \neq k}^4 \mu\_i \left(t - \tau\right)\right]^{n8}}$$

$$+ \ \nu\_7 \frac{\left[\
u\_{FD} \mu\_k \left(t - \tau\right)\right]^{n9}}{K\_9^{n9} + \left[\
u\_{FD} \mu\_k \left(t - \tau\right)\right]^{n9}},\tag{8}$$

and a similar expression for the function fFT→AP<sup>1</sup> with the AP1-related constants v8, v<sup>9</sup> K10, K11, n10, and n11. The first term in equation (8) describes the cumulative activation from four FT proteins distinct from the FT protein with index k, whose activating input is represented by the second term in this equation. Depending on which gene of the FT family is singled out in the described way, we have five possible forms of fFT→LFY and fFT→AP<sup>1</sup> to test under hypothesis H2.

We solved numerically equations (1–5) replacing the concentrations of all regulators in the right-hand side of the equations with their expression data values interpolated in time. This effectively splits the model into four independent parts which do not contain common parameters: single equations for TFL1a, TFL1c, and FD, and the system of two equations for LFY and AP1 sharing the common parameter τ . The initial conditions for all proteins except TFL1a and TFL1c were equal to the value of each transcript at the first available day from the expression data (Ridge et al., 2017). Setting the initial conditions for TFL1a and TFL1c in the same way led to undesirable artifacts in the solutions resulted from the fitting procedure (**Supplementary Figure 7**); therefore, the initial conditions for these proteins were set to zero at t = 0, and the functions in the right-hand side of the model equations were obtained by interpolating the data values back to zero concentrations at t = 0. Numerical solution was obtained using either the ode23s solver in Octave or the NDSolve function in Wolfram Mathematica.

#### Parameter Estimation

The model contains 31 free parameters (7 v<sup>i</sup> 's, 9 K<sup>i</sup> 's, 9 ni's, 5 λ<sup>i</sup> 's, and τ ) under hypothesis H0 and in each version of the model under hypothesis H1, and there are six more parameters in H2. For the ICCV 96029 cultivar, the parameter values were found by minimizing the following weighted residual sum of squares (wRSS):

$$\text{wRSS} = \sum\_{\mathfrak{g}=1}^{5} \sum\_{k=1}^{T} \frac{\left(\mathfrak{u}\_{\mathfrak{g}}\left(t\_{k}\right) - \mathfrak{u}\_{\mathfrak{g}}^{data}\left(t\_{k}\right)\right)^{2}}{\sigma\_{\mathfrak{g},k}^{2}},\tag{9}$$

in which the difference between the model solution u<sup>g</sup> for genes g and the data u dat g is summed over all genes and over T times at which the data is available; σg,<sup>k</sup> is the standard deviation of the data for gene g and time t<sup>k</sup> . For fits to the CDC Frontier data, wRSS was additionally complemented with a penalty term equal to the covariance between the model solution and data.

The model fitting was performed either to the LD data only (and the SD data was used for testing) or to the joint LD and SD data, in which case wRSS from equation (9) should be calculated for the two growth conditions and summed. In the case of the LD fits, there were 35 data points in total for ICCV and 75 data points for CDC Frontier. In the case of fits to the joint SD and LD data, there were 70 and 145 data points

#### REFERENCES

Andrés, F., and Coupland, G. (2012). The genetic basis of flowering responses to seasonal cues. Nat. Rev. Genet. 13, 627–639. doi: 10.1038/nrg3291

for ICCV and CDC Frontier, respectively. The expression data for the five genes under modeling and the five FT homologs in chickpea was obtained from **Figure 5** of the paper by Ridge et al. (2017). The figure was digitized by the web-based tool WebPlotDigitizer (Rohatgi, 2018; the extracted expression data is available at https://zenodo.org, DOI:10.5281/zenodo.1451748). The cost functional was minimized by the differential evolution, which is a global parameter search method, using either a wolframscript program utilizing NMinimize function in Wolfram Mathematica or an entirely parallelized version of the method implemented in the DEEP software (Kozlov et al., 2016).

We assessed the quality of the alternative models H0–H2 using the Akaike information criterion adjusted for small data samples:

$$AICc = 2k - 2\log \hat{L} + \frac{2k^2 + 2k}{m - k - 1},\tag{10}$$

where k is the number of parameters in a model, m is the number of data points used for model fitting, and Lˆ is the maximum value of the likelihood function. In our case, 2log Lˆ = −wRSSmin — the minimal value of the wRSS functional from equation (9) estimated from the set of model fits (see **Supplementary Text** for derivation of Lˆ). We also used a classical likelihood function appearing in least squares fitting.

#### AUTHOR CONTRIBUTIONS

MS and SN conceived and coordinated the project. VG and KK conducted the computational experiments. VG analyzed and summarized the results and wrote the first draft of the manuscript. All the authors participated in finalizing the manuscript.

#### FUNDING

The work was supported by the Russian Science Foundation, grant 16-16-00007.

#### ACKNOWLEDGMENTS

We thank Stephen Ridge for valuable discussions about expression data and Sergey Rukolaine for helpful advices on model inference.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00547/full#supplementary-material

Blümel, M., Dally, N., and Jung, C. (2015). Flowering time regulation in crops—what did we learn from Arabidopsis? Curr. Opin. Biotechnol. 32, 121–129. doi: 10.1016/j.copbio.2014. 11.023


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Gursky, Kozlov, Nuzhdin and Samsonova. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Comparative Analysis of Mycoplasma gallisepticum vlhA Promoters

#### Mikhail Orlov<sup>1</sup> \*, Irina Garanina<sup>2</sup> \*, Gleb Y. Fisunov<sup>2</sup> and Anatoly Sorokin<sup>1</sup>

1 Institute of Cell Biophysics, Russian Academy of Sciences, Pushchino, Russia, <sup>2</sup> Federal Research and Clinical Center of Physical-Chemical Medicine, Federal Medical-Biological Agency, Moscow, Russia

Mycoplasma gallisepticum is an intracellular parasite affecting respiratory tract of poultry that belongs to class Mollicutes. M. gallisepticum features numerous variable lipoprotein hemagglutinin genes (vlhA) that play a role in immune escape. The vlhA promoters have a set of distinct properties in comparison to promoters of the other genes. The vlhA promoters carry a variable GAA repeats region at approximately 40 nts upstream of transcription start site. The promoters have been considered active only in the presence of exactly 12 GAA repeats. The mechanisms of vlhA expression regulation and GAA number variation are not described. Here we tried to understand these mechanisms using different computational methods. We conducted a comparative analysis among several M. gallisepticum strains. Nucleotide sequences analysis showed the presence of highly conserved regions flanking repeated trinucleotides that are not linked to GAA number variation. VlhA genes with 12 GAA repeats and their orthologs in 12 M. gallisepticum strains are more conserved than other vlhA genes and have narrower GAA number distribution. We conducted comparative analysis of physicochemical profiles of M. gallisepticum vlhA and sigma-70 promoters. Stress-induced duplex destabilization (SIDD) profiles showed that sigma-70 group is characterized by the common to prokaryotic promoters sharp maxima while vlhA promoters are hardly destabilized with the region between GAA repeats and transcription start site having zero opening probability. Electrostatic potential profiles of vlhA promoters indicate the presence of the distinct patterns that appear to govern initial stages of specific DNAprotein recognition. Open state dynamics profiles of vlhA demonstrate the pattern that might facilitate transcription bubble formation. Obtained data could be the basis for experimental identification of mechanisms of phase variation in M. gallisepticum.

Keywords: Mycoplasma gallisepticum, promoter, transcription regulation, DNA physics, vlhA

#### INTRODUCTION

Mycoplasmas are genome-reduced bacteria without a cell wall and with a parasitic lifestyle. Mycoplasmas parasitize diverse animal and plant species and humans. Like other intracellular parasites, they need to adapt to the host's immune system. One of main mechanisms Mycoplasmas employ is changing the repertoire of surface lipoproteins (phase variation) (Rosengarten and Wise, 1990). Other pathogenic bacteria, including Haemophilus, Chlamydia, and Streptococcus species,

#### Edited by:

Yuriy L. Orlov, Russian Academy of Sciences, Russia

#### Reviewed by:

Mikhail P. Ponomarenko, Russian Academy of Sciences, Russia Enrique Medina-Acosta, Universidade Estadual do Norte Fluminense Darcy Ribeiro, Brazil

#### \*Correspondence:

Mikhail Orlov orlovmikhailanat@gmail.com Irina Garanina irinagaranina24@gmail.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 08 August 2018 Accepted: 06 November 2018 Published: 21 November 2018

#### Citation:

Orlov M, Garanina I, Fisunov GY and Sorokin A (2018) Comparative Analysis of Mycoplasma gallisepticum vlhA Promoters. Front. Genet. 9:569. doi: 10.3389/fgene.2018.00569

**19**

also use phase variation to escape of host defense mechanisms (Noormohammadi, 2007). Phase variation in Mycoplasmas can occur spontaneously or due to an immune attack, it is important for persistence and survival of Mycoplasmas in a host (Markham et al., 1998; Glew et al., 2000; Ma et al., 2015; Czurda et al., 2017; Chopra-Dewasthaly et al., 2017). Numerous mechanisms of phase variation are described for Mycoplasmas (Citti et al., 2010). Usually, the mechanisms of variation are species-specific and occur in one species or closely related Mycoplasmas. They include DNA slippage, site-specific recombination, reciprocal recombination, and gene conversion (Citti et al., 2010). However, the phase variation system of Mycoplasma gallisepticum is unique, and has not been described so far. Therefore, studying phase variation genes can reveal novel mechanisms of gene expression regulation in bacteria.

Mycoplasma gallisepticum is a major bacterial pathogen inducing widespread respiratory disease in poultry and wild birds, which leads to significant economic losses throughout the world (Bencina, 2002). Phase variation of M. gallisepticum includes the switching on variable lipoprotein and hemagglutinin (vlhA) gene expression (Markham et al., 1992). The exact function of vlhA proteins is still unknown. They involve in haemagglutination (Bencina, 2002; Noormohammadi, 2007), based on data obtained on avian Mycoplasmas it can be assumed that vlhA proteins participate in host cell adhesion and invasion (May et al., 2014; Matyushkina et al., 2016; Hegde et al., 2018). VlhA genes are organized into 3–5 cassettes, uniting ten genes per cassette (Baseggio et al., 1996). The promoter structure of these genes is significantly different from the promoters of the other M. gallisepticum genes. VlhA genes lack conserved sigma-70 promoter sequence and often have GTG start codon (Markham et al., 1994). They are proposed to employ an alternative sigma factor binding GCGAAAAT sequence (Fisunov et al., 2016). Long regions of GAA repeats are located upstream of vlhA genes (Markham et al., 1994). In general, the GAA repeats can be considered as shortsequence repeats (SSRs). SSRs were found in all eukaryotic and many prokaryotic genomes (Mrázek et al., 2007; Avvaru et al., 2017). In bacteria, SSRs were identified in genes coding for bacterial virulence factors including lipopolysaccharidemodifying enzymes or adhesins (Mrázek, 2006; Wei et al., 2015). So, SSRs provide genetic and, therefore, phenotypic variability. Changes in number of repeated units and/or in the repeat unit itself may arise from recombination processes or polymerase errors including slipped-strand mispairing (SSM), either solely or in combination with DNA repair deficiencies (van Belkum et al., 1998; Rocha, 2003; Torres-Cruz and van der Woude, 2003).

First experiments showed that M. gallisepticum express only one vlhA family member at a time and expression depends on the presence of exactly 12 GAA trinucleotide repeats upstream of the gene (Glew et al., 1995, 1998; Liu et al., 2002). Recently it was shown that expression of the gene preceded by 12 GAA exceeds the other vlhA genes, but the other genes with a different number of repeats are also expressed and some of them are expressed at a high level (Matyushkina et al., 2016; Pflaum et al., 2016; Butenko et al., 2017). In vivo experiments showed the non-stochastic character of vlhA switching during infection, vlhA expression pattern changes during infection progression and differs between strains (Pflaum et al., 2016, 2018). So, vlhA expression is determined by GAA repeats, but probably the additional expression control mechanisms exist. An interesting question here is how the cell defines what promoter needs to be activated. One explanation here is the existence of hemagglutinin activator protein (HAP) recognizing 12-GAA repeats (Liu et al., 2002).

Another question is the mechanism of GAA repeat variation in M. gallisepticum. It would be interesting to find out how many repeats changes at a time, whether the change depends on the number of repeats of a given gene, or on the sequences surrounding the GAA repeats and their physicochemical properties. In the present study we used computational methods to analyze genomes of several M. gallisepticum strains and shed light to the mechanism of phase variation and vlhA expression control. For this purpose, we used comparative bioinformatics analysis of sequences of vlhA promoters and genes. We assumed that a nonstandard structure of vlhA promoters may be related to the physicochemical properties of their sequences, using computational methods we predicted these properties on the DNA of vlhA promoters and compared them with the corresponding properties of experimentally obtained sigma-70 promoters of M. gallisepticum S6.

## MATERIALS AND METHODS

### Bioinformatics Analysis of (GAA)n and vlhA Genes

We used 12 complete genomes of M. gallisepticum strains isolated from chickens and house finches of various levels of virulence available for download in June 2018 in the GenBank database (Papazisi et al., 2003; Szczepanek et al., 2010; Fisunov et al., 2011; Tulman et al., 2012; Fleming-Davies et al., 2018). List of the genomes and their characteristics (size, GC content, and number of genes) are provided in **Supplementary Table S1**. We obtained sequences of vlhA promoters of all 12 strains to study GAA number variation. For comparison of physicochemical properties, we retrieved sequences of sigma-70 promoters of S6 strain. The exact coordinates of the transcription start sites of M. gallisepticum S6 were obtained from our published work there 5 0 -end enriched RNA-seq sequencing was conducted (Mazin et al., 2014).

The GAA repeats were defined as 4–27 non-interspaced trinucleotides repeated in a row. A smaller number of the repeats appeared to be non-specific; no 28 or more repeats were detected. We proposed that for the possible GAA recognizing protein the length of GAA tract should be more important than the substitutions in one repeat inside the (GAA)n. So, we considered units with substitutions inside the (GAA)n as intact units and shortened the (GAA)n to the units with at least one substitution if it was at the end of the (GAA)n. We did not detect GAA tracts containing more than two damaged GAA inside the tract. For sequencing retrieval and GAA counting we used Python 2.7 custom script.

To analyze GAA number variation we classified vlhA genes into orthologous groups. Not all vlhA have clear annotation, most are annotated as hypothetical proteins. Since we are interested only in vlhA under the control of (GAA)n containing promoters, to find all vlhA genes we first mapped GAA repeats and then found corresponding vlhA genes. Several times we observed short GAA repeat in coding regions of other genes or GAA that not connected with vlhA, this cases we corrected manually. ProteinOrtho program (version V5.16) was used to computing orthologous vlhA proteins (Lechner et al., 2011). Parameters identity =70% and minimum coverage of best blast alignments =50% were used. Fisher exact test was performed using fisher.test() function in R with two.sided alternative hypothesis.

To reconstruct the phylogenetic tree of vlhA genes for **Figure 3** we obtained consensus sequences of orthologous clusters applying Biopython command dumb\_consensus() to orthologous group alignments (Cock et al., 2009). VlhA proteins and their consensus sequences we aligned by T-coffee program implemented in JalView software (version 2.10.5) with default parameters (Waterhouse et al., 2009; Di Tommaso et al., 2011). Phylogenetic tree of consensus sequences was constructed by Phylogeny.fr tool where the method of maximum-likelihood is implemented (Dereeper et al., 2008). The histogram of GAA number and distributions were constructed in R.

#### Analysis of (GAA)n Flanks

For analysis of (GAA)n flanking regions, we extracted 50 nucleotide sequences upstream and downstream of the (GAA)n. We aligned upstream and downstream flanks independently by T-Coffee program implemented in JalView software (version 2.10.5) with default parameters (Waterhouse et al., 2009; Di Tommaso et al., 2011) and merged corresponding aligned flanks using Biopython Python 2.7 library (Cock et al., 2009). See flanks alignment in **Supplementary Materials**. WebLogo was used for sequence logos construction (Crooks et al., 2004).

To compare (GAA)n flanking sequences between 12-GAA and the other vlhA genes we used a non-linear algorithm of dimension reduction t-SNE (t-Distributed Stochastic Neighbor Embedding). t-SNE allows a visualization a high-dimensional data to see high-dimensional objects in two- or threedimensional space. t-SNE visualizes the data in compact and clear view and has advantages over other dimension reduction methods, like PCA (van der Maaten and Hinton, 2008). Alignment was transformed into the table presenting nucleotides and gaps with numbers, columns correspond to positions in alignment, rows to individual genes. We employed PCA algorithm with default parameters and t-SNE algorithm with perplexity parameter 30 implemented in sklearn Python 2.7 library (Pedregosa et al., 2011).

### Calculation of Physicochemical Properties of Promoters

Stress-induced duplex destabilization (SIDD) is a theoretical method developed to analyze denaturation in superhelical DNA of a specified sequence (Benham, 1990). SIDD profile analysis predicts the DNA positions where the DNA duplex becomes susceptible to separation when under superhelical stress (Benham, 1990). SIDD calculation was carried out as implemented by its authors (Zhabinskaya et al., 2015). The conformational and thermodynamic parameters were derived from the endonuclease digestion experiments on superhelical DNA (Kowalski et al., 1988; Benham, 1992). Theoretical calculations using these parameters were consistent with experimental data (Benham, 1992).

For SIDD calculations 1000 nts-long intervals with transcription start site (TSS) at the center were considered, usage long DNA regions take into account broader genomic context. We filtered nucleotide sequences containing more than one promoter. SIDD profiles were obtained by means of perl script. SIDD calculation was performed using default settings (superhelicity level 0.06, energy threshold 12, and ionic strength 0.01). Temperature value was equal to the average chicken body temperature (314 K). The difference between SIDD profile maximum values was tested by the non-parametric Mann–Whitney U test implemented in R using wilcox.test() function with parameter paired =FALSE.

Distribution of electrostatic potential is DNA duplex feature that contributes to the initial stages of DNA–protein interactions (Jones et al., 2003). The DNA characteristic profiles were obtained using method suitable for genome-wide application (Polozov et al., 1999). The approach is based on Coulomb formula and allows to analyze electrostatic profiles of promoters within the electrostatic map of a whole genome DNA. It is widely used in studies concerning electrostatic patterns of bacterial and phage promoters (Polozov et al., 1999; Kamzolova et al., 2005, 2006, 2009; Sorokin et al., 2006; Osypov et al., 2010). Finally, DNA open states dynamical properties, including their activation energy (E0) and size (d). These are believed to affect transcription bubble formation and introduce additional to the encoded by steadystate DNA properties information. The used model equation was derived from the sine-Gordon equation by adding two additional terms which more accurately take into account heterogeneous nature of the DNA sequence. The profiles were shown to be in agreement with the function of the corresponding DNA regions: promoters are evolving open states with most ease, while terminator are likely to stop the transcription bubble (Grinevich et al., 2015). Therefore, SIDD profiles were obtained by means of perl script, electrostatic profiles was calculated using the algorithm implemented in R, and the dynamical properties of DNA open states were obtained using the algorithm implemented in Matlab 9.2.

### RESULTS

#### VlhA Promoters Share Conserved GAA-Flanking Sequences Irrespective of GAA Units Number

Comparative analysis of GAA repeats number for vlhA genes of different strains was conducted to identify possible patterns of variation. All vlhA genes from 12 strains were clustered

into orthologous groups according to the sequence similarity. Previous studies revealed that activation of vlhA transcription occurs if 12 GAA repeats are present within the promoter. Flanking regions of the GAA repeats were also found to be essential for vlhA expression (Liu et al., 2000). Here we analyzed conservation of GAA flanks among different M. gallisepticum strains and vlhA orthologous groups to identify the mechanism of vlhA expression activation. For each vlhA gene sequences upstream and downstream of (GAA)n were obtained. Totally 368 promoters were taken into analysis. GAA tracts were defined as repeat regions containing 4 or more GAA trinucleotides without substitutions at the ends of the (GAA)n. The logos build demonstrate conserved sequences both upstream and downstream of GAA repeats (**Figure 1**). The conservation level varies among positions of the motifs. We searched for similar sequences in nucleotide collection at NCBI blast by blastn program and did not find any matches in other species. So, these sequences show no sequence homology with sequenced genomes and appear to be identified in M. gallisepticum genome only. The sequences comprise neither repetitive sequences nor palindromes that often are present in regulatory motifs.

We compared flanking sequences of 12-GAA tracts with other vlhA promoters. First, we looked over logos of 12-GAA and non-12-GAA flanks (**Figure 1**). No traceable distinction was found between the two groups. To more precise comparison we visualized sequences in three-dimensional space using t-SNE method (**Figure 2**). This method shows sequences similarity as a distance in two- or three-dimensional space. No clustering of promoters with 12 GAA was identified by t-SNE and by similar method PCA (**Supplementary Figure S1**). So, analysis of GAA flanking regions revealed conserved positions around GAA tract and did not show correlations between 12-GAA units in (GAA)n and sequence of (GAA)n flanks.

To consider in more detail the flanking sequences, we constructed their alignments and phylogenetic trees for genes belonging to the same orthologous groups. In the article we describe two representative examples of trees (**Figure 3**) and the alignments of flanks of orthologous groups (**Supplementary Materials**). The identity level between vlhA proteins of these two orthologous groups is higher than 90% for all protein pairs. The first tree represents the tree of the merged flanks of (GAA)n for the orthologous cluster containing 4 genes with 12- GAA repeats. This is the largest orthologous group, containing proteins represented in all strains. The alignment and tree show that the sequences are conservative within the groups of strains isolated from different species: strains F, S6, Rlow, and Rhigh were obtained from chickens, the remaining strains from house finches. Genomes of finch strains have almost identical genome sequences with a low number of substitutions, but the difference exists (Tulman et al., 2012; Kristensen et al., 2017). Chicken strains are less similar to each other than strains from finches according to data from the ATGC database (Tulman et al., 2012). That is, in this case, one would expect slight differences between the (GAA)n flanks of individual strains, but the sequences for the orthologous group are completely identical within two groups. It is interesting that the flanks and the corresponding genes are located in different vlhA cassettes, the genes from chicken strains are located in the first cassette, and finch genes are located in the third and fourth cassettes. So, the moving to other cassette did not affect sequences of (GAA)n flanks. The orthologous group includes 4 genes with 12-GAA repeats, no differences between them and other genes are noticeable. We observed that the number of repeats within the orthologs cluster varied, while sequences of repeats were conservative. This suggests that the change in the number of GAA repeats does not depend on the sequences flanking them. **Figure 3B** shows the tree of another orthologous group, which also contains 12-GAA repeat genes. The tree confirms the lack of connection between the number of repeats and the sequence of flanks. These flanking sequences are less conservative among themselves than sequences of the first group. Thus, analysis of trees and alignments of particular orthologous groups showed no connections between (GAA)n number and their flanking sequences.

#### Number of GAA Repeats Varies Among Orthologs vlhA and Different Strains of M. gallisepticum

Comparative analysis of GAA repeats a number of vlhA genes from different strains were conducted to identify possible patterns of GAA number variation. All vlhA genes from 12 strains were clustered into orthologous groups (**Figure 4A**). The distribution of GAA tract lengths shows that the majority of values reside within a narrow range of 6-12 repeats. We divided vlhA orthologous clusters into two groups: the one containing 12 repeats at least in one strain and the one including the rest. The distribution within 12-GAA containing group is even narrower varying from 8 to 12 repeats. This may indicate that GAA number changes by an increase/decrease of a small number of repeats.

The number of 12-GAA promoters varies across the strains from zero to three per genome. We found the positive correlation between gene conservation level and the presence of 12-GAA repeats within an ortholog cluster. Genes with 12 repeats are more frequently occur in full ortholog clusters comprising to genes that are represented in all strains (Fisher exact test p-value =0.0248).

The number of repeats varies within one genome as well as within one orthologous cluster. We analyzed the distribution of GAA repeats number among the strains and orthologs clusters (**Figures 4B,C**). The data shows that the prevalent GAA repeats number is 8 and frequency decreases as the number of repeats increases. Genes with 12 GAA repeats follow the common trend and have no exceptional frequencies. Comparison of dispersion in repeats number among the strains and ortholog clusters showed that the number of repeats is more conserved within one strain than within one ortholog cluster. The majority of the strains tend to follow this trend, except for S6 strain which exhibited the most versatile repeat number. Certain ortholog clusters are more conserved than others which may indicate differences in VlhA expression among strains. Therefore, analysis of GAA repeats number did not reveal any traceable patterns in the distribution of repeats. We suggest that alike patterns might be established after considering a bigger set of strains.

FIGURE 1 | The motif of (GAA)n flanking sequences in vlhA promoters. Logos show identity of motifs for promoters with different GAA number. Sequences 50 bp length were aligned by T-coffee program, gaps included in the alignment. Logo constructed by WebLogo 3.6.0; (A,B) logos for upstream flanks, (C,D) logos for downstream flanks; (A,C) logos of 22 sequences of 12-GAA promoters; (B,D) logos of 344 non-12-GAA promoters.

FIGURE 2 | t-SNE analysis plot of (GAA)n flanking motifs. Points represent individual vlhA genes of all analyzed strains, the analysis made on concatenated left and right (GAA)n flanking sequences. Black points show 12-GAA promoters. In analysis was used t-SNE algorithm implemented in sklearn Python library with the parameter of perplexity =30.

**23**

FIGURE 3 | GAA repeats number statistics for 12 Mycoplasma gallisepticum strains and vlhA orthologous clusters. (A) Heatmap showing number of GAA repeats for each vlhA promoter. The number of repeats is indicated by colors, 12-GAA repeats are shown with red. One strain corresponds to three rows (three is the maximum numbers of vlhA paralogs observed for a strain). Names of the strains are shown in the heatmap center. Orthologous clusters correspond to columns. The tree was constructed by Phylogeny.fr software based on T-coffee protein alignment of consensus sequences of orthologous groups using the maximum-likelihood method for phylogeny reconstruction. (B) Histogram of the number of GAA repeats. The dark gray bar shows 12-GAA promoters. (C) Distribution of dispersion of GAA repeats number among strains and orthologous clusters.

#### VlhA Promoters Have Lowest Opening Probability Under Superhelical Stress (SIDD Profiles) While Non-vlhA Promoters Are Highly Destabilized

In order to describe the possible role of physicochemical interactions in phase variation of M. gallisepticum several DNA properties of promoter regions were obtained in the form of profiles. SIDD as a DNA parameter shows a robust correlation with various regulatory DNA loci including promoters, replication origins, etc. The promoters of E. coli can be classified into SIDD-dependent and SIDD-independent groups according to their SIDD profile, which seems to correlate with their functional specialization (Wang and Benham, 2006). In the present article we analyzed SIDD profiles for vlhA promoters from various M. gallisepticum strains as well as, for standard sigma-70 promoters experimentally identified in S6 strain (Mazin et al., 2014). Promoters of both type

Red numbers on branches display branch support values. (A) The tree of the biggest orthologous group that is depicted in the last column in Figure 3A; (B) the tree of another orthologous group, consisting of four 12-GAA genes. The group is depicted in 41 column in Figure 3A.

feature same GC-content of 0.3, which is the average GCcontent of M. gallisepticum genome. Sigma-70 promoters are substantially more destabilized with the profile maxima located in the vicinity of TSS, while vlhA promoters did not incline to melt under the considered conditions (**Figure 5**). Peaks of vlhA promoters' profiles do not overlap TSS region with the sequence adjacent to GAA repeats having zero melting probability. At the same time, the majority of sigma-70 promoters demonstrate sharp maxima in the upstream region [−100; −50] nts (Mann-Whitney test p-value <0.05) (**Figure 6**). The fact to some extent supports the notion that there is no direct correlation between SIDD profiles and GC-content of a DNA segment.

### Dynamical Properties of DNA Open States and Electrostatic Potential Profiles of vlhA Promoters Show Distinct Patterns

Dynamics of DNA open states was shown to be important for transcription bubble formation (Grinevich et al., 2015). The lower the open states activation energy, the more the DNA duplex is prone to open thus facilitating transcription initiation. Open states activation energy profiles, as well as the size of open states profiles, were calculated for vlhA and sigma-70 promoters. We identified that the transition of vlhA promoters to an open state occurred more efficiently in the region downstream TSS. The activation energy for the promoter group in the interval [−70; 20] nts appeared to have a decreasing slope which starts at the right GAA repeats boundary. It may seem tempting to suggest that the slope facilitate the directed movement of RNA-polymerase along the promoter. At that, no traceable patterns were detected for sigma-70 promoters (**Figures 7**,**8**).

Distribution of electrostatic potential (EP) around DNA duplex is a physical property that could be recognized by other molecules at a distance and prior to their direct interaction. It appears to be crucial at the initial stages of promoter recognition by RNA-polymerase (Polozov et al., 1999). Promoters of vlhA genes show characteristic EP pattern with the peak at about 30 nt after TSS. Neither visual assessment nor clusterization revealed traceable patterns for sigma-70 promoters profiles (**Figure 9**).

### DISCUSSION

The promoters of vlhA genes feature a remarkable mechanism of transcriptional regulation. It includes two functional components: transcriptional activation at 12-GAA containing promoters and variation of GAA repeats number. In the article we have analyzed conservation, GAA number distribution, and physicochemical properties of vlhA promoters in M. gallisepticum. We proposed that physicochemical properties of promoters including SIDD, DNA open states dynamics, and electrostatic potential could be connected to the vlhA genes expression regulation.

We demonstrated that the GAA repeats in vlhA promoters are flanked by highly conserved sequences with distinct structure. Altogether the regulatory region takes more than 50 nt. Sequences of such length are generally too large for binding a typical bacterial transcription factor (Rodionov, 2007). Regulatory sequences of this length are unique in bacteria.

It is possible that M. gallisepticum has unique DNA binding proteins with the unknown spatial structure of the DNA binding region that standard annotation programs cannot identify. The hypothesis is supported by the fact that Mycoplasmas have a large number of orphan genes with unknown functions (Tatarinova et al., 2016).

Most of the analyzed strains are isolated from wild birds and are pathogenic for the host. We observed 12-GAA vlhA genes occur more than one time in the genome. Obtained data implies that the presence of a single 12-GAA vlhA gene is not the only possible combination enabling pathogenicity manifestation. Closely related strains Rlow and Rhigh demonstrate similar distributions of 12-GAA genes but have distinct virulence potential (Szczepanek et al., 2010). Vaccine strains F with a low level of pathogenicity have the maximum number of genes with 12-GAA repeats and lacks numerous vlhA genes. One can

speculate the inability of proper vlhA switching may result in a decrease of pathogenicity.

We identified that the distribution of GAA number resides within narrow borders of 8–12 repeats only in case orthologous clusters with at least one 12-GAA promoter were considered. We hypothesize that there is a "working range" of GAA repeats within which the number can iterate while having a considerable chance to get back to 12. Promoters that occasionally go out of range are not functional, while they still may remain conserved. The corresponding genes will never be activated again. The orthologous clusters lacking 12-GAA promoters are distributed in considerably fewer strains which corroborates with the idea that they lost function and represent a decaying group of vlhA.

Calculation of physical properties of vlhA promoters and sigma-70 promoters of S6 strain allowed to identify distinct patterns in open states dynamics and electrostatic potential profiles. We hypothesize that the former could facilitate transcription bubble formation thus stimulating processive transcription, while the latter could contribute to the initial stage of DNA-protein recognition. By contrast, SIDD profiles of vlhA promoters are hardly destabilized and have zero opening probability near TSS while sigma-70 promoters have overall high destabilization levels with maxima associated with TSS position. It corroborates with the idea that an alternative sigma-factor rather than sigma-70 is utilized for transcription of vlhA. One can speculate that zero open probability of vlhA promoters under superhelical stress reflects that fact that these loci are wrapped around activator complex, e.g., are at a high local degree of negative supercoiling. At the same time, improper transcription should not be facilitated from vlhA promoters since their −10 boxes show a substantial degree of similarity with those of sigma-70.

#### CONCLUSION

Analysis of promoters of vlhA indicates the presence of conserved sequences upstream and downstream to GAA repeats. Sequences of (GAA)n flanks are not connected with the number of GAA repeats. The distribution of (GAA)n length among the strains of M. gallisepticum shows a preferred range within which this number iterates: 6–12 repeats. Distribution of GAA

#### REFERENCES


units number varies among strains and orthologous groups. VlhA orthologous groups having at least one 12-GAA gene in the group have a narrower distribution of GAA number with values within the range 8–12 and are more conserved among strains than other orthologous groups. As compared to sigma-70 promoters of M. gallisepticum promoters of vlhA feature distinct and characteristic profiles of physical properties including opening probability under superhelical stress, open state activation energy, and electrostatic potential.

#### DATA AVAILABILITY

The datasets analyzed and scripts for this study can be found in the https://github.com/FVortex/Orlov\_et\_al.\_Frontiers\_in\_ Genetics\_Mycoplasma\_gallispeticum\_script.

#### AUTHOR CONTRIBUTIONS

IG contributed in analysis of genomes, GAA repeats, cauterization, and writing the manuscript. MO contributed in analysis of physicochemical properties and writing the manuscript. GF and AS wrote the manuscript.

#### FUNDING

This work was funded by the Russian Science Foundation grant 14-24-00159 "Systems research of minimal cell on a Mycoplasma gallisepticum model".

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00569/full#supplementary-material

TABLE S1 | Description of strains used in the study.

TABLE S2 | Data on vlhA genes, sequences of genes, their promoters and GAA repeats.




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Orlov, Garanina, Fisunov and Sorokin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Highlights on the Application of Genomics and Bioinformatics in the Fight Against Infectious Diseases: Challenges and Opportunities in Africa

Saikou Y. Bah1,2 \*, Collins Misita Morang'a<sup>1</sup>† , Jonas A. Kengne-Ouafo<sup>1</sup>† , Lucas Amenga–Etego<sup>1</sup> and Gordon A. Awandare<sup>1</sup> \*

#### Edited by:

Ancha Baranova, George Mason University, United States

#### Reviewed by:

David John Studholme, University of Exeter, United Kingdom Sandeep Kumar Dhanda, La Jolla Institute for Allergy and Immunology (LJI), United States

#### \*Correspondence:

Saikou Y. Bah sbah@ug.edu.gh; sabah@mrc.gm Gordon A. Awandare gawandare@ug.edu.gh

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 18 June 2018 Accepted: 08 November 2018 Published: 27 November 2018

#### Citation:

Bah SY, Morang'a CM, Kengne-Ouafo JA, Amenga–Etego L and Awandare GA (2018) Highlights on the Application of Genomics and Bioinformatics in the Fight Against Infectious Diseases: Challenges and Opportunities in Africa. Front. Genet. 9:575. doi: 10.3389/fgene.2018.00575 <sup>1</sup> West African Centre for Cell Biology of Infectious Pathogens, University of Ghana, Accra, Ghana, <sup>2</sup> Vaccine and Immunity Theme, MRC Unit The Gambia at London School of Hygiene & Tropical Medicine, Banjul, Gambia

Genomics and bioinformatics are increasingly contributing to our understanding of infectious diseases caused by bacterial pathogens such as Mycobacterium tuberculosis and parasites such as Plasmodium falciparum. This ranges from investigations of disease outbreaks and pathogenesis, host and pathogen genomic variation, and host immune evasion mechanisms to identification of potential diagnostic markers and vaccine targets. High throughput genomics data generated from pathogens and animal models can be combined with host genomics and patients' health records to give advice on treatment options as well as potential drug and vaccine interactions. However, despite accounting for the highest burden of infectious diseases, Africa has the lowest research output on infectious disease genomics. Here we review the contributions of genomics and bioinformatics to the management of infectious diseases of serious public health concern in Africa including tuberculosis (TB), dengue fever, malaria and filariasis. Furthermore, we discuss how genomics and bioinformatics can be applied to identify drug and vaccine targets. We conclude by identifying challenges to genomics research in Africa and highlighting how these can be overcome where possible.

Keywords: bioinformatics, genomics, infectious diseases, antimicrobial resistant, diagnosis

### INTRODUCTION: OMICS AND BIOINFORMATICS IN INFECTIOUS DISEASES

Genomics and bioinformatics have contributed immensely to our understanding of infectious diseases: from disease pathogenesis, mechanisms and the spread of antimicrobial resistance, to host immune responses. Herein, we review some of the major contributions of genomics and bioinformatics in infectious disease research using examples of three diseases that account for large proportions of morbidity and mortality as well as a neglected tropical disease. Specifically, we review M. tuberculosis, which causes TB, a disease responsible for approximately two million deaths globally per year. Dengue virus (DENV) causes Dengue fever, which is a re-emerging mosquito borne viral disease, responsible for more than 350 million cases annually (WHO, 2017; World Health Organization Western Pacific Region, 2018). Plasmodium falciparum causes malaria,

**32**

a parasitic disease that accounts for the highest morbidity and mortality in Sub-Saharan Africa, especially in children under five and pregnant women (WHO, 2018b), and Filariasis, which is a neglected tropical disease. **Figure 1** shows a circular wheel of genomics/bioinformatics as can be applied in infectious diseases as discussed herein, ranging from understanding host and pathogen genome biology to genome-wide association studies (GWAS) as well as the identification of drug targets and drug resistance surveillance to patient management. This encompasses molecular techniques, bioinformatics and clinical applications (**Figure 1**). We also highlight the application of genomics and bioinformatics to the identification of vaccine targets and drug discovery. We conclude by highlighting some challenges of conducting bioinformatics research in resource-limited countries in sub-Saharan Africa.

#### OMICS OF TUBERCULOSIS PATHOGENS AND HOST RESPONSES

Tuberculosis caused by members of the M. tuberculosis complex is a leading cause of death, with about 9 million cases and two million deaths per year globally (WHO, 2018a). The mycobacterial genome was first sequenced in 1998 and many more M. tuberculosis genomes have since been sequenced (Cole et al., 1998; Guerra-Assunção et al., 2015; Yun et al., 2016). These genomes provide great avenues for the genomic characterization, development of improved diagnostic tools, drug susceptibility testing, and molecular epidemiology of circulating mycobacterial strains. Host-pathogen genomics and transcriptomics have over the past decade enhanced our understanding of humanmycobacterium interactions and in the identification of potential diagnostic and prognostic markers (Anderson et al., 2014; Maertzdorf et al., 2015).

An understanding of the M. tuberculosis genome biology is invaluable in the control of TB. The M. tuberculosis genome is GC rich and consists of about 4000 genes and, unlike other bacteria, a large proportion of its genome encodes proteins and enzymes involved in lipogenesis and lipolysis (Cole et al., 1998), reflecting its thick lipid cell wall. TB control is hampered by antimycobacterial resistance, multidrug resistance (MDR) and, recently, extensively drug resistant (XDR) mycobacterial strains (Leisching et al., 2016). Genomics analysis has immensely contributed to the identification of drug resistance-conferring mutations and surveillance (Köser et al., 2013). Whole genome analyses have demonstrated that mycobacterial drug resistance is largely attributed to single nucleotide polymorphisms (SNPs); for example, rifampicin (RIF) resistance arises from mutations in the rpoB gene and mutations in the katG and inhA lead to isoniazid resistance (da Silva et al., 2011). Newly characterized genetic mutations in M. tuberculosis genomes have also been shown to play key roles in the emergence of antimycobacterial drug resistance (Sun et al., 2012). Analyses of 161 drug resistant M. tuberculosis genomes identified 72 genes, 28 intergenic regions and 21 SNPs with strong and consistent associations with drug resistance (Zhang et al., 2013). Genomic analysis has also identified lineage mutation rate differences and predicted the emergence of antimycobacterial resistance (Ford et al., 2013). A retrospective analysis of thousands of M. tuberculosis genomes collected from African and European patients identified 120 resistance-determining mutations for first and second line antimycobacterial drugs, which could be valuable in developing new assays for drug susceptibility testing (Walker et al., 2015). Furthermore, genomics through the use of GWAS has been used to identify novel mutations associated with resistance to cycloserine, ethionamide, and para-aminosalicylic acid, suggesting the involvement of efflux pump in the emergence of resistance (Coll et al., 2018). A number of genomics-based tools have been developed to detect drug resistance including Mykrobe Predictor, PhyResSE, and TB-Profiler, which are easy to use by researchers with no bioinformatics expertise and can predict drug resistance within minutes after obtaining sequences (Bradley et al., 2015; Coll et al., 2015; Feuerriegel et al., 2015). Mykrobe Predictor has a sensitivity and specificity of 82.6 and 98.5%, respectively (Bradley et al., 2015). TB-Profiler was developed using a mutation library consisting of 1,325 mutations in different genes associated with drug resistance in 15 antituberculosis drugs and had more than 75% sensitivity as well as more than 90% specificity for all drugs tested (Coll et al., 2015). A recent study evaluating the performance of these tools showed that their sensitivity ranges from 74 to 80% along with a specificity of more than 95% (van Beek et al., 2018). However, there is still a need for optimization of analysis pipelines to make them applicable in field settings where the disease burden is usually the highest.

Genomics analysis has also been used to determine the evolutionary history and spread of mycobacterial strains such as the Beijing strain, demonstrating its spread from the Far East (Merker et al., 2015). An investigation of M. tuberculosis transmission dynamics is important in monitoring outbreak; Mehaffy et al. (2014) demonstrated that whole genome analysis can be used to monitor infections to decipher transmission dynamics. Furthermore, genomics has also been applied to decipher transmission dynamics of M. tuberculosis in Vietnam, suggesting that SNPs in ESX-5 type VII secreted protein EsxW could potentially contribute to enhancing transmission (Holt et al., 2018). Furthermore, genomics has been applied to investigate TB outbreaks, genotyping of the outbreak associated lineages, and their evolution during the outbreak (Jamieson et al., 2014; Stucki et al., 2015). Indeed, analysis tools have been developed for the prediction of M. tuberculosis spoligotypes from raw sequence reads, and in combination with other analysis tools also determine antibiotic resistance as well as transmission dynamics (Coll et al., 2012; Bradley et al., 2015). Some genomics methods can also be employed to identify mixed infections as well as infections with a single strain and have recently been applied to clinical isolates from Malawi (Sobkowiak et al., 2018).

Genome-wide association study (GWAS) has also been used to identify candidate gene variants associated with susceptibility to active tuberculosis. GWAS analyses in African patients from Ghana, Gambia, Uganda and Tanzania identified TB diseaseassociated SNPs located on three chromosomal loci: 18q11, 11p13, and 5q33 (Thye et al., 2010, 2012; Sobota et al., 2016). Similarly, GWAS studies have also been done in Europe

potential applications. This includes: (1) molecular techniques such as whole genome sequencing by methods like Illumina to generate sequence reads, which are needed for the (2) bioinformatics application to study host systemic responses, pathogen genomics, and transmission dynamics. Further, bioinformatics can be applied to determine genetic diversity, investigation of drug resistance mechanisms and surveillance, and the identification of vaccine targets in systems vaccinology. Finally (3), all this information can be integrated to define treatment guidelines and patient management.

identifying SNPs in the ASAP1 gene on chromosome 8q24 and in a genomic region in which class II human leucocyte antigen (HLA II) is encoded (Curtis et al., 2015; Sveinbjornsson et al., 2016). Recently, a GWAS study in a Han Chinese population also found SNPs in mitofusin-2 (MFN2), regulator of G protein signaling 12 (RGS12) and HLA II beta chain to be associated with active TB (Qi et al., 2017). This highlights that host genetics play significant roles in susceptibility to active TB and may explain why some individuals remain latently infected while some develop active TB despite having similar exposure levels. Furthermore, based on host genetic variants, GWAS analysis could be applied to identify latently infected individuals who are at a high risk of developing active TB for preventative interventions. Once validated, identified SNPs can be used to develop point of care diagnostics to identify high risk people for mass preventative treatment.

Host transcriptomics are increasingly being used to understand systemic responses to infections and to identify diagnostic and prognostic markers. Mistry et al. (2007) were among the first to use microarray technology to study host systemic response to TB, identifying a nine gene-signature with potential for TB diagnosis. Jacobsen et al. (2007) applied microarray analysis to investigate the host pathway biology and potential diagnostic biomarkers. Analyzing peripheral blood mononuclear cells (PBMCs), they found a monocyte-derived gene expression signature identifying CD64, lactoferrin and Ras-Associated GTPase-33A as potential diagnostic biomarkers, which were further validated in another independent study population in South Africa (Maertzdorf et al., 2011). Applying gene set enrichment analysis to microarray gene expression identified metabolic pathways such as insulin metabolism, immune cell differentiation and inflammation in TB (Lesho et al., 2011). A neutrophil-driven interferon signature consisting both type I and type II interferon during TB was also identified using microarray analysis (Berry et al., 2010). The type I interferon pathway was also observed by Ottenhoff et al. (2012) identifying IL15RA, UBE2L6, and GBP4 as the main molecules involved. A 393-transcript signature for active TB and an 86-transcript signature with a potential for distinguishing TB from other inflammatory diseases were also identified (Berry et al., 2010). In addition, a biosignature consisting of 27 transcript signatures to distinguish active from latent TB and 44 transcript signatures to

distinguish active TB from other diseases were recently identified (Kaforou et al., 2013). Microarrays have also been used to demonstrate that host transcriptional responses to M. africanum and M. tuberculosis differ following treatment (Tientcheu et al., 2015), which could be important in the management of patients infected with the different mycobacterial strains. Furthermore, host gene expression has also been used to monitor treatment responses and predict treatment outcome, which will be valuable in testing new drug regimens and new antimycobacterial drugs (Thompson et al., 2017). These studies prove the potential of host genomics in providing a better understanding of disease pathophysiology, prognosis and host pathway biology in response to an infectious agent.

In addition, arrays have also been applied to childhood TB, to identify signatures for active tuberculosis and a signature that distinguishes active tuberculosis from other diseases in sub-Saharan Africa (Anderson et al., 2014). Similarly, a 9-gene signature was also identified in Warao Amerindian children, further highlighting the potential of using host biomarkers for TB diagnosis (Verhagen et al., 2013). Host transcriptional analysis is moving from array-based technologies to RNA sequencing and has been applied to 16 gene signatures that identified people with a high risk of developing TB 2 years before diagnosis in sub-Saharan Africa (Zak et al., 2016). However, it is noteworthy that identified biosignatures have a variable number of genes, from about 10 to more than 100, and there is very little overlap between some signatures. It will be valuable to conduct a meta-analysis of available datasets to increase statistical power and identify high confidence signatures across studies regardless of circulating pathogens and local environmental factors. In doing such analysis, confounders due to technologies, age and circulating endemic pathogens can be accounted for to give a strong as well as diagnostic and prognostic signature. These studies highlight the potential application of genomics and bioinformatics to interrogate host response for the diagnosis and prognosis of TB, which will contribute immensely to curbing TB morbidity and mortality.

#### DENGUE VIRUS RESEARCH IN THE ERA OF BIOINFORMATICS

Dengue virus (DENV) is a pathogenic single-stranded RNA virus that belongs to the flavivirus genus, which comprises other known pathogenic viruses such as West Nile, yellow fever, Japanese encephalitis, St. Louis encephalitis, tick-borne encephalitis, Omsk hemorrhagic fever and Zika virus (Gould and Solomon, 2008). The re-emergence, evolution, diversity and geographic distribution of flaviviruses make them interesting pathogens (Moureau et al., 2015). Phylogenetic analysis of divergence times suggests that flaviviruses originated from a common ancestor (100,000 years ago) and later split into mosquito and tick borne flaviviruses about 40,000 years ago (Holbrook, 2017). Approximately 40% of the world population is at risk of DENV infection with more than 350 million cases reported annually.

Illumina SNPs genotyping and SNPs identified through whole genome analysis have been used in case-control GWAS statistical analysis to identify SNPs that predispose or confer protection against DENV infection (de Carvalho et al., 2017). The DENV shock syndrome (DSS) has been shown in a GWAS analysis of SNPs in a cohort of 2008 pediatric cases to have a strong association (P < 0.5 × 10−<sup>8</sup> ) with the human major histocompatibility complex (MHC) (rs3132468) on chromosome 6 and phospholipase C (rs3740360 and rs3765524) on chromosome 10 (Khor et al., 2011). Dang et al. replicated the study in 917 Thai children with DSS and confirmed that alleles rs3132468 [MHC I chain related protein A (MICB)] and rs3765524 [phospholipase C epsilon 1 (PLCEI)] predispose Southeast Asians to DSS (Dang et al., 2014). In contrast, Whitehorn et al. (2013) genotyped 3,961 confirmed cases and 5,968 controls and found that rs3132468 MICB and rs3740360 alleles PLCEI were associated with less severe phenotypes of DENV infection in both infants and adults. This implies that the effect of these SNPs could be population-specific. Other candidate genes include dendritic cell-specific intracellular adhesion molecule (ICAM)-3 grabbing non-integrin (DC-SIGN), C-Type Lectin Domain Containing 5A (CLEC5A), immunoglobulin gamma constant fragment receptor (FCGRIIA), Toll-Like receptors (TLRs), Tumor necrosis Factor (TNF), Interferons (IFNs), 2<sup>0</sup> -50 -oligoadenylate synthase (OASs), Janus Kinase (JAK), Stimulator of Interferon Genes (STING), cytokines, chemokines, ICAM-1 and tryptase 1 proteases (de Carvalho et al., 2017).

Whole genome sequencing (WGS) and phylogenetic methods have been used to investigate DENV outbreaks. Faria et al. (2017) analyzing 92 viral genomes from DENV patients during the 2012 outbreak in Rio de Janeiro, found that at least two thirds of infections went unnoticed and their analysis highlighted the scale of the epidemic spread of DENV after the outbreak. Ahn et al. (2015) investigated the genetic variations in 8,826 nucleotide sequences of whole-genome DENV virus, and demonstrated that there was a distinctive genetic pattern between the four DENV subtypes across different regions (American, Oceanian, Asian, and Africa).

Analyses of envelope encoding nucleotide sequences from India have shown a shift from DENV subtype III to subtype IV, suggesting some level of positive selection (Manakkadan et al., 2013). These phylodynamic methods, which indicate evolutionary process or patterns of genetic diversity of the DENV virus, have also been reconciled with the virus epidemiology so as to decrease the variation between the two methods that are mainly used to study the population dynamics or viral behaviors (Pybus et al., 2012; Rasmussen et al., 2014). Due to the importance of genomics and bioinformatics in viral research, a range of tools has been developed to analyze viral genomes and make inferences (Stamatakis, 2014; Brody et al., 2017).

The use of RNA folding, structural predictions and functional studies has shown that genetic variation of the DENV occurs in nature due to high rates of recombination and error-prone RNA polymerases. A deleterious DENV genome was first shown by Aaskov et al. (2006) whereby a stop codon in the envelope coding region resulted in a defective DENV. Li et al. (2011)

also discovered defective interfering viral particles by analyzing short fragments of DENV, suggesting that they may be part of a broader disease attenuating process mediated by the deleterious virus and the defective interfering particles are important in viral replication, thereby enhancing the overall transmission capability of DENV (Li and Aaskov, 2014). Structural RNA predictions have implicated other elements in modulating replication of the virus, such as the downstream cyclization sequence (Friebe et al., 2012), cis-acting elements occurring in the capsid coding region (de Borba et al., 2015), and elements in the promoter Stem Loop A (SLA) and non-structural protein 5 (NS5) regions (Gebhard et al., 2011).

Understanding intra- and inter-host genetic diversity was previously mired with experimental and analytical methods that did not fully account for errors in viral amplifications. Thai et al. (2012) used various statistical approaches to correct for the artefactual mutations resulting from PCR amplifications and sanger sequencing, and showed that the genetic diversity index (Pi) of the DENV was low, ranging from 0 to 0.0013. This suggested sequence conservation, but they were able to show mixed infections and phylogenetically distinct DENV lineages present within the same host. Furthermore, genomewide scans for patterns of intra-host diversity in DENV identified variants between genes suggesting significant differences in intra-host diversity of the virus in the Nicaraguan population (Parameswaran et al., 2012). Functional annotation of the variants showed the impact of viral mutations on protein function, which strongly suggested purifying selection across transmission events.

Deep sequencing, RNA structural analysis and fitness evaluation have been used to determine processes that DENV employs for host specialization (mosquito or human) using RNA elements in the 3<sup>0</sup> -UTR (Villordo et al., 2015). A host adaptable stem loop structure was found to be duplicated, which DENV uses to accumulate mutations that are beneficial in one host and deleterious in another host, but the duplication confers a robust mechanism during host switching (Villordo et al., 2015). Recently, Waman et al. (2016) used population genetics methods to compute the genotype diversity and evolution of 990 DENV genomes, and revealed that the DENV-2 population is subdivided into 15 lineages. Their study also indicated the presence of intragenotype diversity and that the population structure of DENV-2 is spatiotemporal, shaped by episodic positive selection and viral recombination (Waman et al., 2016). The application of genomics and bioinformatics in the study of DENV shows the complexity of the virus biology, which can be exploited in target identification for drug discovery and vaccine development (Guy et al., 2016; Low et al., 2017).

#### PROGRESS IN MALARIA GENOMICS

Malaria incidence and mortality rates decreased by 21 and 29%, respectively, between 2010 and 2015 (WHO, 2018b). The genetic landscape of P. falciparum, the main cause of malaria, is increasingly being unraveled by using deep sequencing to identify polymorphisms and structural and copy number variations, which are fundamental for parasite evolution (Kwiatkowski, 2015). Sequencing consortia such as the MalariaGEN improve our understanding of genomics of both the Anopheles vector and the plasmodium species<sup>1</sup> . A recent study on genotyping accuracy using deep sequencing of Plasmodium parental generations and their progenies revealed that polymorphism frequencies can be used as markers of high recombination rates (Miles et al., 2016), which is an important contributor to enhancing immune evasion and drug resistance. Using whole genome deep sequencing and micro-array analysis, a study observed 18 deletions on regions encoding multigene families that are associated with immune evasion (Bopp et al., 2013). The authors showed the presence of chromosomal crossovers in six of the deletions and were able to estimate mutation rates of P. falciparum (Bopp et al., 2013).

Bioinformatics has contributed to our understanding of resistant mechanisms to previous drugs such as chloroquine and the emerging resistance to artemisinin-based combination therapies (ACT). Robinson et al. deployed next generation sequencing to investigate multi-clonality, population genetics and drug-resistant genotypes (Robinson et al., 2011). More recently, WGS was used to discover that mutations in the Kelch propeller domain (K-13) are associated with ACT resistance in Cambodia (Ariey et al., 2014; Straimer et al., 2015). Profiling of the drug resistance genes [P. falciparum chloroquine resistance transporter (pfcrt), P. falciparum multidrug resistance (pfmdr1), P. falciparum dihydrofolate reductase (dhfr) and P. falciparum dihydropteroate synthetase (dhps), and P. falciparum Kelch protein 13 (pfk13)] was done using Illumina next generation sequencing and demonstrated that the resistance-associated K-13 variants were largely absent in Africa (MalariaGEN Plasmodium falciparum Community Project, 2016; Nag et al., 2017).

Furthermore, bioinformatics tools have been used to demonstrate multi-locus linkage disequilibrium and local diversity, recent selection through integrated haplotype scores, regional gene flow and allele frequency differentiations (Duffy et al., 2017). Intra-host diversity can now be statistically characterized using the Fws metrics because sequencing platforms are able to generate read count data. Auburn et al. characterized within host diversity in 64 samples from West Africa, capturing a multiplicity of infections, number of clone ratios, clonal variation and within-host diversity (Auburn et al., 2012). Bioinformatics analysis of deep sequencing revealed large-scale genetic variations in P. falciparum (86158 SNPs), and genome wide allelic frequencies, population structure, linkage disequilibrium and intra-host diversity (Manske et al., 2012). The genetic diversity of P. falciparum is dependent on directional and balancing selection, whereby drug pressure and host immunity are the major selective agents, respectively (Mobegi et al., 2014; Duffy et al., 2015).

Genomics has been used to discover novel malaria resistance loci in humans, which provide 33% protection from severe malaria (Malaria Genomic Epidemiology Network, 2015). In Ghana, GWAS identified two unknown genetic loci associated with severe malaria: 1q32 within the ATPase Plasma Membrane Ca2<sup>+</sup> Transporting 4 (ATP2B4) gene and the 16q22.2 linked

<sup>1</sup>https://www.malariagen.net/

to a tight junction protein known as MARVELD3 (Timmann et al., 2012). Most recently, GWAS was used in a longitudinal surveillance to detect K-13 signatures, which led to the identification of a Kelch variant that is suggested to be a potential modulator of artemisinin resistance (Cerqueira et al., 2017).

The Plasmodium pathophysiology is increasingly being explored using transcriptomics and proteomics. Bioinformatics and statistical models have been used to describe the genomewide translational dynamics of P. falciparum, showing that parasite transcription and translation are tightly coupled presenting a broad and high resolution of parasite gene expression profiles (Caro et al., 2014). ChIP-Seq and RNA sequencing have been used for polysome profiling to understand the regulation of Plasmodium gene expression in humans. Bunnik et al. (2013) observed a delay in peak polysomal transcript abundance for several genes as compared to the mRNA fraction, which they reported to be alternative polysomal mRNA splicing events of non-coding transcripts.

DNA microarray technologies had been used to describe the gene expression patterns of P. falciparum during the intraerythrocytic stage (Bozdech et al., 2003), gametocyte (Young et al., 2005), sporozoite (Siau et al., 2008), liver stage (Tarun et al., 2008), and even between three different strains (Llinás et al., 2006). Recently, microarrays have been used to characterize parasite transcriptomes during cerebral and asymptomatic malaria, which revealed some differentially expressed genes encoding proteins involved in protein trafficking, Maurer's cleft proteins, transcriptional factor proteins and several hypothetical proteins (Almelli et al., 2014). RNA sequencing has also been used to describe P. falciparum expression profiles at different time points and has found novel gene transcripts, alternative splicing events and predicted untranslated regions of some genes providing further information on the parasite biology (Otto et al., 2010). Yamagishi et al. (2014) simultaneously analyzed the human host and the parasite transcriptomes using RNA sequencing, and showed that several human and parasite genes such as Toll-like receptor 2 and TIR domain-containing adapter molecule 2 (TICAM2) correlated with clinical symptoms. RNA sequencing has also been employed to study the transcriptome of P. vivax, which revealed a hotspot of vir genes on chromosome 2, new gene transcripts and the presence of species-specific genes (Zhu et al., 2016). It would be valuable to compare this data with similar data from other related Plasmodium species to identify species-specific transcriptomes. Analyzing the transcriptome of Chloroquine sensitive and resistant parasites identified 89 upregulated genes and 227 downregulated genes that were associated with resistance (Antony et al., 2016). These differentially expressed genes are involved in immune evasion mechanisms, pathogenesis, and various host-parasite interactions and could be targeted for drug and vaccine development.

Currently, single-cell RNA sequencing is revolutionizing the study of cell-to-cell heterogeneity. For example, the use of this method led to the discovery of novel variations in the expression of specific gene families that are involved in hostparasite interactions among asexual populations (Reid et al., 2018). Altogether, these studies demonstrate the profound impact of malaria parasite transcriptomics and genomics on our understanding of the parasite (Lee et al., 2017), and identify possible candidate targets for drugs, vaccines and diagnostics (Ludin et al., 2012; Hoo et al., 2016).

### GENOMICS RESEARCH IN FILARIASIS

Filariasis is a neglected chronic disease caused by tissue-dwelling nematodes (filariae) with onchocerciasis and lymphatic filariasis (LF), causing significant health concerns with a disease burden approaching 86 million cumulatively (WHO/Department of Control of Neglected Tropical Diseases, 2016). Onchocerciasis is caused by Onchocerca volvulus while LF is caused by three different parasites, namely Wuchereria bancrofti, Brugia malayi, and Brugia timori (Taylor et al., 2010). Elimination of filariasis is challenging because of the unavailability of sensitive diagnostic tools, lack of appropriate treatments and inadequate control measures in resource limited countries.

The W. bancrofti and O. volvulus genomes have been sequenced, providing opportunities for further genomic analyses (Desjardins et al., 2013; Cotton et al., 2016). Bioinformatics revealed the presence of gene coding for host immune system regulators such as human-like autoantigens as well as serine and cysteine protease inhibitors (Molehin et al., 2012; Cotton et al., 2016).

Molecular studies coupled with computational analyses have demonstrated an association between human host factors and filariasis clinical manifestations. LF infections have been shown to cluster in some families using pedigree studies (Cuenco et al., 2004; Chesnais et al., 2016). These studies show that genetic factors are involved in the regulation of LF infections and affect both the presence and intensity of microfilariae. However, a GWAS would be more comprehensive to demonstrate this genetic susceptibility to LF as has been the case for a tropical lymphedema (Podoconiosis) of non-filarial origin (Tekola Ayele et al., 2012). It is worth mentioning that lymphedema, or elephantiasis, is one of the main features of LF and normally occurs as a result of a compromised lymphatic system (Addiss, 2010). As opposed to LF, which is infectious, Podoconiosis is a non-communicable disease caused by soil particles such as aluminum and silica predominant in volcanic regions (Price, 1976; Davey et al., 2007). A comparative genomics-based study of LF would help to better understand these clinical manifestations.

Most of the pathological features of LF are associated with human-immunogenetics (Taylor, 2003; Junpee et al., 2010), which has been investigated using genomics and bioinformatics. Gene candidate-based genomics studies carried out in Thailand revealed that polymorphisms in the TLR-2 gene (−196 to −173 deletion, +597 T > C and +1350 T > C) have a strong linkage disequilibrium and were associated with increased risk of asymptomatic LF (Junpee et al., 2010). In a functional study, individuals with the −196 to −173 deletion were found to have significantly low transcription levels compared to those with the wild-type gene (Junpee et al., 2010). Further analyses showed strong association of a mutation (M196A) in human tumor necrosis factors (TNF) receptor-II with hydrocele development,

while the A288S mutation of endothelin-1 (ET-1) correlated with low ET-1 1 plasma levels and elephantiasis (Panda et al., 2011).

Population genetics is very important for assessing and understanding the epidemiology and transmission dynamics of filarial diseases (Small et al., 2016; Doyle et al., 2017). Population genomics of O. volvulus samples collected from different geographical zones – West Africa (WA), Uganda and Ecuador – demonstrated some level of population structure between WA and other populations (Choi et al., 2016). Furthermore, phylogenetic signals indicative of gene flow and genetic admixture between WA forest and savanna populations were identified. These signals could serve as markers to delineate forest from savanna populations and/or sort out admixed populations (Choi et al., 2016). A study using both nuclear and mitochondrial sequences identified regions in the W. bancrofti genome that exhibited an arrangement which was consistent with both balancing and directional selection (Small et al., 2016).

The control of filariasis in general is difficult due to the complex parasite life cycle. In an attempt to demystify the complex life cycle of the parasite, RNA sequencing has been used to investigate gene expression profiles of different developmental stages of Brugia malayi (Choi et al., 2011). Transcriptomics analyses revealed stage-specific gene expression correlating with stage-specific pathway activation. Upregulated proteins included cathepsin L and Z-like cysteine proteases that were previously demonstrated to be essential for larva molting in O. volvulus (Lustigman et al., 2004) and cuticle and eggshell remodeling in filarial nematodes in general (Guiliano et al., 2004). Another study using a filarial microarray chip composed of 18,104 gene probes revealed that gene expression in B. malayi infective larvae (L3s) depends on environmental factors (Li et al., 2009). The gene expression patterns in irradiated L3s, laboratory-adapted L3s and those collected from mosquitoes were found to be different. Gene Ontology analyses showed that upregulated genes in laboratory-adapted and mosquitoderived L3s were mostly involved in growth and invasion, whereas those in irradiated L3s were enriched with immunogenic proteins and proteins involved in radiation repair (Li et al., 2009). Such high throughput genomics analysis is important for understanding the biology/development, invasion, and immune evasion mechanisms of the parasite and could help improve disease control measures (Choi et al., 2011).

Mass drug treatment with Ivermectin (IVM) or Mectizan <sup>R</sup> and Albendazole is the main strategy for filariasis control in Africa and has been going on for decades (Amazigo, 2008). However, cases of drug resistance have been reported and genomic methods are increasingly being used to investigate mechanisms of resistance. Genotyping and sequencing studies have shown an association between SNPs in some O. volvulus genes (Pglycoprotein-like protein, β-tubulin) and the development of resistance (Nana-Djeunga et al., 2012; Osei-Atweneboana et al., 2012). P-glycoprotein was recently demonstrated to be associated with resistance to IVM in a horse filarial species (cyathostomins) with transcript levels measured by RNA-Seq and confirmed by RT q-PCR found to be significantly higher in the resistant compared to sensitive worm population (Peachey et al., 2017). Moreover, GWAS demonstrated that reduced sensitivity of O. volvulus to IVM is accounted for by genetic drift and soft selective sweeps. Pooled next generation sequencing of O. volvulus worms collected from Ghana and Cameroon repeatedly treated with IVM and phenotypically characterized into poor responder (PR) and good responder (GR) parasites identified genetic variants that considerably delineate GR and PR parasites. One of these variants (SNP, OM1b\_7179218) was common in both Cameroon and Ghana worm populations, whereas the others were country-specific (Nana-Djeunga et al., 2014; Doyle et al., 2017). These variants were found to be grouped in quantitative trait loci (QTLs) in which published genes associated with IVM resistance were scarcely found. Gene Ontology<sup>2</sup> analysis revealed that genes found in those QTLs regions were linked to pathways involved in neurotransmission, development, and stress responses (Harris et al., 2004; Doyle et al., 2017). The involvement of neurotransmission is a promising finding here because one of the main targets of IVM is a ligand-gated channel at neuromuscular junctions (Cully et al., 1994).

The molecular mechanism of Ivermectin is not clearly understood and has been investigated using bioinformatics approaches. RNA-Seq analyses of ivermectin-challenged B. malayi adult female worms revealed that genes involved in cell division (meiosis) and oxidative phosphorylation were drastically downregulated as early as 24 h post-exposure (Ballesteros et al., 2016). A similar study in which the worms were instead challenged with flubendazole (FLBZ), a potential macrofilaricide, demonstrated the effect of FLBZ on embryogenesis and cuticle integrity (O'Neill et al., 2016a). Expression of cuticle-related genes and those involved in mitosis or meiosis were notably affected by the treatment. These studies further elucidate the drug-induced inhibition of embryogenesis and microfilarial release from the female worm uterus during larval development as previously demonstrated (O'Neill et al., 2015, 2016b). Knowledge of this mechanism could help in drug repurposing whereby drugs known to have a similar mode of action or mechanism, but are used for the treatment of other parasitic diseases, could be tested for their efficacy on filarial parasites.

#### APPLICATION OF OMICS TO VACCINE TARGET IDENTIFICATION AND DRUG DISCOVERY

The availability of whole genome sequences of both the host and pathogens in different databases such as GenBank<sup>3</sup> (Benson et al., 2004), EuPathDB (<sup>4</sup> formerly ApiDB), WormBase<sup>5</sup> , Virus Pathogen Database and Analysis Resource (ViPR) has led to tremendous advances in the search for new drug and vaccine targets (Yan et al., 2015; Xia, 2017). This enables high throughput in silico screening for the identification of vaccine and drug targets, thus focusing expensive laboratory screening on selected high affinity targets. Though not yet fully implemented in Africa,

<sup>2</sup>http://geneontology.org/

<sup>3</sup>http://www.ncbi.nlm.nih.gov

<sup>4</sup>http://EuPathDB.org

<sup>5</sup>http://www.wormbase.org

omics technologies and bioinformatics analyses have aided significantly in the generation of new knowledge toward drug and vaccine target discovery (Yan et al., 2015; Xia, 2017). Genomic, transcriptomic and proteomic analyses of pathogens such as filariasis parasites have identified new potential biomarkers that can be invaluable in diagnostics, vaccine and drug development (Armstrong et al., 2016; Bennuru et al., 2017). Kumar et al. (2007), using genome wide C. elegans RNA-interference data as proxy, identified a set of 3,059 essential genes in the B. malayi genome, from which 589 were characterized as potential drug targets. The prioritization algorithm helps in the prediction of the efficacy, selectivity and tractability of each target.

Phylogenomic analyses across Plasmodium spp. and comparative genomic studies in humans have led to the identification of new drug targets in P. falciparum. Identification of essential genes (targets) responsive to specific inhibitors led to the discovery of 40 potential drug targets, which includes known ones such as calcium dependent protein kinase and previously unknown ones such as phosphoisomerase and carboxylase (Ludin et al., 2012). Comparing the transcriptomes of six Plasmodium spp. during blood stage infection revealed about 800 genes that have similar expression patterns across species, among which 240 were demonstrated to be druggable by online drug target prioritization databases (Hoo et al., 2016). Similarly, genomic and transcriptomic analyses have been carried out with other pathogens with encouraging results in fungi (Kaltdorf et al., 2016), bacteria (Turab Naqvi et al., 2017), and viruses (Dapat and Oshitani, 2016).

In vaccine target identification, pathogen genomes are being scanned in a bid to identify genes encoding proteins or molecules with vaccine candidate properties such as low antigenic variation, polymorphism, and immunogenicity (Masignani et al., 2002; De Groot et al., 2008). Despite the success of whole-organism vaccines such as those for polio, whole-organism vaccines for pathogens such as Plasmodium spp., Mycobacterium spp. and HIV remain a challenge (Doolan et al., 2014; Proietti and Doolan, 2015). Genomics offers a potential way around this challenge through the discovery of immunogenic antigens using whole-genome scans (Doolan et al., 2014; Proietti and Doolan, 2015). Here, omics techniques and bioinformatics tools are used to determine genes or proteins that are involved in the virulence of the pathogen and pathogenesis of the disease by comparing, for example, attenuated and pathogenic disease agents. Algorithms can be used to predict T cell epitopes or regions with high affinity within HLA molecules in translated peptides found in databases (Grubaugh et al., 2013; Davies et al., 2015) in order to inform the choice of the right antigens for vaccine design. Omics technologies have been reviewed in the context of vaccine target identification by He (2012).

Most of the tools used for epitope identification rely on statistics and machine learning. Some of them include servers to predict MHC-binding, peptides namely RANKPEP (Reche et al., 2004), which uses Position Specific Scoring Matrices (PSSMs), and nHLAPred<sup>6</sup> (Bhasin and Raghava, 2007), based on Artificial Neural Networks (ANNs) and quantitative matrices among others. Some severs are specific for B-cell epitope prediction, such as Bcepred<sup>7</sup> (Saha and Raghava, 2004), ABCpred<sup>8</sup> (Saha and Raghava, 2006), and BepiPred<sup>9</sup> (Jespersen et al., 2017). These tools work based on the physicochemical properties and location of the peptides. They function alongside epitope-containing databases such as Swiss-Prot, SYFPEITHI, and IEDB (Fleri et al., 2017). The list of tools, methods and databases mentioned here is not exhaustive, however, they have been extensively reviewed elsewhere (Soria-Guerra et al., 2015).

Nowadays, due to advances in the fields of computer sciences, genomics, proteomics, bioinformatics and management of patients' health records, etc., there seems to be a paradigm shift from generalized medicine to personalized therapy (Sorber et al., 2017). For example, many drugs are metabolized by cytochrome P450 enzymes with drug action depending on the expressed gene variant (BlueCross and BlueShield Association, 2004; Daly et al., 2006). Moreover, malaria patients with glucose-6-phosphate (G6p) deficiency have been reported with severe complications such as cardiotoxicity and acute hemolytic anemia following treatment with quinidine gluconate (Damhoff et al., 2014). These complications have been described as a consequence of inherited (X-linked trait) mutations in the g6p gene (Luzzatto and Seneca, 2014). These mutations do not cause the complete loss of the G6P enzyme but instead affect its stability and level in red blood cells (Luzatto et al., 2001). In the same line rifampicin, which is the drug of choice for TB treatment, is transported after administration by a human anion transporter encoded by the SLCO1B1 gene. Studies have shown that mutations in the SLCO1B1 gene, namely rs11045819 and rs4149032, are associated with decreased RIF plasma levels in South-African populations (Weiner et al., 2010; Chigutsa et al., 2011; Gengiah et al., 2014). However, this finding could not be replicated in Malawian and South Indian populations, implying that this could be population-specific (Ramesh et al., 2016; Sloan et al., 2017). These show, in a nutshell, the implication of genomics and bioinformatics in drug discovery and precision therapy (Hamburg and Collins, 2010; Rabbani et al., 2016).

#### CHALLENGES AND OPPORTUNITIES IN CONDUCTING OMICS AND BIOINFORMATICS STUDIES IN AFRICA

Bioinformatics is increasingly becoming an important cornerstone in contemporary research on infectious diseases (Mulder et al., 2017), where Africa has the highest morbidity and mortality but less genomics research output compared to other regions of the world (Fatumo et al., 2014; Karikari, 2015). This slow pace of genomics research output is due to several challenges in omics and bioinformatics research facilities in Africa; three of the major ones are briefly discussed.

<sup>6</sup>http://www.imtech.res.in/raghava/nhlapred/

<sup>7</sup>http://www.imtech.res.in/raghava/bcepred/

<sup>8</sup>http://www.imtech.res.in/raghava/abcpred/

<sup>9</sup>http://www.cbs.dtu.dk/services/BepiPred/

#### Inadequate Infrastructure

fgene-09-00575 November 24, 2018 Time: 16:19 # 9

Bioinformatics and genomics analysis require powerful computers and a reliable source of electricity for large data storage and high throughput analyses (H3Africa Consortium et al., 2014). With the exception of some South African universities, most sub-Saharan African universities lack high performance computing facilities (Karikari et al., 2015; Mulder et al., 2016). There is also a limitation of high-speed internet for sharing data and accessing bioinformatics databases and repositories (Fatumo et al., 2014; Karikari, 2015). This hinders the application of cloud-based web services which could have circumvented the need for local high-performance computing facilities (Navale and Bourne, 2018). Furthermore, few research institutions in Africa have sequencing facilities and therefore resort to sequencing abroad through collaborations. Such collaborations often result in a loss of ownership of the data and resulting publications usually have the external collaborators as lead and correspondence authors. Notable efforts being made to bridge this infrastructural gap include the installation of highperformance computers (HPCs) at The Developing Excellence in Leadership and Genetics Training for Malaria Elimination in sub-Saharan Africa (DELGEME) at the University of Science Technique and Technologies of Bamako, Mali, the West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), University of Ghana and the Medical Research Council Unit, The Gambia at the London School of Hygiene and Tropical Medicine, to support storage and high throughput analyses of genomic data. These HPC facilities are complemented by NGS sequencing facilities at WACCBIP and MRC in addition to some institutions in East Africa such the International Livestock Research Institute (ILRI-Kenya). This infrastructural development, and pressure from initiatives such as Human Heredity and Health in Africa (H3Africa), will hopefully serve as a springboard for Africa to increase her involvement in the study design, sample collection, analysis and ownership of data rather than just collecting samples for international collaborators.

### Lack of Training Opportunities and Well-Structured Bioinformatics Courses

Until the recent introduction of bioinformatics training courses by H3ABioNet, there were limited bioinformatics training courses in Africa. Such training programs were mostly short courses organized by local bioinformaticians with support from experts in the field across Africa and other external collaborators (Gurwitz et al., 2017). Very few African universities have structured bioinformatics courses, most of these universities are South African, while some are North African and few are in sub-Saharan Africa (Bishop et al., 2015). The DELGEME, through funding from the Wellcome Trust, is also providing funding for Master of Science courses in bioinformatics, which are mostly done in South Africa. The other form of bioinformatics training is through local capacity building, which institutions organize for staff with support usually through North-South collaborations and transfer of expertise. However, the downside of short courses is that there is no mentorship beyond the course, which hinders consolidation of the knowledge gained. In addition to these, some organizations working predominantly on crop production, such as the International Institute of Tropical Agriculture Bioscience Center<sup>10</sup> and Consultative Group on International Agricultural Research institute<sup>11</sup>, offer short bioinformatics training opportunities to African scholars. Sometimes some students from Africa get training from European universities, but the challenge is that most of the trainees do not come back to join local institutions because of poor infrastructures. Furthermore, there is a disconnect between biologists and other scientific disciplines such as computer science, statistics and mathematics in most African universities. This affects multidisciplinary research, which is crucial in modern-day infectious disease research. Ultimately, the lack of well-structured bioinformatics curricula hampers the development and maintenance of highly needed experts in the field in Africa, since they often move to Europe and North America for better career prospects.

#### Limited Research Funding

A major challenge to research on the African continent is the lack of funding for biomedical research. Current research is mainly funded from international donors, with limited or no funding from national governments and African regional bodies such as the African Union (Hamburg and Collins, 2010; Karikari, 2015). However, a few countries such as South Africa, through the South Africa's National Research Foundation and Medical Research Council, do provide funding for genomics research projects (Karikari et al., 2015). Until the initiation of H3Africa, through funding from the National Institute of Health (United States) and the Wellcome Trust (United Kingdom), there was limited to no funding for genomics and bioinformatics in Africa (Adoga et al., 2014; Mulder et al., 2017).

## CONCLUSION AND PERSPECTIVE

Herein we highlight how genomics and bioinformatics has contributed to our understanding of infectious diseases of significant health concern, ranging from bacterial and viral to parasitic infections, as well as their applications to drug and vaccine target identification. This ranges from understanding pathogenesis, host systemic responses and host-pathogen interactions to identification of prognostic and diagnostic markers. However, in Africa, despite the high morbidity and mortality due to infectious diseases, there is limited expertise in the field of bioinformatics and hence limited bioinformatics research output in terms of publications. Thus, there is a need to strengthen training and capacity building in bioinformatics in Africa to improve infectious disease genomics and host-pathogen genomics on the continent. This can be achieved through the establishment of well-structured courses, mentorship for junior

<sup>10</sup>http://bioscience.iita.org/index.php/en/services/bioinformatics <sup>11</sup>https://www.cgiar.org/

and trainee bioinformaticians and better career prospects to maintain trained bioinformaticians on the continent.

#### AUTHOR CONTRIBUTIONS

fgene-09-00575 November 24, 2018 Time: 16:19 # 10

All authors listed contributed substantially to the intellectual, writing and editing, and approved the manuscript for publication.

#### FUNDING

All authors were supported by a DELTAS Africa grant (DEL-15-007: GA). The DELTAS Africa Initiative is an independent

#### REFERENCES


funding scheme of the African Academy of Sciences (AAS)'s Alliance for Accelerating Excellence in Science in Africa (AESA) and is supported by the New Partnership for Africa's Development Planning and Coordinating Agency (NEPAD Agency), with funding from the Wellcome Trust (107755/Z/15/Z: GA) and the United Kingdom Government. The views expressed in this publication are those of the author(s) and not necessarily those of AAS, NEPAD Agency, Wellcome Trust or the United Kingdom Government.

#### ACKNOWLEDGMENTS

We are grateful to WACCBIP for providing us with the funding and conducive environment to do quality research.


implications. Antimicrob. Agents Chemother. 55, 4122–4127. doi: 10.1128/AAC. 01833-10

fgene-09-00575 November 24, 2018 Time: 16:19 # 11


of systemic host-pathogen interactions in severe malaria. Sci. Transl. Med. 10:eaar3619. doi: 10.1126/scitranslmed.aar3619


fgene-09-00575 November 24, 2018 Time: 16:19 # 12


falciparum using RNA-Seq. Mol. Microbiol. 76, 12–24. doi: 10.1111/j.1365-2958. 2009.07026.x


fgene-09-00575 November 24, 2018 Time: 16:19 # 13

bancrofti from mosquitoes. Mol. Ecol. 25, 1465–1477. doi: 10.1111/mec.1 3574


of M. africanum- and M. tuberculosis-infected patients after, but not before, drug treatment. Genes Immun. 16, 347–355. doi: 10.1038/gene.2015.21



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Bah, Morang'a, Kengne-Ouafo, Amenga–Etego and Awandare. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants

Maxim S. Kovalev<sup>1</sup> , Anna A. Igolkina<sup>1</sup> \*, Maria G. Samsonova<sup>1</sup> \* and Sergey V. Nuzhdin1,2

<sup>1</sup> Department of Applied Mathematics, Peter the Great St.Petersburg Polytechnic University, St. Petersburg, Russia, <sup>2</sup> Program Molecular & Computational Biology, Dornsife College of Letters Arts and Science, University of Southern California, Los Angeles, CA, United States

#### Edited by:

Yuriy L. Orlov, Institute of Cytology and Genetics (RAS), Russia

#### Reviewed by:

Vasily Ramensky, Moscow Institute of Physics and Technology, Russia Konstantin Vladimirovich Gunbin, Institute of Cytology and Genetics (RAS), Russia

#### \*Correspondence:

Anna A. Igolkina igolkinaanna11@gmail.com Maria G. Samsonova m.g.samsonova@gmail.com

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Plant Science

Received: 18 September 2018 Accepted: 08 November 2018 Published: 28 November 2018

#### Citation:

Kovalev MS, Igolkina AA, Samsonova MG and Nuzhdin SV (2018) A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants. Front. Plant Sci. 9:1734. doi: 10.3389/fpls.2018.01734 The impact of deleterious variation on both plant fitness and crop productivity is not completely understood and is a hot topic of debates. The deleterious mutations in plants have been solely predicted using sequence conservation methods rather than function-based classifiers due to lack of well-annotated mutational datasets in these organisms. Here, we developed a machine learning classifier based on a dataset of deleterious and neutral mutations in Arabidopsis thaliana by extracting 18 informative features that discriminate deleterious mutations from neutral, including 9 novel features not used in previous studies. We examined linear SVM, Gaussian SVM, and Random Forest classifiers, with the latter performing best. Random Forest classifiers exhibited a markedly higher accuracy than the popular PolyPhen-2 tool in the Arabidopsis dataset. Additionally, we tested whether the Random Forest, trained on the Arabidopsis dataset, accurately predicts deleterious mutations in Orýza sativa and Pisum sativum and observed satisfactory levels of performance accuracy (87% and 93%, respectively) higher than obtained by the PolyPhen-2. Application of Transfer learning in classifiers did not improve their performance. To additionally test the performance of the Random Forest classifier across different angiosperm species, we applied it to annotate deleterious mutations in Cicer arietinum and validated them using population frequency data. Overall, we devised a classifier with the potential to improve the annotation of putative functional mutations in QTL and GWAS hit regions, as well as for the evolutionary analysis of proliferation of deleterious mutations during plant domestication; thus optimizing breeding improvement and development of new cultivars.

Keywords: deleterious mutation, random forest (bagging) and machine learning, Orýza, Pisum, Cicer

### INTRODUCTION

New mutations continuously arise in populations. Some of them are neutral, but many are deleterious (Grossman et al., 2010). Under most circumstances, natural selection is effective in maintaining strong deleterious mutations at low level, however mildly deleterious variants may reach considerable frequency in populations due to hitchhiking and population bottlenecks. Deleterious variants may affect phenotypic traits and decrease organismal fitness. Quite the opposite, in maize intermediate and weakly deleterious alleles are involved in heterosis

(Yang et al., 2017). In human rare, deleterious SNPs are associated with common diseases and cancer (Taylor et al., 2015). Therefore, it is no wonder that estimation of the deleterious mutations prevalence in different species is a topic of vivid interests.

Theoretical predictions place the fraction of deleterious mutations in barley, soybean, rice, maize, and Arabidopsis genomes from 20% to 40% approximately (Günther and Schmid, 2010; Mezmouk and Ross-Ibarra, 2014; Kono et al., 2016). Deleterious alleles are usually at low frequency, an observation that is in agreement with the action of weak purifying selection. The prevalence of deleterious alleles differs between wild species, landraces, and elite cultivars. Using rice sequences Günther and Schmid (2010) found fewer deleterious substitutions in the wild than in cultivated rice. In comparisons with traditional landraces, elite maize inbreds show an increase in the proportion of deleterious variants fixed within the population, but the much smaller proportion of segregating deleterious variants (Yang et al., 2017). This is explained by bottlenecks during modern breeding that results in fixation of the majority of mutations, therefore reducing a fraction of segregating variation.

The issue of deleterious variation in plant genotypes is particularly essential for crop improvement, because crop productivity may be reduced due to a persistence of deleterious variants at a moderate frequency. Indeed Yang et al. (2017) found that deleterious variants may contribute substantially to variation in fitness-related quantitative traits in maize and that incorporation of information about deleterious mutations may improve existing genomic prediction frameworks.

NGS technologies open a way to annotate the functional effect of individual SNPs. As the regulatory code responsible for gene activity still remains a puzzle, only genetic variants in the coding regions are considered. The general belief is that non-synonymous substitutions may change protein structure and therefore many of them should have the deleterious effect on protein function, which in turn manifests as biochemical or morphological mutations. The methods for prediction of deleterious effects of nonsynonymous substitutions in proteins could be subdivided into two groups. The first group methods exploit sequence conservation and are based on the assumptions that SNPs in evolutionarily conserved regions are likely to be deleterious. Some of them like SIFT use simple cut-off to discriminate deleterious variants from neutral (Sim et al., 2012), while other like MAPP (Stone and Sidow, 2005) and GERP+++ (Davydov et al., 2010) employ phylogenetic information in addition.

The machine learning algorithms lay the foundation of the second group methods. Of these the most widely used is PolyPhen-2 (Adzhubei et al., 2010). This method employs the rigorously annotated datasets of human disease-causing mutations for training that preconditions its high predictive accuracy. As a machine learning method PolyPhen-2 consists of three steps: firstly a set of features that characterize a mutation was extracted using sequence characteristics, multiple alignment scores, and information about the 3D structure of the resulting protein. At the next steps, training and crossvalidation were performed followed by classification with a naïve Bayes approach. It should be noted, that being trained on human data, PolyPhen-2 is sometimes applied to predict deleterious mutations in other species. There is, however, little consensus about the eligibility of such a direct knowledge transfer. Indeed, it is known that alleles annotated as deleterious in humans at about 15% of cases correspond to normal alleles in other mammals (Kondrashov et al., 2002). It appears from this that to achieve more accurate predictions training might have to be separately executed species by species. However, for many species, information required for classifier training might be substantially more limited than for humans. Accordingly, the question arises whether it is possible to use the information obtained for one species for the search for harmful mutations in another, perhaps phylogenetically close, species.

This question has long been discussed in machine learning in the following formulation: how to transfer knowledge from one object to another, considered to be close (in the sense of data sampling distribution), to solve a specific problem (whether classification or regression). A set of methods that provide the methodology for solving such problems is denoted Transfer Learning (TL). These methods have found broad application in many practical problems. For instance, Lagunas and Garces (2017) classify the painted images of various objects using their naturalistic form (photos). Closer to home, Transfer Learning was used for evaluating the quality of protein models (Hurtado et al., 2018), the localization of proteins in the cell based on ontology databases (Mei et al., 2011) and the search for associations between the genome and the phenotype (Petegrosso et al., 2018).

Up to now, most publications predicting deleterious mutations in plants use sequence conservation methods that is mostly due to lack of well-annotated datasets of deleterious and neutral mutations in these organisms. However, recently, Kono et al. (2016) have assembled a validated database of 2,910 function-altering mutations in Arabidopsis that opens the way for development of machine learning methods specifically tailored for plants. Here, we developed the Random Forest classifier that being tested on two plant species – Orýza sativa and Pisum sativum – for which the sufficient number of neutral and functional mutations are known – showed substantially better performance than PolyPhen-2. We also attempted to improve our classifier using the approaches of Transfer learning, as this technique could provide knowledge transfer from one species for which a lot of information is available to a close species with limited information. Finally, we validate this classifier using population data on single nucleotide allele frequency available for Cicer arietinum (Plekhanova et al., 2017). We believe our classifier will be helpful in plant research for prioritizing mutations in QTL and GWAS support intervals for functional validation, for developing GRN-based models to solve the genotype-tophenotype problem, as well as for improvement of breeding programs.

## MATERIALS AND METHODS

fpls-09-01734 November 26, 2018 Time: 20:31 # 3

#### Arabidopsis Training Database

The list of amino acid substitutions in Arabidopsis thaliana proteins was obtained from the database created by Kono et al. (2018). The database consists of 13,707 replacements available, of them 4,409 were labeled mutations in 994 proteins: 2,894 deleterious and 1,515 neutral. The protein sequences were downloaded from "The Arabidopsis Information Resource."

#### Orýza sativa and Pisum sativum Test Datasets

The sets of deleterious mutations in rice (O. sativa) and pea (P. sativum) were extracted from the UniProt mutation database (The UniProt Consortium, 2017). To construct a set of neutral mutations in rice and pea BLASTp program (Altschul et al., 1997) was used to align each protein sequence against SwissProt sequence database (Bairoch, 1996) and proteins with more than 95% identity to a query sequence were selected. At the next step, the selected sequences were multiply aligned with Clustal Omega (Sievers and Higgins, 2014) and a set of neutral mutations was generated under the following rule. We consider amino acid substitutions without any known phenotype, not present in a continuous block of substituted residues (i.e., are isolated) and independent (i.e., there were no other substitutions in the same sequences of alignment). This rule makes it possible to avoid the phenomenon of correlated mutational behavior between columns in multiple sequence alignment (Kowarsch et al., 2010). Besides we consider only alignment columns that have no more than one substitution. To balance the datasets, neutral mutations were randomly downsampled so that their number was equal to the number of deleterious mutations. Overall, the dataset for rice contained 764 mutations in 400 proteins (by 382 deleterious and neutral); the pea dataset contained 136 mutations in 60 proteins (by 68 deleterious and neutral).

### Cicer arietinum Target Dataset

433 Cicer arietimum landraces from N. I. Vavilov All-Russian Institute for Genetic Resources (VIR collection) were genotyped by GBS sequencing and variants were called and filtered following standard criteria; overall 56855 SNPs were identified (Plekhanova et al., 2017). Identification of SNPs in protein coding regions and classification of those into synonymous and non-synonymous classes was done with SnpEff tools (Cingolani et al., 2012): 3023 synonymous and 3467 non-synonymous replacements were determines within 2569 proteins.

#### Classifier Features

The set of classification features was aggregated by different methods. To extract a set of features characterizing substitutions, the PolyPhen-2 web service (Adzhubei et al., 2010) was used. Additional servers and sources of information were also involved, such as the PfamScan (Finn et al., 2014) and the PCI-SS (Green et al., 2009). The former was used to check whether the amino acid substitution locates within a protein domain of the Pfam database. Features obtained with the latter service incorporate information about the secondary structure of the protein in the loci of the substitution. Since information about the three-dimensional structure of a target protein is not always known, these features played the role of alternative structural characteristics. PCI-SS server indicates a protein secondary structure – α-helix, β-sheet, or non-regular structure – which contains the substitution of interest, and also provides three quantitative characteristics about the structural state of the target amino acid in the protein based on the mean-square error between the models considered in the PCI-SS algorithms. To evaluate the physicochemical nature of amino acid substitutions, several measures were used: the Grantham distance (Grantham, 1974), the Sneath index (Sneath, 1966), the Epstein's coefficient of difference (Epstein, 1967), and the Miyata distance (Miyata et al., 1979). The quantitative evaluation of the amino acid substitution by the matrix of BLOSUM62 substitutions was added as an extra feature (Henikoff and Henikoff, 1992).

Two additional features have been constructed that take into account the amino acid context around the mutation position. The first feature was defined as the mean distance over the Grantham matrix between the wild-type amino acid in the mutation position and each of the two neighboring amino acids. The second feature was calculated in the same way but considering two amino acids from a mutant position at a distance of one. The construction of these features was based on the following hypothesis: if the amino acids that are very different in their physicochemical properties are next to each other, this is most likely justified by the constraints on functions to be performed. Therefore, the more physicochemical differences are in the amino acid position from its context, the more likely it is for the mutation in the position of this amino acid to be harmful.

#### Classifiers

To solve the classification problem of mutations to deleterious versus neutral, three classifiers were tested: Support Vector Machines with a linear kernel (Linear SVM), Support Vector Machines with a Gaussian kernel (Gaussian SVM) (Cristianini and Shawe-Taylor, 2000), and Random Forest (RF) (Breiman, 2001). The Linear SVM method is based on the search for a separating hyperplane with the maximum gap between the data. To use a non-linear separation of classes, the Gaussian SVM was examined; it utilizes the Gaussian kernel instead of the scalar product in the Linear SVM (Cristianini and Shawe-Taylor, 2000). The RF uses the ideas of bagging, or Bootstrap Aggregating (a composition of independent classifiers, in this case, of decision trees) and the method of random subspaces (description of objects using subspaces of the feature space) (Breiman, 2001).

The choice of hyperparameter values for classifiers was carried out on the Arabidopsis dataset. For each classifier, the traditional procedure – grid search with fivefold cross-validation – was performed to find the optimal values of hyperparameters. These values are usually selected as the values that provide the highest cross-validation score that leads to the preventing of overfitting. Further, the optimal hyperparameters were utilized while classifiers' training. One might see that the overfitting effect

was not observed (**Supplementary Figure S1**). Cross-validation was performed with tools from the scikit-learn Python module<sup>1</sup> .

The accuracy was chosen as the characteristic by which the best values of hyperparameters were selected, as calculated by the following formula: Accuracy = (TP + TN)/N, where N is the sample size for which the classification was made, and TP and TN are the numbers of correctly defined deleterious mutations and neutral ones, respectively. To select the best classifier, the data for A. thaliana were divided into training and validation sets (3409 and 1000 samples, respectively). Classifiers were first trained, and then the classification on the validation set was performed. We used Linear SVM, Gaussian SVM, and RF methods from scikitlearn Python module (see footnote 1); the pipeline for tuning, training and testing the classifiers is available at the GitHub repository https://github.com/kovmax/DelMut.

#### Transfer Learning

The transfer learning (TL) is a machine learning technique that improves a model trained on the target data by transfer knowledge from the related and usually larger source data (Pan and Yang, 2010). In our study, we applied TL for training classifiers to predict deleterious mutations in rice and pea datasets (target data) based on the knowledge about deleterious mutations in A. thaliana dataset (source data). We examined the Transductive Transfer Learning which assumes that the source data is labeled (classes of samples are known) but the target data is not and, accordingly, labels for the target data were not used until final validation of the predictions. To implement Transductive TL we assign a weight (W) for each sample from the source data, which inversely depends on the distance in the feature space from this sample to the mean of the target data domain:

$$\mathcal{W} = \exp\left(-||\mathcal{x}\_i^{\mathbb{S}} - m^t||^2\right)$$

where x S i is i-th sample from the source data, m<sup>t</sup> represents mean values of the target dataset features (Pan and Yang, 2010; Lapin et al., 2014). The Transductive TL classifier predicts classes of the target dataset and learns on the weighted source data: the closer a sample form the source data to the target dataset, the more significant it is for training. We applied the Transductive TL technique to Linear SVM, Gaussian SVM, and RF classifiers with hyperparameter values estimated for these classifiers without TL. Methods were implemented with tools of scikit-learn Python module (see footnote 1); all datasets and scripts are available at the GitHub repository https://github.com/kovmax/DelMut.

#### RESULTS

#### Feature Extraction

To develop a method for predicting damaging missense mutations in plants we use machine learning approach and three annotated datasets of non-synonymous deleterious and neutral mutations in A. thaliana, O. sativa, and P. sativum (see Materials and Methods). The method employs classification algorithms

<sup>1</sup>http://scikit-learn.org

and therefore we need to characterize the datasets with a set of features able to discriminate classes. In total, 18 features were selected characterizing the impact of substitution of the wildtype allele by mutant allele on protein sequence and structure. As **Figure 1** shows the distributions of all the features differ between subsets of neutral and deleterious mutations in A. thaliana that points on their utility for discrimination between these subsets.

### Best Classifier for the Arabidopsis thaliana Dataset

The dataset was divided into training and test samples. The test sample was randomly determined, containing 357 neutral and 643 deleterious mutations, and was used to compare the accuracy of the predictions of the four classifiers (PolyPhen-2, Linear SVM, Gaussian SVM, and Random Forest). The results (see **Table 1**) showed that all the classifiers − Linear SVM, Gaussian SVM, and Random Forest − were more accurate than Polyphen-2, and the most accurate one was Random Forest, it had the highest accuracy and AUC values (ROC-curves are presented in **Supplementary Figures S2**–**S4**) and the lowest False Negative and False Positive Rates.

#### Classification of Orýza sativa and Pisum sativum With and Without Transfer Learning

Each classifier was trained on Arabidopsis training samples and applied for prediction in two settings: direct prediction or prediction additionally involving Transfer Learning. Since there is an element of randomization in the Random Forest classification method, estimates for this method were obtained by choosing the best prediction of 300 trained classifiers (**Figure 2**). By comparing the predicted and annotated class values for the rice and pea mutations, we concluded that the best of the proposed classifiers is Random Forest without the addition of Transfer Learning (**Table 2**). Predictions of PolyPhen-2 were better only by the criterion False Positive rate, but by the criterion False Negative Rate was significantly underperforming. Overall the Random Forest classifier makes fewer errors in the predictions of a truly deleterious mutation. The prediction of classifiers in the modes without and with Transfer Learning did not exhibit significant differences. Moreover, for the best Random Forest classifier the mode with Transfer Learning turned out to be less accurate.

#### Classification of Non-synonymous Mutations in Cicer arietinum

To test whether or not our classifiers reasonably perform across different angiosperm species, we chose to annotate deleterious mutations in chickpea, C. arietinum. Classification has been pursued with both PolyPhen-2 and the Random Forest classifier demonstrated the best discriminating ability on rice and pea datasets (see **Figure 2**). One may observe (**Table 3**) that there is a general correspondence between annotations, with 1923 designated as neutral and 851 as deleterious by both classifiers. However, there were also appreciable differences, as may be

Frontiers in Plant Science | www.frontiersin.org

based on five corresponding distance matrices. The second row represents the scores obtained with the PolyPhen-2 service: pph2\_Score1 and pph2\_dScore reflect PSIC scores; pph2\_IdPmax, pph2\_IdQmin, and pph2\_Nobs represent specific features based on the multiple protein alignments. The third row contains features of the secondary protein structure: two features of belonging to helix or strand (helix, strand), and three scores obtained with PCI-SS service (E\_dist, T\_dist, H\_dist). The last row includes two features of the amino acid context around the substitution of interest (Neighb1, Neighb2) and belonging to known Pfam domains (PfamHit). The detailed explanation of features are presented in the Supplementary Table S1.

TABLE 1 | Performance of four classifiers: PolyPhen2, Linear SVM, Gaussian SVM and Random Forest on the Arabidopsis thaliana dataset.


observed by alternative classifications for 517 mutations. Overall, concordance between two classification results was 84.3%.

Due to the lack of annotated missense mutations in chickpea only circumstantial evidence could be used to demonstrate the validity of predictions in this species. To this end, we analyzed the population frequencies of classified polymorphisms in the dataset of 433 chickpea accessions (see Material and Methods). We have calculated the frequencies of synonymous (that are mostly neutral), predicted neutral and predicted deleterious mutations. Due to a large number of missed data, only those genome positions that were called in at least 300 accessions were retained for analysis. Overall, there were 1028 non-synonymous (672 neutral and 356 deleterious) and 901 synonymous polymorphisms (**Table 4**).

Applying the Wilcoxon rank sum test with continuity correction, we showed that there was no statistically significant difference between frequencies of neutral and synonymous substitutions; however, the frequency of deleterious mutations is statistically significantly lower than the frequency of mutations from other classes (one sided test, P < 0.05) (**Table 5**). These results are fully consistent with previous studies on deleterious mutations in other species (Günther and Schmid, 2010; Mezmouk and Ross-Ibarra, 2014) and could be explained by the action of weak purifying selection that sweeps deleterious mutations away. We conclude that our classifier appears to be working across a broad range of angiosperm species.

### DISCUSSION

Here we aimed to develop a classifier specifically tailored for plant datasets that classifies coding non-synonymous mutations TABLE 3 | Comparison of the number of deleterious and neutral mutation predicted by PolyPhen-2 and Random Forest classifier in Cicer arietinum.


TABLE 4 | Mean ffrequencies of non-synonymous deleterious and neutral mutations, as well as synonymous mutations in chickpea dataset.


TABLE 5 | Results of the Wilcoxon rank sum test for mutation frequencies comparison.


into neutral versus functionally deleterious. We have trained the Random Forest classifier in the deleterious mutations in A. thaliana using 18 selected features and accomplished a substantially better performance than PolyPhen-2 for two plant species – O. sativa and P. sativum – for which the sufficient number of neutral and functional mutations is known. The accuracy of our classifier based on Random Forest approach versus PolyPhen-2 was 87% versus 81% for rice and 93% versus 90% for pea. The new classifier also exhibited the superior balance of type I versus type II errors.

We also attempted to improve our classifier using the approaches of Transfer Learning (TL). This has been justified by the following considerations. The task of calling mutation as neutral and deleterious can be set as a classification problem and solved by various methods of machine learning. In mammals, it appeared that the same nucleotide might be


PPh2, PolyPhen-2; lSVM, linear SVM; gSVM, Gaussian SVM; RF, random forest; TL, transfer learning.

deleterious in one species but neutral in another (Kondrashov et al., 2002). Accordingly, training might have to be separately executed species by species. TL appears to be a suitable methodology to implement species-specific training as it could provide knowledge transfer from one species for which à lot of information is available to a close species with limited information. However, here we failed to improve the classifier performance with TL. In fact, the performance of our best Random Forest-based classifier dropped between 1% and 2% for both species, O. sativa and P. sativum. The reason why TL does not improve classifier performance is not clear. There might be unknown technical reasons, but also some biological considerations. It is known, for instance, that alleles annotated as deleterious in humans at about 15% of cases correspond to normal alleles in other mammals (Kondrashov et al., 2002). Which is to say, as GRNs and proteins diverge between species, the functional importance of different amino acids may also diverge. This might partially be explained by a highly epistatic landscape of amino acid substitutions, as best documented for green fluorescence protein (Sarkisyan et al., 2016). When species with diverged GRNs and proteins mate, their progeny suffer from F1 incompatibility and F2 hybrid breakdown because of epistatic incompatibilities (Turelli and Orr, 2000; Rieseberg and Willis, 2007; Coyne, 2016). It is rather interesting to note that the hybrids between different angiosperm species are much more frequently viable, even at higher phylogenetic distances, than mammals are. In fact, rather than suffering from incompatibilities, plant hybrids may exhibit remarkable hybrid vigor (Garcia et al., 2008; Charlesworth and Willis, 2009) raising a question whether the patterns of GRN and protein divergence in plants are functionally equivalent to those in mammals. It might imply that amino acids substitutions in plant proteins and GRNs are less epistatic, which is to say whether an amino acid substitution is deleterious or not could only weekly change between angiosperm species, unlike mammals. If so, then TL should result in substantial improvements when applied to mammals but not angiosperms. Of course, at this moment, this consideration is nothing more than speculation, but the one deserving attention and specially designed analysis to try the TL methodology in mammals.

While somewhat disappointing, that the classifier works well for different species without the need for species-specific learning also has positive aspects – the classifier does not have to be retrained before applying across angiosperms. To test whether our classifier would work with a new species, we utilized the data on population polymorphisms available for C. arietinum. Our hypothesis was that if we annotate these chickpea polymorphisms the population frequency of neutral non-synonymous positions would be identical to the frequencies of synonymous mutations,

#### REFERENCES


while the frequencies of functional (i.e., mostly deleterious) mutations would be significantly lower, as these mutations are actively removed by natural selection. This hypothesis was strongly supported, thus the use of our classifier is justified for a broad use with flowering plants.

Overall, our advances open the path to multiple future directions of research. For instance, it would be interesting to infer how different are domesticated plants from their wild progenitors at the genomic level? While it might be assumed that only a few loci contribute to the process of domestication (Gross and Olsen, 2010), domestication can also indirectly affect the entire genome by interfering with natural selection. First, there is strong selection fixing segregating and novel functional alleles. Second, there is an extensive relaxation of natural selection on characters that are important in the wild but not in cultivation, including due to population size reduction. The selective spread of beneficial mutations but also a consequent build-up of deleterious mutations (especially closely linked to selective sweeps) have been well-documented in plants, including rice (Günther and Schmid, 2010) and maize (Pyhäjärvi et al., 2013). However, whether deleterious mutation build-up is a minor nuisance or a major drag on yield remains incompletely understood, and can now be researched. This will help to understand whether 'cleaning out' such adverse mutations, for instance with CRISPR-based tools, might contribute to substantial gains in yield. Further, it opens the way to prioritizing these mutations for being edited out – perhaps of substantial value to the workflow in future agricultural advances.

#### AUTHOR CONTRIBUTIONS

MK and AI have contributed equally to this work. MS and SN supervised the study.

#### FUNDING

This work was supported by RSF (Russian Science Foundation) Grant No. 16-16-00007.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2018.01734/ full#supplementary-material

Bairoch, A. (1996). The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic Acids Res. 24, 21–25. doi: 10.1093/nar/24.1.21


polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6, 80–92. doi: 10.4161/fly.19695


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Kovalev, Igolkina, Samsonova and Nuzhdin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Characterization of DNA Methylation Associated Gene Regulatory Networks During Stomach Cancer Progression

Jun Wu<sup>1</sup> , Yunzhao Gu<sup>2</sup> , Yawen Xiao<sup>3</sup> , Chao Xia<sup>2</sup> , Hua Li<sup>2</sup> , Yani Kang<sup>2</sup> , Jielin Sun<sup>4</sup> , Zhifeng Shao<sup>2</sup> , Zongli Lin<sup>5</sup> \* and Xiaodong Zhao<sup>4</sup> \*

<sup>1</sup> School of Life Sciences, East China Normal University, Shanghai, China, <sup>2</sup> Bio-ID Center, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China, <sup>3</sup> Department of Automation, Shanghai Jiao Tong University, Shanghai, China, <sup>4</sup> Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai, China, <sup>5</sup> Charles L. Brown Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA, United States

#### Edited by:

Yuriy L. Orlov, Institute of Cytology and Genetics (RAS), Russia

#### Reviewed by:

Sheng Liu, Indiana University, United States Anna Kudryavtseva, Engelhardt Institute of Molecular Biology (RAS), Russia Leonid Olegovich Bryzgalov, Independent Researcher, Novosibirsk, Russia

> \*Correspondence: Zongli Lin zl5y@virginia.edu Xiaodong Zhao xiaodongzhao@sjtu.edu.cn

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 16 October 2018 Accepted: 18 December 2018 Published: 04 February 2019

#### Citation:

Wu J, Gu Y, Xiao Y, Xia C, Li H, Kang Y, Sun J, Shao Z, Lin Z and Zhao X (2019) Characterization of DNA Methylation Associated Gene Regulatory Networks During Stomach Cancer Progression. Front. Genet. 9:711. doi: 10.3389/fgene.2018.00711 DNA methylation plays a critical role in tumorigenesis through regulating oncogene activation and tumor suppressor gene silencing. Although extensively analyzed, the implication of DNA methylation in gene regulatory network is less characterized. To address this issue, in this study we performed an integrative analysis on the alteration of DNA methylation patterns and the dynamics of gene regulatory network topology across distinct stages of stomach cancer. We found the global DNA methylation patterns in different stages are generally conserved, whereas some significantly differentially methylated genes were exclusively observed in the early stage of stomach cancer. Integrative analysis of DNA methylation and network topology alteration yielded several genes which have been reported to be involved in the progression of stomach cancer, such as IGF2, ERBB2, GSTP1, MYH11, TMEM59, and SST. Finally, we demonstrated that inhibition of SST promotes cell proliferation, suggesting that DNA methylationassociated SST suppression possibly contributes to the gastric cancer progression. Taken together, our study suggests the DNA methylation-associated regulatory network analysis could be used for identifying cancer-related genes. This strategy can facilitate the understanding of gene regulatory network in cancer biology and provide a new insight into the study of DNA methylation at system level.

Keywords: DNA methylation, gene regulation network, stomach cancer, tumor stages, system level

### INTRODUCTION

DNA methylation plays a critical role in tumorigenesis through regulating oncogene activation and tumor suppressor gene silencing (He et al., 2008), and has raised extensive attention in the past decade. It has been shown that tumor initiation and development are associated with aberrant DNA methylation patterns, as documented in stomach cancer development (Tahara and Arisawa, 2015; Yamamoto et al., 2016). Aberrant DNA methylation pattern is the hallmark in the cancer genome (Baylin et al., 2000; Bergman and Cedar, 2013) and is involved in malignant progression (Jones et al., 2013). Although critically involved in malignancy, the implication of DNA methylation in tumorigenesis at system level is less characterized.

The gene regulatory network based analysis is regarded as a powerful way to understand the mechanism of tumorigenesis at system level (Kreeger and Lauffenburger, 2010), and various robust machine learning methods based gene regulatory network inference algorithms were proposed for such analysis (Haury et al., 2012; Slawek and Arodz, 2013; Wu et al., 2016). On the other hand, the rapid development of deep sequencing technologies promotes the generation of a tremendous amount of sequencing data, and an increasing number of network-based methods have been recently applied to understand the molecular mechanism of tumor formation and progression (Anglani et al., 2014; Yang et al., 2014; Bicker et al., 2015).

To further investigate the role of DNA methylation in tumorigenesis at system level, in this study we analyzed the DNA methylation-associated the topology dynamics of gene regulatory network in stomach cancer. We observed that although the DNA methylation patterns are generally conserved, the locusspecific DNA methylation patterns can be identified, especially in the early stage. Comparison of the topology of gene regulatory networks derived from different stages yielded several genes, such as IGF2, ERBB2, GSTP1, MYH11, TMEM59, and SST, of which the regulatory relationship is found to be most severely disrupted. To evaluate the biological relevance, we performed siRNA assay against SST in gastric epithelial cell line GES-1 and found that down-regulation of SST significantly promotes gastric cell proliferation. Collectively, these results suggest that the integrative analysis of DNA methylation and gene regulatory network across different stages of stomach cancer would be used to identify genes involved in stomach cancer initiation and development, and provides a new insight into the understanding of DNA methylation in carcinogenesis at system level.

### RESULTS

### Probe-Gene Pairs Assignment

The DNA methylation datasets downloaded from the Cancer Genome Altas (TCGA) data portal were generated using two Illumina Infinium DNA methylation bead arrays (HM27 and HM450). Considering the incompleteness of DNA methylation data, we focused our study on the probes located in the gene promoter regions. Technically, more than one probes were generally designed for a given gene promoter region and it remains unclear which probe-hit methylated region actually affect the expression of the target gene. To address this issue, the distance and correlation criteria were used to assign the proper probes to a gene (See Materials and Methods for further details).

It has been well recognized that DNA hyper-methylation at the promoter region is associated with gene suppression (Bell et al., 2011; Jones, 2012). Due to the unavailability of DNA methylation data and the matched RNA-seq data in normal tissues, we examined the correlation between the pair of the expression level and the DNA methylation level of probes located in the promoter region of a given gene in each tumor stage. Not surprisingly, we observed that negatively correlated pairs outnumber the positive correlated ones (**Figure 1A**). Particularly, in the significantly correlated pairs we found that almost all probe-gene pairs were negatively correlated (**Figure 1B**). The probe-gene pair was assigned if the DNA methylation level of the probe and expression level of a gene are significantly negatively correlated in one of the four tumor stages. With these criteria, 10,777 probe-gene pairs, which consist of 9,830 probes and 7,546 genes, were defined and then used for the downstream analysis.

### Global Conserved and Locus Specific DNA Methylation Patterns Across Different Stomach Cancer Stages

With the selected probe-gene pairs, we firstly examined the global methylation patterns across all stomach cancer stages and the normal samples. We classified the probes into unmethylated, hemi-methylated and fully methylated groups using the approach similar to Lokk et al. (2012). To determine proper thresholds, we examined the distributions of the methylation level in all five phenotypes (**Figure 2A**). We found that the distributions of the methylation level in all five phenotypes are very similar. More than half of the probes were unmethylated and only about 15% probes were fully methylated in all samples. The dynamics in the methylation patterns across the five phenotypes was also analyzed. We found that the conservation between every two phenotypes was higher than 80% (**Figure 2B**), indicating that the DNA methylation patterns are globally conserved across all the five phenotypes. Additionally, we found that DNA methylation patterns are relatively more conserved in tumor stages.

Although the overall patterns are considerably conserved, the phenotype-specific methylation presumably plays an important role in initiation and progress of stomach cancer. To test this presumption, we examined the presence of both the unmethylated and fully methylated probe-linked genes in the five phenotypes. Interestingly, we found that both the unmethylated and fully methylated probe-linked genes in normal samples were significantly more than those in tumor samples (**Figure 3**). We next performed gene ontology (GO) analysis of these genes with DAVID (Huang et al., 2009a,b). The results showed that the fully methylated probe-linked genes in normal samples were enriched in the GO items of defense response to bacterium and innate immune response (**Supplementary Table S1**), including LPO and S100A8 which have been reported to be activated in the H. pylori-infected gastric mucosa (Semper et al., 2014; Zhuang et al., 2015).

To further understand the biological relevance of the DNA methylation in different stages of stomach cancer, we compared the samples in stages I–IV with the normal samples and identified the significantly differentially methylated probes. We found 1,059, 716, 673 and 635 genes linked to significantly differentially methylated genes in stages I–IV samples, respectively. The top 20 significantly differentially methylated probe linked genes with largest positive and negative mean differences were shown in **Figure 4**, in which we found that several oncogenes and tumor suppressor genes were at the top of the lists (positive and negative directions, respectively) in all four tumor stages, including ITGA4, FGF2, FLI1, EGFR, ERBB2, VIM, and DAPK1. ITGA4 encodes a member of the integrin alpha chain family that may play a role in cell motility and migration, and the promoter

FIGURE 1 | Distribution of correlations between the probe methylation level and the expression of target genes. (A): Distribution of spearman correlation of all potential probe-gene pairs in the four stomach cancer stages. (B): Distribution of spearman correlation of all significantly correlated potential probe-gene pairs in the four stomach cancer stages.

represent the thresholds used for dividing the probes into three groups. (B): The conservation between every two phenotypes.

of ITGA4 was reported to be hyper-methylated in various cancers, such as colorectal cancer (Gerecke et al., 2015), breast cancer (Lian et al., 2012) and gastric cancer (Kim et al., 2009). DAPK1, a positive mediator of gamma-interferon induced programmed cell death, was reported to be fully hypo-methylated or up-regulated in several types of cancer, including fistula associated mucinous type anal adenocarcinoma (Sen et al., 2010), nasopharyngeal carcinoma (Luo et al., 2011) and gastric cancer (Zhang et al., 2006).

The Venn diagram of genes with significantly differentially methylation was shown in **Figure 5**. We found that most genes were shared by stages II – IV except in stage I. The GO analysis (**Supplementary Table S2**) shows that the commonly hyper-methylated probe linked genes are mainly involved in carcinogenesis related biological processes, such as cell motion, cell death and cell migration. While the commonly hypo-methylated probe linked genes are mainly involved in development and differentiation biological processes (**Supplementary Table S3**). We also found some genes exclusively present in stage I, suggesting that they are presumably associated with the early stage of stomach cancer. The GO analysis results revealed that both the specifically hyper-methylated genes and the specifically hypo-methylated genes are involved in cell adhesion and

FIGURE 3 | Venn diagrams of genes linked to the fully and unmethylated probes. (A): The Venn diagram of fully methylated probe linked genes with respect to the five phenotypes. (B): The Venn diagram of unmethylated probe linked genes with respect to the five phenotypes.

FIGURE 4 | Differential methylation analysis between four tumor stages and the normal phenotype (A): Stage I vs. Normal; (B): Stage II vs. Normal; (C): Stage III vs. Normal; (D): Stage IV vs. Normal. Left: Mean difference between the methylation level in the tumor samples and the normal samples. Right: Distributions of methylation level, with black vertical lines showing medians. Top 20 of the largest positive and negative mean differences with an adjusted p-value less than 0.05 are shown.

transmembrane transport. The difference is that the genes linked to the specifically hyper-methylated probes are particularly involved in eating behavior and positive regulation of appetite (**Supplementary Table S4**), while the genes linked to the specifically hypo-methylated probes are particularly involved in immune response, response to bacterium and negative regulation of Wnt signaling pathway (**Supplementary Table S5**).

## Regulation Gain or Loss Induced by DNA Methylation Alteration

DNA methylation is one of the key epigenetic mechanisms involved in regulation of gene expression. To further understand the role of DNA methylation alteration during the stomach cancer development, we constructed a DNA methylation associated gene regulatory network for each phenotype and analyzed the topology differences among these networks.

To examine the regulation alteration affected by the DNA methylation changes, we screened the target genes based on the assumption that the hyper-methylation leads to the reduction of affinity between the TFs and the binding regions and then may cause the loss of regulation while the hypomethylation causes its gain (Yao et al., 2016). We calculated in-degree for each target gene and the genes with indegree increase linked to hypo-methylated probes (in-degree decrease genes linked to hyper-methylated probes) were retained. The in-degree of each target gene in each network pair were shown in **Figure 6**. After filtering, 57%, 52%, 59%, and 54% of target genes were retained in stages I–IV, respectively.

To further investigate the regulation alteration in four tumor stages compared to the normal phenotype, we constructed the differential regulatory networks by subtracting the normal weight matrix from the tumor weight matrixes. The regulation relationship with the absolute weight difference ranking top 1,000 was regarded as true alterations. Finally, for each tumor stage we obtained a differential regulatory network consisting of 1,000 edges that point to 172, 172, 189, and 176 target genes in the four tumor stages. The numbers of edges pertaining to gain or loss of regulation were listed in **Table 1**, in which we observed that the gain number is larger than the loss number in each of the four tumor networks.

For the differential regulatory network in stages I–IV, we ranked the target genes according to the number of gained or lost regulation, respectively. We found several genes were at the top in all the tumor stages. The top 10 target genes (listed in



**Supplementary Table S6**) with the largest number of regulation alteration were shown in **Figure 7**. In these subgraphs we found that IGF2, ERBB2, and GSTP1 rank top in the largest number of regulation gained in all the four differential regulatory networks, and MYH11, SST, and TMEM59 rank top in the largest number of regulation lost in all the four differential regulatory networks. IGF2 is an imprinting gene and plays an essential role in the embryonic development. However, activation of IGF2 stimulates the proliferation of tumor cells and prevents damaged cells from being destroyed. It was reported that overexpression of IGF2 plays an important role in carcinogenesis of diffuse type gastric cancer (Wu et al., 1997). MYH11 belongs to a group of proteins called myosins, which are involved in cell movement and the transport of material within and between cells. It was reported that MYH11 is not expressed in gastric cancer cell lines (Saeki et al., 2015) and down-regulated MYH11 correlates with poor prognosis in stage II and stage III colorectal cancer (Wang et al., 2014). These results indicate that the methylation-mediated network analysis facilitates the identification of the key genes involved in tumorigenesis.

To evaluate the authenticity of the genes identified through our network analysis, we performed a siRNA assay against SST in gastric epithelial cell line GES-1. Comparing with the control, we found that SST suppression results in an increase of cells in S and G2/M phases and the decrease of cells in the G0/G1 phase (**Figure 8**), indicating that SST down-regulation promotes cell proliferation. From the results, we found that inhibition of SST promotes cell proliferation, which suggests that DNA methylation-associated SST suppression possibly contributes to the gastric cancer progression.

### DISCUSSION

It has been recognized that aberrant DNA methylation play an import role in tumorigenesis. However, the implication of DNA methylation in gene regulatory network is less characterized. Thus, we performed an integrative analysis of DNA methylation and gene regulatory network with the RNA-seq and DNA methylation data to understand the role of DNA methylation change in the gene regulatory network alteration across different stomach cancer stages.

We first assigned a gene with appropriate probes according to both the location information and correlation relationship. We found that the DNA methylation pattern was global conserved across all phenotypes except some locus specific DNA methylation patterns in the normal phenotype. The differential methylation analysis was also performed to identify the significantly differentially methylated genes in each tumor stage samples. Interestingly, we found more specific alterations in the stage I phenotype compared to the other tumor stages and the GO analysis results showed that these genes are particularly involved in the biological processes closely related to the cancer initiation.

To identify the gene regulation alteration affected by the DNA methylation change, we constructed a DNA methylation associated gene regulatory network in each phenotype and subtracted the normal network from the four tumor networks, respectively. The differential network analysis results showed that the number of regulations gained was larger than that of regulations lost in each of the four tumor networks. We ranked the target genes according to the number of altered regulations

and obtained several genes that rank top in all the tumor stages. For example, IFG2, ERBB2, and GSTP1 ranked top in the largest number of regulation gain and MYH11, TMEM59, and SST ranked top with the largest number of regulations loss. To examine the biological relevance of the genes identified, we selected SST for functional evaluation. We found that inhibition of SST can significantly promote cell proliferation, which suggests that down-regulation of SST is involved in stomach cancer progression.

In brief, our study demonstrated that integrative analysis of the regulatory network and DNA methylation allows identifying cancer-related gene. The strategy proposed here provides new insight into understanding of the role of DNA methylation in disease at system level.

## MATERIALS AND METHODS

#### Data Collection and Differentially Methylated Sites Identification

The DNA methylation data, gene expression data and clinical data were downloaded from TCGA data portal. The DNA methylation data consist of 302 samples, which were generated using two Illumina Infinium DNA methylation bead arrays, HumanMethylation27 (HM7) and HumanMethylation450 (HM450). The HM27 array contains 27,578 probes that target CpG sites located in proximity to the transcription start sites and the HM450 array contains 482,421 probes that target CpG sites throughout the genome. For ease of description, in the following sections of this article we used probes to represent the corresponding CpG sites.

As neither the HM27 nor the HM45 data contains enough samples for analysis for each phenotype, we only took probes located in gene promoters into account even though the DNA methylation of transcriptional enhancers was also reported to be closely associated with carcinogenesis (Aran and Hellman, 2013). We adopted the strategy mentioned in a previous report (Bass et al., 2014) to preprocess the DNA methylation. Briefly, the probes shared by both the HM27 and HM450 platforms were selected, and the probes that overlap with SNPs, repeat and have any "NA"-masked data points were removed. The probes that hit X and Y chromosomes were also removed. After that we obtained 19,736 probes for further analysis. The gene expression data of 272 samples and 26,540 genes were generated using RNA-seq. The DNA methylation samples and the gene expression samples were further divided into five phenotypes, which are normal and tumor stages I–IV, according to the clinical data. Sample numbers for all phenotype are listed in **Table 2**.

As we did not expect all cases to be from a single molecular subtype, and we sought to identify methylation changes within cases from the same molecular subtype. To identify the significantly differentially methylated probes, we excluded the 10% of samples with the lowest methylation and 10% samples with the highest methylation for each probe and the Wilcoxon Rank Sum test was used to measure the significance. Probes with a BH-adjusted p-value less than 0.05 and an absolute methylation difference greater than 0.2 were regarded as significantly differentially methylated.

### Assigning DNA Methylation Sites to the Target Gene

In general more than one DNA methylation probes of the DNA array were designed for a given gene promoter region. Thus, it remains unclear which probes actually affect the expression of the target gene. To address this issue, we used two criteria to

TABLE 2 | Number of samples in each phenotype for the RNA-seq and DNA methylation data.


assign the DNA methylation probes for each gene. We initially assigned a probe to a gene if the probe located in the promoter region of the gene. The promoter region of a gene is defined as ±2 kb region around the transcription start site of the gene. The relationship between a probe and a gene is then confirmed with the aid of gene expression based on the evidence that DNA methylation can repress the transcription when it occurs in the promoter region. The samples with matched gene expression data and methylation data were used for the analysis. For each candidate, we tested the significance of the correlation between the DNA methylation level of the probe and expression level of the gene. The Spearman's coefficient was used as the measure of correlation. The correlation significance was obtained with t-test and the t statistic was calculated as:

$$t = \frac{r\sqrt{n-2}}{1-r^2},$$

where r is the correlation between the methylation and gene expression and n is the number of samples. The probe-gene pairs were finally confirmed if the BH-adjusted p-value is less than 0.05 and the correlation less than zero.

#### DNA Methylation Associated Gene Regulatory Network Construction

To construct the DNA methylation associated gene regulatory network, the potential TFs which maybe bind to the DNA methylated regions should be identified. We first obtained JASPAR-2014 motif position weight matrices (PWMs) and ENCODE motif PWMs from the R package motifDb and 2,182 motif PWMs were used for further analysis (ENCODE Project Consortium, 2004; Mathelier et al., 2014). The potential TFs bound to each target gene were predicted according to sequence affinity. We used FIMO (Grant et al., 2011) to scan a ±100 bps sequence around each probe in search for instances of the selected PWMs. A TF was regarded a potential regulator of a probe-linked genes if the p-value of its motif is less than 1E-4. However, a high sequence affinity just indicated that the TF has a high opportunity to bind to the regulatory region. It was unclear whether the gene relate to the regulatory element is actually bound by the TF.

To measure the confidence of such regulation relationship, we assigned a weight to the edge outgoing from a potential TF to the target gene using our previously proposed gene regulatory network inference method (Wu et al., 2016) with the RNA-seq data. Briefly, we assumed that the expression level of target gene can be formulated by an unknown function of the expression of TFs. We first solved the individual regression problem with the guided regularized random forest algorithm, and then a q-norm normalization was employed to reduce the bias among different regression results and the final results were obtained through

#### REFERENCES

Anglani, R., Creanza, T. M., Liuzzi, V. C., Piepoli, A., Panza, A., Andriulli, A., et al. (2014). Loss of connectivity in cancer co-expression networks. PLoS One 9:e87075. doi: 10.1371/journal.pone.0087075

refining the previous results according to the sparsity property of large scale gene regulatory networks.

#### RNA Interference and Cell Cycle Analysis

RNA interference assays were performed as reported previously. SiRNAs for SST, or negative control, were synthesized by Shanghai GenePharma Co., Ltd. Cells were transfected with SST siRNA or control siRNA using LipofectamintTM 2000 Transfection Reagent (11668027, Invitrogen) according to the manufacturer's protocol. To measure the efficacy of the gene knockdown, the quantitative real-time reverse transcription polymerase chain reaction (RT-qPCR) was used. Total RNA was extracted using TRIzol Reagent (15596-018, Invitrogen) and resuspended in RNase free water. Reverse transcription of 1 µg RNA was performed using the oligo-dT primer and SuperScrip <sup>R</sup> III Reverse Transcriptase (18080-044, Invitrogen) according to the manufacturer's protocol. Expression levels were determined by real-time PCR using ABI step one plus (Applied Biosystems, United States). β-actin was used as a control gene for normalization. The relative level of mRNA was calculated as 2−11Ct (means ± SEM, n = 3). The SST-targeting siRNA, primer sequences and the RT-qPCR results were provided in **Supplementary Table S7**.

#### AUTHOR CONTRIBUTIONS

XZ and JW conceived and designed the project. JW wrote the manuscript. YG and YK performed the experiments. JW, YX, CX, and HL performed the analysis and interpretation of data. JS, XZ, ZL, and ZS made a substantial contributions to the design and revisions of the manuscript. All authors have read and approved the final version of the manuscript.

### FUNDING

This work was partially funded by the National Natural Science Foundation of China (31671299, 81720108017, and 31801118), the Medicine and Engineering cooperation project of Shanghai Jiao Tong University (YG2017ZD15 and YG2015MS33), the Development Program for Basic Research of China (2014YQ09070904), and the Shanghai Science and Technology Committee Program (17JC1400804).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00711/full#supplementary-material

Aran, D., and Hellman, A. (2013). DNA methylation of transcriptional enhancers and cancer predisposition. Cell 154, 11–13. doi: 10.1016/j.cell.2013.06.018

Bass, A. J., Thorsson, V., Shmulevich, I., Reynolds, S. M., Miller, M., Bernard, B., et al. (2014). Comprehensive molecular characterization of gastric adenocarcinoma. Nature 513, 202–209. doi: 10.1038/nature13480


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wu, Gu, Xiao, Xia, Li, Kang, Sun, Shao, Lin and Zhao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

, Askar Zhubatkanov<sup>1</sup>

,

,

# Intracellular Vesicle Trafficking Genes, RabC-GTP, Are Highly Expressed Under Salinity and Rapid Dehydration but Down-Regulated by Drought in Leaves of Chickpea (Cicer arietinum L.)

, Bekzak Amantaev<sup>1</sup>

, Satyvaldy Jatayev<sup>1</sup>

, Sergiy Lopato<sup>2</sup>

, Peter Langridge3,4 and

#### Edited by:

Gulmira Khassanova<sup>1</sup>

\*

Aybek Zhumalin<sup>1</sup>

Carly Schramm<sup>2</sup>

Yuri Shavrukov<sup>2</sup>

Yuriy L. Orlov, Russian Academy of Sciences, Russia

#### Reviewed by:

Awais Rasheed, International Maize and Wheat Improvement Center, Mexico Mehar Hasan Asif, National Botanical Research Institute (CSIR), India

#### \*Correspondence:

Yuri Shavrukov yuri.shavrukov@flinders.edu.au

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

> Received: 25 October 2018 Accepted: 18 January 2019 Published: 07 February 2019

#### Citation:

Khassanova G, Kurishbayev A, Jatayev S, Zhubatkanov A, Zhumalin A, Turbekova A, Amantaev B, Lopato S, Schramm C, Jenkins C, Soole K, Langridge P and Shavrukov Y (2019) Intracellular Vesicle Trafficking Genes, RabC-GTP, Are Highly Expressed Under Salinity and Rapid Dehydration but Down-Regulated by Drought in Leaves of Chickpea (Cicer arietinum L.). Front. Genet. 10:40. doi: 10.3389/fgene.2019.00040 <sup>1</sup> Faculty of Agronomy, S. Seifullin Kazakh AgroTechnical University, Astana, Kazakhstan, <sup>2</sup> Biological Sciences, College of Science and Engineering, Flinders University, Bedford Park, SA, Australia, <sup>3</sup> School of Agriculture, Food and Wine, University of Adelaide, Adelaide, SA, Australia, <sup>4</sup> Wheat Initiative, Julius-Kühn-Institute, Berlin, Germany

, Kathleen Soole<sup>2</sup>

, Akhylbek Kurishbayev<sup>1</sup>

, Arysgul Turbekova<sup>1</sup>

, Colin Jenkins<sup>2</sup>

Intracellular vesicle trafficking genes, Rab, encoding small GTP binding proteins, have been well studied in medical research, but there is little information concerning these proteins in plants. Some sub-families of the Rab genes have not yet been characterized in plants, such as RabC – otherwise known as Rab18 in yeast and animals. Our study aimed to identify all CaRab gene sequences in chickpea (Cicer arietinum L.) using bioinformatics approaches, with a particular focus on the CaRabC gene sub-family since it featured in an SNP database. Five isoforms of the CaRabC gene were identified and studied: CaRabC-1a, -1b, -1c, -2a and -2a<sup>∗</sup> . Six accessions of both Desi and Kabuli ecotypes, selected from field trials, were tested for tolerance to abiotic stresses, including salinity, drought and rapid dehydration and compared to plant growth under control conditions. Expression analysis of total and individual CaRabC isoforms in leaves of control plants revealed a very high level of expression, with the greatest contribution made by CaRabC-1c. Salinity stress (150 mM NaCl, 12 days in soil) caused a 2-3-fold increased expression of total CaRabC compared to controls, with the highest expression in isoforms CaRabC-1c, -2a<sup>∗</sup> and -1a. Significantly decreased expression of all five isoforms of CaRabC was observed under drought (12 days withheld water) compared to controls. In contrast, both total CaRabC and the CaRabC-1a isoform showed very high expression (up-to eight-fold) in detached leaves over 6 h of dehydration. The results suggest that the CaRabC gene is involved in plant growth and response to abiotic stresses. It was highly expressed in leaves of non-stressed plants and was downregulated after drought, but salinity and rapid dehydration caused up-regulation to high and very high levels, respectively. The isoforms of CaRabC were differentially expressed, with the highest levels recorded for CaRabC-1c in controls and under salinity stress,

and for CaRabC-1a – in rapidly dehydrated leaves. Genotypic variation in CaRabC-1a, comprising eleven SNPs, was found through sequencing of the local chickpea cultivar Yubileiny and germplasm ICC7255 in comparison to the two fully sequenced reference accessions, ICC4958 and Frontier. Amplifluor-like markers based on one of the identified SNPs in CaRabC-1a were designed and successfully used for genotyping chickpea germplasm.

Keywords: abiotic stresses, Amplifluor-like SNP markers, bioinformatics, CaRab gene, differential gene expression, gene isoforms

#### INTRODUCTION

Plant genomes include a superfamily of genes that encode small GTP-binding proteins (Guanosine triphosphatases) that are classified into four groups: Arf, Rab, Ran and Rho; and an additional Ras-GTP gene group is found only in yeast and animals (Ma, 2007). Small GTP-binding proteins were first described in medical research, where the term "Ras" stemmed from their association with rat sarcoma (Chang et al., 1982; Bishop, 1985; Chavrier et al., 1990). The remaining three-letter names are not related in structure or function to the genes but rather refer to their product or some other feature (Coffin et al., 1981). Small GTP-binding proteins are known to be involved in a diverse range of activities in eukaryotes that are vital for growth, development and repair; from cytoskeletal organization, vacuolar storage and signaling, to modulation of gene expression (Takai et al., 2001). The mechanism for the regulation of GTPbinding proteins is conserved in all organisms and involves cycling between active (GTP-bound) and inactive (GDP-bound) forms, so they are often described as "molecular switches" that are turned "on" or "off " via the hydrolysis of GTP (Marshall, 1993). Activation requires the dissociation of GDP, which can be either stimulated by a regulatory factor named GEP (GDP/GTP Exchange Protein) or inhibited by GDI (GDP Dissociation Inhibitor; Takai et al., 2001; Liu et al., 2015; Martín-Davison et al., 2017).

Rab proteins, encoded by Rab-GTP genes, are normally prenylated at their carboxyl terminus. The hydrophobic prenylgroups facilitate attachment to membranes and are therefore integral to the biological role performed by Rab proteins in vesicle trafficking via endocytic and exocytic pathways between the endoplasmic reticulum, Golgi membrane network, endosome, plasma membrane and all intracellular membranes (Alory and Balch, 2003). Rab proteins are highly conserved across kingdoms, from yeast to animals and plants (Haubruck et al., 1987; Marcote et al., 2000), but are most often present as a small family of highly similar genes. They are divided into either nine (Ma, 2007) or 18 clades (Agarwal et al., 2009) based on their structure, with only eight clades represented in plants. Historically, different nomenclatures were adopted for identification of Rab genes in plants compared to animals. For example, in plants, eight capitalized letters from A to H were used in the names of Rab genes, while the numbers 1 to 11 were applied in human, animal and yeast research. In the absence of a universal system of nomenclature for Rab genes and their proteins, a list of all known genes and their respective identifiers for both nomenclatures is given later in the text.

The genes for Rab GTP-binding proteins should not be confused with the similarly named Dehydrin genes in plants, which are also known as RAB, meaning "Responsive to ABA" (Abscisic acid). Dehydrins encode proteins belonging to the large but very different group of Late embryogenesis abundant proteins, LEA (Hundertmark and Hincha, 2008). For example, AtRAB18 (or AtRab18) was described and studied in Arabidopsis thaliana in response to various abiotic stresses and ABA treatment (Lång and Palva, 1992; Rushton et al., 2012; Hernández-Sánchez et al., 2017). Despite the identical name, this gene is neither structurally nor functionally related to the Rab-GTP genes, and care must be taken to clearly distinguish between the two. The mixing of these two different types of genes is unfortunately apparent in recent publications. For example, Jiang et al. (2017) studied the correctly designated TaRab18 (=TaRabC1) gene in response to stripe rust in bread wheat, but this gene was incorrectly compared with RAB18 (Responsive to ABA) in Arabidopsis, rice and maize. As a result, the Authors wrongly cited work by Lång and Palva (1992) and others on the Dehydrin AtRab18 to support their findings on the sensitivity of TaRab18 (=TaRabC1) to ABA.

In plants, Rab-GTP genes are reportedly involved in multiple physiological processes (Borg et al., 1997; Rehman and Sansebastiano, 2014; He et al., 2018; Lawson et al., 2018) and are often highly expressed in response to biotic and abiotic stresses (Marcote et al., 2000; Stenmark and Olkkonen, 2001; Zerial and McBride, 2001; Rutherford and Moore, 2002; Ma, 2007; Woollard and Moore, 2008; Agarwal et al., 2009). However, despite the numerous links, little is known about the precise molecular mechanisms underlying their involvement in plant stress responses.

One of the first studies to report a link between Rab protein and abiotic stress was a report by O'Mahony and Oliver (1999) who found increased transcript levels of the Rab2 gene (otherwise known as RabB) in the desiccation-tolerant grass Sporobolus stapfianus in response to dehydration, but decreased transcript levels after rehydration. This suggested the involvement of SsRab2 in both the short-term response and later recovery from desiccation. SsRab2 was found to share 90% similarity to Rab2 proteins found in rice, maize, Arabidopsis, Lotus japonicus and soybean (O'Mahony and Oliver, 1999). Since that time, links to various stresses have been established for genes encoding Rab proteins in numerous plants, and especially in species with high abiotic stress-tolerance such as Lilium formolongi – LfRabB

(Howlader et al., 2017), poplar – PtRabE1b (Zhang et al., 2018), and Mesembryanthemum crystallinum – McRab5b (=McRabF) (Bolte et al., 2000). Interestingly, many plant species were studied for RabG genes and their corresponding proteins including the halophyte species, Aeluropus lagopoides – AlRab7 (=AlRabG) (Rajan et al., 2015) and food grain crop, Pennisetum glaucum – PgRab7 (=PgRabG) (Agarwal et al., 2008), as well as more stress susceptible crops such as rice, Oryza sativa – OsRab7 (=OsRabG) (Nahm et al., 2003) and peanut, Arachis hypogaea – AhRabG (Sui et al., 2017), and the model species A. thaliana – AtRab7 (=AtRabG) (Mazel et al., 2004). A comprehensive analysis of all MpRab genes was reported for the liverwort, Marchantia polymorpha (Minamino et al., 2018).

Rab transcripts are often found to show different responses to abiotic stresses. For example, in rice, dehydration triggered a strong increase in OsRab7 (=OsRabG) transcript after 4 h and then a decrease after 10 h. However, no significant changes were found in response to cold or salinity stress (Nahm et al., 2003). Similarly, in the halophytic grass A. lagopoides, AlRab7 (=AlRabG) was upregulated by dehydration, but salinity stress caused no significant increase in transcript levels (Rajan et al., 2015). In another halophyte, M. crystallinum, expression of McRab5b (=McRabF) was higher after 2 h and continued to rise over 3 days of very strong salt stress (400 mM NaCl), but wilting or osmotic stress triggered no change in expression (Bolte et al., 2000). These differences obviously reflect various roles of the intracellular membrane system to abiotic stresses and may provide the key to uncovering the precise molecular mechanisms underlying differential plant susceptibility or tolerance to an environmental stress.

A number of studies have used a transgenic approach to shed light on the mechanisms explaining the link between Rab proteins and plant stress and to explore how Rab proteins could play a role in the breeding of more stress-tolerant crops. For example, Mazel et al. (2004) constitutively overexpressed AtRabG3e (=AtRab7) in Arabidopsis. The transgenic plants accumulated more sodium in vacuoles and showed greater tolerance to salinity and osmotic stress. Evidence was also found for increased endocytosis in roots and leaves and entry of Reactive oxygen species into the cell to trigger signaling and subsequent activation of stress tolerance mechanisms (Mazel et al., 2004; Baral et al., 2015). AhRabG, OsRab7 (=OsRabG) and OsRab11 (=OsRabA) were also overexpressed in transgenic peanut and rice, respectively, producing plants that showed relatively higher salinity tolerance compared to wild-type plants (Peng et al., 2014; Sui et al., 2017; Chen and Heo, 2018). In transgenic peanut plants, of 132 genes differentially expressed, most were identified as transcription factors (TF) relating to salinity tolerance (Sui et al., 2017).

The aim of this study was to identify and analyze a possible candidate gene involved in the tolerance to drought, salinity and rapid dehydration in chickpea, C. arietinum, a species that is becoming increasingly popular as a cash crop in agricultural areas with the requirements for moderate tolerance to high temperatures, drought and salinity stress during the growing season. A candidate gene CaRabC1, belonging to the family of Rab-GTP genes, was identified from an SNP database using bioinformatic and molecular genetic analyses. Currently, the only report concerning chickpea Rab-GTP genes was published by Muñoz et al. (2001), who identified a Rab-specific GDI in chickpea seedlings showing 96% homology to MsRab11f (=MsRabG), a GDI in Medicago truncatula (Yaneva and Niehaus, 2005). Our study therefore represents the first report of the Rab-GTP family of genes in C. arietinum. We present the results of bioinformatic analyses of the identified genes and tests conducted to assess the expression of all isoforms of the CaRabC gene family in response to salinity, drought and rapid dehydration in selected chickpea genotypes. Amplifluor-like markers based on one of the identified SNPs in CaRabC-1a were used for genotyping of chickpea germplasm.

### MATERIALS AND METHODS

### Plant Material

A germplasm collection comprising 250 chickpea (C. arietinum L) samples from the ICRISAT Reference set plus local accessions were tested over 3 years in field trials in Northern and Central Kazakhstan. Six accessions were selected during field trials for further molecular analyses, as listed in **Table 1**. The first accession, cv. Yubileiny, originated from Krasnokutskaya Breeding Station, in the Saratov region (Russia), and is used as a Standard for local field trials with chickpea accessions. The remaining five chickpea lines were selected from the original 230 collected in the ICRISAT Reference set, to represent diverse gene-pool sources.

#### Identification of the Gene of Interest Using Bioinformatics and Molecular Phylogenetic Comparative Analysis

Bioinformatics and systems biology methods were applied in this study to identify a target gene or "Gene of Interest" (GoI) with a potential role in tolerance to abiotic stresses in chickpea. Initially, the SNP database for C. arietinum L.<sup>1</sup> was used to search and select one suitable SNP with a short fragment of sequence for further study. The full-length nucleotide sequence of the GoI and its corresponding polypeptide sequence was retrieved from the same database and used for both BLASTN and BLASTP in NCBI and in GenomeNet Database Resources, hosted by Kyoto University, Japan<sup>2</sup> . All chickpea gene sequences with KEGG and NCBI identification and the encoded proteins were downloaded from GenomeNet and NCBI databases, while chromosome locations were checked using LIS, Legume Information System database<sup>3</sup> . The A. thaliana genes displaying the highest level of similarity to each GoI within the gene family were identified using alignments from the same database.

Multiple sequence alignments of nucleotide sequences for the Rab family of genes were conducted in CLUSTALW

<sup>1</sup>https://www.ncbi.nlm.nih.gov/snp

<sup>2</sup>https://www.genome.jp/tools/blast

<sup>3</sup>https://legumeinfo.org/organism/Cicer/arietinum

TABLE 1 | List and short description of six selected chickpea germplasm accessions used for molecular analyses.


using GenomeNet Database Resources<sup>4</sup> , while CLC Main Workbench software<sup>5</sup> was used for protein amino acid sequence alignment.

The molecular dendrogram was constructed using BLASTP at GenomeNet Database Resources (See footnote 2) with the function of ETE3 v3.0.0b32 (Huerta-Cepas et al., 2016) and MAFFT v6.861b applied using the default options (Katoh and Standley, 2013). The FastTree v2.1.8 program with default parameters was used for phylogenetic tree preparation (Price et al., 2009).

#### Abiotic Stress Treatments: Salinity, Drought and Rapid Dehydration

Three experiments applying abiotic stress treatments (salinity, slow drought and rapid dehydration) were carried out in chickpea for RT-qPCR analyses using the same conditions as described earlier in our publication for wheat (Zotova et al., 2018). The size of containers used, number of plants, soil type and growth conditions were all as described and no artificial inoculation of rhizobium was applied.

For salt stress, twenty-four uniform seedlings in each of six accessions were grown for one month in two separate containers. On "Day 0," three plants from each accession (three biological replicates) were randomly selected from each container, before the salt stress was applied. The two youngest fully developed leaves were collected from each selected plant into a 10-ml plastic tube and immediately frozen in liquid nitrogen and stored at –80◦C until RNA extraction. Subsequently, 200 ml of 150 mM NaCl was applied to the container, covering the entire soil surface but avoiding any direct contact with the plants. The NaCl treatments were applied four times, on every third day following Day 0 (over 12 days in total) in treatment containers, while the same volume of tap-water without NaCl was used under the same schedule in the control containers. No solution was lost through drainage from any container. No supplementary CaCl<sup>2</sup> was added despite the recommended requirements in experiments with hydroponics. This is because the soil mix used contained sufficient available calcium and no symptoms of Ca deficiency were apparent in the treated plants. After 12 days, as on Day 0, the two youngest fully developed leaves were collected from each of three plants both in salt treatments and controls. Leaf samples were immediately frozen in liquid nitrogen and stored at –80◦C for RNA extraction.

Experiments with slowly droughted plants and rapid dehydration of detached leaves were carried out using exactly the same schedule as described in Experiments 1 and 2 in our previous paper on wheat (Zotova et al., 2018).

#### RNA Extraction, cDNA Synthesis and qPCR Analysis

Frozen leaf samples were ground as described below for DNA extraction. TRIzol-like reagent was used for RNA extraction following the protocol described by Shavrukov et al. (2013) and all other steps for RNA extraction and cDNA synthesis were as described previously (Zotova et al., 2018). The procedures included DNase treatment (Qiagen, Germany), and the use of a MoMLV Reverse Transcriptase kit (Biolabmix, Novosibirsk, Russia). All cDNA samples were checked for quality control using PCR and yielded bands of the expected size on agarose gels.

Diluted (1:2) cDNA samples were used for qPCR analyses using either a QuantStudio-7 Real-Time PCR instrument (Thermo Fisher Scientific, United States) at S. Seifullin Kazakh AgroTechnical University, Astana, Kazakhstan, or Real-Time qPCR system, Model CFX96 (BioRad, Gladesville, NSW, Australia) at Flinders University, Australia. The qPCR protocol was similar in both instruments as published earlier (Shavrukov et al., 2016), wherein the total volume of 10 µl q-PCR reactions included either 5 µl of 2xBiomaster HS-qPCR SYBR Blue (Biolabmix, Novosibirsk, Russia) for experiments in Kazakhstan or 5 µl of 2xKAPA SYBR FAST (KAPA Biosystems, United States) for experiments in Australia, 4 µl of diluted cDNA, and 1 µl of two gene-specific primers (3 µM of each primer) (**Supplementary material 1**). Expression data for the target genes were calculated relative to the average expression of the two reference genes: CAC, Clathrin adaptor complexes, medium subunit (Reddy et al., 2016) and GAPDH, Glyceraldehyde-3 phosphate dehydrogenase (Garg et al., 2010) (**Supplementary material 1**). At least three biological and two technical replicates were used in each qPCR experiment.

#### DNA Extraction, Sequencing and SNP Identification

Plants were grown in control (non-stressed) conditions in containers with soil as described above. Five uniform one monthold individual plants were selected from each accession and five leaves were collected and bulked for leaf samples. Leaf samples

<sup>4</sup>https://www.genome.jp/tools-bin/clustalw

<sup>5</sup>https://www.qiagenbioinformatics.com/products/clc-main-workbench

frozen in liquid nitrogen were ground in 10-ml tubes with two 9-mm stainless ball bearings using a Vortex mixer. DNA was extracted from the bulked leaf samples with phenol-chloroform as described in our earlier papers (Shavrukov et al., 2016; Zotova et al., 2018). One microliter of DNA was checked on a 0.8% agarose gel to assess quality, and concentration was measured by Nano-Drop (Thermo Fisher Scientific, United States).

To identify SNPs in the GoI and compare them with annotated accessions in databases, primers were designed in exon regions flanking introns (**Supplementary material 1**). PCR was performed in 60 µl volume reactions containing 8 µl of template DNA adjusted to 20 ng/µl, and with the following components in the final concentrations listed: 1xPCR Buffer, 2.2 mM MgCl2, 0.2 mM each of dNTPs, 0.25 µM of each primer and 4.0 units of Taq-DNA polymerase in each reaction (Maxima, Thermo Fisher Scientific, United States). PCR was conducted on a SimpliAmp Thermal Cycler (Thermo Fisher Scientific, United States), using a program recommended by the Taq-polymerase manufacturer, with the following steps: initial denaturation, 95◦C, 4 min; 35 cycles of 95◦C for 20 s, 55◦C for 20 s, 72◦C for 1 min, and final extension, 72◦C for 5 min. Single bands of the expected size were confirmed after visualization of 5 µl of the PCR product in 1% agarose gel. The remaining PCR reaction volume (55 µl) was purified using FavorPrep PCR Purification kit (Favorgene Biotec Corp., Taiwan) following the Manufacturer's protocol. The concentrations of purified PCR products were measured using NanoDrop (Thermo Fisher Scientific, United States) and later used as a template (100 ng) in a sequencing reaction with the Beckman Coulter Sequencing kit, following the Manufacturer's protocol. Sanger sequencing and analysis of results were performed on a Beckman Coulter Genetic Analysis System, Model CEQ 8000 (Beckman Coulter, United States) following the Manufacturer's protocol and software at S. Seifullin Kazakh AgroTechnical University, Astana (Kazakhstan). The identified SNPs were used to design allele-specific primers that were applied in Amplifluor-like SNP analysis. Two fully sequenced chickpea accessions, ICC4958 of the Desi ecotype, and Frontier of the Kabuli ecotype, were used as the reference genomes<sup>6</sup> .

#### SNP Amplifluor Analysis

Amplifluor-like SNP analysis was carried out using a QuantStudio-7 Real-Time PCR instrument (Thermo Fisher Scientific, United States) as described previously (Jatayev et al., 2017; Zotova et al., 2018) with the following modifications: Each reaction contained 3 µl of template DNA adjusted to 20 ng/µl, 5 µl of Hot-Start 2xBioMaster (MH020-400, Biolabmix, Novosibirsk, Russia<sup>7</sup> ) with all other components as recommended by the manufacturers, including MgCl<sup>2</sup> (2.0 mM). One microliter of a mixture of two fluorescently labeled Universal probes was added (0.25 µM each) and 1 µl of allele-specific primer mix (0.15 µM of each of two forward primers and 0.78 µM of the common reverse primer). Four microliter of Low ROX (Thermo Fisher Scientific, United States) was added as a

<sup>6</sup>http://www.cicer.info/databases.php

<sup>7</sup>http://biolabmix.ru/en/products

passive reference label to the Master-mix as prescribed for the qPCR instrument. Assays were performed in 96-well microplates. Sequences of the Universal probes and primers as well as the sizes of amplicons are presented in **Supplementary material 1**.

PCR was conducted using a program adjusted from those published earlier (Jatayev et al., 2017; Zotova et al., 2018): initial denaturation, 95◦C, 2 min; 14 "doubled" cycles of 95◦C for 10 s, 60◦C for 10 s, 72◦C for 20 s, 95◦C for 10 s, 55◦C for 20 s and 72◦C for 50 s; with recording of allele-specific fluorescence after each cycle. Genotyping by SNP calling was determined automatically by the instrument software, but each SNP result was also checked manually using amplification curves and final allele discrimination. Experiments were repeated twice over different days, where two technical replicates confirmed the confidence of SNP calls.

#### Statistical Analysis

IBM SPSS Statistical software was applied to calculate means, standard errors, and to estimate the probabilities for significance using ANOVA tests.

### RESULTS

#### Bioinformatics and Comparative Phylogenetic Analysis

During the initial screening of SNP No. 2103, rs853191221 [C. arietinum] within the chickpea SNP database (**Supplementary material 2**), NCBI BLAST analysis revealed the closest nucleotide accession to be XM\_012715175.1, encoding a Ras-related protein in C. arietinum with the corresponding RabC1-like gene (LOC101496214, transcript variant X2, mRNA). We designated this gene as the isoform CaRabC-1a.

To identify the full list of all members of CaRab genes in chickpea, bioinformatics approaches were used to search and analyze annotated sequences and whole genome sequences available in public databases using comparisons to the reference genome of A. thaliana. As a result, eight sub-families of CaRab gene were identified, with 54 isoforms. The corresponding accession IDs for the genes and proteins, as well as references to Arabidopsis genes with the highest level of similarity are shown in **Table 2**.

The sequences of all 54 isoforms of CaRab genes identified in chickpea were used to construct a phylogenetic tree (**Figure 1**). Eight distinct clades were identified in the molecular dendrogram, and the letter corresponding to each sub-family name is used to distinguish the corresponding clade. The biggest and most diverse was Clade A, the CaRabA gene sub-family while Clades B and F contained only two accessions each. Clades D, G, H and F are molecularly similar, but most distanced from other sub-families of the CaRab gene. The sub-family CaRabC contained five isoforms with the closest sub-families being CaRabD and CaRabE (**Figure 1**).

Protein sequence analysis of five isoforms from sub-family CaRabC (**Figure 2**) showed distinct separation of CaRabC-1 from CaRabC-2. The closest molecular similarity was found between CaRabC-1b and CaRabC-1c with the next greatest similarity

TABLE 2 | The eight identified sub-families (RabA – RabH) of the chickpea CaRab, with 54 isoforms and their corresponding accession ID listed for genes and proteins as well as reference to closest genes in Arabidopsis.


(Continued)

#### TABLE 2 | Continued


The sub-family CaRabC studied in this paper is indicated in bold type.

shared with CaRabC-1a, while CaRabC-2a and CaRabC-2a<sup>∗</sup> were the most diverged from all others (**Figure 2**).

#### RT-qPCR and Gene Expression Analysis

Primers for RT-qPCR analysis were designed based on the alignment and comparison of CDS sequences of five identified CaRabC isoforms listed in **Table 2**. To estimate the total expression level of all five CaRabC genes combined, common primers with degenerative nucleotides were designed based on the longest consensus regions in the alignments. In addition, 3'-ends of gene-specific primers were designed for specific SNPs to maximize the specificity of qPCR analysis for each of the five isoforms of CaRabC gene (**Supplementary material 3**).

Initially, the expression level of CaRabC gene was determined in control plants grown under favorable conditions for all isoforms combined, as well as for each of them separately (**Figure 3A**). All six studied chickpea accessions, 3 Kabuli and

FIGURE 2 | A comparison of amino acid sequences (A), and Rooted UPGMA phylogenetic tree with branch length (B) of the five isoforms of CaRabC proteins identified in chickpea. Multiple sequence alignment conducted presented using CLC Main Workbench software.

FIGURE 3 | RT-qPCR analysis of CaRabC gene family expression in chickpea leaves: (A) In favorable, non-stressed conditions (Controls) for 3 Kabuli and 3 Desi (dark green and dark blue, respectively); and the relative gene expression compared to Controls under: (B) Gradual salt stress application, 150 mM NaCl, 7 days; (C) Slowly developing drought in pots with soil, 12 days; and (D) Rapid dehydration of detached leaves, 6 h, room temperature. All isoforms of CaRabC gene combined (darker colors) and the five separate isoforms (lighter colors) of the CaRabC-1a, -1b, -1c, -2a and -2a<sup>∗</sup> (for corresponding gene family) were analyzed separately. Each set contained six chickpea accessions, including three Kabuli ecotypes, shown in yellow (1, Yubileiny; 2, ICC-7255; and 3, ICC-4841), and three Desi ecotypes, shown in pink (4, ICC-1392; 5, ICC-4918; and 6, ICC-12726). Data were normalized using an average for two reference genes, calculated with ANOVA, and are presented as means for three biological and two technical replicates ± SE, shown as error bars. Significant differences (at least for P > 0.95) for each gene isoform and within each set of chickpea accessions are shown by different letters according to ANOVA tests.


accessions against those indicated in blue for the two reference accessions.

3 Desi (dark green and dark blue, respectively, in **Figure 3A**), showed a very high level of total CaRabC gene expression, ranging from 11.2 to 18.4 relative expression units, with nonsignificant differences among the six studied genotypes. The expression level of a single isoform of CaRabC-1c had maximal (63–88%) contribution in the CaRabC gene expression in total. Two isoforms, CaRabC-2a and -2a<sup>∗</sup> , both showed very similar levels of 1.9–2.5 expression units. A level of around 1 expression unit was observed in the isoform CaRabC-1a, similar to the average level for the two reference genes used in this study. An extremely low level of expression (approximately 10-fold lower than both reference genes) was shown for the last isoform CaRabC-1b (**Figure 3A**).

For salinity stress (**Figure 3B**), a high level of expression of the total CaRabC gene family was observed with 2-3.3 fold higher expression relative to Controls, but no significant differences were found within each set of six studied accessions due to relatively wide variability between replicates. In all studied genotypes, the isoform CaRabC-1c made the highest contribution to the gene expression (around 1.5–2-fold above the Controls). Only two accessions, No. 2 (ICC-7255, Kabuli) and No. 6 (ICC-12726, Desi), showed a higher level of CaRabC-2a<sup>∗</sup> isoform expression (2.2- and 2.6-fold, respectively) but these data were quite variable. Significant genetic variation was found for expression levels of CaRabC-1a and CaRabC-2a<sup>∗</sup> . Expression levels of two isoforms, CaRabC-1b and CaRabC-2a, did not differ from Controls (**Figure 3B**).

A different expression pattern for the CaRabC gene family was found for the drought experiment, where total expression was down-regulated by 0.3–0.4-fold compared to Controls (**Figure 3C**). The highest contribution to gene expression was made by the isoform CaRabC-1b. There was no significant genetic variation for CaRabC-1a and CaRabC-1b among the studied germplasm while the other three isoforms were quite variable (**Figure 3C**).

In contrast, rapid dehydration of detached leaves resulted in an up-to 8-fold increase of expression for the total CaRabC gene family expression, as well as isoform CaRabC-1a, compared to controls (**Figure 3D**). With the exception of CaRab1b, significant genetic variation was observed among the studied chickpea accessions for all other isoform expression profiles.

#### Amplicon Sequencing Showed an SNP in the Candidate Gene CaRabC-1a

The initial SNP discovered was annotated at position 516 from the start-codon in the identified CDS, LOC101496214, based on the reverse-compliment order in the SNP-containing fragment. The full nucleotide sequence of the accession and position of this initial SNP is presented in **Supplementary material 2**.

To check for the presence/absence of the initial SNP in the studied chickpea accessions, several pairs of primers were designed flanking the SNP. The most successful primer pair, F5&R5, amplified a fragment of 1148 bp. A fragment of the alignment showing polymorphic amplicons from the germplasm

collection using the Amplifluor-like SNP marker KATU-C22. X- and Y-axes show Relative amplification units, 1Rn, for FAM and VIC fluorescence signals, respectively. Red dots represent homozygote (aa) genotypes with allele 1 (FAM), and blue dots represent homozygote (bb) genotypes for allele 2 (VIC) identified with automatic SNP calling. The black square shows the no template control (NTC) using water instead of template DNA.

sequences compared to two fully sequenced reference chickpea accessions (ICC4958, Desi ecotype and Frontier, Kabuli ecotype) in CaRabC-1a is presented in **Figure 4**. The sequencing of the amplified fragments revealed the presence of 11 new SNPs in two chickpea accessions, Yubileiny and ICC7255, both Kabuli ecotypes (**Table 1**), compared to the two reference accessions. All 11 identified SNPs recorded high scores, and clear nucleotide peaks at the SNP positions were assessed manually. Interestingly, the initial SNP recorded in the database was monomorphic among the two reference accessions and two genotypes sequenced in our study.

### SNP Screening in CaRabC-1a Using Amplifluor-Like Markers

Allele-specific primers, KATU-C22-F&R, were designed for one of the selected SNPs from the 11 identified in the studied fragment of isoform CaRabC-1a to use with Amplifluor-like genotyping analysis. Details on the design of primers and positions of the studied SNPs are presented in **Supplementary material 4**. The example in **Figure 5** shows allele discrimination using Amplifluor-like SNP marker KATU-C22, where allele 1 (FAM) has been identified in chickpea accessions with SNP genotypes similar to reference accessions ICC4958 and Frontier but allele 2 (VIC) was found in germplasm similar to Yubileiny and ICC7255 (**Figure 5**).

## DISCUSSION

fgene-10-00040 February 5, 2019 Time: 17:7 # 11

Rab-GTP proteins are well known in oncology studies in human and animals, but in plants there is increasing evidence that they play a central role in the tolerance to abiotic and biotic stresses. Nevertheless, it appears that the mechanism of membrane trafficking with which they are associated is similar in cells of both humans and plants. Most Rab genes of the eight clades represented in the molecular phylogenetic tree in plants, have similar corresponding groups of genes in human and other animal genomes. A greater or lesser diversity of isoforms for each clade of Rab genes just reflects the differing outcomes of evolution in the plant and animal kingdoms.

In plants, the most studied groups of Rab genes are from Clades G and H, where multiple vacuolar trafficking pathway components were demonstrated (Vernoud et al., 2003; Peng et al., 2014; Uemura and Ueda, 2014; Brillada and Rojas-Pierce, 2017). These types of Rab genes encode proteins that have been associated with a response to salinity and osmotic stresses, and are thought to associate with pre-vacuolar vesicles. Thus, Rab proteins may enhance relocation of Na<sup>+</sup> ions to the vacuole, after they reach a toxic level in the cytoplasm of cells. Whilst there has been less attention placed on other groups of Rab genes, including the diverse Clade A with its many isoforms and the non-diverse Clade B with only two gene members, there is practically nothing known about Clade C of Rab in plants (Vernoud et al., 2003; Jha et al., 2014; Rehman and Sansebastiano, 2014; Lawson et al., 2018). Despite the strong similarity between A. thaliana and C. arietinum, our bioinformatic results show significant differences in the number of Rab isoforms in most clades.

In the work described here, 54 isoforms of CaRab genes were identified in chickpea, indicating an evolutionary reorganization when compared to A. thaliana, where 57 AtRab isoforms have been identified (Vernoud et al., 2003). Clade C in the chickpea dendrogram has not been previously identified, described or studied, and contains the five isoforms: CaRabC-1a, -1b, -1c, -2a and -2a<sup>∗</sup> . The first three isoforms show similarity to AtRabC-1 (At1g43890, **Table 1**) while the latter two isoforms in chickpea were similar to another single isoform AtRabC-2a (At5g03530). The isoform AtRabC-2b (At3g09910), listed in a comprehensive analysis of the Rab genes in A. thaliana (Vernoud et al., 2003), has no ortholog in the C. arietinum genome. To avoid any misunderstanding with the classification of CaRabC-2a and -2a<sup>∗</sup> isoforms, we have used an asterisk instead of another letter, to indicate its very similar polypeptide structure.

Following the bioinformatics study, the expression analyses of total CaRabC for all five isoforms revealed high levels of expression of the gene family in leaves of non-stressed young chickpea plants compared to two reference genes (**Figure 3A**). More importantly, a single isoform, CaRabC-1c, made the major contribution to the gene expression, indicating a very active role of this isoform in chickpea plant development under nonstressed conditions. In the absence of other reports comparing expression of individual and combined (bulk) isoforms of Rab genes in plants, our conclusions await further verification and discussion.

Under salt stress, the dominance of the CaRabC-1c isoform in expression profiles was not as pronounced as under control conditions and was more comparable to other isoforms in some of the studied chickpea accessions, particularly CaRabC-1a and CaRabC-2a<sup>∗</sup> . Therefore, at least three isoforms of CaRabC were salinity-responsive and the two latter ones were strongly genotype-dependent (**Figure 3B**).

An unexpected result was found in the comparison of CaRabC gene expression in response to slowly progressing drought of whole plants and rapid dehydration of detached leaves. Only a few reports have described expression of different genes in parallel experiments with drought and dehydration. For example, a peroxisomal isoform of APX, Ascorbate peroxidase, was downregulated under strong drought but up-regulated in desiccated leaves in a cultivar of cowpea, Vigna unguiculata (D'Arcy-Lameta et al., 2006). Similar results were reported for two genes associated with loss of water during slow drought progression compared to rapid dehydration of barley leaves: HvMT2, a metallothioneinlike protein, and 2HvLHCB, Chlorophyll a-b binding protein of LHCII type III (Gürel et al., 2016). Therefore, there are examples of genes related to drought and dehydration that can be down- and up-regulated, in several plant species. However, our results show for the first time that all isoforms of CaRabC were strongly down-regulated under the slowly developing drought, but very strongly up-regulated in rapidly dehydrated leaves (**Figures 3C,D**).

Amplifluor-like SNP markers and other molecular markers are very helpful in identifying genetic polymorphisms in diverse germplasm accessions. In the current study, the molecular marker KATU-C22 was useful for genotyping one isoform CaRabC-1a (**Figure 5**). This allows for tracking of the different variants of this gene and the possibility of linking variants with an associated phenotype. Additional markers are now needed for all other isoforms of CaRabC and other GoI, but this will require further investment in sequencing in the future. It also may be worth looking for SNPs in the upstream promoter regions of the gene family, since this could explain the variation in expression between the genotypes.

CaRabC is just one sub-family from a large CaRab gene family involved in controlling cell membrane trafficking, and like the other Rab genes investigated to date (reviewed in Flowers et al., 2018), it is responsive and potentially associated with the adaptation of plants to abiotic stresses. For comparison, in the bacteria Salmonella, the Rab18 protein (related to RabC in plants) is actively involved in endocytosis and is localized in the early endocytic compartment of cells (Hashim et al., 2000). In plants, there is increasing evidence for the role of endocytosis under salinity and osmotic stress (Martín-Davison et al., 2017). The implications of increased endocytosis during these stresses would be a reduction in total plasma-membrane area, thereby limiting water loss from the cell through a decrease in the number of aquaporins. Additionally, it may represent a mechanism to obtain Na<sup>+</sup> ions directly from outside the cell for accumulation in the vacuole, thus keeping the cytoplasmic level of Na<sup>+</sup> low (Baral et al., 2015). In future work, we hope to explore the role of CaRabC on endocytosis and Na<sup>+</sup> compartmentalization. There has been very little work published to date concerning

endocytosis and extended drought. The different responses shown in the changes in expression observed in this study between salinity and dehydration (both components of osmotic stress), is intriguing and probably indicative of the underlying biological role of RabC proteins themselves.

Further research is required in several selected chickpea accessions to assess tolerance to salinity, drought and rapid dehydration. This would allow us to explore possible associations between sequence variants and levels of stress tolerance. The genotype-dependent role of each isoform of CaRabC as well as other genes from the gene family will be studied, and we plan to carry out these experiments in the near future. These new experiments should elaborate on the mechanism and clarify the suggested roles of these proteins in cell polarization and recycling to the plasma membrane, as suggested by Vernoud et al. (2003) and Rutherford and Moore (2002), respectively. Hopefully, our study of CaRabC extends the knowledge of Rab gene family structure and function in plants.

#### AUTHOR CONTRIBUTIONS

GK conducted the experiments with chickpea germplasm and the genotyping with Amplifluor-like SNP analysis. AK and SJ supervised the experiments and interpreted the results. AsZ conducted the experiments with plant stresses and sampling. AyZ carried out sequencing. AT worked with plants in the field trial. BA coordinated the experiments in the field. SL analyzed gene sequences in databases and wrote the corresponding sections. CS analyzed the results, and revised and edited the manuscript. CJ analyzed the qRT-PCR data

#### REFERENCES


and revised the corresponding section. KS coordinated the qRT-PCR study and revised other sections. PL supervised the study and revised the final version of the manuscript. YS coordinated all experiments and wrote the first version of the manuscript.

#### FUNDING

This study was supported by the Ministry of Education and Science (Kazakhstan), Research Program BR05236500 (SJ).

#### ACKNOWLEDGMENTS

We would like to thank the staff and students of S. Seifullin Kazakh AgroTechnical University, Astana (Kazakhstan) and Flinders University, SA (Australia) for their support in this research and help with critical comments to the manuscript. The results of this study were presented at the International Conference 'Bioinformatics and Computational Biology', August 2018, Novosibirsk, Russia. The Authors acknowledge the Organizing Committee for their support in the presentation and publication of this work.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00040/full#supplementary-material



protein AtRab7 (AtRabG3e). Plant Physiol. 134, 118–128. doi: 10.1104/pp.103. 025379



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Khassanova, Kurishbayev, Jatayev, Zhubatkanov, Zhumalin, Turbekova, Amantaev, Lopato, Schramm, Jenkins, Soole, Langridge and Shavrukov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Using Ancestry Informative Markers (AIMs) to Detect Fine Structures Within Gorilla Populations

#### Ranajit Das <sup>1</sup> \*, Ria Roy <sup>2</sup> and Neha Venkatesh<sup>3</sup>

*<sup>1</sup> Manipal Centre for Natural Sciences, Manipal Academy of Higher Education, Manipal, India, <sup>2</sup> Department of Biotechnology Engineering, Sahrdaya College of Engineering and Technology, Kodakara, India, <sup>3</sup> Department of Genetics, University of Mysore, Mysore, India*

The knowledge of ancestral origin is monumental in conservation of endangered animals since it can aid in preservation of population level genetic integrity and prevent inbreeding among related individuals. Despite maintenance of studbook, the biogeographical affiliation of most captive gorillas is largely unknown, which has constrained management of captive gorillas aiming at maximizing genetic diversity at the population level. In recent years, ancestry informative markers (AIMs) has been successfully employed for the inference of genomic ancestry in a wide range of studies in evolutionary genetics, biomedical research, genetic stock identification, and introgression analysis and forensic analyses. In this study, we sought to derive the AIMs yielding the most cohesive and faithful understanding of biogeographical affiliation of query gorillas. To this end, we compared three commonly used AIMs-determining methods namely, Infocalc, *FST* , and Smart Principal Component Analysis (SmartPCA) with ADMIXTURE, using gorilla genome data available through Great Ape Genome Project database. Our findings suggest that the SNPs that were detected by at least three of the four AIMs-determining approaches (*N* = 1,531), is likely most suitable for delineation of gorilla AIMs. It recapitulated the finer structure within western lowland gorilla genomes with high degree of precision. We further have validated the robustness of our results using a randomized negative control containing the same number of SNPs. To the best of our knowledge, this is the first report of an AIMs panel for gorillas that may aid in developing cost-effective resources for large-scale demographic analyses, and greatly help in conservation of this charismatic mega-fauna.

Keywords: ancestry informative marker (AIM), gorilla ancestry, conservation genetic management, admixture, informativeness of SNPs

#### BACKGROUND

Effective conservation of endangered animals with unknown ancestral origin entails delineation of the biogeographic affinities of their ancestors in order to facilitate preservation of the population level integrity of genomic signal. The knowledge of ancestral origin could be particularly relevant for planned re-introduction of animals to wild habitats and management of captive breeding programs in order to avoid inbreeding depression.

Gorillas, the largest living ape, were pronounced as critically endangered by IUCN Red List in 2007 (Walsh et al., 2008). Since the gorilla population is rapidly dwindling in the wild as a

#### Edited by:

*Yuriy L. Orlov, Russian Academy of Sciences, Russia*

#### Reviewed by:

*GaneshPrasad Arun ArunKumar, SASTRA University, India Luciana Werneck Zuccherato, Universidade Federal de Minas Gerais, Brazil*

> \*Correspondence: *Ranajit Das ranajit.das@manipal.edu*

#### Specialty section:

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics*

> Received: *03 September 2018* Accepted: *21 January 2019* Published: *08 February 2019*

#### Citation:

*Das R, Roy R and Venkatesh N (2019) Using Ancestry Informative Markers (AIMs) to Detect Fine Structures Within Gorilla Populations. Front. Genet. 10:43. doi: 10.3389/fgene.2019.00043*

**78**

result of severe habitat encroachment and the illegal bushmeat trade, effective management of captive breeding programs has become monumental in order to both increase their numbers and to protect them from inbreeding. Overall 283 wild gorillas were imported to North America till 1970s, which subsequently stopped owing to the introduction of Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES) in 1975 (Nsubuga et al., 2010). It is noteworthy that despite maintenance of studbooks, insufficient information is available pertaining to the biogeographic origin of the majority of captive gorillas in the USA (Wharton, 2009) and that has likely constrained proper management of captive gorillas pertaining to maximizing genetic diversity at the population level. Proper knowledge of ancestry is of great importance in captive breeding programs of gorillas in order to avoid inbreeding depression and at the same time to conserve the genomic integrity of the native gorilla populations.

While whole genome approaches can efficiently resolve the biogeographical affiliation of gorillas by measuring genomic ancestry and level of admixture occurring among various gorilla populations, it is not cost-effective and dependent on the quality of DNA samples such that lower DNA quality (such as DNA extracted through non-invasive techniques) can hamper genome re-sequencing methods to a considerable extent. An alternative cost-effective strategy to whole genome approaches could be estimation of genomic ancestry using a handful of highly informative Single Nucleotide Polymorphisms (SNPs) which may range from a few hundreds to a few thousands. These highly informative SNPs that exhibit large differences in allele frequencies between ancestral populations are commonly referred to as Ancestry Informative Markers (AIMs) (Rosenberg et al., 2003; Shriver et al., 2003; Nassir et al., 2009).

Over the years AIMs panels have been successfully used for inferring biogeographical ancestry of humans (Rosenberg et al., 2003; Shriver et al., 2003; Kosoy et al., 2009; Nassir et al., 2009; Kidd et al., 2011; Tandon et al., 2011; Galanter et al., 2012; Huckins et al., 2014; Vongpaisarnsin et al., 2015), detection of illegal trade and translocation of wild animals (Frantz et al., 2006), food forensics (Wilkinson et al., 2012), genetic stock identification and introgression analysis (Munoz et al., 2015), forensic analysis (Phillips et al., 2016) to name a few. Recently, 9,000 genetic markers have been identified which are unique to a specific subspecies of chimpanzee and gorilla, and around 40,000 markers have been detected that are specific to each hominoid species or lineage (Hormozdiari et al., 2013).

In this study, we have compared three strategies previously used for AIMs determination, namely Infocalc algorithm (Paschou et al., 2007; Kosoy et al., 2009), Wright's FST (Tian et al., 2007; Kidd et al., 2011; Nievergelt et al., 2013), Smart Principal Component Analysis (SmartPCA) (Patterson et al., 2006) with a novel ADMIXTURE based approach (Alexander et al., 2009) to interrogate previously published whole genome data of 31 gorillas available in Great Ape Genome Project (GAGP) (Prado-Martinez et al., 2013) corresponding to two subspecies of western gorillas (Gorilla gorilla), namely western lowland gorilla (Gorilla gorilla gorilla) and Cross River gorilla (Gorilla gorilla dielhi), as well as the eastern lowland gorilla (Gorilla beringei graueri), to delineate an AIMs panel that can reproducibly capture the genomic ancestry of gorillas at the population level and aid in identification of gorillas at the individual level.

We performed our analysis in three steps. In the first step we evaluated the performance of the four AIMs determining approaches (Wright's FST, Infocalc, SmartPCA and ADMIXTURE) by comparing them with complete SNP sets (CSS). Subsequently, we developed a consensus dataset, incorporating the SNPs that are common to at least three of the four AIMs-determining strategies. Finally, we developed a negative control dataset (randomly chosen SNPs from CSS) containing the same number of SNPs as the consensus dataset and re-evaluated the performance of the consensus dataset and four AIMs determining approaches. The consideration of the consensus SNPs as the AIMs panel for gorilla was robust since it balanced out the limitations of each individual AIMs determining method and at the same time recapitulated the ancestry information of query gorillas with high precision.

## METHODS

### Dataset

The dataset employed in this study comprised of 31 gorilla genomes available in GAGP, which overall sequenced 79 great ape individuals to a mean coverage of 25X in an Illumina HiSeq 2000 platform (Prado-Martinez et al., 2013; Das and Upadhayai, 2018): western lowland gorilla (Gorilla gorilla gorilla, N = 27), eastern lowland gorilla (Gorilla beringei graueri, N = 3), and Cross River gorilla (Gorilla gorilla dielhi, N = 1). As indicated previously (Prado-Martinez et al., 2013; Das and Upadhayai, 2018) the western lowland gorilla genomes employed in this study belong to three distinct wild populations: Cameroonian, Congolese, and Equatorial Guinean. The biogeographical origin of the gorilla genomes as mentioned in the Studbook and that predicted through Geographical Population Structure (GPS) algorithm is mentioned in **Supplemental Table 1**. The same dataset comprised of 354,080 markers that has been used recently for tracing ancestry of gorillas (Das and Upadhayai, 2018) was used in this study.

### Population Clustering and Admixture Analysis Employing the CSS

Principal component analysis (PCA) was performed in PLINK v1.9 using - -pca command. The ancestry of the gorilla genomes was estimated using unsupervised clustering as implemented in ADMIXTURE v1.3 (Alexander et al., 2009). Similar to our recent study (Das and Upadhayai, 2018), we chose K = 3 for all downstream analysis to differentiate the western gorilla genomes into the Congolese and Cameroonian clusters and detection of AIMs for identification of genomic ancestry of gorillas at the population level. PCA and Admixture plots were generated in R v3.2.3.

#### Determination of AIMs

In order to deduce the SNP markers that are able to infer the genomic ancestry of gorilla samples with accuracy comparable to that of the CSS of 354,080 SNPs, we evaluated four AIMs determining approaches enumerated below.

#### 1. Infocalc

The first method employed was the Infocalc algorithm (Rosenberg et al., 2003), implemented in Infocalc v1.1, which determines the amount of information multiallelic markers provide regarding an individual's ancestry by calculating the informativeness (I) of each marker individually. Infocalc determines I based on the mathematical expression described previously (Rosenberg et al., 2003):

$$I = \sum\_{j=1}^{N} \left( -p\_j \log p\_j + \sum\_{i=1}^{K} \frac{p\_{ij}}{K} \log p\_{ij} \right)^2$$

Where, p<sup>j</sup> is the mean frequency of allele j over all populations, pij is the relative frequency of allele j in population i and K is the total number of populations.

We selected the top 10,000 most informative markers from the Infocalc v1.1 output file. Infocalc v1.1 compatible files were generated by using - -structure modifier to the PLINK v1.9 command line. The top 10,000 most informative markers were selected based on the informativeness defining column (I\_n) of the output file (**Supplemental Figure S1**).

#### 2. Wright's FST

FST (Sewall Wright, 2006) measures the degree of differentiation among populations likely arising due to genetic structure within them. Given a set of populations, PLINK estimated the fixation indices (FST) separately for all 354,080 markers under evaluation in this study using - -Fst command. The Family ID (FID) was used as the indicator of the geographical affinity of the gorilla genomes to different wild populations as mentioned previously (Prado-Martinez et al., 2013) and/or estimated through our recent biogeographical analysis (Das and Upadhayai, 2018).

The 10,000 SNPs with highest FST values were selected for subsequent analyses (**Supplemental Figure S2**).

#### 3. ADMIXTURE

Analyzing the ADMIXTURE output file with SNP information (P file) for K of 3, we identified 10,662 SNPs with high K (column to column) variance (≥ 0.15).

#### 4. SmartPCA

In order to determine the most informative markers, SNP weightings for each principal component (PC) were calculated using the "SmartPCA" algorithm implemented in EIG v7.2.1 (Patterson et al., 2006; Price et al., 2006). SmartPCA, which is especially designed for analysis of genomic data, employs PCA to determine whether the test samples come from one homogenous population or there is any signature of population structure and outputs principal components (eigenvectors) and eigenvalues. In addition to these two files SmartPCA generates a "snpwt" file, depicting the weight of all 354,080 markers for each principal component.

The 10,000 SNPs with the highest "weights" for the first principal component (PC1) was selected for subsequent analyses (**Supplemental Figure S3**).

#### Estimation of Candidate AIMs Panels

To determine the optimal AIMs-determining strategy for gorilla genomes, we first compared the datasets comprising of the top 10,000 SNPs generated through FST, Infocalc, and SmartPCA with 10,662 SNPs detected through ADMIXTURE both qualitatively (via Admixture analysis and PCA) and quantitatively (by computing the Euclidean distances between the admixture components of the query datasets and the CSS).

Further we developed a consensus dataset, containing SNPs that are common to the four AIMs determining strategies (FST, Infocalc, Admixture, and SmartPCA-based). Here, we note that only 37 SNPs were found to be common to all four approaches evaluated in this study, which was insufficient to recapitulate intraspecific ancestry information of the query gorillas (data not shown). So, in order to generate a consensus SNP panel that is likely to be sufficient to detect the fine structure within western gorilla populations, we developed a dataset comprising of 1,531 SNPs that were common to at least three of the four AIMs-determining methods (**Supplemental Figure S4**). Finally, to adjudge the predictive accuracy of the candidate AIMs datasets, we developed a negative control dataset by randomly sampling 1,531 SNPs from CSS and compared this with those comprising of the top 1,531 SNPs extracted through FST, Infocalc, Admixture, SmartPCA-based methods and the consensus.

#### RESULTS

## ADMIXTURE Analyses

#### Qualitative Analysis

The ancestry of 31 gorilla genomes was estimated using unsupervised clustering as implemented in ADMIXTURE v1.3 (Alexander et al., 2009). For CSS, at K = 3 the eastern lowland gorillas were homogeneously assigned to a unique cluster (blue) while most western gorillas appeared to be a genomic admixture of Cameroonian (green) and Congolese (red) components in varying proportions (**Figure 1A**, **Supplemental Figure S5A**). While the entire genome of Akiba-Beri, Choomba, Paki, Oko, Kolo and Amani is consisted of the Cameroonian admixture component, Katie (B650) and Katie (KB4986) also appeared to be pure-bred and their genome is entirely composed of the Congolese admixture component.

At K = 3, the dataset comprising of the top 10,000 Inforcalc SNPs (Infocalc-10,000) performed the best by successfully and precisely capturing the population structure of gorilla genomes as depicted by the CSS. It homogenously assigned Akiba-Beri, Choomba, Paki, Oko, Kolo and Amani to Cameroon and the Katies (B650 and KB4986) to Congo. Further, similar to the CSS, this dataset revealed fractions of eastern lowland ancestry (blue) in Kokomo, Mimi, Delphi, Coco, Carolyn, and Porta. However, unlike the CSS, Infocalc-10,000 revealed minor fractions of (<1%) eastern lowland ancestry in Kowali and Azizi (**Figure 1B**, **Supplemental Figure S5B**).

admixture analysis at *K* = 3 using ADMIXTURE v1.3 and plotted in R v3.2.3. Each individual is represented by a vertical line partitioned into colored segments whose lengths are proportional to the contributions of the ancestral components to the genome of the individual. Blue represents eastern lowland ancestry component while

The dataset comprising of the top 10,662 Admixture SNPs (Admixture-10,000) appeared to be the second best. In concordance with CSS, Admixture-10,000 homogenously assigned Akiba-Beri, Choomba, Oko and Amani to Cameroon and the Katies (B650 and KB4986) to Congo. However, unlike the CSS, this dataset depicted ∼2, 3, and 4% Congolese ancestral component in the cross river gorilla Nyango, Kolo and Paki, respectively, and eastern lowland ancestral component in Helen and Anthal, which can be attributed to the likely loss of resolution (**Supplemental Figure S5C**).

green and red represent Cameroonian and Congolese ancestral components, respectively.

The remaining two datasets, comprising of 10,000 SNPs generated using SmartPCA and FST-based approaches (SmartPCA-10,000 and FST-10,000, respectively), performed moderately. While SmartPCA-10,000 successfully homogenously assigned Akiba-Beri, Choomba, Paki, Oko, Kolo and Amani to Cameroon and the Katies (B650 and KB4986) to Congo, it additionally assigned Delphi, Carolyn and Porta homogenously to Congo and thus failed to capture their discernible proportions of Cameroonian ancestry (**Supplemental Figure S5D**). Among the four approaches, FST-10,000 performed the worst. In addition to incorrectly assigning Delphi, Carolyn and Porta homogenously to Congo, FST-10,000 revealed Congolese ancestry in Kolo, Akiba-Beri and Paki, which were otherwise homogenously assigned to Cameroon by all AIMs-determining approaches (**Supplemental Figure S5E**).

Among datasets comprising of top 1,531 SNPs deduced via FST, Infocalc, Admixture, and SmartPCA, the 1,531 SNPs derived using Infocalc (Infocalc-1,531) was superior to the rest and most comparable to the CSS in recapitulating the population structure for query gorillas (**Figure 1B**). This was closely followed by a panel of 1,531 SNPs generated as a consensus of at least three of the four AIMs-determining strategies (Consensus-1,531) (**Figure 1F**), and that were detected using Admixture (Admixture-1,531) (**Figure 1C**). Here we note that among all 1,531 datasets, only Consensus-1,531 and Infocalc-1,531 were the only two who could capture the eastern lowland ancestry in the cross river gorilla, Nyango, as revealed by the CSS. In contrast, the SNP panel inferred using SmartPCA (SmartPCA-1,531) and FST (FST-1,531) completely failed to capture the population structure revealed by the CSS (**Figures 1D,E**). Finally, the negative control dataset comprising of 1,531 random SNPs (Random-1,531) was expectedly unsuccessful in capturing the ancestry information of the query gorillas, underscoring the superiority of the AIMs over randomly selected markers in delineating ancestry information (**Figure 1G**).

#### Quantitative Analysis

For comparing the test datasets quantitatively, we computed Euclidean distances between the three admixture components (eastern lowland, Cameroonian and Congolese) of all datasets and the CSS. The shortest mean Euclidean distance (µ = 0.022) was found between Admixture-10,000 and the CSS, closely followed by Infocalc-10,000 and the CSS (µ = 0.064) (**Figure 2**). Among other 10,000 SNP panels, the longest Euclidean distance was found between the CSS and FST-10,000, followed by the CSS and SmartPCA-10,000 (0.154 and 0.108, respectively).

Among the 1,531 panels, the shortest distance was revealed between Admixture-1,531 and the CSS (µ = 0.059). Consensus-1,531 appeared as the second most sensitive approach (µ = 0.087), closely followed by Infocalc-1,531 (µ = 0.095). All three aforesaid 1,531 panels highly significantly outperformed all the remaining datasets including the random dataset (Tukey's post hoc test; p-value < 0.0001). Congruent with our results from qualitative analyses in their inability to capture the accurate population structure for query gorilla genomes, the SmartPCA and FST-based datasets appeared to be the farthest from the CSS (µ = 0.75 in both cases) and performed similar to the Random-1,531 dataset (Tukey's post hoc test; p-value = 0.94 and 0.95, respectively). Here further we note that, although Admixture-1,531 had the shortest mean Euclidean distance from the CSS, its performance was statistically very similar to Consensus-1,531 and Infocalc-1,531 (Tukey's post hoc test; p-value = 0.99).

Overall, our result indicates that while Infocalc-1,531 turned out to be the best method in qualitative ADMIXTURE analysis, Admixture-1,531 was superior to all other approaches in the quantitative analysis. However, in both cases, Consensus-1,531 was a close second and its performance was statistically similar to the other two. Additionally, Consensus-1,531 had discernibly smaller median Euclidean distance from the CSS (0.032) compared to both Infocalc-1,531 (0.078) and Admixture-1,531 (0.043) which further advocates for its candidacy to be considered as the AIMs panel for the gorillas.

#### Principal Component Analysis (PCA)

Principal Component Analysis (PCA) was performed in PLINK v1.9 and the top two PCs were plotted in R v3.2.3. The PCA results for the CSS was in coherence with previous observations of an eastern gorilla-western gorilla contrast along the horizontal principal component (PC1) and vertical delineation (PC2) among western gorilla genomes (Prado-Martinez et al., 2013; Das and Upadhayai, 2018)(**Figure 3A**, **Supplemental Figure S6A**). Further, as observed previously, two distinct clusters were found among western gorillas along PC1: one predominantly composed of Cameroonian gorillas and the other predominantly of Congolese gorillas. Also, as found previously, Coco, the only Equatorial Guinea gorilla employed in our study clustered with the Cameroonian gorillas owing to its genomic proximity to the latter (Das and Upadhayai, 2018).

Similar to ADMIXTURE analysis, Infocalc-10,000 (**Supplemental Figure S6B**) and Admixture-10,000 (**Supplemental Figure S6C**) best replicated the population clusters depicted by CSS-based dataset (**Supplemental Figure S6A**) with high precision. Both datasets successfully recapitulated the overlap of some of the Cameroonian and Congolese gorillas at the center of PC2 and the genomic proximity of the cross river gorilla Nyango to Cameroonian gorillas. Among the remaining datasets, SmartPCA-10,000 could recapitulate the overlap of Cameroonian and Congolese gorillas along PC2, but it failed to recapture the high genomic proximity of Nyango with Cameroonian gorillas as depicted by the CSS (**Supplemental Figure S6D**). Finally, FST-10,000 portrayed two distinct clusters of Cameroonian and Congolese gorillas and failed to replicate the overlap of some of the Cameroonian and Congolese gorillas at the center of the vertical principal component (PC2) (**Supplemental Figure S6E**).

Among the 1,531 SNP panels, Infocalc-1,531 was superior to all other AIMs-determining strategies in replicating the population structure of query gorillas depicted by the CSS (**Figure 3B**). Coherent with the ADMIXTURE analysis, Consensus-1,531 turned out to be the second best (**Figure 3F**), followed by Admixture-1,531 (**Figure 3C**). Among the remaining datasets, SmartPCA-1,531 and FST-1,531 performed discernibly worse and completely failed to depict any contrast among the western gorilla genomes along PC2 (**Figures 3D,E**). Finally, in concordance with the ADMIXTURE analysis, Random-1,531 was completely unsuccessful in capturing population structure of all query gorillas, such that it even failed to depict the eastern gorilla-western gorilla contrast along the horizontal principal component (PC1) (**Figure 3G**). The failure of the random dataset once again underscored the superiority of the AIMs over randomly selected markers in portraying population structure of query genomes.

Taking together all analyses, our study revealed that while Infocalc performed better than other approaches in qualitative analysis, the Admixture-based approach turned out to be the best in the quantitative analysis. This indicates that no single AIMs-determining strategy may be sufficient to recapitulate the ancestry information of gorillas. So, we propose that Consensus-1,531 which performed consistently well in both qualitative and quantitative analysis (ranked 2nd in both) should be elucidated as the AIMs panel for the gorillas as it emerged as the smallest set of SNPs that delineates the ancestry information and population structure of gorillas with optimum precision. Further, we have generated a set of 262 most informative SNPs from the 1,531 AIMs panel, which can be detected through common genotyping

techniques and are powerful enough to detect fine structure within gorilla populations (**Supplemental Table 2**).

#### DISCUSSION

Over the years, Gorillas, with dwindling population size and increasingly reduced and restricted distribution in the wild, are faced with serious threats for their survival. As a consequence, conservation of wild as well as captive gorillas and preservation of unique gorilla gene pools has garnered a lot of attention in recent years. The gorilla breeding programs that affords to increase genetic diversity in order to avoid inbreeding depression, have been restricted by insufficient information about the ancestry of the gorillas (Wharton, 2009; Nsubuga et al., 2010; Simons et al., 2012; Prado-Martinez et al., 2013). Hence, the determination of the biogeographical affiliation of gorillas can be invaluable to foster their population level (intra-specific) management and preservation of unique gorilla gene pools.

In this study we sought to compare three strategies previously used for AIMs determination, namely Infocalc algorithm (Paschou et al., 2007; Kosoy et al., 2009), Wright's FST (Tian et al., 2007; Kidd et al., 2011; Nievergelt et al., 2013), and Smart Principal Component Analysis (SmartPCA) (Patterson et al., 2006) with a novel ADMIXTURE based approach (Alexander et al., 2009) to delineate an AIMs panel that can reproducibly capture the genomic ancestry of gorillas at the population level and aid in identification of gorillas at the individual level. To this end, we developed the first AIMs panel for gorillas containing 1,531 SNPs that were common to at least three out of four AIMs-determining approaches. Our results indicate that this AIMs panel can recapitulate the ancestry information of query gorillas with high precision and can help in population level identification of gorillas, which can be monumental in the preservation of unique gorilla gene pools and selection of individuals for captive breeding program.

Our AIMs panel (Consensus-1,531) consisted of 1,531 SNPs, generated as a consensus of at least three of the four aforesaid AIMs-determining strategies and thus likely balanced out the limitations of each individual approach (Wilkinson et al., 2011). Here we note that out of 1,531 SNPs, 1,359 SNPs were common among FST, ADMIXTURE and SmartPCA and were not detected by the Infocalc based method (**Figure 2**). The great extent of overlap of top-ranked AIMs of the aforementioned strategies indicates that these three strategies essentially captured the same information regarding the ancestry of query gorillas. Further, while the two worst performing approaches-SmartPCA and FST revealed the highest number of overlapping SNPs (>26%), Infocalc generated the highest number of exclusive SNPs (94%), followed by ADMIXTURE (66%). These results indicates a likely relationship between the exclusiveness of a SNP and its ability to recapture the ancestry information.

Overall, our qualitative and quantitative analyses concur that Consensus-1,531 could recapitulate the ancestry information of query gorillas with high precision. While Consensus-1,531 had the shortest median Euclidean distance from the CSS (0.032), it appeared as the second most sensitive approach in terms of the mean Euclidean distance from the same (µ = 0.087) indicating its high precision of recapitulating the ancestral information depicted by the whole dataset. Further, quantitative assessment reflected that the performance of Consensus-1,531 was indistinct from the larger 10,000 SNP based datasets (p-value > 0.99) and had the highest number of individuals (N = 9) with zero Euclidean distances from the CSS. However, we note that while Consensus-1,531 successfully replicated the ancestry information of most query gorillas employed in this study, it failed to capture the Cameroonian ancestry component for Carolyn, Delphi and Porta and homogenously assigned them to Congo (**Figure 1**) and thus appeared to be the second-most sensitive in the qualitative assessment, falling short of the number matched Infocalc derived panel.

Amidst the remaining approaches, we note that FST was the poorest in capturing fine-scale population structure of query gorillas, closely followed by the SmartPCA based approach (**Figures 1**–**3**), suggesting the ineffectiveness of these two strategies in recapitulating the ancestral history of gorillas. We further note that most AIMs determining approaches employed in this study (except FST, and SmartPCA) and their consensus appeared to be superior to the randomly selected markers in capturing the population structure delineated by the CSS (**Figures 1**–**3**), advocating the usefulness of AIMs in tracing biogeographical origin of organisms over randomized SNPs.

Here we note that the goal of this study was to develop AIMs that can be used to tell apart various populations within western lowland gorilla (below subspecies level). Eastern and western lowland gorillas are considered to be different species and are genetically so distinct from each other that they can be differentiated through most markers present in the complete SNP set (CSS). Despite our restriction in terms of sample size and data availability, since most gorilla genomes used in this study belong to various western gorilla populations (27 out of 31), our results should reflect our intended outcome of deducing AIMs that can differentiate western gorillas below subspecies level.

The quest of developing an AIMs panel for gorillas is not new. A previous study has developed polymorphic MEIs, including those that can be considered ancestry-informative markers and MEIs corresponding to regions of incomplete lineage sorting (ILS) (Hormozdiari et al., 2013). However, to the best of our knowledge, this is the first study to have developed an AIMs panel

#### REFERENCES


for gorillas, which can recapitulate their ancestry information with high precision. With limited availability of funding, the conservation geneticists need to draw a balance between the costs of genotyping multiple loci and the inadequacy of information when limited number of loci are genotyped. Comprised of only 1,531 SNPs, the gorilla AIMs panel described here, can become a likely cost-effective solution to this problem. Our AIMs panel can resolve the ancestry information of gorillas with highest resolution power and can detect fine structures within gorilla populations below subspecies level at a highly affordable cost.

#### CONCLUSIONS

Effective conservation of gorilla populations requires the delineation of their ancestry information to facilitate preservation of the population level integrity of genomic signal and avoidance of inbreeding depression. To this end, we have developed an AIMs panel comprising of 1,531 SNPs that can recapitulate the ancestry information of gorillas with high precision. Our AIMs panel can afford a cost-effective solution to whole genome sequencing and/or large-scale genotyping of gorillas for large-scale biogeographic analysis and conservation genetics studies.

To the best of our knowledge this is the first AIMs panel developed for gorillas that can bolster their efficient management and aid in the conservation of their genetic integrity.

#### AUTHOR CONTRIBUTIONS

RD has conceived the idea of the project, written the manuscript and helped in the analysis. RR and NV performed all the analysis.

#### FUNDING

This work was supported by Manipal Academy of Higher Education, Manipal, India.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00043/full#supplementary-material


single-nucleotide polymorphisms in a global set of 119 population samples. Investig. Genet. 2:1. doi: 10.1186/2041-2223-2-1


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Das, Roy and Venkatesh. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The General Transcription Repressor TaDr1 Is Co-expressed With TaVrn1 and TaFT1 in Bread Wheat Under Drought

Lyudmila Zotova<sup>1</sup> , Akhylbek Kurishbayev<sup>1</sup> , Satyvaldy Jatayev<sup>1</sup> , Nikolay P. Goncharov<sup>2</sup> , Nazgul Shamambayeva<sup>1</sup> , Azamat Kashapov<sup>1</sup> , Arystan Nuralov<sup>1</sup> , Ainur Otemissova<sup>1</sup> , Sergey Sereda<sup>3</sup> , Vladimir Shvidchenko<sup>1</sup> , Sergiy Lopato<sup>4</sup> , Carly Schramm<sup>4</sup> , Colin Jenkins<sup>4</sup> , Kathleen Soole<sup>4</sup> , Peter Langridge5,6 and Yuri Shavrukov<sup>4</sup> \*

<sup>1</sup> Faculty of Agronomy, S.Seifullin Kazakh AgroTechnical University, Astana, Kazakhstan, <sup>2</sup> Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia, <sup>3</sup> A.F.Khristenko Karaganda Agricultural Experimental Station, Karaganda, Kazakhstan, <sup>4</sup> Biological Sciences, College of Science and Engineering, Flinders University, Bedford Park, SA, Australia, <sup>5</sup> School of Agriculture, Food and Wine, University of Adelaide, Adelaide, SA, Australia, <sup>6</sup> Wheat Initiative, Julius Kühn-Institut, Berlin, Germany

#### Edited by:

Yuriy L. Orlov, Russian Academy of Sciences, Russia

#### Reviewed by:

Yin-Gang Hu, Northwest A&F University, China Sintho Wahyuning Ardie, Bogor Agricultural University, Indonesia

\*Correspondence: Yuri Shavrukov yuri.shavrukov@flinders.edu.au

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 13 November 2018 Accepted: 24 January 2019 Published: 08 February 2019

#### Citation:

Zotova L, Kurishbayev A, Jatayev S, Goncharov NP, Shamambayeva N, Kashapov A, Nuralov A, Otemissova A, Sereda S, Shvidchenko V, Lopato S, Schramm C, Jenkins C, Soole K, Langridge P and Shavrukov Y (2019) The General Transcription Repressor TaDr1 Is Co-expressed With TaVrn1 and TaFT1 in Bread Wheat Under Drought. Front. Genet. 10:63. doi: 10.3389/fgene.2019.00063 The general transcription repressor, TaDr1 gene, was identified during screening of a wheat SNP database using the Amplifluor-like SNP marker KATU-W62. Together with two genes described earlier, TaDr1A and TaDr1B, they represent a set of three homeologous genes in the wheat genome. Under drought, the total expression profiles of all three genes varied between different bread wheat cultivars. Plants of four high-yielding cultivars exposed to drought showed a 2.0–2.4-fold increase in TaDr1 expression compared to controls. Less strong, but significant 1.3–1.8-fold up-regulation of the TaDr1 transcript levels was observed in four low-yielding cultivars. TaVrn1 and TaFT1, which controls the transition to flowering, revealed similar profiles of expression as TaDr1. Expression levels of all three genes were in good correlation with grain yields of evaluated cultivars growing in the field under water-limited conditions. The results could indicate the involvement of all three genes in the same regulatory pathway, where the general transcription repressor TaDr1 may control expression of TaVrn1 and TaFT1 and, consequently, flowering time. The strength of these genes expression can lead to phenological changes that affect plant productivity and hence explain differences in the adaptation of the examined wheat cultivars to the dry environment of Northern and Central Kazakhstan. The Amplifluor-like SNP marker KATU-W62 used in this work can be applied to the identification of wheat cultivars differing in alleles at the TaDr1 locus and in screening hybrids.

Keywords: Amplifluor-like SNP marker, bioinformatics, drought, general repressor of transcription, TaDr1, TaFT1, TaVrn1

### INTRODUCTION

Amongst the many types of abiotic stresses, drought or water limitation is one of the most important challenges for native plants and crops. There are several genetic and breeding strategies aimed at improving tolerance to drought in crops (Reviewed in: Ingram and Bartels, 1996; Yordanov et al., 2000; Tuberosa and Salvi, 2006; Valliyodan and Nguyen, 2006; Shanker et al., 2014;

**86**

Berger et al., 2016; Kaur and Asthir, 2017). One potential approach is the modulation of flowering time, where wheat plants grow faster and complete their life-cycles a few days earlier, therefore minimizing interruption from oncoming, terminal drought (Reviewed in: Shavrukov et al., 2017). Genetic polymorphism and the introgression of novel alleles from wheat progenitors, relatives and wild species from the genus Triticum is a very powerful tool to enrich the genome of modern cultivars (Reviewed in: Arzani and Ashraf, 2017; Mwadzingeni et al., 2017; Wang et al., 2018).

Molecular markers are used widely for the identification of novel and existing alleles, and to track specific alelles in elite wheat breeding lines and introgression from landraces or wild species. Analysis of SNP (Single nucleotide polymorphism) is a rapidly developing technology with a diverse range of methods and applications (Reviewed in: Schramm et al., 2019). Amplifluor SNP markers are well-established and have been successfully applied in the recent genotyping of candidate genes for various plant species (Absattar et al., 2018; Yerzhebayeva et al., 2018; Khassanova et al., 2019). This includes research in bread wheat, where alleles of candidate genes for drought tolerance, TaDREB5 and TaNFYC-A7, were identified using Amplifluor SNP markers. These genes demonstrate differential expression in high- and low-yielding wheat cultivars from Kazakhstan under a progressive drought and rapid dehydration (Shavrukov et al., 2016b; Zotova et al., 2018). In other studies, overexpression of transcription factors, TaNFYA-B1 and TaNF-YB3;l showed increased yield and nitrogen uptake, and quicker root development and improved tolerance to drought than controls, respectively (Qu et al., 2015; Yang et al., 2017). Similarly, the rice genes OsNF-YA7 and OsNF-YB1 were reported to be responsive to drought. Over-expression of OsNF-YA7 increased drought tolerance in transgenic rice plants (Lee et al., 2015), and OsNF-YB1 controls grain filling, resulting in improved yield (Xu et al., 2016).

Transcription factor (TF) Nuclear Factor Y (NF-Y) is a synonym of CCAAT Binding Factor (CBF) and Heme Activator Protein (HAP). Three subunits (A, B, and C) usually function in a single protein complex of NF-Y, and each of the three components is essential for binding to cis-elements in the promoter regions of target genes (Siefers et al., 2009; Petroni et al., 2012). In plants, the functions of NF-Y proteins are quite diverse, but, for the purposes of this paper, we will focus on just three: (1) regulation of flowering time; (2) response to abiotic stress, particularly drought; and (3) overall productivity in different plants (Gusmaroli et al., 2001; Nelson et al., 2007; Petroni et al., 2012; Kuromori et al., 2014; Swain et al., 2017; Zhao et al., 2017) including bread wheat (Qu et al., 2015; Yadav et al., 2015; Zotova et al., 2018).

In Arabidopsis, the C subunits of NF-Y factor, AtNF-YC3, AtNF-YC4, and AtNF-YC9, are involved in the regulation of photoperiod-mediated flowering time through the GA signaling pathway by binding to RGA (Repressor of ga1-3) and RGL2 (RGA-like2) proteins (Hou et al., 2014; Liu et al., 2016). Overexpression of many individual NF-YC subunits (such as NF-YC1, NF-YC2, NF-YC3, NF-YC4, and NF-YC9) alters flowering time. Individual subunits of the NF-Y complex can affect the transcript levels of Flowering locus T (FT). This gene encodes the protein that is the key integrator in the flowering time pathway, and upor down-regulation of FT in interaction with the NF-Y complex, leads to either early or late flowering in Arabidopsis (Kumimoto et al., 2010; Cao et al., 2014; Hou et al., 2014; Xu et al., 2016).

The flowering time trait has a complicated, multi-level control. Transcriptional up-regulation of two genes, Vrn (Vernalisation) and FT, is strongly required for the transition from the vegetative to reproductive stage, largely determining time to flowering (Reviewed in: Greenup et al., 2009; Jung and Müller, 2009; Yan, 2009; Jarillo and Piòeiro, 2011; Song et al., 2013; Milec et al., 2014; Blümel et al., 2015). In wheat, one of the most important crops, the genetic control of the flowering time trait has been extensively studied (Reviewed in: Li and Dubcovsky, 2008; Craufurd and Wheeler, 2009; Distelfeld et al., 2009; Campoli and Korff, 2014; Kamran et al., 2014). The main regulatory control of flowering time in wheat is through the up-regulation of TaFT1 – TaVrn3 and TaVrn1 genes (Li and Dubcovsky, 2008; Distelfeld et al., 2009).

Interestingly, flowering time is controlled not only by genes during ontogenesis, but is strongly impacted by abiotic stresses (Reviewed in: Kazan and Lyons, 2016; Takeno, 2016). Plants of various species have been reported to alter their development and flowering time in response to different types of abiotic stresses, ranging from osmotic stress in Arabidopsis (Chen et al., 2007), to soil pH in a native population of Corydalis sheareri, Papaveraceae (Huang et al., 2017). However, drought has been shown to be one of the major abiotic factors affecting development of flowering in various plant species such as tea, Camellia sinensis (Sharma and Kumar, 2005), litchi, Litchi chinensis (Shen et al., 2016) and lemon (Li et al., 2017). The genetic control of reproductive development and time to flowering in response to various abiotic stresses are well studied in cereals (Gol et al., 2017), where the influence of cold (Li et al., 2018) and drought (Pinto et al., 2010; Gudys et al., 2018) in particular, affect grain yields. Early flowering as a drought escape strategy in wheat and other species and was reviewed recently (Shavrukov et al., 2017).

In bread wheat, the TaVrn1 gene was mapped to the long arm of chromosome 5A, tightly linked with the Q gene controlling spike morphology (Kato et al., 1998). The Q gene belongs to the large AP2/ERF family of TF (Konopatskaia et al., 2016), which includes DREB genes responsive to drought and dehydration, and reports have shown that the Q gene is also regulated by drought (Gürsoy et al., 2012). Therefore, flowering time and spike morphology seem to have a shared regulatory framework with TaVrn1 and Q genes, and a strong response to drought.

The gene sequence and structure of the general repressor of transcription, Dr1 (alternative name – NC2β), is conserved among various eukaryotes. It operates as a heterodimeric complex with the product of another gene, DrAP1 (alternative name – NC2α), and strongly represses the transcriptional activity of RNA polymerase II and III, but not RNA polymerase I (Kim et al., 1997). Originally, Dr1/DrAp1 was identified in human cells as an unknown factor that was able to inhibit TBP-dependent basal transcription in vitro (Inostroza et al., 1992). Mammalian DrAp1 itself cannot repress transcription and therefore it is considered as an enhancer of Dr1 repression activity (Mermelstein et al., 1996; Kim et al., 1997; Yeung et al., 1997). In Drosophila, Dr1/DrAp1 represses the transcription

from TATA-containing promoters and activates the transcription from promoters without TATA-boxes (Willy et al., 2000).

In plants, Dr1 was originally discovered in Arabidopsis (Kuromori and Yamamoto, 1994). Later, the rice OsDr1 and OsDrAp1 genes were cloned, and formation of the heterodimeric complex, interaction of the protein complex with DNA, and repressive activities of the subunits and protein complex were characterized using the Y2H system, in vitro methods, and a transient expression assay (Song et al., 2002). These authors demonstrated several differences between the properties of Dr1 and DrAp1 in mammals and rice. Firstly, the plant DrAp1 protein was found to be larger than the mammalian and yeast proteins, and both plant Dr1 and DrAp1 contained a greater number of domains/motifs than their mammalian counterparts. Secondly, OsDrAp1 alone showed stronger repression activity than OsDr1, therefore in plants, OsDr1 most likely plays the co-repressor role and enhances the activity of OsDrAp1 (Song et al., 2002).This differs from mammals and yeast, where Dr1 is the repressor and DrAp1 plays the role of a regulatory subunit (Inostroza et al., 1992; Kim et al., 1997; Prelich, 1997).

Two homologs Dr1 genes from bread wheat, TaDr1A and TaDr1B, were identified and their expression patterns were reported in different wheat tissues under control and drought conditions (Stephenson et al., 2007). Transcripts of both TaDr1 homologs were abundant in all tested plant tissues and strongly up-regulated in leaves under drought.

In yeast, a 71% similarity between Dr1 and CBF-A (=NF-YB) was reported (Sinha et al., 1996). In bread wheat, TaDr1 and TaDr2 proteins (accessions AF464903 and BT009234, respectively), showed a "high degree of similarity" with TaNF-YB3 amino acid residues (Stephenson et al., 2007). Therefore, the authors suggested that the Dr1/DrAp1 complex could, potentially, inhibit transcription by acting as antagonist to all or to particular NF-YB and NF-YC subunits, thus preventing subunit association and subsequent binding of the activation NF-Y complex (Stephenson et al., 2007). This could be a possible mechanism to explain TaDr-mediated global repression of transcription.

The aims of this work were: (1) to compare flowering time and time to grain maturity of high-yielding and low-yielding wheat cultivars from Kazakhstan; (2) to analyze the genetic polymorphism of the TaDr1 gene in eight selected bread wheat cultivars, and in an F<sup>3</sup> segregating population 18-6 originating from a complex interspecies hybridisation; (3) to study TaDr1, TaVrn1 and TaFT1 gene expression in response to drought in leaves of selected wheat cultivars; and (4) to assess the co-expression of TaDr1, TaVrn1, and TaFT1 genes and grain yields of wheat cultivars in the dry conditions of Northern and Central Kazakhstan.

#### MATERIALS AND METHODS

#### Plant Material, Conditions of Plant Growth and Drought Application

Eight wheat cultivars, representing two groups with contrasting yields were selected from local varieties tested in field trials, based on their grain yields under the dry conditions in Northern Kazakhstan (current study) and Central Kazakhstan, described earlier by Shavrukov et al. (2016b). Descriptions of plant materials and all experiments were as reported earlier (Zotova et al., 2018). These descriptions included: seeds obtained, conditions of plant growth in the research field in Central Kazakhstan and the controlled conditions in the "Phytotron" experiments on gradual drought using plants in soil-filled containers over 12 days (Experiment 1) (Zotova et al., 2018).

A small outdoor trial was conducted in the research field of S.Seifullin Kazakh AgroTechnical University, Astana in Northern Kazakhstan in the dry season of 2017. Total rainfall was 107 mm during the vegetative growth period, lower than the average of 166 mm that was observed over many years in this region, and a 3◦C higher than average temperature for August (20.3◦C compared to the average, 17.3◦C) was recorded that year. Two-row plots were sown, 1 m in length with 5 cm between plants in rows and 20 cm between rows, and four randomized replicates were used. The number of days between sowing and first flowering of 50% of plants in each plot was counted as "Days to flowering" (DF), while "Days to maturity" (DM) was recorded when all plants in each plot reached the ripening stage. Grain yield was measured for each plot and re-calculated in "g/m<sup>2</sup> " with statistical treatment as described below.

A complex interspecific cross [♀ Triticum spelta, k-53660 <sup>×</sup> ♂ (T. aestivum, Novosibirskaya 67 / T. dicoccum, k-25516)] was produced by one of the authors, Nikolay Goncharov, at the Institute of Cytology and Genetics, Russian Academy of Sciences, Novosibirsk (Russia). F<sup>3</sup> plants from the hybridisation were grown in pots with soil in a "Phytotron" with controlled conditions as mentioned above.

#### Identification of the "Gene of Interest" Using Bioinformatics and Molecular Phylogenetic Comparative Analysis

The cereals SNP database<sup>1</sup> was used to search and select a single target gene or "Gene of Interest" (GoI) for further research. BLAST analysis of the genetic fragments containing a SNP was applied to identify the full-length GoI using the Nucleotide collection of bread wheat in the NCBI database<sup>2</sup> .

Bioinformatics and systems biology methods were applied in this study to identify the full-length nucleotide sequence of the GoI, TaDr1, and its corresponding polypeptide sequence was used for both BLASTN and BLASTP in NCBI and in GenomeNet Database Resources, Kyoto University, Japan<sup>3</sup> . All wheat gene sequences with KEGG identification and their encoded proteins were retrieved from GenomeNet databases. Multiple sequence alignments of nucleotide sequences for the TaDr1A and TaDr1B genes were conducted in CLUSTALW using the CLC Main Workbench software<sup>4</sup> .

<sup>1</sup>http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB

<sup>2</sup>https://blast.ncbi.nlm.nih.gov

<sup>3</sup>https://www.genome.jp/tools/blast

<sup>4</sup>https://www.qiagenbioinformatics.com/products/clc-main-workbench

Chromosome locations of all TaDr1 homeologous genes in the wheat genome were found using BLAST analysis with high confidence annotated genes of the IWGSC database at the Gramene web-site<sup>5</sup> .

The molecular dendrogram of polypeptides of TaDr1 from bread wheat and other monocot plants was constructed using SplitsTree4 program<sup>6</sup> (Huson and Bryant, 2006), with Phylogram Splits and Tree Selector option.

### DNA Extraction and SNP Amplifluor Analysis

Plants were grown in control (non-stressed) conditions in containers with soil as described above. Five uniform, 1 monthold individual plants were selected from each accession and five leaves were collected and bulked for leaf samples. Leaf samples frozen in liquid nitrogen were ground in 10 ml tubes with two 9-mm stainless ball bearings using a Vortex mixer. DNA was extracted from the bulked leaves with phenol-chloroform as described in our earlier papers (Shavrukov et al., 2016b; Zotova et al., 2018). 1 µl of DNA was loaded on a 0.8% agarose gel to assess quality, and concentration was measured by Nano-Drop (ThermoFisher, United States).

Amplifluor-like SNP analysis was carried out using a QuantStudio-7 Real-Time PCR instrument (ThermoFisher Scientific, United States) as described previously (Jatayev et al., 2017; Zotova et al., 2018) with the following adjustment for wheat genotyping. Each reaction contained 3 µl of template DNA adjusted to 20 ng/µl, 5 µl of Hot-Start 2xBioMaster (MH020-400, Biolabmix, Novosibirsk, Russia<sup>7</sup> ) with all other components as recommended by the manufacturers, including MgCl<sup>2</sup> (2.0 mM). One µl of the two fluorescently labeled Universal probes was added (0.125 µM each) and 1 µl of allele-specific primer mix (0.075 µM of each of two forward primers and 0.39 µM of the common reverse primer). 4 µl of Low ROX (ThermoFisher, United States) was added as a passive reference label to the Master-mix as prescribed for the qPCR instrument. Assays were performed in 96-well microplates. The annotated SNP sites were used to design allele-specific primers. Sequences of the Universal probes and primers and sizes of amplicons generated are presented in **Supplementary Material 1**.

PCR was conducted using a program adjusted from those published earlier (Jatayev et al., 2017; Zotova et al., 2018): initial denaturation, 95◦C, 2 min; 20 "doubled" cycles of 95◦C for 10 s, 60◦C for 10 s, 72◦C for 20 s, 95◦C for 10 s, 55◦C for 20 s and 72◦C for 50 s; with recording of Allelespecific fluorescence after each cycle. Genotyping by SNP calling was determined automatically by the instrument software, but each SNP result was also checked manually using amplification curves and final allele discrimination. Experiments were repeated twice over different days, where two technical replicates confirmed the confidence of SNP calls.

#### RNA Extraction, cDNA Synthesis and qPCR Analysis

Plants were grown in the controlled conditions of a "Phytotron" at S.Seifullin Kazakh AgroTechnical University, Astana, Kazakhstan, as described earlier in Experiment 1 (Zotova et al., 2018). In brief, for mild drought stress with 1-month old plants, watering was withdrawn in one of two soil-filled containers for 12 days until wilted leaves were observed. Control plants in similar containers were watered continuously. Five individual plants were used for each cultivar in drought-affected and well-watered containers. All leaves were collected from each plant in plastic tubes as separate biological replicates, frozen immediately in liquid nitrogen and kept at −80◦C until RNA extraction. Three samples were used for RNA extraction in each cultivar and treatment, while two additional samples were used as replacements in case of failed extraction or poor RNA quality.

Frozen leaf samples were ground as described above for DNA extraction. TRIzol-like reagent was used for RNA extraction following the protocol described by Shavrukov et al. (2013) and all other steps for RNA extraction and cDNA synthesis were as described previously (Zotova et al., 2018) including DNase treatment (Qiagen, Germany), and the use of a MoMLV Reverse Transcriptase kit (Biolabmix, Novosibirsk, Russia). The quality of all cDNA samples was confirmed by PCR with products of the expected size.

Samples of cDNA diluted with water (1:2) were used for qPCR analyses using both a QuantStudio-7 Real-Time PCR instrument (ThermoFisher Scientific, United States) at Kazakh AgroTechnical University, Astana, Kazakhstan, and Real-Time qPCR system, Model CFX96 (BioRad, Gladesville, NSW, Australia) at Flinders University, Australia. Similar qPCR protocols were used in both instruments, as described earlier (Shavrukov et al., 2016b). Differences between protocols were: the total volume of 10 µl q-PCR reactions included either 5 µl of 2xBiomaster HS-qPCR SYBR Blue (Biolabmix, Novosibirsk, Russia) for experiments in Kazakhstan or 5 µl of 2xKAPA SYBR FAST (KAPA Biosystems, United States) for experiments in Australia, 4 µl of diluted cDNA, and 1 µl of two gene-specific primers (3 µM of each primer) (**Supplementary Material 2**). Expression data for the target genes were calculated relative to the average expression of the two reference genes: Ta22845, ATP-dependent 26S proteasome and Ta54825, actin (Paolacci et al., 2009). At least three biological and two technical replicates were used in each qPCR experiment.

#### Statistical Analysis

IBM SPSS Statistical software was used to calculate and analyze means and standard error using ANOVA, to estimate the probabilities for significance using Student's t-test. A correlation analysis was performed using Tests of Between-Subjects Effects (IBM SPSS, Statistics Desktop 25.0.0.0).

<sup>5</sup>http://www.gramene.org

<sup>6</sup>http://www.splitstree.org

<sup>7</sup>http://biolabmix.ru/en/products

### RESULTS

fgene-10-00063 February 7, 2019 Time: 1:5 # 5

### Phenological Characteristics and Grain Yield of Studied Wheat Cultivars

To assess the relative grain yield performance of the bread wheat cultivars in the dry conditions of Northern and Central Kazakhstan, eight wheat cultivars were selected from our previously published paper (Shavrukov et al., 2016b), and tested in the field during the dry season of 2017. The group of four cultivars (1. Aktyubinka; 2. Albidum 188; 3. Altayskaya 110; and 4. Saratovskaya 60) performed as expected, confirming their high-yielding status, which was significantly higher than the group with low-yield (5. Vera; 6. Volgouralskaya; 7. Yugo-Vostochnaya 2; and 8. Zhenis) (**Table 1**).

The superior high-yielding cultivar Aktyubinka (240 g/m<sup>2</sup> ) had the shortest DF (39 days) and so earliest start to flowering, while its DM was about average for this group (66 days). In contrast, the lowest-yield cultivar, Yugo-Vostochnaya 2, with more than two-fold lower grain yield than Aktyubinka, started flowering after a 3 day delay (42 days) but was only 1 day shorter in DM (65 days) compared to Aktyubinka. On average, the four high-yielding cultivars started flowering a significant 2.5 days earlier compared to the low-yielding group, while a less pronounced and insignificant difference (1.8 days) was found in DM between the two groups of cultivars (**Table 1**).

#### Genotyping of Wheat Accessions for the TaDr1 Gene Using an Amplifluor SNP Marker

During screening of annotated SNPs in bread wheat, the contig BC000036325 was identified for the drought-responsive candidate gene (TaDr1) using the publicly available database Cereal DB (see text footenote 1). The SNP marker KATU-W62 was developed to target the annotated SNP [W = A/T] in the 3<sup>0</sup> -UTR (untranslated region) based on the sequence of BC000036325. Both selected wheat cultivars and the segregating

TABLE 1 | Phenological characteristics of eight wheat cultivars grown in the Akmola region, Northern Kazakhstan, in the dry season of 2017.


Number of Days to flowering (DF) was counted when 50% of plants in the plot started flowering, while number of Days to maturity (DM) was recorded once all plants in each plot reached the ripening stage. Grain yield was calculated in g/m<sup>2</sup> , as average of four replicates ± SE. Different letters in superscripts and asterisks (<sup>∗</sup> ) indicate significant differences (p < 0.05) using ANOVA.

population 18-6 showed genetic polymorphism, with the more common allele being the nucleotide "A" and rarer allele "T" at the SNP position (**Figure 1**).

Genotyping of plants from the eight studied cultivars using the Amplifluor SNP marker KATU-W62 revealed clear discrimination of homozygote genotypes "aa" in all four highyielding cultivars (1–4) while low-yielding cultivars (5–8) were characterized by a mixture of "bb" (**5.** Vera; and **7.** Yugo-Vostochnaya 2) and "ab" (**6.** Volgouralskaya; and **8.** Zhenis) genotypes (**Figure 1A**). At this stage, it remains unclear whether the "ab" genotypes of cultivars Volgouralskaya and Zhenis belong to true heterozygotes, a mixture of several genotypes or both cases together.

Segregation of genotypes for the SNP marker KATU-W62 was observed in the F<sup>3</sup> population 18-6 (**Figure 1B**) originating from the complex cross, where the favorable allele "a" was inherited from the paternal line. The analysis of the entire hybrid population is still ongoing and will include progeny analyses in the next generation.

#### Bioinformatic Characterisation of the TaDr1 Candidate Gene and Protein

BLASTN results at NCBI<sup>8</sup> for bread wheat gene sequences revealed two accessions, BT009234 for TaDr1B, and AF464903 for TaDr1A, published and described earlier (Stephenson et al., 2007), with 96% identity in both genes, and covering 96% and 89% of the sequences, respectfully.

Genomic DNA analysis using high confidence genes annotated by the IWGSC database revealed that TaDr1A and TaDr1B are located on homeologous chromosomes 3A and 3D, in the positions 689,352,814-689,357,320 and 552,949, 442-552,953,939, on the forward strands of the physical map, respectively. These genes, TraesCS3A02G450700 and TraesCS3D02G443500, contained five exons, produced 1,536 and 1,565 bp long transcripts which encoded 291 and 298 amino acid long proteins, respectively. The sequence of contig BC000036325, which contained the identified SNP, had the highest level of identity (99.7%) with the gene TraesCS3B02G487800, located in the position 733,818,973- 733,823,767, on the forward strand of the physical map of the homeologous chromosome 3B. The gene presented in the BC000036325 contig also contained five exons, transcribed a single 1,317 bp long transcript and encoded a 296 amino acid long protein. Therefore, the two annotated genes TaDr1A and TaDr1B, and the BC000036325 contig from the SNP database, together represent the three homeologous genes of TaDr1 in wheat genomes A, D and B, respectively.

The protein encoded by BC000036325 shared 99.3% and 85.% identity with TaDr1B and TaDr1A, respectively, while a low similarity score and only 18.9% identity was found compared to TaNF-YB3, accession BT009265 (**Figure 2**). This result shows that accession BC000036325 from the B genome used in this work has much stronger similarity to TaDr1B and to the corresponding gene TaDr1B from the D genome of wheat.

<sup>8</sup>https://www.ncbi.nlm.nih.gov

Y-axes show relative amplification units, 1Rn, for FAM and VIC fluorescence signals, respectively. Red dots represent homozygote (aa) genotypes with allele 1 (FAM) associated with the high yielding cultivars, blue dots represent homozygote (bb) genotypes for allele 2 (VIC), and green dots represent heterozygote (ab) or mixed genotypes identified with automatic SNP calling. The black squares show the no template control (NTC) using water instead of template DNA.

## Molecular Dendrogram of the TaDr1 Gene

The phylogenetic tree was constructed based on a BLASTX search for molecular similarity for the TaDr1 protein (BC000036325) in cereal plant species and a group of TFs TaNF-YB for the comparison from NCBI Database. The sequences of all Dr1 proteins are distinct from all TaNF-YB TFs. Among Dr1 sequences, bread wheat (Triticum aestivum) and the diploid progenitor of A genome (T. urartu) form the first sub-clade; and cultivated rice (Oryza sativa) and closely related native grass from tropical Africa (O. brachyantha) are isolated in the second sub-clade. All other cereal species are joined together in the third sub-clade including sorghum (Sorghum bicolor), maize (Zea mays), foxtail millet (Setaria italica), and Hall's panicgrass (Panicum hallii) (**Figure 3**).

### Expression Analysis of the TaDr1 in Leaves of Control Plants and Plants Exposed to Drought

Expression profiles forTaDr1were recorded as the total of all three homeologous genes, TaDr1A, TaDr1B and BC000036325 using primers designed for the most conserved regions of these genes. Reference genes used in this study were stable across all genotypes in control and treatment conditions (**Figure 4A**). In plants exposed to drought, our results revealed significant up-regulation of TaDr1 in all eight studied wheat cultivars (**Figure 4B**). Four high-yielding cultivars increased production of TaDr1 transcripts 2–2.4 fold, while expression levels in plants of low-yielding cultivars were also increased compared to controls but not as strongly as in plants of high-yielding cultivars (**Figure 4B**).

Both flowering time regulators, TaVrn1 and TaFT1, showed drought responsive expression similar to the expression of TaDr1. High-yielding cultivars (1–4) had higher expression levels of TaVrn1 and TaFT1 than low-yielding cultivars (5–8), although differences for some cultivars were not significant. These results show genotype-dependent co-expression following the same trend in all three studied genes, TaDr1, TaVrn1, and TaFT1, in leaves of plants grown under drought (**Figures 4B–D**).

Statistical analysis using Tests of Between-Subjects Effects for the gene expressions presented in **Figures 4B–D** shows a very low correlation between groups of high-yielding cultivars (1–4) and low-yielding cultivars (5–8), with R <sup>2</sup> = 0.081, 0.123 and 0.118, respectively. In contrast, strong correlations (R <sup>2</sup> = 0.897 and R <sup>2</sup> = 0.957) were found between cultivars within each group, 1–4 and 5–8, for the three studied genes TaDr1, TaVrn1, and TaFT1, respectively (**Table 2**).

#### DISCUSSION

Flowering time is a very important trait in wheat, and it was documented that earlier flowering by just a few days can increase the likelihood that plants can minimize the impact of terminal drought and ultimately improve their yield performance compared to wheat genotypes with later flowering times (Reviewed in: Shavrukov et al., 2017). Terminal or late season drought is the most common form of drought stress under most wheat production environments. In the current work, we compared the flowering time of four high-yielding and four low-yielding wheat cultivars and the expression of some genes related to flowering time. In a population of Recombinant breeding lines of durum wheat (Triticum durum Desf.) in diverse environments with drought, one QTL for heading date was identified in Chromosome 2A. However, this QTL had minimal or no effect on grain yield (Maccaferri et al., 2008). Different results were reported concerning early heading in synthetic bread wheat lines that correlated with higher grain yield under dry conditions compared to controls (Inagaki et al., 2007). The authors concluded that genes from the D genome could make an important contribution to the correlation in bread wheat, which is absent in tetraploid durum wheat.

The TaDr1 gene was selected from a SNP database for genetic polymorphism analysis using molecular markers. This gene encodes a protein belonging to the group of general transcription repressors and is an important part of the plant regulatory system.

Two of the three homologous genes, TaDr1A and TaDr1B, were identified earlier in wheat (Stephenson et al., 2007), and a third TaDr1 gene with the temporary name of contig BC000036325 identified in the current study, were localized in A, D and B genomes of bread wheat. Alignment of TaDr1 proteins with TaNF-YB3 reveals a high level of identity in the histone fold domain responsible for protein-protein and protein-DNA interactions (**Figure 2**). This result is in agreement with the previously published statement about the "high degree of similarity between TaDr1A, TaDr1B and TaNF-YB subunit members" (Stephenson et al., 2007).

The expression analysis of all three homeologous genes of TaDr1 comprised an important part of the study of gene function, as published by Stephenson et al. (2007). However, analysis of the primer design for qPCR analysis of the genes, TaDr1A and TaDr1B, in Stephenson et al. (2007) did not reveal sufficient discrimination between these genes (**Supplementary Material 2**). One pair of primers published by Stephenson et al. (2007) was based on BT009234 and targeted the TaDr1B sequence for qPCR analysis, but it shows full consensus between the two genes, with no mismatches (indicated in green, **Supplementary Material 2**). Therefore, the use of these primers gave total (combined) expression for both genes, TaDr1A and TaDr1B. The second pair of primers, used and reported by Stephenson et al. (2007), was based on AF464903, where the reverse primer was again designed in the conserved region which is identical in both genes. Only a single nucleotide insertion and one SNP were

of eight wheat cultivars in response to drought. The expression levels of Ta22845 (A), TaDr1 (B), TaVrn1 (C), and TaFT1 (D) were calculated under drought relative to the corresponding controls in well-watered conditions. Eight wheat cultivars were studied, high-yielding are shown as darker boxes (1. Aktyubinka; 2. Albidum 188; 3. Altayskaya 110; and 4. Saratovskaya 60), and the four low-yielding cultivars are shown as framed light filled boxes (5. Vera; 6. Volgouralskaya; 7. Yugo-Vostochnaya 2; and 8. Zhenis). With the exception of Panel A, expression data were normalized using the averages of two reference genes, Ta22845 and Ta54825 (Actin), and presented as the average ± SE of three biological and two technical replicates for each genotype, experiment and treatment. Different letters above the bars indicate significant differences (p < 0.05) within each experiment calculated using ANOVA.

found in the sequence of the TaDr1A-Fd primer (indicated in pink, **Supplementary Material 2**). We estimate that it contributes about 90–95% of the studied TaDr1A isoform specificity, so in the results presented by Stephenson et al. (2007), TaDr1B was over-estimated and represented the total expression of both genes combined, TaDr1A and TaDr1B (TaDr1).

In this context, we similarly measured total expression of all three homeologous genes TaDr1 with qPCR primers based on the sequence BC000036325. Two mismatches at the 5<sup>0</sup> -end of the reverse primer (indicated in blue, **Supplementary Material 2**) can affect the specificity of the amplified mRNA of both genes, TaDr1A and TaDr1B, but only at an equal rate due to perfect consensus between AF464903 and BT009234 sequences in the primer-binding region.

In this work, the associations of an individual GoI with complex traits, such as flowering time and performance under

TABLE 2 | Correlation analysis between groups of high-yielding and low-yielding cultivars for expression of the three genes, TaDr1, TaVrn1, and TaFT1 (right column), and between cultivars within each group (bottom row).


Data represent the average of the relative expression units for four cultivars, with three biological replicates in each (n = 12) ± SE, extracted from Figure 4. The R<sup>2</sup> correlation coefficient was calculated using Tests of Between-Subjects.

drought, were studied in bread wheat cultivars. The regulatory gene, TaDr1, is clearly involved in the plant's response to drought and its expression pattern correlates with the expression patterns of two other regulatory genes, TaVrn1 and TaFT1, which are well-known regulators of flowering time. The existence of small differences in flowering time between high- and low-yielding wheat cultivars under moderate drought was also demonstrated.

In addition, over-expression of regulatory transgenes, TaNF-YB4, TaDREB3, or TaSHN1, as was shown in our earlier papers, activated sets of downstream genes and this led to significantly improved drought tolerance and/or increased grain yield of transgenic wheat plants (Yadav et al., 2015; Shavrukov et al., 2016a; Bi et al., 2018). These results confirm the relevance of the "single-gene for single-trait" approach in studying complex regulatory gene networks, such as, for instance, the response of bread wheat under limited water conditions.

The eight local wheat cultivars from Kazakhstan used in our study were separated into two groups representing high- and lowyielding varieties in the dry conditions of Northern and Central Kazakhstan, as discussed in our previous paper (Shavrukov et al., 2016b) and confirmed in the current study (**Table 1**). Under drought, the two groups of wheat cultivars showed quite variable expression profiles of TaDr1, with 2–2.4-fold and 1.3–1.8-fold higher expression of TaDr1 in the first and second groups of cultivars, respectively (**Figure 4B**). The expression of TaDr1, identified as TaDr1B in cv. Babax (Stephenson et al., 2007), was reported to be about 2.3-fold above the level of controls, which is close to the highest level of the first group of wheat cultivars in the current study.

Our results indicate that the expression of TaDr1 is dependent on wheat genotype. Four high-yielding cultivars showed very high expression of TaDr1, while gene expression was moderate in all four low-yielding cultivars compared to controls under drought treatment.

The two TFs, TaVrn1 and TaFT1, are well studied and are known to strongly regulate the flowering time trait in wheat. Abiotic stresses, such as drought, can affect plant growth and development including flowering. In our recent paper, we reported that the TaNFYC-A7 gene was differentially expressed under drought in the same cultivars studied here (Zotova et al., 2018). It is suggested that the TaDr1 protein could bind one or both of the TaNF-YB and TaNF-YC type subunits and consequently prevent their interactions or binding to the third subunit, TaNF-YA. It can therefore act as a repressor of the trimeric NF-Y transcription factor. We can extend this hypothesis and speculate that TaNF-Y, which is affected (deactivated) by TaDr1, can release the activity of TaVrn1 and TaFT1 promoters. This in turn leads to earlier flowering and ultimately improved performance of wheat genotypes grown in the dry environment of Northern and Central Kazakhstan. The proposed signaling pathway from TaDr1 to TaVrn1 and TaFT1 is supported by the three genes' co-expression results in the current study in wheat plants under drought. High expression of TaDr1 was accompanied by significant upregulation of TaVrn1 and TaFT1 transcripts. In experiments with drought stress, co-expression patterns in TaDr1, TaVrn1, and TaFT1 were genotype-dependent and highly correlated, being much stronger in the four high-yielding wheat cultivars and less pronounced, but still significant, in the four low-yielding cultivars. Further strong evidence will be required to support or reject this hypothesis, including direct "protein-protein" interactions in the studied wheat genotypes.

The application of the Amplifluor-like SNP marker, KATU-W62, like other molecular markers, is a helpful tool for wheat genotyping of both modern cultivars and interspecific hybrids with wild relatives or species related to the genus Triticum. In this study, we were able to show that the markers can be deployed in tracking the different alleles in an F<sup>3</sup> population resulting from a complex cross. This population will be used to assess the value of the marker in screening for enhanced drought tolerance under production conditions in Northern Kazakhstan. If our hypothesis is correct, we expect lines carrying the "a" allele to perform better under drought, with the strongest improvement shown for homozygotes "aa" in the presented study.

Identification of the TaDr1 alleles can result in a better understanding of genetic polymorphism in the control of down-stream genes, like TaVrn1 and TaFT1, which regulate vernalisation and flowering time. Together with the Q gene, the combined regulatory system can change the reproductive architecture of wheat plants and improve their tolerance to abiotic stresses, primarily drought.

#### REFERENCES

Absattar, T., Absattarova, A., Fillipova, N., Otemissova, A., and Shavrukov, Y. (2018). Diversity array technology (DArT) 56K analysis, confirmed by SNP markers, distinguishes one crested wheatgrass Agropyron species from two others found in Kazakhstan. Mol. Breed. 38:37. doi: 10.1007/s11032-018-0792-3

### AUTHOR CONTRIBUTIONS

LZ conducted the experiments with eight wheat cultivars and the genotyping with Amplifluor-like SNP analysis. AkK and SJ supervised experiments and interpreted results. NG supervised works with vernalisation and flowering time genes, and analysis of interspecific hybrid. NS, AzK, and AN conducted experiments with plant stresses and sampling. AO carried out work and analysis of interspecific hybrid. SS worked with plants in the field trial. VS coordinated experiments in the field. SL analyzed gene sequences in databases and wrote the corresponding section. CS analyzed results, and revised and edited the manuscript. CJ analyzed qRT-PCR data and revised the corresponding section. KS coordinated the qRT-PCR study and revised other sections. PL supervised the study and revised the final version of the manuscript. YS coordinated all experiments and wrote the first version of the manuscript.

### FUNDING

This research has been supported by the Ministry of Education and Science, Kazakhstan, Research Program BR05236500 (SJ). The study for genes affecting plant architectonics was funded by the Russian Science Foundation (Russia), grant no. 16-16-10021 (NG). Preliminary evaluation of growth habit phenotypes (spring vs. winter) of varieties from the collection of the accessions was carried out within the framework of the Budget project no. 0324-2018-0018 (NG).

#### ACKNOWLEDGMENTS

We want to thank the staff and students of S.Seifullin Kazakh AgroTechnical University, Astana (Kazakhstan), Flinders University of South Australia, SA (Australia), and Institute of Cytology and Genetics, Russian Academy of Sciences, Novosibirsk (Russia) for their support in this research and help with critical comments to the manuscript. The results of this study were presented at the International Conference 'Bioinformatics and Computational Biology', August 2018, Novosibirsk, Russia. The authors acknowledge the Organizing Committee for their support in the presentation and publication of this work.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00063/full#supplementary-material

Arzani, A., and Ashraf, M. (2017). Cultivated ancient wheats (Triticum spp.): a potential source of health-beneficial food products. Compr. Rev. Food Sci. Food Saf. 16, 477–488. doi: 10.1111/1541-4337.12262

Berger, J., Palta, J., and Vadez, V. (2016). Review: an integrated framework for crop adaptation to dry environments: responses to transient and terminal drought. Plant Sci. 253, 58–67. doi: 10.1016/j.plantsci.2016.09.007


and lead to improved corn yields on water-limited acres. Proc. Natl. Acad. Sci. U. S. A. 104, 16450–16455. doi: 10.1073/pnas.0707193104


factors in Triticum aestivum. Plant Mol. Biol. 65, 77–92. doi: 10.1007/s11103- 007-9200-9


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zotova, Kurishbayev, Jatayev, Goncharov, Shamambayeva, Kashapov, Nuralov, Otemissova, Sereda, Shvidchenko, Lopato, Schramm, Jenkins, Soole, Langridge and Shavrukov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Natural Selection Equally Supports the Human Tendencies in Subordination and Domination: A Genome-Wide Study With in silico Confirmation and in vivo Validation in Mice

#### Edited by:

Ancha Baranova, George Mason University, United States

#### Reviewed by:

Dusanka Savic Pavicevic, University of Belgrade, Serbia Harinder Singh, J. Craig Venter Institute, United States

#### \*Correspondence:

Mikhail Ponomarenko pon@bionet.nsc.ru

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 09 August 2018 Accepted: 28 January 2019 Published: 20 February 2019

#### Citation:

Chadaeva I, Ponomarenko P, Rasskazov D, Sharypova E, Kashina E, Kleshchev M, Ponomarenko M, Naumenko V, Savinkova L, Kolchanov N, Osadchuk L and Osadchuk A (2019) Natural Selection Equally Supports the Human Tendencies in Subordination and Domination: A Genome-Wide Study With in silico Confirmation and in vivo Validation in Mice. Front. Genet. 10:73. doi: 10.3389/fgene.2019.00073 Irina Chadaeva1,2, Petr Ponomarenko<sup>3</sup> , Dmitry Rasskazov<sup>2</sup> , Ekaterina Sharypova<sup>2</sup> , Elena Kashina<sup>2</sup> , Maxim Kleshchev<sup>1</sup> , Mikhail Ponomarenko1,2 \*, Vladimir Naumenko<sup>1</sup> , Ludmila Savinkova<sup>2</sup> , Nikolay Kolchanov1,2, Ludmila Osadchuk<sup>1</sup> and Alexandr Osadchuk<sup>1</sup>

1 Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia, <sup>2</sup> Novosibirsk State University, Novosibirsk, Russia, <sup>3</sup> University of La Verne, La Verne, CA, United States

We proposed the following heuristic decision-making rule: "IF {an excess of a protein relating to the nervous system is an experimentally known physiological marker of low pain sensitivity, fast postinjury recovery, or aggressive, risk/novelty-seeking, anestheticlike, or similar agonistic-intolerant behavior} AND IF {a single nucleotide polymorphism (SNP) causes overexpression of the gene encoding this protein} THEN {this SNP can be a SNP marker of the tendency in dominance} WHILE {underexpression corresponds to subordination} AND vice versa." Using this decision-making rule, we analyzed 231 human genes of neuropeptidergic, non-neuropeptidergic, and neurotrophinergic systems that encode neurotrophic and growth factors, interleukins, neurotransmitters, receptors, transporters, and enzymes. These proteins are known as key factors of human social behavior. We analyzed all the 5,052 SNPs within the 70 bp promoter region upstream of the position where the protein-coding transcript starts, which were retrieved from databases Ensembl and dbSNP using our previously created public Web service SNP\_TATA\_Comparator (http://beehive.bionet.nsc.ru/cgi-bin/mgs/ tatascan/start.pl). This definition of the promoter region includes all TATA-binding protein (TBP)-binding sites. A total of 556 and 552 candidate SNP markers contributing to the dominance and the subordination, respectively, were uncovered. On this basis, we determined that 231 human genes under study are subject to natural selection against underexpression (significance p < 0.0005), which equally supports the human tendencies in domination and subordination such as the norm of a reaction (plasticity) of the human social hierarchy. These findings explain vertical transmission of domination and subordination traits previously observed in rodent models. Thus, the results of this study equally support both sides of the century-old unsettled scientific debate on

**97**

whether both aggressiveness and the social hierarchy among humans are inherited (as suggested by Freud and Lorenz) or are due to non-genetic social education, when the children are influenced by older individuals across generations (as proposed by Berkowitz and Fromm).

Keywords: gene, promoter, TBP, TATA-box, SNP, expression change, social hierarchy, candidate SNP marker

#### INTRODUCTION

Social dominance-subordination hierarchy is a set of structured relationships between individuals. These relationship ensure coexistence of individuals by reducing mutual aggression and increasing order in the competition for limited environmental resources as well as elevating their reproductive potential (Hinde, 1970; Rowell, 1974). In animals, such intraspecies hierarchy is a result of agonistic aggressive behavior defined by ethologists as an innate form of action to protect oneself, shelter, progeny, and territory (Lorenz, 2002). Artificial selection of animals for either aggressiveness (Kulikov et al., 2016) or domestication (Belyaev, 1979) has demonstrated the contribution of genetic factors to the phenotypic manifestation of aggressiveness (Ehrman and Parsons, 1981; Moore, 2013). Finally, a genome-wide search for genetic factors of both fear and aggressive behaviors has been conducted on model animals, e.g., in canines, which were artificially selected for both domestication and agonistic behavior (Zapata et al., 2016).

In humans, the reference genome (Colonna et al., 2014) and the full set of single-nucleotide polymorphisms (SNPs) available in the public databases Ensembl (Zerbino et al., 2015) and dbSNP (Sherry et al., 2001). In humans, genetic polymorphism exemplifies the results of natural selection rather than artificial one Dobzhansky (1963) concluded: "man is genetically specialized to be unspecialized," meaning that human behavioral tolerance to social and environmental challenges is broad. The recent genome-wide comparison between humans and apes (Gunbin et al., 2018) indicated that the origin of human species coincided with a reliable increase in the plasticity of the transcription regulation of neuronal genes, while in apes the regulatory plasticity of these genes reduced. This observation points at the action of destabilizing (disruptive) natural selection rather than directional or stabilizing natural selection (Belyaev, 1979). Notably, comprehensive multifactorial regression analysis of healthy young athletes (i.e., boxers, kick boxers, and karate fighters) revealed a significant positive correlation between their aggression and anxiety rates, which helps to achieve top combat levels owing to the prevention of injuries under extreme conditions in the arena (Tiric-Campara et al., 2012). Finally, there is the century-old unsettled scientific dispute where one side – e.g., Freud (1921, 1930) and Lorenz (1964, 2002) – explains both human aggressiveness and social hierarchy as a consequence of their genetic predisposition, while the other side – e.g., Fromm (1941, 1973), Berkowitz (1962, 1993), and Skinner (Rogers and Skinner, 1956; Skinner, 1981) – explains this by the continuous non-genetic social education which continues from childhood to the oldest age (Markel, 2016).

Notably, the social dominance-subordination hierarchy in social species (e.g., humans) limits the permissible aggression range, which is under pressure of natural selection as a norm of a reaction (plasticity) to aggressive behavior (Eldakar and Gallup, 2011). Conditions, quality, and the lifespan of an individual depend on his\her rank within the social hierarchy (Michopoulos et al., 2012). In murine micropopulations as combinations of inbred and hybrid individuals, manifestation of the social dominance phenotype reliably depends on some behavioral features taken together with a genotype (Serova et al., 1991). As for human aggressiveness as a target of some antipsychotic drugs [e.g., olanzapine (Ellingrod et al., 2005)], there are a number of biomedical SNP markers that represent statistically significant differences between the reference human genome and the individual genome of patients having either a certain psychiatric disease or resistance/susceptibility to certain treatments of this disease.

Each discovery of the SNP markers associated with the human phenotypic traits had been a unique success in the pregenomic era, whereas now, this task is one of the major aims of the largest scientific project: "1000 genomes" (Colonna et al., 2014). The main results of this project are publicly available within two regularly synchronized and updated databases Ensembl (Zerbino et al., 2015), which is the reference human genome consisting of the most frequent (ancestral) nucleotides at each DNA position, and dbSNP (Sherry et al., 2001) as the human variome containing all the carefully verified SNPs. Now these databases contain a carefully curated extract that summarizes information on more than 10000 individual human genomes and more than 100 million SNPs (Telenti et al., 2016). As for all the 8.58 billion possible human whole-genome SNPs, creation of a relevant database, dbWGFP, was already reported (Wu et al., 2016); this database is designed to compile all the available information about each of these SNPs to use it in the nearest future to handle the requests from the people who want to sequence their own individual genome and, then, get his/her individual benefits from it.

Because biomedical SNP markers may be used for diagnosis and selection of treatments for humans, there is only one acceptable approach to identify them: that is, to estimate the statistical significance of differences in the prevalence of a given SNP in the representative cohorts of individuals with the phenotypic trait of interest (Varzari et al., 2018). It is unlikely that this extremely time-consuming and expensive procedure is applicable to each of the 8.58 billion possible human SNPs (Abbas et al., 2006). Moreover, both Haldane's dilemma (Haldane, 1957) and Kimura's theory of neutral evolution (Kimura, 1968) predict neutrality of the absolute majority of human SNPs.

These neutral SNPs should be discarded by computer-based calculations in order to reduce the total cost of biomedical SNP markers. Currently, there are many public Web services (e.g., Bendl et al., 2016), predicting candidate SNP markers and eliminating the most probable neutral SNPs while taking into account various similarity measures for genome-wide data during infections (Leschner et al., 2012) or diseases (Hu et al., 2013) as well as after treatment (Hein and Graver, 2013) and in health (Ni et al., 2012). The accuracy of these similarity-based predictions increases with the increase in diversity of available genome-wide data, in agreement with our predictions (Ponomarenko et al., 1999) based on Central Limit Theorem.

The best accuracy of these bioinformatics predictions corresponds to SNPs in the protein-coding regions owing to their reliable manifestation as protein damage, whereas in the case of SNPs in the regulatory regions of genes, none of the proteins is damaged (Amberger et al., 2015). Notably, the 70 bp promoter regions in front of the transcription start sites (TSSs) contain the majority of the clinically verified regulatory SNP markers (Ponomarenko et al., 2013) due to the TATA-binding protein (TBP)-binding site (e.g., TATA-box), which is obligatory for the primary initiation of gene transcription (Martianov et al., 2002). Finally, Mogno et al. (2010) experimentally found that the increase in TBP-binding affinity for the TBP-binding sites altered by SNPs causes overexpression of the appropriate genes whereas underexpression corresponds to a decrease in the affinity.

In our previous works, we created a public Web service SNP\_TATA\_Comparator (see text footnote 1) (Ponomarenko et al., 2015) for selecting the statistically significant SNP-caused alterations in TBP's affinity for the promoter regions 70 bp upstream of the protein-coding TSSs. This Web service is based on our three-step model of the TBP–promoter binding to each other (Ponomarenko et al., 2008), namely: (i) TBP slides along DNA ↔ (ii) TBP stops at a putative TBP-binding site ↔ (iii) the TBP–promoter complex is fixed by the DNA bending at a right angle, as was experimentally discovered (Delgadillo et al., 2009). Using SNP\_TATA\_Comparator, we predicted candidate SNP markers – within TBP-binding sites of the human gene promoters – associated with obesity, chronopathology, aggressiveness, and autoimmune and Alzheimer's diseases (for review, see Ponomarenko P. et al., 2017). Recently, we preliminarily studied (Chadaeva et al., 2017) the possibility to predict candidate SNP markers for social hierarchy using a short representative set of 21 human genes homologous to the animal genes encoding the known physiological markers of aggressiveness, which represent nervous, endocrine, immune, respiratory, vascular, muscular, and other systems of the human body.

In this work, due to our observation (Bragin et al., 2006) of domination of adult male BALB/cLac mice over CBA/Lac mice, we made a genome-wide prediction for the human tendencies dominance and subordination within the framework of the neuropeptidergic, non-neuropeptidergic, and neurotrophinergic systems and verified it using a mouse model of human inheritance. We discuss how our results fit both genetic (e.g., Freud and Lorenz) and non-genetic (e.g., Berkowitz and Fromm) irreconcilable sides of the century-old scientific debate about the origin of both aggressiveness and social hierarchy in humans.

#### MATERIALS AND METHODS

#### Animals

This study was carried out in accordance with the recommendations of Directive 2010/63/EU of the European Parliament and of the Council of September 22, 2010, on the protection of animals used for scientific purposes. Manipulations of animals and experimental procedures were performed in compliance with the international rules according to the "Guidelines for the care and use of mammals in neuroscience and behavioral research"<sup>1</sup> . The research protocol was approved by the Interinstitutional Commission on Bioethics at the ICG SB RAS, 10 Lavrentyev Avenue, Novosibirsk, Russia.

Analysis of the inheritance of agonistic behavior indicators and social dominance levels was conducted on 230 adult male mice that are diallelic crosses of a set of five maternal inbred mouse strains (i.e., PT, DD, YT, A/He, and C57BL/6J) with two analytic inbred paternal strains (BALB/cLac and CBA/Lac) of the murine tendencies in dominance and subordination, respectively, as determined experimentally previously (Bragin et al., 2006).

All the mice were maintained under standard conditions of a conventional animal facility of the ICG SB RAS.

#### Identification of Inheritance of the Mouse Tendencies in Dominance and Subordination

One can see all the 230 diallelic crosses in **Table 1**, where five rows and two columns present F1 males. In each row of this table, there are descendants of mothers of the same inbred strain. Thus, the maternal non-genetic (pre- and postnatal) and cytoplasmic effects are the same for males of the same row of this table. To exclude non-genetic paternal postnatal effects on offspring, pregnant female mice were isolated from male mice.

We made up groups of F1 hybrid male mice with the minimal society size, namely: two males each: one from each column

<sup>1</sup>https://grants.nih.gov/grants/olaw/National\_Academies\_Guidelines\_for\_Use\_ and\_Care.pdf



The number of male mice for each of the 10 F1 hybrids is indicated in parentheses.

<sup>1</sup>http://beehive.bionet.nsc.ru/cgi-bin/mgs/tatascan/start.pl

of the same row of **Table 1**. In each pair, both male mice had identical age, weight, and body size, but visually differed from each other in color. This approach allowed us to estimate the influence of the paternal genotype on the social dominance level of the appropriate F1 crosses.

A total of 115 experimental pairs (230 F1 hybrids) were distributed into five groups, corresponding to the maternal inbred strains (see **Table 1**). For each mouse male pair tested, we performed 14 observations (20 min each) during 5 days. Each observation was recorded using a video camera in automatic mode with a fixed period. Next, we analyzed these video recordings using the protocols of software The Observer XT 7.0 (version: 7.0, Noldus Information Technology, license No. OB070-03670). This way, we identified the social rank for each male within the appropriate pair according to asymmetry in agonistic behavior, in particular, by means of attacks and submissive poses as described in the Supplementary Experiment (**Supplementary File S6**).

#### The Basic Decision-Making Rule

Both domesticated and laboratory animals are artificially selected using the known target traits (Belyaev, 1979; Kulikov et al., 2016), which can help in any computer-based genome-wide analysis of these animals (e.g., Zapata et al., 2016) in contrast to the human genome, which is the result of natural selection in favor of unknown unspecializing target traits (Dobzhansky, 1963). Hence, on the basis of our preliminary work (Chadaeva et al., 2017), we proposed the following heuristic decisionmaking rule: "IF {an excess of a protein relating to the nervous system is an experimentally known physiological marker of low pain sensitivity, fast post-injury recovery, or aggressive, fearless, impulsive, anxious, exploratory, risk/novelty-seeking, anestheticlike, or similar agonistic-intolerant behavior} AND IF {a given SNP can cause overexpression of a gene encoding this protein} THEN {this SNP can be a SNP marker of predisposition to social dominance} WHILE {the underexpression corresponds to subordination} AND vice versa." This whole study is devoted to evaluation of this decision-making rule.

#### DNA Sequences

Using the aforementioned basic decision-making rule (see subsection "The Basic Decision-Making Rule"), we analyzed all the 5052 SNPs retrieved from the dbSNP database (build 150, Sherry et al., 2001), which are found within the 70 bp promoter regions upstream of the protein-coding transcripts of all the 231 human genes of the neuropeptidergic, nonneuropeptidergic, and neurotrophinergic systems retrieved from database Ensembl (GRCh38/hg38 assembly, Zerbino et al., 2015), which are listed in the alphabetic order in the first columns of **Supplementary Tables S1–S3**, respectively (hereinafter: see **Supplementary Files S1–S3**, respectively). These genes encode proteins that are known as key factors altering human social behavior, namely, neurotrophic and growth factors, interleukins, neurotransmitters, receptors, transporters, and enzymes.

Using our public Web service SNP\_TATA\_Comparator (Ponomarenko et al., 2015), we compared the DNA sequences of the ancestral (wt) and minor (min) alleles of SNPs of the 70 bp promoter region of these genes. We applied it together with the public Web service UCSC Genome Browser (Haeussler et al., 2015) and two public databases dbSNP (Sherry et al., 2001) and ClinVar (Landrum et al., 2014), as described in the Supplementary Web-service (**Supplementary File S5**). As a result, we obtained two pairs of (–ln(K<sup>D</sup> (wt) ) ± δ(wt)) and (–ln(K<sup>D</sup> (min) ) ± δ(min)) values of TBP affinity for these alleles of the promoter being studied according to contextual, conformational, and physicochemical changes in its B-helical DNA under the influence of a given SNP, as described in the Supplementary Method (**Supplementary File S4**). Next, we calculated Fisher's Z-score as follows: Z = abs[ln(K<sup>D</sup> (min) /K<sup>D</sup> (wt) )]/[δ 2 (min)+δ 2 (wt)] 1/2 , and in turn found the p-value of statistical significance of this score using package R (Waardenberg et al., 2015).

Finally, using this p-value, we discarded all the SNPs the effects of which were estimated as insignificant; otherwise, using decisions on the SNP-caused significant increase and decrease of the binding affinity of TBP for the analyzed promoters, we predicted the candidate SNP markers for overor underexpression of the appropriate genes, respectively, as demonstrated experimentally (Mogno et al., 2010). Readers can find all our predictions within the columns "KD, nM, prediction" of **Supplementary Tables S1–S3**. Their subcolumns "wt" and "min" contain K<sup>D</sup> values of TBP's binding affinity for the ancestral and minor alleles of the appropriate promoters, respectively. Furthermore, subcolumns "1" and "α" correspond to the human gene expression alterations and their statistical significance levels α, which are equal to (1 − p). In addition, subcolumn "ρ" presents a heuristic rank of our predictions varying in alphabetical order from the "best" (A) to the "worst" (E). Finally, **Table 2** contains total numbers of our predictions (NRES) as well as the numbers of the candidate SNP markers for either overexpression (N>) or underexpression (N<) of the human genes, as predicted by this work.

### The Keyword Search in the PubMed Database

For each candidate SNP marker predicted, we manually performed a two-step keyword search in the PubMed database (Lu, 2011) as shown in **Figure 1**.

As presented in this figure, we handled each candidate SNP marker independently of the others, one by one. First of all, we checked whether the SNP in question was annotated by database ClinVar (Landrum et al., 2014) as depicted in **Supplementary Figure S1C** (hereinafter: see **Supplementary File S5** "Supplementary Web service") and boldfaced in both the first and third rightmost columns of **Supplementary Tables S1–S3**.

When this database associated the SNP under study with the human diseases, we manually carried out a primary keyword search for the literature data on the known physiological marker of these diseases, which corresponds to the gene expression alteration predicted for this SNP as described elsewhere

TABLE 2 | Predictions of candidate SNP markers that can statistically significantly alter the TATA-binding protein (TBP)-binding sites of the human gene promoters of all the protein-coding transcripts relating to neuropeptidergic, non-neuropeptidergic, and neurotrophinergic systems.


NGENE and NSNP, total numbers of the human genes and their SNPs (single nucleotide polymorphisms) within the 70 bp promoter region for the protein-coding transcripts, respectively, in this study; NRES, the total number of the candidate SNP markers predicted in this work that can increase (N>) or decrease (N<) the TATA-binding protein (TBP) binding affinity for these promoters and, correspondingly, the expression of these genes; N↑ and N↓, the total numbers of the candidate SNP markers for the human tendencies in dominance and subordination, respectively; P(H0), the estimate of a probability for the acceptance of this H<sup>0</sup> hypothesis, according to the binomial distribution.

(Lu, 2011). **Figure 1** depicts this procedure as two boxes consisting of dashed lines. In the case of a successful finding of such a publication, the clinical data taken from database ClinVar (Landrum et al., 2014) indicated the adequacy of our predictions for the SNP under consideration. These confirmations of our predictions are italicized in both the first and third rightmost column of **Supplementary Tables S1–S3**.

Finally, two dotted boxes in **Figure 1** depict our secondary keyword search for the known physiological markers for pain sensitivity, postinjury repair efficiency, or agonistic behavior, which correspond to underexpression of the human gene containing this SNP. This way, we tested the basic decision-making rule of this work (hereinafter: see subsection "The Basic Decision-Making Rule" "Basic decision-making rule"). As the main bioinformatic results, we predicted the candidate SNP markers for the human tendencies in dominance and subordination, which are in both the first and third rightmost column of **Supplementary Tables S1–S3**. **Table 2** contains the total number of these candidate SNP markers (N↑ and N↓, respectively).

The section "References" lists the articles cited in **Supplementary Tables S1–S3** and in section "Supplementary Method."

#### Statistical Analysis

We analyzed dichotomies via the equiprobable binomial distribution and χ 2 criteria taken from the standard statistical package Statistica (StatSoftTM, Tulsa, United States).

In the genome-wide study in silico, using only Fisher's Z-score test, we predicted the candidate SNP markers, the numbers of which for the human gene overexpression and underexpression were compared with one another using the binomial distribution as well as in the case of the human tendencies in dominance and subordination.

During in vivo validation in mice, by means of the χ 2 criterion, we compared the actual numbers of dominants and subordinates among male mice, which were the F1 hybrids of crossing females from inbred strains of an unknown tendency in social hierarchy with males from two inbred strains BALB\cLac and CBA\Lac of the previously experimentally identified tendencies in dominance and subordination, respectively (Bragin et al., 2006).

#### RESULTS AND DISCUSSION

Our analysis of 5052 SNPs of the TBP-binding regions of 231 human neuron-related genes uncovered 1108 candidate

SNP markers for the human tendencies in dominance and subordination (**Table 2**). These predictions are shown in **Supplementary Tables S1–S3** and exemplified in **Figures 2**, **3** and **Supplementary Figure S1**. For 36 of the 231 genes (16%), namely: ADRA1B, ADRA2A, ADRA2B, ADRB1, AVP, AVPR1Â, CHRNB2, CNR2, FGF15, FGF16, FGF2, FGF23, FGF7, FIGF, FLT3, GABARAPL3, GABRA3, GABRA4, GABRQ, GMFA, GRIA3, GRIK4, GRIN2B, GRM6, IGF2R, IL27RA, KDR, LIF, MANF, MAOA, MAOB, NGF, OXT, TACR3, TGFBRAP1, and VEGFC, no candidate SNP markers were found (data not shown). Let us focus our analysis of our results on the candidate SNP markers that have independent clinical information within database ClinVar (Landrum et al., 2014) to both verify and discuss their relevance to the human genes under study.

#### Candidate SNP Markers Near TBP-Binding Sites in the Promoter of the Human Genes Encoding Neuropeptidergic-System-Related Proteins (e.g., Neurotransmitters)

We applied our experimentally verified public Web service (Ponomarenko et al., 2015) to analyze 395 SNPs in 70 bp proximal promoter regions of 27 human genes encoding neuropeptidergic-system–related proteins, namely: arginine vasopressin receptors (AVPRs), C-X-C motif chemokine receptors (CXCRs); neuropeptide Y and its receptors (NPYs), opioid growth factor receptor (OGFR), opioid receptors (OPRs), oxytocin and its receptor (OXTs), prodynorphin

(PDYN), proenkephalin (PENK), prepronociceptin (PNOC), proopiomelanocortin (POMC), and tachykinins together with their precursors and receptors (TACs). The results obtained can be found in **Supplementary Table S1**.

The human PDYN gene, i.e., the opioid polypeptide hormone prodynorphin, which is a basic building block of endogenous opioid neuropeptides, so-called endorphins, that can inhibit the pain signals peripherally and cause a feeling of euphoria (when acting in the brain) as neurotransmitters of happiness and joy. SNP rs886056538 of this gene's promoter was annotated within database ClinVar (Landrum et al., 2014), where it is associated with spinocerebellar ataxia as shown in **Supplementary Figure S1C**. **Supplementary Figure S1D** illustrates our prediction for this SNP, which is the line "Decision: excess significant" accompanied by the line "Z-score = 2.51, p > 0.95" within the textbox "Result." This outcome means that this SNP can statistically significantly cause overexpression of this gene. Our primary keyword search (hereinafter: two dashed boxes in **Figure 1**) produced an original experiment (Smeets et al., 2015) involving a mouse model of the human diseases, which has identified the prodynorphin excess as a physiological marker for spinocerebellar ataxia. As one can see, these in vivo experimental data independently support our prediction for SNP rs886056538 (**Supplementary Figure S1**). This observation indicates the suitability of our Web service (Ponomarenko et al., 2015) for computer-based analysis of the human genes encoding neuropeptidergic-system–related proteins as italicized in **Supplementary Table S1**.

After this validation, we manually conducted our secondary keyword search (hereinafter: two dotted boxes in **Figure 1**) and found the original experiment (Szklarczyk et al., 2012) in a mouse model of human behavior, which associated the prodynorphin excess with reduced conditioned fear. Using our basic decision-making rule within the limitations of the above experimental model of human behavior (Szklarczyk et al., 2012), we predicted that the analyzed SNP rs886056538 can be a candidate SNP marker for the human tendency in dominance (**Supplementary Table S1**).

Near this clinically characterized SNP marker, we found two unannotated SNPs (rs371345545 and rs557431815), which can also cause overexpression of the human PDYN gene (hereinafter: according to our predictions shown in **Supplementary Tables S1–S3**). That is why we suggest them as two candidate SNP markers of the same genetic tendencies, namely: spinocerebellar ataxia with limitations (Smeets et al., 2015) and social dominance within the framework of the model (Szklarczyk et al., 2012) as presented in **Supplementary Table S1**.

This way, we predicted 66 and 31 candidate SNP markers for excess and deficiency of the proteins of the human neuropeptidergic system, respectively, which are also 51 and 46 candidate SNP markers predicted by this work for the human tendencies in dominance and subordination (**Table 2** and **Supplementary Table S1**). First of all, readers can see that the numbers of the candidate SNP markers predicted for the human tendencies in dominance and subordination markers are not statistically significantly different from one another according to equiprobable binomial distribution criterion (P(N<sup>↑</sup> ≡ N<sup>↓</sup> ≡ NRES/2) > 0.6). This finding is in agreement with our preliminary estimate (Chadaeva et al., 2017), namely: P(N<sup>↑</sup> ≡ N<sup>↓</sup> ≡ NRES/2) > 0.9.

On the contrary, the numbers of the candidate SNP markers predicted for excess and deficiency of the proteins of the human

neuropeptidergic system are significantly different from one another according to the equiprobable binomial distribution criterion (P(N <sup>&</sup>gt; ≡ N <sup>&</sup>lt; ≡ NRES/2) < 0.0005) in line with our preliminary observations (Chadaeva et al., 2017), as presented in **Table 2**: N > = 66, N < = 26 (P(N <sup>&</sup>gt; ≡ N <sup>&</sup>lt; ≡ NRES/2) < 0.0005). According to a number of studies, various molecular phenomena can shift frequencies of mutations – e.g., influence of the nucleotide context on the occurrence and repair of premutational damage to genomic DNA, gene conversion, pleiotropic and epistatic effects – Kasowski et al. (2010) first noticed that SNPs decreasing the protein–DNA affinity are much more frequent than SNPs increasing this affinity within the human genome. Next, the authors of ref. (1000 Genomes Project Consortium et al., 2012) quantitatively characterized this mutational shift, namely: there are ∼800 SNPs damaging the transcription factor binding sites and ∼200 SNPs improving these sites per random individual human genome as shown in **Table 2**. According to Haldane's dilemma (Haldane, 1957) and neutral evolution theory (Kimura, 1968), this genome-wide estimate can correspond to the neutral mutational drift as a

norm. Indeed, we observed 37 clinically proven SNP markers of the human hereditary diseases, which decrease the TBP– promoter affinity, and 14 such SNP markers increasing this affinity (Ponomarenko et al., 2015) in agreement with the above-mentioned genome-wide estimate (**Table 2**). This pattern matches the commonly accepted opinion on these diseases as a genetic load of the neutral mutational drift in the norm.

Nevertheless, in the case of human reproductive potential, which is considered the target of natural selection, we observed a diametrically opposite pattern, namely: five candidate SNP markers were decreasing the TBP–promoter affinity and 19 candidate SNP markers were increasing this affinity (Chadaeva et al., 2018). Besides, we found (Ponomarenko P. et al., 2017) only a minority (12 of 28) of candidate SNP markers of familial Alzheimer's disease that can decrease the TBP–promoter affinity; this finding is consistent with natural selection for its very slow pathogenesis, whose clinical manifestation is observed only at the age of over 65 (**Table 2**). In addition, in the case of core genes of the circadian clock (Ponomarenko et al., 2016), which are naturally selected for continuous coordination between the functioning of systems of the human body and daily fluctuations of the environment, we found 13 candidate SNP markers that can decrease the TBP–promoter affinity and 39 candidate SNP markers increasing this affinity (**Table 2**).

Looking through **Table 2**, we noticed that our predictions for the neuropeptidergic gene system are more similar to those for natural selection cases than to those for neutral drift within the normal range. That is why here we predict that the human genes encoding neuropeptidergic-system-related proteins are under natural selection pressure, which equally supports the human tendencies in subordination and domination, as was preliminarily estimated elsewhere (Chadaeva et al., 2017). This way, we followed the semicentennial bioinformatic tradition to compare the actual frequencies of natural mutations within their various dichotomies [e.g., transitions versus transversions (Kimura, 1980) as well as synonymous versus non-synonymous changes (Li et al., 1985)].

### Candidate SNP Markers Near TBP-Binding Sites in the Promoter of the Human Genes Encoding Proteins Related to the Non-neuropeptidergic System (e.g., Receptors)

Using our public Web service (Ponomarenko et al., 2015), we analyzed 2226 SNPs located within the TBP-binding regions of 109 human genes encoding proteins that are related to the nonneuropeptidergic system, e.g., adenosine receptors (ADORs), adrenoceptors (ADRs), muscarinic cholinergic receptors (CHRMs), nicotinic cholinergic receptors (CHRNs), central cannabinoid receptor 1 (CNR1), catechol-O-methyltransferase (COMT), dopamine D receptors (DRDs), GABA type A receptor-associated proteins (GABARAPs), γ-aminobutyric acid type B receptor subunits (GABBRs), γ-aminobutyric acid − type A receptor subunits (GABRs), G protein–coupled receptors (GRPs), glutamate ionotropic receptor AMPA–type subunits (GRIAs), glutamate ionotropic receptor NMDA-type subunits (GRINs), glutamate metabotropic receptors (GRMs), 5-hydroxytryptamine (serotonin) receptors (HTRs), dopamine transporter DAT (SLC6A3), Na+/Cl−-dependent serotonin transporter SERT (SLC6A4), tyrosine hydroxylase (TH), and tryptophan hydroxylase 2 (TPH2). **Table 2** and **Supplementary Table S2** list the results.

The human COMT gene for catechol-O-methyltransferase has, in its promoter, a clinically annotated SNP, rs777650793, whose association with human cardiovascular disease was documented by database ClinVar (Landrum et al., 2014). **Figure 2A** presents our prediction for this SNP, which is an excess of this protein. As a non-statistical validation of this prediction, we manually performed our primary keyword search, which resulted in an experimental study (He et al., 2011) on a rat model of human pathologies, which has identified COMT overexpression as a physiological marker of cerebral vasospasm. This correspondence between our prediction (**Figure 2A**) and these experimental data (He et al., 2011) can support the suitability of the results of our Web service (Ponomarenko et al., 2015) in the case of a study of the human non-neuropeptidergic system as italicized in **Supplementary Table S2**.

As for our secondary keyword search, it resulted in an in vivo experiment in a rat model of human behavior (Wilhelm et al., 2013), where a catechol-O-methyltransferase excess was a physiological marker of depression. Within the framework of the behavioral animal model (Wilhelm et al., 2013), we predicted the candidate SNP marker of the human tendency in subordination (**Supplementary Table S2**).

The human DRD3 gene (dopamine receptor D3) carries SNP rs36211802 annotated by database ClinVar (Landrum et al., 2014), which associates it with hereditary essential tremor. This SNP can cause an excess of this receptor, according to our prediction given in **Figure 2B**. We validated this prediction by our primary keyword search, which found the original experimental data (Kosmowska et al., 2016) on resistance to the high-dose DRD3-agonist treatment of tremor in a laboratory rat model of this human pathology as italicized in **Supplementary Table S2**.

In addition, our secondary search revealed (**Supplementary Table S2**) that a DRD3 excess reduced both motor activity and behavioral motivation in a mouse model of human motor activity (Ikeda et al., 2013). This finding allows us to predict rs36211802 as a candidate SNP marker of the human tendency in subordination (**Supplementary Table S2**).

The human HTR2C gene encodes 5-hydroxytryptamine (serotonin) receptor 2C and carries SNP rs3813929, manifestation of which is an abnormal response to olanzapine (antipsychotic) according to database ClinVar (Landrum et al., 2014). For this SNP, we predict an excess of this serotonin receptor as shown in **Figure 2C**. Our primary keyword search pointed to the clinical data (Ellingrod et al., 2005) on an HTR2C excess caused by this SNP, whose manifestation is a resistance to olanzapine-caused increase in body mass. It is noteworthy that Tecott et al. (1995) reported that knockout mice (5HT2C(−)/(−) ) are obese, whereas Stahl (1998) observed eating behavior downregulation with a 5HT2C level increase. With this in mind, our prediction of the rs3813929-related 5HT2C excess

(**Figure 2C**) fits the clinical observation of the rs3813929-related resistance to olanzapine-caused increase in body mass (Ellingrod et al., 2005). This agreement between our prediction shown in **Figure 2C** and the clinical observations (Tecott et al., 1995; Stahl, 1998; Ellingrod et al., 2005) is consistent with our verification of our predictions of this type by electrophoretic mobility shift assays (EMSAs) under equilibrium (Savinkova et al., 2013) and non-equilibrium conditions (Drachkova et al., 2014) in vitro. Besides, this result is in agreement with our verification of our predictions on this subject using biosensor ProteONTM (Bio-Rad Lab, United States) (Drachkova et al., 2012) and stopped-flow spectrometer SX.20 (Applied Photophysics, United Kingdom) (Arkova et al., 2014, 2017) in real-time mode. In addition, it fits our verification of our analogous predictions using human cell lines transfected with the pGL 4.10 vector (Promega, United States) (for a review, Ponomarenko M. et al., 2017). Finally, it is in line with our verification of our predictions on this subject using independent data from 60 experiments (for a review, see Ponomarenko et al., 2010) and by means of 43 known clinical SNP markers of human diseases (Ponomarenko et al., 2009) and 38 known genetic SNP markers of the breeding traits of animals and plants (Suslov et al., 2010). All these verification data can be a reason for the applicability of our Web-service (Ponomarenko et al., 2015) when the human genes relating to the non-neuropeptidergic system are studied, as italicized in **Supplementary Table S2**.

Our secondary keyword search yielded empirical data on two laboratory rat strains, which were bred for 60 generations for the presence and absence of high levels of stress-evoked aggression toward humans (Popova et al., 2010). According to these data, increases in both mRNA and protein levels were seen in the brains of nonaggressive rats in comparison with the aggressive ones (**Supplementary Table S2**). On this basis, we propose the candidate SNP marker for human tendency in subordination (**Supplementary Table S2**).

In total, we predicted 342 and 163 candidate SNP markers that can increase and decrease, respectively, the expression of the human proteins related to the non-neuropeptidergic system. Besides, these 505 predictions can be clustered as 240 and 265 candidate SNP markers for the human tendencies in dominance and subordination (**Table 2** and **Supplementary Table S2**). As readers can see in **Table 2**, these results are again consistent with our preliminary estimates (Chadaeva et al., 2017) that natural selection equally supports the human tendencies in dominance and subordination.

### Candidate SNP Markers Near TBP-Binding Sites in the Promoter of the Human Genes Encoding Neurotrophinergic-System-Related Proteins (e.g., Growth Factors, Receptors)

We applied our public Web service (Ponomarenko et al., 2015) to study 2431 SNPs in 70 bp regions in front of the TSSs of 95 human genes encoding neurotrophinergicsystem–related proteins, namely, adenylate cyclase-activating polypeptide 1 and its receptor (ADCYAP1s), artemin (ARTN), brain-derived neurotrophic factor (BDNF), cerebral dopamine neurotrophic factor (CDNF), ciliary neurotrophic factor (CNTF), fibroblast growth factors and their receptors (FGFs), Fmsrelated tyrosine kinases and their ligand (FLTs), glial-cellderived neurotrophic factor (GDNF), GDNF family receptors (GFRs), glia maturation factors (GMFs), insulin like growth factors and their receptors (IGFs), interleukins as well as their receptors and signal transducers (ILs), leukemia-inhibitory factor (IL6-family cytokine) and its receptor (LIFs), nerve growth factor and its receptor (NGFs), neuregulins (NRGs), neuropilins (NRPs), neurturin (NRTN), neurotrophins (NTFs), neurotrophic receptor tyrosine kinases (NTRKs), oncostatin M and its receptor (OSMs), platelet-derived growth factor subunits and receptors (PDGFs), placental growth factor (PGF), persephin (PSPN), Ret receptor tyrosine kinase (RET), transforming growth factors β, its receptors and associated protein 1 (TGFBs), and vascular endothelial growth factors (VEGFs). We show our results in **Table 2** and **Supplementary Table S3**.

The human FGFR2 gene (fibroblast growth factor receptor 2) contains two SNPs rs387906677 and rs886046768, which were clinically detected in patients with bent bone dysplasia syndrome and craniosynostosis, respectively, as documented by database ClinVar (Landrum et al., 2014). Readers can see in **Figures 2B**, **3A** how we predicted the FGFR2 deficiency in the case of rs387906677, whereas rs886046768 corresponds to an FGFR2 excess.

At first, our primary keyword search revealed an experimental report (Merrill et al., 2012) on a mouse model of human embryonic development, which linked bent bone dysplasia with reduced levels of FGFR2. Next, in the same way, we found the original experiment (Mansukhani et al., 2000) on mouse osteoblast cell culture ex vivo that points to FGFR2 as an inducer of apoptosis in these cells and an inhibitor of their differentiation, hyperactivity of which causes craniosynostosis-linked alterations in cell culture. As depicted in the figures, these independent findings confirm the validity of our predictions (**Figures 3A,B**) in the case of the neurotrophinergic system analysis, as italicized in **Supplementary Table S3**.

After this validation, our secondary keyword search yielded an article (Meyer et al., 2012) on FGFR2 deficiency as a physiological marker of delayed post-injury skin wound healing. Analogously, we found a biomedical paper (Baatar et al., 2002) on the injections of recombinant human FGFR2 around ulcers, which have accelerated ulcer healing in rats as an animal model of the human pathologies. On this basis, we predicted rs387906677 and rs886046768 as candidate SNP markers of the human tendencies in subordination and dominance, respectively (**Supplementary Table S3**).

The human PDGFRA gene encodes platelet-derived growth factor receptor α and contains SNP rs183431225 annotated by database ClinVar (Landrum et al., 2014) in connection with both idiopathic hypereosinophilic syndrome and gastrointestinal stromal tumor. **Figure 3C** presents our prediction for this SNP: overexpression of this receptor. Our primary keyword

search revealed two biomedical papers, one of which (Score et al., 2006) reports the PDGFRA excess as a marker of patients with hypereosinophilia, and another one (Hayashi et al., 2015) reveals reduced proliferation of gastrointestinal stromal tumor cells under the influence of a selective inhibitor of PDGFRA. Thus, these independent literature data support applicability of our predictions to the study of human genes encoding neurotrophinergic-system–related proteins as italicized in **Supplementary Table S3**.

Then, we did our secondary keyword search and found a mouse model of human behavior indicating that the PDGFRA overexpression causes oligodendrocyte-associated nociceptive hypersensitivity to neuropathic pain (Shi et al., 2016). That is why we assumed that rs183431225 is a candidate SNP marker of the human tendency in subordination (**Supplementary Table S3**).

The human RET gene codes for the Ret proto-oncogene, where two SNPs (rs10900297 and rs10900296) have been associated with three human diseases (renal adysplasia, Hirschsprung disease, and pheochromocytoma) as documented in database ClinVar (Landrum et al., 2014). As readers can see in **Figures 3D,E**, our predictions for these SNPs surprisingly correspond to over- and underexpression of this gene. Nevertheless, using our primary keyword search, we learned that both an excess (Sarin et al., 2014) and deficit (Bridgewater et al., 2008) of RET are known as physiological markers of renal adysplasia. In addition, both overexpression (Ishii et al., 2013) and underexpression (Zhan et al., 1999) of the RET gene can contribute to the pathogenesis of Hirschsprung disease. Finally, both increased (Huang et al., 2003) and decreased (Moore and Zaahl, 2012) levels of this proto-oncogene are often seen in pheochromocytoma. Thus, the above publications additionally validate our results (**Figures 3D,E**) as italicized in **Supplementary Table S3**.

Accordingly, we conducted a secondary keyword search and thus selected two animal models of human behavior. The rat model (Wang et al., 2017) associated the RET excess with hypersensitivity to neuropathic pain. In the mouse model (Golden et al., 2010), the RET deficit reduced epidermal innervation. Within the limitations of these models, we predicted two candidate SNP markers (rs10900297 and rs10900296) of the human tendency in subordination (**Supplementary Table S3**).

The human TGFBR2 gene (transforming growth factor β receptor 2) contains SNP rs138010137, which occurs in patients with thoracic aortic aneurysm as documented in database ClinVar (Landrum et al., 2014). According to our prediction illustrated in **Figure 3F**, this SNP can reduce levels of receptor TGFBR2 in humans. Using a primary keyword search, we found an original work about the TGFBR2-deficient aortic aneurysm and aortic dissection as the specific forms of these pathologies (Angelov et al., 2017). As one can see, this is one more argument in favor of the applicability of our Web service (Ponomarenko et al., 2015) to research on the human genes related to the neurotrophinergic system as italicized in **Supplementary Table S3**.

Next, our secondary keyword search yielded a transgenic mouse model of human health (Martinez-Ferrer et al., 2010), in which the TGFBR2 deficit accelerates healing, closure, and resurfacing of skin wounds. For this reason, we suggest rs138010137 as a candidate SNP marker of the human tendency in dominance (**Supplementary Table S3**).

Summarizing all the above, we can see 506 candidate SNP markers predicted by this work in the case of human genes encoding the neurotrophinergic-system-related proteins (**Table 2** and **Supplementary Table S3**). These predictions can be grouped into 346 and 160 candidate SNP markers of the excess and deficiency of these proteins, respectively, as well as into 265 and 241 candidate SNP markers of the human tendencies in dominance and subordination (**Table 2**). Notably, the first of these dichotomies of SNPs in the human genome is statistically significantly uneven, whereas the second one is uniform. This is one more actual piece of evidence for the pressure of natural selection on the human neuron-specific genes, which equally supports the human tendencies in dominance and subordination, in agreement with our preliminary estimates (Chadaeva et al., 2017) as well as with all the other predictions of this work.

#### In silico Validation of All the Genome-Wide Predictions Made in This Work

Altogether, we analyzed 5052 SNPs within all the TBP-binding regions of all the promoters in front of all the proteincoding transcripts of all the 231 known human neuron-specific genes and selected 1108 candidate SNP markers that can significantly affect the affinity of TBP for these promoters (22%) as shown in the bottom row of **Table 2**. This result of our exhaustive whole-genome analysis of three systems of the human body (neuropeptidergic, non-neuropeptidergic, and neurotrophinergic) is consistent with both Haldane's dilemma (Haldane, 1957) and Kimura's neutral evolution theory (Kimura, 1968). Our in silico fivefold reduction in the number of unannotated SNPs for their subsequent in vivo studies is in line with the current need for reducing the cost of both experimental and clinical searches for valuable SNP markers in the human genome by trial and error through preliminary computer analysis of the known SNPs (Deplancke et al., 2016).

With this in mind, we selected all the 10 among 1,108 candidate SNP markers predicted in this work (**Figures 2**, **3** and **Supplementary Figure S1**), which are currently linked to the human diseases by public database ClinVar (Landrum et al., 2014). As described above, we non-statistically validated this set of our selected predictions by our primary keyword search in the public PubMed database (Lu, 2011). Essentially, this match between our 10 selected predictions and the found literature data is statistically significant at the level of α < 0.001 according to the criterion of the equiprobable binomial distribution.

It is important to note that most of the candidate SNP markers that were marked in database ClinVar (Landrum et al., 2014) had a "Clinically insignificant" label because the number of patients with these candidate SNP markers varied from one to six, whereas for clinical significance it is necessary to use cohorts of several hundred patients. This observation

supports subsequent verification (using clinical protocols) of the candidate SNP markers predicted by this work. In this way, genotyping for the elite combat athletes in addition to the widely used textual psychological questionnaires for them (Tiric-Campara et al., 2012) could enrich personalized sports medicine.

In addition, we used the semicentennial bioinformatic tradition of comparing the actual frequencies of mutations for their various dichotomies [transitions versus transversions (Kimura, 1980), synonymous versus non-synonymous changes (Li et al., 1985), etc.]. To this end, we grouped all the 1108 predictions into 754 and 354 candidate SNP markers for the increase and decrease in the TBP binding affinity for promoters of the human neuron-related proteins, respectively (**Table 2**: NRES, N<sup>&</sup>gt; and N<). This dichotomy contradicts the binomial distribution of the whole-genome ratio 4:1 of the SNPs reducing versus SNPs increasing affinity of the transcription factors for the human gene promoters (1000 Genomes Project Consortium et al., 2012) as neutral drift according to Haldane's dilemma (Haldane, 1957) and neutral evolution theory (Kimura, 1968), **Table 2**: p(N <sup>&</sup>lt; = 4N <sup>&</sup>gt; = 4NRES/5) < 0.000001. This significant contradiction means the adaptive pressure of natural selection on the human neuron-specific genes is in line with the commonly accepted opinion about the adaptive role of both the nervous system and social behavior in the course of human origin and evolution. That is one more evolutionary argument for the reliability of our predictions made in this work.

Finally, by the same reasoning, we grouped all the 1,108 predictions into 556 and 552 candidate SNP markers for the human tendencies in dominance and subordination, respectively (**Table 2**: NRES, N↑, and N↓). In contrast to the above dichotomy, this one corresponds to the highly probable H<sup>0</sup> hypothesis about the equiprobable binomial distribution of these candidate SNP markers for human social hierarchy [**Table 2**: p(H0: N<sup>↑</sup> = N<sup>↓</sup> = NRES/2) > 0.9]. This correspondence means that the pressure of natural selection proven above equally supports the human tendencies in dominance and subordination.

Notably, so that natural selection can control the human tendencies in dominance and subordination, it is necessary that this human tendencies can be inherited from generation to generation from parents to offspring. That is why, we in vivo validated our in silico predictions of this work in a mouse model of human inheritance as described below.

#### In vivo Validation of Our Predictions Using a Mouse Model of Human Inheritance

Each public Web service addresses a specific sort of regulatory SNP analysis (e.g., Bendl et al., 2016), and each has its specific advantages and disadvantages. Therefore, a comparison between the particular predictions and experimental data as an independent commonly accepted uniform platform (rather than between predictions of various Web services) needs to be a necessary step for prediction of candidate SNP markers in silico (Yoo et al., 2015; Ponomarenko M. et al., 2017). Keeping this in mind, we in vivo validated our in silico predictions on the equal natural-selection support of the human tendencies in dominance and subordination using a mouse model of human inheritance as described in the section "Materials and Methods." The obtained results are given in **Figure 4** and **Table 3**.

**Figure 4** indicates that we completely reproduced the temporal pattern of both formation and maintenance of the social hierarchy in mouse pairs by means of both the number and duration of attacks and submissive poses.

As one can see in the first row "PT" of this table, 21 of 31 mouse males of the F1 hybrids carrying the PT × BALB\cLac genotype dominated over the male F1 hybrids of the PT × CBA/Lac genotype, and 10 mouse males of the PT × CBA/Lac genotype were dominant in the remaining pairs of the same combination. This actual difference between the F1 male hybrids PT × BALB/cLac and PT × CBA\Lac is characterized by the χ 2 -score equal to 3.9, which is statistically significant at the level of α < 0.05. In addition, we observed the same significant dominance of the BALB/cLac-related F1 hybrids over the CBA/Lac-related ones, when the maternal inbred strains were DD and YP (**Table 3**). In addition, in the cases of maternal inbred strains C57BL/6J and A/He, we found only a tendency for the same dominance, which was insignificant, possibly because of the insufficient number of the appropriate mouse male pairs studied regarding these maternal genotypes. Finally, the last row of **Table 3** represents the final result: the statistically significant majority of 79 among 115 BALB/cLac-related male hybrids achieved their dominant social status within their pairs with the CBA\Lac-related males of the same maternal inbred strains. This finding means that this mouse model of human inheritance reveals an ability of the tendencies in dominance and subordination to be inherited from generation to generation from parents to offspring and, therefore, to be an object of natural selection. This is the main genetic in vivo argument in favor of the reliability of our in silico predictions in this work.

Finally, looking through **Figure 4**, one can see that, in contrast to the first day of microsocial observation of a pair of adult male mice, which was characterized by numerous and lasting attacks of one mouse on the other, by the end of the second day a social hierarchy is established, with rare short-term ritualized attacks of dominant and/or ritualized submissive poses of a



The number of male mice – that dominated over their neighbors within the framework of their pair – is indicated; statistically significant results are boldfaced.

subordinate without any injuries and dangers for their lives and health (Lorenz, 2002). This is the main ecological benefit of establishing and maintaining social hierarchy, as a result of which natural selection equally supports the human tendencies for both dominance and subordination.

### CONCLUSION

In this work, we analyzed only how SNPs can alter TBP's binding affinity for the human gene promoters, whereas more than 2500 human DNA-binding proteins are already known (Babu et al., 2004). Consequently, now there is a huge variety of Web services for studying the effects of SNPs on the binding affinity of the human gene promoters for these proteins and the respective phenotypic manifestations (e.g., Bendl et al., 2016). Their use can significantly expand the research capabilities in comparison with the use of our Web service alone (Ponomarenko et al., 2015).

The main finding of this work is that natural selection equally supports the human tendencies in dominance and subordination, which can be inherited from parents to offspring. The results of current study could be seen as an argument in favor of the genetic side within the century-old irreconcilable scientific debate on the nature of both aggressiveness and social hierarchy in humans [e.g., Freud (1921, 1930) and Lorenz (1964, 2002)]. Nevertheless, in the case of a random individual, these human tendencies can define the possible ranges (plasticity) of his/her aggressiveness and social rank rather that their actual levels, which depend on his/her continuous non-genetic social education from childhood to the oldest age (Markel, 2016). Certainly, this one is an argument in favor the other (non-genetic) side of the debate in question [e.g., Fromm (1941, 1973), Berkowitz (1962, 1993), Skinner (Rogers and Skinner, 1956; Skinner, 1981)]. According to recent reports on epigenetics (e.g., Merkulov et al., 2017), various stressors may cause epigenetic reprogramming of the individual genome and, in this way, modulate the actual levels of both individual aggressiveness and social status. Moreover, this reprogrammed pattern of the human genome is inherited from parents to offspring across at least two generations. Definitely, this notion equally supports both sides of the above debate as does our main finding in this work.

Finally, there are social mechanisms of transfer of the hierarchy status from parents to their offspring, previously described in macaques (Prud'Homme and Chapais, 1993), deer (Dusek et al., 2007), and hyenas (Engh et al., 2000). Clearly, the real effects of inherited genotypes on the human social hierarchy are much more complex, diverse, richer, brighter, and more interesting than our maximally simplified decisionmaking rule (see subsection "The Basic Decision-Making Rule" "Basic decision-making rule"). Nevertheless, at least a somewhat valid decision-making rule is necessary for application of the bioinformatic calculations to the genome-wide analysis in silico. In any case, as a computer-based prediction, each candidate SNP marker of the human tendencies in dominance and subordination predicted by this work should be experimentally verified in the studies of large human cohorts.

### AUTHOR CONTRIBUTIONS

NK contributed to concept. DR and PP contributed to software. IC contributed to data compilation. ES and LS contributed to data analysis. MK, EK, LO, and AO performed the in vivo experiment. MP wrote the manuscript. VN performed the revised manuscript study design.

### FUNDING

Manuscript writing was supported by the Russian Ministry of Science and Education within the 5-100 Excellence Program (for MP). The software development was supported by the project #0324-2019-0040 from the Russian Government Budget (for DR). The concept was supported by the integration project #0324-2018-0021 from the Presidium of the Siberian Branch of the Russian Academy of Sciences (for NK). The data

compilation was supported by project #18-34-00496 from the Russian Foundation for Basic Research (for IC). The data analysis was supported by project # 0324-2019-0042 from Russian Government Budget (for ES and LS). The study design was supported by project #16-54-12016 from the Russian Foundation for Basic Research (for VN). The in vivo experiment on animals was supported by a publicly funded project #0324-2019-0041 from Russian Government Budget (for MK, EK, LO, and AO) and implemented using the equipment of the Center for Genetic Resources of Laboratory Animals at ICG SB RAS, supported by the Russian Ministry of Education and Science (unique identifier of the project RFMEFI62117X0015).

#### REFERENCES


#### ACKNOWLEDGMENTS

We are grateful to Shevchuk Editing<sup>2</sup> (Brooklyn, NY, United States) for English translation and editing.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00073/full#supplementary-material

2 http://www.shevchuk-editing.com




in genes of commercial and laboratory animals and plants associated with selectively valuable traits. Russ. J. Genet. 46, 394–403. doi: 10.1134/ S1022795410040022


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Chadaeva, Ponomarenko, Rasskazov, Sharypova, Kashina, Kleshchev, Ponomarenko, Naumenko, Savinkova, Kolchanov, Osadchuk and Osadchuk. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Pan-Cancer Analysis of TCGA Data Revealed Promising Reference Genes for qPCR Normalization

George S. Krasnov\*, Anna V. Kudryavtseva, Anastasiya V. Snezhkina, Valentina A. Lakunina, Artemy D. Beniaminov, Nataliya V. Melnikova and Alexey A. Dmitriev\*

*Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia*

#### Edited by:

*Yuriy L. Orlov, Institute of Cytology and Genetics (RAS), Russia*

#### Reviewed by:

*Alexey V. Pindyurin, Institute of Molecular and Cellular Biology (RAS), Russia Vladimir Kiselev, Wellcome Trust Sanger Institute (WT), United Kingdom Shengjie Yang, NorthShore University HealthSystem, United States*

#### \*Correspondence:

*George S. Krasnov gskrasnov@mail.ru Alexey A. Dmitriev alex\_245@mail.ru*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *31 October 2018* Accepted: *29 January 2019* Published: *01 March 2019*

#### Citation:

*Krasnov GS, Kudryavtseva AV, Snezhkina AV, Lakunina VA, Beniaminov AD, Melnikova NV and Dmitriev AA (2019) Pan-Cancer Analysis of TCGA Data Revealed Promising Reference Genes for qPCR Normalization. Front. Genet. 10:97. doi: 10.3389/fgene.2019.00097* Quantitative PCR (qPCR) remains the most widely used technique for gene expression evaluation. Obtaining reliable data using this method requires reference genes (RGs) with stable mRNA level under experimental conditions. This issue is especially crucial in cancer studies because each tumor has a unique molecular portrait. The Cancer Genome Atlas (TCGA) project provides RNA-Seq data for thousands of samples corresponding to dozens of cancers and presents the basis for assessment of the suitability of genes as reference ones for qPCR data normalization. Using TCGA RNA-Seq data and previously developed CrossHub tool, we evaluated mRNA level of 32 traditionally used RGs in 12 cancer types, including those of lung, breast, prostate, kidney, and colon. We developed an 11-component scoring system for the assessment of gene expression stability. Among the 32 genes, *PUM1* was one of the most stably expressed in the majority of examined cancers, whereas *GAPDH*, which is widely used as a RG, showed significant mRNA level alterations in more than a half of cases. For each of 12 cancer types, we suggested a pair of genes that are the most suitable for use as reference ones. These genes are characterized by high expression stability and absence of correlation between their mRNA levels. Next, the scoring system was expanded with several features of a gene: mutation rate, number of transcript isoforms and pseudogenes, participation in cancer-related processes on the basis of Gene Ontology, and mentions in PubMed-indexed articles. All the genes covered by RNA-Seq data in TCGA were analyzed using the expanded scoring system that allowed us to reveal novel promising RGs for each examined cancer type and identify several "universal" pan-cancer RG candidates, including *SF3A1*, *CIAO1*, and *SFRS4*. The choice of RGs is the basis for precise gene expression evaluation by qPCR. Here, we suggested optimal pairs of traditionally used RGs for 12 cancer types and identified novel promising RGs that demonstrate high expression stability and other features of reliable and convenient RGs (high expression level, low mutation rate, non-involvement in cancer-related processes, single transcript isoform, and absence of pseudogenes).

Keywords: cancer, gene expression, reference genes, quantitative PCR, data normalization, RNA-Seq, TCGA, CrossHub

## INTRODUCTION

Quantitative PCR (qPCR) is the most widely used technique for quantification of gene expression. qPCR is rapid, has a very high dynamic range of mRNA level quantification and provides a measurement of even small gene expression alterations in a large number of samples. The most common and convenient approach for qPCR data normalization assumes mRNA quantification of a reference gene (RG) with stable expression level between the samples under study (Huggett et al., 2005). It is a bottleneck of qPCR, and the reliability of qPCR results strongly depends on the selection of appropriate RGs. This issue becomes more acute when it comes to assessing the moderate changes in the mRNA level of target genes (<2-fold).

The problem of selecting appropriate RGs is especially crucial in cancer studies because of the presence of several molecular subtypes within a histological type and, moreover, a unique molecular portrait of each tumor (Janssens et al., 2004). Despite the fact that almost 30 years have passed since the moment when the issue of picking appropriate RGs had arisen, there is still no consensus (Janssens et al., 2004; Rubie et al., 2005; Gur-Dedeoglu et al., 2009; Ibusuki et al., 2013; Zhao et al., 2018). Many studies indicate that most frequently used RGs (GAPDH, ACTB, B2M, etc.) have a wide but limited field of applicability: they should not be illegibly used for a wide spectrum of diseases or stress conditions (Barber et al., 2005; Rubie et al., 2005; Kozera and Rapacz, 2013; Chapman and Waldenstrom, 2015). To increase the reliability of qPCR data, one should use at least two or more RGs that are not co-expressed with each other (Chapman and Waldenstrom, 2015). The most rigorous approach is to analyze a panel of 5–20 RGs and choose those with the most stable expression for a current study. Several tools have been developed for these purposes: geNorm (Vandesompele et al., 2002), NormFinder (Andersen et al., 2004), BestKeeper (Pfaffl et al., 2004). However, the vast part of researchers do not perform the analysis of RG suitability and just rely on the existing literature data concerning the object of study (Chapman and Waldenstrom, 2015).

Whole-transcriptomic data allow us to look at the problem from the other side. RNA-Seq opens up great opportunities for a complex expression analysis and identifying trends in the mRNA level changes of groups of genes between the samples. RNA-Seq data are free of bias that comes from the instability of RG expression. The most common RNA-Seq data normalization strategy is based on the assumption that the mRNA level of the majority of genes is stable. This method is implemented in popular RNA-Seq differential expression analysis packages, including edgeR [trimmed mean of M-values method, TMM; Robinson et al., 2010], DESeq2 (Love et al., 2014), and others. There are other normalization strategies: by total read count, by upper quartile or median values, FPKM/RPKM, TPM, "remove unwanted variation" (RUV) (Risso et al., 2014); as well as machine-learning approaches: RNA-Seq by Expectation-Maximization (RSEM) (Li and Dewey, 2011) and Sailfish (Patro et al., 2014). Despite the diversity of the methods, in most cases, they give rather similar results, which differ by 20–30%, with the exception of some cases when the expression of half or more of genes is changed significantly (Dillies et al., 2013; Li et al., 2015; Zyprych-Walczak et al., 2015; Evans et al., 2018).

Analysis of highly representative RNA-Seq and microarray datasets is very attractive in terms of the identifying stably expressed RGs for human (Popovici et al., 2009; Tilli et al., 2016; Chen et al., 2017; Chim et al., 2017; Hoang et al., 2017) or other organisms (Alexander et al., 2012; Carmona et al., 2017; Zhou et al., 2017). This approach is valuable for the detection of novel housekeeping gene candidates with constitutively stable mRNA level.

In 2016, Tilli et al. suggested a strategy including the largescale screening of potential RGs from RNA-Seq data with further validation by qPCR and applied it for breast cancer (Tilli et al., 2016). The authors analyzed datasets of The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) and found that several non-traditional RGs, CCSER2, SYMPK, ANKRD17, as well as known RG PUM1 demonstrated the least expression variability in breast cancer samples and normal tissues (Tilli et al., 2016). The similar approach was realized by Chen et al. for the identification of reference mRNA and miRNA suitable for human esophageal squamous cell carcinoma studies (Chen et al., 2017). It allowed authors to identify non-standard RG candidates—DDX5, LAPTM4A, P4HB, and RHOA.

TCGA is the largest resource in the field of cancer biology that is aimed at the discovery of the molecular features of various cancer types (https://cancergenome.nih.gov/). TCGA database includes genomic, transcriptomic, and epigenetic data for 33 human cancer types represented with more than 11,000 individual samples. In the present work, we analyzed TCGA transcriptome sequencing data in order to evaluate the expression stability of widely used RGs and identify novel RG candidates in 12 most common cancer types. The use of representative TCGA sample sets allows us to pay extra attention to the overall stability of mRNA level and presence of outliers, the cases of dramatic expression "blow up" or falling down in single samples. Besides the data on mRNA level, we took into account if this is a well-studied gene or not (by evaluating the number of mentions in PubMed-indexed titles/abstracts), if a gene is involved in cancer-associated biological processes like cell cycle, differentiation, and adhesion (using Gene Ontology). Additionally, we evaluated if a gene is highly mutated (using TCGA data on somatic mutations) that indicates its implication in cancer. Also, we tried to minimize the number of pseudogenes and alternatively spliced transcripts in order to improve usability: the presence of pseudogenes makes it difficult to pick up cDNAspecific primer pairs, and the presence of alternative transcripts complicates the expression analysis and may lead to flawed results. We integrated all the parameters listed above into a single scoring system. Finally, we looked for genes that demonstrate cross-tissue expression stability and may represent "universal" pan-cancer RGs.

### MATERIALS AND METHODS

In the present work, we focused on TCGA data for 12 cancer types for which RNA-Seq data were available for representative sample sets: at least 100 tumor (T) and 20 normal (N) tissue samples. The data were processed with a modified version of CrossHub (Krasnov et al., 2016), a tool for the multi-way analysis of TCGA transcriptomic and genomic data. Read counts data were downloaded from the TCGA data portal (https://portal.gdc. cancer.gov/) and normalized using the TMM method and then recalculated for 1 million library size. The derived CPM (read counts per million) values were used as a measure of mRNA level of a gene for further expression stability analysis.

In order to assess gene expression stability, we developed a scoring system, which included several components (Si) responsible for T-N expression level difference, expression level stability within pools of N and T samples, and correlations of mRNA level with clinical and pathological characteristics [disease stage, TNM (tumor, node, metastasis) classification, follow-up status]. Each scoring component S<sup>i</sup> takes values from 0 to 100. All S<sup>i</sup> are taken with different weights (Wi), which reflect the importance of component. Overall expression scoring S Exp is calculated as follows:

$$\mathcal{S}^{\text{Exp}} = \left(\prod\_{i=1}^{N} \left(\mathbb{S}\_{i} + \text{CA}\_{i}\right)^{\mathbb{W}\_{i}}\right)^{1/\sum\_{i=1}^{N} \mathcal{W}\_{i}}$$

where:


Values of these parameters are presented in **Table 1**.

Each individual component S<sup>i</sup> is calculated with a common parametric formula:

$$S\_i = \frac{100}{1 + \text{Sq} \times \left(\frac{\max(\text{x} - \text{IV}; 0)}{\text{IP} - \text{IV}}\right)^{\text{CS}}}$$

This formula provides a (1-sigma)-like function with a customizable inflection point, tilt, and region of maximal score values. The function takes values from 0 to 100. Here:


All scoring components S<sup>i</sup> and parameters (IV, IP, CS, Sq) are presented in **Table 1**. The derived scoring functions are shown in **Figure 1**.

Two components, SDP and SDL, are responsible for T-N expression level difference. This is the major factor of RG suitability. SDP is calculated for pooled, and SDL–for paired samples. Hence, we applied the strongest scoring parameters (IV = 0.05, IP = 0.25, CS = 2.5) and assigned high weight (W = 4) for these two components. SDP (or SDL) would be equal to 50 if the absolute value of average log2FC<sup>P</sup> (or log2FCL) is equal to IP = 0.25, i.e., fold change between tumor and normal is about 20%. We chose IV = 0.05–0.1 for all the components that are responsible for expression level (SDP, SDL, SDoO, SDoU, SDLc, SEStD, SEoH, SEoL). This means that 5–10% mRNA level changes are ignored.

SDP and SDL are calculated using the trimmed means of either CPM (pooled sample) or log2FC<sup>L</sup> (paired samples). Only values from 10 to 90th percentiles are included. To take into account T-N expression outliers, we added two other scorings, SDoO and SDoU, that are responsible for the upper and lower deciles of log2FCL. For these components, we assigned increased IP value (IP = 0.7) since it is expected that Abs[Average(log2FCL)90−100] calculated for 90–100th percentiles of log2FC<sup>L</sup> will be much greater than such value calculated for 10–90th percentiles.

SEStD, SEoH, SEoL are responsible for evaluating expression stability within pools of normal and tumor samples. SEStD scores trimmed standard deviation of CPM values (10–90th percentiles), and SEoH (or SEoL) is responsible for outliers with high (or low) mRNA level (in terms of CPM). Additionally, we included scoring for average expression level (SEA) and set high weight (W = 6) for this component in order to completely exclude genes with low mRNA level from the analysis.

Finally, we added scorings for correlations between gene expression and six clinical and pathological characteristics: pathologic TNM classification (separately for T, N, and M indexes), pathologic stage, follow-up person neoplasm cancer status and follow-up treatment success status. SCr is the component responsible for Spearman's correlation coefficient, and SCp–for correlation p-value. IV values were chosen in such a way that cases with p > 0.25 and |r<sup>s</sup> | < 0.1 have score equals to 100. In total, each of these two components is taken 18 times: 6 clinical characteristics are analyzed for associations with CPM in tumor samples, CPM in normal samples and T-N expression fold change (paired samples). Hence, we assigned low weights—W = 0.2 and 0.3 for SCr and SCp, respectively.

Besides stable and high enough expression level, an appropriate RG should also demonstrate a low mutation rate, single transcript isoform, and absence of pseudogenes in order to avoid problems with PCR priming and ensure the

#### TABLE 1 | Components of the scoring function.


\**Percentiles, which were taken into calculation, are indicated as a subscript.*

*IV, ideal value; IP, inflection point; CS, curve slope; Sq, "squeeze"; CA, constant add; W, weight; Abs (…), absolute value; Average (…), mean value; CPM, counts per million, gene expression level; FCP, ratio of the average CPM in a pool of tumor samples to the average CPM in a pool of normal samples; FCL, ratio of CPM values between tumor and matched normal tissue (per each paired sample); StDev (…), standard deviation; rs, Spearman's correlation coefficient.*

rigorous evaluation of mRNA level. The mutation rate of a gene was assessed using TCGA data on somatic mutations. The number of transcript isoforms (per gene) was obtained from the Ensembl human genome annotation (hg38, release 88). The number of pseudogenes (per gene) was derived from psiCube (Sisu et al., 2014). Therefore, we extended the scoring system with three additional components, "anti-scorings" (**Table 1** and **Figure 1**). The resulting score S Final is calculated as follows:

$$\mathcal{S}^{\text{Final}} = \mathcal{S}^{\text{Exp}} \cdot \mathcal{S}^{\text{Mut}} \cdot \mathcal{S}^{\text{Isoforms}} \cdot \mathcal{S}^{\text{pseudogenres}}$$

Next, we tried to find RGs that are stably expressed across multiple tissues and cancer types. For this purpose, we calculated the pan-cancer score as follows:

$$\mathcal{S}^{\text{Final}}\_{\text{Pan-cancer}} = \mathcal{S}^{\text{Exp8Mut}}\_{\text{Pan-cancer}} \cdot \mathcal{S}^{\text{Issoftmaxs}} \cdot \mathcal{S}^{\text{Pseudogenes}}$$

where:

$$\mathbf{s}\_{\text{Pan}-\text{cancer}}^{\text{Exp8Mut}} = \left(\frac{\sum\_{j=1}^{M} \left(\mathbf{s}\_{j}^{\text{Exp}} \cdot \mathbf{s}\_{j}^{\text{Mut}} + \mathbf{CA}\right)^{k}}{M}\right)^{1} \mathbf{k}$$

where M = 12 (a number of cancer types analyzed); k = −0.4 (negative k value implies that the pan-cancer score is a harmonic mean of individual scores); CA = 12 (a constant add).

Finally, we assessed the involvement of a gene in cancerrelated processes on the basis of Gene Ontology (GO; The Gene Ontology, 2017) data and mentions in the articles indexed by PubMed (titles and abstracts).

A RG should not be involved in cellular processes that are frequently altered in cancer. A penalty system based on GO data was developed. We evaluated the involvement of

the average CPM in a pool of normal samples; FCL deviation; rs, Spearman's correlation coefficient.

a gene in 6 cancer-associated biological processes: cell cycle, differentiation, stress response, immune response, angiogenesis, adhesion, and cell communication. The relation of a gene to each of these processes was followed by the assignment of penalty points (from 2 to 5). Finally, these points were summed up. According to this system, a gene is penalized (1) with 5 points if its GO annotation contains at least one keyword related to cell cycle process: cell cycle, cell division, cell growth, cell proliferation, apoptosis, apoptotic process, cell death, MAPK cascade, tumor, oncogenic, apoptotic; (2) with 4 points if GO annotation contains a keyword related to cell differentiation: cell differentiation, epithelial to mesenchymal transition, mesenchymal to epithelial transition, stem cell, fetal, embryonic, embryonal, embryo, gastrulation, tissue development, cellular developmental process, organ development; (3) with 3 points for stress response related processes: response to stress, DNA damage, DNA repair; (4) with 2 points for inflammation and immune response: inflammation, inflammatory, immune response, T cell activation, macrophage activation, antigen; (5) with 2 points for angiogenesis: angiogenesis; (6) with 2 points for intercellular interactions: cell communication, cell-cell signaling, cell adhesion, cell motility, cell migration. Thus, a gene may have a maximum of 5 + 4 + 3 + 2 + 2 + 2 = 18 penalty points.

The more accurately the gene is annotated, the more likely it is to find one of the keywords in its annotation. Therefore, GO penalty should be normalized taking into account the number of assigned GO terms for the gene. On the other hand, the better the gene is annotated, the more extensively it is studied, and such genes represent more attractive candidates. In order to keep a balance between these two factors, we introduced normalization coefficient evaluated as the total number of GO terms (assigned for the gene) to the power of 0.3. If a gene lacked sufficient GO annotation (<3 GO terms), we assigned it 10 penalty points.

The number of PubMed-indexed articles with the mention of a gene name or its aliases was evaluated to assesses how well a gene is studied. Next, within this pool of gene-related publications, the number of cancer-related articles was also evaluated. One of the following words should be present in an article title to be treated as cancer-related: cancer, tumor, ∗ carcinoma, sarcoma, glioma, glioblastoma, and other keywords.

The described components (GO and Pubmed) were not included in the main scoring and were only used for manual exclusion of cancer-associated genes. Besides, functional annotations from RefSeqGene (https://www.ncbi.nlm.nih.gov/ refseq/rsg/) were added to each gene.

When revealing optimal RG pairs for each of examined cancer types, we paid special attention to the co-expression of RG candidates to avoid genes with a pronounced correlation between their mRNA levels. To implement the scoring system, we modified our previously developed CrossHub tool (the updated version can be downloaded at https://sourceforge.net/projects/crosshub/).

### RESULTS

We performed the analysis of 12 cancer types from the TCGA project that have RNA-Seq data for representative sample sets: 285-1095 tumor and 19-113 matched normal tissues. These are: breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), kidney renal cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), prostate adenocarcinoma (PRAD), colon adenocarcinoma (COAD), head and neck squamous cell carcinoma (HNSC), liver hepatocellular carcinoma (LIHC), stomach adenocarcinoma (STAD), thyroid carcinoma (THCA), and bladder urothelial carcinoma (BLCA). For the remaining TCGA cancer types, RNA-Seq data were available only for a few normal tissue samples, and this makes it impossible to use such datasets for the discovery of reliable RGs.

First, we assessed the expression stability of a set of 32 frequently used RGs in 12 selected cancer types: ACTB, ALAS1, B2M, CDKN1A, G6PD, GAPDH, GUSB, HBB, HMBS, HPRT1, HSP90AB1, IPO8, LDHA, NONO, PGK1, POP4, PPIA, PPIH, PSMC4, PUM1, RPL13A, RPL30, RPLP0, RPS17, RPS18, SDHA, TBP, TFRC, UBC, YWHAZ, TUBB, RPN1. This set of 32 RGs was composed of commercially available RG panels: Roche "Human Reference Gene Panel, 384" (Switzerland), TATAA "Reference Gene Panel Human" (Sweden), and Bio-Rad "Reference Genes H384" (USA). In total, 31 unique genes are included in the panels, plus we added the RPN1 gene, which was identified by us earlier as a reliable RG for lung, kidney, and colorectal cancers (Krasnov et al., 2011; Fedorova et al., 2015). Expression stability scores were calculated for each gene in each examined cancer type. The results for the top 5 genes are presented in **Table 2** and full data—in **Supplementary Table 1**. In almost each cancer type, there were 1–10 genes with expression score about 70 or more (with a theoretical maximum of 100), which can be considered as moderately high score value. PRAD and THCA demonstrated the highest number of genes with stable mRNA level-10 and 7, respectively. Only in BCLA, all the genes had scores below 70, possibly because of potential bias due to a small number of matched normal tissues (19—the smallest number among the cancer types examined). The cross-tissue analysis of 12 cancer types revealed that the most stably expressed genes were: PUM1 (S Exp = 70), IPO8 (S Exp = 61), UBC (S Exp = 60), ACTB (S Exp = 55), and RPN1 (S Exp = 54). GAPDH, one of the most frequently used RGs, showed one of the least stability of mRNA level—position 25 out of 32 (S Exp = 32). According to the obtained results, GAPDH can be reasonably applied as a RG only in prostate and stomach adenocarcinomas. RPN1 gene suggested by us demonstrated high expression stability score in lung, renal, colon, liver, thyroid, and prostate cancers.

Next, for each of 12 cancer types, we searched for a pair of the most suitable RGs focusing on S Exp values and correlation between mRNA levels of genes in a pair. As a result, we revealed 12 optimal pairs of RGs with S Exp above 65 for each gene and absence of co-expression (**Table 2** and **Supplementary Table 1**). PUM1 came into the pair of RGs for 9 out of 12 cancer types.

It should be noted that genes with high S Exp values may be inconvenient in practice because of the presence of numerous pseudogenes, alternatively spliced transcripts or a high mutation rate. Among the traditionally used RGs with high expression scores, only 3 genes met the requirements—PUM1, IPO8, and RPN1. These genes have no pseudogenes, one (RPN1), or two (PUM1 and IPO8) transcript isoforms, and relatively low mutation rate in examined cancer types.

Using the expanded scoring system (**Figure 2**), in which 3 "anti-scorings" counting mutation rate, number of transcript isoforms and pseudogenes were included, we analyzed a complete list of human genes in order to reveal the most prominent pancancer RG candidates (**Supplementary Table 2**). Top 10 pancancer RG candidates included MBTPS1, HNRNPA0, SF3A1, SF3B2, GGNBP2, HNRNPUL2, SFRS3, RTF1, CIAO1, TM9SF3. All these genes had stable and high enough mRNA level and low mutation rate in most of 12 cancer types, only one annotated transcript isoform and no pseudogenes. Taking into account PubMed article search, GO annotations, and RefSeqGene information, we selected three most promising RG candidates—SF3A1, CIAO1, and SFRS4.

#### DISCUSSION

The use of inappropriate RGs leads to unreliable data and nullifies potentially high accuracy of a qPCR technique in the evaluation of differential gene expression. The search for a RG with a stable mRNA level under experimental conditions represents a separate object of research and is rarely performed during the original



*Optimal pairs of reference genes for each cancer type are shown in bold.*

studies. RNA-Seq data of TCGA project offer a great opportunity for evaluating gene expression stability. Using our CrossHub tool, we developed a complex scoring system that allowed us to assess the suitability of 32 traditionally used RGs for qPCR data normalization in 12 cancer types characterized by high morbidity and mortality rates. The alterations of mRNA level were shown for a number of these genes, including the most frequently used GAPDH, in examined cancer types. The analysis across 12 cancer types revealed that PUM1 and IPO8 genes demonstrate the most stable expression among the 32 genes.

PUM1 (Pumilio RNA Binding Family Member 1) serves as a translational regulator of specific mRNAs by binding to their 3'-UTRs. It may be involved in translational regulation of embryogenesis, cell development, and differentiation. There are several functions that call into question its applicability as a RG. After growth factor stimulation, PUM1 binds to 3'-UTR of CDKN1B/p27 tumor suppressor, inhibits its expression and promotes a rapid entry to the cell cycle (Kedde et al., 2010). PUM1 is capable of repressing many mitotic, DNA repair, and DNA replication factors (Lee et al., 2016). Moreover, some authors reported that PUM1 promotes ovarian cancer proliferation, migration, and invasion (Guan et al., 2018). However, PUM1 is identified as one of the most stably expressed genes in uterine cervical cancer (Tan et al., 2017), endometrial carcinoma (Ayakannu et al., 2015), gallbladder (Yu et al., 2015), leiomyoma (Almeida et al., 2014), breast (Ibusuki et al., 2013; Kilic et al., 2014), and non-small cell lung (Soes et al., 2013) cancers. This gene has only 2 transcript isoforms and no pseudogenes that makes it even more attractive for use as a reference one.

Recently, Tilli et al. performed a screening of breast cancer RNA-Seq datasets from the International Cancer Genome Consortium (ICGC), GEO, and TCGA repositories. Authors found that PUM1, along with "novel" RGs - CCSER2, SYMPK, and ANKRD17, had the most stable

mRNA level (Tilli et al., 2016). This agrees with previous qPCR analyses of RG expression stability in breast carcinomas (Ibusuki et al., 2013; Kilic et al., 2014).

IPO8 (importin 8), which has 2 transcript isoforms and no pseudogenes, is the second in the cross-tissue stability list, but its mRNA level is much less stable than that of PUM1 according to TCGA data. IPO8 mediates nuclear import of proteins with a classical nuclear localization signal. Previously, IPO8 was found to be suitable for data normalization in endometrial (Ayakannu et al., 2015) and ovarian carcinomas (Kolkova et al., 2013), colon adenocarcinoma cell lines (Krzystek-Korpacka et al., 2016), nonsmall cell lung cancer (Soes et al., 2013), and other tissues and diseases: brain edema (Du et al., 2017), heart cavities (Molina et al., 2018), T cells, and neutrophils (Ledderose et al., 2011).

The RPN1 gene (0 pseudogenes, 1 transcript isoform), which was previously suggested by us for normalization of qPCR data in LUAD, LUSC, KIRC, KIRP, and COAD (Krasnov et al., 2011; Fedorova et al., 2015), demonstrate stable expression in these cancer types as well as in PRAD, LIHC, and THCA.

The majority of the remaining genes from the set of 32 genes, even if they demonstrate stable mRNA level in certain cancer types, have many pseudogenes or high mutation rate (for example, UBC is above the 99th percentile in BRCA). The presence of pseudogenes is a weakness of such widely used RGs as GAPDH and ACTB (67 and 64, respectively) (Sun et al., 2012), or genes encoding ribosomal proteins, including RPL13A and RPS17 (Tonner et al., 2012).

Next, we tried to find out novel reliable and convenient RGs suitable for most cancer types. As it was described above, for this purpose, we evaluated expression and mutation scorings for each examined cancer type, calculated pan-cancer scoring values given the "anti-scorings" for the number of transcript isoforms and pseudogenes, and selected the promising candidates taking into account information on functions of the genes and their involvement in carcinogenesis.

Along with SFRS4 (number 13 in the top list of "universal" reference genes), three genes that participate in pre-mRNA splicing and processing pathways (SF3A1, SF3B2, and SFRS3) are present in the top 10 of promising pan-cancer RGs. The splicing machinery (namely spliceosome) is the largest molecular machine so far described. It is composed of five small nuclear ribonucleoproteins (snRNPs U1, U2, U4, U5, and U6) and more than 100 different polypeptides (Ghigna et al., 2008). Aberrant splicing in cancer provides a way to generate alternatively spliced transcripts encoding proteins with distinct functions (Ghigna et al., 2008). There are at least two ways resulting in splicing aberrations in cancer: mutations in the affected genes, e.g., in their splice sites (cis-effect), and altered expression and/or activity of the elements of splicing machinery (trans-effect). Some of the splicing factors are known to be deregulated in cancer, by means of mRNA level alterations, mutations or posttranslational modifications (Stickeler et al., 1999; Blaustein et al., 2005; Ghigna et al., 2008). On the other hand, some of the splicing factors are considered as potential RGs. This may be explained by the complexity of the splicing machinery and various roles of its elements (David and Manley, 2010).

SF3A1 and SF3B2 encode the subunits of splicing factors 3a and 3b. These two splicing factors together with 12S RNA unit form the U2 small nuclear ribonucleoproteins complex, which binds pre-mRNA upstream of the intron's branch site and may anchor the U2 snRNP to the pre-mRNA (Will et al., 2002). SF3A1 is considered as a RG in sarcoma (Aggerholm-Pedersen et al., 2014), its expression was found to be stable in breast cancer (Maltseva et al., 2013), colorectal adenocarcinoma Caco-2 cells under exposure to food products (Vreeburg et al., 2011), white blood cells under treatment with growth hormone (Castigliego et al., 2010), bovine blastocysts produced by different methods (Luchsinger et al., 2014), bovine granulosa cells of dominant follicles during follicular growth and aging (Khan et al., 2016).

Considering the other splicing machinery gene, SFRS4 (serine and arginine rich splicing factor 4), some authors earlier demonstrated that its mRNA level is stable in hepatocellular carcinoma (HCC) cell lines (Liu et al., 2017) and patients with alcoholic liver disease (Boujedidi et al., 2012). SFRS4 remains stably expressed in hepatitis C virus-induced HCC, whereas ACTB and GAPDH are significantly deregulated (Waxman and Wurmbach, 2007).

CIAO1 (number 9 in the top list) is a key component of the cytosolic iron-sulfur protein assembly (CIA) complex. This is a multiprotein complex that mediates the incorporation of ironsulfur cluster into extramitochondrial Fe/S proteins (provided by GeneCards; Stelzer et al., 2016). CIAO1 was not previously described as a RG. Till now, there is only one article describing the possible role of the encoded protein in cancer development, namely interacting with the tumor suppressor protein WD40 (Johnstone et al., 1998). Besides this, there is almost no data on the association of this gene with cancer.

#### CONCLUSIONS

To reveal reliable RGs for qPCR data normalization, a comprehensive analysis of TCGA data was performed. We took into account expression stability, average mRNA level, expression correlation with clinical and pathological characteristics, number of pseudogenes and transcript isoforms, mutation rate, GO terms, and mentions of a gene in titles/abstracts of articles from PubMed. The most reliable pairs of traditionally used RGs were suggested for each of 12 examined cancer types, as well as unsuitability of some frequently used RGs was shown. Pancancer analysis revealed promising RG candidates with stable and sufficiently high expression level and low mutation rate across 12

#### REFERENCES


cancer types. Besides, these genes have only one known transcript isoform and no pseudogenes.

### DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the supplementary files.

#### AUTHOR CONTRIBUTIONS

GK, AK, NM, and AD conceived and designed the work. GK, AK, AS, VL, AB, NM, and AD performed data analysis. GK and AD wrote the manuscript. All authors agreed with the final version of the manuscript and all aspects of the work.

#### FUNDING

This work was financially supported by the Russian Science Foundation, grant 17-74-20064.

#### ACKNOWLEDGMENTS

This work was performed using the equipment of Genome center of Engelhardt Institute of Molecular Biology (http://www.eimb. ru/rus/ckp/ccu\_genome\_c.php).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00097/full#supplementary-material

in a panel of 72 human tissues. Physiol. Genomics 21, 389–395. doi: 10.1152/physiolgenomics.00025.2005


quantitative data normalization in lung and kidney cancer. Mol. Biol. 45, 211–220. doi: 10.1134/S0026893311020129


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Krasnov, Kudryavtseva, Snezhkina, Lakunina, Beniaminov, Melnikova and Dmitriev. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# PreAIP: Computational Prediction of Anti-inflammatory Peptides by Integrating Multiple Complementary Features

#### Mst. Shamima Khatun1†, Md. Mehedi Hasan1† and Hiroyuki Kurata1,2 \*

*<sup>1</sup> Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Fukuoka, Japan, <sup>2</sup> Biomedical Informatics R&D Center, Kyushu Institute of Technology, Fukuoka, Japan*

#### Edited by:

*Yuriy L. Orlov, Institute of Cytology and Genetics (RAS), Russia*

#### Reviewed by:

*Deepak Singla, Punjab Agricultural University, India Hifzur Rahman Ansari, King Abdullah International Medical Research Center KAIMRC, Saudi Arabia*

> \*Correspondence: *Hiroyuki Kurata kurata@bio.kyutech.ac.jp*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics*

> Received: *25 July 2018* Accepted: *06 February 2019* Published: *05 March 2019*

#### Citation:

*Khatun MS, Hasan MM and Kurata H (2019) PreAIP: Computational Prediction of Anti-inflammatory Peptides by Integrating Multiple Complementary Features. Front. Genet. 10:129. doi: 10.3389/fgene.2019.00129* Numerous inflammatory diseases and autoimmune disorders by therapeutic peptides have received substantial consideration; however, the exploration of anti-inflammatory peptides via biological experiments is often a time-consuming and expensive task. The development of novel *in silico* predictors is desired to classify potential anti-inflammatory peptides prior to *in vitro* investigation. Herein, an accurate predictor, called PreAIP (Predictor of Anti-Inflammatory Peptides) was developed by integrating multiple complementary features. We systematically investigated different types of features including primary sequence, evolutionary and structural information through a random forest classifier. The final PreAIP model achieved an AUC value of 0.833 in the training dataset via 10-fold cross-validation test, which was better than that of existing models. Moreover, we assessed the performance of the PreAIP with an AUC value of 0.840 on a test dataset to demonstrate that the proposed method outperformed the two existing methods. These results indicated that the PreAIP is an accurate predictor for identifying AIPs and contributes to the development of AIPs therapeutics and biomedical research. The curated datasets and the PreAIP are freely available at http://kurata14.bio.kyutech. ac.jp/PreAIP/.

Keywords: inflammatory disease, anti-inflammatory peptides prediction, feature encoding, feature selection, random forest

### INTRODUCTION

Inflammation responses occur under the normal conditions when tissues are damaged by bacteria, toxins, trauma, heat, or any other reason (Ferrero-Miliani et al., 2007). These responses cause chronic autoimmune and inflammation disorders, including neurodegenerative disease, asthma, psoriasis, cancer, rheumatoid arthritis, diabetes, and multiple sclerosis (Zouki et al., 2000; Steinman et al., 2012; Tabas and Glass, 2013; Patterson et al., 2014; Hernández-Flórez and Valor, 2016). Numerous inflammation mechanisms are crucial for the upkeep of the state of tolerance (Miele et al., 1988; Corrigan et al., 2015). Numerous endogenous peptides recognized through inflammatory reactions function as anti-inflammatory agents can be employed by new therapies for autoimmune and inflammatory illnesses (Gonzalez-Rey et al., 2007; Delgado and Ganea, 2008). The immunotherapeutic aptitude of these anti-inflammatory peptides (AIPs) has various clinical applications such as generation of regulatory T cells and inhibition of antigen-specific T(H)1-driven responses (Delgado and Ganea, 2008). Moreover, certain synthetic AIPs act as effective therapeutic agents for autoimmune and inflammatory disorders (Zhao et al., 2016). For instance, chronic adenoidal direction of human amyloid-β peptide causes an Alzheimer's disease. Mice models result in compact deposition of amyloid-β peptides, which is a pathological marker of Alzheimer's disease, astrocytosis, microgliosis, and neuritic dystrophy in the brain (Boismenu et al., 2002; Gonzalez et al., 2005; Kempuraj et al., 2017). The present therapy for autoimmune and inflammatory disorders involves the use of non-specific anti-inflammatory drugs and other immunosuppressants, which are frequently related to different side effects, such as initiation of a higher possibility of infectious diseases and ineffectiveness alongside inflammatory disorders (Tabas and Glass, 2013).

Notwithstanding the increasing number of experimentally examined AIPs in vivo, the molecular mechanism of AIP specificity remains largely unknown. On the other hand, largescale experimental analysis of AIPs is time-consuming, laborious, and expensive. An alternative, computational approach that provides an accurate and reliable prediction of AIPs is required to complement the experimental efforts and to access the prompt identification of potential AIPs prior to their synthesis. To date, two in silico methods have been proposed to predict AIPs (Gupta et al., 2017; Manavalan et al., 2018). In 2017 Gupta et al. employed hybrid features with a support vector machine (SVM) classifier to develop the AntiInflam predictor (Gupta et al., 2017). Manavalan et al. developed the AlPpred predictor by using the primary sequence encoding features with a random forest (RF) classifier (Manavalan et al., 2018). These two methods used the primary sequence feature information without considering any evolutionary or structural features.

Nonetheless, the performance of the abovementioned existing predictors is not sufficient and remains to be improved. In this study, we have developed an accurate predictor named PreAIP (Predictor of Anti-Inflammatory Peptides) by integrating multiple complementary. We investigated different types sequence features including the primary sequence, evolutionary, and structural through a RF classifier. The PreAIP achieved higher performance on both the training and test datasets than the existing methods. In addition, we obtained valuable insights into the essential sequence patterns of AIPs.

### MATERIALS AND METHODS

#### Dataset Collection

To construct the PreAIP, we collected training and test datasets from a recently published article of the AIPpred (Manavalan et al., 2018) and the IEDB database (Vita et al., 2019). A peptide was considered as anti-inflammatory (positive sample) if the antiinflammatory cytokines of peptides induce any one of IL-10, IL-4, IL-13, IL-22, TGFb, and IFN-a/b in T-cell analyses of mouse and human (Marie et al., 1996; Jin et al., 2014). Meanwhile, the linear peptides for anti-inflammatory cytokines were considered non-AIPs (i.e., negative samples). To solve the overfitting problem of the prediction model, CD-HIT was employed with a sequence identity threshold of 0.8 (Huang et al., 2010). After eliminating redundant peptides, the same training and test samples were retrieved from the AIPpred predictor (Manavalan et al., 2018). More reliable performance would be achieved by using a more stringent criterion of 0.3 or 0.4, as executed in (Hasan et al., 2016, 2017a). However, this study did not use such a stringent criterion, because the length of the currently available AIPs is between 4 and 25. If we apply a stringent criterion of <0.8, the number of the available AIPs is greatly reduced so that we cannot retrieve the datasets employed by the previous predictor (Manavalan et al., 2018). The collected training dataset results in 1,258 positive and 1,887 negative samples, and the test dataset contains 420 positive and 629 negative samples. All of curated datasets are included in our web server.

### Computational Framework

An overall computational framework of the proposed PreAIP is shown in **Figure 1**. After collecting the positive and negative AIPs from the AIPpred server (Manavalan et al., 2018), their sequence datasets were transformed into the primary sequence, evolutionary and structural features. We considered polypeptides with 1 to 25 natural amino acids. When the peptide contains less than 25 residues, our scheme provides gaps (-) to the missing residues to compensate a peptide length of 25. To encode the primary sequence features, we employed two encoding methods of the composition of k-spaced amino acid pairs (KSAAP) and AAindex properties. An evolutionary feature was encoded by using the position specific encoding matrix, i.e., profilebased composition k-space of amino acid pair (pKSAAP). The structural feature (SF) was encoded by using SPIDER2 (Yang et al., 2017) and PEP2D (http://crdd.osdd.net/raghava/pep2d/) bioinformatics tools. The resulting five types of descriptors were independently put into RF models to produce five consecutive, independent RF prediction scores. Those RF scores were linearly combined using the weight coefficients to obtain the final prediction score. A web server was developed to implement the PreAIP.

#### Feature Encoding

The PreAIP was constructed based on a binary classification problem (positive AIPs and negative-AIPs) through RF algorithms. The extraction of a set of relevant features is a crucial step to present a classifier. To keep the generated feature vectors, a high-quality peptide encoding method is necessary. As a substitute of the simple binary representation, we adopted five types of complicated feature encoding methods: AAindex, KSAAP, SPIDER2, PEP2D, and pKSAAP, which are briefly described in the following subsections.

### Amino Acid Index Properties

Numerical physicochemical properties of amino acids exist in the AAindex database (version 9.1) (Kawashima et al., 2008). After assessing different types of AAindex indices, we selected 8 types of high indices (HI) and ordered them from HI1 to HI8 (**Table S1**). In a peptide sequence with length L, a (L × 20) feature vector was generated through the AAindex encoding.

#### KSAAP Encoding

The KSAAP encoding descriptor is widely used in bioinformatics research (Carugo, 2013; Hasan et al., 2018a,b). The procedure of

KSAAP is briefly described as follows. Peptide sequences contain (20 × 20) types of amino acid pairs (i.e., AA, AC, AD, . . . , YY)<sup>400</sup> for every single k, where k denotes the space between two amino acids. The optimal kmax was set to 0–4 to generate (20 × 20 × 5) = 2,000 dimensional feature vectors for each corresponding peptide sequence. Details of the KSAAP encoding method are described elsewhere (Hasan et al., 2015).

# Structural Features

#### Protein-Based SF

The protein-based SF features are generated by the SPIDER2 software that is widely used in bioinformatics research (Yang et al., 2017; López et al., 2018). Three types of features were generated by SPIDER2: accessible surface area (ASA), backbone torsion angles (BTA), and secondary structure (SS). The BTA generated 4-type feature vectors of phi, psi, theta and tau. The SS generated 3-type feature vectors of helix, strand and coil. Totally, 8-type feature vectors were generated SPIDER2. For each peptide sequence, (L × 8) dimensional feature vectors were generated, where L was the length of a given AIP.

#### Peptide-Based SF

We employed PEP2D to generate a peptide structure prediction feature (http://crdd.osdd.net/raghava/pep2d/). The PEP2D generated three types of probability scores: Helix Prob, Sheet Prob, and Coil Prob. For each peptide sequence, (L × 3) dimensional feature vectors were generated, where L was the length of a given AIP.

#### pKSAAP Encoding

In protein or peptide sequence analysis, the PSSM provides useful evolutionary information. This matrix measures the replacement probability of each residue in a protein with all the residues of the genomic code. The PSSM profile was created by using PSI-BLAST (version of 2.2.26+) against the whole Swiss-Prot NR90 database (version of December 2010) with two default parameters, an e-value cutoff of 1.0 × 10−<sup>4</sup> and an iteration number of 3 (Hasan et al., 2015). Then, we extracted the feature vectors using the given peptide sequences. After generating the PSSM profile, we generated possible k-space pair composition from the PSSM, i.e., pKSAAP, in the same manner as the previous study of protein pupylation site prediction (Hasan et al., 2015). When an optimal k-space was between 0 and 4, a (5 × 20 × 20 = 2,000) dimensional feature vector was generated.

Moreover, we utilized a similarity-search-based tool of BLAST (version of ncbi-blast-2.2.25+) (Altschul et al., 1997; Bhasin and Raghava, 2004) to investigate whether a query peptide belongs to AIPs or not. The BLASTP with an e-value of 1.0 × 10−<sup>2</sup> was used for the whole Swiss-Prot NR90 database (version of December 2010).

#### Feature Selection

To find the top ranking features for predicting AIPs, a wellestablished, supervised method for feature dimensionality reduction, Information Gain (IG) (Azhagusundari and Thanamani, 2013; Huang, 2015; Manavalan et al., 2018), was used through a WEKA package (Frank et al., 2004). A large value of the IG indicates that the corresponding residues have a great impact on prediction performance. The IG processes the decrease in entropy when given information is used to group values of an alternative (class) feature. The entropy of feature U is defined as

$$H(U) = -\sum\_{i} P(u\_i) \log\_2\left(P\left(u\_i\right)\right) \tag{1}$$

where u<sup>i</sup> is a set of values of U and P (ui) is the prior probability of u<sup>i</sup> . Conditional entropy H(U/V), given another feature V, is defined as

$$H(U|V) = -\sum\_{j} P\left(\nu\_{j}\right) \sum\_{i} P\left(\mu\_{i}|\nu\_{j}\right) \log\_{2}(P(\mu\_{i}|\nu\_{j})) \tag{2}$$

where P (u<sup>i</sup> | vj) is the posterior probability of U given by the value vjof V. The IG is defined as the decreased entropy calculated by subtracting the conditional entropy of U given by V from the entropy of U, as follows.

$$IG\left(U|V\right) = H\left(U\right) - H(U|V) \tag{3}$$

#### Random Forest

The RF is a supervised machine learning algorithm (Breiman, 2001) and is widely used for various biological problems (Manavalan et al., 2017, 2018; Bhadra et al., 2018; Hasan and Kurata, 2018). In brief, the following steps are carried to construct n trees of the RF model. Initially, to obtain a new dataset, N samples are obtained from the training set by random selection with replacement procedures. To get n different datasets this procedure is repeated n times and n decision trees are built based on the n datasets. In this assembling process, for K input features, k (k << K) features are selected randomly, where k is the constant during construction of the RF. To split the node, a gini impurity criterion is used from the given features. To grow completely, each decision tree is grown without pruning. Afterward getting n decision trees, the class with the most votes is the final prediction (Breiman, 2001). An R package was implemented to train the proposed model (https://cran.r-project. org/web/packages/randomForest/). We set n to 1000 through the 10-fold cross-validation (CV) test, which is large enough to gain stable prediction.

#### Other Machine Learning Algorithms

The performance of the RF was characterized in comparison to three commonly used machine learning algorithms: Naive Bayes (NB) (Lowd, 2005), SVM (Hearst, 1998), and artificial neural network (ANN) (Michalski et al., 2013). We used the NB and ANN algorithms of the WEKA software (Frank et al., 2004) and the SVM algorithm with a kernel radial basis function (RBF) of the LIBSVM package (https://www.csie.ntu.edu.tw/~ cjlin/libsvm/). In the NB algorithm, we set batch size to 1,000 through the 10-fold CV via the WEKA software. For the ANN algorithm, we considered "MultilayerPerceptron –L 0.3 –M 0.2 –N 500 –V 0 S 0 –E 20 –H a" via the WEKA software. To optimize the parameters of the SVM model, the cost and gamma functions were set to 8 and 0.03125 for KSAAP, respectively, via the LIBSVM package. Similarly, the cost and gamma functions were set to 2 and 0.0123 for AAindex, 32 and 0.0625 for pKSAAP, 16 and 0.125 for SPIDER2, and 8 and 0.015625 for PEP2D.

#### Combined Method

To make an efficient and robust prediction model, optimization of incorporative feature methods is generally essential. We linearly combined the RF scores of the five encoding methods: AAindex, KSAAP, SPIDER2, PEP2D, and pKSAAP, using the following formula (Hasan et al., 2017b):

$$\text{Combined} = \left. \begin{aligned} & \text{\$\boldsymbol{w}\_I \times \text{SPIDER2} + \boldsymbol{w}\_2 \times \text{PEP2D} + \boldsymbol{w}\_3 \times \text{KSAAP} \\ & + \; \boldsymbol{w}\_4 \times \text{AAindex} + \boldsymbol{w}\_5 \times \text{pKSAAP} \end{aligned} \right| $$

where w1, w2, w3, w4, and w<sup>5</sup> are the weight coefficients indicating the strength of the five descriptors; the sum of w1, w2, w3, w4, and w<sup>5</sup> is 1. We adjusted each weight from 0 to 1 with an interval of 0.05. When w1, w2, w3, w4, and w<sup>5</sup> were 0.00, 0.00, 0.15, 0.25, and 0.6, respectively, the AUC value on the CV of training dataset was maximal. Therefore, the linear combination of the three successive RF models of KSAAP, AAindex, and pKSAAP was actually "Combined."

#### Performance Assessment

To investigate the performance of the PreAIP, the thresholddependent and threshold-independent indices were measured. Using the threshold-dependent indices, four widely used statistical measures denoted as accuracy (Ac) specificity (Sp), sensitivity (Sn), and Matthews correlation coefficient (MCC), respectively, were considered. The four outcomes are presented in the following formulas,

$$\text{Ac} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{TN} + \text{FN}} \tag{5}$$

$$\text{Sn} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{6}$$

$$\text{Sp} = \frac{\text{TN}}{\text{TN} + \text{FP}} \tag{7}$$

$$\text{MCC} = \frac{\text{(TP} \times \text{TN)} - \text{(FP} \times \text{FN)}}{\sqrt{\text{(TN} + \text{FN)} \times (\text{TP} + \text{FP}) \times (\text{TN} + \text{FP}) \times (\text{TP} + \text{FN})}} \tag{8}$$

where TP exemplifies the number of correctly predicted positive samples; TN the number of correctly predicted negative samples; FP the number of incorrectly predicted positive samples, and FN the number of incorrectly predicted negative samples. Furthermore, we used the receiver operating characteristics (ROC) curve (Sn vs. 1-Sp plot) to evaluate the area under the ROC curve (AUC) of the threshold-independent parameter (Centor, 1991; Gribskov and Robinson, 1996).

Since the balance between the correctly predicted AIPs and non-AIPs is critically responsible for accurate prediction, Sp and Sn are intuitive, intelligible measures. Typically, high Sp decreases Sn. In this study, the prediction performance of the PreAIP for the training dataset was evaluated with a stepwise change in Sp. We calculated Sn, Ac, and MCC at high (0.903), moderate (0.801) and low (0.709) levels of Sp. These three levels of Sp were given by setting the high (0.468), moderate (0.388), and low (0.342) thresholds of the RF score. In the same manner, we measured the performance of the individual encoding scheme of KSAAP, AAindex, SPIDER2, PEP2D, and pKSAAP at each level of Sp. When the same threshold values of the RF score were applied to prediction of the test dataset, the high, moderate and low levels of Sp were calculated as 0.871, 0.747, and 0.636, respectively.

To assess the performance of the PreAIP using the measures of Ac, Sp, Sn, MCC, and AUC, a 10-fold CV test was used. For the 10-fold CV, original training samples were randomly and equally picked up into 10 subclasses. Among 10 subclasses, one subclass was singled out as the test sample, and the remaining 9 subclasses were considered as the training sample. Then we computed all performance measures for each predictor. We repeated this procedure 10 times by changing the training and test samples. Eventually, we calculated the average value of each performance measure for each predictor.

#### RESULTS AND DISCUSSION

#### Sequence Preference Analysis of AIPs

To investigate the amino acid preference of positive and negative AIPs, we performed sequence compositional preference analysis using the amino acids from the 1 to 15 N-terminal residues of training sets. The length of the AIPs ranged between 4 and 25 amino acid residues in this study. The average length of AIPs was 15 amino acids. Since Ialenti et al. suggested that the AIP activity is located in the N-terminal region of the molecule (Ialenti et al., 2001), we investigated the 1 to 15 N-terminal amino acids by the sequence compositional preference analysis. A non-existing residue was coded by "O" to fill the corresponding position of the AIPs.

At first, we submitted the 1 to 15 N-terminal amino acids of positive and negative AIPs to the sample logo online server (http://www.twosamplelogo.org/) to generate the sequence logo representations (**Figure 2**). The height for each amino acid was in proportion to the percentage of positive (over-represented) or negative (under-represented) peptides. The logos were scaled according to their statistical significance threshold of p < 0.05 by Welch's t-test. Leucine (L) at positions 5, 7, 10, 11, and 15, cysteine (C) at position 7 and 10, isoleucine (I) at positions 2 and 7, arginine (R) at position 5, phenylalanine (F) at position 8, and lysine (K) at position 15 were significantly overrepresented compared with other amino acids, while aspartic acid (D) at positions 4, 5, 10, 13, and 15, threonine (T) at positions 3 and 7, valine (V) at position15 were significantly underrepresented. In addition, tyrosine (Y) at positions 4 and 5 was overrepresented, while Y at positions 5 and 10 underrepresented. These results suggested that positive and negative AIPs are significantly different.

Secondly, we examined the evolutionary conservation features of the PreAIP using the average PSSM value (APV) for each amino acid within 1 to15 N-terminal amino acids of AIPs. The evolutionary conservation information of APV of both the positive and negative AIPs is illustrated in **Figure 3**. Some of amino acid positions of positive and negative AIPs showed significantly different scores. Furthermore, a nonparametric Kruskal–Wallis (KW) test was used to examine whether positive and negative AIPs were significantly dissimilar. The p-values were calculated and corrected by the Bonferroni test (**Table S2**).

Thirdly, we examined the AAindex encoding features of PreAIP. Eight types of informative amino acid indices were

FIGURE 2 | Sequence logo representation of positive and negative AIPs. The upper portion (enriched) is represented by positive AIPs, while lower portion (depleted) negative AIPs. The statistically significant local sequence within the N-terminal 15-residues of AIPs was plotted with *p* < 0.05 by Welch's *t*-test.

used and named HI1 to HI8 as the input feature vectors from the AAindex database. We examined these HI amino acid properties of both the positive and negative AIPs. As illustrated in **Figure 4**, the average values of the eight indices were renamed as AVHI1 to AVHI8. These indices represented the amino acid compositions of intracellular proteins. Some of the AIPs had distinct amino acid compositions in the eight high-quality amino acid indices between two samples of AIPs (**Figure 4**). The KW test was used to examine whether two samples of AIPs were significantly dissimilar with respect to the eight HI properties. The p-values were calculated and corrected by the Bonferroni test (**Table S3**). Significantly different AAindex values with p-value <0.05 appeared at some positions of AIPs, as marked with "<sup>∗</sup> " in **Figure 4**.

Finally, we examined the difference in 8 types of SFs by SPIDER2 between the positive and negative AIPs, as shown in

**Figure 5**. We calculated the average value of 8 types of SFs for SPIDER2: ASA, phi, psi, theta, tau, coil, stand, and helix of both the positive and negative AIPs. The average features were represented as AVAS, AVPhi, AVPsi, AVThe, AVTau, AVCoil, AVSta, and AVHel (**Figure 5**). We plotted these average values of SFs with respect to the 1–15 N-terminal AIPs. Distinguished differences were observed between the positive and negative samples of AIPs. The KW test was employed to examine whether two sample of AIPs were significantly dissimilar among the eight SFs. The p-values were calculated and corrected by the Bonferroni test (**Table S4**). Significantly different SFs were perceived at some positions of AIPs, with a p-value <0.05, as indicated with "<sup>∗</sup> " in **Figure 5**.

The above analysis of residue preference between the positive and negative AIPs suggested that the combination of the primary sequence, evolutionary, and structural amino acid occurrences achieves a precise prediction.

#### Overall Prediction Performance of PreAIP

The selected five descriptors (AAindex, KSAAP, SPIDER2, PEP2D, and pKSAAP) were separately used for prediction of AIPs. Optimization of multiple encoded features is generally essential in the training model to reduce dimensionality while retaining the significant feature. To achieve this, we performed multiple rounds of experiments to select appropriate feature vectors using the IG feature selection via 10-fold CV test on training set; however, it turned out that the IG feature selection did not improve prediction performance. Thus, the IG feature was used to collect significant features and for interpreting a superiority of KSAAP encoding.

We accessed the performances of the training model of five successive encoding methods of AAindex, KSAAP, SPIDER2, PEP2D, and pKSAAP through a 10-fold CV test using the RF classifier. The prediction results by each of five encoding features and the "Combined features" are shown in **Figure 6A**. The AUCs of AAindex, KSAAP, SPIDER2, PEP2D, and pKSAAP were 0.774, 0.813, 0.739, 0.734, and 0.789, respectively. The KSAAP performed best for the 5 single encoding approaches in terms of Sn, MCC and AUC (**Table 1**). The "Combined features" (PreAIP) showed better performance with an AUC of 0.833 than any other single feature. It is noted that "Combined features" means a linear combination of the RF scores (Materials and Methods). Moreover, the PreAIP presented the highest AUC value (0.840) in the test dataset (**Figure 6B**). The performance of PreAIP was effective and reasonable for all the tested cases (**Figure 6**) and was best in the AIP prediction.

To present the known AIPs in the training dataset, we used BLAST to search the (weak) homologs, and ranked them to obtain the best hit e-value (Bhasin and Raghava, 2004). Total 256 positive and 397 negative hits were found out of 1,258 positive and 1,887 negative samples by BLASTP with an evalue of 1.0 × 10−<sup>2</sup> . The reduced numbers of the samples may be due to the peptide length of 5–25. Then, we measured the BLAST performances through 10-fold CV test. The prediction performances of Sp, Sn, Ac, MCC, and AUC were 0.752, 0.269, 0.563, 0.159, and 0.632, respectively, which were lower than those

TABLE 1 | AUC values for prediction performance of the training dataset by 10-fold CV test.

and AAindex methods. High AUC values show accurate performance.


\**PreAIP is the linear combination of the RF scores estimated by SPIDER2, PEP2D, KSAAP, AAindex, and pKSAAP encoding schemes and their weight coefficients are 0.00, 0.00, 0.15, 0.25, and 0.6, respectively. A p-value was computed based on the final model of AUC values by using a Wilcoxson matched-pair signed test.*

by the other sequence encoding-based models. Therefore, we did not consider BLAST for final prediction.

In addition, we found that KSAAP performed best for all the five single encoding methods. To investigate the most significant residue of the KSAAP method, the top 20 amino acid pairs of AIPs were examined through the IG feature selection. The top 20 significant residue pair scores and their corresponding positions are listed in **Table S5**. These significant features are also presented using a radar diagram (**Figure 7A**). For example, the feature sequence motif "L×L," which is represented by 1 spaced residue pair of "LL," is the most important residue pair, where "×" stands for any amino acid. The feature "L×××L" represented the second enriched motif surrounding positive samples of AIPs. Similarly, the feature "LL," which represents a 0-spaced residue pair of "LL," is important and enriched in the negative samples AIPs. Similarly, to keep other k-space amino acid pairs from KSAAP, the same exemplification was employed. Residue preference analysis demonstrated that "L," "Y," "C," "D," and "I" residues frequently appear for AIPs (**Figures 2**, **7A**). These residues are expected to play a key role in the recognition of AIPs. To characterize the top 20 KSAAP-specific features, we compared the numbers of positive and negative AIPs. **Figure 7B** showed the top 20 average value of feature scores (AVFS) by the IG. The average of top 20 features was significantly different between two samples of AIPs with p < 0.05, suggesting the effectiveness of the KSAAP encoding. The significant residue pair scores are listed in **Table S5**, which provides some insights into the sequence patterns of the AIPs. They deserve further experimental validation.

### Comparison of PreAIP With Existing Predictors Using Test Dataset

We evaluated the performances of PreAIP along with that of existing predictors on the test dataset. We submitted the test set to the AIPpred (Manavalan et al., 2018) and AntiInflam (Gupta et al., 2017) servers to assess the performance. It is noted that AntiInflam server provides different thresholds values. We used two threshold values of −0.3 and 0.5 and renamed as less accurate (LA) and more accurate (MA) models (Gupta et al., 2017), respectively. The AIPpred represents the state-of-the-art predictor available. The average performances of the LA, MA, AIPpred, and PreAIP are illustrated in the **Table 2**. The LA showed the highest Sp (0.892) with the lowest Sn (0.258), MCC (0.197), and AUC (0.647) for all the predictors. The PreAIP with the high threshold presented much higher Sn (0.618) Ac (0.770), MCC (0.512), and AUC (0.840) than LA, while it provided Sp (0.871) comparable to LA. The PreAIP with the low threshold showed the highest Sn (0.863), while keeping Sp, Ac, MCC, and AUC at a high level. While the AIPpred presented considerably high values to all the measures of Sp, Sn, Ac, MCC, and AUC, the PreAIP with the moderate threshold outperformed the AIPpred, presenting well-balanced, high prediction performances. The PreAIP performance improvement was found distinct on the test dataset by the Wilcoxson matched-pair signed test, demonstrating its ability to predict unseen peptides.

### Comparison of PreAIP With AIPpred Using Training Dataset

We compared the performance of the proposed PreAIP with the AIPpred using the same training dataset. In this study, the same dataset as the AIPpred set was used to make a fair comparison

FIGURE 7 | Top 20 amino acid pairs selected by the IG feature of the KSAAP method. (A) The radar diagram is represented by the composition of each amino acid pair whose length is proportional to the composition of KSAAP features. (B) Box plot shows the top 20 average value of feature scores (AVFS) by the IG. Red color denotes the positive AIPs, while gray color denotes the negative AIPs. The *p*-value is computed by two-sample *t*-test.

TABLE 2 | Performance comparison with exiting predictors using test dataset.


*A p-value was computed based on AUC values by using a Wilcoxson matched-pair signed test and p* < *0.05 indicates a statistically significant difference between the proposed PreAIP and each selected method. The performances of AntiInflam LA and MA methods were computed using default threshold (server) values of* −*0.3 and 0.5, respectively. The AIPpred threshold was the same as given by its server.*

for prediction performance of AIPs. As shown in **Table 3**, the PreAIP achieved a better performance than the AIPpred in terms of Ac, Sp, Sn, MCC, and AUC. The AUC value was nearly 3% higher than the AIPpred predictor. The PreAIP performance (AUC) improvement over the AIPpred was demonstrated on the training set by the Wilcoxson matched-pair signed test (**Table 3**).

### Comparison of Different Machine Learning Algorithms

The performance of the RF was compared to the three widely used machine learning algorithms, NB, SVM, and ANN by using the same training datasets and features, as shown in **Table 4**. The AUC values of the prediction by the five algorithms were calculated by a 10-fold CV test, while using the SPIDER2, PEP2D, AAindex, KSAAP, and pKSAAP encodings and their combined method. The RF provided higher AUC than any other algorithms for all the encoding methods and their combined method.

TABLE 3 | Performance comparison of PreAIP with AIPpred using training dataset.


*A p-value was computed based on AUC values by using a Wilcoxson matched-pair signed test and p* < *0.05 indicates a statistically significant difference between the proposed PreAIP and AIPpred.*

## The Effect of Peptide Redundancy on the Predictive Model

The peptide redundancy may lead to the overestimation on the predictive performance. Therefore, we performed the CD-HIT with 60% identity cutoff at the peptide level (Huang et al., 2010). After removing the 60% sequence redundancy, we re-assembled a training dataset that contained 1,098 positive and 1,226 negative samples, and the test dataset that contained 308 positive and 275 negative samples. While the overall performance (AUC = 0.821) of the PreAIP by the 10-fold CV test decreased slightly (**Table S6**), the PreAIP could still achieve the best performance on the independent testing dataset (**Figure S1**). The PreAIP achieved 6 and 8% higher AUC values than the AntiInflam and the AIPpred, respectively, demonstrating that the PreAIP with the 60% peptide redundancy removal provides a stable or competitive performance compared with the other predictors, as well as the 80% peptide redundancy removal.

TABLE 4 | AUC values of AIP prediction by different machine learning algorithms based on a 10-fold CV test.


*"Combined" indicates that the performance of the optimized combined features. The combined score of RF was given as the sum of the five SPIDER2, PEP2D, AAindex, KSAAP, and pKSAAP features with weight values of 0.00, 0.00, 0.15, 0.25, and 0.6 respectively. In the same way, the weight values of NB, SVM, and ANN were given as (0.00, 0.00, 0.10, 0.35, and 0.55), (0.00, 0.00, 0.22, 0.45, and 0.33), and (0.00, 0.00, 0.18, 0.5, and 0.32), respectively.*

#### Advantages of PreAIP

In theoretical viewpoints, comparison of the proposed PreAIP with existing predictors is summarized: (1) The PreAIP investigated the primary sequence, physicochemical properties, structural, and evolutionary features, although the AIPpred and AntiInflam predictors used only primary sequence encoding method. For instance, in AntiInflam method (Gupta et al., 2017), studied hybrid features based on primary sequence encoding schemes such as amino acid composition (AAC), dipeptide composition (DPC), and tripeptide composition with SVM algorithm. The AIPpred (Manavalan et al., 2018) studied individual composition (AAC, AAindex, DPC, and chaintransition-composition) through multiple machine learning algorithms. (2) Since existing prediction tools did not control the Sp level, users cannot understand which AIP is highly positive or negative from their servers. On the other hand, the PreAIP controlled Sp at high, moderate and low levels by changing the threshold of the RF scores, based on 10-fold CV test results. A limitation of the PreAIP is that the employed dataset is still small, but we believe that the dataset will grow to enable intensive identification of AIPs. In addition, the calculation speed remains to be improved. The processing time of the PreAIP was <3 min for one peptide sequence, where the generation of PSSM profiles requires a long time.

#### Server of PreAIP

A web server of the PreAIP has been developed and publically accessible at http://kurata14.bio.kyutech.ac.jp/PreAIP/. The web application was implemented by programming languages of Java scripts, Perl, R, CGI scripts, PHP, and HTML. After submitting a query sequence to the server, it generates consecutive feature vectors. Then, the server optimizes the performances through

#### REFERENCES


RFs. After completing the submission job, the server returns the result in the output webpage which consists of the job ID and probability scores of the predicted AIPs in a tabular form. A user gets a job ID like "2018032900067" and can save this ID for a future query. The server stores this job ID for one month. The input peptide sequence must be in the FASTA format. Each of the 20 types of standard amino acids must be written as one uppercase letter. See the test example on the server. The length of AIP sequence was limited from 1 to 25. If users submit 200 amino acids, the PreAIP takes first 1–25 residues to analyze. When the peptide contains less than 25 residues, the PreAIP provides gaps (–) to the missing residues to compensate a peptide length of 25.

### CONCLUSIONS

We have designed an accurate and efficient computational predictor for identifying potential AIPs. It outperforms the existing methods and is effective in understanding some mechanisms of AIP identification. An IG-based feature selection method was carried out to suggest sequence motifs of AIPs from KSAAP encoding. A user-friendly web-server was developed and freely available for academic users.

### AUTHOR CONTRIBUTIONS

MK, MH, and HK conceived and designed the study. MK and MH collected data and performed the analyses. MH, MK, and HK wrote the manuscript. All authors discussed the prediction results and commented on the manuscript.

#### ACKNOWLEDGMENTS

This work was supported by the Grant-in-Aid for Challenging Exploratory Research with JSPS KAKENHI Grant Number 17K20009. This research is partially supported by the developing key technologies for discovering and manufacturing pharmaceuticals used for next-generation treatments and diagnoses both from the Ministry of Economy, Trade and Industry, Japan (METI) and from Japan Agency for Medical Research and Development (AMED).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00129/full#supplementary-material

patterns of amino acid properties and random forest. Sci. Rep. 8:1697. doi: 10.1038/s41598-018-19752-w


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Khatun, Hasan and Kurata. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Sexual Transcription Differences in Brachymeria lasus (Hymenoptera: Chalcididae), a Pupal Parasitoid Species of Lymantria dispar (Lepidoptera: Lymantriidae)

Peng-Cheng Liu1,2, Shuo Tian1,2 and De-Jun Hao1,2 \*

<sup>1</sup> Co-Innovation Center for the Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing, China, <sup>2</sup> The College of Forestry, Nanjing Forestry University, Nanjing, China

#### Edited by:

Ancha Baranova, George Mason University, United States

#### Reviewed by:

Juan Pedro M. Camacho, University of Granada, Spain Nakatada Wachi, University of the Ryukyus, Japan

> \*Correspondence: De-Jun Hao djhao@njfu.edu.cn

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 12 October 2018 Accepted: 18 February 2019 Published: 05 March 2019

#### Citation:

Liu P-C, Tian S and Hao D-J (2019) Sexual Transcription Differences in Brachymeria lasus (Hymenoptera: Chalcididae), a Pupal Parasitoid Species of Lymantria dispar (Lepidoptera: Lymantriidae). Front. Genet. 10:172. doi: 10.3389/fgene.2019.00172 Sex differences in gene expression have been extensively documented, but little is known about these differences in parasitoid species that are widely applied to control pests. Brachymeria lasus is a solitary parasitoid species and has been evaluated as a potential candidate for release to control Lymantria dispar. In this study, gender differences in B. lasus were investigated using Illumina-based transcriptomic analysis. The resulting 37,453 unigene annotations provided a large amount of useful data for molecular studies of B. lasus. A total of 1416 differentially expressed genes were identified between females and males, and the majority of the sex-biased genes were female biased. Gene Ontology (GO) and Pathway enrichment analyses showed that (1) the functional categories DNA replication, fatty acid biosynthesis, and metabolism were enhanced in females and that (2) the only pathway enriched in males was phototransduction, while the GO subcategories enriched in males were those involved in membrane and ion transport. In addition, thirteen genes involving transient receptor potential (TRP) channels were annotated in B. lasus. We further explored and discussed the functions of TRPs in sensory signaling of light and temperature. In general, this study provides new molecular insights into the biological and sexually dimorphic traits of parasitoids, which may improve the application of these insects to the biological control of pests.

Keywords: sexually dimorphic, Brachymeria lasus, transcriptomic analysis, sex determination, venom protein, transient receptor potential channels

## INTRODUCTION

Parasitoids are animals that parasitize other organisms (Godfray, 1994). All invertebrate life stages, such as egg, larva or nymph, pupa and adult, can be attacked by oviposition on or in the host or by depositing a larva on or near a host (Boulton et al., 2015). Based on the number of offspring reared in a host, parasitoid wasps are classified as solitary (one parasitoid per host), quasi-gregarious (one parasitoid per host, but hosts are spatially clumped, such as a clutch of eggs on a leaf), or gregarious (multiple parasitoids per host). The vast majority of parasitoids are solitary wasps (Mayhew, 1998).

Parasitoids can also be classified as koinobionts (in which hosts continue to develop and grow to some extent) or idiobionts (in which hosts do not grow further). Parasitoid wasps are haplodiploid: males develop from unfertilized eggs and are haploid, while females develop from fertilized eggs and are diploid (Cook, 1993; Heimpel and de Boer, 2008). Parasitoid species (e.g., Sclerodermus harmandi, Trichogramma) are important insects and have been extensively applied to reduce the population size of pest species (Hassan, 1993; Li, 1994; Terayama, 1999; Zhishan et al., 2003; Parra and Zucchi, 2004; Lim et al., 2006). In addition to having important applications, parasitoid and mutualistic Chalcidoidea, such as jewel (Nasonia vitripennis) and fig (Pleistodontes froggatti) wasps, have been important study models of behavioral ecology and evolutionary biology for such traits as their sexual dimorphism in longevity, body size, and dispersal (Hamilton, 1967; Charnov, 1982; Yan et al., 1989; Godfray, 1994).

Animals from a broad range of taxa show sex differences, which include behavioral (Breedlove, 1992), physiological (Bardin and Catterall, 1981), and morphological dimorphisms (Darwin, 1871). It is often assumed that the majority of sexually dimorphic traits arise from differences in the expression of genes present in both sexes (Connallon and Knowles, 2005; Rinn and Snyder, 2005). Sex-biased gene expression has been documented in brown algae (Lipinska et al., 2015), birds (Pointer et al., 2013), nematodes (Albritton et al., 2014), Daphnia pulex (Eads et al., 2007), and multiple insect species, including Drosophila (Jin et al., 2001; Arbeitman et al., 2002; Ranz et al., 2003; Chang et al., 2011), Anopheles gambiae (Hahn and Lanzaro, 2005; Marinotti et al., 2006; Baker et al., 2011), Tribolium castaneum (Prince et al., 2010), vespid wasps (Hunt and Goodisman, 2010), and Bemisia tabaci (Wen et al., 2014). However, few studies of sex differences in gene expression have been done in Hymenoptera insects, and these studies have focussed mainly on social species (e.g., honeybee; Cameron et al., 2013) and model organisms of parasitoids, e.g., jewel wasp N. vitripennis (Wang et al., 2015), which is a classic gregarious species. Most species of parasitoid wasps are thought of as solitary species (Mayhew, 1998), but their sexual transcription differences have not been addressed.

Gypsy moth, Lymantria dispar is a worldwide pest, and its pupal stage can be parasitized by Brachymeria lasus. B. lasus is a solitary parasitoid species and has been evaluated as a potential candidate for release to control L. dispar (Simser and Coppel, 1980), Homona magnanima (Mao and Kunimi, 1991) and Sylepta derogate (Kang et al., 2006). In addition, B. lasus has a wide host range, including many Lepidoptera species (e.g., Mythimna separata, Hyphantria cunea, and Cnaphalocrocis medinalis) (Habu, 1960). Male and female B. lasus differ in many important biological traits, including longevity (Mao and Kunimi, 1994b); development time in the egg, larval and pupal stages (Mao and Kunimi, 1994a); secondary symbionts; and body size (Yan et al., 1989). As B. lasus is a classic solitary species with many documented sex differences, though not yet at the gene expression level, it was used as the experimental material in this study. To reveal B. lasus sex differences at the transcriptional level, we carried out an Illumina-based transcriptomic analysis. This study attempted to provide comprehensive insight into the sexually dimorphic traits of parasitoid wasps at the transcriptome level to improve our understanding of other biological traits with the aim of improving the application of parasitoids to the biological control of pest species.

## MATERIALS AND METHODS

### Insect Cultures

In northern China, in addition to L. dispar, B. lasus is also an important pupal parasitoid of H. cunea, for which the parasitism ratio is approximately 1.06–3.39% in the field (Yang et al., 2001). To acquire B. lasus adults, we collected the pupae of H. cunea, which may be parasitized by B. lasus and other parasitoid species (e.g., Coccygomimus disparis Viereck; Chouioia cunea Yang) from a field in Xuzhou City, Jiangsu Province, China, in March 2016. After collection, we isolated the pupae individually in polyethylene tubes (height: 7.5 cm; diameter: 1 cm) whose openings were covered by a cotton ball and incubated them at a temperature of 28 ± 0.5◦C, a relative humidity (RH) of 70 ± 5% and a photoperiod of 14 L:10 D. We observed and selected B. lasus after adult eclosion.

### Transcriptomic Analyses

For the transcriptomic experiment, only 1-day-old B. lasus adults were selected, and the sex was determined under a microscope (Leica M205A, Germany). Then, five adults of the same sex were pooled into a plastic tube (1.5 ml), snap-frozen in liquid nitrogen, and transferred to a –80◦C freezer for long-term storage. RNA from each sample group (whole bodies of female and male adults) was extracted with TRIzol reagent (Invitrogen; United States). Each group had three replicates. The quality of the isolated RNA was assessed using a NanoDrop (Thermo Fisher Scientific NanoDrop 2000, United States), and the A260/280 values were all above 2.0. A total of 3 µg total RNA from each sample was converted into cDNA using the NEBNext <sup>R</sup> UltraTM RNA Library Prep Kit for Illumina <sup>R</sup> (NEB, United States). In total, six cDNA libraries were constructed and subsequently sequenced with the Illumina HiSeq 2000 platform by Beijing Biomarker Technologies Co., Ltd, resulting in raw reads. Raw sequence data generated were deposited into Sequence Read Archive (SRA) database of NCBI with the accession no. PRJNA513855. Clean reads were obtained by removing reads containing adapter, poly-N reads and low-quality reads from the raw data using FASTX-Toolkit<sup>1</sup> , and these clean reads were used for further analysis. Then, transcriptome assembly was performed using Trinity (v2.5.1) with the default parameters (Grabherr et al., 2011). For functional annotation, pooled assembled unigenes were searched using BLASTX (v2.2.31) against five public databases, Clusters of Orthologous Groups (COG), Swiss-Prot, NCBI non-redundant protein sequences (nr), KEGG Ortholog database (KO) and GO, with an E-value cutoff of 10−<sup>5</sup> . Using our assembled transcriptome as a reference, we identified putative genes expressed in males and females by RSEM (Li and Dewey, 2011),

<sup>1</sup>http://hannonlab.cshl.edu/fastx\_toolkit/

using the reads per kb per million reads (RPKM) method. Genes with at least 2-fold changes (i.e., log2|FC| ≥ 1) and a false discovery rate [FDR] < 0.01 as found by DESeq R package (1.10.1) were considered differentially expressed. The GOseq R package (Young et al., 2010) and KOBAS software (Mao et al., 2005) were used to implement the statistical enrichment of differentially expressed genes (DEGs) in the GO and KEGG pathways, respectively, and an adjusted Q-value <0.05 was chosen as the significance cutoff.

#### Validation by mRNA Expression and Behavior

Based on transcriptomic data, a gene of transient receptor potential (trp) involved in the phototransduction pathway enriched only in males (ko: 04745; **Supplementary Figure S1-d**), trp (Leung et al., 2000), was down-regulated in females, which may lead to a reduction in light response (Leung et al., 2000; Popescu et al., 2006). Therefore, we checked this result at the mRNA expression and behavioral levels.

#### Quantitative Real-Time PCR (qRT-PCR) Analysis

Total RNA was extracted from the whole bodies of five female or five male adults reared on the pupae of H. cunea using TRIzol (Invitrogen; United States) according to the manufacturer's protocols, then resuspended in nuclease-free water. Finally, the RNA concentration was measured using a NanoDrop (Thermo Fisher Scientific NanoDrop 2000; United States). Each group have four replicates. Approximately 0.5 mg of total RNA was used as template to synthesize the first-strand cDNA using a PrimeScript RT Reagent Kit (TaKaRa; Japan) following the manufacturer's protocols. The resultant cDNA was diluted to 0.1 mg/ml for further qRT-PCR analysis (ABI StepOne Plus; United States) using SYBR Green Real-Time PCR Master Mix (TaKaRa; Japan). Primers (**Supplementary Table S1**) for trp gene were designed using Primer Express 2.0 software. The cycling parameters were 95◦C for 30 s followed by 40 cycles of 95◦C for 5 s and 62◦C for 34 s, ending with a melting curve analysis (65 to 95◦C in increments of 0.5◦C every 5 s) to check for nonspecific product amplification. Relative gene expression was calculated by the 2−11Ct method using the housekeeping gene GAPDH as a reference to eliminate sample-to-sample variations in the initial cDNA samples.

#### Phototaxis Assays

A glass Y-maze (main arm: 12 cm; two side arms: 5 cm; inner diameter: 1.5 cm; angle between two side arms: 75◦ ) was used for phototaxis assays in a completely dark room (<10 lux, measured by illuminometer, LX-9621, China) at a temperature of 22–26◦C. One 1-day-old B. lasus adult (female or male) began the trial in a tube at the base of the apparatus and faced a choice between two tubes, one of which was dark and the other of which was lighted with a 40-watt bulb (approximately 600 lux). After 1 min, the choice was recorded. The sample sizes of the male and female groups were 18 and 24, respectively. After each test, the Y-maze was washed and dried, and the two side arms were changed for the new test.

#### Statistical Analysis

Prior to analysis, the raw data were tested for normality and homogeneity of variances with the Kolmogorov-Smirnov test and Levene's test, respectively, and the data were log-transformed if necessary. The qRT-PCR data comparing gene expression in females and males were analyzed with the independent t-test. In phototaxis assays, the preferences for light and dark were analyzed using sign tests, and the differences in female and male phototaxis were analyzed by the chi-square test. All analyses were performed using SPSS v.20 (IBM SPSS, Armonk, NY, United States).

### RESULTS AND DISCUSSION

Sexual dimorphism is the condition where the two sexes of the same species exhibit different characteristics (e.g., size, color, behavior) beyond the differences in their sexual organs (Bonduriansky, 2007). Most sexually dimorphic traits are often assumed to arise from differences in the expression of genes present in both sexes (Connallon and Knowles, 2005; Rinn and Snyder, 2005). To reveal B. lasus sex differences at the transcriptional level, we carried out an Illumina-based transcriptomic analysis.

### Transcriptome Sequencing, Read Assembly and Annotation

All high-quality reads (101,945,678) from the six samples were pooled and assembled by using Trinity with the default parameters, and a total of 254,656 transcripts with lengths longer than 200 bp were generated. The N50 size was 2706 bp with 57,605 sequences longer than 1 kb. We chose the longest isoform of each gene to construct the unigene set. After isoforms were considered, these assembled transcripts were predicted to be produced from a total of 164,709 unigenes. The N50 size of the unigenes was approximately 814 bp, and their mean length was 572.08 bp (**Supplementary Table S2**). For annotation, the pooled assembled unigenes were searched using blastx against five public databases with an E-value cutoff of 10−<sup>5</sup> . A total of 37,453 unigenes were successfully annotated, as shown in **Table 1**, including 17,248 genes in GO, 13,491 genes in COG, 35,427 genes in nr, 18,195 genes in Swiss-Prot, and 15,133 genes in KEGG.

In the GO analysis, 17,248 unigenes were successfully annotated and classified into three major GO categories: molecular function (MF), cell component (CC), and biological

TABLE 1 | Annotation of a pooled assembly including both male and female B. lasus transcriptomes.


processes (BP), then assigned to 56 subcategories based on GO level 2. The dominant subcategories for the classified genes were catalytic activity and binding for the MF category; cell and cell part for the CC category; and metabolic process, cellular process, and single-organism process for the BP category (**Supplementary Table S3**). A total of 15,133 KEGG-annotated unigenes were classified into 190 pathways (>10 associated unigenes). Among these pathways, the ten most highly represented were ribosome, carbon metabolism, protein processing in endoplasmic reticulum, oxidative phosphorylation, biosynthesis of amino acids, spliceosome, RNA transport, purine metabolism, peroxisome, and ubiquitin mediated proteolysis (**Supplementary Table S4**).

#### Sex-Biased Genes

fgene-10-00172 March 1, 2019 Time: 18:45 # 4

Although in most species the male and female genomes differ by a few genes located on sex-specific chromosomes (such as the Y chromosome of mammals), the vast majority of sexually dimorphic traits result from the differential expression of genes that are present in both sexes (Connallon and Knowles, 2005; Rinn and Snyder, 2005; Ellegren and Parsch, 2007), and this is especially true in hymenopteran insects. Because sex determination in hymenopteran species is haplodiploid, females and males are nearly identical genetically (Ellegren and Parsch, 2007). Such DEGs include those that are expressed exclusively in one sex (sex-specific expression) and those that are expressed in both sexes but at a higher level in one sex (sex-biased expression). These sex-biased genes can be further separated into male-biased and female-biased genes, depending on which sex shows higher expression. Genes with equal expression in the two sexes are referred to as unbiased (Ellegren and Parsch, 2007).

Using our assembled transcriptome as a reference, we identified putative genes expressed in males and females using the RPKM method, and genes with at least 2-fold changes and FDR < 0.01 were defined as DEGs. By comparing female and male transcriptomes, 1416 DEGs were found in B. lasus, of which 442 genes were annotated in GO, 420 in COG, 1024 in nr, 613 in Swiss-Prot, and 396 in KEGG (**Table 1**). Among these DEGs, 986 were up-regulated in females and 430 were up-regulated in males (**Supplementary Table S5**).

#### GO Enrichment Analyses

In the GO enrichment analyses, 12 and five subcategories were enriched in females and males, respectively. In females, the enriched subcategories were microtubule cytoskeleton, cytoskeletal part, MCM complex, nucleus, protein complex, kinesin complex, and nucleosome for the CC category; DNA replication initiation, cell division and protein phosphorylation for the BP category; and alpha-1,4-glucosidase activity and zinc ion binding for the MF category (**Figure 1A**). These results showed that, consistent with the results in flies, mosquitoes, and Daphnia (Ranz et al., 2003; Hahn and Lanzaro, 2005; Eads et al., 2007), including Hymenoptera insects of Nasonia (Wang et al., 2015), most categories were related to DNA replication, which are probably expressed to produce eggs in females (Spradling, 1993; Parisi et al., 2004). The over-representation of transcripts from genes required for DNA replication may be required for nurse cell polyploidization or for the rapid division of embryonic cells, which rely on maternally deposited gene products (Spradling, 1993; Parisi et al., 2004).

In males, the enriched subcategories were integral component of membrane, cell junction, and postsynaptic membrane for the CC category; ion transport for the BP category; and potassium channel activity for the MF category (**Figure 1B**), consistent with a study in D. melanogaster (Parisi et al., 2004), which may be mainly related to spermatogenesis (Fuller, 1993). For example, the enriched subcategories associated with membranes were likely due to the requirements of the sperm axoneme structure (Parisi et al., 2004). However, in parasitoids of N. vitripennis species, highly enriched subcategories in males are related to sex-pheromone synthetic enzymes (Wang et al., 2015). Those differences might be likely to contribute by their difference in sexual maturity period. Sexual maturity in many gregarious and quasi-gregarious males (e.g., N. vitripennis) happens before eclosion, and these males can immediately mate with females after eclosion and near the emergence site (Boulton et al., 2015), while solitary B. lasus have mating ability for some days after eclosion (Yan et al., 1989).

#### KEGG Pathway Enrichment Analyses

Consistent with the results of GO enrichment in females, pathway enrichment tests revealed that DNA replication (ko: 03030; **Supplementary Figure S1-a**) was enriched in B. lasus females. The functional categories enriched in females also included fatty acid biosynthesis (ko: 00061; **Supplementary Figure S1-b**) and metabolism (ko01212; **Supplementary Figure S1-c**). The fatty acid synthase gene (FASN), which encoded the enzyme catalyzing fatty acid synthesis (Jayakumar et al., 1994, 1995; Persson et al., 2008), was probably crucial for egg yolk production and thus female fecundity. In some insects, for example yellow fever mosquito Aedes aegypti, brown planthopper Nilaparvata lugens) (Alabaster et al., 2011; Li et al., 2016), when FAS expression decreases in females, the number of oviposited eggs significantly decreases.

We found that only the phototransduction-fly pathway (ko: 04745; **Supplementary Figure S1-d**) was enriched in males, which is associated with perception of light signals (Leung et al., 2000). Its potential functions are discussed below.

## Annotated Genes Involved in Venom Proteins

In terms of biological control, parasitoid species have been extensively applied for reducing pest species population sizes (Hassan, 1993; Li, 1994; Terayama, 1999; Zhishan et al., 2003; Parra and Zucchi, 2004; Lim et al., 2006) because parasitoids can propagate on or in other arthropods. The venom of parasitoid wasps, which is injected into a host by females before or at oviposition, is important for the successful development of the progeny. Parasitoid venoms have diverse physiological effects on hosts, including developmental arrest; alteration in growth and physiology; suppression of immune responses; induction of paralysis, oncosis, or apoptosis; and alteration of host behavior (Edwards et al., 2006; Price et al., 2009; Tian et al., 2010; Kryukova et al., 2011). In total, three female-biased

FIGURE 1 | GO enrichment analysis of (A) female- and (B) male-biased genes. GOSeq explicitly takes into account gene selection bias due to differences in gene length and thus the numbers of overlapping sequencing reads. GOSeq was used for the GO enrichment analysis, and an adjusted Q-value <0.05 was chosen as the significance cutoff.

#### TABLE 2 | TRP channel genes in the B. lasus transcriptome.

fgene-10-00172 March 1, 2019 Time: 18:45 # 6


genes (c100635.graph\_c0, c101314.graph\_c0, c101670.graph\_c0; **Supplementary Table S5**) in this study were annotated for venom proteins, which were related to known insect venoms from N. vitripennis and belonged to previously known insect venom families, such as serine proteases (Graaf et al., 2010; Werren et al., 2010). Despite the large diversity of parasitoid wasp species, only a small number of venom proteins have been described from wasps. A wealth of unexplored biomolecules is present in parasitoid venoms; these proteins are of value in basic evolutionary studies, venom biology, host-parasite interactions, and the study of the evolution of life strategies, and they may potentially contain components that could be used in biological control and pharmacology (Moreau and Asgari, 2015).

#### Annotation of Genes in the TRP Channel Family and Function Validation

Transient Receptor Potential channels are cation channels that are mainly considered as unique polymodal cell sensors; TRPs can be subdivided into six main subfamilies: the TRPC (canonical), TRPV (vanilloid), TRPM (melastatin), TRPP (polycystin), TRPML (mucolipin), and TRPA (ankyrin) groups (Gees et al., 2010). Functionally, TRP channels cause cell depolarization when activated, which may trigger many voltage-dependent ion channels. Upon stimulation, Ca2+-permeable TRP channels generate changes in the intracellular Ca2<sup>+</sup> concentration,

the reads per kb per million reads (RPKM) method. Quantitative real-time PCR (qRT-PCR) analysis was used to calculated the relative gene expression to further check the transcriptomic data, in which the differences in female and male were analyzed by the independent t-test. There was a highly significant correlation co-efficient of 0.885 between transcriptomic data and qRT-PCR data. Behavioral responses of Brachymeria lasus adults to dark or light were tested with phototaxis assays. The differences in female (n = 24) and male (n = 18) phototaxis were analyzed by the chi-square test. <sup>∗</sup> indicates p < 0.05. The error bars indicate standard errors.

[Ca2+]<sup>i</sup> , due to Ca2<sup>+</sup> entry via the plasma membrane. However, evidence is increasing that TRP channels are also located in intracellular organelles and serve as intracellular Ca2<sup>+</sup> release channels (Berridge et al., 2000; Bootman et al., 2001; Gees et al., 2010). TRP channels in Drosophila are involved in the perception of sensory signals such as light, temperature, humidity, pheromones, sound, and touch (Lin et al., 2005). In our study, we found 13 TRP channel genes in B. lasus; Nasonia and honey bee contain 12 and 11 genes, respectively, indicating that the number of trp channels seems to be well conserved in Hymenoptera (Werren et al., 2010). Of the TRP channel genes in B. lasus, most belong to two subfamilies: TRPC and TRPA (**Table 2**).

In Drosophila, TRPC plays an important role in the perception of light signals, i.e., the phototransduction pathway (Leung et al., 2000) (ko: 04745; **Supplementary Figure S1-d**), which was enriched in B. lasus male adults. In Drosophila, a number of genes in the visual signal transduction pathway have been characterized, with functions including rhodopsin activation, phosphoinoside signaling, and the opening of TRP and TRPL channels (Wolff and Ready, 1993; Zuker, 1996; Leung et al., 2000; Wang and Montell, 2007). Our transcriptional analyses (**Figure 2A**: FDR < 0.01, log<sup>2</sup> FC = 1.62) and q-PCR results (**Figure 2B**: t = −3.169, df = 6, p = 0.019), showed that the gene corresponding to trp (c103240.graph\_c0) was more highly expressed in B. lasus males, consistent with the phototaxis test. Although both females and males tended to move toward light (**Figure 2C**: female, Z = −1.34, p < 0.05; male, Z = −1.6, p < 0.05), the tendency to prefer light was significantly influenced by sex in adults (**Figure 2C**. χ <sup>2</sup> = 4.17, df = 1, p < 0.05), males more preferring to move to light. This result is similar to the results of research on trp mutants in Drosophila, which had altered phenotypes, including a reduction in light response (Leung et al., 2000; Popescu et al., 2006). Female reduction in light response might be due to their long periods living in the dark to search for hosts and lay offspring into them, as most host species (e.g., pupae of L. dispar or H. cunea) hide in dark environments, such as the litter horizon (Yan et al., 1989; Yang et al., 2001). Surprisingly, five members of the TRPA subfamily, which is involved in sensing environmental temperature, were annotated in our study. Animals must maintain thermal homeostasis and avoid prolonged contact with harmfully hot or cold objects (Caterina, 2007; Karashima et al., 2009). Unlike most parasitoid species, which overwinter in their hosts as eggs or larvae, B. lasus lives through the winter in its adult stage (Yan et al., 1989). Thus, TRPA may be essential for B. lasus adults, allowing them to sense harmful cold during winter. In addition, intraspecific aggregations in B. lasus have been observed in previous research, and an active component that elicited the aggregation response was isolated and identified as 3-hexanone (Mohamed and Coppel, 1987). The effects of aggregation behavior include mating, host attack, defense, and thermoregulation, and in this species, a previous study suggested that aggregation resulted from an increase in reproductive success by increasing the probability of mate location, as well as offering the possibility of mate choice (Mohamed and Coppel, 1987). However, combining the above results, adults may also aggregate at a site for purposes of thermoregulation, especially in winter, in response to cold. Further studies are required to elucidate the nature of this cue.

#### CONCLUSION

Brachymeria lasus is a solitary parasitoid species and has been evaluated as a potential candidate for release to control L. dispar. Whereas previous studies have focussed on the application of parasitoids and their sex differences in phenotypes, this study focussed mainly on sex differences in gene expression. Brachymeria lasus as a representative of solitary species was studied, which enriched our understanding of sexual transcription differences in parasitoid wasps, especially solitary species. Here, we performed transcriptome assembly using the Trinity program, which provided a large amount of useful information for molecular studies of B. lasus, including venom protein and perception of sensory signals. In addition to sex-biased genes, epigenetic processes, such as DNA methylation, are known to play important roles in differentiating phenotype and have been widely studied in Hymenopteran insects, for example, female morphs (queens and workers) in the honeybee, Apis mellifera (Kucharski et al., 2008; Lyko et al., 2010), although these processes do not appear to be in Nasonia (Wang et al., 2015). More future research will be conducted to better understand the molecular mechanisms underlying the biological traits of sex differences in B. lasus and to better apply this parasitoid to the biological control of pests.

#### DATA AVAILABILITY

Publicly available datasets were analyzed in this study. This data can be found here: https://dataview.ncbi.nlm.nih.gov/ object/PRJNA513855.

### ETHICS STATEMENT

There was no requirement to seek ethical approval to carry out the work described above. However, the use of insects in the above experiments was kept to a minimum.

#### AUTHOR CONTRIBUTIONS

P-CL conceived and designed the experiments. P-CL and ST performed the experiments. P-CL and D-JH wrote the manuscript. All the authors reviewed the manuscript.

### FUNDING

A project funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD). This work was also supported by the Doctorate Fellowship Foundation of Nanjing Forestry University and the Natural Science Foundation of Jiangsu Province (BK20131421).

#### ACKNOWLEDGMENTS

fgene-10-00172 March 1, 2019 Time: 18:45 # 8

We gratefully acknowledge undergraduates Ju Luo, Min Li, and Chenxi Zhao of the Nanjing Forestry University for their assistance.

#### REFERENCES


#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00172/ full#supplementary-material


channels in Drosophila. J. Neurosci. 20, 6797–6803. doi: 10.1523/JNEUROSCI. 20-18-06797.2000


receptor potential by eye protein kinase C in Drosophila. J. Neurosci. 26, 8570–8577. doi: 10.1523/JNEUROSCI.1478-06.2006


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Liu, Tian and Hao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00172 March 1, 2019 Time: 18:45 # 9

# Utility of cfDNA Fragmentation Patterns in Designing the Liquid Biopsy Profiling Panels to Improve Their Sensitivity

Maxim Ivanov<sup>1</sup> \*, Polina Chernenko<sup>2</sup> , Valery Breder<sup>2</sup> , Konstantin Laktionov<sup>2</sup> , Ekaterina Rozhavskaya3,4, Sergey Musienko<sup>3</sup> , Ancha Baranova1,3,5,6 and Vladislav Mileyko<sup>3</sup>

<sup>1</sup> Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Dolgoprudny, Russia, <sup>2</sup> N.N. Blokhin Russian Cancer Research Center, Moscow, Russia, <sup>3</sup> Atlas Oncology Diagnostics, Ltd., Moscow, Russia, <sup>4</sup> Vavilov Institute of General Genetics, Moscow, Russia, <sup>5</sup> Research Centre for Medical Genetics, Moscow, Russia, <sup>6</sup> School of Systems Biology, George Mason University, Fairfax, VA, United States

#### Edited by:

Richard D. Emes, University of Nottingham, United Kingdom

#### Reviewed by:

Vladimir B. Teif, University of Essex, United Kingdom Kuo-Ping Chiu, Academia Sinica, Taiwan Tatiana V. Tatarinova, University of La Verne, United States

\*Correspondence:

Maxim Ivanov maksim.v.ivanov@phystech.edu

#### Specialty section:

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics

Received: 19 October 2018 Accepted: 25 February 2019 Published: 12 March 2019

#### Citation:

Ivanov M, Chernenko P, Breder V, Laktionov K, Rozhavskaya E, Musienko S, Baranova A and Mileyko V (2019) Utility of cfDNA Fragmentation Patterns in Designing the Liquid Biopsy Profiling Panels to Improve Their Sensitivity. Front. Genet. 10:194. doi: 10.3389/fgene.2019.00194 Genotyping of cell-free DNA (cfDNA) in plasma samples has the potential to allow for a noninvasive assessment of tumor biology, avoiding the inherent shortcomings of tissue biopsy. Next generation sequencing (NGS), a leading technology for liquid biopsy analysis, continues to be hurdled with several major issues with cfDNA samples, including low cfDNA concentration and high fragmentation. In this study, by employing Ion Torrent PGM semiconductor technology, we performed a comparison between two multi-biomarker amplicon-based NGS panels characterized by a substantial difference in average amplicon length. In course of the analysis of the peripheral blood from 13 diagnostic non-small cell lung cancer patients, equivalence of two panels, in terms of overall diagnostic sensitivity and specificity was shown. A pairwise comparison of the allele frequencies for the same somatic variants obtained from the pairs of panelspecific amplicons, demonstrated an identical analytical sensitivity in range of 140 to 170 bp amplicons in size. Further regression analysis between amplicon length and its coverage, illustrated that NGS sequencing of plasma cfDNA equally tolerates amplicons with lengths in the range of 120 to 170 bp. To increase the sensitivity of mutation detection in cfDNA, we performed a computational analysis of the features associated with genome-wide nucleosome maps, evident from the data on the prevalence of cfDNA fragments of certain sizes and their fragmentation patterns. By leveraging the support vector machine-based machine learning approach, we showed that a combination of nucleosome map associated features with GC content, results in the increased accuracy of prediction of high inter-sample sequencing coverage variation (areas under the receiver operating curve: 0.75, 95% CI: 0.750–0.752 vs. 0.65, 95% CI: 0.63– 0.67). Thus, nucleosome-guided fragmentation should be utilized as a guide to design amplicon-based NGS panels for the genotyping of cfDNA samples.

Keywords: NGS, cfDNA, liquid biopsy, cancer, DNA fragmentation, nucleosome, amplicon, primer design

## INTRODUCTION

fgene-10-00194 March 8, 2019 Time: 17:25 # 2

In an approach known as "liquid biopsy," cell-free DNA (cfDNA) which circulates in the plasma may be used for a diagnostic detection of tumor-specific mutations (Dawson et al., 2013; Pupilli et al., 2013; Xi et al., 2016). In the frame of the Lab-Developed Tests (LDT) paradigm, analysis of cfDNA has already gained approval for a number of common indications, including the detection of the resistance mutation T790M in the EGFR encoding gene (Malapelle et al., 2016), which commonly emerges in lung adenocarcinomas treated with tyrosine kinase inhibitors.

At their inception, cfDNA-based LDTs commonly exploited one or another conventional DNA analysis technique, including real-time PCR, droplet digital PCR and beads, emulsions, amplification, and magnetics (BEAM)ing digital PCR (Dawson et al., 2013; Oxnard et al., 2014; Siravegna et al., 2015; Thress et al., 2015; Sacher et al., 2016). Many studies showed that the concordance of liquid biopsy and tissue-based analysis is relatively high; nevertheless, these approaches are not free of limitations. Typically, PCR-based and hybridizationbased cfDNA profiling techniques are developed to detect particular DNA variants, which most commonly underlie one or another previously described pathophysiological process. These and other variant-specific techniques are not suitable for the exploratory analysis of cfDNA, which is necessary for acquisition of knowledge concerning non-conventional, emerging resistance pathways, for co-detection of the mismatch repair phenotype, and for off-label prescribing of anticancer medications commonly required for personalized treatment of metastatic tumors (Tafe et al., 2015; Wei et al., 2016; Zehir et al., 2017). These limitations are readily surmounted by an advent of sequencing-based technologies, including whole exome sequencing or, more applicable to cfDNA analysis, ampliconbased panels, which are limited to their target genes, but are still exploration-permissive.

With reported sensitivity and a specificity of more than 80%, and 98 to 100%, respectively (Krishnamurthy et al., 2017), a next generation sequencing (NGS) analysis of cfDNA has already inserted itself into the ranks of the commonly used LDTs. Nevertheless, further improvement of the sensitivity in liquid biopsy-based tests is warranted. The most common way to improve sensitivity of the mutation detection in liquid biopsy samples, is to increase the coverage, which in turn leads to a substantial increase in the cost of an assay. Deep or ultradeep coverage is necessary in order to account for low concentrations of total cfDNA in plasma samples that are compounded by the dilution of tumor-specific cfDNA fragments, by substantial amounts of non-tumoral cfDNA fragments (Hellwig et al., 2018).

Another physical characteristic of cfDNA, the distribution of the sizes of its fragments, is relevant to the detection of DNA variants both by sequencing and by PCR. Recent wholegenome sequencing (WGS) studies of cfDNA demonstrated that the distribution of the sizes of plasma derived DNA fragments is far from the typical lognormal distribution that reflects the patterning of DNA in formalin fixed-paraffin-embedded samples or snap-frozen tissues. In fact, cfDNA exhibits a predominant peak at a fragment length of ∼167 bp accompanied by the second, significantly less pronounced extremum at around 350 bp (Ma et al., 2017). These observations mean that the majority of these fragments are suitable to assess the technique that relies on conventional lengths of PCR amplicons. It is of note that tumor-derived cfDNA fragments are even shorter than those that originate from healthy cells of the same origin (Jiang et al., 2015). In the domain of conventional systems for the detection of DNA variants, these characteristic of cfDNA have prompted the development of ultra-short amplicon PCR, which allows for the substantial increase of analytical and, as a consequence, diagnostic sensitivity of these assays.

Moreover, recent studies have shown that fragmentation pattern of cfDNA is not random. As cfDNA degradation is guided by nucleosome patterns defined by epigenetic regulation within particular loci (Ivanov et al., 2015), recurrent underrepresentation of some regions in cfDNA introduces systematic bias in the PCR based enrichment of target amplicons and undermine the sensitivity at a local scale.

In this study, we investigated the effect of the amplicon length on the diagnostic and analytical sensitivity of mutation detection, using two amplicon-based NGS panels with diverse amplicon lengths. We also describe ways to utilize the knowledge of cfDNA fragmentation patterns to increase the sensitivity of mutation detection in a liquid biopsy setting.

### MATERIALS AND METHODS

#### Sample Collection

The sequencing was performed on cfDNA fragments extracted from previously collected plasma samples of 13 non-small cell lung cancer (NSCLC) patients, treated at the Blokhin Russian Cancer Research Centre in 2014 to 2015. For each patient, tumor tissue-based EGFR mutation status was assessed using the therascreen EGFR RGQ PCR Kit (Qiagen, Milan, Italy) according to the manufacturers protocol.

For nucleosome-guided cfDNA fragmentation pattern analysis we used publicly available, anonymized WGS data of cfDNA, described by Snyder et al. (2016) and included in dataset [PRJNA291063].

The present study was approved by the Atlas Biomed Internal Review Board. All subjects gave written informed consent in accordance with the Declaration of Helsinki.

### DNA Extraction and Sample Quality Control

For each NSCLC patient, a peripheral blood sample was collected into an EDTA-containing vacutainer tube (BD). Samples were fractionated into plasma and blood cells by centrifugation at 400 g for 15 min within 4 h after venipuncture, followed by a secondary spin at 1200 g for 20 min. Resultant plasma samples were frozen in aliquots and stored at −80◦C until DNA isolation. Circulating DNA was extracted from 4 ml of plasma using the Blood Plasma DNA Isolation Kit (BioSilica Ltd., Russia) according to the manufacturer's instructions, eluted by 120 µl of nuclease-free water, mixed with 3 µl of glycogen (20 mg/ml, Fermentas, Lithuania), 1/10 volume of 50 mM triethylamine

and then precipitated with 5 volumes of acetone (Bryzgunova et al., 2011). After reconstitution in 30–50 µl of water, cfDNA concentrations were measured using the Qubit fluorometer.

#### Library Preparation and Quality Control

Sequencing libraries were prepared according to the manufacturer's protocol for Ion AmpliSeq Cancer Hotspot Panel (ITCHP2), designed to amplify 207 target regions across 50 cancer-related genes. Additionally, a custom panel namely Atlas Clinical Panel (AODCP), was designed to cover the following genes: EGFR, IDH2, NRAS, KIT, BRAF, TP53, PDGFRA, PTEN, IDH1, KRAS, PIK3CA, ERBB2, CTNNB1 (AODCP, 55 target regions). The custom panel was designed using the Ion AmpliSeq Designer server (pipeline version 5.2). The two panels had several loci in common, allowing for their comparison.

#### Sequencing and Data Analysis

Pooled libraries were sequenced employing Ion Torrent PGM, according to the manufacturers protocol. As low frequency mutant alleles were expected, initial analysis was performed using Ion Torrent Suite software (version 5.2.0) on low stringency settings. In order to exclude false negative single nucleotide variant (SNV) calls, concomitant Bowtie2-Strelka pipeline analysis was carried out. After aligning all reads to the genome (GRCh37) (Bowtie2 parameters: –rdg 5,2 –rfg 5,2 -N 1 -L 17), further off-target reads were removed, while the remaining reads were realigned on target sequences. Primer sequences were excluded from reads employing in-house software (Ivanov et al., 2018). Somatic variant calling was performed employing Strelka (maxInputDepth set to −1; indelMaxRefRepeat set to 6; indelMaxWindowFilteredBasecallFrac set to 0.4; indelMaxIntHpolLength set to 6; lower quality bound for SNV and indels set to 9 and 2, respectively). Variants supported with less than 20 reads in total were discarded. If less than four reads supported alternative allele, the variant was omitted. Mutation hotspots were defined as nucleotide variations identified in ten or more COSMIC (Forbes et al., 2010) samples. Detected variants located within mutation hotspots were supposed to be confidently somatic. Variants outside mutation hotspots with minor allele frequency in the general population, as defined by 1000 Genomes Project (1000 Genomes Project Consortium et al., 2015), of 5% and more were supposed to be confidently germline. Further analysis was limited to confidently somatic and confidently germline variants. Preprocessed fastq files were additionally screened for mutation hotspots by inputting wild type and expected mutant reads into the Poisson distribution statistical model with complexity-dependent variable expectation probability of SNVs and indels. Somatic variant calls were verified manually, in the Tablet (version 1.16.09.06) read alignment visualization tool (Milne et al., 2010). Variant allele frequencies were quantified within raw read sets as a ratio of reads confirming the mutation to the total count of qualified reads covering the mutation site. Normalization of mutation allele frequencies to amplicon coverage was performed by bootstrapping. The genome variation analysis was limited to the nucleotide changes affecting the protein sequence, unless otherwise

specified. Publicly available software and database versions used were Bowtie2 v. 2.1.0 (Langmead and Salzberg, 2012), Strelka v. 1.0.14 (Saunders et al., 2012), and SAMtools v. 0.1.19 (Li, 2011). COSMIC and dbSNP databases were assessed in December 2017.

GC content normalization for linear regression analysis was performed leveraging a simple adjustment according to the equation <sup>e</sup>r<sup>i</sup> <sup>=</sup> <sup>r</sup><sup>i</sup> m mGC, where r<sup>i</sup> stands for the read count of the ith amplicon, mGC is the median read count of all windows with the same GC content as the ith amplicon, and m is the overall median of all the amplicons. Deviation of coverage from the mean was performed for 5% GC content bands rather than percentages of 0, 1, 2, 3, . . ., 100%. Linear regression analysis was performed employing simple least square fitting.

Nucleosome-guided cfDNA fragmentation patterns were analyzed in publicly available sequences obtained from plasma samples pooled from an unknown number of healthy individuals (GSM1833219). The details of the DNA extraction, library preparation and sequencing are provided in Snyder et al. (2016). Briefly, cfDNA libraries underwent paired-end sequencing with Illumina sequence-by-synthesis technology generating reads of 101 bp in size. Importantly, at the library preparation stage, plasma DNA samples did not undergo fragmentation by sonication and, thus, original cfDNA molecules were preserved, granting the opportunity to investigate its fragmentation patterns. The fastq read sequences were aligned to the human genome (aforementioned reference build) with BWA-mem v. 0.7.12 (Li and Durbin, 2009). cfDNA fragment length may exceed sequencing read length, however, paired-end sequencing allows to capture both start and end positions of the fragment. Paired reads, thus, continued to represent WGS fragments. Nucleosome position stringencies were calculated essentially as described in Valouev et al., using the NuMap software with standard parameters. NuMap performs the nucleosome mapping based on the kernel smoothed reads count calculation (Valouev et al., 2011).

For ITCHP2 and AODCP panel amplicons, fragment counts were generated in silico after matching both primers with the fragment amplified and sequenced experimentally. To understand the patterns of amplicon coverage by experimentally observed fragments, the fragments were generated using paired reads, then further filtered by length to include only fragments in the range of 80 to 250 bp. Dinucleosome fragments were therefore excluded. To improve resolution, resulting fragments were trimmed by 40 bp around dyads to generate a set of equallength fragments. For each sequenced nucleotide position, counts of overlapping fragments were recorded. Generated data were subjected to a lowpass filter with the square pulse kernel with the width of 21 base pairs, then resulting coverage plots were mapped to amplicons genome positions.

Statistical analysis was performed using R, version 3.2.3. For machine learning, we used the open source library Orange (Demsar et al., 2013). Five machine learning algorithms were evaluated to find the best model, demonstrating the highest prediction accuracy based on all descriptors [support vector machine (SVM), neural network, multiple linear regression, naïve Bayes, and random forest].

## RESULTS

fgene-10-00194 March 8, 2019 Time: 17:25 # 4

#### Sample Sequencing and Mutation Analysis

In this study, fourteen cfDNA samples collected from patients with NSCLC, were analyzed using the screening panels ITCHP2 and AODCP. The mean sequencing coverages across all experiments were set at 1150× for the AODCP panel and 802× for the ITCHP2 panel with corresponding medians of 1002× and 674×, respectively.

Variant detection results were completely concordant for two panels across 18 identified somatic mutations. Plasma variant detection results were concordant with baseline tissue analysis in 9 samples (69%). False negative samples were limited to the cases, characterized with low plasma DNA concentration (**Figure 1**). In addition to mutations identified by tissue analysis at baseline, namely, these in EGFR and RAS, the sequencing of 13 plasma cfDNA samples revealed five additional somatic missense mutations, including these in PIK3CA and TP53 genes (**Figure 1**).

#### Significance of Amplicon Length for Mutation Detection Sensitivity and Specificity

The average length of amplicons in panel AODCP was much shorter than that in panel ITCHP2 (**Figure 2A**), with median amplicon lengths to include primer sequences at 137 and 156 bp, respectively. Despite the difference in amplicon sizes, variant calling results obtained for each panel were completely concordant, with a total of 51 either somatic or germline variants detected. Therefore, diagnostic sensitivity and specificity of these two detection systems were the same at the study power.

In order to explore possible influences of the amplicon length on the limits of detection and, therefore, analytical sensitivity to the presence of the mutations in liquid biopsy, we performed a pairwise comparison of the frequencies for same mutated allele in reads obtained from pairs of panel-specific amplicons. For the synonymous germline variant, namely, EGFR p.Gln787= with the total of 15 alleles identified (1000 Genomes MAF 0.43), allele frequencies extracted from analysis of AODCP and ITCHP2 amplicons were equivalent (Wilcoxon signed rank test p-value = 0.88). On the other hand, analysis of somatic mutations, which are typically present in a relatively small fraction of the reads, showed Pearson's correlation coefficients of 0.88 (pvalue = 0.02; Wilcoxon signed rank test p-value = 0.44) for point mutations in genes EGFR, TP53, and PIK3CA, and 0.95 for the deletions of the EGFR exon 19 (p-value = 0.001; Wilcoxon signed rank test p-value = 0.53) (**Figure 3**). Since EGFR deletions further reduce the length of amplified fragments by 15 or more bp, their presence should, at least in theory, increase analytical sensitivity of the detection system (**Figure 2B**). Notably, the geometric mean ratio of the allele frequency of the EGFR exon 19 deletions, detected by two panels, was 1.16 (95% CI, 0.72–1.88; p-value > 0.1). This indicates that the analytical sensitivity of this assay is unlikely to change even if the difference in the average sizes of amplicons would increase further.

Finally, we performed a regression analysis to estimate the relationship between amplicon length and its average coverage across samples for the ITCHP2 panel, representing a wider spectrum of amplicon lengths. After normalization on GCcontent and overall sample read count, linear regression analysis employing the least squares fitting approach, demonstrated a negative slope with a Student t-test p-value of 0.0063. However, regression analysis across the set of amplicons with a length of 170 bp or less yielded a non-significant slope coefficient (p-value 0.69) (**Figures 4A,B**). Regression analysis between amplicon length and its coverage covariance demonstrated no significant correlation in any amplicon length range (data not shown). Considering that amplicons with a length of 120 or less comprises of only 5% of that set, this indicates that the NGS sequencing of plasma cfDNA equally tolerates amplicons with a length in the 120–170 bp range.

#### Nucleosome-Guided Pattern May Facilitate Primer Panel Design

According to the most commonly cited hypothesis, plasma cfDNA originates from apoptotic cells where genomic DNA is digested by a set of nucleases (Ma et al., 2017). Wrapping around nucleosomes protects some of the DNA fragments from digestion; that is why cfDNA fragments correspond primarily to the mononucleosome bound regions. Originally supported only by a unimodal distribution of cfDNA fragments sizes (Fan et al., 2008; Lo et al., 2010), this hypothesis has been recently validated in several studies (Chandrananda et al., 2015; Snyder et al., 2016; Ulz et al., 2016). In particular, employing whole exome sequencing of cfDNA fragments to infer the read depth coverage allowed the construction of 'plasma genomewide nucleosome maps. Mapping the fragments covered by the ITCHP2 panel, to these nucleosome maps, showed that the positions of the ITCHP2 primers were selected in a nonoptimal way with respect to the nucleosome positioning (p-value for nucleosome peaks and amplicons interception 0.36). An amplicon covering KRAS exon 4 serves as a good illustration for non-optimal selection of primers which fall in between two peaks (**Figure 5A**). Because of that, amounts of spanning cfDNA fragments are much lower than for the primers selected to amplify the fragment located within the same peak. A similar situation may be observed for the EGFR exon 21; shifting positions of the primers by the order of 100 nucleotides may result in an increase of the depth and the uniformity of the coverage, without compromising amplification of the clinically relevant, mutation-harboring locus.

At the next stage of analysis, we inquired whether efficiency of targeted resequencing of cfDNA samples depends on the pattern of DNA fragmentation. To perform this analysis, for all amplicons represented in the ITCHP2 panel, the fragmentation patterns were extracted from the repository of reads obtained after a shotgun sequencing of cfDNA fragments purified from the pool of plasma samples, of healthy individuals and from five individual patients with solid tumors (**Figure 5B**).

It is known that both the nucleosome positioning (Struhl and Segal, 2013), which, in turn, guides the fragmentation of cDNA (Ma et al., 2017), and the depth and the uniformity of


FIGURE 1 | Samples used for data analysis as well as mutations identified during NGS sequencing and allele frequencies thereof (plasma EGFR status). Mutations identified employing a conventional sequencing method indicated in the tumor alteration column while its match (green) or mismatch (red) with NGS results specified in plasma EGFR status column.

used. Fills in the bottom panel demonstrate the spectrum extension for two panels, respectively, in case of the 15 bp exon 19 deletion mutation.

the coverage by sequencing reads (Benjamini and Speed, 2012), are influenced by the GC content. In the following analysis, we aimed at finding out whether any characteristic related to the fragmentation pattern of cfDNA within the locus of interest may influence the depth and the uniformity of coverage with amplification based sequencing reads.

For the ITCHP2 panel, each amplicon was matched to an individual nucleosome map and evaluated according to four features: (i) absolute count of experimentally observed continuous cfDNA fragments spanning the whole amplicon (Feature A), (ii) read signal amplitude within the amplicon (Feature B), (iii) read signal change at the boundaries of amplicon (Feature C), and (iv) read signal shape defined as the area between its linear approximation and itself (Feature D) (**Figure 5C**). Uniformity of the coverage was defined as a coefficient of inter-individual variation in read coverage between all cfDNA samples. To calculate the robustness of the nucleosome mapping, we assessed the inter-sample variance of the defined features calculated for each amplicon. Averaged coefficients of the variation of features D, B and C were at 390, 68, and 38%, respectively, pointing at significant inter-sample variation.

Further, we estimated the feature quality, employing the RReliefF method (Robnik-Sikonja and Kononenko, 2003) estimating how well their values distinguish between target variables that are near to each other. Despite previously demonstrated low robustness of the nucleosome associated features, the count of spanning fragments (Feature A) was ranked even higher than the GC content, while the other three features, B, C, and D, closely followed feature A and the GC content (**Figure 6A**). This finding indicates that uniformity of the locus coverage, with amplified sequencing reads, may depend on the underlying pattern of cfDNA fragmentation.

Univariate polynomial regression of the sequencing coverage depth and its coefficient of variation based on the GC content with second degree polynomial yielded coefficients of determination of 0.29 and 0.19, respectively. Furthermore,

FIGURE 3 | Pairwise comparison of the frequencies for same mutated allele in reads obtained from pairs of panel-specific amplicons across detected somatic variants.

GC content equal-frequency discretization (four groups) and analysis of variance of both dependent variables between groups, yielded a p-value of less than 1e-6. Thus, a strong non-linear correlation between the GC content, a sequencing coverage and its uniformity (**Figures 6C,D**) was detected. Despite significant linear correlation between counts of spanning fragments and the GC contents (**Figure 6B**), no similar relationship between this feature and sequencing coverage was seen (**Figures 7A,B**). In contrast, as for coverage uniformity, both spanning fragments, count and read depth coverage, shape the demonstrated correlation in relation to it (ANOVA test p-value of 0.037 and 0.013, respectively) (**Figures 7C,D**). No correlation was seen for depth change or depth range (data not shown).

Finally, we tested the performance of the SVM classifier for its prediction of coverage depth and coverage uniformity by either employing the GC content as a single feature or in a combination with all the other features analyzed above. Following 3-groups equal-frequency discretization, the target classes were defined as coverage depth in the lowest third tertile and coverage uniformity in the highest third tertile. For predicting the depth of coverage, GC content in combination with depth change (Feature C) were selected as features. To predict the uniformity of coverage, GC content in combination with the spanning fragment counts (Feature A) and read depth shape (Feature D) were selected as features. A radial basis function (RBF)-kernel utilizing SVM classifier was then applied, using threefold cross-validation. Performance of the SVM classifiers, built upon several features for predicting coverage uniformity, was better than that of the GC-content only classifiers (areas under the receiver operating curve (AUROCs) of 0.75, 95% CI: 0.750–0.752 vs. 0.65, 95% CI: 0.63–0.67; precision – 0.74 vs. 0.68). This indicates that non-GC content features may aid in the prediction of the amplicons with a high coverage variation across samples. For coverage depths, however, applying a similar strategy has not resulted in a significant improvement (AUROCs of 0.69, 95% CI: 0.68–0.70 vs. 0.70, 95% CI: 0.70–0.71; precision – 0.69 vs. 0.69) (**Figures 7E,F**).

### DISCUSSION

The share of cfDNA fragments originating from tumor rather than normal tissues, may vary greatly among patients. In earlystage disease, the share could be as low as 0.01% of the total cfDNA (Thierry et al., 2017). Because of that, the issue of the

detection of low frequency mutant alleles, represents one of the biggest technical challenges to the development of diagnostic and prognostic assays involving the sequencing of cfDNA. In this study we examined various approaches to increase diagnostic and analytical sensitivity of the detection of somatic mutations in liquid biopsy samples.

In a heterogeneous cohort of patients, the liquid biopsy was performed at baseline, at disease progression and/or within the framework of disease monitoring. Overall diagnostic sensitivity of NGS to detect EGFR mutations in cfDNA was at 83%. Of note, when we limited the sample set to the plasma specimens with DNA concentration of 20 ng/ml and higher, the false negative rate was reduced from 17 to 0%. This observation points at low concentrations of cfDNA samples as a primary contributor to imperfect sensitivity of the liquid biopsy assays and at a necessity to either improve the recovery of tumor DNA fragments, or to require cfDNA profiling labs to introduce more stringent QC metrics, which may render many samples ineligible for downstream processing.

Sensitivity of cfDNA based mutation detection assays may be aided by an improvement of amplification efficiency. Plasma cfDNA is known to be highly fragmented (Fleischhacker et al., 2011; Klevebring et al., 2014; **Figure 2B**). Therefore, it is commonly recognized that an increase in length of PCR amplicons may result in the elimination of a majority of the extracted DNA fragments as possible templates. In this study we sought to dissect how much of the amplicon length influences the sensitivity of subsequent mutation detection. For this we performed, to the best of our knowledge, the first comparison of two amplicon based NGS panels characterized by a substantial difference in average amplicon length (**Figure 2A**). The comparison was performed in relation to the panels' diagnostic and analytical sensitivity. Surprisingly, the yield of both the germline and somatic mutations between two panels were completely concordant, pointing at an irrelevance of amplicon size of the specified short range to diagnostic sensitivity of resultant assays.

As a particular example defying "the shorter amplicon, the better amplification efficiency" logic, we dissected the detection of EGFR exon 19 deletion alleles by amplicons of 138 and 168 nt in length. Based on the area under the fragment length distribution curves (**Figure 2B**), mutant alleles should be amplified 1.45 times more efficiently than wild-type ones by the panel with larger amplicons, while the panel with shorter amplicons would be 1.04 times more efficient for mutant cfDNA fragments. Considering that tumor-derived cfDNA fragments are even shorter than normal tissue-derived ones (Jiang et al., 2018), these rates would increase to 1.84 and 1.16, respectively (**Figure 3**). This should result in approximately and increase of 1.6 times of the mutant allele frequencies detected with a larger-amplicon panel as compared to a smaller-amplicon panel. In our experiment, no statistically significant difference in mutant allele frequencies

was noted, with the observed trend being the opposite to what was expected, indicating that the size of the amplicons does not contribute to the analytical sensitivity of cfDNA assays.

Notably, our observations contradict some previous work (Chan et al., 2004; Koide et al., 2005), which show a lengthdependent decrease in efficiency of amplification of cfDNA templates in up to a 250 nt fragment range, which corresponds to the mononucleosome fraction representing approximately 85% of all cfDNA fragments (**Figure 2B**). In these previous studies, the yield of DNA dropped by almost 30 and 60% when using amplicons with a size of 145 nt instead of 105 and 201 nt instead of 145 nt, while for amplicons with larger sizes no pronounced effect was observed. Furthermore, another study demonstrated that increases in the DNA yield may be observed at a lower amplicon size range: a direct digital PCR comparison of the 50 bp to the 84 bp amplicon resulted in significant favoring of the shorter amplicon (Koide et al., 2005; Sikora et al., 2010). It is important, however, to note that reported observations were obtained in course of analysis if cfDNA samples collected either from healthy individuals or in setting of prenatal diagnostics

aimed at amplifying fetal cfDNA and, therefore, cannot be directly projected onto the templates of tumor-derived cfDNA which is known for the shorter sizes of its fragments (Pinzani et al., 2011; Mouliere and Rosenfeld, 2015) and lower integrity (Underhill et al., 2016). The studies of cfDNA specimens collected from patients with tumors show that 60 bp fragments are almost five times more abundant than 150 bp ones, thus pointing at the necessity to use amplicons with sizes of 100 bp or lower (Mouliere et al., 2011).

Importantly, in many cases, reaping the benefit of shorter amplicon size may not be possible due to complications arising from the necessity of the precise positioning of the primers restricting optimization of their GC content, matching melting temperatures and preventing oligonucleotide dimerization. While designing PCR systems for select loci may be still possible, with EGFR analysis being the common example (Reckamp et al., 2016), the introduction of ultra-short amplicons into highly multiplexed systems aiming at a broader molecular profiling of human tumors, may not be feasible. Particular concerns about this multiplexing precluding approach to the amplicon design are owed to the recent observations of a wide mutational spectrum in the liquid biopsies of metastatic cancer patients and its relevance to possible inclusion in clinical trials (Rothé et al., 2014; Frenel et al., 2015). In light of an obvious necessity for multiplexing, the finding that varying amplicon sizes in a range from 140 up to 170 nt does not influence analytical sensitivity is significant, as it shifts the attention of panel designers from minimizing the length of the amplicons to optimizing compatibility of oligonucleotides.

Additionally, cfDNA as a template for a designed PCRbased assay may introduce a set of additional restraints. Both the prevalence of cfDNA fragments of certain sizes and the fragmentation patterns depend on the positioning of the nucleosomes within its tissue of origin. To describe this novel complex variable depicting nucleosome positioning, we introduced four features namely, a spanning fragment count, a read depth change, a read depth range and a read depth shape (**Figure 5C**), which collectively portray the coverage of select amplicon by experimentally obtained WGS reads. When read coverage maps of WGS-sequenced cfDNA fragments from pooled plasma of healthy patients were aligned to the amplicons employed for liquid biopsy analysis of patients with NSCLC, these four features were utilized to determine the extent of the influence of nucleosome positioning on two dependent variables: sequencing coverage and coverage uniformity. A SVM-based classifier demonstrated that combining the GC content with spanning fragment counts and read depth shape, results in an increased accuracy of prediction of both dependent variables. Therefore, this variable should be taken in consideration when designing PCR primer systems.

Nevertheless, the overall robustness of nucleosome positioning remains unclear. It is known that several regulatory events defining the gene expression require the strict positioning of nucleosomes; these events are typically associated with promoter regions (Hesson et al., 2014; Lövkvist et al., 2018). However, nucleosome positioning is not absolute, and even with major shifts in gene expression, some cells fail to change nucleosome configuration (Small et al., 2014), thus, indicating an underlying complexity of nucleosome positioning. Importantly, the majority of clinically relevant mutations are located within exons, which, according to the current view of cfDNA nucleosome maps, do not retain a strict pattern of cfDNA fragmentation. Therefore, nucleosome arranging within such exons may be variable, either between molecular subtypes of the same disease or even between normal tissue specimens. Nevertheless, despite a potential for low robustness, a substantial correlation observed between nucleosome maps revealed by unbiased read coverage in cfDNA from healthy patients, and the sequencing coverage and its uniformity in amplicons obtained in cfDNA of patients with NSCLC, indicates that the efficiency of amplification may be improved if the unbiased read coverages are taken into account.

In conclusion, low plasma cfDNA concentration remains the major factor that limits the sensitivity of liquid biopsy assays. Above we showed that the design of a highly multiplexed system equally tolerates amplicons in the range of 140–170 bp in size, thus allowing the shift of attention toward the melting temperature, GC clamps, cross homology and other controllable variables. We have also provided evidence that the nucleosome placement in the tissue of origin and the resultant genomewide cfDNA fragmentation pattern, may be used as a guide for primer positioning to improve both the sequencing coverage and its uniformity.

#### DATA AVAILABILITY

The datasets generated for this study can be found in the Sequence Read Archive under accession number SRP167082 (https://trace.ncbi.nlm.nih.gov/Traces/sra/?study= SRP167082). The additional datasets analyzed for this study can be found in the Sequence Read Archive under accession number SRP061633.

#### AUTHOR CONTRIBUTIONS

VM, AB, MI, and SM designed the work. PC, KL, and VB collected the samples. ER performed the experiments. All authors participated in the interpretation of the results and in writing the article.

#### FUNDING

This study was supported by the Ministry of Science and Education, Russia (Project No. RFMEFI60714X0098).

#### ACKNOWLEDGMENTS

The authors wish to gratefully acknowledge technical support from laboratory of epigenetics, Medical Genetic Science Center RAMS. Special thanks to Drs. Strelnikov and Tanas for their technical assistance and helpful discussions of presented here results.

#### REFERENCES

fgene-10-00194 March 8, 2019 Time: 17:25 # 11



**Conflict of Interest Statement:** ER, SM, VM, and AB were employed by company Atlas Oncology Diagnostics, Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer, TT declared a past co-authorship with one of the authors AB to the handling Editor.

Copyright © 2019 Ivanov, Chernenko, Breder, Laktionov, Rozhavskaya, Musienko, Baranova and Mileyko. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Transcriptomic Analysis of Seed Germination Under Salt Stress in Two Desert Sister Species (Populus euphratica and P. pruinosa)

Caihua Zhang<sup>1</sup>† , Wenchun Luo<sup>1</sup>† , Yanda Li<sup>2</sup> , Xu Zhang<sup>1</sup> , Xiaotao Bai<sup>1</sup> , Zhimin Niu<sup>1</sup> , Xiao Zhang<sup>1</sup> , Zhijun Li<sup>3</sup> and Dongshi Wan<sup>1</sup> \*

<sup>1</sup> State Key Laboratory of Grassland Agro-Ecosystem, School of Life Sciences, Lanzhou University, Lanzhou, China, <sup>2</sup> Computer Science and Engineering Department, University of California, San Diego, La Jolla, CA, United States, <sup>3</sup> Xinjiang Production & Construction Corps, Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin, College of Life Sciences, Tarim University, Xinjiang, China

#### Edited by:

Ancha Baranova, George Mason University, United States

#### Reviewed by:

Petronia Carillo, Università degli Studi della Campania Luigi Vanvitelli Caserta, Italy Andrés A. Borges, Spanish National Research Council (CSIC), Spain

#### \*Correspondence:

Dongshi Wan wandsh@lzu.edu.cn †These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 16 October 2018 Accepted: 04 March 2019 Published: 25 March 2019

#### Citation:

Zhang C, Luo W, Li Y, Zhang X, Bai X, Niu Z, Zhang X, Li Z and Wan D (2019) Transcriptomic Analysis of Seed Germination Under Salt Stress in Two Desert Sister Species (Populus euphratica and P. pruinosa). Front. Genet. 10:231. doi: 10.3389/fgene.2019.00231 As a major abiotic stress, soil salinity limits seed germination and plant growth, development and production. Seed germination is highly related not only to the seedlings survival rate but also subsequent vegetative growth. Populus euphratica and P. pruinosa are closely related species that show a distinguished adaptability to salinity stress. In this study, we performed an integrative transcriptome analyses of three seed germination phases from P. euphratica and P. pruinosa under salt stress. A two-dimensional data set of this study provides a comprehensive view of the dynamic biochemical processes that underpin seed germination and salt tolerance. Our analysis identified 12831 differentially expressed genes (DEGs) for seed germination processes and 8071 DEGs for salt tolerance in the two species. Furthermore, we identified the expression profiles and main pathways in each growth phase. For seed germination, a large number of DEGs, including those involved in energy production and hormonal regulation pathways, were transiently and specifically induced in the late phase. In the comparison of salt tolerance between the two species, the flavonoid and brassinosteroid pathways were significantly enriched. More specifically, in the flavonoid pathway, FLS and F305 <sup>0</sup>H exhibited significant differential expression. In the brassinosteroid pathway, the expression levels of DWF4, BR6OX2 and ROT3 were notably higher in P. pruinosa than in P. euphratica. Our results describe transcript dynamics and highlight secondary metabolite pathways involved in the response to salt stress during the seed germination of two desert poplars.

Keywords: transcriptome, salt stress, seed germination, differentially expressed gene, desert poplar species

## INTRODUCTION

Soil salinization is caused by many factors and conditions, such as unsuitable irrigation practices, irrigation with salinized water and seasonal effects (Ottow et al., 2005; Annunziata et al., 2017). As one of the most prominent abiotic stresses, salinity stress is considered the greatest threat to crop production and environmental conservation (Ottow et al., 2005; Arbona et al., 2013). Salinity stress leads to osmotic and ionic stress, which reduces cell and tissue expansion, and to ion excesses

**156**

that changes the osmotic potential of plant cells and induce nutritional imbalances (Munns, 2002), sequentially affecting plant growth, development and survival (Carillo et al., 2019). To solve the serious problem of soil salinization, various efforts have been made; these efforts mainly concentrate on enhancing the salt resistance of economically important salt-sensitive plants through traditional breeding and biotechnological approaches or the use of plants that naturally display high salt tolerance (Flowers, 2004).

Populus euphratica and its sister species P. pruinosa are naturally distributed in China's western desert region; due to their extraordinary adaptability to desert environments (Chen et al., 2002; Hukin et al., 2005), both species are also called desert poplars. The distinguished adaptability of these species provides beneficial ecological effects in northwest China. Currently, both poplars are considered important genetic resources in tree breeding and in research elucidating physiological and molecular mechanisms involving stress tolerance in trees (Tuskan et al., 2006; Wullschleger et al., 2013). As the genome data of P. euphratica becomes available (Ma et al., 2013), the resistance mechanism of both poplars have been revealed at multiple levels, e.g., a phylogenetic analysis shows that the two species diverged approximately 1–2 million years ago (Wang J. et al., 2011), and ancient polymorphisms contributed to their genomic divergence (Ma et al., 2018). In addition to leaf morphology and leaf trichome differences (Ma et al., 2016), both poplars occupy different ecological habitats. P. pruinosa prefers desert areas with high ground water levels, while P. euphratica can grow in desert areas where the groundwater levels are low (Ottow et al., 2005). These differences between the sister poplars result from differences in genetic mechanisms, such as the adaptive evolution of genes (Ma et al., 2013; Zhang et al., 2014) and gene expression divergences among orthologs (Qiu et al., 2011; Zhang et al., 2013).

Seed germination is constrained significantly by soil salinity (Kaya et al., 2003). Soil salinity creates an osmotic potential around the outside of seeds, resulting in decreased water uptake during germination and an increase in the excessive uptake of ions, which causes the toxic effects of Na<sup>+</sup> and Cl<sup>−</sup> ions to seeds (Murillo-Amador et al., 2002; Khajeh-Hosseini et al., 2003). Therefore, salt stress can inhibit or delay seed germination (Almansouri et al., 2001). However, studies focusing on the genetic mechanism of seed germination under salt stress are limited.

Seed germination begins with imbibition and ends with the embryonic axis breaking through the seed coat (Bewley et al., 2012). Seed germination includes three phases. In phase I, the seed begins to expand, with a rapidly increasing water content. Then, the seed enters a plateau phase (phase II), in which the water uptake remains at a stable level. In phase III, the water uptake increases rapidly. Phase III ceases as the embryonic axis breaks through the seed coat, upon which seed germination is complete (Bewley, 1997). Energy production and respiration play important roles in the seed germination process. In the early stage, anaerobic respiration provides the main energy source, and then respiratory activity increases with oxygen uptake. Subsequently, plant hormones, such as gibberellins (GA), abscisic acid (ABA), brassinosteroids (BRs), ethylene, auxins, and cytokinins, are widely involved in determining the physiological state of a seed and regulating the germination process (Kucera et al., 2005; Holdsworth et al., 2008; Müller et al., 2009; North et al., 2010). Furthermore, numerous complex networks, including those related to gene expression and regulation commanded by various transcription factors (Chen et al., 2002), ion transporting processes, such as NHX (Na+/H<sup>+</sup> antiporter), SOS (salt overly sensitive) (Zhu, 2001), and HKT (high-affinity K <sup>+</sup> transporter) (Ren et al., 2005) processes, and secondary metabolism are all involved in the response to salt stress (Bewley, 1982, 1997; Bewley and Black, 1984; Biligetu et al., 2011). Recently, transcriptomic analyses of several poplar species under various stresses have been extensively conducted (Chen et al., 2002, 2012; Brinker et al., 2010; Janz et al., 2010; Qiu et al., 2011; Wang J. et al., 2011; Ma et al., 2013; Ziemann et al., 2013). These data provide us with a basic understanding of seed germination. However, detailed transcriptomic dynamics and physiological mechanisms under salt stress during seed germination have not yet been revealed. Such an exploration might be useful to identify the genes that improve poplar salt tolerance by biotechnological manipulation. Moreover, most genes associated with seed germination are poorly understood due to the complexity of the germination process.

Here, we present a comprehensive transcriptome study encompassing the whole process of seed germination for two species under salt stress, which provides a valuable gene resource for genetic manipulation in poplar breeding.

### MATERIALS AND METHODS

#### Plant Materials and Growth Conditions

We collected three replicate samples of the seeds of the two studied species from a total of 18 trees in the Tarim Basin (Xinjiang, China) and stored the seeds at 4◦C. For germination, vigorous seeds were imbibed in distilled water (control), 0.2%, 0.4%, 0.6%, 0.8%, 1.0%, 1.2%, 1.4%, and 1.6% NaCl, and then germinated on wet filter paper in 9 cm diameter Petri dishes in a plant growth incubator (21◦C 200 µmol m−<sup>2</sup> s −1 , 16 h: 8 h light/dark photoperiod). The germination rate was measured using the Chinese national standard test (GB2772-1999). Each sample contained 50 seeds and had three replicates (Wang et al., 2013). The germinating seeds were scanned and photographed using a stereo microscope (Nikon SM Z1500, Japan) to record their morphology. The moisture content of the seed samples was measured in seeds oven dried at 75◦C to a constant weight. The moisture content [g (g FW)−<sup>1</sup> ] was calculated as [(FW−DW)/FW].

The percentage of seeds with two cotyledons turning green or with emerging radicles (>1 mm) was considered the germination rate (Wang Y. et al., 2011). For the germination percentage, counts were made until no additional germination was observed for 72 h (Bradford, 1990). To elucidate the threshold salinity for the two species under the salt treatments, we measured relative indexes, including GR, RGP, GT, GI, K, and RSH (Imit et al., 2015). For RNA isolation, the seeds were imbibed in 1.0% NaCl (to expose them to salt stress) and then removed after 4, 12, 24, 48, and 72 h for RNA preparation. The control samples were collected from dry seeds (0 h). We rapidly transferred all the samples to storage at −80◦C before RNA extraction.

### Reactive Oxygen Species (ROS) Level and Enzyme Activity Determination

For germination, seeds were imbibed in 0%, 0.4%, 0.8% and 1.0% NaCl as described above for 24 h. The levels of ROS, superoxide dismutase (SOD) and catalase (CAT) were measured using the standard protocol for the toolkit from Suzhou Comin Biotechnology.

## Determination of RNA Extraction and Quality

fgene-10-00231 March 25, 2019 Time: 12:43 # 3

Using the CTAB procedure, we extracted and purified total RNA three times from each of the sample set (Chang et al., 1993). The A260/A280 ratios of all the RNA samples ranged from 1.9 to 2.0. We examined the integrity of all RNA samples by the Agilent 2100 Bioanalyzer, and all the RNA integrity number (RIN) values ranged from 7 to 10.

### cDNA Library Construction and RNA Sequencing

Construction of the cDNA library and RNA sequencing were performed by BIOMARKER (Beijing, China) using the Illumina (San Diego, CA, United States) Genome Analyzer platform in accordance with the manufacturer's protocols. Paired-end sequencing was performed using a HiSeq 2500 (Illumina) platform with a read length of 125 bp.

### Initial Mapping of Reads

We trimmed reads by removing adapter sequences, reads with too many (> 5%) unknown base calls (N), low-complexity sequences, and low-quality bases (i.e., sequences for which > 65% of the bases had a quality score ≤ 7). HISAT2 (Kim et al., 2015) was used to align all reads of the two species to the P. euphratica genome (Ma et al., 2013). Because the intrinsic divergence between the species could result in poor mapping, we did not map RNA-seq reads from the two species onto their own genomes. Next, StringTie (Pertea et al., 2015) created multiple isoforms of genes and estimated the gene expression levels (FPKM) (Trapnell et al., 2010) during assembly. To reduce the effects of background transcription, genes with FPKM ≥ 1 were used for the subsequent analysis. We calculated the Pearson correlation coefficient between biological replicates with R software using the expression data. The Pearson correlation calculated by R was used to evaluate repeatability between biological replicates.

### Analysis of DEGs

We applied Ballgown (Frazee et al., 2015) to determine which transcripts were differentially expressed between two or more experiments, confirming their significance with an F-test. Ballgown allows both time-course and fixed-condition differential expression analyses. Therefore, two methods were employed to identify DEGs: (1) time as the main variable and species as the covariate; (2) species as the main variable and time as the covariate.

### Hierarchical Clustering and Gene Co-expression Analysis

Using normalized log<sup>2</sup> (FPKM+1) values, hierarchical clustering was completed with the pvclust package. Based on the normalized FPKM values, K-means clustering was performed by the K-Means/K-Medians Support Module (KMS) embedded in MEV 4.9<sup>1</sup> .

### Gene Functional Enrichment and qRT-PCR Analysis

GO and KEGG enrichment analyses of the two differently expressed transcript data sets were performed using a modified Chi-square test and Fisher's exact test in R (p-value < 0.01 and false discovery rate < 0.05). Transcription levels of genes were quantified with a MX3005P Real-Time PCR Detection System (Agilent) based on the 2(-delta C(T)) method (Livak and Schmittgen, 2001). The experiment was performed in a 20 µL volume reaction system containing 10 µL 2 × SYBR Premix ExTaq (TaKaRa) with the intercalating dye SYBR Green. All primers were designed using PRIMER5.0 software and are listed in **Supplementary Table S4**.

## RESULTS

### Physiological and Morphological Changes During Seed Germination

To evaluate the effect of salt stress on seed germination, the progress of seed germination has traditionally been divided into three phases based on seed water uptake during imbibition (Nonogaki et al., 2010). The first phase (phase I) occurs within the period of 0 h–36 h; the plateau phase (phase II) occurs within the period of 36 h–64 h; and phase III is continuous for 64 h– 120 h during the transition to seedling growth (**Figure 1A**). We investigated the relationship between the germination rate and NaCl concentration. Seeds exhibiting high germination rates were selected and cultured in distilled water with a gradient of NaCl concentrations (0%, 0.2%, 0.4%, 0.6%, 0.8%, 1.0%, 1.2%, 1.4%, and 1.6%) (**Figure 1B**). With increasing NaCl concentration, seed germination was significantly inhibited. At different NaCl concentrations, the seed germination rates of P. pruinosa were higher than those of P. euphratica. The relative germination percentage of the two species exceeded 80% in the 0.4% NaCl solution, whereas the value approached zero in 2.4% NaCl. We hypothesized that when the relative germination percentages were 75%, 50%, and 25%, the corresponding salt concentrations could be considered suitable, critical and limiting for seed germination, respectively. In our study, for P. euphratica, the suitable, critical and limiting values were 0.602%, 1.161%, and 1.72%, respectively, whereas for P. pruinosa, these values were 0.599%, 1.179%, and 1.759%, respectively, suggesting that the

<sup>1</sup>http://www.tm4.org/mev

threshold salinity for the two species differed. We also measured the germination index, salt tolerance index, relative salt harm rate and germination energy (**Figure 1B**). The average germination percentage, subordinate function values, and threshold salinity for P. pruinosa were higher than those for P. euphratica. Based on the results, a 1.0% NaCl concentration was selected for the subsequent salt treatment. The seed phenotypes of the two species were observed at four time points (**Figure 1C**). In the controls, the radicle emergence was completed within 12 h, and the hypocotyl and cotyledons emerged from the seed coat by 24 h. The cotyledons started to open by 36 h and opened fully and turned green by 48 h. In contrast, under the salt treatment, the seeds were still in the imbibition stage at 12 h, the radicle emergence stage was completed by 24 h, and the subsequent stages were all delayed by 12 h.

### RNA-Seq and Mapping of Illumina-Solexa Sequencing Reads

To systematically investigate the transcriptome dynamics of the two species' seeds during germination under salt stress, we obtained 36 transcriptome samples. After removing lowquality sequences and trimming adapter sequences, 3–6 GB 125-bp paired-end clean reads were generated from each library (**Supplementary Table S1**). Approximately 80% of the reads matched the genome (**Supplementary Table S2**). All the genes and transcripts were reassembled (**Table 1**).

In the detection of minor differential gene expression between time points and the two species, we used three biological replicates (**Supplementary Figures S1**, **S2**) to assess our data quality. The results showed that the expression values of biological replicates from the same samples were highly correlated (average R <sup>2</sup> > 0.8). Among the genes, FPKM values exceeding 73% ranged from 1 to 100 at each time point (**Supplementary Figure S3A**). We used the average RPKM of the biological replicates as the expression quantity. To examine the divergence in gene expression between the two species under salt stress in more detail, we performed a hierarchical clustering analysis for all the expressed genes from P. euphratica and P. pruinosa at each time point using bootstrapping

TABLE 1 | Numbers of assembled genes and transcriptions.


(**Supplementary Figure S3B**). The correlation dendrogram in **Supplementary Figure S3B** shows that samples collected at 0, 4 and 12 h clustered together, while those collected at 24, 48 and 72 h clustered into another group. This result indicates that one set of genes was activated during the early stress and germination stages, while there was another set of genes that was differentially expressed after 48 h. Therefore, based on a Spearman correlation analysis, the germinating seed samples from 0, 4 and 12 h were in the early phase, the seeds in the sample from 24 h were in the middle phase, and the seeds from 48 h to 72 h were in the late phase of the germination process.

### Identification of DEGs, Temporal Expression Trends and GO Functional Enrichment

To identify global transcriptional changes that occurred during seed germination under salt stress, we confirmed the two data sets of DEGs, including 12831 DEGs and 19004 differentially expressed transcripts (DETs) for seed germination processes, and 8071 DEGs and 19000 DETs for salt tolerance, of two species. The DEGs were grouped into ten clusters (designated K1–K10) (**Supplementary Figure S4**) to examine the temporal expression trends of seed germination processes. To better understand the functions of the DEGs and obtain a view of functional transitions across time during seed germination in the two species, GO category enrichment analysis was performed (**Supplementary Figure S5**) to identify important events (biophysical, biochemical, and cellular processes) during seed germination.

According to the cluster analysis results, all the clusters of P. pruinosa and P. euphratica could be divided into early (0–12 h), middle (24 h), and late (48–72 h) phases (**Supplementary Figure S3B**). The early phase (represented by clusters K1 to K4) was strongly expressed at 0–12 h and gradually downregulated between 12 and 72 h in the two species. Based on the GO enrichment results, genes related to "adenyl nucleotide binding," "adenyl ribonucleotide binding," "purine ribonucleoside binding," and "purine nucleoside binding" were increasingly expressed after imbibition (**Supplementary Figure S5**). Second, some genes associated with "structural molecule activity," "structural constituent of cytoskeleton," "intracellular nonmembrane-bounded organelle," "non-membrane-bounded organelle" and "cellular structure restoration" were enriched (**Supplementary Figure S5**). In addition, some genes associated with "ATP binding" were enriched (**Supplementary Figure S5**).

Genes in cluster K5 were highly expressed at 0 to 24 h and downregulated from 48 to 72 h. In the middle phase, the enriched genes included genes associated with "catalytic activity," "mitochondrial part," "nutrient reservoir activity," "electron transport chain," and "respiratory electron transport chain" (**Supplementary Figure S5**). Each of the five co-expression modules of the two species could be roughly categorized in the late (K6 to K10) phase. Transcripts of these modules were significantly upregulated during at least the last two time points. Many genes of this stage were typified by the enriched functions of "catabolic process," "generation of precursor metabolites and energy," "lipid metabolic process," "carbohydrate metabolic process," "hydrolase activity," and "catalytic activity" (**Supplementary Figure S5**). Moreover, some upregulated genes of this stage were associated with "cellular nitrogen compound biosynthetic process" and "NAD binding" (**Supplementary Figure S5**).

### Functional Regulatory Network Analysis (KEGG Pathway Enrichment) of Seed Germination Process DEGs

To further elucidate the seed germination process DEGs associated with biochemical pathways, we performed a KEGG pathway enrichment analysis. A total of 3847 out of 12831 DEGs enriched 328 pathways, and 58 pathways were significantly (p-value ≤ 0.01) overrepresented during seed germination (**Supplementary Figure S6**).

The early phase was exemplified by an observed statistically significant enrichment of "ribosome," "proteasome," and "protein processing in endoplasmic reticulum" pathways (**Supplementary Figure S6**). The middle phase exhibited the enrichment of "flavonoid biosynthesis," "oxidative phosphorylation," "ribosome," "proteasome" and "spliceosome" pathways (**Supplementary Figure S6**). While many genes related to the metabolism of free amino acids were enriched in phase III (**Supplementary Figure S6**), most of the major pathways were enriched in the late phase, including "carbon metabolism," "glycolysis/gluconeogenesis," "starch and sucrose metabolism," "oxidative phosphorylation," "photosynthesis," "porphyrin and chlorophyll metabolism," and "carotenoid biosynthesis" (**Supplementary Figure S6**). "Oxidative phosphorylation" provides ATP for other metabolism pathways, such as mitochondrial repair and differentiation (Weitbrecht et al., 2011). The glyoxylate pathway contains a key step in the conversion of fatty acids to sucrose (Pritchard et al., 2002).

#### DEGs Related to Energy Production for Seed Germination Processes

During the preliminary phase, due to the inactivation of photosynthesis, the degradation of storage needed for energy production via processes such as glycolysis, the glyoxylate cycle, and the tricarboxylic acid (TCA) cycle, largely determines germination vigor. We defined the relative functional categories to be "carbon metabolism," "glyoxylate and dicarboxylate metabolism," "glycolysis/gluconeogenesis," and "starch and sucrose metabolism." Then, we identified the four major energy production processes, i.e., fermentation, the TCA cycle, glyoxylate and glycolysis, representing significantly overrepresented functional pathways, and we examined the expression patterns of the related DEGs (**Figure 2**). Here, ten gene families participating in the TCA cycle were differentially expressed over time in the two species. With respect to glycolysis, numerous gene families were upregulated, such as GALM, PFK, FBP, ALDO, GAPDH, and PK. In anaerobic respiration, three related gene families, PDC, ADH, and LDH, were all upregulated in the two species.

counterpart genes in the four pathways. For details of abbreviations, see Supplementary Table S3.

### Hormonal Regulation of Seed Germination in P. euphratica and P. pruinosa Under Salt Stress

In our study, 100 genes associated with "plant hormone signal transduction" were differentially expressed over time in the two species. We identified the key hormone signal transduction genes and further compared the expression profiles of the multistep signaling pathways of ABA, GA and ethylene (**Figure 3** and **Supplementary Figure S7**). Genes related to ABA signal transduction, e.g., PYL/PYR1, the negative regulator PP2C and the positive regulator SnRK2, exhibited similar expression patterns in the two poplars. The expression level of PP2C was high at 0 and 4 h but decreased after 12 h. In the GA signaling pathway, DEGs exhibited different regulatory expression patterns between the species during germination under salt stress. Specifically, the DELLA protein expression was upregulated from 0 to 12 h in P. euphratica but was continuously high level in P. pruinosa. GID1 was strongly upregulated during the middle and late phases of seed germination, while the expression of specific genes differed between the two species. Furthermore, most GA signal transcription-related genes were upregulated in the middle and late phases. We also identified the genes involved in ethylene signaling, as shown **Figure 3**. The expression pattern analysis indicated that most of the DEGs exhibited similar expression patterns in the two species for ETR and EIN3 (**Supplementary Figure S7**). CTR expression was upregulated in the late phase, while ETR was highly expressed after the early phase of germination in the two species.

### Transcription Factors and Genes Involved in Salt Responses During Seed Germination

Numerous transcription factors that regulate the response to salt stress in desert poplars have been identified (Trapnell et al., 2010). Here, a total of 1582 and 1573 expressed transcripts were categorized as transcription factors in P. euphratica and

P. pruinosa, respectively (**Figure 4A**). In total, 1480 transcription factors were expressed in both species (**Figure 4B**). Relatively few genes displayed species-specific expression. MYBs, bZIPs, WRKY, and ERF, as key response factors to abiotic stresses, were all induced by salt stress (**Figure 4B**), and the changes in their expression dynamics may reveal their critical functions in response to salt stress (Yamaguchi-Shinozaki and Shinozaki, 2005). Furthermore, some proteins regulating Na+/H<sup>+</sup> transport and controlling ion homeostasis, such as NHXs, SOS1, SOS2, SOS3, and HKTs, were induced by salt stress (**Figure 4**). These results confirm that the genes related to ion transport and chloride channels play vital roles in maintaining and re-establishing homeostasis in the cytoplasm (Hasegawa et al., 2000; Wang et al., 2008; Sun et al., 2009; Ye et al., 2009; Qiu et al., 2011). BCH1 and ZEP, which are involved in the biosynthesis of ABA, were highly upregulated in salt-stressed samples in the two species (**Figure 4B**). In addition, the expression of BADH and GolS, which are involved in critical solute biosynthesis processes that help plants maintain high osmotic pressure under salt stress. (Taji et al., 2002; Bartels and Sunkar, 2005), was induced by the salt treatment. Nevertheless, the expression

patterns of genes responding to salt stress in P. euphratica were consistent with those in P. pruinosa, indicating there is extensive transcriptional consistency in the two species with respect to their responses to salt stress.

## GO Functional Enrichment Between the Two Species Over the Time Series

The temporal expression trends of DEGs between the two species during germination were obviously different, suggesting that the two desert poplars might have evolved different gene expression patterns to adapt to different salty desert habitats. To obtain a better view of the functional differences between the species over the course of germination, GO enrichment analysis was employed, comparing the two species in two phases (the middle phase had only one DEG) (**Supplementary Figure S8**). The results indicated that in the early phase, 2766 DEGs were mainly enriched, and these DEGs were associated with the functional classifications "ribosomes," "amide biosynthetic process," "cellular macromolecule biosynthetic process," "protein activity," and "ATP binding." In the late phase, 5305 enriched DEGs

were associated with "response to stress," "response to oxidative stress," "response to abiotic stress," "response to stimulus," "ATP metabolic process," "photosystem," "photosynthesis," "growth," "developmental process," "ion binding," "calcium ion binding," and "oxidoreductase activity."

#### KEGG Functional Enrichment in the Two Species Over the Time Series

To further elucidate the different enriched biochemical pathways, DEGs of the two species were mapped into 352 pathways, 11 of which were significantly (p-value ≤ 0.01) enriched, including "flavonoid biosynthesis," "stilbenoid, diarylheptanoid and gingerol biosynthesis," "brassinosteroid biosynthesis," "phenylpropanoid biosynthesis," "diterpenoid biosynthesis," and "monoterpenoid biosynthesis" (**Supplementary Figure S9**). The results indicate that many antioxidants, antioxidases and secondary metabolites are involved in the adaptation to salt stresses by these two species (Burritt and Mackenzie, 2003). The second metabolite in the flavonoid pathway plays vital roles in stress protection, but the biosynthesis of this metabolite is regulated by key enzymes (Winkel-Shirley, 2002). In this study, PAL was induced at 12 h in P. euphratica seeds and at 48 h in P. pruinosa seeds (**Figure 5**). CHS, whose five gene copies had different expression patterns between the two species, initiated flavonoid biosynthesis. Furthermore, the FLS expression in P. pruinosa was higher than that in P. euphratica. Specifically, FLS was highly expressed in the early phase in P. euphratica and was significantly and highly expressed during the seed germination process. In addition, the expression levels of F30 5 <sup>0</sup>H and CHS in P. pruinosa were significantly higher than those in P. euphratica (**Figure 5**). F3H converts naringenin to dihydrokaempferol which is further converted to kaempferol and quercetin by FLS. The duplication of FLS may allow the ability to diversify the types and amounts of flavonols produced in different tissues and under different stresses (Winkel-Shirley, 2002).

Brassinosteroids are involved in an extensive range of effects, such as cell division, cell expansion, xylem differentiation and seed germination, in plants (Kagale et al., 2007). In the two species, six gene families related to brassinosteroid metabolism were enriched, including DET2, DWF4, BR6OX1, BRox2, ROT3 and BAS1. Among them, DET2 and BAS1 were highly expressed in P. euphratica and exhibited relatively low expression in P. pruinosa, while ROT3 was highly expressed in P. pruinosa but was not detected in P. euphratica (**Figure 6**). Moreover, there were two copies of both DWF4 and BR6OX2, and each copy exhibited a different expression pattern between the two poplars.

## ROS Level and Enzyme Activity Determination

We measured ROS levels and related enzyme activities. The quantification assay indicated that more hydrogen peroxide

accumulated in P. euphratica than in P. pruinosa under the various salt conditions, especially in 1.0% NaCl, where the levels were approximately 2-fold higher in P. euphratica than in P. pruinosa (**Figure 7**). Therefore, SOD activities were significantly higher in P. pruinosa than in P. euphratica after treatment with 0.4% NaCl solution. The CAT activities in the treatment with 1.0% NaCl solution demonstrated a similar pattern.

### Verification of Expression Patterns by qRT-PCR

To validate the RNA-seq results, qRT-PCR analysis was conducted at different time points during seed germination in the two species (**Figure 8**). The results of genes studied by the RT-PCR analysis, including those in the plant hormone signal transduction (PYL and GID1), flavonoid biosynthesis (PAL) and brassinosteroid biosynthesis (ROT3 and DWF4) pathways, were all similar to the RNA-seq results.

## DISCUSSION

### P. pruinosa Showed a Higher Salt Tolerance Than P. euphratica at the Three Seed Germination Stages

Populus euphratica and P. pruinosa diverged from a recent common ancestor between 1 and 2 million years ago (Wang J. et al., 2011) and exhibited different ecological adaptations to desert habitats. In this context, the two desert poplars have evolved different genetic strategies (Ma et al., 2013; Zhang et al., 2013). However, it is not known whether these genetic variations also underlie differences in seed germination.

In the present study, the rate of seed germination in P. pruinosa faster than that in P. euphratica during seed germination (**Figure 1C**). Based on the seed moisture content, the seed germination time courses for the two species, upon the transfer of seeds to water, can be divided into three phases, which agree with the three classical phases of seed

germination (Nonogaki et al., 2010; **Figure 1A**). We also investigated the relationship between the germination rate and NaCl concentration. The average germination percentage, subordinate function values, and threshold salinity of P. pruinosa were higher than those of P. euphratica. Based on transcriptome analysis, approximately 80% of the reads matched with the genome, and the number of mapped genes in each library was 52%–70%, indicating that most genes were expressed in the seeds of the two species under salt stress. The correlation dendrogram is consistent with the separate phases classified by seed water uptake (**Supplementary Figure S3B**), suggesting that, in two stages (early and late phases), the sister species evolved divergent regulatory and metabolic pathways associated with seed germination in different salt habitats.

### Biochemical Processes of Poplar Seeds Are Regulated by Highly Coordinated Transcript Dynamics

The expression data in the three seed germination phases showed a high reproducibility in both species, and each phase was clearly distinguished by expression dynamics. The DEGs induced early in seed germination (early phase) appeared to be associated with the repair of genetic materials, the cellular structure and the resumption of energy metabolism. During seed germination, the free amino acids involved in protein synthesis are provided by storage protein degradation induced by osmopriming in the first hours of imbibition (Wang et al., 2012). Accordingly, proteases are newly synthesized and accumulate during imbibition (Yang et al., 2007). Therefore, we speculate that in P. euphratica and P. pruinosa, amino acid biosynthesis genes are expressed after 48 h of the seed germination process, and their products provide for the synthesis and metabolism of de novo proteins in the growing embryo (Joosen et al., 2013). Thus, the stored proteins in seeds act not only as important sources of amino acids but also as a source of energy (Angelovici et al., 2011). The middle phase was associated with the active nutrient reservoir, amino acid metabolism and catalytic activity. In this stage, producing a redox state is likely a primary function of the fast recovery of cellular metabolism at the beginning of imbibition (Rosental et al., 2014). The functions of the enriched genes not only produce energy but also promote the activity of essential enzymes to support the completion of germination (Van Dongen et al., 2011). Moreover, flavonoids can induce a delay in the germination rate and play important roles in protection against diverse stresses (D'Auria and Gershenzon, 2005). Most flavonoid genes in this study were enhanced in P. pruinosa in the middle phase, indicating that the "flavonoid biosynthesis"

pathway might lead to a difference in the seed germination rate between P. pruinosa and P. euphratica.

Reserves used for the germination of seeds are primarily stored in the form of starch, lipids and proteins in the embryo or endosperm (Yang et al., 2009). Proteins related to hydrolase activity contribute to starch and protein degradation (Rosental et al., 2014), while catalytic activity proteins may increase enzyme activities or provide the energy required during seed germination. Nitrogen-containing compounds release seeds from dormancy, presumably leading to the oxidation of NADPH and therefore providing an increased carbon flow through the glycolytic and oxidative pentose phosphate pathways (PPP) (Roberts, 1964; Hendricks and Taylorson, 1974; Roberts and Lord, 1979; Cohn et al., 1983; Hilhorst and Karssen, 1989). NADP, as a coenzyme of glucose-6-phosphate dehydrogenase, plays a key role in linking the glycolysis pathway and PPP (Nonogaki et al., 2010). Here, we investigated the DEGs involved in fermentation, the TCA cycle, glyoxylate and glycolysis during seed germination. We found that energy production mainly occurred in the later phase and increased gradually. Interestingly, many genes of key enzymes in the TCA cycle were expressed during the early phase, which would lead to the accumulation of many key enzymes during early germination (Weitbrecht et al., 2011). In the late phase, some of the enriched genes were linked to "the photosystem II oxygen evolving complex," "photosystem," and "photosynthesis," suggesting that the seeds had already started to photosynthesize, contributing to the energy supply and powering the productivity of the seed (Ruuska et al., 2004; Goffman et al., 2005; Allen et al., 2009). Meanwhile, "glyoxylate metabolism" activity was enriched in the last phase of germination, which suggests that lipid metabolism is also an important energy source for seed germination. The results demonstrated that activation of energy metabolism during early germination is necessary for seed germination, however, energy production is more complex in the late phase than in earlier phases. In addition to energy metabolism, the expression of genes associated with detoxification were also involved in responses to salt stress in the two desert poplars (Zhu, 2001); these genes included genes associated with "glutathione metabolism," "flavonoid biosynthesis" and "cytochrome P450," which all are tightly correlated with seed tolerance to salt stress (Alscher, 1989). These observations suggest that P. euphratica and P. pruinosa quickly establish energetic and developmental balances under salt stress. Germination is actuated by a large number of cellular processes, such as transcription, translation, repair mechanisms, responses to various stresses, organelle reassembly and cellular structure reconstruction. All the processes are supported by metabolism for energy generation. Together, the above results indicate that the transition of primary biochemical processes over time during the seed germination of the two studied poplars is produced partly by highly coordinated transcript dynamics. As the two desert poplars have adapted to different salty desert habitats, these species may have developed different genetic pathways under salt stress during seed germination.

### Hormonal Regulation Contributes to the Difference in Seed Germination Phases

Some genes were enriched in the "plant hormone signal transduction" functional category, which is key for physiological state determination and the regulation of seed germination, especially the GA-ABA balance (Meyer et al., 2009). ABA positively regulates the induction of dormancy and negatively regulates germination. Here, the genes related to ABA signal transduction exhibited similar expression patterns in the two poplars. For example, PYL/PYR1, which are considered ABA receptors, exhibited upregulated expression during the first stage, suggesting that the ABA content of the dry seeds was high and decreased during imbibition (Preston et al., 2009). In addition, the negative regulator PP2C has been found to be a major core component of ABA signaling; its expression level was high at 0 and 4 h but decreased after 12 h (Fujii and Zhu, 2009; Umezawa et al., 2009).

GAs play an important role in the promotion of germination and the release of dormancy (Kucera et al., 2005) by stimulating ABA degradation. Here, many DEGs associated with the GA

signaling pathway exhibited different expression patterns during germination under salt stress. Specifically, DELLA proteins belonging to the GRAS family were negatively regulated in the GA signaling pathway (Sun and Gubler, 2004) and upregulated from 0 to 12 h in P. euphratica, while they were continuously expressed at high levels in P. pruinosa. GID1, coding a soluble GA receptor, was strongly upregulated during the middle and late phases of seed germination. The GA protein can interact with DELLA when bioactive GAs are present (Ueguchi-Tanaka et al., 2007). Furthermore, most GA signaling transcriptionrelated genes were upregulated in the middle and late phases, which corresponds to the results of a previous study showing that the GA content increased during germination in seeds during phase II.

Ethylene is implicated in the promotion of germination in many species. Here, we identified the DEGs involved in ethylene signaling in seed germination. We found that most of these DEGs, alongside ETR and EIN3, exhibited similar expression patterns in the two species (**Supplementary Figure S7**). In the absence of ethylene, ETR1 activates CTR1, which negatively regulates downstream signaling components and is inactive in the presence of ethylene. CTR expression was upregulated in the late phase of germination in the two species, while ETR was highly expressed after the early phase of germination. These proteins are regulated by ethylene levels during seed germination by the inactivation of a MAPK cascade comprising SIMKK and MPK6, which are positive regulators of the ethylene response pathway (Ouaked et al., 2003). EIN3 and EIN3-LIKE proteins bind to the promoter of the ERF1 (ethylene responsive factor 1) gene and thereby confer a hierarchy of transcription factors involved in ethylene signaling (Lee and Kim, 2003). Most importantly, the expression patterns of DEGs related to the ethylene pathway were different between P. euphratica and P. pruinosa, indicating that ETR and EIN3 distinctly regulate ethylene signal transcription pathways during seed germination.

Overall, GAs increase and counteract ABA inhibition in the early and late phases of germination (North et al., 2010). Ethylene counteracts ABA inhibition by interfering with ABA signaling during the late phase of germination, while the ABA content is regulated by an equilibrium between the biosynthesis and catabolism of ABA (Nambara and Marion-Poll, 2005). Thus, many of the DEGs exhibited analogous expression patterns in the two species in the models for GA, ABA and ethylene in response to salinity stress but exhibited completely different expression patterns during seed germination.

#### The Fine Regulation of the Synthesis of Flavonoids and Brassinosteroids in Desert Poplars Contributes to Their Environmental Adaptation

Flavonoids have an extensive range of biological functions, including protecting plants under various stresses (Winkel-Shirley, 2002). Flavonoids are synthesized by the phenylpropanoid pathway and found in most seeds and grains; the major types of flavonoids in seeds are flavonols, anthocyanins, phlobaphenes, isoflavones and proanthocyanidins (Lepiniec et al., 2006). Several genes that encode key enzymes in the flavonoid biosynthetic pathway were expressed differently between the seeds of P. euphratica and P. pruinosa under salt stress (**Figure 5**). We suggest that the phenylpropanoid pathway, especially the flavonoid metabolism pathway, is widely involved in protection from salt stress in both desert poplars. In general, salt stress is often accompanied by an oxidative burst in plants. In this study, the hydrogen peroxide (H2O2) accumulation in P. euphratica was 2-fold higher than that in P. pruinosa under the various salt conditions, especially in 1.0% NaCl (**Figure 7**), suggesting that salt treatment might induce oxidative stress in the seeds of P. euphratica. The unavoidable accumulation of H2O<sup>2</sup> and scavenging pathways activity should be maintained in balance, where H2O<sup>2</sup> could either perform a signaling role or reach a nontoxic level in plants under salt stress conditions. To alleviate and eliminate highly reactive oxygen species, plants have evolved a battery of antioxidative mechanisms, and the antioxidant defense system includes hydrophilic and hydrophobic antioxidants and enzymes such as SOD and CAT (Shalata and Tal, 1998). SOD activities in P. pruinosa were significantly higher than those in P. euphratica when the seeds were exposed to concentrations of NaCl above 0.4%, while CAT activities in P. pruinosa were also higher than those in P. pruinosa when seeds were treated with 1.0% NaCl. Both antioxidases could play a crucial role in scavenging redundant ROS (H2O2) induced by salt stress. Altogether, a significant proportion of the antioxidants induced by salt stress were secondary metabolites, such as a vast amount of compounds primarily derived by the phenylpropanoid pathway (Dixon and Paiva, 1995).

Brassinosteroids are involved in a wide range of growth and development aspects in plants (Kagale et al., 2007). One of the most interesting influences of brassinosteroids is their ability to confer resistance to various abiotic stresses. Several brassinosteroid biosynthesis genes have been identified by molecular genetic analysis and reverse genetic analysis (Takahashi et al., 2005). Among the gene families enriched in the brassinosteroid pathway, DET2 and BAS1 were highly expressed in P. euphratica and exhibited relatively low expression in P. pruinosa, while ROT3 was highly expressed in P. pruinosa but was not detected in P. euphratica (**Figures 6**, **8**). Moreover, DWF4 and BR6OX2 each contain two copies, and each copy exhibited a different expression pattern between the two poplars. The results suggest that the fine regulation of the synthesis of brassinosteroids in desert poplars contributes to their environmental adaptation.

### CONCLUSION

In this study, a multidimensional transcriptome dataset allowed us to discern highly dynamic and coordinated gene expression, as well as functional and regulatory shifts exhibited by the germinating seeds of two species in response to continuous salinity stress. Based on these results, we conclude that the fine regulation of the synthesis of flavonoids and brassinosteroids in desert poplars contributes to their environmental adaptation.

#### DATA AVAILABILITY

fgene-10-00231 March 25, 2019 Time: 12:43 # 14

The Illumina sequencing data sets are available at the NCBI Sequence Read Archive (SRA) database with the project accession number: PRJNA484685.

#### AUTHOR CONTRIBUTIONS

DW conceived and designed the experiments. CZ, WL, and YL conducted the bioinformatic work and wrote the manuscript. XuZ, XB, and ZN contributed to conducting experiments for physiology and transcript analysis. XiZ and ZL provided assistance in sample collection. All authors read, revised and approved the final manuscript.

#### FUNDING

This research was supported by the National Science Foundation of China (Nos. 31470620 and 31870580).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00231/full#supplementary-material

FIGURES S1, S2 | Reproducibility of each trio of biological replicates. The samples were collected at different time points, and total RNA isolation was used to construct RNA-seq libraries for them independently. FPKM values of all the genes expressed in at least one of the 36 sequenced samples are shown in

#### REFERENCES


scatter plots and were used as input for the Pearson product-moment correlation coefficient analysis. The correlations between the biological replicates were high in both two species [average r = 0.945 in P. euphratica (Supplementary Figure S1) and r = 0.939 in P. pruinosa (Supplementary Figure S2)].

FIGURE S3 | Number of genes expressed at each time point (A) and hierarchical clustering of six time points for P. euphratica and P. pruinosa (B).

FIGURE S4 | Hierarchical clustering of six time points for P. euphratica and P. pruinosa. A to B, the expression patterns of co-expression modules of P. euphratica (A) and P. pruinosa (B), ordered according to the sample time points of their peak expression. (C) The gene numbers and the expression fitted curves of all the modules in A and B. For each gene, the FPKM value normalized by the maximum value of all FPKM values of the gene over all time points is shown.

FIGURE S5 | GO function enrichment of the DEGs for seed germination processes.

FIGURE S6 | KEGG function enrichment of the DEGs for seed germination processes.

FIGURE S7 | The expression pattern of the hormone-related genes in the two poplars. Expression patterns of hormone-related genes. Normalized expression levels of genes related to ethylene, GA and ABA are shown.

FIGURE S8 | GO function enrichment of the DEGs for salt tolerance variety of the two species.

FIGURE S9 | KEGG function enrichment of the DEGs for salt tolerance variety of the two species.

TABLE S1 | Overview of the data size of all the samples of P. euphratica and P. pruinosa.

TABLE S2 | Summary of the illumine sequencing reads and the matches in the P. euphratica and P. pruinosa.

TABLE S3 | The details of abbreviations.

TABLE S4 | Primer used for real-time quantitative PCR in this study.




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhang, Luo, Li, Zhang, Bai, Niu, Zhang, Li and Wan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Conserved MicroRNA Act Boldly During Sprout Development and Quality Formation in Pingyang Tezaocha (Camellia sinensis)

#### Lei Zhao1,2†, Changsong Chen<sup>3</sup> , Yu Wang<sup>1</sup> , Jiazhi Shen<sup>4</sup> and Zhaotang Ding<sup>1</sup> \*

<sup>1</sup> Qingdao Key Laboratory of Genetic Improvement and Breeding in Horticultural Plants, College of Horticulture, Qingdao Agricultural University, Qingdao, China, <sup>2</sup> Department of Plant Science and Landscape Architecture, University of Maryland, College Park, MD, United States, <sup>3</sup> Tea Research Institute, Fujian Academy of Agricultural Sciences, Fu'an, China, <sup>4</sup> College of Horticulture, Nanjing Agricultural University, Nanjing, China

#### Edited by:

Yuriy L. Orlov, Institute of Cytology and Genetics (RAS), Russia

#### Reviewed by:

Lidiia Samarina, Russian Research Institute of Floriculture and Subtropical Crops (RRIFSC), Russia Oksana Gennadèvna Belous, Russian Research Institute of Floriculture and Subtropical Crops (RRIFSC), Russia Weiwei Wen, Huazhong Agricultural University, China

\*Correspondence: Zhaotang Ding dzttea@163.com orcid.org/0000-0002-6814-3038

†Lei Zhao orcid.org/0000-0003-1019-3814

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 15 October 2018 Accepted: 04 March 2019 Published: 28 March 2019

#### Citation:

Zhao L, Chen C, Wang Y, Shen J and Ding Z (2019) Conserved MicroRNA Act Boldly During Sprout Development and Quality Formation in Pingyang Tezaocha (Camellia sinensis). Front. Genet. 10:237. doi: 10.3389/fgene.2019.00237 Tea tree [Camellia sinensis (L.) O. Kuntze] is an important leaf (sometimes tender stem)-using commercial plant with many medicinal uses. The development of newly sprouts would directly affect the yield and quality of tea product, especially significant for Pingyang Tezaocha (PYTZ) which takes up a large percent in the early spring tea market. MicroRNA (miRNA), particularly the conserved miRNAs, often position in the center of subtle and complex gene regulatory systems, precisely control the biological processes together with other factors in a spatio-temporal pattern. Here, quality-determined metabolites catechins, theanine and caffeine in PYTZ sprouts including buds (sBud), different development stages of leaves (sL1, sL2) and stems (sS1, sS2) were quantified. A total of 15 miRNA libraries of the same tissue with three repetitions for each were constructed to explore vital miRNAs during the biological processes of development and quality formation. We analyzed the whole miRNA profiles during the sprout development and defined conserved miRNA families in the tea plant. The differentially expressed miRNAs related to the expression profiles buds, leaves, and stems development stages were described. Twenty one miRNAs and eight miRNA-TF pairs that most likely to participate in regulating development, and at least two miRNA-TF-metabolite triplets that participate in both development and quality formation had been filtered. Our results indicated that conserved miRNA act boldly during important biological processes, they are (i) more likely to be linked with morphological function in primary metabolism during sprout development, and (ii) hold an important position in secondary metabolism during quality formation in tea plant, also (iii) coordinate with transcription factors in forming networks of complex multicellular organism regulation.

Keywords: conserved miRNA, sprouts development, quality formation, transcription factors, Camellia sinensis (L.) O. Kuntze

## INTRODUCTION

Originally produced in China, green tea nowadays is the star among the top list beverages, attribute to its good taste, health benefits, and mysterious process, which brings considerable economic benefit in planting and exporting countries such as China, India, Kenya, and Sri Lanka. Based on reports from the China Tea Marketing Association 2017 (http://www.stats.gov.cn/),

**172**

approximately 10.3 million tons of fresh tea leaves were harvested to produce various tea products. As young leaves and tender stems from the tea tree are processed to prepare "tea", the developmental characters are supposed to have a direct and significant bearing on the yield and the quality of tea product.

MicroRNAs (miRNAs), are endogenous single-stranded noncoding small RNAs, that could both regulate their target messenger RNAs (mRNAs) at chromatin state and could also perfectly or imperfectly bind to their targets for further translation suppression by cleaving at some complementary site (Rubio-Somoza and Weigel, 2011; Zheng et al., 2015). The plant miRNA families were thus placed to be at the central position within gene expression programs, always with small numbers per cell and large amounts of transcripts (Voinnet, 2009), yet have powerful effect in developmental regulation, morphogenesis, stress responses (Axtell and Bowman, 2008; De Lima et al., 2012; Jones-Rhoades, 2012; Yang et al., 2013). Still, seldom gene regulation could be completed without consideration of the transcription factors. Many transcription factors in the plant kingdoms are highly conserved even stride over large evolutionary distances, and for some of them, they could still share similar developmental roles in diverse species (Zhao et al., 2013; Xu et al., 2016). Hypothesis demonstrated that the miRNA binding sites evolve faster than the transcription factor binding sites, as the ways to repress a gene is relatively much more than to activate one (Chen and Rajewsky, 2007).

However, some miRNA seems extremely well conserved (Lu et al., 2006). A large portion of the conserved miRNAs and their conventional target transcription factors as well as Fbox proteins play pivotal roles in governing plastic behavior during development, such as phase change and plant architecture (Kidner, 2010; Rubio-Somoza and Weigel, 2011), making miRNA-TF mRNA pairs more fascinating. At least 7 kinds of miRNAs were widely reported to regulate in the three stages of leaf development and leaf morphology. At the initiation stage, a division of leaf primordia are commonly considered to be the key stage in the leaf development process, which comes from a group of cells localized on the flanks of the shoot apical meristem (SAM) loses their indeterminacy (Micol and Hake, 2003). During this stage, miR390/ARF pathway has been described in the regulation of leaf polarity (Braybrook and Kuhlemeier, 2010). miR165/166 regulates the leaf polarity by targeting the HD-ZIP genes and thus control the adaxial cell fate (Rubio-Somoza and Weigel, 2011; Sun, 2012). Recent discovery revealed that the leaf dorsoventral polarity (adaxial-abaxial) signals which may cause mechanical heterogeneity of the cell wall, is linking to the methylesterification of cell-wall pectins in tomato and Arabidopsis (Qi et al., 2017). The shape and architecture of leaf need the orchestration of auxin, KNOX genes and miRNA regulation. KNOX genes could be down-regulated by CUC transcriptional regulators, which are important for organ boundaries building (Takada and Tasaka, 2002; Chen, 2009), floral patterning, and leaf morphogenesis (Micol and Hake, 2003; Engstrom et al., 2004). NAC (NAM, CUC1/2-like) is one branch of CUC gene family regulated by miR164. MiR164/GOB (a CUC2 ortholog gene), well-studied in tomato, is necessary for controlling leaf polarity and determining the serration or smooth of the leaf boundaries (Berger et al., 2009). MiR319, encoding by three loci including miR-JAW (miR319a) in Arabidopsis, regulates five TEOSINTE BRANCHED/CYCLOIDEA/PCF (TCP) family members (Palatnik et al., 2003, 2007), which could also lead to the regulation of CUC genes. Overexpression of miR319 or loss function of these five TCP genes would result in crinkly leaves (Palatnik et al., 2003; Liu et al., 2018). TCP regulated growth and senescence via jasmonic acid synthesis pathway (Schommer et al., 2008). The cell number and cell size, which reported to be precisely spatial and temporal controlled (Usami et al., 2009), are mainly regulated by an SQUAMOSA PROMOTER BINDING PROTEIN PROTEIN-LIKE (SPL)-dependent pathway (Ferreira e Silva et al., 2014; Xu et al., 2016). In Arabidopsis, miR156 targets 11 of the 17 SPL genes, among which SPL3, 4, and 5 accelerates the juvenile-to-adult phase change, SPL9 and SPL15 regulate plastochron length (Wang et al., 2008; Wu et al., 2009; Xu et al., 2016). MiR396 plays an important role in plant leaf growth and development, most likely by repressing Growth-Regulating Factor (GRF) genes in Arabidopsis. Transgenic miR396-overexpressing plants have narrow-leaf phenotypes due to a reduction in cell number (Liu et al., 2009).

During the long cultivation history for more than 2,000 years in China, numerous elite tea varieties have been bred for different characteristics like early germination, high yield, good performance under environmental stress, and distinctive aroma or flavor. Camellia sinensis (L.) O. Kuntze "Pingyang Tezaocha" (PYTZ), an elite cultivar with short internodes selected in Zhejiang Province in the late century, is now popularized in the north tea area in China attributes to its high yield for about 3tons green tea products per hectare [Data from e-China tea from Tea Research Institute of China Academy of Agriculture Sciences AS (TRI, CAAS)] (http://www.e-chinatea.cn/other\_shujuku.aspx) (Zhao et al., 2017). What's more, its early germination in April helps taking up a large percent in the early spring tea market annually (Yang, 2015). In tea plant, phenolic compounds is one of the most important secondary metabolites, accounting for 18% to 36% dry weight in the fresh leaves and tender stem (Jiang et al., 2013), is also the main flavor components and functional ingredients that had been intensely studied in the past decades for its effective and extensive pharmacological activities (Zhao et al., 2013). Accordingly, the accumulation of some phenoliclike nongalloylated catechins like epigallocatechin (EGC) and epicatechin (EC) (Zhao et al., 2017), quinic acid and flavonol glycosides are gradually increasing along with the developing stages (Jiang et al., 2013). What's more, some key enzyme genes involved in the biosynthetic pathway of phenolic compounds in different organs and leaves at different developmental stages also have the similar expression patterns, such like CsDHQ/DHS2 (DHQ/DHS, 3-dehydroquinate synthase), CsCHS1 (CHS, chalcone synthase), CsUGT78E1 (UGT, uridine diphosphate Glycosyltransferase), Cs4CL1 (4CL,4-coumaroyl-CoA ligase), CsF3′H1 (F3′H, flavonoid 3′ -hydroxylase), and some TF genes like Sg4 of CsMYB family (Jiang et al., 2013; Li et al., 2017a), CsMYB5-1 and bHLH24-3 (Jiang et al., 2013). Beyond catechins in tea plant, theanine and caffeine are the other two characteristic constituents determine tea quality (Xia et al., 2017). No matter it is primary or secondary, there is no doubt that metabolisms synchronize along with plant growth and development, are under the precise entire spatio-temporal network control (Chen and Rajewsky, 2007).

Here, the content of three kinds of main taste compounds catechins, theanine, and caffeine in different tissues of spring sprouts including bud, two stages of leaves, and stems of PYTZ were quantified by High-Performance Liquid Chromatography-Mass Spectrometry (HPLC-MS/MS). MiRNA libraries of the same tissues were constructed by Illumina HiSeq technology in order to explore how miRNA works between development and quality formation. Key miRNAs involved in regulating sprout development have been speculated based on computational expression. Conserved miRNA families in tea plant were obtained and mainly studied. To what extent the conserved miRNAs might be linked with morphogenesis function during sprout development was further investigated through other six morphologically-different tea cultivars. Regulations in metabolic pathways of conserved miRNA together with their target genes, especially transcription factor genes that would finally determine tea quality have been studied and discussed. The consistency of performance between development and quality need more cross understanding and balance in the subsequent process of screening tea cultivars.

## MATERIALS AND METHODS

#### Plant Material and RNA Isolation

The four-year-old tea plant cultivar Camellia sinensis (L.) O. Kuntze "Pingyang Tezaocha" (PYTZ) were planted in the Germplasm of Qingdao Tea Repository at the Tea Research Institute located in Qingdao (35◦N119◦E, Qingdao city, China) under natural light condition. To ensure the successiveness of gene expression during the development of newborn branch, we collected the samples of bud, leaves, and stems orderly downwards from the top (**Figure 1A**) (Shen et al., 2019). For normalization, the buds about 3 cm long (sBud), and the first leaf below the bud (sL1) about 3.5cm long, the second leaf with higher maturity below the bud (sL2) 4.5cm long, the stem between the first leaf and the second leaf (sS1) with about 1.5cm long, and the more mature stem between the second leaf and the fish leaf (sS2) with 2.1cm long were measured and collected. For collecting samples of RNA, healthy buds, leaves and stems at different developmental stages were collected and frozen immediately in liquid nitrogen and stored in −80◦C freezers before use (Fan et al., 2015). Three biological replicates were collected and pooled from at least five individuals, and each biological replicate contained more than five buds, leaves and stems. The total RNA for each sample was extracted using TRIzol reagent (Invitrogen, Burlington, ON, Canada). The quality, purity, concentration, and integrity of the total RNA was checked using 1% agarose gel electrophoresis, NanoDrop Photometer Spectrophotometer (IMPLEN, Westlake Village, CA, USA), Qubit RNA Assay Kit in Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA), and RNA Nano 6000 Assay Kit of the Bioanalyser 2100 system (Agilent Technologies, Santa Clara, CA, USA), respectively. RNA samples with a 260/280 ratio between 1.8 and 2.0, 260/230 ratio between 2.0 and 2.5, and RNA integrity number more than 8.0, were used for sequencing and quantitative PCR analysis described below.

### Extraction and Quantification of Catechins, Caffeine, and Theanine

The extraction of catechins and caffeine followed by a previously described method with minor modifications (Jiang et al., 2013; Wang et al., 2018): 0.2 g of each fresh samples (sBud, sL1, sL2, sS1, and sS2) were ground in liquid nitrogen and extracted with an extraction solution (80% methanol and 20% water), followed by vortexing and sonicating for 30 min at a low temperature. Then, the samples were centrifuged at 3,500 g for 15 min, and the residues were re-extracted twice as mentioned above until the final volume of the pooled supernatants was 2 mL. The supernatants were then extracted three times with chloroform and three times with ethyl acetate. The pooled supernatant was concentrated to remove the ethyl acetate at a low temperature with a vacuum pump. Finally, the product was dissolved in 200 µL methanol for quantification. The theanine was extracted as reported by the method of Jeon et al. (2017) with some modifications. One gram of each finely ground sample was mixed with 100 mL boiling distilled water and brewing for 10 min (with the help of magnetic stirrer). All the obtained extract were filtered by 0.45µ nylon membrane (after cooling down) and approximately 1 ml of the sample solution were centrifuged at 13,000 rpm for 10 min prior to HPLC analysis.

The isolation and detection of quality-related metabolites catechins, caffeine, and theanine in the sprouts of PYTZ were performed by high performance liquid chromatographymass spectrometry (HPLC-MS/MS). HPLC analyses were performed on an Agilent 1298 LC system (Agilent, Santa Clara, CA, USA), and MS/MS detection was carried out using an Agilent 6460 Series Triple Quadrupole instrument (Agilent). Caffeine, theanine and six major tea standards: catechins, (-)-epigallocatechin gallate (EGCG), (-)-epigallocatechin (EGC), (-)-epicatechin gallate (ECG), (-)-epicatechin (EC), (-)-gallocatechin (GC), and (+)-catechin (C) were purchased from Sigma (St Louis, MO, USA). An Agilent 20RBAX RRHD Eclipse Plus C18 column (particle size: 1.8 mm, length: 100 mm, and internal diameter: 2.1 mm) was used at a flow rate of 1 mL min−<sup>1</sup> . For catechins and caffeine, the mobile phase consisted of 0.4% acetic acid in water and 100% acetonitrile; and the gradient of latter increased linearly from 0 to 10% (v/v) within 5 min, and to 35% at 20 min, to 10% at 21 min, to 1% at 25 min. For theanine, the mobile phase consisted of HPLC water and acetonitrile; and the gradient of former remained at 100% within 10 min, and decreased linearly from 100% to 20% (v/v) to 12 min, and kept at 20% to 20 min, to 100% at 22 min, and kept at 100% to 40 min. Mass spectra were acquired simultaneously using electrospray ionization in the positive and negative ionization modes over the range of m/z 100 to 2000. A drying gas flow of 6 L min−<sup>1</sup> , drying gas temperature of 350◦C, nebulizer pressure of 45 psi, and capillary voltages of 3,500 V were used. The compounds were identified qualitatively using LC-MS by comparing the retention times (tR), wavelengths of maximum absorbance (λmax), protonated/deprotonated

molecules ([M+H]+/[M–H]−), and major fragment ions with those of the authentic standards and published literature (Jiang et al., 2013; Jeon et al., 2017; Wang et al., 2018).

## Library Construction and Small RNA Sequencing

For sRNA library construction, 3 µg of total RNA per sample was used for the RNA sample preparations. Sequencing libraries were generated using NEBNext Multiplex Small RNA Library Prep Set for Illumina (NEB, USA). The library preparations were sequenced on an Illumina HiSeqTM 2500 sequencer, by Gene Denovo Biotechnology Co. (Guangzhou, China). The generated 50 bp single-end reads were then filtered out the impure sequences (adaptor sequences and the low quality reads) and removed cellular structural RNAs such as rRNA, snoRNA, snRNA, and tRNA based on the alignment with small RNAs in GeneBank database (Release 209.0) and Rfam database (11.0). The clean reads were mapped to the tea tree genome without mismatch to analyze their expression and distribution (NCBI Sequence Read Archive Database under accession PRJNA381277). Tags that mapped to exons or introns and repeat sequences were also removed.

### Identification of Known miRNAs and Novel miRNA

Since tea miRNA dataset was not included in the miRBase, the clean tags were subjected to a Blastn search against miRBase 21.0, to identify and annotate known miRNAs from all other plant miRNAs, allowing two mismatches. All the known miRNAs were further checked for the existence through 72 plant species, to figure out their conservative property. The unannotated tags were aligned with tea tree genome to identify novel miRNA candidates according to their genome positions and hairpin structures predicted by software Mireap (https://github.com/liqb/mireap, version 0.20).

### miRNA Expression Profiles and Prediction of Target mRNAs

The expression levels of both known miRNA and novel miRNA from each sample were calculated and normalized to transcripts per million (TPM) (Wu et al., 2017b). The formula is TPM= Actual miRNA counts / Total counts of clean tags<sup>∗</sup> 10<sup>6</sup> . Meanwhile, the correlation coefficient between every two replicas was calculated to evaluate repeatability between samples. Differential expression analysis across samples was performed using the DEGseq (2010) R package. miRNAs with p < 0.05 and log2-fold change ≥ 2 in comparison were set as the threshold for significantly differentially expressed miRNAs (DEM). Candidate target genes were predicted by using software PatMatch (Version 1.2) blasting against tea tree genome, abiding by some rigorous parameters as follows: No more than four mismatches between sRNA/target (G-U bases count as 0.5 mismatches); For the miRNA/target duplex (5′ of miRNA), (a) no more than two adjacent mismatches, (b) no adjacent mismatches in positions 2–12, (c) no mismatches in positions 10–11, (d) no more than 2.5 mismatches in positions 1–12, and the minimum free energy (MFE) of the miRNA/target duplex should be no < 60% compared to the MFE of the miRNA bound to its perfect complement (Yan et al., 2005; Wu et al., 2017a).

## Functional Enrichment Analysis of Target mRNAs

Gene Ontology (GO) enrichment analysis and KEGG pathway analysis were performed to the target mRNAs of DEM in order to comprehensively figure out their biological functions. All DEM target genes were mapped to GO terms in the Gene Ontology database (http://www.geneontology.org/), then the enriched significant GO terms (taking FDR ≤ 0.05 as a threshold, derived from calculated p-value) comparing to tea tree genome background were categorized into three levels, "biological process," "cellular component" and "molecular function". KEGG is the major public pathway-related database (Kanehisa et al., 2008) for further understand how genes interact with each other to play roles in certain biological functions. The calculating formula is the same as that in GO analysis. KEGG pathway enrichment analysis identify significantly enriched metabolic pathways or signal transduction pathways (Liu et al., 2014). Some online platforms or commercial services that based on same or different mathematical algorithms could help us to reconstruct gene networks, for example STRING (https://string-db.org/), Pathway Commons (https://www.pathwaycommons.org/) (Luna et al., 2016), ANDSystem (Ivanisenko et al., 2019), and so on (Saik et al., 2018). Functional enrichment of both target genes of miRNAs in single samples and DEM in a compare group were carried out in our analysis. Here, STRING was used to show the enrichment networks.

Trend analysis was aiming at the expression of all miRNAs performing in continuous tissues samples (Bud/L1/L2 and Bud/S1/S2) to cluster genes with similar expression patterns. Trend analysis was carried out by software Short Timeseries Expression Miner (Ernst and Bar-Joseph, 2006) (under parameters -pro 20 -ratio 1.0). GO and KEGG pathway enrichment analysis was then be done to target genes of miRNAs in each trend, and the p-value was obtained by hypothesis testing. Those GO term and KEGG pathway were defined as significant ones satisfying Q value ≤ 0.05. Q value was that p-value corrected by FDR (Benjamin and Hochberg, 1995).

#### Quantitative PCR for miRNAs and mRNAs

The expression profiles of mature miRNAs and the potential target mRNAs were further validated by quantitative PCR. Synthesis of the first strand cDNA was performed with Mir-X TM miRNA First-Strand Synthesis Kit (Cat. No. 638313, Clontech Laboratories, Inc., CA, USA), with 5.8S rRNA served as an internal control. The first strand cDNA of mRNA were synthesized by using PrimeScriptTM RT reagent Kit with gDNA Eraser (Perfect Real Time) (Code No. RR047A, Takara, Tokyo, Japan), with glyceraldehyde-3-phosphate dehydrogenase (GAPDH) gene for normalization. The primers of miRNAs and mRNA were listed in **Supplementary Table 1**. Quantitative PCR was carried out with SYBR Premix Ex TaqTM IIKit (Tli RNase H Plus) (Code No. RR820A, Takara, Tokyo, Japan), on a LightCycler 480 instrument (Roche Molecular Systems, Inc., Indianapolis, IN, USA). The amplification program of miRNA was performed under the following parameters: 95◦C for 10 min, 40 cycles at 95◦C for 15 s, 60◦C for 1 min (Zheng et al., 2015). The amplification program of mRNA was performed at 94◦C for 10 s, 58◦C for 10 s and 72◦C for 10 s (Li et al., 2017b). Triplicates of each reaction were performed, and 5.8S rRNA and GAPDH were used as endogenous control separately. CT values obtained through quantitative PCR were analyzed using 2−11CT methods to calculate relative fold change values.

In addition, fresh spring sprouts were plucked from seven tea varieties from one experimental tea garden of Tea Research Institute, Fujian Academy of Agricultural Sciences in Fu'an, China (27◦ 10′N, 119◦ 35′E): Camellia sinensis "Jinfenghuang" (JFH), Camellia sinensis "Pingyang Tezaocha" (PYTZ), Camellia sinensis "Zhengdayin" (ZDY), Camellia sinensis "Dayewulong" (DYWL), Camellia sinensis "Huangdan" (HD), Camellia sinensis "Jiukeng 6" (JK), Camellia sinensis "Queshe" (QS). The bud, leaves, and stems were also sampled at the same position mentioned above for quantitative PCR analysis. The result of relative expressions was presented as a heatmap by using TBtools (Chen et al., 2018).

### Transcription Factor Prediction and miRNA-mRNA-Metabolite Network Construction

All the mRNA genes target by DEM predicted above were blasted against the transcription factor (TF) database from the plant (http://planttfdb.cbi.pku.edu.cn/, version 4.0) to annotate potential TFs. The resulting target TF genes were classified in each tissue for analysis. The resulting TF genes were further blasted against the tea tree genome (Xia et al., 2017) (NCBI Sequence Read Archive Database No. PRJNA381277). The methods for TF blasting and expression analysis were followed by Zhao et al. (2017). Different expression profiles were finally grouped into profile 1, 2, and so on. For the network analysis was based on Savoi's method (Savoi et al., 2016), the mRNA and metabolite association were obtained based on Pearson correlation coefficient between the contents of each metabolite and the expression levels of mRNA, and filtered the pairs when the absolute value of cor was larger than 0.9 and p-value was smaller than 0.05. The miRNA and mRNA association were obtained by Spearman's correlation coefficient according to their expression levels calculated by TPM and filtered the pairs when cor-value was no larger than −0.5, and the p-value was smaller than 0.05. The network was visualized by Cytoscape (V3.6.0) (Praneenararat et al., 2012).

### Availability of Supporting Data

Clean Illumina sequencing reads of 15 small RNA of PYTZ sprout have been deposited in the NCBI Sequence Read Archive Database under accession PRJNA510482.

### RESULTS

### The Contents of Quality-Related Metabolites in Tea Sprouts

The buds, the first leaves, the first stems, sometimes with the second leaves together, are the most common raw materials for producing green tea. The contents of the most qualityrelated metabolites in tea sprouts that contributing flavors and health-promoting functions were quantified by HPLC-MS/MS (**Figures 1B–J**). Among the three kinds of characteristic metabolites, the content of total catechin accounts for the most majority proportion, while distributing significantly different among tender leaves and stems (**Figure 1B**). The galloylated catechins such like EGCG (**Figure 1F**) and ECG (**Figure 1H**), take up 67.4% in total catechin concentration obtained by summation of the individual components ranged from 37.3 to 119.2 mg/g (**Figure 1B**). Both EGCG and ECG had a relatively low concentration in sS2. The level of GC (**Figure 1C**), C (**Figure 1E**), and EC (**Figure 1G**), were found to be low in bud and leaves, especially for C. The concentrations of caffeine ranging from 4.7 to 9.7 mg/g, accumulated to its highest levels in sBud and lowest in sS2 (**Figure 1I**). Theanine, however, showed its highest accumulation in stems in PYTZ sprouts (**Figure 1J**).

### Overview of microRNA Profile and Its Mapping to Tea Tree Genome

To figure out what role the miRNA play during the formation of characteristic metabolites along with the development of the PYTZ sprout, 15 sRNA-Seq libraries including buds, leaves, and stems were separately sequenced on Illumina HiSeqTM 2500 platform generating a total of 292,653,360 raw reads. After removing dirty reads containing adapters and low quality bases, in average, the clean tags of 14,080,519 for sBud (bud), 12,776,739 for sL1 (the 1st leaf), 11,830,623 for sL2 (the 2nd leaf), 11,484,341 for sS1 (the younger stem), and 9,822,915 for sS2 (the older stem) were retained. The filtering data of each procedure were listed in **Supplementary Table 2**. Most clean tags had the length of 21-24nt, in which the 24 nt sRNAs were the most abundant (**Figure 2A**). The proportion of different length tags has no obvious difference among the five sample groups, and generally showed the trends of increased and then decreased bounded by the 24 nt sRNAs. Notably, the number of 24 nt sRNAs in sS1 was the lowest, while in sL1 was the highest.

About 76.97% clean tags were perfectly mapped to tea tree genome (NCBI Sequence Read Archive Database No. PRJNA381277), which indicated a credible quality of sequencing, and the rates of a genomic match were similar across these samples. We removed the tags mapped to exons located in positive-sense strands which might be fragments from mRNA degradation. (For statistics of mapping to tea tree genome, see **Supplementary Table 3**). The tags mapped to repeat sequences were also excluded. The clean tags were then aligned with the Rfam database (11.0) and the percentage of annotation was summarized in **Supplementary Table 4**. The average of rRNA, snRNA, snoRNA, and tRNA in the samples took up of 18.11, 0.19, 0.56, and 1.01%.

### Known miRNAs Identification and Novel microRNAs Prediction in PYTZ

To identify known miRNAs in tea, all the unannotated unique tags were blast-searched against plant miRNAs in miRBase (Release 21.0, June 2014). Overall, a total of 1,928,678 miRNA clean tags were identified from 15 libraries (**Table 1**), and 156 known miRNAs were identified (**Supplementary Table 5**). Most of the identified known miRNAs (81.97%) belonged to the 21 nt length miRNA families, the remaining ones belonged to 18– 24 nt miRNAs families (**Supplementary Table 5**). For the 156 known miRNAs, 122 precursors and 99 kinds of characteristic hairpin structures of the known miRNAs were identified. The length of the precursors varied from 71 to 288 nt, with an average of 144 nt, and the average minimum free energy (MFE) was −57.23 kcal/mol, ranging from −22.3 to −90.2 kcal/mol (**Supplementary Table 6**). The abundance of miRNA including the novel miRNA showed a high distribution in leaf than in stem (ratio value in **Table 1**), with that in S1 the lowest.

For the four nucleic acids, the frequency of cytosine (C) (32.09%) and uracil (U) (29.65%) is higher than guanine (G) (19.26%) and adenosine (A) (19.00%). In the five samples, U had a high appearance at the 1st, 17th, 22th, and 23rd positions, with an average of 84.34, 61.80, 54.57, and 52.48%, respectively (**Supplementary Figure 1**). C occupied a very high percentage (87.44%) at 19th position. The analysis showed that A had a relatively high proportion at 9th position in the stem (sS1 and sS2) and 17th position in leaf (sL1 and sL2), in contrast, A was seldom present at 2nd, 13th, 18th, and 20th positions in the five tissues. For the first nucleotide bias analysis, U had the absolute predominance in miRNAs with the length of 20, 21, and 22 nt (**Supplementary Figure 2**).

The remaining reads which couldn't get mapped to known miRNAs were used to identify novel miRNAs. 384,768 novel miRNA tags were identified from 15 libraries (**Table 1**), and 1186 novel miRNA tags were identified by predicting the hairpin structures of their precursor sequences (**Supplementary Table 7**). The length of the novel miRNAs ranged from 18 to 27 nt, different with known miRNAs, the 22 nt length miRNA families were the most abundant (46.71%), followed by 21nt (41.23%). These novel miRNA were involved in 1130 hairpin miRNA precursors. The length of these precursors varied from 65 to 373 nt, with an average of 178 nt. The average minimum free energy (MFE) was −56.26 kcal/mol, ranging from −18.1 to −292.3 kcal/mol (**Supplementary Table 7**). The

numbers of novel miRNAs were most in sL1 and lowest in sS1, the trend was the same with known miRNAs.

### Whole miRNA Expression Characters in PYTZ

To figure out the whole miRNAs express patterns in the spring sprouts of PYTZ, we need to evaluate the reliability of parallel experimental results as well as operational stability. The expression level of all miRNAs including known miRNAs and novel miRNAs from 15 libraries was normalized to generate TPM, which further used to compute the related coefficients. The strong correlation between every two biological replicates for interlibrary of all five sample groups brought out that the sequencing results are highly reliable (**Supplementary Figure 3**). The correlations between sL1 and sL2, sS1, and sS2 were substantially higher than other inter-groups, suggesting closely associated integral processes in the separate development of leaf and stem.

To lock the target miRNAs which might be responsible for the tea shoot development, we firstly define sBud/ sL1/sL2 as Group 1 (G1), sBud/ sS1/sS2 as Group 2 (G2) to see the whole miRNA change characters. In G1, a total of 226 miRNAs showed different expression among the three tissues and classified into 8 profiles according to their trends. Overall, the down-expression trend miRNAs took up a larger percentage (69.5%), which belonged to profile 3 (75 miRNAs), profile 0 (67 miRNAs), and profile 1 (15 miRNAs). The up-expression trend miRNAs which had the most abundant expression in sL2 belonged to profile 6 (40 miRNAs), profile 7 (12 miRNAs), and profile 4 (1 miRNA). There were other 9 and 7 miRNAs which had the highest and lowest expression level in sL1, separately (**Figure 3A**).

In G2, a total of 273 miRNAs showed different expression among the three tissues and had been classified into 8 profiles according to their trends. Overall, 91 miRNAs showed the downexpression trend belonging to profile 3 (42 miRNAs), profile 1 (34 miRNAs), and profile 0 (15 miRNAs). 64 miRNAs showed the up-expression trend belonging to profile 6 (35 miRNAs), profile 4 (25 miRNAs), and profile 7 (4 miRNA). It is noteworthy that there were 95 miRNAs and 23 miRNAs showed the lowest and highest expression level in sS1, separately (**Figure 3B**).

### Conserved miRNAs Families and Tissue-Specific miRNAs

The 156 known miRNAs belonging to 125 families, among which 27 families were well-conserved that present in more than 10 plant species out of 72 plant species (**Table 2**). miR156 was the most popular one, which was found in 51 plant species, followed by miR396 and miR166, which were conserved in 47 and 45 plant species, respectively.

The expression of miRNA usually tells more about the code of regulating new shoot elongation and development. In order to filter the tissue-specific miRNAs in PYTZ, we merged the DEM from sL1 and sL2 into sL, sS1 and sS2 into sS, and removed duplicates, separately. The Venn diagram (**Figure 2B**) showed that most miRNA were existed in all tissues (82.49%) or at least in one tissue, regardless of their relatively high or medium expression abundance. Interestingly, some miRNA could only




(Continued)


TABLE

2


Continued

#### TABLE 3 | Tissue specific miRNAs of PYTZ.


have their expression in specific tissues. Five miRNAs were budspecific that could only be expressed in the bud, 31 were leafspecific, and 28 were stem-specific, which had been summarized in **Table 3**.

#### Differentially Expressed miRNAs (DEM)

The differentially expressed miRNAs (DEM) were pairwise compared among sBud, sL and sS, with their expression values higher than a 2-fold change and p≤0.05, aiming to find out key miRNAs during the development. The numbers of DEM between tissues were summarized in **Figures 2C–G**. It's worth noting that according to the developmental order, there was a sharp down-trend numbers of DEM (235) and 79 up-trend DEM in sL1 than in sS1 (referred to as sL1 vs. sS1), compared to 67 down-trend and 5 up-trend DEM in sL1 vs. sL2, reminding that there are distinct regulatory changes and thus a metabolites accumulate differences bounded between L1 and S1 (**Figure 2C**). DEM with ups and downs both in G1 (**Figures 2D,F**) and G2 (**Figures 2E,G**) were also classified. The continuously changing ones were seemed to be possibly interesting regulators, such as 5 DEM (miR390-x, novel-m0578-5p, novel-m0634-5p, novelm0503-3p, novel-m0531-5p) that had downtrend expression in G1 (**Figure 2F**), miR396-x uptrend in G1 (**Figure 2D**) and novelm0331-5p (**Figure 2E**) uptrend in G2.

#### GO Enrichment and KEGG Pathway Analyses of DEM

miRNA sequences were searched against tea tree genomic sequences using the plant miRNA potential target finder to predict target mRNAs. The annotation of the target unigene of DEMs was conducted based on GO enrichment and KEGG analyses. In this study, a total of 5501 potential unigenes were predicted to be targeted by 934 miRNAs, including 138 conserved and 796 novel miRNAs. Among the miRNA, miR5385-x targeted the most unigenes (770), followed by miR5658-x (533), and miR8577-x (223). 280 miRNAs targeted one unigene, while most miRNAs could target multiple sites. Similarly, one unigene was also targeted by several miRNAs, and there were 1351 unigenes could be regulated by more than one miRNA, 58 of which were targeted by no <10 conserved miRNA. (The complete list of target genes of all miRNA were listed in **Supplementary Table 8**).

Gene Ontology (GO) enrichment analysis offered a strictly defined concept to describe properties of the target genes and recognize the main biological functions in a dynamic-updated controlled vocabulary. Within biological process categories, represented GO terms associated with these target genes in all tissues (**Supplementary Figure 4**) were related to "metabolic process" the most, followed by "cellular process" and "singleorganismal process". Within cellular components categories, the unigenes were similarly represented, mainly in "cell," "cell part," "membrane," "organelle" and their parts. Within the molecular function categories, the top two GO terms were "catalytic activity" and "binding".

### Key DEM Involved in Growth and Development

We focused on the expressions of miRNAs to filter the possible ones that participate in growth and development, and we found that all of the reported growth and development associate miRNAs belong to the up- or down- trends pattern, except miR172. So we firstly narrowed down miRNAs with similar expression trends in the fore-mentioned G1 and G2 (in which the down-trends including profile 3, 0 and 1; the up-trends including profile 6, 7 and 4), and then further screened by GO and KEGG pathway analysis. Twenty one miRNAs, including 6 novel miRNAs were screened out to be potential developmentally important miRNAs in PYTZ (**Table 4**). Mature sequences of these miRNAs and their target genes in tea genome were also listed in **Table 4**. The heat map of the 21 miRNAs which represent their transcription levels calculated by TPM in the samples were displayed in **Figure 4**. Not each miRNA has the same expression pattern in G1 compared with that in G2: 3 miRNAs only changed in G1, with 1 uptrend and 2 downtrends; 4 miRNAs only changed in G2, with 2 uptrends and 2 downtrends; 11 miRNAs had the same trends in both G1 and G2, with 3 uptrends and 8 downtrends. Interestingly, miR319-y had a downtrend expression pattern in G1 and uptrend in G2. Some other miRNAs hadn't been included in these trends might not because of their expression trends or levels, but the difference of expressions among development stages was not significant (P < 0.05, log2 fold change ≥2).

### Potential Transcription Factor Target Genes

As transcription factors were intensely studied in their numerous important roles during plant growth and development in many species (Ramachandran et al., 1994; Zhang et al., 2009; Chen et al., 2010), we are here supposed to analysis transcription factor genes for identifying key TFs performing this function. All the 5,501 predicted target genes were blasted against Plant Transcription Factor Database (http://planttfdb.cbi.pku.edu.cn/), resulting in a total of 46 kinds of transcription factors involving 352 mRNAs were detected (The full list of identified TFs was provided in

TABLE 4 | Potential developmentally important miRNAs in Pingyang tezaocha.


The underlined genes are the targeted mRNA belonged to TF genes. The gray shaded ones are the genes have reciprocal expression profiles with responding miRNAs.

**Supplementary Table 9**). Then, types and numbers of TFs genes targeted by miRNAs in each tissue (sBud, sL1, sL2, sS1, and sS2) were analyzed and summarized in **Figure 5**. On the whole, the numbers of transcription factors genes targeted by miRNAs in sL2 were the most, and that in sS1 was the least. MYB showed the widest involvement, followed by HD-ZIP and bHLH (basic Helix-Loop-Helix) transcription factors. Some kinds of transcription factors couldn't be targeted in all tissues, such as that the CAMTA family of calmodulin binding TF genes in sS2 and GRF (Growth Regulating Factor) in sL1.

Quantitative PCR was further performed to validate the mentioned 21 growth and development associate miRNAs (**Figure 6**) and their predicted target mRNAs (**Figure 7**), among which 14 of them were TF genes (underlined in **Table 4**). Universally, plant miRNAs might be involved in many complicated and diverse functions in the complex regulatory networks, the fundamental role of miRNAs is to suppress the expression of target genes (Tang and Chu, 2017). Herein, we got eight miRNA-TF genes pairs with reciprocal expression profiles (gray shaded mRNAs in **Table 4**), and the complementary correspondence of miRNA toward the target sites were shown in **Supplementary Figure 5**. The miRNA-TF genes pairs were miR156-x-CSA011373 (SBP), miR156x-CSA019508 (SBP), miR156x-CSA023442 (SBP), miR156x-CSA031667 (SBP), miR165y-CSA023057 (Class III HD-Zip), miR165y-CSA030874 (HD-Zip), miR319y-CSA036087 (MYB), and miR8577x-CSA030921 (bHLH).

### Expression Profiles of Reported Morphological miRNA in Different Tea Varieties

Originated in Yunnan and Tibet region of China, the tea tree has been evolved over thousands of years and now at least 246 cultivars have been selected breeding with significant differences in morphology and physiology. Such as the focused cultivar in this study, PYTZ, has oblong leaf shape, blunt tip, tight dentate margins, and shorter internode. These characteristics are one of the indicators of screening and distinguishing cultivars. Plant growth and development are accompanied by morphogenesis that some regulators including small RNAs and TFs may participate in both biological processes simultaneously. In order to do some basal research linking developmental associate miRNA toward morphology, the expression profiles of reported morphological miRNAs were performed in other six tea varieties with obvious differences in leaf morphology. The focused 21 miRNAs were again checked by quantitative PCR for their expression levels in several representative cultivars. Bud, the 1st leaf, the 2nd leaf, the younger stem, and the older stem were also sampled from each variety in each group for quantitative PCR analysis (**Figure 8**). For each miRNA, the relative expression in JFH was set as the reference so as to get better understanding of the expression levels among varieties. In general, most of the miRNAs had similar expression patterns in ZYD, PYTZ, DYWL, and QS, with high expression levels in the 1st leaf and

then the 2nd leaf. And for other miRNAs were likely to have high expression levels in the bud, such like miR160x in HD and miR156y in JFH.

### Network Analysis on miRNA, Target mRNA, and Quality-Related Metabolites

Correlation analyses were conducted to figure out the extent of the 21 conserved miRNA mentioned above participated in the quality formation progress (**Figure 9**). Surprisingly, compared to catechins and caffeine, theanine had a quite strong relationship and relatively large numbers with target mRNAs, which could be definitely set as the center regulated metabolite, at least during the development of PYTZ sprouts. For the galloylated catechins, ECG was regulated by much more multiple mRNAs than EGCG. The kinds of mRNA participated in regulating the nongalloylated catechins were approximately equal. As for the conserved miRNA, miR8577-x, miR160-x, and novel-m0187-3p were the most social ones. However, the neighborhood relationships for miR156-x and miR319-y seemed rather simple.

## DISCUSSION

### Different miRNAs Were Involved in Different Tissues and Stages During the Sprout Development in PYTZ

How a plant builds leaves from a few cells that grow, divide, and differentiate to form into the complex organ has been wellstudied, the same with the research of mechanical regulation (Braybrook and Kuhlemeier, 2010; Chen, 2012; Qi et al., 2017). miRNA, typically multigene families, allowing for subtlety and complexity of control in different regulatory processes, are

described as factors in many aspects of plant development (Kidner, 2010).

As tea is the leaf-using crash plant, the performance of leaf development has more practical significance. Thus, we took the perspective of looking at the function of miRNA in different tissues and development stages separately. In the comparison of G1 and G2 mentioned above, miRNAs that participate in leaf and stem development are different, hinting that different miRNAs need coordinate working in leaf and stem developmental process, separately. For example, both novelm0155-3p and novel-m0331-5p were stem-specific miRNAs and have high expression levels in sS2 than in sS1, which had strong possibilities responsible for stem elongation. And likewise, one kind of miRNA may function differently in different tissues development. In the 78 miRNAs that both exist in leaf and stem (**Figure 2B**), 28 of them have different including miR5049y and other 27 novel miRNAs could be found both in leaf and stem, in which 23 of them share one similar expression pattern (**Supplementary Figure 6**) (P < 0.05, log2 fold change ≥ 1.5), 27 of them with the expression levels in sL1 higher than in sL2 (**Supplementary Figures 6A–C**), 24 of them with the expression levels in sS2 higher than in sS1 (**Supplementary Figures 6A,D**). The non-conserved miR5049, included in profile1 (**Supplementary Figure 6A**), had been reported to be drought stress response miRNA in the root of drought-tolerant cultivar wheat (Akdogan et al., 2016).

Interestingly, the number of down-trend miRNAs (157 from profile 3, profile 0 and profile 1) are much more than uptrend ones in G1 (53 from profile 6, profile 7 and profile 4) (**Figure 3A**) and in G2 (91 vs. 64) (**Figure 3B**), this is the same case for the abundance of these miRNAs. Discarding the ones with TPM lower than 100, the abundance of down-trend miRNAs occupied 79.04% and 54.30% in G1 and G2, separately (**Supplementary Table 10**). The percentage indicated that during the development, especially in leaf, mRNAs regulated by miRNA have a large percentage in increasing tendency. This trend is consistent with the expression levels from tea leaf transcriptome that 72% of genes were up-regulated in the second leaf stage compared to the first leaf stage (Guo et al., 2017). The coherence of expression levels of the regulator and the content of secondary metabolites is particularly impressive, a point we return to below.

#### Evolutionarily Conserved miRNAs Were Closely Connected to Morphogenesis Functions During the Sprout Development

miRNA is usually be concerned whether to be conserved or not, which typically depends on their degree of presentation in all or at least most of the species, and thus the division could be influenced by sampling and the phylogenetic diversity of available species that miRNAs have been characterized and annotated (Baldrich et al., 2018). In the 21 developmental associated

Camellia sinensis "Queshe" (QS).

miRNAs filtered out in this study (**Table 4**), except for 6 novel miRNAs, 11 of the known miRNAs are conserved miRNAs that could be found in at least 33 plant species out of 72 plant species (**Table 2**). Notably, none of the miRNAs from tea could be found in Chlamydomonas reinhardtii, which is a conventionally model for a photosynthetic cell in studying photosynthesis (Funes et al., 2007), abiotic stress (Hema et al., 2007), circadian clock (Ral et al., 2006) and so on. This result to some extent is an echo of the conservation of miRNA within one kingdom, and no miRNA had been found conserved in green algae and land plants (Baldrich et al., 2018). Interestingly, there was no tea miRNA found in Populus euphratica either, which is an ideal model system of woody plants for research into the abiotic stress resistance (Li et al., 2009), such like drought (Li et al., 2011) and salt (Li et al., 2013). Previously studies on Populus euphratica had been reported that only 9 out of 21 miRNAs families (miR156; miR163; miR172; miR398; miR393; miR171; miR408; miR169; miR472) were conserved in other plants, with other 12 miRNA family candidates show none homologies in Populus, Arabidopsis, and Oryza (Li et al., 2009) and can thus be considered as quite ancient and independent evolution species.

The seven tea cultivars studied above were famous cultivars that frequently used in producing fermented or non-fermented tea in China. Typically in the tea processing industry, teaprocessing suitability and tea quality are basically determined by the main characteristic metabolic compounds, which directly

linked up with the development and morphogenesis of the tea sprouts (Xia et al., 2017). In the 11 known conserved tea miRNA, miR156 was the most popular one, which was found in 51 plant species, followed by miR396 and miR166, which were conserved in 47 and 45 plant species, respectively. Similar with their high abundance (miR166, miR319, miR396, miR160, and miR390 were listed in the top five kinds of miRNAs), which means that conservation is not only represent low sequence variation across diverse plant species, but also to be the large and older miRNA families with abundant copy and target number (Chavez Montes et al., 2014), in order to grantee their tightly constrained roles in function and less gene loss in the regulatory network (Shi et al., 2017). Mature miRNAs in plant often have multiple target genes with similar complementary sequences, among which these evolutionarily conserved miRNAs and their predominantly target genes characteristically play essential roles in developmental regulation, morphogenesis, stress responses (Axtell and Bowman, 2008; Yang et al., 2013). Even more noteworthy is, the tissuespecific miRNAs (**Table 2**) may contribute to the development of the specific tissue, which doesn't mean they are the dominated ones and conserved ones, either. For example, miR319 had been reported to increase the number of longitudinal small veins thus might account for the leaf blade width (Yang et al., 2013) and miR159 was involved in stem elongation (Tsuji et al., 2006), but they are both conserved miRNAs and have expression in bud, leaves, and stems. Only miR164, miR393 and miR2111 were leaf-specific miRNAs and conserved miRNAs as well. Not all of the conserved miRNAs have similar expression patterns in the investigated cultivars (**Figure 8**), which might be the result of the flexible of the "fine-tuners," to enhance the ability of a fast response to evolution (Muleo, 2012). The conservation in sequence doesn't always represent functional conservation (Ason et al., 2006). Though the evidence of miRNA in the plant is less than that in the animal, it is widely accepted that plants miRNA genes are evolved independently as they do in the animal kingdoms. It is thus believed that the larger the miRNA family is, which means the more multiple paralogous copies of one miRNA in plants, brings more flexible of evolution rate and more possibility in diversification though, the more essential function in development it may be involved. More research combined the contents of metabolites with the function of conserved miRNAs in species-level phenotypic differences needs to be further studied.

#### Development Associate miRNAs Might Play Crucial Roles in the Quality Formation Together With Their Potential Target Transcription Factor Genes

Higher plants evolved precise and robust spatio-temporal patterns of gene regulatory systems, among which transcription factors and miRNAs are two of the best studied regulatory mechanisms separately at transcriptional and posttranscriptional level (Chen and Rajewsky, 2007). TFs and TABLE 5 | GO pathway enrichment analysis to the target mRNA genes of DEM between each two samples from Bud, sL1, sL2, sS1, and sS2.


miRNAs generally do not work isolation, but instead, together with co-regulators in the same layer or not, to form large networks of cooperating and interacting in complex multicellular organisms (Dawid, 2006). But they are usually positioned at the center of regulating many aspects of developmental plasticity along with the life cycle (Rubio-Somoza and Weigel, 2011). We perform network and enrichment analysis to the 352 TF genes targeted by miRNA in STRING (Szklarczyk et al., 2017), and marked the developmental relative process (**Supplementary Figure 7**). Most of them were involved were clustered in some certain pathway in regulating gene expression and primary metabolic process, which could be easily understood that TFs were involved in the primary metabolic process because of the fundamental maintenance of living for plant themselves. In shrinking the research objectives, we further perform GO analysis to the DEM in G1 and G2 above, many pathways were enriched in developmental and morphological process (**Supplementary Figure 8**), especially in the G1 up trend expression profiles, which also confirmed us the effective way of filtering key miRNAs. The eight miRNA-TF mRNA pairs verified by their reciprocal expression relationships were much more likely to participate in development regulation, which doesn't imply only this eight pairs of miRNA-TFs were involved.

Metabolism was along with the development. It has been reported that plant miRNA were widely involved in quality formation regulation (Wu et al., 2014; Liu et al., 2017b). As the three kinds of characteristic metabolites which finally determine the quality of tea (Xia et al., 2017; Wei et al., 2018), catechins mainly confer astringent taste, theanine contributes to the umami and sweet tastes, and caffeine offers a bitter taste (Wei et al., 2018). We did GO pathway enrichment analysis to the target mRNA genes of DEM between every two samples from Bud, sL1, sL2, sS1, and sS2, with the evolved participants showed in **Table 5**. The correlation of the chemical analysis on catechins, theanine, caffeine and the soluble matter would finally affect the sensory evaluation of green tea taste. In our study, we found that theanine turned out to be even more active in the network (**Figure 9**). Target mRNAs which belonging to TF genes were further picked out and constructed a more metabolic directivity one due to their correlation (**Supplementary Figure 9**). TF genes like CSA013022 (HD ZIP) and CSA029222 (ARF) had a strong positive relationship with the biosynthesis of theanine, while the later also positively regulated EC, referring to miR165-y, miR166-z, and miR160-x, respectively. CSA031667, a TF gene belonging to SBP family, had a positive correlation with C, and be controlled by miR156-x. When taking the eight miRNA-TF pairs mentioned above into consideration together, there were at least two triplets that participate in both development and quality formation: miR156-x-CSA031667 (SBP)-C and miR319-y-CSA036087 (MYB)-theanine. Molecular mechanisms of sprout development and accumulation of metabolites would be gradually uncovered after the release of tea tree genome (Xia et al., 2017; Wei et al., 2018) and tea organic transcriptomes (Zheng et al., 2015, 2016; Guo et al., 2017; Liu et al., 2017a). More connections would be further studied toward small RNAs to improve breeding efficiency of developing better cultivars with higher quality.

## DATA AVAILABILITY

The datasets generated for this study can be found in Sprouts development of tea plants, PRJNA510482.

### ETHICS STATEMENT

The authors declare that we have complied with all relevant ethical regulations.

## AUTHOR CONTRIBUTIONS

ZD conceived the study. LZ, CC, JS, and YW performed the experiment and analyzed the data. LZ wrote the paper. All authors read and approved the final manuscript.

## FUNDING

This work was supported by the Special Foundation for Distinguished Taishan Scholar of Shandong Province (Ts201712057), the Natural Science Foundation of China (31600557 and 31470027), Science and Technology Plan Projects in Colleges and Universities of Shandong Province (J15LF02), School Fund Project of Qingdao Agricultural University (631412), Qingdao Applied Basic Research Program (grant 15-9- 1-45-jch). This work was also supported by China Scholarship Council (201708370012).

### ACKNOWLEDGMENTS

We acknowledge Gene de novo Co., Ltd. and RuiBo Co., Ltd. at Guangzhou and for their assistance in original data processing.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00237/full#supplementary-material

Supplementary Figure 1 | The nucleic acids frequency of known miRNAs at each position in PYTZ. sBud (A), sL1 (B), sL2 (C), sS1 (D), and sS2 (E). The frequency of cytosine (C) (32.09%) and uracil (U) (29.65%) are higher than guanine (G) (19.26%) and adenosine (A) (19.00%). U had a high appearance at the 1st, 17th, 22th, and 23th positions, with an average of 84.34, 61.80, 54.57, and 52.48% respectively.

Supplementary Figure 2 | The nucleotide bias of 18nt-30nt length known miRNAs at the 1st position in PYTZ. sBud (A), sL1 (B), sL2 (C), sS1 (D), and sS2 (E).

Supplementary Figure 3 | Related coefficients of the 15 miRNA libraries across the five sample groups with three replicates of each group.

Supplementary Figure 4 | Level 2 GO terms of all target genes from PTZY sprouts.

Supplementary Figure 5 | Complementary correspondence of developmental miRNA toward the target sites.

Supplementary Figure 6 | (A–E) The expression profiles of miRNA that exist both in sL (leaf) and sS (stem).

Supplementary Figure 7 | Relative process of the targeted TF genes by STRING analysis. (STRING: https://string-db.org/).

Supplementary Figure 8 | Level 3 GO analysis of DEM in G1 and G2.

Supplementary Figure 9 | The miRNA-TF-metabolites networks. The round rectangle placed in the center were metabolites, the ellipses at the interlayer were potential target mRNAs that belong to TF genes, and the diamond at the outermost layer were miRNAs. Line thickness represented the strength of the relationship. The red line represented the positive correlation efficient and the blue line meant negative.

Supplementary Table 1 | The primers sequences of miRNA and mRNA genes used in Real-time PCR.

Supplementary Table 2 | The filtering data of 15 sRNA-Seq libraries from Camellia sinensis cv. Pingyang Tezaocha.

#### REFERENCES


Supplementary Table 3 | Statistics of sRNA-Seq libraries mapping to tea tree genome.

Supplementary Table 4 | Statistics of sRNA-Seq libraries mapping to Rfam.

Supplementary Table 5 | The lengths, sequences, and expressions of 156 known miRNAs.

Supplementary Table 6 | 122 precursors and 99 kinds of characteristic hairpin structures, with their length and energy information.

Supplementary Table 7 | 1186 novel miRNAs and their 1130 hairpin structures, with their length and energy information.

Supplementary Table 8 | The complete list of target genes of all miRNA.

Supplementary Table 9 | The annotation and the corresponding miRNAs of the total 352 target transcription factor genes.

Supplementary Table 10 | Down-trend and up-trend miRNAs in G1 and G2 with TPM more than 100.

targets-of-regulation in the establishment of lateral organ polarity in arabidopsis. Plant Physiol. 135, 685–694. doi: 10.1104/pp.104.040394


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhao, Chen, Wang, Shen and Ding. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Genomic Landscape of Crossover Interference in the Desert Tree Populus euphratica

Ping Wang<sup>1</sup> , Libo Jiang<sup>1</sup> \*, Meixia Ye<sup>1</sup> , Xuli Zhu<sup>1</sup> and Rongling Wu1,2

*<sup>1</sup> Center for Computational Biology, College of Biological Sciences and Biotechnology, Beijing Forestry University, Beijing, China, <sup>2</sup> Center for Statistical Genetics, The Pennsylvania State University, Hershey, PA, United States*

Crossover (CO) interference is a universal phenomenon by which the occurrence of one CO event inhibits the simultaneous occurrence of other COs along a chromosome. Because of its critical role in the evolution of genome structure and organization, the cytological and molecular mechanisms underlying CO interference have been extensively investigated. However, the genome-wide distribution of CO interference and its interplay with sex-, stress-, and age-induced differentiation remain poorly understood. Multi-point linkage analysis has proven to be a powerful tool for landscaping CO interference, especially within species for which CO mutants are rarely available. We implemented four-point linkage analysis to landscape a detailed picture of how CO interference is distributed through the entire genome of *Populus euphratica*, the only forest tree that can survive and grow in saline desert. We identified an extensive occurrence of CO interference, and found that its strength depends on the length of chromosomes and the genomic locations within the chromosome. We detected high-order CO interference, possibly suggesting a highly complex mechanism crucial for *P. euphratica* to grow, reproduce, and evolve in its harsh environment.

#### Edited by:

*Ancha Baranova, George Mason University, United States*

#### Reviewed by:

*Xiyin Wang, North China University of Science and Technology, China Longjiang Fan, Zhejiang University, China*

> \*Correspondence: *Libo Jiang libojiang@bjfu.edu.cn*

#### Specialty section:

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics*

> Received: *11 October 2018* Accepted: *29 April 2019* Published: *15 May 2019*

#### Citation:

*Wang P, Jiang L, Ye M, Zhu X and Wu R (2019) The Genomic Landscape of Crossover Interference in the Desert Tree Populus euphratica. Front. Genet. 10:440. doi: 10.3389/fgene.2019.00440* Keywords: euphrates poplar, genetic interference, mapping population, meiotic crossover, four-point analysis

### INTRODUCTION

Crossovers (COs) are recombination events involving a reciprocal exchange of genetic material. During meiotic prophase, COs are essential for the accurate segregation of homologous chromosomes (Hillers, 2004). In most organisms, the abundance and distribution of COs is highly regulated by universal mechanisms, referred to as CO interference or genetic interference. The fact that the presence of a CO interferes with the occurrence of other COs within the same chromosome has been confirmed. Due to such interferences, chiasmata are more evenly placed along chromosomes than previously expected (Hillers, 2004; Hultén, 2011). Moreover, CO interference is ubiquitous in eukaryotes and plays a crucial role in their evolution. However, our understanding of CO interference mechanisms and their distribution in biota remains very limited.

Sturtevant and Muller constructed a Drosophila genetic map and found that COs were more evenly spaced than would be expected from random placement (Lam et al., 2005). CO interference is widespread in most eukaryotes and can confer selectivity advantages. The extent of CO interference decreases with genetic distance between COs; however, given the same distance, it is stronger on the same chromosomal arm than on different arms (Berchowitz and Copenhaver, 2010). The variability of CO interference within a specific chromosome region is affected by the overall size and structure of the chromosome (Hillers, 2004), and CO interferences are regulated by the anti-recombinase RTEL-1 protein in Caenorhabditis elegant (Youds et al., 2010). A reduction in CO interference can result from a lack of DNA-damage-response-kinase Tel1/ATM (Anderson et al., 2015). Links between CO interferences and sex differences (Jan et al., 2007; Szatkiewicz et al., 2013), stressinduced adaptation (Yant et al., 2013; Aggarwal et al., 2015), and aging (Campbell et al., 2015; Wang Z. et al., 2016) have been discovered, highlighting the multifaceted role of COs in mediating biological processes. As an evolutionary phenotype, CO interference varies with biotic and abiotic environmental parameters, such as sex, age, and stress. For example, in mice and cattle, interference is stronger in females than in males (Szatkiewicz et al., 2013; Wang Z. et al., 2016). However, the opposite is found in humans, where interference is stronger in males than in females, although this pattern varies by chromosome (Campbell et al., 2015).

Many methods have been used to study the mechanisms of CO interference, including the count-location model, the gamma model, and multi-point linkage analysis. Initially, CO interference was genetically defined and characterized by cytology, the location of protein complexes, and chromosomal CO events. Recent studies have explored the mechanistic basis of CO interference using cytogenetics and molecular methods, whereas more traditional interference studies use the coefficient of coincidence (CoC) between two disjoint intervals on a genetic map. The CoC is defined as the ratio of the observed frequency to the expected frequency, and represents all possible intervals of gametes with double CO for each pair (Waterworth, 2000). Traditional models of interference suggest that the occurrence of a CO produces signals or substances that prevent additional CO events and then spreads along the chromosome at a similar distance on both sides (Housworth and Stahl, 2003). The polymerization model states that early recombination events are distributed independently with each other and then have the same chance of initiating bidirectional aggregation events per unit of time (King and Mortimer, 1990).

More recently, many model and non-model systems have been developed to characterize the phenomenon of CO interference. CO interference has been investigated mainly by tracking DNA markers on a single chromosome of parents during a specific period under electron fluorescence microscopy. The gamma model has recently received attention and suggests that the shape parameter of the gamma distribution is an indicator for uniformity and an indirect indicator for interference (Lam et al., 2005). The mechanical stress model assumes that each CO event releases a specific distance of pressure along the chromosome to prevent the presence of nearby COs (Wang et al., 2015). At present, multi-point linkage analysis has been proven to be more advantageous in genetic distance estimation and gene ordering, and it is equipped with a strong ability to discern and quantify CO interferences.

Despite numerous theoretical and empirical studies, our understanding of how interference is distributed across genomes remains unclear (Housworth and Stahl, 2003). This can be attributed to a number of reasons. First, traditional genetic screens for mutations affecting interference require numerous meiotic progenies to include meiotic COs in multiple intervals along a chromosome (Berchowitz and Copenhaver, 2010). Second, most of the mutations that modify interference affect chromosomal proteins, which not only mediate interference but also play a role in CO formation (Joshi et al., 2009). Thus, genetic strategies that abolish mutation interference also reduce or eliminate CO events. Third, many mutants differ in their frequency of occurrence of CO in different loci and environments (Getz et al., 2008). Therefore, combining multi-point analysis and cytology tools, which are used widely for locating and sequencing genes, can increase the ability to detect interference (Broman and Weber, 2000). The multi-analytic statistical model, which is based on the linkage analysis method of genetic maps, can describe CO interference that take place not only between two adjacent chromosome intervals, but also in multiple consecutive intervals. Additionally, multi-point analysis provides a quantitative method to estimate CO interference (Zickler and Kleckner, 2016). In particular, by assessing the chromosomal distribution of CO interference, multi-point analysis can activate the use of linkage mapping as a routine genetic tool to investigate further dimensions of genomic structure and organization (Lu et al., 2004).

Populus euphratica is the only arbor species in arid-semiarid regions and plays an important role in maintaining the ecological balance in desert regions. The goals of this study were to identify the distribution of CO interference in P. euphratica at a wholegenome scale using multi-point analysis based on the full-sib family of P. euphratica and to study the relationship between the overall CO interference strength and length of the chromosome, as well as the region of the chromosome. Due to the impact of climate change and anthropogenic activities, the area of P. euphratica in northwest China has declined sharply and its ecological security and agricultural production are facing severe challenges (Qiu et al., 2011). By using four-point linkage analysis to analyze the CO interference of P. euphratica, we can describe its distribution within the genome in detail, which will provide a theoretical basis for the follow-up forest genetic research and molecular marker-assisted breeding. It is of great significance to understand the genetic diversity and evolutionary history of P. euphratica and to find their core germplasm resources.

## MATERIALS AND METHODS

#### Plant Material and Genetic Linkage Map

One male and one female P. euphratica individual were randomly selected along the Tarim River in the Korla region of Xinjiang, China. The individuals were located 31 km from one another, ensuring a large genetic difference between them. Male and female flowering branches from the individuals were planted in an artificial climate chamber at Beijing Forestry University. After cultivation was completed, a series of experimental treatments, including dehydration, thinning, and freezing with liquid nitrogen, were performed on the selected materials. Finally, the F<sup>1</sup> progeny of 408 individuals were obtained. DNA was extracted using the TIANGEN plant genomic DNA extraction kit (Beijing, China). The quality of all samples was assessed and RAD technology was used for high-throughput DNA sequencing (Conesa et al., 2005). The genetic map of P. euphratica was constructed from the resultant sequence data.

#### Multi-Point Linkage Analysis

A four-point analysis was developed so that four consecutive markers could be analyzed simultaneously (Wang J. et al., 2016). It beyond three-point analysis, can characterize crossover interference that takes place not only between two adjacent chromosomal intervals, but also over multiple successive intervals (We call the interference occurred in multiple marker intervals of more than three markers as high dimensional CO interference). We used the CoC to describe the ratio of the observed number of double recombinants to this expected number. As we have known, the recombination events occurring between different marker intervals are not independent. Thus, the extent to which this coefficient corresponds to the strength of CO interference.

In the full-sib family of P. euphratica, two heterozygous F<sup>1</sup> individuals, ABCD/abcd and ABCD/abcd, were crossed to produce a segregated F<sup>2</sup> population. Each F<sup>1</sup> parent produced 16 gametes, divided into eight types (**Table 1**). The frequencies of the gamete types are represented by g000,..., g111, where the subscripts represent the number of COs between a particular pair of tags. Based on the genetic map of P. euphratica, we grouped single-nucleotide polymorphism markers on 19 linkage groups with four markers in every group. The genotype frequencies of the gamete types were calculated by counting the number of genotypes within the 408 individuals of each group. The four consecutive markers (i.e., A-B-C-D) had six possible recombination moieties. From these gamete-type frequencies, we expressed the recombination fractions of each marker pair, denoted by rAB, rBC, rCD, rAC, rBD, and rAD, as follows:

$$\begin{aligned} r\_{AB} &= \mathcal{g}\_{111} + \mathcal{g}\_{110} + \mathcal{g}\_{101} + \mathcal{g}\_{100} \\ r\_{BC} &= \mathcal{g}\_{111} + \mathcal{g}\_{110} + \mathcal{g}\_{011} + \mathcal{g}\_{010} \\ r\_{CD} &= \mathcal{g}\_{111} + \mathcal{g}\_{101} + \mathcal{g}\_{011} + \mathcal{g}\_{001} \\ r\_{AC} &= \mathcal{g}\_{101} + \mathcal{g}\_{100} + \mathcal{g}\_{011} + \mathcal{g}\_{010} \\ r\_{BD} &= \mathcal{g}\_{110} + \mathcal{g}\_{010} + \mathcal{g}\_{101} + \mathcal{g}\_{001} \\ r\_{AD} &= \mathcal{g}\_{111} + \mathcal{g}\_{010} + \mathcal{g}\_{100} + \mathcal{g}\_{001} \end{aligned} \tag{1}$$

Denote the coefficients of coincidence (a measure of crossover interference) between double marker intervals A-B and B-C, double marker intervals B-C and C-D, double marker intervals A-B and C-D, and triple marker intervals A-B, B-C, and C-D by C1, C2, C3, and C4, respectively (Sun et al., 2017). Wang J. et al. (2016) formulated the relationship between different recombination fractions based on the CoC and derived a process to estimate and test each coefficient, as follows:

$$\begin{aligned} \text{C}\_4 &= \frac{\mathcal{g}\_{111}}{r\_{AB}r\_{BC}r\_{CD}}\\ \text{C}\_1 &= \frac{\mathcal{g}\_{111} + \mathcal{g}\_{110}}{r\_{AB}r\_{BC}}\\ \text{C}\_2 &= \frac{\mathcal{g}\_{111} + \mathcal{g}\_{011}}{r\_{BC}r\_{CD}}\\ \text{C}\_3 &= \frac{\mathcal{g}\_{111} + \mathcal{g}\_{101}}{r\_{AB}r\_{CD}} \end{aligned} \tag{2}$$

providing a method to characterize the genomic distribution of CO interference along the chromosome.

For an F<sup>2</sup> offspring family of P. euphratica, two F<sup>1</sup> progenies crossed to produce 136 diploids, divided into 81 identifiable genotypes. This situation differs from the backcross population, which is more complex and requires the Expectation Maximization algorithm to be implemented (Dempster et al., 1977). **Table 2** provides the frequencies of these 81 genotypes, as well as the corresponding numbers. The frequencies of heterozygous genotypes are a mix of products of gamete-type frequencies (Wang J. et al., 2016). Subsequently, the P. euphratica data were analyzed by multi-point analysis to obtain the CoC values representing the CO interference strength. If the CoC value is 0, it indicates that interference is absent.

#### The Relationship Between Overall High Dimensional CO Interference Strength and Chromosome Length

Differences in CO interference strength are affected by the overall size of the chromosome (Albini, 2010). Through fourpoint linkage analysis, we obtained the recombination rate between four marker intervals on each linkage group and the corresponding CoC. To study the relationship between chromosome length and overall high-order CO interference strength, we assumed that the length of the linkage group on the genetic map was the length of the chromosome. Next, the distribution interval of high dimensional CO interference strength on the 19 chromosomes was characterized by a boxplot displaying the maximum, minimum, median, and upper and lower quartiles of the data. Due to different structural characteristics of chromosomes, there are many factors affecting the strength of CO interference; therefore, the mean of the CO interference strength on each chromosome was calculated




*(Continued)*


TABLE 2 | Continued

φ *refers to the ratio of the frequency of each gamete genotype to the corresponding genotype frequency.*

to account for the relationship between chromosome size and overall CO interference strength. Due to the distribution of chromosome 1 deviates more from the distribution of other chromosomes, it was determined to be an outlier and was removed from the dataset. Subsequently, chromosomes 2, 3, 4, and 6 were fitted with a linear model (blue line), and the remaining chromosomes were fitted with a trend line (red line). Through the fitting curves, the distribution of the overall high dimensional CO interference strength on different chromosomes was observed.

### Ratio Variance in High Dimensional CO Interference Strength Between Different Chromosome Regions

CO rates are closely related to chromosome region (Giraut et al., 2011), allowing for differences in CO interference strength in different regions to be explored. In this study, each chromosome was divided into three parts according to genetic distance uniformity, and the three sections were labeled NO.1, NO.2, and NO.3, respectively. The CO interference strength of each was subtracted separately. NO.1-NO.2, NO.2-NO.3, and NO.1- NO.3 indicate the difference ratio (sum of the difference value of each corresponding CO interference strength between intervals) of CO interference strength in the first (NO.1) and second (NO.2) parts, the second part and the third (NO.3) part, the first and third part, respectively. This allowed for differences in the distribution of CO interference strength between the regions of the chromosome to be seen.

To display the impact of the three regions (NO.1, NO.2, and NO.3) in the chromosome on the CO interference strength distribution, we employed δ to quantitatively evaluate the difference of the CO interference strength distribution in different sections of chromosome, which can be calculated by

$$\delta = \sum\_{i=1}^{N} \left| p\_i^1 - p\_i^2 \right| \tag{3}$$

where N is the total number of intervals of the CO interference strength value, p 1 i and p 2 i represent the percentage of the ith interval in two different chromosome regions, respectively. We further derived the range of δ:

$$0 \le \delta = \sum\_{i=1}^{N} |p\_i^1 - p\_i^2| \le \sum\_{i=1}^{N} p\_i^1 + \sum\_{i=1}^{N} p\_i^2 = 2 \tag{4}$$

When the CO interference strength distributions in both regions 1 and 2 were the same, δ was equal to 0, whereas δ reached the maximum of 2 when there was no overlapping region between the CO interference strength distributions of two regions. In all other cases, δ is larger than 0 and smaller than 2. δ reflects the difference of two different CO interference strength distributions.

#### RESULTS

In this study, we first used a four-point linkage analysis model to quantitatively analyze the CO interference on a full-sib population of P. euphratica. The genetic map contained 8,305 markers on 19 linkage groups. The total genetic distance was 4574.89 cM for the entire genetic map, among which the shortest linkage group was linkage group 19 (LG19) with a genetic distance of 130.26 cM and the longest linkage group was LG1 with


a distance of 530.03 cM. The average distance of markers on each individual linkage group was 0.40–0.66 cM (Zhang et al., 2017).

The recombination rates rAB, rBC, rCD, rAC, rBD, and rAD and the corresponding C1, C2, C3, and C<sup>4</sup> between every four consecutive markers were obtained by four-point linkage analysis (**Table 3**). According to the CoC (**Table 3**) and the genetic distance of each linkage group, we determined the CO interference between two adjacent intervals, the CO interference of one interval apart, and the high dimensional CO interference of triple marker intervals. CO interference is ubiquitous within a genome, exhibiting COs between two adjacent marker intervals distributed throughout the genome and varied with the length of the chromosome (**Figure 1A**), making the distribution of COs across each linkage group more even. However, the distribution of interference between two non-adjacent marker intervals occasionally occurs at lower frequencies and lower intensities than the adjacent intervals (**Figure 1B**). Interestingly, high dimensional CO interference was highly distributed across the 19 linkage groups and had a wide distribution within the genome (**Figure 1C**). By comparison, high dimensional CO interference with high-density distribution existed on linkage group 4 (LG4) and linkage group 5 (LG5), whereas the high-dimensional CO interference distribution density of linkage group 11 (LG11) was lower.

We plotted the first eight high-dimensional CO interference in the 19 linkage groups to visualize the distribution of highdimensional CO interference on the eight linkage groups more directly (**Figure 2**). Although the chromosome length varied, higher-dimensional CO interferences were evenly distributed within each chromosome and the amplitudes were larger and denser than the other two genetic disturbances. Additionally, the location information of the markers where CO interference occurred could be seen (**Figure 2**). There was an obvious correlation between the density of high-dimensional CO interference and chromosome length, with different chromosome lengths resulting in different distributions of high-dimensional CO interference.

We analyzed the correlation between the genetic distance of chromosomes and overall high-dimensional CO interference strength. The median of the overall CO interference strength was concentrated between 0 and 1, and the interquartile range (IQR) was variable and dependent on chromosome length. The IQR of chromosome 5 was the longest, reaching 41.63 cM; the IQR of chromosome 11 was the shortest, about 1.74 cM; the other 17 chromosomes were similar to chromosome 1, which was about 16.94 cM (**Figure 3**). In other words, the overall strength of CO interference was related to the genetic distance of the chromosome (**Figure 4**). Chromosomes 2, 3, 4, and 6 were locally linearly fitted (blue line) with an adjusted R <sup>2</sup> of 0.71. Simultaneously, the other chromosomes were fitted (red line) with an adjusted R <sup>2</sup> of 0.85 (**Figure 4**). Although the two fitted curves had different slopes, they both increased with the length of the chromosome. These results suggest that the correlation between the genetic distance of chromosomes and the overall high-dimensional CO interference strength was significant.

We plotted the first three of the 19 chromosomes to visualize the distribution of high dimensional CO interference on different

chromosome parts (NO.1, NO.2, and NO.3) (**Figure 5**). The CO interference strength of each chromosome part differed in terms of intensity interval. For example, on chromosome 1, there was no CO interference in the first part (interval of 60–80 cM), whereas chromosome 2 exhibited CO interference. Therefore, different intervals along the chromosome contained different strengths and distributions of CO interference.

The difference ratio was used to compare the differences among the three intervals on each chromosome and study the distribution of high dimensional CO interference strength in different regions of the chromosome. The difference ratios of NO.1-NO.2, NO.2-NO.3, and NO.1-NO.3 in each chromosome were 0.1429-0.9474, 0.0952-1.1250, and 0.2353- 0.8750, respectively (**Figure 6**). Moreover, fluctuations of CO interference strength between the first region and the third region were small, whereas the CO interference strength between the second region and the third region fluctuated greatly (**Figure 6**). The high dimensional CO interference strength between the middle region and both side regions on the chromosome was

very different. Thus, the overall strength of high dimensional CO interference was not only related to the length of the chromosome, but also varied among chromosome regions.

## DISCUSSION

The phenomenon of CO interference has been observed in most organisms. Within eukaryotes, interference may be quite long. For example, in the nematode C. elegans, interference can span a fusion chromosome of 50 Mb (Lian et al., 2008). The results of this study provide strong evidence for the existence of highorder CO interference. We assessed CO interference in the fullsib family of P. euphratica by mapping the distributions of CO interferences in different dimensions along 19 chromosomes. We observed that high-dimensional CO interference existed to varying degrees on all 19 chromosomes, and found that these high-dimensional interferences were even stronger than one- or two-dimensional CO interferences. The discovery of CO interference in the full-sib family of P. euphratica and the relationship between the strength of the overall CO interference

and the chromosome structure can not only help identify and quantify CO interference in the entire genome, but also has the potential to impact further inference on the genome structure, organization, and evolution of P. euphratica populations.

We correlated the genetic length of the chromosome with the strength of the overall high-dimensional CO interference, and found that the mean of CO interference strength on each chromosome had a linear relationship with the genetic length of the chromosome. CO rates and chromosome lengths were previously found to be relevant in other eukaryotic species, including humans, mice, Arabidopsis, and zebrafish (Kleckner

et al., 2003). In addition, CO interference affects the CO rate and is affected by the length of the chromosome. In some species, such as yeast, dogs, mice, and pigeons, small chromosomes often have a higher CO density (Froenicke et al., 2002; Basheva et al., 2008; Mancera et al., 2008). Surprisingly, the CO interference

middle third of the chromosome; NO.3 indicates the distribution strength of

the last third of the chromosome.

strength in this study increased with chromosome length, with longer chromosomes containing a higher CO interference density and a correspondingly smaller CO density. This finding has far-reaching implications on biological evolution. Due to the existence of CO interference, the occurrence of CO events is regulated accordingly (Broman et al., 2002). The length of chromosomes indirectly affects the total strength of heritage interference, thereby affecting genetic diversity and having important implications for evolution.

According to previous studies, the occurrence of CO events is closely related to the center and terminal regions on chromosomes (Chelysheva et al., 2007). Meanwhile, CO interference has variable intensities and distributions in different regions of the chromosome. Moreover, CO interference can have different regulatory effects on a CO event in the corresponding region and exerts subtle influences on biological inheritance and evolution. We further studied the distribution and difference of CO interference between different regions on the chromosome, finding that the distribution of CO interference strength differed among regions. By defining the range of difference ratios, we found a difference in CO interference strength among chromosome regions. Studies of Arabidopsis chromosomes have shown that CO rates correlate with different genomic features associated with chromosome structure, such as the GC content and CpG ratio. Therefore, the differences in CO interference are also clearly related to these factors.

In this study, we used multi-point analysis methods to measure CO interference in the full-sib family of P. euphratica,

#### REFERENCES

Aggarwal, D. D., Rashkovetsky, E., Michalak, P., Cohen, I., Ronin, Y., Zhou, D., et al. (2015). Experimental evolution of recombination and crossover extending from traditional linkage analysis to analyze multiple markers simultaneously. Previous studies have demonstrated that this method is a powerful tool for identifying and estimating CO interference (Wang J. et al., 2016). Accurate estimates of high-dimensional CO interference have significant implications in genomic research (Weeks et al., 1994). First, previous studies of interference in experimental organisms generally only involved adjacent interval groups, whereas multi-point analysis can not only accurately estimate the recombination rate between two adjacent markers, but also between multiple marker intervals and provide additional information about genomic structure and organization. Second, using this method, the strength and distribution of CO interferences between adjacent intervals along a chromosome can be estimated and the results can be used to study the relationship with the structure of the chromosome.

An increasing number of studies have investigated the phenomenon of CO interference. It has been found that CO interference is highly related to many evolutionary and developmental processes, such as gender differences, heterogeneity, senescence, and stress tolerance. The distribution of recombination achieved by CO interference can be determined by genetic background, gender, and many environmental factors, such as temperature and age. However, most genetic mapping studies have not considered CO interference. Regardless, multipoint analysis using genetic mapping has been used to estimate the degree of correlation between CO interference and evolution, and can capture this important phenomenon without extra cost. Similarly, Aggarwal et al. (2015) used multi-point analysis to determine the rules of recombinant frequency and CO interference in fruit flies that were targeted by dry, hypoxia, or high-oxygen tolerance. Here, we have expanded the research on CO interference, allowing for future studies to explore the molecular mechanism of CO in the P. euphratica genome through combination of multi-point analysis with cytology, clarify the development and evolution of COs, and investigate whether specific genes regulate CO interference.

### AUTHOR CONTRIBUTIONS

PW performed data analysis. PW, MY, and XZ interpreted the result. RW and LJ conceived of the idea and designed the model. PW and LJ wrote the manuscript.

### FUNDING

This work is supported by Fundamental Research Funds for the Central Universities (NO. BLYJ201605, NO. BLX201715, NO. 2015ZCQ-SW-06), grant 31700576 from National Natural Science Foundation of China, grant 31600536 from National Natural Science Foundation of China, grant 201404102 from the State Administration of Forestry of China, NSF/IOS award No. 0923975, and the Thousand-person Plan Award.

interference in Drosophila caused by directional selection for stress-related traits. BMC. Biol. 13:101. doi: 10.1186/s12915-015-0206-5

Albini, S. M. (2010). A karyotype of the Arabidopsis thaliana genome derived from synaptonemal complex analysis at prophase I of meiosis. Plant. J. 5, 665–672. doi: 10.1111/j.1365-313X. 1994.00665.x


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Jiang, Ye, Zhu and Wu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Molecular Organization and Chromosomal Localization Analysis of 5S rDNA Clusters in Autotetraploids Derived From Carassius auratus Red Var. (♀) × Megalobrama amblycephala (♂)

QinBo Qin, QiWen Liu, ChongQing Wang, Liu Cao, YuWei Zhou, Huan Qin, Chun Zhao and ShaoJun Liu\*

State Key Laboratory of Developmental Biology of Freshwater Fish, College of Life Sciences, Hunan Normal University, Changsha, China

#### Edited by:

Ancha Baranova, George Mason University, United States

#### Reviewed by:

Lenin Arias Rodriguez, Universidad Juárez Autónoma de Tabasco, Mexico Jingou Tong, Institute of Hydrobiology (CAS), China

> \*Correspondence: ShaoJun Liu lsj@hunnu.edu.cn

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 08 September 2018 Accepted: 29 April 2019 Published: 15 May 2019

#### Citation:

Qin QB, Liu QW, Wang CQ, Cao L, Zhou YW, Qin H, Zhao C and Liu SJ (2019) Molecular Organization and Chromosomal Localization Analysis of 5S rDNA Clusters in Autotetraploids Derived From Carassius auratus Red Var. (♀) <sup>×</sup> Megalobrama amblycephala (♂). Front. Genet. 10:437. doi: 10.3389/fgene.2019.00437 The autotetraploid fish (4n = 200, RRRR) (abbreviated as 4nRR) resulted from the whole genome duplication of red crucian carp (Carassius auratus red var., 2n = 100, RR) (abbreviated as RCC). During investigation of the influence of polyploidization on organization and evolution of the multigene family of 5S rDNA, molecular organization and chromosomal localization of the 5S rDNA were characterized in autotetraploid fish. By sequence analysis of the coding region (5S) and adjacent non-transcribed spacer (NTS), three distinct 5S rDNA units (type I: 203 bp; type II: 340 bp; and type III: 477bp) were identified and characterized in 4nRR. These 5S rDNA units were inherited from their female parent (RCC), in which obvious base variations in NTS and array recombination of repeat units were found. Using fluorescence in situ hybridization employing different 5S rDNA units as probes, these 5S rDNA clusters were localized in chromosomes of 4nRR, respectively, and showed obvious loss of chromosomal loci (type I and type II). Our data revealed genetic variation of the 5S rDNA multigene family in the genome of autopolyploid fish. Furthermore, results provided new insights into the evolutionary patterns of this vertebrate multigene family.

Keywords: autotetraploid line, distant hybridization, 5S rDNA, FISH, chromosomal loci

### INTRODUCTION

Polyploidy is a significant mode of speciation in eukaryotes (Mallet, 2007; Otto, 2007), especially in vertebrates. Ohno (1970) proposed the genome duplication hypothesis, in which two rounds of whole genome duplication occurred during early vertebrate evolution. Polyploids are generally divided in to categories depending on their chromosomal composition and their manner of formation. The autopolyploids have chromosome sets coming from the genome of one species (e.g., AAAA) and exhibit multivalent pairing during meiosis, while the allopolyploids result from the combination of sets of chromosomes from two or more

**Abbreviations:** 4nRB, allotetraploid hybrids; 4nRR, autotetraploid fish; BSB, blunt snout bream; FISH, fluorescence in situ hybridization; ICRs, internal control regions; NTS, non-transcribed spacer; PCR, polymerase chain reaction; RCC, red crucian carp.

different taxa (e.g., AABB) and predominantly form bivalent pairings (Comai, 2005). Notably, multivalent pairing may cause meiotic irregularities and result in reduced fertility compared with diploid progenitors (Jackson, 1982; Parisod et al., 2010). Thus, vertebrate autopolyploids are relatively rare compared with allopolyploids, and the influence of autopolyploidization on intragenomic variation is poorly understood.

5S ribosomal RNA (rRNA) is a component of the large ribosomal subunit in all ribosomes. In vertebrates, the 5S ribosomal DNA (5S rDNA) is organized in tandem arrays with repeat units composed of a 120-bp coding sequence (5S) that encodes the 5S rRNA and a highly variable nontranscribed spacer (NTS) (Korn and Brown, 1978; Nielsen et al., 1993; Hallenberg and Frederiksen, 2001; Pasolini et al., 2006). Molecular organization and chromosomal localization of the 5S rDNA have been extensively characterized in bony fish (Iue et al., 1989; Rocco et al., 2005; Qin et al., 2010; Danillo et al., 2011). Polyploidization plays an important role in the evolution of fish. However, the features of the 5S rDNA have been rarely reported in polyploid fish. Previously, we successfully obtained fertile allotetraploid hybrids (4n = 148, RRBB) (abbreviated as 4nRB) from the first generation of Carassius auratus red var. (2<sup>n</sup> = 100, RR) (♀) <sup>×</sup> Megalobrama amblycephala (2<sup>n</sup> = 48, BB) (♂) hybrids (Qin et al., 2014a). The abnormal chromosomal behavior of allotetraploid hybrids during meiosis leads to the formation of autodiploid sperm and autodiploid ova, and the fertilization of these ova by these sperm in turn produces autotetraploid fish (4nRR) (Qin et al., 2014b, 2015a). Autotetraploids produce diploid ova and diploid sperm and maintain the formation of the autotetraploid line (F1–F10), which could be used as a new model system for investigating the influence of polyploidy on the organization and evolution of the multigene family of 5S rDNA. In this paper, molecular organization and chromosomal localization of the 5S rDNA have been characterized in the autotetraploid and their parents (RCC). Obvious loss of chromosomal loci, base variations in NTS, and array recombination of repeat units have been found in the newly established autotetraploidy genomes. Our results extend the knowledge of the influence of polyploidy on the organization and evolution of 5S rDNA of fish, and are also useful in clarifying aspects of vertebrate genome evolution.

### MATERIALS AND METHODS

#### Source of Samples

All fish were cultured in ponds and fed with artificial feed at the Protection Station of Polyploidy Fish, Hunan Normal University. Fish treatments were carried out according to the regulations for protected wildlife and the Administration of Affairs Concerning Animal Experimentation, and approved by the Science and Technology Bureau of China. Approval from the Department of Wildlife Administration was not required for the experiments conducted in this paper. The fish were deeply anesthetized with 100 mg/L MS-222 (Sigma-Aldrich, St. Louis, MO, United States) before dissection.

### Animals and Crosses

During the reproductive season (April to June) in 2012, the first generation (4nRB) of C. auratus red var. (♀) <sup>×</sup> M. amblycephala (♂) was produced. During the reproductive season (April to June) of 2014, the second generation (4nRR) was produced by self-crossing of 4nRB.

#### Preparation of Chromosome Spreads

Chromosome counts were performed using kidney tissue from 10 RCC and 10 4nRR. After culture for 1–3 days at a water temperature of 18–22◦C, the samples were injected with concanavalin one to three times at a dose of 2– 8 mg/g body weight. The interval between injections was 12– 24 h. Six hours prior to dissection each sample was injected with colchicine at a dose of 2–4 mg/g body weight. The excised kidney tissue was ground in 0.9% NaCl, followed by hypotonic treatment with 0.075 M KCl at 37◦C for 40– 60 min and then fixed in 3:1 methanol–acetic acid with three changes. The cells were dropped onto cold, wet slides and stained for 30 min in 4% Giemsa. The shape and number of chromosomes were analyzed under a microscope. For each type of fish, 200 metaphase spreads (20 metaphase spreads from each sample) of chromosomes were analyzed. The preparations were examined under an oil lens at a magnification of 3330×.

#### PCR Amplification and Sequencing of 5S rDNA Sequences

Total genomic DNA was isolated from peripheral blood cells according to the standard phenol: chloroform extraction procedure described by Sambrook et al. (1989). To acquire preliminary information on the organization of the 5S rDNA repeat variants, and to test for the possible coexistence of different repeat units in the same array, DNA samples of 3 RCC and 3 4nRR were amplified with primers 5SP1-5SP2R (5<sup>0</sup> -GCTATGCCCGATCTCGTCTGA-3<sup>0</sup> and 5 0 -CAGGTTGGTAT GGCCGTAAGC-3<sup>0</sup> ) and with primers 5SNT1-5SNT2R (5<sup>0</sup> -GGCGAGTAGATTGGCTGAACA-3<sup>0</sup> and 5 0 -CAATCTAATCGCCAGTACATTATAT-3<sup>0</sup> ). The PCR reaction was performed in a volume of 25 µL with approximately 20 ng of genomic DNA, 1.5 mM of MgCl2, 200 µM of each dNTP, 0.4 µM of each primer, and 1.25 U of Taq polymerase (Takara). The temperature profile was as follows: an initial denaturation step at 94◦C for 5 min, followed by 30 cycles of 94◦C for 30 s, 56◦C for 30 s, and 72◦C for 1 min, with a final extension step at 72◦C for 10 min. Amplification products were separated on a 3.0% agarose gel using TBE buffer. The DNA fragments were purified using a gel extraction kit (Sangon) and ligated into pMD18-T (Takara). Plasmids were transformed into Escherichia coli DH5a, propagated, and then purified. The cloned DNA fragments were sequenced using an automated DNA sequencer (ABI PRISM 3730). Sequence homology and variation among the fragments amplified from 3 RCC and 3 4nRR were analyzed using ClustalW software<sup>1</sup> .

<sup>1</sup>http://www.ebi.ac.uk/clustalw/intex.html

#### Fluorescence in situ Hybridization

The probes for fluorescence in situ hybridization (FISH) for the 5S gene were constructed for RCC and amplified by PCR using the primers 5<sup>0</sup> -GCTATGCCCGATCTCGTCTGA-3<sup>0</sup> and 5 0 -CAGGTTGGTATGGCCGTAAGC-3<sup>0</sup> . The FISH probes were produced by Dig-11-dUTP labeling (using a Nick Translation Kit, Roche, Germany) of purified PCR products. Purified PCR products of 5S rDNA labeled with Dig-11-dUTP (Roche, Germany) were used as probes, and hybridization was performed according to the method described by Yi et al. (2003) with minor modifications. Purified PCR products of 5S rDNA labeled with Dig-11-dUTP (Roche, Germany) were used as probes, and hybridization was performed according to the method described by Yi et al. (2003) with minor modifications. After treatment with 30 µg/ml RNase A in 2 × SSC for 30 min at 37◦C, the slides with chromosome metaphase spreads were denatured in 70% deionized formamide/2 × SSC for 2 min at 70◦C, dehydrated in a 70, 90, and 100% ethanol series for 5 min each (1 × SSC is 0.15 M NaCl/0.015 M sodium citrate, pH 7.6), and then air-dried. 4 µl of the hybridization mixture (approximately 100 ng of labeled probes, 50% formamide, 10 mg dextran sulfate/ml and 2 × SSC) was denatured for 10 min in boiling water, applied to the airdried slides carrying denatured metaphase chromosomes under a 22 × 22 mm coverslip, and sealed with rubber cement. The slides were then put in a moist chamber and allowed to incubate overnight at 37◦C.

Following overnight incubation, the coverslips were removed and the slides were rinsed at 43◦C in: 2 × SSC with 50% formamide, twice, 15 min each; 2 × SSC, 5 min; 1 × SSC, 5 min, then air-dried. The spectrum signals were achieved by application of 8 µl of 5 µg/ml FITC-conjugated antidigoxigenin antibody from sheep (Roche, Germany) and a final incubation in the humidity chamber at 37◦C. After a series of washes with TNT (containing 0.1 M Tris–HCl, 0.15 M NaCl, 0.05% Tween 20) at 43◦C, the slides were mounted in antifade solution containing 2 µg/ml 4<sup>0</sup> , 6-diamidino-2-phenylindole (DAPI) for 5 min. Slides were viewed under a Leica inverted CW4000 microscope and a Leica LCS SP2 confocal image system (Leica, Germany). Metaphase spreads of chromosomes were analyzed in 10 RCC and 10 4nRR (20 metaphase spreads in each sample).

### RESULTS

### Molecular Organization of the 5S rDNA Classes

Using the primers 5SP1 and 5SP2R, fragments of approximately 200, 340, and 500 bp were generated from RCC and 4nRR (**Figure 1A**). All fragments proved to be 5S rDNA sequences, each included the 3<sup>0</sup> end of the coding region (pos. 1-21), the whole NTS region, and a large 5<sup>0</sup> portion of the coding region of the adjacent unit (pos. 22-120; see **Figure 1B**). In RCC, the three types of 5S rDNA classes (designated type I: 203 bp; type II: 340 bp; and type III: 477 bp) were characterized by distinct NTS types (designated NTS-I, NTS-II and NTS-III for the 83-, 220-, and 357-bp sequences, respectively; **Figure 2**). 4nRR had three types of 5S rDNA classes, which were completely inherited from RCC (type I, type II and type III; **Figure 2**). All 5S rDNA sequences have been submitted to GenBank, and their accession numbers are listed in **Table 1**.

Comparison of the 120-bp coding region of 5S rDNA with those of RCC and 4nRR revealed great similarity (**Figure 3**). Nucleotide variation was not detected among the internal control regions (ICRs, i.e., the promoters for transcription) in 4nRR (**Figure 3**). A comparison of NTS-I revealed six base substitutions among the sequences (**Figure 4A**). A comparison of NTS-II showed five base substitutions and a deletion-insertion at position -177 (**Figure 4B**). A comparison of NTS-III elements showed nine base substitutions and a deletion-insertion at position -164 (**Figure 4C**). The above results indicate that obvious nucleotide variations were found in NTS sequences of 4nRR. In addition, characterization of the NTS-up stream region showed

markers (100 bp increments); lane 1, three DNA fragments (approximately 200, 340, and 500 bp) from RCC; lane 2, three DNA fragments (approximately 200, 340, and 500 bp) from the 4nRR; (B) Arrangement of higher eukaryotic 5S rRNA genes (red) intercalated with non-transcribed DNA segments (NTS; black); (C) Representative sequences of 5S rDNA type I, II, and III from RCC and 4nRR; red indicate 5S rRNA genes; blue, green and purple indicate distinct NTS sequences.

FIGURE 2 | Representative sequences of 5S rDNA from RCC and 4nRR. Complete 5S coding regions are shaded; the NTS upstream TATA elements are underlined.

that the TATA control element, the regulatory region for 5S gene transcription, was identifiable in the NTS of RCC and 4nRR (at -29 in all NTS sequences, where it was modified to TAAA; **Figure 4**), suggesting that all sequences analyzed here were likely to correspond to functional genes.

#### Array Recombination of the 5S rDNA Repeat Units

Thirty clones of the 500 bp fragment from 4nRR were analyzed, and the sequence analysis revealed that five clones were dimeric 5S rDNA formed by 5S rDNA type I (the 99 bp gene sequence, 83 bp of NTS, and 21 bp gene sequence) and 5S rDNA type II (the 99 bp gene sequence, 220 bp of NTS, and 21 bp gene TABLE 1 | GenBank accession numbers of the 5S rDNA sequences in RCC and 4nRR.


RCC, Carassius auratus red var.; 4nRR, autotetraploid fish.

sequence) (**Figure 1C** and **Supplementary Figure 1**). To verify whether the different 5S rDNA classes (type I and type II) were associated within the same tandem array, we designed

FIGURE 3 | Comparison of 5S coding regions from RCC and 4nRR. Internal control regions of the coding region are shaded.

the primers 5SNT1-5SNT2R. Using these primers, the PCR yielded a single band of 352 bp in 4nRR, but no band in RCC and 4nRB (**Figure 5**). Sequence analysis revealed that this fragment was formed by 72 bp of the type I (a 51 bp of the NTS and 21 bp gene sequence) and 280 bp of type II (the 99 bp of gene sequence and 181 bp of NTS) (**Figure 5** and

**Supplementary Figure 2**). The PCR amplification products of the two primers provided direct evidence to prove that the type I and type II repeats were associated within the same tandem array in 4nRR, suggesting that recombination of chromosomes occurred in the autotetraploid genome.

#### Chromosomal Loci of 5S rDNA

The hybridization of type I 5S rDNA probes showed eight 5S gene loci in RCC chromosomal metaphases (**Figure 6A** and **Table 2**). Sixteen 5S gene loci were expected in 4nRR chromosomal metaphases, but only twelve 5S gene loci were found (**Figure 6B** and **Table 2**). Using type II 5S rDNA as a probe, a pair of large 5S gene loci was identified on homologous submetacentric chromosomes in RCC chromosomal metaphases, and a pair of small 5S gene loci was localized on homologous subtelocentric chromosomes (**Figure 6C** and **Table 2**). In 4nRR chromosomal metaphases, a pair of large 5S gene loci on a homologous submetacentric chromosome were found and other a pair of large 5S gene loci on a homologous submetacentric chromosome were lost; two pairs of small 5S gene loci was localized on homologous subtelocentric chromosomes (**Figure 6D** and **Table 2**). FISH hybridization of the type III 5S rDNA probe to the RCC metaphase chromosomes yielded eight 5S gene loci (**Figure 6E** and **Table 2**). As expected, sixteen 5S gene loci were found in 4nRR chromosomal metaphases (**Figure 6F** and **Table 2**). The above results indicate that obvious loss of chromosomal loci occurred in 4nRR.

### DISCUSSION

The evolution of 5S rDNA is driven by birth-and-death processes with strongly purifying selection (Nei and Rooney, 2005; Pinhal et al., 2011; Vizoso et al., 2011), which can lead to the existence of different types of NTS (Pinhal et al., 2011). In teleosts, two distinct 5S rDNA classes are characterized by distinct NTS types and base substitutions in the 5S rRNA gene (Pendas et al., 1994; Moran et al., 1996; Martins et al., 2000; Wasko et al., 2001; Pinhal et al., 2009). Thus, possession of two 5S rDNA classes seems to be a general trend for the organization of these sequences in the genomes of fish (Martins and Galetti, 2001). As ancient polyploidy fish, RCC have undergone an additional round of whole-genome duplication (Qin et al., 2016b). The origin of genic variants has been attributed to events of genome duplication followed by processes that result in the divergence of the duplicated sequences. Thus, RCC possess three distinct 5S rDNA classes that are characterized by distinct types of NTS (Qin et al., 2010). In the current study, 4nRR derived from the distant hybridization of C. auratus red var. (2n = 100, RR) (♀) <sup>×</sup> M. amblycephala (2<sup>n</sup> = 48, BB) (♂), possess four sets of RCC-derived chromosomes and exhibit stability in chromosome number (or ploidy) over consecutive generations (F1–F10) (Qin et al., 2014b). 4nRR have three distinct 5S rDNA classes that are completely inherited from RCC, but no new types of 5S rDNA class were found, suggesting that divergence of the duplicated 5S rDNA sequences were not fully formed in the early generations of the autotetraploid fish.

Because of incompatibility between parental chromosomes, allopolyploidization can increase genomic changes (Pontes et al., 2004). Our previous study revealed the influence of allopolyploidy on 5S rDNA in fish, including parental genome specific loss, substitutions, and insertions-deletions in the NTS sequence (Qin et al., 2010, 2016a). Theoretically, homologous chromosomes should have high compatibility in autotetraploids. In this paper, however, obvious base variation and insertionsdeletions of NTS were also observed in 4nRR, suggesting that autotetraploidization could lead to genetic variation in newly established autotetraploid genomes. Although there was genetic variation in NTS of 5S rDNA, all sequences analyzed here were likely to correspond to functional genes, because they exhibited all the necessary features for correct gene expression: three ICRs (box A, internal element, and box C), a TATA control element, and a T-rich tail.

Autopolyploids are traditionally used to demonstrate multivalent pairing multivalent pairing during meiosis. However, the coexistence of four homologous chromosome sets does

showed twelve 5S gene loci of type I (white arrows) in 4nRR; (C) Type II as probe showed two big (yellow arrows) and two small 5S gene loci (white arrows) in RCC; (D) Type II as probe showed two big (yellow arrows) and four small 5S gene loci (white arrows) in 4nRR; (E) Type III as probe showed eight 5S gene loci (white arrows) in RCC; (F) Type III as probe sixteen 5S gene loci (white arrows) in 4nRR. Bar = 3 µm.


RCC, Carassius auratus red var.; 4nRR, autotetraploid fish.

not result in multivalent formation during meiosis in 4nRR, and diploid-like chromosome pairing was restored (Qin et al., 2019). The presence of two distinct 5S rDNA sequence types organized in different chromosomal regions or even on different chromosomes has been described for several fish (Pendas et al., 1994; Moran et al., 1996; Sajdak et al., 1998; Martins et al., 2002; Rodrigues et al., 2012; Qin et al., 2015b). In the current study, the different 5S rDNA classes (type I and type II) were associated within the same tandem array in 4nRR. In addition, type I and type II 5S rDNA clusters were localized in the chromosomes of 4nRR, and showed obvious loss of chromosomal loci. These findings are clear evidence that elimination of repetitive sequences and recombination of chromosomes occurred in newly established autotetraploid genomes. A positive linear relationship was found between increased bivalent pairing and elimination of specific, low-copy DNA sequences (Wendel, 2000). Thus, we speculate that the elimination of DNA sequences or recombination of chromosomes might generate immediate divergence between homologous chromosomes, providing a physical basis for diploid-like chromosome pairing in 4nRR.

#### ETHICS STATEMENT

fgene-10-00437 May 13, 2019 Time: 14:57 # 8

Fish treatments were carried out according to the regulations for protected wildlife and the Administration of Affairs Concerning Animal Experimentation, and approved by the Science and Technology Bureau of China. Approval from the Department of Wildlife Administration was not required for the experiments conducted in this manuscript. The fish were deeply anesthetized with 100 mg/L MS-222 (Sigma-Aldrich, St. Louis, MO, United States) before dissection.

#### AUTHOR CONTRIBUTIONS

QQ and SL designed the experiments. QL, CW, LC, YZ, HQ, and CZ performed the experiments. QQ and QL performed the statistical analysis. QQ wrote the manuscript. All authors read and approved the final manuscript.

#### REFERENCES


#### FUNDING

This research was financially supported by grants from the Natural Science Foundation of Hunan Province for Distinguished Young Scholars (Grant No. 2017JJ1022), the National Natural Science Foundation of China (Grant Nos. 31430088 and 31210103918), the Major Program of the Educational Commission of Hunan Province (Grant No. 17A133), the State Key Laboratory of Developmental Biology of Freshwater Fish, the Cooperative Innovation Center of Engineering and New Products for Developmental Biology of Hunan Province (20134486), the Earmarked Fund for China Agriculture Research System (CARS-45), and the Construction Project of Key Disciplines of Hunan Province and China.

#### ACKNOWLEDGMENTS

We would like to sincerely thank many researchers who help to complete this manuscript, including Drs. Yao Zhanzhou and Zhao Rurong.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00437/full#supplementary-material

and chromosome loci. Cytogenet. Genome Res. 98, 78–85. doi: 10.1159/ 000068542



(Female) × Megalobrama amblycephala (Male). Biol. Reprod. 91:93. doi: 10. 1095/biolreprod.114.122283


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Qin, Liu, Wang, Cao, Zhou, Qin, Zhao and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Dicyemida and Orthonectida: Two Stories of Body Plan Simplification

Oleg A. Zverkov<sup>1</sup> , Kirill V. Mikhailov1,2, Sergey V. Isaev1,3, Leonid Y. Rusin1,4 , Olga V. Popova<sup>2</sup> , Maria D. Logacheva1,2,5, Alexey A. Penin1,2, Leonid L. Moroz<sup>6</sup> , Yuri V. Panchin1,2, Vassily A. Lyubetsky<sup>1</sup> and Vladimir V. Aleoshin1,2 \*

1 Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia, <sup>2</sup> A.N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia, <sup>3</sup> Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia, <sup>4</sup> Faculty of Biology, Lomonosov Moscow State University, Moscow, Russia, <sup>5</sup> Skolkovo Institute of Science and Technology, Moscow, Russia, <sup>6</sup> Department of Neuroscience, McKnight Brain Institute, University of Florida, Gainesville, FL, United States

Two enigmatic groups of morphologically simple parasites of invertebrates, the Dicyemida (syn. Rhombozoa) and the Orthonectida, since the 19th century have been usually considered as two classes of the phylum Mesozoa. Early molecular evidence suggested their relationship within the Spiralia (=Lophotrochozoa), however, high rates of dicyemid and orthonectid sequence evolution led to contradicting phylogeny reconstructions. Genomic data for orthonectids revealed that they are highly simplified spiralians and possess a reduced set of genes involved in metazoan development and body patterning. Acquiring genomic data for dicyemids, however, remains a challenge due to complex genome rearrangements including chromatin diminution and generation of extrachromosomal circular DNAs, which are reported to occur during the development of somatic cells. We performed genomic sequencing of one species of Dicyema, and obtained transcriptomic data for two Dicyema spp. Homeodomain (homeobox) transcription factors, G-protein-coupled receptors, and many other protein families have undergone a massive reduction in dicyemids compared to other animals. There is also apparent reduction of the bilaterian gene complements encoding components of the neuromuscular systems. We constructed and analyzed a large dataset of predicted orthologous proteins from three species of Dicyema and a set of spiralian animals including the newly sequenced genome of the orthonectid Intoshia linei. Bayesian analyses recovered the orthonectid lineage within the Annelida. In contrast, dicyemids form a separate clade with weak affinity to the Rouphozoa (Platyhelminthes plus Gastrotricha) or (Entoprocta plus Cycliophora) suggesting that the historically proposed Mesozoa is a polyphyletic taxon. Thus, dramatic simplification of body plans in dicyemids and orthonectids, as well as their intricate life cycles that combine metagenesis and heterogony, evolved independently in these two lineages.

#### Keywords: Mesozoa, Dicyemida, Orthonectida, genome, mitochondrial DNA, phylogeny

#### INTRODUCTION

In spite of more than one hundred years of studies, the evolutionary relationships of the Mesozoa are still elusive. The name of this taxon reflects the traditional view of mesozoans as organisms with intermediate organization between unicellular protozoans and multicellular metazoans (Van Beneden, 1876; Hyman, 1940). Indeed, the two groups of microscopic parasitic invertebrates,

#### Edited by:

Ancha Baranova, George Mason University, United States

#### Reviewed by:

Andreas Hejnol, University of Bergen, Norway Denis Baurain, University of Liège, Belgium Max Telford, University College London, United Kingdom

> \*Correspondence: Vladimir V. Aleoshin aleshin@genebee.msu.su

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 03 November 2018 Accepted: 29 April 2019 Published: 24 May 2019

#### Citation:

Zverkov OA, Mikhailov KV, Isaev SV, Rusin LY, Popova OV, Logacheva MD, Penin AA, Moroz LL, Panchin YV, Lyubetsky VA and Aleoshin VV (2019) Dicyemida and Orthonectida: Two Stories of Body Plan Simplification. Front. Genet. 10:443. doi: 10.3389/fgene.2019.00443

**214**

the Dicyemida, and Orthonectida, display a remarkably simple morphological organization and a nearly complete absence of tissues and organs (Malakhov, 1990). Adult dicyemids inhabit the renal sacs of cephalopod mollusks and consist of just about 40 somatic cells, lack recognized muscular, nervous, sensory cells, and the organs typical for eumetazoans (Furuya et al., 2004). Dicyemids do not have a morphologically recognized basal membrane (Czaker, 2000), and never develop "true" tissues throughout their complex life cycle (Furuya and Tsuneki, 2003). The trophic stage of orthonectids is a syncytial plasmodium, which resides inside the invertebrate host and generates ephemeral ciliated organisms that exit the host for reproduction (Slyusarev, 2008). These organisms are composed of several hundred somatic cells without anatomically recognized digestive, circulatory, or excretory systems. Before the discovery of muscular and nervous systems in the swimming stages of orthonectids (Slyusarev and Starunov, 2015), they were thought to have a planula-like organization and were grouped with dicyemids in the Mesozoa as multicellular animals with an incredibly simple body plan, perhaps – the simplest among all Metazoa, and comparable to placozoans.

Intricate life cycles of dicyemids and orthonectids exhibit the alternation of asexual and sexual generations, termed metagenesis. Ameiotic generative cells (agametes) develop inside the dicyemid axial cell and later produce the next vermiform generation possessing gametic cells that undergo self-fertilization. In orthonectids, agametes develop inside the parasitic plasmodium and produce the free-living diecious (or hermaphroditic) generation (Cheng, 1986; Slyusarev, 2008). The phenomenon of successive sexual parthenogenetic and amphimictic generations is termed heterogony. In this sense, orthonectids and dicyemids as well as parasitic flatworms combine metagenesis and heterogony in their life cycles. Particularly, trematode sporocysts and rediae that parasitize gastropod mollusks produce the next generation from ameiotic generative cells (Dobrovolskij and Ataev, 2003; Ataev, 2017). Similarities in life cycles for long sustained the hypothesis about close relationships of dicyemids and orthonectids with digenetic trematodes. On the other hand, intracellular localization of generative cells relates dicyemids and orthonectids with myxozoans rather than trematodes. Such intricate combination of traits makes life strategies in dicyemids and orthonectids unique among animals.

The phylogenetic affinity of dicyemids and orthonectids has been called into question on the grounds of morphology (Kozloff, 1990; Brusca and Brusca, 2003; Ruppert et al., 2004). Molecular data conclusively demonstrated that both dicyemids and orthonectids are in fact bilaterians (Katayama et al., 1995; Hanelt et al., 1996; Pawlowski et al., 1996; Aruga et al., 2007) and belong to the diverse clade of Lophotrochozoa (=Spiralia) (Kobayashi et al., 1999, 2009; Petrov et al., 2010; Suzuki et al., 2010; Mikhailov et al., 2016; Lu et al., 2017; Schiffer et al., 2018), thus implying that their simple organization evolved as the result of their parasitic lifestyle.

In molecular phylogenetic analyses, dicyemid and orthonectid lineages display extremely high levels of divergence, and their exact placement among the spiralians remains ambiguous and potentially prone to long branch attraction artifacts. Complicating the matter is the uncertainty in relationships between other spiralian taxa, including the Annelida, Mollusca, Nemertea, Brachiopoda, Entoprocta, and Bryozoa (Kocot, 2016). Recent phylogenomic analyses lead to conflicting conclusions regarding the mesozoan phylogeny. Lu et al. (2017) using a dataset of 348 orthologs (58,124 alignment positions) from 23 spiralian species, including an orthonectid and a dicyemid, report the monophyly of the Mesozoa either as a sister group to the Rouphozoa (Platyhelminthes + Gastrotricha) or within the Gastrotricha. Alternatively, Schiffer et al. (2018) using a dataset of 469 orthologs (190,027 alignment positions) from 29 spiralian species, including an orthonectid and two dicyemids, conclude that Orthonectida and Dicyemida evolved independently within the Lophotrochozoa, with the orthonectids exhibiting clear affinity to annelids, and dicyemids occupying an isolated position within Lophotrochozoa. Here, we obtained transcriptomic and genomic data for dicyemid species to resolve this contradiction.

The dicyemid genome is distinguished by uncommon features, such as the genome rearrangements during the life cycle and generation of circular DNAs (Noto et al., 2003), including those that encode mitochondrial proteins and rRNAs (Watanabe et al., 1999; Catalano et al., 2015). It is not yet established if the mitochondrial protein-coding genes are encoded only by small circular DNA molecules (Watanabe et al., 1999) or whether they are produced during the dicyemid development from a precursor mitochondrial DNA with a more typical metazoan organization (Awata et al., 2006). Using high-throughput genomic sequencing we sought to find any properties of dicyemid sequences that would reveal their genome organization. We also estimated the extent of gene losses due to the simplification of dicyemid morphological organization, and analyzed whether losses in particular gene families and regulatory pathways are the same or different compared to an orthonectid Intoshia linei.

### RESULTS AND DISCUSSION

## Genomic Sequencing and Assembly of Dicyema sp.

Direct assembly of a dicyemid genome from whole DNA extracts using standard approaches is an extremelly challenging problem due to drastic genome rearrangements that occur in dicyemids during development. Previous studies have demonstrated that somatic cells of dicyemids undergo drastic genome rearrangements and chromatin elimination (Noto et al., 2003), and suggested that selective and whole genome amplification takes place at different stages of their development (Awata et al., 2006). Accordingly, the sequencing of whole DNA extracts from Dicyema sp. resulted in a highly fragmented assembly with uneven coverage and N50 of 942 bp, where the largest contig was only around 20 Kb. The total size of the assembly is 858 Mbp in nearly 1 million contigs over the length of 500 bp, and includes contaminating cephalopod sequences. Due to significant genetic difference between the dicyemid host Enteroctopus dofleini and the available genomic sequence of Octopus bimaculoides, the filtering of the assembly

was performed at the level of predicted gene products. Only predictions identifiable by hits against the InterPro database were retained for the subsequent comparative analyses and filtered from the cephalopod contamination using the best hit approach with BLAST searches against the NCBI nr database. Out of 38,410 predictions with InterPro hits, 43% were discarded as contamination, resulting in 21,842 putative dicyemid genes with 71% complete and 12% fragmented universal eukaryotic orthologs evaluated by BUSCO (**Table 1**). Similar values are obtained for gene predictions after normalizing on the number of BUSCOs found in at least one filtered transcriptome: 76% complete and 12% fragmented. The total percentage of BUSCOs recovered by at least one sequencing library, including genomic and transcriptomic filtered data, approaches values seen in typical metazoan genomes: 91% complete and 3% fragmented. For all analyses in Sections 2.4–2.11 we used original genomic data on Dicyema sp., and the three transcriptomes, including the two originally obtained and the one of Dicyema japonicum available from the published source (Lu et al., 2017).

The dicyemid genes display miniaturization of spliceosomal introns – the median length of introns is 27 bp, and approximately two thirds of predicted introns are under the length of 30 bp (**Figure 1**). This agrees with an earlier survey that revealed extreme intron shortening in a set of 40 genes from D. japonicum (Ogino et al., 2010). The estimated intron density in Dicyema sp. is 4.9 introns/gene for predictions with intact start and stop codons, which is also similar to the 5.3 introns/gene reported for D. japonicum. Similar value of intron density is seen in the genome of orthonectid I. linei (Mikhailov et al., 2016). Notably, the orthonectid genes also harbor short spliceosomal introns, but the majority of its introns are longer than 30 bp, and the median size is 57 bp, considerably exceeding the intron lengths observed in dicyemid genes.

#### "Circular" Contigs in Genomic Assembly of Dicyema sp.

Using the genomic assembly we have identified 24,065 "circular" contigs (see section "Materials and Methods"). The distribution

of circular contig lengths in the assembly is multimodal (**Figure 2A**). The first abundant pool of sequences is formed from contigs less than 500 bp. The second pool, which includes sequences of a length over 500 bp, consists of 3,220 contigs with the median length of 702 bp. The properties of the sequences in this pool (such as length and abundance) are consistent with previous data of DNA gel electrophoresis, EM and PCR experiments (Noto et al., 2003), which supports the conjecture that these sequences are circular DNA rather than direct repeats. "Short" circles (up to 500 bp length) were shown to possess 38.1% low complexity regions, while "long" circles – only 2.9%. This observation might suggest that a fraction of predicted short circles represents direct repeats. Following this rationale, we considered the two sub-pools separately in analyses.

The lengths of sequences from the second pool of circular contigs are distributed non-uniformly which is particularly evident within the 600–800 bp range (**Figure 2B**). The average distance between two adjacent peaks of this distribution is 10.44, which closely corresponds to the number of base pairs in one turn of B-DNA. Multimodal distribution was also observed


<sup>∗</sup>Genomic predictions were filtered by retaining only hits to the InterPro database and cleaned from the cephalopod contamination with BLAST searches against the NCBI nr database.

∗∗Transcriptome assemblies were filtered with BLAST searches against the RefSeq database as detailed in Materials and Methods, Section "Assembly and Filtering of Dicyemid Sequences."

FIGURE 2 | Scatter plot of circular contigs in the Dicyema sp. genome. Different markers stand for circles with the specific signal and with mitochondrial genes (see Text for clarification). (A) The distributions for both length and coverage logarithm are non-uniform — there is a pool of "long" contigs with a high level of coverage in the set. (B) The distribution of "long" contigs with a high level of coverage is, in turn, non-uniform; the distance between two adjacent peaks is approximately 10.44 bp. On both scatterplots the upper axis shows the distribution of contig lengths, and the right axis shows the coverage. Blue dots correspond to circular contigs, and green dots are linear contigs. The scale is logarithmic. Both the coverage and length of circular contigs is obviously less uniform compared to those of linear contigs.

(Kolmogorov–Smirnov test p-value is 0.999) when performing assembly with the varying k-mer size (55 or 77) and with another assembly method (**Supplementary Figure S1**). The presence of this pattern is unexpected, and presumably could be attributed to the greater stability of circles or tendency to circularize for molecules with an integer amount of turns of a relaxed form of DNA. A similar effect has also been observed in short (<200 bp) sequences as a result of rolling circle replication bias (Joffroy et al., 2018). This distribution can result from the random ligation of linear molecules cut from the genome as it leads to the reduction in DNA supercoiling. Alternatively, replicating mini-circular DNA molecules can be selected in length to reduce their supercoiling. **Figure 3** shows that the coverage value for "long" circular contigs is not lower than for linear ones, which casts doubt on the proposed diminution of circular molecules during ontogenesis.

Long circular contigs are predominantly not similar in nucleotide sequences. Only 15% of them have at least one fairly similar contig, and only three families of contigs unite more than 10 members (**Figure 4**).

Two independent motif detection methods (Bailey and Elkan, 1994; Rubanov et al., 2016) have been applied to the circles of length 600–800 bp with a coverage logarithm of over 3 (2,031 sequences). In 1,871 sequences (92.12% of sequences in the analysis) common motifs have been found (E-value: 4.8e-82, see **Figure 5A**).

At a p-value < 10−<sup>5</sup> , the most common motif occurs on average once every 874 bp in "long" circles and every 21,863 bp throughout the entire assembly (statistical significance of the difference provided by the chi-squared criterion: p-value < 0.001). The search for highly conserved sequences in various subsets of genome sequences has demonstrated that less common motifs with high information content can also be found in circles (**Figures 5B–E**).

The search for conserved domains in circular contigs recovered only domains of mtDNA-encoded proteins (10 conserved domains, 13 contigs including paralogs). These sequences are presumably transcribed as they are also found in the RNA-seq data (blastn search, E-value < 1e-30).

#### Mitochondrial DNA of Dicyema sp.

Genomic data confirm the localization of mitochondrial genes of dicyemids on circular DNA molecules (Watanabe et al., 1999; Catalano et al., 2015). The search for mtDNA genes in the genomic data found 21 circles with length varying from 344 to 1605 bp. The following gene sequences were found: cox1-3, cob, nad1-5, atp6, rrnL, rrnS, trnH, trnI, trnK, trnL1, trnN, trnP, trnQ, trnR, trnS2, and trnY (**Figure 6**). In earlier studies the dicyemid mitochondrial contigs were found to carry either one protein coding gene (Watanabe et al., 1999) or a protein coding gene and a tRNA gene (Robertson et al., 2018). We found one circle that contains two genes – cox2 and rrnS, and three circles that contain two tRNA genes each. Protein identity between mitochondrial predictions for Dicyema sp. and the earlier published D. japonicum (Robertson et al., 2018) varies from 39% (nad2) to 75% (cox1). The majority of mitochondrial genes can also be found in the transcriptomic data, except for atp6 and nad5. The mtDNA circles also contain the motif described above (**Figure 5A**) (p-value < 10−<sup>5</sup> ).

The nad2 and atp6 genes were found in two different variants in the genomic data. Two paralogs of nad2 with lengths of 215 and 252 amino acids have 42% identity at the amino acid level. Two paralogs of atp6 with lengths of 117 and 149 amino acids have 89% identity at the amino acid level, and share two long deletions with other dicyemids. These deletions are specific for dicyemids and are not found in other taxa including Orthonectida. Both of dicyemid deletions are located outside of the transmembrane helices – the first one with the length of 16 amino acids is located in the region facing the mitochondrial matrix and the second one with the length of 17 amino acids is located in the region facing the intermembrane space, according to the alignment of atp6.

We predicted 11 mitochondrial tRNA genes in Dicyema sp. including two paralogs of glutamine tRNA gene (**Supplementary Figure S2**). Both dicyemid glutamine tRNAs have similar secondary structures and lack a T-arm. Dicyemid arginine tRNA also lacks a T-arm and lysine tRNA lacks a D-arm. Other mitochondrial tRNAs maintain the typical clover leaf structure, although several tRNA genes have single nucleotide insertions and/or non-complementary pairs in stems. Experimental evidence is needed to confirm all the predicted tRNA genes, as well as decisions whether numerous not listed tRNA-like sequences with p-value below the threshold are functional genes.

Read mapping to the genomic assembly revealed no read pairs that would facilitate mtDNA scaffolding. Whenever one read from a pair would map to the circular mitochondrial contig, the other would map to the same contig or have a sequence of low complexity. Thus, our genomic data fails to confirm the hypothesized existence of an unprocessed mtDNA precursor, which would generate the mtDNA circles (Awata et al., 2006).

The presence of common sequence motifs in circles with mtDNA genes and without them seems to be surprising. It can be interpreted as a consequence of a similar mechanism of generation and maintaining of circles irrespective of their function.

The partitioning of mtDNA into circular molecules is a rare feature for the animal mitochondrial genomes (Odintsova and Yurina, 2005; Burger et al., 2012; Kolesnikov and Gerasimov, 2012; Smith and Keeling, 2015; Lavrov and Pett, 2016, for review). In bilaterians, the mtDNA is fragmented into a large number of mini-chromosomes in the cyst-forming nematodes Globodera spp. (Armstrong et al., 2000; Gibson et al., 2007) and sucking lice (Shao et al., 2009). Notably, the mitochondrial DNAs from orthonectids Intoshia linei, Intoshia variabili, and Rhopalura ophiocomae retain typical structure for metazoans and encode the full set of mitochondrial genes on a single circular molecule (Robertson et al., 2018; Bondarenko et al., 2019). The reason why the mitochondrial Dicyema spp. genome is fragmented is unknown. Earlier, the fragmentation of the mitochondrial genome of sucking lice was considered (Shao et al., 2009) as an adaptation to the high rate of molecular evolution, which is

FIGURE 5 | Motifs specific to Dicyema sp. circular contigs. (A) Signal found in the majority of contigs. (B–E) Highly conserved signals found in a group of circular contigs of high length and coverage; (B) found in 541 contigs; (C) found in 284 contigs; (D) found in 234 contigs; (E) found in 438 contigs. All counts are provided at a p-value < 10−<sup>4</sup> .

even more characteristic of Dicyema spp. It is possible that under conditions of high mutagenesis, a set of uncorrupted genes is easier to assemble from individual than concatenated molecules.

Analysis of mitochondrial DNA suggests an explanation of the multiple observed circular contigs. For searches with the tblastx algorithm we used proteins from the annotated mitochondrial contigs as the query and all 3,220 "long" circle contigs as the database. The searches returned many circular contigs that encode highly diverged genes cox3, nad2, and nad4 (**Figure 7**). The cox3 homolog is largely diverged, while nad2 and nad4 contain stop codons and frame shifts. These contigs therefore represent mitochondrial pseudogenes.

Previous publications and our new data confirm the presence of two unusual features of the Dicyema genome. First, mitochondrial genes in Dicyema are not located on a single long DNA molecule as in most animals, but are partitioned

into smaller circular molecules. The second interesting feature of Dicyema is the presence of thousands of non-coding circular DNA sequences. Both types of circular DNA molecules fall in a similar range of size and coverage in DNA assembly and bear a common set of similar 12–20 bp DNA patterns, which might be hypothetical signal sequences. We assume that all circular DNAs in Dicyema may have a common origin, although experimental evidence is necessary. We speculate that the presence of multiple mtDNA mini-rings instead of one long molecule might have produced serious problems in mitochondrial division. This requires special mechanisms to correct distribution of multiple minicircular DNA molecules upon mitochondrion division so that both descendants would obtain a complete set of genes. Specific signal patterns like the ones we observe could be used to support circular mtDNA duplication, their protection against elimination or their correct distribution between descendent mitochondria. When such mechanisms are established, it is possible that rings carrying mutated (pseudo)genes or other selfish non-coding DNA circular elements acquire similar signal sequences that ensure their preservation in a similar way as with parasitic mobile genetic elements.

#### Homeobox Transcription Factors

Homeodomain (homeobox) transcription factors are crucial regulators of animal development that play central roles in tissue differentiation and axial body patterning. Bilaterian genomes encode from over 300 to around 60 homeobox genes. The genome of orthonectid I. linei was found to possess one of the smallest repertoires of homeoboxes (Mikhailov et al., 2016), which matches the reduced complexity of their organization. To determine how the extreme simplification of body plan seen in dicyemids relates to their homeobox gene content we searched for these genes in the dicyemid genomic and transcriptomic data. For analyses with HMMER, we used gene predictions coming from the genomic assembly of Dicyema sp. (PRJNA527259; designated as Dicyema sp. 1) and transcriptome assemblies of Dicyema sp. (SRR827581; designated as Dicyema sp. 2) and Dicyema japonicum (DRR057371). HMMER searches using the homeodomain profile identified 38, 39, and 55 homeoboxes in the three dicyemids after filtering out contaminating cephalopod sequences. Phylogenetic inference suggests that dicyemid homeoboxes form up to 39 families, and each dicyemid was found to contain 31 or 34 families (**Table 2**). A high level of sequence divergence complicates classification of the dicyemid homeoboxes. Although most dicyemid sequences could be assigned to one of the homeobox classes (Zhong and Holland, 2011), their attribution to known families is inconclusive.

Phylogenetic inference with PRD class homeoboxes reveals six dicyemid families, including the previously identified orthologs of the Pax6 and Otx (Aruga et al., 2007; Kobayashi et al., 2009). Five dicyemid sequence groups were found among the LIM homeoboxes, two of which are grouped with the Lhx6/8 and Islet family sequences. An additional LIM class homeobox of the Lhx2/9 family was found in D. japonicum, but could neither be confirmed by data from the other dicyemids nor discarded as contamination. Another five dicyemid gene groups belong to the POU class homeoboxes, but branch outside of any known families. The dicyemids form at least 5 TALE class sequence groups, with one group branching within the Pbx family. An additional single member of the TALE Tgif family was found only in the genomic data. Four SINE class families were found among the dicyemid sequences, with one grouping with the Six3/6 family. The dicyemids also possess a zinc finger homeobox, and a group of Onecut family sequences, which can be subdivided into 2 dicyemid-specific families.

Reconstructions with the ANTP class homeoboxes recover 8 dicyemid sequence groups (**Figure 8**). Three of these groups fall within the central Hox sequences. One of the dicyemid central Hox groups corresponds to orthologs of DoxC – a dicyemid member of the spiralian Lox5 family (Kobayashi et al., 1999), which was shown to have an expression pattern consistent with defining anterior–posterior boundaries in the developing dicyemids (Aruga et al., 2007). The analysis suggests



(Continued)



that dicyemids possess another member of the Lox5 family – all three dicyemids were found to encode a paralog of the DoxC. The paralog displays greater sequence divergence, but similar to other members of the family retains a Lox5-specific motif flanking the C-terminus of the homeodomain (de Rosa et al., 1999). The third dicyemid group within the central Hox sequences is found outside the Lox5 family and tends to group with the Lox2/Lox4 families, but beyond that does not lend itself to classification. A single posterior type Hox gene was found in D. japonicum, but once again could not be verified using other dicyemid sequences or rejected as contamination by BLAST searches. The dicyemid ANTP class homeoboxes also include members of the Evx, Dlx, and Nk2 families, and a conspicuous group of Hox-like sequences (**Figure 8**). Sequences within the dicyemid Hox-like assemblage share a common ancestor and retain a YPWM motif, which is essential for binding Hox cofactors (Prince et al., 2008), but this group is too divergent to be classified with any family, and is placed with the longest branch of Hox-like genes – the ParaHox Cdx family.

The survey of dicyemid genes suggests that overall they possess fewer homeoboxes than the orthonectid I. linei and their sequences are also markedly more diverged. Unlike the orthonectid, no ParaHox or anterior Hox families could be readily identified in the dicyemid data. Reduction of homeobox transcription factors in dicyemids is consistent with extreme simplification of their body plan. Unexpectedly, the dicyemids also experience several lineage-specific expansions of homeoboxes, notably the duplication of central Hox gene DoxC, which opposes the general trend of regulatory gene loss.

#### Basement Membrane

The basement membrane is a structure that enables the compartmentalization of cells to form tissues and organs. It is present in the majority of metazoans, with exception of sponges, placozoans, and acoelomorphs. The reported loss of a morphologically recognized basement membrane in dicyemids would indicate unprecedented simplification in this animal group. Even though this topic has been studied (Czaker, 2000), it is still unclear whether dicyemids have a basement membrane during any of their life cycle stages. The basement membrane consists of a set of "basement membrane toolkit" proteins, but the most important are collagen IV and laminin (Fidler et al., 2017). Both laminin and type IV collagen are multidomain proteins that include specific domains (LamNT for

laminin and C4 in the case of collagen type IV) and nonspecific domains (EGF-like and other). The BLAST and Pfam searches showed that these domains of canonical molecules forming basement membrane are absent from the sequenced genome of Dicyema sp., therefore supporting the proposed secondary loss of this trait in dicyemids. The apparent absence of the recognized basement membrane is parallel with a reported loss of muscular and nervous systems in these animals. Indeed, in bilaterians, the basement membrane supports the maintainance of the muscular and nervous system architecture, their development and compartmentalization, and supporting growth factor signaling gradients among other functions.

The complete life cycle of dicyemids is not entirely understood, and more complex structures of transitional obscure life forms of these organisms are not excluded. An unknown stage can potentially exist between the infusorioform larvae that exits the host and the vermiform embryos found in cephalopods. The lack or reduced representation of genes encoding key elements of the basement membrane or other mediators of organ formation further supports the idea that dicyemids are secondarily simplified to an outstanding state.

#### Membrane Receptor Proteins

Cell surface membrane receptors act in cell signaling and allow communication between the cell and the extracellular space. Their diversity reflects the complexity of the organism and its ability to respond to different external signals. The number of genes encoding receptor proteins in dicyemids is exceptionally low. We found only two PF00001 domain hits corresponding to the 7 pass transmembrane receptor proteins of rhodopsin family in Dicyema sp. This family of G-proteincoupled receptors (GPCRs) is ubiquitously present and abundant in metazoans and contains tens to hundreds of members in different species. The minimum number of the rhodopsin family genes (six per genome) is detected in the sponge Amphimedon queenslandica; even the genome of the simplified orthonectid I. linei contains 32 genes of the rhodopsin family. The actual specificity of these GPCRs proteins is unknown, although their BLAST search shows best similarity to the rhodopsin family neuropeptide receptors from other animals. Four proteins from another GPCR 7 pass transmembrane receptor family – secretin family (PF00002) were predicted in the Dicyema sp. data. This is fewer than in most metazoans yet some flatworms have even fewer (Zamanian et al., 2011), and the Orthonectida have no such proteins. We found one putative metabotropic glutamate receptor with a PF00003 domain. Curiously, this metabotropic glutamate receptor also contains a (LIVBP)-like domain that is characteristic of ionotropic glutamate receptors. Two ionotropic glutamate receptors (iGluRs) that are ligand-gated ion channels activated by the neurotransmitter glutamate with Lig\_chan

(PF00060) domain were identified in Dicyema sp. One of them with a PF10613 (Lig\_chan-Glu\_bd) and another with PF01094 (ANF\_receptor). Thus, both distinct types of glutamate receptors (ionotropic and metabotropic types) are present in Dicyema sp. It is well known that glutamate is often associated with nonneuronal signaling and is highly abundant in some animals that lack nervous systems (such as sponges and Trichoplax). Previously, we reported that iGluRs are absent in the genome of orthonectid (Mikhailov et al., 2016). Glutamate receptors are also found in plants and many other eukaryotes outside Metazoa (Turano et al., 2001).

Another big group of ionotropic receptors is the Cysloop ligand-gated ion channel superfamily that is composed of nicotinic acetylcholine, GABA-A, GABA-A-ρ, glycine, 5- HT3, and zinc-activated (ZAC) receptors. We found 8 genes for this superfamily in Dicyema sp., identified by the specific transmembrane region domain (PF02932) and the ligand binding domain (PF02931). All these receptors are predicted to be nicotinic acetylcholine-like receptors.

#### Ion Channels

Despite the reported absence of muscles and neurons, tetrameric ion channels that are often associated with cellular electrical excitability are present in Dicyema sp. in numbers similar to the orthonectid I. linei (33 and 36 sequences with PF00520, and 11 and 9 with PF07885 in Dicyema and the orthonectid, respectively). Although unlike Orthonectida no signatures for voltage-gated sodium ion channel (Na\_trans\_assoc PF06512) were detected in Dicyema, Pfam analysis (for Ca\_chan\_IQ PF08763) and reciprocal BLAST searches indicates the presence of voltage-gated calcium ion channels in this animal group. The presence of such channels together with tetrameric potassium ion channels implies that electrical excitability in the form of action potentials might exist in dicyemid cells. **Figure 9** provides a hypothetical schema of the intercellular communication and an analog of the neuromuscular junction in dicyemids. This structure may potentially assemble from key predicted proteins typical to many other metazoans.

#### Genes Encoding Putative Contractile/Muscular Elements

"True" muscle cells are absent in dicyemids and detection of the muscle-specific genes in these animals is interesting. Most of the core muscle proteins, including a type II myosin heavy chain (MyHC) motor protein was already present in unicellular eukaryotes before the origin of multicellular animals (Steinmetz et al., 2012). At the same time, the troponin complex appears to be a universal innovation of bilaterians. Troponin is a complex of three proteins (troponin C, troponin I, and troponin T). These proteins are detected in the dicyemid data by BLAST search, and the troponin domain PF00992 is found by Pfam search. The troponin complex is characteristic of skeletal and cardiac muscles, but not for smooth muscles. It appears that throughout the radical simplification in dicyemids that resulted in massive gene loss (including most of genes encoding the extracellular matrix ECM molecules) and in

FIGURE 9 | Hypothetical schema of intercellular communication in Dicyemida. Dicyemids have no recognizable neurons and muscles, and yet they have key elements of the neuromuscular system. Metabotropic (mGluRs) or ionotropic glutamate receptors (iGluRs) activate the "presynaptic" cell (left). Voltage-gated tetrameric calcium (VGCCs) and potassium (VGKCs) ion channels generate propagating action potentials. Ca++ (blue dots) enters the cytoplasm via VGCCs and triggers the vesicular acetylcholine (ACh, shown by asterisks) release. Activation of nicotinic acetylcholine-like receptors (nAChR) increase Ca++ level in the "postsynaptic muscular" cell (right) directly or by depolarization of the plasma membrane and VGCC opening. Ca++ promote contractile elements activation via the Ca++ dependent troponin-tropomyosin-actin-myosin mechanism. Additional interaction of the two cells can occur via gap junctions (GJ).

the absence of specialized muscle cells the troponins remain essential. The presence of troponins relates dicyemids to all other bilaterians with one remarkable exception – the orthonectid. In contrast to dicyemids and other bilaterians, the genome of orthonectid I. linei has no troponins despite having specialized muscles. Morphological data suggest that muscles in I. linei are similar to smooth muscles, so troponin was likely lost in I. linei, and its absence is a derived feature. At the same time another bilaterian hallmark – the myogenic regulatory factor (Myogenic Basic domain PF01586) – is present in the genome of I. linei, but was not detected in dicyemids. These findings support the mosaic evolution of many bilaterian traits, supporting the possibility of independent simplifications in these two parasitic lineages.

#### Gap Junctions and Adhesion Molecules

Gap junctions are a distinct type of intercellular communication channels. In Metazoa, the gap junction proteins belong to two unrelated families – connexins and pannexins (also known as innexins). The connexins are only found in chordates, while the pannexin family is widespread in invertebrates. The presence of gap junctions and innexin/pannexins in dicyemids was demonstrated earlier by transmission electron microscopy (TEM) (Furuya et al., 1997) and molecular cloning (Suzuki et al., 2010). BLAST and Pfam searches with our dicyemid data detected 21 hits with the innexin/pannexin-specific Pfam domain (PF00876) and no connexins. The number of dicyemid pannexins is similar to other invertebrates (25 in Caenorhabditis elegans, 13 in Drosophila melanogaster). It

appears that unlike the highly reduced chemical signaling, direct intercellular communication via gap junctions is conserved in dicyemids.

Other hallmarks of multicellularity – the adhesion molecules and adherens junctions are retained in dicyemids and were demonstrated in these organisms earlier by TEM (Furuya et al., 1997). The universal metazoan proteins Integrin alpha and Integrin beta are detected in dicyemids in single copies; immunoglobulin domain is present in 6 sequences and Cadherin in 18 copies.

#### Axon Guidance Molecules and Their Receptors

The simplicity of the nervous system in Orthonectida is associated with a reduction of genes encoding components of axon guidance and synapse formation (Mikhailov et al., 2016). Dicyemids are presumably entirely deprived of the nervous system and follow the same trend of gene loss. Both animal groups lack genes encoding semaphorins, important neuronal pathfinding signaling molecules, and their receptors (plexins). Genes potentially involved in the nervous system development, such as Netrin, Ephrins, and Ephrin receptors are present in Orthonectida but were not identified in Dicyemids. Interestingly, the fasciclin domain (PF02469) is absent in the genome of I. linei, but we found its three orthologs in Dicyema. Fasciclin (FAS1 domain) is a cell adhesion domain found in neural cell adhesion molecules involved in axonal guidance in insects (Grenningloh and Goodman, 1992).

#### Peroxisome

The proteins and Pfam domains specific to peroxisome organelles, found in most metazoans, are absent from the dicyemid data. The peroxisomal proteins PEX3, PEX10, PEX12, and PEX19, mandatory for peroxisome function are apparently missing. Failure to detect these genes unequivocally suggests the absence of the organelle. Eight Pfam domains (PF01756, PF04088, PF04614, PF04882, PF05648, PF07163, PF09262, and PF12634) linked to peroxisome in the GO database<sup>1</sup> were not detected in Dicyema spp. In this respect, dicyemids are similar to Orthonectida and parasitic flatworms (Tsai et al., 2013).

#### Phylogenetic Analyses

To clarify the relationships of the two mesozoan groups, Orthonectida and Dicyemida, we used the sequenced transcriptomes of two unidentified species of Dicyema. We included the gene predictions of the orthonectid I. linei (Mikhailov et al., 2016) in the set of orthologous genes based on the dataset published by Struck et al. (2014). Given the high uncertainty in phylogenetic affinities of orthonectids and dicyemids, we extended taxonomic sampling by adding 30 spiralian taxa from available transcriptomes (see section "Materials and Methods"). Although the data broadly covers the spiralian diversity, several taxonomic groups are still missing or

<sup>1</sup>http://geneontology.org/external2go/pfam2go

underrepresented in the complement of sequenced genes. To minimize missing data, we merged closely related species within several operational taxonomic units (OTUs) and produced the final matrix with 73 OTUs (69 OTUs for spiralian species) and 87,610 aa positions from 452 individual protein alignments. The proportion of missing data in the concatenated alignment is 40%.

Highly divergent sequences of mesozoans pose a formidable challenge for inference methods due to the confounding effect of long branched taxa on phylogenetic reconstructions. A recognized approach to tackle the long branch attraction (LBA) problem is to use a site-heterogeneous model of sequence evolution (Rodríguez-Ezpeleta et al., 2007). In the Bayesian tree constructed with PhyloBayes (Lartillot et al., 2013) under the siteheterogeneous CAT-GTR model, the dicyemid and orthonectid lineages form the longest branches, yet they do not group thus contradicting monophyly of the Mesozoa (**Figure 10**). We recovered the position of the orthonectid Intoshia within annelids with the posterior probability of 1.0. Specifically, the orthonectid forms a branch of the Pleistoannelida that comprises the annelid groups Errantia and Sedentaria (Weigert et al., 2014), while Owenia, Magelona, Chaetopteridae, and Phascolosoma (Sipuncula) occupy more basal positions in the annelid subtree.

The same analysis placed the dicyemid lineage near the base of a group uniting the Rouphozoa (Platyhelminthes, Gastrotricha) and Entoprocta + Cycliophora. However, the position of dicyemids in Bayesian inference is unstable. In about one-third of trees dicyemids were recovered as a sister group to the clade uniting Annelida, Nemertea, Lophophorata (Brachiopoda + Phoronida + Bryozoa), and Mollusca. In about 10% of trees the dicyemids branch off at the base of this group plus (Platyhelminthes + Gastrotricha) plus (Entoprocta + Cycliophora) (**Figure 11**, green branch). The grouping of Intoshia linei and Pleistoannelida has been observed in all summed trees. However, the exact position of the orthonectids relative to pleistoannelids is less certain in our analyses. The basal placement of the orthonectids is observed in 50% of trees, and the orthonectids were recovered as a sister group of Sedentaria or Errantia in 38 and 11% of trees, respectively.

The consensus Bayesian tree was obtained from four independent chains. The majority of bipartitions are shared across chains, but convergence on a single topology was not observed. Topologies in each chain uniquely reflect the concurrent hypotheses of spiralian relationships (Kocot, 2016; Kocot et al., 2017). All four chains of our Bayesian analysis are consistent in several major assemblages, including the Rouphozoa (Platyhelminthes, Gastrotricha), Gnathifera (Gnathostomulida, Micrognathozoa, and Syndermata), the sister relationship of Rouphozoa and Entoprocta + Cycliophora, and the basal position of Gnathifera relative to other spiralians. Importantly, in all topologies, the orthonectid is nested within the Annelida, and the dicyemid lineage is inferred sister to the assemblage of Platyhelminthes, Gastrotricha, Entoprocta, and Cycliophora (to the inclusion of Bryozoa in some topologies, **Figure 11**).

Alternative groupings obtained in our analysis include the Lophophorata (Brachiopoda, Phoronida, and Bryozoa) versus


run with the CAT + GTR + 04 evolutionary model. Nodes with posterior probabilities below 1.0 are marked with red dots, with those of 1.0 – with black dots. Chimeric operational taxonomic units include names of merged species signed with an asterisk. The tree is rooted with four ecdysozoan lineages.

FIGURE 11 | Tree topologies in the four chains of the PhyloBayes run. Each panel summarizes the topology obtained in a single chain of the analysis. The monophyly of almost all clades and all major spiralian phyla receives posterior probability of 1.0 in each chain (even if they differ between the chains). In contrast, the position of the dicyemid lineage receives moderate support in each chain. The pie charts reflect the portion of trees where the dicyemid lineage occupies one of the three observed positions in the cunsensus (represented with color). Topologies in each chain were compared with the approximately unbiased (AU) test using the "sitelogl" option of the PhyloBayes; AU test p-values are shown above each topology.

Polyzoa (Entoprocta, Cycliophora, and Bryozoa), and Vermizoa (Annelida, Nemertea) versus Nemertea + Mollusca (**Figure 11**). A comparison of topologies across chains based on site-wise likelihoods computed with PhyloBayes (the "sitelogl" option of the PhyloBayes readpb\_mpi) under the CAT-GTR model and the approximately unbiased (AU) test (Shimodaira, 2002) show that the difference in likelihoods of topologies in chains 1– 3 is not significant, but the topology likelihood in chain 4 is significantly lower (p-value = 0.01). Chain 2 (**Figure 11**) converges on a topology identical to the consensus four-chains topology (**Figure 10**) but its likelihood is lower than in chain 1 (non-significantly). Excluding chain 4 that failed the AU test and constructing the consensus with the three remaining chains does not affect the topology itself but only node supports due to eliminating the effects of non-monophyletic Lophophorata in chain 4 (**Supplementary Figure S3**).

The best scoring topology supports the monophyletic Lophophorata, the grouping of Annelida and Nemertea, and also the monophyly of macrodasyid and chaetonotid gastrotrichs, which frequently find themselves separate in our analyses (**Figure 10**). Maximum likelihood (ML) analyses of the same dataset with RAxML (Stamatakis, 2014) and IQ-TREE (Nguyen et al., 2015) produce a different view on the phylogeny of Mesozoa. The dicyemids and the orthonectid form a monophyletic group in ML trees with maximal support. In the RAxML analysis under the GTR model the monophyletic Mesozoa branch off with chaetonotid gastrotrichs, similarly to the result obtained in the recent phylogenomic analysis (Lu et al., 2017), but the support of the group is minimal (**Supplementary Figure S4**). In the IQ-TREE run under the C60 profile mixture model the monophyletic Mesozoa are found at the base of the Rouphozoa, again with weak support (64% of ultrafast bootstrap replicates) (see **Supplementary Figure S5**).

Although ML analyses show disagreement with the result of Bayesian inference, modeling of site-heterogeneity by the IQ-TREE profile mixture model does shed light on some spurious cases in spiralian relationships. The divergent annelid Myzostoma is correctly grouped with other annelids in the IQ-TREE analysis, in contrast with the RAxML tree where it forms a clade with long branches of the Rouphozoa, Gnathifera, and Mesozoa. The clustering of Rouphozoa and Gnathifera referred to as the Platyzoa, receives maximal support in the RAxML analysis but was previously shown to be artefactual (Struck et al., 2014; Laumer et al., 2015). This grouping is not inferred by both the IQ-TREE and Bayesian analyses.

In contrast with IQ-TREE and PhyloBayes, the RAxML tree supports monophyletic Polyzoa, a group uniting Entoprocta, Cycliophora, and Bryozoa, which was also suggested to be erroneous and caused by the compositional bias in amino acid sequences (Nesnidal et al., 2013).

To test for expected LBA effects, particularly to exclude the possibility of the orthonectid being attracted to annelids by the divergent Myzostoma, we conducted additional analyses excluding each of the long branched lineages. Additional datasets were generated by removing Myzostoma, Myzostoma and both dicyemids, Myzostoma and Intoshia. Bayesian analyses of additional datasets recovered placement of Intoshia within annelids in the absence of Myzostoma (**Supplementary Figures S6**, **S7**). The position of dicyemids is also unaffected by the exclusion of other long-branched taxa – the dicyemids occupy a basal position within the Lophotrochozoa after the divergence of Gnatifera in all analyses of the additional datasets (**Supplementary Figures S6**, **S8**).

We also tested our dataset for the effects of compositional heterogeneity by discarding highly heterogeneous alignments and utilizing the data recoding approach (Susko and Roger, 2007). Bayesian inference with a concatenate of 150 protein alignments retained after discarding highly compositionally heterogeneous alignments from the original dataset recovers the same groupings of the mesozoan taxa as the analysis of the full dataset. The orthonectid is nested within the annelid clade (1.0 posterior probability) and the dicyemids branch off at the base of the Rouphozoa + Entoprocta + Cycliophora clade with weak support (0.46 posterior probability) (**Supplementary Figure S9**). Similarly, inference with the Dayhoff-recoded alignment groups the orthonectid with annelids, while leaving the position of the dicyemids uncertain within the Lophotrochozoa (**Supplementary Figure S10**). Remarkably, the PhyloBayes run with recoded data shows adequate convergence between chains (maxdiff = 0.17) and infers the monophyletic Gastrotricha. Several conventional groupings, such as the Rouphozoa, are not recovered. Consistent with the proposed artefactual nature of the grouping of Bryozoa and Entoprocta due to compositional heterogeneity (Nesnidal et al., 2013), both test datasets support the monophyletic Lophophorata, whereas the alternative Polyzoa was frequently observed for the complete and non-recoded data. The lack of support for monophyletic Rouphozoa in the analysis with the recoded dataset was similarly obtained in a recent study of the spiralian phylogeny, which aimed at counteracting the impact of compositional heterogeneity (Marlétaz et al., 2019).

Schiffer et al. (2018) selected proteins that support annelid monophyly as an approach to verify the orthonectid position. We also selected 111 protein alignments that contain the annelid signal but with a different method, and used those for Bayesian inference with the PhyloBayes program. In contrast to other Bayesian runs, the consensus presents a stable topology (maxdiff value 0.15). In this tree, the orthonectid I. lineistabilizes inside the annelids [posterior probability (PP) 1.0], whereas the species of Dicyema are not attracted to annelids (**Figure 12**). The position of Dicyema remains uncertain within the lineage of long-branched taxa (Platyhelminthes, Gastrotricha, Entoprocta, Cycliophora). Lophophorata and Gastrotricha are reconstructed with PP 1.0 (as in case of the Dayhoff-recoded dataset mentioned above and the non-recoded full dataset in chain 1 that reaches the highest likelihood). Bayesian topologies obtained in chain 1 (**Figure 11**) and both the sub-sampled datasets of 111 proteins with the annelid signal and the 150 proteins with low compositional heterogeneity (**Supplementary Figure S9**) reconstruct the sister relationship of annelids with nemertines. Marlétaz et al. (2019) also report the grouping of annelids and nemertines, with the inclusion of Platyhelminthes.

The lack of convergence in most PhyloBayes analyses precludes strong assertions regarding problematic areas of the spiralian tree. Nevertheless, some clades are reconstructed consistently. We do not observe monophyly of the Mesozoa in any of the chains, in contrast to the recent study by Lu et al. (2017) and in agreement with the evidence from Schiffer et al. (2018), nor do we observe their direct relations with Platyhelminthes.

The orthonectid I. linei occupies a stable position within the annelid part of the tree. Its placement is among the major conflicts between ML and Bayesian topologies, which likely indicates the impact of a more complex CAT-GTR model in the presence of long branches of highly divergent lineages like orthonectids and dicyemids. Noteworthy, polyphyletic Mesozoa and the proposed affinity of orthonectids to annelids was also recovered in Schiffer et al. (2018) in Bayesian analyses of a dataset with the less extensive representation of the lophotrochozoan diversity.

The orthonectids share with annelids certain morphological features: the presence of microvillar cuticle, metameric muscles, gonochory (Slyusarev, 2008), and the dorsal ganglion in adult specimens (Slyusarev and Starunov, 2015). Cases of dramatic morphological reduction in annelids are known in archiannelids (Andrade et al., 2015) and lobatocerebrids (Laumer et al., 2015), and especially in dwarf males of the echiurid Bonnelia, dinophidid Dinophilus gyrociliatus, spionid Scolelepis laonicola (Vortsepneva et al., 2008), and siboglinids (Worsaae and Rouse, 2010). Adaptations of orthonectids that had led to the complete loss of coelomic cavity, gonad wall, chaetae, gastral system, nephridia, trochophore larva, and spiral cleavage further demonstrate the extent of morphological regress in the evolution of annelids.

The dicyemid lineage in our analyses exhibits affinity to the Rouphozoa clade, in congruence with Lu et al. (2017), with the intercalation of Entoprocta or Polyzoa, which were not included in analyses by Lu et al. (2017). Schiffer et al. (2018) report an uncertain position of dicyemids at the base of the Lophotrochozoa. We did not recover the position of the dicyemids as part of the Platyhelminthes. A previous analysis of innexin genes (Suzuki et al., 2010) also rejects the kinship of the dicyemids and Platyhelminthes. Furthermore, rhabditophoran platyhelminths are known to possess a unique non-standard mitochondrial genetic code, which was shown to be not the case for the studied mesozoans (Telford et al., 2000; Schiffer et al., 2018). Dicyemids were placed within Spiralia in various taxonomic contexts in molecular phylogenetic studies (Pawlowski et al., 1996; Petrov et al., 2010;

Convergence value of maxdiff = 0.15.

Suzuki et al., 2010; Lu et al., 2017) but the interpretation of their body plan remains enigmatic. Being among the simplest known bilaterians, they yet possess multiciliated epithelia, which is not a primitive trait and suggests secondary evolutionary regress, and do not display evident synapomorphies with other animal phyla. Dicyemids might represent a relict lineage of lophotrochozoan animals with no direct relatives that had survived to the present days.

#### Conclusion

We confirm that orthonectids are extremely simplified annelids and do not form a monophyletic group with dicyemids. Mesozoa is a polyphyletic taxon. Dramatic simplification of their body plan, as well as the metagenetic life cycle, evolved independently in the two lineages. Many conserved bilaterian genes are absent in the genomes of Dicyemida and Orthonectida. At the same time, the pattern of their loss and presence is different, which supports the conclusion that these animal groups are not close relatives and have simplified independently. Analyses of genes related to the basement membrane, neuronal and muscular systems expose the extreme simplicity of dicyemids. Intriguingly, dicyemids lack muscle cells and the genetic factors of muscle cell differentiation but possess the troponin complex specific for striated muscles. Taken together with detection of a relatively big set of nicotinic acetylcholine receptors often associated with neuromuscular signaling and the presence of voltage-gated ion channels, this fact urges reevaluation of the traditional view that dicyemids completely lost the neuromuscular system. Appealing is to experimentally check if some contractility and movements could be induced in dicyemids by signal molecules such as acetylcholine or glutamate, and for the presence of electrical excitability in the form of propagated calcium action potential in their cells. Small circular extrachromosomal molecules are present in total DNA extracts of dicyemids. Mitochondrial rRNA, tRNA, protein-coding genes and pseudogenes are located on circular molecules. There are short nucleotide sequence motifs confined specifically to circular DNAs in Dicyema sp.

## MATERIALS AND METHODS

### Biological Material, Genome and Transcriptome Sequencing

The original live material on Dicyema sp. 1 was collected at the Vostok marine biological station of the Institute for Marine Biology of the Russian Academy of Sciences (the Vostok Bay of the Sea of Japan, Vladivostok, Russia) from dissected kidneys of the giant Pacific octopus E. dofleini. Live dicyemids were rinsed individually in filtered marine water and fixed in the RNAlater stabilization solution (Ambion). Total DNA was isolated from tissue samples by Diatom DNA Prep (IsoGene). The sequencing of dicyemid genomic data was performed with an Illumina HiSeq2000 system, generating 140 million paired-end reads.

Total RNA was isolated by TRIzol kit (Invitrogen) and further used for ds cDNA synthesis using the SMART approach (Zhu et al., 2001). SMART-prepared amplified cDNA was then normalized using the DSN normalization method (Zhulidov et al., 2004). Normalization included cDNA denaturation/reassociation, treatment by the duplex-specific nuclease (Shagin et al., 2002), and PCR amplification of the normalized fraction (8 PCR cycles: 95◦C for 7 s; 65◦C for 20 s; 72◦C for 3 min). Normalized cDNA libraries were sequenced using the Roche 454 sequencing technology, producing about 480,000 reads with an average length of 444.6 bases.

Specimens of Dicyema sp. 2 was collected at the Friday Harbor Laboratories (Friday Harbor, WA, United States) from circulatory system and kidneys of the octopus E. dofleini. All individual animals were washed 3–5 times in 0.2 µm filtered seawater. Then RNA was extracted from individual animals and processed as described elsewhere (Moroz and Kohn, 2013) for Illumina HiSeq 2000 sequencing.

The sequences are deposited in the NCBI: BioProject PRJNA527259 (Dicyema sp. 1) and SRA SRP021079 (Dicyema sp. 2).

### Assembly and Filtering of Dicyemid Sequences

The reads obtained from the DNA library for Dicyema sp. 1 were trimmed for adapters with Trimmomatic (Bolger et al., 2014) and assembled by SPAdes (Nurk et al., 2013) using k-mer values of 21, 33, 55, and 77. We also performed genome assembly with the Newbler GS De Novo Assembler software (v. 2.9) (using 1/10 of all reads) as a control to our method of circular contigs identification. Gene prediction was performed with Augustus (Stanke and Waack, 2003) after constructing a training set of 200 dicyemid sequences identified in the genomic assembly. The predicted genes were queried against the InterPro database (Finn et al., 2017) with InterProScan (Jones et al., 2014) and genes with InterPro hits were screened for cephalopod sequences with BLAST (Altschul et al., 1997) searches against the NCBI nr database. Predictions producing best hit with cephalopod sequences were discarded from the gene set. Completeness estimates were performed with BUSCO (Waterhouse et al., 2017) using the eukaryota\_odb9 ortholog set (Zdobnov et al., 2017). HMMER (Eddy, 2011) searches were carried out with Pfam (Finn et al., 2016) Homeodomain (PF00046) and Homeobox\_KN (PF05920) profiles to identify homeobox transcription factors in the data. Phylogeny reconstructions for homeobox sequences were performed with IQ-TREE (Nguyen et al., 2015) using the LG + C20 + G4 model of sequence evolution or with PhyloBayes (Lartillot et al., 2013) using the LG +CAT + G4 model.

The reads obtained from the cDNA library for Dicyema sp. 1 were trimmed for adapters, non-coding RNA, low-quality and low-complexity sequences with the SeqClean software (Dana-Farber Cancer Institute<sup>2</sup> ), and about 430,000 reads were retained. Data was further assembled with the original 454 Newbler GS De Novo Assembler software (v. 2.9) utilizing flowgram quality data and settings that maximize contig overlap. The "-urt" option was invoked to improve contigging in low depth portions of the assembly. Fusions of transcripts that can potentially occur with low-depth assembly extensions in densely packed genomes are subsequently eliminated in our experimental design by alignment filtering at the supermatrix construction step. The obtained assembly contained 19,641,638 bases, and 22,082 isotigs of average size 889 bases, N50 size of 1,081, and the largest isotig size of 9,199. Protein coding regions were predicted using TransDecoder (Haas et al., 2013) with settings to maximize the sensitivity of capturing ORFs regardless of the predicted coding likelihood score by accounting for homology to known proteins in the Pfam (Finn et al., 2014) and UniProtKB/Swiss-Prot (UniProt Consortium, 2015) curated databases. Coding region prediction with TransDecoder was set to the minimal predicted protein length of 80 aa. The predicted proteome contained 15,227 unique coding regions.

The second dicyemid transcriptome sequenced using the Illumina platform was assembled with Trinity (Grabherr et al., 2011). Before assembly the reads were processed with the SeqClean software, and the prediction of coding regions was performed by TransDecoder, similarly to the transcriptome of the first dicyemid.

The transcriptomes of dicyemids were derived from samples contaminated with their cephalopod host. Therefore, we paid special attention to avoid mixing dicyemid and cephalopod sequences in the phylogenetic analysis. The transcriptomes of dicyemids were first screened for cephalopod sequences by performing BLAST (Altschul et al., 1997) searches against the NCBI RefSeq database (O'Leary et al., 2016). Two dicyemid transcriptomes were processed independently. In the first step of decontamination we filtered out proteins having best hit in RefSeq belonging to prokaryotes (and having at least 50% identity). This lead to rejection of only 35 proteins for Dicyema sp. 1 and 363 proteins for Dicyema sp. 2. In the second step we removed all dubious proteins if their local alignment score with any cephalopod protein higher than in all the other considered species (with the same query protein). Sequences with best hits to cephalopods were discarded from the transcriptomes if the sequence identity exceeded 70%. For the third filtering step we queried the proteins of O. bimaculoides combined with several transcriptomes of Octopus vulgaris (NCBI BioProject PRJNA79361 and Sequence Read Archive entries SRR331946, SRR1507221) against a custom database containing 9 metazoan proteomes (4 molluscs, 2 annelids, a brachiopod Lingula anatina, an ecdysozoan Limulus polyphemus, and a deuterostome Danio rerio) and the dicyemid transcriptome, and inspected dicyemid sequences that produced hits with the highest match to the cephalopod queries among the 10 metazoans. All dubious sequences (hits with at least 80% identity) captured by this method were discarded from the dicyemid transcriptomes as potential cephalopod contamination.

### Search for "Circular" Contigs, Signals, and Mitochondrial Sequences

The contigs constructed from shotgun fragments display special characteristics emerging from the genome assembly algorithms based on De Bruijn graph of k-mers. This approach results in "circular" contigs starting and ending with the same k-mer. After assembly, terminal repeats equal in length to the k-mer were cut off. Contigs analyzed in sections "Circular Contigs in Genomic Assembly of Dicyema sp." and "Mitochondrial DNA of Dicyema sp.," and NCBI submission data have been cleaned off the terminal repeats. In this study, a contig was considered "circular" if it had terminal direct repeats ≥ 77 nt in length (k77). The length distribution of contigs assembled by different methods (Newbler and SPAdes) was compared with the two-sample Kolmogorov–Smirnov test implemented in the SciPy package in Python 3. Here the null hypothesis is that contig lengths come from the same distribution. High p-values in this case reflect high probabilities of this hypothesis. Low complexity regions were detected with the DUST algorithm from the MEME Suite (Bailey and Elkan, 1994) with standard settings. MEME and ChIPMunk (Kulakovskiy et al., 2010) tools with the default parameters were applied to the task of finding

<sup>2</sup>https://sourceforge.net/projects/seqclean

specific motifs. The reverse lookup for the signal presence was done via FIMO (from the MEME Suite) with the p-value threshold of 10−<sup>4</sup> . Moreover, highly conserved elements of circles in dicyemids were found utilizing the technique borrowed from Rubanov et al. (2016). The method identifies highly conserved DNA elements on the base of the identification of dense subgraphs in a specially built multipartite graph (whose parts correspond to genomes). Specifically, the algorithm does not rely on genome alignments, no pre-identified perfectly conserved elements; instead, it performs a fast search for pairs of words (in different genomes) of maximum length with the difference below the specified edit distance. Such pair defines an edge whose weight equals the maximum (or total) length of words assigned to its ends. The graph composed of these edges is then compacted by merging some of its edges and vertices. The dense subgraphs are identified by a cellular automaton-like algorithm; each subgraph defines a cluster composed of similar inextensible words from different genomes (Rubanov et al., 2016).

HMMER3 package (Eddy, 2011) along with the Pfam-A database were used to find the circles containing protein-coding sequences, whereas an additional verification step was performed in BLAST. The search itself was conducted through the database composed of six-frame translated circular sequences. The search for genes coding for mitochondrial proteins was conducted with BLAST using mitochondrial protein-coding gene sequences from flatworms as queries, MITOS (Bernt et al., 2013) and HMMER3 using HMM profiles from the Pfam-A database. Mitochondrial rrnS genes in dicyemids are highly diverged and poorly detected with BLAST. Their detection was conducted with HMMER3 with HMM profiles preliminarily generated from the set of 140 rrnS alignments from other organisms (140 species of bilaterians, cnidarians, and placozoans). All findings were verified using blastp or blastn with nr NCBI database. It was proposed that the dicyemid small mitochondrial circular DNA molecules are generated from the usual long multigene mitochondrial DNA (Awata et al., 2006). If such long mtDNA exists together with mitochondrial mini-circles we can expect the cases when one read from the sequencing library corresponds to a particular mitochondrial mini-circle while its pair read maps elsewhere. Blastn with minimal word size was used to map raw paired end reads to circular contigs coding for mitochondrial genes to search for hypothetical high-molecular-weight mtDNA. Reads pair analysis was conducted after that in order to find the reads whose pair does not map to the initial circular contig. Mitochondrial tRNA secondary structures were predicted using the MiTFi program (Jühling et al., 2012).

#### Taxonomic Expansion of Alignments

The starting set of orthologous genes used in this work is based on a dataset for phylogenetic reconstructions within Spiralia assembled by Struck et al. (2014) that was later expanded with sequences of orthonectid I. linei (Mikhailov et al., 2016). The base set of orthologs contained 469 alignments with a total of 62 spiralian species and four ecdysozoan species. To extend the taxonomic sampling of Spiralia and minimize the missing data in the dataset we obtained predicted proteins from several genomic projects accessible through public databases and collected transcriptomic data from the NCBI Sequence Read Archive. The annotations for the genomes of Clonorchis sinensis, Echinococcus granulosus, L. anatina, O. bimaculoides, Priapulus caudatus were obtained from the GenBank database, and the proteins of Adineta vaga were obtained from the Genoscope database. The NCBI Sequence Read Archive was used to extract raw sequence data of another 31 spiralian species (see **Supplementary Table S1**).

The assemblies of the SRA transcriptome data were performed with Trinity (Grabherr et al., 2011) after cleaning the reads with SeqClean (Dana-Farber Cancer Institute<sup>3</sup> ) from adapter sequences using the UniVec\_Core database<sup>4</sup> and filtering ribosomal RNA sequences using a database of eukaryotic rRNAs. The prediction of proteins in the assembled transcripts was performed with TransDecoder (Haas et al., 2013), which was assisted with searches against the Pfam (Finn et al., 2014) and UniProtKB/Swiss-Prot (UniProt Consortium, 2015) databases.

The addition of proteins from the newly assembled data to orthologous groups featured in the base set of alignments was performed using the procedure for mapping genes to existing orthologous groups (Fischer et al., 2011) of the OrthoMCL database (Chen et al., 2006). The genes from the initial dataset and novel transcriptomic and genomic data were assigned to orthologous groups of OrthoMCL-DB, and the genes within the same orthologous group were extracted and aligned together using MUSCLE (Edgar, 2004). When more than one sequence per organism was assigned to the same group of orthologs, only the sequence scoring highest against the orthologous group in the initial dataset was selected for the alignment.

#### Phylogenetic Analyses

The concatenation of individual gene alignments was performed with Scafos (Roure et al., 2007) using the option to construct chimeric sequences for several closely related taxa. The following 15 chimeric taxa were constructed for the analysis: Aplysia californica + Biomphalaria glabrata, Brachionus plicatilis + B. manjavacas, Chiton olivaceus + Chaetopleura apiculata, Clonorchis sinensis + Opisthorchis viverrini, Dugesia japonica + Dugesia ryukyuensis, Echinococcus granulosus + Echinococcus multilocularis, Echinorhynchus gadi + Echinorhynchus truttae, Euprymna scolopes + Idiosepius paradoxus, Lepadella patella + Lecane inermis, Pedicellina sp. + P. cernua, Protodrilloides symbioticus + P. chaetifer, Schistosoma mansoni + S. japonicum, Spiochaetopterus sp. + Chaetopterus variopedatus, Stenostomum leucops + Stenostomum sthenum, Symbion pandora + S. americanus. Another ten species that were present in the starting set of alignments were removed due to poor representation in the final alignment: Alcyonidium diaphanum, Fasciola gigantica, Flustra foliacea, Lumbricus rubellus, Philodina roseola, Rotatoria rotatoria, Spirometra erinacei, Stylochoplana maculata, Taenia solium, Turbanella ambronensis. The final number of operational taxonomic units featured in the analysis is 73. Before concatenation, the alignments were trimmed with TrimAl

<sup>3</sup>https://sourceforge.net/projects/seqclean

<sup>4</sup>http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/

(Capella-Gutierrez et al., 2009) to remove poorly aligned regions. The trimming was performed with a gap threshold of 0.5 and a similarity threshold of 0.001. After the removal of invariant positions, the length of the concatenated alignment totaled 87,610 positions, with 40% missing data. Compositional heterogeneity in the alignment partitions (i.e., individual protein alignments after masking) was evaluated using the relative composition frequency variability (RCFV) metric (Zhong et al., 2011). The RCFV values were calculated using BaCoCa (Kuck and Struck, 2014). The low compositional heterogeneity dataset was generated by discarding 302 partitions (referred to in the paper simply as protein alignments) with RCFV value exceeding 0.115.

The phylogenetic reconstructions were performed with PhyloBayes-MPI 1.7 (Lartillot et al., 2013), RAxML (Stamatakis, 2014), and IQ-TREE (Nguyen et al., 2015). The RAxML analysis was carried out utilizing the complete analysis function (−f a) with 150 rapid bootstrap replicates and the PROTCATGTR model of evolution. The IQ-TREE analysis was performed using the LG + C60 + F + G4 evolutionary model, and node support was calculated using the ultrafast bootstrap approximation (Minh et al., 2013) with 1,000 replicates. The Bayesian inference with PhyloBayes was carried out using the CAT + GTR + 04 model, and the analyses were run with four chains. For the main dataset, the majority rule consensus tree was reconstructed after 30,000 cycles using one out of ten cycles with a 60% burn-in. PhyloBayes analyses of the additional datasets were conducted similarly to the main dataset; the consensus trees were reconstructed after 5,000 or 15,000 cycles with a 50% burn-in. Analysis of the recoded alignment was performed with PhyloBayes utilizing the recode option and the Dayhoff recoding scheme with six amino acid groups (Dayhoff et al., 1978). The convergence of the chains was assessed by comparing bipartitions using the pbcomp utility from the PhyloBayes package.

Comparison of topologies in the four chains of the Bayesian inference of the main dataset was performed using the CONSEL program (Shimodaira and Hasegawa, 2001) and the "sitelogl" option of the PhyloBayes readpb\_mpi program. The site-specific marginal log likelihoods were computed for each chain across 10 data points sampled over 2,000 cycles after a 20,000 cycle burn-in.

Alignment partitions (i.e., individual protein alignments after masking) with the strong annelid signal were selected as follows. In a protein alignment we define two sets of sequences – G<sup>1</sup> (ingroup), and G<sup>2</sup> (outgroup). Only alignment positions containing no more than a half of missing data (gaps or X's) in each of the two sets are considered. For each such position i-value q(i) is determined as the maximum of frequency differences of each amino acid in this position from G<sup>1</sup> and G2. Missing data is ignored. Maximum q(i) value is 1 when G<sup>1</sup> consists only of one character, and G<sup>2</sup> does not contain this character. Under any q(i) > <sup>1</sup>/<sup>2</sup> there exists a character a(i) observed in more

#### REFERENCES


than a half of taxa from G<sup>1</sup> but much less frequently in G<sup>2</sup> [frequency difference is q(i) > <sup>1</sup>/2]. In the phylogenetic context, when G<sup>1</sup> + G<sup>2</sup> constitute a monophyletic clade, and G<sup>1</sup> is a narrower natural clade, high q(i) values can be interpreted as presence of a synapomorphy against G2. Notably, in this analysis q(i) values are used only to select partitions but not for alignment editing or positions removal. In our case of detecting the annelid signal, G<sup>1</sup> contained all annelids except the orthonectid, and G<sup>2</sup> – all non-annelid taxa except dicyemids in order to obtain q(i) estimates unbiased with respect to the lineages under study.

#### AUTHOR CONTRIBUTIONS

OZ performed most of the computations, analyzed the data, and drafted the manuscript. KM, YP, SI, OP, and LR performed additional computations, analyzed the data, and wrote the manuscript. LR obtained original RNA-Seq data, assembled the transcriptome of Dicyema sp. 1. ML and AP obtained original DNA-Seq data. LM obtained original RNA-Seq data on Dicyema sp. 2. VL supervised the computational part of the work. VA designed and supervised the research. All authors read and approved the manuscript.

### FUNDING

This research was performed at IITP RAS and supported by the Russian Science Foundation, project no. 14-50- 00150. Sequencing of the Dicyema sp. 2 transcriptome was supported by the Government of the Russian Federation, grant #14.W03.31.0015. The phylogenetic analyses were supported by the Russian Foundation for Basic Research grant nos. 18-29- 13014 and 18-29-13037. The computations were carried out on MVS-10P at Joint Supercomputer Center of the Russian Academy of Sciences (JSCC RAS).

#### ACKNOWLEDGMENTS

We thank V. P. Kuznetsov for graphic design in figures. We are deeply grateful to all reviewers for the productive dialogue that led to the enrichment of the paper.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00443/full#supplementary-material

Aruga, J., Odaka, Y. S., Kamiya, A., and Furuya, H. (2007). Dicyema Pax6 and Zic: tool-kit genes in a highly simplified

with emphasis on meiofaunal taxa. Mol. Biol. Evol. 32, 2860–2875. doi: 10.1093/ molbev/msv157

Armstrong, M. R., Blok, V. C., and Phillips, M. S. (2000). A multipartite mitochondrial genome in the potato cyst nematode Globodera pallida. Genetics 154, 181–192.

bilaterian. BMC Evol. Biol. 7:201. doi: 10.1186/1471-21 48-7-201




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zverkov, Mikhailov, Isaev, Rusin, Popova, Logacheva, Penin, Moroz, Panchin, Lyubetsky and Aleoshin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Searching for Signatures of Cold Climate Adaptation in *TRPM8* Gene in Populations of East Asian Ancestry

*Alexander V. Igoshin1\*, Konstantin V. Gunbin2,3,4, Nikolay S. Yudin3,5 and Mikhail I. Voevoda6*

*1 Sector of the Genetics of Industrial Microorganisms, The Federal Research Center Institute of Cytology and Genetics, The Siberian Branch, The Russian Academy of Sciences, Novosibirsk, Russia, 2 Center of Brain Neurobiology and Neurogenetics, The Federal Research Center Institute of Cytology and Genetics, The Siberian Branch, The Russian Academy of Sciences, Novosibirsk, Russia, 3 V. Zelman Institute for Medicine and Psychology Novosibirsk State University, Novosibirsk, Russia, 4 Center for Mitochondrial Functional Genomics, Institute of Living Systems, Immanuel Kant Baltic Federal University, Kaliningrad, Russia, 5 Laboratory of Livestock Molecular Genetics and Breeding, The Federal Research Center Institute of Cytology and Genetics, The Siberian Branch, The Russian Academy of Sciences, Novosibirsk, Russia, 6 Laboratory of Human Molecular Genetics, The Federal Research Center Institute of Cytology and Genetics, The Siberian Branch, The Russian Academy of Sciences, Novosibirsk, Russia*

#### *Edited by:*

*Ancha Baranova, George Mason University, United States*

#### *Reviewed by:*

*Margarida Matos, University of Lisbon, Portugal Toni Gossmann, University of Sheffield, United Kingdom*

> *\*Correspondence: Alexander V. Igoshin igoshin@bionet.nsc.ru*

#### *Specialty section:*

*This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics*

*Received: 30 October 2018 Accepted: 17 July 2019 Published: 23 August 2019*

#### *Citation:*

*Igoshin AV, Gunbin KV, Yudin NS and Voevoda MI (2019) Searching for Signatures of Cold Climate Adaptation in TRPM8 Gene in Populations of East Asian Ancestry. Front. Genet. 10:759. doi: 10.3389/fgene.2019.00759*

Dispersal of *Homo sapiens* across the globe during the last 200,000 years was accompanied by adaptation to local climatic conditions, with severe winter temperatures being probably one of the most significant selective forces. The *TRPM8* gene codes for a cold-sensing ion channel, and adaptation to low temperatures is the major determinant of its molecular evolution. Here, our aim was to search for signatures of cold climate adaptation in *TRPM8* gene using a combined data set of 19 populations of East Asian ancestry from the 1000 Genomes Project and Human Genome Diversity Project. As a result, out of a total of 60 markers under study, none showed significant association with the average winter temperatures at the locations of the studied populations considering the multiple testing thresholds. This might suggest that the principal mode of *TRPM8* evolution may be different from widespread models, where adaptive alleles are additive, dominant or recessive, at least in populations with the predominant East Asian component. For example, evolution by means of selectively preferable epistatic interactions among amino acids may have taken place. Despite the lack of strong signals of association, however, a very promising single nucleotide polymorphism (SNP) was found. The SNP rs7577262 is considered the best candidate based on its allelic correlations with winter temperatures, signatures of selective sweep and physiological evidences. The second top SNP, rs17862920, may participate in adaptation as well. Additionally, to assist in interpreting the nominal associations, the other markers reached, we performed SNP prioritization based on functional evidences found in literature and on evolutionary conservativeness.

Keywords: TRPM8, environmental correlation analysis, SNP, cold adaptation, East Asian ancestry

## INTRODUCTION

Recent paleoanthropological evidences show the presence of anatomically modern humans in Africa as early as 300 kya (Hublin et al., 2017), with the earliest known "Out of Africa" migration event dating back to 200 kya (Hershkovitz et al., 2018). Dispersal of *Homo sapiens* across the globe during the last 200,000 years was accompanied by adaptation to local environments. Spatial variations in selective pressures have ultimately led to observable geographic distribution of many physiological and anatomical traits in present-day humans. For example, the low level of UV radiation at higher latitudes is now considered to be the major cause of evolution of depigmented skin (Jablonski and Chaplin, 2010).

Since the mid-2000s, there has been significant progress in genotyping technologies, followed by publicly available databases of human genetic variation. This circumstance helped the population geneticists to discover signatures of human local adaptation from genome-wide genotyping data. Human microevolution driven by the action of low temperatures has long been attracting attention of the scientific community. Now, a number of studies have been dedicated to this issue both at the genome level (e.g., Hancock et al., 2011a; Cardona et al., 2014; Valverde et al., 2015) and at the level of selected regions or genes (Hancock et al., 2008; Ohashi et al., 2011; Hancock et al., 2011b; Sazzini et al., 2014; Quagliarello et al., 2017).

Probably the best-known gene in terms of its possible role in adaptation to cold climate is *TRPM8* located on human chromosome 2. This gene codes for ion channel functioning as a thermal sensor, detecting temperatures in the range from 15 to 30°C (Fernández et al., 2011). There are evidences supporting its physiological role in thermoregulation, and in fact, TRPM8 is the only well-established cold receptor in mammals (Bautista et al., 2007; Colburn et al., 2007; Dhaka et al., 2007). Besides these, there are data on associations of its single nucleotide polymorphisms (SNPs) with sensitivity to cold (Kozyreva et al., 2011), the respiratory system response to cooling (Kozyreva et al., 2014), blood lipids, and anthropometric parameters in humans (Potapova et al., 2014). The *TRPM8* gene was suggested to underlie genetic adaptation to cold in ground squirrel and hamster (Matos-Cruz et al., 2017), sheep (Fariello et al., 2014; Liu et al., 2016), and humans (Cardona et al., 2014; Key et al., 2018). According to modern views, adaptation to low temperatures is the major determinant of *TRPM8* molecular evolution (Myers et al., 2009; Majhi et al., 2015).

To our knowledge, a study by Key and colleagues (2018) is the only one to use environmental data to search for signatures of cold climate adaptation in the *TRPM8* gene. The authors used latitudes and annual average temperatures at the locations of the populations of the Old World as predictors for SNP allele frequency distributions. They found evidences that SNP rs10166942 had undergone climate-mediated selection, which raised its derived allele frequency from south to north.

In our opinion, focusing on closely related populations is preferable to using large population sets for the following reasons. First, the ability to survive in a severely cold climate is supposed to be highly polygenic, as many biological processes like vasoconstriction, nonshivering thermogenesis, regulation of adipocyte differentiation, and thermoception are expected to be involved. It is known that the adaptation to cold can be associated with quite different genetic bases (Yudin et al., 2017). Because different branches of *Homo sapiens* are likely to have had distinct genetic background before and during the process of climatedriven selection, it is possible that in phylogenetically distant groups, adaptation may have recruited different genes. Second, even in the case when selection is acting on the same gene, variants involved in adaptation may differ in different branches. Our supposition is supported by the example of variants associated with lactase persistence. Thus, within European populations, the activity of the lactase enzyme in adulthood is connected with the C/T-13910 variant in the enhancer region of the *LCT* gene, whereas in sub-Saharan Africa, this trait is mainly correlated with the presence of the G/C-14010 mutation (Tishkoff et al., 2007). Therefore, it would be sensible to search for microevolution in clusters of related populations.

For the above reasons, the aim of this study was to search for signatures of adaptation to low temperatures in the *TRPM8* gene under various null hypotheses of population structure and dynamics using a combined data set of 19 populations of the East Asian ancestry from Human Genome Diversity Project (HGDP) and 1000 Genomes (1000G) Project with the assistance of environmental correlation analysis techniques. Locations of chosen populations are characterized by a large range of average winter temperatures (−37–+27°C), implying substantial differences in selection pressures.

#### MATERIALS AND METHODS

#### Genotypic and Environmental Data

In this study, we used genotypic data on 656 individuals from 19 HGDP (Cann et al., 2002) and 1000G Project (1000 Genomes Project Consortium et al., 2010) populations (**Supplementary Table 1**) having predominantly the East Asian genetic component. Data on SNPs belonging to *TRPM8* gene were obtained from NCBI dbSNP (https://www.ncbi.nlm.nih. gov/snp/), resulting in 60 polymorphic markers (minor allele frequency >0.01) being at the intersection of HGDP and 1000G sets. The missing genotypes in HGDP data were imputed using fastPHASE v.1.4.8 software (Scheet and Stephens, 2006) with default parameters. Genotypic information from HGDP and VCF formats was combined by using a self-made Python 3 script, so that inconsistency of DNA strands between databases (if that was the case) was resolved using 1000G VCF as a reference. Besides *TRPM8* SNPs, we used 5 Mb regions upstream and downstream this gene (1,309 markers at r2 < 0.7) to infer the phylogenetic tree used in PGLS, to estimate the covariance matrix used in Bayenv2-BLM and Bayenv2-SRC, to correct for background levels of the population structure in LFMM and to make population inferences in BayScEnv more precise. Also, this set of SNPs was used to construct a null distribution for empirical p-value calculation. Information on latitudes and longitudes for 1000G populations was taken from Key et al. (2018). Latitudes and longitudes for HGDP populations were taken from Cann et al. (2002), and average winter temperature values were obtained from ClimateCharts.net database (https:// climatecharts.net) using the corresponding coordinates. We believe the average winter temperature to be a more pertinent predictor for distribution of cold-adaptive alleles than the annual average temperature, as regions with a continental climate may have cold winters and hot summers.

#### Construction of a Phylogenetic Tree and Statistical Analysis

As our sample consisted of phylogenetically close populations, we first performed conventional Spearman's rank correlation test not accounting for the sample structure (Spearman, 1904) at the population (i.e., using allele frequencies) and individual's (i.e., using allele dosages) levels.

PGLS analysis (R package 'ape,' Paradis et al., 2004) was carried out at the individual's level using the simplest Brownian motion model. We selected this type of analysis because it is an opposite alternative (phyletic evolution) to the conventional Spearman's rank correlation test. The phylogenetic tree used in this test was reconstructed with IQ-TREE v. 1.5.5 subprogram ModelFinder (Kalyaanamoorthy et al., 2017) based on the best nucleotide substitution model.

Our primary aim was to test the association between the climatic factor and allele frequencies. For this purpose, we chose two independent approaches: the Bayesian linear model from Bayenv2 (further referred to as Bayenv2-BLM) software (Günther and Coop, 2013) and LFMM (Frichot et al., 2013), each building a regression model relating allele frequencies to environmental values. To minimize the problem of false-positive associations between allele frequencies and environment because of the population structure, the above methods take into account allele frequency correlations across populations while performing the analysis by various ways.

In addition, we used Spearman's rank correlation test from Bayenv2 (further referred to as Bayenv2-SRC) that uses allele frequencies standardized to have no covariance. It is less powerful than Bayenv2-BLM but more robust to outliers and can detect monotonic relationships.

BayScEnv test (de Villemereuil and Gaggiotti, 2015) was used as an alternative to Bayenv2-BLM and LFMM. This method assumes that all populations are independent and exchange genes through the limited migrant pool; it includes a locus-specific effect unrelated to the environmental variable, taking into consideration locus-specific deviations from a neutral model. BayScEnv software was also used to calculate Fst distances for each locus averaged over populations (**Supplementary Table 2**).

In addition to the correlation techniques, we tried the XP-CLR test (Chen et al., 2010) as a complementary approach. This test is designed for detecting selective sweeps on the basis of joint modeling of the multilocus allele frequency differentiation between two populations. The method does not require information on the ancestral/derived status at each SNP (Chen et al., 2010; Vatsiou et al., 2016).

For more details on phylogenetic tree construction and statistical analysis, please see the Supplementary Material.

#### SNP Prioritization

We prioritized SNPs based on three types of evidences found in literature, "association with trait relevant to survival in a cold climate," "evidences for cold-mediated selection," and "association with any other phenotype or risk." These categories were given weights 3, 2, and 1, respectively, and a score for each SNP was summarized (**Supplementary Table 3**). The key assumption behind this prioritization is that because of the pleiotropic nature of *TRPM8*  gene, allelic substitutions having any functional manifestation may potentially have more chances of affecting survival in cold climate conditions than those not having any known effects.

Additionally, we obtained the PhyloP100way vertebrate conservation score for each SNP (**Supplementary Table 4**) from UCSC Genome Browser (Casper et al., 2018). Currently, it is commonly thought that the genetic drift plays a minor role in the evolution of conservative sites, and relatively rare allele replacements occurring therein are mostly driven by positive selection (Andolfatto, 2005; Cai et al., 2009; Halligan et al., 2011; Bazykin and Kondrashov, 2012). Therefore, significant allelic correlation with environmental gradient supported by a high conservation score promises to be the true sign of local adaptation.

#### RESULTS

Contrary to what we had expected, only three SNPs out of a total of 60 markers under study showed nominally significant association with the average winter temperatures at the locations of the studied populations by any two types of analysis (the results of tests carried out using default/ recommended parameters are shown in **Table 1**; for full results,

TABLE 1 | Genic and upstream *TRPM8* variants showing nominally significant (in bold) association in at least two correlation tests with default/recommended parameters (K = 2 for LFMM and pi = 0.1/p = 0.5 for BayScEnv).


*\*The SNP evidence score was not calculated for rs7577262 because of a high level of confidence in its adaptive role.*

see **Supplementary Tables 5 and 6**). When considering the multiple testing threshold, however, none of them is significant (adjusted p values not shown).

SNP rs11682848 was previously reported as associated with the prognosis of colorectal cancer (Walther, 2010). Interestingly, such connection of climate-associated loci with cancer has already been noticed by other researchers (Hancock et al., 2011a). Furthermore, it has been recently shown by combination of 247 genome-wide association studies that cold selected genes are enriched with cancer-associated genes (Voskarides, 2018). Ironically, SNP rs11682848 has the lowest conservation score among 60 markers under study. This means that either rs11682848 is being a false-positive finding or being linked to some functional variant.

SNP rs17862920 has evidences of associations with migraine susceptibility (Freilinger et al., 2012; Meng et al., 2018). The rs17862920-С allele predisposing to migraine is more prevalent in northern latitudes. Also, rs17862920 has been shown to be associated with sensing cold pain in Finnish and Norwegian individuals, with C allele carriers being more susceptible (Kaunisto et al., 2013). Migraine has been reported to be related to increased pain perception of nonnoxious cold temperatures (Burstein et al., 2000). Unlike rs11682848, PhyloP100way conservation score for rs17862920 is more promising and has a rank of 20/60 while still being negative. It is possible that allele substitution in rs17862920 has a functional effect. Thus, rs17862920 was predicted to regulate *TRPM8* transcription by TFsearch and GoldenPath in F-SNP bioinformatics tool (Ghosh et al., 2013).

As for rs6723922, this SNP is a genetic risk factor for severe cutaneous adverse drug reactions (Park et al., 2018). One could hypothesize that there is a certain mechanism underlying both the altered cold sensation and the increased cutaneous susceptibility to chemicals. It could be no surprise given that the TRPM8 channel is activated by a variety of chemical ligands (Beccari et al., 2017). Like rs11682848, SNP rs6723922 has a low conservation score (the rank of 55/60), implying conclusions for this marker similar to those for rs11682848.

The results of the XP-CLR test are more encouraging (**Supplementary Figure S1**). It appears that there is a pronounced trend for several pairs of populations to show the signature of a selective sweep 10 ± 6 Kb upstream from the *TRPM8* gene. The direction of selection in this region is seen when reversing tested and reference populations in pairs (e.g., compare "JPT vs. KHV" and "KHV vs. JPT"). The strongest XP-CLR peaks within this putative sweep are mainly located near rs10929317 and rs7577262 SNP loci. The former was removed from the analysis because of high LD with rs17862920: r2 = 0.966/*D'* = 0.995 in East Asian populations (LDlink tool; Machiela and Chanock, 2015) and is therefore expected to be as significant as rs17862920. The latter was used in the control set of 1,309 markers. Surprisingly, this SNP demonstrates significant association with our climatic variable in almost all of the correlation tests (**Table 1**). Also, rs7577262 has been reported to be associated with susceptibility to migraine (Anttila et al., 2013) and blood pressure response to the cold pressor test (He et al., 2013).

### DISCUSSION

A variety of facts have led us to think of *TRPM8* gene as being under intense positive selection. We expected that the large amount of SNPs in *TRPM8* would demonstrate strong signals because of being under selection immediately or being linked to some causal variants. However, this is not the case. Furthermore, SNPs detected do not pass the corrected threshold, considering multiple testing. Among possible explanations are the following hypotheses:


Despite the lack of strong signals of association, however, a very promising candidate SNP was found. SNP rs7577262 is 7.1 kb upstream of the transcription start site for *TRPM8* mRNA, implying its possible involvement in transcriptional regulation. In addition to correlations and signatures of sweep, physiological data contribute equally to the evidences in favor of selection acting on rs7577262. The rs7577262-G allele is associated with a higher blood pressure response to the cold pressor test (He et al., 2013). It is known that the blood pressure response to the cold pressor test primarily stems from alpha-adrenergically mediated peripheral vasoconstriction (Leppäluoto and Hassi, 1991; Larra et al., 2015), which is, in turn, one of the basic mechanisms of cold adaptation (Daanen and Lichtenbelt, 2016). Given that this allele is more prevalent in northern latitudes (**Figure 1**), its adaptive role may be assumed.

Another SNP, rs17862920, is linked with rs7577262 (r2 = 0.59 in East Asians). Probably, this is the reason for the correlation the former demonstrates. Both SNPs are risk loci for migraine. At the same time, it has been mentioned above that rs17862920 is associated with sensing cold pain in Finnish and Norwegian individuals, with C allele carriers (more prevalent in northern latitudes) being more susceptible. It can be assumed that both loci are independently involved in adaptation to low temperatures. In that case, however, the adaptive role of rs17862920-C allele is hard to explain. The possible mechanism of differential survival might be avoidance of potentially lethal hypothermia by those harboring С allele.

As for rs6723922 and rs11682848 loci, none of them shows any sign of selective sweep in the XPCLR test. Probably, those are false positives, or at least, linked to an unobserved variant under selection. It is also worth noting that SNP evidence scores for these loci are quite low.

In addition to the search for signatures of selection, we would like to note some details on the BayScEnv test not published anywhere (as far as we know).

Changing model parameters drastically affects the output in BayScEnv (see **Supplementary Table 6**). For example, significant results (q value <0.05) were obtained when using model parameters pi = 0.5/p = 0.1 (SNP rs11682848 being significant) or pi = 0.9/p = 0.1 (18/60 SNPs being significant). At the same time, empirical p values are more stable. Thus, we suggest using them in hypothesis-driven studies (with default model parameters) of local adaptation and choosing the significance threshold based on expert's opinions rather than relying on FDR outputs.

Counterintuitively, a reduction in the number of tests in BayScEnv does not lead to a greater number of statistically significant FDR outputs (the posterior error probability and the q value). Furthermore, in our case, given parameters pi = 0.9/ p = 0.1, 18 out of a total of 60 SNPs reach significance level when analyzing 1,369 markers, whereas none is significant when using 60 SNPs. This discrepancy might be explained by less precise constructing a null model of population structure.

#### CONCLUSIONS

Several lines of evidence point to possible involvement of rs7577262 in cold adaptation. This SNP is considered the best candidate based on its allelic correlations with winter temperatures, signatures of selective sweep and physiological evidences. The second top SNP, rs17862920, may participate in adaptation as well. As for rs6723922 and rs11682848 loci, these appear to be false positives or at least linked to some unobserved selected variant.

### AUTHOR CONTRIBUTIONS

NY and MV conceived the project. KG supervised the project. AI and KG processed and analyzed the data. AI, KG, and NY drafted the manuscript.

### REFERENCES


#### FUNDING

This study was supported by budget from project No. 0324-2019- 0041 of the Federal Research Center «Institute of Cytology and Genetics» SB RAS (ICG SB RAS).

### ACKNOWLEDGMENTS

The Common Use Center «Bioinformatics» (ICG SB RAS) is gratefully acknowledged for providing computer facilities.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00759/ full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Igoshin, Gunbin, Yudin and Voevoda. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Initial Characterization of the Chloroplast Genome of Vicia sepium, an Important Wild Resource Plant, and Related Inferences About Its Evolution

#### Chaoyang Li <sup>1</sup> , Yunlin Zhao<sup>1</sup> , Zhenggang Xu1,2\*, Guiyan Yang1 , Jiao Peng<sup>1</sup> and Xiaoyun Peng<sup>2</sup>

<sup>1</sup> Hunan Research Center of Engineering Technology for Utilization of Environmental and Resources Plant, Central South University of Forestry and Technology, Changsha, China, <sup>2</sup> Hunan Urban and Rural Ecological Planning and Restoration Engineering Research Center, Hunan City University, Yiyang, China

Lack of complete genomic information concerning Vicia sepium (Fabaceae: Fabeae) precludes investigations of evolution and populational diversity of this perennial highprotein forage plant suitable for cultivation in extreme conditions. Here, we present the complete and annotated chloroplast genome of this important wild resource plant. V. sepium chloroplast genome includes 76 protein-coding genes, 29 tRNA genes, 4 rRNA genes, and 1 pseudogene. Its 124,095 bp sequence has a loss of one inverted repeat (IR). The GC content of the whole genome, the protein-coding, intron, tRNA, rRNA, and intergenic spacer regions was 35.0%, 36.7%, 34.6%, 52.3%, 54.2%, and 29.2%, respectively. Comparative analyses with plastids from related genera belonging to Fabeae demonstrated that the greatest variation in the V. sepium genome length occurred in protein-coding regions. In these regions, some genes and introns were lost or gained; for example, ycf4, clpP intron, and rpl16 intron deletions and rpl20 and ORF292 insertions were observed. Twelve highly divergent regions, 66 simple sequence repeats (SSRs) and 27 repeat sequences were also found in these regions. Detailed evolutionary rate analysis of protein-coding genes showed that Vicia species exhibit additional interesting characteristics including positive selection of ccsA, clpP, rpl32, rpl33, rpoC1, rps15, rps2, rps4, and rps7, and the evolutionary rates of atpA, accD, and rps2 in Vicia are significantly accelerated. These genes are important candidate genes for understanding the evolutionary strategies of Vicia and other genera in Fabeae. The phylogenetic analysis showed that Vicia and Lens are included in the same clade and that Vicia is paraphyletic. These results provide evidence regarding the evolutionary history of the chloroplast genome.

Keywords: chloroplast genome, comparative analysis, phylogenetic analysis, positive selection, Vicia sepium

#### Edited by:

Ancha Baranova, George Mason University, United States

#### Reviewed by:

Aleksandar M. Mikich, Independent Researcher, Novi Sad, Serbia Tatiana V Tatarinova, University of La Verne, United States

\*Correspondence: Zhenggang Xu rssq198677@163.com

#### Specialty section:

This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics

> Received: 08 October 2018 Accepted: 22 January 2020 Published: 20 February 2020

#### Citation:

Li C, Zhao Y, Xu Z, Yang G, Peng J and Peng X (2020) Initial Characterization of the Chloroplast Genome of Vicia sepium, an Important Wild Resource Plant, and Related Inferences About Its Evolution. Front. Genet. 11:73. doi: 10.3389/fgene.2020.00073

**242**

### INTRODUCTION

Complete chloroplast sequences are indispensable for analyzing genome evolution and phylogenetics (Sabir et al., 2014; Moner et al., 2018). These sequences offer two advantages over genomic ones, namely, a high degree of conservation and a relatively compact gene alignment, resulting from symbiotic horizontal transfer (Timmis et al., 2004). In angiosperms, the chloroplast is a uniparentally inherited organelle. It originated from a cyanobacterium-like organism through an endosymbiosis event. Compared to the nuclear genome, chloroplast genomes, with a quadripartite circular structure, exhibit highly conserved sizes, structures and gene contents across photosynthetic plants (Wicke et al., 2011). Nuclear genomes are highly complex because of the high frequency of the loss and gain of genetic material at any time (Wolfe et al., 1987), making the identification of orthologous genes difficult. Evolutionary and phylogenetic analyses based on complete chloroplast sequences can provide more valuable information of a higher quality than that obtained by analysis of one or more gene loci (Martin et al., 2005). Complete chloroplast sequence datasets contain all site patterns (or all genes) for the reconstruction of evolutionary history. The comparison of complete genomes can reduce the sampling error inherent in analyses of only one or a few genes. That is not to say that we oppose the use of one or a few genes in evolutionary studies, but we instead suggest the investigation of conflicts between complete chloroplast genomes and analyses of one or a few genes that may indicate crucial evolutionary events. Another advantage of the chloroplast genome is that it contributes to structural diversity at low taxonomic levels and among basal lineages. Although genome organization is relatively well conserved in angiosperms, several types of structural diversity have been found. This structural diversity, including the loss of one copy of IRs, gene and intron gains or losses, large inversions, expansions, contractions and localized hypermutable phenomena, provides a powerful tool for evaluating genomic evolutionary history. For example, the loss of one IR is observed in the inverted-repeat-lacking clade (IRLC) (Sabir et al., 2014); the loss of accD, psaI, ycf4, rpl33, clpP, and rps16 resulting in gene function loss is observed in various legume lineages; a 36-kb inversion is observed in the Genistoid clade; a 39-kb inversion is observed in Robinia (Keller et al., 2017); and hypermutation of ycf4 is observed in Lathyrus (Magee et al., 2010). With the development of high-throughput sequencing, more than 800 complete chloroplast genomes have been made available in the National Center for Biotechnology Information (NCBI) database (Asaf et al., 2017a).

The Fabaceae family, especially the Papilionoideae subfamily, is considered a model system for understanding the mechanisms of chloroplast genome evolution due to the presence of major genome rearrangements in this group such as loss of one IR, gene and intron gains and losses, large inversions, expansions, contractions and localized hypermutable regions (Sabir et al., 2014; Keller et al., 2017). However, the mechanisms of these chloroplast genome rearrangements are not known (Sveinsson and Cronk, 2016). Some scholars believe that these genome rearrangements within the Fabaceae chloroplast genomes may be derived from the loss of one copy of IRs; however, Medicago and Cicer species, which exhibit the typical conserved quadripartite structure found in angiosperms (Jansen et al., 2005), also present extensive chloroplast genome rearrangements (Jansen et al., 2008; Sveinsson and Cronk, 2016). Therefore, further in-depth research on the mechanisms of chloroplast genome evolution is needed.

Previous research on Fabaceae chloroplast genomes demonstrated that the deletion or addition of genes and introns, inversions, repeats, and nucleotide variability can result in significant changes in genome length, GC content, and gene composition and orientation (Lei et al., 2016; Yin et al., 2017; Wang et al., 2018). In these genomes, coding regions are better conserved than intergenic spacer (IGS) regions (Sabir et al., 2014; Asaf et al., 2017b; Yin et al., 2018). However, it is unclear whether a consistent pattern in the genomic variation can be observed in species of the tribe Fabeae, which belong to Fabaceae. A possible explanation for these results may be the lack of complete genomic information for Fabeae. To date, 21 complete Fabeae chloroplast genomes have been sequenced (including 18 in the last four years), mainly from the genus Lathyrus (13) and a few from the genera Lens (1), Pisum (4) and Vicia (3). Another possible explanation is the structural diversity among Papilionoideae (Jansen et al., 2008; Sabir et al., 2014; Sveinsson and Cronk, 2016). For example, even within the same genus, the Trifolium subterraneum (Fabaceae) chloroplast genome exhibits 14-18 inversions, while there are only 3 inversions in Trifolium grandiflorum and Trifolium aureum (Sabir et al., 2014). Therefore, the study of the genomic variation and phylogeny of Fabeae species can provide a basis for understanding chloroplast genome evolution.

Vicia sepium (Bush vetch), belonging to the tribe Fabeae, is an important wild resource plant with a wide distribution area (Maxted, 1995), various flowering periods from May to November, abundant proteins, and suitability for cultivation in extreme cold and dry conditions (Maršalkienė, 2016) and can be used as a good potential perennial forage. Additionally, compared with other legumes, V. sepium provides herbage for a long period because of its perennial habit (Maršalkienė, 2016). This plant also produces extrafloral nectaries to attract ants, which act as plant defenders by preying on arthropod herbivores or interrupting their oviposition or feeding (Lenoir and Pihlgren, 2006). However, previous studies on V. sepium have mainly focused on the morphological characteristics (Maršalkienė, 2016) and classification (Schaefer et al., 2012; Jaaska, 2015) of this plant and the relationship between plants and insects (Kruess and Tscharntke, 2000; Lenoir and Pihlgren, 2006). Therefore, little is known regarding the nutrient content, genetic resources, and forage value of this species. As a result, no plant materials of V. sepium have been released for commercial production. However, another Vicia species, Vicia sativa, has been widely used as forage and for hay and silage production. A key difficulty in the use of V. sativa is the presence of a neurotoxic compound in its seeds (Huang et al., 2017). Therefore, the expansion of forage resources based on Vicia species is necessary.

Another difficulty in the utilization of V. sepium is that the taxonomy of some taxa in Fabeae remains controversial (Schaefer et al., 2012; Jaaska, 2015; Iberite et al., 2017) because of the high morphological variability among species. Notably, some variation in morphological characteristics is genetically fixed. For example, Iberite's cultivation tests (Iberite et al., 2017) conducted in V. sativa, Vicia barbazitae, Vicia grandiflora and V. sepium showed that the characteristics of the leaf margins are maintained through successive generations. Recent molecular phylogenetic studies have focused on multitribe legumes or tribe level analyses of Fabeae (Schaefer et al., 2012). These studies have suggested that the taxonomy of some genera in Fabeae is not monophyletic. However, these phylogenetic studies did not use the complete chloroplast genome, instead using plastid DNA sequence data, such as the matK, trnL, rbcL, and nuclear ribosomal internal transcribed spacer (ITS) sequences. Therefore, it is necessary to acquire comprehensive knowledge regarding the organization and evolution of V. sepium.

Here, we present a new complete chloroplast genome of V. sepium, from the genus Vicia. We compare it with chloroplast genomes from related genera (Lens, Pisum, Lathyrus) belonging to tribe Fabeae. The aim of this work is to reveal the genome variation and phylogeny of Fabeae and the genus Vicia and to provide evidence regarding the history of chloroplast genome evolution.

### MATERIALS AND METHODS

#### Plant Material

The sample was collected from the Dongting Lake region (28°48′ 46.06″N, 112°21′10.19″E) and stored at the Hunan Research Center of Engineering Technology for Utilization of Environmental and Resources Plant, China, under accession number 20170707JJ. Plant sampling was performed in areas that were not privately owned or protected in any way, and no specific permits were required for this study. We collected mature V. sepium leaves and placed them in a liquid nitrogen container. Leaf samples were stored at -80°C until sequencing. Extraction of total chloroplast DNA was carried out with the Plant Chloroplast Purification Kit and Column Plant DNA Extraction Kit (Beijing Baiaolaibo Technology, Co., Ltd., China). The chloroplast DNA of V. sepium was fragmented using a Covaris M220 (Covaris, USA) instrument. Wholegenome sequencing and paired-end (PE) library construction were performed according to the method described by Zhang et al. (2017). Raw data were obtained through next-generation sequencing with PE 150-bp reads. Then, N-containing sequences and adapter sequences were removed. Sequences with a Q value less than 20 or an average four-base mass of less than 20 were also removed. Finally, if the length of the reads was less than 50 nt, the reads were removed. All the above filtering steps were performed using Trimmomatic v 0.32 (Bolger et al., 2014), and clean data for subsequent analysis were obtained. Then, all highquality paired reads were assembled into contigs by using SOAPdenovo2 (Luo et al., 2012) and scaffolded by using SSPACE (Boetzer et al., 2011) to obtain the whole-genome sequence. In this process, different K-mers were selected first for assembly, and the best K-mer, k=25, was chosen to obtain the assemblies. The above K parameter was determined on the basis of a K-mer curve and experience. Finally, one contig of 124,095 bp was obtained.

#### Genome Annotation and Sequence Architecture

Our previous study used the programs CpGAVAS (Liu et al., 2012) and DOGMA (Wyman et al., 2004) to annotate the complete chloroplast genome of V. sepium (Li et al., 2018). In this study, to study genomic evolution between V. sepium and its related species in Fabeae, the same V. sepium genome was annotated in Plann (Huang and Cronk, 2015) against the V. sativa genome (NC027155). Gene mapping and relative synonymous codon usage (RSCU) were performed in OGDRAW v1.2 (Lohse et al., 2013) and DAMBE6 (Xia, 2017) according to Dong's method (Dong et al., 2019).

#### SSRs and Repeated Sequences Analysis

We detected SSRs by referring to the method of Lei et al. (2016) using the MISA Perl Script (Thiel et al., 2003) with parameter settings of 8 for mono-, 4 for di- and tri-, and 3 for tetra-, pentaand hexa-nucleotide SSRs. Forward, palindromic, reverse, and complement sequences were identified as described by Cauz-Santos et al. (2017) using REPuter (Kurtz et al., 2001) with 90% or greater sequence identity and a length of 30 bp or longer. Tandem repeats were identified using Tandem Repeats Finder version 4.09 (Benson, 1999) with default parameters.

#### Comparative Analysis

Blast ring image generator (BRIG) (Alikhan et al., 2011) and mVISTA (Frazer et al., 2004) software were used to compare the complete chloroplast genome variation in all available Fabeae chloroplast genomes using V. sepium annotation as a reference. BRIG focus on protein coding segment variation and mVISTA align whole chloroplast genome without discrimination. All the species were included the following twenty-one Fabeae species and one Cicereae species (Cicer arietinum), listed with the corresponding GenBank accession numbers: V. sepium, V. sativa (NC027155), V. faba (KF042344), Pisum abyssinicum (NC037830), Pisum sativum (NC014057), Pisum sativum subsp. Elatius (NC039371), Pisum fulvum (NC036828), Lens culinaris (NC027152), Lathyrus pubescens (NC027079), Lathyrus venosus (NC027080), Lathyrus palustris (NC027078), Lathyrus japonicus (NC027075), Lathyrus ochroleucus (NC027077), Lathyrus davidii (NC027073), Lathyrus littoralis (NC027076), Lathyrus inconspicuus (NC027149), Lathyrus graminifolius (NC027074), Lathyrus tingitanus (NC027151), Lathyrus clymenum (NC027148), Lathyrus sativus (NC014063), Lathyrus odoratus (NC027150), and C. arietinum (NC011163). Genome rearrangement relative to V. sepium was performed in Mauve (Darling et al., 2004).

#### Phylogenetic Analysis

To determine the phylogenetic position of V. sepium within Fabeae, four datasets were used to construct the following phylogenetic trees for Fabeae: (I) the complete chloroplast genomes of 21 Fabeae species and C. arietinum (that is, the same 22 species in the comparative analysis); (II) the conserved chloroplast protein-coding sequences of 21 Fabeae species and C. arietinum (that is, the same 22 species in the comparative analysis); (III) the rbcL gene sequences of 50 Fabeae species, Trifolium pretense and T. repens; and (IV) the matK gene sequences of 62 Fabeae species, T. pretense and T. repens. The names of the species included in the four phylogenetic analyses can be found in Table S1.

Specifically, the conserved chloroplast protein-coding sequence of each species comprised 70 concatenated homologous genes shared among twenty-two related species. These genes were atpA, atpB, atpE, atpF, atpH, atpI, ccsA, cemA, clpP, matK, ndhA, ndhB, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK, petA, petB, petD, petG, petL, petN, psaA, psaB, psaC, psaJ, psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbK, psbL, psbM, psbN, psbT, psbZ, rbcL, rpl14, rpl16, rpl2, rpl20, rpl23, rpl32, rpl33, rpl36, rpoA, rpoB, rpoC1, rpoC2, rps11, rps12, rps14, rps15, rps19, rps2, rps3, rps4, rps7, rps8, ycf1, ycf2, and ycf3.

All datasets were aligned using MAFFT v7.380 (Katoh and Standley, 2013) under the FFT-NS-2 default setting. The alignments were used for phylogenetic analysis. All alignments were used to construct phylogenetic trees via the neighbor joining (NJ) method in MEGA7.0 (Kumar et al., 2016) under the default settings. Then, we obtained four NJ trees.

In addition, we used another method, the maximum likelihood (ML) method, to construct a phylogenetic tree based on conserved chloroplast protein-coding sequences. The aim of this work was to test the effects of different methods on the phylogenetic relationships of Fabeae species. First, we used MAFFT v7.380 to align twenty-two conserved chloroplast protein-coding sequences under the FFT-NS-2 default settings. Second, ModelTest was employed to find the best model in MEGA7.0. Finally, the tree was constructed using the ML method with the GTR+G+I model and 1,000 bootstrap replicates. C. arietinum was selected as the outgroup.

#### Evolutionary Rate Analysis

To determine the sequence divergence of the complete chloroplast genomes, the average pairwise sequence distances of twenty-one Fabeae species and one Cicereae species (that is, the same 22 species in the comparative analysis) were calculated. After alignment with MAFFT v7.380, the average pairwise sequence distances (K2P rate) of these species were presented according to Asaf's method using MEGA7 (Kimura, 1980; Asaf et al., 2017b).

Additionally, the synonymous (Ks) and nonsynonymous (Ka) nucleotide substitution rates as well as the Ka/Ks ratio were used to calculate the sequence divergence of other homologous protein-coding regions. All twenty-one available chloroplast genomes belonging to the genera Vicia, Pisum, Lens, and Lathyrus were selected for this analysis. These species were divided into two groups: (I) within Vicia: V. sepium, V. sativa, V. faba; (II) outside of Vicia (or other genera): V. sepium, P. abyssinicum, P. sativum, P. sativum subsp. Elatius, P. fulvum, L. pubescens, L. venosus, L. palustris, L. japonicus, L. ochroleucus, L. davidii, L. littoralis, L. inconspicuus, L. graminifolius, L. tingitanus, L. clymenum, L. sativus, and L. odoratus. A total of 71 homologous genes (Table S2) from these species were selected

and examined separately. After aligning each gene using the ClustalW (Codons) program in MEGA7, the Ks, Ka, and Ka/Ks values between V. sepium and other species were determined according to Dong's method (Dong et al., 2019) with the program from the PAML package (Yang and Nielsen, 1998). The two independent-samples t-test was used to examine the significance of the sequence divergence between Vicia and other genera. The p-values were determined with Levene's test. If the Levene's test result was less than 0.05, we used the unequal variance as the p-value; if not, we used the equal variance as the p-value.

Once Vicia showed a significantly higher Ka/Ks ratio than the other genera, codon-based likelihood analysis based on the branch model test in CodeML from the PAML package was carried out to identify the lineages in Fabeae that exhibited significantly high evolutionary rates. This test employed the user-defined topology of Fabeae lineages with five other lineages: A0 (Cicer), A1 (Pisum and Lathyrus), A2 (Lens and Vicia), A3 (Lens), and A4 (Vicia). This topology was constructed based on the concatenated DNA sequences of matK and rbcL (Figure S1) using the ML method with the GTR+G50 model in MEGA7.0. The method was the same as that used for the phylogenetic analysis described previously. A one-ratio model (model = 0) and a two-ratio model (model =2) were used to calculate the Ka/Ks ratio for each branch. A one-ratio model, or null model (model = 0), is one in which all clades (or all lineages) exhibit the same Ka/Ks ratio. A two-ratio model, or alternative model (model = 2), is one in which one or more clades present different Ka/Ks ratios. The transition/transversion and Ka/Ks ratios were set as automatically estimated. Codon frequencies were set as the F3 × 4 method. The hypotheses of the two-ratio model are described in Table S3. The likelihood ratio test (LRT) was used to find the best model (P < 0.05) through comparison of two different models. From the best model, we could infer whether a homologous gene showed accelerated evolution in Vicia. In addition, all genes exhibiting accelerated evolution were compared with two genes showing nonaccelerated evolution (matK and rbcL), in two ways. First, we compared their synonymous and nonsynonymous nucleotide substitution rates in Ks trees and Ka trees. The branch lengths representing the substitutions per synonymous site or nonsynonymous site were determined from the best model. Second, we compared their amino acid sequence differences. Amino acid sequence alignment was performed in Jalview v2.10.5 (Waterhouse et al., 2009).

### RESULTS

#### Chloroplast Genome Characteristics and Structure of V. sepium

The original image data obtained by next-generation sequencing technology was converted into the original sequenced reads by CASAVA base calling analysis to obtain raw reads (10,808,365) or raw data (3.24 gigabytes). A total of 7,696,368 clean reads (2.31 gigabytes of clean data) with an average length of 150 bp were obtained after the adapter sequences and low-quality reads were removed. A single long contig of 124,095 bp was assembled using clean data via de novo assembly, forming a loop representing the whole chloroplast genome sequence of V. sepium. The V. sepium chloroplast genome, under GenBank accession number NC039595, showed the loss of one IR and contained 76 protein-coding genes, 29 tRNA genes, four rRNA genes and one pseudogene (rpl23 Y). In particular, one unannotated protein-coding gene, ORF292, was identified (Table 1). The gene map of these 110 genes was presented

TABLE 1 | Genes predicted in the chloroplast genome of V. sepium.


One open reading frame, ORF292, could not be annotated. <sup>a</sup> pseudogene; <sup>b</sup> trans-splicing gene; <sup>c</sup> duplicated gene.

(Figure 1). Among these protein-coding genes, 9 genes (ndhA, ndhB, rpl2, rpl16, petD, petB, atpF, rpoC1, clpP) contained a single intron, while one gene, ycf3, contained two introns (Table 2). Additionally, four tRNA genes containing one intron were also identified as follows: trnVUAC, trnAUGC, trnIGAU, and trnLUAA. As observed in most legumes, the infA, rpl22, and rps16 genes were lost (Lei et al., 2016). The overall GC content of the V. sepium chloroplast genome was 35.0%, whereas that of the protein-coding, intron, tRNA, rRNA and IGS regions was 36.7%, 34.6%, 52.3%, 54.2%, and 29.2%, respectively (Table S4). The RSCU result revealed that the V. sepium protein-coding sequences showed codon usage bias, with all preferred synonymous codons ending with A/T nucleotides and a high AT content at the 3rd codon positions (72.2%) (Figure S2, Table S4).

#### SSRs and Repeats in V. sepium

We analyzed the presence of SSRs and repeats in V. sepium. SSRs, which are regarded as useful gene markers, exhibited a high mutation rate. In this study, a total of 201 SSRs were found in the chloroplast genome of V. sepium (Figure 2). A majority of the SSRs were composed of mono-nucleotide and di-nucleotide repeat motifs. The types of SSRs distributed within the chloroplast genome of V. sepium were characterized, revealing that the SSR motifs of mono-nucleotide repeats mainly consisted of A/T (98.5%) and that those of di-nucleotide repeats mainly consisted of AT/TA (86.8%). A total of 116 and 66 V. sepium SSRs were distributed in the IGS and CDS regions, respectively (Figure 2).

Repeat sequences are essential for genome rearrangements, phylogenetic construction (Cavalier-Smith, 2002) and indel, and substitution variation (Yi et al., 2013). Sixty-two repeats, including 46 forward repeats, 4 palindromic repeats, and 12 tandem repeats, were found in the chloroplast genome of V. sepium. The lengths of the palindromic repeats were 45, 50, 54, and 155 bp, and the lengths of the forward repeats and tandem repeats ranged from 45 to 222 bp and 32 to 229 bp, respectively (Table S5). In addition, the maximum number of repeats (n = 49) were located in IGS regions, followed by those in CDSs (n = 27) (Table S5). We also found that most of these repeats were located in the psaB-rps14 (n = 20), ycf1-trnN-GUU (n = 10), accD (n = 6) and rps14 (n = 5) regions.

#### Comparative Analyses of the Chloroplast Genomes of Fabeae Species

Twenty complete chloroplast genomes from within Fabeae were selected for comparison with V. sepium. One Cicereae species, C. arietinum, was set as the outgroup (Table 3). The changes in chloroplast genome length in these species ranged from 120, 289 bp (L. odoratus) to 126,421 bp (L. pubescens), and the greatest variation in length relative to V. sepium was 3.0% in the proteincoding region of L. culinaris, followed by the IGS region (2.8%) of L. culinaris. An average difference in length of only 0.1% was found in the tRNA and rRNA gene regions. Additionally, the GC content of the twenty-two complete chloroplast genomes ranged from 33.9% to 35.2%, exhibiting little change. After comparing

FIGURE 1 | Gene map of the complete chloroplast genome of V. sepium. Genes inside the circle are transcribed clockwise, and those outside are transcribed counterclockwise. The different colors of the blocks represent different functional groups. The darker gray color of the inner circle corresponds to the GC content, and the lighter gray color corresponds to the AT content.



V. sepium genes with those of twenty-one other Fabeae species, we found an inserted gene that is a unique unannotated proteincoding gene, ORF292, between rps12 and rps4 in V. sepium. Moreover, the rps12 to rps4 region in V. sativa also contained an inserted duplicated rpl20 gene (not mentioned in the table). From genome rearrangement, we can infer that inversion events may result in gene insertion (Figure S3). We also found a pseudogene, rpl23, in V. sepium, V. sativa, P. abyssinicum, P. sativum, P. sativum subsp. Elatius and L. sativus. By analyzing gene and intron losses, all twenty-two species lost the infA, rpl22, and rps16 genes, similar to most of the IR-lacking species. Ycf4 genes were found in only V. sepium, V. faba, P. sativum, and L. sativus. Moreover, one intron of the clpP and rpl16 genes was lost in L. graminifolius and V. faba, respectively (Table 3).

The sequence identity of the chloroplast genomes of V. sepium and twenty-one other Fabaceae species was visualized (Figures 3 and S4), and the results revealed that coding regions are more highly conserved than noncoding regions. Usually,

regions with 50% or less sequence identity can be regarded as highly divergent regions. In coding regions, ycf1, ycf2, rpl23, rps3, rps18, accD, rpoC1, clpP, ORF292, ycf4, psaI, and rpl32 contained relatively low identity regions. In addition, these highly divergent noncoding regions include rps15-ycf1, ycf1-trnN-GUU, rrn16 rps12, ycf2-trnI-CAU, trnI-CAU -rpl23, rpl16 intron, rpl14-rps8, rps8-rpl36, psbB-petL, accD-trnQ-UUG, trnQ-UUG-psbK, psbEclpP, clpP-rps12, psaB-rps14, psbD-trnT-GUU, ycf4-psaI, psaItrnL-UAA, and rpl32-ndhF (Figure 3 and S4).

### Evolutionary Rate of Fabeae Species

regions, are represented as colored boxes.

The pairwise distances (K2P rates) of complete chloroplast genome sequences from twenty-one Fabeae species and one Cicereae species were calculated (Table S6). The results showed that the nucleotide variability rate ranged from 0.001 to 0.248 (L. sativus vs C. arietinum). Compared with V. sepium, the lowest K2P rate was 0.027 (V. sativa) while the highest K2P rate was found in C. arietinum (0.246) (Table S6). The mean K2P rate between Pisum and V. sepium was 0.217. The mean K2P rate between Lathyrus and V. sepium was 0.193. Specifically, the K2P rate between V. faba and V. sepium was 0.207, which was higher than the rate between V. sepium and some Lathyrus species. We hypothesized that V. sepium and V. sativa were located in the same clade and showed different evolutionary directions compared with V. faba.

Ka and Ks nucleotide substitutions within Vicia and outside of Vicia were calculated with V. sepium as the reference, as well as the Ka/Ks ratio (Table S2, Figure 4). The Ka/Ks ratio is an important parameter for determination of the selective constraint acting on each gene (Keller et al., 2017). Ka/Ks > 1 indicates that the gene was under positive selection, whereas Ka/ Ks = 1 or <1 indicates genes under neutral selection or purifying selection (Kimura, 1980). The mean Ks between V. sepium and twenty Fabeae species ranged from 0.0058 (petN) to 0.2375 (ycf1), and the mean Ka ranged from 0 (petG, psbF) to 0.1846 (clpP) (Table S2). Within the genus Vicia, nine genes (ccsA, clpP, rpl32, rpl33, rpoC1, rps15, rps2, rps4 and rps7) with a Ka/Ks ratio >1 (Figure 4) evolved under beneficial mutations, and 60 genes evolved under purifying selection, including sixteen genes that evolved almost neutrally, showing a ratio range of 0.5 to 1. Twelve conserved genes (atpH, petG, petN, psaC, psbA, psbD, psbF, psbH, psbK, psbL, psbM and rpl36 with Ka/Ks = 0) presented a very strong purifying selective pressure. Comparison of sequence divergence between Vicia and other genera showed that the Ka/Ks ratios of the eight genes (accD, atpA, matK, rpl32, rpl33, rps2, rps4, ycf1) were significantly higher (P < 0.05) in Vicia, and among these genes, the ratios of accD, atpA, rpl32, rps2 and rps4 were extremely significantly higher (P < 0.01).

Codon-based likelihood analysis (Table S3; Figure S1) was performed to compare the Ka/Ks ratios of the accD, atpA, rpl32, rps2, and rps4 genes across different Fabeae lineages. C. arietinum was set as the reference. The null model (H0) hypothesized that the A0 (Cicer), A1 (Pisum and Lathyrus), A2 (Lens and Vicia), A3 (Lens), and A4 (Vicia) clades exhibit the same Ka/Ks ratio. The alternative model hypothesized that one or more clades present different Ka/Ks ratios. By comparing the p-values of the two different models, the results demonstrated that the best models for accD, atpA, rpl32, rps2, and rps4 are H2, H3, H0, H2, and H0, respectively (Table S3). A higher Ka/Ks ratio in a specific clade is considered to indicate accelerated evolution of the clade. The Ka/Ks ratios of accD, atpA and rps2 in


 P. sativum, L. sativus, V. faba, C. arietinum) and ycf2 (P. fulvum, L. pubescens, L. venosus, L. palustris, L. japonicus, L. ochroleucus, L. littoralis, L. inconspicuus, L. missing in clpP (L. graminifolius) and rpl16 (V. faba).

the Vicia clade were higher than those in the Cicer clade, and the Ka/Ks ratios of rpl32 and rps4 were the same in the two clades. The results revealed that evolution rates increased in atpA, rps2 and accD of the Vicia lineage but exhibited no change in rpl32 or rps4 (Table S3). The Ka/Ks ratios of rps2 in the Vicia clade were higher than those in the Pisum and Lathyrus clade, but the Vicia clade presented lower Ka/Ks ratios in the accD and atpA genes. The results revealed that rps2 exhibited a higher evolutionary rate in Vicia, while atpA and accD in Pisum and Lathyrus evolved much faster. We also compared the synonymous and nonsynonymous nucleotide substitution rates of genes that evolved rapidly (accD, atpA, and rps2) in different Fabeae lineages to the rates observed in genes that did not evolve rapidly (rbcL and matK) based on codon-based ML phylogenetic analysis. As shown in Figure 5, in the Ka and Ks trees, the substitutions per nonsynonymous site of rps2 evolved much faster in Vicia than in other Fabeae species, but no similar acceleration was observed in rbcL and matK. In addition, all Fabeae lineages showed accelerated evolution in the accD gene for high synonymous and nonsynonymous nucleotide substitution rates compared to rbcL and matK. This result can supplement Magee's findings (Magee et al., 2010). We also detected amino acid differences in the accD, atpA, rps2, matK, and rbcL genes within and outside of Vicia by aligning the sequences from Fabeae species (Figures S5–S9). Notably, there is less amino acid sequence conservation in accD (83.03% identity between Vicia species) and rps2 (91.98% identity between Vicia species) than in matK (94.23% identity between Vicia species) and rbcL (99.16% identity between Vicia species). The lengths of the amino acid sequences ranged from 165 to 1,141 in accD.

#### Phylogenetic Analysis of V. sepium

Considering the rather limited number of complete Vicia chloroplast genomes (only 3), it is difficult to determine whether Vicia is paraphyly. Therefore, in addition to the complete chloroplast genomes and conserved chloroplast protein-coding sequences, we constructed a phylogenetic tree of Vicia using two widely sequenced chloroplast genes, namely, rbcL and matK, to support our result. Detailed information regarding these four datasets can be found in Table S1. Upon comparing the four NJ trees, we found that V. sepium, V. sativa, and V. faba were located in the same evolutionary branch with support rates of 100% in the protein-coding sequence tree, 99% in the matK tree, and 49% in the rbcL tree. However, in the whole-genome tree, the result was different, with V. sepium and V. sativa located in the same clade and V. faba located in another clade. These results indicated that the evolutionary histories of V. sepium and V. sativa were similar but different from that of V. faba (Figures S10–S12). Both the rbcL and matK phylogenetic trees showed that Vicia species were included in different clades, which supports our hypothesis that Vicia is paraphyletic (Figures S11 and S12).

Both the NJ and ML phylogenetic trees for homologous protein-coding sequences showed that Vicia and Lens were included in the same clade, together with Pisum and Lathyrus (Figure 6), but the ML tree presented a higher support rate for the Vicia and Lens clade than the NJ tree.

TABLE 3 |

Characteristics

 of twenty-one Fabeae species and Cicer arietinum.

sativum subsp. Elatius; P. ful, P. fulvum; L. cul, L. culinaris; L. pub, L. pubescens; L. ven, L. venosus; L. pal, L. palustris; L. jap, L. japonicus; L. och, L. ochroleucus;

graminifolius; L. tin, L. tingitanus; L. cly, L. clymenum; L. sat, L.sativus; L. odo, L.odoratus; C. ari, C. arietinum.

ycf4 in P. sativum. \*\*intron gains: one intron added to tRNA-Gly (V. faba,

graminifolius,

 L. clymenum, L. odoratus). intron losses: one intron

\*pseudogenes:

 rpl23 in V. sepium, V. sativa, P. abyssinicum,

 L. dav, L. davidii; L. lit, L. littoralis; L. inc, L. inconspicuus;

 P. sativum, P. sativum subsp. Elatius and L. sativus; ycf1 in L. culinaris;

 L. gra, L.

Methods section. All trees were drawn to the same scale representing the number of substitutions per synonymous or nonsynonymous site.

# DISCUSSION

#### Beneficial Gene Mutations Observed in the Protein-Coding Regions

In our study, within genus Vicia, ccsA, clpP, rpl32, rpl33, rpoC1, rps15, rps2, rps4, and rps7 showed positive selection, with a Ka/Ks ratio >1 (Figure 4). None of these genes are related to photosynthesis (psa, psb, ndh, pet, atp). In fact, genes related to photosynthesis were under less selection pressure than other types of genes (Du et al., 2016; Li et al., 2017; Gao et al., 2018). Such positive selection is also found in other species, as observed for two genes flanking ycf4 (accD and cemA) in Lathyrus (Magee et al., 2010); accD, ycf1, and atpA in seed plants (Zheng et al., 2017); rps14 in Dodonaea viscosa and Sapindus mukorossi (Saina et al., 2018); and the atpF gene in two deciduous Quercus species (Yin et al., 2018). In general, genes under selection pressures are mainly identified by comparing the synonymous and nonsynonymous nucleotide substitution rates in related species. Thus, genes under positive selection pressure in different lineages can be identified. However, the positive selection acting on genes in a specific lineage contrasts with the silent molecular clock hypothesis, according to which the point mutation rate in all regions of the same genome is almost constant (Ochman and Wilson, 1987). The factors causing a higher Ka/Ks ratio in some sequences than in the rest of the genome remain unclear. Here, we consider two explanations for this difference. One possible explanation for this phenomenon is that a greater number of nucleotide substitutions are associated with gene duplications and gene losses. Erixon found that positive selection acting on the clpP gene in various plant lineages is related to repeated duplication (Erixon and Oxelman, 2008). Magee showed that the Ka/Ks ratios of cemA and accD flanking ycf4 are >1 in Lathyrus. This may occur because the increase in the nucleotide mutation rate near the hypermutational ycf4 gene affects the purifying selection acting on the amino acid sequence (Magee et al., 2010). Another possibility is that differential selection may act on gene divergence. For example, research on oak species showed that the atpF gene is highly divergent (Ka/Ks > 1) in the comparation between deciduous oak and evergreen sclerophyllous oak because the former loses its leaves in the cold and drought seasons (Yin et al., 2018). Another study on seed plants suggested that genes affected by positive selection are always involved in plant adaptation, such as accD, ycf1 and atpA (Zheng et al., 2017).

We also found that atpA, accD, and rps2 of Vicia showed significantly accelerated evolution (Figures 4, 5, S5–S7, Table S3). Rps2, encoding the ribosomal protein S2, is retained in almost all plants. The exceptions mainly occur in Apocynaceae. For example, in milkweeds, a 2.4-kb mitochondrial DNA sequence was horizontally transferred to the rps2-rpoC2 plastid intergenic region, resulting in two pseudogenes, namely, rps2 and rpoC2, contained in plastomes (Straub et al., 2013). However, such plastome insertion is rare. A relatively common type of evolution is the point mutation described in our study. For example, Ka and Ks rates are elevated in parasitic Scrophulariaceae and Orobanchaceae, which provide suitable material for studying the evolution of hemi- and holoparasitic plant lineages (dePamphilis et al., 1997). In Gossypium, the Yrp8 and Cys11 sites of rps2 and the other nine genes are undergoing protein sequence evolution, which may aid the adaptation of cotton species to diverse environments (Wu et al., 2018). The accelerated evolution of atpA (participating in ATP synthesis) has also been found in other species, such as Dipsacales (Fan

et al., 2018) and Urophysa (Xie et al., 2018) species. Consistent with our study, only one to three sites show positive selection. AccD is essential for plant leaf development and has been lost in some angiosperm lineages. It is believed that accD was functionally transferred to the nucleus (Magee et al., 2010; Sabir et al., 2014).

At present, Vicia is the only known legume genus in which so many genes show positive selection and accelerated evolution in the chloroplast genome. Therefore, a comprehensive understanding of the mechanism underlying the increased nucleotide substitution of homologous protein-coding genes is necessary, and Vicia species may be suitable model systems for such studies.

### Genome Variation in the Chloroplast Genomes of V. sepium

To detect the genome variation in the chloroplast genome of V. sepium, we compared V. sepium with related genera in the tribe Fabeae. Our results revealed that the greatest variation in genome length relative to V. sepium was located in protein-coding regions (Table 3). This finding is consistent with Zheng'<sup>s</sup> research (Zheng et al., 2017), showing that chloroplast gene length is an important factor affecting chloroplast genome size based on phylogenetic signals. The length variation of proteincoding regions may result from gene loss and gain or differences in the lengths of homologous genes. Ycf4, encoding a photosystem I assembly protein, is the most easily deleted gene in Fabeae species (Table 3). This result supports previous findings revealing that ycf4 has been lost in many species of Lathyrus and Pisum due to its functional transfer to the nuclear genome (Magee et al., 2010). Furthermore, gene insertion events involving one new unannotated protein-coding gene, namely, ORF292 (879 bp) and one duplicated gene, namely, rpl20 (354 bp), were found in V. sepium and V. sativa, respectively. One pseudogene, rpl23, was identified in V. sepium and V. sativa (Table 3). This indicates that the evolutionary histories of V. sepium and V. sativa are similar and that V. faba may be located in a different evolutionary clade. In general, a chloroplast gene cannot be lost arbitrarily unless the function of the gene is transferred to the nuclear genome or replaced by that of a nuclear gene (Magee et al., 2010). Therefore, the mechanism of loss of the rpl23 gene in V. sepium and V. sativa requires further in-depth research. In addition to gene loss, one intron was also missing in clpP (L. graminifolius) and rpl16 (V. faba) (Table 3). The clpP gene normally contains two introns in angiosperms (Jansen et al., 2007; Jansen et al., 2008). Jansen determined that the IRLC lineage (in which Fabeae is included) has lost one intron of clpP (Jansen et al., 2008). However, the loss of two introns observed in clpP is rare; Sabir's research (Sabir et al., 2014) on the IRLC lineage (in which Fabeae is included) showed that this phenomenon has only occurred in Glycyrrhiza glabra, and our findings are complementary to this previous work. V. faba was the only species found to have lost the intron of rpl16 in the tribe Fabeae, and the rpl16 intron shows high divergence in Chusquea (Kelchner and Clark, 1997), Gleditsia (Schnabel and Wendel, 1998), and Cacteae (Butterworth et al., 2002). This result indicates that different evolutionary clades exist in Vicia. In addition to gene loss and gain, differences in the lengths of homologous genes are also found in Fabeae species (ranging from 495 to 3,423, 36 to 537, and 3,879 to 5,403 in accD, rps12 and ycf1, respectively). In seed plants, the length difference in atpA, accD, and ycf1 is the main reason for chloroplast genome size variation (Zheng et al., 2017).

In addition to protein-coding region expansion and contraction in V. sepium, protein-coding sequence divergence also exists. In our study, the GC content of the chloroplast genome of V. sepium was found to be lower than that of other species, such as Chikusichloa mutica [tribe rice (Wu et al., 2017)], Arabidopsis thaliana [Brassicaceae (Asaf et al., 2017a)], and Quercus aquifolioides [Fagaceae (Yin et al., 2018)], which exhibit a conserved structure and evolution of the chloroplast genome (Table S4). Normally, a higher GC content indicates a more stable genome sequence (Wu et al., 2017). Therefore, to consider the genome variation in V. sepium protein-coding regions, we surveyed SSRs, repeat loci, highly divergent regions and pairwise sequence divergence. Many SSRs and repeat loci appeared in the protein-coding regions (CDSs) (Table S5, Figure 2). These results are consistent with previous reports on Astragalus membranaceus (Lei et al., 2016). Because of the slippage of DNA strands, SSRs, regarded as useful gene markers, present a high mutation rate (Huang et al., 2018). Repeated sequences are believed to result in aberrant replication and repair pathways (Sabir et al., 2014). The genes ycf1, ycf2, rpl23, rps3, rpl18, accD, rpoC1, clpP, ORF292, ycf4, psaI, and rpl32 share relatively low identity (Figures 3 and S4). V. sepium showed considerable differences from other Fabeae species (with the exception of V. sativa), even V. faba. Therefore, Vicia presents profound genome variation, which is significant for the evolutionary history of the chloroplast genome.

### Evolution in Vicia

The phylogenetic analysis conducted with the conserved chloroplast protein-coding sequences of rbcL and matK showed that Vicia and Lens were included in the same clade (Figures 6 and S12). This result is also supported by the synapomorphy that is observable in the currently available research. Vicia and Lens both produce the phytoalexin wyerone, which is not found in Pisum and Lathyrus (Schaefer et al., 2012), and show high average protein richness and in vitro protein digestibility (Pastor-Cavada et al., 2014). However, even within Vicia, different evolutionary directions can be found, resulting in the paraphyly of Vicia. For example, in our study, the pairwise distance between V. sepium and V. sativa was much greater than that between V. sepium and V. faba (Table S6). The former species also showed a gene insertion in the rps12 to rps4 region (Figure S3) and an accelerated evolutionary rate in accD (Figure 5). In addition to chloroplast genome characteristics, the life form, stylar characteristics, and chromosome numbers of these species support this result. Ancestral Vicia species originating from the Mediterranean shared an annual life form, a basic chromosome number of 2n=14 and evenly hairy styles. However, the recent evolutionary reconstruction of Vicia indicates that a perennial life form, a chromosome number of 2n=12 (or 10, 24, 28, 42) and adaxially/abaxially hairy styles have arisen in Vicia (Schaefer et al., 2012). In the comparison of Vicia species in our study, all of the species were found to produce adaxially hairy styles, but V. sepium has evolved a perennial life form, while V. sativa and V. faba share the same characteristic of an annual life form. Nevertheless, the evolution of the life form of Vicia verified that V. sepium and V. sativa had a shared evolutionary history. Therefore, we can infer from all of these results that Vicia species may adopt different evolutionary strategies and that the chloroplast genome provides ideal material for reconstructing the evolutionary history of Vicia.

In summary, a new chloroplast genomic resource for an important wild resource plant, V. sepium, is presented. This study fills the gap in V. sepium genomic resources and provides novel insights into evolutionary dynamics in a poorly studied Vicia clade. Our results reveal that Vicia species may have experienced many instances of positive selection in the chloroplast genome and accelerated evolution of proteincoding genes, which is rare, being found in only a few angiosperm species. Detailed surveys show that V. sepium presents profound genomic variation in terms of ORF292 gene insertion, rpl23 pseudogene detection, lower GC content, CDS length variation, and accelerated evolution of the atpA, accD, and rps2 genes. Analysis of the phylogenetic relationships show that Vicia and Lens are included in the same clade and that the evolutionary direction of V. sepium and V. sativa is different from that of V. faba. Therefore, Vicia species may be a suitable model system for understanding the mechanisms of chloroplast genome evolution. This study is expected to attract researchers toward Vicia species, leading to the identification of further evidence regarding the evolutionary history of the chloroplast genome.

## DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the article/ Supplementary Material.

### AUTHOR CONTRIBUTIONS

CL, ZX, and GY conceived the study. All authors collected field samples. CL, JP, and XP analyzed the final data. YZ acquired funds (2016NK2148, 2016TP2007) for this study. CL wrote the original manuscript, and all authors commented on an early draft of the manuscript.

#### FUNDING

This work was supported by the Major Science and Technology Program of Hunan Province (2017NK1014), Key Technology R&D Program of Hunan Province (2016NK2148, 2016TP2007, 2017TP2006), Forestry Science and Technology Project of Hunan Province (XLK201825, XLK201920) and Natural Science Foundation of Hunan Province (2019JJ50027).

#### ACKNOWLEDGMENTS

We would like to thank Wu Liang for providing insightful writing assistance. We would also like to thank the anonymous reviewers for their valuable comments.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020. 00073/full#supplementary-material

FIGURE S1 | Topology of Fabeae lineages obtained from a concatenated data set consisting of matK and rbcL. C. arietinum was selected as the out group.

FIGURE S2 | Codon usage and relative synonymous codon usage (RSCU) of the V. sepium chloroplast genome. The color of the histogram corresponds to the color

### REFERENCES


of the codon. The size of the histogram corresponds to the RSCU of the codon. The X-axis represents different amino acids and the associated codons.

FIGURE S3 | Genomic rearrangement of six Fabeae species relative to V. sepium. Locally collinear blocks (LCBs) are colored to indicate syntenic regions. Blocks below the center line indicate regions that align in the reverse complement (inverse) orientation. The small boxes below the LCBs of each chloroplast genome are represented as genes.

FIGURE S4 | Alignment visualization of twenty-two Fabaceae complete chloroplast genomes using V. sepium as a reference. The vertical scale indicates the percent identity, ranging from 50% to 100%. Arrows indicate the annotated genes and their transcriptional direction. The different colored boxes correspond to exons, tRNA or rRNA, and noncoding sequences (CNSs).

FIGURE S5 | Alignments of the accD protein sequences from Fabeae species.

FIGURE S6 | Alignments of the atpA protein sequences from Fabeae species.

FIGURE S7 | Alignments of the rps2 protein sequences from Fabeae species.

FIGURE S8 | Alignments of the matK protein sequences from Fabeae species.

FIGURE S9 | Alignments of the rbcL protein sequences from Fabeae species.

FIGURE S10 | Phylogenetic relationships based on the complete chloroplast genomes of twenty-two related species obtained by the neighbor joining (NJ) method. C. arietinum was selected as the outgroup.

FIGURE S11 | Phylogenetic relationships based on rbcL gene sequences of 50 Fabeae species, T. pretense and T. repens obtained by the neighbor joining (NJ) method. T. pretense and T. repens were selected as the outgroup.

FIGURE S12 | Phylogenetic relationships based on matK gene sequences of 62 Fabeae species, T. pretense and T. repens obtained by the neighbor joining (NJ) method. T. pretense and T. repens were selected as the outgroup.


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Li, Zhao, Xu, Yang, Peng and Peng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership